International Handbook of Educational Evaluation

INTERNATIONAL HANDBOOK OF
EDUCATIONAL EVALUATION
Kluwer International Handbooks of Education
VOLUME 9
A list of titles in this series can be found at the end of this volume.
International Handbook
of Educational Evaluation
Part One: Perspectives
Editors:
Thomas Kellaghan
Educational Research Centre, St. Patrick's College, Dublin, Ireland
Daniel L. Stufflebeam
The Evaluation Center, ~stern Michigan University, Kalamazoo, Ml, U.SA.
with the assistance of
Lori A. Wingate
The Evaluation Center, Western Michigan University, Kalamazoo, Ml, U.SA.
KLUWER ACADEMIC PUBLISHERS

DORDRECHT / BOSTON / LONDON
Library of Congress Cataloging-in-Publication Data is available.
ISBN 1-4020-0849-X
Published by Kluwer Academic Publishers

PO Box 17, 3300 AA Dordrecht, The Netherlands
Sold and distributed in North, Central and South America

by Kluwer Academic Publishers,
101 Philip Drive, Norwell, MA 02061, U.S.A.
In all other countries, sold and distributed

by Kluwer Academic Publishers, Distribution Centre,
PO Box 322, 3300 AH Dordrecht, The Netherlands
Printed on acid-free paper
All Rights Reserved

© 2003 Kluwer Academic Publishers
No part of this publication may be reproduced or utilized in any form or by any means,
electronic, mechanical, including photocopying, recording or by any information storage
and retrieval system, without written permission from the copyright owner.
Table of Contents
Introduction
Thomas Kellaghan, Daniel L. Stufflebeam & Lori A. Wingate 1
PART ONE: PERSPECTIVES
SECTION 1: EVALUATION THEORY
Introduction
Ernest R. House - Section editor 9
1 Evaluation Theory and Metatheory

Michael Scriven 15
2 The CIPP Model for Evaluation
Daniel L. Stufflebeam 31
3 Responsive Evaluation
Robert Stake 63
4 Constructivist Knowing, Participatory Ethics and
Responsive Evaluation: A Model for the 21st Century
Yvonna S. Lincoln 69
5 Deliberative Democratic Evaluation
Ernest R. House & Kenneth R. Howe 79
SECTION 2: EVALUATION METHODOLOGY
Introduction
Richard M. Wolf - Section editor 103
6 Randomized Field Trials in Education
Robert F. Boruch 107
7 Cost-Effectiveness Analysis as an Evaluation Tool
Henry M. Levin & Patrick J McEwan 125
v
vi Table of Contents
8 Educational Connoisseurship and Educational Criticism:
An Arts-Based Approach to Educational Evaluation
Elliot Eisner 153
9 In Living Color: Qualitative Methods in Educational Evaluation
Linda Mabry 167
SECTION 3: EVALUATION UTILIZATION
Introduction
Marvin C. AIkin - Section editor 189
10 Evaluation Use Revisited
Carolyn Huie Hofstetter & Marvin C. AIkin 197
11 Utilization-Focused Evaluation
Michael Quinn Patton 223
12 Utilization Effects of Participatory Evaluation
1 Bradley Cousins 245
SECTION 4: THE EVALUATION PROFESSION
Introduction
M. F. Smith - Section editor 269
13 Professional Standards and Principles for Evaluations
14 Ethical Considerations in Evaluation
Michael Morris 303
15 How can we call Evaluation a Profession if there are
no Qualifications for Practice?
Blaine R. Worthen 329
16 The Evaluation Profession and the Government
Lois-Ellin Datta 345
17 The Evaluation Profession as a Sustainable Learning Community
Hallie Preskill 361
18 The Future of the Evaluation Profession
M F. Smith 373
Table of Contents vii
SECTION 5: THE SOCIAL AND CULTURAL CONTEXTS OF

Introduction
H S. Bhola - Section editor 389
19 Social and Cultural Contexts of Educational Evaluation:
A Global Perspective
H S. Bhola 397
20 The Context of Educational Program Evaluation in the United States
Carl Candoli & Daniel L. Stufflebeam 417
21 Program Evaluation in Europe: Between Democratic and
New Public Management Evaluation
Ove Karlsson 429
22 The Social Context of Educational Evaluation in Latin America
Fernando Reimers 441
23 Educational Evaluation in Africa
Michael Omolewa & Thomas Kellaghan 465
PART TWO: PRACTICE
SECTION 6: NEW AND OLD IN STUDENT EVALUATION
Introduction
Marguerite Clarke & George Madaus - Section editors 485
24 Psychometric Principles in Student Assessment
Robert 1 Mislevy, Mark R. Wilson, Kadriye Ercikan
& Naomi Chudowsky 489
25 Classroom Student Evaluation
Peter W. Airasian & Lisa M. Abrams 533
26 Alternative Assessment
Caroline Gipps & Gordon Stobart 549
27 External (Public) Examinations
Thomas Kellaghan & George Madaus 577
Vlll Table of Contents
SECTION 7: PERSONNEL EVALUATION
Introduction
Daniel L. Stufflebeam - Section editor 603
28 Teacher Evaluation Practices in the Accountability Era
Mari Pearlman & Richard Tannenbaum 609
29 Principal Evaluation in the United States
Naftaly S. Glasman & Ronald H Heck 643
30 Evaluating Educational Specialists
James H Stronge 671
SECTION 8: PROGRAM/PROJECT EVALUATION
Introduction
James R. Sanders - Section editor 697
31 Evaluating Educational Programs and Projects in the Third World
Gila Garaway 701
32 Evaluating Educational Programs and Projects in the USA
JeanA. King 721
33 Evaluating Educational Programs and Projects in Canada
Alice Dignard 733
34 Evaluating Educational Programs and Projects in Australia
John M. Owen 751
SECTION 9: OLD AND NEW CHALLENGES FOR EVALUATION

IN SCHOOLS
Introduction
Gary Miron - Section editor 771
35 Institutionalizing Evaluation in Schools
36 A Model for School Evaluation
James R. Sanders & E. Jane Davidson 807
37 The Development and Use of School Profiles
Robert L. Johnson 827
38 Evaluating the Institutionalization of Technology in Schools
and Classrooms
Catherine Awsumb Nelson, Jennifer Post & William Bickel 843
Table of Contents IX
SECTION 10: LOCAL, NATIONAL, AND INTERNATIONAL

LEVELS OF SYSTEM EVALUATION
Introduction
Thomas Kellaghan - Section editor 873
39 National Assessment in the United States: The Evolution of a
Nation's Report Card
Lyle V. Jones 883
40 Assessment of the National Curriculum in England
Harry Torrance 905
41 State and School District Evaluation in the United States
William 1 Webster, Ted 0. Almaguer & Tim Orsak 929
42 International Studies of Educational Achievement
Tjeerd Plomp, Sarah Howie & Barry McGaw 951
43 Cross-National Curriculum Evaluation
William H Schmidt & Richard T. Houang 979
List of Authors 997
Index of Authors 1001
Index of Subjects 1021
Introduction
Thomas Kellaghan
The Evaluation Center, Western Michigan University, Ml, USA
Lori A. Wingate
Educational evaluation encompasses a wide array of activities, including student

assessment, measurement, testing, program evaluation, school personnel evalua-
tion, school accreditation, and curriculum evaluation. It occurs at all levels of
education systems, from the individual student evaluations carried out by class-
room teachers, to evaluations of schools and districts, to district-wide program
evaluations, to national assessments, to cross-national comparisons of student
achievement. As in any area of scholarship and practice, the field is constantly
evolving, as a result of advances in theory, methodology, and technology;
increasing globalization; emerging needs and pressures; and cross-fertilization
from other disciplines.
The beginning of a new century would seem an appropriate time to provide a
portrait of the current state of the theory and practice of educational evaluation
across the globe. It is the purpose of this handbook to attempt to do this, to
sketch the international landscape of educational evaluation - its conceptual-
izations, practice, methodology, and background, and the functions it serves. The
book's 43 chapters, grouped in 10 sections, provide detailed accounts of major
components of the educational evaluation enterprise. Together, they provide a
panoramic view of an evolving field.
Contributing authors from Africa, Australia, Europe, North America, and
Latin America demonstrate the importance of the social and political contexts in
which evaluation occurs. (Our efforts to obtain a contribution from Asia were
unsuccessful.) Although the perspectives presented are to a large extent
representative of the general field of evaluation, they are related specifically to
education. Evaluation in education provides a context that is of universal interest
and importance across the globe; further, as history of the evaluation field shows,
the lessons from it are instructive for evaluation work across the disciplines. In
International Handbook of Educational Evaluation, 1-6

T. Kellaghan, D.L. Stufflebeam (eds.)
© 2003 Dordrecht: Kluwer Academic Publishers.
2 Kellaghan, Stufflebeam and Wingate
fact, many advances in evaluation stemmed from the pioneering efforts of
educational evaluators in the 1960s and 1970s.
Contemporary educational evaluation is rooted in student assessment and
measurement. The distinction between measurement and evaluation, suggested
by Ralph Tyler more than 50 years ago and later elaborated on by others, had an
enormous influence on the development of evaluation as an integral part of the
teaching and learning process. For many years, educational evaluation focused
mainly on students' achievements; it concentrated on the use of tests and was
immensely influenced by psychometrics. Another major and long-standing influ-
ence on educational evaluation is to be found in a variety of programs to accredit
schools and colleges. Mainly a U.S. enterprise, accreditation programs began in
the late 1800s and are an established reality throughout the U.S. today.
It was only in the mid-1960s and early 1970s, with the increased demand for
program evaluation made necessary by various evaluation requirements placed
on educational programs and projects by governmental organizations and other
agencies, that educational evaluation dramatically expanded and changed in
character. While earlier evaluation, as noted above, had focused on student
testing and the educational inputs of interest to accrediting organizations, the
new thrust began to look at a variety of outcomes, alternative program designs,
and the adequacy of operations. To meet new requirements for evaluation,
evaluators mainly used their expertise in measurement and psychometrics,
though they also took advantage of two other resources: research methodology
and administration. Research methodology - mainly quantitative but later also
qualitative - provided the guidance for data collection procedures and research
designs that could be applied in evaluation. Administration theory and research
helped to improve understanding of planning and decision making, which
evaluations were designed to service, as well as of the politics of schools.
Most developments in program evaluation took place in the United States and
were "exported" to other parts of the world, sometimes only ten or twenty years
later. In Europe, for instance, the major concern was - and in some countries still
is - testing and student assessment, although tests and other achievement
measures have begun to be used for other purposes. Gradually, tests came to be
used as outcome measures for other evaluation objects, such as programs, schools,
and education systems, sometimes alongside other information regarding the
objects' goals and processes. Widely varying applications of evaluation can now
be found around the world in many shapes and sizes, reflecting its far-reaching
and transdisciplinary nature.
Side by side with all this activity, evaluation has been growing into a fully
fledged profession with national and international conferences, journals, and
professional associations. It is practiced around the world by professional
evaluators in universities, research institutes, government departments, schools,
and industry. It is being used to assess programs and services in a variety of areas,
such as criminal justice, welfare, health, social work, and education. Each area,
while having much in common with evaluation in general, also has its unique
features.
Introduction 3
Three distinctive features set educational evaluation apart from other types of
evaluation. First, it has been strongly shaped by its roots in testing and student
assessment, on one hand, and curriculum and program evaluation on the other.
In other areas (e.g., health services or criminal justice), evaluation focuses
mainly on programs and is usually considered as a type of applied research.
Although it took many years for educational evaluation to come to the point
where it would not be perceived only as student assessment, such assessment is
still an important element of the activity. Second, education is the predominant
social service in most societies. Unlike business and industry, or other social
services such as health and welfare, education affects, or aspires to affect, almost
every member of society. Thus, public involvement and the concerns of
evaluation audiences and stakeholders are of special significance in educational
evaluation, compared to evaluation in other social services, and even more so
when compared to evaluation in business and industry. Third, teachers play very
important roles in educational evaluation as evaluators, as evaluation objects,
and as stakeholders. They are a unique and extremely large and powerful
professional group, with a high stake in evaluation and a long history as
practicing evaluators assessing the achievements of their students, and must be
taken into account whenever evaluation is being considered.
Education is one of the main pillars of the evaluation field, and thus it is
important that those who work in educational evaluation should be part of the
general evaluation community, participating in its scientific meetings and
publishing their work in its professional journals. There is much that they can
share with, and learn from, evaluators in all areas of social service, industry, and
business. However, educational evaluators should also be sensitive to the unique
features of their own particular area of evaluation and work to develop its
capabilities so that they can better serve the needs of education and its consti-
tuents. It is our hope that this handbook will aid members of the educational
evaluation community in this endeavor.
The handbook is divided into two parts, Perspectives and Practice, each of
which is further divided into five sections. While the individual chapters can
stand on their own as reference works on a wide array of topics, grouping them
under Perspectives and Practice, provides in-depth treatments of related topics
within an overall architecture for the evaluation field. In the first part, the
perspectives of evaluation are presented in five major domains: theory, method-
ology, utilization, profession, and the social context in which evaluations are
carried out. The second part of the handbook presents and discusses practice in
relation to five typical objects of evaluation: students, personnel, programs/projects,
schools, and education systems. Chapters in the handbook represent multiple
perspectives and practices from around the world. The history of educational
evaluation is reviewed, and the unique features that set it apart from other types
of evaluation are outlined. Since the chapters in each section are ably introduced
by section editors, we will only comment briefly on each section's contents.
The opening section deals with perspectives on educational evaluation by
examining its theoretical underpinnings. Ernest House introduces the section by
noting that scholars have made substantial progress in developing evaluation
theory, but remain far apart in their views of what constitutes sound evaluation.
Michael Scriven provides an overview and analysis of theoretical persuasions,
which may be grouped and contrasted as objectivist and relativist. Specific evalu-
ation theory perspectives presented in the section include Daniel Stufflebeam's
CIPP model, with its decision/accountability and objectivist orientations; Robert
Stake's responsive evaluation, that stresses the importance of context and
pluralism and advocates a relativist orientation; the constructivist evaluation
approach of Egon Guba and Yvonna Lincoln, with its emphasis on participatory
process and rejection of objective reality; and the relatively new entry of
democratic deliberative evaluation, advanced by Ernest House and Kenneth
Howe, which integrates evaluation within a democratic process. Other sections
present additional theoretical perspectives, including ones relating to utilization-
focused evaluation, participatory evaluation, connoisseurship evaluation, and
experimental design. Readers interested in the theory of evaluation will find in
these chapters ample material to support dialectical examination of the concep-
tual, hypothetical, and pragmatic guiding principles of educational evaluation.
Section 2 focuses on evaluation methods. Evaluators, as the section editor
Richard Wolf notes, differ in their methodological approaches as much as they
differ in their theoretical approaches. Major differences are reflected in the
extent to which investigators control and manipulate what is being evaluated. At
one extreme, randomized comparative experiments, described by Robert Boruch,
are favoured; at the other extreme, completely uncontrolled, naturalistic studies,
described by Linda Mabry. Other methods presented in the section include cost-
effectiveness analysis, described by Henry Levin and Patrick McEwan, and Elliot
Eisner's educational connoisseurship approach. In general, the section reflects
the current dominant view that evaluators should employ multiple methods.
The chapters in Section 3 provide in-depth analyses of how evaluators can
ensure that their findings will be used. Section editor Marvin AIkin and his
colleague Carolyn Huie Hofstetter summarize and examine research on the
utilization of evaluation findings. Michael Patton and Bradley Cousins,
respectively, present state-of-the-art descriptions of utilization-focused and
participatory models of evaluation, and explain how they foster the use of
findings.
Section 4 editor Midge Smith acknowledges that the evaluation field has made
progress toward professionalization, yet judges that the effort is still immature
and in need of much further thought and serious development. The topics
treated in the section include Daniel Stufflebeam's report on progress in setting
professional standards, Michael Morris's treatise on evaluator ethics, Blaine
Worthen's examination of the pros and cons of evaluator certification, Lois-ellin
Datta's analysis of the reciprocal influences of government and evaluation,
Hallie Preskill's proposal that the evaluation field become a sustainable learning
community, and Midge Smith's projection of, and commentary about, the future
of evaluation. Overall, contributors to the section characterize evaluation as an
emergent profession that has developed significantly but still has far to go.
Introduction 5
Section 5 editor Harbans Bhola notes that the practice of evaluation is, and
should be, heavily influenced by the local social setting in which the evaluation is
carried out, but also characterizes a global context for evaluation. Particular
settings for educational evaluation are discussed in chapters by Carl Candoli and
Daniel Stufflebeam for the U.S., Ove Karlsson for Europe, Fernando Reimers
for Latin America, and Michael Omolewa and Thomas Kellaghan for Africa.
Contributions to the section make clear that evaluation practices are heavily
influenced by a nation's resources and employment of technology, as well as by
local customs, traditions, laws, mores, and ideologies. A clear implication is that
national groups need to set their own standards for evaluation.
Section 6 editors Marguerite Clarke and George Madaus introduce chapters
on the assessment of student achievement, which has been, and continues to be,
a core part of educational evaluation. This is the kind of assessment that impacts
most directly on students, often determining how well they learn in the classroom
or decisions about graduation and future educational and life chances. It takes
many forms. Robert Mislevy, Mark Wilson, Kadriye Ercikan, and Naomi
Chudowsky present a highly substantive state-of-the-art report on psychometric
principles underlying standardized testing. Peter Airasian and Lisa Abrams
describe classroom evaluation practice, which arguably is the form of evaluation
which has the greatest impact on the quality of student learning. Caroline Gipps
and Gordon Stobart describe concepts and procedures of assessment that have
received great attention in recent years in response to dissatisfaction with
traditional methods, particularly standardized tests. Thomas Kellaghan and
George Madaus provide a description of external (public) examinations and
issues that arise in their use in a chapter that 20 years ago would probably have
evoked little more than academic interest in the United States. However, having
long eschewed the use of public examinations, which have a tradition going back
thousands of years in China and form an integral part of education systems today
in many parts of the world, the United States over the last decade has accorded
a form of these examinations a central role in its standards-based reforms.
Section 7 editor Daniel Stufflebeam argues that educational evaluations must
include valid and reliable assessments of teachers and other educators, and that
much improvement is needed in this critical area of personnel evaluation.
According to Mari Pearlman and Richard Tannenbaum, practices of school-
based teacher evaluation have remained poor since 1996 but external programs
for assessing teachers, such as teacher induction and national certification assess-
ments, have progressed substantially. According to Naftaly Glasman and Ronald
Heck, the evaluation of principals has also remained poor, and shows little sign
of improvement. James Stronge ends the section on an optimistic note in his
report of the progress that he and his colleagues have made in providing new
models and procedures for evaluating educational support personnel. Overall,
the section reinforces the message that educational personnel evaluation is a
critically important yet deficient part of the educational evaluation enterprise.
James Sanders, the editor of Section 8, drew together authors from diverse
national perspectives to address the area of program/project evaluation.
Program evaluation as practiced in developing countries is described by Gila
Garaway; in the U.S.A. by Jean King; in Canada by Alice Dignard; and in
Australia by John Owen.
Section 9 editor Gary Miron introduces the chapters on evaluation in schools
with a discussion of the old and new challenges in this area. Daniel Stufflebeam
offers strategies for designing and institutionalizing evaluation systems in schools
and school districts; James Sanders and Jane Davidson present a model for
school evaluation; while Robert Johnson draws on his work with Richard Jaeger
to provide models and exemplars for school profiles and school report cards.
Catherine Awsumb Nelson, Jennifer Post, and William Bickel present a frame-
work for assessing the institutionalization of technology in schools.
While Section 6 deals with the evaluation of individual students, the chapters
in Section 10 address the use of aggregated student data to evaluate the perfor-
mance of whole systems of education (or clearly identified parts of them) in a
national, state, or international context. As section editor Thomas Kellaghan
points out, the use of this form of evaluation grew rapidly throughout the world
in the 1990s as education systems shifted their focus when evaluating their
quality from a consideration of inputs to one of outcomes. Two major, and
contrasting, forms of national assessment are described in the section. Lyle Jones
describes the sample-based National Assessment of Educational Progress in the
United States, while Harry Torrance describes the census-based assessment of
the national curriculum in England. William Webster, Ted Almaguer and Tim
Orsak describe state and school district evaluation in the U.S. Following that,
international studies of educational achievement, the results of which have been
used on several occasions to raise concern about the state of American
education, are described by Tjeerd Plomp, Sarah Howie, and Barry McGaw.
William Schmidt and Richard Houang write about a particular aspect of
international studies, cross-national curriculum evaluation.
We hope that the handbook will be useful to readers in a variety of ways, helping
them to consider alternative views and approaches to evaluation, to think about
the role and influence of evaluation in national settings, to gain a better understand-
ing of the complexities of personnel and program evaluation, to gain perspective
on how to get evaluation findings used, to look to the future and to the steps that
will be needed if evaluation is to mature as a profession, to identify a wide range
of resource people, to appreciate the needs for evaluation at different levels, and to
identify common themes to ensure integrity for evaluation across national settings.
We are indebted to the authors, section editors, and others who contributed to
the handbook. David Nevo assisted the publisher in starting the Handbook
project, contributed to the overall plan, and helped select and recruit some of the
section editors and chapter authors. We acknowledge the assistance of Michael
Williams of Kluwer Academic Publishers, who consistently supported our effort.
Hilary Walshe, Regina Kl6ditz, and Rochelle Henderson at the Educational
Research Centre at St Patrick's College in Ireland and Sally Veeder at the
Evaluation Center at Western Michigan University provided competent
editorial, technical, and clerical assistance.
Section 1
Evaluation Theory
Introduction
ERNEST R. HOUSE, Section Editor

University of Colorado, School of Education, CO, USA
It is a pleasure to introduce the work of some of the leading theorists in evalu-

ation over the past thirty years. Most of these people were big names in the
nascent educational evaluation field in the late sixties when I was still in graduate
school. Certainly, several other people's work might have been included in the
theory section, including Patton, AIkin, and Cousins, among others, all of whom
represent influential theoretical positions. Fortunately, their work is included in
other sections of this Handbook.
The pleasure is compounded in that I know these people and have worked
with them on evaluation projects over the years. It has been my honor and
privilege to work with them throughout my career. They all have strong
personalities, tough characters, and an abiding tolerance for conflict and discord.
Perhaps such personal characteristics are required to persevere in the rough and
tumble world of evaluation through so many decades. They are also interesting
and stimulating to be around. I believe their positions reflect their personalities
and backgrounds, both professional and personal. I have discovered there is a
strong tie between character and ideas. Certain ideas persist and reappear in these
theorists' work because of who they are and where they came from. The ideas fit
the people.
I have praised and argued with them over the years, as well as learned from
them. And they have argued with each other, ceaselessly, often in front of large
audiences, but they will do it in front of small ones too. The role of the theorist
in the world of evaluation practice is ambivalent in some ways. The theorists are
high profile, lauded, and sought after to deliver speeches and lend legitimacy to
projects, proposals, and meetings. At the same time practitioners lament that the
ideas are far too theoretical, too impractical. Practitioners have to do the project
work tomorrow, not jawbone fruitlessly forever.
For their part the theorists know from experience that the half-formed ideas
they toy with today may form the basis for proposals and evaluation studies
tomorrow, or that the ideas may be adopted by governments and organizations
as a basis for policy. Fuzzy ideas of yesterday have a way of becoming orthodoxy
tomorrow. When theorists start developing their ideas, they often have only an
intuitive sense of where they want to go with them. The ideas do not leap full-
born from their heads. Through dialogue, debate, and trying out the new ideas

T Kellaghan, D.L. Stufflebeam (eds.)
10 House
theorists clarify and develop their positions. That means a lot of words, some of
which are not practical. The degree to which evaluators holding different theo-
retical positions actually conduct evaluations differently is an interesting question.
I suspect the differences in theoretical positions are stressed in public rather than
the similarities.
Michael Scriven, one of the best-known evaluation theorists, has contributed
a meta-theoretical chapter to this Handbook. Meta-theory is commentary on
theory itself. He formulates six distinct theoretical positions, which he presents
in roughly the chronological order of their historic development. He argues that
each of these positions takes only part of evaluation to be the whole and hence
defines the essence of evaluation inappropriately.
Scriven himself has been committed to consumer-oriented evaluation, using
Consumer Reports as a model, and to goal-free evaluation, again derived from a
consumer rather than a producer orientation. By generalizing from the
Consumer Reports product evaluation model Sciven has provided many insights
into the nature of evaluation. And he admits in his chapter that he too took one
type of evaluation to be the whole, particularly when championing the consumer
as the only legitimate interest worthy of consideration. Scriven's strong prefer-
ence for summative evaluation was developed in reaction to Lee Cronbach, who
took the program or product developer as the most important interest, thus
focusing on formative evaluative to the exclusion of summative evaluation, in
Scriven's view. Summative evaluation corrects this bias, in Scriven's view.
One of the oldest and most thoroughly tested approaches to evaluation is the
CIPP model developed by Dan Stufflebeam and his colleagues over the years.
CIPP stands for context, input, process, and product evaluation. The CIPP
approach has been used in a large number of evaluation studies across all types
of subject areas and settings. Stufflebeam has developed it carefully over the
years so that the approach has been highly refined and elaborated by constant
use. The aim of the evaluation is not to prove, but to improve, he says.
CIPP was originally formulated for evaluating educational programs. The
evaluator looked at the context of the program, the inputs to the program, the
processes of the program, and the product, or outcomes, both outcomes antici-
pated and unanticipated. Originally, the information from the evaluation was to
be used to inform decision makers about what to do. Since that time CIPP has
been used in a variety of settings outside education, and Stufflebeam's concept
of decision makers has been broadened to include a range of stakeholders, not
just the program administrators.
Bob Stake's responsive evaluation was developed as a counter to the pre-
ordinate evaluations prevalent in the 1970s. In Stake's view, evaluations are
"responsive" if they orient to program activities, respond to audience require-
ments for information, and refer to different value perspectives in reporting the
successes and failures of the program. For Stake, there is no single true value to
anything. Value lies in the eye of the beholder, which means that there might be
many valid interpretations of the same events, depending on a person's point of
view, interests, and beliefs. And there might be many valid evaluations of the
Evaluation Theory 11
same program. The evaluator's task is to collect the views of people in and
around the program, the stakeholders.
These diverse views should be represented in the evaluation, both to do honor
to the stakeholders (honor is a traditional virtue for Stake) and so that readers
of the report can draw their own conclusions about the program. Value pluralism
applies to the audiences of the evaluation as well as to the stakeholders. No
impartial or objective value judgments are possible. The method par excellence
for representing the beliefs and values of stakeholders in context is the case
study. In Stake's view, much of the knowledge conveyed is tacit, in contrast to
propositional knowledge rendered through explicit evaluative conclusions.
Yvonna Lincoln's constructivist evaluation approach takes up where Stake's
approach leaves off, and Lincoln explicitly recognizes the debt to responsive
evaluation. But she and Egon Guba have pushed these ideas further in some
ways. The evaluator constructs knowledge in cooperation with the stakeholders
involved in the evaluation. There is also a increasingly strong localist theme in
Lincoln's later theorizing. Knowledge is created by and for those immediately in
contact, not for distant policy makers. She contrasts the local demands and
construction of knowledge with the demands of "rationalist" science, the more
common justification for evaluation studies.
My own deliberative democratic evaluation approach, developed with Ken
Howe, a philosopher, is the newest of these theoretical approaches. We have
reformulated the so-called fact-value dichotomy and constructed an evaluation
approach based on our new conception of values. In our analysis, fact and value
claims are not totally different kinds of claims, but rather blend into one another
in evaluation studies so that evaluative conclusions are mixtures or fusions of
both. To say that Follow Through is a good educational program is to make fact-
value claims simultaneously.
Our way of conducting evaluations is to include all relevant stakeholders, have
extensive dialogue with them, and engage in extended deliberations to reach the
conclusions of the study. In other words, we try to process the perspectives,
values, and interests of the various stakeholder groups impartially with a view to
sorting them out, in addition to collecting and analyzing data in more traditional
ways. The result is that the evaluative conclusions are impartial in their value
claims as well as their factual claims.
There are many possible ways of distinguishing among the four theoretical
positions presented in this section. The distinction I find most interesting is the
way the different theorists conceive and handle value claims, an issue which touches
on every evaluation. The fact-value distinction relates to a basic conundrum the
evaluation field has been caught in since its inception: How can you have scien-
tific evaluations if facts are objective and values are not? After all, evaluations
necessarily contain value claims at some level, values being present especially in
the criteria adopted for the evaluation and in the conclusions of the study.
Early evaluation theorists, like Don Campbell, were strongly influenced by the
positivists, who argued that facts were one thing and values quite another.
Evaluators could discover the facts objectively but values (value claims) were
12 House
essentially subjective, perhaps deep-seated emotions not subject to rational

analysis or discussion. Hence, evaluators and social scientists should stick to the
facts and leave values, as reflected in the program goals or evaluative criteria, to
someone else, perhaps to the politicians or program developers. Evaluators
simply had to accept the goals, in this view.
Scriven was the first person in evaluation to assert that value claims can be
objective, that is, to reject the fact-value dichotomy explicitly. Value claims are
like factual claims in that evaluators can make rational value claims by citing the
evidence for particular claims. In his view, evaluators are particularly well
equipped to do this since they have data collection and analysis methodologies
at their disposal. Furthermore, evaluators need to protect, insulate, and isolate
themselves from pernicious influences, such as being pressured by political forces
or becoming too friendly with the people whose programs are being evaluated,
thus eroding the evaluator's judgment inappropriately.
The image here seems to be that of a lone scientist, Galileo perhaps, working
away on conclusions that may not be politically or socially popular. The lone
scientist persists by drawing only on the evidence and does not let extraneous
factors influence the pursuit of the truth. Scriven manifests this ideal of objectiv-
ity in Consumer Union evaluations, where evaluators assess the big corporations'
products, in spite of possible powerful political and legal repercussions. He
manifests this ideal in goal-free evaluation, where evaluators strive to remain
ignorant of the program developers' goals so that the evaluators can examine all
possible program outcomes without being influenced by the goals or by feeling
sympathetic to the developer's honorable intentions. The same ideal is
manifested in Scriven's modus operandi procedure, which is modeled on a
detective relentlessly pursuing the murderer no matter where the clues may lead.
Stake's approach to the value problem has been quite different. It is to assert
that evaluations and value claims are essentially subjective. Everyone has differ-
ent perspectives, values, and interests, including the evaluator. Hence, what the
evaluator tries to do is to represent the various stakeholder perspectives, values,
and interests in the report. The evaluator tries to be fair in representing these
positions and acknowledging them, even when not in agreement. Even so, as a
practical matter the evaluator alone has to decide what goes into the report, what
does not, and what the final conclusions are. Admittedly, those are one person's
decision, the author's.
To balance off such an arbitrary action, the evaluator tries to reduce the
authority of the evaluation report in the eyes of the audiences by reminding them
that the evaluation is only one person's view and that other people may legiti-
mately have other views that are equally as valid. So Stake focuses on stake-
holder perspectives, values, and interests while Scriven tries to ignore them as
best he can.
Yvonna Lincon's work is another example of the subjective view of values but
one in which the evaluator works as a mediator to reach conclusions by
consensus, rather by the evaluator's individual judgment. Since knowledge is
constructed, the evaluator constructs conclusions with and for the local
Evaluation Theory 13
participants. In some works Lincoln has argued that there are no objective values
or facts, just those constructed by people. She has gone so far as to assert there
is no reality, only multiple realities as seen by individuals. In a sense, everything
is subjective.
House and Howe's deliberative democratic evaluation approach tackles this
problem by reformulating the underlying value issue and the notion of objec-
tivity. We assert that evaluative claims can be objective or impartial. (Impartial,
in the sense of being unbiased, is a better term since objective means so many
different things, but objective is in more general use.) In our view the evaluator
arrives at impartial conclusions by somewhat different processes than Scriven.
Like Stake and Lincoln, we take the perspectives, values, and interests of stake-
holders as being central to the evaluation. But we attempt to sort through these
stakeholder views by seriously considering them and processing them impartially.
Rather than trying to isolate or insulate the evaluator, which we see as
impractical and futile, we advocate that the evaluator become engaged with the
stakeholders by trying to understand and seriously consider their positions. This
does not mean accepting stakeholder views at face value necessarily. The reason
for engagement is that the danger of being misinformed or misunderstanding
various claims is too great otherwise. The risk of engagement is that the evalu-
ator will become too partial and bias the conclusions. To prevent bias evaluators
must draw on both the traditional tools of evaluation as well as new ways of
processing perspectives, values, and interests. This is a more collegial view of
evaluation, as opposed to the lone scientist image. Nonetheless, unlike Lincoln,
the evaluator does not seek consensus among stakeholders.
Finally, Stufflebeam is the most practical of the theorists represented here. He
has been the least inclined to engage in philosophic theorizing about the nature
of value claims. My reading of his position on values must come from inferring
what he believes implicitly from his work over the years. In his Handbook
chapter he has presented a key diagram in which "core values" are at the center
of his definition of evaluation, with goals, plans, actions, and outcomes derived
from these core values, and with context, input, process, and product evaluation
emerging as ways to examine the goals, plans, actions, and outcomes.
I infer that he recognizes the central role that values must play in evaluation
and that the evaluator can and does arrive at objective evaluative claims. His
evaluation practice over the years suggests such a position. Less clear is where
these core values come from and whether the core values are potentially
challengeable and subject to rational analysis. Values are actually value claims,
and I see even core values as being subject to rational discussion and analysis. In
Stufflebeam's view, core values might derive from somewhere outside the purview
of inspection or challenge, such as from the client's values or the politician or
developer's values, or they might be jointly derived. Clearly, the goals, plans,
actions, and outcomes derived from the core values are challengeable in the
CIPP approach. In fact, they are the focus of the evaluation.
In clarifying his position on values, Stufflebeam says the core values derive
from democratic principles - equality of opportunity, human rights, freedom,
14 House
social and individual responsibility, the common good, creativity, excellence,
service, and conservation. Practically, in assigning value meaning to outcomes,
he places the heaviest emphasis on the extent to which beneficiaries' individual
and collective needs have been met.
In summary, although the term "value" constitutes the core of the concept of
"evaluation", the nature of values and value claims remains in dispute after more
than three decades of professional practice and development, even among
leading evaluation theorists. Needless to say, it is a tough issue, an issue that
extends well beyond evaluation into the world of social research. In spite of
disagreements, I do believe that evaluators collectively have a better grasp of the
value issue than most anyone else.
1
Evaluation Theory and Metatheoryl
MICHAEL SCRIVEN
Claremont Graduate University, CA, USA
DEFINITIONS
What is evaluation? Synthesizing what the dictionaries and common usage tell
us, it is the process of determining the merit, worth, or significance of things
(near-synonyms are quality/value/importance). Reports on the results of this
process are called evaluations if complex, evaluative claims if simple sentences,
and we here use the term evaluand for whatever it is that is evaluated (optionally,
we use evaluee to indicate that an evaluand is a person).
An evaluation theory (or theory of evaluation) can be of one or the other of
two types. Normative theories are about what evaluation should do or be, or how
it should be conceived or defined. Descriptive theories are about what evalua-
tions there are, or what evaluations types there are (classificatory theories), and
what they in fact do, or have done, or why or how they did or do that (explanatory
theories).
A metatheory is a theory about theories, in this case about theories of evalua-
tion. It may be classificatory and/or explanatory. That is, it may suggest ways of
grouping evaluation theories and/or provide explanations of why they are the way
that they are. In this essay we provide a classification of evaluation theories, and
an explanatory account of their genesis.
DESCRIPTIONS
Of course, the professional practice of evaluation in one of its many fields, such
as program or personnel evaluation, and in one of its subject-matter areas, such
as education or public health or social work, involves a great many skills that are
not covered directly in the literal or dictionary definition. Determining the merit
of beginning reading programs, for example, requires extensive knowledge of the
type of evaluand - reading programs - and the methods of the social sciences and
often those of the humanities as well. To include these and other related matters
in the definition is attractive in the interest of giving a richer notion of serious
evaluation, so it's tempting to define evaluation as "whatever evaluators do." But
15

16 Scriven
this clearly won't do as it stands, since evaluators might all bet on horse races but
such betting does not thereby become part of evaluation. In fact, professional
evaluators quite properly do various things as part of their professional activities
that are not evaluation but which they are individually competent to do, e.g.,
market research surveys; teaching the theory or practice of evaluation; advising
clients on how to write evaluation components into their funding proposals; how
to meet legal requirements on information privacy; and when to consider
alternative program approaches. Those activities are not part of evaluation as
such, merely part of what evaluators often do, just as teaching mathematics is
part of what many distinguished mathematicians do, although it's not part of
mathematics or of being a mathematician. We often include training in some of
these activities as part of practical training in how to be successful in an
evaluation career; others are just opportunities to help clients that frequently
arise in the course of doing evaluations. Of course, some evaluators are better at,
and prefer, some of these activities to others, and tend to emphasize their
importance more.
In any case, defining evaluation in terms of what evaluators do presupposes
that we have some independent way of identifying evaluators. But that is just
what the definition of evaluation provides, so we cannot assume we already have
it, or, if we do not, that we can avoid circularity through this approach.
It is also true that what evaluators do is to a significant extent driven by the
current swings of fashion in the public, professional, or bureaucratic conceptions
of what evaluation should do, since that determines how much private or public
money is spent on evaluation. For example, fashion swings regularly occur about
outcome-oriented evaluation - we're in the middle of a favorable one now - and
so we currently find many evaluators dedicated to doing mere impact studies.
These have many flaws by the standards of good evaluation, e.g., they rarely seek
for side effects, which may be more important than intended or desired
outcomes; and they rarely look closely at process, of which the same may be said.
These are in fact merely examples of incomplete evaluations, or, if you prefer, of
evaluation-related activities.
Note, second, that evaluation is not just the process of determining facts about
things (including their effects), which, roughly speaking, we call research if it's
difficult and observation if it's easy. An evaluation must, by definition, lead to a
parlicular type of conclusion - one about merit, worth, or significance - usually
expressed in the language of good/bad, better/worse, well/ill, elegantly/poorly etc.
This constraint requires that evaluations - in everyday life as well as in scientific
practice - involve three components: (i) the empirical study (i.e., determining
brute facts about things and their effects and perhaps their causes); (ii) collecting
the set of perceived as well as defensible values that are substantially relevant to
the results of the empirical study, e.g., via a needs assessment, or a legal opinion;
and (iii) integrating the two into a report with an evaluative claim as its
conclusion. For example, in an evaluation of a program aimed to reduce the use
of illegal drugs, the empirical study may show (i) that children increased their
knowledge of illegal drugs as a result of the drug education part of the program,
Evaluation Theory and Metatheory 17
which is (we document by means of a survey) widely thought to be a good

outcome; and (ii) that they consequently increased their level of use of those
drugs, widely thought to be a bad outcome. A professional evaluator, according
to the definition, should do more than just report those facts. While reporting
such facts is useful research, it is purely empirical research, partly about effects
and partly about opinions. First, a further effort must be made to critique the
values, e.g., for consistency with others that are held to be equally or more
important, for the validity of any assumptions on which they are built, and for
practicality, given our relevant knowledge. Second, we must synthesize all these
results, mere facts and refined values, with any other relevant facts and values.
Only these further steps can get us to an overall evaluative conclusion about the
merit of the program. The values-search and values-critique part of this, and the
synthesis of the facts with the values, are what distinguish the evaluator from the
empirical researcher. As someone has pithily remarked, while the applied
psychologist or sociologist or economist only needs to answer the question,
"What's So?", the evaluator must go on to answer the question, "So What?"
In this case, the reason the knowledge about illegal drugs is thought to be good
is usually that it is expected to lead to reduced use (a fact extracted from
interviews, surveys, or focus groups with parents, lawmakers, police, and others).
Hence the second part of the factual results here trumps the first part, since it
shows that the reverse effect is the one that actually occurred, and hence the
synthesis leads (at first sight) to an overall negative conclusion about the program.
However, more thorough studies will look at whether all the consequences of the
use of all illegal drugs are bad, or whether this is just the conventional, politically
correct view. The social science literature does indeed contain a few good books
written on that subject, which is scientifically-based values-critique; but the
significance of these for the evaluation of drug education programs was not
recognized. The significance was that they showed that it was perfectly possible
to combine a scientific approach with critique of program goals and processes;
but this was contrary to the existing paradigm and hence just ignored.
There's a further complication. It's arguable that good and bad should be
taken to be implicitly defined by what the society does rather than what it says.
That means, for example, that good should properly be defined so that alcohol
and nicotine and morphine are acceptable for at least some adults in some
situations, perhaps even good (in moderation) in a subset of these circumstances.
With that approach, the overall evaluative conclusion of our program evaluation
example may be different, depending on exactly what drugs are being taken by
what subjects in what circumstances in the evaluation study. If we are to draw any
serious conclusions from such studies, it is essential to decide and apply a
defensible definition of social good and to answer the deeper questions as illus-
trated above. These are the hardest tasks of evaluation. Of course, these challenges
doesn't come up most of the time since there is usually little controversy about
the values involved, e.g., in learning to read, or in providing shelters for battered
women and children. But it's crucial in many of the most important social
interventions and policies. Avoidance of this obligation of evaluation vitiated or
18 Scriven
rendered trivial or immoral the research of many hundreds, perhaps thousands,
of social scientists who did not question the common assumptions on these
matters, for example in the notorious case of social science support of dictators
in South America.
These further steps into the domain of values, beyond the results of the
empirical part of the study, i.e., going beyond the study of what people do value
into the domain of what the evidence suggests they should value, were long held
(and are still held by many academics) to be illicit - the kind of claims that could
not be made with the kind of objectivity that science demands and achieves. This
skeptical view, known as the doctrine of value-free science, was dominant
throughout the twentieth century - especially in the social sciences. This view,
although also strongly endorsed by extraneous parties - for example, most
religious and political organizations, who wanted that turf for themselves - was
completely ignored by two of the great applied disciplines, medicine and the law.
For example, no doctor felt incapable of concluding that a patient was seriously
ill from the results of tests and observations, although that is of course an evalua-
tive conclusion that goes beyond the bare facts (it is a fact in its own domain, of
course, but an evaluative fact). This legal/medical model (partly adopted in
education and social work as well) would have been a better model for the social
sciences, whose chosen theory about such matters, the value-free doctrine,
rendered them incapable of addressing matters of poverty, corruption, and
injustice because, it was said, the mere identification of those things, since the
terms are value-impregnated, represented a non-scientific act. Consequently,
many areas languished where social science could have made huge contributions
to guiding interventions and improving interpretations, and people suffered and
died more than was necessary. Ironically, many of those who despised excursions
into the logic or philosophy of a domain, thinking of themselves as more prac-
tical for that choice, had in fact rested their efforts on one of the worst
logical/philosophical blunders of the century, and thereby had huge and highly
undesirable practical effects.
Great care is indeed needed in addressing the validity of value judgments, but
science is no stranger to great care; nor, as we'll see in a moment, is it any
stranger to objectively made value judgments. So neither of these considerations
is fatal to scientific evaluation. The real problem appears to have been the desire
to "keep science's nose clean", i.e., to avoid becoming embroiled in political,
theological, and moral controversies. (It is clear that this is what motivated Max
Weber, the originator of the value-free doctrine in the social sciences.) But
getting embroiled in those issues is what it takes to apply science to real world
problems, and balking at that challenge led to a century of failed service in causes
that society desperately needed to press. To the credit of educational
researchers, some of them followed the medical!legal model, which led to half a
century of pretty serious evaluation of various aspects of education. In fact, to
put it bluntly, educational research was several decades ahead of the rest of
social science in the search for useful models of evaluation, and still is, to judge
by most of the evaluation texts of the new millennium (see references). Sadly
enough, although this seems clear enough from a glance at the literature, it is
rarely acknowledged by mainstream social scientists who have gotten into serious
evaluation, a shoddy example of misplaced arrogance about the relative intellec-
tual importance of education and the mainline social sciences.
Part of the explanation for the avant garde role of education in improving
evaluation approaches may be due to three factors. First, a typical premier school
of education, with its historians, philosophers, statisticians, and qualitative
researchers, is remarkably interdisciplinary and less controlled by a single
paradigm than the typical social science (or other science) department. Second,
education is heavily committed to the application of its research, in this respect
it is like engineering, medicine, and the law. And thirdly, it is usually an
autonomous college, not so easily driven by the fashions espoused by fellow high-
prestige fellow departments.
One must concede, however, that it was difficult to conceptualize what was
going on, since most educational researchers are, appropriately enough in most
respects, strongly influenced by social scientists as role models, and there was no
help to be found from them. Not surprisingly, there emerged from this confused
situation a remarkably diverse zoo of models, or theories of evaluation, or, we
might equally well say, conceptions of evaluation. And since many people who
came to do the evaluations that government funded had been brought up on the
value-free doctrine, it is not surprising that that conception - it's really a denial
of all models rather than a model in its own right - was very popular. This nega-
tive view was reconciled with actually doing evaluation, as many of the value-free
doctrine's supporters did, by saying that evaluators simply gathered data that was
relevant to decisions, but did not draw or try to draw any evaluative conclusions
from it. This was "evaluation-free evaluation", perhaps the most bizarre
inhabitant in the evaluation-models zoo. Let's now look in slightly more detail at
this and some other evaluation models.
MODELS OF EVALUATION: EIGHT SIMPLIFIED ACCOUNTS
Evaluators play many roles in the course of doing what is loosely said to be
evaluation, and, like actors, they sometimes fall into the trap of thinking that
their most common role represents the whole of reality - or at least its essential
core. There seems to be about eight usefully distinguishable cases in the history
of evaluation in which this has happened. I list them here, calling them models
in order to bypass complaints that they are mostly lacking in the detailed
apparatus of a theory, and I append a note or two on each explaining why I see
it as providing a highly distorted image of the real nature of evaluation. In most
cases, the conception is simply a portrayal of one activity that evaluators often
perform, one function that evaluation can serve - no more an account of evalua-
tion's essential nature than playing the prince of Denmark provides the essence
of all acting. Then I go on to develop the general theory that is implicit in these
criticisms, one that I suggest is a more essential part of the truth than the others,
20 Scriven
and the only one built on the standard dictionary definition and common usage
of the concept.
Modell: Quasi-Evaluation
In the early days of program evaluation theory, most evaluators were working for
agencies or other funders who had commissioned evaluations to assist them in
their role as decision makers about the future funding of the programs. That led
some evaluators and philosophers of science to think that this was the essential
nature of evaluation - to be a decision-support process, by contrast with the main
run of scientific activity, which was said to be knowledge-generating. The
decisions to be supported in these cases were the typical funders' or managers'
decisions about whether the program should get increased or decreased funding,
or be abandoned or replicated. These decisions were mainly about success in
management terms, i.e., that the program was on time, on target, and on budget;
hence that became the three dimensional data-structure to be gathered by the
evaluator. Each of these dimensions, and hence the report on their achievement
(which in these cases was the whole evaluation report) is purely empirical, not
evaluative. In this model, the real values were in the eye of the funder or
manager, who had already decided that the program's target was a good one and
didn't want any comments from the evaluator on that score. So, the de facto merit
of the program was taken to be simply a matter of it being on time, on target, and
on budget. Today we call these the proper topics for monitoring, rather than
evaluation; and we have added a fourth requirement, that the program be "in
compliance", meaning consistent with the many special federal and/or state
special stipulations for idiosyncratic treatment of certain minorities or women or
local residents, etc. There is a sense in which we can perhaps say that such studies
are derivatively evaluative, since they can be said to adopt (implicitly) the implicit
standards of punctuality, contractual and legal relevance, and fiscal responsibil-
ity. But there is no serious critique of these values in terms of their consistency
with other major values that apply to the program, let alone several other
matters that are routinely considered in serious program evaluation.
Normally, the evaluators were working directly for managers, not directly for
the funders, and the managers were principally interested in whether the program
was successful in the key sense for them, i.e., whether the program was achieving
what the managers had undertaken to achieve. The same kind of data was called
for in these cases. (These data could be used for either formative or summative
decisions.) This led to a model of evaluation which we can refer to as the goal-
achievement model; Malcolm Provus (1971), perhaps its best known advocate,
called it "discrepancy evaluation", the discrepancy being that between the goals
of the program and the actual achievements. This approach probably originated
in the boom era of educational evaluation which began in mid-century with
Ralph Tyler (see Madaus & Stufflebeam, 1998), but Stufflebeam et al. (1971),
Suppes, and Cronbach (1963, 1982) supported the general approach. In truth,
however, this kind of decision support is only one type of decision support (see
Models 2 and 3 below), and decision support in general is only one of evalua-
tion's roles (see Models 4 to 7 below).
It was a common part of the value-free doctrine, held by most of the supporters
of the quasi-evaluation model, to talk as if there was a clear distinction between
facts and values: the subject matter, it was said, of science on the one hand, and
ethics or politics on the other. But evaluation can be validly said to be a fact-
finding/fact-developing endeavor, if one understands that the facts it gathers and
develops include evaluative facts, something whose existence we all take for
granted in everyday life. "Michael Jordan is a great basketball player", we say,
and we're willing to prove it; or "The Mercedes E series is a good sedan, and isn't
that the truth;" or "Mother Teresa is a good person, if ever there was one." While
evaluative facts are sometimes observable (e.g., "That was a beautifully executed
dive"), they are often only established by long-chain inferences, and in profes-
sional program evaluation, or student performance evaluation, that is nearly
always the case. In this respect they are just like facts about causation or atomic
structure, which are amongst the most important scientific facts and call for the
most sophisticated methodology to establish. While there are contexts in which
we usefully distinguish (mere or bare) facts from value claims, in the general run
of discussion facts are identified by their certainty, and evaluative facts like the
examples above are up there with the best of them by that standard. It's
important to distinguish empirical facts from evaluative facts in most contexts;
but this is not the distinction between facts and values.
Model 2: Goal-Achievement Evaluation
What was really happening in Model 1 evaluation was that the values of the
client (i.e., the person or organization commissioning the evaluation, this being
in most cases a manager) were being accepted without examination as the
ultimate definition of merit, worth, or significance. The evaluator in fact did no
assessing of the value of the program, hence the term "quasi-evaluation." But if
that's the "glass is half empty" version, there is another way of looking at it: the
"glass is half full" version. In this interpretation, one creates real values out of
the things one is looking at, the more or less universal managerial values of punc-
tuality, relevance, and frugality. In Model 2, which is only another way of looking
at the same process, people simply generalized this so that the client's values,
whatever they were (short of outrageous impropriety) were bypassed entirely and
merit, worth, or significance were taken to inhere entirely in punctuality/relevance/
frugality. The evaluator was thereby relieved of the responsibility for values-
critique of the implicit values by giving him or her the job of determining the
presence or absence of these managerial values. This was, and still is, very
attractive to social scientists brought up on the value-free doctrine. But it simply
misses the point of distinguishing evaluation from empirical research, which is
22 Scriven
that in the former, the investigator "signs off on" an explicit evaluative conclusion.
Anything less is merely "What's so?" research.
Model 3: Outcome-Based Evaluation
Outcome-based evaluation is also talked of as results-oriented evaluation, in e.g.,

the United Way approach. In the last few years, we have seen renewed emphasis
on outcome-based evaluation as an echo of the decision-support concept of
evaluation, this time under the banner of "performance management", now advo-
cated as the proper way to service the admirable goal of increased accountability
rather than simple improvement of decision making. This approach involves an
attendant slighting of process consideration, the difficult synthesis step when
indicators point in different directions, and side effects, along with several other
essential dimensions (cost, comparisons, generalizability) that should be covered
in almost any serious program evaluation. The charm of outcome-based
evaluation lies in its simplicity and perhaps also in the fact that it skips the values
critique phase. It's just another management-oriented approach, with two
Achilles heels: the omissions just listed, and the perilously short-lived validity of
most indicators, which are all too easy to manipulate, and whose validity also
spontaneously decays as the context changes.
Model 4: Consumer-Oriented Evaluation
With increasing recognition of: (i) the management bias in the first three models;
and (ii) the importance of evaluation as a consumer service, which typically
means summative evaluation (as in Consumer Reports), some theorists (myself
amongst them) began to talk as if summative evaluation for consumers was the
essential duty of evaluation. It is indeed an important aspect of evaluation that
tends to be ignored by the management-oriented approaches. Consumers, by
and large, have no interest in whether the program or product designer's goals
have been met, or indeed in what those goals are, and only a long-term, if any,
interest in improving the program. Hence they have little interest in formative
evaluation. Consumers are mainly interested in whether their own needs are met
(a different definition of "success" from the managers'). In this respect, they
were just as interested in all effects of the program, including what were side
effects from the point of view of the managers, instigators, and planners, and
were considerably underplayed in the first models listed above. However, despite
this shift to needs-meeting instead of goal-achieving, it was incorrect to act as if
the information needs of program funders and managers were illegitimate,
though it was important to stress that they were often not the same as those of
consumers.
Model 5: The Formative-Only Model
Then the pendulum swung full arc in the other direction and we had the
Cronbach group trying to convince us that there was no such thing in the real
world as summative evaluation (Cronbach, 1963, 1982). In reality, they said,
evaluation was essentially always formative. This may well have been true of their
experiences, but not of a reality containing Consumer Reports and the results of
jury trials and appeals; and, in the program area, the final reviews that precede
the termination of many programs when money becomes tight. While lessons
may be learnt from the summative crunch of these programs, they are lessons
from applied research for other programs, not lessons from an intermediate
evaluation of a particular program (the definition of formative evaluation).
Model 6: Participatory or Role-Mixing Approaches
More recently, a strong focus has emerged on taking the role of consumers, not
just their interests, more seriously. (There were precursors to this approaches
current popularity, even in Provus (1971) and certainly in the U.K. school of
transactional evaluation.) This has involved two different steps: (i) identifying
the groups that were not being consulted when the top-down models ruled; (ii)
encouraging or ensuring that those who are being evaluated participate in the
evaluation process itself. This kind of approach has become increasingly popular
for both ethical and political reasons, and various versions of it are called partici-
patory (the evaluees have input to the evaluation process) or collaborative (the
evaluees are coauthors) or empowerment evaluation (in empowered evaluations,
the evaluees are the sole authors and the evaluator is just a coach). So-called
"fourth-generation evaluation" shares much of this approach and draws an
interesting methodological conclusion: if the consumers are to be treated with
appropriate respect, the methods of investigation must be expanded to include
some that were previously dismissed as excessively subjective, e.g., empathy,
participant observation.
It is sometimes suggested by advocates of these positions, notably Fetterman
in talking about empowerment evaluation, that the more traditional models are
outdated, and evaluation, properly done, should always be seen as a collabora-
tive or even an empowerment exercise. Sometimes yes, often no; there are often
requirements of confidentiality or credibility or cost or time that legitimately
require distancing of the evaluator from the evaluated. There mayor may not be
a cost in validity, since close association with the evaluees often leads to softening
of criticism, though it can also lead to the opposite if the chemistry does not
work. But, when possible and strictly controlled, the particpatory approach can
improve the consumer sample representativeness, the ethical mandate, the
quality and span of relevant data gathering, the probability of implementing
recommendations, the avoidance of factual errors, and other aspects of the
quality of the evaluation. The empowerment approach, full-blown, is essentially
24 Scriven
another way to avoid the evaluator doing evaluation; in this case, becoming a
teacher rather than a doer of evaluations.
Model 7: Theory-Driven Evaluation
On a somewhat different track, the approach of theory-driven evaluation has

taken the process of generating explanations of the success and failure of an
enterprise to be the core function of evaluation and concludes that this should
be the core of the conception itself. But black box evaluation, or something close
to it - where we don't even know what the components of a program are, let
alone know the function of each - is not a failure or misconception of evaluation,
as the history of the evaluation of many entities in medicine and education bear
witness. It is just one kind of evaluation, sometimes the only kind we can do, and
sometimes more efficient in answering the key questions than any other kind.
Explanations are a nice bonus, sometimes a great help to program improvement;
but finding them is not an essential part of evaluation itself. It is a standard
enterprise of social science, part of the "What's so?" and "Why so?" search.
Model 7 - Subtheme A
Running alongside these later conceptions of evaluation (6 and 7, and to some

extent 5) has been a rowdy gang of philosophical radicals - varieties of episte-
mological skepticism, e.g., constructivism, deconstructionism, post-modernism,
relativism reincarnate, and some fellow-travelers (see Mabry, 1997). On these
views, the entire enterprise of evaluation, like most others of what they describe
as "so-called objective science", is a fraud, being in reality only a projection of
some individual or group values onto the subject matter. Since a general doctrine
that denies the existence of reality or objectivity can hardly at the same time
claim to be informing us of the real truth, this position is inescapably self-refuting.
If, on the other hand, it is making some less general claim - the claim that some
particular conclusions or types of conclusions are often biased - we can proceed
with our work as carefully as we can until, if at all, some specific errors are
demonstrated. It's of course true that in the past, there have been plenty of
biased evaluations, and there still are and always will be; but that does not show
that there were no unbiased ones, let alone that we cannot improve the ratio of
those in the future.
But those impressed by or evangelizing for sub-theme A rarely limit their
complaints (Mertens [1998]; and Guba and Lincoln [1989] are clear exceptions),
and hence are self-refuting and do not pose the threat they imagine to be their
due. Nevertheless, they have influenced or converted not only many supporters
of Models 5 and 7, but also of our last example of aspect-exaggeration, ModelS.
Bob Stake is perhaps the most eminent evaluator attracted by the relativistic
view, especially in the form in which it simply denies the possibility of objectivity.
Bias is always present, he thinks, so don't try to hide it. But evaluation bias can
be prevented, or reduced to insignificance, and doing so is an appropriate and
frequently achievable ideal. Consider, for example, the scoring guide to a well-
developed mathematics test, or the rating of Einstein as a brilliant scientist, or
the evaluation of certain reading programs as poor investments (when others are
cheaper and better in the relevant respects). The denial of objectivity is, to
repeat, either self-refuting (if a universal claim) or something that has to be
documented in specific cases and applied only to them. When bias can be iden-
tified it can usually be minimized by redesign of the evaluation or even rephrasing
of its conclusions. We can get to a good approximation of the truth by succes-
sively reducing the errors in our approach. The game is not lost before it starts.
ModelS: Power Models
Our last example concerns the power of evaluators and evaluations. Not too
surprisingly, this is something that evaluators often exaggerate. On this view, an
extension of Model 6, fueled by what is seen as the force of Model Ts subtheme
A, it is said to be inevitable that evaluations exert social power, and any serious
approach to evaluation must begin with this fact and proceed to treat evaluation
as a social force whose primary task is maximize the social value of its impact.
Here we have Ernest House and Kenneth Howe's (1999) "democratic deliber-
ative evaluation", and the transformative evaluation of many theorists, well
summarized by Mertens (1998) - who introduced that term - in several recent
works. Much of the discussion of this model refers to "increased stakeholder
representation", but we find that in fact they nearly always mean just increased
impactee representation, sometimes just increased representation from the even
narrower group of the targeted population. There's not much talk of bringing
other stakeholders to the party. However, the whole approach would be both
more ethical and more likely to lead to valid conclusions if it did actually and
explicitly recommend a convocation of all substantial stakeholders, including
representatives of taxpayers who paid for publicly funded programs and those
who were targeted but not impacted.
Here, too, we must place, albeit at one remove from activism, those who
define evaluation as a process aimed at the solution of social problems; Rossi
and Freeman's (1993) leading text is now joined in this view by the latest one, by
Mark, Julnes, and Henry (2000). But evaluation, even program evaluation,
should not be defined as something committed to social engineering. One might
as well define a tape measure as something committed to house building. That's
just one use for it, often not present, as in bridge building. Evaluation is often
concerned only with determining the extent and value of the impact of an
intervention on, for example, computer literacy or epidemiological variables.
This investigation mayor may not be tied to social problem solving by some
stakeholder. While it's a bonus if it is, we can never make this part of its
26 Scriven
definition. Evaluations are often done because we're interested in the results,
not because we have any reason in advance to think there is a problem. We may
in fact think there is something to be proud of, or at least relieved about, as in
the huge reduction of homicide rates in urban u.s. in the last few years.
We could add to this set of eight positions: for example, the advocate-adversary
(a.k.a. jurisprudential) model focused on a process that it recommended in
which evaluation is treated as analogous to the judicial procedure in which
advocates for two or more parties contend in front of a jury. Intriguing though
this was, it is best thought of as a possible methodology rather than an analytical
insight into evaluation in general. For example, there are many cases in which
there are no funds or time for setting up the apparatus required, and many
others where the format of debate is more useful.
But the list provides enough examples, I think, to remind us of the kind of
conclusions drawn by members of a blindfold committee investigating the nature
of an elephant. The reality is both simpler and more complex than their
individual conclusions: simpler at the meta-level, more complex in detail. At the
meta-level, the truth of the matter - if we may be so bold as to try for something
so controversial - is surely that these positions all represent an exaggeration of
one or another important role or aspect of evaluation into an oversimplified
conception of its entire nature. There is a hard core at the heart of these
different conceptions, and it is just the conception of evaluation as the process of
developing an analysis of an evaluand in terms of the specific properties of merit,
worth, or significance. This is analogous to the task undertaken by disciplines like
measurement or statistics or causal analysis, each of which owns the domain of
application of a certain set of predicates and has the task of describing certain
phenomena of the world in terms of that set - a task of great value in appropriate
circumstances, a great waste of time in others (although it may be enjoyable
recreation in those cases, e.g., in attempting serious measurement of auras and
horoscopes, or the evaluation of wine or wall art). In the case of evaluation, the
methodology for doing this well varies from case to case; it uses large slices of
social science methodology in some cases, the methodology of the law or the
humanities or mathematics in others, and in yet others the less prestigious but
highly reliable methodology of crafts and skills.
Much of the hardest work in evaluation theory involves unpacking the way in
which evaluation plays out as a pervasive multi-function, multi-role, multi-player,
multi-method enterprise that is context-dependent here, context-independent
there, biased here, objective there. It is just one part of humanity'S great
knowledge-seeking effort that includes substantial parts of science, technology,
law, ethics, and other humanistic disciplines. In this sense, evaluation is simply
providing us with a certain kind of description of the things it studies - an
evaluative description. This is one amongst many types of description, including
for example the very important category of causal descriptions: someone may be
described as having died of cancer, a market as having suffered from a shortage of
raw materials, etc. Causal and evaluative descriptions are often very hard work
to establish; and, not coincidentally, extremely valuable.
WHY THE MULTIPLICITY OF MODELS?
Apart from the tendency to over-generalize from a limited range of experiences

(often large in number, but limited in variety), and the perpetual attraction of
simplification, there is an affective pressure towards a conception that avoids
direct reference to the need for values weighing and critiquing. Many of our roles
in evaluation are involuntary, and many, not just those same ones, are painful.
Hence it is very attractive to persuade oneself and others that evaluation has
either a more controllable and less painful (or pain-threatening) role than
establishing facts about merit, worth, or significance - or a disreputable one, as
the "value-free" movement averred, by arguing that claims about values are
always purely expressions of personal preference. But these are bad reasons for
poor models. They possibly stem from the belief that evaluation can be divided
into description and judgment, with the former being the evaluator's job and the
latter being the responsibility of evaluation users.
The language of evaluation is often and usefully contrasted with the language
of description, so it seems wrong to say that evaluation is a descriptive enterprise
(like the mainstream conception of science). However, the contrast between
evaluation and description here is entirely context-dependent, like the contrast
between large and small. There are indeed many contexts in which evaluation
requires a long-chain inference from the observations or other data, which mere
description does not require, so the distinction makes good sense. But there are
many other contexts in which we say that a person can be described as good-
looking or intelligent; a program as the best available; an essay as excellent; a
dive as graceful or polished, etc. So there is no intrinsic logical difference between
description and evaluation; the nearest to an intrinsic difference is that between
the involvement of explicitly evaluative terminology (good, better, right, wrong,
etc.) and its absence. But evaluation is often, perhaps usually, buried in complex
terms that are not explicitly evaluative and may only be contextually or idiosyn-
cratically evaluative; the description of someone as "tall, dark, and handsome"
has only one explicitly evaluative term in it, but two others that are often
imbued with evaluative connotations for many people. Understanding when
language is evaluative requires a good ear for context and does requiring dividing
all language into two categories without consideration of context. So the
linguistic distinction is not categorical, quite apart from the difficulty of deciding
in many cases whether a term is intrinsically descriptive or evaluative, e.g.,
"intelligent."
Nor can we save the distinction by appeal to the methodology used to establish
claims involving a term, since evaluative claims can sometimes be justified by
direct observation and sometimes require complex inference. For example, we
can often see that a student has given the right answer to a mathematics question,
or missed a dive completely. The same kind of examples show that the idea of
evaluation as always involving judgments of value is oversimplfied: evaluation
mayor may not involve judgment. Sometimes, as in the evaluation of answers to
questions on an easy mathematics test or in evaluating performance on a
28 Scriven
translation task, the merit of the answer is directly observable; no weighing and
balancing is involved. So the common tendency to refer to evaluation as essen-
tially involving value judgments is mistaken and misleading.
Evaluation has proved to be an elusive concept, and we need to go back to its
roots to see how to get a firmer grip on it.
THE HARD-CORE CONCEPT OF EVALUATION
At the most basic level, evaluation is a survival-developed brain function of

supreme importance. In its simplest forms, it is close to the perceptual level, the
near-instinctive process, sometimes inherited, sometimes learnt, which provides
the kind of knowledge that links perception with action. It is, for example,
learning that something is good to eat and so being able to eat it, knowing how
to recognize a good spearpoint and hence hunting successfully with it, knowing
this batch of curare is too strong and not hunting today, knowing how to pick a
good leader or good tracker and following them and not others. While it is
probably not justified to say that the totally instinctive connection by prey-
species birds between the hawk silhouette and their own flight to hiding is
evaluation in the sense in which evaluators do it, which involves at least one dip
into the knowledge base or the inference process, it is the precursor of evaluation
or perhaps just a different kind of evaluation, a mere high-valencing. One step
up, and we're at primitive forms of real evaluation, which leads us on certain
occasions to flee from the mammoth and on others to attack it, after we review
its state and our resources. Some kinds of evaluation are in the genes - including
some part of our evaluative response to the bodies and behaviors of others - but
much of it is learnt.
In its most complex form, evaluation elevates the simple process of instant or
near-instant appraisal to the complex and systematic - and more reliable - inves-
tigation of elaborate systems, in this process becoming, of necessity, a complex
discipline itself. It remains one of our key survival mechanisms, whether in the
evaluation of complex weapons systems for our armed forces, in our choice of
jobs or answers to questions in an interview, or in our civil efforts to improve the
rewards and punishments attendant on conviction of crime. But it has also become
a key part of the life of the mind, ranging from the satisfaction of curiosity (How
good is X, compared to Y; and absolutely), through science (What'S the best
alternative to string theory?) to our aesthetics (Is this style of painting a
revolution - or a mere attempt to do something different?).
Evaluation seeks only to determine and describe certain aspects of the world
- those aspects referred to by the vocabulary of value, i.e., in terms of merit,
worth, or significance and their cognomens. That's the dictionary definition and
there's no good reason to abandon it. The connection to action and survival is a
good reason to respect efforts to improve it, just as we respect efforts to deter-
mine causation or to provide accurate classifications of medications and poisons.
Of course value is often dependent on context, sometimes dependent on
preferences; but sometimes mainly on more or less acontextual considerations
such as deductive logic (e.g., in the evaluation of some arguments) or on more or
less transcendent ethical standards such as the evils of murder, torture, and child/
spousal/worker abuse. The same is true of causation, measurement, classifi-
cation, and the validity of inferences, and several other sophisticated concepts in
our intellectual and scientific repertoire. The idea that evaluation has no place
in this pantheon is absurd, and so is the idea that it has only one function.
Of course, evaluation is not the first key scientific concept to be treated as an
outcast; many have suggested that causation has no place in science (Bertrand
Russell for one), usually at the end of a long effort to reduce it to other, simpler,
scientific notions. Einstein thought that statistics had no place in the funda-
mental laws of nature. But we already know that complex and precise tacit
understanding of causation and evaluation exists, as their use in the law and
crafts makes clear; our task in evaluation theory is to clarify this use as Hart and
Honore clarified the concept of cause for jurisprudence.
Many actions are legitimated, precipitated, or barred by evaluative descriptions.
But those actions are not intrinsic parts of the evaluation, just consequences of
it, in the social or practical context of the time - just as the actions based on
causal analysis are not part of the meaning of cause. Evaluation itself is a strict
discipline, based on a strict logic, and fully comparable in its validity with any
sophisticated part of science. That is the core of the matter, and one should try
to separate the core issues from the contextual ones. Otherwise, one starts
altering the conclusions of the evaluation in order to improve its consequences,
thereby violating the canons of professional conduct. We do not appreciate
judges who make their decisions on considerations other than legal ones; that is
not what they are paid for, and that is not their role in serving society. Evaluators
are paid to determine merit, worth, or significance, not to try to bring about a
better social order by treating their assessments as instruments to be manipu-
lated in ways that they think will improve the world. The world needs evaluators'
best efforts at directly determining the merit of the evaluands, and democracy's
decision-makers - elected or appointed (directly or indirectly) by those who were
- can then fit those evaluative conclusions into the immensely complex web of
social decision making. For the evaluator to short-cut that two-step process is an
abuse of power, not its proper exercise.
ENDNOTES
1 An earlier version of parts of this paper appeared in Evaluation Journal ofAustralasia, New Series,
vol. 1, no. 2, December, 2001, and are reprinted here by courtesy of Doug Fraser, the editor of that
journal. This paper, and the list of references, has also been greatly improved as a result of
comments on a late draft by Daniel Stufflebeam.
30 Scriven
REFERENCES
Cronbach, L.J. (1963). Course improvement through evaluation. Teachers College Record, 64, 672-683.
Cronbach, L.J. (1982). Designing evaluations of educational and social programs. San Francisco:
Jossey-Bass.
Guba, E.G., & Lincoln, Y.S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage
Publications.
House, E.R., & Howe, KR. (1999). T-1llues in evaluation. Thousand Oaks, CA: Sage.
Mabry, L. (Ed.). (1997). Advances in program evaluation: Vol. 3. Evaluation and the post-modem
dilemma. Greenwich, CT: JAI Press.
Madaus, G.P', & Stufflebeam, D.L. (1998). Educational evaluation: The classical writings of Ralph W.
'lYler. Boston: Kluwer-Nijhoff.
Mark, M.M., Julnes, G., & Henry, G. (2000). Evaluation: An integrated framework for understanding,
guiding, and improving public and nonprofit policies and programs. San Francisco: Jossey-Bass.
Mertens, D.M. (1998). Research methods in education and psychology: Integrating diversity with
quantitative and qualitative approaches. Thousand Oaks, CA: Sage.
Provus, M. (1971). Discrepancy evaluation. Berkeley, CA: McCutchan.
Rossi, P.H., & Freeman, H.E. (1993). Evaluation: A systematic approach (5th ed.). Newbury Park, CA:
Sage.
Stufflebeam, D.L. et al. (1971). Educational evaluation and decision-making. Itasca, IL: Peacock.
2
The CIPP Model for Evaluation
DANIEL L. STUFFLEBEAM
This chapter presents the CIPP Evaluation Model, a comprehensive framework

for guiding evaluations of programs, projects, personnel, products, institutions,
and evaluation systems. This model was developed in the late 1960s to help
improve and achieve accountability for U.S. school programs, especially those
keyed to improving teaching and learning in urban, inner city school districts.
Over the years, the model has been further developed and applied to educational
programs both inside and outside the U.S. Also, the model has been adapted and
employed in philanthropy, social programs, health professions, business,
construction, and the military. It has been employed internally by schools, school
districts, universities, charitable foundations, businesses, government agencies,
and other organizations; by contracted external evaluators; and by individual
teachers, educational administrators, and other professionals desiring to assess
and improve their services.! This chapter is designed to help educators around
the world grasp the model's main concepts, appreciate its wide-ranging
applicability, and particularly consider how they can apply it in schools and
systems of schools. The model's underlying theme is that evaluation's most
important purpose is not to prove, but to improve.
Corresponding to the letters in the acronym CIPp, this model's core concepts
are context, input, process, and product evaluation. By employing the four types
of evaluation, the evaluator serves several important functions. Context evalua-
tions assess needs, problems, and opportunities within a defined environment;
they aid evaluation users to define and assess goals and later reference assessed
needs of targeted beneficiaries to judge a school program, course of instruction,
counseling service, teacher evaluation system, or other enterprise. Input evalua-
tions assess competing strategies and the work plans and budgets of approaches
chosen for implementation; they aid evaluation users to design improvement
efforts, develop defensible funding proposals, detail action plans, record the
alternative plans that were considered, and record the basis for choosing one
approach over the others. Process evaluations monitor, document, and assess
activities; they help evaluation users carry out improvement efforts and maintain
accountability records of their execution of action plans. Product evaluations
31

32 Stufflebeam
identify and assess short-term, long-term, intended, and unintended outcomes.
They help evaluation users maintain their focus on meeting the needs of students
or other beneficiaries; assess and record their level of success in reaching and
meeting the beneficiaries' targeted needs; identify intended and unintended side
effects; and make informed decisions to continue, stop, or improve the effort.
According to the CIPP Model, evaluations should serve administrators, policy
boards, military officers, and other clients; teachers, physicians, counselors, clini-
cians, engineers, social workers, and other service providers; students, parents,
patients, and other beneficiaries; and funding organizations, regulatory bodies,
and society at large. Evaluators should present their audiences with evaluations
that help develop high quality, needed services and products; help identify and
assess alternative improvement options; help assure high quality and ongoing
improvement of services; certify the effectiveness of services and products;
expose deficient, unneeded, and/or unsafe services and products; and help clarify
the factors that influenced an enterprise's success or failure. Thus, the CIPP
Model is oriented to administration, development, effective service, prevention
of harm, accountability, dissemination, and research.
This chapter introduces the CIPP Model by presenting a general scheme to
show relationships among the model's key components. Next, evaluation is
defined. The chapter subsequently delineates the CIPP Model's improvement/
formative and accountability/summative roles. It follows with a brief discussion
of self-evaluation applications of the model. Following discussion of the model's
use for improvement purposes, general guidance and an example checklist are
provided for using the model for accountability purposes. Context, input,
process, and product evaluation are next explained in some detail as applied
mainly to group efforts; these explanations include a few cogent examples and a
range of relevant techniques. The chapter is concluded with guidelines for
designing the four types of evaluation. The Evaluation Center's2 experiences in
applying the model are referenced throughout the chapter.
A GENERAL SCHEMA
Figure 1 portrays the basic elements of the CIPP Model in three concentric circles.
The inner circle represents the core values that provide the foundation for one's
evaluations. The wheel surrounding the values is divided into four evaluative foci
associated with any program or other endeavor: goals, plans, actions, and
outcomes. The outer wheel denotes the type of evaluation that serves each of the
four evaluative foci. These are context, input, process, and product evaluation.
Each double arrow denotes a two-way relationship between a particular
evaluative focus and a type of evaluation. The task of setting goals raises
questions for a context evaluation, which in turn provides information for
validating or improving goals. Planning improvement efforts generates questions
for an input evaluation, which correspondingly provides judgments of plans and
The CIPP Model for Evaluation 33
Figure 1: Key Components of the CIPP Evaluation Model and Associated Relationships
with Programs
direction for strengthening plans. Improvement activities bring up questions for

a process evaluation, which in turn provides judgments of actions and feedback
for strengthening them. Accomplishments, lack of accomplishments, and side
effects command the attention of product evaluations, which ultimately judge
the outcomes and identify needs for achieving better results.
These reciprocal relationships are made functional by grounding evaluations
in core values, as denoted by the scheme's inner circle. The root term in
evaluation is value. This term refers to any of a range of ideals held by a society,
group, or individual. Example values - applied in evaluations of U.S. public
school programs - are students' meeting of state-defined academic standards,
equality of opportunity, human rights, technical excellence, efficient use of
resources, safety of products and procedures, and innovative progress.
Essentially, evaluators assess the services of an institution, program, or person
against a pertinent set of societal, institutional, program, and professional!
technical values. The values provide the foundation for deriving the particular
evaluative criteria. The criteria, along with questions of stakeholders, lead to
clarification of information needs. These, in turn, provide the basis for selecting!
constructing the evaluation instruments and procedures and interpreting
standards. Evaluators and their clients must regularly employ values clarification
as the foundation of their evaluation activities.
34 Stufflebeam
A FORMAL DEFINITION OF EVALUATION
The formal definition of evaluation underlying the CIPP Model is as follows:
Evaluation is the process of delineating, obtaining, providing, and
applying descriptive and judgmental information about the merit and
worth of some object's goals, design, implementation, and outcomes to
guide improvement decisions, provide accountability reports, inform
institutionalization/ dissemination decisions, and improve understanding
of the involved phenomena.
This definition summarizes the key ideas in the CIPP Model. The definition
posits four purposes for evaluation: guiding decisions; providing records for
accountability; informing decisions about installing and/or disseminating
developed products, programs, and services; and promoting understanding of
the dynamics of the examined phenomena. It says the process of evaluation
includes four main tasks: delineating, obtaining, providing, and applying
information. Hence, trainers should educate evaluators in such areas as systems
thinking, group process, decision making, conflict resolution, consensus building,
writing reports, communicating findings, and fostering utilization of evaluation
results. To fully implement the evaluation process, evaluators also need technical
training in collecting, processing, and analyzing information and in developing
judgmental conclusions. The definition also notes that evaluators should collect
both descriptive and judgmental information; this requires employment of both
quantitative and qualitative methods. According to the definition, evaluations
should assess goals, designs, implementation, and outcomes, giving rise to the
needs, respectively, for context, input, process, and product evaluations. Also
highlighted is the fundamental premise that evaluators should invoke the criteria
of merit (the evaluand's quality) and worth (its costs and effectiveness in
addressing the needs of students or other beneficiaries).
The CIPP Model also posits that evaluators should subject their evaluations
and evaluation systems to evaluations and that such metaevaluations should
invoke appropriate standards. The standards for judging evaluations that employ
the CIPP Model go beyond the traditional standards of internal and external
validity employed to judge research studies. The standards employed to judge
CIPP evaluations of North American public school programs and personnel
include utility, feasibility, propriety, and accuracy (Joint Committee, 1981; 1988;
1994). These standards are targeted to educational evaluations in the U.S. and
Canada, but they provide examples that other countries can consider as they
develop their own standards for educational evaluations.
THE CIPP MODE~S IMPROVEMENT/FORMATIVE AND

ACCOUNTABILITY/SUMMATIVE ORIENTATIONS
The CIPP Model is designed to serve needs for both formative and summative
evaluations. CIPP evaluations are formative when they proactively key the
collection and reporting of information to improvement. They are summative
when they look back on completed project or program activities or performances
of services, pull together and sum up the value meanings of relevant information,
and focus on accountability.
The relationships of improvement/formative and accountability/summative
roles of evaluation to context, input, process, and product evaluations are repre-
sented in Table 1. This table shows that evaluators may use context, input,
process, and product evaluations both to guide development and improvement
of programs, projects, or materials - the formative role - and to supply informa-
tion for accountability - the summative role. Based on this scheme, the evaluator
would design and conduct an evaluation to help the responsible teachers,
principals, or other service providers plan and carry out a program, project, or
service. They would also organize and store pertinent information from the
formative evaluation for later use in compiling an accountability/summative
evaluation report.
While improvement/formative-oriented information might not answer all the
questions of accountability/summative evaluation, it would help answer many of
them. In fact, external evaluators who arrive at a program's end often cannot
produce an informative accountability/summative evaluation if the project has
no evaluative record from the developmental period. A full implementation of
the CIPP approach includes documentation of the gathered formative
evaluation evidence and how the service providers used it for improvement.
This record helps the external summative evaluator address the following
questions:
1. What student or other beneficiary needs were targeted, how pervasive and
important were they, how varied were they, how validly were they assessed,
Table 1. The Relevance of Four Evaluation 1YJIes to Improvement and Accountability

Context Input Process Product
Improvement/ Guidance for Guidance for Guidance for Guidance for
Formative choosing goals and choosing a program! implementation termination,
orientation assigning priorities service strategy continuation,
modification, or
installation
Input for specifying
the procedural design,
schedule, and budget
Accountability/ Record of goals Record of chosen Record of the Record of
Summative and priorities and strategy and design actual process achievements,
orientation bases for their and reasons for and its costs assessments
choice along with a their choice over compared with
record of assessed other alternatives needs and costs,
needs, opportunities, and recycling
and problems decisions
36 Stufflebeam
and did the effort's goals reflect the assessed needs? (addressed by context
evaluation)
2. What procedural, staffing, and budgeting plans were adopted to address bene-
ficiaries' needs; how responsive were the plans to the assessed needs; what
alternative approaches were considered; in what respects were the selected
plans superior to the rejected alternatives; to what extent were the chosen
approach and plans feasible, compatible, potentially successful, and cost-
effective for meeting beneficiaries' needs? (addressed by input evaluation)
3. To what extent did the staff (or individual service providers) carry out the
project plan, how and for what reasons did they have to modify it, and what
did the project cost? (addressed by process evaluation)
4. What results - positive and negative as well as intended and unintended -
were observed, how did the various stakeholders judge the outcomes' merit
and worth, to what extent were the target population's needs met, to what
extent were there undesirable side effects, to what extent was the project cost-
effective, and to what extent were any poor project outcomes due to
inadequate project or service implementation or a faulty design? (addressed
by product evaluation)
A CHECKLIST FOR SUMMATIVE EVALUATIONS
As seen in Table 1, applying the CIPP Model proactively to guide decisions yields
much of the information needed to complete a retrospective summative
evaluation. However, it might omit some important data.
To forestall that possibility, evaluators can apply a checklist designed to cover
all variables involved in a comprehensive summative evaluation. An example
checklist follows:
1. Overview of the program or particular service (including its boundaries,

structure, stakeholders, staff, and resources, and the time frame in which it
is examined)
2. Client and audiences for evaluative feedback
3. Program/service background and context
4. Resource/opportunity analysis (service institutions, foundations, staff,
volunteers, grant programs, etc.)
5. Targeted/rightful students or other beneficiaries
6. Values, mission, goals, and priorities
7. Planning (process and products)
8. Governance and management (policies and authority/responsibility
breakdown)
9. Relationship of the program or service to the surrounding community
(services, supporters, detractors, similar programs, etc.)
10. Process (how well was the program or service implemented?)
11. Impact (what classifications and quantities of beneficiaries were reached?)
12. Effectiveness (how well were the beneficiaries served?)
13. Side effects (positive and negative)
14. Costs (e.g., start up and maintenance; personnel, services, and materials;
direct and indirect)
15. Sustainability (with and without external funds)
16. Generalizability/transportability (evidence of use and success elsewhere or
potential for such use)
17. Comparisons (to alternative program approaches)
18. Significance (e.g., were the outcomes profound and cost-effective?)
19. Recommendations (e.g., needed improvements or continuation versus
termination)
20. Reports (tailored to the needs of different audiences)
21. Metaevaluation (did the evaluation meet requirements for utility, propriety,
feasibility, and accuracy?)
SELF-EVALUATION APPLICATIONS OF THE CIPP MODEL
It is emphasized that the evaluator need not be an independent evaluator. Often

the evaluator appropriately is the teacher, administrator, or other professional
who conducts a self-evaluation to improve and be accountable for hislher own
services. Consider, for example, how an elementary school teacher might conduct
and use formative context, input, process, and product evaluation in relationship
to a particular special education student, then compile an accountability/summative
evaluation for presentation to parents, administrators, and other parties.
This teacher might conduct a specific context evaluation to tailor instructional
goals to the assessed needs of the particular student. Mter meeting the student,
the teacher might review the student's school records, meet with the student's
parents, discuss the student's needs and past records of achievement with the
student's previous teachers, and engage a school psychologist to conduct a diag-
nostic evaluation of the student's special needs. Using the obtained information,
the teacher would then define the particular learning and developmental goals
to be sought for this student during the subsequent school year.
Grounded in these goals, the teacher could next conduct an input evaluation
to help chart an appropriate individual educational plan (IEP) for the student.
The teacher might begin by obtaining and reviewing IEPs successfully used with
students having needs similar to those of this student. Such plans might be
obtained from other teachers, the school's instructional resources center, a
university's special education department, a government resource center, the
teacher's past plans for similar students, etc. The teacher would then screen the
identified IEPs to identify those most responsive to the student's diagnosed
needs. He or she might next engage the previously involved school psychologist
and/or a special education expert to rank the screened IEPs for their potential
effectiveness in serving the particular student. The teacher could choose one of
38 Stufflebeam
the IEPs to serve as the basis for more specific planning or could merge the most
appropriate elements from several plans into a hybrid plan. Next the teacher
would add detail to the plan in terms of a schedule and resources and, usually,
more specific lesson plans. Subsequently, the teacher could go over the overall
plan with the student's parents and the school psychologist and/or special
education expert. These exchanges would serve to inform the parents about the
draft plan and to obtain their input and that of the school psychologist and/or
special education expert for finalizing the plan.
Next, the teacher would conduct a process evaluation in the course of putting
the IEP into action. The aim here is to assure that the IEP actually gets imple-
mented and periodically adjusted as needed rather than being set aside and
forgotten. The teacher could maintain a dated log of the respective activities of
the student, the parents, and the teacher in carrying out the action plan.
Periodically, the teacher could meet with the parents to review their child's
progress. The teacher would use information from such steps to keep the
instructional and home activity process on track, to modify the instructional plan
as needed, and to maintain an accountability record of the actual classroom
instruction and home support processes.
Throughout the instructional period, the teacher would also conduct a product
evaluation. The main purpose would be to assess the extent to which the instruc-
tion and learning goals are being achieved and the student's needs met. The
teacher would obtain and assess the student's homework products, classroom
participation and products, and test results. He or she could also ask the school's
psychologist to administer appropriate tests to determine whether the student is
overcoming previously assessed deficiencies and whether new needs and problems
have surfaced. Also, the teacher periodically could ask the student's parents to
report and give their judgments of the student's educational progress. Periodic
discussions of such product evaluation information with the parents and school
psychologist and/or special education expert would be useful in deciding whether
the instructional goals should be modified and how the guiding instructional plan
should be strengthened.
Near the end of each marking period and the school year, the teacher could
compile all relevant context, input, process, and product evaluation information
for this student and write an overall summative evaluation report. Such reports
could be much more useful to the student's parents, the school's principal, and
subsequent teachers than the simple sets of letter grades.
This example is focused on the most basic elements of educational evaluation
- the teacher and a student. After reviewing this illustration, my wife - a former
elementary school teacher - said it basically characterizes what good teachers
already do. Despite her possible opinion to the contrary, my purpose in including
this example is not to belabor classroom practices that are widespread and
obvious but to show how the CIPP Model is designed to fit within and support
an excellent process of teaching and learning.
The basic notions in this simple illustration can be extrapolated to CIPP
evaluations at the classroom, school, and school system levels. In the remainder
of this chapter, the discussion and examples are focused mainly on group rather
than individual applications of the CIPP Model.
AN ELABORATION OF THE CIPP CATEGORIES
The matrix in Table 2 is presented as a convenient overview of the essential

meanings of context, input, process, and product evaluation. These four types of
evaluation are defined in the matrix according to their objectives, methods, and
uses. This section also presents certain techniques that evaluators have found
useful for conducting each type of evaluation. No one evaluation would likely use
all of the referenced techniques. They are presented to give the reader an idea
of the range of qualitative and quantitative methods that are potentially
applicable in CIPP evaluations.
Context Evaluation
A context evaluation's primary orientation is to identify a target group's needs

and thereby provide the criteria for setting goals and judging outcomes. A con-
text evaluation's main contributions are to:
• define a target group of beneficiaries

• identify the group's needs for education or other services
• identify barriers to meeting the assessed needs
• identify resources that could be called upon to help meet the needs
• provide a basis for setting improvement-oriented goals
• provide a basis for judging outcomes of a targeted improvement/service effort
Whatever the target group, administrators and staff can use a context evaluation
to set defensible goals and priorities or confirm that present goals and priorities
are sound. The context evaluation information also provides the essential criteria
for judging an intervention's success. For example, a schoo1's staff may use scores
from a diagnostic reading test to later judge whether a reading improvement
project corrected the previously identified reading deficiencies of a targeted
group of students. As another example, a community health organization might
use statistics on the incidence of influenza among a targeted group of senior
citizens to assess whether a program of administering flu shots in area super-
markets helped lower the incidence of influenza among these seniors. In these
examples, the context information on reading proficiency and influenza incidence
provided the baseline information for judging postintervention measures.
Context evaluations may be initiated before, during, or even after a project,
course, classroom session, or other enterprise. In the before case, institutions
may carry them out as discrete studies to help set goals and priorities. When started
during or after a project or other enterprise, institutions will often conduct and
40 Stufflebeam
Table 2. Four 'JYpes of Evaluation
Objective Method Relation to Decision Making
in the Improvement Process
Context To identify the target By using such methods as For determining and
Evaluation population and assess system analysis; diagnostic documenting the setting to
their needs, diagnose tests; checklists; secondary be served; the target group
barriers to meeting the data analysis; surveys; of beneficiaries; the goals
needs, identify resources document review; literature for improvement; the
for addressing the needs, review; hearings; problem- priorities for budgeting
judge whether goals and focused conferences; town time and resources; and
priorities sufficiently meetings; interviews; focus the criteria for judging
reflect the assessed groups; the Delphi tech- outcomes
needs, and provide nique; schooVinstitution
needs-based criteria for profiles; expert panel site
judging outcomes visits; advisory groups; and
institutional, program, or
service databases
Input To identify and assess By using such methods as For determining and
Evaluation system capabilities, literature search, visits to documenting sources of
alternative program exemplary programs, expert support, a solution strategy,
strategies, the procedural consultants, advocate a procedural design, a
design for implementing teams, panel review, and staffing plan, a schedule,
the chosen strategy, the pilot trials to inventory and and a budget, i.e., for
staffing plan, the sched- assess available human and structuring change activities
ule, and the budget, and material resources and and providing a basis for
to document the case for solution strategies and judging both the chosen
pursuing a particular assess the work plan for course of action and its
course of action relevance, feasibility, cost, implementation
and economy
Process To identify or predict By using such methods as For implementing and

Evaluation defects in the work plan participant observers, refining the work plan and
or its implementation, to independent observers, activities, i.e., for effecting
provide feedback for interviews, document process control, and for
managing the process, review, and periodic providing a record of the
and to record and judge exchange of information actual process for later use
the actual work effort with project leaders and in judging implementation,
staff in order to monitor interpreting outcomes, and
and provide feedback on informing replications
the process and record the
actual process
Product To collect descriptions By measuring intended and For deciding to continue,

Evaluation andjudgrnents of unintended outcomes, by tenninate, modify, or
outcomes; to relate them collecting judgments of refocus a change activity;
to goals and to context, outcomes from stake- and for presenting a clear
input, and process holders, by performing both record of effects (intended
information; and to qualitative and quantitative and unintended, positive
interpret their merit and analyses, by comparing and negative), compared
worth outcomes to assessed with assessed needs and
needs, and by synthesizing goals and for interpreting
findings to reach bottom outcomes
line conclusions
report context evaluations in combination with input, process, and product
evaluations. Here context evaluations are useful for judging already established
goals and for helping the audience assess the effort's success in meeting the
assessed needs of the targeted beneficiaries.
The methodology of a context evaluation may involve a variety of measure-
ments of students or members of another target population and their surrounding
environment. A usual starting point is to ask the clients and other stakeholders
to help define boundaries for the study. Subsequently, evaluators may employ
selected techniques to generate hypotheses about needed services or changes in
existing services. The techniques might include reviewing documents; analyzing
demographic and performance data; conducting hearings and community
forums; and interviewing stakeholders.
The evaluators might administer special diagnostic tests to members of the
target population. The evaluators might construct a survey instrument to investi-
gate identified hypotheses. Then they could administer the instrument to a
carefully defined sample of stakeholders and also make it more generally available
to anyone who wishes to provide input. The two sets of responses should be
analyzed separately.
The evaluators should also examine existing records to identify performance
patterns and background information on the target population. These might
include records of involvements of the parents in the education of a targeted
group of students, attendance records, school grades, test scores, enrollment in
different levels of courses, graduation rates, honors, health histories, immuni-
zation records, housing situations, and/or notations by teachers.
Throughout the context evaluation, the evaluators might involve a represen-
tative review panel to help clarify the evaluative questions and interpret the
findings. They might conduct a meeting - such as a parent-teacher conference or
a town meeting - to engage experts and constituents in studying and interpreting
the findings and making recommendations. They might also engage focus groups
to review the gathered information. The evaluators might use a consensus-
building technique to solidify agreements about priority needs and objectives.
After the initial context evaluation, the institution might need to continue
collecting, organizing, filing, and reporting context evaluation data. The evalua-
tors could draw selectively from the same set of methods recommended above.
They could help stakeholders maintain current information on beneficiaries'
characteristics and achievements in a functional input-process-output informa-
tion system.
Often audiences need to view the effort within both its present setting and its
historical context. Considering the relevant history helps the decision makers
avoid past mistakes. Thus, the methodology of context evaluation includes
historical analysis and literature review as well as methods aimed at character-
izing and understanding current environmental circumstances.
A context evaluation may have many constructive uses. It might provide a
means by which a school staff talks with its public to gain a shared conception of
the school's strengths and weaknesses, needs, opportunities, and priority problems.
42 Stufflebeam
An institution might use it to convince a funding agency that a proposed project
is directed to an urgent need or to convince an electorate to pass a tax issue in
order to meet students' needs better. The context evaluation might be used to set
goals for staff development and/or curriculum revision. A school system could
also use context evaluation to select particular schools or target populations for
priority or emergency assistance. Of course, a school would often use a context
evaluation to help students and their parents or advisers focus their attention on
developmental areas requiring more progress. Also, an institution could use a
context evaluation to help decide how to make the institution stronger by cutting
unneeded or ineffective programs. At the national level a government agency
might issue an attention-getting report in order to mobilize the public to support
a massive program of reform. A famous example of this is the National
Commission on Excellence in Education's (1983) report,A Nation at Risk, which
spawned new U.S. education reform programs. The preceding discussion illus-
trates how institutions can employ context evaluations to launch needed
improvement efforts.
Another use comes later when an institution needs to assess what it accom-
plished through an improvement project. Here the institution assesses whether
its investment in improvement effectively addressed the targeted needs and
goals. The institution also refers to context evaluation findings to assess the
relevance of project plans. Also, at the end of a project, context evaluation records
are pertinent for defending the project's goals and priorities. Considering these
uses, a school or other institution can benefit greatly by grounding improvement
efforts in sound context evaluations.
The Program Profile Technique, as Applied in Context Evaluations
As noted above, many methods are useful in conducting context evaluations.

Evaluators at the Western Michigan University Evaluation Center have devised
an overall approach labeled the Program Profile Technique. This technique includes:
• a checklist to collect data from a variety of sources about relevant history;

current environment; constituent needs; system problems and opportunities;
and program structure, operations, and achievement
• a pertinent database
• periodic reports that characterize the program's background, environmental
circumstances, and present status
• feedback workshops to the client and designated stakeholders
Using this technique evaluators can maintain a dynamic baseline of information

and employ it to keep their audiences informed about the program's status and
environment. The successive profile reports present an evolving picture of bene-
ficiaries' needs, objectives, and external forces, and how these relate to program
design, activities, expenditures, and outcomes. In examining such reports, clients
and other interested parties gain a holistic picture of the program's progress
within its context.
Analysis of Patient Records, as a Procedure for Context Evaluation in

Individual Medical Practice
Context evaluations are needed to guide and assess the performance of individual
professionals as well as programs. A technique of use in conducting a context
evaluation related to improvement needs of individual physicians is what might
be labeled the Compilation and Analysis of Patient Records (see Manning &
DeBakey, 1987). Many such records are routinely completed and stored as a part
of the doctor-patient process, including patient files, hospital charts, and insur-
ance forms. In addition, a physician might maintain a card file on unusual, little
understood, or otherwise interesting patient problems. This helps the physician
gain a historical perspective on such cases. Patient records are a valuable source
of context evaluation information. A doctor can use such records to:
• identify most prevalent patient needs

• identify seasonal patterns in patient problems and needs
• select practice areas for improvement
• select appropriate continuing medical education experiences
• better plan services to patients
• change referral and diagnostic practices
The physician can also compare baseline measures with later measures to
evaluate improvement efforts. Context evaluation questions that doctors might
answer by analyzing patient records include the following:
• What illnesses and types of accidents are most prevalent among the doctor's
patients?
• What are the important systematic variations in illnesses and accidents, aligned
with seasons and with the patients' age, gender, and occupation?
• To what extent do the doctor's patients evidence chronic problems that
treatments help only temporarily?
• What diagnostic tests and procedures does the doctor use most frequently?
• What are relative levels of usefulness and cost-effectiveness of the diagnostic
tests frequently ordered by the doctor?
• What types of problems does the doctor typically treat without referral?
• What types of problems does the doctor typically refer to other professionals?
• What are the success rates, at least relative absence of complaints, of referrals
to the different referral services?
• To what extent are patients' records complete, clear, and up to date?
44 Stufflebeam
• To what extent are patients' immunizations up to date and inclusive of what

they need?
• To what extent have patients been taking physical examinations and other
needed tests on an appropriate schedule?
• To what extent do the patient records reflect success in managing weight,
blood pressure, and cholesterol?
• To what extent do the doctor's patients take flu shots and with what outcomes?
• What are the numbers and types of complaints from patients and/or other
health professionals about the doctor's practice?
• To what extent do the patients pay their bills on time?
• To what extent are the doctor's charges within rates set by third-party payers?
The Compilation and Analysis of Patient Records procedure is a valuable means

of answering questions, such as those listed above. Individual doctors can use this
technique to look for weaknesses and strengths in all aspects of their practice,
then formulate improvement goals. Medical educators can also usefully employ
the technique in cooperation with doctors to set appropriate goals for
individualized continuing medical education services.
This technique fits within a chapter on educational evaluation because it
applies to the continuing education of physicians. Probably the technique could
be adapted for use in providing evaluative guidance for the continuing education
of particular teachers, counselors, administrators, and other educators. Certainly,
all such professionals need continuing education targeted to their needs. Also,
all of them have records associated with their work - such as instructional plans,
budgets, feedback from parents, evaluations by supervisors, and students' test
results. Such records are useful for identifying areas of one's professional
practice that should be improved.
Input Evaluation
An input evaluation's main orientation is to help prescribe a course of action by

which to make needed changes. It does this by searching out and critically
examining potentially relevant approaches, including the one(s) already being
used. Input evaluations can help client groups choose a "best buy" approach
when they search out and assess options. An approach that predictably would
exceed the performance of others will have no possibility of impact if a planning
group does not identify it, compare its merits to those of critical competitors, and
choose it for implementation.
Once an approach has been chosen, an input evaluation next assists educators
or other professionals prepare the chosen approach for execution. It should also
search the pertinent environment for political barriers, financial or legal
constraints, and available resources. An input evaluation's overall intent is to help
administrators and staff examine alternative strategies for addressing assessed
needs of targeted beneficiaries and evolve a workable plan. A sound input

evaluation also helps clients avoid the wasteful practice of pursuing proposed
innovations that predictably would fail or at least waste resources.
Evaluators conduct input evaluations in several stages. These occur in no set
sequence. An evaluator might first review the state of practice in meeting the
specified needs and objectives. This could include:
• reviewing relevant literature

• visiting exemplary programs
• consulting experts
• querying pertinent information services (including those on the World Wide
Web)
• reviewing a pertinent article in Consumer Reports or a similar publication that
critically reviews available products and services
• inviting proposals from staff or potential contractors
Evaluators would set up a file to facilitate storage and retrieval of the informa-
tion. They might engage a study group to investigate it. They might conduct a
special planning seminar to analyze the material. The evaluators would use the
information to locate potentially acceptable solution strategies. They would rate
promising approaches on relevant criteria. Example criteria are listed below:
• responsiveness to priority system needs

• potential effectiveness
• fit with existing services
• propriety
• affordability
• political viability
• administrative feasibility
Next the evaluators could advise the clients about whether they should seek a
novel solution. In seeking an innovation, the clients and evaluators might docu-
ment the criteria the innovation should meet, structure a request for proposal,
obtain competing proposals, and rate them on the chosen criteria. Subsequently,
the evaluators might rank the potentially acceptable proposals and suggest how
the client group could combine their best features. The evaluators might conduct
a hearing or panel discussion to obtain additional information. They could ask
staff, administrators, and potential beneficiaries to react and express any concerns.
They would also appraise resources and barriers that should be considered when
installing the intervention. The clients could then use the accumulated informa-
tion to design what they see as the best combination strategy and action plan.
Input evaluations have several applications. A chief one is in preparing a
proposal for submission to a funding agency or policy board. Another is to assess
one's existing practice, whether or not it seems satisfactory, against what is being
done elsewhere and proposed in the literature. Input evaluations have been used
46 Stufflebeam
in a number of U.S. school districts to decide whether locally generated proposals

for innovation would likely be cost-effective. One school district used an input
evaluation to generate and assess alternative architectural designs for new school
buildings. The Southwest Regional Educational Laboratory used an input eval-
uation to help historically antagonistic groups agree on how to use ten million
dollars to serve the education needs of migrant children. In addition to informing
and facilitating decisions, input evaluation records help authorities defend their
choice of one course of action above other possibilities. School administrators
and school boards can find input evaluation records useful when they must
publicly defend sizable expenditures for new programs.
The Advocacy Teams Technique as Used in Input Evaluations
The Advocacy Teams Technique is a procedure designed specifically for con-

ducting input evaluations. This technique is especially applicable in situations
where institutions lack effective means to meet specified needs and where
stakeholders hold opposing views on what strategy the institution should adopt.
The evaluators convene two or more teams of experts and stakeholders. They
give the teams the goals, background data on assessed needs, specifications for a
solution strategy, and criteria for evaluating the teams' proposed strategies. The
teams may be staffed to match members' preferences and expertise to the nature
of the proposed strategies. Evaluators should do so, especially if stakeholders
severely disagree about what type of approach they would accept. The advocacy
teams then compete, preferably in isolation from each other, to develop a ''winning
solution strategy." A panel of experts and stakeholders rates the advocacy team
reports on the predetermined criteria. The institution might also field-test the
teams' proposed strategies. Subsequently, the institution would operationalize
the winning strategy. Alternatively, it might combine and operationalize the best
features of the two or more competing strategies.
The advocacy teams technique's advantages are that it provides a systematic
approach for:
• designing interventions to meet assessed needs

• generating and assessing competing strategies
• exploiting bias and competition in a constructive search for alternatives
• addressing controversy and breaking down stalemates that stand in the way of
progress
• involving personnel from the adopting system in devising, assessing, and
operationalizing improvement programs
• documenting why a particular solution strategy was selected
Additional information, including a technical manual and the results of five field
tests of the technique, is available in a doctoral dissertation by Diane Reinhard
(1972).
Process Evaluation
In essence, a process evaluation is an ongoing check on a plan's implementation

plus documentation of the process. One objective is to provide staff and
managers feedback about the extent to which they are carrying out planned
activities on schedule, as planned, and efficiently. Another is to guide staff
appropriately to modify and improve the plan. Typically, staffs cannot determine
all aspects of a plan when a project starts. Also, they must alter the plan if some
initial decisions are unsound or need to be changed due to new conditions. Still
another objective is to periodically assess the extent to which participants accept
and can carry out their roles. A process evaluation should contrast activities with
the plan, describe implementation problems, and assess how well the staff
addressed them. It should document and analyze the efforts' costs. Periodically,
it should present staff with timely feedback they can use to strengthen their
efforts. Finally, it should report how observers and participants judged the
process's quality. Also, it provides a detailed record of the actual process of
implementation.
The linchpin of a sound process evaluation is the process evaluator. More
often than not, a staff's failure to obtain guidance for implementation and to
document their activities stems from a failure to assign anyone to do this work.
Sponsors and institutions too often assume erroneously that the managers and
staff will adequately log and evaluate process as a normal part of their assign-
ments. Staff can routinely do some review and documentation through activities
such as staff meetings and minutes of the meetings. However, these activities do
not fulfill the requirements of a sound process evaluation. Beyond lacking the
time to do adequate process review, analysis, and documentation, staff also lack
the important element of an independent perspective. Experience has shown
that project staffs can usually meet process evaluation requirements well only by
assigning an evaluator to provide ongoing review, feedback, and documentation.
A process evaluator has much work to do in monitoring, documenting, and
judging an intervention. The following scenario illustrates what he or she might
do. Initially, the process evaluator could review the relevant strategy and work
plans and any prior background evaluation to identify what planned activities
they should monitor. Possible examples are staff training, project planning, staff
collaboration, materials development, budget and expenditures, management of
the project library, maintenance of equipment, counseling students, meeting
parents, tutoring students, skill or interest grouping of students, classroom instruc-
tion, classroom assessment, field trips, homework assignments, analysis and use
of standardized test results, use of diagnostic tests, and reporting progress.
Beyond looking at the elements of work plans, evaluators might also periodically
consult a broadly representative review panel. The evaluator could ask the pan-
elists to identify important concerns and questions that the process evaluation
should address. Other questions of relevance will occur to the evaluator in
observing activities, examining records and other pertinent documents; providing
feedback; and interacting with staff, beneficiaries, and the review panel.
48 Stufflebeam
With questions and concerns such as those mentioned above in mind, the
process evaluator could develop a general schedule of data collection activities
and begin carrying them out. Initially, these probably should be informal and as
unobtrusive as possible so as not to threaten staff, get in their way, or constrain
or interfere with the process. Subsequently, as rapport develops, the process
evaluator can use a more structured approach. At the outset, the process
evaluator should get an overview of how the work is going. He or she could visit
and observe centers of activity; review pertinent documents (especially the work
plans, budgets, expenditure records, and minutes of meetings); attend staff meet-
ings; interview key staff; and interview students, parents, and other beneficiaries.
The process evaluator then could prepare a brief report that summarizes the
data collection plan, findings, and observed issues. He or she should highlight
existing or impending process problems that the staff should address. The
evaluator could then report the findings at a staff meeting and invite discussion.
He or she might invite the staff's director to lead a discussion of the report.
The project team could then use the report for reflection and planning as
they see fit. Also, the process evaluator could review plans for further data
collection and subsequent reports with the staff and ask them to react to the
plans. Staff members could say what information they would find most useful at
future meetings. They could also suggest how the evaluator could best collect
certain items of information. These might include observations, staff-kept
diaries, interviews, or questionnaires. The evaluator should also ask the staff to
say when they could best use subsequent reports. Using this feedback, the
evaluator would schedule future feedback sessions. He or she would modify the
data collection plan as appropriate and proceed accordingly. The evaluator
should continually show that process evaluation helps staff carry out their work
through a kind of quality assurance and ongoing problem-solving process. He
or she should also sustain the effort to document the actual process and
lessons learned.
The evaluator should periodically report on how well the staff carried out the
work plan. He or she should describe main deviations from the plan and should
point out noteworthy variations concerning how different persons, groups, and/or
sites are carrying out the plan. He or she should also characterize and assess the
ongoing planning activity.
Staff members use process evaluation to guide activities, correct faulty plans,
maintain accountability records, enhance exchange and communication, and
foster collaboration. Some managers use regularly scheduled process evaluation
feedback sessions to keep staff "on their toes" and abreast of their responsi-
bilities. Process evaluation records are useful for accountability, since funding
agencies, policy boards, and constituents typically want objective and substantive
confirmation of whether grantees did what they had proposed. Process evalua-
tions can also help external audiences learn what was done in an enterprise in
case they want to conduct a similar one. Such information is also useful to new
staff, as a part of their orientation to what has gone before. Moreover, process
evaluation information is vital for interpreting product evaluation results. One
needs to learn what was done in a project before deciding why program out-
comes turned out as they did.
Traveling Observer Technique for Use in Process Evaluations
Over the years, The Evaluation Center has developed and employed a procedure
labeled the Traveling Observer Technique (Evers, 1980; Reed, 1991; Thompson,
1986). This technique most heavily addresses process evaluation data require-
ments but, like other techniques, also provides data of use in context, input, and
product evaluations. The technique involves sending a preprogrammed investi-
gator into a program's field sites. This evaluator investigates and characterizes
how the staffs are carrying out the project at the different sites. He or she reports
the findings to the other evaluation team members. This investigator may
participate in feedback sessions provided to the client group.
The traveling observer (TO) follows a set schedule of data collection and
writes and delivers reports according to preestablished formats and reporting
specifications. Before entering the field, the TO develops a traveling observer
handbook (Alexander, 1974; Nowakowski, 1974; Reed, 1989; Sandberg, 1986;
Sumida, 1994). The TO develops the handbook under the principal evaluator's
supervision. They tailor this evaluation tool to the particular evaluation's
questions. This handbook includes the following:
• traveling observer's credentials

• evaluation questions
• description of the study sites and program activities
• contact personnel and phone numbers
• maps showing project locations
• data sources suggested, including interviewees and pertinent documents
• protocols for contacting field personnel and obtaining needed permissions
and cooperation
• rules concerning professional behavior expected
• safeguards to help the TO avoid cooptation by program staff
• sampling plans, including both preset samples and exploratory grapevine
sampling
• recommended data collection procedures
• data collection instruments
• data collection schedule
• daily log/diary format
• rules for processing information and keeping it secure
• the audience for TO feedback
• reporting specifications and schedule, including interim progress reports,
briefing sessions, and expense reports
• criteria for judging TO reports
50 Stufflebeam
• rules about communicating/disseminating findings, including provisions for

reporting to those who supplied data for the TO study
• any responsibilities for scheduling and facilitating follow-up investigations,
e.g., by a site visit team of experts
• issues that may arise and what to do about them
• form for the TO's periodic self-assessment
• budget to support the TO work, including spending limitations
In an early application of this technique, The Evaluation Center sent out travel-
ing observers as "advance persons" to do initial investigation on two $5 million
statewide National Science Foundation programs. The Center assigned the TOs
to prepare the way for follow-up site visits by teams of experts. These teams
included national experts in science, mathematics, technology, evaluation, and
education. Each program included many projects at many sites across the state.
The evaluation budget was insufficient to send the five-member teams of "high
priced" experts to all the potentially important sites. Instead, the Center pro-
grammed and sent TOs to study the program in each state. Each TO spent two
weeks investigating the program and prepared a report. Their reports included a
tentative site visit agenda for the follow-up teams of experts. The TOs also
contacted program personnel to prepare them for the follow-up visits and gain
their understanding and support for the evaluation. On the first day of the team
site visits, each TO distributed the TO report and explained the results. The TOs
also oriented the teams to the geography, politics, personalities, etc., in the
program. They presented the teams with a tentative site visit agenda and
answered their questions. The TO's recommended plans for the site visit team
included sending different members of the site team to different project sites and
some total team meetings with key program personnel. During the week-long
team visits, the TOs remained accessible by phone so that they could address the
needs of the site visit team. At the end of this study, the Center engaged Michael
Scriven to evaluate the evaluation. He reported that the TO reports were so
informative that, except for the credibility added by the national experts, the TOs
could have successfully evaluated the programs without the experts. Overall, The
Evaluation Center has found that the Traveling Observer technique is a powerful
evaluation tool; it is systematic, flexible, efficient, and inexpensive.
Product Evaluation
The purpose of a product evaluation is to measure, interpret, and judge an

enterprise's achievements. Its main objective is to ascertain the extent to which
the evaluand met the needs of all the rightful beneficiaries. Feedback about
achievements is important both during an activity cycle and at its conclusion. A
product evaluation should assess intended and unintended outcomes and
positive and negative outcomes. It should be especially attentive to harmful side
The CIPP Model lor Evaluation 51
effects. Moreover, evaluators should often extend a product evaluation to assess
long-term outcomes.
A product evaluation should gather and analyze judgments of the enterprise
by stakeholders and relevant experts. Sometimes it should compare the effort's
outcomes with those of similar enterprises. Frequently, the client wants to know
whether the enterprise achieved its goals and whether the outcomes were worth
the investment. When indicated, evaluators should interpret whether poor
implementation of the work plan caused poor outcomes. Finally, a product
evaluation should usually view outcomes from several vantage points: in the
aggregate, for subgroups, and sometimes for individuals.
Product evaluations follow no set algorithm, but many methods are applicable.
Evaluators should use a combination of techniques. This aids them to make a
comprehensive search for outcomes. It also helps them cross-check the various
findings. The following discussion illustrates the range of techniques that evalua-
tors might employ.
Evaluators might assess students' test scores compared with a specified
standard. The standard might be a profile of previously assessed needs, pretest
scores, selected norms, program goals, or a comparison group's performance.
Sanders and Horn (1994) advocate a general goal of sustained academic growth
for each student, across three or more years. Webster, Mendro, and Almaguer
(1994) propose comparing schools on one-year, schoolwide gains, when student
background variances have been partialed out. The evaluators might use
published objective tests or specially made criterion-referenced tests. They might
also employ performance assessments. Experts might compare program recipients'
work products against their previously assessed needs.
To assess outcomes that extend beyond an enterprise's goals, evaluators need
to search for unanticipated outcomes, both positive and negative. They might
conduct hearings or group interviews to generate hypotheses about the full range
of outcomes and follow these up with clinical investigations intended to confirm
or disconfirm the hypotheses. They might conduct case studies of the experi-
ences of a carefully selected sample of participants to obtain an in-depth view of
the program's effects.
They might survey, via telephone or mail, a sample of participants to obtain
their judgments of the service and their views of both positive and negative
findings. They might ask these respondents to submit concrete examples of how
the project or other service influenced their work or well-being, either positively
or negatively. These could be written pieces, other work products, new job status,
or negative consequences. They might engage observers to identify what they
believe to be program and comparison groups' achievements. They can then use
the reported achievements to develop tests that reflect the hypothesized out-
comes. By administering the test to program recipients and a comparison group,
the evaluators can estimate the intervention's unique contributions that possibly
are remote from the intended outcomes (see Brickell, 1976).
Evaluators might also conduct a "goal-free evaluation" (Scriven, 1991).
Accordingly, the evaluator engages an investigator to find whatever effects an
52 Stufflebeam
intervention produced. The evaluator purposely does not inform the goal-free
investigator about the intervention's goals. The point is to prevent the investigator
from developing tunnel vision focused on stated goals. The evaluator then
contrasts identified effects with the program beneficiaries' assessed needs. This
provides a unique approach to assessing the intervention's merit and worth,
whatever its goals.
Reporting of product evaluation findings may occur at different stages.
Evaluators may submit interim reports during each program cycle. These should
show the extent the intervention is addressing and meeting targeted needs. End-
of-cycle reports may sum up the results achieved. Such reports should interpret
the results in the light of assessed needs, costs incurred, and execution of the plan.
Evaluators may also submit follow-up reports to assess long-term outcomes.
People use product evaluations to decide whether a given program, project,
service, or other enterprise is worth continuing, repeating, and/or extending to
other settings. A product evaluation should provide direction for modifying the
enterprise or replacing it so that the institution will more cost-effectively serve
the needs of all intended beneficiaries. It might also help potential adopters
decide whether the approach merits their serious consideration.
Product evaluations have psychological implications, since by showing signs of
growth and/or superiority to competing approaches, they reinforce the efforts of
both staff and program recipients. Likewise, they may dampen enthusiasm and
reduce motivation when the results are poor. The latter point brings to mind the
important caveat that product evaluation reported too early in an innovative
project can intimidate staff and stifle their creativity. Evaluators should be
sensitive to this possibility and avoid premature feedback of possibly chilling
product evaluation findings.
Product evaluation information is an essential component of an accountability
report. When authorities document significant achievements, they can better
convince community and funding organizations to provide additional financial
and political support. When authorities learn that the intervention made no
important gains they can cancel the investment. This frees funds for more worthy
interventions. Moreover, other developers can use the product evaluation report
to help decide whether to pursue a similar course of action.
Work Sample Technique as Applied to Product Evaluations
Del Schalock and a team at Western Oregon University (Schalock, Schalock, &
Girod, 1998) are employing the Work Sample Technique to evaluate student
teachers. They require each student teacher to develop and apply a work sample
assessment exercise keyed to an instructional unit's goals. Work samples are
supposed to give the student clear learning goals and performance exercises for
showing mastery of the goals. A student teacher develops a work sample
according to specifications and administers it to each student before instruction
and following instruction. Teachers might employ a parallel form of the work
sample after instruction to help reduce effects of teaching the test. The super-
visor then examines pretest -posttest gains for each part of the work sample. They
do so at the level of individual students; at the level of high, medium, and low
ability groups; and overall. The teacher and his or her supervisor then carefully
examine the results. They assess the teacher's effectiveness in helping every
student achieve the learning goals. They also assess the validity of the teacher's
assessment exercises. Supervisors use these assessments to help teachers gauge
teaching competence, set improvement goals, and improve their abilities to
prepare classroom assessment materials.
The work sample product evaluation technique is strong in instructional
validity. It directly reflects instructional goals. It helps the teacher determine
whether students mastered the learning goals and how much they gained. The
technique also helps the teacher develop facility in developing and using perfor-
mance assessments keyed to instruction.
Bonuses of using the technique are that it provides a basis for examining
whether teachers are:
• teaching and assessing high- or low-level goals

• proficient in developing high quality assessment devices that reflect the goals
• effective in teaching their students
• equally effective in teaching all levels of students
However, a cautionary note is in order. Using the Work Sample Technique to

support high stakes decisions - e.g., state licensing - is controversial. The
technique has not shown sufficient reliability and validity to warrant such use
(Stufflebeam, 1997). Also, to use it in high stakes decision making undoubtedly
would cause teachers to employ the technique to meet state expectations and
thus teach the test. Nothing would prevent them from working with students to
fabricate gains data. Evaluators have reported such undesirable outcomes in
high stakes testing; these occurred even with more rigorous controls than the
Work Sample Technique provides (House, Rivers, & Stufflebeam, 1974; Pedulla,
Haney, Madaus, Stufflebeam, & Linn, 1987; Stufflebeam, Nitko, & Fenster,
1995). Under low stakes conditions, work samples are valuable classroom assess-
ment tools. Hopefully, users will not destroy the technique'S utility for teacher
development and classroom assessment by placing it in a threatening, high risk
context. Predictably, this would cause some teachers to cheat in showing good
results and discourage others from using the technique for formative classroom
assessment purposes.
Continuous Progress Matrix Sampling Testing Technique as Used in Product

Evaluations
The Continuous Progress Matrix Sampling Testing Technique is a product evalua-

tion technique that I use in classroom teaching. This technique provides a
54 Stufflebeam
periodic look at a course's evolving gross learning product and students' progress
and retention of each course unit. The technique is designed to help teachers and
students overcome their frequent dissatisfaction with pretest-posttest gains data.
These indicate only what students gained over several months; they do not show
what learning trends occurred between the two tests. Instructors express frustra-
tion when the gains are small; they do not know why, and they learned this too
late to do anything about it. Probably most instructors and students would be
interested to see and examine learning trends between a pretest and posttest.
Then they could decide to revisit important content that the students either did
not learn or did not retain.
The Continuous Progress Matrix Sampling Testing Technique is based on matrix
sample testing (Cook & Stufflebeam, 1967; Owens & Stufflebeam, 1964). An
instructor administers a parallel form of the final course examination about weekly.
The different students are randomly assigned to complete different, small
random samples of the test items. The instructor analyzes the results to maintain
week-by-week trend lines for the total test and each course unit. During selected
class sessions the instructor devotes only about the first five minutes to admin-
istering the test. This is possible since each student completes only a small sample
of the total set of test questions. Starting with the second class session, the instruc-
tor distributes and explains the latest update on trends in tested achievement.
Each week, the instructor and students can see how well the class as a whole
is progressing toward a high score on the final exam. By looking at the trend line
for the unit taught last, the students can see whether they, as a group, mastered
the material. They can also assess whether they retained or regressed in what
they learned in units taught earlier. Instructors are encouraged when they see
that test scores for previously untaught units remained, week after week, at the
chance level, then dramatically improved following instruction. They should be
concerned when test score trends show that students regressed on previously
mastered material. Such feedback can motivate instructors and students to
revisit and regain the prior learning. It can lead instructors to search for a better
way to teach the material. Students and the instructor can discuss the results
weekly to detect where past instruction and learning activities may have been
weak and for what reasons. They can collaborate in deciding what material they
should review and how the instructor could best get it across. This technique
employs an educationally sound approach to teaching the test.
Advantages of this approach are that it helps students see that:
• testing in the course is instrumental to improving teaching and learning

• they are partners in producing a good outcome for the entire class
• they and the instructor can use relevant empirical data to assess progress and
recycle instructional and learning plans
• the time involved in taking weekly tests can be small
• weekly testing is not threatening since students receive no individual scores
Limitations of the technique are that it:

• provides no feedback on performance of individual students

• is based exclusively on multiple choice test questions
• obtains feedback on each item from only one or a few students(s)
Overall, the technique is decidedly better than a pretest-posttest or posttest only

approach. Like these approaches, it assesses course effectiveness. Equally or
more important, it also guides instruction and learning activities. It also reviews
week-to-week (or day-to-day) learning trends for each part ofthe course and for
the overall course. Of special note, it engages the students and instructor as
collaborators in using evaluation feedback constructively and continually to
strengthen a course.
DESIGNING EVALUATIONS
Once the evaluator and client have decided to conduct a context, input, process,
or product evaluation (or some combination), the evaluator needs to design the
needed work. This involves preparing the preliminary plans and subsequently
modifying and explicating them as the evaluation proceeds. Decisions about such
evaluation activities form the basis for contracting and financing the evaluation,
working out protocols with the involved institutions, staffing the evaluation,
scheduling and guiding staff activities, and assessing the evaluation plans. The
design process also provides opportunities for developing rapport, effecting
communication, and involving the evaluation's stakeholder groups.
Table 3 outlines points to be considered in designing an evaluation. These
points are applicable when developing the initial design or later when revising or
explicating it.
The formulation of the design requires that the client and evaluators collabo-
rate, from the outset, when they must agree on a charge. The client needs to
identify the course, project, program, institution, or other object they will evaluate.
The evaluator should help the client define clear and realistic boundaries for the
study. The client is a prime source for identifying the various groups to be served
by the evaluation and projecting how they would use it. The evaluator should ask
clarifying questions to sort out different (perhaps conflicting) purposes. They
should also get the client to assign priorities to different evaluation questions.
The evaluator should recommend the most appropriate general type(s) of study
(context, input, process, and/or product). The client should confirm this general
choice or help to modify it. In rounding out the charge, the evaluator should
emphasize that the evaluation should meet professional standards for sound
evaluations.
The evaluator should define the data collection plan. He or she should provide
an overview of the general evaluation strategies. These could include surveys,
case studies, site visits, advocacy teams, goal-free searches for effects, adversary
hearings, a field experiment, etc. The evaluator should also write technical plans
for collecting, organizing, and analyzing the needed information. He or she
56 Stufflebeam
Table 3. Outline for Documenting Evaluation Designs
Review of the Charge
Identification of the course or other object of the evaluation
Identification of the client, intended users, and other right-to-know audiences
Purpose(s) of the evaluation (i.e., program improvement, accountability, dissemination, and/or
increased understanding of the involved phenomena)
Type of evaluation (e.g., context, input, process, or product)
Values and criteria (i.e., basic societal values, merit and worth, CIPP criteria, institutional values,
technical standards, duties of personnel, and ground-level criteria)
Principles of sound evaluation (e.g., utility, feasibility, propriety, and accuracy) to be observed
Plan for Obtaining Information

The general strategy (e.g., survey, case study, advocacy teams, or field experiment)
Working assumptions to guide measurement, analysis, and interpretation
Collection of information (i.e., sampling, instrumentation, data collection procedures and instru-
ments, and permissions from data sources)
Organization of information (Le., coding, filing, and retrieving)
Analysis of information (both qualitative and quantitative)
Interpretation of findings (i.e., interpretive standards, processing judgments, developing conclusions)
Plan for Reporting the Results

• Drafting of reports
• Prerelease reviews and finalization of reports
• Dissemination of reports
• Provision for follow-up activities to assist uses of the evaluation
• Plan for responding to anticipated attacks on the evaluation
Plan for Administering the Evaluation

• Summary of the evaluation schedule
• Plan for meeting staff and resource requirements
• Provision for metaevaluation
• Provision for periodic updates of the evaluation design
• Budget
• Memorandum of agreement or contract
should obtain and consider stakeholders' reactions to the data collection plan.
The evaluator and client should anticipate that the data collection plan will likely
change and expand during the evaluation. This will happen as they identify new
audiences and as information requirements evolve.
Evaluators should gear reporting plans to achieve use of the evaluation findings.
They should involve clients and other audiences in deciding the contents and
timing of needed reports. Stakeholders should also help in planning how the
evaluator will disseminate the findings. The reporting plan should consider
report formats and contents, audiovisual supports, review and revision, means of
presentation, and right-to-know audiences. Appropriate procedures to promote
use of findings might include oral reports and hearings, multiple reports targeted
to specified audiences, press releases, sociodramas to portray and explore the
findings, and feedback workshops aimed at applying the findings. The client and
evaluator should seriously consider whether the evaluator might play an impor-
tant role beyond the delivery of the final report. For example, the client might
engage the evaluator to conduct follow-up workshops on applying the findings.
Such followup work can be as important for helping audiences avoid misinter-
pretation and misuse of findings as for helping them understand and make
appropriate use of the results. Also, only neophyte evaluators are surprised when
some person(s) or group(s) that don't like the evaluation's message attack and
otherwise try to discredit the work. Throughout the design and reporting processes
evaluators should be sensitive to the politics attending the evaluation and make
tentative plans to address unwarranted and unfair attacks on the evaluation.
The final part of the design is the plan for administering the evaluation. The
evaluator should identify and schedule the evaluation tasks consistent with the
needs of the client and other audiences for reports and in consideration of the
relevant practical constraints. The evaluator needs to define staff assignments
and needed special resources. The latter might include office space and
computer hardware and software. He or she should also assure that the proposed
evaluation personnel will be credible to the program's stakeholders. The evalua-
tor and client need to agree on who will assess the evaluation plans, processes,
and reports against appropriate standards. They also should agree on a mechanism
by which to periodically review, update, and document the evolving evaluation
design. They need to layout a realistic budget. Also, they should formalize
contractual agreements including authority for editing and releasing findings and
rules for terminating the agreement.
The discussion of Table 3 has been necessarily general, but it shows that
designing an evaluation is a complex and ongoing task. It recommends that the
evaluator should continually communicate with the client and other audiences
and emphasizes the importance of evolving the evaluation design to serve emerg-
ing information requirements. Also, it stresses the need to maintain professional
integrity and contractual viability in the evaluation work. Readers are referred to
www.wmich.edu/evalctr/checklists/, where they can find a collection of checklists
to use in designing and contracting various kinds of evaluations.
CONCLUSION
This chapter has presented the CIPP Evaluation Model, which provides direc-
tion for evaluations of context, inputs, process, and products. The chapter
describes the CIPP Model's role in improving, researching, disseminating, and
accounting for school programs and other evaluands; explains its main concepts;
discusses its uses for guiding improvement efforts and for accountability;
provides illustrations of application; describes techniques particularly suited to
the model; and outlines the elements of sound evaluation designs. The CIPP
Model is shown to be adaptable and widely applicable in many areas, including
elementary, secondary, and higher education. It is recommended for use by
individual educators, groups of educators, schools, and systems of schools and
similar groups in disciplines outside education. Evaluators are advised to validly
assess the merit of a program, service, product, or institution and determine its
worth in serving all the rightful beneficiaries. The chapter's key themes are that
58 Stufflebeam
(1) evaluation involves assessing something's merit and worth; (2) the most
important purpose of evaluation is not to prove, but to improve; (3) evaluations
should be both proactive in guiding improvements and retroactive in producing
accountability reports; (4) evaluators should assess goals, strategies, plans, activi-
ties, and outcomes; (5) evaluations should be grounded in sound, clear values;
(6) evaluators should be interactive in effectively communicating with and serving
clients and other right-to-know audiences; (7) evaluation design and reporting
are ongoing processes that should be tailored to meeting the audiences'
information needs; (8) evaluators should be sensitive to and appropriately resis-
tant to attempts by persons or groups to corrupt or discredit the evaluation; (9)
a program's success should be judged on how well it meets the assessed needs of
targeted beneficiaries; (10) evaluations should employ multiple approaches to
gather relevant information, including both quantitative and qualitative methods;
(11) whatever the methods employed, the evaluation should meet appropriate
standards for sound evaluations; and (12) evaluations themselves should be
evaluated through internal and external metaevaluations.
APPENDIX:
EXAMPLES OF EVALUATIONS THAT WERE GUIDED BY THE CIPP

EVALUATION MODEL
The following is a sampling of evaluations conducted by the Western Michigan

University Evaluation Center. In varying degrees, these evaluations illustrate
The Center's wide ranging use of the CIPP Evaluation Model. They are listed
here at the section editor's recommendation. He suggested that this chapter's
readers would be interested to know what kinds of applications of the CIPP
Evaluation Model they could expect to learn about at the Evaluation Center. At
a minimum, the following list conveys the variety of uses to which the CIPP
Model has been put. While the model was initiated in education, the following
examples show that it can be applied to a wide range of settings and content.
Community Development
1. evaluation of Consuelo Zobel Alger Foundation's self-help housing program

in Hawaii
2. external evaluation of the MacArthur Foundation-sponsored Fund for
Community Development that assisted selected community development
corporations in Chicago to improve housing, commerce, and industry in
their neighborhoods
3. evaluation of the Standard Oil weatherization program in Cleveland
Higher Education
4. an evaluation of the Hill Family Foundation's program to improve produc-

tivity in higher education
5. a technical assistance project to aid Western Michigan University to develop
a universitywide system of program review
6. evaluation of the Mott Foundation's Program for the Historically Black
Colleges
7. evaluation of the Western Michigan University College of Education's
external doctoral program in Guam
International
8. evaluation of Consuelo Foundation's socialized housing project in Negros,

Philippines
9. evaluation for the World Bank of teacher education in the Philippines
Personnel Evaluation
10. evaluation and design of a personnel evaluation system for the U.S. Marine
Corps
11. evaluation and advice for improving teacher evaluation systems in Hawaii,
Alaska, and Ohio
12. development of an evaluation criteria shell to guide the National Board for
Professional Teaching Standards' development and validation of assessment
systems to certify highly accomplished K-12 teachers
Schools and the Personnel
13. evaluation of charter school initiatives (Michigan, California, Connecticut,

Ohio, Pennsylvania, and Illinois)
14. evaluation of various aspects of K-12 schools including curricula, extended
year programs, and community perceptions of schools
15. evaluation of Goals 2000 and Technology Literacy Challenge Fund projects
in Michigan
16. a study for the National Science Foundation of the effects of the 1977 energy
crisis on Columbus, Ohio, public schools
17. evaluation of the Alabama educator Inservice Centers
18. evaluation of the Indianapolis School Partners program for Lilly
Endowment, Inc.
19. program evaluations for the Michigan Partnership for New Education
60 Stufflebeam
Science Education
20. evaluations for the National Science Foundation of Delaware and Oregon
system projects in science education and mathematics education
21. evaluation of the impact and effectiveness of the National Science
Foundation's Advanced Technological Education (ATE) program
22. evaluation of the National Science Foundation-sponsored Rural Systemic
Initiatives Program
23. evaluation of educational programs for the Environmental Protection
Agency
24. evaluation of science education training provided by the Argonne National
Laboratory
SociallYouth
25. evaluation of Michigan's Life Services project for coordinating welfare services
26. evaluation of Michigan programs in supported employment, housing, and
transition from school to work
27. evaluation of the w.K. Kellogg Foundation-sponsored Kellogg Youth
Initiatives Program
28. evaluation of gambling addiction in Michigan for the Michigan Lottery
29. survey of female athletes in Michigan high schools about the possible realign-
ment of high school sports seasons to conform to intercollegiate seasons
State/Regional Educational Services
30. evaluation of Appalachia Educational Laboratory programs

31. development of an evaluation system for Ohio's state system for career
education
32. evaluation of Michigan's regional educational media centers
33. evaluation of the research and development departments of government and
educational organizations
Testing
34. evaluation of the Michigan Educational Assessment Program

35. evaluation of the Kentucky Instructional Results Information System
Metaevaluation
36. metaevaluation of seven undergraduate engineering programs

37. metaevaluation of Australia's national distance baccalaureate program
38. metaevaluation of the National Assessment Governing Board's attempt to
set achievement levels on the National Assessment of Educational Progress
39. metaevaluation of the teacher education program at St. Patrick's College,
Dublin, Ireland
40. metaevaluation of teacher evaluation and school accountability for Texas
41. metaevaluation of an evaluation of the New York City school district's
testing of the Waterford Integrated Learning System-a computer-based skills
program for elementary school students
ENDNOTES
1 CIPP Model has withstood the test of time and practice over many years. The chapter's appendix
lists examples of the wide variety of evaluations that employed this model.
2 The Evaluation Center was established in 1963 at The Ohio State University and has been at
Western Michigan University since 1973. Since its inception The Evaluation Center has conducted
a wide range of projects aimed at advancing the theory and practice of evaluation and has provided
a learning laboratory for many graduate students, visiting scholars, and practitioners. It is the home
base of the North American Joint Committee on Standards for Educational Evaluation and from
1990 through 1995 housed the federally funded national research and development center on
teacher evaluation and educational accountability. Among the Center's experiences are applica-
tions in elementary and secondary education, continuing medical education, community and
economic development, self-help housing, community programs for children and youth,
administrator evaluation, and military personnel evaluation.
REFERENCES
Alexander, D. (1974). Handbook for traveling observers, National Science Foundation systems project.
Kalamazoo: Western Michigan University Evaluation Center.
Brickell, H.M. (1976). Needed: Instruments as good as our eyes. Occasional Paper Series, #7.
Cook, D.L., & Stufflebeam, D.L. (1967). Estimating test norms from variable size item and examinee
samples. Educational and Psychological Measurement, 27, 601-610.
Evers, J. (1980). A field study of goal-based and goal-free evaluation techniques. Unpublished doctoral
dissertation. Kalamazoo: Western Michigan University.
House, E.R, Rivers, w., & Stufflebeam, D.L. (1974). An assessment of the Michigan accountability
system. Phi Delta Kappan, 60(10).
Joint Committee on Standards for Educational Evaluation. (1981). Standards for evaluations of
educational programs, projects, and materials. New York: McGraw-Hill.
Joint Committee on Standards for Educational Evaluation. (1988). The personnel evaluation standards.
Newbury Park, CA: Sage.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards.
Thousand Oaks, CA: Sage.
Manning, P.R., & DeBakey, L. (1987). Medicine: Preserving the passion. New York: Springer-Verlag.
National Commission on Excellence in Education. (1983). A Nation at Risk: The Imperative of
Educational Reform. Washington, DC: U.S. Government Printing Office. (No. 065-000-00177-2).
Nowakowski, A. (1974). Handbook for traveling observers, National Science Foundation systems
project. Kalamazoo: Western Michigan University Evaluation Center.
Pedulla, J.J., Haney, w., Madaus, G.F., Stufflebeam D.L., & Linn. R L. (1987). Response to assessing
the assessment of the KEST. Kentucky School Boards Association Joumal, 6(1), 7-9.
Owens, T.R., & Stufflebeam, D.L. (1964). An experimental comparison of item sampling and
examinee sampling for estimating test norms. Journal of Educational Measurement, 6(2), 75-82.
62 Stufflebeam
Reed, M. (1989). Fifth edition WMU traveling observer handbook: MacArthur project. Kalamazoo:
Western Michigan University Evaluation Center.
Reed, M. (1991). The evolution of the traveling observer (TO) role. Presented at the annual meeting of
the American Educational Research Association, Chicago.
Reinhard, D. (1972). Methodology development for input evaluation using advocate and design teams.
Unpublished doctoral dissertation, The Ohio State University, Columbus.
Sandberg, J. (1986). Alabama educator inservice training centers traveling observer handbook.
Sanders, w., & Horn, S. (1994). The Tennessee value-added assessment system (TVAAS): Mixed-
model methodology in educational assessment. Journal of Personnel Evaluation in Education, 8(3),
299-312.
Schalock, D., Schalock, M., & Girod, J. (1997). Teacher work sample methodology as used at
Western Oregon State College. In J. Millman (Ed.), Using student achievement to evaluate teachers
and schools (pp. 15-45). Newbury Park, CA: Corwin.
Scriven, M. (1991). Evaluation thesaurus (4th ed.). Newbury Park, CA: Sage.
Stufflebeam, D.L. (1997). The Oregon work sample methodology: Educational policy review. In J.
Millman (Ed.), Grading teachers, grading schools (pp. 53-61). Thousand Oaks, CA: Corwin.
Stufflebeam, D.L., Nitko, A., & Fenster, M. (1995). An independent evaluation of the Kentucky
Instructional Results Information System (KlRIS). Kalamazoo: Western Michigan University
Evaluation Center.
Sumida, J. (1994). The Waianae self-help housing initiative: Ke Aka Ho'ona: Traveling observer
handbook. Kalamazoo: Western Michigan University Evaluation Center.
Thompson, T.L. (1986). Final synthesis report of the life services project traveling observer procedure.
Webster, w.J., Mendro, R.L., & Almaguer, T.O. (1994). Effectiveness indices: A "value added"
approach to measuring school effect. Studies in Educational Evaluation, 20(1),113-146.
3
Responsive Evaluation
ROBERT STAKE
Center for Instructional Research and Curriculum Evaluation, University of fllinois at
Champaign-Urbana, IL, USA
Responsive evaluation is an approach, a predisposition, to the evaluation of

educational and other programs. Compared to most other approaches it draws
attention to program activity, to program uniqueness, and to the social plurality
of its people. This same predisposition toward merit and shortcoming can be
built into or can be recognized in other approaches, such as a stakeholder evalua-
tion or connoisseurship evaluation.
A responsive evaluation is a search and documentation of program quality.
The essential feature of the approach is a responsiveness to key issues or problems,
especially those recognized by people at the site. It is not particularly responsive
to program theory or stated goals but more to stakeholder concerns. Its design
usually develops slowly, with continuing adaptation of evaluation goal-setting
and data-gathering in pace with the evaluators becoming well acquainted with
the program and its contexts.
Issues are often taken as the "conceptual organizers" for the inquiry, rather
than needs, objectives, hypotheses, or social and economic equations. Issues are
organizational perplexities or complex social problems, regularly attendant to
unexpected costs or side effects of program efforts. The term "issue" draws
thinking toward the interactivity, particularity, and subjective valuing already felt
by persons associated with the program. (Examples of issue questions: Are the
eligibility criteria appropriate? Do these simulation exercises confuse the
students about authoritative sources of information?) People involved in the
program are concerned about one thing or another (or likely to become con-
cerned). The evaluators inquire, negotiate, and select a few issues around which
to organize the study.
The evaluators look for troubles and coping behavior. To become acquainted
with a program's issues, the evaluators usually observe its activities, interview
those who have some role or stake in the program, and examine relevant docu-
ments. These are not necessarily the data-gathering methods for informing the
interpretation of issues; but are needed for the initial planning and progressive
focusing of the study. And even later, management of the study as a whole
usually remains flexible - whether quantitative or qualitative data are gathered.
63

64 Stake
OBSERVATIONS AND JUDGMENTS
Directed toward discovery of merit and shortcoming in the program, responsive

evaluation study recognizes mUltiple sources of valuing as well as multiple grounds.
It is respectful of multiple, even sometimes contradictory, standards held by
different individuals and groups.
Ultimately the evaluators describe the program activity, its issues, and make
summary statements of program worth. But first they exchange descriptive data
and interpretations with data givers, surrogate readers, and other evaluation
specialists for recognizing misunderstanding and misrepresentation. In their
reports they provide ample description of activities over time and personal
viewing so that, with the reservations and best judgments of the evaluators, the
report readers can make up their own minds about program quality.
There is a common misunderstanding that responsive evaluation requires
naturalistic inquiry, case study, or qualitative methods. Not so. With the program
staff, evaluation sponsors and others, the evaluators discuss alternative methods.
Often the clients will want more emphasis on outcomes, and responsive
evaluators press for more attention on the quality of processes. They negotiate.
But knowing more about what different methods can accomplish, and what
methods this evaluation "team" can do well, and being the ones to carry them
out, the evaluators ultimately directly or indirectly decide what the methods will
be. Preliminary emphasis often is on becoming acquainted with activity,
especially for external evaluators, but also the history and social context of the
program. The program philosophy may be phenomenological, participatory,
instrumental, or in pursuit of accountability. Method depends partly on the
situation. For it to be a good responsive evaluation the methods must fit the
"here and now", having potential for serving the evaluation needs of the various
parties concerned.
Even so, it has been uncommon for a responsive evaluation study to empha-
size the testing of students or other indicators of successful attainment of stated
objectives. This is because such instrumentation has so often been found
simplistic and inattentive to local circumstances. Available tests seldom provide
comprehensive measures of the outcomes intended, even when stakeholders
have grown used to using them. And even when possible, developing new tests
and questionnaires right is very expensive. For good evaluation, test results have
too often been disappointing - with educators, for example, probably justifiably
believing that more was learned than showed up on the tests. With the responsive
approach, tests often are used, but in a subordinate role. They are needed
when it is clear that they actually can serve to inform about the quality of the
program.
In most responsive evaluations, people are used more as sociological inform-
ants than as subjects. They are asked what they saw as well as what they felt. They
are questioned not so much to see how they have changed but to indicate the
changes they see.
Responsive Evaluation 65
SUBJECTIVITY AND PLURALISM
My first thoughts about how to evaluate programs were extensions of empirical

social science and psychometrics, where depersonalization and objectivity were
esteemed. As I have described elsewhere (Stake, 1998), in my efforts to evaluate
curriculum reform efforts in the 1960s, I quickly found that neither those designs
nor tests were getting data that answered enough of the important questions.
Responsive evaluation was my response to "preordinate evaluation", prior
selection and final measurement of a few outcome criteria. Over the years I came
to be comfortable with the idea that disciplining impressions and personal
experience led to better understanding of merit and worth than using needs to
identify improvement with strict controls on bias (Stake et aI., 1997).
Case study, with the program as the case, became my preferred way of
portraying the activity, the issues, and the personal relationships that reveal
program quality. Not all who have a predilection for responsive evaluation use a
case study format. Many evaluators do their work responsively without calling it
that and some who do call their work responsive are not responsive to the same
phenomena I am. There is no single meaning to the term.
Those who object to the responsive approach often do so on the ground that
too much attention is given to subjective data, e.g., the testimony of participants
or the judgments of students. For description of what is happening, the evalua-
tors try (through triangulation and review panels) to show the credibility of
observations and soundness of interpretations. Part of the program description,
of course especially that about the worth of the program, is revealed in how
people subjectively perceive what is going on. Placing value on the program is
not seen as an act separate from experiencing it.
The researchers' own perceptions too are recognized as subjective, in choosing
what to observe, in observing, and in reporting the observations. One tries in res-
ponsive evaluation to make those value commitments more recognizable. Issues,
e.g., the importance of a professional development ethic, are not avoided
because they are inextricably subjective. When reporting, care is taken to illumi-
nate the subjectivity of data and interpretations.
Objection to a responsive approach is also expressed in the belief that a single
authority, e.g., the program staff, the funding agency or the research community,
should specify the key questions. Those questions often are worthy of study, but
in program evaluation for public use, never exclusively. There is general expec-
tation that if a program is evaluated, a wide array of important concerns will be
considered. Embezzlement, racial discrimination, inconsistency in philosophy,
and thwarting of creativity may be unmentioned in the contract and not found in
the evaluators' expertise, but some sensitivity to all such shortcomings belong
within the evaluation expectation, and the responsive evaluator at least tries not
to be blind to them.
Further, it is recognized that evaluation studies are administratively pres-
cribed, not only to gain understanding and inform decision-making but also to
legitimatize and protect administrative and program operations from criticism,
66 Stake
especially during the evaluation period. And still further, evaluation require-
ments are sometimes made more for the purpose of promulgating hoped-for
standards than for seeing if they are being attained. Responsive evaluators
expect to be working in political, competitive, and self-serving situations and the
better ones expose the meanness they find.
By seeking out stakeholder issues, responsive evaluators try to see how
political and commercial efforts extend control over education and social service.
They are not automatically in favor of activist and legitimate reform efforts, but
they tend to feature the issues they raise. Responsive evaluation was not
conceived as an instrument of reform. Some activists find it democratic; others
find it too conservative (Shadish, Cook, & Leviton, 1991). It has been used to
serve the diverse people most affected personally and culturally by the program
at hand - though it regularly produces some findings they do not like.
ORGANIZING AND REPORTING
The feedback from responsive evaluation studies is expected to be in format and

language attractive and comprehensible to the various groups, responsive to their
needs. Thus, even at the risk of catering, different reports or presentations may
be prepared for different groups. Narrative portrayals, story telling, and verbatim
testimony will be appropriate for some; data banks and regression analyses for
others. Obviously the budget will not allow everything, so these different
communications have to be considered early in the work.
Responsive evaluation is not participatory evaluation, but it is organized partly
around stakeholder concerns and it is not uncommon for responsive evaluation
feedback to occur early and throughout the evaluation period. Representatives
of the prospective audience of readers should have directly or indirectly helped
shape the list of issues to be pursued. Along the way, the evaluator may ask, "Is
this pertinent?" and "And is this evidence of success?" and might, based on the
answer, change priorities of inquiry.
Responsive evaluation has been useful during formative evaluation when the
staff needs more formal ways of monitoring the program, when no one is sure
what the next problems will be. It has been useful in summative evaluation when
audiences want an understanding of a program's activities, its strengths and
shortcomings and when the evaluators feel that it is their responsibility to provide
a vicarious experience. Such experience is seen as important if the readers of the
report are to be able to determine the relevance of the findings to their own
sense of program worth.
As analyzed by Ernest House (1980, p. 60) responsive evaluation will some-
times be found to be "intuitive" or indeed subjective, closer sometimes to literary
criticism, Elliot Eisner's connnoisseurship, or Michael Scriven's modus operandi
evaluation than to the more traditional social science designs. When the public
is seen as the client, responsive evaluation may be seen as "client centered", as
did Daniel Stufflebeam and Anthony Shinkfield (1985, p. 290). But usually it
Responsive Evaluation 67
differs from those approaches in the most essential feature, that of responding
to the issues, language, contexts, and standards of an array of stakeholder groups.
When I proposed this "responsive evaluation" approach (at an evaluation
conference at the Pedagogical Institute in G6teborg, Sweden, in 1974) I drew
particularly upon the writings of Mike Atkin (1963); Lee Cronbach (1963); Jack
Easley (1966); Stephen Kemmis (1976); Barry MacDonald (1976); and Malcolm
Parlett and David Hamilton (1977). They spoke of the necessity of organizing the
evaluation of programs around what was happening in classrooms, drawing more
attention to what educators were doing and less attention to what students were
doing. Later I reworked some of my ideas as I read Ernest House (1980); Egon
Guba and Yvonna Lincoln (1985); Tom Schwandt (1989); and Linda Mabry
(1998). Of course I was influenced by many who proposed other ways of
evaluating programs.
It is difficult to tell from an evaluation report whether or not the study itself
was "responsive." A final report seldom reveals how issues were negotiated and
how audiences were served. Examples of studies which were clearly intentionally
responsive were those by Barry MacDonald (1982); Saville Kushner (1992);
Anne McKee and Michael Watts (2000); Lou Smith and Paul Pohland (1974);
and Robert Stake and Jack Easley (1979), indicated in the references below. My
meta-evaluation, Quieting Reform (1986), also took the responsive approach.
REFERENCES
Atkin, J.M. (1963). Some evaluation problems in a course content improvement project. Journal of
Research in Science Teaching, 1, 129-132.
Cronbach, L.J. (1963). Course improvement through evaluation. Teachers College Record, 64, 672--683.
Easley, JA, Jr. (1966). Evaluation problems of the UICSM curriculum project. Paper presented at the
National Seminar for Research in Vocational Education. Champaign, IL: University of Illinois.
Greene, J.C. (1997). Evaluation as advocacy. Evaluation Practice, 18, 25-35.
Guba, E., & Lincoln, Y. (1981). Effective evaluation. San Francisco: Jossey-Bass.
Hamilton, D., Jenkins, D., King, c., MacDonald, B., & Parlett, M. (Eds.). (1977). Beyond the numbers
game. London: Macmillan.
House, E.R. (1980). Evaluating with validity. Beverly Hills, CA: Sage.
Kushner, S. (1992). A musical education: Innovation in the Conservatoire. Victoria, Australia: Deakin
University Press.
Mabry, L. (1998). Portofios plus: A critical guide to alternative assessment. Newbury Park, CA: Corwin
Press.
MacDonald, B. (1976). Evaluation and the control of education. In DA Tawney (Ed.) Curriculum
evaluation today: Trends and implications. London: Falmer.
MacDonald, B., & Kushner, S. (Eds.). (1982). Bread and dreams, CARE. Norwich, England:
University of East Anglia.
McKee, A., & Watts, M. (2000). Protecting Space? The Case of Practice and Professional Development
Plans. Norwich, England: Centre of Applied Research in Education, University of East Anglia.
Parlett, M., & Hamilton, D. (1977). Evaluation as illumination: A new approach to the study of
innovatory programmes. In D. Hamilton, D. Jenkins, C. King, B. MacDonald, & M. Parlett (Eds.),
Beyond the numbers game. London: Macmillan.
Schwandt, T.A. (1989). Recapturing moral discourse in evaluation. Educational Researcher, 18( 8),
11-16.
Shadish, w.R., Cook, T.D., & Leviton, L.C. (1991). Foundations ofprogram evaluation. Newbury Park,
CA: Sage.
68 Stake
Stake, RE. (1974). Program evaluation, particularly responsive evaluation. Reprinted in w.B. Dockrell
& D. Hamilton (Eds), (1980). Rethinking educational research. London: Hodder and Stoughton.
Stake, RE. (1986). Quieting Reform: Social science and social action in an urban youth program.
Champaign, IL: University of Illinois Press.
Stake, RE. (1998). Hoax? Paper at the Stake Symposium on Educational Evaluation, May 9, 1998.
Proceedings edited by Rita Davis, 363-374.
Stake, RE., & Easley, J. (Eds.). (1978). Case studies in science education (Vols. 1-16). Urbana IL:
University of Illinois.
Stake, RE., Migotsky, c., Davis, R, Cisneros, EJ., Depaul. G., Dunbar, c., et al. (1997). The
evolving synthesis of program value. Evaluation Practice, 18(2), 89-103.
Stufflebeam, D.L., & Shinkfield, A.J. (1985). Systematic evaluation: A self-instructional guide to theory
and practice. Boston: Kluwer-Nijhoff Publishing.
4
Constructivist Knowing, Participatory Ethics and
YVONNA S. LINCOLN
Texas A&M University, Department of Educational Administration, TX
To talk about constructivist knowing, participatory ethics, and responsive

evaluation is to recognize an extended family kinship pattern among the terms.
Constructivist, or naturalistic, evaluation is built upon responsive evaluation and
the major break it represents between privileged forms of evaluation practice
and the more democratic forms first proposed by Robert Stake. No discussion of
constructivist inquiry's models of knowing can proceed very far without both a
recognition of the paradigm's debt to Stake's responsive evaluation and without
a discussion of the ethical "terms of endearment" that make the practice of
responsive evaluation possible (Abma & Stake, 2001). By "terms of endear-
ment", I mean the agreements, negotiated meanings, and taken-for-granted
stances of all participants and stakeholders in evaluation approaches which are
responsive, inclusive, and authentically participatory. Terms of endearment are
those ethical postures which come as givens; they refer to what Peter Reason has
often called "a way of being in the world."
Grounded in the belief that social knowledge is less about discovery than it is
about community co-creation, both constructivist knowing and responsive evalu-
ation are linked to participatory ethics by four major concepts. First, evaluations
exhibit responsiveness to a wider set, rather than a smaller set, of stakeholding
audiences (thus avoiding Scriven's "managerial coziness"). Second, ethical stances,
responsiveness, and constructivist evaluation are committed to the principle of
knowledge as a socially interactive creative process. Third, knowledge is believed
to exist within and for a community of knowers; the objective of evaluation is to
extend the boundaries of the community of knowers beyond policy personnel
and program decision makers. Fourth, the concepts are linked by a belief that
effective social inquiry must be characterized by radically restructured relation-
ships between evaluators and stakeholders, configured around ethical and
communitarian principles, rather than structured for the extraction of data for
the use of a distant and somewhat disconnected policy community. Each of these
organizing concepts is embedded in turn in each of the models of inquiry -
models such as inclusive evaluation, democratic evaluation, fourth generation
69

70 Lincoln
evaluation, and responsive evaluation - responsiveness and the ethical practice
of participatory evaluations.
INCLUSION OF EXPANDED NUMBERS OF STAKEHOLDERS

To posit that evaluation practice ought to be more responsive to stakeholding
audiences beyond managers and program funders is itself, foundationally, an
ethical position vis-a-vis participation in social processes, especially those
involved with expenditures of social and/or public funds. When Stake proposes,
in his interview with Tineke Abma, that decision-making by stakeholding
audiences is both a political and an epistemological argument (Abma & Stake,
2001), he is honoring the liberal principle of having governing done on behalf of
the people by the people most closely involved. The liberal principle of gover-
nance "by the people" is an old one, but one that has to be revisited often and
forcefully in the wake of the centralization of governing structures throughout
social life created by the rationalist school's dreams of efficiency and economies
of scale, Federal planned social change efforts, and other outgrowths of the
application of so-called scientific principles of management to social programs.
Responsive evaluation is, as a result, counter-revolutionary in its focus, rubbing
against the grain of abstracted knowledge forms in favor of localized, locally
recognizable, regionally meaningful, and vicarious experience and knowledge.
As the civil rights movements have swept the U.S. (and other parts of the
globe), minority groups (e.g., African-Americans), majority groups (women and
those acting on behalf of children), other tribes (Native Americans), other
cultures (e.g., Hispanics, Mexican-Americans, Tamils, Pushtans), and interest
categories (gay rights activists, advocates for the disabled, differently abled, or
handicapped) have asserted their right to take part in the decision-making
structures which determine life circumstances. As race, gender, culture, language
and sexual orientation are made problematic in social life, stakeholders who are
not a part of the majority or mainstream culture demand the full opportunities
granted all citizens: the right to be heard, the right to have their own needs taken
into account in the competition for scarce public resources, the right to exercise
voice in democratic participation.
Responsive evaluation recognizes those rights, and attends to them mindfully
and respectfully, searching out possible stakeholders and inviting their participa-
tion in nominating concerns, claims and issues (Guba & Lincoln, 1989; Lincoln,
1991b), in suggesting appropriate and culturally-grounded means for collecting,
analyzing and interpreting data, and even sometimes, participating in the collection
and interpretation of data themselves, alongside the evaluator or evaluation team.
EVALUATION KNOWLEDGE AS A SOCIALLY CREATED PROCESS

By the same token, constructivist knowing engages the non-tangible, episte-
mological realm of human behavior by honoring the sense-making and meaning-
making activities of the mind, apart from or within various social contexts.
Constructivist Knowing, Participatory Ethics and Responsive Evaluation 71
Rooted in gestalt and constructivist psychology, constructivist epistemologies

honor the intangible mind's world alongside the more prosaic and measurable
physical world of objects and events (Lincoln, 2001b). These worlds - the
physical and the meaning-imputing of the mental- exist side-by-side and interact
in ways which conventional science has been unable or unwilling to see. In fact,
one terse aphorism of constructivist inquiry criticizes conventional inquiry for its
willingness to discard or ignore the meaning-constructing and sense-making
activities of individuals and groups in favor of those kinds of data which yield
easily to weighing, measuring, or arraying on a Likert scale.
Constructivists remained convinced that individuals and groups behave in
ways which most often reflect the sense they have made of some situation, rather
than in response to any scientific definition of the situation. More often than not,
the response is a common sense, locally-situated response, proceeding from
experience and from intuited and tacit "deep structure" (Geertz, 1971) know-
ledge of the setting. Frequently, this common sense, experiential context knowledge
cares less about federal or state planned social change, and far more about what
is happening in the here and now, in this place, at this time. Constructivist
evaluators understand that stakeholding audiences frequently put together
meanings with a firm grip on the old Chinese proverb, ''Ab, but the Emperor is
a thousand miles away!"
In the same way, standards about what is "good", "fitting", "appropriate", or
"useful" also derive from the community context in which evaluation is conducted
(but see Lackey, Moberg, & Balistrieri, 1997). To argue that abandonment of
scientific principles in favor of constructivist principles for evaluation is to
abandon all sense of standards is to overlook a long-held, exemplary principle in
American democratic thought: local control of schools. While states set some
educational policies as foundational, many other policies - dress codes, deter-
mination of special needs in children, teacher lesson plans, hiring policies, and
the like - are widely and variously locally determined. In effect, states set general
policies - much as the evaluation field is governed by the Joint Committee's
evaluation standards - but the local context sets parameters around how those
policies are enacted, and develops extensions of those policies and additional
policies to match community determinations of what is best for its children. It is
not an unreasonable conceptual leap to expect that communities facilitated in
reasonable and reasoned debate can develop standards for what they determine
to be a program's success or failure. The key, as always, is the community's
having access to sound information and hard data - qualitative and quantitative
both - on which its deliberations might be based. Misconstructions need to be
buttressed with additional data; malconstructions need to be challenged with
more accurate data. And decision making needs to be, like research consent,
"fully informed." As Guba and I have noted before (1989), the words of Thomas
Jefferson guide the evaluator here:
I know of no safe depository of the ultimate powers of the society but the
people themselves; and if we think them not enlightened enough to
72 Lincoln
exercise their control with a wholesome discretion, the remedy is not to
take it from them, but to inform their discretion. (Jefferson, letter to
William Charles Jarvis, September 28, 1820)
Constructivist evaluation relies firmly on "informing the discretion" of

stakeholders in the community, whether in making decisions about what the data
mean, or in the process of developing locally-meaningful standards by which one
might judge the results and findings of an evaluation effort.
Put another way, standards are as much a socially created form of knowledge
as are research and evaluation findings. Those who claim for standards an
a priori status outside of the community in which they were developed ignore the
very nature of knowledge communities: their ability to dictate the "terms of
endearment", the discourses to which they are accustomed, and the power to
enforce those discourses as at least partly beyond human and social creation. It
is a peculiar form of conceit to assume that standards derived from academic and
knowledge-producing elites are those which ought (in a normative sense) or
should (in a political sense) drive community life, in all places and all times. It
seems unlikely, given the polyphony of voices arising to challenge conventional
inquiry as the only way of knowing, that any particular knowledge community
can enforce its standards on local participants for long. The community and
communitarian experience replace scientific discourses with local epistemologies
as soon as the scientists have exited the scene.
KNOWLEDGE FOR WHOM? THE COMMUNITY OF KNOWERS
In fact, even as the field of evaluation practice grows in complexity and

sophistication, and even as its applications have been extended to a variety of
contexts, situations and problems unforeseen a quarter century ago, the field
seems to be split between those who would apply ever more scientific principles
to its practice, and those who would broaden its democratic base in order to
encourage more participation by a wider variety of stakeholding audiences. Even
as one group advocates treating evaluation practice as any other form of civic
debate or civic activity in a democratic society, another is bent on reclaiming the
link between the cult of rational efficiency and an intellectual hierarchy of pro-
fessional experts. As Stake points out indirectly in his conversation with Abma
(Abma & Stake, 2001), the argument rests not on grounds of scientific accuracy
versus half-baked methodology and untrustworthy knowledge (which is often
implied, if not asserted directly), but rather on grounds of whether centralized,
rather than localized, control is best for communities, for the just distribution of
power, or for general democratic principles.
What has evolved as the evaluation theory and practice communities have
undergone their transition is therefore less a matter of models - although models
regulate the discourse, creating an unfortunate competition between practi-
tioners and theoreticians who are aligned with one model or another - and far
more a matter of foundational philosophies, cleaved rather cleanly between

conservative and liberal philosophies of the state, writ both small and large. Much
of the talk about evaluation occurs in terms that utilize language peculiar to the
user group, and thus give the various discourses a veneer of scientific and non-
political respectability. Arguments about the nature, size, appropriate functions,
and legitimate interests of government, however, are translated into arguments
regarding the nature, appropriate functions and legitimate interests of evalua-
tors and the evaluation project more broadly.
THE METHODOLOGICAL DIFFERENCES IN SCIENCE
A part of the evaluation discourse discontent and disconnect has been over
methodology. Many practitioners of both inquiry, more broadly and evaluation
more narrowly, frame the debate about discourses as a debate about methods,
particularly the sharp distinction between quantitative, experimental and conven-
tional methodologies whose contours are virtually always attuned to causality,
and the less rigid, more experiential and open-ended focus of qualitative,
phenomenological, naturalistic, constructivist, and interpretivist methodologies
and strategies, which are focused less on causal relationships and more on social
patternings; networks; linkages; connections; nexuses; nodes; branches; and mutual,
interactive causality, rather than linear causality. While many conventional
techniques emphasize an abstracted, idealized, and frequently linear connection
between cause and effect, interpretive techniques attempt to recognize and
preserve the complex interrelationships of social life and "lived experience",
largely by focusing on interactive and mutually-shaping webs of influence
between and among people, events, and situations.
For constructivists (and indeed, for all sophisticated evaluators of whatever
sort), the question is not either/or, it is both. Despite assertions that are made
about constructivist inquiry, in fact, sophisticated constructivist evaluators, like
most inclusive evaluators, use both qualitative and quantitative data, and deploy
methods and techniques that are appropriate to the questions which need answers.
Those questions occasionally demand quantitative methods; they occasionally
demand qualitative methods. The use of qualitative methods is orthogonal to the
rise in calls for alternative models, philosophies and paradigms of inquiry
(Lincoln, 2002). Methods and models are not necessarily linked, although there
are frequently conceptual congruities between them. Quite likely, the reason for
the assumption of parallelism between the two has been what we called long ago
the "congeniality" and "fit" of qualitative methods with the evaluator's interest
in gaining access to stakeholders' meaning-making activities (Guba & Lincoln,
1981; 1989). Somehow, the terms "naturalistic" and "qualitative" became conflated,
although that was not our intention (nor has it been the perception of our most
discerning critics).
There is, however, another aspect of methodology that has not been fully
explored. If we may truly argue that there has been a conjoining of ontology and
74 Lincoln
epistemology, and another of epistemology and methodology (Lincoln, 2001a),
then it is also possible to apprehend the growing conviction that there is a
conjoining between the formerly apparently unrelated issues of methods and
ethics. The recognition is compelling that "method" is not only not necessarily a
"scientific" choice (Le., "fit" between question and method may be a more
rigorous criterion in the choice of method than widespread acceptability, as in
the case of quantitative methods), but that methods themselves vastly dictate what
kinds of data are open to be gathered, and irretrievably signal the evaluator's
intent vis-a.-vis both data and respondents. Thus, the choice of methodology - the
overall design strategy which incorporates choices about methods - is
inescapably an ethical decision (Lincoln, 1991a; 1991b; 2001a; 2001b). It is an
ethical decision because the design strategy denotes which data forms are to be
considered meaningful; what processes for analyzing, compressing, rearranging
and arraying data will be considered rigorous, systematic and scientific; and how
the human subjects of any inquiry or evaluation are to be treated - whether as
repositories of needed data which must be "extracted", to use Oakley's (1981)
term, or whether as active, cognizing human beings. In the former sense, the
evaluator's task is merely to locate "subjects."! In the latter sense, the evaluator's
task is to interact with others who might, as a part of the evaluation effort,
engage in co-creating knowledge and extending understanding, and to do so with
respect, dignity and a full recognition of active agency on the part of those
respondents.
Methodology precisely reflects these multiple stances toward stakeholders. To
the extent that evaluation participants are treated as mere vessels, to be emptied
of data as the evaluator moves on, the objectification of human beings has been
reimprinted on the evaluation effort. To the extent that evaluation participants
are honored as active agents in the processing, analyzing, interpreting and re-
narration of evaluation data, they have regained status as non-objectified con-
scious beings. This is not to make the argument that all quantitative methods
dehumanize. Rather, it is to reinforce the argument that the socially constructed
and meaning-making activities of program and community members enjoys
equal status with statistics on that program or community. This would, in tum,
argue for having many, if not most, evaluation efforts be mixed-methods
evaluation efforts.
RESTRUCTURED RELATIONS BETWEEN EVALUATORS AND THEIR

COMMUNITIES
The central issue between and among these relationships is the ethical posture
of the practitioner, for constructivist knowing and inquiry, as well as responsive
evaluation, call for a different way of being in the world. This different way of
"being in the world" - a term borrowed from Reason and Heron's cooperative
inquiry model - springs from the consciousness of a different relationship,
assumed and then enacted, between evaluators and the multiple stakeholders
they serve. It is less a practice of one specific evaluation model than it is a prac-
tice of democratic, communitarian, inclusive, and often feminist principles writ
on the programmatic landscape (Christians, 2000; English, 1997; Greene, Lincoln,
Mathison, Mertens, & Ryan, 1998). Thus, a variety of models and philosophies
of evaluation practice fit comfortably within the responsive framework: partici-
patory evaluation, collaborative evaluation, co-generative evaluation (Greenwood
& Levin, 1998), democratic evaluation, constructivist or fourth-generation
evaluation, and the post-positivist, communitarian, inclusive and feminist forms
of evaluation outlined by Donna Mertens, Katherine Ryan, Anna Madison,
Joanne Farley, and others (Greene, Lincoln, Mathison, Mertens, & Ryan, 1998;
Mertens, Farley, Madison, & Singleton, 1994; Patton, 1997; but also see
Stufflebeam, 1994; and Perry & Backus, 1995, for cautions).
Constructivist (or naturalistic) evaluation models (there are several, under
different names) are premised on these communitarian, dialogic, care ethic-
oriented, and inclusionary ethical systems, committed to democratic life and
participatory civic dialogue. Activities in the evaluation process which support
such principles are consistent with the participatory ethics which constructivist
evaluation attempts to uphold. Activities which undermine dialogue and active
and meaningful stakeholder participation are those which cannot be said to be
constructivist (or inclusionary).
RIGOR, CONFIDENCE, TRUSTWORTHINESS AND SYMBOLS OF

AUTHENTICITY
There are three main domains where rigor of any constructivist evaluation may
be questioned and appraised. The first arena concerns the extent to which
stakeholder voices are represented in the evaluation report. Are stakeholder
voices present? Do they represent a solid base of stakeholders, extending far
beyond program managers and funders? Is there evidence in the evaluation
report that stakeholders have been assiduously sought out and invited to
participate in the evaluation effort? Have appropriate data collection methods -
matched to stakeholder needs - been utilized with all groups? Do stakeholders
themselves assent to the portrayals of themselves? Do they feel the repre-
sentations made are fair, honest, balanced?
The second question which may be employed to assess rigor and trustwor-
thiness is the extent to which the evaluation effort contributed to the overall
information level of all stakeholders. This is in direct counter-position to the
more conventional evaluation effort, where few, if any efforts are made to share
evaluation results with stakeholders, or, if such efforts are made, they are made
in terms and discourses which have little meaning for some groups of
stakeholders. The central point of constructivist and inclusionary evaluation
forms is the presentation of data in many forms and formats, formal and highly
informal, in ordinary and "natural" language, so that information for decision
making is available to all stakeholders equally. The central issue here is whether
76 Lincoln
or not the playing field has been leveled by the thoughtful dissemination of
information and data which address as many of the claims, concerns, and issues
as are possible within the time and fiscal limits of the evaluation contract.
This means, at a minimum (Guba & Lincoln, 1989) information which attends
to various claims which are made; data which resolve the concerns of various
stakeholding audiences; and data, information and even possibly research findings
which permit the development of an agenda for negotiation around serious issues
with the program (about which reasonable people might disagree). Information
is not withheld; full and meaningful participation cannot be guaranteed when
critical data are held back, or shared only as executive reports. Once stakeholding
groups have determined that the representations of them are fair, honest,
balanced, then all stakeholders have a right to those representations.
Withholding critical representations and/or concerns leaves the playing field
unleveled; sharing all representations negotiated as fair and balanced creates
awareness in all stakeholding groups of the positions of others (which is termed,
in fourth generation evaluation, educative authenticity, or the effort to create
understanding of all perspectives, viewpoints, and constructions involved in the
evaluation effort).
The third arena where rigor and authenticity may be discerned is in whether
or not the evaluation actually effects meaningful action and decision making, or
whether it serves as a prompt to action on the part of stakeholders. This is not to
imply that resolution of all conflicts will occur. It is rather to suggest that issues
will have been opened to negotiation, to dialogue, and to informed decision
making. (It is this prompt to action which leaves many more conventional
evaluators wary. It is also this tilt toward taking action which causes traditional
evaluators to label constructivist, fourth generation, and inclusionary approaches
"activist" or "advocacy-oriented.")
Briefly, these three major rigor concerns may be described as ontological
criteria, epistemological criteria, and praxis criteria. There are, of course, other
criteria proposed, including methodological criteria (criteria which address the
adequacy and thoroughness of evaluation design strategy and the ways of
deploying methods); tactical criteria (the extent to which the evaluator is willing
to help those unfamiliar with the forms and venues of participation become
familiar with them, and utilize those forms and venues to address their own
stakeholder needs and perspectives; and discursive criteria (the extent to which
the evaluator can speak to and be understood by the multiple stakeholding
audiences, especially via utilizing multiple reporting forms and portrayals.
In summary, fourth generation, constructivist, and inclusionary approaches to
evaluation are designed, first, to compensate for the inadequacies of conven-
tional evaluation theories and practices by expanding the range of activities,
attention, and involvement, and second, to recognize the inherently political
nature of social program evaluation. Evaluation is never "innocent", impartial or
disinterested. Rather, evaluation results and recommendations compete for
attention in a wider political arena where competition for resources and hence,
programs, is fierce. Inevitably, however "neutral", objective or value-free
evaluators may claim to be, they have alliances and allegiances. Some of those
allegiances - especially among constructivist, participatory, and inclusionary
evaluators - are open and aboveboard. Some allegiances are less transparent,
but they exist. For instance, the claim to neutrality itself is a value claim, linked
to a particular paradigm and a particular model of knowledge which asserts that
objectivity is not only possible, but desirable (a possibility denied not only by
constructivist evaluators, but by prominent and thoughtful philosophers of
science). Inclusionary evaluators try to make their political stances clear, their
alliances known, and their values open, although this goal is not always achieved
(see McTaggart, 1991).
Responsive approaches, openness to the socially constructed nature of much
of social life, and participatory ethics all characterize the fourth generation or
inclusionary evaluator. These stances are linked together by a liberal and plural-
istic theory of democracy and democratic action, wherein citizens are empowered
rather than excluded (see Gill & Zimmerman, 1990, for a stakeholder-based
evaluation effort which addresses precisely this issue). Such a posture seems to
be quite consonant with a society becoming increasingly pluralistic even as it
strives to maintain democratic and participatory traditions of civic life.
ENDNOTES
1 The move away from referring to those who provide data as subjects and toward the more
egalitariao and autopoietic respondents, or research participants is welcome. More than twenty-five
years ago, Robert L. Wolf objected to the usage of the term "subjects", given the term's roots in the
Latin sub juga, meaning to "go under the yoke", or be taken into slavery. "Respondents", he
pointed out, from the Latin respondo, -ere, had its roots in the freedom to "answer back", as to an
equal. Respondents, or research participants, fits more nearly the mindset interpretivists utilize
when they seek out those who have access to meaningful data.
REFERENCES
Abma, T.A., & Stake, R.E. (2001). Stake's responsive evaluation: Core ideas and evolution. New
Directions for Evaluation, 92, 7-22.
Christians, C. (2000). Ethics and politics in qualitative research. In N.K. Denzin & Y.S. Lincoln
(Eds.) Handbook of qualitative research, 2nd Ed. (pp. 133-154). Thousand Oaks, CA: Sage
Publications.
English, B. (1997). Conducting ethical evaluations with disadvantaged and minority target groups.
Evaluation Practice, 18(1),49-54.
Geertz, C. (1971). Local knowledge. New York: Basic Books.
Gill, SJ., & Zimmerman, N.R. (1990). Racial/ethnic and gender bias in the courts: A stakeholder-
focused evaluation. Evaluation Practice, 11(2), 103-108.
Greene, J.c., Lincoln, Y.S., Mathison, S., Mertens, D., & Ryan, K. (1998). Advantages and challenges
of using inclusive evaluation approaches in evaluation practice. Evaluation Practice, 19(1),
101-122.
Greenwood, DJ., & Levin, M. (1998). Introduction to action research: Social research for social
change. Thousand Oaks, CA: Sage Publications.
Guba, E.G., & Lincoln, Y.S. (1981). Effective evaluation: Improving the usefulness of evaluation results
through responsive and naturalistic approaches. San Francisco: Jossey-Bass, Inc.
78 Lincoln
Guba, E.G., & Lincoln, Y.S. (1989). Fourth generation evaluation. Thousand Oaks, CA: Sage
Publications.
Lackey, J.F., Moberg, D.P., & Balistrieri, M. (1997). By whose standards? Reflections on empower-
ment evaluation and grassroots groups. Evaluation Practice, 18(2), 137-146.
Lincoln, Y.S. (1991a). Methodology and ethics in naturalistic and qualitative research: The
interaction effect. In M.J. McGee Brown (Ed.), Processes, applications and ethics in qualitative
research. Athens, GA: University of Georgia Press.
Lincoln, Y.S. (1991b). The arts and sciences of program evaluation: A moral tale for practitioners.
Evaluation Practice, 12(1), 1-7.
Lincoln, Y.S. (200la). Varieties of validity: Quality in qualitative research. In J.S. Smart & w.G.
Tierney (Eds.), Higher education: Handbook of theory and research, 16, 25-72. New York: Agathon
Press.
Lincoln, Y.S. (2001b, February). The future of fourth generation evaluation: Visions for a new
millennium. Paper presented for the Stauffer Symposium on Evaluation, Claremont Graduate
University, Claremont, CA
Lincoln, Y.S. (2002). The fourth generation view of evaluation. The future of evaluation in a new
millennium. In S.I. Donaldson & M. Scriven (Eds.), Evaluating social programs and problems:
VISions for the new millennium. Hillsdale, NJ: Erlbaum.
McTaggart, R. (1991). When democratic evaluation doesn't seem democratic. Evaluation Practice,
12(1),9-21.
Mertens, D.M., Farley, J., Madison, AM., & Singleton, P. (1994). Diverse voices in evaluation
practice. Evaluation Practice, 15(2), 123-129.
Oakley, A (1981). Interviewing women: A contradiction in terms. In H. Roberts (Ed.) Doing feminist
research (pp. 30--61). London: Routledge.
Patton, M.Q. (1997). Toward distinguishing empowerment evaluation and placing it in a larger
context. Evaluation Practice, 18(2), 147-163.
Perry, P. D., & Backus, C.A. (1995). A different perspective on empowerment in evaluation: Benefits
and risks to the evaluation process. Evaluation Practice, 16(1), 37-46.
Stufflebeam, D.L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation
standards: Where the future of evaluation should not go, and where it needs to go. Evaluation
Practice, 15(3), 321-338.
5
Deliberative Democratic Evaluation
ERNEST R. HOUSE
KENNETH R. HOWE
Sooner or later most evaluators encounter sharp differences of perspectives,

values, and interests among the stakeholders of the programs and policies being
evaluating. Often, clients often do not have the same interests as the bene-
ficiaries of the services being delivered. Faced with pronounced differences and
even conflicts among stakeholders, what does the evaluator do?
Evaluators have tried to answer this problem in different ways. Some have
suggested attending only to the perspectives and interests of the client. Others
have recommended shaping multiple conclusions to fit the position of each
stakeholder. Some have said to count all perspectives as equally valid. Still others
have suggested that the evaluator should decide which perspectives should
prevail. Others have ignored the problem altogether. None of these resolutions
seems satisfactory.
Our solution is to collect, process, and analyze stakeholder perspectives in a
systematic, unbiased fashion, making those perspectives part of the process of
arriving at evaluative conclusions. In the deliberative democratic approach the
evaluator is still in charge of the evaluation study and responsible for the findings,
but stakeholder perspectives, values, and interests become an integral part of the
study. We expect that inclusion of these perspectives will improve the validity of
the conclusions and increase the legitimacy of the evaluation.
Our approach still employs traditional data collection and analysis techniques,
relying heavily on the expertise of the evaluator, but it expands the repertoire of
evaluation methods to include additional procedures for collecting and proces-
sing stakeholder perspectives. These procedures may be as familiar as focus groups
or as novel as involving stakeholders directly in the construction of conclusions,
albeit in a disciplined fashion. In this chapter we explain the deliberative
democratic evaluation approach, discuss the conception of facts and values
underlying it, present examples of deliberative democratic studies, and discuss
criticisms made of the approach.
79

80 House and Howe
DELIBERATIVE DEMOCRATIC EVALUATION
The deliberative democratic approach to evaluation is based partly on expe-

rience with politicized evaluations in which stakeholders vehemently disagree and
partly on philosophy and political science theory about deliberative democracy,
a theory that moves from simply aggregating stakeholder preferences to processes
for transforming them (e.g., Gutmann & Thompson, 1996; Dryzek, 2000;
Elster, 1998).
Deliberative democracy we identify with genuine democracy, that is, with what
democracy requires when properly understood. For some people, democracy is
equivalent to having everyone vote on an issue. However, for deliberative demo-
cratic theorists voting is not enough. Rather, democracy requires that people
discuss and deliberate on issues in order to come to a public understanding.
During these discussions and deliberations participants may change their views
after being better informed. They may change their ideas about where their own
interests lie. Participation in dialogue and deliberation is critical to arriving at
conclusions. As its name suggests, deliberative democracy emphasizes the
deliberative aspects of collective decision making.
Similarly, deliberative democratic evaluation aspires to be deliberative in two
ways. First, evaluations inform people about relevant issues so they can discuss
these issues in depth. This is how most evaluators typically conceive the use of
evaluation findings. However, we extend the role of the evaluator somewhat
farther into public deliberations about the findings. For example, the evaluator
might organize forums to discuss the findings.
Second, the evaluation study itself incorporates democratic procedures in its
data collection and analysis. In our opinion, no one has formulated better ways
of reconciling conflicting perspectives, values, and interests than through demo-
cratic processes, imperfect though such processes might be. When evaluators
encounter differing perspectives, they would do well to incorporate procedures
that have served other democratic institutions. In our judgment, the conclusions
of the evaluation will be more valid, the study will be perceived as more
legitimate, and the ensuing public discussion will be more productive.
In the past some evaluators have dealt with these problems by trying to seal off
the study from stakeholder influences altogether, a difficult task indeed. In our
experience, it is not possible to conduct evaluation studies successfully without
attending to the politics of the study and the politics of the stakeholders, if only
to circumvent them or balance them out. Rather than pretending that these
political factors have no bearing on the study, we advocate bringing such political
considerations into the study by collecting and processing them in a disciplined
fashion. It will often be the case that the influences of some powerful groups will
have to be mitigated to arrive at relatively unbiased conclusions.
Of course, not every evaluation study encounters such problems and where
conflicting perspectives do not prevail, deliberative democratic methods may not
be needed. Certainly, the approach calls for extra effort and expenditure. However,
in contemporary society differences and conflicts in perspectives, values, and
Deliberative Democratic Evaluation 81
interests seem to be increasing rather than decreasing. We expect evaluators to

face more such conflicted situations.
There are three basic principles in deliberative democratic evaluation: inclu-
sion, dialogue, and deliberation. We discuss each principle in turn, though the
principles are not easy to separate from each other completely. Also, there are
many possible procedures for implementing each principle and no single way to
do so, just as there are many ways of collecting and analyzing traditional data.
Just as evaluations strive to be unbiased in the collection and analysis of factual
information, deliberative democratic evaluation aspires to be relatively unbiased
in the study's value aspects. Actually, we believe the factual and value aspects
blend together ultimately.
The Inclusion Principle
The first principle of deliberative democratic evaluation is inclusion of all

relevant interests. It would not be right for evaluators to provide evaluations only
to the most powerful or sell them to the highest bidders for their own uses. That
would bias the evaluation towards particular interests without consideration of
other interests. Nor would it be right to let sponsors revise the findings and
delete parts of the conclusions they didn't like in order to maximize their own
particular interests. Inclusion of all interests is mandatory.
Evaluation aspires to be an accurate representation of social reality, not a
device for furthering the interests of some over others. The interests of the major
stakeholder groups are central, and the interests of all relevant parties should be
represented in the evaluation, as would be required in other democratic institu-
tions. If the major relevant interests are not included, we have a sham democracy
from which some people have been excluded.
Greene (1997) has noted that evaluation approaches are importantly distin-
guished by whose criteria and questions are addressed. Different groups require
different information and employ different criteria. Evaluators should be able to
justify why they choose one criterion rather than another for the focus of an
evaluation. To do this they must consider the perspectives, values, and interests
of all the major stakeholder groups. This does not mean evaluators must take all
views, values, and interests at face value or as being equally important. No doubt
some views are better founded than others.
In the long run evaluation should provide a basis for public discussion that is
not biased toward particular interests. Evaluation contributes to discussions and
deliberations on the basis of merit. Not just the powerful and rich should have
information to guide their actions. In democracies distorted evaluations are wrong,
and inclusion of all major parties is a necessary and critical step to prevent
distortion from happening.
82 House and Howe
The Dialogical Requirement
A second requirement of deliberative democratic evaluation is that it be

dialogical. Evaluators cannot presume that they know how other people think
without engaging in extensive dialogue with them. Too often, evaluators take the
perspectives of sponsors as defining the situation or believe they know how
things stand. Too often, evaluators do not know when they think they do. The
way to guard against such potential error is to engage in extensive dialogue with
all stakeholders.
This concern is the legacy of minority spokespeople and feminist philosophers,
who have said repeatedly to evaluators, "You only think you know what we think.
You don't!" Again, evaluators need not take such views at face value. There
should be ways of checking such perspectives against other views and other
information. But evaluators must hear and understand the perspectives first in
order to assess them.
Without doubt, there is a serious concern that engaging in extensive dialogue
with stakeholders will cause the evaluator to be biased towards some stake-
holders, perhaps to be too sympathetic to the plight of the program developers
or the sponsors (Scriven, 1973). Being biased towards particular participants is a
real danger, we believe. However, in our view, the danger of being ignorant of
stakeholder views or of misunderstanding them is also a great danger. One
danger must be balanced against the other.
Another aspect of the principle of dialogue is for the stakeholders to talk to
each other. Having stakeholders discuss issues with each other can be part of the
study itself or part of following up the study. The caution here is that these
dialogical situations should be carefully planned and organized by the evaluator.
Otherwise, such sessions can be unproductive, counter-productive, or even heated
free-for-alls that go nowhere. Some of our examples exemplify the negative
possibilities. The aim is to have rational discussion even in disagreement.
Power imbalances are one of the biggest threats to democratic dialogues. Power
imbalances are endemic, and it is easy to see how they can disrupt and distort
discussion. The powerful may dominate the discussion, or those without power
may be intimidated, silenced, or unengaged. There must be a rough balance and
equality of power among participants for reasoned discussions to occur.
Evaluators should design evaluations so that relevant interests are represented
and so that there is some balance of power among them. We need procedures for
doing this. Balancing power relationships might also mean the evaluator must
represent the interests of those who might be excluded in the discussion, since
these interests are likely to be overlooked in their absence. Having the evaluator
represent such views is not as genuine as having the stakeholders themselves
express their own views but sometimes it may be necessary. Discussion itself
should be based on consideration of merit, not on the status of participants.
Discovering real interests is a task of dialogical interaction. Evaluators cannot
assume automatically what the interests of the parties are. Perhaps the evalua-
tors are mistaken. It is better to actively engage participants through dialogues
of various kinds. It may be that through dialogue and deliberation stakeholders

will change their minds as to what their interests are. After they examine other
people's views and engage in argument and discussion, they may see their interests
as different from those with which they began. Finally, evaluation conclusions
are emergent from these processes. They are not waiting to be discovered
necessarily, but are forged in the evaluation and discussions of findings.
The Deliberative Principle
The third principle of evaluation is that it be deliberative. Deliberation is

fundamentally a cognitive process, grounded in reasons, evidence, and principles
of valid argument, an important subset of which are the methodological canons
of evaluation and research. In many instances the special expertise of evaluators
plays a critical role in deliberation.
In the deliberative democratic approach, values are value claims (beliefs) that
are subject to rational analysis and justification. Perspectives, values, and interests
are not taken as given or beyond question, but are subject to examination
through rational processes. Evaluation is a process for determining the validity
of claims, claims that are emergent and transformed through deliberation into
evaluation conclusions. Evaluation thus serves a deliberative democracy in which
values and interests are rationally determined. Discussion and deliberation
requires the expertise of evaluators in such situations.
In broad perspective evaluation is inextricably linked to the notion of choice:
what choices are to be made, who makes the choices, and on what basis.
Evaluations of public programs, policies, and personnel are based on collective
choice and on drawing conclusions on the basis of merit. By contrast, we can
envision one individual weighing and balancing various factors and arriving at
conclusions as an individual act. This is a model of consumer choice, a market
model, with many individuals making their own choices based on the available
information. Ultimate choice in this situation becomes the sum of individual
choices. This may be a proper model for private markets.
However, evaluations of public programs and policies are not like this for the
most part. Individual consumer choice is not the same as collective choice derived
from collective deliberation about public entities from which the expressed public
interest emerges. Collective deliberation requires reciprocity of consciousness
among participants and a rough equality of power during the deliberation
processes if participants are to reach a state in which they deliberate honestly
and effectively about their collective ends. The relevant interests of stakeholders
have to be determined as part of the evaluation.
The principles of inclusion, dialogue, and deliberation overlap and interact.
For example, the quality of the deliberation is not separable from the quality of
the dialogue, which, in turn, affects whether inclusion is truly achieved (as
opposed to tokenism). Inclusion, dialogue, and deliberation cannot be cleanly
distinguished and applied independently. They reinforce each other.
84 House and Howe
If inclusion and dialogue are achieved but deliberation is not, we might have
relevant interests authentically represented but have the issues inadequately
considered resulting in invalid conclusions. If inclusion and deliberation are
achieved but dialogue is missing, we might misrepresent participants' interests
and positions, resulting in inauthentic evaluation based on false interests. Finally,
if dialogue and deliberation are achieved, but not all stakeholders are included,
the evaluation may be biased towards particular interests. The aspiration is to
arrive at relatively unbiased evaluative claims by processing all relevant
information, including the perspectives, values, and interests of stakeholders.
Admittedly, deliberative democratic evaluation is an ideal rather than an activity
that can be achieved easily or completely once and for all. It aims for something
beyond the aspirations we have commonly held for evaluation. Nonetheless, we
think it is within the capabilities of evaluators to achieve.
FACTS AND VALUES
Our analysis of deliberative democratic evaluation is based on how we conceive

the relationship between facts and values generally. According to the logical
positivists, from whom we inherited the fact-value dichotomy, facts and values
have separate referents, which means either that values do not influence facts or
that values cannot be derived from facts. In both cases values are detached from
facts altogether. By contrast, our view is that fact and value statements merge
together on a continuum, something like this (House, 1997):
Brute Facts----------------------------------------------------------------Bare Values
To the left of the continuum are statements like, "Diamonds are harder than
steel," which seems to have nothing to do with personal taste. The statement may
or may not be true. To the right of the continuum are statements such as,
"Cabernet Sauvignon is better than Chardonnay," which has much to do with
personal taste. In the center are statements such as, '~ is more intelligent than
B," or, "This test is valid," or, "X is a good program." These statements are neither
brute facts nor bare values; in these statements facts and values blend together.
The statements can be right or wrong and they can also have strong value
implications.
Whether a claim counts as a "brute fact" or "bare value" or something in between
depends partly on context. For example, "John Doe is dead" used to look much
like a brute fact without much valuing. Yet what if John Doe is hooked up to a
life support system that maintains his breathing and heart beat, and he is in an
irreversible coma? The judgment that he is dead requires a shift in the meaning
of the concept, one spurred by judgments of value, such as what constitutes good
"quality of life." What was once pretty much a brute fact has become more value-
laden because of medical technology.
Professional evaluative statements fall towards the center of the fact-value
continuum for the most part. They are statements derived from a particular
institution, the evolving institution of evaluation. Whether a statement is true
and objective is determined by the procedures established by professional
evaluators according to the rules and concepts of their profession. Some human
judgments are involved, constructed according to the criteria of the institution,
but their human origin doesn't make them objectively untrue necessarily. They
are facts of a certain kind - institutional facts.
For example, consider a statement like, "My bank account is low." This can be
objectively true or false. Yet the statement depends on the institution of banking,
on concepts such as "banks," bank accounts," "money," and being "low." Several
hundred years ago the statement would not have made sense because there were
no banks. Yet now we give our money to strangers for safekeeping and expect
them to return it. The institution of banking is socially maintained. The state-
ment is a fact embedded within the institution of banking, a part of our lives, and
it has strong value implications, even while capable of being true or false.
In general, evaluation and social research are based on such "institutional facts"
(Searle, 1995) rather than on brute ones, on statements such as, "This is a good
program," or "This policy is effective," or "Socioeconomic class is associated with
higher intelligence," or "Industrial society decreases the sense of community."
What these statements mean is determined within the institutions they are
part of, and the statements can be true or false the way we ordinarily understand
that term.
Evaluative statements can be objective (impartial is a better term) in the sense
that we can present evidence for their truth or falsity, just as for other statements.
In these statements factual and value aspects blend together. For example, a
statement like, "Social welfare increases helplessness," has value built in. It
requires an understanding of social welfare and helplessness, which are value-
laden concepts. Is the statement true? We decide its truth within the evidential
procedures of the discipline. We present evidence and make arguments that are
accepted in the discipline. There are disciplinary frameworks for drawing such
conclusions, frameworks different from banking.
So objectivity (and impartiality) in our sense means that a statement can be
true or false, that it is objective or impartial if it is unbiased (Scriven, 1972), that
evidence can be collected for or against the statement's truth or falsity, and that
procedures defined by the discipline provide rules for handling such evidence so
that it is unbiased. Of course, if one wants to contend that any human judgment
or concept of any kind is subjective, then all these statements are subjective. But
then so is the statement, "Diamonds are harder than steel," which depends on
concepts of diamonds and steel, though the referents are physical entities.
The sense of objectivity and impartiality we wish to reject explicitly is the
positivist notion that objectivity depends on stripping away all conceptual and
value aspects and getting down to the bedrock of pristine facts. Being objective
in our sense means working towards unbiased conclusions through the proce-
dures of the disciplines, observing the canons of proper argument and
86 House and Howe
methodology, maintaining healthy skepticism, and being vigilant to eradicate
sources of bias. We deny that it is the evaluator's job to determine only factual
claims and leave the value claims to someone else. Evaluators can determine
value claims too. Indeed, they can hardly avoid doing so.
SOME EXAMPLES
In the 1970s and 1980s the Program Evaluation and Methodology Division of the
General Accounting Office (GAO) was one of the most highly regarded
evaluation units in Washington. Eleanor Chelimsky was its director for many
years. She summarized what she learned over her years as head of the office
(Chelimsky, 1998).
According to her, Congress rarely asked serious policy questions about
Defense Department programs. And this was especially true with questions
about chemical warfare. In 1981 when she initiated studies on chemical warfare
programs, she found that there were two separate literatures. One literature was
classified information, favorable to chemical weapons, and presented by the
Pentagon in a one-sided way to the Congress. The other literature was critical of
chemical weapons, dovish, public, and not even considered by Congressional
policymakers.
On discovering this situation, her office conducted a synthesis of all the
literature, she says, which had an electrifying effect on members of Congress,
who were confronting certain data for the first time. The initial document led to
more evaluations, publicity, and eventually contributed to the international
chemical weapons agreement - a successful evaluation by any standard. This
chemical warfare work was predicated on analyzing the patterns of partisanship
of the previous research, understanding the political underpinnings of the
program and of the evaluation, and integrating conflicting perspectives, values,
and interests into the evaluation - which she recommended for all such studies.
We don't know what framework she used in conducting the study but we think
the deliberative democratic framework could produce somewhat similar results.
Deliberative democratic evaluation would include conflicting values and
stakeholder groups in the study, make certain all major views were represented
authentically, and bring conflicting views together so there could be dialogue and
deliberation. It would not only make sure there was sufficient room for dialogue
to resolve conflicting claims, but also help policymakers and the media resolve
the claims by sorting through the good and bad information.
All this data collection, analysis, and interpretation requires many judgments
on the part of the evaluators as to who is relevant, what is important, what is
good information, what is bad, how to manage the deliberations, how to handle
the media, and what the political implications are. Evaluators unavoidably
become heavily implicated in the findings, even when they don't draw the final
conclusions.
There are several points to be made. One is that some framework is necessary
to guide the evaluation, even if the framework is implicit. Second, the framework
is a combination of facts and values. How the various groups thought about
chemical warfare was an important consideration, as well as whose interests were
behind which position. Facts and values were joined together. The study was inclu-
sive so as to represent all relevant perspectives, values, and interests. Originally,
views critical of chemical warfare programs were omitted and only the Pentagon
views were included, thus biasing the conclusions. Also, there should be sufficient
dialogue with relevant groups so that all views are authentically represented. In
this case the potential victims of chemical warfare can hardly be present.
Someone had to represent their interests.
Finally, there was sufficient deliberation to arrive at the conclusions. In this
case deliberation was long and productive, involving evaluators, policy makers,
and even the media eventually. Deliberation might necessitate ways of protecting
evaluators and stakeholders from powerful stakeholder pressures that might
inhibit discussion. Proper deliberation cannot be a free-for-all among
stakeholders. If it is, the powerful win. Designing and managing all this involves
considerable judgment on the part of evaluators.
How does this chemical warfare case differ from evaluation of social programs?
Not much. In Madison & Martinez's (1994) evaluation of health care services on
the Texas Gulf Coast, they identified the major stakeholders as the recipients of
the services (elderly African-Americans), the providers of the services (white
physicians and nurses), and representatives from African-American advocacy
groups. Each group had different views, with the elderly saying the services were
not sufficiently accessible, and the medical providers saying the elderly lacked
knowledge about the services.
In such an evaluation, there was no grand determination of the rights of
elderly African-Americans versus those of white professionals in society at large.
That was beyond the scope of the evaluation. Evaluators determined what was
happening with these services in this place at this time, a more modest task. The
study informed public opinion impartially by including views and interests,
promoting dialogue, and fostering deliberation. Impartiality was supported by
inclusion, dialogue, and deliberation and the expertise of the professional evalua-
tors. Just as evaluations can be biased by poor data collection and errors of
omission or commission, they can be biased by incorporating too narrow a range
of perspectives, values, and interests.
In a seminal study in Sweden, Karlsson (1996) conducted an evaluation that
illustrates how to handle dialogue among groups. He evaluated a five-year program
that provided care and leisure services for school-age children aged 9-12 in
Eskilstuna. The program aimed for more efficient organization of such services
and new pedagogical content to be achieved through School Age Care Centers.
Politicians wanted to know how such services could be organized, what the
pedagogical content would be, what the centers would cost, and what children
and parents wanted, in essence a formative evaluation.
88 House and Howe
Karlsson's first step was to identify the stakeholder groups and choose represen-
tatives from them, including politicians, managers, professionals, parents, and
children. He then surveyed parents and interviewed other stakeholder groups on
these issues: Politicians - What is the aim of the program? Parents - What do
parents want the program to be? Management - What is required to manage
such a program? Staff union - What do the staff unions require? Cooperating
professionals - What expectations are there from others who work in this field?
Children - What expectations do the children have? These survey data were
summarized and communicated to stakeholder groups in the form of four
different metaphors of ideal types of school age care centers. The metaphors
for the centers were the workshop, the classroom, the coffee bar, and the
living room.
In the second stage of the study, Karlsson focused on the implementation of
the centers, twenty-five altogether, serving five hundred students. This stage of
the study employed a "bottom-up" approach by first asking children how they
experienced the centers, as opposed to the "top-down" approach of the first stage.
Next, parents and cooperating professionals, then managers and politicians,
were interviewed. Dialogue was achieved by presenting to later groups what
prior groups had said.
In the first two stages of the study, the dialogue admitted distance and space
among participants. In the third stage the goal was to have face-to-face dialogue
and the establishment of a more mutual and reciprocal relationship. The aim was
to develop critical deliberation that could stimulate new thoughts among
different stakeholder groups and bring conflicts into open discussion.
Four meetings were arranged with representatives from the stakeholder
groups. To ensure that everyone could have a say, four professional actors played
out short scenes that illustrated critical questions and conflicts to be discussed.
The actors involved the audiences in dialogues through scenarios showing the
essence of problems (identified from the data), and enlisting audiences to help
the actors solve or develop new ways to see the problems. About 250 represen-
tatives participated in four performances. These sessions were documented by
video camera and edited to twenty-minute videos, which were used in later
meetings with parents, politicians, and staff and made available to other
muncipalities.
In Karlsson's view, the aim of such critical evaluation dialogues is to develop
deeper understanding of program limitations and possibilities, especially for
disadvantaged groups. In this process the important thing is to enable the
powerless and unjustly treated stakeholders to have influence. Of course, the
Karlsson study is far beyond what most of us could accomplish with our modest
studies with shorter time frames. But his study illustrates the ingenuity with
which inclusion, dialogue, and deliberation can be handled.
In general, there are ten questions the evaluator might ask to ascertain
whether an evaluation incorporates a deliberative democratic approach. (See the
Deliberative Democratic Checklist, Appendix A, for more details.). The ten
questions are these:
1. Whose interests are represented? The interests of all relevant parties should
be considered in the evaluation. Normally, this means the views and interests
of all those who have a significant stake in the program or policy under
review.
2. Are major stakeholders represented? Of course, not every single, individual
stakeholder can be involved. Usually, evaluators must settle for representa-
tives of stakeholder groups, imperfect though this might be. And no doubt
there are occasions when not all stakeholders can be represented.
Representation may mean that the evaluators bring the interests of such
stakeholders to the study in their absence.
3. Are any major stakeholders excluded? Sometimes important groups will be
excluded, and most often these will be those without power or voice, that is,
the poor, powerless, and minorities. It is a task of evaluators to include these
interests as best they can.
4. Are there serious power imbalances? It is often the case that particular
interests are so powerful so that they threaten the impartiality of the findings.
Often the clients or powerful stakeholders dominate the terms of the study
or the views represented in the study.
5. Are there procedures to control the imbalances? It must fall within the
evaluator's purview to control power imbalances. Just as teachers must be
responsible for creating the conditions for effective discussion in classrooms,
evaluators must establish the conditions for successful data collection,
dialogue, and deliberation. Admittedly, this requires refined judgment.
6. How do people participate in the evaluation? The mode of participation is
critical. Direct involvement is expensive and time-consuming. Still, getting
the correct information requires serious participation from stakeholder groups.
7. How authentic is the participation? Respondents are used to filling out
surveys they care little about. As respondents become swamped with account-
ability procedures, they can become careless about their answers. This seems
to be an increasing problem as cosmetic uses of evaluations result in inau-
thentic findings.
8. How involved is their interaction? Again, while interaction is critical,
perhaps there can be too much. Should stakeholders be involved in highly
technical data analyses? This could bias the findings.
9. Is there reflective deliberation? 1Jpically, evaluators finish their studies
behind schedule, rushing to meet an established deadline. The findings are
not mulled over as long as they should be, nor is sufficient deliberation built
into the study. There is a temptation to cut short the involvement of others
at this stage of the study to meet the timelines.
10. How considered and extended is the deliberation? In general, the more
extensive the deliberation, the better findings we would expect. For the most
part, there is not enough deliberation in evaluation rather than too much.
The deliberative democratic view is demanding here. But this weighs with
the fact that the most common error in evaluation studies is that the
conclusions don't match the data very well.
90 House and Howe
PROCEDURES FOR CONDUCTING STUDIES
Any study is a compromise, of course. We can never accomplish all that we want.
We can never fully implement idealized principles such as those we have
expounded. We need to be grounded in the real world even as we maintain
democratic aspirations. Experimentation is called for. A number of studies have
tried out procedures that might be useful in deliberative democratic evaluations.
Greene (2000) reported a study involving a new science curriculum in a local
high school. The new curriculum called for all students to be placed in general
science courses. Parents of the honor role students were adamantly opposed to
the new curriculum because they wanted their children in traditional physics and
chemistry classes that were segregated by ability.
Greene agreed to chair a committee to oversee the curriculum's introduction.
She began by having open meetings in which everyone was invited. This proved
to be a disaster. The rebellious parents disrupted the meetings and argued others
down vociferously. (Many parents were from the local university.) Not much
productive discussion occurred. Greene set up a qualitative evaluation, which
her graduate students conducted. She helped in the actual data collection. The
rebellious parents wanted test scores as part of the evaluation study. Greene
refused.
In ensuing meetings the parents attacked the on-going evaluation study,
disguising their interests and opposition to the science curriculum as an attack
on the unscientific qualitative methodology of the evaluation. The authority of
the evaluation and the evaluator were compromised. Greene was accused of
siding with the faction that wanted the new curriculum, rather than being an
impartial evaluator. The enterprise dissolved into an embittered mess.
There are several things we can learn from this episode. First, the evaluator
must analyze the political situation at the beginning of the study, taking care to
note the major factions, how people line up on issues, and where trouble is likely
to erupt. Such a careful political analysis is what Chelimsky did in her chemical
warfare study. If the evaluator enters a highly politicized situation, some mapping
of factions and defense strategies are required to negotiate through the political
minefield.
Analyzing the political situation involves learning about the history of the
program, including hidden controversies and potential controversies. Where are
the vested interests? Where are the hidden commitments disguised as something
else? In this case, the self-interests of the elitist parents would have been
apparent with some forethought, as well as how they might react to the program
and the study.
Given the politicized nature of the situation, it is not possible ordinarily to have
open forums where everyone speaks their mind. Open forums get out of hand
and result in negative consequences, as these did. The evaluator should remain
in control of the stakeholder interaction so that rational dialogue and delibera-
tion can prevail. This means structuring the interactions, possibly partitioning off
opposing factions from one another, which is not possible in open forums.
Deliberative Derrwcratic Evaluation 91
In this case I might have met with the different factions separately, honestly
trying to understand their perspectives. I would have taken the issues each
faction put forth seriously, making those issues part of the data collection. Most
likely, I would have collected test scores as the elite faction wanted if they
insisted, as well as collect other kinds of data. The evaluator is in control of the
study and can collect, analyze, and interpret tests scores properly. It could well
be that test score analyses would show that the new program did not reduce the
scores of the good students, as the parents feared. By refusing the rebellious
faction's demand for tests, Greene diminished the credibility of the evaluation in
the eyes of this faction.
When disagreeing groups are able to deal rationally with issues, not the case
here, I would bring them together to discuss the issues with other groups.
However, I would delay face-to-face confrontations with groups not ready to
discuss issues rationally. The evaluator should not take sides with any faction.
Rather the evaluator should work to structure dialogue and deliberation with
each group and among groups. In this case, Greene was seen as siding with one
faction, whatever her true sentiments.
Taking on multiple roles in the study was also a mistake, as Greene realized
later. She was chair of the curriculum introduction committee, head of the evalua-
tion data collection, and a data collector as well. Taking on these conflicting roles
diminished her authority and credibility. The evaluator should not operate on
the same level as everyone else. The evaluator role is special, and she must use
her authority and expertise to structure things.
Ryan & Johnson (2000) conducted a study investigating teaching evaluation in
a large university. Originally, evaluation of teaching was formative in this
university, aimed at improving the teaching skills of the faculty. The university
relied heavily on the type of student rating forms common in higher education
institutions. As the years went by, the student rating data came to be used for
promotion and tenure decisions. Faculty contended that they had no opportu-
nities for dialogue or deliberation on the evaluation of their teaching.
The evaluators involved other stakeholders, such as faculty, students, women,
and minorities, in order to incorporate other perspectives and interests in the
evaluation of teaching, other than just those of the provost. They tried several
ways of enlisting faculty participation. They sent mailed questionnaires to faculty
but received a poor return. They held focus groups with faculty around the campus
and established a faculty review committee. However, they were uncertain how
representative of the faculty the findings from these groups were. When open
meetings were held in departments, the presence of deans increased faculty
participation but tended to silence some faculty members.
One of the best opportunities for participation came from an on-line evalua-
tion discussion group the evaluators established. This on-line group engaged in
extended discussion of the relevant issues, though the interaction had to be
supplemented with face to face meetings when the issues became too complex to
be handled in brief emails. The evaluators reported that there were trade-offs
among these methods between depth and representation, and that the role of the
92 House and Howe
evaluator and the evaluation were changed significantly by using democratic
procedures.
MacNeil (2000) conducted an unusual study by employing democratic proce-
dures in an undemocratic institution, a psychiatric hospital. She evaluated a
program in which peers who had been successful in recovering provided assistance
to other patients. She began her study by interviewing multiple stakeholders.
Realizing the hierarchical nature of the institution, she constituted small forums
that were over-weighted with lower status stakeholders (patients) and provided
special protocol questions to elicit their views. T4e over-weighted forums
encouraged the patients to speak out.
To circumvent the fear that current employees had of saying anything critical,
she invited two former employees to express dissident views, views that were held
by current employees but which they were afraid to express. She also invited
members of the board of directors of the hospital to participate in the discussions
in order to hold the administrators accountable. One patient who was deemed
unsafe but who wanted to participate had his own forum, which included him
and two program staff used to dealing with him. Later, MacNeil wrote a report
and had feedback meetings with different groups. All in all, she used procedures
to circumvent the hierarchical nature of the institution.
A final example is more straightforward and closer to what most evaluators
might attempt. Torres et al. (2000) conducted an evaluation of a program in
which the evaluators, program managers, and staff meet frequently with each
other, thus developing trust and cooperation. Other stakeholders were not
involved, except that data were collected from them. In reflecting on the study
later, Torres said that several things were needed to make the study work well,
which it did.
One ingredient was commitment from the administrators. A second require-
ment was sufficient scheduled time for dialogue and deliberation to occur, which
the evaluators built in. The dialogue and deliberation tended to blur the lines
between the program and the evaluation at times. Torres thought it would have
been a good idea to set boundaries between the two at the beginning. In
retrospect, the study balanced depth in participation with breath in inclusion.
All in all, evaluators have not had much experience dealing with inclusion,
dialogue, and deliberation. No doubt, they will become more competent at it
once new procedures are tried and tested. Traditional data collection and
analysis procedures are sophisticated now compared to those employed thirty
years ago. We might expect a similar learning curve for deliberative democratic
evaluators.
QUESTIONS ABOUT DELIBERATIVE DEMOCRATIC EVALUATION
Some scholars have raised questions about deliberative democratic evaluation as

it has been discussed and debated in the field. We address some of these con-
cerns here.
"Deliberative Democratic Evaluation Brings Politics into Evaluation"
Some scholars raise the objection that the deliberative democratic approach
introduces too much politics into evaluation (Stake, 2000). In our view, there is
no evaluation without politics. We accept this as an empirical matter. In every
study we have conducted, consideration of what political forces are in play, who
the major stakeholder groups are, and how the results are likely to be used are
major concerns for the evaluator. To ignore the politics of the evaluation and
stakeholders is to risk the success of the study. The evaluator should analyze and
adjust to political factors, which influence all aspects of an evaluation, including
its design, conduct, and implementation.
Evaluators sometimes pretend these political concerns are not relevant, while
at the same time taking account of them in the study - necessarily so. The image
of the ivory-tower scholar working on theoretical problems may never have been
true of social science research at any time. The history of the social sciences sug-
gests the image was not accurate (Ross, 1991). In any case, such isolation is definitely
not true of evaluation, which must grapple with the real world. We propose
facing up to the idea that political concerns are a part of evaluation studies. We
advocate examining such concerns analytically, critically, and dealing with them.
"Deliberative Democracy is not the Only Model of Democracy"
Some contend that deliberative democracy is not the only form of democracy
(Stake, 2000). We agree that deliberative democracy is not the only model of demo-
cracy. It is not even the dominant model. The "preferential" model of democracy
is most common (Drysek, 2000). The difference between the two is significant.
In the preferential model participants bring their preferences to the discussion
without any expectation they will change these preferences (views, values, and
interests) during the discussion and deliberation. Participants want to maximize
the satisfaction of their preferences. In our view, this attitude leads to strategic
gaming to obtain advantages, not to genuine discussion leading to joint solutions.
In the deliberative democratic approach the presumption is that participants
may change their preferences as a result of serious rational discussion and
deliberation. Such an expectation entails a different orientation to discussion,
deliberation, and resolution of issues. Jointly searching for correct answers is the
anticipated pathway, though we do not assume that consensus among all partici-
pants is always likely, possible, or desirable. Dialogue and deliberation inform
the conclusions of the study, but that does not mean that everyone agrees.
"The Evaluator is an Advocate in Deliberative Democratic Evaluation"
Some claim that we are proposing that evaluators be advocates for particular
groups in some way (Stake, 2000). The validity of this criticism depends on what
94 House and Howe
is meant by being an advocate. The evaluator is indeed advocating a deliberative
democratic approach to evaluation, just as we believe others are advocating
implicit or explicit approaches too, the most common being preferential demo-
cracy, even when they do not realize it.
However, the deliberative democratic evaluator is not advocating a particular
party or interest in the evaluation. In this sense of advocacy, the evaluator is not
an advocate. There are other approaches to evaluation that do call for evaluators
to be advocates for particular participants, such as for the client or for the
powerless. We appreciate their rationale, but we do not subscribe to that view.
Deliberative democratic evaluation advances a framework for inclusion,
discussion, and deliberation. Certainly, this framework is not value neutral, but
then no approach is value neutral, in our view. The question becomes, which
democratic (or non-democratic) framework is most defensible? We believe ours
has strong advantages.
"Why Bother with Methods and Procedures at All?"
Some scholars have raised the question of why the evaluator doesn't adopt a
position, such as favoring the poor, without relying on methods and procedures
(Kushner, personal communication, 2000). We draw a parallel here with collect-
ing and analyzing data so that evaluation findings will be as free of bias as
possible. For example, evaluators would not collect interview data from just a
few participants without checking such information against the views of others.
Evaluators would use systematic methods of sampling in collecting survey data
so the data would accurately reflect the population involved. The methods
reduce bias in the collection and analysis of data and in the resulting findings.
Similarly, when collecting and processing the perspectives, values, and
interests of participants for the purpose of arriving at valid conclusions, we need
procedures for collecting and processing such information so the conclusions will
be as free of bias as we can make them. Currently, we have the three principles
but do not have a set of methods. We have ideas about some methods that might
be employed. Traditional data collection and analysis procedures took decades
of development to reach their current level of sophistication. The same would be
true of procedures developed for deliberative democratic evaluation, which are
in addition to traditional methods, not in place of them.
"Is the Difficult Fact-Value Analysis Really Necessary?"
Many criticisms derive from not understanding the implications of abandoning

the fact-value dichotomy in favor of the fact-value continuum. In the fact-value
dichotomy, which many evaluators still hold to implicitly, evaluators can discover
facts with appropriate techniques, but values are deep-seated in people, perhaps
emotional, and not subject to rational analysis.
Excluding values from rational analysis is a crippling blow for evaluation. We
reject this false distinction between facts and values. We contend that values
(value claims) are subject to rational analysis just as facts (factual claims) are. In
truth, factual and value claims blend together in evaluations. We need new ways
of handling value claims rationally and systematically, just as we have ways of
handling factual claims systematically. Properly done, we can have (relatively)
unbiased factual claims and (relatively) unbiased value claims, or, more accu-
rately, (relatively) unbiased evaluative conclusions.
~re the Necessary Preconditions in Place to Allow Such an Approach to Succeed

or Must there be Prior Education Efforts?
The most extensive philosophic critique of deliberative democratic evaluation

comes from Nunnelly (2001). While highly positive about the approach, he raises
several difficulties, including whether people are properly prepared. We think
that people in democratic societies are well versed enough in citizen duties that
they are able to participate when the opportunity arises. The problem is that
there are not enough opportunities. We agree that certain preconditions are neces-
sary, such as a commitment to democracy. We might be wrong about people's
readiness to participate. It is an empirical question.
"Do Constitutive Values Preclude Any Negotiation about Them?
Nunnelly presents the difficulty of reaching any kind of agreement with a group
comprised of Hassidic Jews, radical lesbian separatists, Hayekian free-market
libertarian economists, and Southern Baptist ministers, as an example. The
beliefs of these groups are so constitutive of their identity that the people are not
likely to offer up such beliefs to rational analysis, in Nunnelly's view. We agree
this would be a tough group to deal with.
However, in general the fundamental differences among these groups are not
the issues that evaluators face with in normal practice. Our issues tend to be
more practical, though no doubt differences in religious beliefs can intrude on
practical matters. We can imagine debates over the public schools that involve
religious ideas, for example. However, we believe enough agreement exists in the
society so that the society maintains itself, and that this underlying social
agreement forms the basis for evaluation. Such agreement includes the notion of
dialogue and deliberation with others in the society.
We admit there might be issues too hot to handle, though we think well-
trained evaluators can do more than they think they can with conflicting beliefs.
We should remember that the ultimate object is not to get diverse groups to
agree with each other but to inform the evaluator and the evaluation so that valid
conclusions can be reached after all relevant views have been presented and
96 House and Howe
discussed. It may be that no group agrees with the conclusions of the study.
Evaluation is a weighing mechanism, not a voting mechanism.
"Deliberative Democratic Evaluation is Too Idealistic to be Implemented"
Finally, Daniel Stufflebeam (2001) has enumerated strengths of deliberative

democratic evaluation by expressing the spirit of the approach well:
It is a direct attempt to make evaluations just. It strives for democratic

participation of all stakeholders at all stages of the evaluation. It seeks to
incorporate the views of all interested parties, including insiders and out-
siders, disenfranchised persons and groups, as well as those who control
the purse strings. Meaningful democratic involvement should direct the
evaluation to the issues that people care about and incline them to respect
and use the evaluation findings. The approach employs dialogue to examine
and authenticate stakeholders' inputs. A key advantage over some advo-
cacy approaches is that the deliberative democratic evaluator expressly
reserves the right to rule out inputs that are considered incorrect or
unethical. The evaluator is open to all stakeholders' views, carefully con-
siders them, but then renders as defensible a judgment of the program as
possible .... In rendering a final judgment, the evaluator ensures closure
(Stufflebeam, 2001, p. 71).
Stufflebeam then goes on to pose a critical challenge: "In view of the very
ambitious demands of the deliberative democratic approach, House and Howe
have proposed it as an ideal to be kept in mind even if evaluators will seldom, if
ever, be able to achieve it" (Stufflebeam, 2001, p. 71). As of now, we don't know
the degree to which deliberative democratic evaluation is practically attainable,
however philosophically sophisticated it might be. What has been tried so far
seems encouraging. We also recognize that there might be other, better evalua-
tion practices that emerge from our analysis of facts and values than what we
have enumerated. We are better theoreticians than practitioners, and it is likely
to be the gifted practitioners who invent better ways to do things.
APPENDIX A
DELIBERATIVE DEMOCRATIC EVALUATION CHECKLIST
October 2000
The purpose of this checklist is to guide evaluations from a deliberative demo-

cratic perspective. Such evaluation incorporates democratic processes within the
evaluation to secure better conclusions. The aspiration is to construct valid
conclusions where there are conflicting views. The approach extends impartiality
by including relevant interests, values, and views so that conclusions can be
unbiased in value as well as factual aspects. Relevant value positions are included,
but are subject to criticism the way other findings are. Not all value claims are
equally defensible. The evaluator is still responsible for unbiased data collection,
analysis, and arriving at sound conclusions. The guiding principles are inclusion,
dialogue, and deliberation, which work in tandem with the professional canons
of research validity.
Principle 1: Inclusion
The evaluation study should consider the interests, values, and views of major
stakeholders involved in the program or policy under review. This does not
mean that every interest, value, or view need be given equal weight, only that
all relevant ones should be considered in the design and conduct of the
evaluation.
Principle 2: Dialogue
The evaluation study should encourage extensive dialogue with stakeholder

groups and sometimes dialogue among stakeholders. The aspiration is to prevent
misunderstanding of interests, values, and views. However, the evaluator is
under no obligation to accept views at face value. Nor does understanding entail
agreement. The evaluator is responsible for structuring the dialogue.
Principle 3: Deliberation
The evaluation study should provide for extensive deliberation in arriving at

conclusions. The aspiration is to draw well-considered conclusions. Sometimes
stakeholders might participate in the deliberations to discover their true inter-
ests. The evaluator is responsible for structuring the deliberation and for the
validity of the conclusions.
These three principles might be implemented by addressing specific questions.
The questions may overlap each other, as might dialogue and deliberation
processes. For example, some procedures that encourage dialogue might also
promote deliberation.
1. INCLUSION
a. Whose interests are represented in the evaluation?
- Specify the interests involved in the program and evaluation.

- Identity relevant interests from the history of the program.
- Consider important interests that emerge from the cultural context.
98 House and Howe
b. Are all major stakeholders represented?
- Identify those interests not represented.

- Seek ways of representing missing views.
- Look for hidden commitments.
c. Should some stakeholders be excluded?
- Review the reasons for excluding some stakeholders.

- Consider if representatives represent their groups accurately.
- Clarify the evaluator's role in structuring the evaluation.
2. DIALOGUE
a. Do power imbalances distort or impede dialogue and deliberation?
- Examine the situation from the participants' point of view.

- Consider whether participants will be forthcoming under the circumstances.
- Consider whether some will exercise too much influence.
b. Are there procedures to control power imbalances?
- Do not take sides with factions.

- Partition vociferous factions, if necessary.
- Balance excessive self-interests.
c. In what ways do stakeholders participate?
- Secure commitments to rules and procedures in advance.

- Structure the exchanges carefully around specific issues.
- Structure forums suited to participant characteristics.
d. How authentic is the participation?
- Do not organize merely symbolic interactions.

- Address the concerns put forth.
- Secure the views of all stakeholders.
e. How involved is the interaction?
- Balance depth with breadth in participation.

- Encourage receptivity to other views.
- Insist on civil discourse.
3. DELIBERATION
a. Is there reflective deliberation?
Organize resources for deliberation.

Clarify the roles of participants.
Have expertise play critical roles where relevant.
b. How extensive is the deliberation?
Review the main criteria.

Account for all the information.
Introduce important issues neglected by stakeholders.
c. How well considered is the deliberation?
Fit all the data together coherently.

Consider likely possibilities and reduce to best.
Draw the best conclusions for this context.
REFERENCES
Chelimsky, E. (1998). The role of experience in formulating theories of evaluation practice. American
Journal of Evaluation, 19, 35-55.
Dryzek, 1.S. (2000). Deliberative democracy and beyond. Oxford, UK: Oxford University Press.
Elster, 1. (Ed.). (1998). Deliberative democracy. Cambridge, UK: Cambridge University Press.
Greene, J.e. (1997). Evaluation as advocacy. Evaluation Practice, 18, 25-35.
Greene, J.e. (2000). Challenges in practicing deliberative democratic evaluation. In Ryan, KE. &
DeStefano, L. (Eds.) Evaluation as a democratic process: Promoting inclusion, dialogue and
deliberation. New Directions for Evaluation, 85, 13-26.
Gutmann, A., & Thompson, D. (1996). Democracy and disagreement. Cambridge, MA: Belknap Press.
House, E.R., & Howe, KR. (1999). Mllues in evaluation and social research. Thousand Oaks, CA:
Sage.
Karlsson, O. (1996). A critical dialogue in evaluation: How can interaction between evaluation and
politics be tackled? Evaluation, 2, 405-416.
Madison, A., & Martinez, V. (1994, November). Client participation in health planning and evaluation:
An empowerment evaluation strategy. Paper presented at the annual meeting of the American
Evaluation Association, Boston.
MacNeil, e. (2000). Surfacing the Realpolitik: Democratic evaluation in an antidemocratic climate.
In Ryan, KE., & DeStefano, L. (Eds.). Evaluation as a democratic process: Promoting inclusion,
dialogue and deliberation. New Directions for Evaluation, 85, 51-62.
Nunnelly, R. (2001, May). The world of evaluation in an imperfect democracy. Paper presented at the
Minnesota Evaluation Conference, St. Paul.
Ryan, KE., & DeStefano, L. (Eds.). (2000). Evaluation as a democratic process: Promoting
inclusion, dialogue and deliberation. New Directions for Evaluation, 85.
Ryan, KE., & lohnson, T.D. (2000). Democratizing evaluation: Meanings and methods from
practice. In Ryan, KE. & DeStefano, L. (Eds.). Evaluation as a democratic process: Promoting
inclusion, dialogue and deliberation. New Directions for Evaluation, 85, 39-50.
Ross, D. (1991). The origins of American social science. Cambridge: Cambridge University Press.
100 House and Howe
Scriven, M. (1972). Objectivity and subjectivity in educational research. In L.G. Thomas (Ed.),
Philosophical redirection of educational research (pp. 94-142). Chicago: National Society for the
Study of Education.
Scriven, M. (1973). Goal-free evaluation. In E.R House (Ed.). School evaluation. Berkeley, CA:
McCutchan Publishing.
Searle, l.R (1995). The construction of social reality. New York: Free Press.
Stake, RE. (2000). A modest commitment to the promotion of democracy. In Ryan, K.E. &
DeStefano, L. (2000). Evaluation as a democratic process: Promoting inclusion, dialogue and
Stufflebeam, D.L. (2001). Evaluation models. New Directions in Evaluation, 89.
Torres, RT, Stone, S.P., Butkus, D.L., Hook, B.B., Casey, l., & Arens, SA (2000). Dialogue and
reflection in a collaborative evaluation: Stakeholder and evaluator voices. In Ryan, K.E. &
DeStefano, L. (Eds.). Evaluation as a democratic process: Promoting inclusion, dialogue and
Section 2
Evaluation Methodology
Introduction
RICHARD M. WOLF
Columbia University, Teachers College, New York, USA
In 1965, the legislative branch of the United States government passed the
Elementary and Secondary Education Act. This piece of legislation provided
considerable sums of money to schools to improve educational services, espe-
cially to poor children. Along the way to passage of this landmark legislation,
Senator Robert Kennedy of New York proposed an amendment that required
annual evaluation of programs funded under this act. The amendment was
quickly adopted since it made good sense to the legislators that if the govern-
ment was going to spend huge sums of money on something, it had a right, if not
a duty, to determine whether the money being spent produced the desired effect.
In fact, this "evaluation requirement" quickly spread to all kinds of legislation in
other areas such as social service programs, health programs, etc.
While the evaluation requirement made good political and educational sense,
it caught the educational community completely by surprise. Few educators had
any idea of what it meant to evaluate a program and even fewer were prepared
to carry out an evaluation. However, since large sums of money were involved,
there was a strong incentive for people to conduct evaluation studies. In fact,
there was a rush of not only of people in education but also from sociology and
psychology to take on evaluation work because there were strong economic
incentives to do so.
What passed for evaluation in those early days of the Elementary and Secondary
Education Act were sometimes laughable and sometimes distressing. It was soon
recognized that a better notion of what evaluation was and was not was needed.
Consequently, over the next ten to twenty years, writers came up with an almost
dizzying array of theories and models of evaluation and methods for conducting
evaluation studies. The various approaches to evaluation proposed by writers
tended to reflect their background and training. Thus, sociologists tended to
propose theories and models that emphasized sociological phenomena and
concerns while psychologists proposed theories and models that emphasized
psychological concerns. Educators generally tended to focus on educational
concerns but these could range from administrative considerations to curricular
and instructional concerns.
The ferment in the field of evaluation was rather confusing, but it reflected a
genuine concern to come to terms with what evaluation was and how it should
103

T. Kel/aghan, D.L. Stufflebeam (eels.)
104 Wolf
be carried out. Over the past ten years, there has been some settling down in the
field of evaluation although there is still a wide array of approaches to choose
from. How much consolidation will occur in the future is impossible to predict.
Accompanying the proliferation of theories and models of evaluation has been
an enormous expansion in the methods of evaluation. In fact, the array of
methods and techniques may be even greater than the number of theories and
models of evaluation. This section presents several contributions in the area of
evaluation methodology. The nature and variety of these contributions would
have been unthinkable a generation ago. At that time, soon after the legislation
containing an evaluation requirement was passed, the general bent in the field of
educational research was toward almost exclusive reliance on quantitative studies
involving either true or quasi-experiments. The reason for this was the large
impact of an article by Donald Campbell and Julian Stanley in the Handbook of
Research on Teaching (1963), edited by N. Gage. That article made a strong case
for reliance on experimental studies to establish causality in studies of educa-
tional treatments. Boruch's chapter in this section is in the tradition of the work
of Campbell and Stanley, arguing for the use of randomized field trials in evalua-
tion studies that seek to establish causal relationships between educational
programs and student outcomes.
There were several reasons why experimental studies involving randomized
field trials did not remain the premier method for evaluating educational programs.
First, legislation often required that all individuals who were eligible for a parti-
cular program had to receive it. This obviously knocked out the use of control
groups since this would have meant having to withhold treatment from eligible
individuals. Second, randomized field trials cannot be equated with laboratory
experiments in many respects. Even Campbell and Stanley recognized the prob-
lems inherent with experimental studies and identified a number of threats to
internal validity that could arise during the planning or conduct of an experiment.
Since then a number of other such threats have been identified. Two will be men-
tioned here to give the reader a sense of how even carefully planned experiments
can go awry. Resentful demoralization is a threat to internal validity that can occur
when individuals who are in the control group learn that they are not receiving
the experimental treatment and become resentful and eventually demoralized
and, consequently, do not put forth the level of effort needed to succeed. This
can result in the difference between the experimental and control group being
greater than it really should be. This is not due to any superiority in the experi-
mental treatment but rather to a depressed level of success in the control group.
A second threat to internal validity is often referred to as the John Henry
effect. It is named after the mythical railroad worker who battled a machine that
put spikes into railroad ties. John Henry performed heroically, but died of exhaus-
tion after the race against the machine. In the John Henry effect, individuals who
are not receiving the experimental treatment and realize this put forth extra
effort to succeed. This can result in an unnaturally elevated level of performance
in the control group, thus reducing the difference between experimental and
control groups.
Evaluation Methodology 105
Resentful demoralization and the John Henry effect are exact opposites of
one another. Each can threaten the internal validity of a randomized field trial.
One has no way of knowing in advance which, if any, of these threats might occur
in a study. Only careful monitoring by investigators conducting randomized field
trials can identify such threats. How they can or should be dealt with remains an
open issue in evaluation work.
The purpose of the above discussion was to identify just two of the many
problems that can arise in randomized field trials. This does not mean that
randomized field trials should not be undertaken. They should, when possible.
Rather, it is to point out that randomized field trials, which once held sway in the
field of evaluation, are no longer regarded as the only or even the main approach
to evaluating educational programs.
As the concept of evaluation has expanded over the past thirty-five years, so
has the number of methods and techniques. The chapter by Levin and McEwan
is a case in point. In the early years of educational evaluation virtually no atten-
tion was paid to costs. It was a gross oversight. As the costs of education have
soared over the past half-century, costs become a central issue in educational
planning and evaluation. Levin and McEwan have adapted a considerable amount
of material from economics and fashioned it into a useful tool of evaluation.
They have eschewed cost-benefit analysis for a far more salient approach called
cost-effectiveness analysis. Their chapter serves as a useful primer to this
important but neglected topic. They provide a useful set of references, including
one to their latest book, which should be of great use to evaluation workers who
want to incorporate cost information into an evaluation.
The chapters by Eisner and Mabry reflect the expansion of evaluation
approaches and methods of recent years. Both chapters emphasize the use of
qualitative methods in the planning and conduct of an evaluation. As recently as
two decades ago, such approaches would not even have received a hearing in the
field of evaluation. They do now. Qualitative evaluation approaches have come
of age due to the strong efforts of people such as Eisner and Mabry. To be fair,
Eisner has been advocating his qualitative approach to evaluation for several
decades. However, it has been only recently that such approaches have gained a
measure of acceptance. Eisner began his career in the field of art and art
education where qualitative judgments were the only way to appraise the work
of novices. Eisner expanded his early notions of evaluation into its present form
of "connoisseurship evaluation." They clearly are highly appropriate forms of
evaluation for some types of programs. How these ideas apply in general to the
entire field of evaluation is a matter that is still being considered.
Mabry's chapter on qualitative methods goes further than much of the work in
this area. Mabry exposes the notions of qualitative evaluation to traditional research
concepts of validity, reliability and objectivity. She shows how qualitative methods
can meet rigorous scientific criteria. This stands in marked contrast to some
writers who advocate the use of qualitative methods in a purely subjective way.
Taken together, the chapters in this section provide a sampling of methods
used in evaluation studies. The range of methods reflects the expansion in
106 Wolf
theories and models of evaluation that have evolved over the past thirty-five
years. The references at the end of each chapter allow the reader to pursue each
method of interest.
6
Randomized Field Trials in Educationl
ROBERT F. BORUCH
University of Pennsylvania, Philadelphia, USA
DEFINITION AND DISTINCTIONS
In the simplest randomized field trial (RFT), individuals are randomly assigned
to one of two or more groups, each group being given a different educational
intervention that purports to improve the achievement level of children. The
groups so composed do not differ systematically. Roughly speaking, the groups
are equivalent.
The first of two principal benefits of randomized field trials is that they permit
fair comparison. Estimates of the relative differences in outcome among the
regimens that are compared will be statistically unbiased. That is, the estimates
will not be tangled with competing explanations of what caused the difference in
the observed outcome. This is because the groups being compared will not differ
systematically before the education interventions are employed.
The second benefit of a randomized field trial is that it permits the researcher
to make a statistical statement of the researcher's confidence in the results. That
is, the trial's results are subject to ordinary variability in institutional and human
behavior. Any given result can come about on account of chance, so it is impor-
tant to take chance into account on scientific grounds. This variability can be
taken into account using conventional statistical methods.
In large-scale RFTs, entire institutions or jurisdictions may be randomly
assigned to different regimens. For instance, a sample of 20 schools might be split
randomly in half, one group being assigned to a new teacher development system
and the remaining group using conventional systems. This is in the interest of
estimating the new system's effect on the achievement of teachers and their
students.
In more complex randomized trials, a sample of individuals or institutions or
both may be matched first, then randomly assigned to interventions. Matching
and other strategies usually enhance the experiment's statistical power beyond
randomization. That is, small treatment effects are rendered more detectable in
a trial that employs matching, or blocking, or other statistical tactics. Regardless
of these tactics, the randomization helps to assure that unknown influences on
behavior are equalized across the intervention groups.
107

T. Ke/laghan, D.L. Stufflebeam (eds.)
108 Boruch
Randomized trials are different from statistical "observational studies." In the

latter, there is an interest in establishing cause-effect relations. But there is no
opportunity to randomly assign individuals or entities to alternative interven-
tions (Cochran, 1983; Rosenbaum, 1995). Observational studies are often based
on probability sample surveys or administrative records, or surveillance systems.
Their analysis depends on specialized methods for constructing comparison
groups and estimates of treatment differences. Statistical advances in the theory
and practice of observational studies and in analyzing resultant data are covered
by Rosenbaum (1995).
Randomized experiments are also different from quasi-experiments. Quasi-
experiments have the object of estimating the relative effectiveness of different
interventions that have a common aim. But quasi-experiments depend on
assumptions, or on methods other than randomization, to rule out competing
explanations for treatment differences that may be uncovered, or to recognize
bias in estimates of a difference in outcomes (Campbell & Stanley, 1966; Cochran,
1983; Cook & Campbell, 1979).
Statistical analyses of data from observational surveys and quasi-experiments
attempt to recognize all the variables that may influence outcomes - including
selection factors - to measure them, to separate the treatment effects from other
factors, and to estimate these effects. Statistical advances in this arena fall under
the rubric of structural models, selection models, and propensity scores.
Antecedents to these approaches include ordinary (least squares) regression
models, covariance analysis, and simple matching methods.
Observational studies and quasi-experiments have been useful in part because
it is not always possible or desirable to employ randomized experiments. Further,
the nonrandomized approaches can be useful in conjunction with randomized
trials. The estimates of relative effect of interventions generated in each approach
can sometimes be combined into one estimate, at times. This combined estimate
of relative effect can be more accurate and precise than either estimate is (e.g.,
Boruch, 1994; U.S. General Accounting Office, 1992, 1994).
In what follows, the phrase "randomized field trial" is used interchangeably
with other terms that have roughly the same meaning and are used in different
research literatures. These terms include randomized experiment, randomized
test, randomized clinical trial (in the medical arena), and randomized controlled
trial.
RANDOMIZED FIELD TRIALS IN DIFFERENT AREAS OF EDUCATION
As Nave, Miech, and Mosteller (2000) point out, randomized trials are not yet
common in education. Their rarity makes them all the more valuable. And
identifying good trials is also important.
Randomized trials have been carried out at times at different levels of educa-
tion and in different countries. Education is also an important component of the
programs that have been tested in trials on some training and employment
Randomized Field Trials in Education 109
programs in criminal justice interventions, and in other arenas in which educa-

tion is a part of an integrated approach to enhancing the well being of children
or youth, their families, or adults. They have included cultural enrichment
programs for children in the barrios of Columbia, radio-based mathematics
education in Nicaragua, AIDS reduction programs in the Philippines, the United
States, and the African continent. See Mosteller and Boruch (2001) and Riecken
et al. (1974) for descriptions of recent and early studies.
Elementary and Secondary Education
Large-scale trials have involved random assignment of teachers, entire class-

rooms, or schools to different treatments. The interventions that have been
evaluated include reduced class size versus conventional class size (Finn &
Achilles, 1990; Mosteller, Light, & Sachs, 1995), health risk reduction programs
(Ellickson & Bell, 1990), and tutoring strategies (Fuchs, Fuchs, Mathes, &
Simmons, 1997).
Large-scale trials that involve many sites and the random assignment of
children or youth to different education interventions in each site have also been
mounted. The interventions that have been evaluated in these trials include
dropout prevention programs (Dynarski, 1991), college oriented programs for
low-income high school students (Myers & Schrirm, 1999), magnet schools
(Crain, Heebner, & Si, 1992), and voucher/choice programs (Howell, Wolf,
Peterson, & Campbell, 2001; Myers, Peterson, Mayer, Chou, & Howell, 2000).
When randomized trials are local and depend on institutional or university-
based researchers, individuals rather than entities are usually randomized. The
Perry Preschool Project in the United States is a unique example (Barnett, 1985).
Replication efforts in the United Kingdom are underway (Taroyan, Roberts, &
Oakley, 2000).
Training and Employment
At times, randomized trials have been used to evaluate programs that are
directed toward youth and adults who are at risk of chronic low education and
unemployment. The programs often involve education components, to enhance
reading and arithmetic skills, for example. The Rockefeller Foundation, for
instance, has supported randomized controlled field tests of integrated
education and employment programs under its Female Single Parent Program.
The objective was to understand whether a program involving a constellation of
childcare, job skills training, general education, and support would enhance the
economic well-being of single mothers with low education and low employment
skills (Cottingham, 1991). These trials involved randomly assigning eligible women
to either the new program or to a control group whose members had access to
other training, employment, and services that were generally available in the
110 Boruch
community. More recent Foundation-sponsored trials direct attention to
housing projects that are randomly assigned to different interventions which
include educational components (Kato & Riccio, 2001).
Related large-scale examples include multi-state evaluations of welfare
initiatives (Gueron & Pauly, 1991) and regional tests of new programs designed
to retrain and employ workers displaced from their jobs by technology and
competition (Bloom, 1990). These studies involve a substantial effort to generate
good evidence for labor policy and administration. Welfare-to-work trials in
Canada include education and training components that are tied to employment.
See publications by the Social Research and Demonstration Corporation (2001).
Juvenile and Criminal Justice and Education
In juvenile criminal justice, some of the studies concern the effects of education
and training. Dennis's (1988) dissertation updated Farrington's (1983) exami-
nation of the rationale, conduct, and results of such randomized experiments in
Europe and North America. The range of intervention programs whose effec-
tiveness has been evaluated in these trials is remarkable. They have included
juvenile diversion and family systems, intervention probation rules, work-release
programs for prisoners, and sanctions that involve community service rather than
incarceration. Petrosino et al.'s (Petrosino, Boruch, Rounding, McDonald, &
Chalmers, 2000; Petrosino, Turpin-Petrosino, & Finkenauer, 2000) systematic
review of Scared Straight programs in the U.S. and other countries is an
important addition to the literature on educating youth about crime; it reports
negative effects of the program despite good intentions.
Abused and Neglected Children and Education
One object of research on abused and neglected children is to enhance the posi-
tive behaviors of withdrawn, maltreated preschool children in the context of
preschool, school, and other settings. For example, Fantuzzo et al. (1988) employed
a randomized trial to evaluate the effectiveness of a new approach that involved
training highly interactive and resilient preschoolers to play with their
maltreated peers. The program involved these children in initiating activity and
sharing during play periods with children who had been withdrawn and timid. In
Philadelphia, the authors compared the withdrawn children who were randomly
assigned to this approach against children who had been randomly assigned to
specialized adult-based activity.
A different stream of controlled experiments has been undertaken mainly to
understand how to prevent out-of-home placement of neglected and abused
children. In Illinois, for instance, the studies involve randomly assigning children
at risk of foster care to either the conventional placement route, which includes
foster care, or a special Family First program, which leaves the child with the
parents but provides intensive services from counselors and family caseworkers.
Related research has been undertaken by other agencies involved in childcare in
California, Utah, Washington, New York, New Jersey, and Michigan. Schuerman,
Rzepnicki, and Littell (1994), who executed the Illinois trial, found that the
program was actually targeted at low-risk, rather than high-risk, families. This
virtually guaranteed no difference between the foster care group and the family
preservation group would be discerned.
Education and Welfare Reform

At least a half dozen large scale trials that were designed to identify better
approaches to welfare reform have also directed attention to education-related
activity of teenagers in welfare families or teenaged parents who receive welfare
subsidies. The interventions included monetary incentives. These were expected
to increase the likelihood that children in the families advance from one grade
to another, to remain in school, to embrace daily attendance, or affect other
education-related behavior. Because these trials are embedded in economic
studies, few are reported in education research journals. Many education
researchers have been unaware of them, and of the disappointing findings. See,
for example, Granger and Cytron's (1999) summary of trials on Ohio's Learning,
Earning, and Parenting Program, the New Chance Project, and the Teen
Parenting Programs in New Jersey and Illinois.
The Frequency of Randomized Trials

Although it is not difficult to find good examples of randomized trials in educa-
tion, such trials are in the minority of studies in this arena. For instance, a hand
search of the American Educational Research Journal between 1964 and 1999,
focusing on trials in math and science education, uncovered less than 40 articles
out of a total of 1200 articles published in the journal during that period (Mosteller
& Boruch, 2001). In an independent study, Nave, Miech, and Mosteller (2000)
found controlled trials on education practice in the U.S. to be rare.
There are a variety of reasons for the relative infrequency of RFTs, including
the fact that studies of severity of problems and of implementation of programs,
rather than impact studies are important. Lack of political will and unwillingness
to contribute to advancing knowledge in a scientific way may also playa role that
impedes the conduct of fair randomized trials. And, of course, ethical concerns
need to be addressed. We consider this next.
THE ETHICS OF RANDOMIZED TRIALS
The Federal Judicial Center (FJC), the research arm of the federal courts in the
United States, laid out simple conditions that help determine when a
112 Boruch
randomized trial is ethical. These FJC conditions have no legal standing. But
they are useful in understanding when randomized trials meet reasonable social
standards for ethical propriety.
The FJC's threshold conditions for deciding whether a randomized trial ought
to be considered involve addressing the following questions:
• Is the problem serious?

• Is the effectiveness of proposed solutions to the problem uncertain?
• Will a randomized trial yield more valid evidence than alternatives?
• Will the results be used?
• Will the rights of participants be protected?
Affirmative responses to all these questions invite serious consideration of a ran-

domized trial. Negative answers to the questions imply that a randomized filed
trial may not be ethical.
Another class of standards for assuring the ethical propriety of randomized
experiments has legal standing. In the United States, Canada and other coun-
tries, plans for experimentation that involve humans must, by law, be reviewed
by Institutional Review Boards. The principal investigator for an experiment is
responsible for presenting the study's design to the IRB.
Sieber (1992) provides general guidance for meeting ethical standards in
social research. Friedman, Furberg, & DeMets (1995) do so in context of medical
clinical trials. Scholarly discussions of ethical issues appear in Stanley and Sieber
(1992) and are also regularly published in professional journals that concern
applied social and psychological research and educational research. See Boruch
(1997) on tailoring trials so as to meet standards of social ethics and scientific
ethics.
ELEMENTS OF RANDOMIZED FIELD TRIALS
A randomized field trial involves specifying the following elements in advance of

the trials:
• Population, units of randomization, and statistical power

• Interventions and methods for their observation.
• Method of random assignment and checks on its integrity.
• Observations and measurement.
• Analysis and reporting
• Each of these topics is considered briefly below.
Population, Units of Randomization, and Statistical Power
The word "population" refers to people or institutions about which one would
like to make inferences. In the Tennessee class size study, for instance, the object
was to make general statements about whether reduced class sizes worked in the
target population of relevant children, teachers, schools in Tennessee. See
Mosteller (1995) and Finn & Achilles (1990). In such trials, the target population
is confined to all children or people or entities who were eligible to receive an
intervention. The Tennessee study, for example, randomly assigned children and
teachers to different class sizes in early grades of the school, and only in schools
whose classes were large.
Most field experiments randomly allocate individuals as the units of random
allocation to alternative interventions. Individuals then are technically regarded
as the units of random allocation and analysis. Less frequently, institutions or
other entities are also allocated randomly to different education regimens.
Entities are then the units of analysis. The statistical requirement is that the units
are chosen so as to be independent of one another; each respond in a way that is
not influenced by any other unit in the study.
For instance, children in a classroom might be randomly allocated to different
tutoring programs. They are not independent because they talk to one another.
Indeed, their talking to one another is part of any tutoring intervention. So, the
trialist may then design a study in which entire classrooms are allocated to one
tutoring program or another, or design trials in which entire schools are ran-
domly allocated to one or another program. All this is in the interest of generating
a statistically defensible estimate of the effect of the programs. For reviews of
randomized experiments in which entities are the units of allocation and
statistical analysis, see Boruch and Foley (2000), Murray (1998), and Donner and
Klar (2000).
Eligibility and exclusionary criteria define the target population. And this in
tum helps to characterize the experiment's generalizability (see the section on
analysis). The criteria also influence the statistical power of the experiment by
producing a heterogeneous or homogeneous sample and a restriction of sample
size. Statistical power refers to the trial's capacity to detect treatment differ-
ences. Good statisticians and education researchers calculate statistical power to
assure that sample sizes are adequate. The main rules of thumb in assuring
statistical power are: (a) do a statistical power analysis, (b) match on everything
possible prior to randomization and then randomize, (c) get as many units as
possible, and (d) collect covariate!background data and times series data to
increase the trial's power (see Lipsey, 1990).
IntelVentions
"Interventions" refer to the programs, projects, program components, or program

variations whose relative effectiveness is of primary interest in the randomized
trial. In particular, the trialists must know what activities characterize each
program being compared, how they are supposed to work, and how he or she can
verify that they occur.
114 Boruch
Assuring that interventions are properly delivered falls under the rubric of
"compliance research" in drug trials and medical research. In psychological experi-
ments, the research team usually develops "manipulation checks." In other
arenas, phrases such as "process and implementation," "integrity," and "accoun-
tability" drive the idea that one ought to understand how well an intervention is
deployed. In medium- to large-scale field studies, where a program staff (rather
than trialists) are responsible for the intervention's delivery, the topic is usually
handled through program guidelines, training sessions, briefings, and the like.
In educational randomized experiments and trials in other sectors, a "control"
condition is not one in which any treatment is absent. Rather, the "control"
usually denotes a condition in which a conventional or a customary treatment is
delivered. This being the case, the composition of the control group must be
measured and understood as that of the new treatment group is. For instance,
the Tennessee experiments focused attention on classroom size in contrast to a
control condition in which classrooms were of customary large size (Finn &
Achilles, 1990). Observations are made on what happens in both smaller class-
rooms and larger ones. Similarly, employment and training experiments must
verify that the new program is not delivered to control group members. The
point is that trialists must document processes and events in both the control
conditions and the new treatment conditions.
Random Assignment
Contemporary textbooks give technical advice on how to assign individuals or

entities randomly to alternative interventions. Modern statistical software
packages make it easy to randomly select people in a lottery or to randomly
assign them to different interventions in a trial. However, researchers recognize
the realities of field conditions, beyond these sources, so as to assure the integrity
and efficiency of the randomization process.
Good practice demands that the random assignment cannot be anticipated
and subverted ex ante. It also demands that the randomization cannot be
subverted post facto. As a practical matter, this excludes coin flips and card deck
selections that are easily subverted or lead to nonrandom assignment because of
flaws in the randomization device.
Contemporary trials usually depend on centralized randomization. Experiments
in New York City's magnet schools, for instance, used an independent agent, the
Educational Testing Service, to generate random assignments and to record the
assignment of each student to the special school versus a conventional one
(Crain et aI., 1992).
The timing of the random assignment in a field trial is important. A long
interval between random assignment and the actual delivery of intervention can
engender the problem that assigned individuals may disappear, they find inter-
ventions apart from those in the experiment, and so on. For example, individuals
assigned to one of two different training and employment programs may, if their
engagement in the program is delayed, seek other options. The experiment then
is undermined. The implication is that assignment should take place as close as
possible to the point of entry to treatment.
The tactic of randomly allocating individuals (or entities) to interventions A
and B in a 1:1 ratio is common. Trialists depart from this scheme often. For
instance, the demand for one treatment may be strong, and the supply of eligible
candidates for all interventions may be ample. This scenario may justify con-
sidering a 2:1 ratio in a two-treatment experiment. A random allocation ratio
that differs from 1:1 is legitimate, although the statistical power of the experi-
ment is a maximum with the 1:1 ratio. A ratio of 3:1 can lead to a substantial loss
in one's ability to detect the relative differences among interventions.
When the size of the sample of individuals or entities is small, trialists can
employ specialized randomization methods. For example, experiments that
involve classrooms, organizations, or communities as the primary units of
random assignment often engage far fewer than 100 entities. Some experiments
that focus on individuals as the unit must also contend with small sample size, for
example, localized tests of treatments for sexually abused children.
Regardless of what the unit of allocation is, a small sample presents special
problems. A simple randomization scheme may, by chance, result in obviously
imbalanced assignment. For example, four impoverished schools may be
assigned to one health risk reduction program and four affluent schools might be
assigned to a second program.
The approaches recommended by experts and summarized in Boruch (1997)
are as follows. First, if it is possible to match or block, prior to randomization,
this ought to be done. Second, one can catalog all possible random allocations
and eliminate those that arguably would produce peculiarly uninterpretable
results. One then chooses randomly from the remaining set of arrangements.
Ellickson and Bell (1990) did so in randomized experiments that were designed
to determine whether certain substance abuse programs worked. The programs
were mounted at the school level and the randomized field trial, in schools as the
unit of allocation and analysis, worked. Mosteller (1986) presents illustrations
and provides the opportunity to think further on this topic.
Observation, Measurement, and Theory
Contemporary trialists systematically observe what happens to each randomized

person or entity in each treatment condition, their background characteristics,
how education and perhaps other services is provided in each condition, and the
context in which the trial occurs. In principle, one may measure a great many
variables that pertain to outcomes, treatments, and so on, using both qualitative
and quantitative approaches. In practice, scarce resources and the ethical res-
ponsibility to reduce burdens on participants reduce the number of variables that
are measured. In the best of practice, the choice of variables to measure is driven
by theory of how the interventions are supposed to work.
116 Boruch
As an example, consider the randomized field trials in Tennessee on the
relative effect of small class size versus large class size, and how class size affects
the achievement of children in grades 1-3. The study's design was driven by a
rudimentary theory, all empirical research, and the idiosyncratic experience of
teachers and parents that children in classrooms with fewer children would learn
more than children in larger classrooms, other things being equal. To assure that
"other things" were indeed equal, the children and teachers were randomly
assigned to classes of different sizes.
What children learned was then identified as a main outcome variable. It was
measured by using standardized achievement tests in mathematics and reading.
Understanding when and how often to measure is important and this too should be
influenced by theory. For instance, one may expect the outcome differences among
new educational interventions to appear early, or late, or to decay or remain stable.
Responses to one intervention may be more rapid than responses to another. If
education theory suggests rapid change, statistical theory may be used to inform
decisions about periodicity and frequency of measurement of children (or entities)
over time. In any case, multiple post-test measures are usually warranted because
both short-term effects and long-term effects are interesting to policy makers,
practitioners, and others. See, for instance, St. Pierre, Swartz, Murray, Deck, and
Nickel (1995) on Even Start programs for impoverished families.
Once said, it is obvious that the interventions assigned randomly to children
(or schools) might be deployed imperfectly or might not be delivered at all. Such
crude departures from plan need to be observed, of course. At the study level,
trialists count departures from randomization and augment these counts with
qualitative information about how and why people departed from or subverted
the random assignment.
Further, conscientious trialists observe and document, in both numerical and
narrative ways, how the education and related services were provided. A main
reason for such observations is to assure that one understands exactly what inter-
ventions are being compared; this is sensible on scientific and policy grounds.
Such observation can usually facilitate later replicating or improving the inter-
vention that is of primary interest in the study. Virtually, all large contemporary
trials then dedicate serious attention to measuring processes, dimensions, and
resources that characterize the intervention. Smaller trials sometimes often do
not have the resources to do this as thoroughly as the large ones.
The function of baseline measures in randomized trials is to assure that treat-
ments are delivered to the right target individuals to check on the similarity of
the groups that have been randomly composed prior to intervention, to enhance
the interpretability of the trials, and to increase the statistical power of analysis.
Each function is critical and requires the collection of reliable data on people (or
entities) prior to randomization. Original reports on trials, rather than the
abbreviated reports that one finds in journals, typically contain detailed tables of
baseline data for all groups in the trials. Analysis sections will list all baseline
variables that were used to match or stratify individuals (or entities) and all
covariates taken from baseline measures.
Consider next the observation of context. In training and employment pro-

grams that attempt to enhance participants' wage rates, it is sensible to obtain
data on the local job market. This is done in some experiments to understand
whether, indeed, a new program can exercise an effect. An employment and
training program for dropouts, for example, may not appear to work in a
randomized trial simply because there are no jobs available in the local labor
market. A school-based program to prevent school dropouts may not work in a
randomized trial because the context is one in which high paying jobs are readily
accessible to school dropouts. The point is that the fairness of a trial must be
considered in a particular social context.
In some RFfs, measuring the costs of interventions is essential. Studies of
employment and training programs, for example, address cost seriously, e.g., the
Rockefeller Foundation's randomized experiments on programs for single
parents (Burghardt & Gordon, 1990; Cottingham, 1991) and work-welfare projects
(e.g., Gueron & Pauly, 1991). But producing good estimates of costs require
resources, including expertise of the kind given by Gramlich (1990), is not always
available in other sectors. Many randomized trials in education fail to attend to
costs of the interventions, or to cost/effectiveness ratios. This failure usually
decreases the value of the trial for policy makers because costs of intervention
are an important ingredient in policy decisions. On the other hand, some trials
are so important on scientific and practice grounds that sparse information on
cost seems not to inhibit the use of results. The Tennessee class size trials are a
case in point. See Mosteller et al. (1995).
Management
At least three features of the management of RFfs are important. The first
involves identifying and recruiting partners. A second feature is important is the
formation of advisory groups. Third and most obviously, experiments depend on
good people who can do planning and management of the tasks that they engender.
No textbooks on any of these topics exist in education literature. However,
fine descriptions appear, at times, in reports issued by experiment teams. See, for
instance, Dolittle and Traeger (1990) on the Job Training Partnership Act study.
Advice based on the experience of able managers of evaluation research can be
found in Hedrick, Bickman, and Rog (1993), who for instance, counsel researchers
about how to think about resources, data, time, personnel, and money in
planning applied social research and assuring that it is done well. Their advice is
pertinent to running RFTs.
"Partners" includes people who have associated with the schools or other
education entities, including parents, and whose cooperation is essential for exe-
cuting the study. "Partnerships" refers to the constellation of agreements, incen-
tives, and mutual trust that must be developed to run a high quality. In many
trials, the partners, sometimes called stakeholders, are represented in the trials
118 Boruch
official advisory groups. They contribute to understanding incentives and
disincentives, the local context and culture, and to informing other aspects of the
trial's management.
Understanding what tasks need to be done, by whom, when, and how is basic
to management in this arena as in others. The tasks usually fall to the trial's
sponsors, the trial design team, and advisory group members. Their jobs are to
clarify the role of each and to develop partnerships needed to generate high quality
evidence. Part of the trialists' responsibilities include scouting and recruitment of
sites for the trial, for not all sites will be appropriate, on account of small sample
size within the site for instance, or the site's unwillingness to cooperate in a trial.
The tasks include outreach to identify and screen individuals who are eligible to
participate in the trial.
Contact with the individuals who have agreed to participate in the trial must
be maintained over time, of course. This requires forward and backward tracing
and contact maintenance methods of the kinds used in the High/Scope Perry
Preschool Project over a 20-year period (Schweinhart, Barnes, & Weikert, 1993),
among others. This must often be coupled to related efforts to capitalize on
administrative record systems.
Treatments must be randomly allocated, and so, as suggested earlier, random-
ization must be handled so as to insulate it from staff who are responsible for
delivering the service on intervention. And, of course, management requires
attention to treatment delivery. The management burden for this is usually low
in any randomized control conditions because people who are usually
responsible for the education services often continue to take responsibility. The
burden is usually higher for new interventions that are tested against themselves
or against a control condition. New health risk reduction programs, for instance,
require retraining teachers, or hiring and supervising new teachers, scheduling
such training and, more important, the scheduling of such education in a
complex schedule of courses and classes, and assuming that the new intervention
is supported by stakeholders, and so on.
Analysis
In randomized field trials, at least four classes of analysis are essential: quality
assurance, core analysis, subgroup and other internal analyses, and generaliz-
ability or concordance analyses. Only the core analysis usually appears in
published journal articles. The others are crucial but related to reports.
Assuring quality depends on information about which interventions were
assigned to whom and on which treatments were actually received by whom, and
analyses of the departures from the random assignment. Quality assurance also
entails examination of baseline (pretest) data to establish that indeed the ran-
domized groups do not differ systematically prior to treatment. Presenting tables
on the matter in final reports is typical. Quality assurance may also include side
studies on the quality of measurement and on preliminary core analysis. See, for
example, Schuerman et al. (1994) on Families First programs.
The phrase "core analysis" refers to the basic comparisons among interven-
tions that were planned prior to the randomized trial. The fundamental rule
underlying the core analysis is to "analyze them as you have randomized them."
That is, the individuals or entities that were randomly assigned to each interven-
tion are compared regardless of which intervention was actually received. This
rule is justified on the basis of the statistical theory that underlies formal tests of
hypotheses and the logic of comparing equivalent groups. The rule also has a
policy justification under real field conditions. Comparing randomly assigned
groups regardless of actual treatment delivered recognizes that a reality in
medical and clinical trials (e.g., Friedman et aI., 1985) as in the social and
behavioral sciences (Riecken et aI., 1974). The products of the core analysis are
estimates of the relative difference in the outcomes of the treatments and a
statistical statement of confidence in the result, based on randomized groups.
Policy people and scientists usually want analyses that are deeper than a core
analysis that compares the outcomes of one intervention to the outcomes of
another. Comparisons that are planned in advance can, for example, show that
the effect of an intervention is larger for some certain subgroups, ethnic majority
children, than for another, as in the Tennessee class size trial. So called "internal
analyses" that are correlational cannot depend on the randomization to make
unequivocal statements. However, these analyses can produce hypotheses that
can be explored more deeply in future studies.
Further, one must expect "no difference" findings because they often appear.
Post facto analyses of why no difference appears are fraught with uncertainty but
are nonetheless important to building hypotheses. A finding of no differences in
outcomes may be a consequence of using treatments that were far less different
from one another than the researcher anticipated. It may be caused by inade-
quate sample size, or unreliable or invalid measures of the outcomes for each
group. Yeaton and Sechrest (1986,1987) and Julnes and Mohr (1989) provided
good advice; St Pierre et al. (1998) have generated an excellent case study.
A final class of analysis directs attention to how the current trial's results relate
to other similar studies and other populations to which one might wish to
generalize. Determining how a given study fits into the larger scientific literature
on related studies is often difficult. One approach lies in disciplined meta-analyses.
That is, the researcher does a conscientious accounting for each study of who or
what was the target (eligibility for treatments, target samples and population),
what variables were measured and how, the character of the treatments and
control conditions, how the specific experiment was designed, and so on. For
example, the U.S. General Accounting Office (1994) formalized such an approach
to understand the relative effectiveness of mastectomy and lumpectomy on
5-year survival rates of breast cancer victims. See Lipsey (1992, 1993) in the
juvenile delinquency arena, Light and Pillemer (1984) in education, and Cordray
and Fischer (1994) and the u.S. General Accounting Office (1992, 1994) on
the general topic of synthesizing the results of studies. Each contains
120 Boruch
implications for understanding how to view the experiments at hand against

earlier work.
Reporting
The topic of reporting on the results of randomized trials has had an uneven
history. The medical sciences led the way in developing standards for reporting
(e.g., Chalmers et ai., 1981). In education research, no standards for reporting
have been promulgated officially, but see Boruch (1997) for guidelines posed on
good practice. The fundamental questions that should be addressed in reports on
a trial in any sector are as follows:
• To whom should information about the trial be reported?

• When and what should be reported?
• How should the information be reported?
• In what form should the information be reported?
To whom information should be reported usually depends on who sponsored the

study and on the legal responsibilities of the sponsor and the trialist. In the United
States, executing a RFT is often a matter of contract between a government
agency and the contractor or between a private foundation and a grantee or
contractor. For instance, the U.S. Department of Education engaged a private
contractor to undertake randomized studies on school dropout prevention
(Dynarski, Gleason, Rangarajan, & Wood, 1995). In this and similar cases, the
experimenter's primary responsibility, as contractor, was to report to the sponsor
of the study. The trialist may also have a responsibility to report to those who
oversee the direct sponsor, such as the U.S. Congress, a Parliament, or Diet.
When the randomized trial is sponsored by a grant from a private or public
foundation, the primary reporting requirement may be different.
In the best of reports on RFTs, problems in executing the design or in analysis
are handled, sponsorship and potential conflicts of interest are acknowledged,
and idiosyncratic ethical, legal, or methodological problems are discussed. Ideally,
parts of the main report on such issues are published in peer reviewed research
journals. Boruch (1997) provided a checklist on the desirable contents of reports.
This list is similar to one prepared for reports on clinical trials in medicine issued
by the Standards of Reporting Trials Group (1994).
The basic forms in which a report is issued include (a) prose reports, accompa-
nied by numerical tables and charts and (b) public use data sets that are produced
in the study and that permit verification and reanalysis by others. Thoughtful
experimenters and their sponsors have, at times, tried to enlarge on these forms
of reporting. For instance, the Rockefeller Foundation (1988, 1990) produced
videotapes on the foundation's experiments on training and employment programs
for single female parents, apparently to good effect. Contemporary experiments
will doubtless depend on a variety of channels for sharing information, such as

the worldwide web.
CONCLUDING REMARKS
To judge from various governments' interests in evidence-based policy the need

to generate better evidence on what works and what does not will continue. This
demand will lead to more and better randomized trials in education and other
social sectors.
International efforts to synthesize the results of trials on various kinds of
intervention are underway. The Cochrane Collaboration in health care, for
instance, has produced about 1000 such reviews on a wide variety oftherapeutic
and risk preventions. Most of the studies on which these reviews depend are
randomized trials simply because they produce less equivocal evidence about
relative effects than alternatives do.
The Cochrane Collaboration's younger sibling, the Campbell Collaboration,
focuses on reviews of randomized and some nonrandomized trials in education,
crime and justice, and social welfare. Its registry of over 11,000 such trials is
being augmented, is an important source of information on the trials, and is used
in generating high quality reviews of studies on the effects of interventions.
ENDNOTE
1 This paper is abbreviated and modified from Boruch (1998) and depends heavily on Boruch (1997).
The Mosteller and Boruch (2001) book contains papers by other experts on specific aspects of
randomized trials.
REFERENCES
Barnett, W.S. (1985). Benefit-cost analysis of the Perry Preschool program and its long-term effects.
Educational Evaluation and Policy Analysis, 7, 333-342.
Bloom, H.S. (1990). Back to work: Testing reemployment services for displaced workers. Kalamazoo,
MI: W.E. Upjohn Institute for Employment Research.
Boruch, RF. (1994). The future of controlled experiments: A briefing. Evaluation Practice, 15,265-274.
Boruch, RF. (1997). Randomized controlled experiments for planning and evaluation: A practical guide.
Boruch, RF. (1998). Randomized controlled experiments for evaluation and planning. In L. Bickman
& D. Rog (Eds.), Handbook of applied social research methods (pp. 161-191). Thousand Oaks, CA:
Sage.
Boruch, RF., & Foley, E. (2000). The honestly experimental society: Sites and other entities as the
units of allocation and analysis in randomized experiments. In L. Hickman (Ed.), Validity and social
experimentation: Donald T. Campbell's legacy (pp. 193-238). Thousand Oaks, CA: Sage.
Burghardt, J., & Gordon, A. (1990). More jobs and higher pay: How an integrated program compares
with traditional programs. New York: Rockefeller Foundation.
Campbell, D.T., & Stanley, J.C. (1966). Experimental and quasi-experimental designs for research.
Chicago: Rand McNally.
122 Boruch
Chalmers, T.C, Smith, H., Blackburn, B., Silverman, B., Schroeder, B., Reitman, D., & Ambroz, A
(1981). A method for assessing the quality of a randomized controlled trial. Controlled Clinical
Trials, 2(1), 31-50.
Cochran, WG. (1983). Planning and analysis of observational studies (L.E. Moses & F. Mosteller,
Eds.). New York: John Wiley.
Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field
settings. Chicago: Rand McNally.
Cordray, D.S., & Fischer, RL. (1994). Synthesizing evaluation findings. In J.S. Wholey, H.H. Hatry,
& K. Newcomer (Eds.), Handbook of practical program evaluation (pp. 198-231). San Francisco:
Jossey-Bass.
Cottingham, P.H. (1991). Unexpected lessons: Evaluation of job-training programs for single
mothers. In RS. Turpin, & J .M. Sinacore (Eds.) Multisite Evaluations. New Directions for Program
Evaluation, 50, (pp. 59-70).
Crain, RL., Heebner, AL., & Si, Y. (1992). The effectiveness ofNew York City's career magnet schools:
An evaluation of ninth grade performance using an experimental design. Berkeley, CA: National
Center for Research in Vocational Education.
Dennis, M.L. (1988). Implementing randomized field experiments: An analysis of criminal and civil
justice research. Unpublished Ph.D. dissertation, Northwestern University, Department of Psychology.
Dolittle, F., & Traeger, L. (1990). Implementing the national JTPA study. New York: Manpower
Demonstration Research Corporation.
Donner, S., & Kiar, N. (2000) Design and Analysis of Cluster Randomization Trials in Health Research.
New York: Oxford University Press.
Dynarski, M., Gleason, P., Rangarajan, A, & Wood, R (1995). Impacts of dropout prevention
programs. Princeton, NJ: Mathematica Policy Research.
Ellickson, P.L., & Bell, RM. (1990). Drug prevention in junior high: A multi-site longitudinal test.
Science,247,1299-1306.
Fantuzzo, J.F., Jurecic, L., Stovall, A, Hightower, AD., Goins, C., & Schachtel, K.A. (1988). Effects
of adult and peer social initiations on the social behavior of withdrawn, maltreated preschool
children. Journal of Consulting and Clinical Psychology, 56(1), 34-39.
Farrington, D.P. (1983). Randomized experiments on crime and justice. Crime and Justice: Annual
Review of Research, 4, 257-308.
Federal Judicial Center. (1983). Social experimentation and the law. Washington, DC: Author.
Finn, J.D., & Achilles, C.M. (1990). Answers and questions about class size: A statewide experiment.
American Education Research Journal, 27, 557-576.
Friedman, L.M., Furberg, CD., & DeMets, D.L. (1985). Fundamentals of clinical trials. Boston: John
Wright.
Fuchs, D., Fuchs, L.S., Mathes, P.G., & Simmons, D.C. (1997). Peer-assisted learning strategies:
Making classrooms more responsive to diversity. American Educational Research Journal, 34(1),
174-206.
Gramlich, E.M. (1990). Guide to cost benefit analysis. Englewood Cliffs, New Jersey: Prentice Hall.
Granger, R.C., & Cytron, R (1999). Teenage parent programs. Evaluation Review, 23(2),
107-145.
Gueron, J.M., & Pauly, E. (1991). From welfare to work. New York: Russell Sage Foundation.
Hedrick, T.E., Bickman, L., & Rog, D. (1993). Applied research design: A practical guide. Newbury
Park, CA: Sage.
Howell, WG., Wolf, P.J., Peterson, P.P., & Campbell, D.E. (2001). Vouchers in New York, Dayton,
and D.C Education Matters, 1(2), 46-54.
Juines G., & Mohr, L.B. (1989). Analysis of no difference findings in evaluation research. Evaluation
Review, 13, 628-655.
Kato, L.Y., & Riccio, J.A (2001). Building new partnerships for employment: Collaboration among
agencies and public housing residents in the Jobs Plus demonstration. New York: Manpower
Demonstration Research Corporation.
Light, RJ., & Pillemer, D.B. (1984). Summing up: The science of reviewing research. Cambridge, MA:
Harvard University Press.
Lipsey, M.W (1990). Design sensitivity: Statistical power for experimental design. Newbury Park, CA: Sage.
Lipsey, M.W (1992). Juvenile delinquency treatment: A meta-analysis inquiry into the variability of
effects. In T.D. Cook, H.M. Cooper, D.S. Cordray, H. Hartmann, L.v. Hedges, RJ. Light,
T. Louis, & F. Mosteller, (Eds.), Meta-analysis for explanation: A casebook (pp. 83-127). New York:
Russell Sage Foundation.
Lipsey, M.W. (1993). Theory as method: Small theories of treatments. In L.B. Sechrest & Scott
(Eds.), Understanding causes and generalizing about them (pp. 5-38). San Francisco: Jossey-Bass.
Mosteller, E, (1986). Errors: Nonsampling errors. In W.H. Kruskal & J.M. Tanur (Eds.), International
Encyclopedia of Statistics (Vol. 1, pp. 208-229). New York: Free Press.
Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The Future of
Children, 5, 113-127.
Mosteller, F., & Boruch, RF. (2001). Evidence matters: Randomized trials in education research.
Washington, DC: Brookings Institution Press.
Mosteller, F., Light, R.I., & Sachs, J. (1995). Sustained inquiry in education: Lessons from ability
grouping and class size. Cambridge, MA: Harvard University, Center for Evaluation of the
Program on Initiatives for Children.
Murray, D.M. (1998) Design and analysis of group randomized trials. New York: Oxford University
Press.
Myers. D., & Schirm, A (1999) The impacts of Upward Bound: Final report of phase 1 of the national
evaluation. Princeton, NJ: Mathematica Policy Research.
Myers, D., Peterson, P., Mayer, D., Chou, 1., & Howell, W. (2000). School choice in New llirk City after
two years: An evaluation of the School Choice Scholarships Program: Interim report. Princeton, NJ:
Mathematica Policy Research.
Nave, B., Miech, E.1., & Mosteller, F. (2000). The role of field trials in evaluating school practices: A
rare design. In D.L. Stufflebeam, G.F. Madaus, & T. Kellaghan (Eds.), Evaluation models:
Viewpoints on educational and human services evaluation (2nd ed.) (pp. 145-161). Boston, MA:
Kluwer Academic Publishers.
Petrosino, A, Boruch, R., Rounding, c., McDonald, S., & Chalmers, I. (2000). The Campbell
Collaboration Social, Psychological, Educational, and Criminological Trials Registry (C2-
SPECTR) to facilitate the preparation and maintenance of systematic reviews of social and
educational interventions. Evaluation and Research in Education (UK), 14(3-4), 206-219.
Petrosino, AJ., Turpin-Petrosino, c., & Finkenauer, J.O. (2000). Well meaning programs can have
harmful effects. Lessons from "Scared Straight" experiments. Crime and Delinquency, 42(3),
354-379.
Riecken, H.W., Boruch, RF., Campbell, D.T., Caplan, N., Glennau, T.K., Pratt, J.w., Rees, A, &
Williams, w.w. (1974). Social experimentation: A method for planning and evaluating social
programs. New York: Academic Press.
Rosenbaum, P.R. (1995) Observational studies. New York: Springer Verlag.
Social Research and Demonstration Corporation. (Spring, 2001). Deouvrir les appproches etticicaces:
L 'experimentation et la recherche en politique socia Ie a la SRSA, 1(2).
St. Pierre, R, Swartz, J., Murray, S., Deck, D., & Nickel, P. (1995). National evaluation of Even Start
Family Literacy Program (USDE Contract LC 90062001). Cambridge. MA: Abt Associates.
st. Pierre, R., & Others. (1998). The comprehensive child development experiment. Cambridge, MA:
Abt Associates.
Schuerman, J.R, Rzepnicki, T.L., & Littell, J. (1994). Putting families first: An experiment in family
preservation. New York: Aldine de Gruyter.
Schweinhart, L.J., Barnes, H.Y., & Weikert, D.P. (1993). Significant benefits: The High Scope Perry
Preschool study, through age 27. Ypsilanti, ML High/Scope Press.
Sieber, J.E. (1992). Planning ethically responsible research: A guide for students and institutional review
boards. Newbury Park, CA: Sage.
Standards of Reporting Trials Group. (1994). A proposal for structural reporting of randomized
clinical trials. Journal of the American Medical Association, 272, 1926-1931.
Stanley, B., & Sieber, J.E. (Eds.). (1992). Social research on children and adolescents: Ethical issues.
Throyan, T., Roberts, I., & Oakley, A (2000). Randomisation and resource allocations: A missed
opportunity for evaluating health-care and social interventions. Journal of Medical Ethics, 26,
319-322.
U.S. General Accounting Office. (1992). Cross-design synthesis: A new strategy for medical effec-
tiveness research (Publication No. GAO/PEMD-92-18). Washington, DC: Government Printing
Office.
124 Boruch
u.s. General Accounting Office. (1994). Breast conservation versus mastectomy: Patient survival data
in daily medical practice and in randomized studies (Publication No. PEMD-95-9). Washington.
DC: Government Printing Office.
Yeaton, W.H., & Sechrest, L. (1986). Use and misuse of no difference fmdings in eliminating threats
to validity. Evaluation Review, 10, 836-852.
Yeaton, W.H., & Sechrest. L. (1987). No difference research. New Directions for Program Evaluation,
34,67-82.
7
Cost-Effectiveness Analysis as an Evaluation Tool
HENRY M. LEVIN
Teachers College, Columbia University, New lVrk, USA
PATRICK J. McEWAN
Wellesley College, MA, USA
INTRODUCTION
Cost-effectiveness analysis refers to evaluations that consider both the costs and
consequences of alternatives. It is a decision-oriented tool that is designed to
ascertain the most efficient means of attaining particular educational goals. For
example, there are many alternative approaches for pursuing such goals as
raising reading or mathematics achievement. These include the adoption of new
materials or curriculum, teacher training, educational television, computer-
assisted instruction, smaller class sizes, and so on. It is possible that all of these
alternatives, when well-implemented, have a positive effect on student achieve-
ment. Often the one that is recommended is the one with the largest apparent
effect.
Yet, consider the situation when the most effective intervention is 50 percent
more effective than its alternative, but has a cost that is three times as high. In
such a case, the effectiveness per unit of cost is twice as high for the "less
effective" alternative. To determine this, however, both costs and effects need to
be jointly assessed.
Most evaluation studies in education are limited to a determination of com-
parative effectiveness. Costs are rarely taken into account. Consider, however,
that many educational recommendations that show promise in terms of greater
effectiveness are so costly relative to their contribution to effectiveness that they
may not be appropriate choices. That is, they may require far greater resources
to get the same effect than an intervention with somewhat lower effectiveness,
but high effectiveness relative to costs.
At the same time, cost analyses without effectiveness measures are equally
inappropriate for decision making. For example, budgetary studies in education
show that costs per student decrease as the size of schools increase. In New York
City, it was found that the cost per student was greater in small high schools (up
125

©2003 Dordrecht: Kluwer Academic Publishers.
126 Levin and McEwan
to 600 students) than in large high schools (over 2,000 students) (Stiefel, Iatarola,
Fruchter, & Berne, 1998). This led analysts to question whether the increasing
tendency in New York to establish smaller high schools was equitable and
economically viable. Such small schools had been established to provide a more
intimate and caring environment and greater individual attention to and inter-
actions among students. 1
What was not noted in the preceding cost analysis was that small schools had
lower dropout rates and higher graduation rates. When these were taken into
account, the budgeted amounts per graduate were lower in the small schools
than in the large schools. Just as comparing effectiveness without measures of
costs can provide inappropriate information for decisions, so can comparing
costs without measures of effectiveness. 2
Rhetorical Use of the Term
Advocates of particular programs often make the claim that the program that
they are recommending is highly cost-effective. The result of this frequent
mention of cost-effectiveness in policy discussions tends to obscure the fact that
there is little or no analysis of cost-effectiveness, or even a clear definition of
what is meant. 3 Clune (2002) carried out a search of the ERIC database and
found that there were over 9,000 studies in that repository that were identified
by the keywords "cost-effectiveness." Even when limiting the search to the years
1991-96, he identified 1,329 titles. Abstracts of the latter titles were obtained to
see how many had done cost-effectiveness studies as opposed to using the term
in a rhetorical way. To make the analysis of abstracts manageable, only the 541
studies that addressed elementary and secondary education were utilized. Of
these, some 83 percent appeared to use the term cost-effectiveness without any
substantial attempt to actually carry out such a study. Only 1 percent seemed to
suggest that a substantial study was done. Ten percent of the full studies were
obtained, over-sampling those studies that appeared to be more rigorous. This
revealed an even larger portion that were rhetorical or superficial with none
meeting the criteria established by economists for rigorous cost-effectiveness
studies.
The conclusion that might be drawn from this is that even when the term
"cost-effectiveness" is used to describe conclusions of educational evaluations,
there may be no underlying and systematic cost-effectiveness study. Accordingly,
educating both evaluators and policy analysts on the fundamentals of cost-
effectiveness analysis should be a high priority.
Chapter Organization
This chapter describes the methods for a systematic cost-effectiveness analysis,

doing so in four sections. The first section develops the economic concept of
Cost-Effectiveness Analysis as an Evaluation Tool 127
costs and cost-estimation. The second section reviews three modes of cost
analysis: cost-effectiveness, cost-utility, and cost-benefit analysis. It also reviews
issues that are particular to each mode, including the estimation of outcomes.
The third section reviews how to jointly assess cost and outcome measures in
order to rank the desirability of educational alternatives. The concluding section
discusses the use and interpretation of cost studies in a decision context.
ESTIMATION OF COSTS
Every intervention uses resources that have valuable alternative uses. For example,
a program for raising student achievement will require personnel, facilities, and
materials that can be applied to other educational and noneducational
endeavors. By devoting these resources to a particular activity we are sacrificing
the gains that could be obtained from using them for some other purpose.
The "cost" of pursuing the intervention is the value of what we must give up
or sacrifice by not using these resources in some other way. Technically, then, the
cost of a specific intervention will be defined as the value of all of the resources
that it utilizes had they been assigned to their most valuable alternative uses. In
this sense, all costs represent the sacrifice of an opportunity that has been
forgone. It is this notion of opportunity cost that lies at the base of cost analysis
in evaluation. By using resources in one way, we are giving up the ability to use
them in another way, so a cost has been incurred.
Although this may appear to be a peculiar way to view costs, it is probably
more familiar to each of us than appears at first glance. It is usually true that
when we refer to costs, we refer to the expenditure that we must make to
purchase a particular good or service as reflected in the statement, "The cost of
the meal was $15.00." In cases in which the only cost is the expenditure of funds
that could have been used for other goods and services, the sacrifice or cost can
be stated in terms of expenditure. However, in daily usage we also make
statements like, "It cost me a full day to prepare for my vacation," or "It cost me
two lucrative sales," in the case of a salesperson who missed two sales appoint-
ments because he or she was tied up in a traffic jam. In some cases we may even
find that the pursuit of an activity "cost us a friendship."
In each of these cases a loss is incurred, which is viewed as the value of
opportunities that were sacrificed. Thus the cost of a particular activity was
viewed as its "opportunity cost." Of course, this does not mean that we can
always easily place a dollar value on that cost. In the case of losing a day of work,
one can probably say that the sacrifice or opportunity cost was equal to what
could have been earned. In the case of the missed appointments, one can
probably make some estimate of what the sales and commissions would have
been had the appointments been kept. However, in the case of the lost friend-
ship, it is clearly much more difficult to make a monetary assessment of costs.
In cost analysis a similar approach is taken, in that we wish to ascertain the cost
of an intervention in terms of the value of the resources that were used or lost by
applying them in one way rather than in another. To do this we use a
straightforward approach called the "ingredients" method.
The Ingredients Method
The ingredients method relies upon the notion that every intervention uses
ingredients that have a value or cost. 4 If specific ingredients can be identified and
their costs can be ascertained, we can estimate the total costs of the intervention
as well as the cost per unit of effectiveness, utility, or benefit. We can also
ascertain how the cost burden is distributed among the sponsoring agency,
funding agencies, donors, and clients.
The first step in applying the method is to identify the ingredients that are
used. This entails the determination of what ingredients are required to create
or replicate the interventions that are being evaluated, casting as wide a net as
possible. It is obvious that even contributed or donated resources such as
volunteers must be included as ingredients according to such an approach, for
such resources will contribute to the outcome of the intervention, even if they are
not included in budgetary expenditures.
In order to identify the ingredients that are necessary for cost estimation, it is
important to be clear about the scope of the intervention. One type of confusion
that sometimes arises is the difficulty of separating the ingredients of a specific
intervention from the ingredients required for the more general program that
contains the intervention. This might be illustrated by the following situation.
Two programs for reducing school dropouts are being considered by a school
district. The first program emphasizes the use of providing additional counselors
for dropout-prone youngsters. The second program is based upon the provision
of tutorial instruction by other students for potential dropouts as well as special
enrichment courses to stimulate interest in further education. The question that
arises is whether one should include all school resources in the analysis as well as
those required for the interventions, or just the ingredients that comprise the
interventions. We are concerned with the additional or incremental services that
will be needed in order to provide the alternative dropout-reduction programs.
Thus, in this case one should consider only the incremental ingredients that are
required for the interventions that are being evaluated.
Specification of Ingredients
The identification and specification of ingredients is often facilitated by dividing

ingredients into four or five main categories that have common properties. A
typical breakdown would include (1) personnel, (2) facilities, (3) equipment and
materials, (4) other program inputs, and (5) client inputs.
Personnel
Personnel ingredients include all of the human resources required for each of the
alternatives that will be evaluated. This category includes not only full-time
personnel, but part-time employees, consultants, and volunteers. All personnel
should be listed according to their roles, qualifications, and time commitments.
Roles refer to their responsibilities, such as administration, coordination, teaching,
teacher training, curriculum design, secretarial services, and so on. Qualifications
refer to the nature of training, experience, and specialized skills required for the
positions. Time commitments refer to the amount of time that each person
devotes to the intervention in terms of percentage of a full-time position. In the
latter case there may be certain employees, consultants, and volunteers who
allocate only a portion of a full work week or work year to the intervention.
Facilities
Facilities refer to the physical space required for the intervention. This category
includes any classroom space, offices, storage areas, play or recreational
facilities, and other building requirements, whether paid for by the project or
not. Even donated facilities must be specified. All such requirements must be
listed according to their dimensions and characteristics, along with other
information that is important for identifying their value. For example, facilities
that are air conditioned have a different value than those that are not. Any
facilities that are jointly used with other programs should be identified according
to the portion of use that is allocated to the intervention.
Equipment and Materials
These refer to furnishings, instructional equipment, and materials that are used
for the intervention, whether covered by project expenditures or donated by
other entities. Specifically, they would include classroom and office furniture as
well as such instructional equipment as computers, audiovisual equipment,
scientific apparatus, books and other printed materials, office machines, paper,
commercial tests, and other supplies. Both the specific equipment and materials
solely allocated to the intervention and those that are shared with other activities
should be noted.
Other Inputs
This category refers to all other ingredients that do not fit readily into the
categories set out above. For example, it might include any extra liability or theft
insurance that is required beyond that provided by the sponsoring agency; or it
might include the cost of training sessions at a local college or university. Other
possible ingredients might include telephone service, electricity, heating, internet
access fees, and so forth. Any ingredients that are included in this category
should be specified clearly with a statement of their purpose.
Required Client Inputs
This category of ingredients includes any contributions that are required of the
clients or their families (Tsang, 1988). For example, if an educational alternative
requires the family to provide transportation, books, uniforms, equipment, food,
or other student services, these should be included under this classification. The
purpose of including such inputs is that in some cases the success of an intervention
will depend crucially on such resources, while others do not. To provide an
accurate picture of the resources that are required to replicate any intervention
that requires client inputs, it is important to include them in the analysis.
Detailing Ingredients
There are three overriding considerations that should be recognized in

identifying and specifying ingredients. First, the ingredients should be specified
in sufficient detail that their value can be ascertained in the next stage of the
analysis. Thus it is important that the qualifications of staff, characteristics of
physical facilities, types of equipment, and other inputs be specified with enough
precision that it is possible to place reasonably accurate cost values on them.
Second, the categories into which ingredients are placed should be consistent,
but there is no single approach to categorization that will be suitable in all cases.
The one that was set out above is a general classification scheme that is rather
typical. It is possible, however, that there need be no "other inputs" category if
all ingredients can be assigned to other classifications. For example, insurance
coverage can be included with facilities and equipment to the degree that it is
associated with the costs of those categories. Likewise, if parents are required to
provide volunteer time, that ingredient can be placed under client inputs rather
than under personnel. The categories are designed to be functionally useful
rather than orthodox distinctions that should never be violated.
Finally, the degree of specificity and accuracy in listing ingredients should
depend upon their overall contribution to the total cost of the intervention.
Personnel inputs represent three-quarters or more of the costs of educational
and social service interventions. Accordingly, they should be given the most
attention. Facilities and equipment may also be important. However, supplies
can often be estimated with much less attention to detail, since they do not weigh
heavily in overall costs. The important point is that an eventual error of ten
percent in estimating personnel costs will have a relatively large impact on the
total cost estimate because of the importance of personnel in the overall picture.
However, a 100 percent error in office supplies will create an imperceptible
distortion, because office supplies are usually an inconsequential contributor to
overall costs.
Sources of Information
It is important to obtain a familiarity with the intervention that is being subjected

to cost analysis. Only by doing so can the evaluator identify the ingredients used
by the intervention in sufficient detail (and, subsequently, attach values to those
ingredients). Normally this familiarity can be gained in at least three ways:
(1) through a review of program documents, (2) through discussions with
individuals involved in the intervention, and (3) through direct observation of
the interventions.
An essential starting point is the examination of program documents. These
documents may include general descriptions of the program prepared by program
staff or outsiders, budgets and expenditure statements, web sites, reports by
previous evaluators of the program, internal memos or emails, and countless other
sources of information. The evaluator must approach this task as a good detective
might, attempting to tum up every bit of potentially useful documentary evidence.
A second source of information may include interviews with individuals
involved in the intervention. These individuals might include the program
designers; program directors and administrative staff; school personnel such as
principals, teachers, and aides; and parents. In some cases, particularly when
adults or older children are participants, it may be helpful to directly interview
the program recipients. In conducting interviews, the evaluator should seek to
confirm or contradict the impressions left by documentary evidence.
Even after reviewing documents and conducting interviews, it is often helpful
to conduct direct observations of the intervention. In a reading program, for
example, the evaluator might sit in on several classes. Again, the purpose of doing
so is to ascertain the ingredients that are actually being used. If the program
designer mentioned that students should have individual workbooks, is it the
case that all students in the class have workbooks? If program documents state
that 50 minutes of classroom time is devoted to instruction, is this revealed
during classroom observations?
In reading, interviewing, and observing, it is important to search for agreement
and disagreement across sources. Ultimately, we hope that various sources of
information will aid in triangulating upon a reasonable set of cost ingredients.
Where there are significant disagreements, we might be inspired to probe more
carefully. Oftentimes, however, disagreements cannot be easily resolved. For
example, a particular intervention may have been implemented in two sub-
stantially different ways across a series of program sites; each version of the
program used a different set of cost ingredients. In this case, it is helpful to
conduct a sensitivity analysis (described below).
Costing Out the Ingredients
At this second stage a cost-value is placed on each ingredient or resource. 5 Since

the emphasis is typically on annual costs for educational interventions rather
than costs of a longer time horizon or the life of the project, the analysis is often
limited to yearly costs. However, projects of longer duration can be analyzed as
well as long as the time pattern of costs is accounted for by a discounting
procedure set out in the final section of this chapter. It is important to note that
what is called cost is really a cost-value rather than the more familiar notion of
cost in terms of "what was paid for it." The reason for this is that many resources
are not found in standard expenditure or budget documents, and even when they
are included their costs are sometimes stated inaccurately from a value perspective.
For example, donated inputs such as volunteers or in-kind contributions are not
found in any official reporting of expenditures or costs. Investments in capital
renovations which typically last many years such as major facility improvements
are often paid for and shown as a cost in a single year, even though they may have
a life of 20-30 years over which costs must be amortized. Even when facilities are
financed through the sale of bonded debt, the repayment period may not
coincide with the life of the facility, with annual payments either overstating or
understating the annual cost.
Personnel costs are relatively easy to estimate by combining salaries and
benefits, if personnel are hired from reasonably competitive labor markets.
Unfortunately, a portion of personnel benefits is often found in other parts of
the budget than salaries. For example, in the U.S. the system used for
educational cost accounting has a category called "fixed charges" that includes
insurance benefits for workers. In some states, retirement benefits are paid fully
or partly by the state to a state retirement fund, so they do not appear on the
expenditure statements of local school districts. Thus, personnel costs should
include the full market value of what it takes to obtain persons with the qualities
and training that are desired.
Facilities costs are usually more of a challenge because many educational
entities already "own" their facilities, so it is not obvious what the cost-value
amounts to of the use of any particular portion of the facilities. Standard
techniques for estimating their annualized value include determination of what
it would cost to lease them as well as methods of determining annual costs by
estimating replacement value. The annualized value of a facility comprises the
cost of depreciation (that is, how much is "used up" in a given year of a facility
with a fixed life) and the interest forgone on the undepreciated portion. In
principle, a facility with a 30 year life will lose one-thirtieth of its value each year
in depreciation cost. Furthermore, since the undepreciated investment cannot be
invested elsewhere, it implies an additional cost of forgone interest income.
The same is true of equipment such as furniture and computers or materials
such as textbooks that have a usable life of more than one year. Consumable
inputs such as energy and telephone costs or supplies can be readily obtained
from expenditures.
In short, there are standard methods for estimating the costs of each of the
ingredients. These costs are then summed to obtain the total annual cost of each
alternative. The total costs are usually divided by the total number of students
who benefit from the intervention to get an annual cost per student. Where the
entire school is expected to benefit, the total cost can be divided by student
enrollments. To carry out these analyses, a standard worksheet format can be
used as illustrated in Levin and McEwan (2001). The worksheet can be easily
replicated with a spreadsheet package such as Microsoft Excel.
By itself, a cost analysis can provide valuable information. It can tell us whether
a program is feasible, in that it can be carried out within a budget constraint.
With further analysis, it can also tell us how costs are distributed across various
stakeholders that participate in a program, and hence whether they are likely to
evince support for it. However, a cost analysis cannot tell us whether a particular
alternative is relatively more desirable than another, or whether it is worthwhile
in an absolute sense. For that, we must begin to incorporate information on the
outcomes of educational alternatives.
COST-EFFECTIVENESS, COST-UTILITY, AND COST-BENEFIT
There are three analytical approaches to weighing the cost of an educational

alternative against its outcomes: cost-effectiveness, cost-utility, and cost-benefit.
They are similar in one important respect: each method relies upon cost
estimates gathered via the ingredients method described above. In other respects
they are different, both in the analytical questions that can be answered and in
their strengths and weaknesses. Thus, we first provide a brief summary of each,
as well as references to recent studies. Table 1 provides a convenient summary of
the three techniques that will be described in this section.
Subsequent sections review common approaches to estimating effectiveness,
utility, and benefits. They also discuss two themes that are essential - but often
neglected - aspects of all three modes of cost analysis: the discounting of costs
and outcomes over time, and the distribution of costs and outcomes among
constituencies.
Three Modes of Cost Analysis
Cost-Effectiveness
Cost-effectiveness analysis compares two or more educational programs

according to their effectiveness and costs in accomplishing a particular objective
(e.g., raising student mathematics achievement). By combining information on
effectiveness and costs, the evaluator can determine which program provides a
given level of effectiveness at the lowest cost or, conversely, which program
provides the highest level of effectiveness for a given cost.
......
W
+:0-
Table 1. Three types of cost analysis t"--<

~
Type of Analytical Measure Measure S·
analysis question( s) of cost of outcome Strengths Weaknesses l::l
;::::
~
Cost -effectiveness • Which alternative Monetary value Units of effectiveness • Easy to incorporate in standard • Difficult to interpret results when
yields a given level of resources evaluations of effectiveness there are multiple measures of ~
of effectiveness for • Useful for alternatives with effectiveness
the lowest cost (or a single or small number of • Cannot judge overall worth of a ~
;::::
the highest level of objectives single alternative; only useful for
effectiveness for a comparing two or more
given cost)? alternatives
Cost-utility • Which alternative Monetary value Units of utility • Incorporates individual • Sometimes difficult to arrive at
yields a given level of resources preferences for units of consistent and meaningful
of utility for the effectiveness measures of individual preferences
lowest cost (or • Can incorporate multiple • Cannot judge overall worth of a
the highest level of measures of effectiveness single alternative; only useful
utility for a given into a single measure of utility for comparing two or more
cost)? • Promotes stakeholder alternatives
participation in decision making
Cost-benefit • Which alternative Monetary value Monetary value of • Can be used to judge absolute • Often difficult to place monetary
yield a given level of resources benefits worth of an alternative values on all relevant educational
of benefits for the • Can compare CB results outcomes
lowest cost (or the across a range of alternatives
highest level of outside education (e.g. health,
benefits for a infrastructure, etc.)
given cost)?
• Are the benefits of
a single alternative
larger than its
costs?
Source: Adapted from Levin and McEwan (2001, Table 1.1)
The approach's key strength is that it can be easily reconciled with standard
evaluation designs in education. Furthermore, it is useful for evaluating alternatives
that have a limited number of objectives (and measures of effectiveness). When
there are multiple measures, however, a cost-effectiveness analysis becomes
unwieldy. It may conclude, for example, that one alternative is more cost-
effective in raising mathematics achievement, but that another is the most cost-
effective means of raising reading achievement. Without further analytical tools,
we have no decision rule for choosing between alternatives. Finally, cost-
effectiveness analysis is a comparative endeavor. Thus, it allows us to choose
which alternative is most cost-effective, but not whether this or any alternative is
worth the investment.
In the U.S. and other developed countries, the use of educational cost-
effectiveness analysis has grown at a slow but steady pace. In computer-assisted
instruction, for example, there are numerous studies, although their conclusions
are often outpaced by changes in technology and costs (e.g., Dalgard, Lewis, &
Boyer, 1984; Fletcher, Hawley, & Piele, 1990; Levin, Glass, & Meister, 1987;
Levin, Leitner, & Meister, 1986; Lewis, Stockdill, & Turner, 1990). Early child-
hood education is another area where cost-effectiveness analysis has been
frequently applied (e.g., Barnett, Escobar, & Ravsten, 1988; Eiserman, McCoun,
& Escobar, 1990; Warfield 1994). In developing countries, cost-effectiveness
components are a feature of several school effectiveness studies (e.g., Bedi &
Marshall, 1999; Fuller, Hua, & Snyder, 1994; Glewwe, 1999; Harbison &
Hanushek, 1992; Tan, Lane, & Coustere, 1997; World Bank, 1996).
Despite progress, there are still yawning gaps in the literature, particularly
regarding some of the last decade's most popular reform initiatives. The
effectiveness of class size reduction has been extensively studied with experi-
mental and non-experimental designs, but its costs and cost-effectiveness are the
subject of few studies (Brewer, Krop, Gill, & Reichardt, 1999; Levin et aI., 1987).
Even though whole-school reform packages are an increasingly popular reform
option, we know little about their relative cost-effectiveness (Levin, 2002; for
analyses that focus specifically on costs, see Barnett, 1996a; King, 1994). Private
school vouchers are often advocated on the grounds that they improve the
efficiency of schooling. While we have learned a great deal about the relative
effectiveness of private and public schools, we know almost nothing about costs
and efficiency (McEwan, 2000; Rouse, 1998). The only rigorous cost study - that
is, based on the economic concept of opportunity cost - is now three decades old
(Bartell, 1968).
Cost-Utility
In a cost-utility analysis, the evaluator gauges the overall "utility" of stakeholders,

or their satisfaction with a number of alternatives. To measure utility, the
evaluator attempts to elicit individual preferences for the relevant measures of
effectiveness, such as mathematics or reading achievement, student attitudes,
and so on. These yield a set of utility "weights" that can be used to obtain a
summary measure of utility for each alternative. Once this is accomplished, the
analysis proceeds much like a cost-effectiveness evaluation. Which alternative
provides a given level of utility at the lowest cost (or the highest level of utility at
a given cost)?
Among its strengths, cost-utility analysis provides a means of combining
disparate measures of effectiveness into a summative measure of program
satisfaction. In doing so, it eliminates the decision making ambiguity that is
inherent in a cost-effectiveness analysis with multiple measures of effectiveness.
The process of eliciting individual preferences can also promote stakeholder
participation in the evaluation and decision-making process. Nevertheless,
individual preferences are problematic to measure empirically, and research
methods can produce a range of conflicting results. And, just like a cost-
effectiveness analysis, cost-utility analysis is comparative. Thus, it cannot be used
to judge the overall worth of a single program.
In education, the use of cost-utility analysis is almost nil (for some exceptions,
see Fletcher et aI., 1990; Lewis, 1989; Lewis, Johnson, Erickson, & Bruininks,
1994; Lewis & Kallsen, 1995). In contrast, cost-utility analysis is now common in
health research, generally relying upon the concept of the Quality-Adjusted Life-
Year. Medical interventions are frequently evaluated by the number of years by
which life is prolonged. However, the quality of life during each year can vary
dramatically. To reflect this, evaluators weight each life-year gained by a measure
of utility (see Drummond, O'Brien, Stoddart, & Torrance, 1997 for a method-
ological discussion).
Cost-Benefit
In a cost-benefit analysis, the outcomes of an educational alternative are

directly expressed in monetary terms. This is most often the case with
alternatives that are designed to affect outcomes in the labor market. A clear
benefit is the increased earnings that may accrue to participants. Presuming that
monetary benefits can be fully measured, they can be directly compared to
monetary costs.
The method has a clear advantage over other techniques of cost analysis. It
can be used to assess directly whether benefits outweigh costs, allowing a clear
statement of whether the program is desirable in an absolute sense. The results
of a cost-benefit analysis can also be compared to CB results for a wide range of
alternative programs, in education and other fields such as health. With these
advantages come important limitations. In many fields, particularly education, it
is rarely feasible to express outcomes in monetary terms. Thus, a CB analysis
often focuses on a narrow range of outcomes - such as job earnings - and risks
understating the size of benefits.
There is a vast literature, in developed and developing countries, that
compares the benefits and costs of additional years of education (for reviews, see
McMahon, 1998; Psacharopoulos, 1994). There are many fewer cost-benefit
analyses of particular educational alternatives. Important exceptions to this are
Barnett's (1996b) analysis of the Perry Preschool Program and many evaluations
of job training programs (e.g., Orr et aI., 1996).
Estimating Effectiveness
For the student of evaluation, effectiveness is an overarching theme. Indeed, it is

the subject of other articles in this Handbook, and we shall not belabor their
points. We emphasize, however, that a cost-effectiveness analyst must consider a
similar range of issues: What measures of effectiveness should be used, and are
they reliable and valid? What evaluation design should be used to gauge the
success of an alternative in altering effectiveness (e.g., experimental, quasi-
experimental, or non-experimental)? Will the design be successful in establishing
a cause-and-effect relationship between the alternative and the measure of
effectiveness (that is, will estimates of effectiveness possess internal validity)?
Evaluation designs may inspire varying confidence in the existence of causal
relationships between educational alternatives and outcomes such as achieve-
ment. From the perspective of the cost analyst, two points deserve emphasis.
First, a cost-effectiveness analysis is only as good as its various components,
including its estimates of effectiveness. One should be cognizant of the strengths
and weaknesses of the evaluation design that was employed, especially regarding
their internal validity, and suspicious of studies that obfuscate their methods or
interpret their results too optimistically. Second, even clear-cut evidence on
effectiveness provides just a portion of the information needed to make informed
decisions. Without a cost analysis of the alternative and a comparison with other
alternatives, we are hard-pressed to choose that which provides a given
effectiveness at the lowest cost.
For further discussion of these issues in the context of cost-effectiveness
analysis, see Levin and McEwan (2000, Chapter 6). For general discussions,
readers should consult one of the many excellent volumes on evaluation and
research design (e.g., Boruch, 1997; Cook & Campbell, 1979; Light, Singer, &
Willett, 1990; Orr, 1999; Rossi & Freeman, 1993; Smith & Glass, 1987; Weiss,
1998; Shadish, Cook, & Campbell, 2002).
The Use of Meta-Analysis
A particular issue, meta-analysis, merits a separate discussion given its increasing

use in cost-effectiveness evaluations. 6 Meta-analysis is a set of techniques for
summarizing the results from multiple evaluations of effectiveness. For now, the
main question is whether a summary estimate from a meta-analysis is an adequate
input to a cost-effectiveness analysis. We argue that the use of meta-analysis in
cost evaluation warrants some caution (Levin, 1988, 1991).
Cost evaluation is oriented towards providing concrete information to
decision makers on whether a specific program is desirable to implement.
However, a meta-analysis usually provides an average result from many different
varieties of a particular class of alternatives (e.g., tutoring programs, computer-
assisted instruction, ability grouping). This becomes problematic when we
endeavor to estimate the specific ingredients that associated with an alternative.
Consider, for example, a hypothetical meta-analysis of various adult tutoring
programs. Some rely upon on-duty teachers to spend time after schools, whereas
others might pay local adults the minimum wage, and still others could receive
voluntary tutoring services from parents. In this case, the "alternative" has a wide
variety of ingredients and costs, and there is no obvious way to define them
(short of conducting multiple cost analyses).
Under stringent conditions, it may be possible to use meta-analytic results in
a cost analysis. If the specific studies that comprise the meta-analysis refer to
evaluations of exactly the same alternative (e.g., a particular reading curriculum),
then it is more likely that a common set of cost ingredients can be identified.
Estimating Utility
The objectives of educational alternatives are usually not well summarized by a

single measure of effectiveness. There is nothing preventing us from specifying a
full range of measures and evaluating alternatives according to their success and
costs in altering these. However, it is often problematic to arrive at summative
conclusions regarding the overall effectiveness or cost-effectiveness of a particular
alternative. It may turn out that one alternative is most effective or cost-effective
when assessing one outcome, but not in the case of another.
One solution is to obtain a single measure that summarizes the overall "utility"
or satisfaction that stakeholders obtain from each program, thus combining
information on several domains of effectiveness. The field of decision analysis
has developed an array of techniques for doing so. They are largely structured
around the concept of the "multiattribute utility function." In this case,
"attributes" are analogous to measures of effectiveness. Consider a hypothetical
program that is aimed at affecting two attributes: reading achievement and
student attitudes toward reading (although we are considering the two-attribute
case, the framework readily extends to multiple attributes). The overall utility
from the program is expressed as:
where Xl is reading achievement andx2 is student attitudes. UI(X I ) is a convenient

means of stating "the utility obtained from reading achievement." It is multiplied
by wI' which is the utility weight that is attached to this particular attribute. In
general, all the utility weights should sum to 1 (thus, WI + w2 = 1). A relatively
larger weight on WI than W 2 indicates that stakeholders derive relatively greater
satisfaction from higher achievement. The overall measure of program utility is
U(xl'x2) and it is nothing more than a weighted average of utilities derived from
individual attributes.
To obtain this estimate of overall utility, we require two key elements (in
addition to estimates of effectiveness). First, we need to estimate the function
that indicates how much utility is obtained from additional units of a given
attribute; that is, we need to estimate the functions U1(X 1) and U2(x2)' Second, we
need to estimate the utility weights, WI and w2' that indicate the relative
importance of each attribute in overall utility. Both tasks are accomplished by
administering surveys that will elicit the preferences of key stakeholders such as
students, parents, community members, and school personnel.
There is a voluminous literature in education, health, and decision analysis
that suggests practical methods for doing so (and, it should be pointed out, that
has yet to identify an "ideal" method). In education, Levin and McEwan (2001,
Chapter 8) provide a broad summary of methods and several applied examples.
In health, there are numerous reviews of cost-utility analysis, and methods of
eliciting preferences (e.g. Drummond et aI., 1997; Gold et aI., 1996; Kaplan,
1995). More broadly, Clemen (1996) provides a lucid introduction to the field of
decision analysis, as well as many applied examples of utility assessment. Keeney
and Raiffa (1976) and von Winterfeldt and Edwards (1986) are still regarded as
classic works in the field.
Estimating Benefits
In some cases, educational outcomes can be expressed in monetary terms. There

are three general approaches to doing so: (1) standard evaluation designs,
including experimental, quasi-experimental, and non-experimental; (2)
contingent valuation; and (3) observed behavior. Only the first has been applied
with any consistency in educational cost-benefit analysis.
Standard Evaluations
Evaluation designs are overwhelmingly used to evaluate outcomes that are not
expressed in pecuniary terms, such as academic achievement. Yet, in many cases,
it is possible to directly measure pecuniary outcomes such as labor market
earnings. There are several well-known evaluations of job training programs that
are expressly aimed at improving earnings. For example, an experimental
evaluation of the Job Training Partnership Act (JTPA) randomly assigned
individuals to receive training or serve in a control group (Orr et aI., 1996). The
earnings of each group were traced during the ensuing months, and the
difference provided a useful measure of program benefits. In economics, there is
an extensive non-experimental literature that explores the links between
measures of school quality and eventual labor market earnings of individuals (for
reviews, see Betts, 1996; Card & Krueger, 1996).
Even when outcomes are not measured in monetary terms, they can often be
readily converted. In an evaluation of a dropout prevention program, outcomes
were initially measured in the number of dropouts prevented (Stern, Dayton,
Paik, & Weisberg, 1989). Given the well-known relationship between earnings
and high school graduation, the evaluators then derived a monetary estimate of
program benefits. In his cost-benefit analysis of the Perry Preschool Program,
Steven Barnett (1996b) concluded that participants' reaped a variety of positive
outcomes. Among others, student performance improved in K-12 education and
participants were less likely to commit crimes later in life. In both cases, Barnett
obtained estimates of monetary benefits (or averted costs), because individuals
were less likely to use costly special education services or inflict costly crimes
upon others.
Contingent Valuation
A second method of valuing benefits is referred to as contingent valuation. It

calls upon individuals to honestly assess their maximum willingness to pay for a
particular outcome. In the cost-benefit literature, it has found its most frequent
use in environmental evaluations (e.g., Cummings, Brookshire, & Shultze, 1986;
Mitchell & Carson, 1989). For example, what are the monetary benefits of a
pristine forest view, or of the biodiversity contained within the forest? Both are
clearly valued by individuals, but their benefits defy easy quantification.
Researchers in contingent valuation have developed a wide variety of methods
for eliciting willingness-to-pay estimates from individuals. These are summarized
by Boardman et aI. (1996) and generally rely upon interview techniques that
describe a hypothetical good or service that is subsequently valued by individuals.
The methods are subject to some weaknesses. In some cases, for example, the
hypothetical nature of the exercise may not yield realistic estimates of benefits
(see Boardman et aI., 1996, pp. 352-366, for a summary of critiques).
To our knowledge, the method has been applied only once in education.
Escobar, Barnett, and Keith (1988) surveyed parents on their willingness to pay
for special educational services for preschool students. In health, the method has
been used with increasing frequency (for reviews, see Drummond et aI., 1997;
Johannesson, 1996).
Observed Behavior
Instead of relying upon individuals to honestly state their willingness to pay, a

third method attempts to infer it from individuals' observed behavior in the
marketplace. For example, individuals consider a variety of factors when purchasing
a home. Most are tied directly to features of the home itself, such as size and
quality of construction. But some are related to surrounding amenities, such as
the quality of the local public school district. It is unlikely that a family will be
indifferent between two identical homes, if one has access to better quality
schools. The difference in purchase price between the homes can be interpreted
as the implicit price of school quality. It also represents an indirect estimate of
willingness to pay for quality schooling, and hence of monetary benefits.
Economists have used extensive data on home prices and school quality, in
concert with statistical methods, to infer estimates of schooling benefits (e.g.,
Black, 1998, 1999; Crone, 1998).1 While it is not common, these estimates could
certainly be incorporated into an educational cost-benefit analysis.
Discounting of Outcomes and Costs
Many evaluations are confined to a relatively short time period of a year or less,
and estimates of outcomes and costs are confined to that period. Yet, many
evaluations occur over two or more years. When that is the case, outcomes and
costs that occur in the future should be appropriately discounted to their
"present value." The purpose of this procedure is to reflect the desirability of
receiving outcomes sooner (or, similarly, incurring costs later).
To calculate present value, one uses the following general formula:
PV=L ~
1=0 (I + ry
where 0t is the outcome occurring at time t, and r is a discount rate between 0

and 1 (the same formula applies to discounting costs). For example, the present
value of a benefit of $100 that is received immediately (at t=O) is $100. If the
same amount is received one year hence, and the discount rate is 0.05 (5%), the
present value is $95.24 (100/1.05); in two years, it is $90.70 (100/1.05 2). Thus,
outcomes occurring farther in the future are discounted more steeply. A
discount rate larger than 0.05 would imply that future outcomes are to be
discounted even more heavily. Intuitively, this implies that individuals evince an
even stronger preference for outcomes that occur immediately.
The choice of discount rate is somewhat arbitrary, although a plausible range
is between three and seven percent (Barnett, 1996b). A recent set of national
guidelines for economic evaluation of health care programs recommends a
discount rate of three percent, although they also recommend the calculation of
present values under a wider range of assumptions about the discount rate
(Lipscomb, Weinstein, & Torrance, 1996). In part, the ambiguity stems from
different conceptual approaches that can be used to arrive at a discount rate.
Boardman et al. (1996) provide a general discussion. For a discussion of dis-
counting that is applied specifically to non-pecuniary outcomes, see Drummond
et al. (1997), Keeler and Cretin (1983), Levin and McEwan (2001, Chapter 5),
and Viscusi (1995).
Assessing the Distribution of Outcomes and Costs
In general, it is unlikely that either costs or outcomes of educational alternatives

will be evenly distributed across groups or individuals. With respect to outcomes,
some may benefit greatly from educational interventions, others to a lesser
extent, and some not at all. For example, recent evaluations of class size reduc-
tion have found that the academic achievement of lower-income and minority
students improved even more than for other groups of students (Angrist & Lavy,
1999; Krueger, 1999).
Thus, it behooves evaluators to analyze how the outcomes of programs are
distributed across different groups of individuals. Ultimately, this can affect conclu-
sions about the relative cost-effectiveness of alternatives for different populations
of individuals. In their evaluation of two math curricula (Text Math and GEMS),
Quinn, Van Mondfrans, and Worthen (1984) found that both were effective in
raising mathematics achievement. However, the effectiveness of Text Math was
substantially moderated by the socioeconomic status (SES) of participating
students (i.e., lower for low-SES students and higher for high-SES students). Which
program was more cost-effective in raising math achievement? The answer
turned out to be different, depending on the SES of participating students.
Similarly, costs may be distributed unevenly across groups or individuals. For
some educational interventions, a portion of the ingredients are provided in-kind
through volunteers or donated space as examples. Since the sponsoring agency
does not pay for these, the cost to the sponsor is reduced. However, the costs to
other stakeholders are increased by these amounts. A proper cost evaluation
should assess how costs are distributed across these groups. This can provide
important information on the support that some stakeholders might evince for a
particular alternative. To conduct these analyses, a simple cost worksheet allows
one to specify ingredients, costs, and the distribution of who pays for them by
stakeholder (for a description of such a worksheet, see Levin & McEwan, 2001).
JOINT INTERPRETATION OF COSTS AND OUTCOMES
There are two additional steps in a cost analysis. First, costs and outcomes must
be jointly interpreted in order to rank alternatives from most desirable to least
desirable. Second, one must assess whether these conclusions are robust to
variations in key assumptions of the analysis. This is typically accomplished with
a sensitivity analysis.
Ranking Alternatives by Cost-Effectiveness
Which alternative provides a given outcome for the lowest cost? This is deter-
mined by calculating a ratio of costs to outcomes. In a cost-effectiveness analysis,
the cost-effectiveness ratio (CER) of each alternative is obtained by dividing the
cost of each alternative (C) by its effectiveness (E):
C
CER=-
E
It is interpreted as the cost of obtaining an additional unit of effectiveness
(however this is defined by the evaluator). When ratios are calculated for each
alternative, they should be rank-ordered from smallest to largest. Those
alternatives with smaller ratios are relatively more cost-effective; that is, they
provide a given effectiveness at a lower cost than others and are the best
candidates for new investments.8
Depending on the mode of cost analysis, the evaluator may calculate cost-
utility (C/U) or cost-benefit (C/B) ratios for each alternative. 9 They are
interpreted in a similar fashion to the cost-effectiveness ratio. The goal is to
choose the alternatives that exhibit the lowest cost per unit of utility or benefits.
Because a cost-benefit analysis expresses outcomes in monetary terms, however,
the CB ratio has an additional interpretation. We should not implement any
alternative for which the costs outweigh the benefits (i.e., C/B> 1). Thus, we can
assess the overall worth of an alternative, in addition to its desirability relative to
other alternatives.
The interpretation of these ratios is subject to a caveat. Comparisons are valid
for alternatives that are roughly similar in scale. However, when one alternative
is vastly larger or smaller, it can produce skewed interpretations. For example, a
program to reduce high school dropouts may cost $10,000 and reduce dropouts
by 20 (a ratio of $500). Another program may cost $100,000 and reduce dropouts
by 160 (a ratio of $625). The first appears more attractive because it costs less for
each dropout prevented. If the program is scaled up, however, it is unlikely that
either its costs or effectiveness will remain static, perhaps due to economies of
scale or implementation problems. This could potentially alter conclusions
regarding cost-effectiveness.
Because a cost-benefit analysis expresses outcomes in pecuniary terms, there
are several alternative measures of project worth that can be employed,
including the net present value and the internal rate of return. Each has its own
decision rules, and is subject to strengths and weaknesses. For a summary, see
Levin and McEwan (2001, Chapter 7) and Boardman et al. (1996).
Accounting for Uncertainty
It is a rare (or perhaps nonexistent) cost analysis that is not subject to uncertainty.
This can stem from any component of the analysis, including the estimates of
effectiveness, benefits, or utility; the cost ingredients that comprise each alternative;
and the discount rate. In some cases, uncertainty is a natural component of the
analysis, as with estimates of program impact that are derived from statistical sam-
ples of individuals. In other cases, uncertainty is a direct reflection of our ignorance
regarding a component of the analysis, such as the price of a cost ingredient.
There are multiple techniques for assessing whether uncertainty may
invalidate the conclusions of a cost analysis. The simplest is known as one-way
sensitivity analysis, and it is an indispensable part of the evaluator's toolkit. One

identifies a parameter for which there is uncertainty and specifies a range of
plausible values. Generally the "middle" estimate is the original estimate, while
"low" and "high" estimates are derived from more and less conservative assump-
tions. Often these values are provided directly by a statistical confidence interval
surrounding an estimate of effectiveness; more commonly, they are borne of the
evaluator's good judgment and knowledge of alternatives.
Using this range of parameter values, the evaluator calculates a series of cost-
effectiveness ratios for a particular alternative. Ideally, the ratio's magnitude will
not be extremely sensitive, and the overall cost -effectiveness rankings will be
unchanged. When cost-effectiveness rankings do change, it may prod the
evaluator to seek better data on some aspect of the analysis. Minimally, it
warrants a larger dose of caution when interpreting the results and making policy
recommendations.
One-way sensitivity analysis belongs in every cost analysis, although there are
other methods of accounting for uncertainty. The true nature of uncertainty is
rarely captured by varying one parameter at a time. Thus, many evaluations will
also conduct a multi-way sensitivity analysis in which two or more uncertain
parameters are varied at the same time. This, however, can quickly produce an
unwieldy array of results. Another useful tool is a Monte Carlo analysis, in which
statistical distributions are specified for each uncertain parameter in the
analysis. lO After taking random draws from each distribution, one calculates a
cost-effectiveness ratio. The process is repeated many times, yielding multiple
cost-effectiveness ratios. If the ratios are clustered tightly around a single value,
then uncertainty does not greatly alter the conclusions. If they range widely, then
conclusions are less robust. For details on these and others methods of
accounting for uncertainty, see Boardman et al. (1996); Drummond et al. (1997);
Levin and McEwan (20ot, Chapter 6); and Manning et al. (1996).
THE USE OF COST EVALUATIONS
Presumably cost analysis is not conducted for its own sake, but rather to improve
educational policy. Hence, this section reviews several issues in its use. These
include the current uses of cost analysis in policy making, several guidelines to
incorporating cost analysis in such uses, and the emerging use of cost-
effectiveness "league tables" to rank investment alternatives.
How Is Cost Analysis Used?
Cost analysis is not always incorporated in decision making. There are, however,
some notable exceptions to this rule, many of these in health-care. ll In the
United States, the state of Oregon attempted to use a wide-ranging cost-utility
analysis to rank medical interventions and determine which would be available
to Medicaid recipients (Eddy, 1991). Eventually, political considerations led to

substantial modifications, and the cost-utility approach "[was] ultimately abandoned
in favor of a hybrid process in which cost was not a major factor in determining
the final rankings" (Sloan & Conover, 1995, p. 219). Cost-effectiveness analysis
has also been used to determine which drugs will be reimbursed by public
agencies in both Australia and the Canadian province of Ontario (Drummond et
aI., 1997; Sloan & Conover, 1995). The World Bank used a series of cost analyses
to formulate health sector lending priorities for developing countries (Jamison,
Mosley, Measham, & Bobadilla, 1993; World Bank, 1993).
In contrast to health care, education presents fewer instances in which cost
analysis has played a direct role in decision making. Certainly this has something
to do with the relative infrequency of rigorous cost analysis (already briefly
discussed in the Introduction). However, particular cost analyses have influenced
the attitudes of decision makers. The experimental evaluation of the Job
Training Partership Act (ITPA) was instrumental in affecting funding decisions
of the U.S. Congress (Orr, 1999; Orr et aI., 1996). Over the years, George
Psacharopoulos and his colleagues at the World Bank conducted a vast array of
cost-benefit analyses of educational investments in developing countries (for a
summary, see Psacharopoulos, 1994). They purported to show that primary
school investments have a higher rate of return than investments in other levels
of education. At the time, they undoubtedly influenced lending priorities of the
World Bank. Barnett's (1985, 1996b) cost-benefit analyses ofthe Perry Preschool
Program were limited to a particular alternative in early childhood education.
However, they influenced attitudes regarding the relative attractiveness of
investments in larger initiatives such as Head Start.
Some Guidelines for Use
However cost analyses are used, there are several factors that should always be
considered. These include the overall quality of the analysis, the generalizability
of the findings, and the need to incorporate outside information in the final
decision.
Quality
It perhaps goes without saying that higher-quality analyses should be given

greater emphasis. Yet, the cost analysis literature is replete with studies that
ignore essential aspects of costing methodology, or describe their methods and
data with such brevity that they are almost impossible to fully evaluate. What
constitutes a high-quality cost analysis? On the outcomes side, there are well-
established standards regarding internal validity and other elements that we do
not emphasize here. On the cost side, there are several questions to ask,
including the following:
• Are the ingredients for each alternative carefully set out?
• Do these include all cost ingredients, or merely those that are paid for by the
program sponsor?
• Is the methodology for costing out ingredients clearly specified, and does it
appear reasonable?
• If costs (and outcomes) occur over two or more years, are they properly
discounted?
• Are cost-effectiveness ratios (or alternate measures) calculated and properly
interpreted?
• Is a sensitivity analysis conducted to determine whether conclusions are modified
by alternate analytical assumptions (e.g., ingredient costs, the discount rate)?
Levin and McEwan (2001) provide a more comprehensive quality checklist for
cost analysis studies in education, covering both outcomes and costs. In
healthcare, a similar checklist is given by Drummond et al. (1997, Chapter 3).
An important question is whether the cost analysis literature adheres to these
minimal standards. The sheer quantity of cost analysis in health-care has led
several authors to review costing methods. The results are sobering for research
consumers. Among 77 cost-effectiveness and cost-benefit studies that were
reviewed, Udvarhelyi, Colditz, Rai, & Epstein (1992) found that 52 percent
failed to properly discount costs; 47 percent failed to report cost-benefit or cost-
effectiveness ratios; and 70 percent did not conduct a sensitivity analysis. In a
review of 51 health cost-utility analyses, Gerard (1992) found that 69 percent
included a comprehensive set of ingredients; 63 percent provided a clear descrip-
tion of methods used to cost ingredients; and 61 percent clearly described the
costing data. Just 37 percent of studies conducted an extensive sensitivity
analysis. Gerard concludes that about half the studies should be deemed "limited"
in their execution.
There is suggestive evidence that educational cost studies are subject to similar
shortcomings. In their review of cost studies in early childhood education,
Barnett and Escobar (1987) found that 5 of 20 studies gave no details on their
cost estimates, while others focused exclusively on program costs (ignoring costs
to parents and clients). In a more recent review, McEwan (2002) found
numerous studies that did not follow appropriate methods.
Generalizability
Cost analyses are frequently applied beyond the immediate context in which they
were conducted. Before doing so, one must assess to what extent these
generalizations are appropriate (Le., whether the cost analysis has external
validity). Even if the outcomes of an alternative are known to be constant across
several contexts - itself a strong assumption - it is possible that costs will differ,
thus altering the cost-effectiveness rankings. For example, the values of
ingredients such as teacher salaries, textbook prices, and even parent time could
vary drastically from one setting to another (Rice, 1997). This is especially
problematic in comparing results across countries (Lockheed & Hanushek,
1988). In these cases, it may be possible to modify the cost analysis with a
different set of ingredients values that is more context-appropriate.
Incorporating Outside Information
To the non-economist, it may be obvious that cost analyses should be viewed as

sources of information rather than sources of decisions. For example, a cost
analysis may reveal that an alternative is the most cost-effective means of raising
academic achievement. However, the particular circumstances of a school may
suggest that it will be risky or difficult to implement with success. In the final
decision, issues like these should be weighed carefully against the results of the
cost evaluation.
The Use of League Tables
League tables are an attempt to combine the results of many different cost
studies, often from quite different contexts. They are now a common feature of
health research, in which researchers estimate a cost-effectiveness or cost-utility
ratio for a particular treatment and then compare it to ratios from many
different studies (for a general discussion, see Drummond et aI., 1997). One
health league table compares the cost-effectiveness of government regulations at
saving lives, ranging from steering column protection to cattle feed regulation
(Morrall, 1986). Cost-effectiveness league tables are still rare in education, but
will probably become more popular as the number of cost evaluations increases
(Fletcher et ai., 1990; Lockheed & Hanushek, 1988).
In a comparative endeavor such as cost-effectiveness analysis, the advantage
of league tables is that we can evaluate the relative efficiency of a much broader
array of investment alternatives. However, there are weaknesses to consider
(Drummond, Torrance, & Mason, 1993; Levin & McEwan, 2001). Foremost is
that studies can vary enormously in their adherence to basic methodological
standards. For example, some may discount costs and outcomes over time, and
some may not. Some may exclude important cost ingredients, and others may
not. Of course, methodological decisions like these can drastically affect cost-
effectiveness ratios and, therefore, the ranking of alternatives. The important
point is that users of league tables must be cognizant of the strengths and
weaknesses of the individual studies that are summarized.
SUMMARY
Cost-effectiveness analysis is a decision-oriented tool that is often heralded, but

rarely used in educational evaluations. Most of the references to cost-
effectiveness that are found in the literature are rhetorical claims with no data or
incomplete analyses. This situation may arise from lack of familiarity with
analytical techniques among educational evaluators. We have asserted that cost-
effectiveness analysis and its close relations, cost-utility and cost-benefit
analyses, should be incorporated into all educational evaluations that feed into
important decisions. By combining the resource implications of alternatives with
their outcomes, those alternatives with the largest impact relative to resource
needs can be considered and selected. This chapter should serve as a brief guide
to the methods and current literature for those who wish to delve into the topic
more deeply.
ENDNOTES
1 A recent study for Chicago provides more detail on the apparent advantages of small schools
(Wasleyet aI., 2000). Note the focus on effectiveness and the absence of any cost analysis.
2 Although this is an excellent illustration of the error of considering only costs, the particular study
that is cited uses only budgetary costs and not the full range of costs that are captured by the
ingredients method that will be presented in the next section. It also uses available dropout and
graduation data, but does not control for variables other than size that might affect outcomes. It
is a provocative study, but not a rigorous one.
3 On the dearth of cost analysis in education, see Levin (1991); Monk and King (1993); and Smith
and Smith (1985).
4 For similar descriptions of the ingredients method, see Levin (1975, 1988, 1995) and Levin and
McEwan (2001, Chapters 3-5). The reader should be aware that the approach goes by other names
in the literature on cost analysis. For example, it is often referred to as the resource cost model
(for a methodological exposition, see Chambers and Parrish, 1994a, 1994b). At their core, the
ingredients and resource cost approaches are very similar. Both require that each intervention be
exhaustively described in terms of the ingredients or resources that are required to produce the
outcomes that will be observed. All these ingredients must be carefully identified for purposes of
placing a value or cost on them.
5 We remind the reader that this chapter can only provide a brief sketch of methods and that the
details can be found in Levin and McEwan (2001).
6 This is especially the case in health (Saint, Veenstra, & Sullivan, 1999).
7 In health, a similar approach has been used repeatedly to place a monetary value on human life (and
not without some controversy). Forreviews of this literature, see Jones-Lee (1989) and Viscusi (1992).
8 It is also common for cost-effectiveness studies to calculate effectiveness-cost ratios (E/C) for each
alternative. This ratio indicates the units of effectiveness that are obtained for each unit of cost
that is incurred (generally a dollar or a multiple of a dollar). While the interpretation of these
ratios is the different, they will lead to the same conclusions about the relative cost-effectiveness
of alternatives. For purposes of consistency, it is generally preferable to present cost-effectiveness
ratios (C/E). This is the recommended procedure of a set of national guidelines in health cost
analysis (Weinstein, Siegel, Gold, Kamlet, & Russell, 1996).
9 It is worth noting, however, that the literature in cost-benefit analysis has typically advocated the
calculation of benefit-cost (B/C) ratios. The difference is merely one of interpretation.
10 For more details on the method, as well as practical advice on implementing it with a spreadsheet,
see Boardman et al. (1996) and Clemen (1996).
11 See Sloan and Conover (1995) for an overview.
REFERENCES
Angrist, J., & Lavy, V. (1999). Using Maimonides' rule to estimate the effect of class size on scholastic
achievement. Quarterly Journal of Economics, 114(2), 533-576.
Barnett, W.S. (1985). Benefit-cost analysis of the Perry Preschool Program and its policy implications.
Educational Evaluation and Policy Analysis, 7(4), 333-342.
Barnett, W.S. (1996a). Economics of school reform: Three promising models. In H.F. Ladd (Ed.),
Holding schools accountable: Performance-based reform in education (pp. 299-326). Washington,
DC: The Brookings Institution.
Barnett, W.S. (1996b). Lives in the balance: Age-27 benefit-cost analysis of the High/Scope Perry
Preschool Program. Ypsilanti, MI: High/Scope Press.
Barnett, W.S., & Escobar, C.M. (1987). The economics of early educational intervention: A review.
Review of Educational Research, 57(4), 387-414.
Barnett, W.S., Escobar, C.M., & Ravsten, M.T. (1988). Parent and clinic early intervention for
children with language handicaps: a cost-effectiveness analysis. Joumal of the Division for Early
Childhood, 12(4), 290--298.
Bartell, E. (1968). Costs and benefits of Catholic elementary and secondary schools. Notre Dame:
Notre Dame Press.
Bedi, AS., & Marshall, J.H. (1999). School attendance and student achievement: Evidence from
rural Honduras. Economic Development and Cultural Change, 657-682.
Betts, J.R. (1996). Is there a link between school inputs and earnings? Fresh scrutiny of an old
literature. In G. Burtless (Ed.), Does money matter? The effect of school resources on student
achievement and adult success (pp. 141-191). Washington, D.C.: Brookings Institution Press.
Black, S.E. (1998). Measuring the value of better schools. Federal Reserve Bank of New York
Economic Policy Review, 87-94.
Black, S.E. (1999). Do better schools matter? Parental valuation of elementary education. Quarterly
Joumal of Economics, 114(2), 577-599.
Boardman, AE., Greenberg, D.H., Vining, AR, & Weimer, D.L. (1996). Cost-benefit analysis:
Concepts and practice. Upper Saddle River, NJ: Prentice Hall.
Boruch, RF. (1997). Randomized experiments for planning and evaluation: A practical guide. Thousand
Oaks, CA: Sage.
Brewer, D.J., Krop, C., Gill, B.P., & Reichardt, R (1999). Estimating the cost of national class size
reductions under different policy alternatives. Educational Evaluation and Policy Analysis, 21(2),
179-192.
Card, D., & Krueger, AB. (1996). Labor market effects of school qUality: Theory and evidence. In
G. Burtless (Ed.), Does money matter? The effect of school resources on student achievement and
adult success. Washington, DC: Brookings Institution Press.
Chambers, J., & Parrish, T. (1994a). Developing a resource cost database. In W.S. Barnett (Ed.), Cost
analysis for education decisions: Methods and examples (Vol. 4, pp. 23-44). Greenwich, CT: JAI
Press.
Chambers, J., & Parrish, T. (1994b). Modeling resource costs. In W.S. Barnett (Ed.), Cost analysis for
education decisions: Methods and examples (Vol. 4, pp. 7-21). Greenwich, CT: JAI Press.
Clemen, R.T. (1996). Making hard decisions: An introduction to decision analysis (2nd ed.). Pacific
Grove, CA: Duxbury Press.
Clune, W.H. (2002). Methodological strength and policy usefulness of cost-effectiveness research. In
H.M. Levin & P.J. McEwan (Eds.) Cost-effectiveness and educational policy (pp. 55-68).
Larchmont, NY: Eye on Education.
Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis for field studies.
Chicago: Rand McNally.
Crone, T.M. (1998). House prices and the quality of public schools? What are we buying? Federal
Reserve Bank of Philadelphia Business Review, 3-14.
Cummings, R.G., Brookshire, D.S., & Schultze, W.D. (1986). Valuing environmental goods: An
assessment of the contingent valuation method. Totowa, NJ: Rowman & Allanheld.
Dalgaard, B.R., Lewis, D.R, & Boyer, C.M. (1984). Cost and effectiveness considerations in the
use of computer -assisted instruction in economics. Journal ofEconomic Education, 15(4), 309-323.
Drummond, M., Torrance, G., & Mason, J. (1993). Cost-effectiveness league tables: More harm than
good? Social Science in Medicine, 37(1), 33-40.
Drummond, M.F., O'Brien, B., Stoddart, G.L., & Torrance, G.w. (1997). Methods for the economic
evaluation of health care programmes (2nd ed.). Oxford: Oxford University Press.
Eddy, D.M. (1991). Oregon's methods: Did cost-effectiveness analysis fail? Journal of the American
MedicalAssociation, 266(15),2135-2141.
Eiserman, W.D., McCoun, M., & Escobar, e.M. (1990). A cost-effectiveness analysis of two
alternative program models for serving speech-disordered preschoolers. Journal of Early
InteTVention, 14(4), 297-317.
Escobar, e.M., Barnett, W.S., & Keith, J.E. (1988). A contingent valuation approach to measuring
the benefits of preschool education. Educational Evaluation and Policy Analysis, 10(1), 13-22.
Fletcher, J.D., Hawley, D.E., & Piele, P.K. (1990). Costs, effects, and utility of microcomputer
assisted instruction in the classroom. American Educational Research Journal, 27(4), 783-806.
Fuller, B., Hua, H., & Snyder, C.w. (1994). When girls learn more than boys: The influence of time
in school and pedagogy in Botswana. Comparative Education Review, 38(3), 347-376.
Gerard, K. (1992). Cost-utility in practice: A policy maker's guide to the state of the art. Health
Policy, 21, 249-279.
Glewwe, P. (1999). The economics of school quality investments in developing countries: An empirical
study of Ghana. London: St. Martins Press.
Gold, M.R., Patrick, D.L., Torrance, G.w., Fryback, D.G., Hadorn, D.C., Kamlet, M.S., et aI. (1996).
Identifying and valuing outcomes. In M.R. Gold, L.B. Russell, J.E. Siegel, & M.e. Weinstein
(Eds.), Cost-effectiveness in health and medicine (pp. 82-134). New York: Oxford University Press.
Harbison, R.w., & Hanushek, EA (1992). Educational performance of the poor: Lessons from rural
northeast Brazil. Oxford: Oxford University Press.
Jamison, D.T., Mosley, W.H., Measham, AR., & Bobadilla, J.L. (Eds.). (1993). Disease control
priorities in developing countries. Oxford: Oxford University Press.
Johannesson, M. (1996). Theory and methods of economic evaluation of health care. Dordrecht:
Jones-Lee, M.W. (1989). The economics of safety and risk. Oxford: Basil Blackwell.
Kaplan, R.M. (1995). Utility assessment for estimating quality-adjusted life years. In F.A Sloan
(Ed.), Valuing health care: Costs, benefits, and effectiveness of pharmaceuticals and other medical
technologies (pp. 31-60). Cambridge: Cambridge University Press.
Keeler, E.B., & Cretin, S. (1983). Discounting of life-saving and other nonmonetary effects.
Management Science, 29, 300--306.
Keeney, R.L., & Raiffa, H. (1976). Decisions with multiple objectives. New York: Wiley.
King, J.A (1994). Meeting the educational needs of at-risk students: A cost analysis of three models.
Educational Evaluation and Policy Analysis, 16(1), 1-19.
Krueger, A. (1999). Experimental estimates of education production functions. Quarterly Journal of
Economics, 114(2), 497-532.
Levin, H.M. (1975). Cost-effectiveness in evaluation research. In M. Guttentag & E. Struening
(Eds.), Handbook of evaluation research (Vol. 2). Beverly Hills, CA: Sage Publications.
Levin, H.M. (1988). Cost-effectiveness and educational policy. Educational Evaluation and Policy
Analysis, 10(1), 51-69.
Levin, H.M. (1991). Cost-effectiveness at quarter century. In M.W. McLaughlin & D.e. Phillips (Eds.),
Evaluation and education at quartery century (pp. 188-209). Chicago: University of Chicago Press.
Levin, H.M. (1995). Cost-effectiveness analysis. In M. Carnoy (Ed.), International encyclopedia of
economics of education (2nd ed., pp. 381-386). Oxford: Pergamon.
Levin, H.M. (2002). Issues in designing cost-effectiveness comparisons of whole-school reforms. In
H.M. Levin & P.J. McEwan (Eds.), Cost-effectiveness and educational policy (pp. 71-96). Larchmont,
NY: Eye on Education.
Levin, H.M., Glass, G.v., & Meister, G.R. (1987). Cost-effectiveness of computer-assisted
instruction. Evaluation Review, 11(1), 50--72.
Levin, H.M., Leitner, D., & Meister, G.R. (1986). Cost-effectiveness of alternative approaches to
computer-assisted instruction (87-CERAS-l). Stanford, CA: Center for Educational Research at
Stanford.
Levin, H.M., & McEwan, PJ. (2000). Cost-effectiveness analysis: Methods and applications (2nd ed.).
Lewis, D.R (1989). Use of cost-utility decision models in business education. Journal of Education
for Business, 64(6), 275-278.
Lewis, D.R., Johnson, D.R, Erickson, R.N., & Bruininks, RH. (1994). Multiattribute evaluation of
program alternatives within special education. Journal of Disability Policy Studies, 5(1), 77-112.
Lewis, D.R, & Kallsen, L.A (1995). Multiattribute evaluations: An aid in reallocation decisions in
higher education. The Review of Higher Education, 18(4),437-465.
Lewis, D.R., Stockdill, S.J., & Thrner, T.e. (1990). Cost-effectiveness of micro-computers in adult
basic reading. Adult Literacy and Basic Education, 14(2), 136-149.
Light, R.J., Singer, J.D., & Willett, J.B. (1990). By design: Planning research on higher education.
Cambridge, MA: Harvard University Press.
Lipscomb, J., Weinstein, M.e., & Torrance, G.W (1996). Time preference. In M.R. Gold, L.B.
Russell, J.E. Siegel, & M.C. Weinstein (Eds.), Cost-effectiveness in health and medicine (pp.
214-246). New York: Oxford University Press.
Lockheed, M.E., & Hanushek, E. (1988). Improving educational efficiency in developing countries:
What do we know? Compare, 18(1), 21-37.
Manning, WG., Fryback, D.G., & Weinstein, M.e. (1996). Reflecting uncertainty in cost-
effectiveness analysis. In M.R. Gold, L.B. Russell, lE. Siegel, & M.e. Weinstein (Eds.), Cost-
effectiveness in health and medicine (pp. 247-275). New York: Oxford University Press.
McEwan, P.J. (2000). The potential impact of large-scale voucher programs. Review of Educational
Research, 70(2), 103-49.
McEwan, P.J. (2002). Are cost-effectiveness methods used correctly? In H.M. Levin & P.J. McEwan
(Eds.), Cost-effectiveness and educational policy (pp. 37-53). Larchmont, NY: Eye on Education.
McMahon, WW (1998). Conceptual framework for the analysis of the social benefits of lifelong
learning. Education Economics, 6(3), 309-346.
Mitchell, R.e., & Carson, R.T. (1989). Using surveys to value public goods: The contingent valuation
method. Washington, DC: Resources for the Future.
Monk, D.H., & King, J.A. (1993). Cost analysis as a tool for education reform. In S.L. Jacobson &
R. Berne (Eds.), Reforming education: The emerging systemic approach (pp. 131-150). Thousand
Oaks, CA: Corwin Press.
Morrall, J.F. (1986). A review of the record. Regulation, 25-34.
Orr, L.L. (1999). Social Experiments. Thousand Oaks, CA: Sage.
Orr, L.L., Bloom, H.S., Bell, S.H., Doolittle, F., Lin, W, & Cave, G. (1996). Does job training for the
disadvantaged work? Evidence from the NationalJTPA Study. Washington, DC: Urban Institute Press.
Psacharopoulos, G. (1994). Returns to investment in education: A global update. World Development,
22(9),1325-1343.
Quinn, B., Van Mondfrans, A, & Worthen, B.R. (1984). Cost-effectiveness of two math programs as
moderated by pupil SES. Educational Evaluation and Policy Analysis, 6(1), 39-52.
Rice, J.K. (1997). Cost analysis in education: Paradox and possibility. Educational Evaluation and
Policy Analysis, 19(4),309-317.
Rossi, P.H., & Freeman, H.E. (1993). Evaluation: A systematic approach. (5th ed.). Newbury Park,
CA: Sage Publications.
Rouse, C.E. (1998). Schools and student achievement: More evidence from the Milwaukee Parental
Choice Program. Federal Reserve Bank of New York Economic Policy Review, 4(1), 61-76.
Saint, S., Veenstra, D.L., & Sullivan, S.D. (1999). The use of meta-analysis in cost-effectiveness
analysis. Pharmacoeconomics, 15(1), 1-8.
Shadish, WR., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasi-experimental designs for
generalized causal inference. Boston: Houghton-Mifflin.
Sloan, F.A, & Conover, e.J. (1995). The use of cost-effectiveness/cost-benefit analysis in actual
decision making: Current status and prospects. In F.A Sloan (Ed.), Valuing health care: Costs,
benefits, and effectiveness of pharmaceuticals and other medical technologies (pp. 207-232).
Cambridge: Cambridge University Press.
Smith, M.L., & Glass, G.Y. (1987). Research and evaluation in education and the social sciences.
Englewood Cliffs, NJ: Prentice-Hall.
Smith, N.L., & Smith, J.K. (1985). State-level evaluation uses of cost analysis: A national descriptive
survey. In J.S. Catterall (Ed.), Economic evaluation ofpublic programs (pp. 83-97). San Francisco:
J ossey-Bass.
Stern, D., Dayton, C., Paik, I.-W, & Weisberg, A. (1989). Benefits and costs of dropout prevention
in a high school program combining academic and vocational education: third-year results from
replications of the California Peninsula Academies. Educational Evaluation and Policy Analysis,
11(4),405-416.
Stiefel, L., Iatarola, P., Fruchter, N., & Berne, R. (1998). The effects of size of student body on school
costs and performance in New lVrk City high schools. New York: Institute for Education and Social
Policy, New York University.
Tan, J.-P., Lane, J., & Coustere, P. (1997). Putting inputs to work in elementary schools: What can be
done in the Philippines? Economic Development and Cultural Change, 45(4), 857--879.
Thang, M.e. (1988). Cost analysis for educational policymaking: A review of cost studies in education
in developing countries. Review of Educational Research, 58(2),181-230.
Udvarhelyi, I.S., Colditz, G.A., Rai, A, & Epstein, AM. (1992). Cost-effectiveness and cost-benefit
analyses in the medical literature: Are the methods being used correctly? Annals of Internal
Medicine, 116, 238-244.
Viscusi, K.P. (1992). Fatal trade-offs. Oxford: Oxford University Press.
Viscusi, WK. (1995). Discounting health effects for medical decisions. In F.A Sloan (Ed.), Valuing
health care: Costs, benefits, and effectiveness of pharmaceuticals and other medical technologies (pp.
125-147). Cambridge: Cambridge University Press.
von Winterfeldt, D., & Edwards, W (1986). Decision analysis and behavioral research. Cambridge:
Cambridge University Press.
Warfield, M.E. (1994). A cost-effectiveness analysis of early intervention services in Massachusetts:
implications for policy. Educational Evaluation and Policy Analysis, 16(1), 87-99.
Wasley, P.A., Fine, M., Gladden, M., Holland, N.E., King, S.P., Mosak, E., & Powell, L. (2000). Small
schools: Great strides. New York: Bank Street College of Education.
Weinstein, M.e., Siegel, J.E., Gold, M.R., Kamlet, M.S. & Russell, L.B. (1996). Recommendations
of the panel on cost-effectiveness in health and medicine. Journal of the American Medical
Association, 276(15),1253-1258.
Weiss, C.H. (1998). Evaluation. (2nd ed.). Upper Saddle River, NJ: Prentice-Hall.
World Bank (1993). World Development Report 1993: Investing in health. New York: Oxford University
Press.
World Bank (1996). India: Primary education achievement and challenges (Report No. 15756-IN).
Washington, DC: World Bank.
8
Educational Connoisseurship and Educational Criticism:
ELLIOT EISNER
Stanford University, School of Education, CA, USA
Educational connoisseurship and educational criticism are two evaluative

processes that are rooted in the arts. Connoisseurship pertains to what I have
referred to in previous writing as "the art of appreciation" (Eisner, 1991). The
aim of connoisseurship is to engender an awareness of the qualities that con-
stitute some process or object and to grasp their significance. Connoisseurship
pertains to matters of awareness and therefore to the creation of consciousness.
Consciousness of the qualities that constitute novels, symphonies, visual works of
art, and dance performances is not an automatic consequence of maturation. In
the arts, as in the connoisseurship of teaching, for example, awareness is the
product of cultivated attention. Awareness is also the product of the frames of
reference that one brings to the process or object addressed. To appreciate the
significance of Stravinsky's music one needs to know what preceded it and what
followed it. One also, of course, needs to be aware of its particular qualitative
complexities and its subtleties. Similar conditions pertain to the connoisseurship
of classrooms, to teaching, to the ways in which individual students function.
Connoisseurship is a critically important ability in any educational evaluation
that attempts to discern what is subtle but significant in a situation, what an array
of qualities unfolding over time signify, that is to say, what they mean. In this
sense every perception of any aspect of classroom life or of the accoutrements of
educational practice is also a reading. It is an effort to literally "make sense" of
what one is attending to.
I indicated that the ability to perceive or experience an array of qualities is not
an automatic consequence of maturation. Perception is a cognitive event, a kind
of achievement (Neisser, 1976). It is something not simply given to an individual
who looks, but it is more accurately, a construal, a making, a construction of the
process or situation one has encountered. Perception is an encoding, not simply
a decoding process. What one makes of what one sees is, as I have suggested, not
only a construction, it is a construction shaped by one's values and informed by
the theories one employs. A Marxist and a capitalist looking at the "same" social
conditions see those conditions quite differently. No less is true in seeing and
reading the context of education.
153

© 2003 Dordrecht: Kiuwer Academic Publishers.
154 Eisner
What this suggests, of course, is that we cannot achieve, in principle, our
historical desire to get to the rock bottom of things, to be in a position to tell it
like it really is, to have an objective grasp of reality or one that is comprehensive.
What we come to know about the world is not only a function of what the world
brings to us, but what we bring to it. It is a function of the interaction between
object and subject. It is in this interaction that we make our worlds.
Connoisseurship in the domain of theater can address a variety of features. It
can focus its attention on the set, on the features of the plot, on the performance
of the actors, on the pace and tempo of the story as it unfolds on the stage.
Similarly, connoisseurship in education has numerous dimensions that can be
addressed: the quality of the curriculum, including both the intellectual signifi-
cance of the ideas to which students are to be exposed and the suggested activities
through which such exposure is to take place, it could address the quality of the
teaching that students receive, it could focus on the cognitive processes students
employ and the extent to which they are engaged, it could address the entire
milieu of the classroom as a whole, or the incentives used to motivate students.
Any array of qualities can, in principle, be an object of connoisseurship.
The importance of connoisseurship in the context of educational evaluation is
in some ways obvious; one is never in a position to make an evaluative judgment
about a state of affairs that one is unaware of. The process of connoisseurship is
a way to generate awareness and it is through the content of such awareness that
an individual is able to appraise the quality of what has been experienced. Thus,
connoisseurship is, as I have already indicated, the art of appreciation.
Appreciation in this context does not mean necessarily liking or valuing posi-
tively what one has encountered. Appreciation, in the context in which I am
using it, has to do with noticing the phenomena and grasping its significance.
Both a well-performed symphony and one poorly performed can be appreciated
for what they are.
Appreciation of, say teaching, depends on a process long known to perceptual
psychologists, namely, the process of perceptual differentiation (Arnheim, 1969).
As one has extended experience with a process or an array of qualities, say wine
for example, one is better able to notice what is significant but subtle. One educates
one's palette. One begins to notice what initially one hadn't experienced.
Perception becomes increasingly differentiated and there resides as a result of
experience with wine a visual and gustatory memory that makes it possible to
place the taste of any particular wine in a distribution according to the class to
which it belongs. The same holds true in teaching. Teaching is a variety of kinds;
we want to know about excellence for its kind. In addition, each teaching episode
has not only distinctive but unique features. Recognizing these distinctive
features and appreciating their significance requires a context. That context is
located in the images of teaching one has already experienced. Thus it is that
connoisseurship is cultivated through experience in the domain to which one's
connoisseurship is relevant. Connoisseurship is a general process that pertains to
architecture, the strengths and weaknesses of a research design, the qualities of
Chablis, the characteristics of an individual, the features both positive and
Educational Connoisseurship and Educational Criticism 155
negative of the products of student inquiry. Nothing is exempt from

connoisseurship.
As important as connoisseurship is as a means for heightening consciousness,
it is a private and personal achievement. One can appreciate the qualities of
teaching practices, a student's essay, or the milieu permeating a school in the
private sanctuaries of one's own cognitive life. One need not utter a word or
write a sentence about what one has experienced. As I said, connoisseurship is
the art of appreciation and appreciation can be private.
To have social consequences, connoisseurship needs to be transformed into
something that is public and sharable. This is where educational criticism comes
into the picture.
In many ways the term criticism has unfortunate connotations. They are con-
notations I would like to dispel here. Criticism, as I use the term, does not signify
making negative comments about anything. The aim of criticism, wrote John
Dewey (1934), is the reeducation of the perception of the work of art. A critic,
say in the visual arts, is interested in enlarging our understanding and experience
of paintings and sculpture. A critic in music writes or talks to us about the struc-
ture of a particular symphony or the features of a performance. An educational
critic is, similarly, concerned with revealing the covert norms of a classroom, the
qualities of teaching being provided, the modes of behavior and their
significance displayed by students, the features of the reward structure employed
in the classroom and the like. Put in its simplest form, a critic is a teacher. A critic
functions as a kind of midwife to perception. A connoisseur is someone who can
appreciate what he or she has experienced, but a connoisseur has no obligation
to function as a critic. Enjoyments can be entertained in private as can repul-
sions. Connoisseurship is a private accomplishment that does not necessarily
lead to criticism. Critics, however, depend upon connoisseurship in order to
create their work. Critics have something to tell us about the process or product
at hand. Their aim is to edify. Edification is a social process. It is a process that
depends for its subject matter on the qualities that the critic as connoisseur has
discovered in the work.
Thus, in the context of classroom observation or in the study of a school or in
the interpretation of an individual's work, the depth and penetration of that
work will be a function of the level of connoisseurship brought to bear upon it.
The transformation of such awareness into a public form is what criticism is
about. When connoisseurship and criticism address matters of education, they
are called educational connoisseurship and educational criticism.
Examples of educational criticism are found in the work of writers such as
Jonathan Kozol (1991); Tom Barone (2000); Philip Jackson (1992); Alan Peshkin
(1986); Arthur Powell et al. (1985); David Cohen (1990); Sara Lawrence Lightfoot
(1983) and others who have brought a sensitive and keen eye to the study of
classrooms, schools, and communities and have shared what they have made with
the rest of us. At its more public level in the mass media, we find political
criticism being displayed when commentators come together to analyze a
candidate's policies or a congressperson's performance. The sharing of political
156 Eisner
criticisms on news hours is ubiquitous in our culture. The content of such poli-
tical interpretations depends, as I have suggested earlier, on the refinement of
one's sensibilities to the phenomena in that domain and in the application of
theoretical structures and values to those phenomena for purposes of analysis.
Those whose comments we find most credible and illuminating are regarded, in
the main, as excellent critics who deepen our understanding of the political
behavior of those whose decisions influence our lives.
The transformation of the contents of connoisseurship into the public sphere
through an act of criticism requires the "magical" feat of transforming qualities
into a language that will edify others. It requires an act of representation.
Normally such acts take place in the context of language; critics usually write or
speak about what they have paid attention to. However, although I will not
address the matter here, representation is not, in principle, restricted to what is
linguistic. Photographers, for example, also have a potent tool for revealing the
complexities and subtleties of a situation. Press photographers can help us grasp
what is fleeting but significant. Videographers and filmmakers have at their dis-
posal powerful tools for helping us understand in ways that language alone could
never reveal features of a situation that we want to know about. But as I said, I
will leave such matters aside in the context of this chapter. I only wish to point
out that the particular form of representation one uses has distinctive attributes
and therefore has the capacity to engender meanings that are distinctive to it.
The countenance of a face may be indelibly etched on one's cortex as a func-
tion of a particular portrait by a particular photographer. The arts help us see
what we had not noticed. In any case the dominant form for criticism in virtually
all fields is the written or spoken word. It is the ability to use words in ways that
somehow captures and displays what itself is nonlinguistic, what I referred to as
a magical feat. It is an accomplishment that we take much too much for granted.
The selection of one word rather than another can color meaning in the most
subtle of ways, the cadence and melody of one's prose can communicate more
poignantly and precisely than the specific words one chooses to use. In the arts
how something is conveyed is an inextricable part of its content. Critics, when
they are at the top of their form, are engaged in an artistic practice. That practice
has to do with the transformation of their own experience through the
affordances and within the constraints of the written or spoken word.
In writing an educational criticism, say, of a classroom, four dimensions can be
identified. These dimensions of educational criticism are generic features of the
structure of educational criticism. The features that I will describe, I describe
seriatim. However, these features are inextricably related to each other in the
actual context of seeing and saying that mark the aims of connoisseurship and
criticism.
The first of these features is description. At its most fundamental level, the
educational critic is interested in enabling a reader to envision or imagine the
situation the critic is attempting to reveal. Description is a process of making
vivid not only the factual particulars of the situation, but also its tone, its feel, its
emotionally pervaded features. To achieve such powerful and subtle ends,
language needs to be used aesthetically. By aesthetic I mean here quite the

opposite of the anaesthetic. The anaesthetic is a suppressant to feeling. The
aesthetic evokes it. The effective critic has the ability to use language in ways that
enables a reader to imagine and empathetically participate in the situation and,
as a result, to get a sense of its feel.
The function of "feel" is not simply ornamental. Knowing what a situation
feels like is knowing something about the situation and it may very well be the
case that what one comes to know through feel may be among its most important
features. A writer must use language aesthetically, that is to say artistically, in
order for such a sense to be secured. The evaluator who is aware of the
conditions of a school, whether it is one that is decaying or one that is flourishing,
must craft the language in a way that enables a reader to grasp the sense of the
place.
I am fully aware that the artistic use of language and the admonition to pay
attention to the emotional character of a situation is antithetical to longstanding
traditions in the social sciences, traditions that pay homage to the facticity of the
situation denuded of any emotional content except in ways that succeed in
describing them without affect. I am arguing quite the contrary that emotion has
a fundamental role to play in advancing human understanding and that the
ability to convey the feel of a place is a way to help someone understand what it
is like to be there, a form of understanding that I believe most people would
regard as important.
I am also aware of the fact that the artistic use of language can be seductive.
Like advertising, highly crafted images, whether visual or verbal, may lead one to
conclusions that are not warranted. Yet, the sterilization of feeling in the material
that we write about in education can be as much a distortion as its unwarranted
presence. Misinterpretations occur from matters of omission as well as matters
of commission. Our historical proclivity towards tidy, spic and span descriptions
usually embedded in quantified variables leave out, very often, more than what
they include. Our concerns for reliability and precision have lead us all too often
to neglect forms of disclosure that have the capacity to advance our own genuine
understanding of educational matters. I am proposing that although we need to
be careful about how we say what we say and how we interpret what we read, we
limit our ability to advance our comprehension of educational matters when we
restrict our forms of representation to traditional factual propositional forms
of writing.
The import of the foregoing comments on the importance of form in the
enlargement of human understanding relates to certain fundamental assump-
tions of an epistemological kind. Traditional assumptions about knowledge
require assertability (Phillips, 2000). They also require having warrant for one's
assertions. While warrant is important (although it is not called warrant) in
educational criticism, assert ability in the sense of propositional claims is not a
necessity. The artistic use of language recognizes that meaning is engendered
through the form language takes and not only from the formal claims it makes.
Thus, it acknowledges that our conception of the world is influenced by a variety
158 Eisner
of symbolic devices, including what has been artistically crafted. These symbolic
devices find their most acute expression in the arts. This perspective on the
sources of knowledge and understanding is fundamentally different from the
perspective that is traditionally applied to language claims.
A second dimension of educational criticism is interpretation. If description
can be regarded as an effort to give an account of a state of affairs, interpretation
represents an effort to account for what has been given an account of.
Interpretation is an effort to explain, at times through theoretical notions rooted
in the social sciences, the relationships one has described. Thus, if a writer like
Powell (1985) describes the treaties that high school students and teachers
sometimes create, if he describes the kind of collusion that exists between them
so that they can share a life with as little stress as possible, the interpretation of
such a described set of relationships would pertain to the reasons why they occur
in the first place and what their social functions and consequences are in the
classroom. Put another way, in the interpretive aspect of educational criticism,
there is an effort to explicate, not only to describe.
I wish to remind the reader that the distinctions I am drawing and the form
that I am using to describe to these distinctions may give the illusion that the
distinctions are rigid and pertain to utterly separate processes. Nothing could be
farther from the truth. Interpretation is always, to some degree, at work in the
observation of a state of affairs; we are interested in making meaning out of what
we see. The distinctions that we make visually are typically rooted in structures
of meaning that assign them not only to a set of qualitative categories, but to
interpretive ones. Interpretation feeds description and description feeds
perception. As E.H. Gombrich once pointed out, artists do not paint what they
are able to see, rather they see what they are able to paint. The representational
form that we choose to use affects what it is that we look for and what it is that
we look for and find is affected by the interpretive structures we bring to the
situation. It's all exceedingly interactive and one separates them at one's peril.
Nevertheless, it is difficult, if at all possible, to write or talk about a state of
affairs without drawing distinctions even when, as I have already indicated, the
situation is characterized by its interactive features. We dissect in order to reveal,
but we need to be careful that in dissection we do not kill what we are interested
in understanding.
A third feature of educational criticism pertains to evaluation. Evaluation is, as
the editors of this volume know quite well, more than the quantification of
information. Statistics and the quantities that go into them are descriptors having
to do with the magnitude of relationships; they are not evaluations unless they
are directly related to evaluative criteria. The essential and fundamental feature
of evaluation is that it is evaluative; values are at its heart. And they are as well
in the process of education. Education is a normative undertaking. Children
come to school not merely to change, but to strengthen their cognitive abilities
and to develop their potentialities. They come to school to learn how to do things
that are valuable so that they can lead personally satisfying and socially
constructive lives. Thus, the educational evaluator cannot engage in evaluation
without making some assessment of the educational value of what students have
encountered and learned.
Now values in education are of several varieties. Like politics, there are differ-
ent versions of the good society just as there are different versions of the good
school. Regardless of the version embraced, some judgment needs to be made
about the value of what one sees. These value judgments should themselves be
subject to discussion and should be made vivid in the writing.
There is a tendency among some theorists to want writers of qualitative
approaches to evaluation to lay their cards out up front. That is, to indicate to
the reader "where they're coming from," to be explicit about the values that they
embrace. Although I think this might be done, I am frankly skeptical about an
effort to describe the educational values that an individual holds in a matter of a
page or two of narrative. I believe that these values come through much more
authentically in the writing of educational criticism itself. At the same time, I
have no objection to individuals providing a normative preface to enable the
reader to get some sense of the values that the educational critic holds most dear.
The important point, however, is that one needs to be in position to indicate to
the reader what one makes of what has been described and interpreted. What is
its educational merit? What value does it possess? Educational criticism should
enable a reader to understand the educational virtue or the absence thereof of
what the critic has addressed in his or her writing.
The fourth dimension of educational criticism is called thematics. Thematics is
the distillation from the situation observed of a general idea or conclusion that
sums up the salient or significant feature or features of that situation. Thematics
is an effort to identify a common and important thread running through the
processes observed. Returning to the work of Arthur Powell and his colleagues
(1984), the observation that teachers and students at the secondary level develop
agreements that enable them to live comfortably together inside a classroom is
the result of many observations of this phenomena during the course of their
work in school. What Powell and his colleagues then do is to distill these observa-
tions and then to "name the baby," that is, to call it a "treaty." The concept treaty
is a distillation of the essential features of the relationships between teachers and
students that exist in the classrooms that Powell and his colleagues observed. The
selection of the term treaty implies an agreement, often among warring parties
that agree to certain conditions as a way to terminate hostilities. "Treaty" captures
the essential features of those relationships and conveys to the reader a certain
kind of significance.
The formulation of a theme that pervades a classroom process, a school, or an
approach to the management of teachers and students through administrative
edicts and educational policies need not culminate in a theme represented by a
single word. Themes can be distilled and represented by phrases or sentences.
Describing the use of technically rational procedures as a way to manage students
and staff in a school district can function as a theme that distills many practices
and policies formulated by school boards and state agencies to improve schools.
The term "technical rationality" as a part of an assertion concerning the character
160 Eisner
of organizational life in a school has much to tell us about the basis for such
management. Concepts in sociology such as Tonnies' (1963) "gessellschaft" and
"gemeinshaft" similarly are efforts to characterize forms of affiliation by describ-
ing the differences between communities and societies.
The distillation of a theme in an interesting way functions as a kind of gener-
alization. I know full well that within the strictures of standard statistical theory
single case studies cannot meet the conditions necessary for generalization.
After all, in addition to being single, case studies are often convenience samples
rather than random selections from a universe. How can a nonrandomly selected
single case yield generalizations? It is to that question we now turn.
The answer to the question I have posed can be found in the cognitive func-
tions of art and in the particulars of ordinary experience. In art there is a
distillation of experience into an icon, whether written as in literature or visual
as in film that articulates an idea or set of relationships about the subject matter
the work addresses. But it does more than that. It also provides a schema which
allows a reader or viewer to search more efficiently in other areas of life in order
to determine whether the schema secured through the prior work applies to
other situations. In a significant sense, this is a function that theory in the social
sciences performs. We secure ideas about cognitive development, about
socialization, about the structure of participation in communities, and we use
these distilled notions as a means through which the situations to which they apply
can be tested. Schemata help us locate elsewhere the phenomena that were
revealed earlier.
Ordinary daily experience performs a similar function. Our experience is not
a random selection from a universe of possibilities. It is an event from which we,
in a sense, extract a lesson. That lesson is used not as a predictor of the future, but
as a way to think about the situations that we are likely to encounter. The con-
cept of treaties that I mentioned earlier was invented to stand for relationships
seen in particular schools, but we have the wisdom to use the concept as a general
condition for thinking about other classrooms, classrooms not a part of the array
from which the concept was derived. Put more simply, the generalizations that
are secured from the arts represent schemata through which we come to
understand what we can seek and how we can behave. We learn lessons from
particulars that extend well beyond the particulars from which those lessons were
derived. It is in this sense rather than in the statistical sense that generalizations
from case studies are useful, even when there is only one case and even when
that case is not a random selection from a universe. In fact, even in studies that
employ methods for the random selection of the sample, generalizations derived
from such randomly selected samples are extended well beyond the universe
from which the sample was secured. We wisely use the research findings as a kind
of heuristic device, a sort of efficiency in our search in situations that have very
little to do, if anything, with the original populations. Even more simply, we test
ideas rather than apply them. We try them out to see if they shed light on areas
in which we are interested. Case studies do generalize, but they do not generalize
in the way in which statistical studies do (Goodman, 1978).
This conception of generalization is much more congenial to the current drift

in philosophy than the more formalized and perhaps even more rigid conceptions
of the conditions of generalizability that have dominated traditional research
methodology. It recognizes that situations differ, that knowledge is constructed,
that reality in its pure form whatever that might be is never going to be accessible
to a mind filled with cultural structures through which reality is construed. To
hold these views is not to deny in some sense the possibility of an objective reality
"out there"; it is to recognize that we are not in a position to separate mind from
matter. We recognize that mind is a construing organ. We will never be able to
secure a crystal clear view of the world from God's knee.
The issues about which I speak are contentious ones. One of the reasons why
there has been so much debate about the legitimacy of qualitative research
methods has to do with a clash of assumptions among those embracing traditional
views of knowledge and those recognizing alternative views. Their respective
premises do not overlap very much. We still harbor traditional distinctions
between what is objective and what is subjective as if it were possible to get an
objective view of the world if one means by "objective view" one that allows us
to know how something really is. 1 We are always engaged in an interaction
between what the world brings to us and what we bring to it. We make sense of
the world that we inhabit; we try to find significance in the relationships we
notice. Educational connoisseurship is a process of enabling us to become aware
of those relationships. Educational criticism is a way of representing those relation-
ships through some form of inscription, a process that stabilizes the evanescent
ideas we all have and makes it possible to share them with others.
The clash between ideas about the "nature" of knowledge are not only episte-
mological, they are also political. When one redefines the game to be played, one
also redefines the competencies necessary to play it and, as a result, who
possesses such competencies. There are protective mechanisms at work in every
research community and more than a few in that community want not only to
hold on to traditional methods, but to deny their students the opportunity to
pursue new ones. I receive many, many letters and emails from doctoral students
across the world who are seeking from me support to pursue the kind of research
I have written about and which in the eyes of their professors resides outside of
the parameters of legitimacy. Ironically, though perhaps not ironically, second
and third tier institutions of higher education are usually the most apprehensive
about providing students with permission to pursue such work while research
one universities are more likely to provide such permission. I suspect that when
one's status is problematic, one cannot risk taking chances.
Does the concept "validity" have a role to play in assessing the utility or cre-
dibility of educational criticism? If we regard validity as the antithesis of invalid,
then I would say it certainly does. An invalid is an individual who is impaired or
weak. An individual who is valid is an individual who is strong or robust. I know
that we do not use the term valid in the way in which I described, but I would
propose that it is not an inappropriate way to use it. When it comes to educa-
tional criticism, there are three types of validation processes. The first is called
162 Eisner
structural corroboration. The second is called referential adequacy. The third is
called consensual validation.
Structural corroboration is a process in which the data sources relevant to a
conclusion in an educational criticism are brought together to provide credibility
to that conclusion. In law, it's called circumstantial evidence. In law, even when
no one saw the act committed, no one found the gun, no one confessed to the
crime, circumstantial evidence can still be used to send someone to prison for a
lifetime. Circumstantial evidence or structural corroboration is a way of pulling
together a variety of information sources to build a credible picture. In some circles
it's called "triangulation." The point is that conclusions, say, about the pervasive
norms of an elementary school can be revealed best not by a single event but by
a coalescence of events that collectively make a compelling case that the school
embraces norms of this kind or of that. One musters the evidence, often from
multiple data sources, and puts it together to draw a conclusion which in the eyes
of readers is justified.
Referential adequacy, the second method for validating the conclusions drawn
in educational criticism, is a process through which a reader is able to locate in
the objects of critical attention what it is that the critic claims to be there. Put
another way, the aim of criticism is the reeducation of the perception of the work
of art. In educational criticism, the criticism is aimed at the reeducation of the
perception of the work of education. As a result of having read a critic's analysis
of an educational setting, one ought to be able to notice what it is that the critic
claims has been and is going on. The criticism is essentially instrumental. Its
instrumentality resides in being able to see what one had not earlier noticed, to
understand what had not been understood. The critic, in effect, sheds light on the
scene. As a result, the scene is seen.
Referential adequacy is the most compelling form of validation processes for
educational criticism. As I said earlier, the critic serves as a midwife to percep-
tion; the critic is a teacher, and as a result we ought to be able to notice and,
indeed concur, with the critics' observations after having read the criticism and
after having visited the site.
I am aware of the fact that very often the site cannot be visited. Even for
anthropologists, perhaps especially for anthropologists, few people are going to
take a trip to Melanesia to check out an ethnographer's observations on the Ik.
We take Colin Turnbull's (1968) observations seriously because of the wealth of
detail, the cogency and credibility of his analysis, the structural corroboration
that it provides us with. We take his observations seriously because we believe
what he has to say and we believe what he has to say by the way he puts the story
together, not because we have gone to the site to determine whether or not his
observations fit the circumstances.
With respect to schools, the task is not quite as practically difficult. An
individual writing about the pace and tempo of a secondary school, the impact of
bells marking the beginning and end of the periods through which the school day
is divided, the nature of the reward system, the hardness of the surfaces of the
interior of the school; all of these features can be assessed not only by visiting
secondary schools in the area, but also by recollecting one's own experience in
secondary schools. Much of what is written about the mechanized character of
the school day, week, month, and academic year require for confirmation not a
trip to another school, but a reflection on one's own experience in schools. Much
of such observations are validated through reflection. Generalization is retro-
spective as well as prospective.
A third type of validation, consensual validation, is closest in character to
typical methods employed in standard social science research. What we have in
this process is the use of two or more critics viewing and writing about a situa-
tion. What one is looking for is the extent to which there is overlap among the
observations. In standard social science research methodology the measurement
of such agreement is typically referred to as inter-judge reliability. Inter-judge
reliability is a way of securing confidence in the assessment that has been made.
At the same time it is the farthest from the ways in which criticism is used in the
arts. There are thousands of criticisms of Shakespeare's Hamlet. To determine
the "best" of these criticisms or the "most valid" of these criticisms, we do not try
to find the modal category among the ten thousand critics who have written about
Hamlet. We recognize that critics have different things to say about the same play
and what we seek, therefore, is useful illumination recognizing that there will be
differences among critics. These are not regarded as sources of unreliability, on
the contrary, they are regarded as distinctive sources of illumination. Through a
combination of perspectives we secure a more comprehensive view of the play,
we secure difference angles of refraction and therefore come to understand the
play in ways that no single criticism could provide.
Nevertheless, consensual validation - which is not commonly employed - is a
third process through which the validation of an educational criticism can be
secured.
In many ways the quality of educational criticism is dependent upon the quality
of educational connoisseurship. A critic, as I have indicated earlier, is not able to
describe, interpret, or evaluate that of which he or she is unaware. Promoting the
awareness of the complexities and subtleties of educational practice and of its
effects is, therefore, critical to the quality of the educational criticism. Since the
ability to notice what is subtle but significant in educational matters is a develop-
able ability, it needs to be attended to in schools and departments of education.
Put simply, programs preparing educational evaluators who are interested in
doing educational criticism need to develop the skills possessed by educational
connoisseurs.
One part of these skills pertains to understanding theories of teaching, for
example, but another part pertains to deep knowledge about the subject matter
being taught. It is difficult, indeed impossible, to do really good educational crit-
icism of a physics teacher at work without some understanding of physics. One
needs to be able to discern in observing such a teacher the quality of the content
and the appropriateness of the activities students engage in for learning it. The
same holds true for every other subject being taught. Thus, theories of cognition,
of teaching, and a deep understanding of the content being taught are necessary
164 Eisner
conditions for excellence in criticism. But one also has to learn to see what is
going on.
Now, to be sure, there are multiple modes of perception. To suggest that it's
important for an educational connoisseur or someone aspiring to become one to
be able to see what is going on is not to suggest that there is one way to see what
is going on or that there is one thing going on at a time. Regardless of the mul-
tiplicity of potential perspectives, observations within a perspective need to be
refined. The development of perceptual differentiation so that one is able to
make distinctions in the conduct of a class is critical. This suggests that time and
attention be devoted in schools and departments of education helping students
learn how to study, for example, videotapes of teaching and how to engage in
critical discussion, indeed debate, concerning the ways in which they describe
and interpret and indeed appraise the value of what they have analyzed. In this
process, judging from my own experience, there will be differences among
students. These differences are not necessarily signs of unreliability but rather as
opportunities to deepen the analysis and to test one's ideas with others.
Educational connoisseurship can be promoted, but time must be provided and
attention as well for such complex and subtle abilities to be realized. There is no
reason why every School of Education cannot, in principle, have videotapes of
teachers working in classrooms at virtually every level and in virtually every
subject area. These tapes are best produced in real time without editing; the gray
time in classrooms can be as revealing as those episodes that are charged with
energy and meaning.
I have been describing educational connoisseurship and educational criticism
largely directed towards classrooms, teaching and schools, but connoisseurship
can also be directed towards the materials that students are asked to use, espe-
cially textbooks. How they "genderize" individuals is of particular importance as
well as how they stereotype racially. Women, for example, have called our atten-
tion to gender bias in textbooks designed for even the youngest children. We
need to learn to see the subtexts in the texts we use. Educational critics can help
us achieve that aim.
Perhaps the most significant contribution educational connoisseurship and
educational criticism makes to the world of evaluation pertains to the way it prob-
lematizes conventional evaluation cannons. Educational connoisseurship and
criticism are arts based in character. The idea that the arts could serve as a
paradigm for evaluation practices is not an idea that was warmly welcomed a
quarter of a century ago. Evaluation is an enterprise closely associated with the
social sciences. It employs procedures derived from social science research
methodology. It has focused its efforts on the measurement of performance and
has employed statistical methods for treating the data evaluators collect. The arts
have not been a part of the evaluation picture.
In many ways this omission is ironic since there is no domain, I believe, that
depends more heavily on evaluative practice than does the creation and percep-
tion of the arts. The interpretation of novels, the criticism of visual works of art,
the exegesis of text has been historically important. The problem, if problem
there be, has been related to the fact that on the whole criticism in the arts has
focused on individual work while much of evaluation deals with the measure-
ment of group performance and the descriptions of such performance by looking
for central tendencies and standard deviations from those tendencies. The result
of traditional assumptions has lead to a homogenization of expectation. In meri-
tocratic societies homogenization has a utility; it makes comparison possible. It
makes it possible to establish hierarchies of performance. It increases the possi-
bility of creating league tables that describe differences in rank among schools
and school districts. This is precisely what is happening today in the United
States in our seemingly desperate effort to improve our schools.
Educational connoisseurship and educational criticism will make meaningful
comparisons among students, schools, and school districts more difficult, rather
than less so. It will focus attention on what is distinctive, even unique, rather than
what is common or general. I believe that a limelight that casts its rays on the
distinctive features of individuals, schools, and schools districts as well as on
pedagogical resources and teaching practices is an important correction to
evaluation practices that are unbalanced by their emphasis on what is common.
It is in this sense that the new arts based paradigm on which educational connois-
seurship and criticism rest not only gives us another platform from which to
reflect on how evaluation practice might proceed, it also affords us opportunities
to recognize what we would otherwise overlook and thus to be able to create
conditions in our schools that genuinely educate. These conditions, ultimately,
must address the design of the environment in which students and teachers work,
it must take into account the curriculum that is provided, the norms that are
embraced, the rewards systems that are used, indeed it must include the entire
culture we call school.
A few years ago after a lecture on evaluation that I gave in Nebraska, a woman
in the audience approached me and said, "Professor Eisner, I liked your lecture
on evaluation, but we're from Nebraska and we raise a lot of cattle in Nebraska
and one thing that we have learned is that you can't fatten cattle by putting them
on a scale."
The woman was right. Measurement has its utilities, but it is the diet that
makes the difference. Understanding what we are feeding the young and how it
is they have developed is critical for improving the diet. Educational connois-
seurship and educational criticism, I believe, can help us improve the diet.
ENDNOTE
1 Belief in the ability of the human being to eventually know the world as it really is depends upon a
correspondence theory of truth. This conception of truth, and hence of knowledge, assumes that
when beliefs are isomorphic with reality, they are true. Thus, correspondence between belief and
reality is critical. The only problem with this view is that if correspondence is a criterion, the
individual needs to know two things: first, the individual needs to know what he or she believes and
second, he or she needs to know what reality is like in order to know that there is a correspondence.
If the individual know the latter, the former would be unnecessary.
166 Eisner
REFERENCES
Arnheim, R. (1969). Visual thinking. Berkeley: University of California Press.

Barone, T. (2000). Aesthetics, politics, and educational inquiry: Essays and examples. New York: Peter
Lang Publishers.
Cohen, D. (1990). A revolution in one classroom: The case of Mrs. Oublier. Educational Evaluation
and Policy Analysis, 12(3),311-329.
Dewey, J. (1934).AI1 as experience. New York: Minton Balch & Co.
Eisner, E. (1991). The enlightened eye: Qualitative inquiry and the enhancement of educational practice.
New York: Prentice Hall.
Goodman, N. (1978). Ways of worldmaking. Indianapolis: Hackett Publishing Company.
Jackson, P. (1992). Untaught lessons. New York: Teachers College Press, Columbia University.
Kozol, J. (1991). Savage inequalities: Children in America's schools. New York: Trumpet Club.
Lightfoot, S.L. (1983). The good high school: Pol1raits of character and culture. New York: Basic
Books.
Neisser, U. (1976). Cognition and reality: Principles and implications of cognitive psychology. San
Francisco: W.H. Freeman.
Peshkin, A. (1986). God's choice. Chicago: University of Chicago Press.
Phillips, D. (2000). Postpositivism and educational research. Lanham: Rowman & Littlefield Publishers.
Powell, A.G., Farrar, E., & Cohen, D.K. (1985). The shopping mall high school: Winners and losers in
the educational marketplace. Boston: Houghton Mifflin.
Tonnies, F. (1963). Community & society (gemeinschaft und gesellschaft). (C.P. Loomis, Trans. & Ed.).
New York: Harper & Row.
9
In Living Color: Qualitative Methods in
Educational Evaluation
LINDA MABRY
Washington State University J;(mcouver, Vancouver, Washington
I know a Puerto Rican girl who became the first in her family to finish high
school and who then got a nursing degree. She started at the gallery by
participating in an art program, then worked at the front desk. I know an
Hispanic girl, a participant in one of our drama programs and later a
recipient of our housing resource program, who was the first in her family
to earn a college degree. She now works in the courts with our battered
women program .... Yesterday, I had lunch at a sandwich shop and met a
young woman ... who got a full scholarship to the University of Illinois
and was the first in her family to speak English or go to college. She said,
"That art program was the best thing that ever happened to me."
(P. Murphy, personal communication, December 2, 1994, quoted in
Mabry, 1998b,p. 154)
This is interview data collected in the course of studying an educational program

in four Chicago schools partnered with neighborhood arts agencies. It is
empirical data, but it is unlike test scores, costs per student, or graduation rates.
It is vividly experiential, differently compelling. It is qualitative data.
It is evaluation data, but is it good evaluation data? It is a testimonial from a
program developer, an articulate and committed proponent with a clear bias.
Confirmation of the events described by the respondent would have helped
establish the credibility of this information, but verification of these specifics was
somewhat peripheral to the evaluation and beyond its resources, limited as is the
case for many such inquiries. Evidence that the events described by the
interviewee were typical rather than isolated would have bolstered her implicit
claim of program worthiness and effectiveness and, overall, the dataset did
indicate program merit. But a day visitor during the program's first year reported
otherwise, a reminder that such data were open to contrary interpretation. Are
qualitative data stable, credible, useful? Do they help answer evaluation
questions or complicate them? Are qualitative methods worth the evaluation
resources they consume? How well do they improve understanding of the quality
of educational programs; how much do they clutter and distract?
167

168 Mabry
Test scores no longer reign as the sole arbiters of educational program quality
even in the U.S., but testing's unabated capacity to magnetize attention reveals a
common yearning for simple, definitive, reliable judgments of the quality of
schooling. Many want evaluation to provide lucid representation of issues and
unambiguous findings regarding quality - evaluation with bottom lines,
evaluation in black-and-white. Qualitative data, lively and colorful, represent
programs from multiple vantage points simultaneously, as cubist portraits do.
Applied to the study of education, where causal attribution for long-term results
is notoriously difficult, where contexts and variables are labyrinthine, where con-
trasting ideologies and practices uneasily coexist, where constructivist learning
theories celebrate idiosyncratic conceptions of knowledge, the expansionist
motive of qualitative methods serve better to reveal complexity than to resolve
it. In educational evaluation, qualitative methods produce detailed, experiential
accounts which promote individual understandings more readily than collective
agreement and consensual programmatic action.
Naturalistic research designs in which "a single group is studied only once" by
means of "tedious collection of specific detail, careful observation, testing, and
the like" were disapproved by Campbell and Stanley (1963, pp. 6-7), who voiced
early objections from social science traditionalists that "such studies have such a
total absence of control as to be of almost no scientific value" (pp. 6-7). Yet, initial
articulation and justification of qualitative methods in educational evaluation by
Guba (1978) and Stake (1978) arose from no less troubling doubts about the
appropriateness and effectiveness of quantitative methods which might "foster
misunderstandings" or "lead one to see phenomena more simplistically than one
should" (Stake, 1978, p. 6-7). Ricocheting doubts and defenses culminated in
public debate among three American Evaluation Association presidents and
others in the early 1990s (see Reichardt & Rallis, 1994). Since then, the so-called
paradigm wars have repeatedly been declared over or moot (see Howe, 1992),
but issues regarding the credibility and feasibility of qualitative methods in evalu-
ation continue to vex not only the so-called "quants," evaluators more confident
of quantitative than qualitative methods, but also the "quaIs" (Rossi, 1994, p. 23).
Qualitative methods, ethnographic in character and interpretivist in tradition,
have earned a place in evaluation's methodological repertoire. A number of
evaluation approaches have been developed that rely heavily on qualitative
methods (see, e.g., Eisner, 1985, 1991; Fetterman, 1996; Greene, 1997; Guba &
Lincoln, 1989; Patton, 1997; Stake, 1973). Mixed-method designs in evaluation
are attractive and popular (see, e.g., Datta, 1994, 1997; Greene, Caracelli, &
Graham, 1989). It has even been claimed that qualitative methods in educational
evaluation have overshadowed quantitative (see Rossi, 1994), although questions
linger and recur.
Qualitative epistemology and strategies for data collection, interpretation, and
reporting will be sketched here. Then, under the major categories of The
Program Evaluation Standards (Joint Committee, 1994), issues regarding
qualitative methods in the evaluation of educational programs will be presented
as fundamentally irresolvable.
In Living Color: Qualitative Methods in Educational Evaluation 169
QUALITATIVE METHODS IN RESEARCH AND EVALUATION
Qualitative methods have been described well in the literature of educational

research by Denzin (1989, 1997); Denzin and Lincoln and their colleagues
(2000); Eisner (1991); Erickson (1986); LeCompte and Preissle (1993); Lincoln
and Guba (1985); Stake (1978, 2000); and Wolcott (1994, 1995). In evaluation,
too, qualitative methods have been described (e.g., Greene, 1994; Guba, &
Lincoln, 1989; Mabry, 1998a) and accepted as legitimate (e.g., House, 1994;
Shadish, Cook, & Leviton, 1991; Worthen, Sanders, & Fitzpatrick, 1997).
Despite warnings of methodological incommensurability (see especially Lincoln
& Guba, 1985) and epistemological incommensurability (Howe, 1992),
qualitative and quantitative strategies have been justified for mixed method
designs (see Greene, Caracelli, & Graham, 1989), and quantitative strategies are
not unknown as supportive elements in predominantly qualitative evaluation
designs (see Miles & Huberman, 1994). Still, the paradigmatic differences are
stark, even if viewed congenially as points on a series of shared continua.
Qualitative add-ons to quantitative designs are sometimes disregarded in overall
derivations of program quality. Qualitative data are sometimes relegated to
secondary status in other ways as well: considered merely exploratory prepara-
tion for subsequent quantitative efforts; mined for quantifiable indications of
frequency, distribution, and magnitude of specified program aspects; neglected
as inaggregable or impenetrably diverse and ambiguous, helpful only in providing
a colorful vignette or quotation or two. Such dismissiveness demonstrates that
problematic aspects remain regarding how qualitative data are construed,
collected, interpreted, and reported.
Epistemological Orientation
Sometimes in painful contrast to clients' expectations regarding scientific and

professional inquiry, qualitative methodologists do not seek to discover,
measure, and judge programs as objects. They - we - do not believe objectivity
is possible. To recognize the program or its aspects, even to notice them, is a
subjective act filtered through prior experience and personal perspective and
values. We do not believe a program is an entity which exists outside human
perception, awaiting our yardsticks. Rather, we think it a complex creation, not
static but continuously co-created by human perceptions and actions.
The meaning and quality of the program do not inhere in its mission statement
and by-laws, policies, personnel schematics, processes, outcomes, or relationships
to standards. The program does not exist - or does not meaningfully exist -
outside the experiences of stakeholders, the meanings they attach to those
experiences, and the behaviors that flow from those meanings and keep the
program in a perpetual state of revision. Reciprocally, program change and
evolution affect experiences and perceptions. Thus, the program is pliant, not
fixed, and more subjective than objective in nature. Coming to understand a
170 Mabry
program is less like snapping a photograph of it and more like trying to paint an
impression of it in changing natural light.
Coming to understand a program's quality requires sustained attention from
an array of vantage points, analysis more complex and contextual than can be
anticipated by prescriptive procedures. Representation of a program requires
portrayal of subtle nuances and multiple lines of perspective. Qualitative
evaluators have responded to these challenges with stakeholder-oriented
approaches which prioritize variety in viewpoints. These approaches also offer a
variety of conceptualizations of the evaluator's role and responsibility. Guba and
Lincoln (1989) take as their charge the representation of a spectrum of
stakeholder perceptions and experiences in natural settings and the construction
of an evaluative judgment of program quality by the evaluator, an irreplicable
construction because of the uniqueness of the evaluator's individual rendering
but trustworthy because of sensitive, systematic methods verified by member-
checking, audit trails, and the like. Also naturalistic and generally but not
exclusively qualitative, Stake's (1973) approach requires the evaluator to
respond to emergent understandings of program issues and reserves judgment to
stakeholders, a stance which critics feel sidesteps the primary responsibility of
rendering an evaluation conclusion (e.g., Scriven, 1998). Eisner (1991) relies on
the enlightened eye of the expert to recognize program quality in its critical and
subtle emanations, not easily discerned from the surface even by engaged stake-
holders. These three variations on the basic qualitative theme define respectively
naturalistic, responsive, and connoisseurship approaches.
Some evaluation approaches, generally but not exclusively qualitative, orient
not only to stakeholder perspectives but also to stakeholder interests. Patton's
approach prioritizes the utilization of evaluation results by "primary intended
users" (1997, p. 21) as the prime goal and merit of an evaluation. More internally
political, Greene (1997) presses for local participation in evaluation processes to
transform working relationships among stakeholder groups, especially
relationships with program managers. Fetterman (1996) takes the empowerment
of stakeholders as an explicit and primary goal of evaluation. More specifically
ideological, House (1993) urges evaluation as instrumental to social justice,
House and Howe (1999) as instrumental to deliberative democracy, and Mertens
(1999) as instrumental to the inclusion of the historically neglected, including
women, the disabled, and racial and cultural minorities.
This family of approaches, varying in their reliance on qualitative methods,
hold in common the view that a program is inseparable from subjective percep-
tions and experiences of it. For most qualitative practitioners, evaluation is a
process of examination of stakeholders' subjective perceptions leading to
evaluators' subjective interpretations of program quality. As these interpretations
take shape, the design and progress of a qualitative evaluation emerges, not
preordinate but adaptive, benefiting from and responding to what is learned
about the program along the way.
The palette of qualitative approaches in evaluation reflects the varied origins
of a methodology involving critical trade-offs:
We borrowed methods from anthropology, sociology, and even journalism

and the arts. We were willing to cede some internal validity to gain
authenticity, unit generalization for analytical and naturalistic gener-
alization, objectivity for Verstehen. 1 For some of us, this was a fair trade
in spite of accusations that we are numerical idiots or mere storytellers.
(Smith, 1994, p. 40)
Even as "mere storytellers," qualitative evaluators encounter difficult constraints.

Ethnographers insist upon the necessity of sustained engagement at a site of
study yielding thick description (Geertz, 1973) which recounts perspectives and
events as illustrative elements in cultural analysis, but relatively few educational
evaluations luxuriate in resources sufficient for long-term fieldwork. The
ethnographer's gradual development of themes and cultural categories from
layers of redundancy in extensive observational and interview data becomes, for
qualitative evaluators, compressed by the contract period. Consequently, special
care is needed in the selection of occasions for observation and in the selection
of interviewees, complemented by alertness to unanticipated opportunities to
learn. Still, the data collected will almost certainly be too little to satisfy
ethnographers, too much to reduce to unambiguous interpretation, and too
vulnerable for comfort to complaints about validity.
Compared to the quantitative, qualitative findings encounter disproportionate
challenge regarding validity despite common acknowledgment that "there are no
procedures that will regularly (or always) yield either sound data or true
conclusions" (Phillips, 1987, p. 21). Qualitative researchers have responded with
conflicting arguments that their work satisfies the traditional notion of validity
(see LeCompte & Goetz, 1982) and that the traditional notion of validity is so
irrelevant to qualitative work as to be absurd (see Wolcott, 1990). Guba's (1981)
effort to conceptualize validity meaningfully for qualitative work generated the
venerable alternative term, "trustworthiness" (Lincoln & Guba, 1985, p. 218),
which has been widely accepted.
Data Collection
Three data collection methods are the hallmarks of qualitative work: observa-
tion, interview, and review and analysis of documents and artifacts. These methods
provide the empirical bases for colorful accounts highlighting occurrences and
the experiences and perceptions of participants in the program studied.
Observation is generally unstructured, based on realization that predetermined
protocols, as they direct focus, also introduce blinders and preconceptions into
the data. The intent of structured protocols may be to reduce bias, but bias can
be seen in the prescriptive categories defined for recording observational data,
categories that predict what will be seen and prescribe how it will be documented,
categories that preempt attention to the unanticipated and sometimes more
meaningful observable matters. Structured observation in qualitative work is
172 Mabry
reserved for relatively rare program aspects regarded as requiring single-minded
attention. IntelViews tend, for similar reasons, to be semi-structured, featuring
flexible use of prepared protocols to maximize both issue-driven and emergent
information gathering (see also Rubin & Rubin, 1995). Review of relevant
documents and artifacts (see Hodder, 1994) provides another and more
unobtrusive means of triangulation by both method and data source (Denzin,
1989) to strengthen data quality and descriptive validity (Maxwell, 1992).
Hybrid, innovative, and other types of methods increasingly augment these
three data collection mainstays. Bibliometrics may offer insight into scholarly
impact, as the number of citations indicate breadth of program impact (e.g.,
House, Marion, Rastelli, Aguilera, & Weston, 1996). Videotapes may document
observation, serve as stimuli for interviews, and facilitate repeated or collective
analysis. Finding vectors into informants' thinking, read-aloud and talk-aloud
methods may attempt to convert cognition into language while activities are
observed and documented. "Report-and-respond forms" (Stronach, Allan, &
Morris, 1996, p. 497) provide data summaries and preliminary interpretations to
selected stakeholders for review and revision, offering simultaneous opportunity
for further data collection, interpretive validation, and multi-vocal analysis.
Technology opens new data collection opportunities and blurs some long-
standing distinctions: observation of asynchronous discussion, of online and
distance education classes, and interactions in virtual space; documentation of
process through capture of computerized records; interview by electronic mail,
and so forth.
Restraining the impulse toward premature design creates possibilities for
discovery of foci and issues during data collection. Initial questions are refined
in light of incoming information and, reciprocally, refined questions focus new
data collection. The relationship between data collection and analysis is similarly
reciprocal; preliminary interpretations are drawn from data and require
verification and elaboration in further data collection. Articulated by Glaser and
Strauss (1967) as the constant comparative method, the usual goal is grounded
theory, that is, theory arising inductively from and grounded in empirical data.
For qualitative evaluators, the goal is grounded interpretations of program
quality. Evaluation by qualitative methods involves continual shaping and reshaping
through parallel dialogues involving design and data, data and interpretation,
evaluator perspectives and stakeholder perspectives, internal perceptions and
external standards. These dialogues demand efforts to confirm and disconfirm,
to search beyond indicators and facts which may only weakly reflect meaning.
Interpretation, like data collection, responding to emergent foci and issues,
tends to multiply meanings, giving qualitative methods its expansionist character.
Are qualitative methods, revised on the fly in response to the unanticipated,
sufficient to satisfy the expectations of science and professionalism? Lacking the
procedural guarantees of quality presumed by quantitative inquirers, Smith
(1994) has claimed that "in assuming no connection between correct methods
and true accounts, extreme constructivists have seemingly abandoned the search
for the warrant for qualitative accounts" (p. 41). But the seeming abandonment
of warrant is actually a redirection of efforts - qualitative practitioners seek
substantive warrants rather than procedural ones. Quality in qualitative work is
more a matter of whether the account is persuasive on theoretical, logical, and
empirical grounds, less a matter of strict adherence to generalized, decon-
textualized procedures.
Validity is enhanced by triangulation of data, deliberate attempts to confirm,
elaborate, and disconfirm information by seeking out a variety of data sources,
applying additional methods, checking for similarities and dissimilarities across
time and circumstance. The data, the preliminary interpretations, and drafts of
reports may be submitted to diverse audiences, selected on the bases of expertise
and sensitivity to confidentiality, to try to ensure "getting it right" (Geertz, 1973,
p. 29). Critical review by evaluation colleagues and external substantive experts
may also be sought, and metaevaluation is advisable (as always) as an additional
strategy to manage subjective bias, which cannot be eliminated whatever one's
approach or method.
Interpretation
The data collected by qualitative methods are typically so diverse and ambiguous
that even dedicated practitioners often feel overwhelmed by the interpretive
task. The difficulty is exacerbated by the absence of clear prescriptive
procedures, making it necessary not only to determine the quality of a program
but also to figure out how to determine the quality of a program. The best advice
available regarding the interpretive process is rather nebulous (Erickson, 1986;
Wolcott, 1994), but some characteristics of qualitative data analysis are
foundational:
1. Qualitative interpretation is inductive. Data are not considered illustrations or

confirmations of theories or models of programs but, rather, building blocks
for conceptualizing and representing them. Theoretical triangulation
(Denzin, 1989) may spur deeper understanding of the program and may
surface rival explanations for consideration, but theories are not the a priori
impetus for study, not focal but instrumental to interpretation. Rival
explanations and different lenses for interpreting the data from a variety of
theoretical vantage points compound the expansionist tendencies of
qualitative data collection and contrast with the data reduction strategies
common to quantitative analysis.
2. Qualitative interpretation is phenomenological. The orientation is emic,2
prioritizing insiders' (i.e., immediate stakeholders') views, values, interests,
and perspectives over those of outsiders (e.g., theorists, accreditors, even
evaluators). The emphasis on stakeholders' perceptions and experiences has
sometimes been disparaged as an overemphasis leading to neglect of dis-
interested external perspectives (Howe, 1992) or of salient numerical indicators
of program quality (Reichardt & Rallis, 1994, p. 10). But determinations of
174 Mabry
program impact necessarily demand learning about the diverse experiences of

participants in natural contexts. Because the respondents selected for
observation and interview by evaluation designers influence what can be
learned about the program, care must be taken to ensure that qualitative data
document a broad band of stakeholder views, not just the interests and
perceptions of clients.
3. Qualitative interpretation is holistic. Because, the program is viewed as a
complex tapestry of interwoven, interdependent threads, too many and too
embedded to isolate meaningfully from the patterns, little attention is
devoted to identifying and correlating variables. Clarity is not dependent on
distinguishing and measuring variables but deflected and obscured by
decontextualizing and manipulating them. Indicators merely indicate,
capturing such thin slices of programs that they may distort more than reveal.
Not the isolation, correlation, and aggregation of data reduced to numerical
representations but thematic analysis, content analysis, cultural analysis, and
symbolic interactionism typify approaches to qualitative data interpretation.
The effort to understand involves macro- and micro-examination of the data
and identification of emergent patterns and themes, both broad-brush and
fine-grained.
4. Qualitative interpretation is intuitive. Personalistic interpretation is not
merely a matter of hunches, although hunches are teased out and followed up.
It is trying hard to understand complex phenomena from multiple empirical
and theoretical perspectives, searching for meaning in the baffling and outlying
data as well as in the easily comprehended. It can be as difficult to describe
and justify as to employ non-analytic analysis, reasoning without rationalism.
Qualitative findings are warranted by data and reasoned from data, but they
are not the residue of easily articulated procedures or of simple juxtapositions
of performances against preordained standards. Analysis is not an orderly
juggernaut of recording the performances of program components, comparing
performances to standards, weighting, and synthesizing (see especially Scriven,
1994; see also Stake, et aI., 1997). Rather, the complexity of the program, of
the dataset, and of the interpretive possibilities typically overwhelm
criteriality and linear procedures for movement from complex datasets to
findings. The effort to understand may, of course, include rationalistic and
even quantitative procedures, but more-or-Iess standardized formalities
generally give way to complex, situated forms of understanding, forms
sometimes unhelpfully termed irrational (see Cohen, 1981, pp. 317-331).
Qualitative interpretation sometimes borrows strategies from the literary and

visual arts, where the capacity of expressiveness to deepen understanding has
long been recognized (see Eisner, 1981). Narrative and metaphoric and artistic
renderings of datasets can open insightful lines of meaning exposition, greatly
enhancing personal comprehension and memorability (Carter, 1993; Eisner,
1981; Saito, 1999). Such interpretation can open rather than finalize discussion,
encouraging deep understanding but perhaps at the expense of definitive findings.
As beauty is in the eye of the beholder, different audiences and even different
evaluators render unique, irreplicable interpretations of program quality (see,
e.g., Brandt, 1981). The diversity of interpretive possibilities, not uncommon in
the experiences of clients and evaluators of all stripes, brings into sharp focus not
only problems of consensus and closure but also of bias, validity, and credibility.
Noting that, in evaluation, "judgments often involve multidimensional criteria
and conflicting interests," House (1994), among others, has advised, "the
evaluator should strive to reduce biases in making such judgments" (p. 15). But
bias can be difficult to recognize, much less reduce, especially in advance,
especially in oneself. Naturally occurring diversity in values, in standards of
quality, in experiential understandings, and in theoretical perspectives offer
many layers of bias. Even methodological choices inject bias, and many such
choices must be made. Bias bleeds in with the social and monetary rewards that
come with happy clients, a greater temptation where qualitative methods require
greater social interaction, and with political pressures large and small. Subjective
understanding is both the point of qualitative evaluation and its Achilles' heel.
Reporting
Consistent with attention to stakeholder perceptions and experiences in data

collection and interpretation, qualitative evaluation reporting aims for broad
audience accessibility, for vicarious experience of naturalistic events, and for
representation of stakeholder perspectives of those events. Narratives which
reveal details that matter and which promote personal and allusionary
connections are considered important to the development of understanding by
audiences, more complex understanding than is generally available from other
scientific reporting styles (Carter, 1993). Development of implicit understanding
by readers is more desirable than the development of explicit explanations (see
von Wright, 1971) because personalistic tacit knowledge is held to be more
productive of action than is abstracted propositional knowledge (Polanyi, 1958).
Consequently, qualitative representations of programs feature experiential
vignettes and interview excerpts which convey multiple perspectives through
narratives. Such reporting tends to be engaging for readers and readerly - that is,
borrowing a postmodernist term, consciously facilitative of meaning construction
by readers.
Advantageous as the benefits of experiential engagement and understanding
are, there is a significant disadvantage associated with qualitative reporting:
length. For those evaluation audiences interested only in the historically
enduring question, "What works?" and their brethren whose appetites stretch no
farther than executive summaries, the voluminousness of an experiential report
with a cornucopia of perspectives is problematic. Some clients, funders, and
primary stakeholders are eager for such informativeness, but others are irritated.
Multiple reports targeted for specific groups can help some, although the gains
in utility compete with costs regarding feasibility. Like other trade-offs in
176 Mabry
qualitative inquiry, those involving report length and audience desires are not
easily resolved.
ISSUES IN QUALITATIVE EVALUATION
Issues introduced in the foregoing discussion of methods will be clustered for

further attention here under the categories of The Program Evaluation Standards
(Joint Committee, 1994): feasibility, accuracy, propriety, and utility.
Feasibility
Qualitative fieldwork requires the devotion of significant resources to accumu-

lating data about day-to-day events to support development and documentation
of patterns and issues illuminative for understanding program quality. Time for
the collection and interpretation of voluminous datasets, time for the emergence
of issues and findings, time for validation and interpretation, time for creation of
experiential and multi-vocal reports - time is needed at every stage of qualitative
inquiry, time that is often painfully constrained by contractual timelines and
resources.
The methodological expertise needed for each stage of a qualitative
evaluation is not generously distributed within the population, and training and
experience take even more time. Identification, preparation, and coordination of
a cadre of data collectors may strain evaluation resources. Substantive expertise,
also needed, often requires further expansion and resources. One may well
ask whether qualitative work can be done well, whether it can be done in a
timely manner, whether it can be done at all under ordinary evaluation circum-
stances.
Scarce as they may be, logistical resources and methodological expertise are
less inherently troublesome than is accurate representation of the perspectives
of multiple stakeholders. For the most part, broad professional discussion has
not progressed beyond expressions of interest in stakeholder perspectives and, in
some approaches, in the involvement of stakeholders in some or all evaluation
processes. Serious attention has not yet been devoted to the difficulty of fully
realizing and truly representing diverse stakeholders, especially since the
interests of managers, who typically commission evaluations, may contrast with
those of program personnel and beneficiaries. Nor does the evaluation literature
brim with discussion of the potential for multiple perspectives to obstruct
consensus in decision-making. Documentation of stakeholder perspectives in
order to develop understanding of the multiple realities of program quality is
significantly obstructed by the complexity and diversity of those perspectives and
by contractual and political circumstances.
Accuracy
Awareness of the complexity of even small programs, their situationality, and

their fluidity has led qualitative evaluators to doubt quantitative representations
of programs as "numbers that misrepresent social reality" (Reichardt & Rallis,
1994, p. 7). Organizational charts, logic models, budgets - if these were enough
to represent programs accurately, qualitative methods would be a superfluous
luxury, but these are not enough. Enrollment and graduation figures may say
more about the reputation or cost or catchment area of a teacher preparation
program than about its quality. Growth patterns may be silent regarding
personnel tensions and institutional stability. Balanced budgets may be
uninformative about the appropriateness of expenditures and allocations. Such
data may even deflect attention counterproductively for understanding program
quality. But the addition of qualitative methods does not guarantee a remedy for
the insufficiency of quantitative data.
In evaluation, by definition a judgment-intense enterprise,3 concerns persist
about the potential mischief of subjective judgment in practice. It is not
subjectivity per se but its associated bias and incursions into accuracy that
trouble. In distinguishing qualitative from quantitative evaluation, Datta (1994)
claims that "the differences are less sharp in practice than in theoretical
statements" (p. 67). But it is the subjectivity of the practicing qualitative
evaluator, not that of the quantitative evaluator or of the evaluation theorist,
which has particularly raised questions regarding accuracy. Qualitative
methodologists are familiar with the notion of researcher-as-instrument, familiar
with the vulnerability to challenge of interpretive findings, familiar with the
necessity of managing subjectivity through such means as triangulation,
validation, and internal and external review, but the familiar arguments and
strategies offer limited security.
Qualitative evaluation datasets - any datasets - are biased. There is bias in
decisions about which events to observe, what to notice and document, how to
interpret what is seen. There is bias in every interviewee's perspective. Every
document encapsulates a biased viewpoint. Because of the prominence of
subjective data sources and subjective data collectors and especially because of
reliance on subjective interpretation, consciousness of the potential for bias in
qualitative work is particularly strong. The skepticism associated with subjectivity
works against credibility, even when triangulation, validation, and peer review
are thoroughly exercised.
Even more challenging is the task of accurate representation of various
stakeholders. Postmodernists have raised issues that many qualitative evaluators
take to heart: whether outsiders' portrayals of insiders' perspectives necessarily
misrepresent and objectify humans and human experiences, whether authors of
reports have legitimate authority to construct through text the realities of others,
whether the power associated with authorship contributes to the intractable
social inequities ofthe status quo (Brodkey, 1989; Derrida, 1976; Foucault, 1979;
Lyotard, 1984). These problems may leave evaluation authors writing at an ironic
178 Mabry
distance from their own reports as they attempt, even as they write, to facilitate
readers' deconstructions (Mabry, 1997), producing open texts which demand
participation in meaning construction from uncomfortable readers (see Abma,
1997; McLean, 1997). Presentation of unresolved complexity and preservation of
ambiguity in reports bewilders and annoys some readers, especially clients and
others desirous of clear external judgments and specific recommendations.
Tightly coupled with representation is misrepresentation (Mabry, 1999b,
1999c); with deep understanding, misunderstanding; with usefulness, misuse.
The very vividity of experiential accounts can carry unintended narrative fraud.
Even for those who wish to represent it and represent it fully, truth is a mirage.
When knowledge is individually constructed, truth is a matter of perspective.
Determining and presenting what is true about a program, when truth is
idiosyncratic, is a formidable obligation.
If truth is subjective, must reality be? Since reality is apprehended subjectively
and in no other way by human beings and since subjectivity cannot be distilled
from the apprehension of reality, it follows that reality cannot be known with
certainty. The history of scientific revolution demonstrates the fragility of facts,
just as ordinary experience demonstrates the frequent triumph of
misconceptions and preconceptions. No one's reality, no one's truth quite holds
for others, although more confidence is invested in some versions than in others.
Evaluators hope to be awarded confidence, but is it not reasonable that
evaluation should be considered less than entirely credible, given the op-art
elusiveness of truth? Can there be a truth, a bottom line, about programs in the
postmodern era? If there were, what would it reveal, and what would it obscure?
How accurate and credible must - can - an evaluation be?
Accuracy and credibility are not inseparable. An evaluation may support valid
inferences of program quality and valid actions within and about programs but
be dismissed by non-believers or opponents, while an evaluation saturated with
positive bias or simplistic superficialities may be taken as credible by happy
clients and funding agencies. Suspicion about the accuracy of an evaluation, well-
founded or not, undermines its credibility. In an era of suspicion about
representation, truth, and even reality, suspicion about accuracy is inevitable.
The qualitative commitment to multiple realities testifies against the simpler
truths of positivist science, against its comforting correspondence theory of
truth,4 and against single truths - even evaluators' truths. Alas, accuracy and
credibility are uneven within and across evaluation studies partly because truth
is more struggle than achievement.
Propriety
In addition to the difficulties noted regarding feasibility and accuracy, qualitative

evaluation, as all evaluation, is susceptible to such propriety issues as conflicts of
interest and political manipulation. Dependent as it is on persons, qualitative
fieldwork is particularly vulnerable to micropolitics, to sympathy and persuasion
at a personal and sometimes unconscious level. The close proximity between

qualitative evaluators and respondents raises special issues related to bias, ethics,
and advocacy. Given the paucity of evaluation training in ethics (Newman &
Brown, 1996) and the myriad unique circumstances which spawn unexpected
ethical problems (Mabry, 1999a), proper handling of these issues cannot be
assured.
Misuse of evaluation results by stakeholders mayor may not be harmful, may
or may not be innocent, mayor may not be programmatically, personally, or
politically expedient. Misuse is not limited to evaluation results - evaluations
may be commissioned to stall action, to frighten actors, to reassure funders, to
stimulate change, to build or demolish support. Failure to perceive stakeholder
intent to misuse and failure to prevent misuse, sometimes unavoidable, may
nevertheless raise questions regarding propriety.
Not only stakeholders but evaluators, too, may misuse evaluation.
Promotionalism of certain principles or certain stakeholders adds to the political
swirl, subtracts from credibility, and complicates propriety. Whether reports
should be advocative and whether they can avoid advocacy is an issue which has
exercised the evaluation community in recent years (Greene & Schwandt, 1995;
House & Howe, 1998; Scriven, Greene, Stake, & Mabry, 1995). The
inescapability of the evaluator's personal values, as fundamental undergirding
for reports, has been noted (Mabry, 1997), a recognition carrying over from
qualitative research (see especially Lincoln & Guba, 1985), but resisted by
objectivist evaluators focused on bias management through design elements and
criterial analysis (see especially Scriven, 1994, 1997). More explosive is the
question of whether evaluators should (or should ever) take explicit, proactive
advocative positions in support of endangered groups or principles as part of
their professional obligations (see Greene, 1995, 1997; House & Howe, 1998;
Mabry, 1997; Scriven, 1997; Stake, 1997; Stufflebeam, 1997). Advocacy by
evaluators is seen as an appropriate assumption of responsibility by some and as
a misunderstanding of responsibility by others.
Beneath the arguments for and against advocacy can be seen personal
allegiances regarding the evaluator's primary responsibility. Anti-advocacy
proponents prioritize evaluation information delivery, professionalism, and
credibility. Pro-advocacy proponents' prioritize the program, some aspect of it,
its field of endeavor, such as education (Mabry, 1997) - or more broadly, to
principles that underlie social endeavor such as social justice (House, 1993),
deliberative democracy (House & Howe, 1999), the elevation of specific or
historically underrepresented groups (Fetterman, 1996; Greene, 1997; Mertens,
1999). The focus is more directly on human and societal interests than on
information and science. At issue is whether evaluation should be proactive or
merely instrumental in advancing human, social, and educational agendae - the
nature and scope of evaluation as change agent.
Methodological approaches that pander to simplistic conceptions of reality
and of science raise a different array of propriety issues. Rossi (1994) has
observed that "the quants get the big evaluation contracts" (p. 25), that the
180 Mabry
lopsided competition among evaluation professionals regarding approach and
scale "masks a struggle over market share" (p. 35), and that "the dominant
discipline in most of the big firms is economics" (p. 29). This is problematic in
the evaluation of massive educational programs sponsored by organizations such
as the World Bank (Jones, 1992; Psacharopoulos & Woodhall, 1991), for
example, because education is not properly considered merely a matter of
economics. Educational evaluators should beware designs which imply simple or
simply economic realities and should beware demands to conduct evaluations
according to such designs. Wariness of this kind requires considerable alertness
to the implications of methodology and client demands and considerable ethical
fortitude.
Utility
As an applied social science, evaluation's raison d'etre is provision of grounding

for sound decisions within and about programs. Both quantitative and qualitative
evaluations have influenced public policy decisions (Datta, 1994, p. 56), although
non-use of evaluation results has been a common complaint among work-weary
evaluators, some of whom have developed strategies (Chelimsky, 1994) and
approaches (Patton, 1997) specifically intended to enhance utilization.
Qualitative evaluation raises troublesome questions for utility - questions which
again highlight the interrelatedness of feasibility, accuracy, propriety, and utility:
Are reports too long to be read, much less used? Is it possible to ensure accurate,
useful representation of diverse interests? Can reports be prepared in time to
support program decisions and actions? Are they credible enough for confident,
responsible use? At least for small-scale educational evaluations, qualitative
work has been described as more useful than quantitative to program operators
(Rossi, 1994)5 but, unsurprisingly, some quantitative practitioners hold that the
"utility is extremely limited for my setting and the credibility of its findings is too
vulnerable" (Hedrick, 1994, p. 50, referring to Guba & Lincoln, 1989).
The invitation to personal understanding that characterizes many qualitative
reports necessarily opens opportunity for interpretations different from the
evaluator's. Respect for individual experience and knowledge construction
motivates qualitative report-writers and presumes the likelihood of more-or-Iess
contrary interpretations. The breadth and magnitude of dissent can vary greatly
and can work not only against credibility but also against consensual
programmatic decision-making.
The qualitative characteristic of openness to interpretation highlights the
questions: Use by whom? And for what? If it is not (or not entirely) the
evaluator's interpretations that direct use, whose should it be? The too-facile
response that stakeholders' values, criteria, or interpretations should drive
decisions underestimates the gridlock of natural disagreement among competing
stakeholder groups. Prioritization of the interests of managerial decision-
makers, even in the interest of enhancing utility, reinforces anti-democratic
limitations on broad participation. Attention to the values, interests, and
perspectives of multiple stakeholders can clarify divisions and entrench
dissensus. Consideration of issues related to qualitative evaluation, such as issues
of epistemology and authority, make it all too clear that utility and propriety, for
example, are simultaneously connected and conflicted.
REALITY, REALISM, AND BEING REALISTIC
Let's be realistic. The reality of educational programs is too complex to be

represented as dichotomously black and white. Qualitative approaches are
necessary to portray evaluands with the shades of meaning which actually
characterize the multi-hued realities of programs. But while the complex nature
of educational programs suggests the necessity of qualitative approaches to
evaluation, the dizzying variety of stakeholder perspectives as to a program's real
failures and accomplishments, the ambiguous and conflicting interpretations
which can be painted from qualitative data, and the resource limitations
common to evaluations of educational programs may render qualitative
fieldwork unrealistic.
Is educational evaluation a science, a craft, an art? Realism in art refers to
photograph-like representation in which subjects are easily recognized by
outward appearance, neglecting perhaps their deeper natures. Hyperrealism
refers to portrayals characterized by such meticulous concentration on minute
physical details - hair follicles and the seams in clothing - as to demand attention
to technique, sometimes deflecting it from message. Surrealism, on the other
hand, refers to depiction of deep subconscious reality through the fantastic and
incongruous, but this may bewilder more than enlighten. Artists from each
movement offer conflicting views of what is real - views which inform, baffle,
repel, and enthrall audiences. In evaluation, different approaches provide
different kinds of program representations (see Brandt, 1981), with a similar
array of responses from clients and other stakeholders. Whether the program is
recognizable as portrayed in evaluation reports is necessarily dependent upon
the acuity of audiences as well as the skill of evaluators. Such is our daunting
professional reality.
According to some philosophers of art, artworks are not the physical pieces
themselves but the conceptual co-creations of artists and beholders. According
to some theories of reading and literary criticism, text is co-created by authors
and readers. As analogies regarding meaning and authorship, these notions
resonate with the actual experiences of evaluators. Our reports document data
and interpretations of program quality, sometimes participatory interpretations,
but they are not the end of the brushstroke. The utility standard implies the
practical priority of stakeholder interpretations of program quality, those who
ultimately make, influence, and implement program decisions.
In the hands of accomplished practitioners, educational evaluation may seem
an art form, but most clients expect not art but science - social science, applied
182 Mabry
science. Programs have real consequences for real people, however multiple and
indeterminate the reality of programs may be. Such realization suggests need for
complex qualitative strategies in evaluation, with all the living color associated
with real people and all the local color associated with real contexts, and with all
the struggles and irresolutions they entail.
ENDNOTES
1 Dilthey (1883) prescribed hermeneutical or interpretive research to discover the meanings and
perspectives of people studied, a matter he referred to as Verstehen (1883).
2 Anthropologists have distinguished between etic accounts which prioritize the meanings and
explanations of outside observers from emic accounts which prioritize indigenous meanings and
understandings (see Seymour-Smith, 1986, p. 92).
3 Worthen, Sanders, and Fitzpatrick note that, "among professional evaluators, there is no uniformly
agreed-upon definition of precisely what the term evaluation means. It has been used by various
evaluation theorists to refer to a great many disparate phenomena" (1997, p. 5, emphasis in the
original). However, the judgmental aspect, whether the judgment is expected of the evaluator or
someone else, is consistent across evaluators' definitions of evaluation: (1) Worthen, Sanders, &
Fitzpatrick: "Put most simply, evaluation is determining the worth or merit of an evaluation object"
(1997, p. 5). (2) Michael Scriven: "The key sense of the term 'evaluation' refers to the process of
determining the merit, worth, or value of something, or the product of that process" (1991, p. 139,
emphasis in the original). (3) Ernest House: "Evaluation is the determination of the merit or worth
of something, according to a set of criteria, with those criteria (often but not always) explicated and
justified" (1994, p. 14, emphasis added).
4 The positivist correspondence theory of truth holds that a representation is true if it corresponds
exactly to reality and is verifiable byobservation.

S Note, however, that the very helpfulness of these evaluations has led to claims that they are not
evaluations at all but rather, "management consultations" (Rossi, 1994, p. 33; Scriven, 1998).
REFERENCES
Abma, T. (1997). Sharing power, facing ambiguity. In L. Mabry (Ed.), Advances in program
evaluation: Vol. 3 Evaluation and the post-modem dilemma (pp. 105-119). Greenwich, CT: JAl
Press.
Brandt, R.S. (Ed.). (1981). Applied strategies for curriculum evaluation. Alexandria, VA: ASCD.
Brodkey, L. (1989). On the subjects of class and gender in "The literacy letters." College English, 51,
125-141.
Campbell, D.T. & Stanley, J.C. (1963). Experimental and quasi-experimental designs for research.
Boston: Houghton-Mifflin.
Carter, K. (1993). The place of story in the study of teaching and teacher education. Educational
Researcher, 22(1), 5-12,18.
Chelimsky, E. (1994). Evaluation: Where we are. Evaluation Practice, 15(3), 339-345.
Cohen, L.J. (1981). Can human irrationality be experimentally demonstrated? Behavioral and Brain
Sciences, 4, 317-331.
Datta, L. (1994). Paradigm wars: A basis for peaceful coexistence and beyond. In C.S. Reichardt, &
S.F. Rallis (Eds.), The qualitative-quantitative debate: New perspectives New Directions for Program
Evaluation, 61, 153-170.
Datta, L. (1997). Multimethod evaluations: Using case studies together with other methods. In E.
Chelimsky, & W.R. Shadish (Eds.), Evaluation for the 21st century: A handbook (pp. 344-359).
Denzin, N.K. (1989). The research act: A theoretical introduction to sociological methods (3rd ed.).
Englewood Cliffs, NJ: Prentice Hall.
Denzin, N.K. (1997). Interpretive ethnography: Ethnographic practices for the 21st century. Thousand
Oaks, CA: Sage.
Denzin, N.K. & Lincoln, Y.S. (2000). Handbook of qualitative research (2nd ed.). Thousand Oaks,
CA: Sage.
Derrida,1. (1976). On grammatology (trans. G. Spivak). Baltimore, MD: Johns Hopkins University
Press.
Dilthey, W (1883). The development of hermeneutics. In H.P. Richman (Ed.), W Dilthey: Selected
writings. Cambridge: Cambridge University Press.
Eisner, E.W (1981). On the differences between scientific and artistic approaches to qualitative
research. Educational Researcher, 10(4),5-9.
Eisner, E.W (1985). The art of educational evaluation: A personal view. London: Falmer.
Eisner, E.W (1991). The enlightened eye: Qualitative inquiry and the enhancement of educational
practice. NY: Macmillan.
Erickson, F. (1986). Qualitative methods in research on teaching. In M.e. Wittrock (Ed.), Handbook
of research on teaching (3rd ed.), (pp. 119-161). New York: Macmillan.
Fetterman, D.M. (1996). Empowerment evaluation: Knowledge and tools for self-assessment and
accountability. Thousand Oaks, CA: Sage.
Foucault, M. (1979). What is an author? Screen, Spring.
Geertz, e. (1973). The interpretation of cultures: Selected essays. New York: Basic Books.
Glaser, B.G. & Strauss, A.I. (1967). The discovery of grounded theory. Chicago, IL: Aldine.
Greene, J.e. (1994). Qualitative program evaluation: Practice and promise. In N.K. Denzin & Y.S.
Lincoln (Eds.), Handbook of qualitative research (pp. 530-544). Newbury Park, CA: Sage.
Greene, J.e. (1997). Participatory evaluation. In L. Mabry (Ed.), Advances in program evaluation:
Evaluation and the post-modern dilemma (pp. 171-189). Greenwich, CT: JAI Press.
Greene, J.e., Caracelli, v., & Graham, WF. (1989). Toward a conceptual framework for multimethod
evaluation designs. Educational Evaluation and Policy Analysis, 11, 255-274.
Greene, J.G. & Schwandt, T.A. (1995). Beyond qualitative evaluation: The significance of "positioning"
oneself. Paper presentation to the International Evaluation Conference, Vancouver, Canada.
Guba, E.G. (1978). Toward a methodology of naturalistic inquiry in educational evaluation.
Monograph 8. Los Angeles: UCLA Center for the Study of Evaluation.
Guba, E.G. (1981). Criteria for assessing the trustworthiness of naturalistic inquiries. Educational
Communication and Technology Journal, 29, 75-92.
Guba, E.G. & Lincoln, Y.S. (1989). Fourth generation evaluation. Thousand Oaks, CA: Sage.
Hedrick, T.E. (1994). The quantitative-qualitative debate: Possibilities for integration. In e.s.
Reichardt & S.F. Rallis (Eds.), The qualitative-quantitative debate: New perspectives. New Directions
for Program Evaluation, 61, 145-152.
Hodder, I. (1994). The interpretation of documents and material culture. In Denzin, N.K, &
Lincoln, Y.S. (Eds.), Handbook of qualitative research (pp. 403-412). Thousand Oaks, CA: Sage.
House, E.R. (1993). Professional evaluation: Social impact and political consequences. Newbury Park,
CA: Sage.
House, E.R. (1994). Integrating the quantitative and qualitative. In e. S. Reichardt, & S. F. Rallis
(Eds.), The qualitative-quantitative debate: New perspectives. New Directions for Program Evaluation,
61, 113-122.
House, E.R., & Howe, KR. (1998). The issue of advocacy in evaluations. American Journal of
Evaluation, 19(2), 233-236.
House, E.R., & Howe, KR. (1999). Ullues in evaluation and social research. Thousand Oaks, CA:
Sage.
House, E.R., Marion, S.F., Rastelli, L., Aguilera, D., & Weston, T. (1996). Evaluating R&D impact.
University of Colorado at Boulder: Unpublished report.
Howe, K (1992). Getting over the quantitative-qualitative debate. American Journal of Education,
100(2), 236-256.
Joint Committee on Standards for Educational Evaluation (1994). The program evaluation standards:
How to assess evaluations of educational programs (2nd ed.). Thousand Oaks, CA: Sage.
Jones, P. (1992). World Bank financing of education: Lending, learning and development. London:
Routledge.
LeCompte, M.D. & Goetz, J.P. (1982). Problems of reliability and validity in ethnographic research.
Review of Educational Research, 52, 31-60.
184 Mabry
LeCompte, M.D. & Preissle, J. (1993). Ethnography and qualitative design in educational research (2nd
ed.). San Diego: Academic Press.
Lincoln, Y.S. & Guba, E.G. (1985). Naturalistic inquiry. Newbury Park, CA: Sage.
Lyotard, J.-F. (1984). The postmodern condition: A report on knowledge. Minneapolis: University of
Minnesota Press.
Mabry, L. (Ed.). (1997). Advances in program evaluation: Vol. 3. Evaluation and the post-modem
dilemma. Greenwich, CT: JAl Press.
Mabry, L. (1998a). Case study methods. In H.J. Walberg, & AJ. Reynolds (Eds.), Advances in
educational productivity: Vol. 7. Evaluation research for educational productivity (pp. 155-170).
Greenwich, CT: JAI Press.
Mabry, L. (1998b). A forward LEAP: A study of the involvement of Beacon Street Art Gallery and
Theatre in the Lake View Education and Arts Partnership. In D. Boughton & KG. Congdon
(Eds.), Advances in program evaluation: Vol. 4. Evaluating art education programs in community
centers: International perspectives on problems of conception and practice. Greenwich, CT: JAI Press.
Mabry, L. (1999a). Circumstantial ethics. American Journal of Evaluation, 20(2), 199-212.
Mabry, L. (1999b, April). On representation. Paper presented an invited symposium at the annual
meeting of the American Educational Research Association, Montreal.
Mabry, L. (1999c, November). Truth and narrative representation. Paper presented at the annual
meeting of the American Evaluation Association, Orlando, FL.
Maxwell, J.A (1992). Understanding and validity in qualitative research. Harvard Educational
Review, 62(3), 279-300.
McLean, L.D. (1997). If in search of truth an evaluator. In L. Mabry (Ed.), Advances in program
evaluation: Evaluation and the post-modem dilemma (pp. 139-153). Greenwich, CT: JAl Press.
Mertens, D.M. (1999). Inclusive evaluation: Implications of transformative theory for evaluation.
American Journal of Evaluation, 20,1-14.
Miles, M.B., & Huberman, AM. (1994). Qualitative data analysis: An expanded sourcebook (2nd ed.).
Newman, D.L. & Brown, RD. (1996). Applied ethics for program evaluation. Thousand Oaks, CA:
Sage.
Patton, M.O. (1997). Utilization-focused evaluation (3rd ed.). Thousand Oaks, CA: Sage.
Phillips, D.C. (1987). Validity in qualitative research: Why the worry about warrant will not wane.
Education and Urban Society, 20, 9-24.
Polanyi, M. (1958). Personal knowledge: Towards a post-critical philosophy. Chicago, IL: University of
Chicago Press.
Psacharopoulos, G., & Woodhall, M. (1991). Education for development: An analysis of investment
choices. New York: Oxford University Press.
Reichardt, C.S., & Rallis, S.F. (Eds.) (1994). The qualitative-quantitative debate: New perspectives.
New Directions for Program Evaluation, 61.
Rossi, P.H. (1994). The war between the quais and the quants: Is a lasting peace possible? In C.S.
Reichardt, & S.F. Rallis (Eds.), The qualitative-quantitative debate: New perspectives. New Directions
for Program Evaluation, 61, 23-36.
Rubin, H.J., & Rubin, I.S. (1995). Qualitative interviewing: The art of hearing data. Thousand Oaks,
CA: Sage.
Saito, R (1999). A phenomenological-existential approach to instructional social computer simulation.
Unpublished doctoral dissertation, Indiana University, Bloomington, IN.
Scriven, M. (1994). The final synthesis. Evaluation Practice, 15(3), 367-382.
Scriven, M. (1997). Truth and objectivity in evaluation. In E. Chelimsky, & W.R Shadish (Eds.),
Evaluation for the 21st century: A handbook (pp. 477-500). Thousand Oaks, CA: Sage.
Scriven, M. (1998, November). An evaluation dilemma: Change agent vs. analyst. Paper presented at
the annual meeting of the American Evaluation Association, Chicago.
Scriven, M., Greene, J., Stake, R, & Mabry, L. (1995, November). Advocacy for our clients: The
necessary evil in evaluation? Panel presentation to the International Evaluation Conference,
Vancouver, BC.
Seymour-Smith, C. (1986). Dictionary of anthropology. Boston: G. K Hall.
Shadish, W.R, Jr., Cook, T.D. & Leviton, L.C. (1991). Foundations ofprogram evaluation: Theories of
practice. Newbury Park, CA: Sage.
Smith, M.L. (1994). Qualitative plus/versus quantitative: The last word. In C.S. Reichardt & S.E
Rallis (Eds.), The qualitative-quantitative debate: New perspectives. New Directions for Program
Stake, RE. (1973). Program evaluation, particularly responsive evaluation. Paper presented at
conference on New Trends in Evaluation, Goteborg, Sweden. Reprinted in G.E Madaus, M.S.
Scriven & Stufflebeam, D.L. (1987), Evaluation models: Viewpoints on educational and human
services evaluation (pp. 287-310). Boston: Kluwer-Nijhoff.
Stake, RE. (1978). The case study method in social inquiry. Educational Researcher, 7(2), 5-8.
Stake, R.E. (1997). Advocacy in evaluation: A necessary evil? In E. Chelimsky, & W.R Shadish
(Eds.), Evaluation for the 21st century: A handbook (pp. 470-476). Thousand Oaks, CA: Sage.
Stake, R.E. (2000). Case studies. In N.K. Denzin, & Y.S. Lincoln (Eds.), Handbook of qualitative
research (2nd ed.) (pp. 236-247). Thousand Oaks, CA: Sage.
Stake, R, Migotsky, c., Davis, R, Cisneros, E., DePaul, G., Dunbar, C. Jr., et al. (1997). The
evolving synthesis of program value. Evaluation Practice, 18(2), 89-103.
Stronach, 1., Allan, J. & Morris, B. (1996). Can the mothers of invention make virtue out of
necessity? An optimistic deconstruction of research compromises in contract research and
evaluation. British Educational Research Journal, 22(4): 493-509.
Stufflebeam, D.L. (1997). A standards-based perspective on evaluation. In L. Mabry (Ed.),Advances
in program evaluation: Vol. 3 Evaluation and the post-modem dilemma (pp. 61-88). Greenwich, CT:
JAI Press.
von Wright, G.H. (1971). Explanation and understanding. London: Routledge & Kegan Paul.
Wolcott, H.E (1990). On seeking - and rejecting - validity in qualitative research. In E.W. Eisner &
A. Peshkin (Eds.), Qualitative inquiry in education: The continuing debate (pp. 121-152). New York:
Teachers College Press.
Wolcott, H.E (1994). Transforming qualitative data: Description, analysis, and interpretation. Thousand
Oaks, CA: Sage.
Wolcott, H.E (1995). The art offieldwork. Walnut Creek, CA: AltaMira.
Worthen, B.R, Sanders, J.R., & Fitzpatrick, J.L. (1997). Program evaluation: Alternative approaches
and practical guidelines (2nd ed.). New York: Longman.
Section 3
Evaluation Utilization
Introduction
MARVIN C. ALKIN, Section Editor

University of California at Los Angeles, CA, USA
Issues related to the use of evaluation are at the very heart of the theoretical
writings and practice of evaluation. Evaluation, after all, is concerned with
providing information about identifiable programs - and, hopefully their use.
Using the distinction made between conclusion-oriented and decision-oriented
inquiry made by Cronbach and Suppes (1969), evaluation is decision oriented.
Its purpose is not necessarily the creation of knowledge - conclusions about the
nature of an entity. Rather, the intended purpose of evaluation is to make
judgements about a specific program or programs at a particular point in time.
Implicit in evaluation as decision-oriented inquiry is the presence of a user (or
multiple users) for whom it is hoped the evaluation will have relevance.
CONTEXT
This section overview presents a context for considering evaluation utilization

issues. It is important to recognize that evaluation model building (writings about
evaluation), evaluation practice, and research on evaluation are all intertwined
processes. This is particularly true of evaluation use. That is, the demand for
educational evaluation has sparked the consideration of how evaluations are
used. The practice of educational evaluation also sparks the thinking of
evaluation "theorists" about how evaluations might be better conducted. In turn,
there is a belief that evaluation writings influence practice. Research on evalu-
ation becomes translated into popular writings, which, in turn, impact practice.
In this introduction, the events associated with the development of the ideas of
evaluation utilization will be described roughly in sequence, starting with the 1960s.
First, a few introductory remarks about what might be called evaluation models.
As Aikin and Ellett (1985) note, the term model may be used in two general
ways: prescriptive models, which are rules, suggestions, and guiding frameworks
for good or proper evaluation; or descriptive models which are sets of statements
and generalizations that predict or explain evaluation activities. Most of the
writings on evaluation describe what the writer believes evaluators should do to
conduct an appropriate evaluation; these are prescriptive models. When the
term models (or writings) is used in this paper, we refer to prescriptive models.
189

190 Aikin
There has not been a great deal of the kind of research on evaluation that
might ultimately justify a descriptive theory. The area in which the most research
has been conducted is evaluation utilization, and that will be the focus when
research is discussed in this paper.
THE CYCLE OF USE IDEAS
A brief historical perspective within the United States context will help to explain
the development of thinking about evaluation use. The Russian launching of
Sputnik in 1957 sparked massive mathematics and science curriculum
development projects in the United States. There was an urgent need in the
minds of many American politicians and educators to "catch up." Each of the
national curriculum projects that resulted included a required evaluation
component.
Practice: The evaluations of these curriculum projects generally focused on
making a determination of the adequacy of the curriculum. After the curriculum
had been developed, a judgment was to be made about its effect on student
learning.
Models/Writings: A classic article by Cronbach (1963) commented on the
nature of evaluation, focusing in particular on the manner in which evaluation
was conducted in these national curriculum projects. In essence, Cronbach
argued that an important purpose of evaluation in these projects was for "course
improvement." In referring to a course improvement objective for evaluation, he
meant that the appropriate role of evaluation was not only to make final
judgments about the efficacy of the curricula but also to provide information that
would assist in making modifications of the courses under development. Cronbach's
idea that course improvement was an appropriate outcome of the evaluation
activity was the basis for the formative/summative distinction subsequently made
by Scriven (1967). In essence, Cronbach's prescriptive formulation extended the
notion of appropriate evaluation use.
Practice: A few years after the establishment of the national curriculum
project, political events catalyzed another example of the impact of evaluation
practice on prescriptive evaluation models. In 1965, the United States Congress
enacted the Elementary and Secondary Education Act. This piece of legislation
provided additional federal resources to local school districts for serving the
programmatic needs of educationally disadvantaged youth. The Act required that
school district recipients of funding engage in, or contract for, evaluations of
these programs in order to demonstrate the impact of the federal funds. At the
insistence of Senator Robert Kennedy, School Site Councils, consisting of local
administrators, teachers, and parents, had to be founded at each school. The
Councils and the federal government were both viewed as required recipients of
the evaluation information.
Senator Kennedy and others embraced the idea that an important purpose of
evaluations is their use by parents and local educators to improve school
Evaluation Utilization 191
programs. The notion of evaluation use as being relevant to various people who
have a stake in the program received an important political boost.
Models/Writings: The recognition of a broadened range of potential users
came to be reflected in subsequent writings about evaluation. While no ideas are
ever really new (see for example, Tyler, 1942), the recognition of parents and
others as appropriate users in the 1960s may have provided the stimulus for
subsequent writings about evaluation stakeholders (see Bryk, 1983; Stake, 1976).
Practice: Where were all ofthese "evaluators" oflocal school Title programs to
come from? School districts turned to testing specialists and to other university
researchers for required evaluations of school programs. Testing experts were
employed because evaluation was viewed as primarily measurement of outcomes.
Alternatively, administrators inexperienced with evaluation tended to view it as
akin to research, and university researchers were called upon. But, attempts at
doing evaluation - all too frequently considered as research within field settings
- proved to be of little help to potential users. At the federal level, the evalu-
ations had a very minor impact on decision making, in part due to methodology
not sufficiently attuned to capturing small increments of "value added," and in
part due to lack of timeliness of reporting. At the local level, attention to federal
reporting requirements led to a general lack of responsiveness to the information
needs of local users. Rippey (1973) commented on the extent of the failure of
these efforts:
At the moment, there seems to be no evidence that evaluation, although the

law of the land, contributes anything to educational practice other than
headaches for the researcher, threats for the innovators, and depressing
articles for journals devoted to evaluation. (p. 9).
Models/Writings: The perception grew within the educational community that

the then current evaluation practices and procedures were not appropriately
attuned to the realities of school situations and to meeting the information needs
of various users. In partial response to these perceptions, a number of educational
researchers and theorists began to define and redefine the boundaries of
educational evaluation in ways that increasingly focused on issues associated
with fostering the utility (and hopefully the use) of evaluation information. For
example, Stufflebeam et al. (1971) chose to focus on the information needs of
"decision makers." He recognized the necessity of identifying some category of
users as the focus for evaluation in order to develop a sound conceptual model.
By focusing on decision makers (usually meant to imply school administrators),
a framework was developed that tied evaluation to the various types of decisions
that needed to be made.
Stake (1976) viewed the issue of evaluation use more broadly. He recognized
that there are various constituencies that have (or should have) a stake in the
evaluation. He sought to include the views of these multiple "stakeholders" as a
part of the evaluation. Their role was not only to provide input on the issues to
be addressed in the evaluation but also information from their perspectives as
192 Aikin
ultimate users of the evaluation information. Other evaluation theorists during
the late 1960s and 1970s consciously attempted to identify potential users as a
means of guiding the manner in which an evaluation was to be conducted (see
Provus, 1971; Hammond, 1973; AIkin, 1969; Scriven, 1974).
Research: A considerable stream of research on evaluation use ensued. This
was not only important in building an empirical literature of the field, it also had
an impact on subsequent evaluation model formulations and on evaluation
practice. While there was prescriptive literature on "knowledge utilization"
(described in the Hofstetter and AIkin chapter of this volume) - the issue of
utilizing the results of decision-oriented inquiry (evaluation) had not been as
thoroughly examined. The concern about evaluation utilization was brought to
the fore by the early work of Carol Weiss, in particular, her 1972 chapter entitled,
"Utilization of Evaluation: Toward Comparative Study." This led to a stream of
use-oriented evaluation research in the 1970s that was unprecedented. Prior to
that, most of the writings in the field of evaluation had been prescriptive in
nature.
The chapter in this volume by Hofstetter and AIkin describes some of the
important research on evaluation use. Included is Michael Patton's study (Patton
et al., 1978) of the use of information by federal decision makers - one of the first
major empirical studies on the topic of evaluation use. AIkin and his colleagues
(1979) examined the extent to which utilization can take place, and the factors
associated with higher levels of use, through intensive case studies within five
school district programs. Further, Larry Braskamp, Robert Brown, Dianna
Newman and various colleagues at the University of Illinois engaged in systematic
studies of particular evaluation factors (see a summary in Braskamp, Brown, &
Newman, 1982). Jean King, Ellen Pechman, and Bruce Thompson also made sub-
stantial contributions to the research on evaluation use (see King & Thompson, 1983).
Models/Writings: The research on evaluation utilization led to further recom-
mendations of appropriate evaluation models. Patton, for one, was stimulated by
both his own research and that of others to recognize the importance of potential
users committed to the evaluation. He reasoned that it was essential that the
evaluation focus on "primary intended users" in order to increase their subse-
quent use of evaluation information. His book Utilization Focused Evaluation
(1997) framed the evaluation procedures with a focus on enhanced use (see
Patton's chapter in this volume).
Another theme stemming from the research in utilization was the call for
stakeholder participation in various phases of the conduct of evaluation (Greene,
1988). This emphasis was subsequently described more fully as "participatory
evaluation" which was defined as "an extension of the stakeholder-based model
with a focus on enhancing utilization through primary users' increased depth and
range of participation in the applied research process" (Cousins & Earl, 1995,
p. 397). This attention to primary user participation along with an increased
focus on the evaluation process led to further writings related to the relationship
between evaluation and organizational learning (see Cousins' chapter for a full
discussion of this work).
Perhaps another influential evaluation writing may be seen as growing out of
the concern for fostering increased use. Guba and Lincoln's influential book,
Fourth Generation Evaluation (1989), may be viewed as a recognition and accom-
modation to the role of users within a framework emphasizing the social
construction of knowledge.
While the theoretical writing focused on evaluation utilization was primarily
dominated by authors from the United States, Canada, and Australia (Owen &
Rogers, 1999), there were contributors from elsewhere in the world. Most notable
were writings by authors from Switzerland (Huberman, 1987), Norway (Forss,
Cracknell, & Samset, 1994), and Israel (Nevo, 1993).
Practice: Concurrently, the research on evaluation utilization and the writings
stimulated a greater attentiveness to use in evaluation practice. Evaluation
textbooks, which previously had not paid much attention to use, dramatically
expanded their focus on this issue (e.g. Rossi & Freeman, 1985). The American
Evaluation Association had "utilizing evaluation" as the theme of its 1987
conference. The Program Evaluation Standards (Joint Committee, 1981, 1994)
contained a section devoted to utility (potential for use). These and similar
events were indicators of changed practice.
USE, IMPACT, AND INFLUENCE
A recent publication seeks to replace the word use with influence (Kirkhart,
2000). This is an interesting, but, unfortunately, misguided, suggestion. There is
a very important difference between influence and use that should be preserved.
As noted earlier in this introduction, Cronbach and Suppes (1969) provide the
important distinction between conclusion- and decision-oriented inquiry.
Evaluation is intended to be decision-oriented, not conclusion- oriented. It is an
act that has as its reference a specified program (or set of programs) at a
specified time. It does not seek to generalize about other programs in other
places and times. The preservation of a term ("use") that captures the evaluative
setting and time frame is important. The evaluative setting consists of the
program (or programs) being evaluated. The evaluative time frame begins when
an evaluation is contemplated and concludes well beyond the completion of the
evaluation report. How far? Generally for a period of several years - as long as
the information (generally speaking) is still thought of as emanating from the
evaluation.
Influence is a term that includes other places and time frames. Leviton
and Hughes (1981) rightfully argue that influence of an evaluation should be
referred to as evaluation impact. In the accompanying chapter by Hofstetter
and AIkin, the point is made that the further removed from the place and time
of the evaluative encounter one gets, the more one may view the results or
processes as knowledge, and the influence might well be considered as
knowledge use.!
194 AIkin
CHAPTERS IN THIS SECTION
A broad context has been presented for considering evaluation use. These
historical perspectives show the intertwining of evaluation practice, research,
and models/writings. Further elaboration of the use issue is provided in the three
accompanying chapters. Carolyn Hofstetter and Marvin Aikin present a discussion
of evaluation use research, writings, and issues; Michael Patton discusses his
utilization-focused evaluation model; and, J. Bradley Cousins considers partici-
patory evaluation and recent research on evaluation's role in organizational
learning.
ENDNOTE
1 In recent years there has been a focus on a number of knowledge-generating "evaluation" acivities
such as "cluster evaluations," "meta-analyses," "lessons learned," and "best practices," all of which
look for patterns of effectiveness across programs. However, in considering these, it is important
to differentiate between pre-planned cross program evaluation and post hoc aggregation of
individual evaluations. Cluster evaluations are intended at the outset to examine cross program
findings; the specified programs for the cluster evaluation are pre-set. The effect of the process or
of the findings is properly called "evaluation use." Meta-analysis on the other hand, is a derivative
activity. While evaluation related, such analyses attempt to gain understanding or knowledge from
independently conducted uncoordinated evaluations. It is appropriate to talk about "use" from
evaluation derived activities, but it is "knowledge use."
REFERENCES
AIkin, M.e. (1969). Evaluation theory development. Evaluation comment, 2, 2-7. Also excerpted in
B.R Worthen & J.R Sanders (1973), Educational evaluation: Theory and Practice. Belmont, CA:
Wadsworth.
AIkin, M.C., Daillak, R, & White, P. (1979). Using evaluations: Does evaluation make a difference?
Beverly Hills, CA: Sage Publications.
AIkin, M.e. & Ellett, F. (1985). Educational models and their development. In T. Husen, & T.N.
Postlethwaite (Eds.), The intemational encyclopedia of education. Pergamon: Oxford.
Braskamp, L.A., Brown, RD., & Newman, D.L. (1982). Studying evaluation utilization through
simulations. Evaluation Review, 6(1),114-126.
Bryk, A.S. (Ed.). (1983). Stakeholder based evaluation. New Directions for Program Evaluation, 17,
San Francisco, CA: Jossey-Bass.
Cousins, J.B., & Earl, L.M., (Eds.). (1995). The case for participatory evaluation. Participatory
evaluation in education. London: Falmer.
Cronbach, L.J. (1963). Course improvement through evaluation. Teachers College Record, 64,
672-683.
Cronbach, L.J., & Suppes, P. (1969). Research for tomorrow's schools: Disciplined inquiry for
evaluation. New York: MacMillan.
Forss, K., Cracknell, B., & Samset, K. (1994). Can evaluation help an organization learn. Evaluation
Review, 18(5), 574-591.
Greene, J.G. (1988). Stakeholder participation and utilization in program evaluation. Evaluation
Review, 12(2),91-116.
Guba, E.G., & Lincoln, Y.S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.
Hammond, RL. (1973). Evaluation at the local level. In B.R Worthen, & J.R. Sanders, Educational
evaluation: Theory and practice. Belmont, CA: Wadsworth.
Huberman, M. (1987). Steps towards an integrated model of research utilization. Knowledge:
Creation, Diffusion, Utilization, 8(4),586-611.
Joint Committee on Standards for Educational Evaluation (1981). Standards for evaluations of
educational programs, projects, and materials. New York: McGraw-Hill Book Co.
Joint Committee on Standards for Educational Evaluation (1994). The program evaluation standards
(2nd ed.). Thousand Oaks, CA: Sage Publications.
King, J.A., & Thompson, B. (1983). Research on school use of program evaluation: A literature
review and research agenda. Studies in Educational Evaluation, 9, 5-21.
Kirkhart, K.E. (2000). Reconceptualizing evaluation use: An integrated theory of influence. New
Directions in Evaluation, 88, 5-23.
Leviton, L.A., & Hughes, E.FX. (1981). Research of utilization of evaluations: A review and
synthesis. Evaluation Review, 5(4), 525-48.
Nevo, D. (1993). The evaluation minded school: An application of perceptions from program
evaluation. Evaluation Practice, 14(1), 39-47.
Owen, J.M., & Rogers, P. (1999). Program evaluation: Forms and approaches. St. Leonards, Australia:
Allen & Unwin.
Patton, M.Q. (1997). Utilizationjocused evaluation (3rd ed.). Thousand Oaks, CA: Sage Publications.
Patton, M.Q., Grimes, P.S., Guthrie, K.M., Brennan, N.J., French, B.D., & Blyth, D.A. (1978). In
search of impact: An analysis of the utilization of federal health evaluation research. In T. D. Cook
and others (Eds.), Evaluation Studies Review Annual (Vol. 3). Beverly Hills, CA: Sage.
Provus, M.M. (1971). Discrepancy evaluation. Berkeley, CA: McCutchan.
Rippey, R.M. (1973). The nature of transactional evaluation. In RM. Rippey (Ed.), Studies in
transactional evaluation. Berkeley: McCutchan.
Rossi, P.H., & Freeman, H.E. (1985). Evaluation: A systematic approach. Beverly Hills: Sage.
Scriven, M. (1967). The methodology of evaluation. In RE. Stake (Ed.), Curriculum evaluation.
American education research association monograph series on evaluation, 1. Chicago: Rand
McNally.
Scriven, M. (1974). Standards for the evaluation of educational programs and products. In G.D.
Borich (Ed.), Evaluating educational programs and products. Englewood Cliffs, NJ: Educational
Technology Publications.
Stake, RE. (1976). A theoretical statement of responsive education. Studies in Educational
Evaluation, 2(1), 19-22.
Stufflebeam, D.L., Foley, WJ., Gephart, WJ., Guba, E.G., Hammond, RL., Merriman, H.O., et al.
(1971). Educational evaluation and decision making. Itasca, IL: Peacock.
Tyler, R W (1942). General statement on evaluation. Journal of Educational Research, 35, 492-501.
Weiss, C. (1972). Utilization of evaluation: Toward comparative study. In C.H. Weiss (Ed.) Evaluating
action programs. Boston: Allyn & Bacon.
10
Evaluation Use Revisited
CAROLYN HUIE HOFSTETTER

University of California at Berkeley, CA, USA
MARVIN C. ALKIN
University of California at Los Angeles, CA, USA
Evaluation use! has been, and remains today, a pressing issue in the evaluation
literature. Are evaluations used? Do they make a difference in the decision-
making process or at all? If not, why?
For decades, little or no evidence of evaluation use was documented. More
encouraging studies followed in the 1980s, acknowledging the complexity of
defining and measuring evaluation use. With a broadened conception of use,
studies suggested that evaluations were used in the policy- or decision-making
process, but not as the sole information informing the process; rather decision-
makers used evaluations as one of many sources of information. Additionally,
these studies helped social researchers, evaluators, and decision-makers to
temper unrealistic expectations and recognize the multi-faceted realities of the
policy development process.
This chapter revisits the notion of evaluation use. We incorporate literature on
knowledge utilization that preceded and informed research on evaluation
utilization - literature that we believe merits greater mention than found in
earlier reviews. The chapter has several goals: (1) review studies examining the
role of knowledge utilization to inform social science and policy, and their
linkage to current studies of evaluation utilization; (2) summarize previous
studies of evaluation utilization; (3) discuss prominent themes that emerge, such
as definition of use, and factors associated with use; and (4) address a variety of
related issues, such as impact, utility, and misutilization.
KNOWLEDGE UTILIZATION
This section discusses some of the studies which examined the issue of knowledge
utilization. In doing so, we do not argue that issues related to knowledge use are
synonymous with those related to evaluation use. There are subtle differences
197

198 Huie Hofstetter and Aikin
between social science research and evaluation, largely based on their purposes
and the potential impact of the political contexts within which they operate.
Rather, we argue that the knowledge use literature greatly informed the
evolution of the evaluation utilization literature and that the former had a far
greater impact on the latter than has been reported in earlier literature reviews.
Knowledge utilization in the social sciences refers to the use of social research
to inform decisions on policies. Generally, social scientists hope their efforts
influence the policy development process and ultimately contribute to the
improvement of societal functioning and human welfare. In the 1960s and 1970s,
despite increased federal funding for social research and programs nationwide,
social scientists and government officials remained concerned that research
efforts went largely unnoticed and that policies were often debated and passed
without consideration of, or in spite of, research findings. Speculations about the
mismatch between expectations and performance emerged, but no empirical
studies regarding the prevailing state of social science research utilization among
government officials were available (Caplan, Morrison, & Stambaugh, 1975).
Recognition of problems in this area compelled the scientific community to
examine the use of social science information and its impact on policy-related
decisions. Schools of public policy emerged in the early 1970s, as did the Center
for Research on Utilization of Scientific Knowledge, housed at the University of
Michigan, and new academic journals like Knowledge: Creation, Diffusion,
Utilization.
Early contributions to knowledge utilization research emerged from landmark
work by Caplan (1974), then Caplan, Morrison, and Stambaugh (1975). In the
latter study, conducted in 1973-74, Caplan and his colleagues interviewed 204
upper-level officials in ten major departments of the executive branch of the
United States government to see how they used social science information to
inform policy formation and program planning. The surveys included direct and
indirect questions related to social science utilization, including self-reported use
of knowledge, specific information involved, policy issue involved, and impact of
the information on decisions.
Caplan et al. (1975) noted that the definition of social science knowledge one
employed greatly influenced how much it was used in federal decision-making
processes. "Soft" knowledge (non-research based, qualitative, couched in lay
language) appeared to be used more by decision-makers than "hard" knowledge
(research based, quantitative, couched in scientific language). Hard knowledge
once retrieved mayor may not prove crucial to outcomes of a particular
decision-making situation, but it could serve as a validity check of pre-existing
beliefs as well as provide new information.
Interestingly, Caplan found hard knowledge yielded little impact on policy
formulation. Widespread use of soft knowledge among government officials,
although difficult to assess, suggests "that its impact on policy, although often
indirect, may be great or even greater than the impact of hard information" (Caplan
et al., 1975, p. 47).1 Simultaneously, Rich (1975) examined independently how
knowledge and information were used over time, yielding results congruent with
Evaluation Use Revisited 199
Caplan. This research, conducted by the National Opinion Research Center
(NORC) and funded by the National Science Foundation (NSF), involved
interviewing executives in seven federal agencies on a monthly basis for 1.5 years,
to ascertain how they used data generated by the Continuous National Survey
(CNS). Rich (1977) differentiated between instrumental use (knowledge for action)
and conceptual use (knowledge for understanding) stating: "'Instrumental' use
refers to those cases where respondents cited and could document ... the specific
way in which the CNS information was being used for decision-making or
problem-solving purposes. 'Conceptual' use refers to influencing a policymaker's
thinking about an issue without putting information to any specific,
documentable use" (p. 200). Specifically, instrumental utilization is "knowledge
for action," while conceptual utilization is "knowledge for understanding." Hard
and soft knowledge may be used either instrumentally or conceptually.
Caplan and Rich conducted the above studies separately, but the research
commonalties prompted a later collaboration that expanded the research on
knowledge utilization. Pelz (1978) synthesized the research to date, particularly
that by Caplan and Rich, highlighting the conceptual difficulties in measuring
knowledge use:
As both authors observed, policy makers found it difficult to identify
particular actions or decisions as outputs (instrumental uses) with parti-
cular pieces of knowledge as inputs, in a one-to-one matching between
input and output. They characterized the shortcomings of such an input/
output model (Rich & Caplan, 1976): "First, because knowledge accumu-
lates and builds within organizational memories, some decisions (outputs)
are made which seem to be independent of any identifiable, discrete
inputs ... Secondly, because knowledge produces [multiple] effects, it is
often impossible to identify the universe of informational inputs." They
argued that "conceptual uses ... of social science knowledge should not be
viewed as failures of policy makers to translate research findings into
action," and that other significant functions of knowledge include
organizational learning and planning, and the way in which problems are
defined. (Pelz, 1978, p. 350)
In addition to conceptual and instrumental use, Knorr (1977) identified a third
mode of use of social science data known as symbolic or legitimate use. Knorr
drew from 70 face-to-face interviews with medium-level decision makers in
Austrian federal and municipal government agencies and noted four roles that
social science results may serve in decision making: (1) decision-preparatory role,
where data serve as an "information base" or foundation for actual decisions to
occur but without a direct linkage to action, similar to "conceptual use";
(2) decision-constitutive role, where data are translated into practical measures
and action strategies, much like "instrumental" use; (3) substitute role, where a
government official signals by announcing an evaluation, that something is being
done about a problem, while proper actions that should be taken are postponed
or ignored altogether; and (4) legitimating role, the most common, where data
are used selectively to support publicly a decision made on a different basis or on
an already held opinion. The last two categories are distinctly different from
conceptual and instrumental use. Symbolic use occurs when social science data
are used selectively or are distorted to legitimate an existing view derived from
alternative sources of information and primarily for personal gain.
Carol Weiss (1977) argued that rather than playing a deliberate, measurable
role in decision making, social science knowledge "enlightens" policymakers.
According to Weiss, policymakers valued studies that prompted them to look at
issues differently, justified their ideas for reforms, challenged the status quo and
suggested the need for change. Thus, the term enlightenment became part of the
standard verbiage in the utilization research.
Weiss (1977) also introduced a broader, more accurate typology of research
utilization that forced the field to rethink the meaning of knowledge "use" and
to recognize that the knowledge-policy link must include all "uses." She identi-
fied six ways that research could be used in the policy process: (1) knowledge-
driven model, where research findings prompt new ways of doing things;
(2) problem-solving (instrumental) model, where research is a linear, step-by-step
process; (3) enlightenment model, the most common, where research informs,
albeit indirectly, policy decisions; (4) tactical model, where research is used to
gain legitimacy and build a constituency; (5) political model, where research
findings supporting a particular view are chosen to persuade others; and
(6) interactive model, where knowledge is used in conjunction with a decision-
maker's insights, experiences, and other communications, not in a linear fashion.
Pelz (1978) concluded that the disappointing lack of influence of social science
knowledge on policy stemmed from conceptual difficulties in measuring
knowledge utilization. Offering a more optimistic perspective, he stated, "Some
recent investigations, however, suggest that this dismal picture may have arisen
from an overly narrow definition of what is meant by social science knowledge
and by 'utilization' of that knowledge" (p. 346). And, like other researchers, Pelz
makes a similar distinction between instrumental and conceptual use. Drawing
from Weiss's and others' work, Pelz remarked:
The instrumental use of knowledge is equivalent to what has been called

"engineering" or "technical" or "social-technological" function; the conceptual
use of knowledge is equivalent to an "enlightenment" function (see Crawford
& Biderman, 1969; Janowiz, 1970; Weiss, 1977; Weiss & Bucuvalas, 1977).
Weiss & Bucuvalas (1977, p. 226) observe: "Thus the definition of
research 'use' is broadened. Research that challenges accepted ideas,
arrangements, and programs obviously cannot be put to work in a direct
and immediate way. It cannot be used instrumentally ... Yet, decision-
makers say it can contribute to their work and the work of other
appropriate decision-makers. It can enlighten them." (Pelz, 1978, p. 350).
Pelz (1978) then presented an expanded framework for examining social

knowledge, particularly how it is defined and applied. The 2 x 3 matrix contains
types of knowledge and modes of use (see Figure 1). The types of knowledge fall
into two general categories (hard and soft) representing the extremes of a
continuum, with most types of knowledge falling somewhere between the two.
Pelz noted that traditional conceptions of knowledge use focus on cell A, that is,
the idea that hard knowledge is applied in a one-to-one match with a specific
action. However, this conceptualization is far from reality, and therefore an
alternative conceptualization was warranted. Most knowledge use falls into cell
D, in which general principles are accumulated through time in a broad
framework of understanding about a policy area. Pelz (1978) concluded "that the
use of social science knowledge may be more widespread than is generally
supposed" (p. 346).
Rein and White (1977) noted that researchers and policy makers mutually
inform one another. A researcher identifies, sorts, and chooses interesting
questions for study, helping to create issues. In tum, the policymaker helps to
define knowledge (Innes, 1990). Rein and White also recognized that a broader
conception of "knowledge" was required, as numerous types of knowledge
informed policy. Finally, the concept of "reflective practitioner" surfaced as a
new way to conceive of the knowledge-policy relation (Schon, 1983).
Another important contribution was the linkage of social science knowledge
with "ordinary knowledge" and "working knowledge." Ordinary knowledge
emerges from everyday personal experience, rather than from social science
evidence, to inform the decision-making process (Lindblom & Cohen, 1979). It
tends to be more subjective than social science knowledge, which strives to be
objective. It can be incorrect or false, as it derives from nonsystematic
observations and verification strategies, and thus less valid than social science
knowledge. In other words, ordinary knowledge "does not owe its origin, testing,
degree of verification, truth status, or currency to distinctive [professional social
inquiry] techniques but rather to common sense, casual empiricism, or
thoughtful speculation and analysis" (Lindblom & Cohen, 1979, p. 12). What
differentiates ordinary knowledge from social science knowledge is effectively
how it is verified by the user.
7ype of Knowledge
Mode of Utilization
Hard Soft
Instrumental/engineering A B
Conceptual/enlightenment C D
Symbolic/legitimate E F
Figure 1: Modes of Knowledge Use. Reprinted from Pelz, 1978, p. 347, Some Expanded
Perspectives on Use of Social Science in Public Policy
Extending the conception of ordinary knowledge to the work environment,
"working knowledge" is the "organized body of knowledge that administrators
and policy makers use spontaneously and routinely in the context of their work"
(Kennedy, 1983, p. 193). Working knowledge is context dependent, functioning
as a filter to interpret incoming, new social science evidence and to judge the
validity and usefulness of the new evidence. This contrasts with the use of social
science knowledge presented earlier (by Weiss, for example) where incoming
social science evidence is considered to inform and expand working knowledge,
thus "enlightening" the decision maker.
The above distinction led to new examinations of the linkage between working
knowledge and social science knowledge. In one study, Kennedy (1983) observed
and interviewed public school administrators and teachers in sixteen school
districts to examine the relation between evidence and the working knowledge
these participants had about substantive issues within their districts. Kennedy
identified three analytically distinct, though in practice interdependent, processes
that involved the use of evidence. The processes were: (1) seeking out new
evidence and attending to it; (2) incorporating new evidence into existing
working knowledge; and (3) applying this new evidence to working situations as
they arise. To a great extent, these processes overlap and are difficult to
distinguish from one another. Kennedy (1983) states:
Working knowledge is a large and amorphous body of knowledge that

includes experience, beliefs, interests and a variety of forms of evidence.
These components are blended together so completely that individuals
often cannot distinguish its parts. Knowledge derived from empirical
evidence may seem to derive from beliefs such as myths or legends, or vice
versa. Working knowledge continually accumulates new elements and its
structure continually evolves. (p. 209)
Emmert (1985) presented a theoretical framework for studying the interaction

between ordinary knowledge and policy science knowledge, and its implications
for knowledge utilization. Examining the epistemological foundations of
knowledge generation, and drawing primarily from works by Popper (1963,
1966) and Campbell (1974), Emmert argued that ordinary knowledge is not
fundamentally different from policy (social) science knowledge and that, in fact,
the two are intrinsically bound together. What differentiates them is how they
are verified. Further, even though the verification processes for ordinary
knowledge are less systematic, producing less valid information - the knowledge
itself may be more reliable in the user's view, and thus more useful. Emmert
(1985) states:
A point missed by Lindblom and Cohen is that while the verification

processes of ordinary knowledge are less systematic, they may lead to
more reliable knowledge in a pragmatic sense, even if less valid. Policy
science knowledge is, at best, verified via quasi-experimental research and
related empirical efforts of fellow researchers.... On the other hand,

ordinary knowledge is, in effect, field-tested by those who hold such
knowledge. If the worth of ordinary knowledge is assessed in terms of its
reliability (not simply its validity) ordinary knowledge may prove itself to
the knowledge holder whereas policy-science-derived knowledge might
seem much more pendantic [sic] than pragmatic. Reliability, in this sense,
is strongly related to utility. Those bits of ordinary knowledge that provide
the user with information about relationships that have appeared to be
causal each time the user has observed them will be taken as both
sufficiently reliable and valid to serve as the basis for future action.
(pp.99-100)
Drawing from theories of human cognitive processing and focusing on the

recipient of incoming information, Emmert examined the process of knowledge
utilization. In brief, the cognitive model posits that individuals use their prior
knowledge to assess how to interpret new information. The more persuasive the
new information is, in relation to existing conceptions and information, the
greater the level of agreement between the two. The more contrary the pieces of
information are, the lower the level of agreement. The cognitive reactions,
separate from the actual information itself, influence if and how new information
is used. Further, the user's motivation and ability to understand the information
are also influential. A motivated, able user who is open to different ideas is more
likely to cognitively process the information than is his or her passive counter-
part. Thus, what is convincing and useful social evidence to one person may not
be to another. Ordinary knowledge, largely based on prior knowledge, is the
reference point by which social science knowledge is interpreted and applied.
While social researchers, policymakers, and other scholars examined the
knowledge utilization-policy linkage, similar discussions occurred among
evaluators who questioned why more use was not made of evaluations to inform
policy and program decisions. In some research studies, the types of research
data or "knowledge" used were federal evaluation studies conducted with the
intent of informing policy. Evaluations, like social research, relied heavily on
systematic data collection and analysis to assist governmental decision and policy
makers. Additionally, scholars prominent in the evaluation field, notably Carol
Weiss, conducted studies focused on knowledge use. Weiss's contribution of
knowledge as enlightenment also applied to the use of evaluation studies in
enlightening decision makers rather than in providing them with direct,
measurable knowledge.
Leviton and Hughes (1981), however, caution against drawing inferences from
the knowledge utilization literature as evaluations differ from other social research
used in government because they are more typically bound by political contexts.
There is a dearth of studies dealing with the utilization of evaluations per

se .... Evaluations differ from other social research used in government by
being more often politically sensitive (Weiss, 1972; Campbell, 1969). Few
204 Huie Hofstetter and AIkin
researchers have adequately distinguished the utilization of evaluations

and other forms of research (but see Weiss, 1978; Young & Comptois,
1979). (p. 526)
The research on knowledge utilization is highly relevant to evaluation utilization.

Many of the burgeoning ideas related to knowledge utilization and evaluation
utilization research overlapped, some even occurred simultaneously in both
areas, in part due to the similarity of goals, processes, and uses found in research
and evaluation studies. Brief overviews of some of these evaluation-related
studies and their contributions to the utilization research follow.
EVALUATION USE
The middle and late 1960s was a period of enormous growth in evaluation
activities. Large federal programs had been implemented during this time, often
requiring an evaluation. Indeed, it was a period in which many evaluations were
conducted simply to fulfill governmental reporting requirements. Conduct of
numerous poorly constructed, often naive evaluations, insensitive to field con-
ditions, exacerbated existing frustrations. At that time, there was little theoretical
literature available to guide these evaluations, so evaluators "worked to bend
their social science research skills to the next task of program evaluation" (Aikin,
Daillak, & White, 1979, p. 17).
In 1972, Weiss published "Utilization of Evaluation: Toward Comparative
Study," the first call for a focus on evaluation utilization issues (Weiss, 1972).
Weiss noted that although some instances of effective utilization existed, the rate
of non-utilization was much greater. She outlined two factors that account for
the underutilization: organizational systems and (then) current evaluation
practice. Generally, the evaluator must understand that hidden goals and
objectives underlie all organizations and that these agendas provide contexts that
moderate the impact of evaluations. Weiss (1972) stated that, "Much evaluation
is poor, more is mediocre" (p. 526). She hypothesized factors that influence
evaluation practice, including inadequate academic preparation of evaluators to
work in action settings, low status of evaluation in academic circles, program
ambiguities and fluidity, organizational limitations on boundaries for study,
limited access to data that restricts the rigor of the evaluation design, and
inadequate time, money, and staffing. Her paper provided the initial impetus for
further research on evaluation utilization.
Into the early 1970s, concerns about non-utilization and underutilization
remained prevalent in the evaluation literature. 3 Despite technical and
theoretical advances, the level of dissatisfaction regarding evaluation use grew.
In fact, some of the harshest critics of the lack of evaluation use were highly
respected theorists and practitioners in the field itself. Weiss (1972) tactfully
noted, "Evaluation research is meant for immediate and direct use in improving
the quality of social programming. Yet a review of evaluation experience suggests
that evaluation results have not exerted significant influence on program

decisions" (pp. 10-11).
As with social science research, this era of disillusionment prompted a re-
examination of the conceptual dimensions of utilization. What emerged over the
following decade was a recognition that the conception and operationalization of
evaluation "use" was too narrow and essentially not characteristic of the program
development process. As evaluation researchers began to accept the more complex
notion of utilization propounded by those engaged in the study of knowledge
utilization, they also began to realize that their expectations were more unrealistic.
With the broader conceptions of use in mind, studies by evaluation researchers
began to explore more fully the bounds of evaluation utilization. Michael Patton
conducted one of these first studies. Through detailed interviews with decision
makers and evaluators from a stratified, diverse random sample of twenty
program evaluation studies sponsored by the U.S. Department of Health,
Education, and Welfare, Patton (1975, 1977) documented that evaluations were
multifaceted, complex, and used by decision makers in alternative ways: by
reducing uncertainty, increasing awareness, speeding things up, and getting
things started - uses more characteristic of the decision-making process. There
were no direction-changing decisions. Patton et al. (1977) noted that "evaluation
research is used as one piece of information that feeds into a slow, evolutionary
process of program development. Program development is a process of
'muddling through' and evaluation is part of the muddling" (p. 148). This study
also highlighted the importance of the "personal factor" in the utilization
process, where a real person feels a personal responsibility for using evaluation
information to answer questions of concern.
Other evaluation studies emerged as well. AIkin, Kosecoff, Fitz-Gibbon, and
Seligman (1974) examined what impact evaluations had on past and future
decision making at the local level (focusing on project directors), and at the
federal level (focusing on project sponsors). Drawing from evaluation and audit
reports from 42 Title VII (bilingual education) studies in operation during
1970-71, AIkin et al. coded the evaluation reports according to five categories:
characteristics of the evaluator, evaluation activities, content and format of the
evaluation report, results of the evaluation, and the rater's reaction to the
evaluation. Audit reports, when available, were coded using similar categories.
Unbeknownst to the raters, the projects had been designated as "outstanding,"
"average," or "below average" by the federal monitors. Finally, project directors
of the bilingual education programs completed a questionnaire about the impact
of the evaluation effort.
At the federal level, the researchers' findings supported the dismal view of
evaluation use at that time; they reported that the use of evaluation information
was limited at best. Regardless of the poor evaluation findings, programs were
consistently refunded from year to year, with similar percentage increases and
decreases in their funding amounts, because federal funding decisions were
made prior to receipt of the evaluation report. The factor of timing of evaluation
report was clearly elucidated as an important factor in evaluation use.
In stark contrast, AIkin and colleagues (1974) documented strong evidence
supporting the use of formative evaluation at the local level. AIl of the Title VII
project directors reported that formative evaluations were useful in guiding their
ongoing project activities, including "guiding project staff in making modifica-
tions in the program," "identifying possible problem areas," "preparing reports"
and "providing recommendations for program changes" (AIkin et aI., 1974,
p. 29). Such findings contradicted earlier Title I evaluation studies suggesting
that local project evaluations yielded little useful information for informing local
project management or operations (Wholey & White, 1973).
During the late 1970s and early 1980s, a series of simulation studies based on
a communication theoretical framework were conducted to examine evaluation
utilization. Various aspects of a common evaluation (e.g., evaluator characteristics,
evaluation design and contents, report style, and audience/receiver charac-
teristics) were manipulated to see what specific impact, if any, each had on
evaluation use. User groups included teachers and administrators. The studies
provided a number of interesting findings that furthered the evaluation
utilization field. Many of the simulation study findings are summarized in
Braskamp, Brown and Newman (1982). A sampling of study findings are
presented below.
Some evaluator characteristics appeared to influence evaluation use. Braskamp,
Brown, and Newman (1978) noted that the evaluator's title influenced the users'
perception of the objectivity of the study - researchers and evaluators were
considered more objective than art educators - although there was no effect on
the users' level of agreement with the report recommendations. In this case,
administrators found the evaluation report more useful than did teachers,
perhaps due to greater familiarity with evaluations in general. The evaluator's
gender also was significantly related to evaluator credibility, with female
evaluators rated lower than male evaluators (Newman, Brown & Littman, 1979).
Recipient or user characteristics were examined. The receiver's or user's job
(e.g., school administrator, teacher, student) or professional level appeared to
influence how he/she reacted to written evaluation reports, as well as hislher
subsequent use or non-use of the evaluation information. Students in profes-
sional schools of business and education were more likely to agree with evaluator
recommendations than did teachers and school administrators. Additionally,
school administrators found external evaluations to be more useful than did
teachers, although they did not differ in their views on the usefulness of internal
evaluations (Braskamp, Brown, & Newman, 1978).
Another simulation study found that the user's perceived need for program
evaluation influenced the level of impact of an evaluation report (Brown,
Newman, & Rivers, 1980). Respondents reporting a "high need" for evaluation
were more likely to agree with the evaluator's recommendations than were
respondents with a perceived "low" need for evaluation. The distinctions were
based on the participants' responses to a Program Evaluation Budget Needs
Assessment Inventory featuring a list of ten educational programs. Respondents
were asked to indicate on a five-point Likert scale, Strongly Agree (5) to Strongly
Disagree (1), whether the program's budget should include money specifically
for evaluation. The responses were then divided in half using a median split, with
the top half defined as "high" perceived need for evaluation, and the bottom half
defined as "low" perceived need.
The simulation studies also found that how evaluation findings are reported
may influence their level of use. Use of percentages and graphs were more
favorably received than no data at all (Brown & Newman, 1982). In contrast,
addition of statistical analyses or references (e.g., statistically significant at
p < 0.05), in contrast, elicited negative reactions. Data-rich and jargon-free
evaluation reports were seen as more desirable than jargon-full reports, but were
not necessarily used more frequently than were their jargonized counterparts
(Brown, Braskamp, & Newman, 1978).
AIkin, Daillak, and White (1979) conducted intensive case studies of five Title
lor Title IV-C (formerly Title III) programs, locating only one textbook example
of "mainstream utilization." Instead, most of the case studies represented examples
of "alternative utilization" a broader conception of evaluation that emphasizes
the dynamic processes (e.g., people interactions, contextual constraints), rather
than the more static factors that characterize mainstream utilization. From this
work, AIkin and colleagues (1979) developed an analytic framework consisting
of eight categories of factors presumed to influence evaluation use. Research
evidence was presented on several factors strongly related to use. One of the
dynamic processes impacting utilization was the "personal factor" (identified by
Patton et al., 1975) which was typically a decision maker/primary user who takes
direct personal responsibility for using the information or for getting it to
someone who will use it. AIkin et al. (1979) further extended the notion of the
personal factor with particular emphasis on the role of an evaluator specifically
committed to facilitating and stimulating the use of evaluation information.
Another important factor in this study was evaluation credibility - that is, was
the evaluator involved a credible source of information? This credibility related
not only to the evaluator's technical capabilities but also to the extent to which
the evaluator was non-partisan and unbiased. It is important to note, however,
that credibility is not totally a function of the expertise the evaluator possesses or
is initially perceived to possess; credibility is, in part, "acquired." As AIkin et al.
(1979) note: "Changes in credibility can come about when an evaluator has
opportunities to demonstrate his (her) abilities and skills in relation to new
issues and for new audiences" (p. 274).
Further studies focusing on individual schools and school districts were
conducted. Kennedy, Neumann, and Apling (1980) intensively studied eighteen
school districts where staff reportedly used Title I evaluation or test information
in decision making. Respondents such as school district policymakers, program
managers, and school building personnel (e.g., teachers) were asked "to discuss
those issues that most concerned them in their current positions, and to describe
what information, if any, they relied upon in making decisions with respect to
these issues" (Kennedy, Neumann, & Apling, 1980, p. 3). For these user groups,
mandated summative Title I evaluations were of no use. However, several locally
designed evaluation components were useful in increasing program knowledge,
meeting funding, compliance, and other decision-making needs.
In another example of naturalistic research on evaluation utilization, King and
Pechman (1982) conducted a year-long case study of an urban Research and
Evaluation (R & E) unit in New Orleans, Louisiana (USA). They documented
the numerous activities and functions that the department served, including
collecting, processing, and distributing basic status information about the school
district (e.g., state mandated statistical reports, student enrollment data and
records, and state and federal evaluation reports). They found that such informa-
tion, often taken for granted, was essential for maintaining a well-functioning
school system. This realization led King and Pechman to identify two types of
evaluation use: "signalling" and "charged use."
Drawing from work by Zucker (1981), King and Pechman (1984) developed a
model of process use in local school systems incorporating these two types of use.
The former, signalling, refers to "the use of evaluation information as signals
from the local school district to funding and legislative support agencies that all
is well" in another to support maintenance of the institution (King & Pechman,
1984, p. 243). In contrast, charged use was less clearly defined, not predeter-
mined, and unpredictable in its effect. Evaluation information, be it positive or
negative, had the potential to create a disruption or change within the system
when an individual with the potential to act on that information gives it serious
consideration. Generally, evaluation activities produced information, such as
formal or informal documents, which in turn led to signaling use (institution
maintenance) or to charged use (responsive actions). Thus, the context in which
the evaluation occurs (and persons involved in it ) may greatly influence its level
and type of use.
In developing this conceptualization of evaluation use, King and Pechman
(1984) also synthesized what the evaluation field had learned up to that point,
most notably that evaluations were, in fact, used by decision makers and other
interested parties. This challenged the more traditional view that evaluations had
no impact. The types of use, however, were not as dramatic, direct and readily
apparent as evaluators had once hoped. Instead, evaluators realized, as had
social scientists with regard to knowledge utilization, evaluations were often used
in subtle, meaningful, yet largely undocumentable ways. The type and level of
evaluation use was also context and user dependent. King and Pechman (1984), in
summarizing evaluation use in school systems, emphasize two important points:
Essential to an understanding of the use process is an awareness of the

specific evaluation context. The politics of a local organization and an
individual user's status will radically affect the extent to which any use is
possible. In many cases, non-use will be unavoidable and no one's fault.
(p.250)
Building on a broadened definition of use which included the conception of use

as "mere psychological processing of evaluation results ... without necessarily
informing decisions, dictating actions, or changing thinking," Cousins and
Leithwood (1986, p. 332) studied 65 evaluation use studies published between
1971-1985 from a diversity of fields. Several types of evaluation use were
identified: (1) use as decision making (e.g., decisions about program funding,
about the nature or operation of the program, or decisions associated with
program management); (2) use as education (e.g., informing development of
instructional materials and strategies; and (3) use as processing (when decision
makers employ or think about evaluation results).
The theoretical and conceptual broadening of evaluation use was also evident
in later distinctions between use of findings and "process" use. Greene (1988,
1990) used several evaluation case studies to examine the linkage between the
process of stakeholder participation and evaluation utilization. In these studies,
Greene (1988, p. 100) documented several uses of evaluation results: (a) signifi-
cant new program developments, policy implementation, and/or planning activities
(major instrumental use) and (b) smaller procedural changes in program opera-
tions and activities (minor instrumental use), both of which appeared to be
grounded in (c) a broader and deeper program understanding, representing
important confirmation of existing intuitions (conceptual use), (d) citation of
results in reports and proposals to external audiences (persuasive use), and (e)
enhanced prestige and visibility for the program within the larger community
(symbolic use). In essence, Greene (1988) argued that the "participatory evalu-
ation process ... functioned most importantly to foster a high degree of
perceptual and attitudinal readiness for use among evaluation users themselves"
(p. 112). Further, the most salient process factors were the evaluator
characteristics and the evaluation quality, findings consistent with Patton.
In a more recent review, Shulha and Cousins (1997) highlighted several
related developments in the evaluation use literature: (1) considerations of
context are critical to understanding and explaining use, (2) process use itself was
an important consequence of evaluation activity, (3) expansion of conceptions of
use from the individual to the organizational level, and (4) diversification of the
role of the evaluator to facilitator, planner, and educator/trainer.
The literature recognizes that the conception of evaluation utilization has
experienced its own growing pains. Like the conception of knowledge utilization,
the notion of evaluation utilization has progressed from an overly narrow
definition of directly observable use (suggesting that evaluations were hardly, if
ever, used) to a more comprehensive, realistic conception that reflects the
complexities of the environments where evaluations are found (and, thus, used).
IN SEARCH OF "UTILIZATION"
The literature offers many insights into the nature of utilization - and a good
deal of uncertainty and lack of clarity as well. We note that after 30 years of
research on knowledge and evaluation utilization, the definition of "utilization"
has never been agreed on, and there is a lack of widespread agreement as to what
situations should be accepted as instances of utilization. The knowledge and

evaluation use studies described in this chapter effectively used complementary,
but varying, conceptualizations of "use."
Defining Use
Even in the most recent evaluation use studies, there are instances of narrowly
defined "use," requiring evidence "that in the absence of the research informa-
tion, those engaged in policy or program activities would have thought or acted
differently" (Leviton & Hughes, 1981, p. 527). Cook and Pollard (1977) place
similar constraints on the definition of use indicating that instances of use must
have present "serious discussion of the results in debates about a particular
policy or program" (p. 161). In other studies, the notion of use was broadened
considerably to include all sorts of conceptual and symbolic use, a notion more
in line with early knowledge use studies. Such a definition may include, for
example, the "mere psychological processing of evaluation results ... without
necessarily informing decisions, dictating actions, or changing thinking"
(Cousins & Leithwood, 1986, p. 332).
The multiplicity of definitions has been a major impediment in the interpreta-
tion of this collective body of research. In fact, Weiss (1979) has indicated that
until we resolve questions about the definition of use, we face a future of non-
compatible studies of use and scant hope of cumulative understanding of how
evaluation and decision-making intersect. AIkin, Daillak, and White (1979)
attempted to reconcile these seemingly disparate definitions. Their earlier
definition, in the form of a Guttman-mapping sentence, now serves as a basis for
an evolving current definition of use.
As the first component of an evaluation utilization definition, AIkin et al.
(1979) noted that evaluation information must be communicated. There must be
an "it" that is the potential entity to be utilized. They indicated that this "might
include quantitative test results, accounts of interviews, budget analyses, imple-
mentation checks, or vignettes on important program aspects." The information
potentially to be used typically referred to the results or findings of the
evaluation. Such information includes not only the specific findings in the
evaluation report but also the reported data, descriptions, narratives, etc., that
led to those results. It should be noted that evaluation findings might derive from
formative as well as summative reports. Formative evaluation, in tum, may speak
to both the implementation of a program as well as to process outcome measures
- short-term indicators of desired outcomes.
The definition of evaluation use is here broadened to include the evaluation
process itself. Note the important distinction between process as previously used
and process as it is used here. In this instance, we refer to the act of conducting
the evaluation rather than to its interim or on-going results. This notion of
process use was noted by King and Thompson (1983) and Greene (1988) and has
more recently been fully articulated by Patton (1997). Patton suggested four
different ways that the evaluation process might affect use - most notably,
changes in the thinking and behavior of the individuals and the organizations in
which they reside. Attention to process use led to a concern for evaluation's use
in organizational development (see Owen & Lambert, 1995; Preskill, 1994;
Preskill & Torres, 1999) and to the development of evaluation theoretic models
designed to enhance organizational development and organizational learning.
(these are discussed more fully in the accompanying chapter by J. Bradley Cousins.)
A second element in AIkin et al.'s (1979) earlier definition referred to
individuals who might potentially make use of the information. They called this
group "admissible users." The implication of this term is that employment of the
information in certain contexts is not always "use," but is dependent on the
relevancy of the individual. For use to occur, individuals within the context of the
program being evaluated must be the users. Patton (1997) considers this issue
and distinguishes between "intended users" and others. He defines intended
users as individuals (and not stakeholder groups) who are the intended primary
recipients of the evaluation - individuals who are interested in the evaluation
and likely to use it. He focuses his evaluation efforts on meeting the information
needs of these individuals and says that their use is the only use for which he
takes responsibility (Patton, in AIkin, 1990).
Notwithstanding this point of view, it seems appropriate to define "admissible
users" as both intended and unintended users. However, each must in some way
be directly related to the program under study. Thus, in a program within a
school district, the following might be considered admissible users: program
personnel, school and district administrators, parents and students, program
funders, and individuals in community organizations having a relationship in the
program. So, generally, who are admissible users? We would conclude that
admissible users are all individuals who directly have a stake in the program -
that is, are stakeholders (see AIkin, Hofstetter, & Ai, 1998).
So what of those in another setting who in some way employ the findings of
the evaluation (or its process)? Consider, for example, individuals who take the
findings of a study conducted in another context and use it, in part, to change
attitudes or to make program changes. Clearly, something was learned from the
evaluation that took place elsewhere - knowledge or understanding were
acquired. Since evaluation, by intent, refers to the study of a specific program (or
policy), this action is not evaluation utilization. Instead, it may be thought of as
use of knowledge derived from an evaluation (further discussion is found in a
subsequent section on "impact").
Thus, we take evaluation use to refer to instances in which the evaluation
process or findings lead to direct "benefits" to the admissible users. The "benefits"
must have the program presently under evaluation as its referent. Moreover, the
"benefit" must occur within an appropriate time frame related to the reporting
of evaluation results. What is an appropriate time frame? One must conclude
that if an evaluation leads to changes in attitudes or understanding (conceptual
use) in the year following the evaluation report it would be an appropriate time
frame. And if these conceptual views lead to program changes in the next year
or so, this would be an appropriate instance of instrumental use. Extending these

time lines much further might make one question whether the internalization of
the evaluation information as part of working knowledge leads to the conclusion
that what has occurred is knowledge utilization. Thus, the focus of the action,
changed views or perceptions, or substantiated viewpoints must be the program
under study at that or a proximous time.
A third element of a utilization definition refers to what AIkin et al. (1979) call
"form of consideration." They cite three modes as examples: dominant influence,
one of multiple influences, one of multiple cumulative influences. The implica-
tion, then, is that evaluation information or process need not be the sole stimulus
in affecting a change. This reflects two additional dimensions to the manner of
use. The first is the point of view expressed by Weiss (1980) that evaluation has
a cumulative effect and that decisions do not typically occur as a function of a
specific event (e.g. evaluation report or process) but "accrete." AIso reflected is
the idea that evaluation supplements "ordinary knowledge" (Lindblom & Cohen,
1979). As previously noted, this is the knowledge that emerges from ordinary
every day personal experience. Kennedy's (1983) notion of "working knowledge"
as the basis upon which evaluation information rests also reflects a similar
concept.
Finally, AIkin et al. (1979), as a fourth element of their definition, indicate
information purposes to which an evaluation can be put: making decisions,
establishing or altering attitudes, substantiating previous decisions or actions.
The first two purposes roughly correspond with the instrumental and conceptual
categories identified in the knowledge and evaluation use literature. Greene
(1988) elaborates these categories further by denoting "instrumental" as decision-
or action-oriented, and "conceptual" as educational. Owen (1999) further expands
our use categories by reference to "legitimative" use wherein evaluation is used
as "a way of retrospectively justifying decisions made on other grounds" (p. 110).
This category distinguishes certain legitimate symbolic uses from others in which
the conduct of the evaluation is only "window dressing" (AIkin, 1975) - the
evaluation was commissioned, for purposes of enhancing the status of the
"program administrator."
The research and writings seem to concur that the original conception of use,
whereby an evaluation leads to measurable action does occur, but is not always
assured. 4 More frequently, evaluations are used in much more subtle fashions, such
as to reduce uncertainty, increase awareness or enlightenment among particular
users, confirm existing points of view, or aid in the day-to-day maintenance of a
school, a program, or a system.
Does Evaluation Use Occur?
Much of the research demonstrates that evaluation efforts rarely, if ever, led to
major, direction-changing decisions. This led to an extensive early literature dis-
paraging the extent of use of evaluation (Guba, 1969; Rippey, 1973). However,
the research suggests that evaluations are, in fact, "used" albeit differently and
in much more complex ways than originally envisioned by utilization theorists (or
observed by Guba & Rippey). Employing an expanded view of utilization
("alternative utilization," Aikin et aI., 1979) there is convincing evidence that
utilization occurs. This is particularly true with reference to evaluations conducted
in local settings. (See Aikin et aI., 1974, 1979; Aikin & Stecher, 1983; Cousins,
1995; David, 1978; Dickey, 1980; Greene, 1988; Kennedy, 1984).
The case for extent of utilization in state or national arenas is more tenuous.
Indeed, differences between the views of Michael Patton (1988) and Carol Weiss
(1988) in the now famous "debate" about the extent of evaluation use may be
largely a function of scope of the evaluation. Patton typically focuses on smaller
scale programs and finds ample evidence of utilization. Weiss, whose typical
encounters with evaluation are in large-scale, policy-oriented settings, finds less
direct evidence of use. This is reasonable based upon the evaluation use factors
just discussed, and the current views on the importance of context in evaluation
use. In small-scale projects or programs evaluators have the opportunity to
develop a more personal relationship with potential users, there are fewer
"actors" in the political scene, and the context is less politically charged.
In addition to scope, the evaluation research suggests that the purpose of the
evaluation influences its level of use. Formative data (particularly in local contexts)
tend to be more useful and informative to potential users than summative evalu-
ations. (And, large scale evaluations tend to focus more heavily on summative
concerns.) The literature on formative evaluation provides ample evidence that
program directors and staff use information to guide program development and
they and their organizations benefit from the evaluation process. This is
particularly true in cases where evaluators and program staff are in close contact
with one another so that the evaluation process and findings are readily
accessible for potential use. In contrast, summative information may typically not
be available until a program has already been refunded, regardless of program
effectiveness, in part due to the necessary lead time for funding.
Factors Affecting Use
As noted previously, Weiss's classic statement (1972) "Much evaluation is poor,

more is mediocre" marked the first call for systematic research on evaluation
utilization, including the study of which factors influenced use. The field
acknowledged that many factors were too difficult to manipulate and control,
subject to the vicissitudes of the political and organizational climates
surrounding the evaluation process. Nonetheless, several factors did appear
changeable, often falling within the purview of the evaluator, the evaluator's
approach to the evaluation, and selection of users. While early typologies and
lists of factors emerged more from anecdotal evidence than empirical study, in
more recent years, evaluation researchers developed factor lists based on
amassed research findings. Summarizing these studies, this section addresses the
question, "What factors affect evaluation utilization?"
A number of research studies have provided insights into factors associated
with evaluation utilization. Some of these have been discussed in an earlier
section of this chapter. Rather than describing individual study findings anew, it
is best to focus on those publications which provided a synthesis of utilization
factors along with two empirical studies that presented a framework of factors
derived from their research (AIkin et al., 1979; Patton et al., 1977).
One of the first research studies was conducted by Patton and colleagues in the
mid-1970s (Patton, et aI., 1977). They interviewed evaluators and program
personnel from twenty federal mental health evaluations to assess how their
findings had been used and to identify the factors that affected varying degrees
of use. Of eleven utilization factors extracted from the utilization literature -
methodological quality, methodological appropriateness, timeliness, lateness of
report, positive or negative findings, surprise of findings, central or peripheral
program objectives evaluated, presence or absence of related studies, political
factors, decision maker/evaluator interactions, and resources available for study
- only two were consistently regarded as "the single factor" with the greatest
influence on utilization. These factors, political considerations and what Patton
termed the "personal factor," were the primary factors that affected evaluation
utilization, either alone or in conjunction with other factors.
AIkin, Daillak, and White (1979) studied evaluation utilization in local field
settings. Using intensive case studies, they identified eight categories of factors
that influence evaluation use: (1) pre-existing evaluation bounds, which includes
school-community conditions, the mandated bounds of the evaluation itself,
fiscal constraints, and other 'non-negotiable' requirements; (2) orientation of the
users, including their questions and concerns, the users' expectations for the
evaluation, and their preferred forms of information; (3) evaluator's approach,
including use of a formal evaluation model, research and analysis considerations,
how the evaluator perceives hislher role in the evaluation, and the desirability for
user involvement; (4) evaluator credibility, both in terms of audience perceptions
and how evaluator credibility can change throughout the evaluation effort;
(5) organizational factors, such as interrelationships between site and district,
site-level organizational arrangements, other information sources; (6) extra-
organizational factors, such as community and governmental influences;
(7) information content and reporting, such as the substance of the evaluation
data, and subsequent dialogues among stakeholders; and (8) administrator style,
including administrative and organizational skills, level of initiative. AIkin et aI.
note that these factors rarely stand alone, but generally interrelate across
categories and at different times throughout the evaluation process.
There also have been a number of research syntheses of evaluation use factors.
For example, AIkin (1985) organized and synthesized factors identified in the
research literature into three main categories: (1) human factors including user
and evaluator characteristics, such as people's attitudes toward, and interest in,
the program and its evaluation, their backgrounds and organizational positions,
and their professional experience levels; (2) context factors reflecting the setting
of the project being evaluated, including organizational and programmatic
arrangements, social and political climate, fiscal constraints, and relationships
between the program being evaluated and other segments of its broader
organization and the surrounding community; and (3) evaluation factors
referring to the actual conduct of the evaluation, the procedures used in the
conduct of the evaluation, the information collected, and how that information
is reported.
Several years later, using a strict definition of utilization,S Leviton and Hughes
(1981) identified five "clusters" of variables that affect utilization: (1) relevance
of evaluation to the needs of potential users, which includes evaluation reports
arriving in a timely fashion, prior to the decision at hand; (2) extent of
communication between potential users and producers of evaluations;
(3) translation of evaluations into their implications for policy and programs;
(4) credibility of both the evaluator and users, trust placed in evaluations; and
(5) commitment or advocacy by individual users.
Using a broadened conception of use, Cousins and Leithwood (1986) studied
65 evaluation use studies from various fields to examine what factors influenced
evaluation use. They found twelve factors that influence use, six associated with
evaluation implementation and six associated with the decision or policy setting.
Factors related to the implementation of the evaluation were: (1) evaluation
quality (for example) sophistication of methods, rigor, type of evaluation model;
(2) credibility of evaluator and/or evaluation process, defined in terms of
objectivity and believability; (3) relevance of the evaluation to information needs
of potential users; (4) communication quality, including clarity of report to
audience, breadth of dissemination; (5) degree of findings, consistency with
audience expectations; and (6) timeliness of dissemination to users.
Cousins and Leithwood (1986) note that the factors related to the decision or
policy setting were expanded beyond organizational characteristics to include the
information needs of all evaluation users. The six factors were: (1) information
needs of the evaluation audience; (2) decision characteristics, including impact
area, type of decision, significance of decision; (3) political climate, such as
political orientations of evaluation sponsors, relationships between stakeholders;
(4) competing information outside of the evaluation; (5) personal characteristics,
for example, users' organizational role, experience; and (6) commitment and/or
receptiveness to evaluation, including attitudes of users toward evaluation,
resistance, open-mindedness to evaluation generally.
This study yielded several interesting findings. Cousins and Leithwood (1986)
found that evaluation use seemed strongest when: evaluations were appropriate
in approach, methodological sophistication, and intensity; decisions to be made
were significant to users; evaluation findings were consistent with the beliefs and
expectations of the users; users were involved in the evaluation process and had
prior commitment to the benefits of evaluation; users considered the data
reported in the evaluation to be relevant to their problems; and that a minimum
of information from other sources conflicted with the results of the evaluation.
Building on this work, Shulha and Cousins (1997) emphasized the importance
of the evaluation context as critical to understanding and explaining evaluation
use more generally (see also King, 1988). Without such information, analyzed
primarily through in-depth naturalistic research (rather than a more static form
of data collection), social science data and evaluation data remain static entities
until they are accessed, interpreted, and used by a potential user.
Finally, Preskill and Caracelli (1997) conducted a survey about perceptions of
and experiences with evaluation use with members of the Evaluation Use Topical
Interest Group (TIG) of the American Evaluation Association (ABA). Based on
a sample of 282 respondents (54 percent of the TIG membership at that time),
they found several strategies that may influence and facilitate evaluation use.
The most important factors were: planning for use at the beginning of an
evaluation, identifying and prioritizing intended users and intended uses of
evaluation, designing the evaluation within resource limitations, involving stake-
holders in the evaluation process, communicating findings to stakeholders as the
evaluation progresses, and developing a communication and reporting plan.
In sum, numerous factors influence use. The "personal factor" appears to be
the most important determinant of what impact as well as the type of impact of
a given evaluation. The evaluator could enhance use by engaging and involving
intended users early in the evaluation, ensuring strong communication between
the producers and users of evaluations, reporting evaluation findings effectively
so users can understand and use them for their purposes, and maintaining
credibility with the potential users. Potential users were more likely to use an
evaluation if they felt it was relevant to their needs, they were committed to the
particular evaluation effort, felt that evaluations were useful and informative,
and became active participants in the evaluation process. The context
surrounding the evaluation also affected how and why the evaluations were used,
including fiscal constraints, political considerations, and how the program fits
into a broader organizational arena. Finally, evaluation quality proved to be
another key factor in predicting use (quality does not simply refer to technical
sophistication, it also incorporates methodological approach, report readability,
and jargon-free language.)
Use, Impact, and Utility
What are the differences between evaluation use, impact and utility? As Leviton
and Hughes (1981) so cogently put it: "impact may be defined as modification of
policy or programs to which evaluation findings have contributed" (p. 527).
While most writers agree that use and impact are not the same, the differences
have not been well articulated. As noted in the previous section, impact implies
a more far-reaching effect and is not necessarily limited to those who were the
recipients of the evaluation. Impact refers to indirect benefits to the larger
system, or to benefits that are more long term in nature. For example, the results
of an evaluation might lead to changes in the larger system, which are incidental
or by-products of the evaluation focus. Alternatively, an evaluation might

contribute to broad understandings that might influence decisions and actions
many years hence. 6
As evaluators, perhaps our strong preference would be for instrumental use to
occur. Conceptual and perhaps legitimative use would be viewed as a sign of
accomplishment. On the other hand, while recognizing that "symbolic" use has
been identified as a kind of use, evaluators are far less enthused about being in
the position to conduct evaluations that only yield such outcomes. Indeed, being
a foil for decision makers who only seek evaluation to enhance their own
reputations, for example, or to satisfy a legal requirement does not constitute a
"real" use. Despite the presence of the symbolic category in the literature we
would be inclined to not consider such instances as "use."
Utility is an easier concept to understand. The potential utility of an
evaluation, according to Leviton and Hughes (1981) "involves the relevance of
the findings to issues of concern in policy and programs" (p. 527). Thus, The
Program Evaluation Standards (Joint Committee, 1994) has its first section
devoted to "utility," an indication of the importance of this standard. The utility
section of the Standards urges evaluators to "acquaint themselves with their
audiences, define their audiences clearly, ascertain the audiences' information
needs, plan evaluations to respond to these needs, and report the relevant
information clearly and in a timely fashion" (Joint Committee, 1994, p. 5). In
actuality, the potential utility of an evaluation is, to a great extent, defined by all
of the program evaluation standards. Limitations in the feasibility, propriety or
accuracy of the evaluation limit potential utility. Evaluators are directed to
perform the evaluation in ways that increase the likelihood of use.
Evaluation Misuse
Concurrent with earlier research on evaluation utilization were references in the

literature to evaluation misutilization or misuse (Cook & Pollard, 1977; Weiss &
Bucuvalas, 1977). While these anecdotal citations of instances of perceived mis-
utilization were present, there has not been conceptualization or systematic
study of this phenomenon.
There is a tendency to consider "misuse" alongside "use" and "non-use," but
this is not appropriate. Misuse is an ethical issue on the part of stakeholders and
other recipients of evaluation information. Patton (1988) correctly points out
that use and misuse operate in different dimensions. He notes that "one dimension
is a continuum from non-utilization to utilization. A second continuum is non-
misutilization to misutilization" (p. 336). That is, misuse is not the opposite of
use: It is a concept associated with the manner of use. The use dimension implies
appropriate use (or non-use); the misuse dimension focuses on inappropriate
use. 7
Another important concept to be considered is that of "misevaluation." Alkin
and Coyle (1988) have pointed out the critical distinction between "misevaluation"
- where an evaluator performs poorly and fails to meet appropriate standards in
the conduct of an evaluation - and "misuse" which is an act of users to modify or
skew the evaluation for their own purposes. Misevaluation pertains to the actions
of the evaluators, and here we are guided by The Program Evaluation Standards
(Joint Committee, 1994).
AIkin and Coyle (1988) have conceptualized various evaluation relationships
and their consequences. This work is aided by the earlier work of Cook and Pollard
(1977), who noted that it is important to make distinctions between intentional
and unintentional activities on the part of the user. Intentional non-use of com-
petently done evaluations, are clearly instances of misuse (or abuse). Use of a
poorly done evaluation by an informed user who should be aware of deficiencies
can be thought of as an attempt to deceive and thus misuse. On the other hand,
one cannot categorize unintentional non-use of a well done evaluation as misuse
- it is simply "non use." Nor can one consider use of a poorly done evaluation by
an uninformed user to be misuse; it is neither use nor non-use (perhaps it should
be called mistaken use for incompetent use). On the other hand, non-use of a
poorly done evaluation is a justified act. Indeed, King and Pechman (1982) have
argued that intentional non-use of poorly conducted studies is the responsible
thing to do.
Misuse may occur at multiple evaluation stages. AIkin and Coyle (1988) comment
on misuse in the commissioning of an evaluation, misuse associated with the
conduct of the evaluation, and misuse of evaluation findings at any stage of the
evaluation. Christie and AIkin (1999) describe potential user actions at each of
these stages that might lead to misuse. In the commissioning of the evaluation,
user actions designed for political gain or public relations may be considered
instances of misuse. In the evaluation process stage, users may delay critical
decisions or use political influence to impede the conduct of the evaluation. In
the evaluation reporting stage, conclusions might be rewritten or selectively
reported. In an edited volume by Stevens and Dial (1994), a number of authors
have described specific instances of evaluation misuse in the three stages. Datta
(2000) describes threats to non-partisan evaluation that may emanate from
evaluation sponsors. These threats constitute an exemplary list of potential
instances of misuse:
stopping an evaluation; reducing funding so the evaluation is almost

mission-impossible; adding so many sub-sub-sub-questions and additional
studies that the main chase is subverted; dumbing down the evaluation by
pressing for weak measures and avoidance of setting any real standards
for valuation of whatever is being studied ... withholding or hiding key
information; providing selective information masquerading as complete;
suppressing reports so they never appear or appear so late that they are of
historical interest only; prematurely releasing incomplete reports for
political advantage; turning complex and nuanced findings into sound-bits
that misrepresent what was found; exaggerating positive findings. (p. 2-3)
FINAL COMMENTS
For as long as modern-day evaluations have existed, so too have questions about
evaluation use. As with social science research and knowledge, there remain
ongoing concerns about the level of importance of conducting evaluation and
producing information that leads to action, be it in creating policy, decision
making, or changing how someone may regard a particular issue. We assume that
our efforts are rewarded with some importance or action, although measuring
the effects evaluation has and will continue to prove problematic.
In this chapter we have shown the relationship between research on
knowledge use and evaluation use but have sought to differentiate between the
two. In contrast to knowledge use, evaluation use has a particular user (or set of
users) and a particular program as its referent. We have presented a definition
of evaluation use that we believe to be complete, but which might form the basis
for further clarifications. Evidence has been presented that, in fact, utilization
does occur and there are factors associated with that use. Other research has
shown that a program's context has a strong influence on use and work is
currently under way in investigating the role of context particularly as it relates
to evaluation's role in fostering organizational learning.
ENDNOTES
1 The terms "utilization" and "use" are presented interchangeably in this paper. The authors recog-
nize that some may consider one term or another preferable but both have been used almost
synonymously in the use/utilization literature.
2 Researchers, however, were unable to measure the level of impact of such knowledge, in part
because the researchers and respondents had different opinions about the meaning of knowledge
and knowledge use (Innes, 1990).
3 In many cases, however, non-use of evaluation results is rational and deliberate (King & Pechman,
1982).
4 However, the evaluation world today is quite different than the one in which the classic studies of
use were conducted. 'First, much of the earlier writings deploring the lack of use were not based on
research findings. Moreover, "both the descriptive research on use and the prescriptive writings
about use, including but not limited to the Standards, have changed the way that evaluators go about
doing evaluations so that there is now greater focus on use, greater intentionality about achieving
use, and therefore greater evidence of use" (M. Patton, personal communication, December 11,
2000).
5 Leviton and Hughes (1981, p. 527) identified two bottom-line criteria for use: serious discussion of
results in debates about a particular policy or program (information processing); and evidence that,
in the absence of research information, those engaged in policy or program activities would have
thought or acted differently.
6 Kirkhart (2000) has suggested that the term "influence" be adopted in place of "use." Influence, as
we understand the term, would incorporate both use and impact as defined here. We strongly disagree
with such a proposed change because it seemingly ignores the basic purpose of evaluation - to help
improve the specific programs under study.
7 The Program Evaluation Standards serve as an excellent guide to avoiding misuse situations. "one
of the main reasons for the Evaluation Impact standard is to guide evaluators to take steps following
the delivery of reports to help users not only make appropriate uses of findings, but also to help
them avoid misusing the findings" (D. Stufflebeam, personal communication, February 14, 2001).
REFERENCES
AIkin, M.e. (1975). Evaluation: Who needs it? Who cares? Studies in Educational Evaluation, 1(3),
201-212.
Aikin, M.C. (1985). A guide for evaluation decision makers. Beverly Hills: Sage Publications.
Aikin, M.C. (1990). Debates on evaluation. Newbury Park, CA: Sage Publications.
Aikin, M.e., & Coyle, K (1988). Thoughts on evaluation utilization, misutilization and non-utiliza-
tion. Studies in Educational Evaluation, 14, 331-340.
Aikin, M.C., Daillak R, & White, P. (1979). Using evaluations: Does evaluation make a difference?
Beverly Hills, CA: Sage Publications.
AIkin, M.C., Hofstetter, C., & Ai, X. (1998). Stakeholder concepts. In A Reynolds, & H. Walberg
(Eds.), Educational productivity. Greenwich, CN: JAI Press, Inc.
Aikin, M.e., Kosecoff, J., Fitz-Gibbon, C., & Seligman, R (1974). Evaluation and decision-making:
The Title VII experience. Los Angeles, CA: Center for the Study of Evaluation.
Aikin, M.C., & Stetcher, B. (1983). Evaluation in context: Information use in elementary school
decision making. Studies in Educational Evaluation, 9, 23-32.
Braskamp, LA, Brown, RD., & Newman, D.L. (1978). The credibility of a local educational
program evaluation report: Author source and client audience characteristics. American
Educational Research Journal, 15(3), 441-450.
Braskamp, L.A, Brown, RD., & Newman, D.L. (1982). StUdying evaluation utilization through
simulations. Evaluation Review, 6(1), 114-126.
Brown, RD., Braskamp, L.A, & Newman, D.L. (1978). Evaluator credibility as a function of report
style: Do jargon and data make a difference? Evaluation Quarterly, 2(2), 331-341.
Brown, RD., & Newman, D.L. (1982). An investigation of the effect of different data presentation
formats and order of arguments in a simulated adversary evaluation. Educational Evaluation and
Policy Analysis, 4(2), 197-203.
Brown, RD., Newman, D.L., & Rivers, L. (1980). Perceived need for evaluation and data usage as
influencers on an evaluation's impact. Educational Evaluation and Policy Analysis, 2(5), 67-73.
Campbell, D.T. (1969). Reforms as experiments. American Psychologist, 24(4), 409-429.
Campbell, D.T. (1974). Evolutionary epistemology. In P.A Schipp (Ed.), The philosophy of Karl
Popper, Vol. 14-1 (pp. 413-463). La Salle, IL: Open Court Publishing.
Caplan, N. (1974). The use of social science information by federal executives. In G.M. Lyons (Ed.),
Social research and public policies - The Dartmouth/OECD Conference (pp. 46--67). Hanover, NH:
Public Affairs Center, Dartmouth College.
Caplan, N., Morrison, A, & Stambaugh, RJ. (1975). The use of social science knowledge in policy
decisions at the national level: A report to respondents. Ann Arbor, MI: Institute for Social Research,
University of Michigan.
Christie, CA, & Aikin, M.C. (1999). Further reflections on evaluation misutilization. Studies in
Educational Evaluation, 25,1-10.
Cook, T.D., & Pollard, W.E. (1977). Guidelines: How to recognize and avoid some common problems
of mis-utilization of evaluation research findings. Evaluation, 4, 162-164.
Cousins, J.B. (1995). Asessing program needs using participatory evaluation: A comparison of high
and marginal success cases. In J.B. Cousins, & L.M. Earl (Eds.), Participatory evaluation in
education: Studies in evaluation use and organizational learning (pp. 55-71). London: Farmer.
Cousins, J.B., & Leithwood, KA (1986). Current empirical research on evaluation utilization.
Review of Educational Research, 56(3), 331-364.
Crawford, E.T., & Biderman, AD. (1969). The functions of policy-oriented social science. In E.T.
Crawford, & AD. Biderman (Eds.), Social scientists and international affairs (pp. 233-43). New
York: Wiley.
Datta, L. (2000). Seriously seeking fairness: Strategies for crafting non-partisan evaluations in a
partisan world. American Journal of Evaluation, 21(1),1-14.
David, J.L. (1978). Local uses of Title I evaluation. Menlo Park, CA: SRI International.
Dickey, B. (1980). Utilization of evaluations of small scale educational projects. Educational Evaluation
and Policy Analysis, 2, 65-77.
Emmert, MA (1985). Ordinary knowing and policy science. Knowledge: Creation, Diffusion, Utilization,
7(1), 97-112.
Greene, J.G. (1988). Stakeholder participation and utilization in program evaluation. Evaluation
Review, 12(2), 91-116.
Greene, J.G. (1990). Technical quality versus user responsiveness in evaluation practice. Evaluation
and Program Planning, 13, 267-274.
Guba, E.G. (1969). The failure of educational evaluation. Educational Technology, 9(5), 29-38.
Innes, J.E. (1990). Knowledge and public policy: The search for meaningful indicators (2nd Ed.). New
Brunswick, NJ: Transaction Publishers.
Janowiz, M. (1970). Political conflict: Essays in political sociology. Chicago: Quadrangle Books.
Joint Committee on Standards for Educational Evaluation (1994). The program evaluation standards
(2nd ed.). Thousand Oaks, CA: Sage Publications.
Kennedy, M.M. (1983). Working knowledge. Knowledge: Creation, Diffusion, Utilization, 5(2),193-211.
Kennedy, M.M. (1984). How evidence alters understanding and decisions. Educational Evaluation
and Policy Analysis, 6(3), 207-226.
Kennedy, M., Neumann, w., & Apling, R. (1980). The role of evaluation and testing programs in Title
I programs. Cambridge, MA: The Huron Institute.
King, J.A (1988). Evaluation use. Studies in Educational Evaluation, 14(3), 285-299.
King, J.A., & Pechman, E.M. (1982). The process of evaluation use in local school settings (Final Report
of NIE Grant 81-0900). New Orleans, LA: Orleans Parish School Board. (ERIC Document
Reproduction Service No. ED 233 037).
King, J.A., & Pechman, E.M. (1984). Pinning a wave to the shore: Conceptualizing evaluation use in
school systems. Educational Evaluation and Policy Analysis, 6(3), 241-251.
King, J.A., & Thompson, B. (1983). Research on school use of program evaluation: A literature
review and research agenda. Studies in Educational Evaluation, 9, 5-21.
Kirkhart, KE. (2000). Reconceptualizing evaluation use: An integrated theory of influence. In v.J.
Caracelli, & H. Preskill (Eds.), The expanding scope of evaluation use. New Directions for
Knorr, KD. (1977). Policymakers' use of social science knowledge: Symbolic or instrumental? In
e.H. Weiss (Ed.), Using social research in public policy making (pp. 165-182). Lexington, MA:
Lexington Books.
Leviton, L.e., & Hughes, E.F.X. (1981). Research on the utilization of evaluations: A review and
synthesis. Evaluation Review, 5(4), 525-548.
Lindblom, e., & Cohen, D.K. (1979). Usable knowledge: Social science and social problem solving.
New Haven, CT: Yale University Press.
Newman, D.L., Brown, R.D., & Littman, M. (1979). Evaluator report and audience characteristics
which influence the impact of evaluation reports: does who say what to whom make a difference?
CEDR Quarterly, 12(2), 14-18.
Owen, J.M., & Lambert, Ee. (1995). Roles for evaluation in learning organizations. Evaluation, 1(2),
237-250.
Owen, J.M., & Rogers, P.J. (1999). Program evaluation: Forms and approaches. St. Leonards,
Australia: Allen & Unwin.
Patton, M.Q. (1975).Altemative evaluation research paradigm. North Dakota Study Group on Evaluation
Monograph, Center for Teaching and Learning, University of North Dakota, Grand Forks, ND.
Patton, M.Q. (1988). Six honest serving men for evaluation. Studies in Educational Evaluation, 14,
301-330.
Patton, M.Q. (1997). Utilization-focused evaluation (3rd Ed.). Thousand Oaks, CA: Sage
Publications.
Patton, M.Q., Grimes, P.S., Guthrie, KM., Brennan, N.J., French, B.D., & Blyth, D.A (1977). In
search of impact: An analysis of the utilization of the federal health evaluation research. In e.H.
Weiss (Ed.), Using social research in public policy making (pp. 141-164). Lexington, MA: Lexington
Books.
Pelz, D.e. (1978). Some expanded perspectives on use of social science in public policy. In J.M.
Yinger, & S.J. Cutler (Eds.), Major social issues: A multidisciplinary view. New York: Macmillan.
Popper, KR. (1963). Conjectures and refutations. London: Routledge & Kegan Paul.
Popper, KR. (1966). Of clouds and clocks: An approach to the problem of rationality and the freedom
of man. St. Louis, MO: Washington University Press.
Preskill, H. (1994). Evaluation's role in enhancing organizational learning. Evaluation and Program
Planning, 17(3), 291-297.
Presldl!, H., & Caracelli, V. (1997). Current and developing conceptions of use: Evaluation Use TIG
survey results. Evaluation Practice, 18(3), 209-225.
Preskill, H., & Torres, R. (1999). Evaluative inquiry for organizational learning. Twin Oaks, CA: Sage
Rein, M., & White, S.H. (1977). Policy research: Belief and doubt. Policy Analysis, 3, 239-271.
Rich, R (1975). An Investigation of infonnation gathering and handling in seven federal bureaucracies:
A Case study of the Continuous National Survey. Unpublished doctoral dissertation, University of
Chicago.
Rich, RE (1977). Uses of social science information by federal bureaucrats: Knowledge for action
versus knowledge for understanding. In C.H. Weiss (Ed.), Using social research in public policy
making (pp. 199-211). Lexington, MA: Lexington Books.
Rippey, R.M. (1973). The nature of transactional evaluation. In R.M. Rippey (Ed.), Studies in
transactional evaluation. Berkeley: McCutchan.
Schon, D. (1983). The reflective practitioner: How professionals think in action. New York: Basic Books.
Shulha, L.M., & Cousins, 1.B. (1997). Evaluation use: Theory, research, and practice since 1986.
Stevens, C.l., & Dial, M. (1994). What constitutes misuse? In C.l. Stevens, & M. Dial (Eds.), Guiding
principals for evaluators. New Directions for Evaluation, 64, 3-14.
Weiss, C.H. (1972). Evaluation research: Methods of assessing program effectiveness. Englewood Cliffs,
Nl: Prentice-Hall.
Weiss, c.R. (1972). Utilization of evaluation: Toward comparative study. In C.H. Weiss (Ed.),
Evaluating action programs: Readings in social action and education (pp. 318-326). Boston, MA:
Allyn and Bacon, Inc.
Weiss, C.H. (Ed.). (1977). Using social research in public policy making. Lexington, MA: Lexington
Books.
Weiss, C.H. (1978). Improving the linkage between social research and public policy. In L.E. Lynn
(Ed.), Knowledge and policy: The uncertain connection. Washington, D. c.: National Research
Council.
Weiss, c.R. (1979). The many meanings of research utilization. Public Administration Review,
September/October.
Weiss, C.H. (1980). Knowledge creep and decision acretion. Knowledge: Creation, Diffusion, Utilization,
1(3), 381-404.
Weiss, C. H. (1988). Evaluation for decisions: Is anybody there? Does anybody care? Evaluation
Practice, 9(1), 5-19.
Weiss, C.H., & Bucuvalas, M.l. (1977). The challenge of social research in decision making. In C.H.
Weiss (Ed.), Using social research in public policy making (pp. 213-234). Lexington, MA: Lexington
Books.
Wholey, 1.S., & White, B.E (1973). Evaluation's impact on Title I elementary and secondary
education program management. Evaluation, 1(3), 73-76.
Young, C.l., & Comptois, 1. (1979). Increasing congressional utilization of evaluation. In E Zweig
(Ed.), Evaluation in legislation. Beverly Hills, CA: Sage Publications.
Zucker, L.G. (1981). Institutional structure and organizational processes: The role of evaluation
units in schools. In A. Bank, & R C. Williams (Eds.), Evaluation in school districts: Organizational
perspectives (CSE Monograph Series in Evaluation, No. 10). Los Angeles, CA: University of
California, Center for the Study of Evaluation.
11
Utilization-Focused Evaluation
MICHAEL QUINN PATTON

Union Institute and University, MN, USA
Utilization-focused evaluation begins with the premise that evaluations should

be judged by their utility and actual use; therefore, evaluators should facilitate
the evaluation process and design any evaluation with careful consideration of
how everything that is done, from beginning to end, will affect use. This is
consistent with standards developed by the Joint Committee on Standards for
Evaluation and adopted by the American Evaluation Association that
evaluations should be judged by their utility, feasibility, propriety, and accuracy.
(See chapter on standards and principles for evaluations.)
Utilization-focused evaluation is concerned with how real people in the real
world apply evaluation findings and experience the evaluation process.
Therefore, the focus in utilization-focused evaluation is on intended use by
intended users.
In any evaluation there are many potential stakeholders and an array of
possible uses. Utilization-focused evaluation requires moving from the general
and abstract, i.e., possible audiences and potential uses, to the real and specific:
actual primary intended users and their explicit commitments to concrete, specific
uses. The evaluator facilitates judgment and decision-making by intended users
rather than acting solely as a distant, independent judge. Since no evaluation can
be value-free, utilization-focused evaluation answers the question of whose
values will frame the evaluation by working with clearly identified, primary
intended users who have responsibility to apply evaluation findings and
implement recommendations. In essence, utilization-focused evaluation is
premised on the understanding that evaluation use is too important to be merely
hoped for or assumed. Use must be planned for and facilitated.
Utilization-focused evaluation is highly personal and situational. The
evaluation facilitator develops a working relationship with intended users to help
them determine what kind of evaluation they need. This requires negotiation in
which the evaluator offers a menu of possibilities. Utilization-focused evaluation
does not depend on or advocate any particular evaluation content, model,
method, theory, or even use. Rather, it is a process for helping primary intended
users select the most appropriate content, model, methods, theory, and uses for
223

224 Quinn Patton
their particular situation. Situational responsiveness guides the interactive
process between evaluator and primary intended users. As the entries in this
Handbook of Educational Evaluation demonstrate, many options are now
available in the feast that has become the field of evaluation. In considering the
rich and varied menu of evaluation, utilization-focused evaluation can include
any evaluative purpose (formative, summative, developmental), any kind of data
(quantitative, qualitative, mixed), any kind of design (e.g., naturalistic, experi-
mental) and any kind of focus (processes, outcomes, impacts, costs, and cost-
benefit, among many possibilities). Utilization-focused evaluation is a process
for making decisions about these issues in collaboration with an identified group
of primary users focusing on their intended uses of evaluation.
A psychology of use undergirds and informs utilization-focused evaluation. In
essence, research on evaluation use (cf. Patton, 1997) indicate that intended
users are more likely to use evaluations if they understand and feel ownership of
the evaluation process and findings; they are more likely to understand and feel
ownership if they've been actively involved; and by actively involving primary
intended users, the evaluator is training users in use, preparing the groundwork
for use, and reinforcing the intended utility of the evaluation every step along the
way.
While concern about utility drives a utilization-focused evaluation, the
evaluator must also attend to the evaluation's accuracy, feasibility and propriety
(Joint Committee, 1994). Moreover, as a professional, the evaluator has a
responsibility to act in accordance with the profession's adopted principles of
conducting systematic, data-based inquiries; performing competently; ensuring
the honesty and integrity of the entire evaluation process; respecting the people
involved in and affected by the evaluation; and being sensitive to the diversity of
interests and values that may be related to the general and public welfare
(Shadish, Newman, Scheirer, & Wye 1995).
BASIC DEFINITIONS
Program evaluation is the systematic collection of information about the

activities, characteristics, and outcomes of programs to make judgments about
the program, improve program effectiveness and/or inform decisions about
future programming. Utilization-focused program evaluation (as opposed to
program evaluation in general) is evaluation done for and with specific intended
primary users for specific, intended uses.
The general definition above has three interrelated components: (1) the
systematic collection of information about (2) a potentially broad range of topics
(3) for a variety of possible judgments and uses. The definition of utilization-
focused evaluation adds the requirement to specify intended use by intended
users. This matter of defining evaluation is of considerable import because
different evaluation approaches rest on different definitions. The use-oriented
Utilization-Focused Evaluation 225
definition offered above contrasts in significant ways with other approaches (see
Patton, 1997, pp. 23-25).
INVOLVING INTENDED USERS IN MAKING EVALUATION DECISIONS:

THE PERSONAL FACTOR
Many decisions must be made in any evaluation. The purpose of the evaluation
must be determined. Concrete evaluative criteria for judging program success
will usually have to be established. Methods will have to be selected and
time lines agreed on. All of these are important issues in any evaluation. The
question is: Who will decide these issues? The utilization-focused answer is:
primary intended users of the evaluation.
Clearly and explicitly identifying people who can benefit from an evaluation is
so important that evaluators have adopted a special term for potential evaluation
users: stakeholders. Evaluation stakeholders are people who have a stake - a
vested interest - in evaluation findings. For any evaluation there are multiple
possible stakeholders: program funders, staff, administrators, and clients or
program participants. Others with a direct, or even indirect, interest in program
effectiveness may be considered stakeholders, including journalists and members
of the general public, or, more specifically, taxpayers, in the case of public
programs. Stakeholders include anyone who makes decisions or desires
information about a program. However, stakeholders typically have diverse and
often competing interests. No evaluation can answer all potential questions
equally well. This means that some process is necessary for narrowing the range
of possible questions to focus the evaluation. In utilization-focused evaluation
this process begins by narrowing the list of potential stakeholders to a much
shorter, more specific group of primary intended users. Their information needs,
i.e., their intended uses, focus the evaluation.
Different people see things differently and have varying interests and needs.
This can be taken as a truism. The point is that this truism is regularly and
consistently ignored in the design of evaluation studies. To target an evaluation
at the information needs of a specific person or a group of identifiable and
interacting persons is quite different from what has been traditionally
recommended as "identifying the audience" for an evaluation. Audiences are
amorphous, anonymous entities. Nor is it sufficient to identify an agency or
organization as a recipient of the evaluation report. Organizations are an
impersonal collection of hierarchical positions. People, not organizations, use
evaluation information - thus the importance of the personal factor.
The personal factor is the presence of an identifiable individual or group of
people who personally care about the evaluation and the findings it generates.
Research on use (Patton, 1997) has shown that where a person or group is
actively involved with and interested in an evaluation, evaluations are more likely
to be used; where the personal factor is absent, there is a correspondingly
marked absence of evaluation impact.
226 Quinn Patton
The personal factor represents the leadership, interest, enthusiasm,
determination, commitment, assertiveness, and caring of specific, individual
people. These are people who actively seek information to make judgments and
reduce decision uncertainties. They want to increase their ability to predict the
outcomes of programmatic activity and thereby enhance their own discretion as
decision makers, policy makers, consumers, program participants, funders, or
whatever roles they play. These are the primary users of evaluation.
Though the specifics vary from case to case, the pattern is markedly clear:
Where the personal factor emerges, where some individuals take direct, personal
responsibility for getting findings to the right people, evaluations have an impact.
Where the personal factor is absent, there is a marked absence of impact. Use is
not simply determined by some configuration of abstract factors; it is determined
in large part by real, live, caring human beings.
Nothing makes a larger difference in the use of evaluations than the

personal factor - the interest of officials in learning from the evaluation
and the desire of the evaluator to get attention for what he knows
(Cronbach and Associates, 1980, p. 6, emphasis added).
The importance of the personal factor in explaining and predicting evaluation

use leads directly to the emphasis in utilization-focused evaluation on working
with intended users to specify intended uses. The personal factor directs us to
attend to specific people who understand, value and care about evaluation, and
further directs us to attend to their interests. This is the primary lesson the
profession has learned about enhancing use, and it is wisdom now widely
acknowledged by practicing evaluators (Cousins, Donohue, & Bloom, 1996;
Preskill & Caracelli, 1997).
USER-FOCUSED
In essence, utilization-focused evaluation is user-focused (AIkin, 1995). Since no

evaluation can serve all potential stakeholders' interests equally well, utilization-
focused evaluation makes explicit whose interests are served - those of explicitly
identified primary intended users.
Attending to primary intended users is not just an academic exercise
performed for its own sake. Involving specific people who can and will use
information enables them to establish direction for, commitment to, and
ownership of the evaluation every step along the way from initiation of the study
through the design and data collection stages right through to the final report
and dissemination process. If decision makers have shown little interest in the
study in its earlier stages, they are not likely to suddenly show an interest in using
the findings at the end. They won't be sufficiently prepared for use.
THE STEPS IN A UTILIZATION-FOCUSED EVALUATION PROCESS
First, intended users of the evaluation are identified. These intended users are
brought together or organized in some fashion, if possible (e.g., an evaluation
task force of primary stakeholders), to work with the evaluator and share in
making major decisions about the evaluation.
Second, the evaluator and intended users commit to the intended uses of the
evaluation and determine the evaluation's focus, for example, formative,
summative, or knowledge generating. Prioritizing evaluation questions will often
include considering the relative importance of focusing on attainment of goals,
program implementation, and/or the program's theory of action (logic model).
The menu of evaluation possibilities is vast, so many different types of
evaluations may need to be discussed. The evaluator works with intended users
to determine priority uses with attention to political and ethical considerations.
In a style that is interactive and situationally responsive, the evaluator helps
intended users answer these questions: Given expected uses, is the evaluation
worth doing? To what extent and in what ways are intended users committed to
intended use?
The third overall stage of the process involves methods, measurement and
design decisions. Primary intended users are involved in making methods
decisions so that they fully understand the strengths and weaknesses of the
findings they will use. A variety of options may be considered: qualitative and
quantitative data; naturalistic, experimental, and quasi-experimental designs;
purposeful and probabilistic sampling approaches; greater and lesser emphasis
on generalizations; and alternative ways of dealing with potential threats to
validity, reliability and utility. More specifically, the discussion at this stage will
include attention to issues of methodological appropriateness, believability of
the data, understandability, accuracy, balance, practicality, propriety and cost.
As always, the overriding concern will be utility. Will results obtained from these
methods be useful- and actually used?
Once data have been collected and organized for analysis, the fourth stage of
the utilization-focused process begins. Intended users are actively and directly
involved in interpreting findings, making judgments based on the data, and
generating recommendations. Specific strategies for use can then be formalized
in light of actual findings and the evaluator can facilitate following through on
actual use.
Finally, decisions about dissemination of the evaluation report can be made
beyond whatever initial commitments were made earlier in planning for intended
use. This reinforces the distinction between intended use by intended users
(planned utilization) versus more general dissemination for broad public
accountability (where both hoped for and unintended uses may occur).
While in principle there is a straightforward, one-step-at-a-time logic to the
unfolding of a utilization-focused evaluation, in reality the process is seldom
simple or linear. For example, the evaluator may find that new users become
important or new questions emerge in the midst of methods decisions. Nor is
228 Quinn Patton
there necessarily a clear and clean distinction between the processes of focusing
evaluation questions and making methods decisions; questions inform methods,
and methodological preferences can inform questions.
NEGOTIATING EVALUATIONS TO FIT SPECIFIC SITUATIONS
Utilization-focused evaluation involves negotiations between the evaluator and

intended users throughout the evaluation process. This is most obvious, perhaps,
at the design stage. The design of a particular evaluation depends on the people
involved and their situation. Situational evaluation is like situation ethics
(Fletcher, 1966), situational leadership (Hersey, 1985), or situated learning: "action
is grounded in the concrete situation in which it occurs" (Anderson, Reder, &
Simon, 1996, p. 5). The standards and principles of evaluation provide overall
direction, a foundation of ethical guidance, and a commitment to professional
competence and integrity, but there are no absolute rules an evaluator can follow
to know exactly what to do with specific users in a particular situation. That's why
Newcomer and Wholey (1989) concluded in their synthesis of knowledge about
evaluation strategies for building high-performance programs, "Prior to an
evaluation, evaluators and program managers should work together to define the
ideal final product" (p. 202). This means negotiating the evaluation's intended
and expected uses.
Every evaluation situation is unique. A successful evaluation (one that is useful,
practical, ethical, and accurate) emerges from the special characteristics and
conditions of a particular situation - a mixture of people, politics, history,
context, resources, constraints, values, needs, interests, and chance. Despite the
rather obvious, almost trite, and basically commonsense nature of this
observation, it is not at all obvious to most stakeholders who worry a great deal
about whether an evaluation is being done "right." Indeed, one common
objection stakeholders make to getting actively involved in designing an
evaluation is that they lack the knowledge to do it "right." The notion that there
is one right way to do things dies hard. The right way, from a utilization-focused
perspective, is the way that will be meaningful and useful to the specific evaluators
and intended users involved, and finding that way requires interaction,
negotiation, and situational analysis.
The phrase "active-reactive-adaptive" describes the nature of the consultative
interactions that go on between utilization-focused evaluators and intended
users. The phrase is meant to be both descriptive and prescriptive. It describes
how real-world decision-making actually unfolds. Yet, it is prescriptive in alerting
evaluators to consciously and deliberately act, react and adapt in order to
increase their effectiveness in working with primary intended users.
Utilization-focused evaluators are, first of all, active in deliberately and
calculatedly identifying intended users and focusing useful questions. They are
reactive in listening to intended users and responding to what they learn about
the particular situation in which the evaluation unfolds. They are adaptive in
altering evaluation questions and designs in light of their increased understanding

of the situation and changing conditions. Active-reactive-adaptive evaluators
don't impose cookbook designs. They don't do the same thing time after time.
They are genuinely immersed in the challenges of each new setting and
authentically responsive to the intended users of each new evaluation.
This active-reactive-adaptive stance characterizes all phases of evaluator-user
interactions from initially identifying primary intended users, to focusing
relevant questions, choosing methods, and analyzing results. All phases involve
collaborative processes of action-reaction-adaptation as evaluators and intended
users consider their options. The menu of choices includes a broad range of
methods, evaluation ingredients from bland to spicy, and a variety of evaluator
roles: collaborator, trainer, group facilitator, technician, politician, organiza-
tional analyst, internal colleague, external expert, methodologist, information
broker, communicator, change agent, diplomat, problem solver, and creative
consultant. The roles played by an evaluator in any given situation will depend
on the evaluation's purpose, the unique constellation of conditions with which
the evaluator is faced and the evaluator's own personal knowledge, skills, style,
values, and ethics.
Being active-reactive-adaptive explicitly recognizes the importance of the
individual evaluator's experience, orientation, and contribution by placing the
mandate to be "active" first in this consulting triangle. Situational responsiveness
does not mean rolling over and playing dead (or passive) in the face of stake-
holder interests or perceived needs. Just as the evaluator in utilization-focused
evaluation does not unilaterally impose a focus and set of methods on a program,
so, too, the stakeholders are not set up to impose their initial predilections
unilaterally or dogmatically. Arriving at the final evaluation design is a negotiated
process that allows the values and capabilities of the evaluator to intermingle
with those of intended users.
The utilization-focused evaluator, in being active-reactive-adaptive, is one
among many at the negotiating table. At times there may be discord in the
negotiating process; at other times harmony. Whatever the sounds, and whatever
the themes, the utilization-focused evaluator does not act alone.
PROCESS USE
Most discussions about evaluation use focus on use of findings. However, being
engaged in the processes of evaluation can be useful quite apart from the findings
that may emerge from those processes. Reasoning processes are evaluation's
donkeys: they carry the load. If, as a result of being involved in an evaluation,
primary intended users learn to reason like an evaluator and operate in accordance
with evaluation's values, then the evaluation has generated more than findings.
It has been useful beyond the findings in that it has increased the participants'
capacity to use evaluative logic and reasoning. "Process use," then, refers to
230 Quinn Patton
using the logic, employing the reasoning, and being guided by the values that
undergird the evaluation profession.
Those trained in the methods of research and evaluation can easily take for
granted the logic that undergirds those methods. Like people living daily inside
any culture, the way of thinking of those inside the research culture seems to
them natural and easy. However, to practitioners, decision makers, and policy
makers, this logic can be hard to grasp and quite unnatural. Thinking in terms of
what's clear, specific, concrete and observable does not come easily to people
who thrive on, even depend on, vagueness, generalities and untested beliefs as
the basis for action. Learning to see the world as an evaluator sees it often has a
lasting impact on those who participate in an evaluation - an impact that can be
greater and last longer than the findings that result from that same evaluation.
Process use refers to and is indicated by individual changes in thinking and
behavior, and program or organizational changes in procedures and culture, that
occur among those involved in evaluation as a result of the learning that occurs
during the evaluation process. Evidence of process use is represented by the
following kind of statement after an evaluation: The impact on our program came
not so much from the findings but from going through the thinking process that the
evaluation required.
Any evaluation can, and often does, have these kinds of effects. What's
different about utilization-focused evaluation is that the process of actively
involving intended users increases these kinds of evaluation impacts. Furthermore,
the possibility and desirability of learning from evaluation processes as well as
findings can be made intentional and purposeful. In other words, instead of
treating process use as an informal offshoot, explicit and up-front attention to
the potential impacts of evaluation logic and processes can increase those
impacts and make them a planned purpose for undertaking the evaluation. In
that way the evaluation's overall utility is increased.
The groundwork for process use is laid in working with intended users to help
them think about the potential and desired impacts of how the evaluation will be
conducted. Questions about who will be involved take on a different degree of
importance when considering that those most directly involved will not only play
a critical role in determining the content of the evaluation, and therefore the
focus of findings, but they also will be the people most affected by exposure to
evaluation logic and processes. The degree of internal involvement, engagement
and ownership will affect the nature and degree of impact on the program's
culture.
How funders and users of evaluation think about and calculate the costs and
benefits of evaluation also are affected. The cost-benefit ratio changes on both
sides of the equation when the evaluation produces not only findings, but also
serves longer term programmatic needs like staff development and organizational
learning.
Four primary types of process use have been differentiated: (1) enhancing
shared understandings, especially about results; (2) supporting and reinforcing the
program through intervention-oriented evaluation; (3) increasing participants'
engagement, sense of ownership and self-determination (participatory and
empowerment evaluation); and (4) program or organizational development
(Patton, 1997).
An example of process use can be found in the framework of Cousins and Earl
(1995) who have advocated participatory and collaborative approaches primarily
to increase use of findings. Yet, they go beyond increased use of findings when
they discuss how involvement in evaluation can help create a learning organiza-
tion. Viewing participatory evaluation as a means of creating an organizational
culture committed to ongoing learning has become an important theme in recent
literature linking evaluation to "learning organizations" (e.g., King, 1995;
Sonnichsen, 1993).
Utilization-focused evaluation is inherently participatory and collaborative in
actively involving primary intended users in all aspects of the evaluation as a
strategy for increasing use of findings. The added attention to process use is how
participation and collaboration can lead to an ongoing, longer term commitment
to using evaluation logic and building a culture of learning in a program or
organization. Making this kind of process use explicit enlarges the menu of
potential evaluation uses. How important this use of evaluation should be in any
given evaluation is a matter for negotiation with intended users. The practical
implication of an explicit emphasis on creating a learning culture as part of the
process will mean building into the evaluation attention to and training in
evaluation logic and skills.
SUMMARY PREMISES OF UTILIZATION-FOCUSED EVALUATION
(1) Commitment to intended use by intended users should be the driving force
in an evaluation. At every decision point - whether the decision concerns
purpose, focus, design, methods, measurement, analysis, or reporting - the
evaluator asks intended users, "How would that affect your use of this
evaluation?"
(2) Strategizing about use is ongoing and continuous from the very beginning
of the evaluation. Use isn't something one becomes interested in at the end
of an evaluation. By the end of the evaluation, the potential for use has
been largely determined. From the moment stakeholders and evaluators
begin interacting and conceptualizing the evaluation, decisions are being
made that will affect use in major ways.
(3) The personal factor contributes significantly to use. The personal factor
refers to the research finding that the personal interests and commitments
of those involved in an evaluation undergird use. Thus, evaluations should
be specifically user-oriented - aimed at the interests and information needs
of specific, identifiable people, not vague, passive audiences.
(4) Careful and thoughtful stakeholder analysis should inform identification of
primary intended users, taking into account the varied and multiple
interests that surround any program, and therefore, any evaluation. Staff,
232 Quinn Patton
program participants, directors, public officials, funders, and community
leaders all have an interest in evaluation, but the degree and nature of their
interests will vary. Political sensitivity and ethical judgments are involved in
identifying primary intended users and uses.
(5) Evaluations must be focused in some way - focusing on intended use by
intended users is the most useful way. Resource and time constraints will
make it impossible for any single evaluation to answer everyone's questions
or to give full attention to all possible issues. Because no evaluation can
serve all potential stakeholders' interests equally well, stakeholders
representing various constituencies should come together to negotiate what
issues and questions deserve priority.
(6) Focusing on intended use requires making deliberate and thoughtful
choices. Purposes for evaluation vary and include: judging merit or worth
(summative evaluation); improving programs (instrumental use); and
generating knowledge (conceptual use). Primary information needs and
evaluation uses can change and evolve over time as a program matures.
(7) Useful evaluations must be designed and adapted situationally.
Standardized recipe approaches won't work. The relative value of a
particular utilization focus can only be judged in the context of a specific
program and the interests of intended users. Situational factors affect use.
These factors include community variables, organizational characteristics,
the nature of the evaluation, evaluator credibility, political considerations,
and resource constraints. In conducting a utilization-focused evaluation,
the active-reactive-adaptive evaluator works with intended users to assess
how various factors and conditions may affect the potential for use.
(8) Intended users' commitment to use can be nurtured and enhanced by
actively involving them in making significant decisions about the
evaluation. Involvement increases relevance, understanding and ownership
of the evaluation, all of which facilitate informed and appropriate use.
(9) High quality participation is the goal, not high quantity participation. The
quantity of group interaction time can be inversely related to the quality of
the process. Evaluators conducting utilization-focused evaluations must be
skilled group facilitators.
(10) High quality involvement of intended users will result in high quality, useful
evaluations. Many researchers worry that methodological rigor may be
sacrificed if nonscientists collaborate in making methods decisions. But,
decision makers want data that are useful and accurate. Validity and utility
are interdependent. Threats to utility are as important to counter as threats
to validity. Skilled evaluation facilitators can help nonscientists understand
methodological issues so that they can judge for themselves the trade-offs
involved in choosing among the strengths and weaknesses of design options
and methods alternatives.
(11) Evaluators have a rightful stake in an evaluation in that their credibility and
integrity are always at risk, thus the mandate for evaluators to be active-
reactive-adaptive. Evaluators are active in presenting to intended users
their own best judgments about appropriate evaluation focus and methods;
they are reactive in listening attentively and respectfully to others' concerns;
and they are adaptive in finding ways to design evaluations that incorporate
diverse interests, including their own, while meeting high standards of
professional practice. Evaluators' credibility and integrity are factors
affecting use as well as the foundation of the profession. In this regard,
evaluators should be guided by the profession's standards and principles.
(12) Evaluators committed to enhancing use have a responsibility to train users
in evaluation processes and the uses of information. Training stakeholders
in evaluation methods and processes attends to both short-term and long-
term evaluation uses. Making decision makers more sophisticated about
evaluation can contribute to greater use of evaluation over time. Any
particular evaluation, then, offers opportunities to train evaluation users
and enhance organizational capacity for use - what has come to be called
"process use" - using the evaluation process to support longer term
program and organizational development.
(13) Use is different from reporting and dissemination. Reporting and
dissemination may be means to facilitate use, but they should not be
confused with such intended uses as making decisions, improving
programs, changing thinking, and generating knowledge.
(14) Serious attention to use involves financial and time costs that are far from
trivial. The benefits of these costs are manifested in greater use. These costs
should be made explicit in evaluation proposals and budgets so that
utilization follow through is not neglected for lack of resources.
ISSUES IN DOING UTILIZATION-FOCUSED EVALUATION
User Responsiveness and Technical Quality
Being responsive to and actively involving primary intended users in an

evaluation should not mean a sacrifice of technical quality. A beginning point is
to recognize that standards of technical quality vary for different users and
varying situations. The issue is not meeting some absolute research standards of
technical quality but, rather, making sure that methods and measures are
appropriate to the validity and credibility needs of a particular evaluation purpose
and specific intended users.
Jennifer Greene (1990) examined in depth the debate about "technical quality
versus user responsiveness." She found general agreement that both are
important, but disagreements about the relative priority of each. She concluded
that the debate is really about how much to recognize and deal with evaluation's
political inherency: "Evaluators should recognize that tension and conflict in
evaluation practice are virtually inevitable, that the demands imposed by most if
not all definitions of responsiveness and technical quality (not to mention
feasibility and propriety) will characteristically reflect the competing politics and
234 Quinn Patton
values ofthe setting" (p. 273). She then recommended that evaluators "explicate
the politics and values" that undergird decisions about purpose, audience, design,
and methods. Her recommendation is consistent with utilization-focused evaluation.
User Turnover: The Achilles Heel of Utilization-Focused Evaluation
The Achilles heel of utilization-focused evaluation, its point of greatest

vulnerability, is turnover of primary intended users. The process so depends on
the active engagement of intended users that to lose users along the way to job
transitions, reorganizations, reassignments and elections can undermine
eventual use. Replacement users who join the evaluation late in the process
seldom come with the same agenda as those who were present at the beginning.
The best antidote involves working with a task force of multiple intended users
so that the departure of one or two is less critical. Still, when substantial turnover
of primary intended users occurs, it may be necessary to re-ignite the process by
renegotiating the design and use commitments with the new arrivals on the
scene. Many challenges exist in selecting the right stakeholders, getting them to
commit time and attention to the evaluation, dealing with political dynamics,
building credibility and conducting the evaluation in an ethical manner. All of
these challenges revolve around the relationship between the evaluator and
intended users. When new intended users replace those who depart, new
relationships must be built. That may mean delays in original timelines, but such
delays payoff in eventual use by attending to the foundation of understandings
and relationships upon which utilization-focused evaluation is built.
Building Capacity for Evaluation Use
Just as students need experience and practice to learn to do evaluations,

programs and organizations need experience and practice to become adept at
using evaluations for program improvement and organizational learning. The
field of evaluation is paying more and more attention to ways of building capacity
for evaluation into programs and organizations (Patton, 1994). Openness to
evaluation increases as organizations have positive experiences with evaluation -
and learn to reflect on and take lessons from those experiences. A common
problem in introducing evaluation to organizations has been doing too much
(large scale efforts and universal mandates) before capacity was sufficient to
support useful evaluation. That capacity includes developing administrative and
staff understanding of the logic and values of evaluation, developing organization-
specific processes for integrating evaluation into planning and program
development, and connecting evaluation to the latest understandings about
organizational learning (Preskill & Torres, 1998; Sonnichsen, 2000).
A quarter-century of research on "readiness for evaluation" (Mayer, 1975;
Preskill & Torres, 2000; Seiden, 2000;) has found that valuing evaluation and
learning are necessary conditions for evaluation use. Valuing evaluation cannot
be taken for granted. Nor does it happen naturally. Users' commitment to
evaluation is typically fragile, often whimsical, and must be cultivated like a
hybrid plant that has the potential for enormous yields, but only if properly cared
for, nourished, and appropriately managed. Utilization-focused evaluation makes
such nurturing a priority, not only to increase use of a particular evaluation but
also to build capacity (process use) for utilization of future evaluations.
Variable Evaluator Roles Linked to Variable Evaluation Purposes
Different types of and purposes for evaluation call for varying evaluator roles.
Gerald Barkdoll (1980), as associate commissioner for planning and evaluation
of the U.S. Food and Drug Administration, identified three contrasting
evaluator roles. His first type, evaluator as scientist, he found was best fulfilled
by aloof academics who focus on acquiring technically impeccable data while
studiously staying above the fray of program politics and utilization relationships.
His second type he called "consultative" in orientation; these evaluators were
comfortable operating in a collaborative style with policymakers and program
analysts to develop consensus about their information needs and decide jointly
the evaluation's design and uses. His third type he called the "surveillance and
compliance" evaluator, a style characterized by aggressively independent and
highly critical auditors committed to protecting the public interest and assuring
accountability. These three types reflect evaluation's historical development
from three different traditions: (1) social science research, (2) pragmatic field
practice, especially by internal evaluators and consultants, and (3) program and
financial auditing.
When evaluation research aims to generate generalizable knowledge about
causal linkages between a program intervention and outcomes, rigorous
application of social science methods is called for and the evaluator's role as
methodological expert will be primary. When the emphasis is on determining a
program's overall merit or worth, the evaluator's role as judge takes center stage.
If an evaluation has been commissioned because of and is driven by public
accountability concerns, the evaluator's role as independent auditor, inspector,
or investigator will be spotlighted for policymakers and the general public. When
program improvement is the primary purpose, the evaluator plays an advisory
and facilitative role with program staff. As a member of a design team, a
developmental evaluator will play a consultative role. If an evaluation has a
social justice agenda, the evaluator becomes a change agent.
In utilization-focused evaluation, the evaluator is always a negotiator -
negotiating with primary intended users what other roles he or she will play.
Beyond that, all roles are on the table, just as all methods are options. Role
selection follows from and is dependent on intended use by intended users.
Consider, for example, a national evaluation of Food Stamps to feed low
income families. For purposes of accountability and policy review, the primary
236 Quinn Patton
intended users are members of the program's oversight committees in Congress
(including staff to those committees). The program is highly visible, costly, and
controversial, especially because special interest groups differ about its intended
outcomes and who should be eligible. Under such conditions, the evaluation's
credibility and utility will depend heavily on the evaluators' independence,
ideological neutrality, methodological expertise, and political savvy.
Contrast such a national accountability evaluation with an evaluator's role in
helping a small, rural leadership program of the Cooperative Extension Service
increase its impact. The program operates in a few local communities. The
primary intended users are the county extension agents, elected county
commissioners, and farmer representatives who have designed the program.
Program improvement to increase participant satisfaction and behavior change
is the intended purpose. Under these conditions, the evaluation's use will depend
heavily on the evaluator's relationship with design team members. The evaluator
will need to build a close, trusting, and mutually respectful relationship to
effectively facilitate the team's decisions about evaluation priorities and methods
of data collection, and then take them through a consensus-building process as
results are interpreted and changes agreed on.
These contrasting case examples illustrate the range of contexts in which
program evaluations occur. The evaluator's role in any particular study will
depend on matching her or his role with the context and purposes of the
evaluation as negotiated with primary intended users. This is especially true
where the utilization-focused evaluator and primary intended users agree to
include explicit attention to one or more of the four kinds of process use
identified earlier: (1) enhancing shared understandings, (2) reinforcing inter-
ventions, (3) supporting participant engagement, and (4) developing programs
and organizations. Process use goes beyond the traditional focus on findings and
reports as the primary vehicles for evaluation impact. Any evaluation can, and
often does, have these kinds of effects unintentionally or as an offshoot of using
findings. What's different about utilization-focused evaluation is that the
possibility and desirability of learning from evaluation processes as well as from
findings can be made intentional and purposeful - an option for intended users
to consider building in from the beginning. In other words, instead of treating
process use as an informal ripple effect, explicit and up-front attention to the
potential impacts of evaluation logic and processes can increase those impacts
and make them a planned purpose for undertaking the evaluation. In this way
the evaluation's overall utility is increased.
But the utilization-focused evaluator who presents to intended users options
that go beyond narrow and traditional uses of findings has an obligation to
disclose and discuss objections to such approaches. As evaluators explore new
and innovative options, they must be clear that dishonesty, corruption, data
distortion, and selling out are not on the menu. When primary intended users
want and need an independent, summative evaluation, that is what they should
get. When they want the evaluator to act independently in bringing forward
improvement-oriented findings for formative evaluation, that is what they should
get. But those are no longer the only options on the menu of evaluation uses.
New, participatory, collaborative, intervention-oriented, and developmental
approaches are already being used. In utilization-focused evaluation the new
challenge is working with primary intended users to understand when such
approaches are appropriate and helping intended users make informed decisions
about their appropriateness for a specific evaluation endeavor.
Political Underpinnings of Utilization-Focused Evaluation
Utilization-focused evaluation requires astute political sensitivity in identifying

both intended uses and intended users, for evaluation design and use always
occur within a political context. Here, then, some lessons from practice:
(1) Not all information is useful.

To be power-laden, information must be relevant and in a form that is
understandable to users. Organizational sociologist Michael Crozier
(1964) has observed: "People and organizations will care only about what
they can recognize as affecting them and, in turn, what is possibly within
their control" (p. 158).
(2) Not all people are information users.
Individuals vary in their aptitude for engaging evaluative information and
processes. Differential socialization, education, and experience magnify
such differences. In the political practice of evaluation, this means that
information is most powerful in the hands of people who know how to use
it and are open to using it. The challenge of use is one of matching: getting
the right information to the right people.
What of people who are not inclined to use information - people who
are intimidated by, indifferent to, or even hostile to evaluation? A
utilization-focused evaluator looks for opportunities and strategies for
creating and training information users. Thus, the challenge of increasing
use consists of two parts: (a) finding and involving those who are, by
inclination, information users and (b) training those not so inclined.
(3) Information targeted at use is more likely to hit the target.
It's difficult knowing in advance of a decision precisely what information
will be most valuable. Utilization-focused aims to increase the probability
of gathering appropriate and relevant information by focusing on real
issues with real timelines aimed at real decisions. In that way, utilization-
focused evaluation aims at closing the gap between potential and actual
use, between knowledge and action. Targeting an evaluation at intended
use by intended users increases the odds of hitting the target.
(4) Only credible information is ultimately powerful.
Aikin, Daillak, and White (1979) found that the characteristics of both an
evaluation and an evaluator affect use and one of the most important
characteristics of each is credibility. Eleanor Chelimsky (1987), one of the
238 Quinn Patton
profession's most experienced and successful evaluators in dealing with

Congress, has emphasized this point: "Whether the issue is fairness, balance,
methodological quality, or accuracy, no effort to establish credibility is
ever wasted. The memory of poor quality lingers long" (p. 14). The more
politicized the context in which an evaluation is conducted and the more
visible an evaluation will be in that politicized environment, the more
important to credibility will be an independent assessment of evaluation
quality to establish credibility. This amounts to a form of utilization-
focused matching in which safeguards of evaluation credibility are
designed to anticipate and counter specific political intrusions within
particular political environments.
Where possible and practical, an evaluation task force can be organized to make
major decisions about the focus, methods and purpose of the evaluation. The
task force is a vehicle for actively involving key stakeholders in the evaluation.
Moreover, the very processes involved in making decisions about an evaluation
will typically increase stakeholders' commitment to use results while also
increasing their knowledge about evaluation, their sophistication in conducting
evaluations, and their ability to interpret findings. The task force allows the
evaluator to share responsibility for decision making by providing a forum for the
political and practical perspectives that best come from those stakeholders who
will ultimately be involved in using the evaluation.
Utilization-Focused Evaluators Need Special Skills
To nurture evaluation use and keep an evaluation from getting caught up in

destructive group processes or power politics, a utilization-focused evaluator
needs to be politically savvy, skillful in group facilitation, able to decipher relevant
internal organizational dynamics, and a user-friendly communicator (Torres,
Preskill, & Piontek, 1996). The writings on utilization-focused evaluation
(Patton, 1997) often offer concrete practice wisdom about how to make the
process of involving primary intended users work. This makes explicit that
utilization-focused evaluators need not only technical and methodological skills,
but also group process skills and political astuteness - what are sometimes called
"people skills for evaluators" (Ghere, Minnema, Stevahn, & King, 1998).
Evaluation Misuse
Utilization-focused evaluation strives to facilitate appropriate use of evaluation

findings and processes, so utilization-focused evaluators must also be concerned
about misuse. Evaluation processes and findings can be misrepresented and
misused in the search for political advantage. Aikin and Coyle (1988) have made
a critical distinction between "misevaluation," in which an evaluator performs
poorly or fails to adhere to standards and principles, and "misuse," in which
users manipulate the evaluation in ways that distort the findings or corrupt the
inquiry. King (1982) has argued that intentional non-use of poorly conducted
studies should be viewed as appropriate and responsible. Here are some premises
with regard to misuse.
As use increases, misuse may also increase, so utilization-focused evaluators
must be attentive lest their efforts to call greater attention to evaluations
backfire. When people ignore evaluations, they ignore their potential uses as
well as abuses. As evaluators successfully focus greater attention on evaluation
data and increase actual use, there may be a corresponding increase in abuse,
often within the same evaluation experience. Donald T. Campbell (1988) made a
similar prediction in formulating "a discouraging law that seems to be emerging:
the more any social indicator is used for social decision making, the greater the
corruption pressures upon it" (p. 306).
Working with multiple users who understand and value an evaluation is one of
the best preventatives against misuse. Allies in use are allies against misuse.
Indeed, misuse can be mitigated by working to have intended users take so much
ownership of the evaluation that they become the champions of appropriate use,
the guardians against misuse, and the defenders of the evaluation's credibility
when misuse occurs.
Policing misuse is sometimes beyond the evaluator's control, but what is
always squarely within an evaluator's domain of direct responsibility and
accountability is misevaluation: failures of conduct by the evaluator, which brings
this discussion to evaluation ethics.
Ethics Of Being User-Focused
Sometimes there is concern that in facilitating utilization-focused evaluation, the

evaluator may become co-opted by stakeholders. How can evaluators maintain
their integrity if they become involved in close, collaborative relationships with
stakeholders? How does the evaluator take politics into account without
becoming a political tool of only one partisan interest?
The nature of the relationship between evaluators and the people with whom
they work is a complex one. On the one hand, evaluators are urged to maintain
a respectful distance from the people they study to safeguard objectivity and
minimize personal and political bias. On the other hand, the human relations
perspective emphasizes that close, interpersonal contact is a necessary condition
for building mutual understanding. Evaluators thus find themselves on the
proverbial horns of a dilemma: getting too close to decision makers may
jeopardize scientific credibility; remaining distant may undermine use.
One way to handle concerns about co-optation is to stay focused on evalua-
tion's empirical foundation. The empirical basis of evaluation involves making
assumptions and values explicit, testing the validity of assumptions and carefully
examining a program to find out what is actually occurring. The integrity of an
240 Quinn Patton
evaluation depends on its empirical orientation - that is, its commitment to
systematic and credible data collection and reporting. Likewise, the integrity of
an evaluation group process depends on helping participants adopt an empirical
perspective. A commitment must be engendered to really find out what is
happening, at least as nearly as one can given the limitations of research methods
and scarce resources. Engendering such commitment involves teaching and
facilitating. The savvy evaluator will monitor the empirical orientation of intended
users and, in an active-reactive-adaptive mode of situational responsiveness, take
appropriate steps to keep the evaluation on an empirical and useful path.
The Program Evaluation Standards and the AEA Guiding Principles provide
general ethical guidance and make it clear that evaluators encounter all kinds of
situations that require a strong grounding in ethics that may demand courage.
Beyond general ethical sensitivity, however, the ethics of utilization-focused
evaluators are most likely to be called into question around two essential aspects
of utilization-focused evaluation: (1) limiting stakeholder involvement to
primary intended users and (2) working closely with those users. The ethics of
limiting and focusing stakeholder involvement concerns who has access to the
power of evaluation knowledge. The ethics of building close relationships
concerns the integrity, neutrality and corruptibility of the evaluator. Both of
these concerns center on the fundamental ethical question: Who does an
evaluation - and an evaluator - serve?
First, evaluators need to be deliberative and intentional about their own moral
groundings and attend thoughtfully to concerns about whose interests are
represented in the questions asked and who will have access to the findings. The
active part of being active-reactive-adaptive invites evaluators to bring their own
concerns, issues and values to the negotiating table of evaluation. The evaluator
is also a stakeholder - not the primary stakeholder - but, in every evaluation, an
evaluator's reputation, credibility, and beliefs are on the line. A utilization-
focused evaluator is not passive in simply accepting and buying into whatever an
intended user initially desires. The active-reactive-adaptive process connotes an
obligation on the part of the evaluator to represent the standards and principles
of the profession as well as his or her own sense of morality and integrity, while
also attending to and respecting the beliefs and concerns of other primary users.
A second issue concerns how the interests of various stakeholder groups are
represented in a utilization-focused process. The preferred solution is to work to
get participants in affected groups representing themselves as part of the
evaluation negotiating process. As noted earlier, user-focused evaluation involves
real people, not just attention to vague, abstract audiences. Thus, where the
interests of disadvantaged people are at stake, ways of hearing from or involving
them directly should be explored, not just have them represented in a potentially
patronizing manner by the advantaged. Whether and how to do this may be part
of what the evaluator attends to during active-reactive-adaptive interactions.
A different concern about utilization-focused evaluation is raised by those who
worry that the varied roles available to utilization-focused evaluators may
undermine what some consider evaluation's central (or only) purpose -
rendering independent judgments about merit or worth. If evaluators take on

roles beyond judging merit or worth, like creating learning organizations or
facilitating judgments by intended users, does that confuse what evaluation is?
Michael Scriven, for example, argues that evaluators don't serve specific
people. They serve truth. Truth may be a victim, he believes, when evaluators
form close working relationships with program staff. Scriven admonishes
evaluators to guard their independence scrupulously. Involving intended users
would only risk weakening the hard-hitting judgments the evaluator must render.
Evaluators, he has observed, must be able to deal with the loneliness that may
accompany independence and guard against "going native," the tendency to be
co-opted by and become an advocate for the program being evaluated (1991b,
p. 182). Going native leads to "incestuous relations" in which the "evaluator is
'in bed' with the program being evaluated" (p. 192). He has condemned any
failure to render independent judgment as "the abrogation of the professional
responsibility of the evaluator" (1991a, p 32). He has derided what he mockingly
called "a kinder, gentler approach" to evaluation (p. 39). His concerns stem from
what he has experienced as the resistance of evaluation clients to negative
findings and the difficulty evaluators have - psychologically - providing negative
feedback. Thus, he has admonished evaluators to be uncompromising in
reporting negative results. "The main reason that evaluators avoid negative
conclusions is that they haven't the courage for it" (p. 42).
My experience as a utilization-focused evaluator has been different from
Scriven's, so I reach different conclusions. I choose to work with clients who are
hungry for quality information to improve programs. They are people of great
competence and integrity who are able to use and balance both positive and
negative information to make informed decisions. I take it as part of my
responsibility to work with them in ways that they can hear the results, both
positive and negative, and use them for intended purposes. I don't find them
resistant. I find them quite eager to get quality information that they can use to
develop the programs to which they have dedicated their energies. I try to render
judgments, when we have negotiated my taking that role, in ways that can be
heard, and I work with intended users to facilitate their arriving at their own
conclusions. They are often harsher on themselves than I would be.
In my experience, it doesn't so much require courage to provide negative
feedback as it requires skill. Nor do evaluation clients have to be unusually
enlightened for negative feedback to be heard and used if, through skilled
facilitation, the evaluator has built a foundation for such feedback so that it is
welcomed for long-term effectiveness. Dedicated program staff don't want to
waste their time doing things that don't work.
CONCLUSION
The fundamental focus of utilization-focused evaluation - working with primary

intended users to achieve intended use - has become central to the practice of
242 Quinn Patton
most professional evaluators. Cousins and his colleagues (1996) surveyed a
sample of 564 evaluators and 68 practitioners drawn from the membership lists
of professional evaluation associations in the United States and Canada. The
survey included a list of possible beliefs that respondents could agree or disagree
with. Greatest consensus centered on the statement: "Evaluators should
formulate recommendations from the study." The item eliciting the next highest
agreement (71 percent) was: "The evaluator's primary function is to maximize
intended uses by intended users of evaluation data" (Cousins et aI., 1996, p. 215).
Preskill and Caracelli (1997) reported similar results from a 1996 survey of
American Evaluation Association members. Thus, in twenty years, since the first
edition of Utilization-Focused Evaluation (Patton, 1978), its basic premise has
moved from a controversial idea (cf. AIkin, 1990) to mainstream evaluation
philosophy.
REFERENCES
Aikin, M. (Ed.). (1990). Debates on evaluation. Newbury Park, CA: Sage.

Aikin, M. (1995). Lessons learned about evaluation use. Panel presentation, International Evaluation
Conference, American Evaluation Association, Vancouver, November 2.
Aikin, M., & Coyle, K. (1988). Thoughts on evaluation misutilization. Studies in Educational
Aikin, M., Daillak, R., & White, P. (1979). Using evaluations: Does evaluation make a difference?
Beverly Hills, CA: Sage.
Anderson, J., Reder, L., & Simon, H. (1996). Situated learning and education." Educational
Researcher, 25(4), 5-21.
Barkdoll, G. (1980). Type III evaluations: Consultation and consensus." Public Administration Review
(March/April), 174-179.
Campbell, D.T. (1988). Methodology and epistemology for social science: Selected papers. edited by
E.S. Overman. Chicago: University of Chicago Press.
Chelimsky, E. (1987). The politics of program evaluation. In D.S. Cordray, H.S. Bloom, & R.J. Light
(Eds.), Evaluation practice in review. New Directions for Program Evaluation, 34, 5-22.
Cousins, J.B., Donohue, J., & Bloom, G. (1996). Collaborative evaluation in North America:
Evaluators' self-reported opinions, practices and consequences. Evaluation Practice, 17(3),
207-226.
Cousins, J.B., & Earl, L.M. (Eds.). (1995). Participatory evaluation in education: Studies in evaluation
use and organizational learning. London: Falmer Press.
Cronbach, L.J., & Associates. (1980). Toward refonn ofprogram evaluation. San Francisco: Jossey-Bass.
Crozier, M. (1964). The bureaucratic phenomenon. Chicago: University of Chicago Press.
Fletcher, J. 1966. Situation ethics: The new morality. London: Westminster John Knox.
Ghere, G., Minnema, J., Stevahn, L., & King, J. A. (1998). Evaluator competencies. Presentation at
the American Evaluation Association, Chicago, IL.
Greene, J.e. (1990). Technical quality versus user responsiveness in evaluation practice. Evaluation
and Program Planning, 13(3), 267-74.
Hersey, P. (1985). Situational leader. North Carolina: Center for Leadership.
King, J.A. (1982). Studying the local use of evaluation: A discussion of theoretical issues and an
empirical study. Studies in Educational Evaluation, 8,175-183.
King, J.A. (1995). Involving practitioners in evaluation studies: How viable is collaborative evaluation
in schools. In J.B. Cousins, & L. Earl (Eds.), Participatory evaluation in education: Studies in
evaluation use and organizational learning. London: Falmer Press.
Mayer, S.E. (197Sa). Are you ready to accept program evaluation. Program Evaluation Resource
Center Newsletter, 6(1), I-S. Minneapolis: Program Evaluation Resource Center.
Mayer, S.E. (197Sb). Assess your program readiness for program evaluation. Program Evaluation
Resource Center Newsletter, 6(3), 4-S. Minneapolis: Program Evaluation Resource Center.
Newcomer, K.E., & Wholey, J.S. (1989). Conclusion: Evaluation strategies for building high-
performance programs. In 1.S. Wholey, & K.E. Newcomer (Eds.), Improving government performance:
Evaluation strategies for strengthening public agencies and programs. San Francisco: Jossey-Bass.
Patton, M.O. (1978). Utilization-focused evaluation. Beverly Hills, CA: Sage.
Patton, M.O. (1994). Developmental evaluation. Evaluation Practice, 15(3), 311-20.
Patton, M.O. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand Oaks,
Preskill, H., & Caracelli, V. (1997). Current and developing conceptions of evaluation use:
Evaluation use TIG survey results. Evaluation Practice, 18(3), 209-22S.
Preskill, H., & Torres, R (1998). Evaluative inquiry for learning in organizations. Thousand Oaks, CA:
Sage Publications.
Preskill, H., & Thrres, R (2000). The readiness for organizational learning and evaluation instrument.
Oakland, CA: Developmental Studies Center.
Scriven, M. (1991a). Beyond formative and summative evaluation. In M.W. McLaughlin and D.C.
Phillips (Eds.), Evaluation and education: At Quarter Century. 90th Yearbook of the National Society
for the Study of Education. Chicago: University of Chicago Press.
Scriven, M. (1991b). Evaluation thesaurus. 4th edition. Newbury Park, CA: Sage.
Seiden, K. (2000). Development and validation of the "organizational readiness for evaluation" survey
instrument. Unpublished doctoral dissertation, University of Minnesota.
Shadish, W.R, Jr., Newman, D.L., Scheirer, M.A., & Wye, C. (199S). Guiding principles for evaluators.
New Directions for Program Evaluation, 66.
Sonnichsen. RC. (1993). Can governments learn? In F. Leeuw, R Rist, & R Sonnichsen (Eds.),
Comparative perspectives on evaluation and organizational learning. New Brunswick, NJ: Transaction.
Sonnichsen, RC. (2000). High impact internal evaluation. Thousand Oaks, CA: Sage Publications.
Torres, R, Preskill, H., & ~iontek, M.E. (1996). Evaluation strategies for communicating and reporting:
Enhancing learning in organizations. Thousand Oaks, CA: Sage.
12
Utilization Effects of Participatory Evaluation l
J. BRADLEY COUSINS
University of Ottawa, Canada
INTRODUCTION
Participatory evaluation (PE) turns out to be a variably used and ill-defined

approach to evaluation that, juxtaposed to more conventional forms and
approaches, has generated much controversy in educational and social and
human services evaluation. Despite a relatively wide array of evaluation and
evaluation-related activities subsumed by the term, evaluation scholars and
practitioners continue to use it freely often with only passing mention of their
own conception of it. There exists much confusion in the literature as to the
meaning, nature, and form of PE and therefore the conditions under which it is
most appropriate and the consequences to which it might be expected to lead. In
spite of this confusion, interest in collaborative, empowerment, and participatory
approaches to evaluation has escalated quite dramatically over the past decade
as evidenced in a bourgeoning literature on such topics. This interest, in my view,
is testament to the promise such collaborative approaches hold for enhancing
evaluation utilization and bringing about planned sustainable change.
Participatory evaluation for the present purposes is defined as an approach
where persons trained in evaluation methods and logic work in collaboration
with those not so trained to implement evaluation activities. That is, members of
the evaluation community and members of other stakeholder groups relative to
the evaluand each participate in some or all of the shaping and/or technical
activities required to produce evaluation knowledge leading to judgments of
merit and worth and support for program decision making. As we have noted
elsewhere (Cousins & Whitmore, 1998) PE is distinguishable from other forms
of collaborative inquiry such as stakeholder-based evaluation and empowerment
evaluation by virtue of the requirement that members of both the evaluation
community and other stakeholder groups are directly involved in the production
of evaluation knowledge.
Participatory evaluation also may be divided into two distinct yet overlapping
streams, with different histories and goals. On the one hand, PE has been used
extensively as a route to enlightening members of disadvantaged or oppressed
groups with the expressed goals of fostering empowerment, emancipation, and
245

246 1. Bradley Cousins
self-determination. We (Cousins & Whitmore, 1998) refer to this general approach
as transformative participatory evaluation (T-PE) and comment on its relationship
to democratic evaluation (e.g., MacDonald, 1977), participatory action research
(e.g., McThggert, 1991), empowerment evaluation (e.g., Fetterman, 2000; Fetterman,
Kraftarian, & Wandersman, 1977) and other forms of collaborative inquiry that
are ideologically driven. In T-PE evaluation is conceived as a developmental
process where, through the involvement of less powerful stakeholders in
investigation, reflection, negotiation, decision making, and knowledge creation,
individual participants and power dynamics in the socio-cultural milieu are
changed (Pursley, 1996). A second stream of PE, one more directly focused on
program decision making and problem solving, is of central interest in the
present chapter. That stream we labelled practical participatory evaluation (P-PE)
and framed it as being focussed on the generation of knowledge useful for
program and organizational decision making (Cousins & Whitmore, 1998).
Indeed, a powerful key to the allure of the participatory approach is its
perceived relationship to evaluation utilization and/or more generally defined
effects, outcomes, or consequences (Le., knowledge utilization). Principally, the
argument goes, the more directly involved in the inquiry are those who have a
vested interest in either the program or its evaluation, the more likely the
evaluation will have anticipated and desirable consequences. On a theoretical
level, the argument is plausible as it aligns with principles of shared meaning,
ownership, and commitment common to other participatory endeavors such as
collaborative organizational development and participatory decision making. It
applies also to the more general case of collaborative forms of evaluation where
members of the evaluation community need not be directly involved (e.g.,
empowerment evaluation). The more refined version of this generalization with
increased relevance to PE would stipulate that stakeholder participation is
necessarily associated with the direct involvement of trained evaluators.
Nevertheless, my contention is that the plausibility of the argument has been
pivotal in generating theoretical appeal and practical interest in participatory
and other forms of collaborative evaluation.
But what is really known about this relationship? Can it be specified with
confidence that PE fosters deep, sustainable evaluation and temporally distal
knowledge utilization effects? Can the types of utilization and related
consequences that are likely to emerge from such processes be anticipated with
some degree of accuracy? Are the conditions under which PE is likely to be
effective appreciated in a comprehensive way? These questions are of central
interest to me, and exploring answers to them through a review of the scholarly
and professional literature defines the purpose for the present chapter. In the
ensuing sections my goal is to accomplish the following. First, I specify and
describe the conceptual framework that I use to guide the review. Then, using
the framework, I describe what the extant literature has to say about the practice
and consequences of PE. Finally, I derive a set of issues for consideration by
evaluators who are drawn to participatory evaluation as viable means of
enhancing the impact of evaluation.
Utilization Effects of Participatory Evaluation 247
CONCEPTUAL FRAMEWORK
The framework appearing in Figure 1 represents an evolution of our thinking

about evaluation utilization over the period dating back to our original concep-
tion, which emerged from a review and integration of the academic knowledge
base some 15 years ago (Cousins & Leithwood, 1986). Recent developments
have more adroitly accommodated the participatory element as well as an
expanded conception of evaluation and knowledge utilization. In the Figure, the
three panels correspond to participatory practice, its consequences and enabling
factors and conditions. I elaborate briefly on each.
Participatory Practice
Previously we concluded that P-PE and T-PE differ in their primary functions
and ideological and historical roots - practical problem solving versus
empowerment - but overlap with one another in their secondary functions and
in other areas (Cousins & Whitmore, 1998). Despite differences that are evident
at first blush, P-PE and T-PE have substantial similarities. It is for this reason that
an examination of the relationship between PE and utilization effects should not
be limited to the more utilization-oriented P-PE stream.
Inasmuch as PE requires a partnership between members of the evaluation
community and other stakeholder groups, it is logical that partners bring
different perspectives, knowledge, and expertise to the evaluation. Evaluators
will, in the main, contribute expertise and practical knowledge of evaluation logic
and methods whereas program practitioners and intended beneficiaries will be
more likely to contribute their knowledge of the program, its intended and
unintended effects, and the context in which the program operates. Both sets of
strengths usefully inform PE technical decision making and implementation.
Some time ago we cast participatory evaluation as an extension of the
"conventional" stakeholder-based model and proposed three basic distin-
guishing features: control of technical evaluation decision making, stakeholder
selection, and depth of participation (Cousins & Earl, 1992). Practical
participatory evaluation, it seemed to us, is most potent when stakeholder
participation is limited to "primary users" or to those with a vital interest in the
program (Alkin, 1991); there is provision for partnership between evaluators and
program practitioners and shared control over evaluation technical decision
making; and non-evaluator participants are involved in all phases of the research
including planning, data collection, analysis, interpretation, reporting, and
follow-up. We later observed that these distinguishing features correspond to
basic dimensions or continua along which any given collaborative research
project might be located and made a case for using these dimensions for
differentiating among various forms of collaborative evaluation and between
collaborative and non-collaborative evaluation (Cousins, Donohue, & Bloom,
1996; Cousins & Whitmore, 1998).
~
00
Factors and Conditions: Participatory Practice Consequences Knowledge Utilization
:--..
I::x::J
Evaluation
i:l
~
Context ~
Evaluation Utilization
i) Expertise g
i) Communication ~
Liberation
i) Instruction ~.
Process Use
".
, TIme
i) Resources
J (, ,.......
Processes Evaluation
Knowledge
i) Individual
i) Control of i) Group Emancipation
i) Role flexibility Production
evaluation i) Organizational
technical Responsiveness
decision making Credibility
i) Stakeholder Enlightenment
Decision/Policy Sophistication
Setting ,.,ed;oo""
participation
i) Admin support
i) Micro politics
li) Depth of
participation
Use of Findings
i) Conceptual Empowerment
i) Culture i) Instrumental
i) Information i) Symbolic
needs
i) Impetus Self Determination
i) Skills
~
• Figure 1:
•
Conceptual framework linking participatory evaluation and knowledge utilization
If it is plausible to accept that these three dimensions are useful for
differentiating collaborative approaches to systematic inquiry, it might be useful
to consider the possibility that they may be orthogonal. That is to say, decisions
about who participates, to what extent they participate, and who controls
evaluation technical decision making can, in theory, be made independently
from one another. Empirically, such independence seems unlikely but
heuristically the distinction is a useful one. Figure 2 represents each of the
described dimensions of form in collaborative inquiry in three-dimensional space.
This device may be used to consider and explicate the collaborative processes
associated with a variety of genres of especially collaborative inquiry. Any given
example may be considered in terms of its location on each of the dimensions or
continua, thereby yielding its geometric coordinate location in the figure. The
sectors are defined by axes (a) control of evaluation process, (b) stakeholder selec-
tion for participation, and (c) depth ofparticipation. Each dimension is divided at
the point of intersection with the other dimensions.
In differentiating aspects of form and process of P-PE and T-PE, it may be
concluded that the approaches are quite similar, with the exception of deciding
who participates in the evaluation exercise. In the Cousins and Earl (1992, 1995)
approach to P-PE, the emphasis is on fostering program or organizational
decision making and problem solving, and evaluators tend to work in partnership
(a) Control of evaluation process
Researcher
Controlled
Primary
Users
All Legitimate
Groups Deep
Participation
(b) Stakeholder selection (c) Depth of participation

for participation
Practitioner
Controlled
Figure 2: Dimensions of form in collaborative inquiry

250 J Bradley Cousins
with organization members who have the clout to do something with the
evaluation findings or emergent recommendations. While this approach accom-
modates participation by others, the likelihood of potential users' ownership of
and inclination to do something about evaluation data will be limited without the
involvement of key personnel. Part of the rationale for limiting participation to
stakeholders closely associated with program development, implementation, and
management is that the evaluation stands a better chance of meeting the
program's and organizational decision makers' timelines and need for
information. Although the evaluator acts to safeguard against the potential
intrusion of self-serving interests of the primary users (mostly by keeping
program practitioner participants true to the data and findings), the model is not
as useful in cases where there is disagreement or lack of consensus among
stakeholder groups about program goals or intentions. In such cases, conflict
among competing interest groups needs to be resolved, and if stakeholder
participation is limited to primary decision makers, the evaluation is more likely
to be seen as problematic and biased. For evaluators finding themselves in
situations of this sort, questions arise as to the prudence of their acting in a
conflict-resolution mode and/or their ability to resist being co-opted by powerful
stakeholders.
On the other hand, T-PE is more generally inclusive concerning stakeholder
participation especially with regard to intended program beneficiaries, members
of the program, or to the development project's target population. In this sense,
power differentials are consciously built into the approach. Typically, intended
program beneficiaries define the very population that T-PE is intended to serve
through fostering empowerment and illuminating key social and program issues.
While there may be direct roles for evaluators and facilitators in training
practitioners, dependency on such professionals is expected to diminish as time
passes and experience is acquired. This may also be the case for P-PE but as we
noted elsewhere (Cousins & Earl, 1995), dealing with organizational constraints
and integrating evaluation and participation into the culture of organizations are
formidable tasks destined to unfold over several repetitions and protracted
periods of time.
Both forms of PE share the intention of involving stakeholders and community
members in all aspects of the evaluation project, including the highly technical
ones. Based on the emergence of practical and logistical issues we conceded that
the value and viability of engaging practitioners in highly technical activities,
such as data processing analysis and reporting, might be questionable (Cousins
& Earl, 1995). But in some contexts, community members may be better
positioned than evaluators to accomplish some technical tasks (Chambers, 1997;
Gaventa, 1993). Nevertheless, the assumption that mastery of such technical
tasks is a form of empowerment remains intact. We concluded, then, that the two
forms of PE, although similar, may be located in slightly different geometric
regions of the three-dimensional framework (see Figure 2) based on stakeholder
selection for participation (Dimension B) as a point of departure (Cousins &
Whitmore, 1998).
Consequences of Participatory Evaluation
Consequences of PE are complex, multi-leveled and multidimensional and, as

shown in Figure 1, they range from quite direct dissemination effects to
amorphous and diffuse knowledge utilization effects that are somewhat
temporally removed from the actual evaluation. First, participatory practices
have a direct bearing on the knowledge production and dissemination function.
Aspects of that function that are likely to be important are responsiveness to
stakeholder needs, the credibility of the evaluation and its findings (i.e., validity),
the level of technical sophistication or complexity of the evaluation, its
communicability and timeliness relative to decision making needs.
Evaluation dissemination activities and indeed the knowledge production
function of P-PE lead to distinctive types of effects. One category is associated
with the use of the evaluation findings themselves defined in conventional terms
as conceptual (educative, development of thinking about the program); instru-
mental (support for discrete decision making); and symbolic (persuasive, signaling
usages). But process use, a term coined by Patton, is an additional discernable
consequence of stakeholder participation in, or knowledge of, evaluation
activities (Cousins et aI., 1996; Patton, 1997, 1998; Preskill & Caracelli, 1997;
Shulha & Cousins, 1997). By virtue of their proximity to the evaluation,
stakeholders may develop in ways that are quite independent of the findings or
substantive knowledge emerging from the inquiry. For example, participation
has the potential to lead to the development of research skills and the capacity
for self-critique, self-determination, and systematic inquiry at the level of
individual stakeholders. There may also be collective effects at the group, team,
or organizational level. In theory, if evaluation becomes integrated into the
ongoing activities within an organization it may become a learning system that
fosters the development of shared values and understanding among organization
members. Such phenomena have been labeled team and/or organizational
learning. Recently, several evaluation theorists have recognized the potential for
evaluation to foster such collective developmental effects (Cousins & Earl, 1992;
Owen & Lambert, 1995; Patton, 1999; Preskill, 1994; Preskill & Torres, 1998;
Torres, Preskill, & Piontek, 1996).
Finally, some effects of PE may be long-term, amorphous, and diffuse. This
may be particularly true of, although not limited to, T-PE with its emancipatory/
empowerment agenda. Such effects may be conceived as enlightenment, social
empowerment, illumination, and liberation and are consistent with the abstract
societal levels of knowledge utilization described by Dunn and Holtzner (1988)
and Weiss (1983).
Factors and Enabling Conditions
Several variables fall into this panel of the conceptual framework. These
variables are shown in Figure 1 to be located in one of two components, one
associated with the context within which the evaluation takes place, the other
with the characteristics of the evaluator or trained evaluation team (see also
Cousins & Earl, 1992, 1995). Regarding the former, the most potent variables
are likely to be administrative or organizational support for the evaluation; the
micro-political processes and influences; organizational culture; the information
needs of the organization; the impetus of staff to participate or otherwise
embrace the evaluation enterprise; and the level of evaluation skill development
among staff. Important characteristics of the evaluator or team include level of
expertise, communication skills, instructional skills, availability of resources
including time and support functions, and the flexibility of evaluators in the face
of the challenges of participatory models.
Most of these variables would be considered antecedent, although it is
important to recognize the recursive nature of the framework. For example, if
participatory evaluation leads to the development of evaluation learning systems
within an organization, and ultimately enhances organizational learning, naturally
the conditions associated with the decision or policy setting component of this
panel will be affected. It might also be noted that similar consequences influence
the ongoing nature of participation in evaluation. As program practitioner staff
and intended beneficiaries, for example, develop their research skills and
capacity for systematic inquiry, they are more likely to participate in evaluation
activities in a deeper, more penetrating way.
Let us now turn to a review of the scholarly and professional literature
concerning PE and its effects on utilization. The basis for the ensuing sections is
a recently completed comprehensive review and synthesis of the empirical
literature for the period 1996-2000 (Cousins, 2000). In that document I
systematically integrated and synthesized knowledge from 29 empirical studies
reporting on PE processes and effects. I now summarize the main findings of this
prior work.
PARTICIPATORY EVALUATION PRACTICE
The orienting questions for this section are as follows. What are the salient
features of PE implementation? What are the main problems or issues
associated with implementation or factors and conditions influencing it? What
are the supporting conditions and contexts for implementation?
Control of Technical Decision Making
In a survey of evaluators Preskill and Carcelli (1997) observed that most want to
maintain control over evaluation decision making but experienced evaluators
were more likely to acknowledge the potential contribution of stakeholders and
importance of sharing control. Similar findings emerged from a study of
collaborative evaluation practices by Cousins et al. (1996). Evaluators were
decidedly of the view that the control of evaluation decision making should rest
with the evaluator. Yet proponents of both P-PE and T-PE argue for striking a
balance of control among evaluators and members of other stakeholder groups.
In a case narrative on community service intervention for children and families
reported by Schnoes, Murphy-Berman, and Chambers (2000) difficulties in
striking this balance emerged. The authors observed that an initial positive
reception to the approach and feelings of optimism about effecting change
diminished over time. Often program people were more interested in discussing
program management issues at meetings because they did not often get the
opportunity to do so. Compared with their interest in program implementation
issues, evaluation was not a priority. ''As evaluators we ended up being more
directive and acting more autonomously in this effort than had been originally
envisioned" (p.59). In contrast to this result, Lee (1999) described the
devolution of control of the evaluation function by the evaluator in a large-scale
school improvement program in Manitoba, Canada. Given the expanding nature
of the program, it was necessary for Lee as evaluation consultant to become less
involved at the individual project level. As she puts it, "it has been possible to
move teachers from a dependence on the evaluation consultant ... to a
relationship where the evaluation consultant is a resource and support for the
school's ongoing inquiry and reflection" (p. 174).
In a multiple case study of P-PE in school districts I found that too much
control on behalf of the evaluator could be detrimental (Cousins, 1996). When
involvement was too close the danger of creating false expectations for the
evaluation on behalf of members of the intended audience was high. On the
other hand, in cases where involvement was less direct, impact was quite good. I
concluded that transfer of research skills and knowledge to non-evaluator
stakeholders appears to be possible in indirect ways.
Finally, in a longitudinal study by Robinson (1998) the disparate perceptions
of evaluators and non-evaluator participants are highlighted. In this case, the
evaluator was committed to balanced control in the process and technical
decision making. The focus for the evaluation was a Canadian national training
organization with members of the participatory evaluation team located in
various regions across the country. Participant observations revealed that
Robinson, as the principal evaluator, was comfortable that control of technical
decision making had been shared as was intended. However, independently
collected interview data were discordant with this perception. When these data
were examined at a later point in time, the author was surprised to learn that
non-evaluator members of the evaluation team tended to defer to him as the
evaluation expert and perceived him to be in control of technical decisions.
Stakeholder Selection for Participation
Proponents of P-PE advocate limiting stakeholder participation to primary users

or those with a vital stake in the program and its evaluation. Despite this
254 1 Bradley Cousins
assertion, we found in our survey of evaluators about collaborative evaluation
that evaluators tend to be of the view that involving a broad array of stakeholders
in the evaluation would be likely to increase use (Cousins et aI., 1996). On the
other hand, T-PE advocates support an inclusive approach such that meaningfuI
participation by a full array of interested parties will be possible. Several studies:
had interesting observations concerning this process feature of PE.
Stakeholder selection as a cross-cutting PE issue was the main focus in the
case reports by Mathie and Greene (1997). The authors acknowledged arguments
for diversity including building a holistic understanding of the program meaning
and content and also to enable "democratizing conversation." They concluded
that diversity is important to the development of holistic understanding but the
full range may not be practical or necessary for action in the larger context.
Greene (2000) reported additional complications in another case example, this
one a three-year evaluation of a new high school science curriculum to be steered
by a committee composed of faculty, parents, students, local scientists, and
interested community members. The evaluation was grounded in principles of
democracy and deliberation, and Greene acted as the principal evaluator.
According to her, the evaluation was less than successful. She reported that
committee membership of the poor minority constituencies waned, and she
experienced considerable recruitment difficulties. "The space created by the
calls for pluralism was large, diffuse, hard to define, and even harder to fill"
(p. 19). Other factors at play were associated with role clarity for the evaluator
and power differentials among participants. In the end, limited authority of the
evaluation was apparent; the steering committee was effectively cut out of the
decision making loop. In Greene's words "my multiple roles further eroded the
authority and deliberative potential of the evaluation" (p. 23). She observed that
while inclusive evaluators need to advocate for inclusion of all legitimate groups,
somehow that can be mistaken for partisanship.
A case report by Gaventa, Creed, and Morrissey (1998) yielded similar concerns
about differences in power among stakeholder groups with an interest in the
evaluation, but in this case, engaging members of groups with less power was not
the issue.
Although in some ideal sense the PE process can involve negotiation and
involvement by all levels, in reality such involvement is difficult because of
pre-existing distrust or conflict. In some communities, involving local citi-
zens in a process that they could consider their own meant excluding other
stakeholders who were seen as always being in charge or not to be trusted.
However, once these other stakeholders were excluded, or not actively
included, it became difficult to gain their support for the process. (p. 91)
Here may be seen the power of pre-existing tensions to set the tone for both the
participatory process and, ultimately, its implications for outcomes. In contrast,
however, Rowe and Jacobs (1998) describe a successful case application in the
context of a substance abuse prevention program in a Native American
community. In this case, the issue of power differentials did not surface as being
particularly problematic.
Depth of Participation
To what extent should non-evaluator stakeholders participate in the evaluation

process? Our survey showed that evaluators tend to see the primary role for
stakeholders to be consultative in the process of shaping and planning the
evaluation and interpreting findings (Cousins et aI., 1996). It seems likely that
non-evaluator stakeholders were most often involved in reactive "interface"
activities as members of a review panel- as distinct from an advisory committee
- and/or as members of the evaluation team. PE, however, calls for a much more
active role in all technical aspects of the evaluation process. Several of the case
studies reported on such integral levels of participation.
Torres et ai. (2000) describe a four-year collaboration of external evaluators
and program coordinators to evaluate, understand, and improve a multi-site
family literacy program in Colorado. They observed that several factors both
plagued and supported the implementation of the evaluation; although there was
little funding for the project, collaborative interchanges (monthly working
sessions) among members of the evaluation team helped to develop rapport,
trust, and credibility. Torres et ai. noted that methods for the evaluation evolved
over time through dialogue and deliberation among evaluator and non-evaluator
participants.
Though many studies report the involvement of non-evaluator stakeholders in
a comprehensive list of evaluation interface and technical activities (e.g.,
Barrington, 1999; Coupal & Simoneau, 1998; Cousins, 1996; Johnson, Willeke,
& Steiner, 1998; Lee, 1999; Robinson, 1998), others focused on restrictive,
reactive involvement. Brandon (1999), for example, describes the involvement of
medical students in informing the recommendations for an evaluation. He
describes the development of a detailed procedure for soliciting systematically
the recommendations from the students arguing "for recommendations to be
well informed, evaluators need to become expert in the programs' historical
administrative, managerial, and operational contexts. However, they usually do
not have the time or funding to gain this expertise, particularly when evaluating
small programs" (p. 364).
Two quantitative studies shed light on teachers' interest and willingness to
participate in evaluation. First, we showed that Ontario, Canada teachers'
attitudes towards school-based research including their participation in it were
dependent on teacher efficacy, prior coursework in research methods, prior
participation in research, and to some extent teaching experience (Cousins &
Walker, 2000). In a British Columbia, Canada school accreditation system,
Turnbull (1999) found that organizational culture was a predictor of teacher
participation in accreditation decision making and evaluation activities. She
found a definite link between participative climate and participation in the
evaluation activities. In fact, according to Turnbull, "teachers who perceived their
schools to have highly participative climates tended to experience saturation
levels of participation (actual participation was higher than preferred participa-
tion)" (p. 138).
Several factors and supportive conditions for "meliorative" participatory
evaluation were identified by King (1998) following her multiple case analysis.
King concluded that requirements for projects that work in practice are high
levels of interpersonal and organizational trust; participants adeptness at
creating shared meaning of their experiences over time; acknowledgement of
and dealing with the power structure within which participants work; and
leadership. She also pointed to the relevance and importance of issues
addressed, the provision of resources and the presence of outside facilitators.
Finally, Folkman and Rai (1997) were committed to involving participants as
"co-evaluators in framing specific assessment strategies, designing data
collection instruments, collecting information, and assessing results" (p. 463),
but ultimately the meetings were poorly attended and they "lacked the critical
dialogue on the kind of impact these activities were having in the neigh-
bourhood" (p. 463). The authors concluded that evaluators need to perform
multiple roles and functions in order to engage stakeholders in meaningful
participation. Gaventa et al. (1998) echo this sentiment suggesting that
evaluators may need training in team building, negotiation, conflict resolution,
stakeholder involvement, facilitation, and group leadership.
In sum, research-based knowledge about participatory evaluation practice
reveals some interesting patterns. Despite efforts for evaluators to divest control
of the research process to program-based or other stakeholders often was the
case that this was not possible due to practical considerations. Intentions for
balanced control sometimes led to tensions and precipitated a reversion to a
more autonomous posture for the evaluator. This was not always the case. In
some situations it was impossible for evaluators to be involved in a very close way
and therefore necessary for members of the stakeholder community to take
control of technical decision making. T-PE cases typically adhered much more
closely to principles of diversity among stakeholder groups than did P-PE
projects. But here again certain tensions arose as a consequence of conflict
associated with power differentials and plurality of values. Evaluators described
the need to play many different roles throughout the course of the evaluation in
order to accommodate such diversity and to balance the need for
representativeness with the need to implement the evaluation and see it through
to completion. Depth of participation was variable across case reports and other
studies. In some cases collaboration was focused on only reactive interface
activities (e.g., responding to evaluator decisions and interpretations), which
raises questions about balance among evaluator and non-evaluator participants
in the control of technical decision making. On the other hand, several instances
of full participation, for example, on learning teams were in evidence. Finally,
some studies reported diminishing participation by non-evaluator stakeholders
or refusal to participate and cited factors such as conflict, "time-intensivity,"
competing demands, lack of suitable skill sets (e.g., interpersonal skills), and the
like, as being influential in bringing about such situations.
CONSEQUENCES OF PARTICIPATORY EVALUATION
Does PE foster evaluation utilization? If so, what sort of effects are to be

expected and what are the indicators? Do T-PE and P-PE lead to differentiated
utilization effects? These questions guided me in developing the ensuing section.
Evaluation Knowledge Production
Several studies showed positive effects of participatory evaluation processes on

the evaluation knowledge production function. In Brandon's studies (1998,
1999), intended program beneficiaries were involved at various points in the
evaluation process in order to improve validity. Brandon makes the argument
that evaluators interested in conducting collaborative evaluation would do well
to involve intended program beneficiaries and ought carefully to obtain their
perspective through systematic means. As he suggests, "The full contribution of
stakeholders will help to ensure that evaluators can be confident of the validity
of the conclusions they draw about programs" (1998, pp. 331-332).
What is not clear from Brandon's argument is to what extent intended
beneficiaries are to be involved as partners in the evaluation or as sources of data
that might help shape evaluation technical decision making. Also unclear is the
suitability of collaborative evaluation in a summative/judgmental context. We
have argued before (Cousins & Earl, 1995) that other approaches to evaluation
might be more suitable if moving beyond an improvement-oriented focus. Yet
many of the respondents to our survey on collaborative evaluation reported
using such approaches in either summative or mixed-purpose settings (Cousins
et aI., 1996). Given the potential intrusion of self-serving bias in particularly
summative evaluation contexts, this issue requires further examination and
consideration.
Some have made the case that a trade-off exists between the need for technical
quality vs. responsiveness in evaluation, the former being less of a concern in a
formative context (e.g., Greene, 1990). In my study of varying levels of evaluator
involvement in participatory evaluation (Cousins, 1996) all three cases resulted
in the reporting of findings whose credibility was observed to be adequate. "The
final products were generally found to be credible by various stakeholder groups
and significant degrees of research skill development were apparent in each of
the three cases" (p. 12). A similar observation was made by Johnson et aI. (1998)
in their study of participatory evaluation in a family literacy program context.
Brandon (1998) found that intended program beneficiaries can help to
identify or conceptualize needs in ways that would be difficult for the evaluator.
This observation is corroborated by others (e.g., Labrecque, 1999; Lau &
LaMahieu, 1997). According to Lau and LaMahieu (1997), "Teachers yield
invaluable insight to the construction of authentic measures and richness to the
interpretation" (p. 9). " ... Teachers in the project were in a better position than
any others to be sensitive to significant factors affecting student learning" (p. 12).
These authors were clearly persuaded that teacher involvement in the evaluation
helped to strengthen the validity of the findings.
Use of Evaluation Findings
Evidence supporting the contribution of participatory evaluation to instrumental

and conceptual uses of evaluation findings emerged from a number of the
studies. First of all, in our survey, ratings of impact for instrumental and
conceptual uses of the data were the highest observed (Cousins et aI., 1996). The
three top ranked items were:
• Intended users have learned (or will learn ) about their practice;
• Intended users have based (or will base) significant decisions on this
information;
• Data have helped (or will help) intended users incrementally improve their
performance. 2
In my multiple case study (Cousins, 1996), I also noted instrumental and

conceptual consequences of participatory evaluation. In the case of a curriculum
review project where the evaluator acted as a behind-the-scenes silent partner
the following effects were observed:
Several interview respondents identified the creation of a curriculum

coordinating committee, family-of-schools curriculum networks,
improved communications through a curriculum newsletter, and the
continuation with curriculum reviews in subject areas deemed most
important in the survey.... There was a strong sense that the study
provided an excellent basis from which to determine curriculum priorities
and move away from the "squeaky-wheel-gets-the-grease" approach to
system-wide curriculum decision making. (p. 18)
Similar instrumental and conceptual effects were observed in other studies. For
example, according to Torres et ai. (2000):
[The evaluation] gave participants greater understanding, confidence, and

conviction about the objects of discussion, which in turn increases the
likelihood that informed change will occur. It empowers participants to
enact change on the basis of program issues and evaluation findings that
have been thoroughly considered. (pp. 36-37)
Finally, conceptual effects were reported by two other studies. Johnson et al.
(1998) observed that their participatory evaluation fostered respect for cultural
diversity and created challenges to developing collaborative literacy activities
between families and family educators. Similarly, in their stakeholder-based study
of integrating users' perspectives into the ongoing evaluation and development
of information technology in the health sector Lehoux, Potvin, and Proulx (1999)
reported that staff were able to move beyond the pros and cons of use of
technology into an in-depth understanding of the process of utilization and the
meanings associated with it. In essence non-evaluator stakeholders felt they had
been integrated into the evaluation and development loop.
Process Use
Process use was by far and away the most evident type of use arising from PE
activities. Several studies identified the development of research skills among
participating stakeholders as an effect attributable to the evaluation. Participant
benefits were observed in capacity building and the development of research
skills as a result of the collaborative learning teams described by Gaventa et al.
(1998) in a community revitalization initiative context. According to the authors,
"Each team had members who grew in confidence and skill and who became
involved in other public roles in the community" (p. 88). Similarly, Johnson et al.
(1998) reflected on the development of research acumen of some participating
members. Further, Lee (1999) observed a developing commitment on the part of
teachers to integrating evaluation into the culture.
Despite the rewarding professional development experience for many non-
evaluator participants, tensions can arise as well. I noted that teachers in more
than one school system site appreciated the opportunity to participate in
constructing relevant knowledge for decision making and program improvement
but at the same time were extremely overburdened by the responsibility in view
of all of the other routine demands on their time (Cousins, 1996). The tension
between labor intensivity and personal empowerment is a very real issue for
many participants.
Other studies reported different forms of empowerment for stakeholders such
as enriched understanding of circumstances, the development of a deeper
understanding of the perspectives of others, and the willingness to take action or
to move in new directions (Abma, 2000; Coupal & Simoneau, 1998; Lau &
Mahieu, 1997; McNeil, 2000). As an example, Lau and LeMahieu (1997)
described how teachers no longer feared experimenting with new ways or being
accountable for outcomes after having participated in the program evaluation.
Reportedly teachers embraced the notion that evaluation aids rather than
inhibits the construction of new knowledge within the project.
Still other studies reported changes and effects at the organizational level.
Such studies typically had a P-PE focus, although this was not exclusively the
case. Robinson (1998) showed quite remarkable organizational changes to a
national training organization as a consequence of a two-year long participatory
curriculum renewal project. The changes included were not only structural but
suggested deep levels of organizational learning. Similarly, I observed
organizational structural changes in a school system that followed a participatory
evaluation project (Cousins, 1996). The rural east-central Ohio school district
created a research officer position with the central administration and hired into
it, the chair (special education teacher) of the participatory evaluation project.
In our survey of evaluators we observed that questions about organizational
effects of a specific collaborative evaluation project yielded responses that were
less salient than those inquiring about instrumental or conceptual utilization of
findings (Cousins et aI., 1996). Nevertheless, the authors did observe some
indication of process use. Lee (1999) described how schools have developed
capacity to the extent that evaluation is integrated into the culture of the
organization. Her remarks about the link of evaluation to accountability
demands are particularly revealing:
Schools that have taken hold of the role of evaluator - those who really
own and use their school data - are best able to tie internal accountability
to external accountability. The concept is that if people in schools can see
the relevance of data collection in relation to their own goals and
outcomes, they will begin to value and use the evaluation process ... [and]
have an array of information to support demands for external account-
ability (pp. 175-176).
In sum, PE does appear to foster utilization according to research-based

accounts. Considerable evidence emerged concerning contributions of
stakeholder participation in evaluation activities to the refinement and
enhancement of the evaluation knowledge production function. Non-evaluator
stakeholders provided insight and clarity to the evaluation process and focus and
the operationalization of constructs. This was particularly notable in contexts
where a broad array of non-evaluator stakeholders was involved (e.g., program
practitioners and intended program beneficiaries), a distinguishing feature of
T-PE. Utilization effects were also reported in terms of discrete decisions based
on evaluation findings (instrumental uses) and educative functions of evaluation
(conceptual uses). The development of deeper understandings of program
processes, contexts, and outcomes was enhanced as a result of at least some
collaborative evaluation activities. Perhaps the most prevalent effects of
participatory evaluation were in the domain of process use. There was much
support for the notion that PE fosters the development of individual and team
learning of research and evaluation skills, in addition to other skills that lead to
capacity building and empowerment. Some evidence of organizational effects of
PE was also available but such effects were noted comparatively less frequently.
I now present a discussion of these findings framed as considerations for
practice.
IMPLICATIONS FOR PRACTICE
The present review yielded many issues and tensions that will serve as a basis for
ongoing research and reflection on evaluator practical responsibilities and
choices. I selected a few that struck me as particularly interesting and potentially
useful for practical consideration. Given the limited development of research-
based knowledge in the domain, for each issue I offer cautious recommendations
for practice.
Role Flexibility for Evaluators
It is important for evaluators to understand that conventional skills in evaluation

logic and methods will be necessary but not sufficient for implementing effective
participatory evaluation. Role expectations may include evaluator, facilitator,
critical friend, organization developer, mediator, and the like. There are
implications here for evaluators to consider ongoing professional development
in such areas as conflict resolution, mediation, and negotiation. There are also
implications for revisiting curricula for evaluation pre-service training programs.
Evaluator should also be wary of effects on their perceived credibility of some of
these roles. An example was intentions of being inclusive being misconstrued as
partisanship.
Power, Influence, and Related Issues of Context
Evaluators always need to develop a good understanding of the program

context and its micropolitical landscape, and this is especially the case for
participatory evaluators. It will be prudent for evaluators to consider and
implement divergent and convergent strategies for developing this under-
standing in advance of committing to the PE. Identifying and communicating
with as many stakeholder groups as is reasonable is likely to payoff in this
regard. Evaluators would benefit from developing sensitivities to whose interests
are being served and a basis for deciding the appropriateness of answers to this
question. They would also benefit from considering explicitly the potential for
and consequences of tension and conflict among non-evaluator stakeholders
with varying value perspectives. Will such tension serve to enhance or impede the
evaluation effort? What will be the evaluators' role in conflict resolution and
what will be the effects of such decisions on evaluation focus and technical
quality? Given the differential access to power and influence among non-
evaluator stakeholder groups there are deep ethical issues here that evaluators
need to consciously consider. Adherence to principles and standards of
professional practice is underscored.
Stakeholder Selection for Participation
In keeping with considerations of access to power and influence, identifying

stakeholder groups for participation is likely to be more complex than one might
expect. A useful exercise for evaluators will be to render explicit the interests in
the program and its evaluation held by the identified groups. This will be easier
said than done but is likely to be a productive activity nevertheless. In
particularly transformative approaches to PE, some evidence suggests that it is
important for evaluators to optimize representation rather than to maximize it.
A participatory evaluation that collapses due to unwieldiness grounded in
diversity of values will not lead to desired or suitable action. Neither, however,
will evaluations where input from key value positions is unavailable or whether
partisanship dominates the process.
Divesting Control of the Evaluation
Evaluators will do well to be clear about expectations for control of evaluation

decision making from the start. By virtue of its commitment to fostering
empowerment, self-determination and social action, the focus for T-PE will
almost always be on divesting ownership and control to non-evaluator partici-
pants. To a lesser extent, this will be the case with P-PE. But what will be the
implications of divesting such control in contexts where conflict and diversity in
value perspectives is high? Regardless, expectations about control should be
clarified among participants at the outset. Evaluators need also to attenuate
implications of divesting control for standards of technical quality, an issue
discussed in more detail below.
Quality Standards
PE evaluators need to be very sensitive to the potential for non-evaluator

participant bias to intrude on evaluation technical decision making and in
particular on the validity of the findings and claims that are generated. It will
always be prudent for evaluators to be explicitly guided by professional standards
of practice and to make this posture known to participants at the outset. An
important role for evaluators will be to ensure that evaluation conclusions are
grounded in the emergent findings. In addition, independently conducted
metaevaluations could yield valuable insights into the extent to which evaluation
findings and conclusions are credible and trustworthy. Research-based evidence
shows that the validity and credibility of evaluation data can be enhanced
through the involvement of non-evaluator participants. Evaluators should look
to stakeholder participants' perspectives for insight into relevant issues and
concepts for study.
Participant Engagement and Depth of Participation
Some evidence suggested that participation may wane as the project continues.
This may be due to limited skills, discomfort arising from conflict and power
differentials, competing demands on time, and the like. Evaluators need to be
sensitive to the perspectives held by varying non-evaluator participants and to
develop strategies to enhance interest and engagement (e.g., specialized training
in evaluation methods and logic, division of labour according to expertise and
interest, provision of suitable forums for open discussion and deliberation). It
may not always be desirable or suitable to involve stakeholders in highly
technical evaluation tasks. In some instances, in fact, it may be troubling to do so
since diminishing motivation to participate may be the result. There will be
implications for divesting control of evaluation, however, in situations where
evaluators take charge of the more technical duties. It will be useful also to
consider variation in non-evaluator participants' role both within and between
participatory evaluation projects. As shared by an anonymous reviewer of this
chapter:
... some examples that come to mind are participant observer, translator,
and conveyor of findings to particular groups, direct uses of findings,
liaison between the evaluation team and those who respond to evaluation
queries, members of advocacy team engaged in generating competing
alternative program designs, and report writer. (Anonymous personal
communication, February 12, 200l.)
To conclude, while there exists considerable evidence supporting the relationship

between PE and utilization, many issues and tensions require further attention.
It will be important for evaluation scholars to consider carefully the conditions
under which PE is sensible, appropriate, and effective. In this chapter I have
addressed only a sample of issues for further consideration and study based on
my reading and analysis of the current research-based literature. With an
apparent rise in the popularity and prevalence of PE it becomes increasingly
pressing to investigate and explore these and related issues and to work toward
the development of mechanisms for informing PE practice on the basis of
research-based knowledge. I look forward to seeing what the next five years (and
beyond) will bring in this regard.
ENDNOTES
1 I wish to thank Paul Brandon, Jean King and Lyn Shulha for helpful comments on a prior draft of
this chapter. Critiques by Marvin AIkin and an anonymous reviewer were also helpful in shaping
the arguments herein. I assume full responsibility for any lingering shortcomings.
2 It should be noted that similar ratings from program practitioners who worked on the same
collaborative evaluation project were systematically lower, suggesting that evaluator ratings may
have been somewhat inflated (Cousins, 2(01).
REFERENCES
Abma, TA. (2000). Stakeholder conflict: A case study. Evaluation and Program Planning, 23,
199-210.
AIkin, M.C. (1991). Evaluation theory development: II. In M.W. McLaughlin, & D.C. Phillips (Eds.),
Evaluation and education: At quarter century. Chicago, IL: The University of Chicago Press.
Barrington, G.Y. (1999). Empowerment goes large scale: The Canada prenatal nutrition experience.
Canadian Journal of Program Evaluation [Special Issue], 179-192.
Brandon, p.R. (1999). Involving program stakeholders in reviews of evaluators' recommendations for
program revisions. Evaluation and Program Planning, 22(3), 363-372.
Brandon, P. (1998). Stakeholder participation for the purpose of helping ensure evaluation validity:
Bridging the gap between collaborative and non-collaborative evaluations. American Journal of
Evaluation, 19(3), 325-337.
Chambers, R. (1997). Whose reality counts? Putting the last first. London: Intermediate Technology
Publications.
Coupal, E, & Simoneau, M. (1998). A case study of participatory evaluation in Haiti. In E. Whitmore
(Ed.), Understanding and practicing participatory evaluation. New Directions in Evaluation, 80,
69-80.
Cousins, J.B. (2001). Do evaluator and program practitioner perspectives converge in collaborative
evaluation? Canadian Journal of Program Evaluation, 16(2), 113-133.
Cousins, J.B. (2000). Understanding participatory evaluation for knowledge utilization: A review and
synthesis of current research-based knowledge. Working paper. Ottawa: University of Ottawa.
Cousins, J.B. (1996). Consequences of researcher involvement in participatory evaluation. Studies in
Educational Evaluation, 22(1), 3-27.
Cousins, J.B., Donohue, J.J., & Bloom, GA. (1996). Collaborative evaluation in North America:
Evaluators' self-reported opinions, practices, and consequences. Evaluation Practice, 17(3), 207-226.
Cousins, J.B., & Earl, L.M. (1992). The case for participatory evaluation. Educational Evaluation and
PolicyAna/ysis, 14(4), 397-418.
Cousins, J.B., & Earl, L.M. (Eds.). (1995). Participatory evaluation in education: Studies in evaluation
use and organizational learning. London: Falmer Press.
Cousins, J.B., & Leithwood, KA. (1986). Current empirical research in evaluation utilization. Review
of Educational Research, 56(3), 331-364.
Cousins, J.B., & Walker, CA. (2000). Predictors of educators' valuing of systematic inquiry in
schools. Canadian Journal of Program Evaluation [Special Issue], 25-52.
Cousins, J.B., & Whitmore, E. (1998). Framing participatory evaluation. In E. Whitmore (Ed.),
Understanding and practicing participatory evaluation. New Directions in Evaluation, 80, 3-23.
Dunn, W.N., & Holtzner, B. (1988). Knowledge in society: Anatomy of an emergent field. Knowledge
in Society, 1(1),6-26.
Fetterman, D.M. (2000). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.
Fetterman, D.M., Kraftarian, S., & Wandersman, A. (Eds.). (1996). Empowerment evaluation: Knowledge
and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.
Folkman, D.Y., & Rai, K (1997). Reflections on facilitating a participatory community self-
evaluation. Evaluation and Program Planning, 20(4), 455-465.
Gaventa, J. (1993). The powerful, the powerless and the experts: Knowledge struggles in an
information age. In Park P., Brydon-Miller, M., Hall, B.L., & Jackson, T. (Eds.), Voices of change:
Participatory research in the United States and Canada. Toronto: OISE Press.
Gaventa, J., Creed, Y., & Morrissey, J. (1998). Scaling up: Participatory monitoring and evaluation
of a federal empowerment program. In E. Whitmore (Ed.), Understanding and practicing
participatory evaluation. New Directions in Evaluation, 80, 81-94.
Greene, J.C. (1990). Technical quality vs. responsiveness in evaluation practice. Evaluation and
Program Planning, 13, 267-274.
Greene, J. (2000). Challenges in practicing deliberative democratic evaluation. In KE. Ryan, & L.
DeStefano (Eds.), Evaluation as a democratic process: Promoting inclusion, dialogue and
deliberation. New Directions in Evaluation, 85, 13-26.
Johnson, R.L., Willeke, MJ., & Steiner, DJ. (1998). Stakeholder collaboration in the design and
implementation of a family literacy portfolio assessment. American Journal of Evaluation, 19(3),
339-353.
King, J. (1998). Making sense of participatory evaluation practice. In E. Whitmore (Ed.),
Understanding and practicing participatory evaluation. New Directions in Evaluation, SO, 56-68.
Labreque, M. (1999). Development and validation of a needs assessment model using stakeholders
involved in a university program. Canadian Journal of Prograrn Evaluation, 14(1), 85-102.
Lau, G., & LeMahieu, P. (1997). Changing roles: Evaluator and teacher collaborating in school
change. Evaluation and Program Planning, 20, 7-15.
Lee, L. (1999). Building capacity for school improvement through evaluation: Experiences of the
Manitoba School Improvement Program Inc. Canadian Journal ofProgram Evaluation, 14(2), 155-178.
Lehoux, P., Potvin, L., & Proulx, M. (1999). Linking users' views with utilization processes in the
evaluation of interactive software. Canadian Journal of Program Evaluation, 14(1), 117-140.
MacDonald, B. (1977). A political classification of evaluation studies. In D. Hamilton, D. Jenkins, C
King, B. MacDonald, & M. Parlett (Eds.), Beyond the numbers game. London: MacMillan.
MacNeil, C (2000). Surfacing the realpolitik: Democratic evaluation in an antidemocratic climate.
In KE. Ryan, & L. DeStefano (Eds.), Evaluation as a Democratic Process: Promoting Inclusion,
Dialogue and Deliberation. New Directions in Evaluation, 85, 51-62.
Mark, M., & Shotland, RL. (1985). Stakeholder-based evaluation and value judgments. Evaluation
Review, 9(5), 605-626.
Mathie, A., & Greene, J.C (1997). Stakeholder participation in evaluation: How important is
diversity? Evaluation and Program Planning, 20(3), 279-285.
McTaggart, R (1991). When democratic evaluation doesn't seem democratic. Evaluation Practice,
12(1), 9-21.
Owen, J.M., & Lambert, F.C. (1995). Roles for evaluation in learning organizations. Evaluation, 1(2),
237-250.
Patton, M.O. (1999). Organizational development and evaluation. Canadian Journal of Program
Evaluation [Special issue], 93-114.
Patton, M.O. (1998). Discovering process use. Evaluation, 4(2), 225-233.
Patton, M.O. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Newbury Park, CA:
Sage.
Preskill, H. (1994). Evaluation's role in enhancing organizational learning. Evaluation and Program
Planning, 17(3), 291-297.
Preskill, H., & Caracelli, V. (1997). Current and developing conceptions of use: Evaluation use TIG
survey results. Evaluation Practice, 18(3), 209-225.
Preskill, H., & Torres, R (1998). Evaluative inquiry for organizational learning. Twin Oaks, CA: Sage.
Pursely, LA (1996). Empowerment and utilization through participatory evaluation. Unpublished
doctoral dissertation, Cornell University, Ithaca, NY.
Robinson, T. (1998). An investigation of the organizational effects of internal participatory evaluation.
Unpublished doctoral dissertation, University of Ottawa, Ottawa, Canada.
Rowe, WE., & Jacobs, N.F. (1998). Principles and practices of organizationally integrated evaluation.
Canadian Journal of Program Evaluation, 13(1), 115-138.
Schnoes, C.J., Murphy-Berman, v., & Chambers, J.M. (2000). Empowerment evaluation applied:
Experiences, analysis, and recommendations from a case study. American Journal of Evaluation,
21(1), 53-64.
Shulha, L., & Cousins, J.B. (1997). Evaluation utilization: Theory, research and practice since 1986.
Torres, R, & Preskill, H. (1999). Ethical dimensions of stakeholder participation and evaluation use.
In J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in evaluation. New
Directions in Evaluation, 82, 57-66.
Torres, R, Preskill, H., & Piontek, M.E. (1996). Evaluation strategies for communication and reporting.
Torres, RT., Stone, S.P., Butkus, D.L., Hook, B.B., Casey, J., & Arens, SA (2000). Dialogue and
reflection in a collaborative evaluation: Stakeholder and evaluator voices. In KE. Ryan, & L.
DeStefano (Eds.), Evaluation as a Democratic Process: Promoting Inclusion, Dialogue and
Deliberation. New Directions in Evaluation, 85, 27-38.
Turnbull, B. (1999). The mediating effect of participation efficacy on evaluation use. Evaluation and
Program Planning, 22,131-140.
Weiss, CH. (1983). Ideology, interests, and information. In D. Callahan, & B. Jennings (Eds.),
Ethics, the social sciences and policy analysis. New York: Plenum Press.
Section 4
The Evaluation Profession

Introduction
M. F. SMITH
University of Maryland, The Evaluators' Institute, DE; (Emerita), MD, USA
One might argue that this entire Handbook defines what this one section claims
as its territory, i.e., the profession of evaluation. The theories we espouse and the
methods and procedures we use clearly compel the direction our profession
moves. As old theories are challenged and new ones developed and as
procedures change and new ones are practiced, the profession changes. What
sets (or should set) the tone and tolerance for directional flow are our values,
beliefs, goals, standards, and ethics. That is why this section starts with a focus on
professional standards and moves immediately to ethical considerations in evalu-
ation. The standards and principles described by Daniel Stufflebeam are more
like external barometers by which to judge evaluation practice; they represent
"commonly agreed to" ideas about practice. Ethical considerations as discussed
by Michael Morris, on the other hand, seem to be decided more by internal
norms. Somewhat like beauty, what one sees as ethical, another may not. What
one sees as a moral or ethical dilemma may not be problematic at all for others,
or may be regarded as methodological considerations by others. Both of these
types of norm are critical to and problematic for the profession. Blaine Worthen's
chapter follows in which he identifies challenges the evaluation community must
pursue if it is "ever to become a mature, widely recognized and understood profes-
sion." To the ideas of standards and ethics, he adds the challenge of establishing
qualifications for practice, for inaugurating quality control mechanisms to limit
those who practice evaluation to those who are competent to do so - what any
"true" profession seeks to do.
The next three chapters in the section select a different focus. In simple
terms, one might characterize them as "what influenced what we now are"
(where Lois-ellin Datta explores the impact of the government on evaluation);
"how to become something different" (where Hallie Preskill proposes that
evaluation strive to become a "sustainable learning community" and offers a
method to get there); and "where we may be headed" (where I look for lessons
learned from two efforts by leading experts in the field to preview the future of
evaluation). I draw upon the writings of 15 leaders' works that were published in
1994 as a special edition of the journal of Evaluation Practice and 23 in 2001 that
make up a sequel issue in The American Journal of Evaluation.
269

270 Smith
Daniel Stufflebeam, in Professional Standards and Principles for Evaluations,
describes North American evaluators' progress in setting and applying standards
and principles of program evaluation. He starts the chapter by defining eleven
functions that standards and codes can perform, e.g., protect consumers and
society from harmful or shoddy practices. He describes four evaluation standards/
principles efforts and tells how they are interrelated and complementary: (i) the
Joint Committee on Standards for Educational Evaluation, i.e., Standards for
Evaluation of Educational Programs, Projects, and Materials published in 1981
and updated in 1994 as The Program Evaluation Standards; (ii) the same
Committee's Personnel Evaluation Standards published in 1988; (iii) Evaluation
Research Society Standards for Program Evaluations in 1982; and (iv) the
American Evaluation Association's Guiding Principles for Evaluators in 1995. A
great deal of detail is provided on each of these efforts, e.g., areas covered,
definitions for each and every standard, differences in who developed the
standards and for whom they were developed.
In the sixth and concluding section, Stufflebeam wrestles with the "so what?"
question, i.e., what differences have these standards and principles made? Using
the eleven functions that standards can serve, presented at the beginning of the
chapter, he makes observations about the impact of the Joint Committee standards
and ABA's guiding principles on these functions and then comments on what
needs to be done "to further professionalize the evaluation enterprise and
improve its contributions." Both the standards and principles are appropriately
grounded, he says, and have strong potential for use. However, they have been
limited in impact, perhaps most notably because so little effort has gone into
their dissemination. Few in the intended audiences are aware of the standards,
much less have they invested in their use. He concludes that the standards and
principles have likely had little impact in preventing or exposing bad evaluation
practices or in protecting clients from poor evaluation service. He recommends
that ABA and the Joint Committee investigate ways of enforcing compliance
with the standards/principles or at least publicly criticize evaluations that are
clearly unethical or otherwise bad for "there is little point of having standards
and principles if they don't help enhance evaluation practice and thwart or
castigate clear instances of malpractice."
Stufflebeam's final observation about the standards and principles is that they
are distinctly American, reflecting U.S. government laws and culture, and may
not be usable "as is" for non-American audiences. Since the standards and
principles represent "commonly agreed to" ideas about practice, then groups
outside the U.S. should go through a similar process as used in the U.S. where a
standards-setting group is appointed and a culturally acceptable participatory
process is followed to reach creditable decisions about what will be used as the
basis for judging evaluation work. The American standards provide examples that
other cultures can expand upon to satisfy their own specific values and interests.
Elsewhere in this handbook, Stufflebeam identifies one of the functions of
standards is to protect consumers and society from harmful practices. Harmful
practices may result from ignorance (doing the wrong thing because one does
The Evaluation Profession 271
not know any better) or from unethical behavior (doing the wrong thing because
of lapses in personal responsibility or judgment or accountability - often for
personal gain or self aggrandizement). The standards described by Stufflebeam
along with subject matter and skills training can prevent harmful practices due to
ignorance.
In his chapter, Ethical Considerations in Evaluation, Michael Morris discusses
ways to prevent harmful practices where the fault may be interpreted as
reflecting lapses in ethics rather than inadequate technical methods. However,
ethics have an even greater role to play in a profession's overall well-being. How
its members respond to situations where decisions crop up as to what is the good
or moral thing to do shapes the profession's public image and affects its internal
social cohesion.
Morris addresses five areas of concern relating to ethics for the profession and
ends with recommendations for the future: (i) the pervasiveness of ethical
problems in the professional lives of evaluators; (ii) the nature of ethical
challenges most frequently encountered; (iii) ethical dilemmas presented by
different levels and types of stakeholder involvement in evaluation; (iv) ethical
dilemmas presented by evaluator involvement in utilization of findings; and
(v) what evaluators can do to prevent ethical conflicts and how to handle them
effectively once they arise.
One conclusion that could be reached from reading Morris' chapter is that
ethical concerns are not pervasive among evaluators. His own research and that
of others suggest that ethical dilemmas are rarely identified by evaluators during
the conduct of evaluation and policy analysis activities.
Another conclusion is that ethics is in the eyes of the beholder. According to
Morris, "Professional challenges that are viewed as ethical in nature by some
evaluators may be seen as 'simply' political, philosophical, or methodological by
others." Furthermore, stakeholder views of what are ethical concerns often differ
from those identified by evaluation practitioners. Morris reports data that
suggest that evaluators are less likely to identify ethical problems that emanate
from their own behavior than they are from the behavior of others.
A third conclusion the reader could reach is that the label one attaches to a
conflict is unimportant as long as it is handled in a conscientious and principled
manner. But therein lies the rub. One evaluator, for example, might believe it is
unconscionable to give a group of stakeholders the option of approving (perhaps
biasing) the conclusions in an evaluation report whereas another might believe it
is immoral to not give them this power when it is their program that is being
evaluated. This ambiguity aside, Morris identifies a number of situations in which
conflicts most often arise, usually at the entry/contracting and the results/
reporting stages, and notes that "Evaluators who develop an in-depth appreci-
ation of these challenges will be better equipped to 'manage their ethical lives'
as evaluators than those who fail to do so."
Morris points to the proliferation of multiple (competing?) models of evalua-
tion to guide the involvement of stakeholders (e.g., practical and transformative
participatory evaluation, empowerment evaluation, deliberative democratic
272 Smith
evaluation, utilization-focused evaluation, and evaluation-as-assisted-sensemaking)
as a problem for the profession in reaching agreement about what is appropriate/
inappropriate, ethical/unethical conduct for evaluators.
Evaluators may view themselves as stakeholders in an evaluation and see it as
their responsibility or right to become involved in the use of evaluation findings.
Complicating matters, Morris notes, is the potential for evaluator role-shifting
when use is given a high priority, e.g., when the evaluator begins to function as a
program decision maker or as a program advocate, lobbying on behalf of a
program believed to be worthy of support. When this occurs, the credibility of
the evaluator and the accuracy of the evaluation may be questioned.
Morris concludes the chapter with six strategies for the evaluator to prevent
ethical challenges or to cope with them once they arise and six recommendations
or major lessons for the evaluation profession. Similarities and differences exist
between the two lists. When one thinks of these twelve in terms of actions that
Morris thinks should be taken, a total of nine different suggestions results: three
for both the individual and the profession, four for evaluators, and two that seem
more profession-specific, i.e.,
Evaluators and the profession should:
• increase the common ground within which ethical discussion can take place;
• seek to understand stakeholder views of ethical problems in evaluation and
seek a shared vision of what is ethically required in the stakeholder involve-
ment/utilization arenas, e.g., during the entry/contracting stage of an evaluation
they should raise ethical issues, solicit concerns of stakeholders, and develop
a framework for how these will be addressed; and
• examine evaluations to assure that they "live up to" professional guidelines for
ethical practice.
Evaluators are advised to:
• be sensitive to the contextual dimensions of the evaluation; don't assume that

standards appropriate in one culture are equally appropriate in another;
• consider the benefits/costs ratio working toward the situation where benefits
significantly exceed risks;
• reflect upon ethical challenges; pay attention to "gut-level" or intuitive feelings
that something might be amiss; and
• solicit the views of experienced evaluators.
The profession should:
• systematically research ethical issues utilizing a variety of approaches; and

• include ethics as a part of the common training of evaluators with plenty of
practice to develop skills in handling conflicts that have the greatest potential
for occurrence.
Blaine R. Worthen in his chapter, How Can We Call Evaluation A Profession If
There Are No Qualifications for Practice?, raises and then provides his answers to
a number of very important questions and challenges "we might pursue if we desire
evaluation ever to become a mature, widely recognized and understood profession."
Is evaluation really a profession? Yes, he says, if one accepts only the most general
and rudimentary definitions such as provided by Webster's Dictionary, but No if
more exacting definitions are used that include required specialized knowledge
and practices that conform to technical or ethical standards. However, he notes
that evaluators, themselves, are not of one voice in their opinions of whether
evaluation is a profession, which leads him to his second question: Does it really
matter whether we call evaluation a profession? If we call ourselves a specialty or
a field or a discipline, instead of a profession, "can't we advance our field just as
readily ... ?" No, Worthen, says, terminology does matter. For example, if we
believe we are a profession, then we will pattern ourselves after other professions,
such as law or medicine where attention is given to preparation of personnel for
anticipated roles and where quality control measures are evident, e.g.,
determining who is qualified and who is not, who is admitted to the profession
and who is not, what is considered appropriate practice and what is not.
Worthen then discusses the challenge of establishing quality control mechanisms:
(i) challenges in admitting only qualified evaluators to evaluation associations
(e.g., the struggle to attract a sufficient number of dues-paying members to
assure financial stability and thus survival, and that the associations provide one
means of training new professionals); (ii) challenges in accrediting evaluation
training programs (e.g., there are few programs directed specifically to evalu-
ation training and hardly a critical mass that would see accreditation as essential
and some who would see it as an intrusion into their disciplinary freedom); and
(iii) challenges in certifying the competence of individual evaluators (e.g.,
agreement on the core knowledge and skills that all evaluators should possess,
given the proliferation and legitimization of a plethora of new paradigms of
evaluative inquiry, and constructing certification procedures that will withstand
the virtually certain legal challenges to disputes over competency test outcomes).
Given these constraints, Worthen asks Then should evaluation stop aspiring to
become a full-fledged profession? His answer is that "there is nothing to suggest
that evaluation cannot become a fully mature profession" if evaluators make the
decision to pursue that goal and if they allocate sufficient time, patience, and
effort to do so, e.g., by identifying barriers and searching for solutions to those
barriers. In the meantime, Worthen says he is "comfortable considering it close
enough (to a profession) to counsel students toward careers in it" while effort is
ongoing to make it so.
Lois-ellin Datta in her chapter, The Evaluation Profession and the Government,
posits that government (defined as the federal government in the U.S.A.) and
the evaluation profession (evaluators and their products) have influenced each
other in a continuing spiral of: government stimulus ~ profession response/
change ~ profession stimulus ~ government response/change ~ government
stimulus ~, and so on.
274 Smith
She identifies eight types of government action that have influenced the
evaluation profession and three ways in which the profession has influenced
government. For example, in the 1960s and 1970s, the government passed acts
requiring recipients of federal funds to set aside money to pay for internaV
program improvement studies and for evaluations of program results, impacts,
and outcomes (e.g., the Elementary and Secondary Education Act of 1965). In
1978, the government also passed an act requiring federal cabinet departments
to establish high-level evaluation units known as Offices of Inspector General.
By this time, most agencies had created internal evaluation offices and units
which, in addition to implementing studies of their own, wrote requests for
proposals (RFPs) for evaluations by external evaluators. Those RFPs impacted
on evaluation methodology by favoring some methods and not others, and
stimulated the development of new methodologies to answer questions specified
in the RFPs. All this activity created what Datta defines as a "prospering,
growing, thriving industry" in academic centers and for-profit companies. The
demand for evaluators grew as employees of government evaluation units and
for-profit companies responded to government RFPs. Universities were
affected, too, as professors made recommendations to agencies, sometimes as
members of advisory commissions and sometimes as temporary employees of
government agencies; as they competed for RFP funds; and when they created
education programs to train evaluators.
The influence of the federal government on the development of evaluation as
a profession took place in ways Datta refers to as indirect and unsystematic, e.g.,
by sponsoring conferences and workshops for evaluators to come together and
through awards to evaluators to participate in meetings. One such meeting
(funded through RFPs by the federal government), called for the purpose of
discussing common issues and problems of national evaluations, led to the
formation of the Evaluation Research Society (ERS), one of two evaluation
associations that came into being in the late 1960s and early 1970s. 1 The
government continued its stimulation of the profession by influencing issues
discussed at evaluation association meetings, usually reports of their findings or
methodologies developed to implement major national studies.
A considerable impact of the government on evaluation practice and the
profession was and is its support for developing evaluation capacity. For
example, the Joint Committee Program Evaluation Standards (1981, 1994) received
federal government support; many evaluation centers at universities got their
start from federal funding; and agencies (e.g., the National Science Foundation)
have provided funds to support programs aiding individuals to develop
evaluation expertise.
The profession influenced government through the recommendations of
university professors and lay practitioners serving on agency advisory committees
and through publishing evaluation results. More often than not, it was persons
from academe who served on the advisory committees that recommended content
for RFPs, wrote white papers on issues of evaluation evidence to recognize
exemplary programs, and spent periods of time as temporary employees of
government units where their influence was varied and widespread. Datta notes
that the influence of published results on government is "uncertain" because "It
has been a matter of some controversy whether, how, and how much evaluations
affect action." Datta was employed by the U.S. General Accounting Office at the
height of its influence on evaluation2 and notes that analyses of GAO did docu-
ment that GAO recommendations were acted upon by the agencies for which
they conducted studies. The impact of evaluation studies, in general, though is a
matter that has not been systematically studied, even though Datta surmises that
key legislative aides are "sophisticated consumers of evaluation."
Datta closes by saying that while federal impact has been monumental, it has
itself been shaped by nongovernmental influences (e.g., foundations). She
encourages the profession to pay attention to how its "essence" is being affected
by funders and to be aware of how the profession can grow more fully into its own.
Hallie Preskill in her chapter, The Evaluation Profession as a Sustainable
Learning Community, proposes that if the evaluation profession wishes to grow
and develop in ways that add value and importance to organizations and society,
it should work toward becoming a sustainable learning community. Before
discussing how this might be accomplished, she provides a short history of the
development of the field: from the 1960s when President Johnson's War on
Poverty and Great Society programs provided funds for evaluations, to the 1970s
when the two evaluation associations (ERS and ENet) emerged and several
evaluation journals were born, to the 1980s when the two associations merged
into the ABA and internal evaluation units became common inside and outside
of government, to the 1990s that brought increased emphasis on program
accountability and monitoring of government performance. The field of
evaluation changed its approaches and methods, becoming highly pluralistic, and
persons calling themselves evaluators went from being part time to spending full
time on their craft. She uses the metaphor of a "gangly teenager" to describe the
profession as one currently experiencing "ambiguity, lack of clarity, and anxiety"
about what the future holds.
Preskill proposes that the values and beliefs of a learning community, i.e., "a
commitment to inclusion, collaboration, learning, deliberation, dialogue, com-
munication, equity, and experimentation," be embraced by the profession to
guide its growth and activities and members' roles and behaviors. She explains
the concept of "community," the attributes of sustainability, the importance of
learning to the support of a community's sustainability, and describes steps the
evaluation profession could take to become such a community.
Preskill observes that there are a number of large-scale change processes that
could be used to help an organization move in the direction of becoming a
sustainable learning community but describes one, appreciative inquiry (AI), as
offering great promise for the evaluation profession. AI is a methodology for
addressing the needs of the whole system using participative, collaborative, and
systematic approaches to understand what is best in people and organizations.
Ultimately, it is about asking questions to strengthen a system's capacity for
organizational learning and creativity.
276 Smith
She proposes a set of such questions to ask to determine the extent to which
the profession is maturing and "taking us in the right direction." She admits that
coming up with answers will not be a simple task, but we must start somewhere.
The choice, as she noted earlier in her chapter, is either "to become a sustainable
learning community where we grow and learn together, or '" allow the profes-
sion to mature in unplanned, uncoordinated, and haphazard ways that in the end
creates Lone Rangers and disconnected small enclaves of evaluation expertise."
The final words in this section on the profession are given to soothsayers in
1994 and 2001 who wrote essays on the future of the field of evaluation for the
first special issue of AEXs journal called Evaluation Practice (EP) (1994) and the
sequel seven years later (2001) for the American Journal of Evaluation (AlE). I
generated the first set of predictions from 16 eminent leaders in the field of
evaluation and synthesized their thoughts as the first paper in the 1994 EP
volume. Mel Mark, the current editor of AlE, and I worked together on the
sequel with a cross-section of 23 evaluators. The 1994 effort included evaluators
in the United States only, but in 2001, several writers were from outside the U.S.,
a statement in itself about how the profession has expanded globally. The
purpose of my paper in this Handbook, The Future of the Evaluation Profession,
is to describe what I see as the most salient issues raised by both sets of writers
and to offer a few observations of my own about the importance some of these
have for our field.
The first topic discussed is the professionalization of the field. Authors in 1994
seemed to be more concerned about whether evaluation had attained the status
of "profession" than did those in 2001, hardly any of whom mentioned the topic.
I surmised that the reason cannot be that we have solved all the problems
identified in 1994, for we have not. For example, we still have not figured out
how to stem the flow of the untrained/inexperienced/unqualified into the field, a
topic that Worthen discusses in detail in his chapter here.
The second topic was about issues that (continue to) divide us. For example,
most 2001 authors thought the qualitative vs. quantitative debate was all but
gone. But it seems to have "mutated rather than muted" and is alive and well
under another banner, this time as objectivity vs. advocacy, or independence vs.
involvement. I discuss these controversies, along with evaluation vs. program
development, and conclude that the advocacy issue is the one that scares me
most for the future of evaluation. Recommendations are made for reducing the
divisiveness in the field, but not much hope is offered for a permanent truce.
Evaluation's purpose was mentioned "as a topic" by writers more in 1994 than
in 2001, but in both situations the disparities in points of view were extreme.
Observers in 1994 noted a shift in focus of evaluation from outcomes/impact to
implementation. In 2001 the shift might be described as movement from inquiry
to advocacy.
Much is new in evaluation, i.e., new fields have emerged (e.g., metaevaluation),
old views have re-emerged (e.g., constructivism), and many new/different theories
and approaches have been developed (e.g., collaborative and participatory).
Other changes also present challenges for our future: demand for evaluation
services has increased and is expected to continue; expectations for level of
evaluator performance are higher; professionals from disciplines outside the
classic social sciences are performing more evaluation tasks; more people who
have no knowledge or experience in evaluation are being involved in all aspects
of evaluation design and implementation; more end users of data are demanding
help in understanding, interpreting, and applying evaluation findings; new
technology is driving service delivery and collection of data for decision making;
evaluation has spread around the globe with a 1000 percent increase in the
number of regionaVnational evaluation organizations; politics have become
more prominent with heightened interest in transparency and accessibility of
evaluators' products and processes and in accountability for evaluation
outcomes; the corporate mantra for "best practices" is affecting the knowledge
stakeholders find acceptable; and interconnected with most all of the above is
the growth in the performance measurement movement.
Evaluators were criticized for not doing more research and creating more
knowledge about evaluation processes, about program effectiveness, and methods
for routinely assessing the impact of everyday practical programs on the social
conditions they address. Related to the latter was the concern that so many -
without questioning the technique - are jumping onto the bandwagon of using
simplistic indicators of performance to portray complex, contextual knowledge
of programs.
Evaluator education/training was given little prominence by 1994 and 2001
writers, even though the situations and challenges they identified - all those I've
mentioned above - sum to an astounding need for evaluator training. One
prediction was that there will be little if any demand for graduate training
programs in the future and those that are supported will likely not include
doctoral level preparation.
As this summary of my summary of all the papers from 1994 and 2001 attests,
seven years have made a difference in evaluation's outward and inward expansion.
Our increasing diversity can become strengths for unification or agents for
polarization - and we could move as easily in one direction as another. Some
think we are at a pivotal point in our field. I guess the only advice I have is to
note that it is not what all the contributors to these two volumes think that
decides evaluation's direction but rather it is the choices each one of us makes
every time we perform an evaluation task.
NOTES
I The other was called the Evaluation Network (ENet). ERS and ENet combined forces in 1985 to
become the American Evaluation Association (AEA).
2 This was in the 1980s and 1990s when Eleanor Chelimsky gave leadership to the Program Evaluation
and Methodology Division.
13
Professional Standards and Principles for Evaluations
Members of most professions and many other public service fields are expected
to comply with given standards or codes of performance and service. The
standards and codes have several important functions:
• protect consumers and society from harmful practices

• provide a basis for accountability by the service providers
• provide an authoritative basis for assessing professional services
• provide a basis for adjudicating claims of malpractice
• help assure that service providers will employ their field's currently best
available practices
• identify needs for improved technologies
• provide a conceptual framework and working definitions to guide research and
development in the service area
• provide general principles for addressing a variety of practical issues in the
service area
• present service providers and their constituents with a common language to
facilitate communication and collaboration
• provide core content for training and educating service providers
• earn and maintain the public's confidence in the field of practice
Standards and codes of practice typically are defined by distinguished members

of the service area, in some cases by government licensing bodies, and occa-
sionally with full participation of user groups. Familiar examples are the standards
of practice employed by the fields of law, medicine, clinical psychology, educa-
tional testing, auditing, and accounting. Other examples are the codes established
for the construction, engineering, electrical, plumbing, and food service areas.
Historically, evaluators had no need to be concerned about explicit profes-
sional standards for program evaluations, because until relatively recently there
was no semblance of an evaluation profession and there were no standards for
evaluations. But times changed and the 1980s and 1990s brought standards into
prominence from a number of sources. Their appearance signified both the
279

280 Stufflebeam
field's historic immaturity and its comparatively recent movement toward profes-
sionalization.
In the early 1980s two programs for setting evaluation standards emerged and
have survived. The Joint Committee on Standards for Educational Evaluation
was established in 1975. Through the years, this standing committee has
continued to be sponsored by 12 to 15 professional societies with a combined
membership totaling more than 2 million. The Committee's charge is to perform
ongoing development, review, and revision of standards for educational evalu-
ations. This committee issued the Standards for Evaluations of Educational
Programs, Projects, and Materials in 1981 and an updated version in 1994 called
The Program Evaluation Standards. The Joint Committee also published standards
for evaluating educational personnel in 1988, and in early 2001 was working on
a set of standards for evaluating students. The Joint Committee is accredited by
the American National Standards Institute (ANSI) as the only body recognized
to set standards for educational evaluations in the U.S.
At nearly the same time as the Joint Committee standards were published, a
second set was produced by the Evaluation Research Society (ERS). The ERS
was established in 1976 and focused on professionalizing program evaluation as
practiced across a wide range of disciplines and service areas. This society
published a set of 55 standards labeled the Evaluation Research Society Standards
for Program Evaluations (ERS Standards Committee, 1982). In 1986, ERS amal-
gamated with the Evaluation Network (ENet) to form the American Evaluation
Association (AEA), which has a membership of about 2,900. AEA subsequently
produced theAEA Guiding Principles for Program Evaluators (Shadish, Newman,
Scheirer, & Wye, 1995).
This chapter's purpose is to familiarize readers with North American
evaluators' progress in setting and applying standards and principles for program
evaluation. Both the Joint Committee Program Evaluation Standards and the
AEA Guiding Principles provide authoritative direction for assessing program
evaluation studies. The ERS standards are no longer the sanctioned standards of
any professional group of evaluators but are of historical significance and
substantive interest. Both the ERS standards and the AEA principles cut across
many areas of program evaluation, while the Joint Committee standards concen-
trate on evaluations of education and training programs and services at all levels
of the education spectrum. The Joint Committee standards are considerably
more detailed than the ERS standards and the AEA principles and address prac-
tical and technical concerns of importance to the general practice of professional
evaluation.
The chapter is organized to look at four evaluation standards/principles efforts.
The first two sections characterize the Joint Committee's work in developing
both program and personnel evaluation standards. Personnel evaluation
standards are described because program evaluators often need to consider the
personnel variable in assessing programs and because program evaluators are
sometimes contracted to help design and/or assess personnel evaluation systems.
The third section provides a historical overview of the ERS standards. This is
Professional Standards and Principles for Evaluations 281
followed by a description of the ABA guiding principles. The fifth section
considers how the Joint Committee program evaluation standards and the ERS
standards and ABA principles are interrelated and complementary. In the sixth
and concluding section, I reflect on the progress North American evaluators have
made in applying the standards and principles to fulfill the intended functions
listed above and point to needs for considerable more work on application.
THE JOINT COMMITTEE PROGRAM EVALUATION STANDARDS
The Joint Committee on Standards for Educational Evaluation developed the

Standards for Evaluations of Educational Programs, Projects, and Materials between
1975 and 1980. This 161-page book essentially includes detailed presentations of
30 standards. Each standard includes a statement of the standard, an explanation
of its requirements, a rationale, guidelines for carrying it out, pitfalls to be
anticipated and avoided, warnings against overzealous application, and an
illustrative case.
The 30 standards are grouped according to 4 essential attributes of a sound
evaluation: utility, feasibility, propriety, and accuracy. The Joint Committee
advises both evaluators and clients to apply the 30 standards so that their
evaluations satisfy all 4 essential attributes of a sound evaluation.
1. An evaluation should be useful. It should be addressed to those persons and

groups that are involved in or responsible for implementing the program
being evaluated. The evaluation should ascertain the users' information needs
and report to them the relevant evaluative feedback clearly, concisely, and on
time. It should help them identify and attend to the program's problems and
be aware of important strengths. It should address the users' most important
questions while also obtaining the full range of information needed to assess
the program's merit and worth. The evaluation should not only report feed-
back about strengths and weaknesses, but also should assist users to study and
apply the findings. The utility standards reflect the general consensus found
in the evaluation literature that program evaluations should effectively
address the information needs of clients and other right-to-know audiences
and should inform program improvement processes.
2. An evaluation should be feasible. It should employ evaluation procedures that
are parsimonious and operable in the program's environment. It should avoid
disrupting or otherwise impairing the program. As much as possible it should
control the political forces that might otherwise impede and/or corrupt the
evaluation. And it should be conducted as efficiently and cost-effectively as
possible. This set of standards emphasizes that evaluation procedures should
be workable in real world settings, not only in experimental laboratories.
Overall, the feasibility standards require evaluations to be realistic, prudent,
diplomatic, politically viable, frugal, and cost-effective.
282 Stufflebeam
3. An evaluation should meet conditions of propriety. It should be grounded in

clear, written agreements defining the obligations of the evaluator and client
for supporting and executing the evaluation. The evaluation should protect all
involved parties' rights and dignity. Findings should be honest and not
distorted in any way. Reports should be released in accordance with advance
disclosure agreements. Moreover, reports should convey balanced accounts
of strengths and weaknesses. These standards reflect the fact that evaluations
can affect many people in negative as well as positive ways. The propriety
standards are designed to protect the rights of all parties to an evaluation. In
general, the propriety standards require that evaluations be conducted legally,
ethically, and with due regard for the welfare of those involved in the
evaluation as well as those affected by the results.
4. An evaluation should be accurate. It should clearly describe the program as it
was planned and as it was actually executed. It should describe the program's
background and setting. It should report valid and reliable findings. It should
identify the evaluation's information sources, measurement methods and
devices, analytic procedures, and provisions for bias control. It should present
the strengths, weaknesses, and limitations of the evaluation's plan, proce-
dures, information, and conclusions. It should describe and assess the extent
to which the evaluation provides an independent assessment rather than a
self-assessment. In general, this final group of standards requires evaluators
to obtain technically sound information, analyze it correctly, and report justi-
fiable conclusions. The overall rating of an evaluation against the 12 accuracy
standards is an index of the evaluation's overall validity.
The 17 members of the original Joint Committee were appointed by 12 profes-

sional organizations. The organizations and their appointed members represented
a wide range of specialties - school accreditation, counseling and guidance,
curriculum, educational administration, educational measurement, educational
research, educational governance, program evaluation, psychology, statistics,
and teaching. A fundamental requirement of the Committee is that it include
about equal numbers of members who represent evaluation users groups and
evaluation methodologists. Over the years the number of the Joint Committee's
sponsoring organizations has slightly increased. (At the publication of the 1994
Standards volume, the Committee was sponsored by 15 organizations. I) Daniel
L. Stufflebeam chaired the Joint Committee during its first 13 years, James R.
Sanders served as chair during the next 10 years, and Arlen R. Gullickson has
been the chair since the end of 1998. All 3 are members of the Western Michigan
University Evaluation Center, which has housed and supported the Joint
Committee's work since its inception in 1975.
In each of its standards-setting projects, the Joint Committee engaged about
200 persons concerned with the professional practice of evaluation in a system-
atic process of generating, testing, and clarifying widely shared principles by
which to guide, assess, and govern evaluation work in education. In each project,
the Committee sought widely divergent views on what standards should be
adopted. The Committee subsequently worked through consensus development

processes to converge on the final set of standards.
Each set of Joint Committee Standards is a living document. The Joint
Committee is a standing committee. The Committee encourages users of each
set of Standards to provide feedback on applications of the standards along with
criticisms and suggestions. From the outset of its work, the Joint Committee has
provided for periodic reviews and improvement of the standards. This feature of
its work is consistent with requirements for maintaining the Committee's accred-
itation by the American National Standards Institute (ANSI).
The Committee's review of its 1981 program evaluation standards led to the
development of a second edition, The Program Evaluation Standards, published
in 1994. Like the first edition, 30 standards are presented within the 4 categories
of utility, feasibility, propriety, and accuracy. The Committee merged some of
the original standards and added some new ones. New illustrative cases were
included that pertain to more diverse areas of application than did the illustrations
in the 1981 version. The 1994 version covers education and training in such
settings as business, government, law, medicine, the military, nursing, profes-
sional development, schools, social service agencies, and universities.
The Program Evaluation Standards (Joint Committee, 1994) are summarized in
Table 1,2 ANSI approved these standards as an American National Standard on
March 15, 1994. Readers are advised to study the full text of The Program
Evaluation Standards, so they can internalize and apply them judiciously at each
stage of an evaluation. The summary presented in Table 1 is only a starting point
and convenient memory aid. The Joint Committee offered advice on which of
the above 30 standards are most applicable to each of 10 tasks in the evaluation
process: deciding whether to evaluate, defining the evaluation problem,
designing the evaluation, collecting information, analyzing information,
reporting the evaluation, budgeting the evaluation, contracting for evaluation,
managing the evaluation, and staffing the evaluation. The Committee's
judgments of the different standards' applicability to each evaluation task are
summarized in Table 2. The 30 standards are listed down the side of the matrix,
while the 10 evaluation tasks are presented across the top. The Xs in the various
cells indicate that the Committee judged the standard was particularly applicable
to the given task. While the Joint Committee concluded that all of the standards
are applicable in all educational program evaluations, the functional analysis is
intended to help evaluators quickly identify those standards that are likely to be
most relevant to given tasks.
The Committee also presented and illustrated five general steps for applying
the standards: (1) become acquainted with The Program Evaluation Standards,
(2) clarify the purposes of the program evaluation, (3) clarify the context of the
program evaluation, (4) apply each standard in light of the purposes and context,
and (5) decide what to do with the results. The Committee also suggested ways
to employ the standards in designing an evaluation training program.
The program evaluation standards are particularly applicable in evaluations of
evaluations, i.e., metaevaluations. In such studies, the metaevaluator collects
284 Stufflebeam
Table 1. Summary of the Program Evaluatiou Standards
Utility
The utility standards are intended to ensure that an evaluation will serve the information needs of
intended users.
Ul Stakeholder Identification.
Persons involved in or affected by the evaluation should be identified, so that their needs can be
addressed.
U2 Evaluator Credibility.
The persons conducting the evaluation should be both trustworthy and competent to perform the
evaluation, so that the evaluation findings achieve maximum credibility and acceptance.
U3 Information Scope and Selection.
Information collected should be broadly selected to address pertinent questions about the program
and be responsive to the needs and interests of clients and other specified stakeholders.
U4 ~lues Identification.
The perspectives, procedures, and rationale used to interpret the findings should be carefully
described, so that the bases for value judgments are clear.
U5 Report Clarity.
The evaluation reports should clearly describe the program being evaluated, including its context,
and the purposes, procedures, and findings of the evaluation, so that essential information is
provided and easily understood.
U6 Report Timeliness and Dissemination.
Significant interim findings and evaluation reports should be disseminated to intended users, so that
they can be used in a timely fashion.
U7 Evaluation Impact.
Evaluations should be planned, conducted, and reported in ways that encourage follow-through by
stakeholders, so that the likelihood that the evaluation will be used is increased.
Feasibility
The feasibility standards are intended to ensure that an evaluation will be realistic, prudent,
diplomatic, and frugal.
FI Practical Procedures.
The evaluation procedures should be practical, to keep disruption to a minimum while needed
information is obtained.
F2 Political Viability.
The evaluation should be planned and conducted with anticipation of the different positions of
various interest groups, so that their cooperation may be obtained and so that possible attempts by
any of these groups to curtail evaluation operations or to bias or misapply the results can be averted
or counteracted.
F3 Cost Effectiveness.
The evaluation should be efficient and produce information of sufficient value, so that the resources
expended can be justified.
Propriety
The propriety standards are intended to ensure that an evaluation will be conducted legally, ethically,
and with due regard for the welfare of those involved in the evaluation, as well as those affected by
its results.
PI Service Orientation.
Evaluations should be designed to assist organizations to address and effectively serve the needs of
the full range of targeted participants.
Continued on next page
Table 1. Continued
P2 Formal Obligations.
Obligations of the formal parties to an evaluation (what is to be done, how, by whom, when) should
be agreed to in writing, so that these parties are obliged to adhere to all conditions of the agreement
or formally to renegotiate it.
P3 Rights of Human Subjects.
Evaluations should be designed and conducted to respect and protect the rights and welfare of
human subjects.
P4 Human Interactions.
Evaluators should respect human dignity and worth in their interactions with other persons
associated with an evaluation, so that participants are not threatened or harmed.
P5 Complete and Fair Assessment.
The evaluation should be complete and fair in its examination and recording of strengths and
weaknesses of the program being evaluated, so that strengths can be built upon and problem areas
addressed.
P6 Disclosure of Findings.
The formal parties to an evaluation should ensure that the full set of evaluation findings along with
pertinent limitations are made accessible to the persons affected by the evaluation and any others
with expressed legal rights to receive the results.
P7 Conflict of Interest.
Conflict of interest should be dealt with openly and honestly, so that it does not compromise the
evaluation processes and results.
P8 Fiscal Responsibility.
The evaluator's allocation and expenditure of resources should reflect sound accountability
procedures and otherwise be prudent and ethically responsible, so that expenditures are accounted
for and appropriate.
Accuracy
The accuracy standards are intended to ensure that an evaluation will reveal and convey technically
adequate information about the features that determine worth or merit of the program being
evaluated.
Al Program Documentation.
The program being evaluated should be described and documented clearly and accurately, so that
the program is clearly identified.
A2 Context Analysis.
The context in which the program exists should be examined in enough detail, so that its likely
influences on the program can be identified.
A3 Described Purposes and Procedures.
The purposes and procedures of the evaluation should be monitored and described in enough detail,
so that they can be identified and assessed.
A4 Defensible Information Sources.
The sources of information used in program evaluation should be described in enough detail, so that
the adequacy of the information can be assessed.
A5 valid Information.
The information-gathering procedures should be chosen or developed and then implemented, so
that they will ensure that the interpretation arrived at is valid for the intended use.
A6 Reliable Information.
The information-gathering procedures should be chosen or developed and then implemented, so
that they will ensure that the information obtained is sufficiently reliable for the intended use.
286 Stutflebear.n
Table 1. Continued
A7 Systematic Information.
The information collected, processed, and reported in an evaluation should be systematically
reviewed, and any errors found should be corrected.
A8 Analysis of Quantitative Information.
Quantitative information in an evaluation should be appropriately and systematically analyzed, so
that evaluation questions are effectively answered.
A9 Analysis of Qualitative Information.
Qualitative information in an evaluation should be appropriately and systematically analyzed, so that
evaluation questions are effectively answered.
AIO Justified Conclusions.
The conclusions reached in an evaluation should be explicitly justified, so that stakeholders can
assess them.
All Impartial Reporting.
Reporting procedures should guard against distortion caused by personal feelings and biases of any
party to the evaluation, so that evaluation reports fairly reflect the evaluation findings.
AI2 Metaevaluation.
The evaluation itself should be formatively and summatively evaluated against these and other
pertinent standards, so that its conduct is appropriately guided and, on completion, stakeholders can
closely examine its strengths and weaknesses.
Source: Joint Committee on Standards for Educational Evaluation (1994).
information and judgments about the extent to which a program evaluation

complied with the requirements for meeting each standard. Then the evaluator
judges whether each standard was "addressed," "partially addressed," "not
addressed," or "not applicable." A profile of these judgments provides bases for
judging the evaluation against the considerations of utility, feasibility, propriety,
and accuracy, and in relation to each standard. When such metaevaluations are
carried out early in an evaluation, they provide diagnostic feedback of use in
strengthening the evaluation. When completed after a program evaluation, the
metaevaluation helps users to assess and make prudent use of the evaluation's
findings and recommendations.
The utility standards were placed first in The Prograr.n Evaluation Standards
because program evaluations often are ad hoc. A program evaluation would be
done not as a matter of course, but because it is needed and could make an
important difference in delivering and improving services. Evaluators and their
clients should first make sure that findings from a program evaluation under
consideration would be used before taking the trouble to address concerns for
feasibility, propriety, and accuracy. For example, it makes no sense to develop a
sound data collection and analysis plan, a contract, and a budget if no one is
likely to read and act on the projected report. In such a case it is better to abort
the evaluation as soon as it is known that carrying it out would make no
difference. For these reasons, evaluators should first apply the utility standards
to assure that an evaluation could impact on program quality and delivery. If
there is no prospect for use, then the evaluator and client should stop the
process. In that event they need not look at the standards of feasibility, propriety,
Table 2. Analysis of the Relative Importance of 30 Standards in Performing 10 Tasks in an Evaluation
1. Deciding 2. Defining 3. Designing 4. Collecting 5. Analyzing 6. Reporting 7. Budgeting 8. Contracting 9. Managing 10. Staffing
Whether the Evaluation the Information Information the the for the the the
to Evaluate Problem Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation
Ul Stakeholder
Identification x x x x x x
U2 Evaluator
Credibility x x x x x
U3 Information t
Scope & Selection x x x x x ~
~
U4 Values c·
;.:::
Identification x x x x !::l
US Report V)
-
Clarity x S
;.:::
U6 Report
f}
Timeliness & a
Dissemination x x x '";.:::!::l
~
U7 Evaluation
Impact x x ~
;.:::.
~
Fl Practical '6.
x ~
Procedures x x
'"
F2 Political x x x x x 'C'
...
Viability tt-l
~
F3 Cost i:
Effectiveness x x x !::l
5°
;.:::
Pl Service '"
Orientation x x x x X
N
Continued on next page 00
-.l
Table 2. Continued N
00
00
1. Deciding 2. Defining 3. Designing 4. Collecting 5. Analyzing 6. Reporting 7. Budgeting 8. Contracting 9. Managing IO. Staffing
to Evaluate Problem Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation ~
P2 Formal ~
<'I:>
Agreements x x x x x x ~
<'I:>
;::,
P3 Rights of ::?J
Human Subjects x x x x
P4Human
Interactions x x
P5 Complete &
Fair Assessment x x x
P6 Disclosure
of Findings x x
P7 Conflict
of Interest x x x x
P8 Fiscal
Responsibility x x x
Ai Program
Documentation x x x x x x x x
A2 Context
Analysis x x x x x
A3 Described
Purposes &
Procedures x x x x x x x
A4 Defensible
Information Sources x x x
Table 2. Continued
1. Deciding 2. Defining 3. Designing 4. Collecting 5. Analyzing 6. Reporling 7. Budgeting 8. Contracting 9. Managing 10. Staffing
to Evaluate Problem Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation
AS Valid
Information x x ~
~
A6 Reliable 1:;
Information x x 0·
~
I::l
A7 Systematic -~
Information x x 1S"
~
A8Analysis i}
of Quantitative
Information x x
a'"
I::l
~
A9Analysis I::l..
of Qualitative
Information x x ~
~
AlO Justified -e.
Conclusions x x x [
All Impartial '0>
....
Reporting x x x
Al2 ~
$:
Metaevaluation x x x x x x x x x X I::l
8"
~
'"
N
00
1.0
290 Stufflebeam
and accuracy. But if there is a good prospect for utility, the evaluator should
systematically tum to consideration of the full set of standards.
PERSONNEL EVALUATION STANDARDS

As mentioned earlier, the Joint Committee also developed The Personnel
Evaluation Standards (1988). This document includes 21 standards organized
according to the 4 basic concepts of propriety, utility, feasibility, and accuracy.
These standards reflect the fact that personnel qualifications and performance
are critically important concerns for evaluating programs and that personnel
evaluation is important in its own right for helping to assure the delivery of
sound, ethical professional services. The personnel evaluation standards are
designed to give educators and board members a widely shared view of general
principles for developing and assessing sound, respectable, and acceptable
personnel evaluation systems, plus practical advice for fulfilling the principles.
Institutions need effective personnel evaluation systems to help select, retain,
and develop qualified personnel and to supervise and facilitate their work and
development. Individual professionals need valid assessments of their
performance to provide direction for improvement and to be accountable for the
responsiveness and quality of their services. The state of personnel evaluation in
educational institutions has been poor (Joint Committee, 1988), thus the charge
by the Joint Committee's sponsoring organizations to devise personnel
evaluation standards that institutions could use to correct weaknesses in their
personnel evaluation practices and/or to develop new and sound personnel
evaluation systems.
The 1988 Personnel Evaluation Standards are focused on assessing and improving
the systems that educational organizations use to evaluate instructors, adminis-
trators, support staff, and other educational personnel. This book is intended to
be used by board members and educators in school districts, community colleges,
four-year colleges, universities, professional development organizations, and
other educational institutions.
Whereas utility standards were placed first in The Program Evaluation
Standards, the situation in personnel evaluation is different. Mainly, personnel
evaluations are not ad hoc. They are basically inevitable, no matter how badly
they will be done. Thus, the Joint Committee said the personnel evaluator should
deal first with the contemplated evaluation's propriety. A key reason for this
decision is that the first propriety standard addresses the issue of service orien-
tation. This standard emphasizes that the fundamental purpose of personnel
evaluation is to provide effective, safe, and ethical services to students and society.
Personnel evaluations especially should help protect the interests of students by
uncovering harmful practices of teachers, administrators, etc., as well as by
providing feedback to help such persons improve their services to the students.
The bottom line thrust of The Personnel Evaluation Standards is to help assure
that students are served well, that services constantly improve, and that harmful
practices are quickly uncovered and promptly addressed.
To balance this emphasis on service orientation, the personnel evaluation
standards also stress that personnel evaluation practices should be constructive
and free of unnecessarily threatening or demoralizing characteristics. In this
positive vein, personnel evaluations can and should be employed to help plan
sound professional development experiences and help each professional assess
and strengthen her or his performance. Such evaluations should identify the
educator's deficiencies and strengths.
THE ERS STANDARDS FOR PROGRAM EVALUATIONS
The original ERS Standards for Program Evaluations (ERS Standards Committee,
1982) were developed to address program evaluations across a broad spectrum,
e.g., community development, control and treatment of substance abuse, educa-
tion, health, labor, law enforcement, licensing and certification, museums, nutrition,
public media, public policy, public safety, social welfare, and transportation. In
July 1977, the ERS president appointed a seven-member committee to develop
the ERS standards. All committee members were evaluation specialists, with
Scarvia B. Anderson serving as chair. This committee collected and studied
pertinent materials, such as the draft standards then being developed by the
Joint Committee on Standards for Educational Evaluation. Since ERS's focus
was considerably wider than educational evaluations, the ERS Standards
Committee decided to prepare a set of general standards that this committee
deemed to be broader in applicability than those being devised by the Joint
Committee on Standards for Educational Evaluation. The ERS Standards
Committee then produced a draft set of standards and circulated it mainly to
ERS evaluation specialists. Using the obtained reactions, the ERS committee
finalized and published its standards in September 1982.
The ERS standards are 55 admonitory, brief statements presented in about 9
pages of text. An example is "1. The purposes and characteristics of the program
or activity to be addressed in the evaluation should be specified as precisely as
possible." The 55 standards are divided into the following 6 categories.
Formulation and Negotiation
The 12 standards in this group concretely advise evaluators that before

proceeding with an evaluation they should clarify with their client as much as
possible and in writing the evaluation work to be done, how it should be done,
who will do it, who is to be served, protections against conflicts of interest,
protections for participants and human subjects, the evaluation budget, and
constraints on the evaluation. A general caveat for this subset of standards warns
292 Stufflebeam
that initial evaluation planning decisions should be revisited often and revised as
the evaluation evolves and circumstances change.
Structure and Design
The six standards concerned with structure and design note that evaluation plans
should both prescribe a systematic, defensible inquiry process and take into
account the relevant context. The key requirement here is to design the
evaluation to produce defensible inferences about the value of the program
being studied. The plan should clearly present and justify the basic study design,
sampling procedures, data collection instruments, and arrangements for the
needed cooperation of program personnel and other participants in the
evaluation.
Data Collection and Preparation
The 12 standards in this section call for advance planning of the data collection
process. The plan should provide for selecting and training data collectors; pro-
tecting the rights of data sources and human subjects; monitoring, controlling,
and documenting data collection; controlling bias; assessing validity and relia-
bility of procedures and instruments; minimizing interference and disruption to
the program under study; and controlling access to data.
Data Analysis and Interpretation
Nine standards essentially call for tempering the data analysis and interpretation
within the constraints of the evaluation design and data actually collected. These
standards require evaluators to match the analysis procedures to the evaluation
purposes; describe and justify use of the particular analysis procedures; employ
appropriate units of analysis; investigate both practical and statistical significance
of quantitative findings; bolster cause-and-effect interpretations by referencing
the design and by eliminating plausible rival explanations; and clearly distinguish
among objective findings, opinions, judgments, and speculation.
Communication and Disclosure
Ten standards emphasize that evaluators should employ effective com-

munication throughout the evaluation process. Particular requirements are to
determine authority for releasing findings; organize data in accordance with the
accessibility policies and procedures; present findings clearly, completely, fairly,
and accurately; denote the relative importance of different findings; make clear
the evaluation's underlying assumptions and limitations; be ready to explain the
evaluation procedures; and disseminate pertinent findings to each right-to-know
audience in accordance with appropriate, advance disclosure agreements.
Use of Results
The concluding "Use of Results" section includes six standards. These empha-
size that evaluators should carefully attend to the information needs of potential
users throughout all phases of the evaluation. Accordingly, evaluators should
issue reports before pertinent decisions have to be made; anticipate and thwart,
as much as possible, misunderstandings and misuses of findings; point up suspected
side effects of the evaluation process; distinguish sharply between evaluation
findings and recommendations; be cautious and circumspect in making
recommendations; and carefully distinguish between their evaluative role and
any advocacy role they might be playing.
The ERS standards are not the official standards of any group at this time.
Their inclusion reflects their historical significance. Also, like the ABA guiding
principles, they address a wide range of evaluations outside as well as inside
education. Furthermore, the ERS standards still are judged to be valuable, since
they apply to the full range of evaluation tasks, whereas the ABA guiding
principles mainly propose a code of ethics for the behavior of evaluators.
THE ABA GUIDING PRINCIPLES
Following the 1986 merger of the Evaluation Network and the Evaluation Research
Society to create the American Evaluation Association, the amalgamated
organization revisited the issue of professional standards for evaluators. After
considerable discussion at both board and membership levels, the ABA leaders
decided to supplement the ERS standards summarized above with an updated
statement of evaluation principles. In November 1992, ABA created a task force
and charged it to develop general guiding principles for evaluators rather than
standards for evaluation practice. The task force, chaired by William R. Shadish,
subsequently drafted the Guiding Principles for Evaluators. Following a review
process made available to the entire ABA membership, the task force finalized
the principles document. Mter an affirmative vote by the ABA membership, the
ABA board adopted the task force's recommended principles as the official
ABA evaluation principles. ABA then published the principles in a special issue
of ABA's New Directions for Program Evaluation periodical (Task Force on
Guiding Principles for Evaluators, 1995). The "guiding principles" are presented
as a 6-page chapter in this special issue. The ABA guiding principles are consis-
tent with the prior ERS standards but shorter in the number of presented
statements. Essentially, the ABA principles comprise 5 principles and 23
294 Stufflebeam
underlying normative statements to guide evaluation practice. The principles,
with a summary of the associated normative statements, are as follows.
'~. Systematic Inquiry: Evaluators conduct systematic, data-based inquiries
about whatever is being evaluated." This principle is supported by three norma-
tive statements. These charge evaluators to meet the highest available technical
standards pertaining to both quantitative and qualitative inquiry. Evaluators are
also charged to work with their clients to ensure that the evaluation employs
appropriate procedures to address clear, important questions. The evaluators
are charged further to communicate effectively, candidly, and in sufficient detail
throughout the evaluation process, so that audiences will understand and be able
to critique the evaluation's procedures, strengths, weaknesses, limitations, and
underlying value and theoretical assumptions and also make defensible
interpretations of findings.
"B. Competence: Evaluators provide competent performance to stakeholders."
Three normative statements charge evaluators to develop and appropriately
apply their expertise. Evaluator(s) should be qualified by education, abilities,
skills, and experience to competently carry out proposed evaluations, or they
should decline to do them. They should practice within the limits of their capabil-
ities. Throughout their careers, evaluators should constantly use pertinent
opportunities to upgrade their evaluation capabilities, including professional
development and subjecting their evaluations to metaevaluations.
"c. Integrity/Honesty: Evaluators ensure the honesty and integrity of the entire
evaluation process." Five normative statements are provided to assure that
evaluations are ethical. Evaluators are charged to be honest and candid with
their clients and other users in negotiating all aspects of an evaluation. These
include costs, tasks, limitations of methodology, scope of likely results, and uses
of data. Modifications in the planned evaluation activities should be recorded,
and clients should be consulted as appropriate. Possible conflicts of interest
should be forthrightly reported and appropriately addressed. Any misrepre-
sentation of findings is strictly forbidden, and evaluators are charged to do what
they can to prevent or even redress misuses of findings by others.
"D. Respect for People: Evaluators respect the security, dignity, and self-worth
of the respondents, program participants, clients, and other stakeholders with
whom they interact." The five normative statements associated with this standard
require evaluators to show proper consideration to all parties to the evaluation.
In focusing the evaluation, collecting information, and reporting findings, the
evaluator should identify and respect differences among participants, e.g., age,
disability, ethnicity, gender, religion, and sexual orientation. Pertinent codes of
ethics and standards are to be observed in all aspects of the evaluation. The
evaluator should maximize benefits to stakeholders and avoid unnecessary harms;
observe informed consent policies; deal proactively, consistently, and fairly with
issues of anonymity and confidentiality; and do whatever is appropriate and
possible to help stakeholders benefit from the evaluation.
"E. Responsibilities for General and Public Welfare: Evaluators articulate and
take into account the diversity of interests and values that may be related to the
general and public welfare." Five normative statements are given to support this
principle. Evaluators are charged not to be myopic but to show broad concern
for the evaluation's social relevance. Evaluators have professional obligations to
serve the public interest and good, as well as the local need for evaluative feedback.
They should consider the program's long-range as well as short-term effects,
should search out side effects, and should present and assess the program's broad
assumptions about social significance. They should balance their obligation to
serve the client with services to the broader group of stakeholders. They should
involve and inform the full range of right-to-know audiences and, within the
confines of contractual agreements, give them access to the information that may
serve their needs. In interpreting findings evaluators should take into account all
relevant value perspectives or explain why one or some of these were excluded.
Keeping in mind the interests and technical capabilities of their audiences,
evaluators should report findings clearly and accurately.
COMPARISON OF THE JOINT COMMITTEE STANDARDS AND THE

ERS/ABA STANDARDS AND PRINCIPLES
Comparisons of the substance of the Joint Committee and ERS/ABA standards

and principles documents reveal key differences and similarities (Cordray, 1982;
Covert, 1995; Sanders, 1995; Stufflebeam, 1982). While the Joint Committee's
standards focused on evaluations in education, the ERS standards and ABA
principles addressed evaluations across a variety of government and social service
sectors. Essentially everything covered by the ERS standards and ABA prin-
ciples is covered by the Joint Committee's standards, but the latter's coverage is
much more detailed and goes deeper into evaluation issues. The Joint
Committee's presentations of standards have averaged more than 100 pages,
while the ERS and ABA presentations of standards and principles each
numbered less than 10 pages. Further, the Joint Committee standards were
developed by a committee whose 17 members were appointed by 12 professional
organizations with a total membership of over 2 million. The ERS standards and
the ABA principles were developed by single organizations with memberships at
the time of about 1,000 and 2,000, respectively. The standards/principles-
development task forces of ERS and ABA had 6 and 4 evaluation specialists,
respectively, whereas the Joint Committee had 17 members. Another key differ-
ence is that the Joint Committee standards were developed by a combination of
evaluation users and evaluation specialists, while the ERS standards and ABA
principles were developed almost exclusively by evaluation specialists. Finally,
the ABA principles were formally adopted by AEA, whereas the Joint
Committee's 1994 Program Evaluation Standards were accredited by ANSI, but
have not been formally adopted by any of the Committee's sponsoring organiza-
tions.
The differences in lengths of the documents reflect perhaps somewhat differ-
ent purposes. The ERS/ABA efforts have focused almost exclusively at the level
296 Stu~ebear.n
of general principles to be observed by evaluators. The Joint Committee stresses

general principles - as seen in its requirements for utility, feasibility, propriety,
and accuracy - but also attempts to provide specific and detailed standards of
good practice along with guidelines for meeting the standards. In this sense, the Joint
Committee's standards include both general requirements of sound evaluations
and rather specific advice for meeting these requirements. Nevertheless, the
ERS/ABA standards/principles-setting programs emphasize that the standards
and principles should be seen as general guides and that evaluators and their
clients should consult and employ much more specific material when dealing
with the details of design, measurement, case studies, statistics, reporting, etc.
Both sets of documents are in substantial agreement as to what constitutes
sound evaluation practices. Evaluators should seek out and involve their
intended audiences in clarifying evaluation questions and in reporting evaluation
findings. Evaluations should be beyond reproach, with evaluators adhering to all
relevant ethical codes. Moreover, evaluators should strive to produce valid
findings and should be careful not to present unsupportable conclusions and
recommendations. In addition, evaluators should carefully sort out their roles as
independent inquirers from their social advocacy roles and make sure that their
evaluations are not corrupted by conflicts of interest. Also, the Joint Committee
standards, the ERS standards, and the ABA principles concur that evaluations
occur in politically charged, dynamic social settings and calion evaluators to be
realistic, diplomatic, and socially sensitive, while maintaining their integrity as
evaluators. And all three movements stress that sound evaluation is vital to the
functioning of a healthy society. Service providers should regularly subject their
services to evaluation, and evaluators should deliver responsive, dependable
evaluation services. Professional standards are a powerful force for bringing
about the needed sound evaluation services.
EVALUATION STANDARDS AND PRINCIPLES EXIST: SO WHAT?
It is a mark of progress that North American evaluators and their audiences have
standards and principles to guide and assess evaluation practices. But what
differences have these standards and principles made? The proof is not in the
pudding, but in the eating. Unfortunately, no definitive study has been made of
the uses and impacts of the standards and principles. Nevertheless, the editor of
this book's Evaluation Profession section (M.E Smith) inveigled me to give my
perspective on these matters, and I agreed.
I do so based on more than 25 years of relevant experience. This includes
leading the development of the first editions of the Joint Committee program
and personnel evaluation standards, helping develop the second edition of the
Joint Committee program evaluation standards, offering training in the stan-
dards, publishing articles and other materials on the different sets of standards,
developing checklists for applying standards, using the standards to conduct
numerous metaevaluations, and citing the standards in affidavits for legal
proceedings. Despite this considerable experience, I am not an authority on how
much and how well the standards are being used. I can only give my impressions.
To organize the discussion I will reference the material in Table 3. The column
headings denote the Evaluation Standards and the ABA Guiding Principles. The
row headings are the functions of standards and principles with which I began
the chapter. Basically, the table's contents speak for themselves. I will offer a few
general comments, then suggest what I believe evaluators need to do in order to
better use standards and principles to further professionalize the evaluation
enterprise and improve its contributions.
One important message of Table 3 is that both the Program Evaluation
Standards and the Guiding Principles are appropriately grounded and have strong
potential. Both instrumentalities have been carefully developed, and ABA supports
both. The standards' depth of definition, inclusion of practical guidelines and
common errors to avoid, and illustrative cases enhance their usefulness in
instruction, evaluation practice, and metaevaluation. Guidance is also provided
on how the standards apply to the full range of evaluation tasks. ABA's
promotion of both tools through a special section of AlE on metaevaluation
brings visibility and practical examples of use. The standards are backed up with
supporting checklists and illustrative cases and are subject to periodic reviews
and updates. Both the standards and the principles are regularly taught in
evaluation training programs, such as The Evaluators' Institute in Maryland and
the National Science Foundation-funded evaluation training institutes at
Western Michigan University. In combination, both tools bring clarity to the issue
of technically competent, ethical, practical, and useful evaluation service and
provide evaluators and their constituents with a common evaluation language.
Beyond being well grounded and having strong potential for use, both the
evaluation standards and the guiding principles have made some notable impacts.
Increasingly, one sees references to both the standards and the principles in
accounts and assessments of evaluation studies. For both, these are seen across
a variety of disciplines. While the Joint Committee standards are intended only
for application in education in North America, their use has been demonstrated
in fields outside education and in a number of countries. Also, I know of a few
examples where interest in meeting the requirements of the standards led to the
development of new evaluation techniques, such as feedback workshops, and
tools, e.g., various evaluation checklists included on the Web site of the Western
Michigan University Evaluation Center (www.wmich.edu/evalctr/checklists/).
However, the chilling message of Table 3 is that the standards and principles
are limited in both kinds and extent of impacts. Only ABA has officially
endorsed the principles. While the Joint Committee is sponsored by about 15
organizations that span different areas of education, none of the organizations
have officially endorsed the standards. It seems clear that few teachers, school
principals, superintendents, counselors, school board members, etc., are aware
of, let alone invested in use of the standards. ABA has defined its constituency
much more narrowly to include the ABA membership and has ratified the
guiding principles, but these principles are little known outside the relatively
298 Stufflebeam
Thble 3. Observations on the Uses and Impact of the Joint Committee Program Evaluation
Standards and the AEA Guiding Principles
Functions Program Evaluation AEA Guiding
Standards Principles
Protect consumers and Probably little impact here. Probably little impact here.
society from harmful The standards are voluntary The principles are very general.
practices and not ratified. No While ratified by AEA, no
enforcement mechanism exists. enforcement mechanism exists.
Provide a basis for Some impact. Increasingly, Probably little impact, because the
accountability by the evaluators are referencing principles are highly general.
service providers their attempts to comply with
the standards; and the standards
are quite detailed.
Provide an authoritative Some impact. Examples of the Some impact. AEA is strongly
basis for assessing standards being used in promoting dissemination and use
professional services metaevaluations are increasing. of the Guiding Principles through
Also, the standards are backed its conventions and the AlE
up by supporting checklists and section on metaevaluation.
andAlE includes a regular
section on metaevaluation that
publishes uses of both the
standards and the AEA principles.
Provide a basis for No evidence that the standards No evidence that the Guiding
adjudicating claims of have been used in this way. Principles have been used in this
malpractice way.
Help assure that service Some assistance here. The Some assistance here. The
providers will employ their Joint Committee periodically principles are probably used widely
field's currently best reviews and updates the in training programs. However,
available practices standards. The standards are there are no firm requirements to
probably used widely in training apply the principles.
programs. However, there are
no firm requirements to apply
the standards.
Identfy needs for improved Some obscure impacts. The Probably little impact here. The
technologies WMU Evaluation Center has principles are general and focus
developed techniques, such as mainly on evaluator behaviors
feedback workshops, and rather than evaluation
checklists to help evaluators procedures.
better meet the standards.
Provide a conceptual Some impact here. The Impact on research and
framework and working standards have helped development in evaluation is
definitions to guide research evaluators and others to view unknown. These principles bring
and development in the evaluation as a much broader clarity to the issue of ethical service
service area enterprise than research and in evaluation, but I am not aware
have stimulated development that the principles have guided
of new evaluation tools. research and development.
Provide general principles Considerable help here. The Considerable help here. Ethical
for addressing a variety of standards provide 30 principles, positions of evaluator behavior
practical issues in the show how they apply to the are clearly presented along with
service area different parts of an evaluation some more specific
process, and also delineate recommendations.
procedural suggestions and
pitfalls to avoid.
Table 3. Continued
Functions Program Evaluation AEA Guiding
Standards Principles
Present service providers Good progress here. The Some progress here. AEA is
and their constituents with standards were developed stressing that all members should
a common language to jointly by evaluation learn and apply the principles.
facilitate communication specialists and users of
and collaboration evaluation.
Provide core content for Good and increasing impact. Good and increasing impact. The
training and educating The standards include detailed principles include general advice
service providers suggestions and illustrative cases on evaluation ethics and a few
as well as defined principles specific guidelines.
and are being used in many
training programs. But many
evaluators are still to be reached.
Earn and maintain the The standards have enhanced The principles have good
public's confidence in the credibility where applied, but potential that is largely unrealized.
field of practice many audiences have not been
reached.
small group of AEA members. Probably the standards and principles have had
little impact in protecting clients from poor evaluation service, since there are no
mechanisms to screen evaluation plans and head off predictably bad evaluation
projects. In my view, the standards and principles are far from realizing their
potential to help strengthen evaluation services, to prevent or expose bad
evaluation practices, and to build public confidence in evaluation.
Members of the evaluation profession should take pride in their strides to help
professionalize evaluation through the development, dissemination, and use of
standards and guiding principles. They should build on their substantial record
of contributions by learning, regularly applying, and helping to improve the
standards and principles. But they have to do much more. Far too little effort has
gone into disseminating the standards and principles. AEA and other evaluation
bodies should redouble their efforts to publicize and offer training in the
standards and principles. AlE should vigorously continue its practice of
publishing metaevaluations keyed to both the standards and the principles, and
AEA should periodically review and update the principles, involving the total
membership in this effort. The Joint Committee should seek funds to expand
and strengthen its efforts to engage all of the sponsoring organizations in
promoting and supporting the use of the standards. Both the Joint Committee
and AEA should investigate possible means of enforcing compliance with the
standards and principles or at least publicly criticizing evaluations that are clearly
unethical and/or otherwise bad. They should find some way of bringing sanctions
against evaluators who violate the standards and principles and consequently
misserve their clients. In other words, there is little point to having standards and
principles if they don't help enhance evaluation practice and thwart or castigate
clear instances of malpractice.
300 Stufflebeam
CLOSING COMMENTS
This chapter has provided an overview of the state of standards and principles in
the field of evaluation as practiced in the U.S. Professional standards, and prin-
ciples are seen as important for assessing and strengthening evaluation practices.
It is a mark of American evaluators' move toward professionalism that two
separate but complementary standards/principles-development movements are
now more than two decades old and continuing. It is fortunate that mUltiple sets
of standards/principles have been developed. They provide cross-checks on each
other, even though they are appropriately aimed at different constituencies. The
two sets of presentations have proved to be complementary rather than compet-
itive. The ERS/AEA standards and principles address evaluations across a wide
range of disciplines and service areas, while the Joint Committee standards have
focused on education. It should be reassuring to educational evaluators that all
the important points in the ERS/AEA standards and principles are also covered
in the Joint Committee standards, also that the standards have proved useful in
fields outside education and countries outside North America. There seem to be
no conflicts about what principles evaluators should follow in the different sets
of materials. Moreover, evaluators outside education find that the details in the
Joint Committee standards can help to buttress the general principles in the
ERS/AEA standards and principles.
For the future the two groups should continue to review and update the
standards and principles as needed. They should also put a lot more effort into
promoting effective use of the standards and principles. Especially, they should
encourage evaluation educators to build the standards and principles into evalu-
ation degree programs and into special training sessions for evaluation users as
well as specialists. Evaluators should employ the evaluation standards and
principles to conduct and report metaevaluations. They should also take on some
of the difficult questions related to enforcement of standards and principles.
If the standards and principles become well established and if they are
regularly applied, then both evaluation consumers and producers will benefit.
Adherence to the evaluation standards and principles will improve the quality of
evaluations and should increase their value for improving programs and services.
These points seem to provide an ample rationale for evaluators to obtain, study,
apply, and help improve the Joint Committee standards for program and for
personnel evaluations and the AEA principles. The fact that these efforts
developed independently gives added credibility to the consensus reflected in
their reports about what constitutes good and acceptable evaluation practice.
Now it is time for collaboration as the Joint Committee and AEA move ahead
to advance the professional practice of evaluation in the U.S. through adherence
to high standards and principles of practice. It is also time for a lot more effort
to disseminate the standards and principles and at least to explore the develop-
ment of enforcement mechanisms.
It is noteworthy that the standards and principles discussed in this chapter are
distinctly American. They reflect U.S. government laws and the North American
culture. This chapter in no way recommends that evaluators outside the U.S. be
held to the American standards and principles. Based on my experience in
leading the development of the initial sets of Joint Committee standards, I am
convinced that the participatory process leading to standards is essential. Groups
in countries outside the U.S. wanting evaluation standards are advised to appoint
their own standards-setting group and to proceed through a culturally acceptable
participatory process designed to reach creditable decisions about what
principles and standards should be used as the basis for judging evaluation work
in the particular culture. It may be useful to remember that evaluation standards
are principles commonly agreed to by the evaluators and other relevant
stakeholders in a particular defined setting for judging the quality of evaluations.
At best, the American evaluation standards and principles and the employed
standards-setting arrangements and processes provide examples that may be
useful to standard setters in other cultures.
ENDNOTES
1 The membership of the Joint Committee on Standards for Educational Evaluation, as of

publication of the 1994 The Program Evaluation Standards, included the American Association of
School Administrators, American Educational Research Association, American Evaluation
Association, American Federation of Teachers, American Psychological Association, Association
for Assessment in Counseling, Association for Supervision and Curriculum Development,
Canadian Society for the Study of Education, Council of Chief State School Officers, Council on
Postsecondary Accreditation, National Association of Elementary School Principals, National
Association of Secondary School Principals, National Council on Measurement in Education,
National Education Association, and National School Boards Association.
2 The summary statements of the 30 program evaluation standards are printed here with the permis-
sion of the Joint Committee on Standards for Educational Evaluation.
REFERENCES
American Evaluation Association Thsk Force on Guiding Principles for Evaluators. (1995). Guiding
principles for evaluators. New Directions for Program Evaluation, 66, 19-26.
Cordray, D.S. (1982). An assessment of the utility of the ERS standards. In P.H. Rossi (Ed.),
Standards for evaluation practice. New Directions for Program Evaluation, 15, 67-81.
Covert, R.w. (1995). A twenty-year veteran's reflections on the guiding principles for evaluators. In
W.R. Shadish, D.L. Newman, M.A. Scheirer, & C. Wye (Eds.), Guiding principles for evaluators.
New Directions for Program Evaluation, 66, 35-45.
ERS Standards Committee. (1982). Evaluation Research Society standards for program evaluation.
In P.H. Rossi (Ed.), Standards for evaluation practice. New Directions for Program Evaluation, 15,
7-19.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards:
How to assess evaluations of educational programs. Thousand Oaks, CA: Sage.
Sanders, l.R. (1995). Standards and principles. In w.R. Shadish, D.L. Newman, M.A. Scheirer, & C.
Wye (Eds.), Guiding principles for evaluators. New Directions for Program Evaluation, 66, 47-53.
302 Stufflebeam
Shadish, w.R., Newman, D.L., Scheirer, M.A., & Wye, C. (Eds.). (1995). Guiding principles for
evaluators. New Directions for Program Evaluation, 66.
Stufflebeam, D.L. (1982). A next step: Discussion to consider unifying the ERS and Joint Committee
standards. In P.R. Rossi (Ed.), Standards for evaluation practice. New Directions for Program
14
Ethical Considerations in Evaluation
MICHAEL MORRIS
University of New Haven, Department of Psychology, CT, USA
Every profession confronts ethical issues as its members engage in practice, and
the ways in which individuals respond to these challenges playa major role in (1)
shaping the profession's public image and credibility, and (2) increasing (or
decreasing) the profession's internal social cohesion. Simply put, "ethics matter"
to professions, and evaluation is no exception.
According to the dictionary, ethics deal with "what is good and bad and with
moral duty and obligation" (Webster's, 1988, p. 426). Thus, to behave ethically is
to "do the (morally) right thing." As Stufflebeam points out in Chapter 13, one
strategy used by professions to ensure that their members do the right thing is to
develop standards, principles, and/or codes. Given that evaluation is, at its core,
a research enterprise, these standards, principles, and codes represent attempts
to articulate for evaluators what Merton (1973) describes as the "normative
structure of science." This structure is "expressed in the form of prescriptions,
proscriptions, preferences, and permissions ... legitimized in terms of institu-
tional values," and "internalized by the scientist, thus fashioning his [sic] scientific
conscience" (p. 269).
In this chapter the focus is on the distinctive implications for evaluation as a
profession of empirical research and commentary relevant to the normative
structure of the field. Within this context, five key areas are addressed, followed
by recommendations for the future:
• How pervasive are ethical challenges in evaluation?

• What is the nature of the ethical problems that evaluators encounter most
frequently?
• What ethical dilemmas are presented by different levels and types of
stakeholder involvement in evaluation?
• What ethical challenges are presented by evaluator involvement in utilization
of findings?
• What can evaluators do to prevent and respond to ethical problems?
The answers to these questions are not always as straightforward as one might
wish. As we shall see, however, the lack of a definitive answer can be instructive
303

304 Morris
in its own right, helping us to identify issues that evaluation must address if it is
to mature as a profession. Such a "lesson" is clearly evident when the question
of pervasiveness is considered.
THE PERVASIVENESS OF ETHICAL CHALLENGES
As the members of a profession grapple with ethical conflicts and publicly

discuss their experiences, the groundwork is laid for the development of the
field's normative structure. Thus, the question "How pervasive are ethical prob-
lems in evaluation?" is a crucial one. In the most comprehensive study of this
issue, Morris and Cohn (1993) asked a random sample of American Evaluation
Association (AEA) members, "In your work as a program evaluator, have you
ever encountered an ethical problem or conflict to which you had to respond?"
Of the study'S 459 respondents, 65 percent answered yes, and 35 percent
answered no. Thus, although a solid majority of the respondents believed they
had faced ethical challenges, a substantial minority - slightly more than a third -
reported they had not.
The findings of other researchers also suggest that ethical conflicts lack
salience for many evaluators. Newman and Brown (1996), for example, note that
"we consistently found people whose generalized response was 'What? Ethics?
What does ethics have to do with evaluation?'" (p. 89). A similar message
emerged from Honea's (1992) in-depth interviews of public sector evaluators,
where a major purpose of the interviews was to identify and describe the ethical
issues and dilemmas encountered by these professionals. Honea reports that
"ethics was not discussed during the practice of evaluation and ethical dilemmas
were rarely, if ever, identified during the conduct of evaluation and policy
analysis activities" (p. 317).
What are we to make of these results? At least one conclusion seems justified:
Within the evaluation community "there is a substantial subgroup of evaluators
who are not inclined to interpret the challenges they face in ethical terms"
(Morris, 1999, p. 16). Such a conclusion underscores the crucial role that
subjective factors play in determining the strength of evaluation's normative
structure. Professional challenges that are viewed as ethical in nature by some
evaluators may be seen as "simply" political, philosophical, or methodological by
others. Morris and Jacobs (2000), for example, found that when a sample of
AEA members was asked to judge whether an evaluator in a vignette had
behaved unethically in not thoroughly involving stakeholders in an evaluation, 16
percent of the respondents indicated that they thought the scenario raised a
methodological or philosophical issue, rather than an ethical one. An additional
11 percent responded that they weren't sure if the scenario posed an ethical issue.
It is important to note that viewing an evaluation problem as a political,
methodological, or philosophical matter does not necessarily lead one to respond
to the problem in a substantially different fashion than if the issue were defined
as an ethical challenge. A variety of rationales can underlie any single behavior.
Ethical Considerations in Evaluation 305
Thus, in the Morris and Jacobs study, some respondents believed the evaluator
should have involved the stakeholders more thoroughly because social justice
required it. Others argued that greater stakeholder input would have resulted in
a study that was methodologically superior to a low-involvement study, and explic-
itly rejected an ethical justification for their recommendation. The intriguing
implication of this latter perspective is that a methodologically inferior evaluation,
even when it is avoidable, is not necessarily seen as an unethical evaluation. This
distinction appears to reflect a more general pattern in the normative structure
of professions, in which a differentiation is made between "technical standards
that establish the minimum conditions for competent practice, and ethical
principles that are intended to govern the conduct of members in their practice"
(Advisory Committee, 1995, p. 200, emphasis added; for a conflicting view, see
Rosenthal,1994).
Our knowledge of the factors that distinguish evaluators who are more likely
to view problems through an ethical lens from those who are less likely to do so
is limited. Morris and Cohn (1993) found that respondents who reported that
they had never encountered an ethical conflict in their work had conducted fewer
evaluations, had devoted more of their time to internal evaluation, and were
more likely to identify education as their primary discipline, than respondents
who indicated they had faced ethical problems. The finding for internal evalu-
ators is consistent with results reported by Honea (1992) and Sonnichsen (2000).
Interpretation of the experience finding is straightforward: the longer one
practices evaluation, the greater the chances are that one will eventually encounter
a challenge that he/she deems to be ethical in nature. But why should internal
evaluators be less likely to report ethical conflicts than external ones? Certainly,
most scholarly commentators on internal evaluation have not suggested that one
should expect ethical challenges to be less frequent in the lives of internal evalu-
ators than external ones (e.g., Adams, 1985; Love, 1991; Lovell, 1995; Mathison,
1991,1999; Sonnichsen, 2000). Indeed, one might even be tempted to argue that
the in-depth knowledge which internal evaluators typically have of the inner
workings of program settings would expose them to more opportunities for
ethical conflict than their external counterparts.
Further reflection upon the role of the internal evaluator, however, may help
to account for the counterintuitive finding that internal evaluators are less likely
to report ethical challenges. Mathison (1991) has noted that internal evaluators,
because they are simultaneously evaluators and members of the organization
being evaluated, are especially vulnerable to role conflict and cooptation in their
professional work. Viewing situations as ethically problematic may be less likely
in such an environment, since "organizations work against self-reflection and
self-criticism and the internal evaluator must often go against the organizational
zeitgeist" (Mathison, 1991, p. 97). Being too close can make it hard to see.
The finding that evaluators whose primary discipline is education are less likely
than those from other fields (e.g., psychology, sociology, research/statistics) to
report ethical conflicts is a provocative one, but not readily interpretable. The
nature of graduate training and socialization in education, as well as the
306 Morris
characteristics of individuals drawn to that field, are obvious candidates for

further analysis in this context. However, such analysis should probably await
replication of the original finding.
The relationship of the three factors just reviewed to the experience of ethical
conflicts is modest. In other words, there are many new evaluators, internal
evaluators, and evaluators with a background in education who report that they
have encountered ethical conflicts. Conversely, there are significant numbers of
experienced evaluators, external evaluators, and individuals from other fields
who maintain that they have not. Clearly, much work remains to be done in distin-
guishing those who are more or less likely to view evaluation problems through
an "ethical lens" (Morris, 1999, p. 16). Ethical challenges, by their very nature,
raise issues of personal responsibility, accountability, and character for the
perceiver. Moral obligations, in turn, often require decisions and actions that
would put one at considerable risk, professionally and otherwise. Thus, a fairly
strong incentive frequently exists for not viewing problems in ethical terms.
The implications of the findings reported in this section for evaluation as a
profession are disquieting. Published professional standards notwithstanding, it
is exceedingly difficult to establish a viable and comprehensive normative struc-
ture for a field if significant disagreement exists over the domains that should be
included in that normative structure. When discussing euthanasia physicians may
express sharply differing opinions, but they typically all agree that the issue is
relevant to ethics. In contrast, when evaluators consider stakeholder involvement
- to cite just one example - one cannot assume that all participants view the
matter as having ethical significance. To the extent that disagreement at this level
is a persistent characteristic of the field, the professionalization of evaluation is
hindered.
THE NATURE OF ETHICAL CHALLENGES
The preceding discussion should not distract us from the fact that most evalu-
ators do report that they have faced ethical challenges. In the Morris and Cohn
(1993) survey, respondents were asked in an open-ended question to describe
"the ethical problems or conflicts that you have encountered most frequently in
your work as an evaluator." A review of the results suggests that, in the eyes of
evaluators, ethical challenges are not evenly distributed throughout the course of
a project. Rather, they are concentrated at the beginning (i.e., the entry/
contracting stage) and toward the end (i.e., communication of results and utiliza-
tion of results). In the entry/contracting stage, the following conflicts were most
often reported:
• A stakeholder has already decided what the findings "should be" or plans to
use the findings in an ethically questionable fashion.
• A stakeholder declares certain research questions "off-limits" in the evalua-
tion despite their substantive relevance.
• Legitimate stakeholders are omitted from the planning process.

• Various stakeholders have conflicting expectations, purposes, or desires for
the evaluation.
• The evaluator has difficulty identifying key stakeholders.
It is clear from these findings that initiating an evaluation can be akin to walking
through a minefield. Evaluation clients may simply want the evaluation to
confirm what they are sure they already "know," or to provide them with ammu-
nition to dispatch a vulnerable and disliked program director. The gathering of
data on important questions may be discouraged or prohibited. The participation
of relevant stakeholders may not be obtained for a variety of reasons, ranging
from ignorance to deliberate exclusion. And even if all the relevant stakeholders
are included in the planning process, unresolved disagreements between them
can present the evaluator with hard choices regarding the group(s) to which
he/she is going to be most responsive.
With respect to communicating results, respondents identified the following
conflicts as occurring most frequently:
• The evaluator is pressured by a stakeholder to alter presentation of the

findings.
• The evaluator is unsure of hislher ability to be objective or fair in presenting
findings.
• The evaluator is pressured by a stakeholder to violate confidentiality.
• Although not pressured to violate confidentiality, the evaluator is concerned
that reporting certain findings could represent such a violation.
Being pressured to slant the presentation of evaluation results is, by a wide

margin, the ethical challenge that respondents in the Morris and Cohn study
experienced most often. Not surprisingly, it usually took the form of the evalu-
ator being asked by the primary client to write (or revise) a report in a fashion
that portrayed the program in a more positive light than the evaluator thought
was ethically warranted. The pressure exerted in such a situation can range from
subtle (e.g., "Is there a way to soften the wording a bit here?") to brutally
straightforward (e.g., comments in the spirit of, "You'll make these changes if
you ever want to work as an evaluator in this community again").
The pressure is not always to shift from a less positive to a more positive
report. For example, if the hidden agenda of a stakeholder calls for a program to
be terminated or totally restructured (or for a director to be dismissed) as a
result of an evaluation, there may be pressure to emphasize the negative rather
than the positive. Influence can also be exerted on dimensions other than the
positive-negative axis. Evaluators and stakeholders may disagree over what
certain findings mean, and these disagreements can lead to disputes over how
(and whether) they should be presented.
The source of pressure is not always the client. For instance, respondents
occasionally indicated that their supelVisors on an evaluation project (e.g., a
308 Morris
principal investigator) had instructed them to make changes that respondents
deemed ethically questionable. Moreover, external pressure from stakeholders is
not the only threat to the unbiased reporting of results. Evaluators sometimes
question their own ability to interpret and communicate findings in a balanced,
accurate fashion. Evaluators may believe they possess certain values, attitudes,
beliefs, or prejudices that make them vulnerable to misinterpreting the reality of
the programs they study.
It is interesting to note in this context that respondents identified themselves as
the source of potential bias far less frequently (about 25 percent as often) than
they accused other stakeholders of pressuring them to distort results. Such a
finding is consistent with what social psychologists have learned about attributional
processes (e.g., Jones et aI., 1971). In general, individuals tend to "externalize"
the problems they experience, locating the source of these difficulties outside of
themselves.
Protecting confidentiality is another ethical responsibility within the communi-
cation-of-findings domain where respondents indicated they experienced
frequent conflicts. Not surprisingly, evaluators often feel pressured by various
stakeholders to violate the promise of confidentiality they have offered research
participants. The pressure can range from being asked, point blank, to reveal
what a specific person said or wrote, to responding to requests for data sets when
those data sets could be analyzed in ways that allow certain individuals or
subgroups to be identified. Also described are situations where, in the absence
of external pressures, respondents worry that presenting findings in certain ways
could compromise confidentiality. When sample sizes are small, for example,
reporting data according to conventional demographic and program charac-
teristics can make it possible to identify individuals. A third major confidentiality
issue arises when an evaluator becomes aware of behavior on the part of the
program's administration, staff, or clients that is illegal, dangerous, or unethical.
These discoveries typically occur during the evaluation's data collection phase
and represent one of the few ethical challenges that a sizable number of
respondents reported they frequently encountered during that phase. This
challenge highlights the conflicts generated by competing ethical commitments,
such as honoring confidentiality versus protecting the wider public interest. In
some cases the law may dictate the evaluator's response (e.g., "required reporting"
policies), but the presence (and pressure) of legal mandates only serves to
reinforce, not diminish, the ethical nature of this challenge for the respondent.
The last major group of frequently encountered ethical problems identified by
Morris and Cohn's respondents unfold during the evaluation's utilization-of-
results phase:
• A stakeholder suppresses or ignores findings.

• Disputes or uncertainties exist concerning ownership/distribution of the final
report, raw data, etc.
• A stakeholder misuses the findings (nature of misuse not specified by the
respondent).
• A stakeholder uses the findings to punish another stakeholder or the evaluator.

• A stakeholder deliberately modifies findings prior to releasing them.
• A stakeholder misinterprets the findings.
• A stakeholder engages in plagiarism/misrepresentation of authorship.
What is clear from this list is that presenting an honest, objective, and impartial
report to an evaluation client may not prevent subsequent ethical problems.
Indeed, there are a number of troublesome fates that can befall a set of evalu-
ation findings. The findings might simply be ignored or, if they are threatening
to powerful vested interests, actively suppressed. One might argue that such a
fate does not necessarily represent an ethical challenge for the evaluator, even
though the outcome is an unfortunate one, and in defense of this argument cite
the assertion from the Program Evaluation Standards that evaluators should
avoid "taking over the client's responsibilities for acting on evaluation findings"
(Joint Committee, 1994, p. 69). The findings suggest, however, that a significant
subgroup of evaluators believe that utilization is a shared responsibility, and that
nonuse of findings can represent a type of misuse (Stevens & Dial, 1994). It may
be that the growing popUlarity of utilization-based models of evaluation (e.g.,
Patton, 1997) over the past two decades has fostered a normative climate in
which the application of results is increasingly seen as part of the evaluator's job
(Caracelli & Preskill, 2000; Preskill & Caracelli, 1997; Shulha & Cousins, 1997).
Misinterpretation and deliberate modification of findings by stakeholders
represent two use-related problems that are probably viewed nearly universally
by evaluators as ethical challenges they should address. Likewise for the
plagiarism/misrepresentation issue, where the respondents' most frequent
complaint was that others (usually their supervisors) had taken inappropriate
credit for the respondents' work in reports or publications. All three of these
offenses compromise either the accuracy or honesty of the evaluation, two
qualities that have been found to be at the core of the normative structure of
science more generally (Korenman, Berk, Wenger, & Lew, 1998).
Using results to punish individuals, such as program directors or staff, violates
the principle of do no harm (nonmalfeasance), which "according to many
ethicists ... takes precedence over all other principles" (Newman & Brown, 1996,
p. 41). This finding reflects one of the distinctive characteristics of evaluation:
Program clients are not the only stakeholders who can be placed at considerable
risk by a study. Indeed, some individuals, such as program directors, can be hurt
by an evaluation without ever having served as a formal source of data for the
investigation. Unfortunately, analyses of the rights of these stakeholders (Le.,
directors and staff) tend to be quite general (e.g., see the Program Evaluation
Standards) when compared with the high level of detail that typically character-
izes discussions of "human subject" issues such as informed consent, privacy, and
confidentiality (e.g., U.S. Department of Health and Human Services, 2001).
The final end stage issue identified by Morris and Cohn's respondents involves
the question of "Who owns (and can distribute) what?" This finding highlights
a reality that we will return to in the last section of this chapter: Ethical
310 Morris
challenges that surface late in an evaluation often reflect issues that were not
effectively addressed in the project's initial stages. To the extent that various stake-
holders' "property rights" are explicitly discussed during the entry/contracting
stage, the chances of ethical mischief occurring at later stages is significantly
reduced. Moreover, any conflicts that do arise are likely to be handled more
effectively, given that a set of parameters for the discussion has presumably been
established already.
Implications
The challenges we have reviewed thus far represent just a subset of the myriad
ethical problems that evaluators encounter in their work. To be sure, they consti-
tute a very important subset, since research indicates that evaluators perceive
these conflicts as the ones they are most likely to face. Evaluators who develop
an in-depth appreciation of these challenges will be better equipped to "manage
their ethical lives" as evaluators than those who fail to do so.
It is nevertheless crucial for evaluators to recognize that ethical questions can
arise at virtually any point in a project. The nature of applied research is such
that many of these questions can be anticipated by experienced evaluators. For
example, how "informed" does informed consent need to be? When can passive
consent serve as an acceptable alternative to active consent (e.g., see Jason,
Pokorny, & Katz, 2001)? What ethical considerations need to be taken into
account when randomly assigning needy participants to research conditions?
How should evaluators respond if administrators wish to alter program
operations in the middle of a study due to early findings indicating serious
program deficiencies? Under what circumstances, if any, should evaluation of
individual staff members take place within the context of a program evaluation?
Indeed, in certain cases the concerns raised by questions such as these may be
serious enough to cause an evaluator to turn down an evaluation assignment, or
terminate a relationship that has already been entered into.
The methodology employed in the research examined in this section imposes
at least three limitations on the findings that should not be ignored. First, the
data were collected through a survey. It is possible, and perhaps even likely, that
interviews, analysis of organizational documents/records, and case-study
observation would have generated a different array of ethical conflicts than that
produced by the survey approach. Second, asking evaluators to describe the
ethical conflicts they have personally experienced maximizes the likelihood that
problems will be identified which do not implicate the respondent (i.e., the
evaluator) as the source of the difficulty. Thus, ethical problems which result
from the respondent's behavior will be underrepresented in the findings. As
discussed earlier, such an outcome would be predicted by Attribution Theory
(Jones et aI., 1971). A third, and closely related limitation is the restriction of this
research to evaluators, who comprise just one stakeholder group whose views of
ethical concerns might be solicited. All three of these limitations underscore the
fact that a truly comprehensive understanding of ethical issues in evaluation is

unlikely to emerge from methodologies that mainly focus on mapping the
subjective world of evaluators.
Unfortunately, little research has been done to help us develop this more
comprehensive picture. A distinctive exception is the work of Newman and
Brown (1996), who presented evaluators, program administrators, and program
staff with statements describing violations of the 1981 version of the Program
Evaluation Standards. Respondents were asked to indicate how frequently they
believed the violations actually occurred in evaluation practice. Note that
respondents were not asked if they felt personally responsible for the violation.
Thus, the evaluators in the sample could respond to items depicting unethical
behavior without engaging in self criticism, thereby attenuating the biasing
influence of attributional processes.
The results, presented in Table 1, are instructive. Of the five violations ranked
as most frequently occurring by each category of respondents, two were selected
by all three groups: "Evaluator selects a test primarily because of his or her
familiarity with it" and "Evaluator responds to the concerns of one interest
group more than another's." The interest group violation speaks to issues
identified by Morris and Cohn's respondents in the entry/contracting stage of
evaluation, but the test familiarity item represents a new concern, one that
clearly implicates the evaluator and thus would have been less likely to be
spontaneously raised by a substantial number of the Morris and Cohn sample.
Table 1. Most Frequent Violations, by Respondent Category (adapted from Newman and
Brown, 1996)
Evaluator Program Administrator Program Staff

Evaluator selects a test Evaluator selects a test Evaluator selects a test
primarily because of his or primarily because of his or her primarily because of his or her
her familiarity with it familiarity with it familiarity with it
Evaluation responds to the Evaluation responds to the Evaluation responds to the

concerns of one interest group concerns of one interest group concerns of one interest group
more than another more than another more than another
Evaluator loses interest in the Evaluation is conducted because Evaluation is conducted because
evaluation when the final it is "required" when it obviously it is "required" when it obviously
report is delivered cannot yield useful results cannot yield useful results
Evaluator fails to find out Limitations of the evaluation are Evaluation is so general that it
what the values are of not described in the report does not address differing
right-to-knowaudiences audience needs
Evaluator writes a highly Evaluator conducts an evaluation Evaluation reflects the self-
technical report for a when he or she lacks sufficient interest of the evaluator
technically unsophisticated skills and experience
audience
312 Morris
Of the three remaining violations ranked highly by Newman and Brown's group
of evaluators, two appear to overlap considerably with the conflicts identified by
Morris and Cohn's respondents. Specifically, "Evaluator fails to find out what
the values are of right-to-know audiences" reflects the entry/contracting concerns
articulated in the Morris and Cohn study dealing with stakeholder inclusion.
Similarly, "Evaluator loses interest in the evaluation when the final report is
delivered" sets the stage for many of the use-related problems cited by Morris
and Cohn's respondents. Taken as a whole, then, the two studies provide com-
pelling evidence for the salience, in the minds of evaluators, of stakeholder-
related ethical challenges at the beginning of evaluations, and utilization-oriented
challenges at the end.
Newman and Brown's findings also strongly suggest that different stakeholders
bring different ethical "sensitivities" to the evaluation process. Indeed, the majority
of violations that received the highest rankings from program administrators and
staff did not make the "top five" list generated by the evaluators. Thus, violations
such as "Evaluator conducts the evaluation when he or she lacks sufficient skills
or experience" and "Evaluation report reflects the self-interest of the evaluator,"
which administrators and staff, respectively, perceived as occurring very
frequently, did not appear on the evaluators' most-frequent list.
The practical implications of such differences are not trivial. Stakeholders who
believe that evaluators are unskilled and/or inappropriately self-interested are
likely to participate in evaluations in contentious and ineffective ways.
Evaluators who make an effort to surface stakeholders' ethical concerns early in
a project are usually in a much better position to address them constructively
than those who do not. This will be true, regardless of whether or not the
perceptions underlying the stakeholders' concerns are accurate.
Surfacing issues in this fashion sensitizes evaluators to the fact that different
types of evaluation projects tend to raise different types of ethical issues in the
minds of stakeholders. For example, Nee and Mojica (1999), who are philanthropic
foundation officers, have discussed the tension generated by "comprehensive
community initiatives," which are "programmatic interventions and collabora-
tive strategies designed to increase community capacity, and require high levels
of community engagement in initiative design and decision-making" (p. 36).
Foundations are increasingly facilitating and participating in such endeavors,
and in these efforts evaluators are frequently asked to serve as "teachers,
consultants, and technical assistants" (p. 42), in addition to their more traditional
evaluator role. As a result, evaluators can end up evaluating initiatives that they
helped to design. As Nee and Mojica observe, the evaluator is put
in the odd position - from an ethical point of view - of assessing the

efficacy of his or her own advice. The evaluator has become a partner in
the success or failure of the enterprise and yet is still expected to report
candidly on the success of the intervention. Are these dual, potentially
conflicting, roles appropriate? (1999, p. 43)
In essence, Nee and Mojica have highlighted, from an evaluation client per-
spective, a conflict of interest issue for evaluators.
Overall, the findings reported in this section underscore at least two important
challenges for evaluation as a profession. The first concerns the political
landscape of evaluation. The perceived misapplication of stakeholder power and
influence is at the core of many of the ethical conflicts described by evaluators.
To be sure, the "politics of evaluation" has been a much discussed topic within
the field over the years. Much less frequent, however, are analyses that explicitly
link political dynamics to ethical concerns and frameworks (e.g., House & Howe,
1999). It is noteworthy, for example, that in The Politics of Program Evaluation
(Palumbo, 1987), "ethics" does not even appear in the subject index, and there is
only a single reference for "ethicality." And, of the Joint Committee's 30 Program
Evaluation Standards, only one - Political Viability - explicitly targets the
political dimension of evaluation, although a number of other Standards raise
issues relevant to it (see McKillip & Garberg, 1986).
Given this state of affairs, the profession would be well served if increased
attention were paid to delineating the ethical bases that evaluators might draw
upon when confronting "evaluation's political inherency" (Patton, 1987, p. 100).
For evaluation to mature as a respected profession in the domain of ethics, we
need to do more than simply ask "What would a savvy politician do?" when
political challenges present themselves. At the very least, we should be asking
such questions as, "What are the various ways in which an ethical evaluator
might respond to this political challenge?" and "Is there an ethical price we must
pay in order to achieve our political objectives in this situation, and is that price
too high?"
The second implication for the profession of the research summarized here
involves the sometimes divergent perspectives that evaluators and stakeholders
bring to ethical issues. As has been noted, it can be difficult for practitioners to
identify ethical problems that emanate from their own behavior. To the extent
that the profession fails to inventory, in a systematic fashion, the ethical concerns
that stakeholders have regarding the behavior of evaluators, the materials needed
to build the ethical foundations of the field are not fully present. Against this
background, it is ironic that evaluators have always taken great pride in their
commitment to soliciting stakeholder input regarding the focus of the evalua-
tion. Applying this commitment to the domain of ethical challenges would appear
to be a task for which the field is well suited, and represents a natural next step
for the profession.
STAKEHOLDERS, PARTICIPATION, AND ETHICS
Although stakeholders' perceptions of ethical problems in evaluation have not

been extensively researched, a related issue - involvement of stakeholders in the
evaluation process - has been the subject of considerable discussion and
disagreement within the profession. Analyses of stakeholder involvement tend to
314 Monis
be framed in ways that, either explicitly or implicitly, raise ethical concerns. On
the one hand, the importance of identifying the priorities and goals of stake-
holders when designing evaluations is accorded dogmatic status in virtually all
program evaluation texts (e.g., Posavac & Carey, 1997; Weiss, 1998; Worthen,
Sanders, & Fitzpatrick, 1997). As Mathison (2000) observes, "no evaluator would
offer up an evaluation that did not at least pay lip service to the notion that
stakeholders and their interests ought to be included in an evaluation" (p. 86).
The "devil" of controversy is, not surprisingly, in the details, and surrounds such
questions as "who counts as a stakeholder, just what inclusion means, and the
consequences for evaluation when stakeholders are included in novel ways"
(Mathison, 2000, p. 86). These issues are probably most salient in discussions of
participatory evaluation, in which "researchers, facilitators, or professional evalu-
ators collaborate with individuals, groups, or communities who have a decided
stake in the outcome" (Cousins & Whitmore, 1998, p. 5). Cousins and Whitmore
distinguish between practical participatory evaluation, where the "central function
[is] the fostering of evaluation use, with the implicit assumption that evaluation
is geared toward program, policy, or organizational decision-making" (p. 6), and
transforrnative participatory evaluation, which "invokes participatory principles
and action in order to democratize social change" (p. 7).
Where practical participatory evaluation is concerned, the available evidence
suggests that evaluators are sharply divided over whether this type of stakeholder
involvement represents an ethical imperative for evaluation. In the Morris and
Jacobs (2000) vignette study, for example, only 39 percent of the respondents
thought that the evaluator's failure to involve stakeholders in an in-depth fashion
was ethically problematic. Moreover, of the 49 percent who did not believe that
the evaluator's behavior was ethically problematic, half maintained that the
scenario did not even raise an ethical issue (as opposed to a methodological or
philosophical one). Thus, the extent to which practical participatory evaluation
has become part of the evaluation community's normative structure remains an
open question, despite the high level of endorsement of this practice within
certain subgroups of evaluators (Preskill & Caracelli, 1997).
For evaluators who are committed to participatory evaluation, a variety of
ethical issues present themselves. Torres and Preskill (1999), for example, note
that the question of "what stakeholders should be included?" is a vexing one.
The roster of legitimate stakeholders in any given evaluation can be lengthy and
intimidating. What criteria can evaluators use to determine which stakeholders
should be accorded highest priority when soliciting participation? One might
argue, for instance, that those whose interests and well being will be most substan-
tially affected by the evaluation should be at the top of the list. "Interests" and
"well being" are abstract concepts, however, and thus are subject to disagree-
ment and uncertainty. Moreover, Torres and Preskill remind us that a number of
practical limitations - access to stakeholders, the political/organizational context
of the program, and the time and resources devoted to the evaluation - can
affect the evaluator's ability to secure meaningful stakeholder participation, even
when we are confident that we know who should be involved.
A second issue identified by Torres and Preskill is the nature and depth of
stakeholder participation. Cousins and Whitmore (1998) note that involvement
can vary "from consultation (with no decision-making control or responsibility)
to deep participation (involvement in all aspects of an evaluation from design,
data collection, analysis and reporting to decisions about dissemination and
use)" (p. 10). As evaluators approach the "deep participation" end of the contin-
uum, their ethical responsibilities increase. Both the Guiding Principles for
Evaluators and the Program Evaluation Standards hold evaluators accountable
for the competent design and implementation of an evaluation. This can be a
difficult standard to meet when involving - in an in-depth fashion - stakeholders
whose research and evaluation expertise is limited. How ethical is it to sacrifice
some measure of technical quality in an evaluation in order to foster skill-building
and psychological ownership of the project among stakeholders? Professional
principles and standards offer no straightforward answer to this question, a point
underscored by Torres and Preskill (1999) when they note that "evaluation codes
are not explicit on the desired degree of stakeholder involvement or the trade-
offs between various principles and standards" (p. 64).
The ethical significance of participatory evaluation takes on an added dimension
when the focus shifts from practical participatory evaluation to transformative
participatory evaluation. In the latter approach, evaluation serves the cause of
emancipation and social justice, representing an attempt "to empower members
of community groups who are less powerful than or are otherwise oppressed by
dominating groups" (Cousins & Whitmore, 1998, p. 6). What is distinctive here
is the explicit ethical agenda of the enterprise: to give voice to the disenfranchised
in a way that alters the fundamental relationships among groups. In this view,
"the production of evaluation results and their use to transform societal ills
cannot be separated" (Mertens, 1999, p. 12). Evaluators who endorse this
perspective are, in essence, employing ethical criteria to judge their practice that
their nontransformative colleagues are not applying. For transformative evalua-
tors, conducting an evaluation which answers the questions posed by powerful
stakeholders is not enough. Indeed, conducting an evaluation that thoroughly
engages, and enhances the skills of, powerful stakeholders is not enough. Rather,
an ethical, transformative evaluation must strive to reduce the power differential
among stakeholder groups, at least within the context of the evaluation.
Little systematic data are available on the percentage of evaluators who sub-
scribe to the transformative approach (or most other approaches, for that matter),
so it is difficult to gauge the extent to which recent discussions of this domain
reflect a meaningful shift of evaluation's normative structure in a transformative
direction. However, in their 1996 survey of members of the AEA's Evaluation
Use Topical Interest Group, Preskill and Caracelli (1997) found that respondents
were least likely to agree that the promotion of social justice represented an
appropriate purpose for evaluation, when compared with five other evaluation
purposes they were asked to rate (only 45 percent agreed or strongly agreed with
the social-justice purpose). Indeed, given that practical participatory evaluation's
status in the field's ethical (as opposed to methodological) canon is still
316 Morris
somewhat murky (see Morris & Jacobs, 2000), it is unlikely that the more
ambitious - and social-change oriented - transformative philosophy has eclipsed
it in terms of evaluator support.
Transformative participatory evaluation, of course, is just one of a number of
approaches to evaluation practice that emphasize the inclusion of stakeholders
in various ways. Empowerment evaluation (Fetterman, Kaftarian, & Wandersman,
1996), deliberative democratic evaluation (House & Howe, 1999), utilization-
focused evaluation (Patton, 1997), and evaluation-as-assisted-sensemaking (Mark,
Henry, & Julnes, 2000) are other prominent examples.! Empowerment
evaluation, for example, which is defined by Fetterman (1994) as "the use of
evaluation concepts and techniques to foster self-determination" (p. 115), clearly
shares common ground with transformative participatory evaluation. Fetterman
(1997) does not regard empowerment evaluation as suited to every evaluation
circumstance. He asserts that empowerment evaluation "is not designed to
replace all forms of evaluation" (p. 253), and "it may not be appropriate for eval-
uators who value the role of the external, distant expert above group interaction
and participation" (p. 264). Thus, the evaluator is not seen as ethically bound to
adopt an empowerment approach, even in situations where it might hold great
value for traditionally marginalized stakeholders.
In contrast, deliberative democratic evaluation - emphasizing inclusion of all
relevant interests, dialogical interaction, and thorough deliberation - is presented
by House and Howe (1999, 2000) as a model that should inform, if not guide, all
evaluation practice in democratic societies. Insofar as this model is based in an
"egalitarian conception of justice that seeks to equalize power in arriving at
evaluative conclusions" (House & Howe, 1999, p. xxi), it places a greater ethical
burden on the evaluator than that imposed by the empowerment vision. It is hard
to escape the conclusion that, in the deliberative democratic view, evaluators
have a moral obligation to involve stakeholders in a certain way. Stakeholder
participation is not simply, or even primarily, a methodological concern.
At the very least, the preceding discussion should alert evaluators to the
ethical implications of the seemingly benign phrase, "stakeholder involvement,"
and the choices they make concerning it. Complicating matters even further is
the fact that evaluators and stakeholders might have very different views of the
ethical significance of involvement. The greater the extent to which these
multiple, and perhaps conflicting, perceptions are discussed and worked through
early in the project, the more effective the resulting evaluation is likely to be.
In terms of the profession, the proliferation of multiple - and to a certain
extent, competing - models to guide evaluation practice regarding stakeholder
involvement indicates that a well-defined normative structure in this domain has
yet to take shape. Different models send different messages to evaluators about
the ethical significance of their decisions concerning involvement. Indeed,
Mark's (2000) characterization of evaluation as "an increasingly Balkanized
area" seems particularly fitting in this context, given his warning that "continued
fragmentation of the field into alternative approaches can make reasoned debate
more difficult, because those in different camps speak different languages or use
terms founded on different assumptions" (p. 1). In this case, a key source of
fragmentation appears to be the lack of consensus regarding the ethical necessity
of various forms of involvement. In view of the importance that evaluators
attribute to stakeholder involvement in general, addressing this lack of consensus
represents a major challenge for the profession's future.
EVALUATORS, USE OF FINDINGS, AND ADVOCACy2
If stakeholders represent "individuals or groups that may be involved in or

affected by a program evaluation" (Joint Committee, 1994, p. 3), then it is clear
that evaluators are themselves stakeholders in any given evaluation. This fact is
perhaps nowhere more evident than in the attention that has been paid in recent
years to the role of evaluators in the utilization of evaluation findings (e.g.,
Shulha & Cousins, 1997). The notion that evaluators should be concerned with
how their findings are used - and misused - has become part of the conventional
wisdom of the field. Indeed, the Program Evaluation Standards include seven
standards devoted to utilization. In its discussion of the Evaluation Impact
standard, for example, the Joint Committee (1994) asserts that "evaluators should
help stakeholders use the evaluation findings in taking such beneficial actions as
improving programs, selecting more cost-beneficial products or approaches, or
stopping wasteful, unproductive efforts" (p. 59). It also maintains that a common
error committed by evaluators is "failing to intervene if evaluation findings are
seen to be misused or misinterpreted" (p. 60). In a similar vein, the Guiding
Principles for Evaluators states that "within reasonable limits ... [evaluators]
should attempt to prevent or correct any substantial misuses of their work by
others" (American Evaluation Association, 1995, p. 23). The message would
seem to be straightforward: The evaluator's job does not end when the final
report is delivered. And if the evaluator does his or her job well, utilization of
findings will be enhanced.
However, there is reason to believe that practitioners approach this assign-
ment with more than a little ambivalence. Although evaluators in the Newman
and Brown (1996) study, for example, rated the statement, "Evaluator loses
interest in the evaluation when the final report is delivered," as one of the most
frequently occurring violations of the Program Evaluation Standards, it was also
seen by them as one of the least serious violations. This result can be interpreted
in multiple ways, but it is hard to ignore the possibility that many respondents
were saying, in effect, "Yes, evaluators often drift away after delivering the final
report, but it's just not that big an ethical issue. Planning for utilization is impor-
tant both before and during the evaluation, but, ultimately, moral accountability
for the fate of the evaluation rests primarily with stakeholders other than the
evaluator."
Thus, many evaluators may experience a bit of dissonance when reflecting on
the demands, ethical and otherwise, of being "utilization minded" in their work.
318 Morris
Evaluator involvement, on behalf of utilization, in the post-reporting phase
brings challenges that practitioners are not always well positioned to meet, given
the logistical and political constraints that characterize most evaluations (see
Shulha & Cousins, 1997).
Complicating matters even further is the potential for evaluator role-shifting
when utilization is accorded high priority in a project. The Joint Committee
(1994) is sensitive to this issue, cautioning evaluators against "taking over the
client's responsibilities for acting on evaluation findings" (p. 60), even as it
maintains that evaluators have an obligation to "help the stakeholders assess,
interpret, and apply the evaluation findings folfowing the release of the final
report" (p. 60). In essence, the Joint Committee is suggesting that evaluators
serve as consultants, rather than as ''pairs of hands" (Block, 2000) during the
utilization-of-findings phase. This warning is sound advice, but it can be
exceedingly difficult to follow. The line between helping stakeholders explore
options and inappropriately "taking over" their responsibilities may be clear in
theory, but is probably frequently smudged, or erased entirely, in practice. When
this occurs, evaluators place themselves in the ethically precarious position of
functioning as program decision makers, a role for which they were presumably
not hired. Indeed, it can be argued that the desire for utilization involvement
that characterizes the field generates, virtually by design, "occasions for sin" in the
professional lives of evaluators.
At a more general level, the rise of collaborative models of evaluation, which
"aspire to more equitable power relationships between evaluators and program
practitioners leading to jointly negotiated decision making and meaning making"
(Shulha & Cousins, 1997, p. 200), can threaten the image, if not the substance,
of the evaluator as an impartial, objective, disinterested third party. Of course,
there are those who maintain that evaluators have never been all that impartial,
objective, and disinterested to begin with (e.g., Guba & Lincoln, 1989).
Nevertheless, many evaluators continue to profess faith in the "honest broker"
model, in which the evaluator strives to function as a distortion-free link between
stakeholders and accurate information about their programs. For these practi-
tioners, a commitment to intensive collaboration and post-reporting utilization
activity can pose an ethical hazard to both the evaluator and the evaluation,
undermining the credibility of the former and the accuracy of the latter.
Taken to the extreme, ethical concerns over devotion to collaboration and
utilization can lead to the fear of evaluators tumbling down a "slippery slope"
into advocacy. Practically speaking, the type of advocacy being referred to here
consists of an evaluator actually lobbying on behalf of a program that he/she
believes is worthwhile and deserving of support. 3 (In theory, of course, an
evaluator could also advocate for the elimination of an unworthy program.) For
an evaluator to advocate in this fashion would certainly raise questions concern-
ing whether an inappropriate role shift had occurred, from impartial evaluator
to committed activist. As Greene has observed, "such advocacy compromises the
perceived credibility and thus persuasiveness of evaluative claims" (Ryan,
Greene, Lincoln, Mathison, & Mertens, 1998, p. 109).
In reality, one rarely encounters a prominent evaluator who endorses this form
of advocacy, at least in print. Indeed, on the basis of her selective review of the
literature, Datta (1999) concludes that "diverse evaluators agree that the evalua-
tor should not be an advocate (or presumably, an adversary) of a specific
program in the sense of taking sides" (p. 84). She also finds that commentators
agree that evaluators should advocate for (1) the gathering of data which
generate the most comprehensive picture of the program as possible and (2) the
responsiveness of the evaluator to the full range of relevant stakeholders,
especially the less powerful or popular ones. Both of these varieties of advocacy
would appear to be fully consistent with prevailing professional principles and
standards. Thus, although a slippery slope leading to program advocacy may
exist, there is little evidence that evaluators have publicly embraced it, and con-
siderable testimony indicating that such advocacy is deemed unethical. However,
testimony need not be synonymous with practice, and additional research in the
latter domain would represent a significant contribution to the field.
PREVENTING AND RESPONDING TO ETHICAL PROBLEMS
It is probably safe to say that the vast majority of evaluators would rather avoid
ethical conflicts than have to cope with them once they have arisen. Hence, this
section will begin with a discussion of strategies for prevention and then proceed
to a consideration of what evaluators can do when they are placed in a more
reactive position.
Effectively Manage the Entry/Contracting Stage
In the entry/contracting phase of an evaluation, the expectations of evaluators

and key stakeholders are developed, shared, discussed, and negotiated. When
this stage is handled well, the chances of a successful evaluation taking place
increase dramatically (see Block, 2000). This point applies as much to the ethical
dimensions of the evaluation as to any other facet of the project. During this
phase evaluators should raise whatever ethical issues they believe are relevant to
the study, and solicit any ethical concerns the stakeholders might have. The more
thoroughly these matters are discussed at the beginning of the evaluation, the
less likely they will arise in a problematic fashion later on. And if they do arise,
a framework for addressing them has at least been established.
There are a variety of information sources that evaluators can draw upon to
identify the ethical issues that are likely to be engaged by the project. These
include their own experiences and those of their colleagues; the published evalu-
ation literature; the Program Evaluation Standards and the Guiding Principles
for Evaluators; and the Institutional Review Board guidelines. Reviewing with
stakeholders the professional standards that guide and constrain one's conduct
as an evaluator can be especially useful in this regard. The Program Evaluation
320 Morris
Standards, for example, define evaluation as "the systematic investigation of the
worth or merit of an object" (Joint Committee, 1994, p. 3). Thus, evaluators are
obliged to work with key stakeholders to determine what this statement implies
- in concrete, specific terms - for the evaluation that is to be conducted.
Appreciate the Contextual Dimensions of the Evaluation
The way in which ethical issues unfold in an evaluation can be significantly

influenced by the nature of both the setting and the evaluator's role. Conducting
evaluations in international settings, for example, requires the evaluator to be
sensitive to the ethical implications of the value systems of the host culture
(Bamberger, 1999). Evaluators should not assume that professional standards
that are taken for granted in the United States can be automatically applied in
other countries (Hendricks & Conner, 1995; Russon, 2000).
Where the evaluator's role is concerned, we have previously highlighted the
ethical challenges that can occur when evaluators take a highly collaborative
and/or utilization-focused approach to their work. Those who function as internal
evaluators must also be wary, given findings that suggest that they might be less
likely to define problems in ethical terms than external evaluators (Morris &
Cohn, 1993). Analysts of internal evaluation have frequently commented on the
substantial pressures that internal evaluators can face to demonstrate loyalty, in
various forms, to their employer (e.g., Adams, 1985; Love, 1991; Lovell, 1995;
Mathison, 1991; Sonnichsen, 2000); in certain instances this loyalty can threaten
the application of professional standards in an evaluation.
External evaluators who are self-employed (e.g., those not affiliated with
universities) represent yet another group who exhibit distinctiveness in the
ethical domain. In the Morris and Jacobs (2000) study, these individuals were
less likely than other respondents to view the behavior of evaluators in the
vignettes as unethical. In attempting to explain this finding, the authors speculate
that the economic pressures associated with self-employment may motivate this
group to be especially tolerant when judging fellow evaluators, given that ethical
challenges frequently call for risk-taking on the part of the evaluator. To the
extent this interpretation is accurate, it suggests that self-employed evaluators
need to be particularly sensitive to any tendency on their part to generate self-
serving rationalizations for their ethical decision making.
Consider Applying an Ethics Cost-Benefit Analysis to the Evaluation
In essence, an ethics cost-benefit analysis (ECBA) examines the ethical risks and
hazards associated with a study and weighs them against the social good that the
study can be reasonably expected to accomplish (Mark, Eyssell, & Campbell,
1999; U.S. Department of Health and Human Services, 2001). The focus here is
typically on current and future program clients and involves such issues as
informed consent, privacy, confidentiality, anonymity, data monitoring, and
incentives for participation. It is important to remember, however, that a variety
of nonclient stakeholders can be affected by an evaluation's risks and benefits,
such as program staff and administration, taxpayers, and clients' significant
others. Thus, the task of performing a comprehensive ECBA for a specific evalu-
ation can be daunting, given the complexities, uncertainties, and predictions that
are inevitably involved. Nevertheless, the fundamental principles underlying
such an analysis - that risks should be minimized and that benefits should signifi-
cantly exceed risks - provide a sound ethical compass for designing evaluations.
An ECBA sensitizes evaluators and stakeholders to the centrality of the "First,
Do No Harm" principle in professional practice (Newman & Brown, 1996) and
reinforces the notion that, if potential harm is foreseen, it must be justified by the
anticipation of a substantial, clearly articulated social good. Detailed guidelines
for conducting an ECBA focused on research participants can be found in the
U.S. Department of Health and Human Services' IRB Guidebook (2001).
The strategies covered thus far have been primarily preventive in nature.
However, as any experienced evaluator knows, ethical challenges can arise in
even the most conscientiously planned projects. The options available to the
evaluator in these situations include the following.
Review the Program Evaluation Standards and the Guiding Principles for
Evaluators
Both the Guiding Principles and the Standards are extremely useful for
establishing a framework within which ethical conflicts can be addressed (see
Chapter 13 by Stufflebeam in this volume). Indeed, there may even be occasions
when a particular guideline directly applies to, and helps resolve, the problem at
hand. For example, the Program Evaluation Standards include a "functional
table of contents" (Joint Committee, 1994, pp. vii-x) to help evaluators link the
various stages and tasks of an evaluation to the specific Standards that are most
likely to be relevant to them. In most cases, however, the primary value of standards
is to serve as a general orienting device for the evaluator, rather than to provide
a specific "answer" to the ethical question in situ. Numerous commentators have
observed that professional guidelines, by their very nature, are usually too
general and abstract to supply detailed guidance (e.g., House, 1995; Mabry,
1999). Mabry notes, for example, that "codes of professional practice cannot
anticipate the myriad particularities of ordinary endeavors .... Of necessity,
practitioners must interpret and adapt them in application" (1999, p. 199).
Contributing to this situation is the fact that, in any given instance, fulfilling
one professional guideline may require violating another professional guideline.
In this context the Advisory Committee on Human Radiation Experiments
(1995) has asserted that:
All moral principles can justifiably be overridden by other basic principles

in circumstances when they conflict. To give priority to one principle over
322 Moms
another is not a moral mistake; it is a reality of moral judgment. The
justifiability of such judgments depends on many factors in the circum-
stance; it is not possible to assign priorities to these principles in the
abstract. (pp. 198-199)
Despite these limitations of professional guidelines, their value to evaluators

remains considerable; practitioners who are familiar with them will be much
better positioned (ethically and otherwise) to respond to the challenges they
encounter than those who are not. Indeed, in its Metaevaluation standard, the
Joint Committee on Standards for Educational Evaluation in effect asserts that
evaluators have an ethical obligation to examine systematically, both formatively
and summatively, the extent to which their evaluations "live up to" relevant
Program Evaluation Standards.
Consult with One's Colleagues
Given the preceding discussion, it should come as no surprise that evaluators

frequently disagree over how ethical difficulties and dilemmas should be
handled. Morris and Jacobs (2000), for example, found significant differences of
opinion among respondents concerning the perceived ethicality of evaluators'
actions in vignettes dealing with informed consent, impartial reporting, and the
involvement of stakeholders in the evaluation. And it is not unusual for readers
of the "Ethical Challenges" section of the American Journal of Evaluation to
encounter conflicting recommendations being offered by two commentators in
response to the same case (e.g., Cooksy, 2000; Knott, 2000; Morris, 2000a).
Evaluators can differ in how they interpret general guidelines in specific situa-
tions, as well as in the relative priorities they assign to these guidelines, to cite just
two areas of potential disagreement.
Nevertheless, soliciting the views of experienced evaluators can be enormously
helpful when grappling with ethical problems. At a minimum, such a strategy is
likely to elicit perspectives on the issue that had not been previously considered
by the evaluator. These perspectives can expand the evaluator's range of options
for responding to the challenge. Indeed, there is even some indirect evidence to
suggest that, as the amount of descriptive detail in a case is increased, the level of
agreement among evaluators on how to respond also increases (Morris &
Jacobs, 2000). Insofar as this is true, evaluators who communicate fine-grained
accounts of their ethical challenges may render themselves less vulnerable to
receiving a host of mutually exclusive and incompatible recommendations from
their network of colleagues.
Reflect Systematically on the Ethical Challenge One is Facing

Newman and Brown (1996) have proposed an overall framework for addressing
ethical conflicts which incorporates several of the suggestions previously offered
in this section, as well as additional ones. Their model focuses on five levels of
"ethical thinking" and arranges these levels in a sequence evaluators can use, in
decision-tree fashion, to shape their ethical behavior. The levels include (1)
intuition, (2) rules and codes, (3) principles and theories, (4) personal values and
beliefs, and (5) taking action.
Within this context, intuition is one's immediate "gut-level" feeling that
something is ethically amiss in a situation. Rules and codes refer to professional
guidelines, such as the Guiding Principles for Evaluators and the Program
Evaluation Standards, that individuals can consult when they intuitively sense
the presence of an ethical problem. Principles and theories are more general
sources of moral guidance that one can turn to if professional standards are not
sufficient. The principles emphasized by Newman and Brown are autonomy,
nonmalfeasance (do no harm), beneficence (doing good), justice, and fidelity.
Personal values and beliefs represent "who one is," in terms of the distinctive
value and belief system one has developed during one's life. This system presum-
ably underlies one's intuitive responses at the model's first level. Thus, in some
situations one may feel compelled to go beyond, or even violate, professional
norms and conventional ethical principles, because the actions they recommend
are inconsistent with one's core sense of what is right and just. Finally, there is
the action-taking phase, where the individual's ethical decision is actually
implemented.
The model developed by Newman and Brown is a generic one, in the sense
that it is relevant to professional practice in virtually any content area. And it is
certainly not the only ethical framework that could be applied to evaluation. Its
chief virtue, for evaluators and other professionals, would appear to be its
comprehensiveness. An evaluator who conscientiously applies this model when
faced with an ethical quandary is unlikely to overlook any significant dimension
of the problem. This is no small benefit, given the multifaceted nature of many
ethical challenges in evaluation.
REFLECTIONS AND FUTURE DIRECTIONS
What major lessons concerning ethics should evaluators, and the evaluation
profession, draw from this chapter? The following six would appear to hold
special value.
1. The evidence suggests that evaluators differ considerably in the extent to

which they view professional challenges in ethical terms. These differences
hinder communication within the evaluation community concerning such
issues. Specifically, carving up the territory of evaluation into largely non-
overlapping domains labeled "politics," "methodology," "philosophy," and
"ethics" does not foster a dialogue that is optimal for the overall development
of the profession. The field would be well served by successful attempts to
increase the "common ground" within which ethical discussion can take place.
324 Morris
2. Although ethical problems can occur at any stage during an evaluation, they
are especially likely to arise during the entry/contracting and reporting! utiliza-
tion phases. Being pressured to alter the presentation of evaluation findings
is probably the single most frequent ethical conflict encountered by
evaluators. Evaluators-in-training would benefit greatly from having the
opportunity to develop, in a structured, low-risk environment, their skills in
handling this and other challenges.
In this context Newman (1999) has observed that, "unlike other professions,
such as law, medicine, and accounting, we have done little to incorporate
ethics into our training. Though we have adopted a set of ethical guidelines,
until they become part of our common training and practice, it is questionable
whether we truly have a set of 'common' ethics" (p. 67). Indeed, it might even
be asked whether a common profession can exist under these circumstances,
especially if the lack of training reflects the absence of a consensus within the
field concerning what the domain of ethics encompasses.
3. We know much more about evaluators' views of ethical problems in evaluation
than we do about other stakeholders' views. The relatively little we do know
about the latter suggests that significant differences exist between the two
groups. Addressing this lack of knowledge is an important task for researchers
in evaluation ethics. Given the crucial role that stakeholder identification and
consultation play in virtually all approaches to evaluation, the credibility of
these approaches, as well as the profession as a whole, is at risk as long as this
"hole" in ethical scholarship exists.
4. Evaluation models that emphasize stakeholder participation and empower-
ment, evaluator/stakeholder collaboration, and evaluator commitment to
utilization raise interrelated ethical issues that continue to be controversial
within the field. These are domains where the normative structure of evalua-
tion is far from solidified, and "Balkanization" of the profession is evident.
This results in a situation where the height of the ethical "bar" a practitioner
must clear in order to perform adequately in an evaluation can vary dramat-
ically, depending on the evaluation model one subscribes to. Movement toward
a more shared vision of what is ethically required in the involvement/
utilization arenas is necessary for evaluation to achieve greater cohesion as a
profession.
5. ''An ounce of prevention is worth a pound of cure" is a maxim that evaluators
would do well to incorporate into their professional practice. Thorough
exploration of the evaluation's ethical dimensions (paying special attention to
potential "trouble spots") during the entry/contracting phase is a strategy that
is likely to pay huge ethical dividends as the project unfolds.
6. Systematic research on ethical issues in evaluation is not only possible, but is
essential if our understanding of evaluation as a profession is to make sub-
stantial progress. There has been relatively little in-depth study, for example,
of the ethical challenges associated with the design, data collection, and data
analysis phases of evaluation. Case study methodology would be especially
valuable for examining these and other ethical issues in the field (e.g., Ferris,
Ethical Considerations in Evaluation 325·
2000). Indeed, to the extent that evaluators follow the Joint Committee's guide-
lines for conducting metaevaluations of their own work, the "raw material"
for many such case studies will be potentially available to researchers. As has
been previously noted, the fullest, most meaningful picture of evaluation
ethics is likely to emerge when a variety of methodological approaches are
employed. And there is no group of professionals who should appreciate this
truth more than evaluators.
ENDNOTES
1 This list of models, of course, is far from complete. For a much more comprehensive analysis of
evaluation approaches, see Stufflebeam (2001).
2 Portions of this section are drawn from Morris (2000b).
3 For an alternative conceptualization of advocacy in evaluation, see Sonnichsen (2000).
REFERENCES
Advisory Committee on Human Radiation Experiments. (1995). Final report. Washington, DC: U.S.
Government Printing Office.
Adams, K.A. (1985). Gamesmanship for internal evaluators: Knowing when to "hold 'em" and when
to "fold 'em." Evaluation and Program Planning, 8, 53-57.
American Evaluation Association, Thsk Force on Guiding Principles for Evaluators. (1995). Guiding
principles for evaluators. In WR Shadish, D.L. Newman, MA Scheirer, & C. Wye (Eds.),
Guiding principles for evaluators (pp. 19-26). New Directions for Program Evaluation, 66.
Bamberger, M. (1999). Ethical issues in conducting evaluation in international settings. In J.L.
Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in evaluation (pp. 89-97).
New Directions for Evaluation, 82.
Block, P. (2000). Flawless consulting: A guide to getting your expertise used (2nd ed.). San Francisco:
J ossey-BasslPfeiffer.
Caracelli, Y.J., & Preskill, H. (Eds.). (2000). The expanding scope of evaluation use. New Directions
for Evaluation, 88.
Cooksy, L.J. (2000). Commentary: Auditing the Off-the-Record Case.Americanlournal ofEvaluation,
21, 122-128.
Cousins, J.B., & Whitmore, E. (1998). Framing participatory evaluation. In E. Whitmore (Ed.),
Understanding and practicing participatory evaluation (pp. 5-23). New Directions for Evaluation, SO.
Datta, L. (1999). The ethics of evaluation neutrality and advocacy. In J.L. Fitzpatrick, & M. Morris
(Eds.), Current and emerging ethical challenges in evaluation (pp. 77-88). New Directions for
Evaluation, 82.
Ferris, L.E. (2000). Legal and ethical issues in evaluating abortion services. American lournal of
Fetterman, D.M. (1994). Empowerment evaluation. Evaluation Practice, 15, 1-15.
Fetterman, D.M. (1997). Empowerment evaluation: A response to Patton and Scriven. Evaluation
Practice, 18, 253-266.
Fetterman, D.M., Kaftarian, AJ., & Wandersman, A (Eds.). (1996). Empowerment evaluation:
Knowledge and tools for self-assessment and accountability. Newbury Park, CA: Sage.
Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation. Newbury Park, CA: Sage.
Hendricks, M., & Conner, RE (1995). International perspectives on the guiding principles. In
WR Shadish, D.L. Newman, MA Scheirer, & C. Wye (Eds.). Guiding principles for evaluators
(pp. 77-90). New Directions for Program Evaluation, 66.
Honea, G.E. (1992). Ethics and public sector evaluators: Nine case studies. Unpublished doctoral
dissertation, University of Virginia.
326 Morris
House, E.R (1995). Principled evaluation: A critique of the AEA Guiding Principles. In
W.R Shadish, D.L. Newman, M.A. Scheirer, & C. Wye (Eds.), Guiding principles for evaluators
(pp. 27-34). New Directions for Program Evaluation, 66.
House, E.R, & Howe, KR. (1999). Values in evaluation and social research. Thousand Oaks, CA: Sage.
House, E.R, & Howe, KR. (2000). Deliberative democratic evaluation. In KE. Ryan &
L. DeStefano (Eds.), Evaluation as a democratic process: Promoting inclusion, dialogue, and
deliberation (pp. 3-12). New Directions for Evaluation, 85.
Jason, L.A, Pokorny, S., & Katz, R (2001). Passive versus active consent: A case study in school
settings. Journal of Community Psychology, 29, 53-68.
Joint Committee on Standards for Educational Evaluation. (1994). The Program Evaluation
Standards: How to assess evaluations of educational programs (2nd ed.). Thousand Oaks, CA: Sage.
Jones, E.G., Kanouse, D.E., Kelley, H.H., Nisbett, R.E., Valins, S., & Weiner, B. (1971). Attribution:
Perceiving the causes of behavior. Morristown, NJ: General Learning Press.
Knott, T.D. (2000). Commentary: It's illegal and unethical. American Journal of Evaluation, 21,
129-130.
Korenman, S.G., Berk, R., Wenger, N.S., & Lew, V. (1998). Evaluation of the research norms of
scientists and administrators responsible for academic research integrity. JAMA, 279, 41-47
Love, AJ. (1991). Internal evaluation: Building organizations from within. Newbury Park, CA: Sage.
Lovell, R.G. (1995). Ethics and internal evaluators. In W.R Shadish, D.L. Newman, M.A Scheirer,
& C. Wye (Eds.), Guiding principles for evaluators (pp. 61-67). New Directions for Program
Evaluation, 66.
Mabry, L. (1999). Circumstantial ethics. American Journal of Evaluation, 20, 199-212.
Mark, M.M. (2000). Toward a classification of different evaluator roles. Paper presented at the annual
meeting of the American Evaluation Association, Honolulu, Hawaii.
Mark, M.M., Eyssell, KM., & Campbell, B. (1999). The ethics of data collection and analysis. In
J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in evaluation
(pp. 47-56). New Directions for Evaluation, 82.
Mark, M.M., Henry, G.T., & Julnes, G. (2000). Evaluation: An integrated framework for understanding,
guiding, and improving policies and programs. San Francisco: J ossey-Bass.
Mathison, S. (1991). Role conflicts for internal evaluators. Evaluation and Program Planning, 14,
173-179.
Mathison, S. (1999). Rights, responsibilities, and duties: A comparison of ethics for internal and
external evaluators. In J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical
challenges in evaluation (pp. 25-34). New Directions for Evaluation, 82.
Mathison, S. (2000). Deliberation, evaluation, and democracy. In KE. Ryan & L. DeStefano (Eds.),
Evaluation as a democratic process: Promoting inclusion, dialogue, and deliberation (pp. 85-89).
New Directions for Evaluation, 85.
McKillip, J., & Garberg, R (1986). Demands of the Joint Committee's Standards for Educational
Evaluation. Evaluation and Program Planning, 9, 325-333.
Mertens, D.M. (1999). Inclusive evaluation: Implications of transformative theory for evaluation.
American Journal of Evaluation, 20, 1-14.
Merton, RK (1973). The normative structure of science. In N.W. Storer (Ed.), The sociology of
science: Theoretical and empirical investigations (pp. 267-278). Chicago: University of Chicago
Press.
Morris, M. (1999). Research on evaluation ethics: What have we learned and why is it important?
In J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in evaluation
Morris, M. (2000a). The off-the-record case. American Journal of Evaluation, 21, 121.
Morris, M. (2000b). Increasing evaluation's capacity for mischief· An ethical issue? Paper presented at
the annual meeting of the American Evaluation Association, Honolulu, Hawaii.
Morris, M., & Cohn, R. (1993). Program evaluators and ethical challenges: A national survey.
Evaluation Review, 17, 621-642.
Morris, M., & Jacobs, L. (2000). You got a problem with that? Exploring evaluators' disagreements
about ethics. Evaluation Review, 24, 384-406.
Nee, D., & Mojica, M.1. (1999). Ethical challenges in evaluation with communities: A manager's
perspective. In J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in
evaluation (pp. 35-45). New Directions for Evaluation, 82.
Newman, D.L. (1999). Education and training in evaluation ethics. In J.L. Fitzpatrick, & M. Morris
(Eds.), Current and emerging ethical challenges in evaluation (pp. 67-76). New Directions for
Evaluation, 82.
Newman, D.L., & Brown, RD. (1996). Applied ethics for program evaluation. Thousand Oaks, CA:
Sage.
Palumbo, DJ. (Ed.). (1987). The politics ofprogram evaluation. Newbury Park, CA: Sage.
Patton, M.Q. (1987). Evaluation's political inherency: Practical implications for design and use. In
D.J. Palumbo (Ed.), The politics ofprogram evaluation (pp. 100-145). Newbury Park, CA: Sage.
Patton, M.Q. (1997). Utilization-focused evaluation: The new century text (3rd ed.). Thousand Oaks,
CA: Sage.
Posavac, EJ., & Carey, RG. (1997). Program evaluation: Methods and case studies (5th ed.). Upper
Saddle River, NJ: Prentice Hall.
Preskill, H., & Caracelli, V. (1997). Current and developing conceptions of evaluation use:
Evaluation Use TIG survey results. Evaluation Practice, 18, 209-225.
Rosenthal, R (1994). Science and ethics in conducting, analyzing, and reporting psychological
research. Psychological Science, 5,127-134.
Russon, C. (Ed.). (2000). The Program Evaluation Standards in international settings. Kalamazoo: The
Evaluation Center, Western Michigan University.
Ryan, K., Greene, J., Lincoln, Y., Mathison, S., & Mertens, D.M. (1998). Advantages and challenges
of using inclusive evaluation approaches in evaluation practice. American Journal of Evaluation,
19,101-122.
Shulha, L.M., & Cousins, J.B. (1997). Evaluation use: Theory, research, and practice since 1986.
Evaluation Practice, 18, 195-208.
Sonnichsen, R.C. (2000). High impact internal evaluation: A practitioner's guide to evaluating and
consulting inside organizations. Thousand Oaks, CA: Sage.
Stevens, CJ., & Dial, M. (1994). What constitutes misuse? In CJ. Stevens, & M. Dial (Eds.),
Preventing the misuse of evaluation (pp. 3-13). New Directions for Program Evaluation, 64.
Stufflebeam, D.L. (2001). Evaluation models. New Directions for Program Evaluation, 89.
Torres, RT., & Preskill, H. (1999). Ethical dimensions of stakeholder participation and evaluation
use. In J.L. Fitzpatrick, & M. Morris (Eds.), Current and emerging ethical challenges in evaluation
U.S. Department of Health and Human Services Office for Human Subjects Protections. (2001). IRB
guidebook. Retrieved from http://ohrp.osophs.dhhs.gov/irb/irbpidebook.htm
Webster's ninth new collegiate dictionary. (1988). Springfield, MA: Merriam-Webster.
Weiss, C. (1998). Evaluation: Methods for studying programs and policies. Upper Saddle River, NJ:
Prentice Hall.
Worthen, B.R, Sanders, J.R, & Fitzpatrick, J.L. (1997). Program evaluation: Alternative approaches
and practical guidelines. New York: Longman
15
How can we call Evaluation a Profession if there are no
Qualifications for Practice?1
BLAINE R. WORTHEN
Utah State University, Western Institute for Research and Evaluation, UT, USA
Anyone making a living as an evaluator has probably found it difficult on occa-

sion to explain to others just what the occupation of evaluation entails. Indeed,
some people I meet seem a bit bewildered how persons other than Ralph Nader
could earn their income merely by "evaluating things" and reporting their
findings. Yet questions about how one can earn a living simply by evaluating things
are not nearly as taxing as the more insightful and penetrating questions of those
who can easily imagine one doing evaluations for a living (doubtlessly conjuring
up referent images of tax auditors, quality control experts, meat inspectors,
theater critics and the like), but probe deeper by asking, '~e you required to
have to have a state license to practice evaluation?," '~e there university programs
to train evaluators?," '~e evaluators accredited?," "What qualifications do
members of your professional association have to meet?" And when all these are
answered in the negative, the obvious question becomes: "Then is evaluation
really a profession?" It is these and related questions that I explore here, along
with a few comments about challenges we confront and options we might pursue
if we desire evaluation ever to become a mature, widely recognized, and
understood profession. 2
IS EVALUATION REALLY A PROFESSION?
Yes, if one looks beyond religious connotations such as "profession of faith" and
considers only the most general and rudimentary dictionary definitions of
profession, such as:
• "The occupation ... to which one devotes oneself; a calling in which one
professes to have acquired some special knowledge"
• '~principal calling, vocation, or employment"
• "The collective body of persons engaged in a calling"
329

330 Worthen
The most demanding dictionary definition of profession, ''A calling requiring
specialized knowledge and often long and intensive academic preparation" [italics
added], should cause us to ponder before responding, however, as should also
the more exacting dictionary definition of professional: "Characteristic of or
conforming to the technical or ethical standards of a profession or an occupation
regarded as such" [italics added]. The problem is in the italicized words -
requiring, conforming, and standards - all of which seem to lie at the root of why
many evaluators have widely varying opinions concerning the matter of whether
evaluation is a profession. In short, can evaluation really be considered a
profession unless it requires its practitioners to exhibit competence in some body
of "specialized knowledge" and conforms to the "technical or ethical standards"
usual for professional bodies?
While I have not conducted an in-depth literature review or formally surveyed
fellow evaluators on the subject of whether evaluation is a profession, it is hard
not to be aware - from reading well known evaluation books and journals over
the past 30 years and listening to conversations among leading evaluators during
the same period - that those who have commented on evaluation's status as a
profession are not of one voice. Earlier, most writers seemed to hold the view
that evaluation had not yet attained the status of a distinct profession. For
example, Rossi and Freeman (1985) concluded that" ... evaluation is not a
'profession,' at least in terms of the formal criteria that sociologists generally use
to characterize such groups. Rather, it can best be described as a 'near-group,' a
large aggregate of persons who are not formally organized, whose membership
changes rapidly, and who have little in common in terms of the range of tasks
undertaken, competencies, work-sites, and shared outlooks" (p. 362). During
this same period, Merwin and Weiner (1985) also concluded that full
professional status could not yet be claimed by evaluators.
By the 1990s, many evaluation authors became somewhat more liberal in their
conclusions. For example, Love (1994) pointed out some areas of profession-
alism yet lacking in evaluation but spoke of "regulating the profession" of
evaluation (p. 38). Chelimsky (1994) spoke of a "renascent evaluation profession"
(p. 339) and House (1994) referred to evaluation as a "specialized profession"
(p. 239). In his article, "The Challenge of Being a Profession," Patton (1990)
stated unequivocally that evaluation had become a profession, and a demanding
and challenging one at that.
Shadish and his colleagues (1991) were slightly more cautious. In their opinion,
"Evaluation is a profession in the sense that it shares certain attributes

with other professions and differs from purely academic specialities such
as psychology or sociology. Although they may have academic roots and
members, professions are economically and socially structured to be
devoted primarily to practical application of knowledge in a circumscribed
domain with socially legitimated funding .... Professionals ... tend to
develop standards of practice, codes of ethics, and other professional
trappings. Program evaluation is not fully professionalized, like medicine
How can we call Evaluation a Profession if there are no Qualifications? 331
or the law; it has no licensure laws, for example. But it tends toward
professionalization more than most disciplines" (p. 25).
Although I've not returned to these authors' later editions or combed their
subsequent publications to see if they have altered their views, two conclusions
seem safe: (1) evaluators are not univocal about whether evaluation is truly a
profession, and (2) the trend seems to be more and more toward calling it one,
albeit somewhat carefully, with terms such as becoming, tending toward, and
nearly awash in our conversations and pronouncements on the matter. For
example, the coeditors of this volume, in their prospectus for this handbook,
opined that "Evaluation is now growing into a full-fledged profession with
national and international conferences, journals, and professional associations"
[italics added].
My views on this issue are somewhat more cautious. During the last decade,
James Sanders and I developed ten touchstones for judging when a field has
become a fully mature profession (Worthen, 1994; Worthen & Sanders, 1991;
Worthen, Sanders, & Fitzpatrick, 1997). Those ten characteristics, shown in
Table 1, would seem essential to any field aspiring to call itself a profession.
On these ten criteria, evaluation qualified on seven, but lacked three essential
dimensions: (a) licensure and/or certification, (b) controlled entry into the field,
and (c) accreditation of pre service preparation programs. If you examine those
three criteria carefully, you will see that the thread that binds them together is
quality control, which professions seek to attain by careful decisions about inclusion
or exclusion of individuals (admissions screening by professional associations;
and certification, licensure, or credentialing as a prelude to practice), and
programs, through traditional use of accreditation processes. Also, even though
we placed #10 in Table 1 in the "yes" column, that placement may be somewhat
misleading. Evaluation does have printed and widely promulgated standards and
guidelines, but there is evidence to suggest they are seldom used in actual
evaluation studies (Worthen, Jones, & Goodrick, 1998).
Thus judged, if precision of language is important, then it would seem
premature to term evaluation a profession at this point.
Table 1. Criteria for Judging Evaluation's Progress Toward Becoming a Profession

Does evaluation meet the criterion of Yes No
1. A need for evaluation specialists? x
2. Content (knowledge and skills) unique to evaluation? x
3. Formal preparation programs for evaluators? x
4. Stable career opportunities in evaluation? x
5. Institutionalization of the function of evaluation? x
6. Certification or licensure of evaluators? x
7. Appropriate professional associations for evaluators? x
8. Excusion of unqualified persons from membership in evaluation associations? x
9. Influence of evaluators' associations on preservice preparation programs
for evaluators? x
10. Standards for the practice of evaluation? x
Source: Worthen, Sanders, and Fitzpatrick, 1997, p. 47
332 Worthen
But Does It Really Matter Whether ~ Call Evaluation a Profession?
At this point, one could well ask why so many evaluators have devoted time to
whether or not we term evaluation a profession, as opposed to other terms we
might well use, such as a speciality, a field, a discipline,3 and the like. Does it
really matter? Does the fact that a number of evaluators who have addressed the
issue suggest that some broader problem underlies this question of terminology?
Should we really care? Can't we advance our field just as readily whatever
description we append to it, just so we continue to work to make evaluation serve
its clients well?
Perhaps, but the impact of mere terminology on a field is not always trivial.
Indeed, it would probably matter little whether or not we considered evaluation
a profession were it not for the fact that our conceptions - and even our
semantics - influence how we prepare personnel for evaluation roles. If we think
of evaluation as a discipline, then preservice programs for evaluators will be
patterned after those used to train academicians in other disciplines. If we think
of it as a profession, the coursework and internships in our evaluator preparation
programs will resemble more closely the methods courses and practica used to
prepare practitioners for other professions. If we think of evaluation as a hybrid
between a discipline and a profession, or a transdiscipline, then our evaluation
programs will combine elements of programs aimed at training practitioners with
those of programs used to prepare academicians.
Perhaps an even more important reason for pondering whether or not
evaluation should be called a profession is that the issue of quality control is
patently critical to the future of this particular area of human endeavor, and such
control is the hallmark of professions, professionals, and professional associations,
not of disciplines, fields, and specialities. In short, without the influence of some
well-organized professional group to worry collectively about the qualifications
of evaluators, it would seem likely that our field would suffer from unqualified
evaluators practicing evaluation. Although Darwinian forces may operate to
weed out many incompetent individuals (and perhaps, unfortunately, some who
are very competent practitioners but timid or tentative in their practice), this
seems likely to do little to advance the endeavor, since nothing prohibits them
being replaced by others as incompetent, albeit in different ways. This is a
serious problem, for anyone who has perused very many evaluation reports can
scarcely avoid the conclusion that the flaws evident in many of them are
attributable to their perpetrators' lack of competence in evaluation.
In many instances, this lack of competence is traceable to inadequate or
irrelevant academic preparation. In others, lack of evaluation experience is the
causal culprit. In other cases, poor evaluation practice seems to stem more from
the practitioners' lack of conscience than from any lack of competence.
Evaluation has too many evaluator wannabes who suffer from the erroneous but
sincere belief that they are conducting good evaluation studies, whereas the
advice they give or the work they do may actually be useless, or worse, mis-
leading. In addition, there are self-styled evaluators or evaluation consultants
who are knowingly unscrupulous, committing atrocities in the name of evalu-
ation, banking on the hope they can remain undetected because their clients
likely know even less than they about evaluation. Unfortunately, there is simply
no way at present to prevent such incompetent or unscrupulous charlatans from
fleecing clients by proclaiming themselves to be competent evaluators. Without
some type of quality control mechanism, such unprincipled hucksters can do
much mischief, providing poor evaluation services and greatly tarnishing the
image of competent evaluators and evaluation in the process.
THE CHALLENGE OF ESTABLISHING QUALITY CONTROL

MECHANISMS IN EVALUATION
If the problem of incompetent evaluators is as serious as I paint it here - and I

believe it is - then why don't we evaluators band together in our evaluation
association(s) or some other collective forum to assure that evaluation can fulfill
the three unmet criteria seen in Table I? Why don't we simply establish
mechanisms for:
• Excluding unqualified persons from membership in evaluation associations;

• Accrediting only evaluator training programs that clearly can produce well
qualified graduates; and
• Providing some sort of credential, certificate, or license that only qualified
evaluators can obtain?
Because it really isn't quite so simple. Proposals have been made to address two
of these questions and some significant groundwork has been laid, but there is as
yet little clear-cut evidence that any substantial progress has been made toward
resolving the lack of competence assurance among evaluators, or will be made
anytime soon. Let me comment briefly on two of these issues and elaborate
somewhat more on the third.
Challenges in Admitting Only Qualified Evaluators to Evaluation Associations
The membership criteria for all of the professional evaluation associations are
rather lenient, and none would effectively exclude from membership individuals
who are unqualified as evaluators. Understandably so, for young associations
often struggle to attract sufficient numbers of dues-paying members to assure
financial stability necessary for survival and continuity. No clear thinking
pragmatist would urge such associations to effect restrictive membership criteria
that would make it difficult for applicants to squeeze through their membership
gate until they knew they at least had enough members within their walls to
sustain the association. The very act of trying to legitimize or distinguish a
professional body by setting or raising the bar for admission too early would
334 Worthen
obviously be self-defeating, if not suicidal, to that body. Having served as either
an ex officio or elected member of the American Evaluation Association (AEA)
board for several years, it is clear that it, like most associations for evaluators, is
appropriately more concerned at present with reaching out to attract new
members and retain current members than in establishing exclusive membership
criteria. Indeed, despite the presumption that those chosen to govern evaluation
associations are competent evaluators, there is no real check on their evaluation
qualifications, leading one to hope we never find ourselves in the regrettable
position of Groucho Marx, who once mused that he invariably refused to join any
group that would have him as a member.
In addition to this admittedly pragmatic reason that no meaningful member-
ship criteria exist for evaluation associations, there is another rationale that is
both pragmatic and idealistic, and that is that professional associations for
evaluators can and do playa vital role in training new evaluators through (1) the
services and learning opportunities they provide for novice student members
through their journals and conferences and (2) the opportunities students have
to be exposed to (and socialized by) experienced evaluators. The importance of
this function is underscored by the way the National Science Foundation (NSF)
has used both AEA and the American Educational Research Association as pillars
under and resources to the several doctoral training programs for evaluators it
recently funded.
In summary, it is too early in its evolution for evaluation associations to attempt
to establish any restrictive criteria for membership. Any such efforts would
clearly be foolhardy until evaluation associations are much larger and stronger,
and have thought through ways to do so without sacrificing critical services they
offer both evaluation students and regular members.
Challenges in Accrediting Evaluation Training Programs
Accreditation may be defined as a process by which governmental agencies or

(more typically) professional associations evaluate pre-service training programs
against prespecified standards or criteria. This process may be thought of as
certification of training programs. Thus defined, it would seem that accreditation
of evaluator training programs may be the most feasible, if not the most direct,
method of assuring that evaluators are well qualified for practice. It can be
argued that setting criteria to assure that such programs will produce high quality
evaluators is an obvious precursor to satisfying either of the other criteria for a
profession on which evaluation falls short.
Former AEA president, Leonard Bickman, stressed the importance of both
accreditation and voluntary certification programs for evaluators, arguing that
"accountants ... developed into a powerful professional group. A key aspect of
their development was the certification of their members and the accreditation
of their educational programs" (1997, p. 7). Bickman went on to say:
We need to consider accreditation as well as certification. In the accred-
itation procedure we would determine a set of courses and activities we
believe an evaluator should experience in order to conduct competent
evaluations. An accreditation system would help ensure that evaluators
share some common educational experiences (1997, p. 9).
One of Bickman's first acts as ABA president was to form an accreditation task
force, headed by ABA Board member Bill Trochim. This group's efforts included
involving ABA's membership in establishing accreditation standards and
development of a proposed set of guidelines and a procedures manual (Trochim
& Riggin, 1996). Bickman developed a process to share these drafts with the
membership for comment after the ABA Board considered them. It appears that
somehow the momentum of this effort was lost, however, and plans to carry the
draft forward to the ABA membership have evaporated.
That is unfortunate, for even though I found some aspects of the proposed
process contrary to what I believe best practice and the limited evaluation
training literature would suggest, it would be regrettable if other insightful and
creative aspects of Trochim and Riggin's proposal were lost. And quite aside
from specifics of their particular proposal, their arguments for moving forward
with accreditation of evaluation training programs are compelling, and would be
more so if there were more evaluation training programs (as I will address
shortly). Logically, accreditation is an important precursor to evaluation certi-
fication because of the role it would play in establishing the essential content for
evaluation. Similarly, accreditation typically precedes all credentialing efforts,
where only coursework or practicum experience gained in an accredited program
is counted.
Accreditation also has the advantage of being a much less controversial process
than certification. Decisions to deny accreditation normally lead to deep regret
on the part of evaluation trainers, whereas decisions to deny them evaluators'
certification could well lead to lawsuits.
Yet there are possible explanations for why more voices haven't joined
Bickman's call for accreditation or picked up on Trochim and Riggin's proposed
accreditation system. One is that the number of genuine evaluation training
programs may be too small to warrant the cost and time of setting up the proposed
system. In conducting an international survey of evaluation training programs to
use in developing a directory of such programs, Altschuld, Engle, Cullen, Kim,
& Macce (1994) reported that their survey had turned up only 49 programs, and
the term evaluation only appeared in the title of 25. While that is not the litmus
test for whether a program really intends to prepare evaluators, the definition of
evaluation used in the "survey instrument gave respondents a great deal of
latitude in depicting their program" (Altschuld et aI., 1994, p. 77). The resulting
directory includes programs that consist of only a couple of introductory
methods courses and others that include only such courses plus a policy course
or two, hardly what most would think of as well-rounded preparation in
evaluation. The directory also includes at least one university which NSF
336 Worthen
selected in 1996 as one of four "Evaluation Training Programs," then deferred
funding for that institution for one year because there were no evaluation experts
on the faculty. NSF released funds to that university when it hired an evaluator
to work in the program.
These comments are not intended as critical of Altschuld and his colleagues,
for they accomplished a daunting task as well as circumstances and constraints
permitted. Rather, they are intended merely to suggest that although 49 programs
responded to this survey - thus reflecting that they see themselves as somehow
involved in producing graduates who may serve in evaluation roles - the number
of such programs which would ever be likely to notice or even care if they were
listed as accredited evaluation training programs may really be quite small,
perhaps only 20 or so. This is hardly the critical mass of programs that have
traditionally spawned accreditation efforts. Further, many of the programs in the
directory are in disciplines and fields not traditionally much involved with
accreditation and likely to consider accreditation as an intrusion into their
disciplinary freedom.
If these hunches are even partially accurate, it may explain why there has been
no clamor to implement an accreditation mechanism for evaluation training
programs at this time. That is unfortunate in that one advantage to moving
forward more quickly with accreditation than with certification is that it is a
much easier system to implement, and success in this arena would show that
AEA can handle a process closely related to certification, thus lending credibility
to any subsequent certification efforts AEA may propose. Which leads us into
the next, decidedly thornier, thicket.
Challenges in Certifying the Competence of Individual Evaluators
The need for some type of certification, credentialing, or licensure for evaluators
has been, in my opinion, patently clear for the past three decades (Worthen,
1972). For all the reasons cited earlier, a call to certify or license evaluators
seems eminently sensible, at least on its surface. Requiring evaluators to acquire
essential evaluation competencies before beginning to practice evaluation seems
an obvious way to prevent well-intentioned but naive novices from committing
inadvertent evaluation malpractice. Until some accepted form of certification,
credentialing, or licensure is established, evaluation will not progress beyond its
present status as a near-profession. Without it, evaluation clients will continue to
lack reliable information about the evaluation competencies of prospective
evaluators.
At this point, the careful reader may well be murmuring, "But I thought earlier
you argued strongly for keeping the gate to professional evaluation associations
wide open so as to keep their numbers and funding base at a viable level. Now
you seem to want to fight malpractice via certification, credentialing, or licen-
sure, which would require that you narrow and monitor the evaluation gate very
closely. Aren't you guilty of a bit of doublespeak?" Perhaps, although I could
possibly lessen the dilemma by noting that my earlier point referred to qualifi-
cations for belonging to an evaluation association, whereas here my focus is on
qualifications for practicing evaluation, which are not necessarily synonymous.
(After all, there are several thousand members of the American Educational
Research Association whom I suspect have never conducted a piece of research.)
But if that distinction is strained, and my two arguments still seem to teeter on
the brink of doublespeak, then that reflects the conundrum we presently face in
our field. We face the practical reality of keeping the gate to evaluation associa-
tions wide enough for them to be viable in advancing the field of evaluation,
including finding ways to narrow the gate to evaluation practice sufficiently to
assure that evaluation practitioners are competent to ply their trade. The tension
that produces is one of the challenges associated with certifying the competence
of individual evaluators.
Before proceeding further to discuss certification, credentialing, and licensure
as three integrally related ways to try to assure evaluators' competence, we
should define and differentiate more clearly among these three processes. As
used in this chapter, in relation to evaluation:
• Credentialing is the process whereby individuals who complete a specified set of

evaluation courses and/or field experiences are issued a credential attesting to
this fact, with the presumption that completion of the required evaluation
courses or experiences prepares those individuals to perform competently as
evaluators.
• Certification is the formal process used to determine individuals' relative levels
of competence (e.g., knowledge and skill) in evaluation and, for those who
reach or exceed specified minimal levels, to issue certificates attesting that the
individual is competent to do good evaluation work.
• Licensure is the formal permission or sanction of some legally constituted
agencylbody for individuals to do evaluation work.
In arguing the necessity of the second of those three for evaluation, generally,
and for AEA specifically, Bickman (1997) positioned himself as a vigorous
champion of certification:
A major purpose of certification is to provide consumers some assurance

of the quality of the procedures and personnel used in the evaluation
process .... Incompetent performance reflects negatively on all of us. AEA,
as a professional association, has an obligation to assure that members
fulfill their professional responsibilities. AEA is the only organization in
this country that can provide some assurance to those who would use
evaluation services that a quality evaluation has been or will be
performed. (pp. 7-8).
Bickman further contended that failure to establish a certification process could

jeopardize "the survival of AEA and the field of evaluation" (p. 8), indicting
338 Worthen
evaluators as hypocritical if they "oppose certification because it is difficult to
obtain agreement on how to evaluate ourselves" (p. 8).
James Altschuld, who was appointed by Bickman to head the AEA task force
charged with studying the feasibility of evaluator certification, offered contrasts
that sharpen the differences between certification, credentialing, and licensure:
Credentialing refers to the fact that a person has studied a field whereas
certification indicates that the individual has attained a certain level of
knowledge and skill in it .... To be credentialed an individual must com-
plete a set of courses and/or field experiences or practicums. To be certified
their level of skill must be ascertained (Altschuld, 1999a, p. 507-509).
Certification deals with an individual's ability to perform professional
duties and licensure represents the legal right to perform them. A person
may be certified ... based on specific competencies and skills as deter-
mined by a certification board or professional association, but not licensed
to practice (legally prevented from doing so) for a variety of reasons.
Using medicine as an illustration, a physician might be certified to
perform a certain speciality, but not permitted to do so because of fraud,
illegal prescriptions for drugs, and so forth (p. 482).
After offering analyses of some advantages and disadvantages of all three,

implicitly recognizing that licensure is the prerogative of government officials
rather than evaluators, Altschuld ended up favoring a voluntary system for
credentialing evaluators. That system would be based on their having
successfully passed a prescribed set of evaluation courses and/or evaluation
practice or work experiences, rather than jumping initially into the tougher
business of certification. In essence, he proposes credentialing as an important
and feasible first step on a road that may ultimately lead to a certification system.
In efforts to broaden the dialogue beyond the AEA task force, a subcommittee
carried out a survey of AEA members' opinions about evaluator certification,
and a debate on the subject was a featured session at the 1998 annual meeting of
the American Evaluation Association. Both the debate and the results of the
survey of AEA members were summarized in part in the Fall 1999 issue of the
American Journal of Evaluation.
It is impossible to briefly summarize here even the major arguments advanced
in that debate (interested readers are referred to the respective articles by
Altschuld, 1999b; Bickman, 1999; Smith, 1999; and Worthen, 1999). But one
finding from the survey of AEA members that can be summarized succinctly is
their majority view that certification is not needed for evaluators. Specifically, it
was found that only 25 percent of those responding believe that an evaluator
certification process is necessary, while twice as many (47 percent) do not believe
it is (Jones & Worthen, 1999). Interestingly, about half of the respondents
believe it would be feasible to establish evaluator certification, even though more
than half of those thought it was unnecessary.
Personally, I am skeptical of such optimistic predictions, although I may have

been one of the first to offer them (Worthen, 1972). But in those simpler days, I
saw only four major operational challenges to certifying evaluators, all of which
we still face today. They are:
1. Determining what basic approach to certification should be taken and what

type of evidence is most compelling, i.e., determining whether to base certifi-
cation on formal training (course or program completion, which might better
be termed credentialing), evaluation experience, evaluator performance, or
evaluator competency.
2. Reaching agreement on what evaluation is and what core knowledge and
skills all evaluators should possess.
3. Constructing professionally and legally defensible certification procedures
and instruments.
4. Garnering support for a mandatory certification process.
Space constraints prevent me from discussing each of these challenges more

fully, or summarizing my earlier paper here,4 beyond saying that there are sound
reasons for preferring direct assessment of the competence of an evaluation
practitioner as the ideal for making certification decisions about that individual.
Let me instead touch briefly on a much more serious challenge that results from
two pervasive societal trends - one in our society as a whole and the other in the
field of evaluation. Each of these stands as a major cultural impediment to evalu-
ator certification that didn't exist back in the early 1970s when a certification
system for evaluators could have been implemented much more readily. Those
two major changes in evaluation's cultural context that make this whole notion
of certifying evaluators more problematic today are: (1) the proliferation of
evaluation paradigms and the resulting lack of agreement about any unified set
of core competencies for evaluators, and (2) the litigiousness of today's society.
Each deserves at least brief discussion.
The Inverse Relationship Between Paradigm Proliferation and the Feasibility

of Evaluation Certification
Back in simpler days (or perhaps it was only I who was simpler) it seemed quite
feasible to identify and agree upon a core of critical competencies needed by all
evaluators; indeed, a review of early writings shows that their authors were much
more unified in their views about what evaluation is, and the differences that
existed were not of a magnitude that posed any real obstacle to thinking about
basic competencies that might reasonably be expected of all evaluators.
But then there was a series of major shifts in evaluation thought, and the past
quarter century has been marked by proliferation and legitimization of a plethora
of new paradigms of evaluative inquiry. The specific evaluation knowledge and
skills that tend to be associated with each of these various paradigms are
340 Worthen
increasingly seen as a legitimate part of the repertoire of some evaluators, albeit
not all evaluators. And because the philosophical and methodological tenets of
several of these paradigms are virtually opposed to one another, there is no
longer even an illusion that the field of evaluation can claim a unified content -
a body of knowledge and skills that would be agreed upon as essential by even a
majority of those who call themselves evaluators.
While we thankfully seem to be healing some of the most divisive and unpro-
ductive paradigm schisms, we are doing so more by enjoining tolerance and
eclecticism than by unifying around what we deem to be essential evaluation
knowledge or skills. Indeed, evaluation is becoming so splintered that we evalua-
tors can no longer even agree on what evaluation is. With so little agreement
about the methods and techniques - or even the paradigms - evaluators should
use, it will be difficult to get a majority of practicing evaluators to agree on any
common template for judging all evaluators' qualifications or even on the variety
of templates of various sizes, shapes, and properties that would be required for
separate certification of evaluators associated with the myriad evaluator
specialities.
Given this situation, proposed evaluator certification systems face serious
challenges that, though in my opinion not impossible to overcome, are daunting
enough to suggest that a careful, incremental climb over the barriers we see
ahead seems more likely to succeed than would a rash headlong effort to leap
them in one bound. Later I reference an earlier proposal for a patient, incre-
mental approach that offers some hope here, I think, although it is admittedly
ambitious and will require at least a modicum of trust on the part of all involved.
That brings us to the second major cultural challenge certification faces today.
The Direct Relationship Between Certification and Litigation
A decade as a university department head provided me considerable experience

working with accreditation bodies and setting up systems for accreditation or
certification of new specializations. That experience prompts several
observations:
1. Systems that aim at determining professional status or competence of

individuals to practice certain crafts or professions are enormously more likely
to yield a steady trickle - or flood - of lawsuits than are systems aimed at
institutional "certification" (which we term accreditation).
2. A tendency to address disputes, disagreements, and disappointments through
litigation has reached near epidemic proportions in our society.
3. No grievance or legal action can be lightly dismissed. Even laughably absurd
grievances can take three years and several court hearings, proceeding all the
way to the U.S. district court level, before being thrown out as frivolous - even
though the plaintiff not only loses in every hearing, but also is characterized by
the first hearing judge's written statement as "totally lacking in integrity."
4. Even when you stand firmly on solid legal and ethical ground, defending
yourself or your institution drains an enormous amount of your time and an
even greater amount of your and/or your institution's money.
5. Measurement experts - or even prestigious and powerful testing corporations
or strong professional associations such as the Educational Testing Service
and the American Psychological Association - can no longer promote or
apply high stakes tests or alternative measures of competency with impunity.
Legal challenges to minimum competency tests - which obviously share some
characteristics with competency tests that might be proposed for use in certify-
ing evaluators - abound, and litigants typically win if there is even the scent of
subjectivity in the judgments made about the competence of individuals.s
In such a context, it would be naive to believe that any system that certified
evaluators could operate long before some disgruntled applicant who failed a
certification test would mount a legal challenge to the process that led to that
decision. Indeed, in today's litigious environment, it would be foolhardy for any
evaluation association to plunge into any certification effort without complete
confidence that it is fully defensible on both professional and legal grounds.
Which suggests, for reasons outlined in the next section, that we should move
cautiously and incrementally rather than attempting to leap immediately to a
full-blown certification effort.
SHOULD EVALUATION STOP ASPIRING TO BECOME A

FULL-FLEDGED PROFESSION?
Despite the challenges outlined previously, there is nothing to suggest that

evaluation cannot become a fully mature profession, if evaluators decide that is
the direction our field should go. Things that seem impossible today may be
merely difficult once we have experimented and learned more. For example, my
earlier examination of challenges specific to evaluator certification (Worthen,
1999) led me to propose a series of steps that, if followed carefully, strike me as
having the best chance of our ever achieving the holy (or unholy, depending on
your view) grail of an effective and defensible system for certifying evaluators in
both an agreed-upon common core of competencies and an array of well-
accepted specialities collectively broad enough to encompass all those who do
quality evaluation work in those areas.
Although focused specifically on what AEA might do to achieve such an
outcome, that proposal demonstrates why and how I think we can - if we truly
believe it is a priority - clear the highest hurdles we face in becoming a
profession, if we allocate sufficient time, patience, and effort to do so. While I'm
less sanguine than some of my AEA colleague-respondents that satisfying the
three unmet criteria for a profession will be easy, I continue to believe it is a
priority, and nothing suggests we should abandon the effort. It strikes me as
more sensible to:
342 Worthen
1. Clearly identify each of the barriers or obstacles that confront such efforts;
2. Press fOlWard carefully in search of creative solutions to each of these obstacles;
3. Use the process to reach a firmer sense of what kind of profession or aggre-
gation of specialties evaluators (collectively) really want to be. 6
In the meantime, I am not troubled greatly by those such as Patton, Chelimsky,

and House who want to consider evaluation to be a profession already. I'm
equally comfortable with Shadish and others who identify it as a near-profession
that falls short because it lacks licensure laws and some other characteristics of
true professions such as the law, clinical psychology, and medicine. Personally, I
would prefer at present to think of it as a hybrid, part profession and part discipline,
possessing many characteristics of both but without some essential qualities of
each. Perhaps the best designation of evaluation at present is that of a
transdiscipline, like statistics, that serves many other disciplines and professions.
But perhaps it would be wisest to step aside from such exacting distinctions.
Evaluation is a vital social force, an area of professional practice and speciali-
zation that has its own literature, preparation programs, standards of practice,
and professional associations. Though it is not a profession in all particulars, I
am comfortable considering it close enough to counsel students toward careers
in it, even as we strive to make it a fully mature profession by establishing
screening criteria for membership in professional evaluation associations,
certification as a gateway to evaluation practice, and accreditation to assure the
quality and relevance of the graduate programs that produce evaluators.
ENDNOTES
1 Part of this chapter draws on an earlier treatment of critical challenges confronting efforts to
establish certification systems for evaluators (Worthen, 1999).
2 At least some colleagues in evaluation have privately confided that they are not at all sure we
should pursue this aspiration, feeling it could lead us toward standardization of evaluation philo-
sophy, approaches, and methods, if we adopted a view of profession that is simplistic and restricted.
3 Discussions of whether evaluation qualifies as a discipline or should rather be termed a transdis-
cipline appear in Scriven (1991), Worthen and Van Dusen (1994), and Worthen, Sanders, and
Fitzpatrick (1997).
4 Worthen (1999) contains an extensive treatment of problems and potential in addressing each of
these challenges.
5 Those who might find a comprehensive discussion of these items interesting are referred to
Chapter 2 ("Coming to Grips with Current Social, Legal, and Ethical Issues in Measurement") of
Worthen, White, Fan, and Sudweeks (1999).
6 Readers should note also Sawin's (2000) thoughtful effort to clarify and improve upon my
suggestions.
REFERENCES
Altschuld, l.W. (1999a). Certification of evaluators: Highlights from a report submitted to the board
of the American Evaluation Association. American Journal of Evaluation, 20(3), 481-493.
Altschuld, J. W. (1999b). A case for a voluntary system for credentialing evaluators. American Journal
of Evaluation, 20(3), 507-517.
Altschuld, J.w., Engle, M., Cullen, e., Kim, I., & Macce, B.R. (1994). The 1994 directory of evalu-
ation training programs. In J.w. Altschuld, & M. Engle (Eds.), The preparation of professional
evaluators; issues, perspectives, and programs. New Directions for Program Evaluation, 62, 71-94.
Bickman, L. (1997). Evaluating evaluations: Where do we go from here? Evaluation Practice, 18(1),
1-16.
Bickman, L. (1999). AEA, bold or timid? American Journal of Evaluation, 20(3), 519-520.
Chelimsky, E. (1994). Evaluation: Where are we? Evaluation Practice, 15, 339-345.
House, E.R. (1994). The future perfect of evaluation. Evaluation Practice, 15, 239-247.
Jones, S., & Worthen, B.R. (1999). AEA members' opinions concerning evaluator certification.
American Journal of Evaluation, 20(3), 495-506.
Love, A.J. (1994). Should evaluators be certified? In J.w. Altschuld, & M. Engle (Eds.), The prepa-
ration of professional evaluators: Issues, perspectives, and programs. New Directions for Program
Merwin, J.e., & Weiner, P.H. (1985). Evaluation: A Profession? Educational Evaluation and Policy
Analysis, 7(3), 253-259.
Patton, M.O. (1990). The Challenge of Being a Profession. Evaluation Practice, 11, 45-51.
Rossi, P.H., & Freeman, H.E. (1985). Evaluation: A systematic approach (3rd ed.). Beverly Hills,
CA: Sage.
Sawin, E.1. (2000) Toward clarification of program evaluation: A proposal with implications for the
possible certification of evaluators. American Journal of Evaluation, 21(2), 231-238.
Scriven, M. (1991). Introduction: The nature of evaluation. Evaluation Thesaurus (4th ed.) Newbury
Park, CA: Sage.
Shadish, w.R., Cook, T.D., & Leviton, L.C. (1991). Foundations ofprogram evaluation. Newbury Park,
CA: Sage.
Smith, M. (1999). Should AEA begin a process for restricting membership in the profession of
evaluation? American Journal of Evaluation, 20(3), 521-531.
Trochim, w., & Riggin, L. (1996). AEA accreditation report. Paper presented at the meeting of the
American Educational Research Association, Chicago.
Worthen, B.R. (1994). Is evaluation a mature profession that warrants the preparation of evaluation
professionals? In. J.w. Altschuld, & M. Engle (Eds.), The preparation of professional evaluators:
Issues, perspective, and programs. New Directions for Program Evaluation, 62, 3-16.
Worthen, B.R. (1999). Critical challenges confronting certification of evaluators. American Journal of
Evaluation, 20(3), 533-555.
Worthen, B.R., Jones, S.C., & Goodrick, D. (1998). Mat are we really learning - and not learning-
from our evaluation journals? Paper presented at the 19th Annual Conference of the Canadian
Evaluation Society, St. Johns, Newfoundland.
Worthen, B.R., & Sanders, J.R. (1991). The changing face of educational evaluation. In J.w.
Altschuld (ed.), Educational evaluation: An evolving field. Theory Into Practice, 30(1), 3-12.
Worthen B.R., Sanders, J.R., & Fitzpatrick, J.L. (1997). Program evaluation: Alternative approaches
and practical guidelines. White Plains, NY: Longman, Inc.
Worthen, B.R., & Van Dusen, L.M. (1994). The nature of evaluation. In T. Husen, & T.N.
Postlethwaite (Eds.), International Encyclopedia of Education (2nd ed.), Vol. 4, 2109-2119. Oxford,
England: Pergamon.
Worthen, B.R., White, K.R., Fan, X., & Sudweeks, R.R. (1999). Measurement and assessment in
schools (2nd ed.). New York: Addison Wesley Longman.
16
The Evaluation Profession and the Government
LOIS-ELLIN DATTA!
Datta Analysis, Waikoloa
INTRODUCTION
Suppose the senior legislative aide to the chair of an authorization committee

has a Ph.D. in psychology, is president of the American Educational Research
Association, and is an evaluator. His knowledge is reflected in legislation affect-
ing program evaluations. The requirements for these evaluations in turn affect
the field. For example, a billion dollars may be allocated for education demon-
stration grants, many with requirements for multi-site longitudinal outcome
studies. Evaluator experience with these demonstrations can create a driving
force fairly swiftly reflected in new ideas presented in professional journals, at
conferences and through evaluation experts advising the Department of Education
on new study designs. These new ideas, in turn, influence the beliefs of the senior
legislative aide about future priorities; and so the process continues.
Imagine that an evaluator within a government agency, meeting with
colleagues from academia and from practice at an evaluation conference, gets a
fire in her belly about the need for research on evaluation utilization. She talks
with her colleagues, listens to their ideas, makes the case internally for funding a
major cross-disciplinary conference on evaluation utilization. The conference
brings together evaluators from many programmatic domains and disciplines
who decide to develop a special journal. When the next authorization train arrives,
some experts in this group influence the writing of new evaluation utilization
initiatives into legislation. These initiatives in turn create a momentum of
projects, research, summary papers, and professional training.
Are these examples, and many others that could be adduced, cases of the
government influencing the evaluation profession or the evaluation profession
influencing the government? They seem to be instances of both. And that is the
theme of this chapter: the fluid dynamics of these interactions and where mutual
influence (1) seems to have worked to the detriment of the field and the
government and (2) where it seems to have worked well. Before reviewing these
influences, though, a few definitions are in order.
345

346 Datta
What is "Government"?
Government should be a plural noun. At least three levels of government usually

flourish in a country: (1) the national, central, or federal government often charged
with responsibilities such as foreign relations, national monetary policy, and
inter-state commerce; (2) the state, territorial, regional governments, often respon-
sible for higher education, health, and economic development; and (3) county,
district, and city governments, typically concerned with parks and recreation,
primary and secondary education, roads, public safety, and welfare. This des-
cription may be an oversimplification since in many domestic policy areas there
are similarly titled units in all levels, such as a federal Department of Education,
a state Department of Education, and a city Department of Education. Further,
at any government level, the government is far from monolithic. The Department
of Defense is unlike the Department of Health and Human Services is unlike the
Department of the Interior - and so on and so on, with regard to evaluation
priorities and approaches.
One could examine the relation between the evaluation profession and the
government at all these levels - as well as at the level of international governing
bodies. In addition, as federal domestic expenditures have shifted from
entitlements plus categorical programs to entitlements plus block grants, much
of the evaluation action has shifted to state and local governments. For the
purposes of this chapter, however, government will mean the federal government,
with examples taken primarily from the government of the United States.
Further, the focus will be on government since about 1960.
What is "The Evaluation Profession"?
This too, could be plural. The evaluation profession inheres in its professional
associations, such as the American Evaluation Association; in the organizations
which practice it, such as for-profit evaluation consultant companies; in the
institutions which develop and transmit evaluation knowledge and skills, such as
colleges and universities and professional development training organizations;
and in the individuals who profess it, as practitioners, teachers, and writers. And
it inheres in its artifacts: its journals, conferences, reports, and standards and
guiding principles. One could examine the relationship individually in all these
areas, but for the purposes of this chapter the evaluation profession will mean
both evaluators and their products.
Caveat Emptor
These definitional choices make for analytic focus. Their prices are some notable
limitations. One limitation in this analysis, for example, is that lessons learned
from these choices may not be representative of conclusions that might have
The Evaluation Profession and the Government 347
been reached had the focus been local governments or foundations. Some foun-
dations have become in loco federalis in funding large demonstrations intended
to change national practice. They probably are having a correspondingly large
influence on the evaluation profession but it may be in quite different directions
than that of the federal government. One could study, for example, the impact
on the evaluation profession in the health area of the Robert Wood Johnson
Foundation, the Kellogg Foundation, the Colorado Trust, and many more.
Another limitation is that experiences in other nations and settings could
differ profoundly from those in the United States, particularly settings which
have fewer evaluation resources, programs of a different scale, or evaluation
influences that include more external agencies such as the World Bank and the
International Monetary Fund. Yet another potential limitation is that informa-
tion on the evaluation profession in some important governmental sectors is
restricted due to privacy and security concerns, such as evaluations of certain tax
programs and programs in the defense sector of government.
Thus, this chapter is a work-in-progress. It may be, however, among the
relatively few analyses of the ways in which the government and the evaluation
profession have influenced each other. (For prior analyses primarily in the area
of education, see House, 1993; Worthen & Sanders, 1987; and Worthen, Sanders,
& Fitzpatrick, 1997.)
INFLUENCES OF THE GOVERNMENT ON AND BY THE

EVALUATION PROFESSION
Using the definitions noted above to focus this discussion, I see eight types of
governmental actions that have influenced the evaluation profession and three
ways in which the evaluation profession has influenced the government.
Influences of the Government on the Profession
Over the past 35 years, governmental actions have influenced the evaluation
profession in just about every possible way. The eight influences include: (1)
demands for internal evaluation within government; (2) demands for internal
evaluations among recipients of federal support; (3) demands for external
evaluations among recipients of federal support; (4) specification of methods,
designs, and measures to be used; (5) support for the development of evaluation
as a profession; (6) creation of opportunities for employment of evaluators; (7)
leadership for evaluation emulated by nongovernmental entities such as private
foundations; and (8) support of evaluation capacity such as evaluator training,
development of evaluation standards and a definition of evaluation, and a
professional infrastructure. Each of these is examined here with examples.
348 Datta
(1) Influences on Demand for Internal Evaluation within Government
Prior to 1975, there were few highly-placed internal evaluation units in federal
cabinet departments. That changed with the passage of the Inspector General
Act of 1978 (Siedman, 1989). All federal cabinet departments (ministries) now
have an Office of the Inspector General. The office is intended to be an inde-
pendent watchdog, a reviewer of department program operations and results,
protected against politicization, a department-level parallel to the U.S. General
Accounting Office (GAO). These offices, like radar, are constantly scanning,
evaluating, and assessing government operations. The results of their evaluative
studies are used internally for program improvement, are seized upon quadren-
nially when administrations change as part of the road map to government
improvement, and, particularly, are used in Congressional hearings as part of
broad oversight.
These studies include more than financial audits. In 1979, leaders in GAO,
under the guidance of then Comptroller General Elmer Staats, decided to
incorporate program evaluation as part of GAO's tool chest for oversight studies.
In 1980, GAO's Institute for Program Evaluation, later designated as the Program
Evaluation and Methodology Division (PEMD), began its work. PEMD, in tum,
focused on demonstrating and promoting extensive use of program evaluation
methodology for program results, effects, and impact studies. As Chelimsky (1994)
reported, incorporating evaluation units in an auditing agency is not smooth sailing;
however, evaluation principles and methods have been at least partly infused
into GAO's own studies. Seen as a leader in government evaluation, PEMD's
work has added to other influences on the methodologies of the Offices of the
Inspector General, creating one of the larger bodies of internal evaluative work.
(2) Influences on Demand for Internal Evaluation for Recipients of Federal Funds
Much federal disbursement is in entitlement programs, such as Social Security or

payment of federal pensions, and evaluations of such programs are largely
internal. However, billions are spent annually on demonstration programs, on
service delivery programs such as Medicaid, and for a variety of approaches to
achieving specific goals, such as disbursements through the Superfund for
environmental cleanup or through incentives for economic development such as
Empowerment and Enterprise Zones. For the purpose of this chapter, federal
disbursements include both tax expenditures and direct expenditures. Much
federal legislation authorizing these activities requires programs receiving federal
funds to set aside money to pay for internal, program improvement studies. In
areas such as programs aimed at preventing alcohol and drug abuse, much of the
demand for internal, process, and program improvement evaluation originates in
these federal requirements.
As an example: In 1964-65, path-breaking federal legislation made funds
available to schools serving low income children to deliver compensatory
education - the Elementary and Secondary Education Act of 1965. Senator
Robert Kennedy, not having great trust in the educational establishment, wrote
into the legislation the requirement for evaluations, including internal studies to
be controlled by local parent-community advisory boards. This legislation alone
created a notable demand locally for internal as well as external evaluation support
in education (Worthen & Sanders, 1987; Worthen, Sanders, & Fitzpatrick, 1997).
'~s the bill was drafted .,. those involved increasingly realized that, if passed
[ESEA] would result in tens of thousands of federal grants to local education
agencies and universities ... Robert F. Kennedy was among the senators who
forcefully insisted that ESEA carry a proviso requiring educators to be
accountable for the federal moneys ... and to file an evaluation report for each
grant showing what effects had resulted from the expenditure of federal funds
(Worthen & Sanders, 1987, p. 5)"
This legislation also established community-parent decision making bodies,
empowering them to conduct evaluations so they would not be dependent on
"establishment" statements of effects. House (1993) notes: "[Kennedy] delayed
passage of [ESEA] until an evaluation clause was attached to the bill. His remedy
was to force the schools to provide test scores (scientific information) to parents,
in the belief that the parents could then monitor the performance of their schools
and bring pressure on the schools to improve" (p. 17). This early stakeholder
empowerment didn't lead to extensive evaluation under the direction of the parents
themselves. However, the Title I and III provisions are seen as the concrete
manifestations of powerful effects of government on school-based evaluation.
(3) Influences on Demand for External Evaluation for Recipients of Federal Funds
About 1960-1970 another government influence on evaluation came to the fore.

As well described by House (1993), in the United States, one political trade-off
between skeptics arguing against and advocates arguing for the expanding
community development and human service programs was "try it and see if it
works." In virtually every sector in the Department of Health, Education, and
Welfare, as well the Department of Justice, Labor, and the Interior, Congress
required the administering agency to obtain program evaluations of results,
impacts, and outcomes. This in turn created a prospering, growing, thriving
industry in the academic centers and for-profit companies conducting such
studies, and fostered the development of agency evaluation offices and units.
The evaluators in the federal government would write requests for proposals
(RFPs), specifying questions to be answered, methodology, and often constructs
to be measured. The growing evaluation industry would then compete for these
awards, which often were in the millions of dollars.
For example, under the guidance of Dr. Edmund Gordon, Dr. Urie
Bronfenbrenner, Dr. Edward Zigler, and others of the Head Start Advisory
Panel, Head Start assured that the first summer Head Start programs in 1965
were evaluated to determine the immediate effects on children's development.
350 Datta
This study used whatever tests could be found for preschool children and did the
best that could be done to find qualified personnel to collect pre- and post-data
in a short time on a national sample. This was followed by other pre-post
evaluations funded by Head Start, and, in 1969, by the Westinghouse-Ohio study
commissioned by the evaluation office of the Office of Economic Opportunity.
The first study has not received a lot of attention, but subsequent studies have
and there is a significant group of evaluation professionals whose careers were
launched in the Head Start evaluations. Many others have entered the ongoing
and thriving work after almost 35 years of finding out if Head Start works.
(4) Influences on Evaluation Methods, Designs, and Measures
The government has had a notable influence on evaluation design. Design is in

part driven by questions asked by Congress and conditions placed on federal
demonstration and grant programs. The Congressional requirements are
translated by federally employed evaluators, cascading down into the RFPs and
requirements for applying for evaluation grants. Translated into practice, these
requirements have been the source of a lot of debates and developments in the
field of evaluation in areas such as the appropriate evaluative model, the criteria
applied, and measures used (House, 1993). Early demands in the 1960s focused
on randomized experimental or comparison group designs. About a decade later,
designs involving quasi-experimental approaches including time series analyses
and econometric approaches were prevalent. A decade more, and the RFPs
reflected a merger between experimental and theory-based evaluation designs,
emphasizing understanding what worked best for whom under what circumstances.
Many RFPs now call for both qualitative and quantitative methodologies.
Multi-site studies and the methodological developments associated with cross-
site analyses were largely stimulated by federal demand for these designs in areas
such as the prevention of drug abuse and alcoholism. One of the largest, most
conceptually and methodologically complex studies using this design involved
approaches to helping homeless men with multiple substance abuse problems
and was funded through the National Institute on Alcohol Abuse and
Alcoholism (Huebner & Crosse, 1991; Orwin, Cordray, & Huebner, 1994). Such
a design reflected a strong federal influence, involving cross-state planning and
data collection and requiring a complex structure to support program developers
and directors, as well as coordinating the program effects evaluation effort,
something that a single state or city has not - as far as is known - ever itself
initiated or funded.
(5) Influences on the Development of Evaluation as a Profession
The federal government, in a somewhat unsystematic way, sponsored a wide

variety of activities promoting the development of evaluation as a profession. In
the 1960s, for example, the National Institutes of Mental Health (NIMH) received
enthusiastic support for evaluation and dissemination by Senator Harrison
Williams. Two close colleagues of Senator Williams and well known evaluator
Marcia Guttentag, evaluators Susan Salasin and Howard Davis, were at that time
leading NIMH work on evaluation utilization and dissemination. Together,
Williams, Guttentag, Salasin, and Davis expanded the importance of program
evaluation in the mental health field. NIMH resources funded many conferences
and workshops in the area of mental health which brought evaluators together.
Further, Salasin and Davis published a professional caliber evaluation journal
before current evaluation journals were well established. The first national
evaluation conferences held in the District of Columbia and Chicago were
generously sponsored through awards from this NIMH evaluation group, in part
directly and in part indirectly through awards to evaluators to participate in the
meetings and workshops. Without such support, the field might have developed
considerably more slowly.
An example of an indirect influence was the decision in the 1960s by Clark Abt
of Abt Associates to bring together leaders in the field of evaluation to discuss
common issues and problems associated with the design, implementation, and
analysis of major national evaluations funded through RFPs by the federal govern-
ment. Absent these RFPs and the resources to bring the leaders together would
have been lacking. The meeting led to the formation of the Evaluation Research
Society (ERS). Later, Phi Delta Kappa International (a professional association),
supported the development of The Evaluation Network (EN), an evaluation
association whose early membership came primarily from the area of justice,
education, and housing. The development of the ERS, EN, and other evaluation
organizations in turn influenced publishers such as Pergamon Press, Sage, and
Jossey-Bass to support journals and publish books on evaluation methodology.
The government also influenced issues discussed at evaluation association
meetings. A scan of some of the annual programs of the American Evaluation
Association (and one of its earlier source organizations, ERS) showed that about
25 percent to 40 percent of the sessions (between 1975 and 1995) were based on
the major national evaluations funded by the government or on the more decen-
tralized evaluations required by national funding. At present, relative influence
of the government may be shifting as foundations become major evaluation
funders, and international organizations participate more extensively in the
conferences.
(6) Influences on Employment Opportunities for Evaluators
The federal government almost single-handedly supported evaluators for the

first 15 to 20 years since the development of the field. It did so through jobs within
the federal government, through positions required to meet the requirements of
the RFPs, and through jobs provided in grants for internal and program
improvement evaluation. State and local governments now probably form the
352 Datta
bulk of employment through government influences, with a growing surge of

support through evaluations demanded by and sponsored by the private sector:
foundations and national groups such as the Boy and Girl Scouts and the United
Way. Good information on these trends is limited. Not every evaluator joins the
American Evaluation Association so tallies of membership may not adequately
represent the full evaluation community. To do so, one would need to study
systematically the evaluation special interest groups or national conference
programs in areas such as public health, justice, housing, and agriculture - virtually
every sector. So the current landscape on evaluator jobs is more impressionist
than precise.
(7) Influences through Evaluation Leadership
The federal government, in part, follows the "pilot studies" from state, local, and
foundation initiatives and in part, leads them. What the government requires and
does is one influence. Most states have an Office of the Auditor or Inspector
General; many have evaluation units. Program evaluation tends to be part of the
methodological repertoire of the audit units, rather than freestanding. Across
the 50 states, and in the hundreds of cities with equivalent units, there is an
enormous body of evaluation activities. These tend to follow the federal govern-
ment emphases and requirements, both to comply with conditions of federal
moneys and to service their own legislative and administrative demands.
Another influence is through the leadership of evaluators in the government.
It is no coincidence that in its initial distinguished award - the Alva and Gunnar
Myrdal award - the Evaluation Research Society recognized equally leadership
in evaluation in government and leadership in evaluation in non-government
sectors. The Myrdal award has been received by internationally recognized
evaluators such as Joseph Wholey and Eleanor Chelimsky. Chelimsky was the
Director of the Program Evaluation and Methodology Division of the U.S.
General Accounting Office (GAO) and twice president of the American
Evaluation Association. Her articles are among the most widely cited on topics
such as evaluation and politics.
In 1988, the GAO urged that the budget process be more tightly connected to
evidence that federal programs work, i.e., to program results. Richard Darman,
then Director of the Office of Management and Budget, was familiar with and
sympathetic to the use of evaluations. When then Comptroller General Charles
Bowsher and PEMD Director Eleanor Chelimsky met with Darman, his
question was not should we do evaluation, but how. This 1988 meeting and follow-
up papers were among the influences leading to the Government Performance
and Results Act of 1993, which mandated presentation of program results data
for all major programs. GPRA may stand as one of the single largest influences
on the evaluation field, focusing attention on outcomes, results, and impacts.
As public sector funding declined and private sector funding increased in areas
such as human services, accountability became increasingly valuable in improving
donor confidence in management as well as program performance. Many private

organizations followed the federal initiative. For example, the national United
Way (1996) urged each of its member agencies to require a results-oriented out-
come evaluation based on the program logic model prior to giving new funding.
This mandate has affected almost every charity receiving United Way funds in
every community. (For examples of some of the debates around the approach of
linking funding to performance, see Bernstein 1999; Perrin 1998, 1999; U.S.
General Accounting Office, 1997; and Winston, 1999.)
(8) Influences on Evaluation Capacity
While often indirect, the federal government has influenced evaluation capacity,
e.g., evaluator training, development of evaluation standards and definitions of
evaluation quality, and a professional infrastructure. For example, the widely
accepted Program Evaluation Standards (Joint Committee, 1994) were
developed in part with federal support. Federal support of centers for evaluation
has waxed and waned, and most centers now have many funding sources.
However, an initial impetus and in some instances long-term support has made
for internationally visible centers specifically designated for evaluation. Having
such centers traditionally has been seen as essential for capacity building.
In addition, some agencies have directly funded evaluator training. For
example, the National Science Foundation currently encourages the training of
evaluators in the areas of science, mathematics, engineering, and technology
through support of university degree programs in several states and through
support of short term professional development opportunities like those
provided by The Evaluators' Institute 2 and the annual NSF-sponsored Summer
Institute at the Western Michigan University Evaluation Center. Klein reminds
us of the Educational Research and Development personnel program, supported
between 1971 and 1974 by the National Institute of Education (NIB):
This program helped develop the emerging fields of evaluation and
dissemination. For example, NIB funded a big evaluation training consor-
tium under Stufflebeam and evaluation materials development projects by
Scriven, Popham and Baker. The Far West Laboratory had a Research,
Development, Demonstration and Evaluation training consortium, which
developed in-service training materials for Laboratory staff and others,
using competence assessment and practicum strategies. (S. Klein,
personal communication, 2000.)
In general, government support for evaluation infrastructure development has
tended to be methodologically broad, but focused on an area such as education,
or on a perceived need, such as authentic measurement in education or quali-
tative techniques. This has given the evaluation profession more scope than in a
more highly structured RFP. For example, two centers in the area of educational
evaluation have received fairly extensive and consistent federal support, one for
354 Datta
as long as 35 years. The Center for the Study of Evaluation located at the
University of California at Los Angeles and now directed by Eva Baker and
Robert Linn has changed names several times, but has consistently focused on
issues of evaluation design, utilization, and assessment. The Center has been part
of a national network of laboratories and centers funded since 1964 by the U.s.
Department of Education in part to build and sustain evaluation capacity. Other
Centers which received fairly sustained federal funding and devote part of their
efforts to evaluation included facilities at Northwestern University formerly
under the leadership of the late Donald Campbell, The Evaluation Center at
Western Michigan University under the leadership of Dan Stufflebeam, and
CIRCE (Center for Instructional Research and Curriculum Evaluation) at the
University of Illinois, Champaign, under the leadership of Robert Stake.
Influences of the Evaluation Profession on the Government
The evaluation profession has not been a passive recipient of these tidal waves
of influence. What the government believes should be done in all of the areas
described above is influenced by the profession itself. For example, the ideas
reflected in the government's use of and criteria for evaluation come from a
number of sources, the most notable of which has been professors from acade-
mia who sit on innumerable evaluation advisory panels and who train staff who
serve in internal offices of governmental agencies. Another, albeit less certain,
influence on the government is that of evaluation results found in published
literature.
From the beginning of the evaluation movement, all major evaluations have
been guided by panels of experts from academia who give advice on evaluation
designs, methodologies, measures, and analyses. The advice of these experts is
often reflected in the RFPs and in the regulations developed by federal agencies.
Evaluators sit on many important panels, such as the experts who meet as a
standing body to advise the Comptroller General and those who serve on many
of the National Research Council panels convened as authoritative, impartial
advisory bodies to answer questions of national concern. Much of what emerges
as a federal requirement, mandate, design, or decision reflects strongly the
influences of these academics.
For example, Boruch and Foley (2000) describe direct academic influences on
evaluation design by the National Academy of Sciences Panel on Evaluating
AIDS Prevention Programs, conferences of the National Institute on Allergy and
Infections Disease, the National Research Council's Panel on Understanding and
Control of Violent Behavior, and the National Institutes of Health conference
on Data Analysis Issues in Community Trials. As another example, in re-
authorizing legislation for the Department of Education, Congress required the
Department to seek out, select, and recognize programs that were exemplary in
their results in areas such as gender equity, mathematics instruction, and
education for the handicapped. The Department created a large, complex system
of expert panels and commissioned many papers on issues of evaluating evidence
to recognize exemplary and promising programs. The advisors further recom-
mended elements of federal regulations specifying how the system would work.
Virtually all the decisions for this system were made by these panels of experts
and the panel system itself was built not in Washington, but through the decisions
and advice of the experts themselves (Klein, 2000).
Evaluators who work for the federal government - writing the RFPs, drafting
regulations, shaping evaluation requirements in grants, and funding evaluations
- come from colleges and universities that provide courses and award degrees in
program evaluation. Leaders in the academic evaluation community have enor-
mous influence on the federal government through the students they train who
later accept federal employment.
Another influence, well documented, has been through former Senator Daniel
Moynihan, himself a distinguished social scientist from Harvard University. In
the late 1960s and early 1970s, while serving at the White House as a domestic
policy advisor, he energetically supported the use of evaluation studies for policy-
making and has consistently been an advocate for grounding policy in federally
supported social science research.
An uncertain influence is that of published results of evaluation studies on
government action. It has been a matter of some controversy whether, how, and
how much evaluations affect action. Weiss (1998) has seen the influences as tidal
or illuminatory, rather than immediate. Chelimsky (1998), focusing on national
studies and national decisions, argues that effects can be direct and notable.
Certainly the GAO evaluation reports, which often contain specific recommen-
dations, are systematically tracked by the agency to document impacts. These
analyses indicate that most recommendations are acted upon by the agencies,
and overall, yield savings annually that far outweigh the costs of producing the
studies. Other analyses show broad policy impacts, such as the effect of the GAO
Chemical Warfare reports (U.S. General Accounting Office, 1983; Chelimsky,
1998) or the well-known Westinghouse-Ohio evaluation of Head Start
(Campbell & Erlebacher, 1970) that closed down summer programs or the
effects of state level studies of waivers that contributed to passage of the major
policy shift in welfare reform (Gueron, 1997). Perhaps equally interesting is the
influence of evaluations on meta-policy on the support for evaluation in the
government and on methodologies and measures. This has not been studied
systematically. Observations suggest that key legislative aides are well aware of
issues in evaluation methodology, and have become sophisticated consumers of
evaluation, the evaluation journals, and of evaluative studies.
Effects of Mutual Influences
Overall then, since at least 1964-65, the federal government has had consider-
able influence on evaluation as a profession, through creating demand, providing
356 Datta
employment, support of training, and showing the way for state, local, and
private sectors. And the evaluation profession has had considerable influence on
the government, through the recommendations of academic experts, training
evaluators who later work within the government, and theoretical and method-
ological developments within the profession itself. There are some lessons to be
learned about how well these mutual influences have worked.
Where the evaluation profession is concerned, two considerations seem
particularly important: an insufficient awareness of nuances in evaluation design
and undue influences torquing the field. An example with regard to nuances of
design is a GAO report that blasted prior Head Start evaluations as inconclusive
because randomized designs were not used (GAO, 2000). Congress mandated a
national evaluation requiring examination of child development in treatment
and non-treatment conditions to test whether Head Start works (Head Start
Amendments of 1988). In its RFP, Head Start called for a randomized national
experiment. This is an instance, in my opinion, where there may be no meaning-
ful control group possible in part because of the demands for childcare created
by welfare reform. Procedures are available to ameliorate this problem, e.g.,
studying control, comparison, and treatment group experiences in depth. Then
comparisons could sort out true control conditions where children received no
service of any kind from experiences that may approximate Head Start to varying
degrees, an approach used by Abt in its tests of the effectiveness of Comer's
approach (Datta, 2001).
With regard to torquing the field, infusion of a great deal of money, mandates,
and interest can draw attention away from areas of practical and methodological
importance for evaluation. As a result, evaluation methodology may under-
invest in areas of developing significance or miss out entirely in grappling with
some issues that could be of benefit to development of more robust evaluation
theory and methodology.
For example, almost all evaluations have focused on direct expenditure
programs - entitlements, grants, demonstrations - often in the human service
areas. This is in spite of the fact that as much money is spent through tax
expenditures as through direct expenditures, the former of which typically
benefits corporations and wealthier individuals. There has been little discussion
in the evaluation literature on the methodology for assessing results, impacts,
and outcomes of tax expenditures. That has been left to the economists even
though there is equal need for credible evidence of program results, attribution,
and value and worth, for programs funded through tax expenditures and those
funded through direct expenditures. Programs from both expenditure sources
should have equal claims on methodological attention from evaluation theorists
and practitioners.
In most instances, government and the evaluation profession have both bene-
fited where the influences have been mutual. Perhaps the most salient lesson
learned is the benefit of a healthy circulation between federal and nonfederal
sectors, so that important developments in the field of evaluation, such as theory-
based approaches, can infuse federal evaluation thinking where appropriate. A
limitation in this circulation of personnel is that the job transitions from outside
to inside government have been more common than the reverse. It seems easier
and more common for an evaluator in academia to move to a government
position than for federal evaluators to find comparable positions in academia
unless these federal evaluators have been unusually diligent in maintaining a
strong publications record and have been acquiring teaching expertise. However,
even though personnel from government are not as likely to move to academia,
some who work in academia move to the government for short periods of time,
e.g., on temporary assignments. They may return to their university jobs with a
better understanding of evaluation from a government perspective. So academia
may be changed even though the number of persons moving there from govern-
ment may not be as great as the number moving in the other direction.
Where this circulation has occurred - as in the examples of Senator Moynihan
and in tracing some of the alumnae of key federal agencies to important academic
positions, and through Congressional and other fellowships bringing evaluation
theorists to the federal government for sustained periods - the benefits to theory
and to practice clearly have had an important and mutually positive influence on
our profession.
SUMMARY
This analysis has documented at least eight ways in which the profession has
been influenced by one particular funder, the U.S. federal government. One
could attempt to grasp the extent and depth of the influence by imagining what
the profession would be like if, since about 1960, there were no federal influence
on evaluation, i.e., no federal requirement for internal use of evaluation; no
external funding requirements; no support of training, publications, associations,
standards development; no grants and RFPs; no specification of designs,
measures and methodology.
While federal influence has been monumental, I believe that it has itself been
shaped by nongovernmental forces. Organizations and individuals in the field of
evaluation have influenced the nature and content of many federal actions; i.e.,
from advice of academic experts and federal evaluation staff they have trained,
and from their written books, articles, and reports. I see these influences as
benefiting government (the people) and the profession (evaluators). However, it
seems well worth systematic examination, as part of our development as a
profession, to determine how our essence is being affected by different types of
funders. It may be that the influences of one type of funder, that could limit our
essence, are counteracted by the influences of others, whose mandates and
concerns move in different ways. Nonetheless, we should be aware of where the
influences have been benign; where they have been limiting; and how our
profession can grow even more fully into its own.
358 Datta
ENDNOTES
1 Many thanks to Drs. Ella Kelly, Sue Klein, Midge Smith, Daniel Stufflebeam and Blaine Worthen
for their help. Special thanks to Liesel Ritchie.
2 www.EvaluatorsInstitute.com
REFERENCES
Bernstein, D.J. (1999). Comments on Perrin's "Effective Use and Misuse of Performance
Measurement." American Journal of Evaluation, 20(1), 85-94.
Boruch, R, & Foley, E. (2000). The honestly experimental society: Sites and other entities as the
units of allocation and analysis on randomized trials. In L. Bickman (Ed.), Validity and Social
Experimental (Vol. 1, pp. 193-238). Thousand Oaks: Sage.
Campbell, D.T., & Erlebacher, A. (1970). How regression artifacts in quasi-experimental evaluations
can mistakenly make compensatory education look harmful. In J. Hellmuth (Ed.), Compensatory
education: A national debate: Vol. 3., The disadvantaged child (pp. 185-210). New York:
BrunnerlMazell.
Chelimsky, E. (1994). Making evaluation units effective. In J.S. Wholey, H.P. Hatry, & K.E.
Newcomer (Eds.), Handbook ofpractical program evaluation (pp. 493-509). San Francisco: Jossey-
Bass.
Chelimsky, E. (1998). The role of experience in formulating theories of evaluation practice. American
Journal of Evaluation, 19(1), 35-58
Datta, L. (2001). Avoiding death by evaluation in studying pathways through middle childhood: The Abt
evaluation of the Comer approach. Paper presented at the MacArthur Foundation Conference on
Mixed Methods in the Study of Childhood and Family Life. Santa Monica, CA.
Gueron, J.M. (1997). Learning about welfare reform: Lessons from state-based evaluations. In D.J.
Rog, & D. Fournier (Eds.), Progress and future directions in evaluation: Perspectives on theory,
practice, and methods. New Directions for Evaluation, 76, 79-94.
House, E.R (1993). Professional evaluation: Social impact and political consequences. Newbury Park:
Sage.
Huebner, RB., & Crosse, S.B. (1991). Challenges in evaluating a national demonstration program
for homeless persons with alcohol and other drug problems. In D. Rog (Ed.), Evaluating programs
for the homeless. New Directions for Program Evaluation, 52, 33-46.
Joint Committee on Standards for Education Evaluation. (1994). The program evaluation standards:
How to assess evaluations of educational programs (2nd ed.). Thousand Oaks, CA: Sage.
Klein, S. (2000). The historical roots of the system of expert panels. Washington, DC: U.S. Department
of Education.
Orwin, RG., Cordray, D.S., & Huebner, RB. (1994). Judicious application of randomized designs.
In K.J. Conrad (Ed.), Criticially evaluating the role of experiments. New Directions for Program
Perrin, B. (1998). Effective use and misuse of performance measurement. American Journal of
Evaluation, 19(3), 367-379.
Perrin, B. (1999). Performance measurement: Does the reality match the rhetoric? A rejoinder to
Bernstein and Winston. American Journal of Evaluation, 20(1), 101-114
Siedman, E. (1989). Toothless junk-yard dogs? The Bureaucrat, Winter, 6-8.
U.S. General Accounting Office. (1983). Chemical warfare: Many unanswered questions. Washington,
DC: U.S. General Accounting Office, GAO/IPE 83-6.
U.S. General Accounting Office. (1997). Performance budgeting: Past initiatives offer insights for GPRA
implementation. Washington, DC: U.S. General Accounting Office, GAO/GGD-97-46.
U.S. General Accounting Office. (2000). Preschool education: Federal investment for low-income
children significant but effectiveness unclear. Washington, DC: U.S. General Accounting Office.
GAO/T-HEHS-00-83.
United Way of America. (1996). Focusing on program outcomes: A guide for United Ways. Alexandria,
VA: United Way of America.
Weiss, C. (1998). Have we learned anything new about the use of evaluation? American Journal of
Evaluation, 19(1), 21-34
Winston, J. (1999). Understanding performance measurement: A response to Perrin. American
Journal of Evaluation, 20(1), 85-94.
Worthen, B., & Sanders, J. (1987). Educational evaluation: Alternative approaches and practical
guidelines. New York: Longman.
Worthen, B.R., Sanders, J.R., & Fitzpatrick, J.L. (1997) Program Evaluation: Alternative
Approaches and Practical Guidelines. Second Edition. New York: Longman 25-33
U.S. General Accounting Office. (1998). Performance measurement and evaluation: Defmitions and
relationships. Washington, DC: U.S. General Accounting Office. GAO/GGD 98-26.
17
The Evaluation Profession as a Sustainable Learning
Community
HALLIE PRESKILL
University of New Mexico, Albuquerque, NM, USA
After nearly four decades of study and practice, the evaluation profession
appears on the brink of moving from adolescence into early adulthood and to
becoming a "major social force" (House, 1994, p. 239). I believe this move into
adulthood signals not only another phase in the profession's life, but a significant
transformation in what it means to be an evaluator. In this chapter, I propose
that the evaluation profession must strive to become a Sustainable Learning
Community if it wishes to grow and develop in ways that add value and impor-
tance to organizations and society. In 1990, Patton suggested that for evaluation
to develop as a profession worldwide it "needs three things: vision, quality
products and processes, and skilled, trained evaluators" (p. 47). I will argue that
in addition to these general goals, there are other goals as significant and
potentially defining for the evaluation profession. These include:
• A commitment to long-term objectives and results, not just short-term

outcomes;
• Affirmation that all members are critical to the overall success of the profes-
sional community;
• A belief that members' continuous learning is essential to the future growth
and development of the evaluation profession.
Implicit in these goals are values and beliefs that together represent a philosophy
of what the profession might become. As can be seen in Table 1, these values and
beliefs reflect a commitment to inclusion, collaboration, learning, deliberation,
dialogue, communication, equity, and experimentation. If agreed upon, these
values and beliefs would guide the evaluation profession's growth, activities, and
members' roles and behaviors (see Table 1). Before discussing how the evalu-
ation profession might become a sustainable learning community, it is useful to
consider where the profession has come from and where it currently stands.
361

362 Preskill
Table 1. Core Values and Beliefs of a Sustainable Learning Community
• Embody a spirit of experimentation • Encourage ongoing dialogue
• Provide for many leadership roles • Support continuous learning
• Honor and respect ongoing differences • Create a sense of belonging
• Promote reflection • Make what we know about practice explicit
• Encourage collaboration • Communicate broadly and frequently
• Be caring • Promote inquiry on itself
• Replenish resources • Engage in paricipatory decision-making practices
• Share resources • Learn to discern where consensus is appropriate
• Generate a spirit of trust, courage, and and where it is not
imagination • Include a diversity of members' ideas,
• Consider assumptions of theory and practice perspectives, background, experiences
• Sustain democratic values • Equalize voices
• Emphasize the importance of multi- • Recognize that regardless of how fair we try to
disciplinary learning be, ther are built-in injustices
• Adjust to the social and economic needs of • Encourage curiosity
members • Be committed to its long-term goals
THE EVOLUTION OF THE EVALUATION PROFESSION
The discipline, or transdiscipline (Scriven, 1991) of evaluation has grown signifi-

cantly over the last 40 years. In the 1960s the field of evaluation was largely borne
out of President Lyndon Johnson's War on Poverty and Great Society programs
that spurred huge social and educational program efforts and related program
evaluation studies. During the 1970s, as the number of people who conducted
evaluations grew, two U.S. based professional evaluation associations emerged
which drew evaluators from allover the world - the Evaluation Network,
consisting mostly of university professors and school-based evaluators, and the
Evaluation Research Society, attracting mostly government-based evaluators
and some university evaluators. In 1985, these two organizations merged to form
the American Evaluation Association (ABA).
The 1980s saw huge cuts in social programs due to President Ronald Reagan's
emphasis on less government. He saw to it that the requirement to evaluate was
eased or altogether removed from federal grants (House, 1993; Weiss, 1998).
However, evaluators and evaluation researchers continued to hone their craft. It
was during this time that school districts, universities, private companies, state
departments of education, the FBI, the Food and Drug Administration, and the
General Accounting Office (GAO), developed internal evaluation units (House,
1993). Up until the 1980s, external evaluators were generally called on to
conduct program evaluations. However, with the birth of internal evaluation
units, the role of the internal evaluator was created. (For a more detailed history
of evaluation's beginnings see Cronbach, 1980; House, 1993; Rossi & Freeman,
1985; Russ-Eft & Preskill, 2001; Stufflebeam, Madaus, & Kellaghan, 2000; Weiss,
1998; Worthen, Sanders & Fitzpatrick, 1997).
During the 1990s, the evaluation profession evolved even further. Increased
emphasis on government program accountability, and organizations' efforts to
The Evaluation Profession as a Sustainable Learning Community 363
be lean, efficient, global, and more competitive, led to more urgent calls for eval-
uation. While the majority of evaluations were still being conducted of government-
funded programs, evaluation in the 1990s moved into every type of organization.
Today, evaluation is performed not only to meet government mandates, but also
to improve programs' effectiveness, to enhance organization members' learning,
to determine the merit, worth, and value of programs, and to allocate resources
in a wide variety of public and private organizations across the globe.
The maturing of the profession can also be seen in the myriad of professional
journals that now exist. These include:
• American Journal of Evaluation;

• New Directions for Evaluation;
• Evaluation Review;
• Evaluation: The International Journal of Theory, Research and Practice;
• Educational Evaluation and Policy Analysis;
• Evaluation and Program Planning;
• Evaluation and the Health Professions;
• Evaluation and Educational Policy;
• Studies in Educational Evaluation;
• Journal of Personnel Evaluation in Education.
Over the last four decades, the evaluation profession has grown from being mono-
lithic in its definition and methods, to being highly pluralistic. It now incorporates
mUltiple methods, measures, criteria, perspectives, audiences, and interests. It
has shifted to emphasizing mixed method approaches in lieu of only randomized
control group designs, and has embraced the notion that evaluations are value-
based and, by their very nature, politically charged.
Finally, and perhaps most telling is how evaluators identify themselves.
Whereas in 1989 only 6 percent of all American Evaluation Association (AEA)
members declared evaluation as their primary discipline on their membership or
renewal applications (Morell, 1990), 32 percent of AEA members identified
"evaluation" as their primary discipline in 2000 ("Education" ranked second with
23 percent) (AEA internal document, July 2000).
In many ways, the growth of the evaluation profession has occurred in spurts
and unplanned, uncharted ways. Using the metaphor of human development
again, I believe that the evaluation profession is currently akin to the gangly
teenager, who has all of the parts, intelligence, and motivation, but is not quite
sure what to do with it all. As the teen moves towards adulthood, he or she begins
planning a future, making decisions that will affect his or her life. Given the
teen's level of maturity, however, there is a certain amount of ambiguity, lack of
clarity, and anxiety about what the future holds. It may be that the evaluation
profession is at a similar stage. It has a rich history, an immense amount of talent,
a great deal of motivation, and yet, it is unsure of where it is going or how it will
get there. Given this scenario, it seems reasonable to consider what the profes-
sion's next steps might be as we look toward the future. What would it mean for
364 Preskill
the profession to be a sustainable learning community? How might the

profession become such a community? In the following sections, I will discuss the
concept of community, the notion of sustainability, the importance of learning,
and how the evaluation profession could work towards embodying each of these.
WHAT IT MEANS TO BE A SUSTAINABLE LEARNING COMMUNITY
The Concept of Community
For various reasons, interest in what it means to be a community and how to

achieve it has become a topic of great interest in the last several years. Perhaps
it is due to the mobility of our societies, or the use of technology that enables
people to work alone, or the decline of family members living near one another,
that we have an increasing yearning for feeling a part of something, for being
part of a community. This trend is evident not only in the fields of urban
planning, architecture and environmental studies, but also in sociology,
education, business, human services, and psychology.
Definitions of community found in various dictionaries describe community as
a society or group of people who have common rights and possessions, who live
in the same geographic place, and are governed by the same laws. However, in
the last few years, the notion of geographical proximity has become less
important in many authors' descriptions of community. Instead, values, interests,
and relationships are being advocated as core attributes of community (Levey &
Levey, 1995; Morse, 1998; Ulrich, 1998). As Ulrich (1998) explains:
Today, boundaries based on values may be more common than boundaries

based on geographic proximity. Proximity focuses on what is seen; values
focus on what is felt. Proximity assumes the importance of physical
presence to share ideas; values create emotional bonds and the ability to
share ideas easily across great distances. Communities of the future may
be less defined by where we live than by what we believe. (p. 157)
Other authors write about the importance of shared values, vision, caring, creating
a common destiny, diversity, openness, and inclusion in developing communities
(Gardner, 1995; Jarman & Land, 1995; Whitney, 1995). Palmer's (1987) defini-
tion of community also focuses on relationships - not only to people, but "to
events in history, to nature, to the world of ideas, and yes, to things of the spirit"
(p. 24). What these definitions suggest is that the development of community is
not something that happens overnight; nor is it an event that has a beginning and
an end. Rather, it reflects an ongoing process that evolves and reconfigures itself
as members come and go and as times change. And, community is as much about
relationships as it is about location.
Grantham (2000) suggests that the value of community lies in its ability to help
us answer essential questions about our lives. These include:
• Who am I?
• What am I part of?
• What connects me to the rest of the world?
• What relationships matter to me in the world?
If we take these questions and relate them to developing a professional evalua-

tion community, they might read:
• Who am I as an evaluator?
• How and where do I fit in as a member of the evaluation profession?
• How am I connected to other evaluators and the profession in general?
• What relationships ground me in the profession? Which relationships are
important to me?
Ultimately, the answers to these questions may help the evaluation profession con-
sider what values it will exemplify. Such answers would not only define the who,
what, and why of the evaluation profession, but they would also communicate the
vision, mission, and future of the profession. Moreover, the answers would exter-
nalize the values, beliefs, and assumptions of the evaluation profession to the
larger society. Inherently, professional communities provide a sense of identifi-
cation, unity, involvement, and relatedness (Chelimsky, 1997; Grantham, 2000).
Sustainable Communities
If the evaluation profession wishes to continue having an impact on organiza-

tions and society, then it must also develop and grow in ways that are sustainable.
That is, the profession would need to internalize processes, systems, policies, and
procedures that are self-renewing and connected with the larger ecosystem - the
context and relationships that constitute the field of evaluation. Sustainable
communities recognize that they are part of a larger system that strongly influences
the patterns and activities within their own systems. For the evaluation
profession, this means that we must recognize that each of us works within other
communities that influence our beliefs, commitments, and involvement in the
evaluation profession. Thus, it becomes critical to ensure that all voices are
heard, that there are diverse outlets for disseminating knowledge, and that there
are various means for participating in the evaluation community.
Recent interest in sustainable communities has grown out of concern for the
future of our planet and its societies. Roseland (1998) argues that this "quiet
transformation" is taking place in communities all over North America and
around the world," and is intended to "improve the quality of community life,
protect the environment, and participate in decisions that affect us" (p. 2). He
further emphasizes that sustainable communities are not merely about
"sustaining the quality of our lives - they are about improving it" (p. 2). This
improvement focus is dependent on an ability to take a holistic, proactive, and
366 Preskill
conscious approach to planning. And, part of this planning includes attention to
issues of population growth and possible limits to growth. As a result, members
within a sustainable community improve their knowledge and understanding of
the world around them, enjoy a sense of belonging, receive mutual support from
members within the community, and enjoy freedom from discrimination (Roseland,
1998, p. 147). Sustainable communities are also fundamentally committed to
social equity and social capital. This means that resources are shared in ways that
produce fair distribution of the benefits and costs between generations, and that
the community values the knowledge and expertise of other community members.
Social equity and capital are developed and supported through courage, dialogue,
reciprocity, trusting relationships, imagination, commitment, democratic ideals,
and active involvement (Roseland, 1998).
If the evaluation profession were a sustainable community, it would empower
its members to participate in a process for developing the profession's future, it
would provide the means for evaluators to stay connected with one another, it
would provide a sense of place, and it would pay particular attention to issues of
diversity and equity. Furthermore, a professional sustainable community would
seek to integrate the collective intelligence of its members with the profession's
goals and would develop the means for sustaining the profession's memory and
learning over time (Gozdz, 1995; Preskill & Torres, 1999).
The Role of Learning in a Sustainable Community
A learning community assumes the characteristics of communities, but adds

learning as a primary goal for supporting a community's sustainability. As
Wenger (1998) explains, "Learning is the engine of practice, and practice is the
history of that learning .... In a community of practice, mutual relationships, a
carefully understood enterprise, and a well-honed repertoire are all investments
that make sense with respect to each other" (p. 97). Peck (1987) defines a learn-
ing community as '~ group of individuals who have learned to communicate
honestly with each other, whose relationships go deeper than their masks of
composure ... and who delight in each other, make others' conditions our own"
(p. 59). Learning is thus a means for continual growth and development of
individuals, groups, and the profession. It is a means by which we make sense of
our experiences, and find new ways of thinking and acting. Consequently, a focus
on learning within a community has certain implications for the evaluation
profession:
• For individual evaluators, it means learning about the theory and practice of
evaluation is an issue of engaging in and contributing to the dissemination of
what is learned with others in the evaluation community.
• For the evaluation community, it means learning is an issue of continuously
refining our practice and ensuring that new generations of evaluators are
provided with the necessary evaluation knowledge and skills.
• For the evaluation profession, it means learning is an issue of sustaining the

infrastructure and systems to support new and veteran members so their
practice and scholarship are of the highest quality and value.
In sum, if the evaluation profession were a sustainable learning community, it

would foster a culture that values truth, risk-taking, openness, curiosity, inquiry,
and experimentation, and it would champion the ongoing learning of all its
members.
BECOMING A SUSTAINABLE LEARNING COMMUNITY
While the evaluation profession currently reflects several aspects of community,

sustain ability, and learning, it is not necessarily taking a proactive stance in
ensuring a coordinated vision that incorporates these core concepts. With more
organizations engaging in evaluation work, and the growing need for individuals
to learn about evaluation, I believe we are at a critical juncture. We can either
choose to become a sustainable learning community where we grow and learn
together, or we can allow the profession to mature in unplanned, uncoordinated,
and haphazard ways that in the end create Lone Rangers and disconnected small
enclaves of evaluation expertise. The path which we choose to follow in the next
few years will significantly affect not only how well evaluation work gets done,
but the value and role of evaluation in organizations and society.
Therefore, we must consider the kinds of structures, systems, and processes
that will support a professional sustainable learning community. If we can agree
that the evaluation profession should plan for the long-term, view all evaluators
as critical to the profession's success, and consider learning an essential compo-
nent of everything we do, then we can create a culture and infrastructure that
supports evaluators sharing their knowledge and interests in ways that will nurture
new members and revitalize and maintain the growth of more experienced
evaluators.
While there are several kinds of large-scale change processes and interven-
tions that the evaluation profession could use to become a sustainable learning
community, one that offers great promise is Appreciative Inquiry. Within the
organization development arena, interest in appreciative inquiry has gained a
great deal of attention in recent years. Appreciative inquiry
is a worldview, a paradigm of thought and understanding that holds

organizations to be affirmative systems created by humankind as solutions
to problems. It is a theory, a mindset, and an approach to analysis that
leads to organizational learning and creativity .... Appreciative inquiry
seeks what is "right" in an organization ... searches for the successes, the
life-giving force, the incidence of joy. It moves toward what the
organization is doing right and provides a frame for creating an imagined
future (Watkins & Cooperrider, 2000, p. 6).
368 Preskill
Taking an appreciative inquiry approach to developing a sustainable evaluation
community means focusing on what is good about the evaluation profession, and
what makes evaluation worth doing. It is not framed as a problem solving
approach trying to find out what is wrong in order to fix it (ironically, like most
evaluation approaches). Rather, it strives to create hope and optimism while still
acknowledging the challenges the profession faces.
An Appreciative Inquiry Summit (AI Summit) is a methodology for address-
ing the needs of the whole system in a positive way. Based on the work of Weisbord
(1987; Weisbord & Janoff, 2000), who developed the term "future search" as a
means for helping organizations and communities learn more about them
themselves, AI Summits have been implemented in a wide variety of profit and
nonprofit organizations around the world since 1995. Typically, AI Summits
involve from 100 to 2500 participants often at multiple sites, connected through
conferencing technologies. Similar to evaluation practice, AI Summits involve
stakeholders who are interested, have influence, have information, who may be
impacted, and have an investment in the outcome.
A Summit lasts anywhere from two to five days and is designed to follow the
Appreciative Inquiry 4-D process of Discovery, Dream, Design, and Destiny. While
it is not my intent to describe these processes in detail, it is helpful to understand
the basic purpose of each process and how it could be adapted to helping the
evaluation profession become a sustainable learning community. (For additional
information on Appreciative Inquiry, see Hammond, 1996; Whitney &
Cooperrider, 2000.)
The Discovery process involves a dialogue about who we are, individually and
collectively, resources we bring, our core competencies, our hopes and dreams
for the future, the most hopeful macro trends impacting us at this time, and ways
we can imagine going forward together. In this process, we would ask evaluators
to reflect on their values and beliefs about evaluation, why they do evaluation
work, their hopes, and dreams for evaluation, and what they hope the evaluation
profession will become. It could be a time when evaluators deliberate on and
discuss the values and beliefs presented in Table 1. For example, evaluators could
be asked to think about the extent to which these represent their goals for
evaluation in addition to developing others.
The next step in the Dream process asks participants to envision "the
organization's greatest potential for positive influence and impact in the world"
(Whitney & Cooperrider, 2000, p. 17). For example, evaluators might be asked,
"We are in the year 2010 and have just awakened from a long sleep. As you wake
and look around, you see that the profession is just as you have always wished
and dreamed it might be. What is happening? How is the profession different?"
These questions would stimulate a dialogue that seeks to illuminate positive
images of what the profession might be in its most idealistic sense. The Dream
process is a creative exercise that invites all members to think broadly and
holistically about a desirable future.
In the Design phase, participants propose strategies, processes, and systems;
make decisions; and develop collaborations that will create positive change.
They develop affirmative statements about the future of the organization, or in

this case, the profession. Within the evaluation context, such statements might
include, "The evaluation profession is multigenerational and involves individuals
from diverse ethnic and racial backgrounds." Or, "there are mechanisms in place
whereby new evaluators are mentored by senior evaluators." These statements
represent the profession's commitment to its shared goals, values, and beliefs. It
is during this process that participants provide ideas for learning and community
development. For example, some of the affirmative statements related to learning
might be derived from a desire to establish web-based discussion groups, a
clearinghouse for evaluation workshops, graduate education courses, short-term
courses and programs, video and audiotapes, books, resource directories, or
internship opportunities.
The final process, Destiny focuses on ways for the group to take action. This is
usually done through the participants self-organizing into groups that will work
on different initiatives on behalf of the whole community. Here, evaluators in
different locations may initiate and oversee various profession-related activities
that will ultimately be made public and/or available to the entire evaluation
community.
An AI Summit is most successful when there are high levels of participation.
It is not a conference with speakers, and it is not a place for the profession's
leadership to assert its influence. Instead, it is a place and time to engage as many
community members as possible in the spirit of honest and deliberate dialogue
about the profession's future. Using processes associated with organizational
learning, AI summits ask participants to reflect, ask questions, and challenge
theirs and others' assumptions, values, and beliefs, so that a shared vision can be
developed. Since AI summits include steps for implementing the group's vision
and ideas by developing task forces and action plans, the participants' work
extends far beyond the 2 to 5 days they spend together. By engaging evaluation
community members in such a summit, we may be able to chart our future in a
way that is affirming, challenging, supportive, and creative. The process alone
may go far in establishing a sense of camaraderie and appreciation of the gifts we
each bring to the profession.
It is quite possible that the national and local evaluation associations and
societies could provide the means for conducting Appreciative Inquiry Summits
and in a very real sense, they probably offer the greatest hope for creating the
infrastructure for a sustainable evaluation community. Evidence of the value and
role of professional evaluation organizations can be seen in the explosion of new
evaluation associations across the globe in the last few years. For example, the
American Evaluation Association, the only U.S. professional association dedi-
cated to the theory and practice of evaluation today has a membership of
approximately 2,700 individuals from a wide range of countries. And, while some
organizations have been around for several years, others are just now emerging.
Some of these include the Australasian Evaluation Association, African Evaluation
Association, Asociaci6n Centro Americana, Canadian Evaluation Society,
European Evaluation Society, Israeli Association for Program Evaluation,
370 Preskill
United Kingdom Evaluation Society, Malaysian Evaluation Society, Sri Lanka
Evaluation Association, and the La Societe Francaise de l'Evaluation. In an
attempt to begin building a framework for international cooperation, repre-
sentatives of these and other evaluation-oriented organizations met in Barbados,
West Indies in early 2000 to explore how they could work in partnership to
advance the theory and practice of evaluation around the world. In essence, their
efforts reflect a desire for establishing a worldwide professional evaluation
community, and possibly, a sustainable learning community.
While it would constitute an enormous amount of coordination and time, I can
envision multiple Appreciative Inquiry Summits occurring in many locales around
the world over a five-year period. Given the amount of energy and motivation
that exists within these varied organizations, we should not ignore the potential
these organizations have for developing the profession as a sustainable learning
community.
CONCLUSIONS
My passion for the evaluation profession and its future is in large part due to the
increasing recognition that evaluation is my professional home. As I've talked with
colleagues about our common evaluation interests and concerns, about our
evaluation friends, about our hopes for the profession, it has become clear to me
that the evaluation community is critically important to my life. As I look back
over my 20 years in the field, I realize that many of my best friends are other
evaluators, many of whom I've known since graduate school. And, like many of
us, I look forward to the ABA conference every year, because it is the only place
where people really understand what I do (many of us have laughed about how
hard it is to explain to non-evaluation friends and family our work as evaluators).
Somehow, we share a common bond that survives many miles of separation and
time between face-to-face visits. Yet, this distance does not diminish our sense of
community in the least. We still find ways that are meaningful to connect with
one another. In a recent conversation I had with an evaluation colleague/friend,
who was experiencing a difficult time in her life, and had received much support
from her other evaluator colleagues/friends, she said how wonderful it was to be
part of this evaluation community. Though it sounds terribly idealistic, it is this
kind of community I believe the evaluation profession should aspire to. If what
is presented in this chapter resonates with other evaluators, we must then ask the
question, "how do we continue to support and develop the evaluation profession
into a sustainable learning community so that new and veteran members of the
community feel that they belong and can contribute?" To this end, I challenge
the evaluation profession to begin a process, such as Appreciative Inquiry, to
determine the means for exploring the possibility and desirability of becoming a
sustainable learning community.
Returning to the metaphor of development with which I began this chapter,
there are certain questions we can ask to determine the extent to which the
profession is maturing and taking us in the right direction. In addition, these

questions might help us explore the need for more systematic efforts to become
a sustainable learning community:
• Is the evaluation profession moving toward greater inclusivity of information,

people, and ideas?
• Is the evaluation community sharing power effectively, becoming more con-
sensual and democratic?
• Is the evaluation community becoming more capable of contemplative
learning?
• Is the profession a safe place, a practice field, for exploring each person's full
potential?
• Can evaluators argue, debate, and disagree gracefully?
• Is the profession moving toward or away from becoming a group of all
leaders?
• To what extent do evaluators feel connected with others and the profession?
• To what extent does the evaluation profession regularly employ metaevalua-
tion as a means for assessing and improving the value of evaluation services?
(Adapted from Gozdz, 1995.)
I do not presume, even for a moment, that answering these questions will be a
simple task. But if the evaluation profession is to grow and prosper, and if we
want our work to add value to organizations and society in general, then we must
begin somewhere. The development of community is without a doubt a
complicated undertaking. It requires a delicate balance between autonomy and
connectedness, between self-identity and group identity. It cannot be imposed,
but must be co-created. As Levey and Levey (1995) suggest, "The strength of a
relationship or community can be measured by the magnitude of challenge it can
sustain. When faced with challenges, relationships and communities either come
together or fall apart" (p. 106). If we wish to embody the foundational principles
outlined earlier in this chapter, then we must embrace this struggle together and
help the profession grow into its adulthood.
REFERENCES
Chelimsky, E. (1997). Thoughts for a new evaluation society. Evaluation, 3(1),97-118.

Cronbach, L.J., & Associates. (1980). Toward reform ofprogram evaluation. San Francisco: Jossey-Bass.
Gardner, J. (1995). The new leadership agenda. In K. Gozdz (Ed.), Community building
(pp. 283-304). San Francisco: Sterling & Sons, Inc.
Gozdz, K. (1995). Creating learning organizations through core competence in community building.
In K. Gozdz (Ed.), Community building (pp. 57-68). San Francisco: Sterling & Sons, Inc.
Grantham, C. (2000). The future of work. New York: McGraw-Hill.
Hammond, SA (1996). The thin book of appreciative inquiry. Plano, TX: Kodiak Consulting.
House, E.R. (1993). Professional evaluation. Newbury Park, CA: Sage.
House, E.R. (1994). The future perfect of evaluation. Evaluation Practice, 15(3), 239-247.
372 Preskill
Jarman, B., & Land, G. (1995). Beyond breakpoint - Possibilities for new community. In K. Gozdz
(Ed.), Community building (pp. 21-33). San Francisco: Sterling & Sons, Inc.
Levey, J., & Levey, M. (1995). From chaos to community at work. In K. Gozdz (Ed.), Community
building (pp. 105-116). San Francisco: Sterling & Sons, Inc.
Morell, J.A. (1990). Evaluation: Status of a loose coalition. Evaluation Practice, 11(3), 213-219.
Morse, S.w. (1998). Five building blocks for successful communities. In F. Hesselbein, M. Goldsmith,
R Beckhard, & RF. Schubert (Eds.), The community of the future (pp. 229-236). San Francisco:
Jossey-Bass.
Palmer, P. (1987). Community, conflict, and ways of knowing. Change, 19(5), 20-25.
Patton, M.Q. (1990). The challenge of being a profession. Evaluation Practice, 11(1), 45-51.
Peck, M.S. (1987). The different drum. New York: Simon & Schuster.
Preskill, H., & Torres, RT. (1999). Evaluative inquiry for leaming in organizations. Thousand Oaks,
CA: Sage.
Roseland, M. (1998). Toward sustainable communities. Gabriola Island, BC, Canada: New Society
Publishers.
Rossi, P.H., & Freeman, H.E. (1985). Evaluation: A systematic approach. Thousand Oaks, CA: Sage.
Russ-Eft, D., & Preskill, H. (2001). Evaluation in organizations: Enhancing leaming, performance, and
change. Cambridge, MA: Perseus Books.
Scriven, M. (1991). Evaluation thesaurus (4th ed.). Thousand Oaks, CA: Sage.
Stufflebeam, D.T., Madaus, G.F., & Kellaghan, T. (Eds.). (2000). Evaluation models. Boston: Kluwer.
Ulrich, D. (1998). Six practices for creating communities of value, not proximity. In F. Hesselbein, M.
Goldsmith, R Beckhard, & R.F. Schubert (Eds.), The community of the future (pp. 155-165). San
Francisco: Jossey-Bass.
Watkins, J.M., & Cooperrider, D. (2000). The appreciative inquiry summit: An emerging
methodology for whole system positive change. OD Practitioner, 32(1), 13-26.
Weisbord, M.R (1987). Productive workplaces: Organizing and managing for dignity, meaning and
community. San Francisco: Jossey-Bass.
Weisbord, M.R, & Janoff, S. (2000). Future search: An action guide to finding common ground in
organizations and communities (2nd ed.). San Francisco: Berrett-Koehler.
Weiss, C.H. (1998). Evaluation (2nd ed.). Upper Saddle River, NJ: Prentice-Hall.
Wenger, E. (1998). Communities of practice: Learning, meaning, and identity. Cambridge, UK,
Cambridge University Press.
Whitney, D.Y. (1995). Caring: An essential element. In K. Gozdz (Ed.), Community building (pp.
199-208). San Francisco: Sterling & Sons, Inc.
Whitney, D., & Cooperrider, D.L. (2000). The appreciative inquiry summit: An emerging
methodology for whole system positive change. OD Practitioner, 32(1), 13-26.
Worthen, B.R, Sanders, J.R, & Fitzpatrick, J.L. (1997). Program evaluation (2nd ed.). New York:
Longman.
18
The Future of the Evaluation Profession
M. F. SMITH
University of Maryland, The Evaluators' Institute, DE; (Emerita), MD, USA
Once in 1994 and twice in 2001, I had the opportunity to work with groups of
evaluators to think about the field or profession of evaluation. In 1994, while I
was editor of Evaluation Practice (EP) ,1 I invited 16 eminent leaders in the field
to "take a turn at the crystal ball" to tell how they thought "the future of evalua-
tion will and/or should go." That effort resulted in the first special issue ofEP and
was entitled Past, Present, Future Assessments of the Field of Evaluation (Smith,
1994b). In 2001, Mel Mark2 and I repeated that process for the first special issue
of the American Journal of Evaluation (Smith & Mark, 2001). While the wording
of the request to this second group of forecasters was somewhat different than it
had been to the earlier group, the charge was much the same. They were asked
to indicate where they thought the field of evaluation would be or where they
wanted it to be in the year 2010 and to discuss any significant challenges the field
might face in getting there. In the latter effort, 23 theorists and practitioners, in
addition to Mel and me, offered observations about what the destination of
evaluation might be. Then, the Evaluation Profession section of this Handbook
came together with six contributors in 200l.
One difference in the two journal efforts and this section of the Handbook is
that in the former, authors were given wide latitude in the selection of the
content of their papers whereas in the latter, writers were asked to write about
specific topics. In the introduction to the Handbook section on the profession, I
have summarized the six papers written for the section. Here my purpose is to
describe issues raised by the 1994 and 2001 journal contributors that seem to
have the most relevance for the future of the evaluation field (in the course of
which I sometimes offer my own views). The issues comprise the professionaliza-
tion of evaluation, issues that divide the profession and strategies to reduce
divisiveness, the purpose of evaluation, what is new in the field and other changes
that present new challenges, criticisms of the field, and evaluator education/
training. The reader should realize that I am not an uninterested reporter. For
more than 20 years my work has been intertwined closely with that of the
profession, and I care deeply about what we do and how we are perceived.
373

374 Smith
PROFESSIONALIZATION OF EVALUATION
About half of the contributors to the 1994 EP volume registered concern that
evaluation has not progressed to the point of being a true profession (Smith,
1994a), compared to less than 5 percent of the 2001 writers, i.e., one of 23. A
review of the reasons for the concerns given in 1994 compared to the situation in
our field today suggests that not much has changed. So the fact that so few of the
2001 writers mentioned "professionalization" would seem to indicate that the
topic just simply isn't one they are worried about. But I am ... and when you read
Worthen's chapter in this Handbook you will see that he too has concerns.
A number of factors, from my perspective, have not changed. First, we still
have not clearly identified what evaluation is - its tasks and functions and the
specialized knowledge needed to carry out an evaluation (Shadish, 1994) - and,
in fact, may be further from doing so than in the past (Datta, 2001). Second, evalu-
ation is still being done in ad hoc ways by persons with no particular training in
the area (Sechrest, 1994) and we have no way to keep them from doing so
(Worthen, 2001). Third, we still have little success in developing and maintaining
academic training programs targeted to prepare evaluators (Sechrest, 1994),
though some success has been achieved through short-term nonformal programs
such as those offered by the American Evaluation Association (AEA) at its
annual conferences, the annual summer program provided by The Evaluation
Center at Western Michigan University, and the courses offered biannually by
The Evaluators' Institute. 3 Fourth, the tensions between standards and diversity
of approaches (Bickman, 1994) are only increasing as more and different
approaches to evaluation proliferate (e.g., Mertens, 2001; Mark, 2001). Finally,
the standards we do have are not consistently applied (Stufflebeam, 1994); one
of the reasons is that many evaluators and stakeholders are not aware of their
existence (Stufflebeam, 2001).
ISSUES THAT DIVIDE THE PROFESSION
In 1994, several issues were identified as factious for the profession. One of
these, the false dichotomy between facts and values, is discussed elsewhere in
this Handbook by Ernest House and Kenneth Howe. And since House was the
author who addressed this issue in the 2001 AlE volume, I will not focus on it
here. Three other issues that have a lot in common with each other are the quali-
tative vs. quantitative debate, evaluators as program advocates, and evaluators as
program developers. All three were still very much on the radar screen of the
2001 writers, though often under different guises, or as Worthen (2001) said
about the qualitative/quantitative debate: "this debate has not muted as much as
it has mutated" (p. 413).
The Future of the Evaluation Profession 375
The Qualitative-Quantitative Debate
In 1994, half of the writers observed (and rightly so) that this topic was creating
quite an uproar in the profession. Mark (2001) said that if this issue had not
received the most attention in the past 20 years, at least it had the status of the
loudest. Ostensibly, the discussion was framed around the two types of methods.
But Yvonna Lincoln (1994) observed that this is much more than a simple
disagreement about evaluation models. In my opinion, what underlies the
continued - and often heated - divisiveness is a basic, deep-seated difference in
philosophy or epistemology or world view and these are not aspects of a person's
life fabric that are easily changed.
Most of the 1994 authors, while noting the level of concern the debate was
generating, seemed of the opinion that it did not deserve so much attention.
Most thought both types of methods were needed and would be used together
more in the future. House (1994) said this debate "is the most enduring schism
in the field," but I think he, too, was looking at it as more than a discussion on
methods. What was noticeable about the qualitative/quantitative topic among
2001 writers was that not many mentioned it, and most of those who did basically
concluded that it is no longer an issue. Worthen was a notable exception, saying
the debate has just changed its name (e.g., as objectivity vs. advocacy).
Evaluators as Program Advocates or Independence/Objectivity vs.

Involvement/Advocacy
This controversy was alive in 1994 and reached new levels of debate in 2001. It is
what gives me the most worry about the ability of our field to be a profession, for
the issue threatens the very foundation or essence of what evaluation is and any
hopes for external credibility it can ever achieve. And, without external credibil-
ity, there can be no profession. This is a very important issue for our field.
Greene (2001) talked about the independence (or not) of evaluators in terms
of a difference in the value stance evaluators bring with them to the evaluation
situation. She said those with a value-neutral stance try to "provide ... objective,
impartial and thereby trustworthy information," while the value-interested "strive
for inclusiveness and fairness" and make claims "that rest on some particular set
of criteria for judgment (and) represent some interests better than others"
(p. 400). However, her assessment of the real difference between the two types
of evaluators seems to be one of whether the evaluator admits to having a value
position. All evaluators, she says, operate from a "value-committed" position,
serving some interests better than others, and it is time "to say so and to assume
responsibility" for our actions.
Sechrest (1994) sees the issue as a split between an ideological liberal group
who want to use evaluation to forward a preferred political agenda and an essen-
tially nonideological group who wish evaluation to be objective and scientific.
The "preferred political agenda" as identified in 2001 by writers such as Greene,
376 Smith
Mertens, and Fetterman is "social justice." However, I sense a different or
broadened view now on how the advocacy position is described by its proponents,
that is, as one of inclusivity, of getting the "disenfranchised" and "powerless"
included in all aspects of the evaluation process - with less interest in the findings
of the study than with who was involved in producing the findings, and with more
interest in the acceptability of the findings (by the many and varied stakeholders)
than with the verifiability of the findings. I agree with House (1994) who said
"the notion that truth is always and only a condition of power relationships can
mislead" (p. 244). If there are only power relationships and no "correct and
incorrect expressions and no right or wrong" (p. 244), then evaluation can be
used by one group or another for its own self interests. When/if that is the case,
then it seems to me "the profession" of evaluation has abdicated its respon-
sibilities to the public.
Evaluators as Program Developers

In 1994 and in 2001 there was recognition that evaluators have important skills
for aiding program development and that they are and will continue to use their
skills in this way. However, there are still areas of concern. Can persons who are
intimately involved in determining program methods and outcomes see strengths
and weaknesses, and will they choose to do so? This takes us back to the question
of advocacy discussed above. Chen (1994) noted that "Evaluations are con-
sidered credible when evaluators have no vested interest in the programs they
evaluate .... When evaluators are intensively involved in planning programs, they
become advocates of the programs ... [and] are no longer regarded as objective
and independent evaluators" (p. 236). To avoid this dilemma, Chen suggested
that evaluators be required to limit their role to that of internal evaluator and to
make all data and all procedures available for periodic outside inspections.
WHAT CAN BE DONE TO DECREASE DIVISIVENESS

Mark (2001) indicates that there may be some serious challenges in finding a
permanent truce, at least between the qualitative/quantitative camps. To do so,
"those who come from the quantitative tradition will need to embrace the impor-
tance of human construals and meaning-making, while those from the qualitative
tradition will need to embrace some form of realism" (p. 459). An unlikely event,
he thinks for at least some evaluators from the qualitative tradition, since the
concept of realism is seen to "retain unacceptable notions from logical positivism
or inherently conflicts with valuing the understanding of what Greene (2001)
calls 'situated meaning'" (p. 459). I agree with Mark that it is important for our
ideas about evaluation to include some form of realism, that "there really are
mechanisms and processes triggered by social policies and programs, and to
acknowledge that not all is construal, that there is some reality out there that we
are all trying to interpret" (p. 459).
Three recommendations for reducing divisiveness can be deduced from 2001

writers such as Datta, Henry, and Rogers. First, interactions about different
approaches should be respectful and constructive with all parties being open to
questions and challenges to encourage discussions about when and where
various approaches are most useful, rather than having exhibitions of territorial
defensiveness. Second, proponents of different approaches should provide
examples of completed evaluations exemplifying their strategies or procedures
that everyone can study. And third, evaluation efforts should be transparent with
accounts about what was done and what was different at the beginning and at the
end of the process so that proponents and skeptics can use this evidence as a
basis for reaching their own conclusions about the approaches being employed.
Without such actions, Henry (2001) said we are forced into arguments "that gain
no traction because ... those who have used the approaches have complete
control over what information is available" (p. 425).
PURPOSE OF EVALUATION
Nearly all the authors in both the 1994 and the 2001 special journal issues on the
future of evaluation would concur with the statement that the intent for
evaluation is ultimately to help to improve the human condition. But is that done
as an observer by measuring, as accurately as possible, whether promised improve-
ments of social programs are actually delivered (i.e., to assessing program
effects) as Lipsey (2001) proposes? Or, is it done by being a constitutive part of
the social programs with the aim of transforming society toward greater justice
and equality by advocating for the interests of some (traditionally disenfran-
chised) more than others, as Greene (2001) and Mertens (2001) suggest? These
two views represent two ends of a continuum along which the different contribu-
tors' views about purpose would fall, and the differences in the views are profound,
not in what I believe to be the goal of ultimately doing good (as different ones
might define it), nor in specific methods used (e.g., qualitative or not), but rather
in the presence/absence of a predetermined outcome goal for the evaluation.
In 1994, both Chen and Sechrest observed that the focus of evaluation had
changed from its original concern for the later stages of policy processes, such as
evaluating outcomes or impact, to a focus on implementation processes. They
gave two reasons for the change. First, so many social programs had been
observed to fail because of faulty implementation, and second, measuring program
outcomes requires a great deal of rigor and is difficult and demanding work. The
writings of many in 2001 suggest that another shift may be taking place in which
the field moves from a discipline of inquiry and practice to a sociopolitical move-
ment, from a concern for program effectiveness to a concern for social justice. I
do not think that anyone would argue about the importance of having public
conversations about important public issues and of ensuring that legitimate
stakeholder voices are a part of these conversations. However, discussion and
debate are not the same as evaluation, nor substitutes for it. There are many
378 Smith
pressing problems that evaluation and evaluators can help to solve by identifying
programs that work and ones that do not.
WHAT'S NEW IN EVALUATION?
The 2001 authors reported several changes in evaluation in recent years. These
are described briefly.
New Fields of Evaluation Emerged and Old Ones Re-emerged
Two new fields of evaluation that have emerged (Scriven, 2001) are metaevalua-
tion (evaluation of evaluations) and intradisciplinary evaluation (to demonstrate
if disciplines use valid and replicable methods for evaluating data, hypotheses,
and theories).
Some older views have re-emerged and gained traction in today's settings. One
example given by Stake (2001) is that of the constructivists (e.g., Yvonna Lincoln,
Ernest House, Jennifer Greene, Thomas Schwandt) whose advocacies "change
not just the aim but the meaning of impact, effectiveness, and quality ... [where]
merit and worth is not a majoritarian or scholarly view, but one of consequence
to stakeholders with little voice" (p. 353). Another example is theory-based eval-
uation, which according to Stake (2001) and Greene (2001), is one of the latest
rejuvenations of equating evaluation with social science where experimental
methods are used to test interventions apparently based on substantive theory.
They assert that there is a greater need for studies to concentrate on explaining
the special contexts of programs to understand what Greene calls their "situated
meaning."
New/Different Evaluation Theories and Approaches
According to House (2001), disenchantment with large scale experimental

studies led to the development of three alternative approaches: qualitative
studies which show the interaction of people and events with other causal factors
in context and can limit the causal possibilities and alternatives with which to
contend; program theory which delineates the domain investigated and helps to
make evaluation questions more precise, relevant, and testable; and meta-
analysis which uses individual studies occurring in separate circumstances to
generalize about program effects in the larger context. One thing these have in
common is the ability to focus on local, individual situations. Results do not have
to hold for all sites all over the country but rather for this place at this time. Meta-
analysis is the more demanding in that it does require more "experimental"
conditions, but can deal with small studies conducted in different settings and
circumstances which vary considerably from each other.
Collaborative and participatory approaches to evaluation became prominent

as a means to involve a wider range of stakeholders in all phases of the
evaluation process. Some (e.g., Torres & Preskill, 2001) credit the use of these
procedures with increasing the relevance and use of both evaluation processes
and findings in a variety of programs and organizations, while others (e.g.,
Newcomer, 2001) see the procedures as having established precedents for the
involvement of persons without evaluation training in the assessment of program
performance.
While the expansion of theories and evaluation approaches can be viewed as
positive, since it increases the tools from which evaluators can choose, the impact
can also be costly (Mark, 2001). For example, evaluators can become bewildered
in the task of choosing one approach over another - probably a greater problem
for stakeholders who do not understand evaluation, much less the nuances of
different approaches. Furthermore, the strident advocacy for one approach and
ardent dismissal of others (and their supporters!) can prevent productive debate
about when/where/how some approach might be useful (depriving us of this
useful knowledge), and lead to an organization of factions or approach cults
rather than what I think we would all prefer - ,a community of leiiffiers and
sharers.
Mark (2001) has suggestions that should help to avoid the dangers noted above.
First, question any evaluation theory or approach that does not set out likely
limits to its applicability. Second, encourage approaches that attempt to specify
conditions under which different approaches are appropriate. Third, encourage
discussions on panels and in journals about circumstances under which
someone's preferred approach may be inappropriate. Fourth, focus more on
actual instances of practice than on books and papers that promote a particular
approach. And finally, encourage disagreement and criticism, when based on
good evidence and reasoned arguments.
OTHER CHANGES THAT PRESENT NEW CHALLENGES
Demand for and Performance Level of Evaluator Expertise has Increased
Demand for evaluations and evaluator expertise is on the rise and is expected to
remain so for some time to come (Datta, 2001; Newcomer, 2001; Worthen, 2001).
As an example of demand, Stufflebeam (2001) reported finding 100 job announce-
ments for evaluators in the job bank of AEA on a single day in August 2001. Such
growth is attributed mostly to pressures on public, private, and non-profit sectors
to measure performance, the performance measurement movement, and to
government use of technology to deliver services (Love, 2001; Mertens, 2001;
Newcomer, 2001; Wholey, 2001).
Evaluators will need to operate at a much higher level of capacity than most
currently do in terms of technically high quality products that serve the needs of
multiple users on multiple issues across a variety of disciplines while dealing with
380 Smith
constraints in time and other resources (Datta, 2001). Those with whom
evaluators come into contact may already have a basic knowledge of evaluation,
a heightened expectation about receiving meaningful performance data, and low
estimates about the challenges of measuring program activities and outcomes
(Fetterman, 2001; Newcomer, 2001). Thus, evaluators will be called upon to
educate politicians, program leaders, and the public about what we cannot know
(Newcomer, 2001). Some of the needed evaluator skills are strategies for coping
with the information electronic revolution (e.g., assisting governments with
electronic delivery of information and evaluation of its consequences) and inter-
personal and group dynamic skills for working in collaborative relationships.
More Disciplines are Performing Evaluation Roles
There has been an expansion in the number and types of professionals from
other disciplines (e.g., business, management science, environmental science,
law, engineering) working in the evaluation arena (Love, 2001; Newcomer, 2001;
Rogers, 2001; Stufflebeam, 2001; Worthen, 2001). This influx has negative and
positive consequences. Negative consequences occur when those entering the
field do not have evaluation expertise and provide poor quality work causing
clients to be shortchanged and the profession to suffer from the image that this
work leaves on the mind of the public. Positive consequences can occur, when
new and creative ideas are brought in by those in non-typical evaluation disci-
plines who may also question the basic assumptions, models, and tools of the
evaluation trade. As these new disciplines increase their involvement in the
evaluation field, so must evaluators develop the ability to cross disciplinary lines
to confidently and competently provide evaluation services in a variety of fields
(Stufflebeam, 2001).
More Persons/Groups are Becoming Involved in Evaluations
There is an expansion in the number and types of stakeholder groups becoming

involved in evaluation design and implementation, many of whom have no training
in, or understanding of, traditional evaluation procedures or of standards for
evaluation performance. This influx was precipitated by at least two forces: the
development of practical and transformative participatory evaluation approaches,
and the governmental performance measurement movement that involved
ordinary citizens in setting performance objectives and measures (Newcomer,
2001). However, Henry (2001) suggests the responsibilities of such individuals
might have some limits, and that we should "not allow judgments concerning
rigor to fall to those who need information but who are not and need not be
experts at obtaining accurate information" (p. 423).
End Users of Data are Demanding Assistance
More end users of data are seeking help in understanding, interpreting, and
applying findings (Datta, 2001; Torres & Preskill, 2001; Wholey, 2001). This need
arises from the preceding situation where persons/groups are involved in studies
who do not know evaluation theory and methodology and from the proliferation
of collaborative evaluation approaches, which themselves sparked the influx of
non-evaluators into evaluation planning and deliberations. Organizational
performance measurement activities have also affected the need for assisting
end users in making use of their monitoring and measurement efforts.
Information Technology has Flourished
The increase in use of information technology in the delivery of information and

services to the public and in the collection of data for decision making has
increased the demand for evaluation expertise and is impacting the nature of
evaluation practice. Datta (2001), Love (2001), and Worthen (2001) all point to
the dramatic ways that advances in electronic and other technological media will
alter our techniques of data collection and analysis. For example, increasing data
access is a trend that is changing the way evaluators structure information and
disseminate it to end users (Love, 2001). Organizations have become very
proficient at collecting and storing massive amounts of data but lack procedures
for using information quickly and intelligently. Evaluators can assist this
development. Another example is the impact the new technologies can have on
reporting, altering old procedures and enabling broader use of such techniques
as what Worthen (2001) calls "storytelling," a method often used in case studies.
Evaluation has Become Global
Evaluation has spread around the globe, from five regional/national evaluation
organizations prior to 1995 to more than 50 in 2001 (Mertens, 2001). Evaluators
in more and less developed countries will learn from each other, existing evalu-
ation approaches will be modified or abandoned through efforts to tailor them
to multiple cultural contexts, and the result will be to broaden the theoretical and
practice base of evaluation and make it more resilient (Love, 2001). Efforts to
adapt the Joint Committee standards to non-U.S. cultures were described by
Mertens (2001) and Hopson (2001).
Politics are More Prominent
An increased politicization of evaluation is putting pressure on evaluators to

make their work more transparent (Henry, 2001). Evaluators, like the press,
382 Smith
have a responsibility to open their work to public scrutiny, to move beyond simply
producing information to being perceived as producing credible information that
is not unduly biased by its sponsor or by bad decisions of the evaluator. Findings
and descriptions of study designs and processes of public programs should be
easily accessible to all who are interested.
Interest in "Best Practices" may be in Conflict with Evaluators "Practicing Best"
The corporate world's mantra of "best practices" as the most sought after form
of knowledge is a lead that evaluators should be wary of accepting (Patton,
2001). The assumption of a best practice is that there is a single best way to do
something in contrast to the assumption that what is "best" will vary across
situations and contexts. Patton said this corporate rhetoric is already affecting
stakeholder expectations about what kinds of findings should be produced and
challenged evaluators to educate users about such concepts and their deeper
implications.
The Growth of the Performance Measurement Movement
The performance measurement movement, initiated by the U.S. Government

Performance and Results Act of 1993, was hardly a part of the evaluation scene
until after the 1994 EP volume was published, but it is very visible now. The Act
shifts the focus of government away from an emphasis on activities to a focus on
results (U.S. General Accounting Office, 2002).
This movement has had, and is predicted to continue to have, great impact on
evaluation. Some of the 2001 writers believe it is increasing the demand for
evaluation services (e.g., Wholey) but it is not clear that this increase is transfer-
ring to an increase in evaluator employment since people from other disciplines
(e.g., business consultants) compete with evaluators for performance measure-
ment tasks (Newcomer, 2001). Many other changes have been stimulated by this
movement, e.g., an increase, or at least a change, in the level at which evaluators
are expected to perform, the influx of nontraditional disciplines into the evalua-
tion arena, and the increase in internal evaluator roles. Newcomer adds other
outcomes to this list that give us cause for concern: heightened expectations
about what evaluation can deliver and underestimations of the difficulty in
providing the information; assumptions that routine performance measurement
is sufficient and thus there is no need to ask further questions about programs;
and beliefs that reporting performance data is an end in itself, rather than a way
to accumulate useful knowledge about programs and management processes.
Stake (2001) is concerned that simplistic performance indicators do not portray
the complex, contextual knowledge of programs and may even obscure the
situations one is trying to understand.
Mark (2001) said that regardless of what one thinks about the performance
measurement movement, it may be with us for a while, so we might as well figure
out how we can contribute. Can we, for example, help determine where it works
better and where it works worse? And can we help ferret out abuses? In other
words, will evaluators be perceived as having something unique in the way of
knowledge/skills to offer or will they be indistinguishable from others who offer
their services? The answers to such questions will help decide whether the
movement is a boom or a bust for the field.
CRITICISMS OF EVALUATORS/EVALUATION
The evaluation enterprise was criticized on a number of grounds. The first is for
creating so little useful knowledge about itself (Datta, 2001; Lipsey, 2001). We
know precious little about the recurring patterns in methods and outcomes of the
thousands of evaluations conducted annually and make little use of what
knowledge we do have when designing future studies. Most of us see only our
own studies. This dearth of information on details of evaluations enables shoddy
practices to be repeated over and over again. The profession suffers, as do the
clients for whom these flawed studies are produced. More metaevaluations are
needed (Wholey, 2001).
Second, little useful knowledge has been created about program effectiveness
- what effects programs do and do not have, for whom they work, under what
circumstances, how they bring about change, etc. (Lipsey, 2001; Rogers, 2001).
Some of the newer evaluation approaches (using qualitative methods, partic-
ipatory methods, and program theory, for example) contribute to this dearth of
information about successful program development when they treat every
evaluation as if it were unique, with atypical logic, goals and objectives,
implementation criteria, domains of outcomes, effects, and the like, and as they
focus their concerns entirely on the local situation without concern for some kind
of greater good. Lipsey (2001) believes there are stable aspects between program
development and implementation and the effects programs create on social
conditions they address, and that having consistent information across programs
will lead to better programs, better evaluations, more effective use of resources,
and increased public well being.
Thirdly, viable methods have not been found for routinely assessing the impact
of everyday practical programs on the social conditions they address. Lipsey
(2001) thinks this is the most critical issue we face as evaluators. While large scale
experimental studies are not the answer to all needs for evaluation, recognition
of that does not absolve us, as a profession, from the responsibility of carrying
out research to discover techniques and procedures that will work. While that
work is going on, we can do more with what we currently know how to do, e.g.,
by acquiring more knowledge about participant characteristics and program
settings, and by making evaluation studies more transparent and more
accessible.
384 Smith
Stake (2001) echoed this concern for finding ways to measure outcomes and
impacts but what is bothersome to him stems from what he sees as the growing
reliance on measurement indicators as criteria of value and well being. He
criticizes evaluators for "going along with" the proliferation of the use of these
simplistic indicators to portray the complex, contextual knowledge of programs.
EVALUATOR EDUCATIONffRAINING
The education of evaluators was given almost no attention by both sets of writers
in 1994 or in 2001. I was particularly sensitive to this void when I was reading the
2001 papers because I had spent so much of my time since 1994 thinking about
and planning evaluator training experiences, and because these authors
identified so many situations which require evaluators to increase their
repertoire of skills and competencies. For example, they pointed to the current
and projected increase in demand for evaluation work (e.g., because of the
performance measurement movement); the increased need to work with end
users of data; the need for evaluators to operate at higher levels of capacity (e.g.,
because of increase in knowledge and expectations of stakeholders); the plethora
of ostensibly new evaluation theories and approaches that have emerged; the
influx of new disciplines performing evaluation tasks (again, primarily because of
the performance measurement movement); the calls for evaluation and
accountability by many governments around the globe and the concurrent need
to adapt evaluation procedures and goals to fit different cultural contexts; and
the increased prominence of politics in evaluation that brings heightened
expectations for transparency in our methods, procedures, and outcomes. All of
these together sum to an astounding list of needs for evaluator training.
Stufflebeam (2001) was one of only a few who did talk about training. He said
the future success of the field is dependent upon sound education programs that
provide a continuing flow of "excellently qualified and motivated evaluators"
(p. 445). A current void, in his view, is evaluators who can view programs in the
multidisciplinary context in which problems occur and who can work effectively
with the multiple disciplines that now take on evaluation tasks. His paper provided
a detailed blueprint for an interdisciplinary, university-wide Ph.D. degree
program that he said would fill the void.
Worthen (2001) predicted little if any future increase in the demand for
graduate training programs in evaluation and those that are supported, he said,
will focus on the master's level. His views were similar to those of both Bickman
and Sechrest in 1994 who were of the opinion that most evaluation, as practiced
at the local level, does not require a doctoral degree, especially, Bickman said, as
more and more private-sector evaluations are conducted.
Scriven's (2001) idealized view of the future had evaluation being taught at all
levels of academe, including K-12 grades. However, Scriven did not talk about
training needs per se or how they might be assuaged; rather the discussion was
in the context of what the situation would be if evaluation were more readily
accepted as a discipline in its own right.
CONCLUSION
The 2001 authors painted a picture of a field that is more eclectic in methods,
approaches, and practitioners than was described by the writers in 1994. Seven
years made a difference in evaluation's outward and inward expansion as other
disciplines and other parts of the world became intermingled with our own. The
more diverse we become, the more opportunity there is to have debates about
our differences. It is up to us to decide whether the differences become sources
of strength for unifying the field or agents for polarization. Current writers suggest
the future could as easily take us in one direction as the other. I believe the
direction will be positive for the field if we follow the same rules for debates that
we do for making recommendations and drawing conclusions in our studies, that
is, use data (empirical examples) to support our statements and assertions.
As a result of performing my role as Director of The Evaluators' Institute
since 1996, I have had contact with over 1,200 evaluators from nearly every one
of the United States and more than 31 other countries. Sometimes the contact is
through e-mail exchanges and sometimes it is in person when individuals attend
classes. These experiences have affected my outlook about the future of our
field. If these people are any indication of the caliber of those shaping our future,
then we are in good hands. They are serious seekers of knowledge and they
demand excellence from themselves and others.
Finally, I want to say that the contributors to these two journal volumes can
comment on the future, and what they say, as Scriven (2001) pointed out, can
exert influence on future events; but it is the choices each one of us make that
will ultimately decide the direction in which the collective association
(profession) of evaluators will move.
ENDNOTES
1 Evaluation Practice became The American Journal of Evaluation (AlE) in 1996.

2 Mel Mark is the current editor of AlE.
3 www.Evaluatorslnstitute.com.
REFERENCES
Bickman, L. (1994). An optimistic view of evaluation. Evaluation Practice, 15(3), 255-260.

Chen, H.T. (1994). Current trends and future directions in program evaluation. Evaluation Practice,
15(3), 229-238.
Datta, L.E. (2001). Coming attractions. American Journal of Evaluation, 22(3), 403-408.
Fetterman, D. (2001). The transformation of evaluation into a collaboration: a vision of evaluation
in the 21st century. American Journal of Evaluation, 22(3), 381-385.
386 Smith
Greene, J. (2001). Evaluation extrapolations. American Journal of Evaluation, 22(3), 397-402.
Henry, G. (2001). How modem democracies are shaping evaluation and the emerging challenges for
evaluation. American Journal of Evaluation, 22(3), 419-429.
Hopson, R. (2001). Global and local conversations on culture, diversity, and social justice in
evaluation: Issues to consider in a 9/11 era. American Journal of Evaluation, 22(3), 375-380.
House, E. (1994). The future perfect of evaluation. Evaluation Practice, 15(3), 239-247.
House, E. (2001). Unfinished business: Causes and values. American Journal of Evaluation, 22(3),
309-315.
Lincoln, Y.S. (1994). Tracks toward a postmodern politics of evaluation. Evaluation Practice, 15(3),
299-310.
Lipsey, M. (2001). Re: Unsolved problems and unfinished business. American Journal of Evaluation,
22(3), 325-328.
Love, A. (2001). The future of evaluation: Catching rocks with cauldrons. American Journal of
Evaluation, 22(3), 437-444.
Mark, M.M. (2001). Evaluation's future: Furor, futile, or fertile? American Journal of Evaluation,
22(3),457-479.
Mertens, D.M. (2001). Inclusivity and transformation: Evaluation in 2010. American Joumal of
Evaluation, 22(3), 367-374.
Newcomer, K. (2001). Tracking and probing program performance: Fruitful path or blind alley for
evaluation professionals? American Journal of Evaluation, 22(3), 337-341.
Patton, M.Q. (2001). Evaluation, knowledge management, best practices, and high quality lessons
learned. American Journal of Evaluation, 22(3),329-336.
Rogers, P. (2001). The whole world is evaluating half-full glasses. American Journal of Evaluation,
22(3),431-435.
Scriven, M. (2001). Evaluation: Future tense. American Journal of Evaluation, 22(3), 301-307.
Sechrest, L. (1994). Program evaluation: Oh what it seemed to be. Evaluation Practice, 15(3), 359-365.
Shadish, w.R. (1994). Need-based evaluation: Good evaluation and what you need to know to do it.
Smith, M.E (1994a). Evaluation: Review of the past, preview of the future. Evaluation Practice, 15(3),
215-227.
Smith, M.E (Ed.). (1994b). Past, present, future assessment of the field of evaluation. Evaluation
Practice, 15(3), 215-391.
Smith, M.E (2001). Evaluation: Preview of the future #2. American Journal of Evaluation, 22(3),
281-300.
Smith, M.E, & Mark, M.M. (Eds.). (2001). American Joumal of Evaluation, 22(3), 281-479.
Stake, R. (2001). A problematic heading. American Journal of Evaluation, 22(3), 349-354.
Stufflebeam, D.L. (1994). Empowerment evaluation, objectivist evaluation, and evaluation
standards: Where the future of evaluation should not go and where it needs to go. Evaluation
Practice, 15(3), 321-338.
Stufflebeam, D.L. (2001). Interdisciplinary Ph.D. programming in evaluation. American Journal of
Evaluation, 22(3), 445-455.
Torres, R., & Preskill, H. (2001). Evaluation and organizational learning: past, present, and future.
Wholey, J. (2001). Managing for results: roles for evaluators in a new management era. American
Journal of Evaluation, 22(3), 343-347.
Worthen, B. (2001). Whither evaluation? That all depends. American Journal of Evaluation, 22(3),
409-418.
U.S. General Accounting Office. (2002). The Government Performance and Results Act of 1993.
Retrieved from http://www.gao.gov/new.items/gpra/gpra.htm
Section 5
The Social and Cultural Contexts of Educational

Evaluation
Introduction
H.S.BHOLA
Indiana University, School of Education, IN, USA
Context means something that surrounds and influences, as environment or

circumstances. It raises an important question for theory and epistemology: is it
possible to make knowledge assertions about reality (particularly social reality)
that are free from the influence of the surrounding social environment and
circumstances? In other words, is it possible to create knowledge that is context-
free, true in all social contexts and historical frames? (Peng, 1986; Sanderson, 2000).
"CONTEXT" IN THE PARADIGM DEBATE
These questions bring us to the paradigm debate and dialogue between posi-
tivists and constructivists carried out during the last few decades (see Guba,
1990). Positivists believe that reality exists "out there" and that universally gener-
alizable statements about all reality are possible. The fact that rival assertions of
truth about a reality exist is no more than an invitation to continue with the
search for the one single truth. Experimental and related systematic method-
ologies are suggested for testing research hypotheses in search of generalizable
knowledge (Phillips, 1987).
Constructivists, on the other hand, suggest that the search for universally gener-
alizable knowledge is in vain, since reality is individually and socially constructed
within the context of particular environments and circumstances (Berger &
Luckmann, 1967). Constructions, of course, are not completely idiosyncratic. We
are born in a world already "half-constructed" and individual and social
constructions are organically connected with what the community and culture
already know. At the same time, new individual constructions are validated as
collective social constructions through a process of interaction, reflection, and
praxis (Bernstein, 1983).
CONTEXTS OF AND CONTESTS BETWEEN ASSERTIONS
Between the search for universal generalizations and the absolute rejection of
"grand narratives," a sensible position is emerging: that while universal general-
389

390 Bhola
izations, irrespective of contexts, may be impossible, particularly in the social-
cultural sphere of our lives, generalizations of some scope at some level have to
be possible, if knowledge is to be cumulative and utilizable at all. In other words,
knowledge has to be generalizable beyond the specific location of its discovery,
though only within specified contexts. A particular quantum of knowledge would
initially be bounded in a particular context within which it will be generally usable.
Some knowledge may be generalizable to a particular cluster of contiguous
communities, but not to all communities in a culture. Yet again, some knowledge
may be generalizable to a whole culture spread over many continents, but not to
all cultures on the globe. There is also the implication that what we are doing in
knowledge transfer is not merely making formulaic transfers, codified as laws
and axioms, but transporting insights that may be validated and reconstructed in
the very process of utilization in the new settings to be rediscovered and
determined anew.
Thus, grand narratives cannot be completely done away with in our processes
of knowledge production. We need to make general assertions of some kind to
suit various levels of social reality, while remaining fully aware of the huge mass
under the tip of the iceberg of reality: a world lacking in coherence, homogenous
and heterogeneous by turns, accommodating multiple levels, overlapping cate-
gories, and amazing contradictions. In other words, production and utilization of
knowledge (including evaluative knowledge) - which is constructed by individ-
uals from their particular standpoints of ideology and epistemology - occupy
contested ground. Individual standpoints are not always congruent, and in the
process of their accumulation and collectivization, contests occur at the levels of
individuals, groups, communities and institutions, as well as within multiple
frameworks of history, culture, politics, and economy.
A THEORY OF THE CONTEXT?
Is there an existing theoretical discourse that we could borrow to apply by

analogy to the study of social text and context relationship in the theory and
practice of evaluation (including educational evaluation)?
In search of a theory of context, there may in fact be more than one possible
direction to go. In one, context may be metaphorically equated with "social
space" within which the means and ends dynamics of a purposive action would
be organized, implemented, understood, and evaluated. The known character
and properties of physical space could provide us with some analogues for
understanding social spaces or social contexts (Peng, 1986). Second, at a more
structured and theoretical level, the rich tradition of general systems theory
could be used to understand the nature and function of contexts. In this case,
context could be seen as the outlying system within which micro, meso, or molar
systems of social dynamics could be located for analysis and understanding (von
Bertlanffy, 1968).
The Social and Cultural Contexts of Educational Evaluation 391
System thinking, by itself, however will not suffice to fully elaborate the
concept of context, since contexts as systems are constructed in a dialectical
relationship with their related social text. Thus, it is necessary to locate the
concept of context within an epistemic triangle (Bhola, 1996) formed by systems
thinking (von Bertalanffy, 1968), constructivist thinking (Bernstein, 1983; Berger
& Luckmann, 1967), and dialectical thinking (Mitroff & Mason, 1981). The
epistemic triangle does not exclude positivist thinking but views it as one
particular instance of the construction of reality, in which conditions of relative
certainty can reasonably be assumed (Bhola, 1996).
Contexts can be seen as constructions themselves, and the process of con-
structing boundaries for them could be more or less intelligent. A context could
be too narrowly defined, thereby putting a dynamic social situation in the
proverbial procrustean bed, distorting reality. Or context could be bounded to be
so large as to become a cloud on the distant horizon.
Stakeholders might or might not be self-conscious of the context that surrounds
them. Furthermore, contexts defined by one group of stakeholders may be con-
tested by others. Like systems, contexts are hierarchical when a more immediate
context is surrounded by several layers of larger and larger contexts. The same
dynamic social action may be resonating at the same time to different and con-
tradictory contexts. Contexts as emergences from structures and their particular
"substances" may have different characters and qualities. Finally, it is possible to
speak of local, regional, national, and international contexts; family, community,
institutional, and associational contexts; political, social, economic, educational,
and cultural contexts.
CONTEXT MATTERS IN THEORY AND PRACTICE
The concept of context is particularly important for evaluators who, in contrast

to researchers, should not even claim to produce universally generalizable
knowledge. Evaluators should be sensitive to the fact that they are engaged in
developing knowledge which will be used for improvement or accountability
within the particular context of a project or program. Educational evaluators are
nested concomitantly within yet another context: that of educational systems
with their particular ethos and norms of valuing, doing, and acting.
Context matters both in the theory and practice of evaluation. Theory builders
and model makers have to realize that their work is context-bound in several
ways. First, theory often gets its color from its ideological context. Liberational
evaluation, participative evaluation and now deliberative-democratic evaluation
are more ideological positions than theoretical advances. More importantly, and
more concretely, theory building and model making are also confined within
particular political, economic and cultural contexts (such as England or Indonesia)
and are located in differentiated professional cultures (such as education, welfare,
or business).
392 Bhola
Evaluation as practice has to resonate to a multiplicity of layered contexts: the
professional culture of evaluators and the institutional culture of the place in
which evaluation is commenced or commissioned, designed, and implemented.
In the process of gathering data, whether qualitative or quantitative, it will be
necessary for evaluators to enter the personality and culture dialectic, including
class and gender (Harding, 1987; Harris, 1998; Schudson, 1994).
CULTURAL CONTEXTS OF EDUCATIONAL EVALUATION IN WORLD

REGIONS
Context, as has already been stated, influences both the theory and practice of
evaluation. It was also suggested that it emerges from a multiplicity of dimen-
sions: local, regional, national, and global; family, community, institution, and
voluntariness; and political, social, economic, educational, and cultural.
Being fully aware of the limitations (if not the impossibility) of "grand
narratives," and of the necessity of caution and qualification in dealing with
increasingly abstract categories, it was considered useful for the purposes of this
handbook to discuss the social and cultural contexts of evaluation in terms of the
various cultural regions of the world. Thus, the chapters in this section reflect
North America, Latin America, Europe, and Africa. The opening chapter
provides an overall international perspective.
In the dictionary, the social and the cultural are often defined in terms of each
other, and are used interchangeably in our discussion. In social science discourses,
the terms culture and context are also used interchangeably. At our particular
level of discussion context is culture - with its many layers.
The Global Perspective
In the first chapter in the section, "Cultural and Social Contexts of Educational
Evaluation: A Global Perspective," H.S. Bhola points out that "culture as
context" can accommodate several part-whole relationships among and between
cultures: culture in the general anthropological sense, political culture, bureau-
cratic culture, media culture, educational culture, and professional cultures of
doctors, accountants, and, of course, educational evaluators. Cultures can also
be looked at across levels of family and community, and as minority culture,
mainstream culture, and cultural regions covering multiple nations. Finally,
cultures in themselves exist in two components: the yin-yang of the normative
(the idealized version of the culture as coded by the sages in scriptures) and the
praxiological that one sees acted out in existential realities.
In a broad historical sweep, Bhola talks of the slow but steady Westernization
of the world's ancient cultures and describes how the processes of globalization
have further accelerated and intensified the integration of world economies,
political systems, professional discourses, and knowledge systems. Educational
evaluation, which is an invention of the United States has now been accepted as
a normative-professional culture and incorporated in practice by evaluators all
around the world. It is only at the level of field realities where evaluators go to
collect data that non-western cultures and their traditions - no longer pristine,
and indeed already hybridized - come into play.
North American Perspective
Carl Candoli and Daniel L. Stufflebeam in their chapter, "The Context of

Educational Program Evaluation in the United States," construct the context of
educational evaluation in America which is at the same time, historical, political,
legal, and cultural. Assuming a bifocal perspective of an educational evaluator
and administrator, and with particular focus on program evaluation, they first
summarize the foundational principles that underlie public education in the
United States. Next the five-tiered structure and governance of the educational
system is described and its implications for planning and reporting evaluations of
educational programs are clarified. The evolving agenda of educational
evaluation in schools and school districts, as revised in response to the civil rights
movement and to federal politics of desegregation and accountability prevailing
at the time, are also recounted.
The chapter indicates how, over the past 40 years of intense activity, as
America has changed and with it the educational system, new expectations for
evaluations and evaluators have emerged. The ideology of democratization of
education, as well as of the evaluation process itself, have become central. The
evaluation profession has met the challenge of methodology and technique by
constructing new models, developing new methods, and establishing new
standards and criteria for evaluations and evaluators. In the meantime, the seeds
of educational evaluation that had sprouted in the American classroom have
blossomed forth into a big and sturdy tree with long branches that cover the
whole educational system in the U.S., as well as other sectors of social and
human services in the country and also in other countries and continents.
The European Perspective
Ove Karlsson's chapter, "Program Evaluation in Europe: Between Democratic

and New Public Management Evaluation," begins with the acknowledgment that
Europe had willingly adopted the professional evaluation culture ofthe U.S. The
first wave of evaluation in European countries during the 1960s and the 1970s
was directed to developing social and educational programs. During the second
wave in the 1980s, evaluation came to be focused on the public sector. National
and supra-national bureaucracies conducted or commissioned top-down
evaluations for accountability at two levels, the national and supra-national, in
each case seeking more effective control, at lower costs. In the 1990s, evaluation
394 Bhola
came to be further dominated by New Public Management (NPM) evaluation
seeking to achieve an audit society resonating to the frameworks of management
and evaluation in use in the market sector.
There are voices of dissent as some European evaluators demand that evalua-
tion should serve the public interest. They talk of stakeholder participation in
evaluation at all stages of the process within a democratic-deliberative-dialogic
framework, thereby contributing to the difficult task of reconciling the new
politics of the European Union with national legal and political systems.
The Latin American Perspective
Fernando Reimers in his chapter, "The Social Context of Educational Evaluation

in Latin America," addresses issues relating to research in general and its
control, as well as the evaluation of programs and system evaluation. He points
out that the practice and prospects of educational evaluation on the continent
have been embedded in the struggle between two alternative educational ide-
ologies of conservatism and progressivism. The state has been a dominant actor
in the provision of education and therefore has largely controlled educational
evaluation which more often than not has been policy-oriented. The context of
the implementation and utilization of evaluation has inevitably been shaped by
the authoritarian and exclusionary attitudes of the policy elite, prevailing
institutional practices, and the culture of the local research community.
Transactions between evaluators and policy elite have resonated to paradigms
and politics at national and international levels. United Nations agencies and
American universities and foundations have brought evaluation models, funded
evaluation studies, and promoted institutionalization of evaluation practices in
Latin America with mixed results. Local communities of research and evaluation
in universities and non-governmental centers have developed in several coun-
tries (Argentina, Brazil, Chile, Mexico, Peru, and Venezuela), but the state has
sought to influence and curtail their independence and professional autonomy.
The African Perspective
Finally, Michael Omolewa and Thomas Kellaghan in their chapter, "Educational

Evaluation in Africa," join the indigenous with the modern. On the one hand,
they dig deep in the indigenous traditions of the peoples of Sub-Saharan Africa
for examples of "evaluation" of socialization and apprenticeship projects for
training and initiation of adolescents, both males and females, into new cultural
and economic roles using a mix of participatory evaluation and juries of experts.
On the other hand, they discuss modern developments in the practice of evaluation,
including: (i) evaluation of individual students; (ii) evaluation of the education
systems; and (iii) evaluation of programs. The evaluation of individual students,
institutionalized as annual examination systems all over Africa, goes far back
into colonial times. Examination systems came to serve as instruments of

controlling disparate elements of education systems, managing the climb up the
educational pyramid and the employment of graduates, and also made teachers
accountable for lack of student achievement. The evaluation of education
systems developed within the framework of development assistance during the
last ten years.
Program evaluation had become inevitable in Africa as African states and civil
society institutions sought to use education for socio-economic development
with loans and grants from international development agencies and banks.
Reports of program evaluations served the twin purposes of more responsive
policy formulation and measurement of policy impact. The establishment of the
African Evaluation Association (AfrEA) is a hopeful sign of program evaluation
striking roots on the continent and contributing to the total project of education
in Africa.
CONCLUSION
To conclude, we should ask: What are the crucial contextual questions in relation
to educational evaluation? And how will the chapters included in this section of
the handbook help in answering them? The following set of questions can be
raised to assist an ongoing discourse: How is one to bound the context as a
"sphere of influence"? Related to this is the question of choice of identifying
sources of influence: What particular aspect or combination of aspects in the con-
text should be selected - economic, political, social, intellectual, or all together?
How is one to keep the context and text relationship clear and articulated in
our own evaluations? This question will help us see how the existing context
might affect the processes of designing, implementing, and utilizing the results of
an educational evaluation. It will raise our awareness in regard to how the
context could limit or enhance the possibilities of an evaluation. Also, how might
the context inhibit or compromise the objectives or processes of an evaluation?
How is one to "read" the context and text relationship in evaluation studies
when it is not well articulated? Much too often text-context relationships are not
made clear by evaluators and researchers; and sometimes obvious relativities of
results may be deliberately hidden from view. We need to learn to "read"
text-context relationships.
How is one to read across a set of evaluations, each with its own text-context
relationships and how is one to develop insights that make sense in larger and
larger contexts? All knowledge comes with its context. But how large is the
context? Since grand narratives with pretensions of being universal generali-
zations are impossible, in a world of a billion narrow contexts, knowledge will be
unusable. We need to expand what we learn in one context to other contexts,
thereby seeking to touch the clouds with our heads but keeping our feet on the
ground. The chapters in this section may help us do that!
396 Bhola
REFERENCES
Berger, P.L., & Luckmann, T. (1967). The social construction of reality: A treatise in the sociology of
knowledge. New York: Doubleday/Anchor Books.
Bernstein, R. (1983). Beyond objectivism and relativism: Science, hermeneutics, and praxis.
Philadelphia PA: University of Pennsylvania Press.
Bhola, B.S. (1996). Between limits of rationality and possibilities of praxis: Purposive action from the
heart of an epistemic triangle. Entrepreneurship, Innovation and Change, 5, 33-47.
Guba, E.G. (Ed.). (1990). The paradigm dialogue. Newbury Park, CA: Sage.
Harding, S. (1987). Introduction: Is there a feminist method? In S. Harding (Ed.), Feminism and
methodology: Social science issues (pp. 1-14). Bloomington, IN: Indiana University Press.
Harris, M. (1998). Theories of cultures in post modern times. Walnut Creek, CA: AltaMira Press.
Mitroff, I., & Mason, R.O. (1981). Creating a dialectical social science: Concepts, methods, and models.
Boston, MA: Reidel.
Peng, F.c.c. (1986). On the context of situation. International Journal of Sociology of Language, 58,
91-105.
Phillips, D. (1987). Philosophy, science, and social inquiry. Oxford: Pergamon.
Sanderson, S.K. (2000). Theories and epistemology [Review of the book Theories of culture in post
modern times}. Contemporary Sociology, 29, 429-430.
Schudson, M. (1994). Culture and the integration of national societies. International Social Science
Journal, 46, 63-81.
von Bertalanffy, K.L. (1968). General systems theory. New York: Braziller.
19
Social and Cultural Contexts of Educational Evaluation:
H.S.BHOLN
Tucson, AZ, USA
The cultures of the world are located in history and, over long periods in time,
have changed in transmission and by transformation. In dialectical relationships
between and among themselves, they have declined or ascended, and merged to
emerge anew.
During the centuries of colonization by the West of Africa, the Middle East,
Asia, the Americas, and Australia and New Zealand, the cultures of the
colonized peoples came to be situated within multiple histories at the same time;
local history, the history of the region, and the history of the Western colonizing
power all began to converge. In the process, cultures, both Western and non-
Western, came to be irreversibly transformed. In these intercultural encounters,
the dominance of the West was overwhelming, but the Western influence was by
no means a total substitution of the old by the new. Yet Westernization left no
one untouched. Even those who lived in far-flung habitations across colonized
lands, or were unable to communicate in the language of the colonizer, were
affected through the local surrogates of the colonial state and the churches.
Local elites that the colonizers used as mediators to govern the colonies became
quite Westernized. They learned the language of the colonizers for use in all
important discourses, and became more or less Westernized in habit and thought.
Most importantly, the colonizers transplanted their institutions of governance
and education which have survived everywhere, more or less adapting to
endogenous cultures, and today continue to be used as instruments of nation-
building and economic development by the new governing classes.
By the time the postcolonial period began after the second world war, Western
ideologies and epistemologies had been internalized by the "Westernized non-
Western" governing elites in all of the newly independent nations - with more or
less contextualization, hybridization, and vernacularization (Burbules & Torres,
2000). Those who came to lead independence movements rode the waves of
nationalism, not rejecting Western ideals and institutions, but rather fighting
against their exclusion from the networks of wealth, status, and power.
397

398 Bhola
The grand success of the forces of globalization in the 1990s in the economic,
political and cultural integration of the world (see O'Meara, Mehlinger, & Krain
2000), was possible not only because of the power-knowledge nexus (Foucault,
1976) between the West and the non-West and the miracles of Western
technology, but also because of the hundreds of thousands of Trojan horses left
behind by the colonizers in the parliaments, courts, treasuries, schools, and
business houses of their erstwhile colonial subjects. These Westernized elites
were black and brown and pale yellow, but all had accepted the Western utopian
imagination of present prosperity and future development.
THE NECESSITY OF ESTABLISHING A STANDPOINT
To have a perspective is to have a standpoint (Harding, 1993) - the point at which

individual ideological and epistemological positions intersect - which I believe
social scientists should make transparent each time they engage in a task of
description, analysis, explanation, consultation, intervention, inquiry, or evalua-
tion. The process should serve not only to induce self-awareness and reflexivity,
and thereby bring clarity, coherence, and commitment to one's own standpoint,
but also should compel similar reflection on the part of "others" (scholars and
stakeholders), helping to create "an inter-subjective space in which difference
can unfold in its particularity" (Scheurich, 1996), which in turn may lead to
enlightened agreement and honest disagreements.
The Epistemological Preamble
From post-modern discourses (Harvey, 1990; Lyotard, 1984), we learn that phe-
nomenal reality, with all its multiple aspects, features, and appearances, is not
amenable to perception and conception by one single paradigm. Thus, dialog
between and among paradigms is necessary (Guba, 1990). In working toward a
"practical epistemology", we can consider ourselves as being located in the
epistemic space formed by the three paradigms of systems thinking, dialectical
thinking, and constructivist thinking. It should be made clear that this epistemic
triangle does not exclude or reject empiricism; its assertions, however are con-
sidered as a particular construction based on empirical warrants, available and
usable in conditions of relative control (Bhola, 1996).
The systemic view embedded in the epistemic triangle allows us to see cultures
as formations that emerge from several sub-cultures and mediating cultures, for
example, the culture of the inner city; media culture; youth culture; cultures of
institutions such as the army, business, and church; and cultures of professional
communities of lawyers, doctors, accountants, and evaluators. It enables us also
to separate the ideal from the actual in cultures: the normative part from its
Social and Cultural Contexts of Educational Evaluation 399
praxeological projection. The constructivist view in the epistemic triangle enables
us to look at cultures as processes in perpetual reconstruction: not something to
be saved intact for ever, but to be renewed and enriched. As also anticipated in
the epistemic triangle, cultures will inevitably merge and emerge through
dialectical relationships with each other, in ways that are impossible to fully anti-
cipate. In the context of the theory and practice of evaluation in the world, this
view privileges an epistemological stance that prefers paradigm dialog to para-
digm debate, and seeks to develop evaluation discourses that use multiple
models and methods of inquiry to make warranted assertions (Bhola, 2000b).
The Ideological Preamble
I accept the reality of globalization, but do not accept it as it is practiced today.

Rather I demand what I have called re-globalization that (i) in the sphere of
economy, does not bring ever-increasing gaps in income between nations and
within nations, but reduces these gaps, bringing greater economic justice for all;
(ii) in the social sphere, does not absolve the state from its obligation of provid-
ing public goods of education and health to the weaker sections of the community,
but expands human services to more and more people in a world without borders;
(iii) in the cultural sphere, does not destroy or cannibalize other cultures, but
helps renew and enrich all cultures in the new global encounters; and (iv) in the
educational sphere, makes education a lifelong process serving all children,
youth, and adults; and does not merely focus on the social reproduction of labor
but enables people to live a life balanced between the material and the spiritual.
What is suggested is not impossible to achieve. The wherewithal is available, but
perhaps not the will.
I will be critical (and not merely sentimental) regarding processes of cultural
change involving ascendence and decline, fruition and attrition. I am aware of
the tension between sense and sentimentality surrounding discussions of oral
heritage (as wisdom) and indigenous knowledge (as science). I am also aware of
the claims of indigenous philosophies, at least regarding variation in ways of
knowing in non-Western cultural traditions. From my present standpoint, I do
not believe that cultures will be able to withstand the Western epistemological
surge that has created such a spectacularly magical world by advancing science
and technologies (Bhola, in press).
In the area of educational evaluation which is the focus of this paper, I do not
regret that the theory and practice of such evaluation has spread from the West
to all the Rest. The non-Western world would have taken decades to re-invent all
this knowledge. Its task now, by no means an easy challenge, is to self-consciously
and intelligently adapt available theoretical and practical knowledge to its own
purposes, and to incorporate it in existing traditions. What would adaptation
mean? When will situational compromise become a corruption or a surrender?
Those questions remain to be answered.
400 Bhola
The Matter of Perspectives, Terms, and Categories
A chapter that weaves a grand narrative from a global perspective cannot avoid
some pretentiousness in discourse or some over-generalization. But then neither
absoluteness of assertions, nor homogeneity in structures and systems, is claimed.
Inevitable pitfalls will be avoided by qualifying statements in context, and by
indicating the level of determination of boundaries and meanings.
While at the theoretical level, anthropology (the study of cultures) can be
distinguished from sociology (the study of societies), in the practical everyday
world, cultural and social perspectives overlap. In the discussion that follows, the
terms cultural and social are considered as mutually definitional and often
interchangeable.
The two larger cultural contexts of the West and the Rest (that is, the non-
West) have already become a familiar part of discourses on the new world order.
The use of these categories in discussion should, therefore, not require
justification. It is also clear that America today both leads and represents the
West (O'Meara, Mehlinger, & Krain, 2000), even though some countries or
regions within the non-West have rather long histories of Westernization, and
some pockets of highly Western sub-cultures exist outside of the geographical
West.
The terms education and evaluation both present problems of definition. All
cultures during their long histories of emergence have used some organized
modes of socialization and education for intergenerational transfer of collective
experience and knowledge. But "education" offered in old times in Buddhist
Vats, Hindu Pathshalas, Christian churches, Islamic Madrasas, and Sikh
Gurudwaras was quite different in purpose and in practice from education as we
know it today. While the term "education" will be used to refer to these two
different types of activity, the distinction between "old education" and
"education today" will be maintained in the text as needed.
A similar problem arises in the case of the term "evaluation." As individual
beings, we have been perpetual evaluators: allocating particular or relative values
to material goods, nonmaterial ideals, social connections, and actions in the
world. Socially institutionalized mechanisms of evaluation go back to the civil
service examinations administered by the Chinese to the literati, and cover the
much more recent critically evaluative reports written by individuals, commit-
tees, and commissions on various issues of interest to the state and civil society.
But again this old "generic evaluation" differs from the new "professional
evaluation" as we know it today, which can be defined as the systematic
development of information based on data, both empirical and experiential, to
determine the merit, worth or significance of some purposive action or aspects
thereof (see Madaus & Stufflebeam, 2000).
At one level, today's professional evaluation is a category that could cover
evaluation of all social programs and human services. Many of the assertions
made at this level may, of course, have to be qualified as evaluation is applied to
particular domains, such as education, health, or agricultural extension. Within
each of these domains or sectors, evaluation could focus on particular compo-

nents or aspects of a system and thereby become context evaluation, policy
evaluation, program evaluation, impact evaluation, or personnel evaluation. As
evaluation moves across the domains of social and human services, and/or
focuses on particular components and aspects of a system of purposive action, it
will not assume an exclusive stand-alone epistemology, nor will it preclude the
general logic and the associated calculus of means and ends of all evaluative
inquiry. Of course, each new aspect in each new context will require adaptations
to suit new perspectives and foci.
A large part of the author's experience with evaluation was accumulated while
working within the sectors of adult education and adult literacy in English-
speaking African countries and a few countries in Asia and South America. But
the insights developed were used and are usable wherever evaluators are faced
with the challenge of asserting the merit, worth, or significance of some social
intervention for accountability.
PROFESSIONAL CULTURE OF EDUCATIONAL EVALUATION
Educational evaluation as "practice" can today be regarded as a culture. The

tribe of evaluators has come to have distinctive paradigms and ideologies,
associated with shamans and scriptures, ethics and standards of practice, and
seasonal rituals. It is a culture well-placed in the nexus of knowledge and power.
Its practitioners are in universities, high enough on the academic totem pole.
Those who commission and pay for evaluations are powerful politicians or high-
level bureaucrats working in the policy arena.
Professional educational evaluation as we know it is a relatively new culture
that began to crystallize within American academia in the 1960s, and in its life of
some 40 years has developed well-articulated components, both normative and
praxeological. From the beginning, it has induced the emergence of similar
professional cultures all over the world: in Canada, Australia and New Zealand,
Europe, Asia and the Middle East, Africa and South America.
The intercultural encounters and interactions between the West and non-West
have been mediated by several subcultures and specialized cultures. Three levels
are of particular interest: the level of the professional cultures of educational
evaluation, the level of the education systems, and the general culture. What we
have called mediating cultures (see Figure 1) have not been elaborated in this
paper.
At the outlying level of the larger culture, as implied in the introductory
section, the non-Western countries, now governed by the native "non-Western
but Westernized individuals", continue to have special relations with their old
colonizers and, under globalization, have become even more open to Western
ideologies and technologies, both hard and soft (O'Meara, Mehlinger &
Krain, 2000).
402 Bhola
THE WEST THE REST

WESTERN INFLUENCES
~~~
0°
rY
1\
\ \
~~GER
CULTURE
E7
CULTURE OF
T10N
MEDIATING
-,<.L_ _ _ CULTURES
THE WEST THE REST

NON-WESTERN RESPONSES
Figure 1: Mediations between Western and non-Western cultures of education and evaluation
and the dialectics between their normative and praxiological components
Intercultural Educational Relations
Educational systems in non-Western countries, almost without exception, were

established by the West during the period of colonization, and remain highly
congenial to West-driven innovations including planning, management, and
evaluation. Articulation and acceleration has been aided during the last 40 years
by such fora as international commissions on education established by UNESCO
(Delors et aI., 1996; Faure et aI., 1972) and international conferences on basic
education (Inter-Agency Commission, 1990), adult education (UNESCO, 1997)
and higher education (UNESCO, 1998a, 1998b). The reports and recommen-
dations of these bodies have often provided the context and, literally, the text for
educational policies around the world. Within such an educational context, the
innovation of "educational evaluation" from the West was not going to be overtly
or covertly rejected.
The Professional Culture of Educational Evaluation, As Made in the U.SA.
The normative as well as the praxeological components of educational

evaluation as made and packaged in the U.S.A. are included in the work of the
Joint Committee on Standards for Educational Evaluation (1981, 1994). The
standards address concerns that are ethical (e.g., protection of human subjects),
ideological (e.g., concerns with the participation for empowerment of
stakeholders), and epistemological (e.g., methodological rigor producing results
that have proven validity and reliability).
The standards were fairly readily accepted by most evaluators in North America,
Western Europe, Australia, and New Zealand. Specifically, they ask that:
(i) Evaluators conduct systematic, data-based inquiries about whatever is being

evaluated;
(ii) Evaluators provide competent performance to stakeholders;
(iii) Evaluators ensure the honesty and integrity of the entire evaluation process;
(iv) Evaluators respect the security, dignity and self-worth of the respondents,
program participants, clients, and other stakeholders with whom they inter-
act; and
(v) Evaluators articulate and take into account the diversity of interests and values
that may be related to the general and public welfare. (Joint Committee,
1981, 1994).
The Program Evaluation Standards have been approved by the American

National Standards Institute (ANSI) as the American National Standards; and
are now under submission to the International Standards Organization in Geneva
for their approval and adoption.
Borrowing and Building on American Standards
Considerable discussion surrounds the question of whether or not evaluation

practitioners in other cultures should establish their own culturally sensitive and
locally rooted evaluation standards or should start with what is available and
adapt them to suit their own purposes (Smith, Chircop, & Mukherjee, 1993).
Unfortunately, there seems to be more sentimentality than social scientific sense
in nations wanting to have their own emic standards.
My position is as follows. Evaluation is a child of America. The normative
component of the American professional culture of educational evaluation, as
codified, is Western in tone and content. This normative content is almost fully
acceptable to evaluators abroad because the evaluation discourse is indeed
conducted in Western terms, and most frequently in the English language. Those
conducting evaluation projects are Westernized individuals, most of them
trained in the United States or Europe. The only thing that their own version of
the American statement of standards would give them is a sense of ownership
through participation in the exercise of doing a re-write. In the process, they will
have perhaps contextualized the text, hybridized some of the concepts, and
vernacularized ways of saying things - not always for the better.
404 Bhola
GLOBAL DISSEMINATION OF THE PROFESSIONAL CULTURE OF
Before American-style evaluation came to the non-West, "old" evaluation existed

in several non-Western countries. Human beings are, indeed, perennial evalua-
tors. The chiefs, pashas, sultans, maharajahs and other potentates did commission
"evaluations" of some sort to guide their decrees and proclamations. During
colonial times, colonial officers sent detailed "evaluative accounts" to their
superiors. Following that, the twentieth century witnessed a whole series of
"policy evaluations" of education by committees and commissions established by
incoming independent governments (in India in 1966 and in Botswana in 1977 to
name just two).
Processes of Dissemination of a Professional Culture
The diffusion and dissemination of the cultural package of professional educa-

tional evaluation has been possible through a number of intertwined processes:
(i) provision of training of evaluators in non-Western areas; (ii) mandating
evaluation of programs and projects supported by Western donors in
cooperation with government and non-government agencies around the world;
and (iii) promoting professional associations of educational evaluators in non-
Western countries.
Universities in the West have not only defined and shaped educational evalua-
tion as a discipline, but have also provided training to non-Western participants
on university campuses in the West and in non-formal settings to functionaries of
governments and non-government agencies around the world. Training was also
supported and funded by several bilateral and multilateral organizations,
engaged in providing technical assistance. Unfortunately, there are not many
examples of universities in the non-West picking up the challenge of providing
evaluation training to their own professionals on a systematic basis.
Even more important for inducing and articulating a culture of educational
evaluation was the demand by almost all donor agencies, including UNESCO,
UNICEF, UNDP, USAID, SIDA, OECD, and the World Bank that govern-
mental and non-governmental agencies, accepting funding from donors, include
an evaluation component in the project under negotiation. The evaluations were
expected to follow particular models, methods, and formats, and were almost
always done by outside consultants (Binnendijk, 1989; Cracknell, 1988;
Hageboeck & Snyder, 1991; OECD, 1988, 1990; Snyder & Doan, 1995; Stokke,
1991; Uphoff, 1992; USAID, 1989, 1990) Since outside consultants could not do
everything themselves, they almost always worked with local counterparts who,
by necessity, had to be trained in evaluation processes at some level.
A further process for the dissemination of the evaluation culture was to enable
the establishment of professional associations of educational evaluation in non-
Western countries. In 1995, according to the 7 March 2001 electronic newsletter
of the American Evaluation Association, there were five national or regional

evaluation associations in the world. Today, there are thirty (AEA International
and Cross-Cultural Evaluation TIG, 2000). The African Evaluation Association
website lists 450 evaluators by country (African Evaluation Association, 2001).
Efforts are afoot to establish a framework for cooperation among evaluation
organizations world-wide, to facilitate leadership and capacity building, to enrich
theory and practice, and to assist the profession in taking global approaches to
identifying and solving today's problems.
TRAINING AND SOCIALIZATION OF MEMBERS OF NON-WESTERN

CULTURES IN EDUCATIONAL EVALUATION
In this section, three major international projects that have helped induce,
articulate and sustain the professional cultures of educational evaluation around
the world are described.
A Case of Evaluation Training in Africa
There have been many projects of training for evaluation for non-Western
professionals supported by almost all bilateral and multilateral agencies, but I
take the example of an extremely useful project focused on built-in evaluation
that was conducted by the German Foundation for International Development
(DSE) between 1975 and 1995. Program implementers were trained to conduct
evaluation of their own work in the workplace, and were expected to use the
results to improve their own program performance. The idea was to engender a
particular culture of evaluation or rather a new culture of information within
program organizations (Bhola, 1992a, 1995a).
The Project resulted from the foresight of two individuals, John W Ryan, then
Director, UNESCO International Institute for Adult Literacy Methods (IIALM)
located in Tehran, Iran, and Josef Mueller, then a Program Officer in the German
Foundation for International Development (DSE) in Bonn. It was determined
that adult literacy work around the world was being neglected for several
reasons; and that perhaps the most important need was for professional materials
that middle- and lower-level functionaries could read and learn from, to do their
day to day work. A series of 10 to 12 monographs dealing with various topics of
adult literacy was planned. I was invited to be the series editor.
It was decided that before publication each monograph would be tested-in-use
in one or more workshops specially organized for middle-level literacy workers
for whom the monographs were intended. IIALM was an international institute
with interests in developing countries in all of the world regions. The DSE,
however, at the time, was focusing its work on Anglophone Africa, and so, at
first, the test-in-use workshops came to be located in Africa. Several countries in
the region were contacted by Josef Mueller to see if they would be interested in
406 Bhola
workshops for their middle-level workers providing training in planning,
curriculum development, writing reading materials for new readers, and
evaluation. The response was overwhelming. Over the years, evaluation
workshops were organized in Tanzania, Kenya, Zambia, Malawi, Zimbabwe, and
Botswana. As the emphasis was on built-in evaluation, pretentious stand-alone
studies requiring separate evaluation teams, extra budgets, and resources were
discouraged (Bhola, 1983, 1989).
The development and testing of written materials for the training of these
"barefoot evaluators" was an important part of the project. Materials developed
for use in training workshops attracted attention outside the project, and requests
to duplicate and use them were received from other development groups, such
as those dealing with health education and cooperatives. After systematic test-in-
use, the materials were collected for publication as a book (Bhola, 1979). The
further accumulation of experience on training evaluators was represented in a
book written with international audiences in mind (Bhola, 1990), which was
translated into Spanish to be distributed by VNESCO/OREALC in Latin America
(Bhola, 1991), translated and published in French for professionals in French-
speaking Africa (Bhola, 1992b), and later translated into Arabic (Bhola, 1995b)
and Farsi (Bhola, 1998a) for distribution in Middle Eastern countries.
The cross-regional dissemination of these materials is noteworthy. The
material continued to contribute to the emergence of a culture of evaluation as
they came to be widely used in adult literacy evaluation in programs and projects
around the world as training for evaluation. The evaluation and other
monographs were adopted as texts in master's level programs in adult education
and adult literacy in several universities.
The model of training for the delivery of workshops in and of itself attracted
attention for its choice of content and for the process of delivery (Bhola, 1983).
It was later published for wider dissemination (Bhola, 1989).
Extending Evaluation Training to Southeast Asia
The model of training for evaluation described above was brought to South and
East Asian countries in a workshop held at Chiang Rai, Thailand during
November 1994 with sponsorship of DSE and VIE with the local cooperation of
the Department of Nonformal Education, Ministry of Education, Bangkok. This
Regional Training Workshop on Monitoring and Evaluation of Nonformal Basic
Education (NFBE) attracted participants working in governments and NGOs
from Afghanistan, Bangladesh, India, Indonesia, Nepal, Pakistan, the Philippines,
Sri Lanka, and Thailand.
Within the context of a short workshop, any serious enculturation of partici-
pants would have been impossible, but the seeds of several good ideas were planted.
The most significant idea was to rise above mere techniques of monitoring and
evaluation and to think in terms of "the culture of information" within organiza-
tions to enable them to develop their "organizational intelligence" (Bhola, 1995a).
Concrete objectives were to analyze existing explicit and implicit formal and
nonformal evaluation practices in use within the Non-Formal Basic Education
(NFBE) sector in participating countries; to assist participants to reflect on the
information culture in the contexts in which they worked; and to encourage the
acceptance of an expanded vision of evaluation to include MIS (Management
Information Systems), rationalistic (quantitative) evaluation, and naturalistic
(qualitative) evaluation.
A variety of evaluation proposals were developed. Development of manage-
ment information systems for monitoring was a serious concern for most
participants. Some wanted to computerize their systems, while others wished to
accommodate qualitative data. Interest was expressed in conducting evaluations
of the impact of literacy on women empowerment. One proposal was to evaluate
learners' achievement; another to evaluate facilitators' training; yet another
proposed an evaluation of the effects of literacy and the Mother and Child
Health Program on refugees of the civil war in Afghanistan; and one on the use
of quality of life indicators in general. Another proposal looked ahead in sug-
gesting a base-line survey of learners and their families for perspective planning.
Some of the country teams had brought impressive work that they and their
colleagues had already done in their countries (Canieso-Doronila & Espiriu-
Acuna, 1994; India, 1994).
Symbiosis Between Training and Evaluation
Symbiosis between training and evaluation has become a reality of intercultural

evaluation. Consultants from the West working in non-Western cultures simply
cannot read the world and listen to its voices on their own. They have to use
others - their non-Western colleagues - as their eyes and ears. They have to train
locals to be their informants, their assistants, and their associates. This kind of
symbiosis between training and evaluation was deliberately constructed in recent
educational evaluations in South Africa (Bhola, 1998b, 1999d), Namibia (Bhola,
1998c), and Ghana (Bhola, 2000a).
BRINGING EVALUATION PRACTICE TO ADULT EDUCATION:

THE EXPERIMENTAL WORLD LITERACY PROGRAM
The Experimental World Literacy Programme (EWLP) launched by UNESCO

in 1969 played a significant role in inducing a culture of evaluation in programs
of adult literacy around the non-Western world and, through adult literacy, in the
sectors and cultures of adult education and development extension in general.
Experimental functional literacy projects were launched in eleven countries:
Algeria, Ecuador, Ethiopia, Guinea, India, Iran, Madagascar, Mali, Sudan, Syria,
and Tanzania. The projects were planned as experiments in "literacy for develop-
ment", and evaluation was central to each and every one. The overall assessment
408 Bhola
of EWLp, despite its quantitative bias, remains a landmark study in the area of
literacy evaluation (UNESCO, 1976).
The EWLP had consequences far beyond the eleven experimental projects.
The new cultures of evaluation in these countries influenced not only literacy
work but development work in general. Diffusion of both the theory and practice
of evaluation occurred to other countries. Unfortunately, in too many places,
these nascent cultures of educational evaluation atrophied under conditions of
civil war and programs of structural adjustment under which resources were
withdrawn from education and particularly from the education of adults. There
were important exceptions. Impressive evaluation work has continued in India,
Indonesia, the Philippines, Kenya, Tanzania, Namibia, and South Africa (Bhola,
1998b, 1998c, 1998d; Carr-Hill, Kweka, Rusimbi, & Chengelele, 1991; Carron,
Mwiria, & Righa, 1989).
The most important legacy of the EWLP may have been that evaluation is now
part of all adult literacy and adult education projects supported nationally, and
particularly those supported by international donors.
EDUCATION FOR ALL (EFA) INITIATIVE: SHARPENING THE FOCUS

OF EVALUATION ON FORMAL EDUCATION
What the Experimental Adult Literacy Program (EWLP) did for creating a
culture of evaluation within the nonformal education sector including adult
literacy, adult education, and other development extension initiatives around the
world, the Education for All (EFA) initiative (UNESCO, 1990) did for formal
basic education. The initiative was world-wide in scope, and had the support of
an alliance of UNESCO, UNICEF, UNDP, and the World Bank.
Once again, evaluation of outcomes was an important, indeed non-negotiable,
component of all country projects under EFA. A separate Monitoring and
Evaluation section was established within the EFA Secretariat at UNESCO
Headquarters, with Vinayagum Chinapah as Director of the Monitoring Learning
Achievement Project (see Chinapah, 1995). Having been trained and socialized
himself in the tradition of international comparisons of educational achievement
(see Husen, 1987; Purves, 1989), he established a similar monitoring system at
UNESCO in consultation with participants from all the world regions. The
UNESCO unit undertook the task of assisting the establishment of endogenous
and sustainable systems for monitoring the quality of basic education in coun-
tries; established frameworks for conceptual, methodological and analytical
approaches within which data were to be collected and analyzed; and conducted
workshops and other programs for capacity building within countries with the
objective of creating monitoring cultures (UNESCO, 1995).
The design of assessment instruments and frameworks for management
information systems met generally accepted criteria, but inevitably there were
several breakdowns in the process of implementation both for lack of initial
professional capacity in countries and for lack of physical infrastructures and
institutional resources.
Evaluations of the EFA Initiative Itself
The EFA project was evaluated in 1995 (UNESCO, 1996) and in 2000 (UNESCO,
2000). In each case, a pyramid model was used. Local evaluations were used to
develop country studies, which were then used to develop regional studies, one
each for the Africa region, Arab States, Latin America and the Caribbean, Asia
and the Pacific region, and Europe and North America. The regional studies
were finally integrated into an international evaluation report.
Did the EFA-related assessments make any worthwhile contributions toward
creating a culture of evaluation within formal basic education around the world?
This is not an easy question to answer. Yet some optimism is permissible. A few
hundred educators from the 47 participating countries were exposed to an
expanded view of evaluation, to methods of item and test design, approaches to
developing management information systems, and uses of data for making
educational decisions. Surely, it was not all in vain. Unfinished business does influ-
ence future agenda. Failure changes the landscapes of professions and cultures.
QUALITIES OF CULTURES OF EDUCATIONAL EVALUATION
Knowing what we know of the acceleration of history and integration of all nations
into one global network, economically, politically, culturally, and educationally,
what kind or kinds of "cultures of educational evaluation" are emerging in
societies and states around the world? Grand narratives are always problematic,
yet a general sense of the ethos of the cultures of educational evaluation can be
advanced.
Educational Evaluation as Praxis
In the West, a rich and robust professional culture of educational evaluation has
emerged. Its normative and methodological ideals are clearly codified, and it is
well interfaced with the educational culture and with the education policy culture.
That does not mean, however, that there are no distortions, no contradictions
in the culture. Manifestos are rarely implemented as intended. In real life, ideal
norms are not actualized on the ground. Even in the mother country of the Joint
Committee on Standards for Educational Evaluation (1981, 1994), though the
standards have been formally accepted, they are by no means fully incorporated
within the culture. Disagreement and dissent are in evidence when it comes to
their elaboration and application in real-life settings. Ideological positions,
epistemological choices, paradigm debates, and even ethical conceptions are
410 Bhola
contested. What constitutes systematic, data-based inquiry is not always agreed
upon, while a shared and usable definition of competent evaluator performance
does not exist. Critical commitments are hard to keep in focus if evaluation is
conducted as business for profit. Ethics become situational. Feminists complain
of gender biases and disempowerment. Minorities have complained of racial
biases and exclusion (Armstrong, 1995; Stanfield, 1999).
When these normative standards are taken from the U.S. to other cultures and
self-consciously adapted to the political, economic, and cultural context, that is
not the end of the story, but its beginning. The power holders in non-Western
government may think of evaluation as an imposition by the donors. In "economies
of scarcities" typical of the developing world, the providers of education and
development services may be interested more in providing services than in
spending resources on evaluation. Highly placed, Western-educated evaluation
specialists will understand the standards, but they will also understand the
realities of their local politics and culture. At the middle levels of government
and non-government organizations, those contexualized standards would neither
be fully accepted nor faithfully applied. At the lower levels of the culture of
educational evaluation - at the field level - these may not even be fully under-
stood. It is at this level that the greatest divergences and largest drifts away from
the original norms will be seen.
As the practice of educational evaluation filters downward within the
institutional structures of education and spreads across various levels of the
system and culture, multiple distortions will most likely appear. Evaluators may
be "evaluators by designation": that is, because they happen to have landed jobs
as evaluators, without necessarily knowing anything much about evaluation.
They may have a lot of learning to do on the job.
Economies of scarcities also have scarcities of morality. The battle for survival
in a job may require the practice of ambiguity rather than acquiring, collecting,
and sharing information. Feedback can be dangerous where power-holders are
unable to differentiate between causes and culprits. Information may be held
against those who supply it. Writing of periodic reports may be an exercise in
writing prose without information, and in diffusing personal responsibility for
conditions which are clearly not right. There may be no monitoring system in
place, and record-keeping may be in disarray. People may be unwilling to leave
the comforts of their offices to go to collect data in the field. Or organizational
structures may take too long to approve field visits and travel advances to enable
workers to go to the field.
Evaluation on the Ground Level
Evaluation studies that include an empirical component require going to the

field where the people are, many of them in urban slums and villages deep in the
interior, at the back of the beyond. As stories of evaluations in developing coun-
tries would tell, things are not the way we may have come to expect in our own
world of evaluation work (Asamoah, 1988; Bamberger, 1989, 1991; Chapman &
Boothroyd, 1988; Russon & Russon, 2000; Smith, 1991). Fieldwork is never easy,
whether the evaluator comes from a foreign culture or from another subculture
nearer home. It is here that the evaluator encounters the "cultural other"
(Schweder, 1991). It is at this level that Westernization/globalization becomes
rarified, and we move into the thickets of endogenous cultures with their
particular world views: about self and other, of individual and community, and
need for achievement versus ethics of frugality bordering on asceticism. Here we
may meet people with their particular ways of relating across ethnicity, religion,
class, caste, age, and ,gender; and their taboos about animals, clothing, foods -
quite a bit hybridized because of the onslaught of media, yet quite a bit different
from our own. Questions of motivation and representation both cloud communi-
cation. The evaluator needs to learn from the people: by observing, eliciting,
asking them directly or asking others about them. But more often than not, we
cannot read their world or their thoughts directly. We do not even understand
their language. We end up using informants and local assistants to act as our eyes
and ears. We get second-hand views of their world and mediated meanings of
their experienced realities.
The very presence of the outsider creates perturbations: in avoidances,
greetings, seating, eating and drinking, style and content of conversation, ques-
tioning and insisting. Problems posed by lack of infrastructure and comforts
(even safety) of life are by no means trivial. Malaria can strike. Tree cobras are a
reality. Walking by the riverside, you do have to be careful about the crocodiles.
Field workers have been eaten by lions!
UTILIZATION OF EVALUATION RESULTS
Utilization of evaluation results is a much discussed question in Western cultures.

Epistemological, institutional, political, and pragmatic aspects of problems have
all been debated. It is not always clear whether the results of evaluation studies
commissioned by policy makers and administrators are actually used in decision-
making.
The same problems appear within the cultures of non-Western countries. There
may be one significant difference, however. Most evaluation is conducted within
the structures of government or what are called attached or subordinate offices
of the state. Even universities and specialized research institutions do not have
the same independence - and might one say, the same level of academic freedom
- as in Western countries. This may in some ways be favorable to utilization.
All may not be well, however (see Werge & Haag, 1989). In two major evalua-
tion studies which I recently analyzed, I found that while impressive amounts of
data had been collected using reasonably dependable methods, reports of the
studies remained more descriptive than analytical. Substantial evaluation work
was done by the National Institute for Educational Planning and Administration
(NIEPA, 1985, 1986a, 1986b) to get a measure of people's responses to the
412 Bhola
Government of India's initiative on a New Education Policy. Analysis was weak,
and recommendations rather unconvincing. The same was true of another major
evaluation: a series of studies of the National Literacy Mission of India (NLM),
commissioned by the government (NLM, 1994, 1995a, 1995b). In examining
some 61 studies of the NLM, I found that they overall constituted a "satisficing"
job. Data had been presented in percentages, but no interesting analyses had
been done. Recommendations for action made at the end were not persuasive
since the logic of the process embedded in the program had not been made clear
and the data had not been analyzed to confirm or to disconfirm dynamic
relations. Most significantly, no one in the system had tried to put together
results from all of the 61 studies to find out what they meant when taken as a
whole. What were the convergences and divergences? What were the policy,
planning, or pedagogical implications in different contexts?
CONCLUSION
The seeds of "educational evaluation" have been spread far and wide in almost
all of the cultures of the world. In the cultures of the West, they have taken hold
and have grown into healthy plants. In the non-Western world, they have
sprouted in only a few places, and many more seasons of seeding and sowing will
be needed for educational evaluation to cover more ground and to become an
organic component of education. Thus, the tasks of evaluators, both Western and
the Westernized, are by no means done.
In the meantime, professional evaluators in the West and non-West will
continue to find themselves in situations where clients want the evaluation work
to be done by outsiders, and the evaluators have to do it all by themselves, or not
at all. Suitable counterparts in ministries or field offices of host cultures may not
have been appointed or may not be available. The challenge for the evaluator is
to do one's best, in less than perfect conditions, on the basis of clearly insufficient
knowledge.
The challenge can be met if detailed preparation is made for an evaluation by
reading and research in the library; obtaining and reading all relevant project
documentation; choosing informants and evaluation assistants with care; providing
appropriate training to counterparts; and using sense and sensibilities. It will be
important not to rush to judgment. It will be necessary to use hermeneutic strate-
gies for constructing and testing meanings. It will be important to remember that
it is not the ultimate truth we will find, but only a construction that those we
worked with found credible and workable.
ENDNOTE
1 My thanks to Professor Fernando Reimers, Graduate School of Education, HalVard University, for
reading an earlier draft of this paper and for providing most insightful comments from which I
greatly profited.
REFERENCES
AEA International and Cross-Cultural Evaluation TIG. (2000). Links to international organizations.
Retrieved from http://home.wmis.net/-russon/icce/links.htm
African Evaluation Association. (2001). AfrEA membership and consultant's database. Retrieved from
http://www.geocities.com/afreval/database.htm
Armstrong, I. (1995). New feminist discourses. London: Routledge.
Asamoah, Y. (1988). Program evaluation in Africa. Evaluation and Program Planning, 11, 169-177.
Bamberger, M. (1989). The monitoring and evaluation of public sector programs in Asia. Evaluation
Review, 13,223-242.
Bamberger, M. (1991). The politics of evaluation in developing countries. Evaluation and Program
Planning, 14, 325-339.
Bhola, H.S. (1979). Evaluating functional literacy. Amersham, Bucks, UK: Hulton Educational
Publications.
Bhola, H.S. (1983). Action training model (ATM) - An innovative approach to training literacy workers.
N.S. 128. Notes and Comments. Paris: UNESCO Unit for Cooperation with UNICEF and WEP.
Bhola, H.S. (1989). Training evaluators in the third world: Implementation of an action training
model (ATM). Evaluation and Program Planning, 12, 249-258.
Bhola, H.S. (1990). Evaluating "literacy for development": Projects, programs and campaigns.
Hamburg: UNESCO Institute of Education (UIE)/Bonn: German Foundation for International
Development (DSE).
Bhola, H.S. (1991). Le evaluacion de proyectos, programas y campanas de "alfabetizacion para el
desarrollo." Santiago, Chile: Oficina Regional de Educacion de la UNESCO para America Latina
y el Caribe (OREALC).
Bhola, H.S. (1992a). A model of evaluation planning, implementation, and management: Towards a
"culture of information" within organizations. International Review of Education, 38, 103-115.
Bhola, H.S. (1992b). Evaluation des projects, programmes et campagnes d' "alphanetisation pour Ie
developpement". Hamburg: Institut de l'Unesco pour l'Education/Bonn: Fondation allemande
pour Ie developpement international (DSE).
Bhola, H.S. (1995a). Informed decisions within a culture of information: Updating a model of
information development and evaluation. Adult Education for Development, 44, 75-83.
Bhola, H.S. (1995b). Taqwim hamlat wa-baramij wa-mashruat "mahw al-ummiyah min ajl
al-tanmiyah". Cairo, Egypt: Arab Literacy and Adult Education Organization.
Bhola, H.S. (1996). Between limits ofrationality and possibilities of praxis: Purposive action from the
heart of an epistemic triangle. Entrepreneurship, Innovation and Change, 5(1), 33-47.
Bhola, H.S. (1998a). Arzshiabi tareha wa barnamahai amozishi barai tawasay. Tehran, Iran:
International Institute for Adult Education Methods.
Bhola, H.S. (1998b). Constructivist capacity building and systemic program evaluation: One discourse,
two scripts. Evaluation: International.Tournal of Theory, Research and Practice, 4, 329-350.
Bhola, H.S. (1998c). Program evaluation for program renewal: A study of the national literacy
program in Namibia (NLPN). Studies in Educational Evaluation, 24, 303-330.
Bhola, H.S. (1998d). They are learning and discerning: An evaluative status review. Studies in
Bhola, H.S. (2000a). A discourse on impact evaluation: A model and its application to a literacy
intervention in Ghana. Evaluation: International .Tournai of Theory, Research and Practice, 6,
161-178.
Bhola, H.S. (2000b). Developing discourses on evaluation. In D.L. Stufflebeam, G.F. Madaus, & T.
Kellaghan (Eds.), Evaluation models: Viewpoints on educational and human selVices evaluation
(2nd ed.), pp. 383-394. Boston: Kluwer Academic Publishers.
Bhola, H.S. (in press). Reclaiming old heritage for proclaiming future history: The "knowledge for
development" debate in the African context. Africa Today, 49(3).
Binnendijk, AL. (1989). Donor agency experience with the monitoring and evaluation of developing
projects. Evaluation Review 13, 206-223.
Botswana, Republic of. (1977). Education for Kagisano: Report of the national commission on education.
Gaborone: Government Printer.
Burbules, N.C., & Torres, C.A (Eds). (2000). Globalization and education: Critical perspectives. New
York: Routledge.
414 Bhola
Canieso-Doronila, M.L., & Espiriu-Acuna, J. (1994). Learning from life: An ethnographic study of
functional literacy in fourteen Philippines communities. (Volume 1: Main Report). Manila: University
of the Philippines/Department of Education, Culture and Sports.
Carr-Hill, R.A., Kweka, A.N., Rusimbi, M., & Chengelele, R. (1991). The functioning and effects of
the Tanzanian literacy program. Paris: International Institute for Educational Planning.
Carron, G., Mwiria, K, & Righa, G. (1989). The functioning and effects of the Kenyan literacy program.
Paris: International Institute for Educational Planning.
Chapman, D.W., & Boothroyd, R.A. (1988). Evaluation dilemmas: Conducting evaluation studies in
developing countries. Evaluation and Program Planning, 11(1), 37-42.
Chinapah, V. (1995). Monitoring learning achievement: Toward capacity building. A to Z handbook.
Paris: UNESCO-UNICEF.
Cracknell, B. (1988). Evaluating development assistance: A review of literature. Public Administration
and Development, 8, 75-83.
Delors, J. et al. (1996). Learning: The treasure within. Report to UNESCO of the International
Commission on Education for the Twenty-first Century. Paris: UNESCO.
Faure, E. et al. (1972). Learning to be: The world of education today and tomorrow. Paris: UNESCO.
Foucault, M. (1976). Two lectures. In C. Gorden (Ed.), Michel Foucault: Power/knowledge. London:
Harvester Wheatsheaf.
Guba, E.G. (Ed). (1990). The paradigm dialog. Newbury Park, CA: Sage.
Hageboeck, M., & Snyder, M. (1991). Coverage and quality ofU.SA.I.D evaluation in FY 1989 and FY
1990. Washington, DC: Management Systems International.
Harding, S. (1993). Rethinking standpoint epistemology: What is "strong objectivity"? In L. Alcoff,
& E. Potter (Eds.), Feminist epistemologies (pp. 49-82). New York: Routledge.
Harvey, D. (1990). The condition of post-modernity: An inquiry into the origins of cultural change.
Oxford: Blackwell.
Husen, T. (1987). Policy impact of lEA research. Comparative Educational Review, 31, 29-46.
India, Government of. (1994). Literacy campaigns - an evaluation digest (Vol. I). New Delhi: Literacy
Mission, Directorate of Adult Education, Ministry of Human Resource Development.
India, Government of. (1966). Report of the Education Commission, 1964-66. Education and national
development. New Delhi: Ministry of Education.
Inter-Agency Commission. (1990). World conference on education for all. Meeting basic learning needs,
5-9 March 1990, Jomtien, Thailand. Final report. New York: Inter-agency Commission, WCEFA
UNDp, UNESCO, UNICEF, World Bank.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standard.
Lyotard, J. (1984). The post-modem condition. Minneapolis, MN: University of Minnesota Press.
Madaus, G.F., & Stufflebeam, D.L. (2000). Program evaluation: A historical overview. In D.L.
Stufflebeam, G.F. Madaus, & T. Kellaghan (Eds.), Evaluation models: Viewpoints on educational
and human services evaluation (2nd. ed) (pp. 3-18). Boston: Kluwer Academic Publishers.
NIEPA (National Institute of Educational Planning and Administration). (1985). Towards restructuring
Indian education: citizens' perception. New Delhi: Author.
NIEPA (National Institute of Educational Planning and Administration). (1986a). Towards restructuring
Indian education: A viewpoint of the press. New Delhi: Author.
NIEPA (National Institute of Educational Planning and Administration). (1986b). Towards restructuring
Indian education: Perceptions from states. New Delhi: Author.
NLM (National Literacy Mission). (1994). Evaluation of literacy campaigns: summary reports (Vol. I:
23 Studies). New Delhi: Directorate of Adult Education, Ministry of Human Resource
Development, Department of Education, Government of India.
NLM (National Literacy Mission). (1995a). Evaluation of literacy campaigns: Summary reports (Vol.
II: 25 Studies). New Delhi: National Literacy Mission, Directorate of Adult Education, Ministry of
Human Resource Development, Department of Education, Government of India.
NLM (National Literacy Mission). (1995b). Evaluation of literacy campaigns: summary reports (Vol. IlL'
21 Studies). New Delhi: Directorate of Adult Education, Ministry of Human Resource Development,
Department of Education, Government of India.
OECD (Organization for Economic Cooperation and Development). (1988). Evaluation in
developing countries: A step in a dialogue. Paris: Author.
OECD (Organisation for Economic Cooperation and Development). (1990). Evaluation in Africa.
Paris: Author.
O'Meara, P., Mehlinger, H.D., & Krain, M. (Eds.). (2000). Globalization and the challenges of a new
century: A reader. Bloomington, IN: Indiana University Press.
Purves, A.C. (Ed.) (1989). International comparisons and educational refonn. ASCD Books.
Russon, c., & Russon, K. (2000). The annotated bibliography of international program evaluation.
Boston: Kluwer Academic Publishers.
Scheurich, J.J. (1996). The masks of validity: A deconstructive investigation. Qualitative Studies in
Education, 9(1), 49-60.
Schweder, RA. (1991). Thinking through cultures: Expeditions in cultural psychology. Cambridge, MA:
Smith, N.L. (1991). The context of investigation in cross-cultural evaluations. Studies in Educational
Evaluation, 17,3-21.
Smith, N.L., Chircop, S., & Mukherjee, P. (1993). Considerations on the development of culturally
relevant evaluation standards. Studies in Educational Evaluations, 19, 3-13.
Snyder, M.M., & Doan, P.L. (1995). Who participates in the evaluation of international development
aid? Evaluation Practice, 16(2), 141-152.
Stanfield II, J.H. (1999). Slipping through the front door: relevant social scientific evaluation in the
people-of-color century. American Journal of Evaluation, 20, 415-431.
Stokke, O. (Ed.) (1991). Evaluating development assistance: Policies and perfonnance. London: Frank
Casso
UNESCO. (1976). The experimental world literacy program: A critical assessment. Paris: UNESCO/
New York: UNDP.
UNESCO. (1990). The World Conference on Education for All. Meeting basic learning needs. 5-9 March,
Jomtien, Thailand. Final report. Paris: UNESCO.
UNESCO. (1995). Monitoring learning achievement: Toward capacity building. Final report of interna-
tional workshop. Paris: UNESCO-UNICEF.
UNESCO. (1996). Education for all: Achieving the goal. Mid-decade meeting of the international consul-
tative forum on education for all. 16-19 June 1996. Amman, Jordan. Final report. Paris: UNESCO.
UNESCO. (1997). Fifth international conference on adult education (Adult education: A key to 21st
century), Hamburg, 14-18. Paris: UNESCO.
UNESCO. (1998a). Framework for priority action for change and development in higher education.
Retrieved from http://www.unesco.org/education/wche/declaration.shtml
UNESCO. (1998b). World declaration on higher education in the twenty-first century: VISion and action.
Retrieved from: http://www.unesco.org/education/wche/declaration.shtml
UNESCO. (2000). The Dakar framework for action. Education for all: Meeting our collective commit-
ments. Text adopted by the World Education Forum, Dakar, Senegal, 26-28 April. Paris: UNESCO.
Retrieved from http://www.unesco.org/education/efa/edjor_allldakfram_eng.shtml
Uphoff, N. (1992). Monitoring and evaluating popular participation in World Bank-assisted projects.
In B. Bhatnagar, & A.C. Williams (Eds.), Participatory development and the World Bank: Potential
directions for change (pp. 135-153). World Bank Discussion Paper 183. Washington, DC: World
Bank.
USAID. (U.S. Agency for International Development). (1989). U.SA.I.D. evaluation handbook. In
U.SA.I.D. program design and evaluation methodology report. Washington, DC: Agency for
International Development.
USAID. (U.S. Agency for International Development). (1990). The U.SA.I.D. evaluation system: Past
performance and future directions. Washington, D.C.: Bureau for Program and Policy
Coordination, Agency for International Development.
Werge, R, & Haag, RA. (1989). Three techniques for helping international program managers use
evaluation findings. In J.B. Barkdoll, & G.L. Bell (Eds.), Evaluation and the federal decision maker
(pp. 35-48). San Francisco, CA: Jossey-Bass.
20
The Context of Educational Program Evaluation in the
United States
CARLCANDOLI
University of Texas, Texas, USA
The focus of this chapter is on the evaluation of educational programs, and in

particular on how practice has developed in the United States over the past four
decades of the 20th century. The Year 1963 is of particular significance since it
saw the beginning of a new, federally supported, national thrust to reform public
education, as well as the first nationwide requirement that schools evaluate
federally funded programs. Ensuing federal requirements for evaluation greatly
accelerated the practice of program evaluation in school districts and stimulated
the development of new evaluation approaches. The chapter is intended to help
readers consider how the present context in the U.S. influences the evaluation of
educational programs. Such consideration is important when designing
evaluations of school district programs. They are also important when one is
considering whether or not to apply U.S. evaluation approaches in other national
settings. In addition to looking at the current U.S. context, some key historical
events will be identified to help the reader appreciate how the American
educational system reached its present state.
It should be noted that the chapter considers only one aspect of educational
evaluation: program evaluation. The term is used to encompass special exter-
nally funded, time-limited projects and ongoing programs that repeat year after
year. The former include, for example, pilot efforts to try out new approaches to
instruction in a particular curriculum area, after-school study centers, home
visitations to improve parent involvement, inservice training of teachers, and
computer assisted instruction. Ongoing programs may include annual offerings
in math, language arts, science, social studies, music, physical education, business,
industrial arts, driver education, transportation, and counseling. In addition to
this chapter's coverage of the U.S. context for program evaluation, other
chapters in the Handbook provide in-depth treatment of student evaluation and
personnel evaluation. Further, this chapter comments only briefly on models and
417

418 Candoli and Stufflebeam
procedures for evaluating educational programs and projects, since these are
addressed in other chapters. The contextual factors discussed here are, in general,
relevant for consideration in other forms of educational evaluation in the U.S.
The chapter is divided into seven main parts. In the first, the foundational
principles that underlie public education in the U.S. are summarized. The second
part describes the structure and governance of the U.S. educational system and
notes its significance in planning and reporting evaluations of educational
programs. The third part presents a case for program evaluation in schools. The
fourth part characterizes the diversity of American school districts and stresses
that evaluators should take this diversity into account when evaluating school
district programs. The fifth part focuses on stakeholders in school districts and
emphasizes the importance of involving them in program evaluations. The sixth
part characterizes the evaluation profession's response to school districts' needs
for program evaluation. The final part describes how school districts have
addressed requirements for program evaluation.
FOUNDATIONAL PRINCIPLES
To understand the American perspective on educational evaluation, one needs

an appreciation of the American public school system in the context of the
nation's constitution and laws. The nation was founded on the democratic
principle that all its citizens are created equal and should benefit from the
advantages to be found in society. However, the U.S. was slow to extend equal
rights to minorities and Native Americans and, up to the 1960s, set up so-called
"separate-but-equal" schools. This discriminatory practice led to the landmark
Brown v. Board of Education of Topeka (1954) decision that struck down sepa-
rate schools for minorities, citing them as "inherently unequal", unconstitutional
and a violation of students' rights to equal educational opportunities (see Ravitch,
1983). This decision provided a foundation stone for guaranteeing equality of
educational opportunities, defined in a variety of ways (see Coleman, 1968),
regardless of national origin, race, or religion. However, for more than a decade
the law had little effect.
The culture of the American school had always been assumed to be monolithic
and of a standard definition. About ten years after the 1954 Supreme Court
decision that made it mandatory to fully desegregate all schools and districts, the
reality that America was made up of many diverse and vastly different cultures,
and that many inequities existed between the poor and more advantaged citizens,
suddenly emerged on the horizon of the public education system.
The Civil Rights Act of 1964 and the War on Poverty, launched in 1965, has as
objectives enforcement of the Constitution's equal opportunity requirements
and the provision of support for action to improve conditions and services for those
who were suffering the consequences of discrimination. Equality of educational
opportunity became established as a fundamental criterion in evaluating public
education programs.
The Context of Educational Program Evaluation in the United States 419
President Lyndon Baines Johnson's War on Poverty launched the federal

government into the field of education in a massive attempt to fight the impact
of poverty on its citizens. The various federally funded education efforts required
grantees to evaluate the effect of the effort on targeted students. Results were
expected, and schools were required to present evidence of success.
STRUCTURE AND GOVERNANCE OF AMERICAN SCHOOLS
It is noteworthy that the U.S. Constitution does not mention education;

education, therefore, is assumed to be a state function. In the absence of a
national system of education, the American system is, in reality, 50 state systems.
The main federal laws that bind these systems together pertain to equality of
opportunity. Actually, there has always been considerable similarity between and
among the various state school systems, except for the somewhat differentiated
formations of the various local school districts that make up the state systems.
Only one state, Hawaii, has a single school district for the delivery of educational
programs to its constituents. In others - Florida, Kentucky, Louisiana, and
Maryland to name four - each county is delegated the responsibility of providing
for the educational needs of its citizens. In most of the other states, a school
district is usually coterminous with the governing body in which it is founded
(town, city, township, or county and/or an amalgamation of all these). In all states,
however, education is a state function and is governed by the state. Over time,
the various state educational systems evolved into very similar systems offering
similar curricula (see Campbell, Cunningham, McPhee, & Nystrand, 1970).
Most states provide for a system of intermediate school districts to support the
school districts in their region. Intermediate districts offer certain support activi-
ties that the local district cannot generate or would find too costly. The state educa-
tion agency and the intermediate districts thus provide considerable support to
schools, such as special education, media, and technology services. Each school
district is governed locally by a school board whose members usually are elected
by the community. The board appoints a superintendent to administer the school
district, and in most districts each school is directed by a school principal.
The federal government also has been significantly involved in assisting and
controlling schools, on the basis of the constitutional mandate that requires it to
solve certain problems that states cannot or will not solve. The rationale for
providing for the common good of all of its citizens was what drove the federal
government to provide assistance for the delivery of vocational programs and for
the various governmental grants provided for the development of the state univer-
sity systems. Further, since it must guarantee equality of educational opportunity
and help ensure that all citizens have access to quality education, it provides
grants to help schools overcome the deleterious effects of discrimination and
improve their offerings.
It is at this point that federal control and requirements for accountability
typically reach the schools. When schools (or states or intermediate districts)
accept federal aid for school improvement, they usually are required to meet
requirements for program evaluation and accountability. Thus, federal aid to
schools can be a political tool; by requiring accountability reports the federal
government can expand its influence on state education departments and local
school districts. The fact that strings are attached to federal support of schools is
an important factor in the evaluation of federally funded education programs. Of
course, local school districts can refuse federal funds and thus keep federal
intrusion in the districts' affairs to a minimum. Because of their financial needs
and pressures from their communities, however, since 1965 most districts have
sought and accepted as much federal money as they could get.
Despite this situation, the governance of U.S. school districts remains
fundamentally the responsibility of states. The states have discretion to evaluate
and even close public schools, as well as to require schools to provide evaluation
reports on state-funded special programs. The federal government, in general,
has no authority to evaluate the public schools. It acquires control and can
initiate evaluations to the extent that school districts violate federal statutes,
such as equality of opportunity, and when school districts accept federal grants.
Increasingly, since the launching of the massive federal programs to help reform
public education, schools have received more and more federal grant money and
have had to meet increasing levels of associated federal program evaluation and
accountability requirements.
Intermediate school districts have no control over public schools, unless a school
district transfers certain areas of responsibility and authority to the intermediate
district. On the other hand, intermediate school districts are subject to
evaluations by the state education department, by the schools they serve and,
when they accept federal funds, by the federal government. It should also be
noted that school districts have authority to evaluate individual schools. Finally,
in some school districts, authority and responsibility have been decentralized to
individual schools. In such districts, the schools may be governed both by the
district school board and a local school council, with the latter often having
authority to hire and fire the school principal.
The five-tiered system of public education in the United States - federal, state,
intermediate, district, school - is a significant factor in the evaluation of school
programs. When evaluating a program, it is necessary to determine which persons
and groups should help identify the most important evaluation questions, supply
needed information, and receive and act upon evaluation reports. It is also cru-
cially important to note and differentiate the kind and level of detail of evaluative
feedback that audiences at each level require. Often program evaluators must
prepare an evaluation report for each level of the education system that is involved
and interested in a particular program. Some of the main failures of program
evaluations occur when evaluators present a report that addresses the questions
of only one audience. For example, a report that looks only at average, district-
wide student achievement scores is of little use to the staff of individual schools.
The nature of reports and the identification of the audiences that will receive
them depend on who hires the evaluator. Clearly, the evaluator should address
the client's questions, but often should also tailor special reports for audiences
at other levels of the education system.
THE CASE FOR PROGRAM EVALUATION IN AMERICAN SCHOOLS
Educational programming is a process that requires evaluation at all its stages.

Evaluation is needed to identify, understand, and set priorities on students'
educational needs. It is also needed in the course of preparing and assessing the
merits of program plans. Since most programs either do not follow plans exactly
or may improve on plans, staff and oversight personnel need evaluations of the
implementation process. After programs have been planned and delivered, they
must be evaluated to determine how they impacted on students, to see if they
have served all students adequately, or if adjustment is needed to reach students
for whom they were intended. This may require the collection of data relating to
a broad range of issues from information on student achievements (disaggre-
gated in various ways) to information on diversity and extent of integration of
students and staff, and on characteristics of students' homes and communities
(see Candoli, 1995). However, the evaluation function is often ignored or haphaz-
ardly conducted by school systems. All too often, evaluations are conducted by
the operators of a program, and little attention is paid to their real impact on
students.
THE DIVERSITY OF U.S. SCHOOL DISTRICTS
In addressing the program evaluation function, it is important to keep in mind

that U.S. school districts are diverse in such matters as size; nature of students
served; location (urban, rural, or suburban); economic conditions; political
climate; degree of state control; and community support. This diversity has to be
taken into account when how best to evaluate a school district's programs is
being considered. It is also instructive to consider how America's system of
schools reached its present state.
The school districts of the nation grew rapidly during the late 1800s and early
1900s both in size and number. As the country grew, and as the agricultural
economy began to change from the one-dimensional agricultural focus to a more
diverse and multidimensional focus that included industrial production of
various goods and services, the school districts providing for the education of the
nation's children had to evolve toward reflecting the changing nature of the
society in which students would find themselves (see Tyack, 1974). The urbani-
zation of America was followed by the urbanization of many school districts. It
was also accompanied by increased immigration, primarily from Europe, but also
from Asia, and made up of many nationalities and cultures. School districts with
a single school and a single teacher serving the educational needs of all the
children of a community were replaced by a multischool system in which many
teachers were employed to meet the educational needs of a growing number of

children. Of course, Native Americans and African Americans were also in the mix
of the American population and were served, after a fashion, by the public schools.
By the mid-1800s, the office of superintendent of schools was created to
provide leadership to multi-school districts serving a wide variety of children
coming from a most diverse population. In 1915, Utah created the office of
county superintendent to provide leadership for the consolidation of small rural
school districts into larger districts, which would then have a sufficiently large
population to offer a much wider curriculum and richer educational opportu-
nities for the students of the new consolidated district.
As the school systems grew and became more diverse, the need for expanded
support services and specialties also became apparent and led to the huge
bureaucracies of today. Large urban school districts often face many problems
associated with poverty and white flight and therefore need, and can afford to
employ, teams of evaluators to assess special improvement programs. Many
remote rural areas with many very small school districts also still exist. These
might or might not have serious problems associated with poverty, but almost
certainly cannot afford to hire evaluators or any other type of specialist, and will
typically obtain specialized services from their area's intermediate school district.
The age of the education specialist was born of the need to provide vastly differ-
ing kinds of education to vastly differing kinds of students now found in schools.
The curriculum offerings had to encompass everything from basic skills to
classical education, to vocational education, to college preparatory education,
and had to be offered to a student body that came from very diverse circum-
stances and possessed very different kinds of talents, interests, and expectations.
The demography of the nation's schools changed dramatically again in recent
decades with an increasing influx of non-European migrants. These new migrants,
mainly from Mexico, Central and South America, as well as from Africa and
Asia, served to increase the diverse nature of the country's public school districts.
With public school desegregation mandated by federal law, the new arrivals,
coupled with indigenous Native-American and African-American students, soon
dramatically changed the student population of schools. During the 1990s, the
public school enrollments of several states became populated by a majority of
students of Native American, Asian, African, or Hispanic origins. The public
schools of California, Texas, and Florida were the first states to become "majority
minority" in student enrollment, and it is projected that by 2030 most of the
country's school districts will be "majority minority" in student composition. The
changing student population has done much to change the delivery of education
to students, especially in such areas as compensatory education and English as a
second language.
Clearly, America's educational evaluation establishment needs to consider the
full range of the country's schools, as it decides how best to meet needs for educa-
tional program evaluation. Large school districts are often helped by setting up
a school district department of evaluation and gearing it to help all schools in the
district meet their educational program evaluation needs. To meet the evaluation
needs of small, remote school districts, intermediate school districts might be

helped to develop and offer an evaluation service. State education departments
and other groups sometimes assist school districts to evaluate their programs by
embedding evaluation procedures in curriculum materials. Additional means used
to meet educational program evaluation requirements include contracted
educational program evaluations, government-sponsored site visits, accrediting
organizations' evaluations of school programs, and assistance by area
universities.
SCHOOL DISTRICT STAKEHOLDERS: THE POLITICS OF U.S.

PROGRAM EVALUATION
Given the five-tiered nature of the U.S. education system, it is not surprising that
program evaluation in school districts can be intensely political. Indeed, federal
and state government agencies often employ evaluation as an instrument for
seeking and winning power over schools. State education departments require
evaluations of state-funded programs to help ensure that schools will meet the
state's objectives. Federal agencies require evaluations of federally funded
projects to help ensure that districts will address federal priorities, such as affir-
mative action and desegregation. Local school boards and superintendents
obtain evaluations of school district programs to help ensure that the district's
teachers and administrators will effectively carry out district policies. They also
use program evaluation reports to help improve and sustain the community's
respect for the schools, and sometimes to help preserve their leadership positions.
The mechanisms by which a variety of groups employ evaluation for political
gain include controlling the questions that will be addressed, the criteria that will
be applied, the information that will be obtained, what will be included in reports,
and the audiences that will receive reports. The last of these has been mitigated
by freedom of information laws and statutes. Whereas politically oriented clients
used to be able to control the distribution of reports (e.g., release the news if it
is good, and otherwise suppress it), now any citizen can make a freedom of infor-
mation request and obtain almost any program evaluation report that was
produced with public funds. Freedom of information requests are common and
have tended to extend the political advantages of having access to evaluation
findings to all interested stakeholders, including citizens, parents, students, and
political action groups.
Another political reality in U.S. program evaluation stems from certain groups'
abilities to resist evaluations that might cast their work in a bad light or influence
it in an unwanted direction. By refusing to cooperate with evaluators, groups
such as teachers' unions, parents, and students can prevent the collection of
needed information. Further, unfounded attacks on evaluation reports by such
groups as parent-teacher organizations and teachers' and administrators' unions
can have the effect of reducing the credibility of evaluation findings. Such attacks
are not uncommon.
American educators have learned that it is crucially important to involve repre-
sentatives of the full range of stakeholder groups in planning evaluations. These
may include teachers, parents, students, administrators, school board members,
community members, state education department officials, and federal govern-
ment representatives. When representatives of interested and right-to-know
audiences have the opportunity to make an early input, they are more likely to
develop confidence in, and use, evaluation findings. Consultation with stake-
holders also helps ensure that evaluators will effectively address the information
requirements of different audiences.
THE EVALUATION PROFESSION'S RESPONSES TO U.S. SCHOOL

DISTRICTS' NEEDS FOR PROGRAM EVALUATION
In 1965, U.S. school districts could find little relevant support to meet their
program evaluation needs. At the time, the traditions of program evaluation
included objectives-oriented evaluation, experimental and quasi-experimental
design, standardized testing, and site visits by accreditation teams. All of these
approaches were found largely inapplicable and unresponsive to the evaluation
needs posed by the new federal programs (Stufflebeam et al., 1971).
The objectives-oriented evaluation model, which had been developed by
Ralph Tyler (see Madaus & Stufflebeam, 1989), required too much time to define
behavioral objectives and to develop valid measures and, even then, provided
findings only at the end of a project. The schools needed an evaluation process
that could be integrated into the development and operation of projects, and
that would help them improve project operations.
The experimental (and quasi-experimental) design approach, which was based
on agricultural and laboratory research, required assumptions about the alloca-
tion of students to competing project conditions and controlled treatments, and
also was limited to outcome data. However, it was illegal to assign entitled students
to a nontreatment (control) group. Furthermore, because of the urgency to
launch projects, schools needed to constantly assess and improve their projects,
which was anathema to the conduct of sound experiments.
At first, many educators saw a quick fix in the use of published standardized
tests. However, these had been normed on average students and usually proved
to be only remotely relevant to assessing the needs and achievements of disad-
vantaged children. Further, the tests had been designed primarily to discriminate
among examinees rather than to assess a group's achievement of curriculum
objectives.
The federal government made extensive use of the site visit approach.
Preannouncement of intended visits stimulated school district staffs to compile
data on their projects. Visitors also had the opportunity, not only to examine the
case presented by the district, but to observe projects at first hand. The approach
also gave the federal government a means of obtaining evaluative feedback from
a third party. However, the site visit provided a far from adequate response to
the program evaluation needs of diverse audiences. Visits were infrequent and
often only reported on staged, untypical project activities. Sometimes local
district staffs co-opted the visitors and influenced them to provide overly positive
reports. And this approach did not, of course, provide the continuous feedback
that school staffs needed to improve their projects.
Overall, it is now clear that the evaluation field was just as unprepared to meet
the program evaluation needs of school district staff as were the districts to meet
the needs of children from disadvantaged backgrounds. After many failed or
unsatisfactory attempts to apply existing evaluation technology, evaluators and
their clients took steps to significantly improve the theory and practice of educa-
tional program evaluation.
Other chapters in this book present evaluation approaches that were subse-
quently developed and have proved useful. These include Michael Scriven's
formative-summative approach to evaluation, Robert Stake's responsive
evaluation approach, Michael Patton's utilization-focused evaluation model,
Daniel Stufflebeam's CIPP evaluation model, the Guba-Lincoln constructivist
approach, and the recently developed House and Howe democratic-deliberative
evaluation model. All of these approaches, except Scriven's, require extensive
involvement of stakeholders in planning evaluations. Scriven's approach
provides for minimal involvement with the purpose of protecting the evaluator's
independent perspective. Whereas the other approaches stress the importance
of formative or improvement-oriented evaluation, Scriven emphasizes the impor-
tance of independent, hard-hitting, summative evaluation. His approach appeals
most to external audiences, such as the federal government and taxpayers; the
other approaches appeal to school staffs that need systematic feedback to help
them carry out projects.
Another response to the need for better program evaluation in schools was
provided by accreditation agencies whose leaders became concerned with the
lack of achievement data available from school program accreditation reports.
Basically, the accrediting organizations had a tradition of measuring only inputs,
such as teacher credentials, books available to students, curricula, and per-pupil
expenditures. A trend developed to place greater emphasis on student achieve-
ment as a crucial component of the accreditation design.
In recognition of the importance of stakeholder involvement in school district
program evaluations, and to make the evaluation process more equitable and
understandable to practitioners, a national committee representing the main
professional societies concerned with education developed professional standards
for guiding and assessing educational program evaluations. Membership on this
joint committee was by appointment from one of the national education
organizations, and was supported through grants from a number of funding
sources including several foundations. The organizations that provided members
to the committee included American Association of School Administrators,
American Educational Research Association, American Federation of Teachers,
American Personnel and Guidance Association, American Psychological
Association, Association for Supervision and Curriculum Development, Council
for American Private Education, Education Commission of the States, National
Association of Elementary School Principals, National Council on Measurement
in Education, National Education Association, and National School Boards
Association. Later, representatives from the National Association of Secondary
School Principals, the Council on Postsecondary Accreditation, the Canadian
Society for the Study of Education, and the American Evaluation Association
were added.
In 1981, the Joint Committee issued Standards for Evaluations of Educational
Programs, Projects, and Materials, which were intended to define the attributes of
good and acceptable educational program evaluation and to provide schools with
practical guidance for conducting evaluations. It is noteworthy that the commit-
tee included an equal number of evaluation specialists and of evaluation users.
This is consistent with the position that the full range of stakeholders should be
involved in all aspects of program evaluation if they are to help shape evaluation
plans and subsequently value, understand, and use evaluation findings. Consistent
with the notion that the development of standards is a continuing activity, the
Joint Committee published a second edition in 1994 (Joint Committee on
Standards for Educational Evaluation, 1994). Both sets of standards require
evaluations to satisfy conditions of utility, feasibility, propriety, and accuracy.
The Joint Committee's experience and standards are more fully described in
Chapter 13 in this Handbook.
SCHOOL DISTRICTS' RESPONSES TO AND USE OF ADVANCEMENTS

IN EVALUATION
Since the 1960s, evaluation practices in U.S. school districts have changed and
improved, but only marginally. A few large urban districts, especially in Lansing,
Michigan; Los Angeles, California; Pittsburgh, Pennsylvania; Lincoln, Nebraska;
Des Moines, Iowa; Dade County, Florida; and Dallas, Texas, developed and have
operated well-staffed offices of evaluation. Furthermore, school evaluators devel-
oped a division on school evaluation within the American Educational Research
Association. Some intermediate school districts have developed and delivered
excellent evaluation services to the schools, while school districts and other eval-
uation clients often contract evaluation services from specialized evaluation centers
in universities and evaluation companies. Overall, however, program evaluation
in U.S. school districts is not dramatically advanced from where it was in 1965.
When the conveyers of external grants require evaluation, the schools jump.
Otherwise, the dream of systematic evaluation in schools is just that - a dream.
CONCLUSION
The discipline of educational program evaluation in the U.S. is still quite new
and evolving. New and specific models are being developed for specialized
evaluation tasks. The field has developed professional standards for educational
program evaluation.
Educational evaluation has closely followed the democratization of the private
sector. Organizations have become much more responsive to demands for
greater stakeholder participation in their governance. Educational organizations
also have become more participatory as clients and employees of the school
organization expressed their desire to take part in the decision making process,
particularly those decisions that affect their children. As publicly funded bodies,
school districts must be responsible to the citizens of the particular district, and
must reflect the best interests of the clients of the district as decisions are made.
Education needs, and to some degree is obtaining, evaluation that can assist
program development and decision making at different levels. The identification
and involvement of representatives of stakeholders in the process of educational
program evaluation is necessary to help ensure the political viability of programs,
responsiveness to stakeholders' questions, and utilization of findings.
A pervasive criterion in U.S. educational evaluation involves equality of educa-
tional opportunity. There has also been an increasing trend to assess student
outcomes, especially in basic skill areas. Generally, educators want responsive
formative evaluations and resonate to the notion that evaluation's most impor-
tant purpose is not to prove, but to improve. On the whole, though, the practice
of educational evaluation has not advanced greatly from its state of low use in
the mid-1960s. While schools do evaluate programs when external sponsors
require evaluation as a condition for funding, school districts on the whole
continue to place a low priority on evaluation, possibly because school adminis-
trators and teachers are not very familiar with program evaluation. Furthermore,
they may regard the use of standardized tests, which are strongly entrenched in
the system, as a grossly inadequate means of accountability for school programs.
It would seem that American schools, in addressing the problems that face them,
tend to continue to adopt, and subsequently discard, one educational fad after
the other, rather than securing the continual improvement that the use of
systematic program evaluation could support.
REFERENCES
Brown v. Board of Education of Topeka (1954, May 17), Chief Justice Earl Warren, the United States
Supreme Court. (Reprinted in SA Rippa (Ed.), Educational ideas in America. A documentary
history (pp. 429-432). New York: David McKay, 1969.
Campbell, RL., Cunningham, L.L., McPhee, RF., & Nystrand, RO. (1970). The organization and
control ofAmerican schools. Columbus, OH: Charles E. Merrill.
Candoli, I. (1995). Site-based management: How to make it work in your school. Lancaster, PA:
Technomic Publishing.
Coleman, J.S. (1968). The concept of equality of educational opportunity. Harvard Educational
Review, 38, 7-22.
Joint Committee on Standards for Educational Evaluation. (1988). The personnel evaluation
standards. Newbury Park, CA: Sage.
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards
(2nd ed.). Thousand Oaks, CA: Sage.
Madaus, G.F., & Stufflebeam, D.L. (1989). Educational evaluation: Classic works of Ralph W. Tyler.
Boston: Kluwer.
Ravitch, D. (1983). The troubled crusade. American education, 1945-1980. New York: Basic Books.
Stufflebeam, D.L., Foley, WJ., Gephart, WJ., Guba, E.G., Hammond, R.L., Merriman, H.O., &
Provus, M. (1971). Educational evaluation and decision making. Itasca, IL: Peacock.
Tyack, D.B. (1974). The one best system. A history of American urban education. Cambridge MA:
21
Program Evaluation in Europe: Between Democratic
and New Public Management Evaluation
OVE KARLSSON!
Miilardalen University, Miilardalens hOgskola, Sweden
Evaluation is a rapidly growing field in European countries, especially in the

European Union (EU), and is a topical issue in central state bureaucracies
(Russon & Russon, 2001). The dominant perspective emphasizes the role of
evaluation in financial control and management. In the course of the 1990s, a
bureaucratic model dominated the discourse on evaluation, in which a "top-
down" perspective based on control, effectiveness, and the measurement of
quality were central concepts. There is, however, also a widespread commitment
to stakeholder perspectives in evaluation that is democratic and that recognizes
that in any major initiative to change, reform or develop an enterprise or a
policy, a variety of interests is involved. Many evaluators would begin by making
these interests visible in their framework, and would have a strong "bottom-up"
democratic view of the function of evaluation.
The aim of this chapter is to explore how the two perspectives of management
and democratic/pluralism might be combined to serve the evaluation needs of a
multicultural Europe, which includes countries with varying traditions in policy
formation. It will not be possible to present detailed descriptions of evaluation
practice in each country and sector; rather, examples will be provided of top-
down and bottom-up approaches, which are central to the European evaluation
discourse today.
TODAY'S EVALUATION DISCOURSE IN EUROPE
Evaluation, especially in education, has a long history. In Sweden, for example,

the institution of school inspection was established in 1860 to evaluate schools by
site-visits and to present annual reports on the standard of education (Franke-
Wikberg & Lundgren, 1980; Gunnarsson, 1980). Towards the end of the 20th
century, evaluation activity expanded considerably in both the national and
super-national bureaucracies of Europe. While development varied somewhat
429
International Handbook of Educational Evaluation, 429--440

430 Karlsson
across countries, two waves have been identified. In the first, during the 1960s
and 1970s, evaluation was directed towards the development of social and
educational programs. In the second, during the 1980s and 1990s, evaluation was
directed towards increasing control and effectiveness, and decreasing public
costs (Furubo, Sandahl, & Rist, 2002).
In countries in the first wave with longer traditions of evaluating public activity
(e.g., Britain, Germany, Sweden), a tradition developed based more upon process
than upon results, and it was relatively common to apply qualitative methods to
reveal the functioning of a program and its implementation process. Countries
in the second wave (e.g., Finland, Italy, Spain) increased their evaluation capa-
city in the 1990s under pressure from the European Union (EU). The focus in this
wave was on results, the measurement of effectiveness, and the control of quality,
activities often associated with programs of regional and technical development.
The European impetus for evaluation signifies two things: First, public
administration, which has to organize itself on the basis of the Structural
Fund programmes, needs to learn to develop certain functions (calls for
proposals, project selection, monitoring, calls for evaluation bids, etc.)
and it has to adapt to the requirements for accountability and expenditure
management .... Second, there is a need to create evaluation competence,
both internal and external to public administrations. (News from the
Community, 1999, p. 360.)
Contemporary discourse on evaluation in Europe may be illustrated with

examples from south to north: Italy, France, Denmark, and Sweden. In Italy,
evaluation was introduced during the 1980s, mainly as cost-benefit analysis and
ex-ante evaluations of investments. During the 1990s, demands for evaluation
increased when EU programs and the reorganization of Structural Funds were
introduced (Stame, 1998). Aware of the British motto of "value for money" and
President Clinton's platform to "reinvent government," the relative roles of
government and market were debated. New criteria were implemented.
Managers were held accountable for results and quality, users were placed at the
focal point of service, and success was recognized and rewarded. At the same
time, a growing awareness could be observed that evaluation was not necessarily
rationalistic and technocratic, and increasing attention was given to relationships
between social assistance and public services and how different interests should
be handled. Current evaluation research in Italy is exploring how management
and pluralistic evaluation can contribute to decision making.
In France, evaluation was adopted rather late and, when it came, it was as part
of the global project of modernizing public services. There was in fact in the
1990s a general infatuation with public policy evaluation (Duran, Monnier, &
Smith, 1995). At the same time, there was widespread confusion about the meaning
of evaluation. Once translated into practical usage, evaluation has necessitated
a new look at the analytical distinction between administration and politics
(which is particularly important in the French context). The question is whether
Program Evaluation in Europe 431
evaluation is about making public decision making more effective or about
improving the citizen's relationship with public affairs.
In the French case, these two dimensions clearly contribute to defining the
boundaries of what may be called "arenas of controversy." The search for alter-
native ways of thinking has led to a perception of public evaluation as an
intellectual and practical response to contemporary conditions for public action,
which is increasingly becoming more diverse in an environment that is no longer
restricted to the level of state, caught as it is, between increasingly assertive local
authorities and the emergence of a European level of government.
Evaluation in France has been presented as a response to demands to address
concerns relating to both efficiency and democracy. When the question is how to
be more effective with fewer resources, the main focus of evaluation has been on
identifying and measuring the exact effects of a policy. The democratic approach,
on the other hand, grew out of the crisis of traditional forms of political repre-
sentation, in particular ones involving political parties and trade unions. It
espouses the position that the state should be held accountable for its actions to
its citizens, and that priority in evaluation should be given to judgments of the
extent that public policies enhance democratic processes. In this context, "the
objective of democratic evaluation rapidly became a way of bypassing state
institutions and their tendency to monopolize the political dimension of such
studies" (Duran et al., 1995, p. 50).
My third example of the development of evaluation in Europe comes from
Scandinavia where economic growth in the 1960s and 1970s made possible the
rapid expansion of state welfare services (Hansson, 1997). The absence of
economic constraints and a broad consensus about the goals of welfare politics
may go a long way towards explaining governments' lack of interest in evaluating
reforms and programs. It was not until the 1980s, when the economy stagnated,
that demands for savings, efficiency, and systematic evaluation at all levels of
society emerged. The demand for savings made visible the different stakeholder
groups in political discussion and in evaluation, as well as tensions between
different levels of management and stakeholders. The power of evaluation also
became more visible, and the importance of having one's interests represented
in it compelling. As a result, considerable shifts occurred in the role of stake-
holders. For example, while evaluation in higher education in Denmark in the
1980s represented a top-down approach, from the beginning of the 1990s it was
dominated by local actors at university level (Hansen & Borum, 1999).
Sweden has not escaped the debate on the role of central and local authorities
in the control of evaluation. As in many other European countries, the govern-
ment in Sweden has encouraged all authorities to look at how their activities are
evaluated. In the central administration, most sectors have established special
evaluation units and the number of people working in evaluation has increased
markedly during the last decade. The dominant view of evaluation, and it is one
shared by a new generation of evaluators, is a top-down one. There is also a
growing interest in quality assurance, especially among politicians at the munic-
ipality level, but recently also at the central (state) level. Quality in this case is
432 Karlsson
often defined in terms of quantitative indicators, and systems of quality assur-
ance are implemented to maintain levels of performance. This involves defining
performance in terms of results (outputs, outcomes, effects, impacts, etc.); setting
measurable levels of intended achievement (performance targets, service
standards, etc.); determining the extent to which results are achieved using
performance indicators (performance measurement, performance monitoring,
etc.); accounting for the achievement of results in light of the resources utilized
(performance reporting, effectiveness reporting, value-for-money accounting,
etc.); and basing resource allocation decisions on performance information
(performance budgeting, result-based budgeting, etc.) (Davies, 1999). Often,
prepackaged instruments and scales used by private consultants in the business
sector are used.
One of the major factors in increasing the use of evaluation on a national basis
in Europe has been the impact of the evaluation discourse in the U.S.A. Another
important factor has been the European Union's increasingly superior role in
governance, which in tum has contributed to the centralization and standardiza-
tion of evaluation. At the same time, there are forces that oppose homogenization
and wish to guard the right of nation states to carry out evaluations according to
their own traditions of public management and control.
THE DOMINANCE OF NEW PUBLIC MANAGEMENT EVALUATION
Many western democracies/countries witnessed a radical transformation during

the 1990s in their perception of how public sector activity should be managed,
which in tum led to changes in the role of evaluation. Many also experienced
reduced economic growth. In Eastern Europe, the political situation was radi-
cally altered by the fall of the Soviet Union. In the EU, the number of member
states increased, and will further increase. These changes have put new questions
on the agenda relating to integration and coordination against national and
regional self-determination. Administration within the EU has raised the
question of how to increase effectiveness and accountability.
One approach to the reorganization of public sector activity necessitated by
these factors is represented in the concept, New Public Management (NPM).2
Pollit (1995) has listed some of the basic elements in NPM:
• Cost-cutting, capping budgets, and seeking greater transparency in resource

allocation .... ;
• Disaggregating traditional bureaucratic organizations into separate agencies
... often related to the parent by contract or quasi-contract ... ;
• Decentralization of management authority within public agencies ("flatter"
hierarchies) [italics added];
• Separating the function of providing public services from that of purchasing
them;
• Introducing market and quasi market-type mechanisms (MTMs);
• Requiring staff to work to performance targets, indicators and output objec-

tives (performance management);
• Shifting the basis of public employment from permanency and standard
national pay and conditions towards term contracts, performance-related pay
(PRP) and local determination of pay and conditions;
• Increasing emphasis on service "quality" standard setting and "costumer
responsiveness" (p. 134).
The main reason for the popularity of NPM concepts in many countries would
seem to be the demand for budget cuts in the 1990s, which increased the need
for rationalization and a reduction of the public sector. NPM, which represents
a solution based on the introduction of management ideas from the private
sector, is supportive of the view that institutional structures for control and
"accountability" should be strengthened and that evaluation in the first instance
should be defined as a steering instrument for management. Through measure-
ment and the control of quality, politicians are in a position to demonstrate
"value for money" to tax payers. Furthermore, NPM provides a rationale for
reducing public sector spending through its support for private solutions rather
than politically controlled activities.
Countries have varied in the aspects of NPM they have chosen to emphasize.
For example, in Scandinavian countries, the focus has been on performance
management, while in France it has been on internal decentralization ("flatter"
hierarchies) (Pollit, 1995).
The important role ascribed to evaluation in management is illustrated in an
interview with Alan Pratley, Financial Controller and Acting Director General of
the European Commission (Visits to the world of practice, 1995). Prately stresses
that the Commission clearly must show that it is managing things properly, using
money correctly, that it has effective control of how money is being used, and has
systems in place to ensure that people cannot misuse funds. When asked about
stakeholder perspectives, Prately says that he is aware of them, but that the main
question for the Commission right now is to make sure that the money spent in
the Community is managed properly and with control: "Obviously we are taking
very tough, clear measures to fight fraud. It's not just the citizens, but also obviously
the citizen as taxpayer who is vitally concerned: it goes through parliaments; it
goes through all public opinion" (Visits, 1995, p. 261). When asked if there
would be a convergence between the way these policies are evolving (i.e.,
improved understanding of program results and better financial management)
and the way the evaluation community is evolving (which often looks at things
from a pluralistic perspective as well as from a top-down perspective), Prately
states that the Commission takes account of these issues in formulating policy.
It is clear from this interview that financial control is a major concern in the
context of evaluation in Europe. The Commission feels the need to legitimize its
taxation of member states. At the same time, the need remains to develop forms
of evaluation that are sensitive and responsive to what interest groups have to
say, and that include the voices of people with little or no power. The tensions
434 Karlsson
involved in this situation will inevitably give rise to debate about how the top-
down control perspective will be strengthened, while at the same time allowing a
more stakeholder-based and democratic perspective to develop.
DEMOCRATIC EVALUATION
The NPM-concept fits well with a concept of evaluation as the collection of

information, the presentation of criteria for judging the information, and the selec-
tion of a winner or best performer. An alternative way is to look at evaluation as
a democratic process, a platform for reflection, learning, and the development of
new perspectives on a problem or issue. While we may speak of an "alternative
way," it is important to recognize that there is no one single definition of
democratic evaluation, just as there is no single definition of evaluation.
Several evaluation researchers have discussed the connection between evalu-
ation and the democratic process. MacDonald (1973, 1977), one of the first to
formulate a demand for democratic evaluation, described his model in relation
to two dominant discourses of evaluation: bureaucratic evaluation and autocratic
evaluation. In the former, the evaluator is assigned the role of hired hand or
consultant, who plays an instrumental role in the maintenance and extension of
managerial power. In the latter, the evaluator is assigned the role of expert adviser.
In both bureaucratic and autocratic evaluations, the evaluator is allied either
with leaders (politicians, managers) or with the academic community. The
evaluation has a top-down perspective. By contrast, in democratic evaluation,
one starts from the position that power is distributed among interest groups and
that the evaluator serves the people. In this view, the evaluation has a bottom-up
perspective. Furthermore, the democratic evaluator serves those who suffer from
a lack of power or impotence, not those who hold positions of power. At the
same time, it is also important that the evaluator and the evaluation proceed in
an independent manner and serve the interests of society with respect to
information and the representation of different needs.
According to MacDonald, the evaluator should recognize the value of
pluralism and attempt to present a wide selection of interests in the formulation
of an evaluation. The most fundamental level of an evaluation will involve the
informed citizen and an evaluator who disseminates information between groups
who desire knowledge about each other. Thus, the democratic evaluator is given
the role of information broker serving the public's right to know.
In the 1980s, interest in democratic evaluation grew. In the U.S.A., Bryk (1983)
formulated the concept of "stakeholder-based evaluation," which was inspired
by a desire to increase the applicability of evaluation, hopefully through a more
active representation of interest groups and by responding to their need for
knowledge and information.
Today there is extensive discussion of the role of different stakeholders'
perspective in evaluation. Floc'hlay and Plottu (1998) describe three types of
such pluralistic evaluation: empowerment evaluation, which aims to bring a
population to the point of being able to evaluate and take charge of itself
(Fetterman, 1994); participatory evaluation, which is the negotiation process
between public decision makers (those in positions of power) and the general
public (counter-power), each seeking to protect its own interests through negoti-
ation; and democratic evaluation which aims at rendering participatory evaluation
operational by proposing a formal framework for the development of negotia-
tion and by encouraging decision making. Perceived advantages of democratic
evaluation are that it attempts to answer questions about values by questioning
policy objectives; it allows the expression of popular choices; and it helps avoid
conflicts at a later stage, which can be costly if they delay project implementation.
The Evaluator's Role
One of the main questions that arises when evaluation is considered as a part of
a democratic process relates to the position to be taken by the evaluator. Is the
evaluator to be value-neutral in describing the concerns of different groups, or
should he/she adopt certain fundamental democratic values and let these
determine the nature of the evaluation? Should the evaluator represent socially
disadvantaged groups who have not been able to make their interests heard? The
traditional answer to questions of this kind is to assert that the evaluator's task is
to describe the facts collected by scientifically proven methods. The goal of such
an approach is to distinguish between facts and values, and thereby establish that
the evaluation is not influenced by politics.
House and Howe (1999) challenge this view, pointing out that since investi-
gators face value-laden issues, it is not possible to remain value-neutral. They
argue that fact and value statements, rather than being dichotomous, blend
together. Evaluative statements consist of intertwined facts and values, as do
most claims in evaluation. Their conclusion is that in order to perform their work,
evaluators need some conception of their role and how it is compatible with
democracy. Thus, evaluations must have some broad conception of how their
studies will be used in democratic societies, even if these views are only implicit.
Others, who also urge the evaluator to involve interest groups in an evaluation,
designate the evaluator's role as that of consultant for these interest groups
(Fetterman, 1994; Patton, 1994, 1996). In this situation, the evaluator becomes a
"facilitator" and throughout the evaluation adopts a neutral position with respect
to the interests of different groups as they strive to empower themselves as indi-
viduals and as a group. For example, Shadish, Cook and Leviton (1991) argue
that the evaluator's role is to describe the values of stakeholders in a neutral way.
A Critical Dialogue
My view on evaluation draws its inspiration from these debates on democratic

evaluation and relationships between evaluation and politics. Evaluation can
436 Karlsson
provide the opportunity for democratic processes if it ensures that everyone's

voice is heard. In particular, it is important to ensure that disadvantaged and
marginalized minority groups have a say in an evaluation, in which case the evalu-
ator must try to establish the greatest possible "power-free dialogue" between
stakeholders (Habermas, 1984).
Floc'hlay and Plottu (1998) claim that democratic evaluation has two essential
characteristics: citizen participation in decision making, and decisions that are
the result of a negotiation process. Dialogue-directed evaluation is to be seen in
the context of democratic evaluation, but takes a somewhat different perspective
on the idea that the negotiation process is the core issue. My opinion is that the
goal of democratic evaluation is not necessarily to produce results for
negotiation. It is also important that an evaluation contribute to learning for all
involved in the process. Benhabib (1994) and Welch (1990) also see evaluation
as playing an important role in providing comprehensive illumination of each
others' points of view and how this might stimulate learning and understanding.
Critical dialogue in evaluation, which has as its objective critical examination
leading to increased insight, can be described as a meditative process, in which
the individual examines and verifies his/her own and others' perspectives and
assumptions. Each party to the dialogue pursues self-criticism and ideological
criticism (Karlsson, 1996, 1998). The aim of this critical examination should be
to develop a deeper understanding of what a program means for stakeholders in
terms of its limitations and possibilities, and to reach greater insight and clarity
about the foundations of one's own judgments and those of others.
CONCLUSION
There are obvious unsolved problems in the various kinds of democratic evalua-
tion that aim to involve stakeholders in the evaluation process. The model could,
for example, be seen as a way to legitimate political power. It is claimed that if
the starting point has been the opinions of interest groups, politicians can use the
stakeholder evaluation model (with its strong emphasis on users) to avoid other
kinds of evaluation. On the other hand, the stakeholder model can be seen as a
challenge to the existing power structure as it allows the current process of
decision making to be problematized.
One could add to the problems of stakeholder evaluation by asking how stake-
holders are to be identified. Furthermore, how are the representatives of groups
to be chosen, or how do differences in power among stakeholder groups influence
the evaluation? These questions highlight the dilemma facing the evaluator.
Should he/she strengthen powerless stakeholders? From what value position
could such a decision be motivated? And what effect is the standpoint taken
likely to have on confidence in the evaluation among other groups? (Karlsson
1990, 1995, 2001).
If pluralistic approaches are to become alternatives to traditional (top-down)
evaluation models, a further question arises: to what extent can the approaches
encompass a variety of values, and is it possible to include or exclude certain

values? In a market-oriented paradigm, we might expect the central values to be
based on the needs of the market, and to be expressed by the multiple actors
representing different interests. Choice of actors will shape the normative
content of an evaluation, determining the boundaries of the knowledge "base,
the scope, and potentially the outcomes of evaluation" (Dabinett & Richardson,
1999, p. 233).
Despite problems with democratic evaluation, it would seem important to
explore further the role of evaluation from the perspective of citizens, taking into
account their legitimate claim to have an impact on the evaluation process. An
important issue for the EU's evaluation activity is how to handle the subsidiarity
principle that recognizes the need for a transparent decision making process as
a critical factor in strengthening the democratic nature of EU institutions and in
supporting citizens' legitimization of the administration.
The NPM concept also needs to be evaluated more thoroughly and critically.
Although one sees reports on how successful the concept has been in changing
and making the public sector more efficient, it may be asked if other factors (e.g.,
budget cuts, reducing personnel) played a role in bringing about this state of
affairs. It is surely paradoxical that the concept of NPM has not been evaluated,
while at the same time it points to the need for more evaluation.
Performance management also carries risks. Indicators may not measure what
they are intended to; unwarranted attributions of causality to indicators in
relation to outcomes may be made; performance information may be used for
purposes for which it was not intended; and goal displacement may occur if
incentives divert effort from attaining program objectives to meeting the require-
ments of measuring and reporting (Davies, 1999). Performance measurement
systems also decouple accountability from ownership and responsibility,
assigning to accountability a role in regulation and control, and inhibiting shared
responsibility among stakeholder-citizens. They "also let the evaluator off the
hook, by heavily obscuring their authorship and thereby muting their responsi-
bility" (Greene, 1999, p. 170). The meaning and quality of results for stakeholders
are not adequately explored, and, of course, performance measurement systems
are unlikely to provide information about program process or implementation.
It is interesting to speculate on how evaluation will develop in Europe, and if
development in countries with longer traditions of this activity will differ from
development in countries in which evaluation is a relatively recent phenomenon.
My view is that evaluation is so deeply imbued with the dominant New Public
Management discourse that differences between countries and their evaluation
cultures will disappear. This tendency is likely to be strengthened by the growth
of professional organizations in the field of evaluation, at both national and
international levels, since we might expect a shared discourse on evaluation to
take shape within this "community of professionals." Depending on what is
considered to be the discourse - the market or democracy - the kinds of interests
represented in evaluations will be regarded either as customers and consumers
in the market or as citizens in a democratic society.
438 Karlsson
If this turns out to be case, then the reasons for developing a form of evalu-
ation that has as its focus critical dialogue and learning, and for strengthening
democratic forms of work in opposition to the use of market mechanisms in
public sector activity, are all the stronger. It is important that the democratic
worldview be presented as an alternative to the view that regards individuals as
consumers that are considered solely in terms of their purchasing power.
ENDNOTES
1 My special thanks to Joyce Kemuma, Mehdi Sedigh, and Anna Lundin for their helpful criticism
of earlier drafts of this chapter.
2 The concept NPM is taken from the discussion of management in the private sector (see Osborn
& Gaebler, 1992).
REFERENCES
Benhabib, S. (1994). Autonomi och gemenskap. Kommunikativ etik, feminism och postmodemism.
[Situating the self. Gender, community and postmodernism in contemporary ethics]. Goteborg:
Bokforlaget Daidalos AB.
Bryk, AS. (Ed). (1983). Stakeholder-based evaluation. New Directions for Program Evaluation.
Dabinett, G., & Richardson, T. (1999). The European spatial approach. The role of power and
knowledge in strategic planning and policy evaluation. Evaluation, 5, 220-236.
Davies, I.e. (1999). Evaluation and performance management in government. Evaluation, 5, 150-159.
Duran, P., Monnier, E., & Smith, A (1995). Evaluation a la fran<;aise. Towards a new relationship
between social science and public action. Evaluation, 1, 45-63.
Fetterman, D. (1994). Empowerment evaluation. Evaluation Practice, 15, 1-15.
Floc'hlay, B., & Plottu, E. (1998). Democratic evaluation. From empowerment evaluation to public
decision-making. Evaluation, 4, 261-277.
Franke-Wikberg, S., & Lundgren, u.P. (Red.). (1980). Att viirdera utbildning. Del 1. [To evaluate
education. Part 1]. Stockholm: Wahlstrom & Widstrand.
Furubo, J-E., Sandahl, R., & Rist, R. (2002}.A diffusion perspective on global developments in evaluation.
International Evaluation Atlas. New Brunswick: Transaction Publishers.
Greene, J.C. (1999). The inequality of performance measurements. Evaluation, 5, 160-172.
Gunnarsson, R. (1980). Svensk utbildningshistoria. Skola och samhiille fa" och nu. [Swedish history of
education. School and society before and now]. Lund: StudentIitteratur.
Habermas, J. (1984). The theory of communicative action. Reason and the rationalization of society.
Boston, MA: Beacon Press.
Hansen, H.F., & Borum, F. (1999). The construction and standardization of evaluation. The case of
the Danish university sector. Evaluation, 5, 303-329.
Hansson, F. (1997). Evaluation traditions in Denmark. Critical comments and perspectives.
House, E., & Howe, K. (1999). J;alues in evaluation and social research. London: Sage.
Karlsson, O. (1990). Utviirdering av fritidsklubbar. [Evaluation of school-age childcare/youth clubs].
Eskilstuna: Socialtjiinsten.
Karlsson, O. (1995).Att utviirder- mot vad? Om kriterieproblemet vid intressentutviirdering. [To evaluate
- to what purpose? On the problem of criteria in interest evaluation]. Stockholm: HLS Forlag.
Karlsson, O. (1996). A critical dialogue in evaluation. How can the interaction between evaluation
and politics be tackled? Evaluation, 2, 405-416.
Karlsson, O. (1998). Socratic dialogue in the Swedish political context. In T.A Schwandt (Ed.),
Scandinavian perspectives on the evaluator's role in informing policy. New Directions for
Karlsson, O. (2001). Critical dialogue: Its value and meaning. Evaluation, 7, 211-227.
MacDonald, B. (1973). Briefing decision makers. In E.R. House (Ed.), School evaluation: The politics
and process (pp. 174-187). Berkeley, CA: McCutchan.
MacDonald, B. (1977). A political classification of evaluation studies. In D. Hamilton, D. Jenkins, C.
King, B. MacDonald, & M. Parlett, Beyond the numbers game. A reader in educational evaluation
(pp. 124-227). London: Macmillan Education.
News from the Community. (1999). 2nd Conference of the Italian Association of Evaluation (AIV),
Naples 15-17 April 1999. Evaluation,S, 359.
Osborne, D., & Gaebler, T. (1992). Reinventing government. Reading, MA: Addison Wesley.
Patton, M.O. (1994). Developmental evaluation. Evaluation Practice, 15, 311-319.
Patton, M.O. (1996). Utilization-focused evaluation. London: Sage.
Pollit, C. (1995). Justification by works or by faith? Evaluating the new public management.
Russon, c., & Russon, K. (2001). The annotated bibliography of international program evaluation.
Boston, MA: Kluwer Academic.
Shadish, w.R. Jr., Cook, T.D., & Leviton, L.c. (1991). Foundations ofprogram evaluation: Theories of
practice. Newbury Park, CA: Sage.
Stame, N. (1998). Evaluation in Italy. Experience and prospects. Evaluation, 4, 91-103.
Visits to the world of practice. (1995). Auditing, financial management and evaluation: An interview
with Alan Prately. Evaluation, 1, 251-263.
Welch, S.D. (1990). A feminist ethic of risk. Minneapolis, MN: Fortress Press.
22
The Social Context of Educational Evaluation in Latin
America
FERNANDO REIMERS
Harvard Graduate School of Education, Harvard University, MA, USA
In the mid-1990s, a Deputy Secretary of Education in a country in Latin America

convened a meeting of some of the best educational researchers in the country.
The purpose of the meeting was to discuss preliminary results of the evaluation
of student achievement which had been undertaken as part of the Third
International Mathematics and Science Study (TIMSS). The performance of
students in this country was below that of students in most other participating
countries. The Deputy Secretary asked the researchers to comment on how the
results might be interpreted by the larger research community and by the public.
After listening for the better part of the meeting, he asked all participants to
return the draft presenting the results and said that the meeting should not be
discussed; in fact, it could be considered as not having taken place. The results
would be held in the strictest secrecy. The country informed the International
Association for the Evaluation of Educational Achievement (IEA) that it was
withdrawing from the study and that no one could use or disseminate the data
collected. The Director General of Evaluation was asked to keep the data secure
and to prevent dissemination and utilization of the information.
This example illustrates a set of institutional practices which characterize the
social context in which educational evaluation is conducted and utilized in Latin
America. The incident took place in one of the most complex and sophisticated
education systems in the region. It involved highly trained and capable education
decision makers and researchers. It happened in a country with one of the best
established and sophisticated research communities, in one with competitive
politics and democratic rule. Precisely because this case is not an aberration but
is illustrative of an institutional culture, it raises questions of profound conse-
quence for the practice of, and prospects for, educational evaluation. High-level
education officials in this and in another country in Latin America exercised
strong pressure on UNESCO in the late 1990s to prevent the public release of
the results of another international assessment of student learning outcomes. A
study of the impact of recent international assessments of student achievement
on policy in Latin America concluded that the impact has been very modest, to
441

442 Reimers
a great extent because those who have had access to the results have been few
and in many cases they have prevented public dissemination of the findings of
these studies (Arregui & Ferrer, 2002).
How could a public official hijack basic information regarding student learning
in primary schools? Why was the opportunity to use this information to examine
options to improve the quality of primary schools foregone? By what authority
could professional researchers be instructed to keep this information secret? This
chapter will address these questions.
In addressing them, it will be argued that educational evaluation in Latin
America is an activity embedded in the struggles between two alternative edu-
cation ideologies which have shaped education policy during the 20th century,
that it has been largely controlled by the State, and that it takes place in a context
shaped by the institutional practices and culture of the local education research
community as well as by the authoritarian and exclusionary attitudes of policy
elites. The transactions between evaluators and policy elites interact in turn with
national and international politics. These factors set the boundaries which define
the quality of educational evaluation and its role in policy formation or refor-
mulation. They constrain the role of evaluation to operate within the zone of
acceptance defined by current policy and practice. The chapter concludes with a
case study that examines the practice of evaluation with regard to a recent set of
policy initiatives to improve the learning chances of disadvantaged children in
Mexico.
Throughout the chapter, evaluation is taken to mean a particular form of
educational research in which the interpretation of factors influencing conditions
that matter for student learning are implicitly or explicitly related to programs or
policies that are currently or could potentially affect those conditions. The discus-
sion will at times extend to other forms of research and analysis, such as the analysis
and projection of trends in enrollments, the assessment of student achievement
on curriculum-based tests, and more basic forms of disciplinary research.
EVALUATION: A STRUGGLE BETWEEN CONSERVATISM AND

PROGRESSIVISM
Education reform during the 20th century in Latin America was shaped by the
struggle between conservatism and progressivism (Reimers, in press). The con-
servative approach, with deep roots extending back to the establishment of the
Spanish colonies in the 15th and 16th centuries, saw education as a means to
preserve the privileges of the elites. This approach was challenged during the
20th century by a progressive alternative that sought to expand the learning
chances of marginalized groups. The roots of the progressive approach extend
back in time to independence movements and to the creation of the modern
republics. The main gains of progressivism were the creation of public education
systems, the universalization of primary education, an unprecedented educa-
tional expansion at all levels with consequent intergenerational educational
The Social Context of Educational Evaluation in Latin America 443
mobility, and a significant reduction in the gender gap in educational opportu-
nity. The progressive movement was contested by conservatives, resulting in the
social segregation of educational institutions as disadvantaged groups gained
access and in neglect of the quality of education in institutions serving disadvan-
taged groups. The battles between these approaches took place at the level of
policy formation as well as policy implementation, and competition between the
approaches, representing powerful competing ideologies and visions for education
and society, shaped all education activities, including educational evaluation.
The practice of educational evaluation must thus be understood by reference
to how the purposes of evaluation aligned with the ideologies, espoused by
competing approaches. This alignment defined a rich pattern of the influences of
competing groups on the actors involved in the evaluation task. The purposes
and ideologies influenced what was evaluated, who conducted the evaluation,
and how the results would be used and received. When educational evaluators
used evaluation results to mobilize a particular constituency, the discussion of
the results was framed in the context of the competition between conservatism
and progressivism. For example, a series of studies commissioned by the Economic
Commission for Education in Latin America in the late 1970s to examine the rela-
tionship between education and social development supported the progressive
aspiration to identify the constraints on the education of disadvantaged children
(Nassif, Rama & Tedesco, 1984; UNESCO/CEPAL/PNUD, 1981). Similarly, a
series of ethnographies on the quality of instruction in schools attended by
marginalized children, funded by the International Development Research
Council in the 1970s, served to highlight the extent to which schools serving
disadvantaged children failed to provide adequate opportunities for learning
(Avalos, 1980).
The development of student testing systems in several countries of Latin
America also illustrates how ideologies shape the practice of evaluation. In
Chile, the introduction of student testing during military rule in the 1980s served
to legitimize the movement towards privatization of education by showing the
systematic underachievement of students in public schools, without taking
proper account of differences in the social backgrounds of students attending
public and private schools. In contrast, the development of student testing in
Uruguay took place in a context of strong commitment of the State to public
education and to reducing disparities in the quality of education offered to
children of different social classes. The analysis of test results sought to identify
the school and teacher practices that enhanced students' learning and were fed
back to each school in the country, along with resources and technical assistance
to use the information as part of a process of school improvement. The broad
dissemination of results of individual schools through national newspapers which
had been used in Chile and Costa Rica was avoided (Administraci6n Nacional de
Educaci6n Publica, 2000; Benveniste 1999,2000.)
In general, research documenting the limited learning chances of marginalized
groups was supported by those advocating a progressive approach and opposed
by those partial to a conservative alternative. Evaluation of the impact of various
444 Reimers
experiments to transform educational management and evaluation of student

achievement was supported by those in favor of a conservative approach that saw
the main constraint on educational efficiency not in how much was spent on
education but in how funds were used. The balance of power between the groups
at particular moments in time in a given country determined the kind of research
that was carried out.
Three broad emphases characterize educational research during the last half
century. Educational evaluation focused on the unmet quantitative needs for
access to different levels of education during the 1950s and 1960s. Research on
the qualitative divides separating various social groups emerged in the 1970s.
Evaluation and research activities were seriously constrained during the 1980s, a
period of severe economic instability and austerity in the region, which had a
deep impact on the education and academic communities. In Mexico, the
national association of educational research had to postpone plans for a biannual
congress of educational research for 12 years because of the paucity of resources
for research and the steep decline of academic salaries (Weiss & Maggi, 1997).
The economic crisis of the period and a resurgence of conservative politics also
marked a shift in research emphases away from the critical stance and concern
with issues of equity that had characterized much research in the 1970s towards
a greater focus on issues of management efficiency and on the quality of educa-
tion (Weiss, 1998)1.
THE CONTROL OF THE PRACTICE OF EVALUATION BY THE STATE
During the 20th century there was a remarkable expansion of public education
in Latin America. Consolidating trends initiated in the latter part of the 19th
century, the State became the dominant actor in the provision of education,
challenging the dominance of the Church in previous centuries2• State domi-
nance has been characterized as the policy of the "teaching state" (estado docente).
Table 1 shows how, during the second half of the 20th century, enrollments at all
levels increased significantly. Most of this expansion took place in primary
institutions.
The expansion of the role of the State in the provision of education was
reinforced by growing urbanization and by Import Substituting Industrialization,
Table 1. Enrollments by level in Latin America and the Caribbean in millions

1950 1960 1970 1980 1990 1997
Primary 15 27 44 65 76 85
Secondary 2 4 11 17 22 29
Tertiary 0.3 0.6 1.6 4.9 7.3 9.4
Increase % 1950--60 1960-70 1970-80 1980-90 1990-97
Primary 80.00% 62.96% 47.73% 16.92% 11.84%
Secondary 100.00% 175.00% 54.55% 29.41% 31.82%
Tertiary 100.00% 166.67% 206.25% 48.98% 28.77%
Source: Unesco, 2000, Thbles 3.1, 3.2, and 3.3.
the development strategy adopted by many Latin American economies since the
late 1940s. Also supporting the expansion of education was the activism of
UNESCO and other United Nations agencies which saw education as a funda-
mental human right and mobilized policy elites in support of its expansion.
Another process supporting quantitative expansion was the development of
human capital theory and the intellectual legitimacy gained by the idea that
economic development could be planned.
In line with the expectations that national development could be planned by
the State, Latin American nations established national planning units and began
to carry out education development plans to align the expansion of education
with broader economic and social objectives. Since the 1950s, the State, and the
executive branch in particular, became powerful and controlling not just of educa-
tion activities, but as a dominant actor in society. The power of the executive was
unmatched by organizations of civil society, Congress, the Judiciary, or the Press.
Only the military, which frequently between mid-century and the 1980s seized
the executive under authoritarian rule, could challenge Presidents and Ministers
of Education, and they often did!.
The power and dominance of the executive in governing education were
supported by multiple international agencies. Educational planning and the
ensuing educational expansion were supported by United Nations and bilateral
and multilateral development agencies. The emergence of educational planning
created a favorable context for the utilization of evaluation on a grand scale. This
early partnership between planning and evaluation, as both were emerging as
professional fields, defined the context of evaluation for the remaining half
century. The kind of evaluation which supported educational planning in the
1950s and 1960s primarily involved the assessment of the number of children of
different age groups enrolled in school and the number not enrolled and the use
of demographic forecasting techniques to estimate the expansion requirements
consistent with the desired objectives of manpower planning2. Because a black
box model of education dominated most educational planning, the emphasis was
on the numbers and flows of students through different levels of the system. An
important issue that emerged as a result of the evaluation of the internal effi-
ciency of student flows was the recognition of high rates of retention (repetition)
in the early grades. This topic would assume growing importance as evaluation
began to look inside schools and classrooms in the 1970s, and as it began to
examine the impact of policy later on.
Educational planning, and consequently evaluation, in the region developed
as an activity that involved relatively few actors, experts, and decision makers.
International cooperation supported an elitist approach to using evaluation results
to estimate the quantitative needs for educational development. This mode of
educational evaluation (focused on very broad quantitative enrollment needs
and involving a very narrow group of stakeholders) dominated during the 1950s
and 1960s.
There were few institutions conducting evaluation other than government
agencies during this period. Paralleling the highly centralized nature of educational
446 Reimers
decision making, educational research elites were few, and were concentrated in
capital cities and in a few institutions. Very modest efforts to contest State
dominance were made by universities and non-governmental organizations.
Universities in Latin America, traditionally elite institutions, reflected two
broad trends during the second half of the century. The first was towards greater
autonomy from the State and political radicalization, particularly under author-
itarian rule. The second trend was in part a response to the first: authoritarian
governments assaulted the universities or allowed them to deteriorate under
benign neglect. A strong private university sector developed as a result of the
growing demand by students for higher education, which public universities
could not meet. The consequences of these processes for educational evaluation
were varied. A few public universities were able to preserve the conditions
required to conduct quality research, but this occurred in a context of growing
tensions between universities and governments. The stance adopted by
"evaluators" was not one conducive to the formulation of recommendations for
policy reform. A more denunciatory and critical stance was more common. In
response, public decision makers were less than inclined to use the results of the
research and evaluation to inform their thinking. They were even less inclined to
be supportive of research that did not comply with very narrow terms of refer-
ence or that addressed issues outside the zone of acceptance defined by the
agenda of the executive branch of government. The result of this separation and
conflict between States and universities was a growing gap between policy and
programmatic decision making and evaluation. On the other hand, private
universities, by and large, focused more on teaching than on research and kept
some distance from public policy. None had the institutional strength and
autonomy to adopt a head-on critical stance vis a vis the efforts of the State.
As a result, the professional evaluation of government programs and policies,
and the activity of policy analysis more generally, have been under governmental
control in Latin America during the last 50 years. Governments have been able
to regulate the flow, focus, and timing of evaluation, and thus to contain its impact.
The channels through which the executive exerted this control were manifold.
Ministries of education conducted evaluation in-house, and occasionally
Ministries of Finance and Planning also conducted or contracted educational
evaluations. This was particularly the case for the quantitative assessments of
social demand for education and, more recently, for the establishment of student
assessment systems. Ministries of Education also contracted evaluations of
programs with public and private research units. In this way, the executive held
a tight control of the definition of the terms of reference of evaluation.
Governments exercised some control over two additional sources of support
for evaluation. Much research conducted in public and private research institu-
tions was funded publicly either by ministries of education, or other government
agencies, such as the National Councils of Scientific Research. This allowed
governments to influence research and evaluation communities through research
development plans, funding priorities, and programs to stimulate and monitor
researcher productivity. Lastly, foundations and development agencies supported
educational research and evaluation carried out by public and private agencies.
These provided the most autonomy from governmental influence, but were
hardly impenetrable to the long hand of the State.
Because university communities were often targeted by authoritarian govern-
ments for political persecution as part of larger political struggles, the roots of
independent research are to be found in a few independent institutions. An
example of these is the Centro de Investigaci6n y Desarrollo de la Educaci6n in
Chile, which was established with support of the Catholic Church when the
military regime closed down entire programs of social studies in universities and
prosecuted social scientists. Similar examples of independent education research
centers are the Centro de Estudios Educativos in Mexico and the Centro de
Estudios para la Reflexi6n y Planificaci6n de la Educaci6n in Venezuela, also estab-
lished with support of the Catholic Church as alternatives to State hegemony.
Representing the best examples of a counter-hegemonic and independent prac-
tice of educational evaluation, these centers were financially and institutionally
fragile and had to negotiate a precarious equilibrium with States unreceptive to
critical dissent. They had limited staff and were highly dependent on soft funding
from foundations, international agencies, and State contracts.
In an attempt to overcome this institutional fragility and to facilitate exchanges
of experience, several centers undertook the organization of a network of
research centers, which had as its main mission the systematization of existing
research-based knowledge in a regional database. The Red Latinoamericana de
Educaci6n (REDUC) was thus created in 1972. In spite of the success at net-
working between national centers and in exchanging research information, much
of the research conducted and summarized in the regional database had limited
relevance to policy. A former senior researcher at the Centro de Investigaciones y
Desarrollo de la Educaci6n in Chile explained how, on assuming a senior manage-
ment position in the Ministry of Education with the democratic administration
in 1990, he found most of the research on the relationship between education
and poverty to have very limited use in the design of policies aimed at improving
the learning chances of poor children (Juan Eduardo Garcia Huidobro, personal
communication, March 1996).
The chief architect of REDUC, Father Patricio Cariola, commented that in
spite of the success in building a database of 20000 research abstracts, "our
achievement in terms of use of the products of REDUC by people in decision-
making positions, and even by academics, are far more modest" (Cariola, 1996b).
THE ROLE OF THE LOCAL RESEARCH COMMUNITY
In addition to the role of ideology and of factors influencing the "demand" and
the environment for educational evaluation (the role of the State and public
dominance, the attitudes of elite decision makers, and national and international
politics), the practice of educational evaluation in Latin America has been shaped
448 Reimers
by the institutional norms, expectations, and resources of national educational

research communities.
In some countries in the region, research communities have carved out an
institutional space and defined traditions which bear on the professional practice
of evaluation. Larger, more organized and sophisticated research communities
have been able to secure greater professional autonomy for evaluators and to
exert more leadership in shaping the policy agenda. Weaker research commu-
nities have been less successful in negotiating autonomy with external influences.
Argentina, Brazil, Chile, Mexico, Peru, and Venezuela developed, arguably,
research communities with the most training and institutional resources. In
Brazil, Mexico, and Venezuela, public universities and public resources sup-
ported the development of educational research. In Argentina, Chile, and Peru,
authoritarian governments decimated public research centers, and educational
research capacity was consequently privatized and severely debilitated. By the
end of the 20th century, the strongest educational research communities were in
Brazil and Mexico, and most research in these countries was conducted in public
institutions.
The case of Mexico may be considered to illustrate how institutional norms
and culture define the social context of evaluation. The creation of current
educational research institutions and practices in Mexico occurred in 1963 with
the establishment of the Centro de Estudios Educativos with the support of the
Jesuits. The first activities of this independent center were an education sector
assessment which identified areas for public action and the establishment of a
journal (Revista Latinoamericana de Estudios Educativos) which would soon
achieve broad distribution in Latin America. The National University (Universidad
Nacional Autonoma de Mexico, UNAM) in the late 1960s also established a
center of educational research. (By the end of the 20th century, the Centro de
Estudios sobre la Universidad at UNAM was one of the largest educational research
centers in the country.) In 1971, during a major education reform, the State
established a center for research on curriculum and mathematics and science
teaching in an elite center of scientific research (Direccion de Investigaciones en
Educaci6n del Centro de Investigaci6n y Estudios Avanzados, CINVESTAV)
(Weiss, 1998). These three institutions achieved a fair amount of independence
from the direct control of government (although they all depended heavily on
government funding and contracts) and nurtured a core group of researchers
with enough stature to maintain critical postures in conducting evaluations and
policy analysis.
State attempts to influence or curtail the independence and professional auto-
nomy of evaluators and researchers are, however, omnipresent. During the
1970s, a period of high productivity for researchers and evaluators, the Mexican
State launched several initiatives aimed at supporting and coordinating
educational research. The period concluded with the first Mexican Congress of
Educational Research in 1981 and with a National Plan of Educational Research
to be funded by the National Council of Scientific Research (CONACYT). The
economic crisis of the 1980s led to a reassessment of State priorities, preventing
the implementation of the plan. The second Congress of Educational Research
took place in 1993. It was a major stocktaking exercise that led to the establish-
ment of the Mexican Council of Educational Research (COMIE) and the
launching of a new educational research journal, Revista Mexicana de InvestigaciOn
Educativa.
The creation of COMIE was a major milestone in the institutionalization of a
research community in Mexico. The editorials of the journal suggest a clear
intent to formalize traditions and standards regarding the practice of educational
research (Rueda, 1996, 1997a, 1997b). Four meetings of the association took
place during the 1990s, with large attendances and rich programs spanning over
a week. A recurrent theme in the journal, and the theme of the fifth congress of
the society in 1999, was the connection between educational research and
decision making. In the mid-1990s, the executive council of COMIE published a
formal response to the government education program. In 1998, some of the
most active members of COMIE launched the Citizen's Observatory of
Education (Observatorio Ciudadano de la Educaci6n), a periodic publication in a
leading political magazine and on the web containing short essays and reports
which aim to shape public opinion on education policy issues.
These developments signal that, perhaps to a greater extent than in any other
country in Latin America and with the possible exception of Brazil, there is an
established educational research community in Mexico. This community has
established practices, traditions, and a complex web of reciprocal exchanges and
expectations among members which shape the practice of educational evalua-
tion. It selects and educates its members, distributes rewards and sanctions,
facilitates communication and exchange, lobbies for some of its members to
serve in positions as decision makers or senior managers and thus serve as
brokers between research and policy. These are all sources of strength to bring
professional and academic standards to the practice of high quality evaluation
and education policy analysis. Since 1964, when the Centro de Estudios Educativos
conducted the first education sector assessment, researchers in Mexico have
carried out evaluations and have not waited to be invited by governments to
discuss the policy implications of their work. There is also a strong tradition of
action research in Mexico, a form of inquiry committed to the transformation of
educational practice.
And yet, for all its strengths, this community faces formidable resistance and
opposition from a State and a political environment that are strong enough to
outmaneuver it, divide it, and weaken and control it, producing the situations
discussed in the opening paragraph of this chapter. In 2000, after democratic
elections yielding a remarkable political shift after 70 years of single-party rule,
leading members of the Mexican educational research community were invited
to join a task force that prepared an education policy strategy for the new
government. A last minute political maneuver by the President brought in an
education secretary with no ties to this group. A year later, the work of this task
force has had no impact on the education strategy pursued by the new
government.
450 Reimers
THE REFLECTION OF ELITE, AUTHORITARIAN, AND
EXCLUSIONARY PRACTICES IN THE INSTITUTIONAL CONTEXT
OF EVALUATION
State dominance in the educational expansion that took place in Latin America
during the 20th century need not per se have constrained the possibility of
conducting independent or critical educational evaluation. Indeed, centers with
some independence and capacity for critical inquiry developed in some of the
countries with a strong public education system, such as Argentina, Chile, and
Mexico. However, that so few countries developed strong independent research
communities, and that even those that did faced serious constraints on their
professional autonomy, reflects a culture of decision making that devalues
accountability, transparency, and critical dissent.
In Latin America, cultural and institutional norms of elite rule and pervasive
authoritarian practices generally make public decision-making a private affair.
Even the decisions of modem political parties under democratic rule are often
made by a very limited number of elites. It is not uncommon for members of a
political party in Congress to vote en masse after the leadership of the party has
decided a position on the issue in question. As a result, much educational deci-
sion making is not transparent. That the utilization of public resources for
private purposes is not uncommon further reflects the difficulty elites have in
treating the private sphere as a trust for which they are ultimately accountable to
the public or to the intended beneficiaries of their policies. Some argue that the
systematic underperformance of the public education system in Latin America
reflects not problems of implementation, but rather the capture of the system by
the private interests of politicians and bureaucrats (Plank, 1996).
Some have argued that the Spanish colonizers brought with them a feudal
mindset of cultural values that was elitist and exclusionary, and that that is at the
root of the political underdevelopment of Latin America3• The generation of
wealth and the maintenance of political power on the basis of major asymmetries
between sections of the population reinforced deep-seated attitudes and prac-
tices of exploitation and exclusion. During the 20th century, there was much
political instability and military rule in Latin America which was not conducive
to democratic debate of education reform options or to holding officials account-
able to the public. Paradoxically, some of the patterns of secrecy in State decision
making were maintained as democratic regimes and more competitive politics
replaced military rule. Experiments to engage a wide range of stakeholders in the
discussion of education policy were generally resisted by senior government
officials. This includes resistance to very basic practices common in a research com-
munity, such as disseminating a research report (see Reimers & McGinn, 1997).
Arrogant elites tend to see little need to bother reading evaluation research or
asking those who do for advice, believing they know the answers because their
position confers authority on them. They also believe they know the problems
and the questions which call for answers. Since they operate in a context where
they are not held accountable for their actions, the systematic underperformance
of the education systems of Latin America has little consequence for the
perpetuation of their ineffective practices.
NATIONAL AND INTERNATIONAL POLITICAL INFLUENCES ON

EVALUATION
Just as political struggles between conservatism and progressivism shaped the

practice of evaluation in Latin America, so did international influences, such as
the support of United Nations agencies and of multilateral financing organiza-
tions for educational planning and evaluation. To a large extent evaluation, in
particular program evaluation, and more recently the evaluation of student
assessment, was supported by international agencies.
The support of evaluation research by international bodies was of two kinds.
One, foundations and cooperation agencies supported independent research
centers. For example, the Canadian Agency for International Development
through the International Development Research Council, the United States
Agency for International Development (USAID), and the Ford Foundation
supported the development of research to support educational planning in
ministries of education, universities and independent research centers in the 1960s
and 1970s. They also supported the development of REDUC, the Educational
Research Network mentioned earlier (Cariola, 1996b). A second kind of foreign
support for evaluation came as governments had to satisfy the bureaucratic
requirements of foreign assistance agencies, or as agencies themselves had to
satisfy their own bureaucratic requirements to report to senior management or
to sponsors. USAID, which has had an active presence supporting education
projects in many countries in Latin America, has supported much evaluation
work used to design, to monitor the implementation, and to assess the impact of
these projects. The World Bank and the Inter-American Development Bank, the
dominant institutions providing foreign funding for education in the region, have
also funded much evaluation for similar purposes.
In some ways these development agencies can be credited with the introduc-
tion of the "foreign" practice of evaluation in cultures of elite and authoritarian
decision making. However, as those working in these agencies are skilled diplo-
mats, the introduction of the practice was adapted in most settings to the elite
use of information. Many of these agencies ended up operating in the same
secrecy, and with a similar disdain for public opinion and accountability, as local
governments. Evaluation is carried out but results are not widely shared.
A more recent instance of international agencies and politics shaping the
practice of evaluation is the introduction of student assessment systems in Latin
America. Their introduction illustrates the power of international agencies to
insert and support evaluation. It also illustrates how this foreign impetus is
modulated by national political negotiation and by the agenda and culture of the
local evaluation and research community.
452 Reimers
Student assessment systems were introduced largely as the result of influence

and support, initially of USAID in the 1980s, and later of the World Bank and
the Inter-American Development Bank, which funded their establishment in
education loans beginning in the 1990s. In 1993, the Economic Development
Institute of the World Bank convened a seminar to discuss student assessment.
At that time only five countries had some experience of national student assess-
ments, a number that increased rapidly, so that by 1997 all countries in the region
had introduced some form of national student assessment system (Rojas &
Esquivel, 1998).
Emphasis on the assessment of student achievement was consistent with the
dominant emphasis in education policy in the United States in the 1980s exem-
plified in the introduction of the standards-based reform movement. Some have
argued that the shift in domestic education priorities of the United States away
from issues of equity and towards an interest in the competitiveness of the labor
force and quality of education, as well as towards improving efficiency in the
utilization of education resources, were exported to other countries through
development cooperation projects (Bock & Arthur, 1991; Orfield, 2000). The
initial push for the introduction of student assessment systems in Latin America
sought to improve management efficiency through increased accountability and
competition between public and private schools. This had been the rationale of
the Chilean system, one of the first to be established. The Chilean system was pro-
moted as a model for other countries to follow at the 1993 World Bank seminar.
However, the implementation of these evaluation systems in each country was
ultimately shaped by national politics and significant stakeholders, including the
local research community and teacher unions. In Uruguay, for example, the
model adopted emphasized the utilization of the results by individual schools as
part of school improvement projects. As a result of negotiations with the teacher
union, results were not published in the national media. There was also an
emphasis in analyzing differences between schools, taking into account the effects
of the socioeconomic background of students. As a result of these influences
student assessment systems differed from those adopted in the United States and
focused, for example, on issues which are no longer a major concern in the U.S.
relating to gender, race, and ethnicity gaps in educational opportunity. Another
way in which local influences shaped these assessment systems was in the
decision to use samples of schools (Colombia, Mexico, and Venezuela) or
censuses of students (Argentina, Brazil, Chile, Dominican Republic, Panama,
and Uruguay). In the latter case, the decision arguably reflected an early intent
to use the assessment system to target schools for corrective policies.
However, few of these assessment systems can be considered as evaluation, as
defined at the beginning of this paper. They were not designed to collect or relate
information on the implementation of particular policies to student learning
outcomes, and few collected or analyzed information on teacher or school
characteristics that would have allowed an examination of correlates of student
achievement. None was designed to identify the value added by educational
interventions to student learning gains over time.
The varying patterns of interaction between international agencies, domestic

politics, and local groups shaped the development of student assessment systems,
with some emphasizing accountability to the public and comparisons between
types of schools (Chile and Costa Rica), others emphasizing school improvement
(Uruguay), others supporting career advancement for teachers (Mexico), and
others following mixed or ambiguous purposes. In spite of this variation, there is
great uniformity between systems, and most of the countries are still very secre-
tive and restrict access to the data generated. As a result, the data have not been
subject to the scrutiny of multiple analyses of groups with varying ideological
agenda. The processing of most of the data, largely by evaluation units in min-
istries of education, has not emphasized an assessment of the impact of education
policy or been used to explore sources of educational inequality. A review of
assessment systems in Latin America concludes that they are at present not
collecting and analyzing information on the relationship between socioeconomic
status and learning outcomes (Rojas & Esquivel, 1998). The assessment systems
have thus, by virtue of what they have done and failed to do, been more suppor-
tive of conservatism than of the progressive alternative both in their purpose and
in how they implemented utilization of the results. A recent evaluation of assess-
ment systems developed as part of international comparisons in Latin America
argues that assessment systems are often implemented as part of a new orthodoxy
that seeks to improve economic performance through greater productivity of the
labor force, better alignment of learning outcomes with requirements of the labor
market, greater control of curriculum objectives by the State, a reduction in educa-
tion spending and increase in community participation, and the introduction of
market mechanisms to improve education (Arregui & Ferrer, 2002, p. 7-8).
In addition to the narrow conservative purposes which inspired these systems,
their implementation reflected equally narrow views. Access to the results was
commonly limited to a very narrow group of senior decision makers who in
several cases precluded public dissemination of these results (Arregui & Ferrer,
2002, p. 7-8). In this sense, the implementation of these systems reflects the
same model of the practice of evaluation advanced in the 1950s: State-controlled
and exclusionary. It is not surprising that the study cited concludes that the
impact of these systems in policy has been very modest or null. Institutional and
cultural practices of secretive decision making violated the basic premise upon
which comparative evaluations of student assessment are purported to induce
school change, namely that the results would attract the interest of the media,
mobilizing political forces and generating debate about the practical conse-
quences of the results (Greaney & Kellaghan, 1996; Kellaghan, 1996, cited in
Arregui & Ferrer, 2002, p, 14).
EVALUATING POLICIES TO INCREASE THE LEARNING CHANCES

OF POOR CHILDREN IN MEXICO
This section will illustrate the social context of evaluation in the case of Mexico
with regard to the evaluation of recent policies to improve schools attended by
454 Reimers
marginalized children. As already noted, Mexico has a relatively well-established
research community, which, together with public education officials, has long
been interested in educational inequality. The Mexican situation thus provides
one of the most favorable set of circumstances for the practice of evaluation
currently existing in Latin America. Even these circumstances, however, would
appear too constraining for evaluation to contribute its full potential to signi-
ficant improvement of the education system.
The Mexican State has espoused a commitment to equality of educational
opportunity for most of the 20th century. The administration of President Ernesto
Zedillo (1994-2000) launched a range of education policies aimed at fostering
educational equity. The compensatory programs, which had been initiated under
the previous administration of Carlos Salinas, when Ernesto Zedillo was Secretary
of Education, were a central feature of these policies. The purpose of these
programs was to improve the quality of the schools attended by children living in
marginalized rural communities. Other programs included the expansion of access
to indigenous education, programs of education, health and nutrition (Progresa),
community education, special education, programs for migrant children, and
other scholarship programs (SEP, 2000a). Compensatory programs, launched in
1991, were one of the few areas of direct intervention in education the Federal
Government reserved for itself after the decentralization of 1993, and were the
focus of most government borrowing from the World Bank and the Inter-
American Development Bank for education during the Zedillo administration.
The programs are managed by CONAFE, a federal agency of the Secretaria de
Educaci6n Publica which had been operating programs of community education
and developing nontraditional modalities of education to reach marginalized
rural communities. The bulk of the work of CONAFE focused on a stream of
programs of similar design and purpose, which were rapidly expanded to serve a
greater number of beneficiaries. From a small program which targeted 100
schools in the early 1990s, compensatory programs expanded to cover 46 percent
of all public schools by 1999. The objective of the programs is to improve the
quality of primary education and expand access to preprimary and primary educa-
tion through the provision of infrastructure, training, materials, and incentives to
teachers and supervisors in rural schools. Part of the program administration was
gradually transferred to the states and some flexibility was incorporated to
support local initiatives. Starting in 1998, coverage of the programs included
secondary education (middle school).
At present, compensatory programs support the provision of learning materials
to four million students in 43319 rural and indigenous schools; incentives to
approximately 13 000 teachers, administered by parent associations; teacher
education sessions every month to more than 97 000 teachers; and a small fund
to be managed by school-community associations in 14000 schools. Coverage
has expanded significantly in the last five years: student coverage increased
50 percent between 1995 and 1999, schools covered increased 200 percent, and
the number of teachers receiving the incentive increased 352 percent
(CONAFE, 1999).
The first of these programs was PARE (an acronym for the initials of the
program's name in Spanish: Program to Abate Educational Backwardness)
launched in 1991-92 to support rural primary schools in the four states with
the greatest indices of social marginalization. This program was followed by a
series of programs of similar design (PRODEI, PAREB, PAED, PAREIB) which
expanded coverage to more states and children. All of these programs were
funded with loans from the World Bank or from the Inter-American
Development Bank.
PARE is the only one of this series of programs that has been evaluated. The
two mid-term evaluations were part of the original project design and were
therefore funded with proceeds from the World Bank loan. The evaluations were
conducted by the Centro de Estudios Educativos (in a longitudinal study following
students over three years) and by the Direccion de Investigaciones Educativas del
CINVESTAV, a qualitative study observing a small number of schools in the early
years of the program. An internal project completion evaluation was conducted
by the World Bank. The Mexican Secretary of Public Education also provided
evidence on the impact and implementation of the program in student comple-
tion rates in several reports. A training program of the Directorate of Evaluation
of the Federal Ministry of Education conducted a qualitative study that looked
at the implementation of some aspects of the programs. Finally, the group of
researchers which form the Citizen's Observatory of Education published during
2000, an election year, an article which was critical of the compensatory programs
based on the evidence generated by the Centro de Estudios Educativos and by the
Direccion de Investigaciones Educativas. What knowledge did these evaluations
generate? How was that knowledge used? What does this case reveal about
the practice of evaluation in an action setting, about what is and is not done
and why?
According to the World Bank project completion report, and to reports by
Conafe, the PARE program successfully achieved most of its targets in terms of
distributing inputs. These included providing school supplies to students in the
targeted areas, providing salary supplements to teachers and supervisors working
in the areas, providing training to teachers, principals and supervisors as well as
building and rehabilitating schools (Conafe, 1999; World Bank, 1997).
A recent report of the Ministry of Education states that the percentage of
children completing primary school (terminal efficiency rates) improved in the
schools supported by the compensatory programs more than in schools not
supported by the programs (see Table 2). Figures show significant improvement
over a one year period for all schools, and an even greater improvement for
schools targeted by the programs. Efficiency figures are aggregated for all states
covered by PAREB and PIARE. (Note that the schools referred to as control are
simply those in the state that did not qualify for support as part of the program
because they were not as marginalized; therefore they are somewhat more advan-
taged than the target schools.) Comparisons of the levels of student achievement
in the schools targeted by the PARE program between 1992 and 1997 show very
small increases in mathematics and language in grades four, five, and six.
456 Reimers
Thble 2. Changes in terminal efficiency rates in schools supported by PARE, PARED and PIARE
1995-1996 1997-1998
Control Target Control Target
PARE (note that these states were later incorporated in Pareb)
Chiapas 48.2 42.7 63.1 59.4
Guerrero 61.1 51.6 66.9 58.7
Hidalgo 90.3 85.7 97.6 102.9
Oaxaca 70.4 67.6 76.2 72.3
PAREB 72.5 63.6 79 72.7
PIARE 88.5 78.4 91.1 84.5
Source: SEp, 2000a, pp. 66-Q7.
Results reported by CONAPE on the impact of PARE are consistent with

these figures. In 1994-95, schools targeted by the program had a terminal
efficiency of 59 percent, compared to 71 percent for schools in the states not
targeted by the program. In 1997-98, the terminal efficiency of targeted schools
had increased to 71 percent compared to 79 percent for nontargeted schools.
Thus, the gap between targeted and nontargeted schools diminished during
three years of program implementation.
The longitudinal assessment of the impact of PARE on levels of student achieve-
ment in the years 1992-93, 1993-94, and 1994-95 conducted by the Centro de
Estudios Educativos found no significant impact of the program on learning
gains. The study points out, however, that implementation of the program had
deviated significantly from the program plan in a number of ways. Components
reached schools only partially; there were contradictions between the peda-
gogical approaches underpinning teacher education and those underlying the
design of instructional materials; there were serious weaknesses in the inservice
training of teachers; and school supervisors did not perform the activities of
pedagogical support expected of them. The authors of the study concluded that
five years into the life of the program it had not been implemented as intended.
Most children in schools targeted by the program continued to perform at
substandard levels in basic curriculum areas (Munoz Izquierdo, Ahuja, Noriega,
Schurmann, & Campillo, 1995).
The qualitative study of the implementation of the program in a small sample
of schools in 1994 conducted by the Direccion de Investigaciones Educativas of
CINVESTAV found that teacher incentives had reduced teacher turnover but
not teacher absenteeism, which is very high in rural areas. It also found that super-
visors, given that their workload had remained unchanged, had not been able to
provide the proposed pedagogical assistance to schools. While the distribution of
teaching materials and school libraries had been effective, teaching materials
had scarcely been used by teachers in classrooms. Only the most traditional
teaching materials such as dictionaries and charts were used, while the most
innovative ones, such as games, were rarely used. The books in school libraries
were used only very modestly. The implementation of the teacher education
component was the most haphazard aspect of the program; teachers participated
only marginally in training activities. Teachers provided good reports of the
training courses, but differed in their views of trainers. It is worth noting that
they rarely mentioned ways in which courses had directly influenced their prac-
tice (Ezpeleta & Weiss, 1994).
A recent qualitative study of rural schools conducted by the Directorate of
Evaluation of the Federal Ministry of Education reports that difficulties with the
implementation of compensatory programs continues. Using data from intensive
observation in 83 schools, the study concludes that there were no differences
between the teaching practices of teachers who had been trained as part of the
compensatory programs and those that had not. This study also highlights the
fact that while the compensatory programs had indeed distributed intended
inputs to the target schools, the rate of utilization of materials varied widely.
Parents and teachers valued highly the instructional materials provided directly
to students (SEP, 2000b).
A recent report of CONAFE describes several administrative reorientations
in the compensatory programs which took place in 1996 aimed at promoting
better coordination among the various programs and among the components of
a given program, although the basic purpose and design of the programs has
remained unchanged. The report indicates that these reorientations were in part
a result of the findings of the two studies of Pare conducted by the Centro de
Estudios Educativos and by the Direccion de Investigaciones Educativas. The
implementation of these changes has not been evaluated (CONAFE, 1999).
World Bank education officers have used the results of these studies, as well
as the implementation completion report, to inform their negotiations about the
design of subsequent education projects in Mexico.
The researchers who participated in these studies have published reports of
them. The study of the Centro de Estudios Educativos has been reported in the
Revista Latinoamericana de Estudios Educativos and, in a recent book, addressed
to the international community of academics and practitioners (Munoz
Izquierdo et aI., 1995; Munoz Izquierdo & Ahuja Sanchez, 2000). The study of
the Direccion de Investigaciones Educativas has also been published (Ezpeleta &
Weiss, 2000). The Directorate of Evaluation of the Ministry of Education has
published a report of the impact of compensatory programs in rural primary
schools (2000b). As a result of these activities of dissemination, the lessons
learned from the evaluations have entered the cumulative knowledge base of the
Mexican research community.
More importantly, the lessons have been shared with a larger public through a
special report of the Citizen's Observatory of Education, which confirms that com-
pensatory programs have been the main instrument to implement equity policies
during the current administration. It states that Pare is the only program that has
been evaluated, and that the results of the evaluation show that after three years
of implementation the impact of the program on student achievement levels was
negligible and that students attending the schools targeted by the program
registered very low levels of achievement (Observatorio Ciudadano, 1999).
458 Reimers
WHAT DOES THE MEXICAN EXPERIENCE ILLUSTRATE ABOUT
THE PRACTICE OF EVALUATION?
Mexico's experience in evaluating compensatory programs during the last ten

years shows that evaluation is carried out and that it influences policy imple-
mentation. Independent evaluation is possible because there is an established
research community which allows independent critical assessment and protects
the activity from full control by the State. The research community takes
responsibility to disseminate the findings of its research among its members and
the larger public. Were it not for these efforts, opportunities to accumulate
learning from experience would be very limited, as the State has demonstrated
its reticence to share information beyond the walls of government offices.
However, the dependence of the evaluation task on public funding creates a
paradoxical vicious circle. Insufficient resources to develop capacity to produce
and utilize evaluation research, together with insufficient linkages between the
research and decision making communities, lead policymakers to complain that
the products of research are too expensive and not very usable. As a result, public
officials refrain from providing further resources that would make it possible to
deepen the quality of evaluation research. (Resources allocated in the World
Bank loans to conduct final evaluations of the PARE program were never utilized.)
The elitist and centralized nature of both decision making and evaluation pro-
duction limits even more the possibilities to develop a solid evaluation tradition.
The narrow social boundaries of decision making involve not just government
officials. In 1996 and 1997, I served as education specialist for the World Bank
and participated in the design of a compensatory program in Mexico. My efforts
to engage in this process members of the Mexican research community,
including those who had evaluated these programs, were considered by some of
my superiors in the Bank a distraction from my duties as an education expert.
The knowledge emerging from the evaluations which have been conducted
serves to inform public decision making. It shows that it is possible to provide
basic inputs to the most disadvantaged schools, thereby reducing inequality in
inputs. Because initial inequalities are so significant, this alone is an important
accomplishment. Important gains can be achieved in expanding access and primary
school completion by supporting the basic functioning of schools. Achieving
changes in learning, and thereby contributing to the reduction of inequality in
learning outcomes, is more difficult. In part, this reflects challenges to raising
teacher capacity from very low initial levels. Helping teachers to become effec-
tive is much harder than providing them or their students with textbooks and
notebooks. The implementation challenges in developing new teaching practices
are also greater than in providing financial incentives for teachers to show up or
for parents to fix up schools.
This case also highlights how little we have learned from what appears to be a
massive policy initiative. Documenting program implementation, testing the
models embedded in the programs, and discussing and assessing the impact of
the programs on a broad range of outcomes have not yet been a central
preoccupation of policymakers or of the research community. The few studies

that have been conducted appear to have had some impact in reorienting
program management, but have Dot influenced program theory.
These findings show that even in a country with a relatively sophisticated and
strong research community, evaluation has played a modest role in reorienting
program implementation and none in influencing policymaking outside the zone
of acceptance defined by the State. A case could be made that the very existence
of a policy to make equity a priority results from a long and rich tradition of
research on educational inequality which has permeated and shaped the percep-
tions of policymakers and of the public about what issues ought to be accorded
priority. While research that leads to broad attention to a topic is important, it is
insufficient to guide specific actions to improve the learning chances of children
in school. More and better evaluation is needed to broaden the effectiveness and
scope of interventions to reduce educational inequality.
The evaluations of these programs are recent and limited. They highlight the
importance of basic school supplies and infrastructural conditions to facilitate
school learning. They show that children do better when they have textbooks,
when their schools are not in disrepair, when there are school libraries and their
teachers have instructional resources. These effects should not be surprising given
that policies are targeting schools and children in great need, where a simple
pencil and notebook is a great addition. What the studies do not tell us is how far
the expansion of such basic provision of school inputs can go. It is reasonable to
expect that the effects of strategies will level off after a point. The studies,
however, are confined within the "zone of acceptance" defined by the policies in
place. One could argue that the main concern of equity ought to be not with
equalizing outcomes in learning old and dated knowledge, but with developing
problem solving skills and the application of knowledge to new contexts.
The evaluation of compensatory policies examined in this case provides more
questions than answers. What is the practical significance of student achieve-
ment in the tests used to measure curriculum coverage? How does performance
on these tests or completion rates relate to life chances? How well does perform-
ance on the tests predict performance in higher levels of education? What is the
influence of the compensatory programs on other social and attitudinal outcomes?
What is the impact of the programs on building social capital in communities?
What are the perspectives of the intended beneficiaries of the programs? What
do they think they get out of them? What do they think about how the programs
should be run? Existing evaluations document relatively short-term impact of
the programs; as the inputs they support in schools consolidate how do their
effects change?
Much remains to be known about the results and implementation of the com-
pensatory programs. Sound evaluations, with longitudinal datasets over a broad
range of outcomes, which appropriately analyze data to make them comparable
over time and attribute to school-level factors the portion of variance which does
not result from differences between students would be extremely useful, but have
not been conducted so far. Information about the costs of the different programs
460 Reimers
would be very helpful too, as would an examination of the cost-effectiveness of

the components of programs.
We also need to know more about the conditions that foster school improve-
ment, and about the social dynamics that explain why some schools targeted by
compensatory policies improve to achieve consistently high levels of learning for
their students, while others do not. Future studies need to reveal more about the
implementation process at the school level, and about the organizational and
cultural factors that facilitate and impede the change of school and classroom
processes.
In addition to studies that examine the effects of compensatory policies, our
understanding of the alternatives to expand the educational chances of the poor
would benefit from two alternative forms of research: needs assessments and
experimentation. We need to know more about factors (social and cultural capital,
job skills, social and political skills, and attitudes) that are associated with social
mobility and with expanding life chances. The identification of these factors,
along with surveys of beneficiaries about what they perceive to be their needs
and diagnostic studies of the conditions prevalent in schools attended by low-
income children, could inform the design of innovations along the lines of the
third kind of compensatory program identified at the beginning of this paper.
Planned variation could be used in small-scale radical experiments to examine
the effectiveness of alternative procedures to develop skills and to overcome the
constraints identified in the needs assessments. It is especially important to
design innovative approaches that are not based on deficit theory as explanations
for school failure among the poor. These experiments could provide the know-
ledge base to support restructuring efforts in ways that the assessment of options
implemented in the existing system may not4 •
It can be argued that there is a need for more evaluation research on ways to
enhance the effectiveness of schools serving low-income children. Evaluation
designs should reflect multiple ways to value program benefits by different
stakeholders in ways which existing studies have not. But such research can only
yield part of the knowledge that is required to support effective policy reform to
expand the learning chances of marginalized groups in Latin America. There is
also a need for studies that adopt a political perspective and a cultural perspec-
tive of the change process, particularly in the investigation of the micropolitics of
school change. To the extent that power is an organizing feature of schools and
school systems, we need to understand how existing power arrangements are
challenged by compensatory programs, and how they limit the range of options
to be considered or implemented.
Poverty policies are contested terrain in which conflict among goals is more
likely than consensus among program participants. The implicit assumption of
existing studies is that there is a single set of goals and objectives of programs. A
constructivist perspective to program evaluation would alter this assumption
shedding light on the "political nature of program evaluation since it assumes
that power and position playa role in the social constructions of program reality"
(Palumbo & Hallet, 1993).
The generation of knowledge about the impact of policies to improve the

education chances of the children of the poor needs to take place in the context
of a critical assessment of the conditions of policy formulation, evaluation,
decision making, and policy implementation. The high levels of social exclusion
which characterize Latin America are indicative of politics and history that have
produced and maintain such inequities, while compensatory policies are formu-
lated and implemented in a context and through mechanisms that reflect the very
inequalities of origin that are at the root of the problem they try to correct. New
knowledge to support social change needs to contribute to the formation of
ideology and political frameworks that seek greater social justice and do not
simply assume that benevolent government officials would do what is right for
the poor if they only knew how. The poor themselves, and the coalitions of
interests to support their options and freedom, should be the prime constituency
for this knowledge which should feed into democratic and public processes of
dialogue and debate. It should not be hijacked by the politics of secrecy and
exclusion that have dominated policy formation and evaluation in the region for
most of the last decades.
To sum up, the practice of evaluation in Latin America has been shaped by
institutional rules that reflect State dominance and elitism; it has also been
shaped by the struggle between two competing education projects, one exclusion-
ary and one inclusive. State dominance has depended on coopting international
influences and local research communities. The advancement of more demo-
cratic education policies will require public ideas that convey the importance and
the ways to strengthen public education and the learning opportunities of
marginalized children. These ideas need to reach social movements that can
generate political pressure to support progressive policy reform. To contribute to
such public debate and democratic process of policy formation, the practice of
evaluation will have to be turned on its head and itself become more independ-
ent, rigorous, open and public.
ENDNOTES
1 Cariola (1996a) argues that the relationship between research and policy in Latin America emerges
in the late 1950s. Educational research "was intended to help planning. It was basically quantitative
and related to a decision to expand the system. It was done in ministries of education and planning
and in schools of economics and sociology. It was used to produce plans for the ministries and plans
to adopt foreign aid, and was to be simple research for simple policy decisions" (p. 150).
2 Teacher unions emerged as a significant stakeholder of educational change and in national politics
during the 20th century; as their power extended, however, many of them were coopted or
demobilized by the State.
3 For an extended discussion of the contribution of elite values, reflecting a feudal structure, to the
under-development of Latin American societies see Lipset (1967) and Wiarda (2002).
4 The restructuring literature points to the limits to changing the "core" of teaching through control
mechanisms, incentives, and learning opportunities in schools where organizational structure,
culture, and norms maintain compliance with existing practices. In the United States restructuring
represents the most recent approach to school change, following a journey of four decades that has
462 Reimers
included innovation and diffusion, organizational development, knowledge utilization and transfer,
creating new schools, managing local reform, training change agents, and managing systemic
reform (Miles, 1998).
REFERENCES
Administraci6n Nacional de Educaci6n PUblica. Unidad de Medici6n de Resultados Educativos.

(2000). Evaluaciones nacionales de aprendizajes en educaci6n primaria en el Uruguay (1995-1999).
Montevideo, Uruguay: Author.
Avalos, B. (Ed.). (1980). Teaching the children of the poor. Ottawa: International Development
Research Centre.
Arregui, P., & Ferrer, J. (2002). Pruebas internacionales de logros de aprendizaje: Impacto sobre los
procesos de mejoramiento de la calidad de la educacion y criterios para guiar las decisions sobre la
participacion de Peru y otros paises de America Latina. Lima, Peru: GRADE.
Benveniste, L. (1999). The policy of testing: A comparative analysis of national assessment systems in
southern cone countries. Unpublished doctoral dissertation, Stanford University.
Benveniste, L. (2000). La evaluaci6n del rendimiento academico y la construcci6n de consensos en
Uruguay. Lima, Peru: Grupo de Trabajo sobre Estandares y Evaluaci6n de GRADE y PREAL.
Bock, J., & Arthur, G. (1991). Politics of educational reform: The experience of a foreign technical
assistance project. Educational Policy, 5, 312-328.
Cariola, P. (1996a). Cultures of policy, cultures of research in Latin America. In N. McGinn (Ed.),
Crossing lines (pp. 149-154). Hartford, CN: Praeger.
Cariola, P. (1996b). Lessons learned from REDUC (1972-1992). In N. McGinn (Ed.), Crossing lines
(pp. 155-157). Hartford, CN: Praeger
Conafe (1999). Programas compensatorios. Mexico City: Secretaria de Educaci6n PUblica.
Ezpeleta, J., & Weiss, E. (1994). Programa para abatir el rezago educativo. Mexico City: Centro de
Investigacion y Estudios Avanzados del Instituto Politecnico Nacional, Departamento de
Investigaciones Educativas.
Ezpeleta, J., & Weiss, E. (2000). Cambiar la escuela rural. Mexico City: Centro de Investigacion
y Estudios Avanzados del Instituto Politecnico Nacional, Departamento de Investigaciones
Educativas.
Greaney, v., & KeUaghan, T. (1996). Monitoring the learning outcomes of education systems. The
World Bank: Washington, De.
Lipset, M.S. (1967). Values, education and entrepreneurship. In M. Lipset, & A. Solari (Eds.), Elites
in Latin America (pp. 3-60). Oxford: Oxford University Press.
Miles, M. (1998). Finding keys to school change: A 40 year odyssey. In A. Hargreaves et aI. (Eds.),
International handbook of educational change (pp. 37-69). Boston: Kluwer Academic Publishers.
Munoz Izquierdo, e., & Ahuja Sanchez, R. (2000). Function and evaluation of a compensatory
program directed at the poorest Mexican states: Chiapas, Guerrero, Hidalgo, and Oaxaca. In
F. Reimers (Ed.), Unequal schools, unequal chances (pp. 341-373). Cambridge, MA: Harvard
University Press.
Munoz Izquierdo, e., Ahuja, R., Noriega, e., Schurmann, P., & Campillo, M. (1995). Valoraci6n del
impacto educativo de un programa compensatorio, orientado a abatir el rezago escolar en la
educaci6n primaria. Revista Latinoamericana de Estudios Educativos, 25(4),11-58.
Nassif, R., Rama, G., & Tedesco, J.e. (1984). El sistema educativo en America Latina. Buenos Aires:
Kapelusz.
Observatorio Ciudadano de la Educaci6n. (1999). Comunicado 5. Programas compensatorios.
Retrieved from http://www.jornada.unam.mx/I999/may99/990528/observatorio.html
Orfield, G. (2000). Policy and equity: Lessons of a third of a century of educational reforms in the
United States. In F. Reimers (Ed.), Unequal schools, unequal chances (pp. 401-426). Cambridge,
MA: Harvard University Press.
Palumbo, D., & Hallett, M. (1993). Conflict vs. consensus models in policy evaluation and
implementation. Evaluation and Program Planning, 16, 11-23.
Plank, D. (1996). The means of our salvation. Public education in Brazil 1930-1995. Boulder, CO:
Westview.
Reimers, E (in press). Learning and the state. The struggle for educational opportunity in Latin
America. In V. Bulmer-Thomas, J. Coatsworth, & R. Cortes Conde (Eds.), Cambridge economic
history of Latin America. Cambridge: Cambridge University Press.
Reimers, E, & McGinn, N. (1997). Informed dialogue. Using research to shape education policy around
the world. Hartford, CN: Praeger.
Rojas, c., & Esquivel. J.M. (1998). Los sistemas de medicwn dellogro academico en Latino America.
Washington, DC: World Bank.
Rueda, M. (1996). Compromiso con un proceso cientifico mas complejo. Revista Mexicana de
Investigaci6n Educativa, 1(2),265-267.
Rueda, M. (1997a). Investigacion educativa y compromiso social. Revista Mexicana de Estudios
Educativos, 2(3), 7-10.
Rueda, M. (1997b). Investigacion educative y procesos de decision. Revista Mexicana de Estudios
Educativos, 2(4), 199-203.
SEP (Secretaria de Educaci6n Publica). (2000a). Informe de labores 1998-1999. Mexico City:
Secretaria de Educaci6n Publica.
SEP (Secretaria de Educaci6n Publica). (2000b). Las escuelas primarias rurales y los apoyos de los
programas compensatorios. Mexico City: Secretaria de Educaci6n Publica, Subsecretaria de
Planeaci6n y Coordinaci6n, Direccion General de Evaluacion.
UNESCO. (2000). World education report. Paris: UNESCO.
UNESCO/CEPAIJPNUD. (1981). Desarrollo y educaci6n en America Latina. Santiago: Proyecto
Desarrollo y Educacion en America Latina y el Caribe.
Weiss, E. (1998). El desarrollo de la investigacion educativa, 1963-1996. In P. Latapi (Ed.), Un siglo
de educaci6n en Mexico (pp. 383-411). Mexico City: Fondo de Cultura Economica.
Weiss, E., & Maggi, R. (Eds.). (1997). Sintesis y perspectivas de las investigaciones sobre educaci6n en
Mexico (1982-1992). Mexico City: Consejo Mexicano de Investigacion Educativa.
Wiarda, H. (2002). The Soul of latin america. New Haven. Yale University Press.
World Bank. (1997). Implementation completion report. Mexico Primary Education Project.
Washington, DC: Author.
23
Educational Evaluation in Africa
MICHAEL OMOLEWN
Nigerian Permanent Delegation to UNESCO, Paris, France
THOMAS KELLAGHAN
Educational Research Centre, St. Patrick's College, Dublin
To write about evaluation in contemporary education systems in Africa is a

daunting task, since any generalizations that one might make could easily be
challenged by instances in which they did not hold. This is because Africa, even
if one focuses on the area south of the Sahara, exhibits great variety in the tradi-
tional cultural institutions and practices of the many ethnic groups that inhabit
it, and in the extent to which these groups were in contact with non-African cultures.
Variety in any of these conditions might go some way towards accounting for the
current state of education on the continent, and for variation in attitudes
towards, and the practice of, evaluation.
While a discussion of differences in the social and cultural context to which
European education systems were translated is beyond the scope of this chapter,
one example relating to the extent to which roles were achieved or ascribed in
traditional societies may serve to illustrate their relevance. It seems that in many
cultures, authority was centralized, leadership tended to be inherited, and advance-
ment may have been limited to particular lineages, clans, or other social units. In
others, however (e.g., the Igbo of eastern Nigeria, the Pakot of western Kenya),
alternative paths to success were provided, and positions of prestige, authority,
and leadership were virtually open to all and were largely achieved on the basis
of individuals' ability, knowledge, and skills (Ottenberg, 1959; Schneider, 1959).
One might expect that acceptance of, and attitudes to, the meritocratic assump-
tions in western systems of education and evaluation would differ for these two
traditions.
African societies also varied in the extent to which they were in contact with
non-African cultural influences. Sometimes this was because of geographical
isolation. In other cases, cultural contact may have been affected by the openness
of traditional cultures to change and their acceptance of new conditions. Thus,
for example, while the Igbo and Pakot might have agreed in their approach to
the distribution of social and economic benefits, it seems that the Igbo were open
465

466 Omolewa and Kellaghan
to change (Ottenberg, 1959), while the Pakot showed marked resistance to
European culture (Schneider, 1959).
A further factor that limits generalizations about education and evaluation in
Africa is that several colonizing powers were involved in the culture-contact
situation, each affecting different parts of the continent and each with its own
system of education and student evaluation. The British and French influences
were most pervasive and are still in evidence. But other European powers
(Germany, Italy, the Netherlands, Portugal, and Spain) also had sought and
obtained their "place in the sun" and left behind some degree of influence.
A more extended treatment of educational evaluation in Africa than is pos-
sible in this chapter might attempt to address these issues. However, an extensive
research base on which such a discourse might draw is not available and, if it
were, an adequate description would require volumes rather than a chapter.
Suffice it to say at this point that the reader should keep in mind the variation in
educational evaluation activity on the continent today, the complexity of the
origins of that variation, and the impossibility of adequately representing it
because of limitations of knowledge and space.
The chapter begins with examples of traditional education and evaluation of
individuals' performance that have survived the introduction of western educa-
tion. A brief description of education in Africa is then presented, following which
issues relating to current educational evaluation are considered under three
headings: the evaluation of individual students; the evaluation of education
systems; and the evaluation of programs.
TRADITIONAL ROOTS OF EVALUATION
African traditions of evaluating non-formal/informal learning formats, which

sustained people's cultures for generations, existed long before the introduction
of western, or even Islamic education. Indigenous modes of learning and training
had standard criteria by which those who engaged in learning were certified as
having been successful or otherwise. All the creative works of arts and crafts as well
as indigenous science, healing practices and principles, vocational training, and
enculturation relied profoundly on these standard measures. The obligations of
applying standard criteria were spread over many interest groups in the commu-
nity who had to reach some form of consensus in jUdging whether a learner had
performed satisfactorily. For example, the apprenticeship system that guaranteed
the survival of the Benin and Ife bronze castings to the present day could not
have been so profound in the absence of agreed upon standards and measures.
There are several descriptions available of traditional modes of education and
training throughout Africa. Much of this education was provided in informal
groups, in precepts in the home, and in participation in economic activities of the
home (see, e.g., Little, 1960; Schneider, 1959; Watkins, 1943). During adoles-
cence, more intensive education and training might be provided in special
schools or in locations removed from the home. Activities were designed to
Educational Evaluation in Africa 467
prepare young people for farming, craftwork, and drumming, and to acquaint
them with the laws and customs of their culture (e.g., by having participants play
the role of elders in mock courts and trials). A number of aspects of the educa-
tional and evaluation procedures are of interest. First, the educational program
was focused on the responsibilities which participants would have to assume in
adult life. Second, formative evaluation, which included learners' responses to a
program, played a major role during the learning process. Third, the final evalua-
tion was performance based and in today's terminology would be regarded as
"authentic," and resembled the form of evaluation of the performance of crafts
people in medieval Europe. Although it was often separate from the general
social milieu, it was closer to real life than school evaluation in Euroamerican
cultures. And fourth, the assessment often involved the judgments of stake-
holders. These features may be observed in the descriptions of two traditional
learning programs.
The first program, located in the Calabar fattening homes in the Cross River
State in Nigeria, typified adult development transitional learning programs
designed to prepare female children for adulthood. The homes were known for
their orderly conduct of learning in which curricula were unwritten. Over a
twelve-month period, learners were exposed to everything they needed to know
in preparation for later life: aesthetic values, nutritional practices, home care,
hospitality, and other customs and traditions of their communities. Specially
trained matrons oversaw the imparting of the requisite body of knowledge. How
well participants had benefitted from the program was determined by a jury of
persons selected from within the community. A judgment of successful comple-
tion, which was binding, had to be agreed by more than 70% of the jury. The
variety of interests represented in the jury contributed to a balance in objectivity.
This mode of evaluation is also observable among many other cultures. A
rather spectacular case is to be found in another learning program in Setswana
in Botswana in southern Africa. The program served the interests of the entire
ethnic group, and apparently everyone had a profound interest in how it was
conducted, since it was the means of socializing the young into the people's
accepted norms. The curriculum, which was unwritten, included family life
education, social behavior, orientation towards a vocation, understanding laws
and customs, language, and pastoralism (Parsons, 1984). Specialized transitional
education was provided separately in bogwera for boys and in bojale for girls
(Parsons, 1984; Schapera, 1934; Townsend Cole, 1985).
The boys' bogwera served the purpose of producing age-regiments which were
intended to bind together the people regardless of the "ward" from which they
might have come (Parsons, 1984). It was held every two to five years, and was
overseen by children from royalty. Stakeholders included royalty, parents of the
learners, and instructors. The leaders of bogwera were expected to be 14 or 15
years of age.
Bogwera was conducted at a bush camp under the guidance of a ngaka (priest-
doctor) and his apprentices. The curriculum consisted of Kaross sewing for
shields and clothing, modeling cattle in clay to reinforce practical knowledge of
livestock, and precepts learnt through memorizing maxims, poems, and songs,
which were held sacred and secret for the remainder of their lives. Other important
items included physiological knowledge for entry into adulthood, community
responsibilities, and ethnic history learned through poetry, dance-drama, and
games. Tswana cosmology, hunting, and fighting skills were taught, as were
games, riddles, puzzles, and proverbs, thereby engendering socially desirable
attitudes among learners (Parsons, 1984).
Bogwera was held usually in two stages, the first of which was located in a camp
during the winter and culminated in circumcision, followed by a hunt for small
wild game. The second stage, also located in a camp, was held over a period of a
year, and was called the "black bogwera" (bogwera jo bontsho) because, according
to popular oral traditions, a young dark-skinned boy was sacrificed. But Parsons
(1984) is of the view that it might just as well have been a young black ox - black
being the auspicious color for producing rain. Moreover, the story of child
sacrifice may have been a bogie spread among the people to deter the uncircum-
cised from coming near the camp. It is true, however, that sometimes, boys died
in bogwera. When this happened, it was announced by silence and then by casting
down the dead child's broken wooden food-bowl in front of his mother (Parsons,
1984; Schapera, 1934).
Of greater interest in the present context is how performance was assessed.
Boys were expected to graduate from the camp mainly by bloodying their spears
in a big game hunt or in warfare. For example, the story has been told of how the
young Khama (son of the Chief of the Tswana people) led his Mopatho (male age
regiment) fresh out of school in the mid-1850s to hammer the baBirwa (another
ethnic group in the area) between the Tswapong hills and the Motloutse river
into submission to his father's Ngwato kingdom (Parsons, 1984). Young graduates
were counted as adults only ten years after their education and training in the
bogwera, and had to prove themselves through a series of physical tests to be
good herders, hunters, and fighters before they could go into fatherhood and
marriage. Several stakeholders, who sometimes had contrary views, had to reach
consensus to determine success.
A parallel adolescent initiation school (bojale) for female children was organ-
ized periodically by the "state," also for the purpose of producing age-regiments.
Unlike the bogwera, the bojale was not held in the bush but in a location within
the village. It was more informal and involved female initiation rites (ikalanga
mazakwa). Enrollment usually coincided with menarche. The bojale was
conducted every two to five years, again under the leadership of royal children
(females). The leader was usually 14 or 15 years of age. The curriculum consisted
of formal instruction in matters concerning womanhood, domestic science,
agricultural practices, sex, and behavior towards men. Other topics related to
carving mortars, making hoe-handles, community responsibilities, ethnic history,
laws, and customs, and Tswana cosmology (Schapera, 1934).
The main methodology used to teach was practical demonstration. Lessons
were also delivered in the form of songs in which the lead singer sang and the
initiates replied (Motlotle, 1998). The songs generally covered a wide range of
themes. For example, some aimed at preparing girls for acceptable standards of
behavior required of adults. Others dealt with respect for the elderly, how to
prepare and present statements of grievance, and societal expectations (Motlotle,
1998). Development of the female personality was regarded as a core require-
ment for successful living.
Initiates were allowed a major opportunity to participate in the evaluation of
their learning. For example, they sang songs that told the older women how they
perceived the quality of their training. This was particularly significant in the
sense that it built democratic participation into their learning program and its
evaluation. A positive perception of the quality of training reinforced good teach-
ing even as the feedback acted as a corrective. Graduation from the bojale was
typified by incising a brand on the thigh of the young person, after which she was
immediately qualified for marriage. Since men had to wait much longer after the
completion of bogwera, there was a wide age-gap between wife and husband,
which helped to reinforce the sex subordination of women to men (Parsons, 1984).
In the Calabar fattening homes and the bogwera and bojale institutions,
teaching and evaluation were geared to standard levels of performance, which
were determined by a conglomerate of stakeholders, who had to agree on
whether or not a minimum level had been achieved. The decision was often
"subjective" and in a sense "political," but the participation by a wide array of
persons meant that everyone had at least to try to be objective. Furthermore, any
performance that was below the acceptable standard was quickly identified, in
which case, supplementatry teaching was provided, as traditional education in
Africa had little or no room for educational wastage.
A careful study of practice in traditional communities reveals that evaluation
of non-formal learning did not rely on insensitive and mechanistic tools for
certifying success. Focusing on performance in real-life "authentic" settings, the
evaluation process was infused with humane interaction and cooperation
between the evaluator and those with an interest in the program. Some of these
values may have survived in present-day Africa. For example, the community
development, literacy, and health project in some rural communities in the Oyo
State of Nigeria, which was coordinated by the first named author, manifested
evidence of cooperative evaluation efforts. In the project, sponsors, instructors,
community leaders, and facilitators agreed that there was no need for anyone to
be held back on account of having failed a test. This implied that everyone could
advance in learning at his/her own pace, and ample provision was made for
retaking a failed test as the need arose.
EDUCATION IN AFRICA
Education in Africa has served, more or less cogently, several purposes at varying
times. In pre-colonial days, it helped to sustain rich cultural heritages, promoting
and cultivating the values, norms, and interests of communities. These purposes
shifted significantly as a result of external influences.
During the 11th century, when Africa came in contact with foraging Islamic
religious groups, its education systems were significantly altered as many commu-
nities that were Islamized adjusted to an Islamic educational system and values.
When European explorers, along with Christian missionaries, began to arrive
during the middle of the 15th century, western education was introduced. Today's
education systems, together with their evaluation contexts, are profoundly
influenced by western values and practices.
The systems of education that were brought to Africa from Europe differed
markedly from the type of education in the examples of traditional African
education and training described above in several ways: in their separation from
real life, their formality, their abstract nature, their rules of interaction and
communication, their learning strategies, and their focus on cognitive perfor-
mance (see Kellaghan, Sloane, Alvarez, & Bloom, 1993). Ogbu (1982) draws on
Gay and Cole's (1967) study of the Kpelle in Liberia to illustrate the kinds of
problem that this can give rise to for children going to a western school. For
example, since learning in the Kpelle is largely nonverbal, children may find it
difficult to adapt to a medium that is highly verbal. In the specific area of mathe-
matics, the fact that the Kpelle language does not have words for zero or number
may also create problems. Because of such differences, western schooling will be
foreign to the cultural system of students, making it difficult for them to adapt to
it and for homes to reinforce what is taught in school.
There is a dearth of research on colonial systems of education in Africa. In
general terms, we can say that the colonists began to build up European-type
education systems on the continent towards the end of the 19th century and early
part of the 20th century, particularly following the first world war. Christian
churches played a major role in providing schools, although other institutions -
trading companies and later local colonial governments - often provided moral
and financial support. Initially, the main aim of education systems was to provide
personnel for the activities of government, the churches, and commerce. In
pursuing it, the frame of reference was entirely western, though efforts were
made from time to time to adapt educational provision to local needs (see Jones,
1922, 1928). However, curricula, languages of instruction, management systems,
and examinations were basically European. Furthermore, the systems were
elitist, and to some extent meritocratic, focusing their resources on students who
could successfully negotiate them and take their place in the colonial structure
(Kellaghan, 1992).
While there is some evidence that European countries varied in their commit-
ment to education in Africa, and at the educational levels at which development
was encouraged, it is an over-simplification to say, as has been said, that the
Portuguese built churches, the British built trading stations, and the French built
schools (Scanlan, 1960). It also seems naive to assert that the concern of adminis-
trators in francophone Africa was to guarantee that the curriculum would be
equal to that of metropolitan France (Scanlan, 1960). Recent research based on
an analysis of information in archives indicates that the schooling students
received was much inferior to that provided in France, and that the main concern
was to ensure that the curriculum was oriented to basic colonial realities and
would inculcate loyalty and subservience (Kelly & Kelly, 2000).
By the time independence was achieved by Mrican countries in the 1950s and
1960s, educational provision throughout the continent was sporadic. For example,
in 1960, the average gross enrolment ratio in primary school (the numbers
enrolled regardless of age as a percentage of the relevant age group) in 39 coun-
tries in Sub-Saharan Mrica was 36, and varied between 10 and 90 per cent.
Secondary education hardly existed at all (Kellaghan, 1992). By 1995-97, provi-
sion had improved considerably, but in many countries was still totally inadequate.
While net primary enrolment (percentage of the age group corresponding to the
official school age of primary education) in a number of countries was in excess
of 90 percent (Mauritius, Namibia, Swaziland, South Mrica), it was 40 percent
or lower in Angola, Eritrea, Ethiopia, Mali, and Mozambique (UNDP, 2001;
UNESCO, 2000). (Figures were not available for several countries.) Provision of
schools (primary and secondary) also varies within countries. Urban areas are
better served than isolated rural areas, while the problem of providing
educational facilities for migrant populations has proved particularly intractable.
Much remains to be accomplished if western education is to playa major role in
quickening the rate at which the continent is developing and in challenging the
status quo regarding the structures that have kept so many in Mrica below the
poverty line and excluded them from participation in the development processes
of their nations.
The stakes are indeed high. Mrican countries expend a significant percentage
of public expenditure on education - from a low of 18.1 percent in Swaziland to
a high of 33.1 percent in Senegal (1995-1997 figures) (UNDP, 2001) - in the
hope that it will accelerate the rate of social and economic change. The
importance attached to education is also very high in local communities who
look to it as the key instrument of development. They expect it to assist in
building a new and entirely different society, in which the abilities and energies
of people will be released so that they can be the subject, and not merely the
object, of development theories and actions.
To meet these aspirations, a gargantuan effort will be required. It is estimated
that 41 million school-age children in Mrica are not at school (UNESCO, 2000).
Furthermore, at a regional conference on Education for All (EFA) for Sub-
Sahara Mrica, held in Johannesburg in December 1999, attended by ministers of
education, representatives of civil society and international development
agencies (UNESCO, 2000), it was concluded that for those who are in the edu-
cation system, the quality of what is provided is "poor and the curricula often
irrelevant to the needs of the learners and of social, cultural and economic
development" (p. 26); that "education planning and management capacity ...
remain largely underdeveloped" (p. 26); and that "only a small proportion of
children are reaching the minimum required competencies and our education
systems are not performing to the standards we expect of them" (p. 28).
To ensure that education policies, plans, and programs in the future serve
national interests and do not belie the hopes of the general population, it will be
necessary to generate and systematically collect exact, precise, and defensible

information of quality (Bhola, 1990). While one might expect this need to be well
understood in policy and academic cultures, that is not always the case, and
bureaucrats may consider the findings of research and evaluation in policy making
at best of little value, at worst threatening (see, e.g., Umar & Tahir, 2000). Nor
are evaluation findings always welcomed at lower levels of program delivery or
by local communities. Among the indigenous, non-literate population, evalua-
tion has consistently evoked strong fear and often great suspicion, sometimes
being seen as a tool of intimidation, and even of destruction of vision and effort.
The contexts in which education (and educational evaluation) are carried out
have many layers - political, legal, social, religious, and cultural. The political
context has remained significant in traditional cultures as well as in so-called
modern times. Politicians have at all times sought to influence the direction of
education. And if education was not moving in the direction they wished, they
were among the first to raise eyebrows and call for a review of the system.
Educational evaluation in Africa also operates in a significant legal-contractual
context. There are many programs that are sponsored by a wide range of interest
groups, ranging from donor agencies to non-governmental organizations, that
may require some form of evaluation that serves their particular needs. In this
situation, there may be several constraints on how the evaluation is carried out,
depending on the expectations of the agency that commissioned it.
Religion - a very serious matter in Africa - and culture also impinge on educa-
tion and evaluation. Any educational evaluator who is focused on obtaining
worthwhile results needs to consider the religious and broader cultural senti-
ments of the people in the social ecology within which the evaluation is being
carried out. To ignore them is to court trouble, and sometimes even vicious
attacks by constituents whose values are brought into focus in evaluation
exercises.
THE EVALUATION OF INDIVIDUAL STUDENTS
Policy and practice relating to the evaluation of students have developed over the
years. The former European colonial powers exported their examination systems
as part and parcel of their education systems. Just as western education systems
represented a discontinuity with traditional systems in Africa, so did evaluation
procedures. While traditional procedures were formative, performance-based,
and involved a range of stakeholders, western-type examinations were summa-
tive and relied on the administration of instruments administered in artificial
situations. The examinations reflected those of the metropolitan country in
format and content; they were subject-based, externally set, "closed-book," and
taken under controlled conditions at the end of a period of formal study.
Countries in francophone Africa used the French baccalaureat system at the end
of secondary school, while anglophone countries have had close ties with British
examining boards. In many cases the examinations were actually set and marked
in Europe for a considerable length of time. As recently as 1981, the school-

leaving certificate examinations in seven countries were set in England by the
University of Cambridge Local Examinations Syndicate (Bray, Clarke, &
Stephens, 1986).
Following the colonial period, the post-independence expansion of educational
facilities in the 1960s and 1970s was geared towards economic development and
greater egalitarianism. In the administration of examinations, some local author-
ities adopted agency status for a foreign body, and in some cases administration
was completely localized. Localization gave rise to some concern about standards
as the general public were keen to ensure that school qualifications would have
international as well as national currency. To address this issue, liaison was
maintained by some examination authorities with European examination bodies
which continued to provide a monitoring function. An alternative approach that
some countries adopted was to set up joint bodies with examination authorities
in neighboring countries; for example, the West African Examinations Council
was set up in 1951, and the Joint Examinations Board of Botswana, Lesotho and
Swaziland in 1960. However, the latter had a short life and the former has lost
considerable ground to national authorities in recent years.
During the post-colonial period, the character of some examinations changed
as transfers from countries other than former colonial countries began to occur
(see Sebatane, 2000). As student numbers and the problem of processing very
substantial numbers of examination scripts grew, some systems adopted American
multiple-choice formats (Botswana, Kenya, Lesotho, Tanzania, Zambia); some
used a mixture of multiple-choice and short constructed responses (Swaziland);
while some changed to multiple-choice and then later abandoned it in favor of
the constructed short-answer format (Uganda) (Bude & Lewin, 1997b; Lewin &
Dunne, 2000).
In a further phase in the 1980s and 1990s, in a reappraisal of educational
provision in the context of economic problems arising from oil crises and
increasing levels of debt repayment, issues relating to the adequacy of evaluation
procedures were addressed. Proposals to improve them included the incorpora-
tion of school-based assessment into examinations (Ali & Akubue, 1988; Kellaghan
& Greaney, 1992; Pennycuick, 1990) and the reform of other components of
examining to improve the quality of students' learning (Kellaghan & Greaney,
1992). School-based assessment has been introduced into a number of
examination systems (Ghana, Nigeria, Tanzania), while its introduction is being
considered in others (Malawi, Uganda). Reforms to other aspects of examina-
tions will be considered below.
Today, public examinations are usually organized and administered either by a
ministry of education or by an independent body established for that purpose. In
francophone countries, responsibility is usually vested in the ministry, while in a
number of anglophone countries (e.g., Kenya, Lesotho, Swaziland, Zambia,
Uganda), it is vested in an independent examinations board or council composed
of individuals drawn from a broad educational constituency (Kellaghan &
Greaney, 1992).
Examinations in Africa have a number of important functions which reflect
the social and educational contexts in which they are administered. First, they
serve to control the disparate elements of the education system, something that
was especially necessary in colonial times when most schools were under private
management. Second, they are used to select students in pyramidal education
systems in which the number of places diminishes at each successive level. Third,
the examinations have a certification function, though this is often lost sight of
because of the emphasis on their use for selection. Formal certification of aca-
demic achievements can be important for some students in gaining access to
employment or training. However, while primary-school leaving certificates may
be required as a minimum qualification for a range of occupations in some
countries (e.g., for drivers and filing clerks in Ethiopia), lower level certificates
are losing their currency in the labor market as the numbers possessing them
increase. Finally, public examinations may serve an accountability function for
teachers and schools. This will especially be the case when the results of students'
performance are published (Kellaghan & Greaney, 1992).
Examinations are administered in most countries at the end of primary school-
ing, at the end of the junior cycle of secondary schooling, and at the end of the
senior cycle of secondary schooling. It was necessary, because of a shortage of
places in secondary schools, to maintain examinations at the end of primary
schooling long after they had been abolished in Europe.
In general, it would seem that examinations in Africa are perceived as an
acceptable way of allocating scarce educational benefits. At the same time, it is
recognized that they have many deficiencies: the range of assessment techniques
is narrow; practical assessments are not included; what is assessed mainly
requires recall; tasks are for the most part presented in abstract contexts; and
tensions exist between assessment strategies and curricula (see, e.g., Kellaghan
& Greaney, 1992; Lewin & Dunne, 2000). As a consequence, the examinations
are perceived to have a number of negative effects on teaching and learning.
These are similar to those that have been observed for the past century and a half
in other examination systems to which high stakes are attached and relate in
particular to the narrowing effect of the examinations on what is taught, so that
subjects, topics, and skills that are not covered by the examination are ignored in
the classroom, even when they are specified in the curriculum.
Because external public examinations are a key element of the education
systems of African countries, a role for them in educational reform has been
proposed by a number of commentators. It is argued that the high stakes
attached to examinations, which arise from their use to select students for the
limited number of places available in schools, can be capitalized on to determine
to a large extent what is taught and what is learned. Further, while external exami-
nations invariably have some negative effects on schools, the fact that in Africa
they are often of poor quality and focus on selection rather than certification
means that they are likely to have a particularly negative impact on the
educational experiences of students who fail to be selected for further education
and leave school at an early stage. With these considerations in mind, the World
Bank (1988) report, Education in Sub-Saharan Africa, which was concerned with
proposals to revitalize existing educational infrastructures, recommended the
strengthening of examination systems in the interest of raising academic standards.
A variety of specific recommendations have been made about the role exami-
nations might play in improving education: widening the range of knowledge and
skills assessed and methods of assessment; including an appropriate distribution
of recall, application, synthesis, and evaluation-type questions; providing feed-
back information on student achievement to schools, administrators, and the
community; contributing to greater efficiency by identifying students most likely
to benefit from further education; incorporating elements of school-based assess-
ment into the examinations that will reflect a wider range of competencies than
it is possible to assess in terminal written examinations; and ensuring compara-
bility of standards between schools and between school-based and non-school-
based candidates if education systems diversify to include non-school-based ways
of delivering educational services (Bude & Lewin, 1997a, 1997b; Heyneman &
Ransom, 1990; World Bank, 1988).
There can be little quarrel with the view that the quality of examinations
should be improved. As examinations are held at three key points in the educa-
tional systems of most countries, they have the potential to influence the whole
system. If they influence what goes on in schools, then obviously high quality
examinations should have a more beneficial impact on learning and teaching
than ones of poor quality. Further, providing information on student perform-
ance to schools is likely to be beneficial, though making public information about
individual school performance could be counterproductive. Again, while
examinations might be used to improve selection, too great a focus on this
function could have serious and damaging effects on the educational experiences
of many students if it ignores the fact that for many - in fact the majority -
learning has to have utility beyond that of qualifying individuals for the next level
of education (World Bank, 1988).
EVALUATION OF THE EDUCATION SYSTEM
In recent years, interest has grown in assessing the achievements of an education

system as a whole (or a clearly defined part of it), in addition to the assessment
of individual students. Data may be obtained for individual countries (in a
national assessment) or for a number of countries in a comparative exercise (in
an international assessment) (see Section 10).
During the 1990s, following the emphasis placed at the World Conference on
Education for All held at Jomtien, Thailand, in 1990 on the need to ascertain if
students acquired useful knowledge, skills, and values (World Declaration on
Education for All, 1990), concerted efforts were made to develop the capacity to
carry out national assessments (Kellaghan & Greaney, 2001). Several were
actually carried out. All were at the primary school level, and assessed students'
achievements in core curriculum areas. All were carried out in samples of schools.
The UNESCO-UNICEF Monitoring Project, in conjunction with the

Education for All initiative, assisted 24 countries in Africa to mount assessments
of students' competencies in literacy, numeracy, and skills required for daily life
(relating to awareness and knowledge of health, nutrition, sanitation, hygiene)
after four years of schooling (Chinapah, 1997; UNESCO, 1999). It was hoped
that the findings of the assessments would provide decision-makers with data
that would allow them to identify the most serious shortfalls in attaining the
objectives of basic education.
In a separate program, ministries of education in southern and eastern Africa,
in association with the International Institute for Educational Planning (IIEP),
carried out national assessments of the reading literacy achievements of grade 6
students (Southern Africa Consortium for Monitoring Educational Quality
[SACMEQ)). Seven countries participated between 1995 and 1998, and 15 between
1998 and 2001 (Ross, 2000). In addition to data on student achievements, reports
of the studies contain descriptions of many aspects of education systems, as well
as suggestions for policy and action (see, e.g., Nkamba & Kanyika, 1998).
Several countries have, during the 1990s, also carried out national assessments
with the support of the World Bank and other agencies (e.g., Uganda, Zambia)
(see, e.g., Kelly & Kanyika, 2000).
The findings of national assessments serve to underline the concerns expressed
at the Johannesburg EFA conference concerning the small proportion of children
reaching "minimum" levels of competence. For some countries this is as low as
a quarter for reading literacy in sixth grade (see Nkamba & Kanyika, 1998;
Voigts, 1998). The situation in other countries in which over half reach the
"minimum" level is better, if not entirely satisfactory (Kulpoo, 1998; Machingaidze,
Pfukani & Shumba, 1998). The difference would seem largely attributable to the
level of economic development of countries.
With a few exceptions (Nigeria, South Africa), African countries have not
participated in international studies.
EVALUATION OF PROGRAMS
Program evaluation became inevitable in Africa as the continent attempted to

promote its economic productivity and competitiveness, competent political
leadership, broad-based participation in the political process, scientific and
technological development and, indeed, its ability to be a subject, and not an
object, in the pervading phenomenon called globalization. Expectations for
education systems have also meant that the social context of evaluating
educational programs is expanding. As in Britain (Graham, 1992), interests have
become divergent and wide, and education is being influenced by an array of
power groups which include politicians, government leaders and officials, aca-
demics, the business community, parents, teachers, educational inspectorates,
the military in countries with military dictatorship regimes, educational
investigators, researchers, and external influences.
Educational evaluation is a social and political process. It is social, because it
involves the interaction of several persons or groups of persons with specific
motives which influence their perceptions and actual participation in the process.
And it is political, because it involves the application of diverse values and the
interests of multiple constituents. Any activity that involves the application of
diverse values in judging the value of a given object invariably has political
overtones. This will especially be so when evaluation involves the allocation or
redistribution of scarce resources, according to priorities that are constantly
being redefined, to suit the needs of holders of power who manage the political
processes at work in any given context (Worthen & Sanders, 1987).
In Africa, the major stakeholder in evaluation is the state, in more concrete
terms, the government. Since education is still a social service (a public good)
that is largely financed by government, governments in most countries playa key
role in prompting, planning, and sponsoring evaluation.
It would be wrong to assume, however, that only governments stimulate
educational evaluation. In fact, even when the government seeks to playa major
role, education managers, parents, teachers, communities, and industry have always
asked for and frequently secured a controlling voice in evaluative exercises. In
Botswana, for example, democratic principles require that the people be brought
in at every stage of the evaluation process.
Scholars also play prominent roles. Indeed, their research and scholarly
contributions to debates on the quantity and quality of education have been of
immense utility in the development of professional evaluation. They initiate
evaluative studies of learning programs and examine evaluation practice. For
example, Sebatane (1981) has, since the 1980s, indicated interest in the reviews
of educational evaluation in Lesotho, while Putsoa (1981) has focused attention
on the inservice training of primary-school teachers in Swaziland.
International bodies (the World Bank, UNESCO, UNDP, and UNICEF) and
bilateral agencies which fund projects and innovations have also had a profound
influence on the direction and management of education and evaluation. In fact,
most evaluations have been at the request of donors and international agencies.
The beneficiaries of educational evaluation are primarily the sponsors of
programs, followed by consumers. Learners as beneficiaries often appear to be
unorganized and separated in space. However, the rapid access to information as
mediated by Internet connectivity has made it possible for them to respond to
evaluation findings. When and where there is a clash of interest between providers
and consumers of services, the latter, particularly college students, have had to
embark on political action to have their views taken seriously.
The findings of program evaluations in education have sometimes contributed
to national education conferences on subjects such as curriculum review and
have even dealt with the formulation of educational policies. For example,
Botswana initiated two national commissions on education (in 1977 and 1983) in
an attempt to upgrade its entire education system. And efforts are not confined
to the national level. In 1996, UNESCO, in collaboration with the Association
for Development of Education in Africa (ADEA) undertook a review of
externally initiated and commissioned studies of education in Africa spanning
the period 1990 to 1994. The review covered almost the entire continent. ADEA
(1999) also undertook stocktaking reviews of education in Africa.
Evaluation findings have on occasion been used to improve programs. Studies
by Afemikhe (1989), Arubayi (1985, 1987a, 1987b), and Bajah (1987) revealed
that the opinions of consumers of educational programs can play an important
role in program improvement. A number of impact evaluations have been helpful
in assisting decision-makers to initiate or modify programs. For example, Akpe
(1988,1989) has described how consumer evaluation has been useful in improv-
ing the primary school program of the Rivers State College of Education in Nigeria.
Similarly, efforts in South Africa have led to the national institutionalization of
the qualifications framework (Kgobe, 1997). In a number of countries
(Botswana, Gambia, Ghana, Kenya, Namibia, Mauritius, and Tanzania) national
policies since independence have been influenced immensely by the evaluation
of programs in the education sector.
The development of evaluation capacity is obviously a major challenge. There
is already provision for the preparation of evaluators in a number of institutions.
The University of Ibadan in Nigeria has an international center for educational
evaluation which conducts doctoral programs. Several other universities (e.g.,
the Universities of Cameroon, Cape Town, Ghana, and Nairobi) also provide
graduate programs, while almost all faculties and colleges of education provide
core courses in measurement and evaluation as part of their pre service program.
One consequence of the attempts that have been made to provide personnel
are the efforts, however disparate, that are being made to establish educational
evaluation as a profession in Africa. The development of the profession will
undoubtedly be facilitated by the establishment of the African Evaluation
Association (AfrEA) in 1999. The association has already established a data base
of over 450 evaluators working in a variety of fields including education, and has
among its aims the creation of networks of evaluators (already established in ten
countries) and consideration for adoption/adaptation in Africa of the North
American Joint Committee (1994) Program Evaluation Standards.
Institutional arrangements are in place in several nations to promote the practice
of evaluation. Many ministries or departments of education at central/federal,
regional, and provincial governmental levels have evaluation and research units.
Institutional arrangements are also necessary for the dissemination of program
evaluation findings. Usually, findings are disseminated in seminars, press
briefings, journals, reports, and sometimes government documents. South Africa
has an on-line journal that is sponsored by the Association for the Study of
Educational Evaluation in Southern Africa (ASEESA).
CONCLUSION
Education in Africa faces many problems. Many children do not receive any
formal education and for those that do it is often inadequate as a preparation for
life. There is an obvious need to improve access, to expand and upgrade provi-
sion, and to develop the professional and technical skills of education managers
and teachers. One may hope that evaluation will have a role to play in guiding
the development of policy and of strategies to address these needs.
If it will, further development of evaluation capacity will be required, whether
evaluation has as its focus the achievements of individual students, the achieve-
ments of education systems, or the value of education programs. Considerable
expertise has been built up over a long period of time in examination bodies
which is being applied to improving methods of assessing individual students.
Some of this expertise will transfer to system assessment since some of the skills
it requires are similar to those involved in the assessment of individual students.
However, system assessment also requires additional skills in sampling and
statistical analysis.
The tradition of program evaluation does not extend as far back in time, in
Mrica or elsewhere, as that of individual student evaluation. While it has begun
to take roots, it is still many years behind what obtains in the developed world. If
it is to make a significant contribution to improving education in Mrica, it will
have to move more rapidly towards ensuring that a profession - with its own
standards, objectives, expectations, funding, training, and operating mechanisms
- is firmly in place. That profession will be all the stronger and more effective if
it takes cognizance of Mrican traditions of evaluation in which formative
functions and the involvement of stakeholders were accorded key roles.
ENDNOTE
1 Michael Omolewa is currently the Nigerian Ambassador and Permanent Delegate of Nigeria to
UNESCO in Paris. He received considerable support from Akpovire Oduaran of the University of
Botswana in the preparation of this paper.
REFERENCES
ADEA (Association for Development of Education in Africa). (1999).1999 prospective, stock-taking
review of education in Africa. Retrieved from http://www.adeanet.org/programs/pstr99/enJlstr99.html
Afemikhe, O.A. (1989). Consumer evaluation of chemistry curricula in a Nigerian university.
Assessment and Evaluation in Higher Education, 14(2), 77-86.
Akpe, C.S. (1988). Using consumer evaluation to improve college curricula in Nigerian teacher
training. Journal of Education for Teaching, 14(1), 85-90.
Akpe, C.S. (1989). Using respondents' free comments to improve college curricula: A case study of
River State College of Education NCE primary programme. Studies in Educational Evaluation, 15,
183-191.
Ali, A., & Akubue, A. (1988). Nigerian primary schools' compliance with Nigeria national policy on
education. An evaluation of continuous assessment practices. Evaluation Review, 12, 625-637.
Arubayi, E. (1985). Subject disciplines and student evaluation of a degree program in education.
Higher Education, 14, 683-691.
Arubayi, E. (1987a). Assessing academic programs in college of education: Perceptions of final year
students. Assessment and Evaluation in Higher Education, 12(2), 105-114.
Arubayi, E. (1987b). Programme evaluation: An empirical study of a degree programme in educa-
tion. Studies in Educational Evaluation, 13, 159-162.
Bajah, S.T. (1987). Evaluation of the impact of curriculum institutes: A Nigerian case study. Studies
in Educational Evaluation, 13, 319-328.
Bhola, H.S. (1990). Evaluating "Literacy for Development" projects, programs and campaigns. Hamburg:
UNESCO Institute for Education/Bonn: Deutsche Stiftung flir Internationale Entwicklung.
Bray, M., Clarke, P.B., & Stephens, D. (1986). Education and society in Africa. London: Edward Arnold.
Bude, u., & Lewin, KM. (Eds). (1997a). Improving test design. Voll. Constructing test instruments,
analysing results, and improving assessment quality in primary schools in Africa. Bonn: Deutsche
Stiftung flir Internationale Entwicklung.
Bude, U., & Lewin, KM. (Eds). (1997b). Improving test design. Vol. 2. Assessment of science and
agriculture in primary schools in Africa; 12 country cases reviewed. Bonn: Deutsche Stiftung flir
Internationale Entwicklung.
Chinapah, V (1997). Handbook on monitoring leaming achievement. Towards capacity building. Paris:
UNESCO.
Gay, J., & Cole, M. (1967). The new mathematics and an old culture: A study of learning among the
Kpelle of Liberia. Troy, MO: Holt, Rinehart & Winston.
Graham, D. (1992). An overview of ERA + 3. In H. Tomlinson (Ed.), The search for standards
(pp. 1-14). Harlow, Essex: Longman.
Heyneman, S.P., & Ransom, A.w. (1990). Using examinations and testing to improve educational
quality. Educational Policy, 4, 177-192.
Jones, T.J. (1922). Education in Africa. New York: Phelps-Stokes Fund.
Jones, T.J. (1928). Education in Africa. New York: Phelps-Stokes Fund.
Kellaghan, T. (1992). Examination systems in Africa: Between internationalization and
indigenization. In M.A. Eckstein, & H.J. Noah (Eds), Examinations: Comparative and international
studies (pp. 95-104). Oxford: Pergamon.
Kellaghan, T., & Greaney, V (1992). Using examinations to improve learning. A study in fourteen
African countries. Washington, DC: World Bank.
Kellaghan, T., & Greaney, V (2001). The globalisation of assessment in the 20th century. Assessment
in Education, 8,87-102.
Kellaghan, T., Sloane, K, Alvarez, B., & Bloom, B.S. (1993). The home environment and school
learning. Promoting parental involvement in the education of children. San Francisco: Jossey-Bass.
Kelly, G.P', & Kelly, D.H. (2000). French colonial education: Essays on Vietnam and west Africa. New
York: AMS Press.
Kelly, M., & Kanyika, J. (2000). Learning achievement at the middle basic level. Summary report on
Zambia's national assessment project 1999. Lusaka: Examinations Council of Zambia.
Kgobe, M. (1997). The National Qualifications Framework in South Africa and "out-of-school
youth": Problems and possibilities. International Review of Education, 43, 317-330.
Kulpoo, D. (1998). The quality of education: Some policy suggestions based on a survey of schools.
Mauritius. Paris: International Institute for Educational Planning.
Lewin, K, & Dunne, M. (2000). Policy and practice in assessment in anglophone Africa: Does
globalisation explain convergence? Assessment in Education, 7, 379-399.
Little, KL. (1960). The role of the secret society in cultural specialization. In S. Ottenberg &
P. Ottenberg (Eds), Cultures and societies of Africa (pp. 199-213). New York: Random House.
Machingaidze, T., Pfukani, P., & Shumba, S. (1998). The quality of education: Some policy suggestions
based on a survey of schools. Zimbabwe. Paris: International Institute for Educational Planning.
Motlotle, N.P. (1998). Bojale in Kgatleng, Zebra's Voice, 25(3/4), 22-25.
Nkamba, M., & Kanyika, J. (1998). The quality of education: Some policy suggestions based on a survey
of schools. Zambia. Paris: International Institute for Educational Planning.
Ogbu, J. U. (1982). Cultural discontinuities and schooling. Anthropology and Education Quarterly, 13,
290-307.
Ottenberg, S. (1959). Ibo receptivity to change. In w.R. Bascom & M.J. Herkovits (Eds), Continuity
and change in African cultures (pp. 130-143). Chicago: University of Chicago Press.
Parsons, O.N. (1984). Education and development in pre-colonial Botswana to 1965. In M. Crowder
(Ed.), Education for development (pp. 21-52). Garborone, Botswana: The Botswana Society.
Pennycuick, D. (1990). The introduction of continuous assessment systems at secondary level in
developing countries. In P. Broadfoot, R. Murphy, & H. Torrance (Eds), Changing educational
assessment. International perspectives and trends (pp. 106-118). London: Routledge.
Putsoa, B. (1981). A review of inservice training of primary school teacher in Swaziland. In B. Otaala
& S.N. Pandey (Eds.), Educational research in Boleswa countries. Proceedings of the Educational
Research Seminar, University College, Gaborone, Botswana, May 4-7 (pp. 160-161). Gaborone,
Botswana: University College.
Ross, K.N. (2000). Translating educational assessment findings into educational policy and reform
measures: Lessons from the SACMEQ initiative in Africa. Paper presented at World Education
Forum, Dakar, Senegal, April.
Scanlon, D. (1960). Education. In RA. Lystad (Ed), The African world. A survey of social research
(pp. 199-220). London: Pall Mall Press.
Schapera, I. (1934). Western civilization and the natives of South Africa. London: Routledge & Kegan
Paul.
Schneider, H.K. (1959). Pakot resistance to change. In W.R. Bascom & MJ. Herkevits (Eds.),
Continuity and change in African cultures (pp. 144-167). Chicago: University of Chicago Press.
Sebatane, E.M. (1981). The role of educational evaluation in Lesotho. Past, present and future. In B.
Otaala & S.N. Pandey (Eds.), Educational research in Boleswa countries. Proceedings of the
Educational Research Seminar, University College, Gaborone, Botswana, May 4-7 (pp. 68-84).
Gaborone, Botswana: University College.
Sebatane, M. (2000). International transfers of assessment: Recent trends and strategies. Assessment
in Education, 7, 401-416.
Townsend Cole, E.K. (1985). The story of education in Botswana. Gaborone, Botswana: Macmillan.
Umar, A., & Tahir, G. (2000). Researching nomadic education: A Nigerian perspective. International
Journal of Educational Research, 33,231-240.
UNDP (United Nations Development Programme). (2001). Making new technologies work for human
development. Human development report 2001. New York: Oxford University Press.
UNESCO (United Nations Educational, Scientific and Cultural Organization). (2000). World educa-
tion report 2000. The right to education. Towards education for all throughout life. Paris: Author.
UNESCO. (1999). Monitoring Learning Achievement (MLA) Project. Follow-up on Jomtien. Paris:
Author.
UNESCO. (2000). The Dakar Framework for Action. Education for All: Meeting our collective commit-
ments. Paris: Author.
Voigts, F. (1998). The quality of education: Some policy suggestions based on a survey of schools.
Namibia. Paris: International Institute for Educational Planning.
Watkins, M.H. (1943). The west African bush school. American Journal of Sociology, 43, 666-674.
World Bank. (1988). Education in Sub-Saharan Africa. Policies for adjustment, revitalization, and
expansion. Washington DC: Author.
World Declaration on Education for All. (1990). Adopted by the World Conference on Education for
All, Meeting Basic Learning Needs. New York: UNICEF House.
Worthern, B.R, & Sanders, J.R (1987). Educational evaluation: Alternative approaches and practical
guidelines. New York: Longman.
Volume 1
International Handbook of Educational Leadership and Administration
Edited by Kenneth Leithwood, Judith Chapman, David Corson,
Philip Hallinger and Ann Hart
ISBN 0-7923-3530-9
Volume 2
International Handbook of Science Education
Edited by Barry J. Fraser and Kenneth G. Tobin
ISBN 0-7923-3531-7
Volume 3
International Handbook of Teachers and Teaching
Edited by Bruce J. Biddle, Thomas L. Good and Ivor L. Goodson
ISBN 0-7923-3532-5
Volume 4
International Handbook of Mathematics Education
Edited by Alan J. Bishop, Ken Clements, Christine Keitel, Jeremy Kilpatrick and
Collette Laborde
ISBN 0-7923-3533-3
Volume 5
International Handbook of Educational Change
Edited by Andy Hargreaves, Ann Leiberman, Michael Fullan and David Hopkins
ISBN 0-7923-3534-1
Volume 6
International Handbook of Lifelong Learning
Edited by David Aspin, Judith Chapman, Michael Hatton and Yukiko Sawano
ISBN 0-7923-6815-0
Volume 7
International Handbook of Research in Medical Education
Edited by Geoff R. Norman, Cees P.M. van der Vleuten and David I. Newble
ISBN 1-4020-0466-4
Volume 8
Second International Handbook of Educational Leadership and Administration
Edited by Kenneth Leithwood and Philip Hallinger
ISBN 1-4020-0690-X
Volume 9
International Handbook of Educational Evaluation
Edited by Thomas Kellaghan and Daniel L. Stufflebeam
ISBN 1-4020-0849-X
INTERNATIONAL HANDBOOK OF
VOLUME 9
A list of titles in this series can be found at the end of this volume.
International Handbook
of Educational Evaluation
Part Two: Practice
Editors:
Thomas Kellaghan
The Evaluation Center, Western Michigan University, Kalamazoo, MI, US.A.
with the assistance of
Lori A. Wingate
The Evaluation Center, Western Michigan University, Kalamazoo, MI, US.A .
.....
"
KLUWER ACADEMIC PUBLISHERS
DORDRECHT / BOSTON / LONDON
Library of Congress Cataloging-in-Publication Data is available.
ISBN 1-4020-0849-X
Published by Kluwer Academic Publishers

PO Box 17, 3300 AA Dordrecht, The Netherlands
Sold and distributed in North, Central and South America

by Kluwer Academic Publishers,
101 Philip Drive, Norwell, MA 02061, U.S.A.
In all other countries, sold and distributed

by Kluwer Academic Publishers, Distribution Centre,
PO Box 322, 3300 AH Dordrecht, The Netherlands
Printed on acid-free paper
All Rights Reserved

© 2003 Kluwer Academic Publishers
No part of this publication may be reproduced or utilized in any form or by any means,
electronic, mechanical, including photocopying, recording or by any information storage
and retrieval system, without written permission from the copyright owner.
Table of Contents
Introduction
Thomas Kellaghan, Daniel L. Stufflebeam & Lori A. Wingate 1
PART ONE: PERSPECTIVES
SECTION 1: EVALUATION THEORY
Introduction
Ernest R. House - Section editor 9
1 Evaluation Theory and Metatheory
Michael Scriven 15
2 The CIPP Model for Evaluation
3 Responsive Evaluation
Robert Stake 63
4 Constructivist Knowing, Participatory Ethics and
Yvonna S. Lincoln 69
5 Deliberative Democratic Evaluation
Ernest R. House & Kenneth R. Howe 79
SECTION 2: EVALUATION METHODOLOGY
Introduction
Richard M. Wolf - Section editor 103
6 Randomized Field Trials in Education
Robert F. Boruch 107
7 Cost-Effectiveness Analysis as an Evaluation Tool
Henry M Levin & Patrick 1 McEwan 125
v
vi Table of Contents
8 Educational Connoisseurship and Educational Criticism:
Elliot Eisner 153
9 In Living Color: Qualitative Methods in Educational Evaluation
Linda Mabry 167
SECTION 3: EVALUATION UTILIZATION
Introduction
Marvin C. Aikin - Section editor 189
10 Evaluation Use Revisited
Carolyn Huie Hofstetter & Marvin C. Aikin 197
11 Utilization-Focused Evaluation
Michael Quinn Patton 223
12 Utilization Effects of Participatory Evaluation
J Bradley Cousins 245
SECTION 4: THE EVALUATION PROFESSION
Introduction
M. F. Smith - Section editor 269
13 Professional Standards and Principles for Evaluations
14 Ethical Considerations in Evaluation
Michael Morris 303
15 How can we call Evaluation a Profession if there are
no Qualifications for Practice?
Blaine R. Worthen 329
16 The Evaluation Profession and the Government
Lois-Ellin Datta 345
17 The Evaluation Profession as a Sustainable Learning Community
Hallie Preskill 361
18 The Future of the Evaluation Profession
M. F. Smith . 373
Table of Contents vii
SECTION 5: THE SOCIAL AND CULTURAL CONTEXTS OF
Introduction
H. S. Bhola - Section editor 389
19 Social and Cultural Contexts of Educational Evaluation:
H S. Bhola 397
20 The Context of Educational Program Evaluation in the United States
Carl Candoli & Daniel L. Stufflebeam 417
21 Program Evaluation in Europe: Between Democratic and
New Public Management Evaluation
Ove Karlsson 429
22 The Social Context of Educational Evaluation in Latin America
Fernando Reimers 441
23 Educational Evaluation in Africa
Michael Omolewa & Thomas Kellaghan 465
PART TWO: PRACTICE
SECTION 6: NEW AND OLD IN STUDENT EVALUATION
Introduction
Marguerite Clarke & George Madaus - Section editors 485
24 Psychometric Principles in Student Assessment
Robert 1 Mislevy, Mark R. Wilson, Kadriye Ercikan
& Naomi Chudowsky 489
25 Classroom Student Evaluation
Peter W. Airasian & Lisa M. Abrams 533
26 Alternative Assessment
Caroline Gipps & Gordon Stobart 549
27 External (Public) Examinations
Thomas Kellaghan & George Madaus 577
viii Table of Contents
SECTION 7: PERSONNEL EVALUATION
Introduction
Daniel L. Stufflebeam - Section editor 603
28 Teacher Evaluation Practices in the Accountability Era
Mari Pearlman & Richard Tannenbaum 609
29 Principal Evaluation in the United States
Naftaly S. Glasman & Ronald H Heck 643
30 Evaluating Educational Specialists
James H Stronge 671
SECTION 8: PROGRAMJPROJECT EVALUATION
Introduction
James R. Sanders - Section editor 697
31 Evaluating Educational Programs and Projects in the Third World
Gila Garaway 701
32 Evaluating Educational Programs and Projects in the USA
Jean A. King 721
33 Evaluating Educational Programs and Projects in Canada
Alice Dignard 733
34 Evaluating Educational Programs and Projects in Australia
John MOwen 751
SECTION 9: OLD AND NEW CHALLENGES FOR EVALUATION

IN SCHOOLS
Introduction
Gary Miron - Section editor 771
35 Institutionalizing Evaluation in Schools
36 A Model for School Evaluation
James R. Sanders & E. Jane Davidson 807
37 The Development and Use of School Profiles
Robert L. Johnson '827
38 Evaluating the Institutionalization of Technology in Schools
and Classrooms
Catherine Awsumb Nelson, Jennifer Post & William Bickel 843
Table of Contents ix
SECTION 10: LOCAL, NATIONAL, AND INTERNATIONAL

LEVELS OF SYSTEM EVALUATION
Introduction
Thomas Kellaghan - Section editor 873
39 National Assessment in the United States: The Evolution of a
Nation's Report Card
Lyle V. Jones 883
40 Assessment of the National Curriculum in England
Harry Torrance 905
41 State and School District Evaluation in the United States
William J Jf'I?bster, Ted 0. Almaguer & Tim Orsak 929
42 International Studies of Educational Achievement
Tjeerd Plomp, Sarah Howie & Barry McGaw 951
43 Cross-National Curriculum Evaluation
William H. Schmidt & Richard T. Houang 979
List of Authors 997
Index of Subjects 1021
Section 6
New and Old in Student Evaluation

Introduction
MARGUERITE CLARKE
Boston College, Center for the Study of Testing, Evaluation and Educational Policy, MA, USA
GEORGE MADAUS
Student evaluation involves making judgments about the merit or worth of student
achievement in a particular area. Tests, assessments, and examinations are
commonly used to collect the information on which these judgments are based.
The result may be a decision about an individual student or students, or about
the way in which instruction takes place. The four chapters in this section cover
several key topics in this area of educational evaluation. Robert Mislevy, Mark
Wilson, Kadriye Ercikan, and Naomi Chudowsky discuss psychometric principles
in student assessment; Peter Airasian and Lisa Abrams provide an overview of
evaluation in the classroom; Caroline Gipps and Gordon Stobart consider alter-
native assessment; and Thomas Kellaghan and George Madaus describe external
(public) examinations.
While the term "evaluation" has been defined elsewhere in this volume, it is also
worth looking at some definitions of the terms "assessment," "test," and "examina-
tion" since they figure prominently in the chapters that follow. The Oxford
English Dictionary (OED) (1989) reveals interesting etymologies for the terms.
While "assessment," modified by the adjectives "authentic," "alternative," or
"performance," has only become popular in educational discourse in recent
years, the OED reveals that the word, meaning estimation or evaluation, actually
dates back to 1626. However, its first link to education, according to the OED,
did not occur until 1956 when it was used in England in contradistinction to the
more traditional term "examination." The term "test" is considered an
Americanism by the OED. The first education-related reference is from 1910
when the term was used to imply a "simpler, less formal, procedure than an exami-
nation." "Examination" is the oldest of the three terms and the one preferred in
Europe. The first reference to an academic examination (defined as the process
of testing, by questions oral or written, the knowledge or ability of pupils)
identified by the OED was in 1612, and reads as follows: "Which worke of
continuali examination, is a notable quickner and nourisher of all good learning."
There are a number of subtle and interesting differences between the OED
definitions of the verbs "examine," "test," and "assess." To "examine" implies an
485

486 Clarke and Madaus
attempt to investigate, inquire, and probe; to understand and learn about the
capacity or knowledge of a student. To "test" implies putting students through an
ordeal or trial designed to reveal their achievements. To "assess" implies deter-
mining or fixing an amount; the process or means of evaluating academic work.
Whichever term is used, certain fundamental principles underpin the process
of collecting and evaluating information on students (AERA, APA, NCME,
1999; Madaus, Raczek, & Clarke, 1997; National Research Council, 1999).
Mislevy, Wilson, Ercikan, and Chudowsky layout these principles in the first
chapter in this section. The authors describe the assessment situation as one in
which information is gathered on what a student says, does, or makes in a few
particular circumstances in order to make inferences about what the student
knows or can do more generally. Whether this situation involves a teacher asking
questions to find out if a student is ready to move on to the next page in her
reading book, or policymakers mandating a national assessment of fourth grade
students' reading ability, the assessment can be viewed as an evidentiary
argument, of which the principles of validity, reliability, comparability, and fairness
are regarded as desirable properties. While the focus of the Mislevy et al. chapter
is on the statistical tools and techniques used to establish these properties in
large-scale standardized testing, the authors note that the properties - since they
are also social values - apply to other forms of student evaluation, although their
operationalization may vary with context.
The three chapters that follow look at some of the contexts in which student
evaluation takes place, as well as some of the uses to which information may be
put. Airasian and Abrams describe evaluation in the classroom. Five areas of
evaluation, which have as their common purpose to aid the teacher in making
decisions about teaching and learning processes and outcomes, are outlined. The
evaluation activities discussed include not only tests - either teacher-made or
externally developed - but also more informal activities such as questioning or
monitoring of a class during a lesson or activity. External examinations, as
describ~d by Kellaghan and Madaus, provide a strong contrast to these classroom-
based evaluation practices, since they represent a form of evaluation in which
control over purpose, content, length, timing, and scoring is external to the
classroom. Furthermore, the data obtained are often used~to make important
decisions about students, including ones relating to entry to the next level of
education, graduation, and certification. Given the high stakes that are attached
to these decisions, issues of validity, reliability, comparability, and fairness are of
greater concern than in many other kinds of student evaluation. The focus of the
chapter by Gipps and Stobart is on alternative forms of assessment. While the
purposes (i.e., assessment for learning) and characteristics of these forms (i.e., an
integral part of the teachingllearning process, tends to take place under relatively
uncontrolled conditions, seeks evidence of best rather than typical perform-
ances) seem more in tune with classroom-based student evaluation, the drive to
include these kinds of information-rich approaches in large-scale assessment
systems has given them, in these contexts, characteristics, purposes, and effects
similar to those of external (public) examinations.
New and Old in Student Evaluation 487
As Kellaghan and Madaus note, student evaluation has a long history. External
testing systems can be traced back to the Western Zhou Dynasty in China
(1027-711 BC). In addition to longevity, these kinds of evaluation activities
affirm Ecclesiastes' moral (i.e., "What has been is what will be, and what has
been done is what will be done; and there is nothing new under the sun") in
terms of the formats used for collecting information on student achievement
(Madaus, Raczek, and Clarke, 1997). For example, supply-type items were used
by the medieval guilds and universities in Europe. The former required
apprentices to supply a product as proof of competence; the latter used the viva
voce, or oral disputation, to evaluate a student's knowledge and determine
whether he was worthy of being included in the community that the judge had
been authorized to represent (Hoskins, 1968). Selection-type items have a
similarly long lineage. For example, the true-false test is believed to have
forerunners in the Chinese civil service examination system (Loewe, 1986). '
Student evaluation is a paradox. In some ways, it remains the same; in others,
it is constantly evolving. Reflecting the latter reality, the chapters in this section
refer to several developments in the areas of computer technology, cognitive
psychology, and psychometrics that suggest exciting new possibilities and
improvements (see also Pellegrino, Chudowsky, & Glaser, 2001). Despite these
advances, it is important to keep in mind that while the tools, and even the func-
tions, of student evaluation may change, certain key principles remain the same.
These include the need to attend to issues of validity, reliability, comparability, and
fairness; the need to balance instructional (usually classroom-based) and account-
ability uses of evaluation information; and the key role of the teacher as an
influence on the context and outcomes of the evaluation process.
REFERENCES
American Educational Research Association, American Psychological Association, and National

Council on Measurement in Education (1999). Standards for educational and psychological testing.
Hoskins, K. (1968). The examination, disciplinary power and rational schooling. History ofEducation,
8,135-146.
Loewe, M. (1986). The former Han dynasty. In D. 1Witchett, & M. Loewe (Eds.), The Cambridge
history of China (pp. 103-198). Cambridge: Cambridge University Press.
Madaus, G., Raczek, A., & Clarke, M. (1997). The historical and policy foundations of the
assessment movement. In A. Lin Goodwin (Ed.), Assessment for equity and inclusion: Embracing
all our children (pp. 1-33). New York: Routledge.
National Research Council (1999). High stakes: Testing for tracking, promotion, and graduation.
Washington, DC: National Academy Press.
The Oxford English dictionary (1989) (2nd ed.). Oxford: Oxford University Press.
Pellegrino, J., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and
design of educational assessment. Washington, DC: National Academy Press.
24
Psychometric Principles in Student Assessment l
ROBERT J. MISLEVY
University of Maryland, MD, USA
MARKR. WILSON
University of California, Berkeley, CA, USA
KADRIYE ERCIKAN
University of British Columbia, BC, USA
NAOMI CHUDOWSKY
National Research Council, Washington DC, USA
"Validity, reliability, comparability, and fairness are not just measurement issues,
but social values that have meaning and force outside of measurement wherever
evaluative judgments and decisions are made" (Messick, 1994, p. 2).
What are psychometric principles? Why are they important? How do we attain
them? We address these questions from the perspective of assessment as eviden-
tiary reasoning; that is, how we draw inferences about what students know, can
do, or understand from the handful of particular things they say, do, or make in
an assessment setting. Messick (1989), Kane (1992), and Cronbach and Meehl
(1955) show the deep insights that can be gained from examining validity from
such a perspective. We aim to extend the approach to additional psychometric
principles and bring out connections with assessment design and probability-
based reasoning.
Seen through this lens, validity, reliability, comparability, and fairness (as in
the quotation from Messick, above) are properties of an argument - not formulae,
models, or statistics per se. We will do two things before we even introduce
statistical models. First, we will look at the nature of evidentiary arguments in
assessment, paying special attention to the role of standardization. And secondly,
we will describe a framework that structures the evidentiary argument in a given
assessment, based on an evidence-centered design framework (Mislevy, Steinberg,
& Almond, in press). In this way, we may come to appreciate psychometric
principles without tripping over psychometric details.
489

490 Mislevy, Wilson, Ercikan and Chudowsky
Of course in practice we do use models, formulae, and statistics to examine the
degree to which an assessment argument possesses the salutary characteristics of
validity, reliability, comparability, and fairness. Thus, it will be necessary to
consider how these issues are addressed when one uses particular measurement
models to draw particular inferences, with particular data, for pruticular purposes.
To this end we describe the role of probability-based reasoning in the evidentiary
argument, using classical test theory to illustrate ideas. We then survey some
widely used psychometric models, such as item response theory and generaliz-
ability analysis, focusing on how each is used to address psychometric principles
in different circumstances. We cannot provide a guidebook for using all this
machinery, but we will point out some useful references along the way for the
reader who needs them.
This is a long road, and it may seem to wander at times. We commence by
looking at examples from an actual assessment, so that the reader will have an
idea of where we want to go, in thinking about assessment in general, and
psychometric principles in particular.
AN INTRODUCTORY EXAMPLE
The assessment design framework provides a way of thinking about psycho-

metrics that relates what we observe to what we infer. The models of the evidence-
centered design framework are illustrated in Figure 1. The student model, at the
far left, concerns what we want to say about what a student knows or can do -
aspects of the student's knowledge or skill. Following a tradition in psycho-
metrics, we label this "8" (theta). This label may stand for something rather
simple, such as a single category of knowledge (e.g., vocabulary usage), or
something much more complex, such as a set of variables relating to the strategies
that a student can bring to bear on mixed-number subtraction problems, and the
conditions under which particular strategies are selected and used. The task
model~ at the far right in the figure concerns the situations we can set up in the
world, in which we will observe the student say or do something that gives us
clues about the knowledge or skill that we have built into the student model.
Between the student and task model are the scoring model and the measurement
model, through which we reason from what we observe in performances to what
we infer about a student.
These models may be illustrated with a recent example - an assessment system
built for a middle school science curriculum, "Issues, Evidence and You" (lEY)
(SEPUP, 1995). Figure 2 describes variables in the student model upon which
both the lEY curriculum and its assessment system, called the BEAR Assessment
System (Wilson & Sloane, 2000) are built. The student model consists of four
variables, at least one of which is the target of every instructional activity' and
assessment in the curriculum. The four variables are seen as four dimensions on
which students will make progress during the curriculum. The dimensions are
correlated (positively, we expect), because they all relate to "science," but are
Psychometric Principles in Student Assessment 491
quite distinct educationally. The psychometric tradition would use a diagram like
Figure 3 to illustrate this situation. Each of the variables is represented as a
circle, with the intention of indicating that they are unobservable or "latent"
variables. They are connected by curving lines to indicate that they are not
necessarily causally related to one another (at least as far as we are modeling that
relationship), but that they are associated (usually we use a correlation
coefficient to express that association).
The student model represents what we wish to measure in students. These are
constructs, that is, variables that are inherently unobservable, but which we
propose as a useful way to organize our thinking. They describe aspects of
students' skill or knowledge for the purposes of, say, comparing programs, evalu-
ating progress, or planning instruction. We use them to accumulate evidence
from what we can actuaIly observe students say and do.
Now look at the right hand side of Figure 1. This is the task model. This is how
we describe the situations we construct in which students will actually perform.
Particular situations are generically called "items" or "tasks."
In the case of lEY, the items are embedded in the instructional curriculum, so
much so that students would not necessarily know that they were being assessed
unless the teacher teIls them. An example task is shown in Figure 4. It was
Student Model EYidence Mode!(sl Task ModeJ(sl
Measurement Scoring
C]~
~ ~""
Model Model
@
~@@ ~~
X,
I. X1U1UX 2.J.u ;u .xJtJ.
3. JOI.U:U..U. 4. U ..II:J:I::U.
S. .Iu.u:u.:n. 6.1.1.:111.1.11.
7. x:uX:UXI. 8 . 1XXJ.1XJ.J.
Figure 1: General Form of the Assessment Design Models
Understanding Concepts (U) - Understanding scientific concepts (such as properties and inter-
actions of materials, energy, or thresholds) in order to apply the relevant scientific concepts to the
solution of problems. This variable is the lEY version of the traditional "science content", although
this content is not just "factoids".
Designing and Conducting Investigations (I) - Designing a scientific experiment, carrying through
a complete scientific investigation, performing laboratory procedures to collect data, recording and
organizing data, and analyzing and interpreting results of an experiment. This variable is the lEY
version of the traditional "science process".
Evidence and Tradeoffs (E) - Identifying objective scientific evidence as well as evaluating the
advantages and disadvantages of different possible solutions to a problem based on the available
evidence.
Communicating Scientific Information (C) - Organizing and presenting results in a way that is free
of technical errors and effectively communicates with the chosen audience.
Figure 2: The Variables in the Student Model for the BEAR "Issues, Evidence, and You" Example
designed to prompt student responses that relate to the "Evidence and Tradeoffs"
variable defined in Figure 2. Note that this variable is a somewhat unusual one
in a science curriculum; the lEY developers think of it as representing the sorts
of cognitive skills one would need to evaluate the imporJance of, say, an
environmental impact statement, something that a citizen might need to do that
is directly related to science's role in the world. An example of a student
response to this task is shown in Figure 5.
How do we extract from this particular response some evidence about the
unobservable student-model variable we have labeled Evidence and Tradeoffs?
What we need is in the second model from the right in Figure 1 - the scoring
model. This is a procedure that allows one to focus on aspects of the student
response and assign them to categories, in this case ordered categories, that
suggest higher levels of proficiency along the underlying latent variable. A scoring
model can take the form of what is called a "rubric" in the jargon of assessment,
and in lEY does take that form (although it is called a "scoring guide"). The
rubric for the Evidence and Tradeoffs variable is shown in Figure 6. It enables a
teacher or a student to recognize and evaluate two distinct aspects of responses
to the questions related to the Evidence and Tradeoffs variable. In addition to
the rubric, scorers have exemplars of student work available to them, complete
with adjudicated scores and explanations of the scores. They also have a method
(called "assessment moderation") for training people to use the rubric. All these
elements together constitute the scoring model. So, what we put in to the scoring
model is a student's performance; what we get out is one or more scores for each
task, and thus a set of scores for a set of tasks. In additoin to embedded tasks,
the lEY assessments also include so-called "link-items", which are more like
regular achievement test items (though scored with the same scoring guides).
Figure 3: Graphical Representation of the BEAR Student Model
2. You are a public health official who works in the Water Department. Your supervisor has asked
you to respond to the public's concern about water chlorination at the next City Council meeting.
Prepare a written response explaining the issues raised in the newspaper articles. Be sure to
discuss the advantages and disadvantages of chlorinating drinking water in your response, and
then explain your recommendation about whether the water should be chlorinated.
Figure 4: An Example of a Task Directive from the BEAR Assessment

As an edjucated employee of the Grizzelyville water company, I am well aware of the controversy
surrounding the topic of the chlorination of our drinking water. I have read the two articles
regarding the pro's and cons of chlorinated water. I have made an informed decision based on the
evidence presented the articles entitled "The Peru Story" and "700 Extra People May bet Cancer
in the US". It is my recommendation that our towns water be chlorine treated. The risks of infect-
ing our citizens with a bacterial disease such as cholera would be inevitable if we drink non-treated
water. Our town should learn from the country of Peru. The article "The Peru Story" reads thousands
of inocent people die of cholera epidemic. In just months 3,500 people were killed and more
infected with the diease. On the other hand if we do in fact chlorine treat our drinking water a risk
is posed. An increase in bladder and rectal cancer is directly related to drinking chlorinated water.
Specifically 700 more people in the US may get cancer. However, the cholera risk far outweighs
the cancer risk for 2 very important reasons. Many more people will be effected by cholera where
as the chance of one of our citizens getting cancer due to the water would be very minimal. Also
cholera is a spreading disease whereas cancer is not. If our town was infected with cholera we could
pass it on to millions of others. And so, after careful consideration it is my opinion that the citizens
of Grizzelyville drink chlorine treated water.
Figure 5: An Example of a Student Response from the BEAR Assessment
What now remains? We need to connect the student model on the left hand
side of Figure 1 with the scores that have come out of the scoring model. In what
way, and with what value, should these nuggets of evidence affect our beliefs
about the student's knowledge? For this we have another model, which we will
call the measurement model. This single component is commonly known as a
psychometric model. Now this is somewhat of a paradox, as we have just explained
that the framework for psychometrics actually involves more than just this one
model. The measurement model has indeed traditionally been the focus of
psychometrics, but it is not sufficient to understand psychometric principles. The
complete set of elements, the full evidentiary argument, must be addressed.
Figure 7 shows the relationships in the measurement model for the sample
IEY task. Here the student model (first shown in Figure 3) has been augmented
with a set of boxes. The boxes are intended to indicate that they are observable
rather than latent, and these are in fact the scores from the scoring model for this
task. They are connected to the Evidence and Tradeoffs student-model variable
with straight lines, meant to indicate a causal (though probabilistic) relationship
between the variable and the observed scores, and the causality is posited to run
from the student model variables to the scores. Put another way, what the
student knows and can do, as represented by the variables of the student model,
determines how likely it is that students will make right answers rather than wrong
ones, carry out sound inquiry rather than founder, and so on, in each particular
task they encounter. In this example, both observable variables are posited to
depend on the same aspect of knowledge, namely Evidence and Tradeoffs. A
different task could have more or fewer observables, and each would depend on
one or more student-model variables, all in accordance with the knowledge or
skills the task is designed to evoke.
It is important for us to say that the student model in this example (indeed in
most psychometric applications) is not proposed as a realistic explanation of the
thinking that takes place when a student works through a problem. It is a piece
Score Using Evidence: Using Evidence to Make Tradeoffs:

Response uses objective reason( s) based Response recognizes multiple perspectives
on relevant evidence to support choice. of issue and explains each perspective
using objective reasons, supported by
evidence, in order to make choice.
Response accomplishes Level 3 AND goes Response accomplishes Level 3 AND goes
beyond in some significant way, such as beyond in some significant way, such as
4
questioning or justifying the source, suggesting additional evidence beyond the
validity, and/or quantity of evidence. activity that would further influence
choices in specific ways, OR questioning
the source, validity, and/or quantity of
evidence & explaining how it influences
choice.
Response provides major objective reasons Response discusses at least two

3 AND supports each with relevant & perspectives of issue AND provides
accurate evidence. objective reasons, supported by relevant &
accurate evidence, for each perspective.
Response provides some objective reasons Response states at least one perspective of
AND some supporting evidence, BUT at issue AND provides some objective
2
least one reason is missing and/or part of reasons using some relevant evidence BUT
the evidence is incomplete. reasons are incomplete and/or part of the
evidence is missing; OR only one complete
& accurate perspective has been provided.
Response provides only subjective reasons Response states at least one perspective of
(opinions) for choice and/or uses issue BUT only provides subjective reasons
1
inaccurate or irrelevant evidence from the and/or uses inaccurate or irrelevant
activity. evidence.
No response; illegible response; response No response; illegible response; response

0 offers no reasons AND no evidence to lacks reasons AND offers no evidence to
support choice made. support decision made.
X Student had no opportunity to respond.
Figure 6: The Scoring Model for Evaluating Tho Observable Variables from Task Responses in
the BEAR Assessment
of machinery we use to accumulate information across tasks, in a language and

at a level of detail we think suits the purpose of the assessment (for a more
complete perspective on this see Pirolli & Wilson, 1998). Without question, it is
selective and simplified. But it ought to be consistent with what we know about
how students acquire and use knowledge, and it ought to be consistent with what
we see students say and do. This is where psychometric principles come in.
What do psychometric principles mean in lEY? Validity concerns whether the
tasks actually do give sound evidence about the knowledge and skills the student-
model variables are supposed to measure, namely, the four lEY progress
variables. Or are there plausible alternative explanations for good or poor
performance? Reliability concerns how much we learn about the students, in
Using
Using Evidence
Evidence to Make
Tradeoffs
Figure 7: Graphical Representation of the Measurement Model for the BEAR Sample Task for
the item in Figure 2
terms of these variables, from the performances we observe. Comparability con-

cerns whether what we say about students, based on estimates of their student-
model variables, has a consistent meaning even if students have taken different
tasks or been assessed at different times or under different conditions. Fairness
asks whether we have been responsible in checking important facts about students
and examining characteristics of task-model variables that would invalidate the
inferences that test scores would ordinarily suggest.
PSYCHOMETRIC PRINCIPLES AND EVIDENTIARY ARGUMENTS 2
We have seen, through a quick example how assessment can be viewed as eviden-
tiary arguments, and that psychometric principles can be viewed as desirable
properties of those arguments. Let's go back to the beginning and develop this
line of reasoning more carefully.
Educational Assessment as Evidentiary Argument 1
Inference is reasoning from what we know and observe to explanations, conclu-

sions, or predictions. Rarely do we have the luxury of reasoning with certainty;
the information we work with is typically incomplete, inconclusive, amenable to
more than one explanation. The very first question in an evidentiary problem is:
"evidence about what?" Data become evidence in some analytic problem only
when we have established their relevance to some conjecture we are considering.
And the task of establishing the relevance of data and its weight as evidence
depends on the chain of reasoning we construct from the evidence to those
conjectures.
Both conjectures and an understanding of what constitut~s evidence about
them arise from the concepts and relationships of the field under consideration.
We use the term "substantive" to refer to these content- or theory-based aspects
of reasoning within a domain, in contrast to structural aspects such as logical
structures and statistical models. In medicine, for example, physicians frame
diagnostic hypotheses in terms of what they know about the nature of diseases,
and the signs and symptoms that result from various disease states. The data are
patients' symptoms and physical test results, from which physicians reason back
to likely disease states. In history, hypotheses concern what happened and why.
Letters, documents, and artifacts are the historian's data, which shelhe must fit
into a larger picture of what is known and what is supposed.
Philosopher Stephen Toulmin (1958) provided terminology for talking about
how we use substantive theories and accumulated experience (say, about algebra
and how students learn it) to reason from particular data (Joe's solutions) to a
particular claim (what Joe understands about algebra). Figure 8 outlines the
structure of a simple argument. The claim is a proposition we wish to support
with data. The arrow represents inference, which is justified by a warrant, or a
generalization that justifies the inference from the particular data to the
particular claim. Theory and experience provide backing for the warrant. In any
particular case we reason back through the warrant, so we may need to qualify
our conclusions because there are alternative hypotheses, or alternative explanations
for the data.
Claim:
Sue can use specifics
to illustrate a description
of a fictional character.
Alternative Hypothesis:
Warrant: ~~
unless The student has not
Students who know how actually produced the work.
i
to use writing techniques will since
do so in an assignment that
calls for them.
so
account
on
of
Backing: Data: Rebuttal data:

The past three terms, Sue's essay uses Sue's essay is very
students' understandings of three incidents to similar to the character
the use of techniques in in- illustrate Hamlet's description in the Cliff Notes
depth interviews have indecisiveness. guide to Hamlet.
corresponded with their
performances in their
essays.
Figure 8: A Toulmin Diagram for a Simple Assessment Situation

In practice, of course, an argument and its constituent claims, data, warrants,
backing, and alternative explanations will be more complex than the repre-
sentation in Figure 8. An argument usually consists of many propositions and
data elements, involves chains of reasoning, and often contains dependencies
among claims and various pieces of data. This is the case in assessment.
In educational assessments, the data are the particular things students say, do,
or create in a handful of particular situations - written essays, correct and
incorrect marks on answer sheets, presentations of projects, or explanations of
their problem solutions. Usually our interest lies not so much in these par-
ticulars, but in the clues they hold about what students understand more gener-
ally. We can only connect the two through a chain of inferences. Some links
depend on our beliefs about the nature of knowledge and learning. What is
important for students to know, and how do they display that knowledge? Other
links depend on things we know about students from other sources. Do they have
enough experience with a computer to use it as a tool to solve an interactive
physics problem or will it be so unfamiliar as to hinder their work? Some links
use probabilistic models to communicate uncertainty, because we can administer
only a few tasks or because we use evaluations from raters who do not always
agree. Details differ, but a chain of reasoning must underlie an assessment of any
kind, from classroom quizzes and standardized achievement tests, to coached
practice systems and computerized tutoring programs, to the informal conver-
sations students have with teachers as they work through experiments.
The Case for Standardization
Evidence rarely comes without a price. An obvious factor in the total cost of an
evidentiary argument is the expense of gathering the data, but figuring out what
data to gather and how to make sense of it can also be expensive. In legal cases,
these latter tasks are usually carried out after the fact. Because each case is
unique, at least parts of the argument must be uniquely fashioned. Marshalling
evidence and constructing arguments in the 0.1. Simpson case took more than a
year and cost prosecution and defense millions of dollars.
If we foresee that the same kinds of data will be required for similar purposes
on many occasions, we can achieve efficiencies by developing standard proce-
dures both for gathering the data and reasoning from it (Schum, 1994, p. 137). A
well-designed protocol for gathering data addresses important issues in its
interpretation, such as making sure the right kinds and right amounts of data are
obtained, and heading off likely or pernicious alternative explanations. Following
standard procedures for gathering biological materials from crime scenes, for
example, helps investigators avoid contaminating a sample and allows them to
keep track of everything that happens to it from collection to testing. Furthermore,
merely confirming that they have followed the protocols immediately commu-
nicates to others that these important issues have been recognized and dealt with
responsib ly.
A major way to make large-scale assessment practicable in education is to
think these issues through up front: laying out the argument for what data to
gather from each of the many students that will be assessed and the reason for
gathering them. The details of the data will vary from one student to another,
and so will the claims. But the same kind of data will be gathered for each
student, the same kind of claim will be made, and, most importantly, the same
argument structure will be used in each instance. This strategy offers great
efficiencies, but it allows the possibility that cases that do not accord with the
common argument will be included. Therefore, establishing the credentials of
the argument in an assessment that is used with many students entails the two
distinct responsibilities listed below. We shall see that investigating them and
characterizing the degree to which they hold can be described in terms of
psychometric principles.
Establishing the Credentials of the Evidence in the Common Argument
This is where efficiency is gained. To the extent that the same argument structure
holds for all the students it will be used with, the specialization to any particular
student inherits the backing that has been marshaled for the general form. We
will discuss below how the common argument is framed. Both rational analyses
and large-scale statistical analyses can be used to test its fidelity at this macro
level. These tasks can be arduous, and they can never really be considered
complete because we could always refine the argument or test additional
alternative hypotheses (Messick, 1989). The point is, though, that this effort does
not increase in proportion to the number of examinees who are assessed.
Detecting Individuals for Whom the Common Argument Does Not Hold
Inevitably, the theories, the generalizations, the empirical grounding for the
common argument will not hold for some students. The usual data arrive, but the
usual inference does not follow, even if the common argument does support
validity and reliability in the main. These instances call for additional data or
different arguments, often on a more expensive case-by-case basis. An assess-
ment system that is both efficient and conscientious will minimize the frequency
with which these situations occur, but routinely draw attention to them when
they do.
lt is worth emphasizing that the standardization we are discussing here
concerns the structure of the argument, not necessarily the form of the data.
Some may think that this form of standardization is only possible with so-cfllled
objective item forms such as multiple-choice items. Few large-scale assessments
are more open-ended than the Advanced Placement (AP) Studio Art portfolio
assessment (Myford & Mislevy, 1995); students have an almost unfettered choice
of media, themes, and styles. But the AP program provides a great deal of
information about the qualities students need to display in their work, what they
need to assemble as work products, and how raters will evaluate them. This
structure allows for a common argument, heads off alternative explanations
about unclear evaluation standards in the hundreds of AP Studio Art classrooms
across the country, and, most happily, helps the students come to understand the
nature of good work in the field (Wolf, Bixby, Glenn, & Gardner, 1991).
Psychometric Principles as Properties of Arguments
The construction of assessment as an argument from limited evidence may be

taken as a starting point for understanding psychometric principles. In this
section, the key concepts of validity, reliability, comparability, and fairness are
discussed in this framework.
Validity
Validity is paramount among psychometric principles, since it speaks directly to

the extent to which a claim about a student, based on assessment data from that
student, is justified (Cronbach, 1989; Messick, 1989). Establishing validity entails
making the warrant explicit, examining the network of beliefs and theories on
which it relies, and testing its strength and credibility through various sources of
backing. It requires determining conditions that weaken the warrant, exploring
alternative explanations for good or poor performance, and feeding them back
into the system to reduce inferential errors.
In the introductory example we saw that assessment is meant to get evidence
about students' status with respect to a construct, some particular aspect(s) of
knowledge, skill, or ability - in that case, the lEY variables. Cronbach and Meehl
(1955) said "construct validation is involved whenever a test is to be interpreted
as a measure of some attribute or quality that is not operationally defined" - that
is, when there is a claim about a person based on observations, not merely a
statement about those particular observations in and of themselves. Earlier work
on validity distinguished a number of varieties, such as content validity, predictive
validity, convergent and divergent validity, and we will say a bit more about these
later. But the current view, as the Standards for Educational and Psychological
Testing (American Educational Research Association, American Psychological
Association, National Council of Measurement in Education, 1999) assert, is
that validity is a unitary concept. Ostensibly different kinds of validity are better
viewed as merely different lines of argument and different kinds of evidence for
a single kind of validity. If you insist on a label for it, it would have to be construct
validity.
Embretson (1983) distinguishes between validity arguments that concern why
data gathered in a certain way ought to provide evidence about the targeted skill
knowledge, and those that investigate relationships of resulting scores with other
variables to support the case. These are, respectively, arguments about "construct
representation" and arguments from "nomothetic span." Embretson noted that
validation studies relied mainly on nomothetic arguments, using scores from assess-
ments in their final form or close to it. The construction of those tests, however,
was guided mainly by specifications for item format and content, rather than by
theoretical arguments or empirical studies regarding construct representation.
The "cognitive revolution" in the latter third of the 20th century provided both
scientific respectability and practical tools for designing construct meaning into
tests from the beginning (Embretson, 1983). The value of both lines of argument
is appreciated today, with validation procedures based on nomothetic span tending
to be more mature and those based on construct representation still evolving.
Reliability
Reliability concerns the adequacy of the data to support a claim, presuming the
appropriateness of the warrant and the satisfactory elimination of alternative
hypotheses. Even if the reasoning is sound, there may not be enough information
in the data to support the claim. Later we will see how reliability is expressed
quantitatively when probability-based measurement models are employed. It
may be noted here, however, that the procedures by which data are gathered can
involve multiple steps or features, each of which can affect the evidentiary value
of data. Depending on Jim's rating of Sue's essay rather than evaluating it
ourselves adds a step of reasoning to the chain, introducing the need to establish
an additional warrant, examine alternative explanations, and assess the value of
the resulting data.
How can we gauge the adequacy of evidence? Brennan (2001) has noted that,
since the work of Spearman (1904), the idea of repeating the measurement
process has played a central role in characterizing an assessment's reliability
much as it does in physical sciences. If you weigh a stone ten times and get a
slightly different answer each time, the variation among the measurements is a
good index of the uncertainty associated with that measurement procedure. It is
less straightforward to know just what repeating the measurement procedure
means, though, if the procedure has several steps that could each be done
differently (different occasions, different task, different raters), or if some of the
steps cannot be repeated at all (if a person learns something by working through
a task, a second attempt is not measuring the same level of knowledge). We will
see that the history of reliability is one of figuring out how to characterize the
value of evidence in increasingly wider ranges of assessment situations.
Comparability
Comparability addresses the common occurrence that the specifics of data

collection differ for different students or for the same students at different times.
Differing conditions raise alternative hypotheses when we need to compare
students with one another or against common standards or when we want to track
students' progress over time. Are there systematic differences in the conclusions
we would draw when we observe responses to Test Form A as opposed to Test
Form B, for example? Or from a computerized adaptive test instead of the
paper-and-pencil version? Or if we use a rating based on two judges, as opposed
to the average of two, or the consensus of three? We must extend the warrant to
deal with these variations, and we must include them as alternative explanations
of differences in students' scores.
Comparability overlaps with reliability, as both raise questions of how evidence
obtained through one application of a data-gathering procedure might differ
from evidence obtained through another application. The issue is reliability
when we consider the two measures interchangeable; which measure is used is a
matter of indifference to the examinee and assessor alike. Although we expect
the results to differ somewhat, we do not know if one is more accurate than the
other, whether one is biased toward higher values, or if they will illuminate
different aspects of knowledge. The same evidentiary argument holds for both
measures, and the obtained differences are what constitute classical measure-
ment error. The issue is comparability when we expect systematic differences of
any of these types, but wish to compare results obtained from the two distinct
processes nevertheless. A more complex evidentiary argument is required. It
must address the way that observations from the two processes bear different
relationships to the construct we want to measure, and it must indicate how to
take these differences into account in our inferences.
Fairness
Fairness is a term that encompasses more territory than we can address in this
paper. Many of its senses relate to social, political, and educational perspectives
on the uses to which assessment results are put (Willingham & Cole, 1997).
These all give rise to legitimate questions, which would exist even if the chain of
reasoning from observations to constructs contained no uncertainty whatsoever.
Like Wiley (1991), we focus our attention here on construct meaning rather than
use or consequences, and consider aspects of fairness that bear directly on this
portion of the evidentiary argument.
Fairness in this sense concerns alternative explanations of assessment per-
formances in light of other characteristics of students that we could and should
take into account. Ideally, the same warrant backs inferences about many students,
reasoning from their particular data to a claim about what each individual knows
or can do. This is never quite truly the case in practice, since factors such as
language background, instructional background, and familiarity with represen-
tations surely influence performance. When the same argument is to be applied
with many students, considerations of fairness require us to examine the impact
of such factors on performance, and to identify the ranges of their values beyond
which the common warrant can no longer be justified. Drawing the usual inference
from the usual data for a student who lies outside this range leads to inferential
errors. If they are errors we should have foreseen and avoided, they are unfair.
Ways of avoiding errors involve the use of additional knowledge about students
to condition our interpretation of what we observe under the same procedures,
and the gathering of data from different students in different ways, such as
providing accommodations or allowing students to choose among ways of providing
data (and accepting the responsibility as assessors to establish the comparability
of data so obtained!).
A FRAMEWORK FOR ASSESSMENT DESIGN
In this section, a schema for the evidentiary argument that underlies educational
assessments, incorporating both its substantive and statistical aspects, is presented.
It is based on the "evidence-centered" framework for assessment design illus-
trated in Mislevy, Steinberg, and Almond (in press) and Mislevy, Steinberg,
Breyer, Almond, and Johnson (1999, in press). We will use it presently to examine
psychometric principles from a more technical perspective. The framework
formalizes another proposition from Messick (1994):
A construct-centered approach [to assessment design] would begin by

asking what complex of knowledge, skills, or other attributes should be
assessed, presumably because they are tied to explicit or implicit objectives
of instruction or are otherwise valued by society. Next, what behaviors or
performances should reveal those constructs, and what tasks or situations
should elicit those behaviors? Thus, the nature of the construct guides the
selection or construction of relevant tasks as well as the rational develop-
ment of construct-based scoring criteria and rubrics (p. 16).
Figure 1, presented with the introductory example, depicts elements and rela-
tionships that must be present, at least implicitly, and,coordinated, at least
functionally, if an assessment is to effectively serve an inferential function.
Making this structure explicit helps an evaluator understand how to first gather,
then reason from, data that bear on what students know and can do.
In brief, the student model specifies the variables in terms of which we wish to
characterize students. Task models are schemas for ways to get data that provide
evidence about them. There are two components of the evidence model, which
act as links in the chain of reasoning from students' work to their knowledge and
skill. The scoring component contains procedures for extracting the salient
features of student's performances in task situations (observable variables j, and
the measurement component contains machinery for updating beliefs about
student-model variables in light of this information. These models are discussed
in more detail below. Taken together, they make explicit the evidentiary
grounding of an assessment and guide the choice and construction of particular
tasks, rubrics, statistical models, and so on. An operational assessment will
generally have one student model, which may contain many variables, but may
use several task and evidence models to provide data of different forms or with
different rationales.
The Student Model: What Complex of Knowledge, Skills, or Other Attributes

Should be Assessed?
The values of student-model variables represent selected aspects of the infinite

configurations of skill and knowledge real students have, based on a theory or a
set of beliefs about skill and knowledge in the domain. These variables are the
vehicle through which we determine student progress, make decisions, or plan
instruction for students. The number and nature of the student model variables
depend on the purpose of an assessment. A single variable characterizing overall
proficiency in algebra might suffice in an assessment meant only to support a
pass/fail decision; a coached practice system to help students develop the same
proficiency might require a finer grained student model, to monitor how a
student is doing on particular aspects of skill and knowledge for which we can
offer feedback. When the purpose is program evaluation, the student-model
variables should reflect hypothesized ways in which a program may be more or
less successful or promote students' learning in some ways as opposed to others.
In the standard argument, then, a claim about what a student knows, can do, or
has accomplished is expressed in terms of values of student-model variables.
Substantive concerns about, for example, the desired outcomes of instruction, or
the focus of a program evaluation, will suggest what the student-model variables
might be and give substantive meaning to the values of student-model variables.
The student model provides a language for expressing claims about students,
restricted and simplified to be sure, but one that is amenable to probability-
based reasoning for drawing inferences and characterizing beliefs. The following
section will explain how we can express what we know about a given student's
values for these variables in terms of a probability distribution, which can be
updated as new evidence arrives.
Task Models: What Tasks or Situations Should Elicit Those Behaviors?
A task model provides a framework for constructing and describing the situa-
tions in which examinees act. We use the term "task" in the sense proposed by
Haertel and Wiley (1993) to refer to a "goal-directed human activity to be pursued
in a specified manner, context, or circumstance." A task can thus be an open-
ended problem in a computerized simulation, a long-term project such as a term
paper, a language-proficiency interview, or a familiar multiple-choice or short-
answer question.
A task model specifies the environment in which the student will say, do, or
produce something; for example, characteristics of stimulus material, instruc-
tions, help, tools, and so on. It also specifies the work product, or the form in
which what the student says, does, or produces will be captured. But again it is
substantive theory and experience that determine the kinds of situations that can
evoke behaviors that provide clues about the targeted knowledge and skill, and
the forms in which those clues can be expressed and captured.
To create a particular task, an assessment designer, explicitly or implicitly,
assigns specific values to task model variables, provides materials that suit the
specifications there given, and sets the conditions that are required to interact
with the student. A task thus describes particular circumstances meant to provide
the examinee an opportunity to act in ways that produce evidence about what
they know or can do more generally. For a particular task, the values of its task
model variables constitute data for the evidentiary argument, characterizing the
situation in which the student is saying, doing, or making something.
It is useful to distinguish task models from the scoring models discussed in the
next section, as the latter concern what to attend to in the resulting performance
and how to evaluate what we see. Distinct and possibly quite different evaluation
rules could be applied to the same work product from a given task. Distinct and
possibly quite different student models, designed to serve different purposes, or
derived from different conceptions of proficiency, could be informed by
performances on the same tasks. The substantive arguments for the evidentiary
value of behavior in the task situation will overlap in these cases, but the specifics
of the claims, and thus the specifics of the statistical links in the chain of
reasoning will differ.
Evidence Models: What Behaviors or Peiformances Should Reveal the Student

Constructs, and What is the Connection?
An evidence model lays out the part of the evidentiary argument that concerns
reasoning from the observations in a given task situation to revising beliefs about
student model variables. Figure 1 shows there are two parts to the evidence model.
The scoring component contains "evidence rules" for extracting the salient
features of whatever the student says, does, or creates in the task situation - i.e.,
the "work product" that is represented by the jumble of shapes in the rectangle
at the far right of the evidence model. A work product is a unique human pro-
duction, perhaps as simple as a response to a multiple-choice item or as complex
as repeated cycles of treating and evaluating patients in a medical simulation.
The squares coming out of the work product represent "observable variables," or
evaluative summaries of what the assessment designer has determined are the
key aspects of the performance (as captured in one or more work products) to
serve the assessment's purpose. Different aspects could be captured for different
purposes. For example, a short impromptu speech contains information about a
student's subject matter knowledge, presentation capabilities, or English
language proficiency; any of these, or any combination of them, could be the
basis of one or more observable variables. As a facet of fairness, however, the
student should be informed of which aspects of her/his performance are being
evaluated, and by what criteria. Students' failure to understand how their work
will be scored is an alternative hypothesis for poor performance which can and
should be avoided.
Scoring rules map unique performances into a common interpretative frame-
work, thus laying out what is important in a performance. These rules can be as
simple as determining whether the response to a multiple-choice item is correct,
or as complex as an expert's holistic evaluation of multiple aspects of an uncon-
strained patient-management solution. They can be automated, demand human
judgment, or require both. Values of the observable variables describe properties of
the particular things a student says, does, or makes. As such, they constitute data
about what the student knows, can do, or has accomplished as more generally
construed in the standard argument.
It is important to note that substantive concerns drive the definition of observ-
able variables. Statistical analyses can be used to refine definitions, compare
alternatives, or improve data-gathering procedures, again looking for patterns
that call a scoring rule into question. But it is the conception of what to observe
that concerns validity directly, and raises questions of alternative explanations
that bear on comparability and fairness.
The measurement component of the Evidence Model tells how the observable
variables depend, in probabilistic terms, on student-model variables, another
essential link in the evidentiary argument. This is the foundation for the reason-
ing that is needed to synthesize evidence across multiple tasks or from different
performances. Figure 1 shows how the observables are modeled as depending on
some subset of the student model variables. The familiar models from test theory
that we discuss in a following section, including classical test theory and item
response theory, are examples. We can adapt these ideas to suit the nature of the
student model and observable variables in any given application (Almond &
Mislevy, 1999). Again, substantive considerations must underlie why these posited
relationships should hold; the measurement model formalizes the patterns
they imply.
It is a defining characteristic of psychometrics to model observable variables
as probabilistic functions of unobservable student variables. The measurement
model is almost always a probability model. The probability-based framework
model may extend to the scoring model as well, as when judgments are required
to ascertain the values of observable variables from complex performances.
Questions of accuracy, agreement, leniency, and optimal design arise, and can be
addressed with a measurement model that includes the rating link as well as the
synthesis link in the chain of reasoning. The generalizability and rater models
discussed below are examples of this.
PSYCHOMETRIC PRINCIPLES AND PROBABILITY-BASED
REASONING
The Role of Probability-Based Reasoning in the Assessment
This section looks more closely at what is perhaps the most distinctive charac-
teristic of psychometrics, namely, the use of statistical models. Measurement
models are a particular form of reasoning from evidence; they provide explicit,
formal rules for how to integrate the many pieces of information that may be
relevant to a particular inference about what students know and can do. Statistical
modeling, probability-based reasoning more generally, is an approach to solving
the problem of "reverse reasoning" through a warrant, from particular data to a
particular claim. Just how can we reason from data to claim for a particular student,
using a measurement model established for general circumstances - usually far
less than certain, typically with qualifications, perhaps requiring side conditions
that mayor may not be satisfied? How can we synthesize the evidentiary value of
multiple observations, perhaps from different sources, often in conflict?
The essential idea is to approximate the important substantive relationships in
some real world problem in terms of relationships among variables in a
probability model. A simplified picture of the real world situation results. A
useful model does not explain all the details of actual data, but it does capture
the significant patterns among them. What is important is that in the space of the
model, the machinery of probability-based reasoning indicates exactly how
reverse reasoning is to be carried out (specifically, through Bayes theorem), and
how different kinds and amounts of data should affect our beliefs. The trick is to
build a probability model that both captures the important real world patterns
and suits the purposes of the problem at hand.
Measurement models concern the relationships between students' knowledge
and their behavior. A student is modeled in terms of variables (8) that represent
the facets of skill or knowledge that suit the purpose of the assessment, and the
data (X) are values of variables that characterize aspects of the observable
behavior. We posit that the student-model variables account for observable
variables in the following sense: We do not know exactly what any student will do
on a particular task, but for people with any given value of e. there is a probability
distribution of possible values of X, say p(XI 8). This is a mathematical expression
of what we might expect to see in data, given any possible values of student-
model variables. The way is open for reverse reasoning, from observed Xs to
likely 8s, as long as different values of 8 produce different probability
distributions for X. We do not know the values of the student-model variables in
practice; we observe "noisy" data presumed to have been determined by them and,
through the probability model, reason back to what their values are likely to be.
Choosing to manage information and uncertainty with probability-based r~ason
ing, with its numerical expressions of belief in terms of probability distributions,
does not constrain one to any particular forms of evidence or psychological
frameworks. That is, it says nothing about the number or nature of elements of
X, or about the character of the performances, or about the conditions under
which performances are produced. And it says nothing about the number or
nature of elements of e such as whether they are number values in a differential
psychology model, production-rule mastery in a cognitive model, or tendencies
to use resources effectively in a situative model. In particular, using probability-
based reasoning does not commit us to long tests, discrete tasks, or large samples
of students. For example, probability-based models have been found useful in
modeling patterns of judges' ratings in the previously mentioned AP Studio Art
portfolio assessment (Myford & Mislevy, 1995), which is about as open-ended as
large-scale, high-stakes educational assessments get, and in modeling individual
students' use of production rules in a tutoring system for solving physics
problems (Martin & vanLehn, 1995).
It should be borne in mind that a measurement model is not intended to
account for every detail of data; it is only meant to approximate the important
patterns. The statistical concept of conditional independence formalizes the
working assumption that if the values of the student model variables were
known, there would be no further information in the details. The fact that every
detail of a student's responses could in principle contain information about what
a student knows or how he/she is thinking underscores the constructive and
purposive nature of modeling. We use a model at a given grainsize or with certain
kinds of variables, not because we think that is somehow "true," but rather
because it adequately expresses the patterns in the data in light of the purpose of
the assessment. Adequacy in a given application depends on validity, reliability,
comparability, and fairness in ways we shall discuss further, but characterized in
ways, and demanded in degrees, that depend on that application: the purpose of
the assessment, the resources that are available, the constraints that must be
accommodated. We might model the same troubleshooting performances in
terms of individual problem steps for an intelligent tutoring system, in terms of
general areas of strength and weakness for a diagnostic assessment, and simply
in terms of overall success rate for a pass/fail certification test.
Since we never fully believe the statistical model we are reasoning through, we
bear the responsibility of assessing model fit, in terms of both persons and items.
We must examine the ways and the extent to which the real data depart from the
patterns in the data, calling attention to failures of conditional independence -
places where our simplifying assumptions miss relationships that are surely
systematic, and possibly important, in the data. Finding substantial misfit causes us
to re-examine the arguments that tell us what to observe and how to evaluate it.
Probability-Based Reasoning in Classical Test Theory
This section illustrates the ideas from the preceding discussion in the context of
classical test theory (CTT). In CIT, the student model is represented as a
single continuous unobservable variable, the true score e. The measurement
model simply tells us to think of an observed score ~ as the true score plus an
508 Mis[evy, Wilson, Ercikan and Chudowsky
error term. If a CIT measurement model were used in the BEAR example, it
would address the sum of the student scores on a set of assessment tasks as the
observed score.
Figure 9 pictures the situation in a case that concerns Sue's (unobservable)
true score and her three observed scores on parallel forms of the same test; that
is, they are equivalent measures of the same construct, and have the same means
and variances. The probability distribution p( 8) expresses our belief about Sue's
8 before we observe her test scores, theXs. The conditional distributions p(~ 18)3
indicate the probabilities of observing different values of ~ if 8 took any given
particular value. Modeling the distribution of each ~ to depend on (J but not the
other XS is an instance of conditional independence; more formally, we write
P(X1' X? X318) = p(X118) p(X218) p(X318). Under CIT we may obtain a form for
the p(~ 18)s by proposing that:
(1)
where Ej is an "error" term, normally distributed with a mean of zero and a

variance of cr~.4 Thus ~ 18 - N( 8.<1E). This statistical structure quantifies the
patterns that the substantive arguments express qualitatively, in a way that tells
us exactly how to carry out reverse reasoning for particular cases. If p( 8)
expresses belief about Sue's (J prior to observing her responses, belief posterior to
learning them is denoted as p(8Ixl'xZ'x3) and is calculated by Bayes theorem as:
(The lower casexs denote particular values ofXs.)

Figure 10 gives the numerical details for a hypothetical example, calculated
with a variation of an important early result called Kelley's formula for estimating
,.-..········..·..·-1
i p(~... _...J!
............
,,_...._.........................
! p(Xll ~! ( ....-.-.................~
,. . . . . . . . . (:-:::::::.:~. . . . . . . ., L~~~~!.~
. . ..J
I p(Xd~ I
Vjx.
Figure 9: Statistical Representation for Classical Test Theory
true scores (Kelley, 1927). Suppose that from a large number of students like
Sue, we have estimated that the measurement error variance is (j i = 25, and for
the population of students, () follows a normal distribution with a mean of 50 and
a standard deviation of 10. We now observe Sue's three scores, which take the
values 70, 75, and 85. We see that the posterior distribution for Sue's 8. is a
normal distribution with mean 74.6 and standard deviation 2.8.
The additional backing that was used to bring the probability model into the
evidentiary argument was an analysis of data from students like Sue. Spearman's
(1904) seminal insight was that if their structure is set up in the right wayS, it is
Theorem
Let N (p, u) denote the nonnal (Gaussian) distribution with mean p and standard
deviation cr. If the prior distribution of 8is N (Po,uo) and the X is N( 8,uE ), then the
distribution for 8posterior to observing X is N(ppost'U post)' where Upost =(U~2 +U;2 (
and Ppost = (U~2pO +U;2X )/( U~2 +U;2).
Calculating the posterior distribution for Sue

Beginning with an initial distribution of N (50,10) , we can compute the posterior
distribution for Sue's 8 after seeing three independent responses by applying the theorem
three times, in each case with the posterior distribution from one step becoming the prior
distribution for the next step.
a) Prior distribution: 0 - /II (50,10).

b) After the first response: Given 0- N (50,10) and Xt-N (0,5), observing X t =70
yields the posterior N (66.0,4.5).
c) After the first response: Given 0 - N (66.0,4.5) and X 2 - N (0,5), observing X 2 = 75
d) After the third response: Given 0 - N (70.0,3.3) and X 3-N (8,5), observing X3=85
Calculating a fit index for Sue

Suppose each of Sue's scores came from a N (8,5) distribution. Using the posterior
mean we estimated from Sue's scores, we can calculate how likely her response vector is
under this measurement model using a chi-square test of fit:
[(70-74.6)/5f + [(75-74.6)/5]2 + [(85-74.6)/5f =.85+.01 +4.31 =5.17.
Checking against the chi-square distribution with two degrees of freedom, we see that
about 8-percent of the values are higher than this, so this vector is not that Unusual.
Figure 10: A Numerical Example Using Classical Test Theory

possible to estimate the quantitative features of relationships like this, among
both variables that could be observed and others which by their nature never can
be. The index of measurement accuracy in CIT is the reliability coefficient p,
which is the proportion of variance in observed scores in a population of interest
that is attributable to true scores as opposed to the tota( variance (which is
composed of true score variance and noise). It is defined as follows:
(2)
where a ~ is the variance of true score in the population of examinees and a i is

the variance of the error components, neither of which is directly observable!
With a bit of algebra, though, Spearman demonstrated that if Equation 1 holds,
correlations among pairs of Xs will approximate p. We may then estimate the
contributions of true score and error, or a ~ and ai, as proportions p and (l-p)
respectively of the observed score variance. The intuitively plausible notion is
that correlations among exchangeable measures of the same construct tell us
how much to trust comparisons among examinees from a single measurement.
As an index of measurement accuracy, however, p suffers from its dependence
on the variation among examinees' true scores as well as on the measurement
error variance of the test. For a group of examinees with no true-score variance,
the reliability coefficient is zero no matter how much evidence a test provides
about each of them. We will see how item response theory extends the idea of
measurement accuracy.
What is more, patterns among the observables can be so contrary to those the
model would predict that we suspect the model is not right. Sue's values of 70, 75,
and 85 are not identical, but neither are they surprising as a set of scores (Figure
10 shows how to calculate a chi-squared index of fit for Sue). Some students have
higher scores than Sue, some have lower scores, but the amount of variation
within a typical student's set of scores is in this neighborhood. But Richard's
three scores of 70, 75, and 10 are surprising. His high-fit statistic (a chi-square of
105 with 2 degrees of freedom) says his pattern is very unlikely from parallel tests
with an error variance of 25 (less than one in a billion). Richard's responses are
so discordant with the statistical model that expresses patterns under the
standard argument that we suspect that the standard argument does not apply.
We must go beyond the argument to understand what has happened, to facts the
standard data do not convey. Our first clue is that his third score is particularly
different from both the other two and from the prior distribution.
Classical test theory's simple model for examinee characteristics suffices when
one is interested in only a single aspect of student achievement, when tests are
only considered as a whole, and when all students take tests that are identical or
practically so. But the assumptions of CIT have generated a vast armamentar-
ium of concepts and tools that help the practitioner examine the extent to which
psychometric principles are being attained in situations when the assumptions
are adequate. These tools include reliability indices that can be calculated from
multiple items in a single test, formulas for errors of measuring individual students,
strategies for selecting optimal composites of tests, formulas for approximating
how long a test should be to reach a required accuracy, and methods for equating
tests. The practitioner working in situations that CTT encompasses will find a
wealth of useful formulas and techniques in Gulliksen's (1950/1987) classic text.
The Advantages of Using Probability-Based Reasoning in Assessment
Because of its history, the very term "psychometrics" connotes a fusion of the
inferential logic underlying Spearman's reasoning with his psychology (trait
psychology, in particular intelligence which is considered an inherited and stable
characteristic) and his data-gathering methods (many short, "objectively-scored,"
largely decontextualized tasks). The connection, while historically grounded, is
logically spurious, however. For the kinds of problems that CTT grew to solve
are not just Spearman's problems, but ones that ought to concern anybody who
is responsible for making decisions about students, evaluating the effects of
instruction, or spending scarce educational resources, whether or not Spearman's
psychology or methodology is relevant to the problem at hand.
And indeed, the course of development of test theory over the past century has
been to continually extend the range of problems to which this inferential approach
can be applied: to claims cast in terms of behavioral, cognitive, or situative
psychology; to data that may be embedded in context, require sophisticated
evaluations, or address multiple interrelated aspects of complex activities. We
will look at some of these developments in the next section. But it is at a higher
level of abstraction that psychometric principles are best understood, even
though, in practice, they are investigated with particular models and indices.
When it comes to examining psychometric properties, embedding the assess-
ment argument in a probability model offers several advantages. First, using the
calculus of probability-based reasoning, once we ascertain the values of the
variables in the data, we can express our beliefs about the likely values of the
student estimates in terms of probability distributions - given that the model is
both generally credible and applicable to the case at hand. Second, the machinery
of probability-based reasoning is rich enough to handle many recurring
challenges in assessment, such as synthesizing information across multiple tasks,
characterizing the evidentiary importance of elements or assemblages of data,
assessing comparability across different bodies of evidence, and exploring the
implications of judgment, including different numbers and configurations of
raters. Third, global model-criticism techniques allow us not only to fit models to
data, but to determine where and how the data do not accord well with the models.
Substantive considerations suggest the structure of the evidentiary argument,
while statistical analyses of ensuing data through the lens of a mathematical model
help us assess whether the argument matches up with what we actually see in the
world. For instance, detecting an unexpected interaction between performance
on an item and students' cultural backgrounds alerts us to an alternate explana-

tion of poor performance. We are then moved to improve the data gathering
methods, constrain the range of use, or rethink the substantive argument. Fourth,
local model-criticism techniques allow us to monitor the operaJion of the reverse-
reasoning step for individual students even after the argument, data collection
methods, and statistical model are up and running. Patterns of observations that
are unlikely under the common argument can be flagged (e.g., Richard's high
chi-square value), thus avoiding certain unsupportable inferences, and drawing
attention to cases that call for additional exploration.
Implications for Psychometric Principles
Validity
Some of the historical "flavors" of validity are statistical in nature. Predictive

validity is the degree to which scores in selection tests correlate with future
performance. Convergent validity looks for high correlations of a test's scores
with other sources of evidence about the targeted knowledge and skills, while
divergent validity looks for low correlations with evidence about irrelevant factors
(Campbell & Fiske, 1959). Concurrent validity examines correlations with other
tests presumed to provide evidence about the same or similar knowledge and skills.
The idea is that substantive considerations that justify the conception and
construction of an assessment can be put to empirical tests. In each of the cases
mentioned above, relationships are posited among observable phenomena that
would hold if the substantive argument were correct, and whether or not they
hold is determined by exploring the nomothetic net. Potential sources of backing
for arguments for interpreting and using test results are sought in conjunction
with explorations of plausible alternative explanations.
Consider, for example, assessments designed to support decisions about whether
a student has attained some criterion of performance (Ercikan & Julian, 2002;
Hambleton & Slater, 1997). The decisions, typically reported as proficiency or
performance level scores, which are increasingly being considered to be useful in
communicating assessment results to students, parents, and the public, as well as
for evaluation of programs, involve classification of examinee performance to a
set of proficiency levels. It is rare for the tasks on such a test to exhaust the full
range of performances and situations users are interested in. Examining the
validity of a proficiency test from this nomothetic-net perspective would involve
seeing whether students who do well on that test also perform well in more
extensive assessment, obtain high ratings from teachers or employers, or su~ceed
in subsequent training or job performance.
Statistical analyses of these kinds have always been important after the fact, as
significance-focused validity studies informed, constrained, and evaluated the use
of a test, but they rarely prompted more than minor modifications of its contents.
Rather, as Embretson (1998) noted, substantive considerations have traditionally
driven assessment construction. Neither of the two meaning-focused lines of
justification that were considered forms of validity used probability-based reason-
ing. These are content validity, which concerns the nature and mix of items in a
test, and face validity, which is what a test appears to be measuring on the
surface, especially to non-technical audiences. We will see in our discussion of
item response theory how statistical machinery is increasingly being used in the
exploration of construct representation as well in after-the-fact validity studies.
Reliability
Reliability, historically, was used to quantify the amount of variation in test scores
that reflected "true" differences among students, as opposed to noise (Equation
2). Correlations between parallel test forms used in classical test theory are one
way to estimate reliability in this sense. Internal consistency among test items, as
gauged by the KR-20 formula (Kuder & Richardson, 1937) or Cronbach's (1951)
alpha coefficient, is another. A contemporary view sees reliability as the evi-
dentiary value that a given realized or prospective body of data would provide for
a claim. That is, reliability is not a universally applicable number or equation that
characterizes a test, but an index that quantifies how much information test data
contain about specified conjectures framed in terms of student-model variables
(e.g., an estimate for a given student, a comparison among students, or a
determination of whether a student has attained some criterion of performance).
A wide variety of specific indices or parameters can be used to characterize
evidentiary value. Carrying out a measurement procedure two or more times
with supposedly equivalent alternative tasks and raters will not only ground an
estimate of its accuracy, as in Spearman's original procedures, but will demon-
strate convincingly that there is some uncertainty to deal with in the first place
(Brennan, 2001). The KR-20 and Cronbach's alpha apply the idea of replication
to tests that consist of multiple items, by treating subsets of the items as repeated
measures. These CTT indices of reliability appropriately characterize the
amount of evidence for comparing students in a particular population with one
another. This is indeed often an important kind of inference. But these indices
do not indicate how much information the same scores contain about other
inferences, such as comparing students against a fixed standard, comparing
students in other populations, or estimating group performance for purposes of
evaluating schools or instructional programs.
Since reasoning about reliability takes place in the realm of the measurement
model (assuming that it is both correct and appropriate), it is possible to
approximate the evidentiary value of not only the data in hand, but the value of
similar data gathered in somewhat different ways. Under CIT, the Spearman-
Brown formula (Brown, 1910; Spearman, 1910) can be used to approximate the
reliability coefficient that would result from doubling the length of a test:
_ 2p
Pdouble - 1+p . (3)
If P is the reliability of the original test, then Pdouble is the reliability of an other-
wise comparable test with twice as many items. Empirical checks have shown that
these predictions can hold up quite well, but not if the additional items differ in
content or difficulty, or if the new test is long enough to fatigue students. In these
cases, the real-world counterparts of the modeled relationships are stretched so
far that the results of reasoning through the model fail.
Extending this thinking to a wider range of inferences, generalizability theory
(g-theory) (Cronbach, GIeser, Nanda, & Rajaratnam, 1972) permits predictions
of the accuracy of similar tests with different numbers and configurations of
raters, items, and so on. And once the parameters of tasks have been estimated
under an item response theory (IRT) model, one can even assemble tests item by
item for individual examinees on the fly, to maximize the accuracy with which each
is assessed. (Later we will point to some "how-to" references for g-theory and IRT.)
Typical measures of accuracy used in CTT do not adequately examine the
accuracy of decisions concerning criteria of performance discussed above. In the
CTT framework, the classification accuracy is defined as the extent to which
classification of students based on their observed test scores agree with those
based on their true scores (Traub & Rowley, 1980). One of the two commonly
used measures of classification accuracy is a simple measure of agreement, Po,
defined as
L
PO=I,Pll'
I, 1
where pu represents the proportion of examinees who were classified into the
same proficiency level (1 = 1, ... ,5) according to their true score and observed
score. The second is Cohen's (1960) J( coefficient, which is similar to the
proportion agreement Po' except that it is corrected for the agreement that is due
to chance. The coefficient is defined as:
where
L
Pc =I,PI. PI'
1= I
The accuracy of classifications based on test scores is critically dependent on

measurement accuracy at the cut-score points (Ercikan & Julian, 2001; Hambleton
& Slater, 1997). Even though higher measurement accuracy tends to imply higher
classification accuracy, higher reliability such as indicated by KR-20 or coeffi-
cient alpha does not imply higher classification accuracy. While these measures
give an overall indication of measurement accuracy provided by the test for all
examinees they do not provide information about the measurement accuracy
provided at the cut-scores. Therefore, they are not accurate indicators of
classification decisions based on test performance.
On the other hand, measurement accuracy is expected to vary for different
score ranges, resulting in variation in classification accuracy. This points to a serious
limitation of interpretability of single indices that are intended to represent
classification accuracy of a test given a set of cut-scores. Ercikan & Julian (2002)
study found that classification accuracy can be dramatically different for
examinees at different ability levels. Their results demonstrated that comparing
classification accuracy across tests could be deceptive, since it may be high for
one test for certain score ranges and low for others. Based on these limitations,
it is recommended that classification accuracy be reported separately for
different score ranges.
Comparability
Comparability concerns the equivalence of inference when different bodies of data

are gathered to compare students, or to assess change when the same students
are assessed at different points in time. Within a statistical framework, we can
build models that address quantitative aspects of questions such as these: Do the
different bodies of data have such different properties as evidence as to
jeopardize the inferences? Are conclusions about students' knowledge biased in
one direction or another when different data are gathered? Is there more or less
weight for various claims under the different alternatives?
A time-honored way of establishing comparability has been creating parallel
test forms. A common rationale is developed to create collections of tasks which,
taken together, can be argued to provide data about the same targeted skills and
knowledge, which, it is hoped, will differ only in incidentals that do not accumu-
late. This may be done by defining a knowledge-by-skills matrix, writing items in
each cell and constructing tests by selecting the same numbers of tasks from each
cell for every test form. The same substantive backing thus grounds all the forms.
But it would be premature to presume that equal scores from these tests
constitute equivalent evidence about students' knowledge. Despite care in their
construction, possible differences between the tests in difficulty or in the amount
of information covered must be considered as an alternative explanation for
differing performances among students. Empirical studies and statistical
analyses enter the picture at this point in the form of equating studies (Petersen,
Kolen, & Hoover, 1989). Finding that similar groups of students systematically
perform better on Form A than on Form B confirms the alternative explanation.
Adjusting scores for Form B upward to match the resulting distributions addresses
this concern, refines the chain of reasoning to take form differences into account
when drawing claims about students, and enters the compendium of backing for
the assessment system as a whole. We shall see below that IRT extends
comparability arguments to test forms that differ in difficulty and accuracy, if
they can satisfy the requirements of a more ambitious statistical model.
Fairness
The meaning-focused sense of fairness we have chosen to highlight concerns a

claim that would follow from the common argument, but would be called into
question by an alternative explanation sparked by other information we could
and should have taken into account. When we extend the discussion of fairness
to statistical models, we find macro-level and micro-level strategies to address
this concern.
Macro-level strategies of fairness fall within the broad category of what the
assessment literature calls validity studies, and are investigations in the nomo-
thetic net. They address broad patterns in test data, at the level of arguments or
alternative explanations in the common arguments that are used with many
students. Suppose the plan for the assessment is to use data (say, essay res-
ponses) to back a claim about a student's knowledge (e.g., the student can back
up opinions with examples) through an argument (e.g., students in a pretest who
are known to be able to do this in their first language are observed to do so in
essays that ask for them to), without regard to a background factor (such as a
students' first language). The idea is to gather from a group of students data that
include their test performances, but also information about their first language
and higher quality validation data about the claim (e.g., interviews in the
students' native languages). The empirical question is whether inferences from
the usual data to the claim (independently evidenced by the validity data) differ
systematically with first language. In particular, are there students who can back
arguments with specifics in their native language, but fail to do so on the essay
test because of language difficulties? If so, the door is open to distorted
infer~nces about argumentation skills for limited-English speakers, if one
proceeds from the usual data through the usual argument, disregarding language
proficiency.
What can we do when the answer is "yes"? Possibilities'include improving the
data collected for all students, taking their first language into account when
reasoning from data to claim (recognizing that language difficulties can account
for poor performance even when the skill of interest is present), and preiden-
tifying students whose limited language proficiencies are likely to lead to flawed
inferences about the targeted knowledge. In this last instance, additional or
different data could be used for these students, such as an interview or an essay
in their primary language.
These issues are particularly important in assessments used for m~ng
consequential proficiency-based decisions, in ways related to the points we raised
concerning the validity of such tests. Unfair decisions are rendered if (i) alter-
native valid means of gathering data for evaluating proficiency yield results that
differ systematically from the standard assessment, and (ii) the reason can be
traced to requirements for knowledge or skills (e.g., proficiency with the English
language) that are not central to the knowledge or skill that is at issue (e.g.,
constructing and backing an argument).
The same kinds of investigations can be carried out with individual tasks as
well as with assessments as a whole. One variation on this theme can be used with
assessments that are composed of several tasks, to determine whether individual
tasks interact with first language in atypical ways. These are called studies of
differential item functioning (DIF) (Holland & Wainer, 1993).
Statistical tools can also be used to implement micro-level strategies to call
attention to cases in which a routine application of the standard argument could
produce a distorted and possibly unfair inference. The common argument provides
a warrant to reason from data to claim, with attendant caveats for unfairness
associated with factors (such as first language) that have been dealt with at the
macro level. But the argument may not hold for some individual students for
other reasons, which have not yet been dealt with at the macro level, and perhaps
could not have been anticipated at all. Measurement models characterize
patterns in students' data that are typical if the general argument holds. Patterns
that are unlikely can signal that the argument may not apply with a given student
on a given assessment occasion. Under IRT, for example, "student misfit" indices
take high values for students who miss items that are generally easy while
correctly answering ones that are generally hard (Levine & Drasgow, 1982).
SOME OTHER WIDELY-USED MEASUREMENT MODELS
The tools of classical test theory have been continually extended and refined
since Spearman's time, to the extensive toolkit by Gulliksen (1950/1987), and to
the sophisticated theoretical framework of Lord and Novick (1968). Lord and
Novick aptly titled their volume Statistical Theories of Mental Test Scores,
underscoring their focus on the probabilistic reasoning aspects in the measurement-
model links of the assessment argument rather than the purposes, substantive
aspects, and evaluation rules that produce the data. Models that extend the same
fundamental reasoning for this portion of assessment arguments to wider varieties
of data and student models include generalizability theory, item response theory,
latent class models, and multivariate models.
Each of these extensions offers more options for characterizing students and
collecting data in a way that can be embedded in a probability model. The models
do not concern themselves directly with substantive aspects of an assessment
argument, but substantive considerations often have much to say about how one
should think about students' knowledge, and what observations should contain
evidence about it. The more measurement models that are available and the
more kinds of data than can be handled, the better assessors can match rigorous
models with their theories and needs. Models will bolster evidentiary arguments
(validity), extend quantitative indices of accuracy to more situations (reliability),
enable more flexibility in observational settings (comparability), and enhance the
prospects of detecting students whose data are at odds with the standard
argument (fairness).
Generalizability Theory
Generalizability theory (g-theory) extends classical test theory by allowing us to

examine how different aspects of the observational setting affect the evidentiary
value of test scores. As in Classical Test Theory (CIT) the student is character-
ized by overall proficiency in some domain of tasks. However, the measurement
model can now include parameters that correspond to "facets" of the observational
situation such as features of tasks (task-model variables), numbers and designs
of raters, and qualities of performance that will be evaluated. An observed score
of a student in a generalizability study of an assessment consisting of different
item types and judgmental scores is an elaboration of the basic CIT equation:
where now the observed score is from Examinee i, to Item-Type j, as evaluated

by Rater k; ~ is the true score of Examinee i; and ~ and ~x: are, respectively, effects
attributable to Item-Type j and Rater k.
Researchers carry out a generalizability study to estimate the amount of varia-
tion associated with different facets. The accuracy of estimation of scores for a
given configuration of tasks can be calculated from these variance components,
the numbers of items and raters, and the design in which data are collected. A
"generalizability coefficient" is an extension of the CIT reliability coefficient: it
is the proportion of true variance among students for the condition one wishes
to measure, divided by the variance among observed scores among the measure-
ments that would be obtained among repeated applications of the measurement
procedure that is specified (how many observations, fixed or randomly selected;
how many raters rating each observation; different or same raters for different
items, etc.). If, in the example above, we wanted to estimate Busing one randomly
selected item, scored as the average of the ratings from two randomly selected
raters, the coefficient of generalizability, denoted here as a, would be calculated
as follows:
where (J'i, (J';, (J'~, and (J'i are variance coefficients for examinees, item-types,
raters, and error respectively.
The information resulting from a generalizability study can thus guide oeci-
sions about how to design procedures for making observations; relating for
example, to the way to assign raters to performances, the number of tasks and
raters, and whether to average across raters, tasks, etc. In the BEAR Assessment
System, a generalizability study could be carried out to see which type of assess-
ment, embedded tasks or link items, resulted in more reliable scores. It could
also be used to examine whether teachers were as consistent as external raters.
G-theory offers two important practical advantages over err. First, gen-
eralizability models allow us to characterize how the particulars of the evaluation
rules and task model variables affect the value of the evidence we gain about the
student for various inferences. Second, this information is expressed in terms
that allow us to project these evidentiary-value considerations to designs we have
not actually used, but which could be constructed from elements similar to the
ones we have observed. G-theory thus provides far-reaching extensions of the
Spearman-Brown formula (Equation 3), for exploring issues of reliability and
comparability in a broader array of data-collection designs than CIT can.
Generalizability theory was developed by Lee Cronbach and his colleagues,
and their monograph The Dependability of Behavioral Measurements (Cronbach
et aI., 1972) remains a valuable source of information and insight. More recent
sources such as Shavelson and Webb (1991) and Brennan (1983) provide the
practitioner with friendlier notation and examples to build on.
Item Response Theory (IRT)
Classical Test Theory and generalizability theory share a serious shortcoming:

measures of examinees are confounded with the characteristics of test items.
Difficulties arise in comparing examinees who have taken tests that differ by
even as much as a single item, or comparing items that have been administered
to different groups of examinees. Item Response Theory (IRT) was developed to
address this shortcoming. In addition, IRT can be used to make predictions
about test properties using item properties and to manipulate parts of tests to
achieve targeted measurement properties. Hambleton (1989) gives a readable
introduction to IRT, while van der Linden and Hambleton (1997) provide a com-
prehensive though technical compendium of current IRT models. IRT further
extended probability-based reasoning for addressing psychometric principles,
and it sets the stage for further developments. We will present a brief overview
of key ideas.
At first, the student model under IRT seems to be the same as it is under CTT
and g-theory, involving a single variable measuring students' overall proficiency
in some domain of tasks. Again the statistical model does not address the nature
of that proficiency. The structure of the probability-based portion of the
argument is the same as shown in Figure 9, involving conditional independence
among observations given an underlying, inherently unobservable, proficiency
variable e. But now the observations are responses to individual tasks. For Item
j, the IRT model expresses the probability of a given response Xj as a function of
e and parameters ~ that characterize Itemj (such as its difficulty):
(4)
Under the Rasch (1960/1980) model for dichotomous (right/wrong) items, for
example, the probability of a correct response takes the following form:
(5)
where Xii is the response of Student i to Item j, 1 if right and 0 if wrong; ()i is the
proficiency parameter of Student i; /3] is the difficulty parameter of Item j; and
\}IO is the logistic function, \}I(x) = exp(x)/[1 +exp(x)]. The probability of an
incorrect response is then:
(6)
Taken together, Equations 5 and 6 specify a particular form for the item response
function, Equation 4. Figure 11 depicts Rasch item response curves for two
items, Item 1 an easy one, with /31 = -1 and Item 2 a hard one with /32 = 2. It
shows the probability of a correct response to each of the items for different
values of 8. For both items, the probability of a correct response increases toward
one as () increases. Conditional independence means that for a given value of 8,
the probability of Student i making responses XiI and X i2 to the two items is the
product of terms like Equations 5 and 6:
All this is reasoning from the model and given parameters to probabilities of not-
yet-observed responses; as such, it is part of the warrant in the assessment
argument, to be backed by empirical estimates and model criticism. In applica-
tions we need to reason in the reverse direction. Item parameters will have been
estimated and responses observed, and we need to reason from an examinee'sxs,
to the value of 8. Equation 7 is then calculated as a function of () with \] and \2
fixed at their observed values; this is the likelihood function. Figure 12 shows the
Probability
1 r----------------------=~=-----~~
0.8
Item 1: /JJ=-1
0.6
0.4 Item 2: fJ2=2
0.2
O+-==~~r_--~~==~--~--~--~--_r--~
-5 -4 -3 -2 -1 o 2 3 4 5
()
Figure 11: 1\\'0 Item Response Curves

likelihood function that corresponds to X;1 = 0 and X;2 = 1. One can estimate 0
by the point at which the likelihood attains its maximum (around 0.75 in this
example), or use Bayes theorem to combine the likelihood function with a prior
distribution for 0, p( 0), to obtain the posterior distribution p( 0IXjJ,xi2).
The amount of information about 0 available from Item j, V0), can be
calculated as a function of 0, /1,and the functional form off (see the references
mentioned above for formulas for particular IRT models). Under IRT, the
amount of information for measuring proficiency at each point along the scale is
simply the sum of these item-by-item information functions. The square root of
the reciprocal of this value is the standard error of estimation, or the standard
deviation of estimates of 0 around its true value. Figure 13 is the test information
curve that corresponds to the two items in the preceding example. It is of particular
importance in IRT that once item parameters have been estimated ("calibrating"
them), estimating individual students' Os and calculating the accuracy of those
estimates can be accomplished for any subset of items. Easy items can be
administered to fourth graders and harder ones to fifth graders, for example, but
all scores arrive on the same 0 scale. Different test forms can be given as pretests
and posttests, and differences of difficulty and accuracy are taken into account.
IRT helps assessors achieve psychometric quality in several ways
Concerning Validity
The statistical framework indicates the patterns of observable responses that

would occur in data if it were actually the case that a single underlying profi-
ciency did account for all the systematic variation among students and items. All
the tools of model criticism from five centuries of probability-based reasoning
can be brought to bear to assess how well an IRT model fits a given data set and
Likelihood
0.04 - r - - - - - - - - - - - - - - - - - - - - ,
Maximum
likelihood
estimate
-.75
-5 -4 -3 -2 -1 o 1 2 3 4 5
Figure 12: The IRT Likelihood Function Induced by ObservingXi1 =O andX,z=1

Information
0.4 -,-----------------------------------------.
Test Information
nction
0.3
.- .. -. .....
0.2
"Information •.,'
.... , ,, . 'iii
from Item 1 ; / •• _ Information.
0.1 ./ • '. _.. from Item 2
,; ....
-------- .... _..... .
o +---~--~~T---~--~--r---r-~~~--~
-5 -4 -3 -2 -1 o 1 2 3 4 5
theta
Figure 13: An IRT Test Information Curve for the 1\vo-Uem Example
where it breaks down, now item by item, student by student. The IRT model does
not address the substance of the tasks, but by highlighting tasks that are
operating differently than others, or proving harder or easier than expected, it
helps test designers improve their work.
Concerning Reliability
Once item parameters have been estimated, a researcher can gauge the precision
of measurement that would result from different configurations of tasks. Precision
of estimation can be gauged uniquely for any matchup between a person and a
set of items. We are no longer bound to measures of reliability that are tied to
specific populations and fixed test forms.
Concerning Comparability
IRT offers strategies beyond the reach of CIT and g-theory for assembling tests
that "measure the same thing." These strategies capitalize on the above-men-
tioned capability to predetermine the precision of estimation from different sets
of items at different levels of 0. Tests that provide optimal measurement for mastery
decisions can be designed, for example, or tests that provide targeted amounts of
precision at specified levels of proficiency (van der Linden, 1998). Large content
domains can be covered in educational surveys by giving each student only a
sample of the tasks, yet using IRT to map all performances onto the same scale.
The National Assessment of Educational Progress, for example, has made good
use of the efficiencies of this item sampling in conjunction with reporting based
on IRT (Messick, Beaton, & Lord, 1983). Tests can even be assembled on the fly
in light of a student's previous responses as assessment proceeds, a technique
called adaptive testing (see Wainer et aI., 2000, for practical advice on construct-
ing computerized adaptive tests).
Concerning Fairness
Differential item functioning (DIF) analysis, based on IRT and related methods,
has enabled both researchers and large-scale assessors to routinely and
rigorously test for a particular kind of unfairness (e.g., Holland & Thayer, 1988;
Lord, 1980). The idea is that a test score, such as a number correct or an IRT e
estimate, is a summary over a large number of item responses, and a comparison
of students at the level of scores implies that they are similarly comparable across
the domain being assessed. But what if some items are systematically harder for
students from a group defined by cultural or educational background, for reasons
that are not related to the knowledge or skill that is meant to be measured? This
is DIF, and it can be formally represented in a model containing interaction
terms for items by groups by overall proficiency, an interaction whose presence
can threaten score meaning and distort comparisons across groups. A finding of
significant DIF can imply that the observation framework needs to be modified
or, if the DIF is common to many items, that the construct-representation argu-
ment is oversimplified.
DIF methods have been used in examining differential response patterns for
gender and ethnic groups for the last two decades and for language groups more
recently. They are now being used to investigate whether different groups of
examinees of approximately the same ability appear to be using differing cog-
nitive processes to respond to test items. Such uses include examining whether
differential difficulty levels are due to differential cognitive processes, language
differences (Ercikan, 1998), solution strategies and instructional methods (Lane,
Wang, Magone, 1996), and skills required by the test that are not uniformly
distributed across examinees (O'Neil & McPeek, 1993).
Extensions of IRT
We have just seen how IRT extends statistical modeling beyond the constraints
of classical test theory and generalizability theory. The simple elements in the
basic equation of IRT (Equation 4) can be elaborated in several ways, each time
expanding the range of assessment situations to which probability-based
reasoning can be applied in the pursuit of psychometric principles.
Multiple-Category Responses
Whereas IRTwas originally developed with dichotomous (right/wrong) test items,

the machinery has been extended to observations that are coded in multiple
categories. This is particularly useful for performance assessment tasks that are
evaluated by raters on, say, 0--5 scales. Samejima (1969) carried out pioneering
work in this area. Thissen and Steinberg (1986) explain the mathematics of the
extension and provide a useful taxonomy of multiple-category IRT models, and
Wright and Masters (1982) offer a readable introduction to their use.
Rater Models
The preceding paragraph mentioned that multiple-category IRT models are

useful in performance assessments with judgmental rating scales. But judges
themselves are sources of uncertainty, as even knowledgeable and well-meaning
raters rarely agree perfectly. Generalizability theory, discussed earlier, incorpo-
rates the overall impact of rater variation on scores. Adding terms for individual
raters into the IRT framework goes further, so that we can adjust for their
particular effects, offer training when it is warranted, and identify questionable
ratings with greater sensitivity. Recent work along these lines is illustrated by
Patz and Junker (1999) and Linacre (1989).
Conditional Dependence
Standard IRT assumes that responses to different items are independent once
we know the item parameters and examinee's O. This is not strictly true when
several items concern the same stimulus, as in paragraph comprehension tests.
Knowledge of the content tends to improve performance on all items in the set,
while misunderstandings tend to depress all, in ways that do not affect items
from other sets. Ignoring these dependencies leads one to overestimate the
information in the responses. The problem is more pronounced in complex tasks
when responses to one subtask depend on results from an earlier subtask or
when multiple ratings of different aspects of the same performance are obtained.
Wainer and his colleagues (e.g., Bradlow, Wainer, & Wang, 1999; Wainer &
Keily, 1987) have studied conditional dependence in the cOntext of IRT. This line
of work is particularly important for tasks in which several aspects of the same
complex performance must be evaluated (Yen, 1993).
Multiple Attribute Models
Standard IRT posits a single proficiency to "explain" performance on all the

items in a domain. One can extend the model to situations in which mu.ltiple
aspects of knowledge and skill are required in different mixes in different items.
One stream of research on multivariate IRT follows the tradition of factor
analysis, using analogous models and focusing on estimating structures from tests
more or less as they come to the analyst from the test developers (e.g., Reckase,
1985). Another stream starts from multivariate conceptions of knowledge, and
constructs tasks that contain evidence of that knowledge in theory-driven ways
(e.g., Adams, Wilson, & Wang, 1997). As such, this extension fits in neatly with
the task-construction extensions discussed in the following paragraph. Either
way, having a richer syntax to describe examinees within the probability-based
argument supports more nuanced discussions of knowledge and the ways it is
revealed in task performances.
Incorporating Item Features into the Model
Embretson (1983) not only argued for paying greater attention to construct
representation in test design, she argued for how to do it: Incorporate task model
variables into the statistical model, and make explicit the ways that features of
tasks impact on examinees' performance. A signal article in this regard was
Fischer's (1973) linear logistic test model (LLTM). The LLTM is a simple extension
of the Rasch model shown above in Equation 5, with the further requirement
that each item difficulty parameter f3 is the sum of effects that depend on the
features of that particular item:
~ =£%kTh'
k=l
where hk is the contribution to item difficulty from Feature k, and qjk is the extent
to which Feature k is represented in Item j. Some of the substantive consider-
ations that drive task design can thus be embedded in the statistical model, and
the tools of probability-based reasoning are available to examine how well they
hold up in practice (validity), how they affect measurement precision (reliability),
how they can be varied while maintaining a focus on targeted knowledge
(comparability), and whether some items prove hard or easy for unintended
reasons (fairness). Embretson (1998) walks through a detailed example of test
design, psychometric modeling, and construct validation from this point of view.
Additional contributions along these lines can be found in the work of Tatsuoka
(1990), Falmagne and Doignon (1988), Pirolli and Wilson (1998), and DiBello,
Stout, and Roussos (1995).
Progress on Other Fronts
The steady extension of probability-based tools to wider ranges of assessment

uses has not been limited to IRT. In this section we will mention some other
important lines of development, and point to work that is bringing these many
lines of progress into the same methodological framework.
Latent Class Models
Research on learning suggests that knowledge and skills in some domains could
be characterized as discrete states (e.g., the "production rule" models John
Anderson uses in his intelligent tutoring systems [Anderson, Boyle, & Corbett,
1990]). Latent class models characterize an examinee as a member of one of a
number of classes, rather than as a position on a continuous scale (Dayton, 1999;
Haertel, 1989; Lazarsfeld, 1950). The classes themselves can be considered ordered
or unordered. The key idea is that students in different classes have different
probabilities of responding in designated ways in assessment settings, depending
on their values on the knowledge and skill variables that define the classes. When
this is what theory suggests and purpose requires, using a latent class model offers
the possibility of a more valid interpretation of assessment data. The probability-
based framework of latent class modeling again enables us to rigorously test this
hypothesis, and to characterize the accuracy with which observed responses
identify students with classes. Reliability in latent class models is therefore
expressed in terms of correct classification rates.
Models for Other Kinds of Data
All of the machinery of IRT, including the extensions to multivariate student

models, raters, and task features, can be applied to data other than just dichoto-
mous and multiple-category observations. Less research and fewer applications
appear in the literature, but the ideas can be found for counts (Rasch, 1960/1980),
continuous variables (Samejima, 1973), and behavior observations such as
incidence and duration (Rogosa & Ghandour, 1991).
Models that Address Interrelationships among Variables
The developments in measurement models we have discussed encompass wider

ranges of student models, observations, and task features, all increasing the
fidelity of probability-based models to real world situations. This contributes to
improved construct representation. Progress on methods to study nomothetic
span has taken place as well. Important examples include structural equations
models and hierarchical models. Structural equations models (see, e.g., Joreskog
& Sorbom, 1979) incorporate theoretical relationships among variables and
simultaneously take measurement error into account, so that complex hypo-
theses can be posed and tested coherently. Hierarchical models (see, e.g., Bryk
& Raudenbush, 1992) incorporate the ways that students are clustered in class-
rooms, classrooms within schools, and schools within higher levels of organization,
to better sort out the within- and across-level effects that correspond to a wide
variety of instructional, organizational, and policy issues, and to growth and
change. Clearer specifications and coherent statistical models of the
relationships among variables help researchers frame and critique "nomothetic
net" validity arguments.
Progress in Statistical Methodology
One kind of scientific breakthrough is to recognize situations previously handled

with different models, different theories, or different methods as special cases of
a single approach. The previously mentioned models and accompanying computer
programs for structural equations and hierarchical analyses qualify. Both have
significantly advanced statistical investigations in the social sciences, validity
studies among them, and made formerly esoteric analyses more widely available.
Developments taking place today in statistical computing are beginning to
revolutionize psychometric analysis in a similar way.
These developments comprise resampling-based estimation, full Bayesian
analysis, and modular construction of statistical models (Gelman, Carlin, Stern,
& Rubin, 1995). The difficulty of managing evidence leads most substantive
researchers to work within known and manageable families of analytic models;
that is, ones with known properties, available procedures, and familiar exemplars.
All of the psychometric models discussed above followed their own paths of
evolution, each over the years generating its own language, its own computer
programs, its own community of practitioners. Modern computing approaches,
such as Markov Chain Monte Carlo estimation, provide a general approach to
construct and fit such models with more flexibility and see all as variations on a
common theme. In the same conceptual framework and with the same
estimation approach, we can carry out probability-based reasoning with all of the
models we have discussed.
Moreover, we can mix and match components of these models and create new
ones to produce models that correspond to assessment designs motivated by theory
and purpose. This approach stands in contrast to the compromises in theory and
methods that result when we have to gather data to meet the constraints of
specific models and specialized computer programs. The freeware computer
program BUGS (Spiegelhalter et aI., 1995) exemplifies this building-block
approach. These developments are softening the boundaries between researchers
who study psychometric modeling and those who address the substantive aspects
of assessment. A more thoughtful integration of substantive and statistical lines
of evidentiary arguments in assessment will further the understanding and the
attainment of psychometric principles.
CONCLUSION
These are days ofrapid change in assessment. 7 Advances in cognitive psychology

deepen our understanding of how students gain and use knowledge (National
Research Council, 1999). Advances in technology make it possible to capture
more complex performances in assessment settings, by including, for example,
simulation, interactivity, collaboration, and constructed response (Bennett, 2001).
Yet as forms of assessment evolve, two themes endure: the importance of psycho-
metric principles as guarantors of social values, and their r~alization through
sound evidentiary arguments.
We have seen that the quality of assessment depends on the quality of the evi--
dentiary argument, and how substance, statistics, and purpose must be woven
together throughout the argument. A conceptual framework such as the assess-
ment design models of Figure 1 helps experts from different fields integrate their
diverse work to achieve this end (Mislevy, Steinberg, Almond, Haertel, &
Penuel, in press). Questions will persist. How do we synthesize evidence from
disparate sources? How much evidence do we have? Does it tell us what we think
it does? Are the inferences appropriate for each student? The perspectives and
the methodologies that underlie psychometric principles (validity, reliability,
comparability, and fairness) provide formal tools to address these questions, in
whatever specific forms they arise.
ENDNOTES
1 This work draws in part on the authors' work on the National Research Council's Committee on
the Foundations of Assessment. The first author received support under the Educational Research
and Development Centers Program, PR/Award Number R305B60002, as administered by the
Office of Educational Research and Improvement, U.S. Department of Education. The second
author received support from the National Science Foundation under grant No. ESI-9910154. The
findings and opinions expressed in this report do not reflect the positions or policies of the National
Research Council, the National Institute on Student Achievement, Curriculum, and Assessment,
the Office of Educational Research and Improvement, the National Science Foundation, or the
U.S. Department of Education.
2 We are indebted to Professor David Schum for our understanding, such as it is, of evidentiary
reasoning. This first part of this section draws on Schum (1987,1994) and Kadane & Schum (1996).
3 p(~!~) is the probability density function for the random variable~, given that e is fixed at a
speCified value.
4 Strictly speaking, CIT does not address the full distributions of true and observed scores, only
means, variances, and covariances. But we want to illustrate probability-based reasoning and
review CIT at the same time. Assuming normality for e and E is the l(asiest way to do this, since
the first two moments are sufficient for normal distributions.
5 In statistical terms, if the parameters are identified. Conditional independence is key, because CI
relationships enable us to make multiple observations that are assumed to depend on the same
unobserved variables in ways we can model. This generalizes the concept of replication that
grounds reliability analysis.
6 See Greeno, Collins, & Resnick (1996) for an overview of these three perspectives on learning and
knowing, and discussion of their implications for instruction and assessment.
7 Knowing What Students Know (National Research Council, 2001), a report by the Committee on the
Foundations of Assessment, surveys these developments.
REFERENCES
Adams, R., Wilson, M.R., & Wang, W-c. (1997). The multidimensional random coefficients multino
miallogit model. Applied Psychological Measurement, 21, 1-23.
Almond, RG., & Mislevy, RJ. (1999). Graphical models and computerized adaptive testing. Applied
Psychological Measurement, 23, 223-237.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological testing.
Washington, DC: American Educational Research Association.
Anderson, J.R, Boyle, C.F., & Corbett, AT. (1990). Cognitive modeling and intelligent tutoring.
Artificial Intelligence, 42, 7-49.
Bennett, RE. (2001). How the internet will help large-scale assessment reinvent itself. Education
Policy Analysis, 9(5). Retrieved from http://epaa.asu.edu/epaa/v9n5.htlm.
Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64,153-168.
Brennan, RL. (1983). The elements ofgeneralizability theory. Iowa City, IA: American College Testing
Program.
Brennan, RL. (2001). An essay on the history and future of reliability from the perspective of
replications. Journal of Educational Measurement, 38(4), 295-317
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of
Psychology, 3, 296-322.
Bryk, AS., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis
methods. Newbury Park: Sage.
Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-
multimethod matrix. Psychological Bulletin, 56, 81-105.
Cohen, J.A (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement, 20, 37-46.
Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297-334.
Cronbach, L.J. (1989). Construct validation after thirty years. In RL. Linn (Ed.), Intelligence:
Measurement, theory, and public policy (pp. 147-171). Urbana, IL: University of Illinois Press.
Cronbach, L.J., Gieser, G.c., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin,
52,281-302.
Dayton, C.M. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage.
Dibello, L.v., Stout, w.F., & Roussos, L.A (1995). Unified cognitive/psychometric diagnostic
assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R Brennan
(Eds.), Cognitively diagnostic assessment (pp. 361-389). Hillsdale, NJ: Erlbaum.
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span.
Psychological Bulletin, 93,179-197.
Embretson, S.E. (1998). A cognitive design systems approach to generating valid tests: Application
to abstract reasoning. Psychological Methods, 3, 380-396.
Ercikan, K. (1998). Translation effects in international assessments. International Journal ofEducational
Research, 29, 543-553.
Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to profi-
ciency levels: Guidelines for assessment design. Applied Measurement in Education. 15, 269-294.
Falmagne, J.-c., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of
knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1-23.
Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta
Psychologica, 37, 359-374.
Gelman, A, Carlin, J., Stem, H., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman &
Hall.
Greeno, J.G., Collins, AM., & Resnick, L.B. (1996). Cognition and learning. In D.C. Berliner, &
RC. Calfee (Eds.), Handbook of educational psychology (pp. 15-146). New York: Macmillan.
Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ: Lawrence
Erlbaum.
Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement
test items. Journal of Educational Measurement, 26, 301-321.
Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing.
In N. Frederiksen, RJ. Mislevy, & 1.1. Bejar (Eds.), Test theory for a new generation of tests.
Hillsdale, NJ: Lawrence Erlbaum.
Hambleton, R.K (1989). Principles and selected applications of item response theory. In R.L. Linn
(Ed.), Educational measurement (3rd ed.) (pp. 147-200). Phoenix, AZ: American Council on
Education/Oryx Press.
Hambleton, R.K, & Slater, S.c. (1997). Reliability of credentialing examinations and the impact of
scoring models and standard-setting policies. Applied Measurement in Education, 10, 19-39.
Holland, P.w., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenzsel
procedures. In H. Wainer, & H.I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ:
Lawrence Erlbaum.
Holland, P.w., & Wainer, H. (1993). Differential itemjunctioning. Hillsdale, NJ: Lawrence Erlbaum.
Joreskog, KG., & Sorbom, D. (1979). Advances in factor analysis and structural equation models.
Cambridge, MA: Abt Books.
Kadane, J.B., & Schum, D.A. (1996).Aprobabilistic analysis of the Sacco and Vanzetti evidence. New
York: Wiley.
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,
527-535.
Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book.
Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika,
2, 151-160.
Lane, w., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-
school mathematics performance assessment. Educational Measurement: Issues and Practice, 15(4),
21-27,31.
Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A.
Stouffer, L. Guttman, E.A Suchman, P.R. Lazarsfeld, S.A. Star, & J.A Clausen (Eds.),
Measurement and prediction (pp. 362-412). Princeton, NJ: Princeton University Press.
Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating
studies. British Journal of Mathematical and Statistical Psychology, 35, 42-56.
Linacre, J.M. (1989). Many faceted Rasch measurement. Doctoral dissertation, University of Chicago.
Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Lawrence Erlbaum.
Lord, R.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-
Wesley.
Martin, J.D., & VauLehn, K. (1995). A Bayesian approach to cognitive assessment. In P. Nichols, S.
Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141-165). Hillsdale, NJ:
Erlbaum.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13-103).
New York: American Council on EducationlMacmillan.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance
assl!ssments. Education Researcher, 32, 13-23.
Messick, S., Beaton, AE., & Lord, F.M. (1983). National Assessment of Educational Progress
reconsidered: A new design for a new era. NAEP Report 83-I. Princeton, NJ: National Assessment
for Educational Progress. ,
Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational
assessments. Measurement: Interdisciplinary Research and Perspectives. In S. Irvine, & P. Kyllonen
(Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.
Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points
for improving educational assessment. In B. Means, & G. Haertel (Eds.), Evaluating the effects of
technology in education. Hillsdale, NJ: Erlbaum.
Mislevy, R.I., Steinberg, L.S., Breyer, F.I., Almond, R.G., & Johnson, L. (1999). A cognitive task
analysis, with implications for designing a simulation-based assessment system. Computers and
Human Behavior, 15, 335-374.
Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of
data from complex assessment. Applied Measurement in Education. •
Myford, CM., & Mislevy, R.J. (1995). Monitoring and improving a portfolio assessment system (Center
for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service.
National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee
on Developments in the Science of Learning. Bransford, J.D., Brown, AL., & Cocking, R.R.
(Eds.). Washington, DC: National Academy Press.
National Research Council (2001). Knowing what students know: The science and design of educational
assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., &
Glaser, R. (Eds.). Washington, DC: National Academy Press.
O'Neil, KA, & McPeek, WM. (1993). Item and test characteristics that are associated with
Differential Item Functioning. In P.W Holland, & H. Wainer (Eds.), Differential item functioning
(pp. 255-276). Hillsdale, NJ: Erlbaum.
Patz, R.J., & Junker, B.W (1999). Applications and extensions of MCMC in IRT: Multiple item types,
missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342-366.
Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn
(Ed.), Educational measurement (3rd ed.) (pp. 221-262). New York: American Council on
Education/Macmillan.
Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and
learning. Psychological Review, 105,58-82.
Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen:
Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint).
Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied
Psychological Measurement, 9, 401-412.
Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations (with
discussion). Journal of Educational Statistics, 16, 157-252.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.
Psychometrika Monograph No. 17,34, (No.4, Part 2).
Samejima, F. (1973). Homogeneous case of the continuous response level. P~ychometrika, 38, 203-219.
Schum, D.A (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press
of America.
Schum, D.A (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.
SEPUP (1995). Issues, evidence, and you: Teacher's guide. Berkeley: Lawrence Hall of Science.
Shavelson, R.J., & Webb, N.W (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Spearman, C. (1904). The proof and measurement of association between two things. American
Journal of Psychology, 15, 72-101.
Spearman, C. (1910). Correlation calculated with faulty data. British Journal of P~ychology, 3,271-295.
Spiegel halter, D.J., Thomas, A, Best, N.G., & Gilks, WR. (1995). BUGS: Bayesian inference using
Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.
Thtsuoka, KK. (1990). Toward an integration of item response theory and cognitive error diagnosis.
In N. Frederiksen, R. Glaser, A Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and
knowledge acquisition (pp. 453-488). Hillsdale, NJ: Erlbaum.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567-77.
Toulmin, S. (1958). The uses of argument. Cambridge, England: University of Cambridge Press.
Traub, R.E., & Rowley, G.L. (1980). Reliability of test scores and decisions. Applied Psychological
van der Linden, WJ. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195-202.
van der Linden, WJ., & Hambleton, R.K (1997). Handbook of modern item response theory. New
York: Springer.
Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D.
(2000). Computerized adaptive testing: A primer (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.
Wainer, H., & Keily, G.L. (1987). Item clusters and computerized adaptive testing: A case for testIets.
Journal of Educational Measurement, 24, 195-201.
Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow, & D.E. Wiley (Eds.),
Improving inquiry in social science (pp. 75-107). Hillsdale, NJ: Erlbaum.
Willingham, WW, & Cole, N.S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.
Wilson, M., & Sloane, K (2000). From principles to practice: An embedded assessment system.
Applied Measurement in Education, 13, 181-208.
Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms
of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31-74).
Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.
Yen, WM. (1993). Scaling performance assessments: Strategies for managing local item dependence.
25
Classroom Student Evaluation
PETER W. AIRASIAN
Boston College, School of Education, MA, USA
LISA M. ABRAMS
Four classroom realities which define the richness and complexity of classrooms
and teachers' lives provide a useful context for considering the nature and
dimensions of classroom evaluation. First, classrooms are both social and
academic environments. To survive and succeed, teachers need to know Jheir
students' social, personal, linguistic, cultural, and emotional characteristics, as
well as their academic ones, since decisions about managing, monitoring, and
planning require knowledge of a broad range of student characteristics. Second,
the classroom is an ad hoc, informal, person-centered, and continually changing
environment that calls for constant teacher decision-making about classroom
pace, instruction, learning, and the like (Doyle, 1986). Third, many of the
decisions that confront teachers are immediate, practical ones that focus on
particular students and particular contexts. Fourth, teachers are both evaluators
and participants in the classroom society, both of which can influence their
objectivity in planning, implementing, and applying classroom evaluations
(Airasian, 2001).
The general, overriding purpose of evaluation in the classroom is to help
teachers make decisions in the complex social and instructional context in which
it takes place (see AFT/NCME/NEA, 1990). In this chapter, five areas of class-
room evaluation are described which encompass evaluations that take place
before, during, and after instruction, and represent the vast majority of
evaluations that a teacher will carry out. The areas are: evaluation in establishing
and maintaining social equilibrium in the classroom; evaluation in planning
instruction; evaluation during instruction; evaluation of assessments and tests;
and evaluation for judging and grading academic learning.
533

534 Airasian and Abrams
EVALUATION IN ESTABLISHING AND MAINTAINING SOCIAL
EQUILIBRIUM IN THE CLASSROOM
A classroom is more than a group of students who are together in the same place.
It is a complex social situation characterized by rules and routines that foster
order, discipline, civility, and communication. At the start of each school year,
teachers learn about their new students by gathering information and making
judgments about their individual cognitive, affective, and psychomotor character-
istics. These initial evaluations are necessary and important in establishing
classroom order, routines, and structure (Cohen, 1972; First & Mezzell, 1980;
Henry, 1971; Jackson, 1990; Waller, 1967).
Teachers will seek answers to a variety of questions about their students. Will
they get along? Will they be cooperative? Are they academically ready for this
grade or subject? What intellectual and emotional strengths and weaknesses do
they have? What will interest them? How long are they likely to stay focused on
tasks? What is the best way to organize the classroom to promote learning?
Usually by the second or third week of school, teachers will have collected and
evaluated information about their class that enables them to provide a viable
social and academic class characterized by rules, routines, and common under-
standings (Boostrom, 1991; Dreeben, 1967; Henry, 1971).
Teachers rely on many sources of information to learn about their students.
These may include comments of prior teachers, special education teachers, or
parents; comments in the teachers' room; official school records; the performance
of prior siblings; formal and informal classroom observations; attitude inventory
scores; and conversations with the students themselves. Teachers will also focus
on what students say, do, and write. While the student characteristics that teachers
focus on will vary by grade level, all strive to learn about and evaluate their
students' characteristics early in the school year. The knowledge they acquire will
form a set of perceptions and expectations that influence the manner in which
they, establish rules, plan for instruction, interact with students, carry out
instruction, and interpret and judge student performance (McCaslin & Good,
1996).
A great deal of research suggests that teachers' expectations, defined as "the
inferences that teachers make about the future behavior or academic achieve-
ment of their students, based on what they know about these students now"
(Good & Brophy, 1991, p.llO) can have both positive and negative effects on
teacher-student interactions and student outcomes (Beez, 1968; Brophy, 1983;
Cooper & Good, 1983; Dusek, 1985; Eden & Shani 1982; Jones, 1986; Jussim,
1986; Kellaghan, Madaus & Airasian, 1982; Rosenthal & Jacobson, 1968; Schrank,
1968; Simon & Willcocks, 1981). The expectation effect, commonly referred to
as the "self-fulfilling prophecy," was popularized by Robert Rosenthal and
Lenore Jacobson (1968). The notion that false perceptions can lead to benavior
that over time causes the initial expectation to come true can be especially
damaging to students perceived to be low achieving. Furthermore, teachers'
standards for academic success are often rooted in their own social or cultural
Classroom Student Evaluation 535
background (Rist, 1977). The more congruent a student's background character-
istics are with those of their teacher, the more likely the teacher will establish
high expectations for that student. It is vitally important that teachers base their
expectations on accurate perceptions of students while maintaining a high
degree of awareness of how their own characteristics may influence classroom
evaluations.
Much of the process of student and class evaluation is based on informal,
observational, and idiosyncratic information. There are two major threats to its
validity: stereotyping and logical errors. Stereotyping occurs when a teacher's
beliefs or personal prejudices influence her/his evaluations (Steele, 1997).
Judgments such as "girls are not good at science" and "students who come from
that area are poorly prepared for learning" are stereotypes. Stereotyping becomes
a particular concern when students with unique cultural, racial, language, or
disability characteristics are being evaluated. Too often, such characteristics are
erroneously judged and interpreted by teachers as student deficits rather than as
student differences and assets (Delpit, 1995; Ladson-Billings, 1994). Logical
errors occur when a teacher selects wrong or inappropriate evidence to evaluate
a student's performance. For example, judging a student's ability in math by
observing his/her attention in class or the neatness of his/her writing is a logical
error.
The perspectives that teachers develop early in the school year can also create
reliability problems. Informal, often fleeting, information gathered about students
leads to subjective and idiosyncratic evaluation that later information might well
change. Beginning-of-year perceptions are probably best regarded as hypotheses
to be confirmed by additional observations or more formal information, which
might be obtained from diagnostic tests or by having the student read aloud to
judge his/her facility. The use of a variety of assessments will help teachers to
confirm or reject initial perceptions. Among the strategies that teachers use are
a variety of data sources to learn about students; not relying solely on informal
evidence; observing a behavior or activity two or three times before associating
a student with it; not prematurely judging students' needs and abilities (how
students act in the first few days of school may not represent their typical
performance); and observing student performance in structured activities such
as spelling or math games.
EVALUATION IN PLANNING INSTRUCTION
The second area in which teachers employ evaluation is in planning instruction

(Anderson et aI., 2001). In this activity, four questions on which the instructional
process is based are relevant. What are the important objectives that students
should learn (the planning question)? How can I deliver instruction that leads to
high levels of student learning and accommodates their individual needs,
interests, and aptitudes (the teaching question)? How do I select or construct
information that provides valid and reliable information about learning (the
student achievement question)? How can I ensure that my objectives, instruction,
and student assessment are aligned with each other (the alignment question)?
A substantial amount of a teacher's time is spent planning instructional activities.
Teachers seek to develop a curriculum that will fit the readiness of students and the
available instructional resources. Planning helps teachers in five ways. It reduces
uncertainty by providing a sense of purpose and subject-matter focus. It provides an
opportunity to review and become familiar with the material to be taught. It
provides a way to get instruction started. It links daily instruction to more
integrative topics or units. And lastly, planning instructional activities helps teachers
assess and take into account students' readiness to learn the intended content.
Four sets of factors that influence teachers' instructional planning have been
identified. First, knowledge of students' characteristics is essential. For example,
teachers make judgments about what students are ready to learn; the skills they
already have mastered; how well they work in groups; their preferred learning
styles; disabilities in the classroom and how they are to be accommodated; the
level of language and discourse used by the students; and how to accommodate
students' individual and commonly held needs (Overton, 1996).
Teacher characteristics constitute a second set of factors that influence planning.
Teachers' content knowledge, their beliefs about teaching and learning, instruc-
tional preferences, available technology, and physical limitations are all important.
Teachers' choice of topics, the teaching methods they prefer, and the up-to-
datedness of their content knowledge all intersect with student characteristics in
planning decisions.
A third important factor in planning is the time available for instruction, which
has to be balanced with other factors such as student readiness, coverage of a
required curriculum, and teachers' instructional preferences. Every decision
about what to teach and the emphasis it should be given is based on, or con-
strained by, instructional time.
Finally, the availability or lack of availability of instructional resources influences
planning. Resources encompass a broad range of supports and materials which
include teaching supplies, classroom aids and volunteers, up-to-date textbooks,
computers, classroom space, collegial support, and the principal's support. The
resources that are available determine not only the nature--of instruction, but also
learning outcomes. Richness of materials facilitates richness in instruction, while
lack of materials limits instructional opportunities.
Having considered these four factors, the task of the teacher is to use the
information in devising lesson plans. Typically, plans will be developed in the
context of educational objectives that state intended student learning outcomes.
The objectives describe desired student learning, not how objectives will be
taught. In everyday activities, objectives provide a focus for what is important,
for what has to be accomplished, as well as describing the kinds of content and
process that students should learn (Alexander, Frankiewicz, & Williams, '1979;
Ausubel, Novak, & Hanesian, 1978; Bloom, Madaus, & Hastings, 1981).
Objectives derive their importance from the fact that teaching is an intentional
and normative act. It is intentional because teaching has a purpose. Teachers
want students to learn something as a result of being taught. Teaching is also
normative because what is taught is viewed as being worth learning. Regardless
of how they are stated and what they are called, objectives are present in all
teaching. Other common names for objectives are "learning targets,"
"educational objectives," "instructional objectives," "behavioral objectives," "pupil
outcomes," "benchmarks," and "curriculum objectives." Objectives are usually
the starting point in the instructional process since they identify the desired
outcomes of instruction. Some have called the process of stating objectives a
"backward approach" to planning since it starts by defining the end result of
instruction (Wiggins & McTighe, 1998).
Most teachers' formal objectives are cognitive and emphasize intellectual
activities such as memorizing, interpreting, applying, problem solving, reasoning,
analyzing, and thinking critically. Virtually all the instruction and assessments
that students are exposed to are intended to foster one or more of these cognitive
processes. Examples of objectives keyed to various cognitive processes are:
students can identify the correct punctuation marks in a writing assignment
(knowledge); students can integrate the information from the science experiment
into a general conclusion (synthesis); students can punctuate correctly in a
writing task (application); students can translate French sentences into English
(comprehension); and students can distinguish facts from opinions in eight
newspaper editorials (analysis).
In addition to teacher-developed objectives, there are two sources of class-
room objectives, one based on teacher edition textbooks and the other on
statewide standards. Teachers' textbooks and their accompanying aids provide a
broad range of resources to help planning, delivery, assessment, and the evaluation
of instruction and learning. However, because textbook authors can rarely tailor
their objectives, materials, plans, and assessments to all textbook users, teachers'
texts may not reflect the needs, readiness, and resources of particular teachers
and classes. Thus, blindly following a teacher's textbook could undermine
instruction and evaluation validity. It is up to the classroom teacher to ensure
that objectives, instructional activities, and materials are well matched to the
readiness and needs of their students. The simple task of selecting a textbook can
have a considerable impact on the achievement of classroom instructional
objectives. Three questions help guide an evaluation of the appropriateness of
teachers' textbooks. Are the objectives and text materials clearly stated? Are
they suitable for students in this particular classroom? Do they cover all the
kinds of objectives and activities that these students should be exposed to? A
negative answer to one or more of these questions will raise questions about the
validity of instruction.
In addition to the objectives in textbooks, teachers in the United States have
to take account of state and district-wide learning outcomes associated with
standards-based education. Performance on state standards can have important
consequences for students as well as for teachers and school administrators.
State curriculum standards encourage teachers to emphasize instruction on the
statewide standards, often creating a dilemma for them. Should they focus their
planning on the standards or on outcomes that they consider more appropriate
to their students' needs? Regardless of the form, nature, or source of objectives
or standards, it is important that all teachers evaluate their appropriateness and
feasibility when planning instruction.
Potential threats to the validity of instructional planning decisions include:
failure to learn about student needs and abilities before planning; failure to
include a range of instructional activities and strategies that fit students' needs;
failure to recognize and plan for one's own knowledge and pedagogicallimita-
tions; and failure to take into account state curriculum standards or the school's
curriculum specifications. Potential threats to reliability of planning instruction
include: making planning decisions without checking the consistency and
stability of information or initial perceptions that have not been confirmed by
further observation.
EVALUATION DURING INSTRUCTION
While discussed separately here, it is important to bear in mind that planning and
delivering instruction are integrally related. The instructional process constantly
cycles from planning to delivery, to revision of planning, to delivery, and so on.
The link between the two processes should be logical, ongoing, and natural. Both
are necessary for successful teaching and evaluation.
However, a teacher's evaluation activities when stating objectives and planning
instruction are very different from those when actually teaching. The most
obvious difference is in the time at which evaluation occurs. Planning evaluations
are developed during quiet time, when the teacher can reflect and try to identify
appropriate objectives, content topics, and activities for instruction. Evaluations
during instruction, on the other hand, take place on the firing line during teaching,
and require quick decisions to keep instruction flowing smoothly. The fast pace
at which evaluation occurs during instruction requires the teacher to understand
the "language of the classroom" or to interpret informal student cues such as
attention, facial expressions, and posture (Jackson, 1968). A more formal and
important evaluation during instruction is questioning by both students and teacher.
Once instruction begins, teachers have a two-fold task: to teach the planned
lesson and to constantly evaluate the progress of instruction, modifying it if
necessary (Arends, 1997). To carry out these tasks, teachers "read" their
students' minute-by-minute, sense their mood, and make decisions about how to
maintain or alter instruction. They attend to student answers, detect their
attention or confusion, choose who to call on to answer a question, judge
individual and class participation, and watch for possible misbehavior. During
these activities, they need to maintain a viable instructional pace and judge the
quality of each answer. When students are divided into small groups, the number
of events to observe, monitor, and regulate increase (Doyle, 1986).
During instruction, teachers seek information to help judge a variety of
factors. These include the interest level of individual students and of the class as
a whole; students' retention of previously learned content; apparent or potential
behavior problems; the appropriateness of the instructional technique or activity
employed; the most appropriate student to call on next to answer a question; the
appropriateness of the answer a student provides; pace of instruction; the
amount of time being spent on the lesson and its activities; the usefulness and
consequences of students' questions; the smoothness of transition from one
concept to another, and from one activity to another; the usefulness of examples
used to explain concepts; the degree of comprehension on the part of individual
students and of the class as a whole; and the desirability of starting or ending a
particular activity (Airasian, 2001). Information gathered about these factors
leads to decisions about other activities. For example, depending upon the
appropriateness of the answer a student gives, the teacher may try to draw out a
better answer from the student; seek another student to provide an answer; or
ask the next logical question in the content sequence.
Three sources threaten the validity of teachers' classroom evaluations during
instruction: lack of objectivity when observing and judging the quality of the
instructional process; the limited range of indicators used to make decisions
about instruction and student learning; and the difficulty of assessing and
addressing students' individual differences. Being a participant in the instruc-
tional process can make it difficult for a teacher to be a dispassionate, detached
observer about learning and instruction. Teachers have a stake in the success of
their instruction and derive their primary rewards from it. They also have a
strong personal and professional investment in the success of their work.
Furthermore, they frequently gauge their instructional success on the basis of
indicators that focus on the process of instruction rather than on student
learning outcomes. While correct in focusing some attention on process factors,
involvement and attention are only intermediate goals that hopefully, but not
inevitably, result in student learning.
One of the difficulties in collecting reliable evidence about the instructional
process stems from the fact that the teacher's opportunity to corroborate
instructional observations and interpretations is limited. The more brief the
opportunity to observe instruction, the less confidence a teacher can have in the
reliability of observations. Thus, an important issue in assessing the reliability of
instructional evaluations is the difficulty of obtaining an adequate sampling of
student reactions and understandings, and especially their configuration over
time.
One strategy to overcome this problem is to use questioning during instruction.
Questioning students can serve to promote attention, encourage learning from
peers, deepen understanding, provide reinforcement, control the pace and
content of instruction, and yield diagnostic information. There are a number of
types of questions that teachers can use, including open-ended ones (what is your
reaction to the story?), diagnostic questions (what is the theme of this poem?),
sequence questions (order these American presidents from earliest to latest),
prediction questions (what do you think will happen to the red color when it is
mixed with a green color?), and extension or follow-up questions (explain to me
why you think that?). Questions can target lower-level skills requiring students
to recall factual information, or they can involve higher-level skills when students
are asked to predict, analyze, explain, or interpret. By questioning a number of
students, teachers can improve the reliability and validity of their evaluations
during instruction.
Strategies that can improve classroom questioning, minimize student anxiety,
and inform student evaluation include: asking questions that are related to
lesson objectives; avoiding overly vague questions; involving the entire class in
the questioning process; being aware of patterns when calling on students;
avoiding calling on the same students repeatedly; providing five seconds of "wait
time" for students to form their responses; probing students who answer "yes" or
"no" with follow-up questions asking "why?" or "can you give another example?";
bearing in mind that questioning is a social process in which students should be
treated with encouragement and respect; and recognizing that good questioning
also involves good listening and responding by the teacher (Morgan & Saxton,
1991).
EVALUATION OF ASSESSMENTS AND TESTS
In considering the evaluation of assessments and tests, a teacher asks: How do I

construct or select strategies that will provide valid and reliable information about
students' learning? Summative evaluations focus on teacher decisions that the
school bureaucracy requires: grading, grouping students, recommending student
placements of varied kinds, and referring students to special education services.
They occur less frequently than formative evaluations (which are integrated into
the instructional process and are designed to facilitate learning) and are based
mainly on teacher-made tests and assessments, although commercial standardized
tests and statewide assessments are also commonly used.
The teacher's main focus in assessing learning is to provide a fair opportunity
for students to demonstrate what they have learned. Good classroom assess-
ments and tests possess three main characteristics: students are tested or
assessed on content they were taught; the exercises or questions in the test or
assessment include a representative sample of stated objectives; and the test and
assessment items, questions, exercises, directions, and scoring procedures are
clear, unambiguous, and appropriate. If these conditions are met, the information
teachers gather will provide a reliable and valid basis for evaluating students'
achievements (AFTINCME/NEA, 1990; Joint Advisory Committee, 1993).
To select appropriate assessments and tests, teachers must make a number of
decisions. For example, what instructional topics should be assessed? What type
or types of assessments should be used? Should a paper-and-pencil test be given
and, if so, what kinds of test items should be used? Should observations, essays,
portfolios, group projects, checklists, or rubrics be used? Regardless of the
method used, the decision will be made after the teacher has identified the topics
and processes that are to be assessed. Unless a teacher has reflected on the test
or assessment to be administered and on the instruction it will cover, she or he
cannot provide the instructional focus students need to prepare for the assess-
ment. Teachers can enhance validity and student performance by providing a
review (but they should not teach the test), and by ensuring that students
understand item formats and what will be expected of them in the test or
assessment.
If there is not a good match between objectives, instruction, and assessment,
students' performance will provide a poor and invalid indication of their learning.
Furthermore, assessments that do not align with objectives and instruction will
have little positive influence in motivating and focusing student study. If students
find little relationship between what they are taught and what they are tested on,
they will undervalue the teacher's instruction ..
When tests are used to evaluate student achievement, the items should be
clearly presented. The following five guidelines will help to ensure this (Ahmann
& Glock, 1963; Bloom, Madaus, & Hastings, 1981; Ebel, 1965; Gronlund, 1998).
First, vocabulary that students do not understand and sentence structures that
are ambiguous and confusing should be avoided. If the wording or sentence
structure of items is confusing, students will not have an opportunity to
demonstrate their learning, thus decreasing assessment validity. Second, questions
should be short and to the point so that students can quickly focus on what is
being asked. Third, items should have only one correct answer. With the
exception of essay questions, most paper-and-pencil test items are designed to
have students select or supply a best answer. Fourth, information about the
nature of the desired answers should be provided. While the failure to properly
focus the work of students is common to all types of test items, it is most often a
concern in essay items. Although students will have freedom to structure their
own responses, essay questions should require them to demonstrate mastery of
key ideas, principles, or concepts and should provide students with guidance on
the form, length, and other characteristics of the desired answer. Finally, clues to
correct answers should not be provided. Unlike the other four guidelines, this
one is concerned with instances in which teachers unknowingly provide clues to
test items because of grammatical construction or implausible options.
The range of strategies that classroom teachers use in assessment is substantial
and no single tool or technique can provide all the information needed to describe
the richness of students'learning (Airasian, 2001; Arter, 1999). Common teacher-
controlled formal classroom assessment types include the following:
• Traditional paper and pencil tests which can include (i) Selection items which
require students to choose the best answer from among several options.
Multiple-choice, true or false, and matching are frequently used formats for
selection items. (ii) Supply items, which require students to write in or supply
the answer, and include short answer, sentence completion, and essays. (iii)
Interpretive exercises, which can be used to assess higher-level skills. Students
are presented with material such as a map, political cartoon, poem, chart,
graph, or data table, and are required to answer questions by analyzing or
interpreting it. This item type can be formatted so that answer choices are
provided (selection item) or students write in their responses (supply item).
• Peiformance assessments in which students carry out an activity or create a
work product to demonstrate their learning, and include activities such as
projects, simulations, lab experiments, music performances, and presenta-
tions. This type of assessment may be called "authentic" since students are
showing what they can do in real situations (Wiggins, 1992). Teachers usually
develop checklists, rating scales, or descriptive rubrics for scoring.
• Portfolios assessments are a unique type of performance assessment. Portfolios
involve a systematic collection of selected student work compiled over a
significant period of time such as a month, the school year, or throughout an
instructional unit. Typically, portfolios involve strategies that require students
to interpret and reflect on their work (in writing assignments, journal entries,
artwork, book reviews, rough and final drafts), thus demonstrating their growth
and development toward curriculum goals. Both performance and portfolio
assessments are considered as "alternatives" to the traditional form of class-
room assessment, the paper and pencil test.
• Na"ative assessments are based on formal teacher observations of specific
learning behaviors, and can involve measuring the ability of students to read
out loud or present an oral report to the class. Like portfolios, they are also
examples of performance assessments.
The availability of these data collection procedures allows classroom teachers to

choose among a variety of approaches in obtaining information about student
learning. Choice of procedure will depend on its purpose. Evaluation in establishing
classroom equilibrium relies heavily on informal student observation. Evaluation
during instruction relies on questioning of students. And evaluation of assess-
ments and testing relies on a variety of formal, paper-and-pencil approaches
(Taylor & Nolen, 1996; Tombari & Borich, 1999).
Al~ these assessment procedures are controlled by the classroom teacher, who
decides who to assess, when to assess, what to assess, how to assess, how to score,
and how to use results. They are the fundamental ingredients of the classroom
teaching, learning, and evaluation process. However, teachers do not control all
classroom assessments. Two important types of assessment that are not con-
trolled by teachers, but which are important, are commercial standardized tests
and, in the U.S., state assessments.
Commercial standardized achievement tests, such as the Iowa Tests of Basic
Skills (2000).and Te"a Nova (1997), and state assessments are administered in
classrooms relatively infrequently compared to teachers' own classroom assess-
ments. Commercial standardized achievement tests are commonly used to
compare the performance of students to national norms; to provide develop-
mental information about student achievement over time; and to identify areas
of student strengths and weaknesses. State assessments are mandated by state
legislatures or boards of education and are used to assess students' performance,
based on a set of state-mandated learning standards, to determine whether an
individual student or school as a whole has achieved state standards of acceptable
performance. Reports of state test results typically yield rankings of schools and
districts, thus allowing for public comparisons.
Even if commercial standardized tests and state assessments are technically
strong, with well-written items, and statistically sophisticated scoring, it is
appropriate for an individual teacher to raise the issue of whether they provide
valid evaluations of hislher students. Four factors influence their validity: the
appropriateness of the alignment of objectives or standards and test items; the
alignment of the content of the test with what is taught in the classroom; the
conditions under which tests are administered; and the interpretation and use of
test results.
Standardized tests and state assessments can have a strong impact on
classroom practices. The lack of teacher control and the fact that they permit
comparisons across schools and school districts make them important for
teachers and school administrators. The public visibility that often accompanies
results can put pressure on teachers to align their classroom tests or assessments
to mirror the content and format of standardized tests and especially of state
assessments (Canner et at, 1991; Mehrens, 1998; Smith & Rottenberg, 1991).
This pressure can cause tension between teacher-based instruction and
assessment and externally based assessment. It is often true that more attention
will be paid to tests or assessments to which important consequences are attached
than to ones to which such consequences are not attached.
EVALUATION FOR JUDGING AND GRADING LEARNING
Obtaining information based on tests, assessments, and other forms of classroom

information is not an end in itself. The end is a judgment of the quality of
teaching and learning in the classroom. To make such a judgment, the teacher
has to answer the question: What is this information saying about the class or a
student's performance?
It is important to note the difference between scoring and grading students'
performances. Scoring involves summarizing performance, as when a teacher
counts the number of multiple-choice items on a test which a student answered
correctly. Grading, on the other hand, involves a judgment of the worth or
quality of student performance. For example, when a teacher determines that a
student's test performance merits an A, or should be placed in the second level
of a five-level rubric scale, she/he has evaluated the student. Clearly, this is a
sensitive process. Students learn very early in school that grades serve as
sanctions for academic achievement (Dreeben, 1967), and can affect their
perceptions of self-esteem and personal worth (Henry, 1957). Consequently, it is
essential that teachers strive to ensure that their scoring and grading processes
are fair, valid, and accurate.
Scoring selection-type items, as well as most short-answer items, is straight-
forward: the teacher counts the number of items that the student has answered
correctly in a test or assessment. Scoring essay items is more complicated. Items
can be lengthy and complex, and account may be taken of style, neatness,
handwriting, and grammar, as well as content. Strategies to improve the scoring
of essay questions include defining the characteristics o~ an essay before
administration; determining and indicating to students how handwriting,
spelling, sentence structure, and organization will be judged; scoring students'
work anonymously, if possible; scoring all students' answers to the first essay
before scoring any student's second essay to provide consistency in scoring; and
rescoring some essays at random to judge the reliability of scoring.
Grades are represented in a variety of forms, such as letters (A, A-, B+ ... ),
categories (excellent, good, fair, poor), numbers (100-90, 89-80, 79-70 ... ), and
checklists (e.g., in math, the student can count in order, recognizes numbers to
10, writes numbers clearly). The most common grading system used in schools
involves letter grades (Polloway et aI., 1994), which are used to provide informa-
tion about student performance to students themselves, to their parents, to
counselors, and to other teachers (Brookhart, 1999).
Classroom grading is dependent on teacher judgments, since teachers are
typically the constructors and dispensers of grades. In most schools, teachers
have wide latitude in the way they grade their students. Different teachers can
have quite different grading strategies. Whatever a teacher's classroom grading
system, three key questions should be asked: Against what standard will students'
performance be compared? What student characteristics or performances
should be included in the grade? And how should different characteristics or
performances be weighted in the grades?
All grading is based on a comparison (Frisbie & Waltman, 1992). The two
most common classroom grading standards are based on a comparison of a
student's performance with the performance of other students (norm-referenced
grading) and on a comparison of a student's performance with a set of pre-
determined standards of "good" or "poor" performance (criterion-referenced
gradi~g) (Brookhart, 1999; Friedman & Frisbie, 1993). Grading may also involve
a comparison of a student's performance with a judgment of the student's ability,
or a judgment of a student's improvement over a period of time.
HaviBg made a decision about the grading comparison-- to be used, teachers
face a series of other decisions to complete the process. They must decide the
student performances (tests, projects, homework, etc.) that will be considered in
the grading and then assign weights to each type of performances. Will tests
count more than quizzes? Will projects count more than a test? And will
homework count?
The quality of assessment information is influenced by three factors: (i) the
conditions under which information is collected, including the opportunity
students are afforded to show their typical or best behavior; (ii) the quality o.f the
instruments used, which is affected by such factors as clarity of test items or
performance criteria and the appropriateness of the language level of items; and
(iii) the objectivity of the information, which could, for example be affected by
biased scoring.
Five general cautions which relate to all types of evaluation, merit
consideration when teachers are evaluating students. First, classroom evaluation
information describes students' learned behaviors and their present status. For a
variety of reasons (cultural, societal, economic, familial, etc.), some students
learn more, retain more, and have more opportunities to learn than others.
Whatever the cause of these learning differences, it is important to recognize
that the information classroom evaluations provide describes only how students
currently perform, not how they will perform in the future. Second, evaluation
information provides an estimate, not an exact indication, of performance. There
are always sources of error in students' performances and assessment
procedures. Third, the results of a single assessment do not provide a sufficient
basis on which to make important decisions about students that might affect
their lives and opportunities. A by-product of relying on the results of a single
assessment is a tendency to ignore additional information about students that
might contribute to improving the validity of important decisions. Fourth, the
tendency to form global and convenient labels such as "poor leamer," "self-
confident," or "hard worker," on the basis of a single student performance
should be avoided. Finally, since evaluations only describe and judge student
performance, they should not be interpreted as providing an explanation of
performance. It rarely is possible to determine with certainty why a student
performed as he or she did. Thus, the search for "explanations" of performance
must go beyond the immediate evaluation information.
CONCLUSION
The discussion of classroom evaluation in this chapter addressed several aspects

of teaching. Evaluation is an integral part of maintaining the social climate of the
classroom, planning for instruction, implementing instruction, developing
quality measurement tools, and judging or grading academic learning.
Underpinning each of these specific parts of the educational process are three
fundamental concerns.
First, there is growing diversity among students in most school populations.
The number who manifest special needs and disabilities and who do not speak
English as their first language is increasing in most classrooms. The diversity that
this situation gives rise to increases the tension between goals and priorities that
instruction and evaluation should emphasize. The goal of balancing quality and
equity is commonly embraced but rarely attained. How should educational
resources be allocated, in equal proportions to all students or in greater
proportions to the most needy? The reality of student diversity has increased the
complexity of classroom operations, making it more difficult to provide
appropriate instruction, assessment, and evaluation to all students. At issue is
how a teacher is to meet the needs of diverse student groups while simul-
taneously meeting the "one size fits aU" demands of many classroom and most
U.S. statewide evaluations.
Second, while there are many concerns about applying sanctions to students
on the basis of their performance in standard-based assessments, the use of
teacher and principal sanctions based on student performance should also be a
matter of concern. In this case, the issue is the validity of decisions made on the
basis of the results of student assessments. While an evaluation can indicate
which students do well or poorly on the topics evaluated, using the results to
evaluate teachers or administrators is another step which numerous intervening
factors relating to students and instructional resources can invalidate. If
evaluations are constructed for the main purpose of assessing student
achievement, caution should be exercised in using them for other purposes.
When used for multiple purposes, the validity of each purpose should be
determined independently.
Finally, successful classroom evaluation relies heavily on the quality of
instruction that was provided. While stating objectives and developing or
selecting assessments are important aspects of classroom evaluation, the links
among objectives/standards, instruction, and assessments are varied and
complex. Defining objectives and standards and selecting and constructing
assessments are largely "technologies" for which there are well-developed
procedures. Planning, and even more so the delivery of instruction, are more arts
than technology, especially when it comes to teaching higher-level knowledge
and skills. Knowing how to state an objective does not mean that we know how
to teach it to all students. Success in teaching means being able to teach a variety
of students with a variety of characteristics in a variety of ways.
REFERENCES
Ahmann, J.S., & Glock, M.D. (1963). Evaluating pupil growth. Boston: Allyn & Bacon.
Airasian, P. (2001). Classroom assessment. Concepts and applications (4th ed.) New York, McGraw-
Hill.
Alexander, L., Frankiewicz, R, & Williams, R (1979). Facilitation of learning and retention or oral
instruction using advanced and post organizers. Journal of Educational Psychology, 71(5), 701-707.
American Federation of Teachers, National Council on Measurement in Education, and National
Education Association. (1990). Standards for teacher competence if! educational assessment of
students. Washington, DC: The National Council on Measurement in Education.
Anderson, L.w., Krathwohl, D.R, Airasian, P.w., Cruikshank, K.A., Mayer, P.R, Raths, J., &
Wittrock, M.C. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's
taxonomy of educational objectives. New York: Longman.
Arends, RI. (1997). Classroom instruction and management. New York, McGraw-Hill.
Arter, J. (1999). Teaching about performance assessment. Educational Measurement: Issues and
Practice, 18(2), 30-44.
Ausubel, D., Novak, J., & Hanesian, H. (1978). Educational pyschology: A cognitive view. New York: Holt.
Beez, W. (1968). Influence of biased psychological reports on teacher behavior and pupil
.
performance. Proceedings of the 76th Annual Convention of the American Psychological Association,
3,605-606.
Bloom, B., Madaus, G.F., & Hastings, J.T. (1981). Evaluation to improve learning. New York:
McGraw-Hill.
Boostrom, R. (1991). The value and function of classroom rules. Curriculum Inquiry, 21,194-216.
Brophy, J. (1983). Research on the self-fulfilling prophecy and teacher expectations. Journal of
Educational Psychology, 75, 631-661.
Brookhart, S.M. (1999). Teaching about communicating assessment results and grading. Educational
Measurement: Issues and Practice, 18(1), 5-13.
Canner, J., Fisher, T., Fremer, J., Haladyna, T., Hall, J., Mehrens, w., Perlman, C., Roeber, E., &
Sandifer, P. (1991). Regaining trust: Enhancing the credibility of school testing programs. A report
from The National Council on Measurement in Education Task Force. Washington, DC: National
Council on Measurement in Education.
Cohen, E.G. (1972). Sociology and the classroom: Setting the conditions for teaching-student
interaction. Review of Educational Research, 42, 441-452.
Cooper, H.M., & Good, T. L. (1983). Pygmalion grows up: Studies in the expectation communication
process. New York: Longman.
Delpit, L. (1995). Other people's children: Cultural conflict in the classroom. New York: New Press.
Doyle, W. (1986). Classroom organization and management. In M.e. Wittrock. (Ed.) Handbook of
research on teaching (pp. 392-431). New York: Macmillan.
Dreeben, R (1967). The contribution of schooling to the learning of norms. Harvard Educational
Review, 37,211-237.
Dusek, J.B. (Ed.). (1985). Teacher expectancies. Hillsdale, NJ: Erlbaum.
Ebel, RL. (1965). Measuring educational achievement. Englewood Cliffs, NJ: Prentice-Hall.
Eden, D., & Shani, A (1982). Pygmalion goes to bootcamp: Expectancy, leadership, and trainee
performance. Journal ofApplied Psychology, 67,194-199.
First, J., & Mezzel, M.H. (1980). Everybody's business: A book about school discipline. Columbia, SC:
Southeastern Public Education Program, American Friends Service Committee.
Frisbie, D.A, & Waltman, K.K. (1992). An NCME instructional module on developing a personal
grading plan. Educational Measurement: Issues and Practice, 11(3),35-42.
Friedman, S.J., & Frisbie, D.A (1993). The validity of report cards as indicators of student performance.
Paper presented at the annual meeting of the National Council on Measurement in Education,
Atlanta.
Good, T.L., & Brophy, J.E. (1991). Looking in classrooms (5th ed.). New York: Harper Collins.
Gronlund, N.E. (1998). Assessment of student achievement. Boston, MA: Allyn & Bacon.
Henry, J. (1957). Attitude organization in elementary school classrooms. American Journal of
Orthopsychiatry, 27,117-123.
Henry, J. (1971). Essays on education. New York: Penguin.
Iowa Tests of Basic Skills. (2000). Itasca, IL: Riverside Publishing, Houghton Mifflin.
Jackson, P.w. (1968). Life in classrooms. New York: Holt, Rinehart, & Winston.
Jackson, p.w. (1990). Life in classrooms. New York: Teachers College Press.
Jones, E. (1986). Interpreting interpersonal behavior: The effects of expectancies. Science, 234,
41-46.
Jussim, L. (1986). Teacher expectancies: Self-fulfilling prophecies, perceptual biases, and accuracy.
Journal of Personality and Social Psychology, 57, 469-480.
Joint Advisory Committee. (1993). Principles for fair students' assessment practices for education in
Canada. Edmonton, Alberta: University of Alberta.
Kellaghan, T., Madaus, G.F., & Airasian, P.w. (1982). The effects of standardized testing. Boston:
Kluwer-Nijhoff.
Ladson-Billings, G. (1994). The dreamkeepers: Successful teachers of African American children. San
Francisco: Jossey Bass.
McCaslin, M., & Good, T. (1996). Listening to students. New York: Harper Collins.
Mehrens, W.A (1998). Consequences of assessment: What is the evidence? Education Policy Analysis
Archives, 13(6). Retrieved from http://epaa.asu.edu/epaav6n13.html
Merrow, J. (2001). Undermining standards. Phi Delta Kappan, 82(9), 652-659.
Morgan, N., & Saxton, J. (1991). Teaching, questioning, and learning. New York: Routledge.
Overton, T. (1996). Assessment in special education: An applied approach. Englewood Cliffs, NJ:
Merrill.
Polloway, E.A, Epstein, M.H., Bursuck, W.D., Roderique, RW., McConeghy, J.L., & Jayanthi, M.
(1994). Classroom grading: A national survey of policies. Remedial and Special Education, 15,
162-170.
Rist, Re. (1977). On understanding the process of schooling: The contribution of labeling theory. In
J. Karabel, & AH. Halsey (Eds.), Power and ideology in education. New York: Oxford University
Press.
Rosenthal, R. & Jacobson, L. (1968). Pygmalion in the classroom: Teacher expectations and pupils'
intellectual development. New York: Holt.
Schrank, W. (1968). The labeling effect of ability grouping. Journal of Educational Research, 62,
51-52.
Simon, B., & Willcocks, J. (Eds.) (1981) Research and practice in the primary classroom. London:
Routledge & Kegan Paul. -
Smith, M.L., & Rottenberg, C. (1991). Unintended consequences of external testing in elementary
schools. Educational Measurement: Issues and Practice, 10(4), 7-1l.
Steele, C.M. (1997). A threat in the air: How stereotypes shape intellectual identity and performance.
American Psychologist, 52, 613-629.
Taylor, c., & Nolen, S., (1996). A contextualized approach to teaching students about assessment.
Educational Psychologist, 37, 77-88.
Terra Nova. (1997). Monterey, CA: CTBIMcGraw Hill.
Tombari, M., & Borich, G. (1999). Authentic assessment in the classroom. Upper Saddle River, NJ:
Prentice Hall.
Waller, W. (1967). The sociology of teaching. New York: Wiley.
Wiggins, G. (1992). Creating tests worth taking. Educational Leadership, 44(8), 26-33.
Wiggins, G., & McTighe, J. (1998). Understanding by design. Alexandria, VA: Association for
Supervision and Curriculum Development.
26
Alternative Assessment
CAROLINE GIPPS
Kingston University, UK
GORDON STOBART
Institute of Education, University of London, UK
"When I use a word" Humpty Dumpty said ... "it means just what I
choose it to mean" ... "The question is," said Alice, "whether you can
make words mean so many different things." "The question is," said
Humpty Dumpty, "which is to be master - that's all." (Lewis Carroll,Alice
Through the Looking Glass)
Seeking an agreed definition of "alternative assessment" has a Lewis Carroll feel

to it. It is an umbrella term which shelters any alternatives to standardized
multiple-choice testing (Oosterhof, 1996; Salvia & Ysseldyke, 1998); alternative,
usually Information Technology (IT)-based, ways of testing (Bennett 1998; Klein
& Hamilton, 1999); and, an alternative approach which makes assessment an
integral part of the teaching and learning process (Shepard, 2000) and in which
the focus is more directly on "performance" (Kane, Crooks, & Cohen, 1999).
Exercising our right to be master of the term, we shall use it essentially in
relation to the last of these, that is, assessment which is seen as an integral part
of teaching, and is designed to assess performance. The emphasis is on a "close
similarity between the type of performance that is actually observed and the type
of performance that is of interest" (Kane et aI., 1999, p. 7). Within this approach
sit a number of further categories which, following Shepard and Bleim (1995),
we will treat as equivalent: "We use the terms performance assessments, authentic
assessments and direct assessments interchangeably, the idea being to judge what
students can do in terms of the actual tasks and end performances that are the
goals of instruction" (p. 25). While this approach may not satisfy all (see Wiggins,
1989), it provides us with a broad base from which to work.
We argue that this approach reflects a paradigm shift from psychometric and
measurement models to a model in which assessment is an integral part of the
learning process ("assessment for learning") (Stobart & Gipps, 1997) and so is
not restricted to summative purposes (Shepard, 2000). Of course psychometric
549

550 Gipps and Stobart
tests can be used to support learning, but our concern is with assessments that
are an essential part of the teachingllearning process. In this view, a close
relationship between the assessment task and the goals of instruction is considered
central to assessment. It also views learning as a process in _which the learner
actively constructs meaning and whose participation in the assessment process
reflects an appreciation of the standard of performance required and the need
for self-monitoring in learning (Sadler, 1989).
The chapter begins with a review of some of the assumptions underlying
alternative assessment. The approach is located within the paradigm of educational
assessment which is contrasted with the dominant psychometric model. Some of
the key features of educational assessment are discussed: underlying constructivist
views of learning, the use of tools and assistance in the assessment process, and
the use of processes such as self-evaluation and feedback to enhance learning.
These are developed at some length because we believe that they illustrate our
assertion that alternative assessment is not simply the use of alternative forms of
assessment but is also an alternative use of assessment as part of the learning
process. This leads on to a brief treatment of the forms that alternative assess-
ments may take, with particular reference to performance and portfolio
assessments. We then look at the uses to which alternative assessments are being
put. Originally, their primary use was as a means of reforming curriculum and
instruction (Pellegrino, Baxter, & Glaser, 1999). They are widely used at classroom
level where there is robust support for validity (Frederikson & Collins, 1989;
Linn, Baker & Dunbar, 1991; Moss, 1992; Newmann, 1990). We tackle the more
problematic area of use in large-scale high-stakes systems where the issue of
reliability is central. Following that, the use of types of alternative assessment at
state and national level is described, drawing on examples from the U.S.A.,
Australia, and the United Kingdom. We conclude by discussing the role of
alternative assessment in systems in which accountability is one of the key
functions of assessment and propose ways forward.
ASSESSMENT PARADIGMS AND APPROACHES
Assessment has traditionally been seen as something done to learners in order to

grade/certificate, select, and report on them. Within the paradigm of psycho-
metrics this is treated as an essentially quantitative activity. Tests based on
psychometric theory have as a prime requirement measurement properties
amenable to statistical analysis; reliability and norm-referencing are the prime
concerns. This has profound implications for the style of task assessed, the
limited ways in which tasks can be explained to students, and a lack of interaction
with the tester. A critique of the traditional psychometric approach can be fC!und
in Gipps (1994a). For the purposes of this chapter key issues are that, in the
psychometric model, the focus on reliability requires the standardization of
administration and tasks as well as of scoring, while scoring is based mainly on
norm-referencing.
Alternative Assessment 551
Glaser (1963) championed criterion-referenced testing, heralding a move
away from norm referencing. He made the point that the preoccupation of
psychometrics with norms and norm-referenced testing stemmed from the focus
of test theory on aptitude, selection, and prediction. An individual's performance
was reported in relation to that of his/her peers, so that performance was seen in
relative, rather than in absolute terms. Since students cannot control the
performance of other students they cannot control their own grades.
In contrast, educational assessment deals with the individual's achievement
relative to him or herself rather than to others; seeks to test for competence
rather than for intelligence; takes place in relatively uncontrolled conditions and
so does not produce well-behaved data; looks for best rather than typical
performances; is most effective when rules and regulations characteristic of
standardized testing are relaxed; and embodies a constructive outlook on assess-
ment, where the aim is to help rather than sentence the individual (Wood, 1986).
The term competence refers to the product of education, training, or other
experience, rather than an inborn or natural characteristic such as intelligence.
We could more comfortably now use the term "achievement." Wood (1986)
argues that a powerful reason why educational measurement should not be
based on psychometric theory is that "achievement data arise as a direct result of
instruction and are therefore crucially affected by teaching and teachers"
(p.190). Aptitude and intelligence, by contrast, are traits which are unaffected by
such factors. Achievement data are therefore "dirty" compared with aptitude
data and should not/cannot be analysed using models which do not allow for
some sort of teaching effect.
Goldstein (1994) also argues that we need to stop seeing testing as a static
activity which has no effect on the student. On the contrary, the student is
participating in a learning procedure while being tested, and his or her state will
be altered at the end of it. For example, successfully completing early items in a
test might boost confidence and result in a higher overall performance than failing,
or being unable to complete, early items. Thus we need a more interactive model
of assessment which does not assume that an individual's ability to respond to
items remains constant during the test. The more "authentic" the assessment
becomes, the more important it is to question the assumption that nothing
happens to the student during the process of assessment.
The last 20 years have seen significant developments in this type of assess-
ment. The resulting open-ended, realistic assessment tasks, collections of pieces
of student work (done under more or less controlled conditions), in-depth
projects, and informal assessment in the classroom with regular feedback to the
student, are all "alternative" assessments designed to support the learning
process, although they can be used to grade, select, and report. These approaches
rely on human judgment of complex performance in marking. They often involve
discussion between teacher and students about the task and about the
assessment criteria or expected performance, and thus fall outside the traditional
notion of the external examination/assessment. Furthermore, assessment in this
view is something done with the student rather than to the student. The distinction
can be summed up as "assessment for learning" as opposed to "assessment of
learning" (Stobart & Gipps, 1997).
The key differences between traditional and alternative approaches to
assessment are based in the conceptions of learning that unqerlie them, and in
the perceived relationship between learning and assessment. What we believe
about learning will affect our approach to assessment. Thus, statistical models
used in early psychometrics were based on beliefs about fixed "innate ability"
(Gould, 1996); multiple-choice standardized tests were readily adopted by
behaviorist views of "knowledge in pieces" (Mislevy, 1993). Alternative
approaches to assessment, on the other hand, based on (broadly) cognitive and
constructivist models of learning (Shepard, 2000) require some considerable
rethinking of how we assess (Gipps, 1999; Moss, 1996).
Along with new conceptions of learning, the drive to develop alternative
models of assessment received support from the fact that traditional types of test
(multiple-choice standardized tests in the U.S. and formal written examinations
in the u.K.) have had unintended and negative effects on teaching and on the
curriculum (Kellaghan, Madaus, & Raczek, 1996; Linn, 2000; Madaus, 1988).
The stultifying effect of public examinations on the secondary school system in
England has been pointed out by school inspectors (HMI, 1979, 1988) and was a
major consideration in the shift towards the new examination at age 16 (General
Certificate of Secondary Education) with its increased emphasis on a broader
range of skills, a lessening of emphasis on the timed examination, and an opening
up of the examination to a broader section of the age cohort. The limiting and
damaging effect of standardized multiple-choice tests in the U.S.A. has also been
well documented and analysed (e.g., Linn, 2000; Resnick & Resnick, 1992).
"Under intense political pressure, test scores are likely to go up without a
corresponding improvement in student learning. In fact, distortions in what
and how students are taught may actually decrease students' conceptual
understanding" (Shepard, 2000, p.13).
Be~ause of the distorting effects of traditional types of assessment, particularly
when used for high stakes and accountability purposes (Kellaghan et aI., 1996;
Madaus, 1988), attempts have been made to develop alternative accountability
assessments, which aim to provide good quality information about students'
performance without distorting good teaching practice and learning. However,
Shepard (2000) again warns that recent experience in the U.S. shows that all
assessment can be corrupted, undermining the belief that teaching to good tests
would be an improvement on teaching "low-level basic skills curricula." Another
problem in their development is that, of course, fairness demands that large-
scale external accountability assessments are standardized in terms of content,
administration, and timing. This means that it is particularly important to
develop other forms of assessment to support learning that can be used alon~side
accountability assessment. There is, of course, a role for traditional examinations
and standardized tests in assessment policy, but we need to look at the purpose
of each assessment program (and the likely impact of testing), and then choose
an assessment type that fits.
CONCEPTUAL FRAMEWORK AND ASSESSMENT IMPLICATIONS
In this section we consider some of the theoretical underpinnings of the

educational assessment paradigm and how these generate alternative practices
in the way teachers approach assessment. We use Shepard's framework and then
review some practical implications: the use of "tools and assistance" within
assessment and the role of processes such as self-evaluation and feedback in
assessment for learning.
Shepard (2000) has developed a conceptual framework for new views of
assessment based on cognitive and constructivist theories of learning. She
summarizes the "new" view of learning in the following principles: (i) intellectual
abilities are socially and culturally developed; (ii) learners construct knowledge
and understandings within a social context; (iii) new learning is shaped by prior
knowledge and cultural perspectives; (iv) intelligent thought involves "meta-
cognition" or self-monitoring of learning and thinking; (v) deep understanding is
principled and supports transfer; (vi) cognitive performance depends on
dispositions and personal identity. Based on this framework, the following
characteristics of classroom assessment can be identified. Classroom assessment
(i) involves challenging tasks to elicit higher order thinking; (ii) addresses learning
processes as well as learning outcomes; is an on-going process, integrated with
instruction; (iii) is used formatively in support of student learning; (iv) makes
expectations visible to students; (v) involves students actively in evaluating their
own work; (vi) is used to evaluate teaching as well as student learning. These
principles for classroom assessment also underpin the alternative assessment
approach.
If in norm-referenced assessment students come to see themselves in terms of
average or below average, rather than as having mastered certain aspects of
mathematics or writing, this can be profoundly unhelpful for educational
purposes (Crooks, 1988; Kluger, & DeNisi, 1996). Educational assessment is
based on the understanding, developed from learning theory and the work of
educators such as Bruner (1966) and Feuerstein (1979), that all children can
learn, given the right input and motivation; and that achievement is not fixed, but
is amenable to change through teaching. Educational assessment, therefore, is
enabling in concept, and has a key role helping the individual to develop and to
further his or her learning. This leads us, too, to the concept of looking for high
or best performance in assessment rather than"standardized" performance.
Looking for best performance by using assessments which elicit elaborated
performance, or tests at the upper rather than lower thresholds of performance,
has a good psychological pedigree, resonating with Vygotsky's (1978) zone of
proximal development. In this process, assessor and student collaborate to produce
the best performance of which the student is capable, given help from an adult,
rather than withholding such help to produce typical performance, which is what
happens in examinations. Such collaboration is done in teaching, and can be
done in assessment.
THE USE OF TOOLS AND ASSISTANCE IN ALTERNATIVE
ASSESSMENT
An important aspect of encouraging best performance is the. use of tools and

assistance in the assessment process. Vygotsky (1978) pointed to the importance
of tools and aids in human action and hence also in learning. The use and
internalization/appropriation of external supports is a key element in his account
of the development of mental functions. However, assessment in the traditional
examination and the psychometric model denies the student the use of external
tools, thus reducing its usefulness and ecological validity. If we follow Vygotsky's
ideas, we would develop assessment which allows the use of auxiliary tools
(including adult support) to produce best rather than typical performance. Since
such assessment is interactive it may be termed dynamic assessment. As Lunt
(1994) explains:
Dynamic assessment procedures ... involve a dynamic interactional

exploration of a learner's learning and thinking process and aim to
investigate a learner's strategies for learning and ways in which these may
be extended or enhanced. Since it offers individuals an opportunity to
learn, dynamic assessment has the potential to show important informa-
tion about individual strategies and processes of learning and, therefore,
to offer potentially useful suggestions about teaching. (p. 152)
However, even in dynamic assessment, help is at some point withheld, and an

evaluation of unassisted performance made. As Newman, Griffin, and Cole (1989)
put it:
Dynamic assessment, then, shares a feature in common with the traditional

testing method of assessment in that it requires putting the child "on her
OWI).." Support has to be removed until the child begins to falter. One
difference between the two approaches lies in the fact that dynamic
assessment achieves a finer-grained idea of the child's level of "indepen-
dent ability." (p. 79)
An example of an interactive assessment approach that elicited best performance

is to be found in the National Assessment program in England. In its early stages,
the program required students to be assessed across the full range of the
National Curriculum using external tests and teachers' own assessments. The
tests were originally called Standard Assessment Tasks (SATs) and the teachers'
assessments were called Teacher Assessment (TA). The SATs used in 1991 and
1992 involved performance, which was assessed in classroom-based, externally' set,
but teacher-assessed activities and tasks. For example, at age seven, reading'was
assessed by having the child read aloud a short passage from a children's book
chosen from a list of popular titles. At age 14, there were extended investigative
projects in mathematics and science, and assessments based on classroom work
in English. What these tasks had in common across both ages, in addition to the
performance element, was administration in the classroom, rather than a formal
examination; interaction between teacher and student with ample opportunity to
explain the task; and, at age seven, a reduced emphasis on written response.
An important point emerged from evaluation studies, at both age 7 and 14
(Gipps, 1992; SEAC, 1991). The SAT, with its emphasis on active, multi-mode
assessment and detailed interaction between teacher and student, appeared to
offer a good opportunity for children, particularly those from minority groups
and those with special needs, to demonstrate what they knew and could do. In
other words, they elicited best performance. The key aspects of the assessment
seemed to involve: a range of activities, offering a wide opportunity to perform;
match to classroom practice; extended interaction between student and teacher
to explain the task; a normal classroom setting which therefore was not unduly
threatening; and a range of response modes other than written.
The emphasis on non-standardized administration and classroom setting
however, limited the reliability of the assessments. Considerable variation was
found in administration, not only across teachers, but also between administra-
tions by the same teacher (Gipps, 1994b). This finding reflects the inevitable
tension between eliciting best performance, which may involve different
treatment for different students, and reliability (and fairness) through
standardization, in which all students are treated alike. Unfortunately, as well as
giving rise to concern about reliability, the SATs were considered to be
unacceptably time-consuming for national accountability assessment. This is a
theme to which we return later. On the other hand, they were very useful tools
for teacher's classroom-based and formative assessment and for developing
teachers' skills in these.
Interaction between student and assessor, such as in clinical interviews or in
think-aloud protocols, is a way of providing assistance, as well as gaining insight
into students' understanding, and is a key feature of dynamic assessment.
Individual interviews [also] make it possible to conduct "dynamic assess-

ments" that test, (and thereby extend) what a student can do with adult
support. In these interactions a teacher-clinician is not merely collecting
data but is gathering information and acting on it at the same time, thus
completely blurring the boundaries between assessment and instruction.
Clinical interviews, like a good teacher, require that teachers be knowledge-
able about underlying developmental continua. (Shepard, 2000, p. 32)
As the work on the English SATs for 7-year-olds showed, interviews with small
groups of students (though time-consuming and demanding) were highly
effective in developing teachers' knowledge and understanding of both in-depth
assessment and the children's learning in the subject areas (Gipps, Brown,
McCallum, & McAlister, 1995).
There are some useful examples of the dynamic use of diagnostic assessments
in the assessment of young children. A variety of approaches involving both
observation and interview are used in the assessment of reading (Raban, 1983).
Informal reading inventories have been developed which help teachers to assess
the skills that students have mastered. These inventories can include checklists
to assess attitudes to reading, checklists of examples to assess phonic skills (e.g.,
knowledge of initial and final letters ), lists of books read by the student, activities
to assess comprehension, and "miscue" techniques for analysing reading aloud.
To assess comprehension, teachers will often ask questions after the student has
read aloud about the content of the story, or about what may happen next, while
the "running record" (Clay, 1985) uses a notation system to record miscues
during reading aloud. Miscue analysis is essentially a formalized structure for
assessment while hearing children read, a device which primary schools tradi-
tionally use in the teaching of reading. When miscue analysis is used, reading
aloud develops from a practice activity to a more diagnostic one. The teacher has
a list of the most common errors children make, with a set of probable causes.
By noticing and recording in a systematic way the child's errors and analysing the
possible causes, teachers can correct misunderstandings, help the child develop
appropriate strategies and reinforce skills. In other words miscue analysis is
essentially an informal diagnostic assessment.
Building on these approaches in England, the Primary Language Record was
developed in the late 1980s by teachers and researchers. This is a detailed
recording and observation document which is filled in at several points during
the school year to provide information that can inform teaching. Thus, it is
designed as a formative assessment rather than as an administrative record. The
assessment includes, indeed begins with, the parent's comments about the child's
speech, writing, and reading activities. There is also a section for recording the
child's view of him/herself as a reader, writer, and language user. A "conference"
between teacher and child encourages the child to begin the process of self-
assessment, something which is a key feature of profiles and records of achieve-
ment. The reading assessment is about as different from a standardized reading
test a~ it is possible to be: it is based on teacher assessment and it requires
predominantly a written account of the child's interest in reading, enjoyment,
strategies, difficulties, etc. This was quite a forward-looking approach to alternative
assessment, involving as it did discussion between teacher and student, collection
of samples of work, self-assessment, etc.
PROCESSES.IN ALTERNATIVE ASSESSMENT
As well as considering the use of tools and assistance in alternative assessment,

we need to consider the processes which contribute to this approach. We ~hall
discuss self-evaluation and feedback since these lead us to an essential recon-
sideration of the teacher-learner assessment relationship. The key assumption is
that the learner is an active participant in the assessment process. Rather than
assessment criteria and standards being the exclusive "guild knowledge" of the
teacher (Sadler, 1989), the assumption is that they are shared with the learners
who progressively take more control over their learning.
SelfEvaluation
Developing self-evaluation or self-assessment skills is important to help learners

enhance their metacognitive skills. Metacognition is a general term which refers
to a second-order form of thinking: thinking about thinking. It includes a variety
of self-awareness processes to help plan, monitor, orchestrate, and control one's
own learning. To do this, learners use particular strategies which hinge on self-
questioning to make the purpose of learning clear, searching for connections and
conflicts with what is already known, and judging whether understanding of
material is sufficient for the task. An essential aspect of metacognition is that
students appraise and regulate their own learning by self-assessment or self-
evaluation. If students are to become self-monitoring and self-regulating learners
(Broadfoot, 1996; Wittrock & Baker, 1991; Wolf, Bixby, Glenn, & Gardner, 1991),
they will need sustained experience in ways of questioning and of improving the
quality of their work, as well as supported experience in self-assessment, which
includes understanding the standards expected of them and the criteria on which
they will be assessed (Black & Wiliam, 1998; Sadler, 1989). Sadler (1998) also
points out that:
Ultimately the intention of most educational systems is to help students

not only grow in knowledge and expertise, but also to become progressively
independent of the teacher for lifelong learning. Hence if teacher-
supplied feedback is to give way to self-assessment and self-monitoring,
some of what the teacher brings to the assessment act must itself become
part of the curriculum for the student, not an accidental or inconsequential
adjunct to it. (p. 82).
Teachers commonly bring with them to the assessment setting an elaborate and
extensive knowledge base, dispositions towards teaching as an activity, and
sophisticated assessment skills. This expertise, particularly the criteria,
appropriate standards, and how to make evaluative judgments, should be shared
with students. Feedback can assist in this.
Feedback
Feedback from teacher to student is the process that embeds assessment in the
teaching and learning cycle (Black & Wiliam, 1998; Crooks, 1988; Gipps,
McCallum, & Hargreaves, 2000). It is a key aspect of assessment for learning and
of formative assessment in particular. Formative assessment is the process of
appraising, judging, or evaluating students' work or performance, and using this
to shape and improve their competence. Sadler's (1989) detailed discussion of

the nature of qualitative assessment gives feedback a crucial role in learning; he
identifies the way in which it should be used by teachers to unpack the notion of
excellence, so that students are able to acquire understanding of standards for
themselves to help them become self-evaluating.
Research by Tunstall and Gipps (1996) to describe and classify feedback from
teachers to young pupils suggests that it can be categorized as evaluative or
descriptive. Evaluative feedback is judgmental, with implicit or explicit use of
norms. Descriptive feedback is task-related, and makes specific reference to the
child's actual achievement or competence. Two types of descriptive feedback
were identified. One, "specifYing attainment and improvement," adopts a mastery-
oriented approach. It involves teachers' acknowledgement of specific
achievements, the use of models by teachers for work and behavior, diagnosis
using specific criteria, and correcting or checking procedures. The other, which
is labelled "constmcting achievement and the way forward," involves teachers' use
of both sharp and "fuzzy" criteria, teacher-child joint assessment of work,
discussion of the way forward, and the use of strategies for self-regulation of
learning.
Teachers using the "specifying" forms of feedback are providing clarity but
essentially retaining control of the pedagogical act. When teachers use the
"constntcting" forms of feedback, on the other hand, they are helping learners to
develop self-evaluation and self-monitoring skills. In other words, they are
sharing responsibility for assessment and learning with the students. Where the
teacher and student are discussing or "constmcting" achievement or improvement
mutually, the student's role is integral to the feedback process itself. The student
is being encouraged and supported in making a self-assessment or self-
evaluation by reflecting on his or her performance in relation to the standard
expected (and indicated by the teacher), and by thinking about how performance
can improve. This end of the feedback spectrum therefore involves the use of
metacognitive strategies, whereby students monitor or regulate their own
learning. They are now supported by the teacher, but not dependent on her, in
deciding on the value of a performance and how it could improve (Gipps et aI.,
2000). Teachers' use of this type of feedback shifts the emphasis to the students'
own role in learning, using approaches which pass control to the student. The
teacher becomes more "facilitator" than "provider" or "judge": teacher with the
student, rather than to the student is the focus.
This research on feedback clearly indicated that teachers were developing
children's self-evaluation skills even at ages 6 and 7. Although it may seem
unlikely that children as young as this can engage in such sophisticated learning
activity, it may actually be easier at this stage when students have had only
limited exposure to the (traditional) didactic relationship.
The relationship between teachers and learner is a key issue in the school
setting; teachers are experts and students are novices or apprentices. The
relationship between them is crucial to the learning process, and can be
constructed in a number of ways. In traditional assessment, it is hierarchical. The
teacher sets and defines the task and determines how performance should be
evaluated; the student's role is to be the object of this activity and, through the
completion of tasks and tests, to be graded. However, there are other ways of
seeing the relationship. In forms of assessment such as negotiated assessment in
profiling and self-assessment, the student has a role in discussing and negotiating
the terms and outcomes of the assessment. The hierarchical nature of the
relationship between expert and apprentice is diminished by allotting the learner
a more active role in which she or he is afforded responsibility and involvement
in the assessment activity (Perrenoud, 1991; Taylor, Fraser, & Fisher, 1997;
Torrance & Pryor, 1998). However, for many teachers and indeed for some
students, developing a more equal or open relationship is not easy. The
traditional hierarchical teacher-student relationship is deeply ingrained in many
cultures. In an early evaluation of Records of Achievement in the U.K., a
profiling, portfolio, scheme which involved student self-assessment and
negotiation of target setting, Broadfoot, James, McMeeking, Nuttal, and Stierer
(1988) found that secondary-school students viewed self-assessment as difficult,
partly because they were not used to it, and partly because the assessment
criteria were unclear. Students' perceptions of teacher expectations, their views
on what was socially acceptable, and their anxiety not to lose face affected their
self-evaluation. Furthermore, there were differences in relation to gender and
ethnic group in approach to self-assessment and "negotiation" with teachers.
Boys were more likely to challenge a teacher's assessment and to have a keen
sense of the audience for the final record, while girls tended to enter into a
discussion and to negotiate more fully.
So, it would appear that if student self-assessment is to be empowering,
considerable development will be required of teachers, as well as preparation of
students. Teachers will need to take a back seat rather than driving and
controlling the process, and not only make the new ground rules clear to
students, but also to persuade them that their evaluation is valued. For much
younger students, although the ground rules need to be clear, there will be less
need to modify students' experience and understanding of their traditional role
vis-a.-vis the teacher.
There is a lesson for alternative assessment here: Changing well-established
practice will take time, for students as well as for teachers. Indeed, students can
be quite traditional. If they are used to particular classroom practices (e.g., the
teacher telling them how goodlbad a piece of work is, what they should do to
improve it, the use of multiple-choice items), they may well resist the change to
other approaches. In a study of the use of alternative assessment in California,
Herman, Klein, and Wakai (1997) found that while students could see that open-
ended tasks were more interesting and challenging, and they had to (and did) try
harder on them than on multiple-choice tests, they did not necessarily like this.
The students understood that they had to adopt different thinking and answering
strategies in open-ended items, but they found multiple-choice items easier to
understand, believed that they were better at them, and (perhaps not
surprisingly) preferred them.
A similar, changed relationship, which results from particular views of learning
and assessment, is inherent in the model of the classroom as a learning
community. In Bruner's (1996) view we should see the classroom as a community:
of mutual learners, with the teacher orchestrating the proceedings. Note

that, contrary to traditional critics, such sub-communities do not reduce
the teacher's role nor his or her "authority." Rather, the teacher takes on
the additional functions of encouraging others to share it. (pp. 21-22)
The teacher's role, rather than being diminished by sharing responsibility for
assessment with the student, is augmented.
FORMS OF ALTERNATIVE ASSESSMENT
We have already provided examples of some of the forms that alternative

assessment may take, for example, the "dynamic" use of diagnostic reading tasks
and the self-assessment encouraged in records of achievement. In this section we
review some of the forms more commonly identified with alternative assessment:
performance assessments and portfolios. At the heart of these alternative forms
is the recognition that a wider range of strategies is needed to assess a broader
body of cognitive aspects than mere subject-matter acquisition and retention.
Glaser (1990) considers some of these forms: portfolios of accomplishments;
situations which elicit problem-solving behavior which can be observed and
analysed; and dynamic tests that assess responsiveness of students to various
kinds of instruction (see Feuerstein, 1979; Vygotsky, 1978).
Glaser's point is that assessment must offer "executable advice" to both
students and teachers. To do this, the assessments must themselves be useful and
must focus on the student's ability to use the knowledge and skills learnt.
Know.ledge must be assessed in terms of its constructive use for further action.
Peifonnance Assessment
The aim of performance assessments is to model the real learning activities that
we wish students to engage with (e.g., oral and written communication skills,
problem solving activities) rather than to fragment them, as happens in multiple-
choice tests (Kane et ai., 1999; Linn et ai., 1991; Shepard & Bieim, 1995). In the
United States, performance assessment may include portfolio assessment and
what in Britain is called classroom assessment or teacher assessment (teachers'
assessment of students' performance).
We have not sought to distinguish performance assessment from authentic
assessment. We treat authentic assessment as performance assessment carried
out in an authentic context (i.e., it is produced in the classroom as part of normal
work, rather than as a specific task for assessment). While not all performance
assessments are authentic, it is difficult to imagine an authentic assessment that
would not also be a performance assessment; thus performance assessment
incorporates authentic assessment. Meyer (1992) suggests that in using the term
"authentic assessment" the respects in which the assessment is authentic should
be specified: the stimulus, task complexity, locus of control, motivation,
spontaneity, resources, conditions, criteria, standards, and consequences.
Educational measurement specialists were writing about performance
assessment as early as the 1950s (Stiggins & Bridgeford, 1982), and an evaluation
of this early work led to a working definition:
Performance assessment is defined as a systematic attempt to measure a

learner's ability to use previously acquired knowledge in solving novel
problems or completing specific tasks. In performance assessment, real
life or simulated assessment exercises are used to elicit original responses
which are directly observed and rated by a qualified judge. (p. 1)
Today, there is general agreement that "performance measurement calls for

examinees to demonstrate their capabilities directly, by creating some product or
engaging in some activity" (Haertel, 1992), and that there is heavy reliance on
observation and/or professional judgment in the evaluation of the response
(Mehrens, 1992).
Early developments of performance assessments in the United States were
directed towards the reform of curriculum and instruction, rather than
accountability. The assessments differed from traditional tests in a number of
ways.
First, the development process was informed by multiple perspectives,

including cognitive psychologists, teachers, and subject-matter and
measurement specialists. Second, the tasks generally consisted of open-
ended prompts or exercises requiring students to write explanations, carry
out a set of procedures, design investigations, or otherwise reason with
targeted subject matter. Third, innovative multilevel scoring criteria or
rubrics that gave consideration to procedures, strategies, and quality of
response were favored over right/wrong scoring. Not surprisingly, given
the state of the art, psychometric properties, particularly reliability, were
of primary concern, as were practical considerations such as the time and
cost of administration and scoring. (Pellegrino et aI., 1999, p. 321)
In this work concern was expressed about generalizability of performance across

tasks and about the cost and practicalities of designing, scoring, and administering
the new forms of assessment. Further, the developments had minimal impact on
instruction, and tasks did not necessarily measure higher level cognitive skills as
had been anticipated. While the latter weakness was the result of inadequate
technical and theoretical development, the limited impact was due to the large-
scale nature of the testing which meant that the assessment activity was
separated from instructional opportunity. However, experience in the United
Kingdom, particularly in the case of science in the primary school, is that if large-
scale testing is associated with high stakes and a mandated curriculum, then
teachers do change their practice (Stobart & Gipps, 1997).
The next move in alternative assessment in the U.S. was to develop assessment
that is integrated with classroom practice. The hallmark of this development was
a focus on knowledge-rich domains, an emphasis on the development of

conceptual understanding in a domain, and a view that assessment
development should stem from an analysis of the various ways in which
students conceptualize or explain situations or events. That is, an
understanding of knowledge development in the domain serves as the
guide for developing instruction and assessment. The assessment, in turn,
provides feedback on instruction by calling attention to levels of student
understanding at various times and in various contexts. Explicit attention
to the iterative nature of teaching, learning, and assessment is the
hallmark of the approaches ... (Pellegrino et aI., 1999, p. 326)
It is crucial in these classroom-based formative assessment strategies that student

understandings be carefully probed and analysed to facilitate student-specific
adjustments to the "learning environment," thus influencing their learning trajectory.
Pellegrino et al. (1999) also describe assessment which is built into extended
problem-based and project-based learning in mathematics and science. The
SMART (Scientific and Mathematical Arenas for Refining Thinking) model
provides opportunities for formative assessment and includes an emphasis on
self-assessment to help students and teachers develop the ability to monitor their
own understanding (Brown, Bransford, Ferrara, & Campione,1983; Stiggins,
1994). The model involves the explicit design of multiple cycles of problem
solving, self-assessment, and revision in an overall problem-based learning
environment.
The range of problem-based projects is made up of activity modules. Within
each module, selection, feedback, and revision make use of the SMART Website,
which supports three high-level functions. First, it provides individualized
feedback to students and serves as a formative evaluation tool. The feedback not
only suggests aspects of students' work that are in need of revision, but also
classroom resources that students can use to help them revise. The feedback
does not tell students the "right answer," but rather it sets a course for
independent inquiry by the student. The second function of the SMART Website
is to collect, organize, and display the data collected from multiple distributed
classrooms ("SMART Lab"). Displays allow the teacher and his or her class to
discuss a variety of solution strategies. The third section of the SMART WeQsite
consists of explanations by student-actors ("Kids Online"). The explanations are
text based with audio narration, and, by design, include errors. Students are
asked to critically evaluate the explanations and provide feedback to the student-
actor. The errors seed thinking and discussion on concepts that are frequently
misconceived by students. At the same time, students are learning important
critical evaluation skills (Pellegrino et aI., 1999).
Evaluation findings indicate that students who use these resources and tools
learn more than students who have gone through the same instructional
sequence for the same amount of time without the benefit of the tools or of
formative assessment. What is more, their performance in other project-based
learning activity is enhanced. As the earlier section on self-evaluation and
feedback makes clear this enhanced outcome would indeed be anticipated by the
literature on metacognition and formative assessment.
Portfolios
At one level, portfolios may be regarded simply as collections of pieces of

student work which are kept as a record. However, these collections can also be
used as a basis for assessment, and so portfolios are often described as a method
of assessment.
A portfolio used for assessment and learning purposes involves

documentation of achievements, self-evaluation, process artefacts and
analyses of learning experiences. It is therefore significantly more than a
collection of assignments. (Klenowski, 2000, p. 219)
Using collections of student work for assessment allows a wide range of

approaches to marking, from traditional marks and grades assigned by teachers,
through detailed written comments by teachers (with no grades), teacher -student
discussion, and negotiation of the assessment to student self-evaluation. As with
other forms of alternative assessment, the proponents of portfolio assessment
present a rationale based on the enhancement of learning. First, assessment that
celebrates achievement on an ipsative or criterion-referenced basis builds
students' confidence and ultimately enhances their performance, which may be
contrasted with the situation in which assessment draws attention to failure
through the use of grades and marks, particularly when norm-referenced.
Second, learning is enhanced if students understand clearly what they are trying
to achieve, and are provided with personal and substantive feedback in trying to
reach their target. Third, collecting, sorting, and annotating evidence to
demonstrate achievement in the form of a portfolio of work is a valuable aid in
developing the skills of self-evaluation, as well as for communicating standards
in a relatively unambiguous and accessible way (Broadfoot, 1998). If portfolio
assessment is to achieve the goal of enhancing learning, a number of principles
need to be incorporated. These are: portfolio assessment involves a new
perspective on learning in which the student is actively engaged in the learning
process; it is a developmental process and the portfolio demonstrates growth;
self-evaluation is an integral part of the process; the student must collect, select,
and reflect on work; and the teacher's role as a facilitator is vital (Klenowski, 2000).
Conscious of the importance of portfolio assessment for supporting learning,

its developers recommend the following procedures and practices:
• An increased use of evidence in the form of portfolios for formative purposes,

to stimulate student reflection, to enhance summative communication, and as
a base for setting standards;
• An increased emphasis on comprehensive descriptive recording rather than
the use of grade and marks;
• An increased emphasis on teacher assessment in recognition that only the teacher
can access students' performance on some of the most significant competencies;
• A growing involvement of students themselves in the preparation of portfolios
in recognition of the power of assessment to inhibit or enhance student
motivation. (Wolf & Koretz, 1998, p. 303.)
THE USE OF ALTERNATIVE ASSESSMENT
The account so far has focused largely on the use of alternative assessment to
improve the quality of students' learning. We now address the issues raised when
alternative assessments are incorporated into large-scale assessment systems,
where they are used for accountability purposes.
Before considering the validity of alternative assessments, we will summarize
current thinking on validity, which incorporates "reliability" and "equity." While
recognizing that alternative assessments are subject to the same basic
considerations as more conventional assessments (Messick, 1994), we take the
view that there is a legitimate debate about whether increased validity in terms
of "authenticity" (Newmann, 1990), "cognitive complexity" (Linn et aI., 1991) or
"directness" (Frederickson & Collins, 1989) can provide a challenge to the
narrowness of standard applications of validity and reliability (Moss, 1992). We
are confident that this challenge is effective when the uses and consequences of
alternative assessment relate to the improvement of learning in the classroom.
However, when used in large-scale, particularly high-stakes programs, the
consequences of assessment differ, and are more problematic. In the same way
that alternative assessments may have particular strengths, they may also have
particular weaknesses in terms of generalizability (Kane et aI., 1999).1
Current thinking treats validity as an integrated concept, organized around a
broader view of construct validity (Shepard, 1993) and regards it is a property of
test scores rather than of the test itself. As early as 1985, the American Standards
for Educational and Psychological Testing were explicit about this: "validity always
refers to the degree to which ... evidence supports the inferences that are made
from the scores" (AERA, APA, NCME, 1985, p. 9). This unitary approacp to
validity also subsumes considerations about reliability which has traditionally
been treated as a separate construct. Confidence in inferences must include
confidence in the results themselves which inconsistencies in administration,
marking, or grading will undermine.
More recent debate has focused on whether the concept of validity should also
include the consequences of an assessment. Messick's (1989) view was that it
should.
For a fully unified view of validity, it must also be recognized that the
appropriateness, meaningfulness, and usefulness of score-based inferences
depend as well on the social consequences of the testing. Therefore, social
values and social consequences cannot be ignored in considerations of
validity. (p. 19)
While there is support for this position in both America (Linn, 1997; Shepard,
1997) and in England (Gipps, 1994a; Wiliam, 1993), it has been challenged by
those who regard it as "a step too far." Critics such as Popham (1997) and
Mehrens (1997) argue that while the impact of test use should be evaluated, this
should be kept separate from validity arguments. "The attempt to make the
social consequences of test use an aspect of validity introduces unnecessary
confusion into what is meant by measurement validity" (Popham, 1997, p. 13).
Part of Popham's concern is that the test developer would be responsible for
what others do with test results. His claim that this approach will "deflect us from
the clarity we need when judging tests and the consequences of test use"
(Popham, 1997, p. 9) merits attention. What is important here is the emphasis on
uses and consequences.
The Validity of Alternative Assessment
Messick (1994) has raised the issue of whether performance assessments should
"adhere to general validity standards addressing content, substantive, structural,
external, generalizability and consequential aspects of construct validity" (p. 13).
His position is that they should. "These general validity criteria can be
specialized for apt application to performance assessments, if need be, but none
should be ignored" (p. 22). However, alternative assessment gives rise to a
variety of problems that cannot be easily accommodated within conventional
interpretations of validity and reliability. For example, students may be permitted
"substantial latitude in interpreting, responding to, and perhaps designing
tasks"; further, "alternative assessments result in fewer independent responses,
each of which is complex, reflecting integration of multiple skills and knowledge;
and they require expert judgment for evaluation" (Moss, 1992, p. 230). The
dilemma is how much these "specialized" criteria can be used to offset limitations
in terms of conventional requirements of reliability.
Moss (1994) has suggested that "by considering hermeneutic alternatives for
serving the important epistemological and ethical purposes that reliability serves,
we expand the range of viable high-stakes assessment practices to include those
that honor the purposes that students bring to their work and the contextualized
judgements of teachers" (p. 5). She draws telling examples from higher education
to demonstrate alternative approaches: how we confer higher degrees, grant
tenure, and make decisions about publishing articles. This approach is in line
with Mishler's (1990) call to reformulate validation "as the social discourse
through which trustworthiness is established," thus allowing us to see conventional
approaches to reliability as particular ways of warranting validity claims, "rather
than as universal, abstract guarantors of truth" (p. 420).
The importance of this line of argument lies in its challenge to narrow views
of reliability which constrain what can be assessed and, therefore, what will be
taught (Resnick & Resnick, 1992). Wolf et aI., (1991) sum up this approach by
suggesting that we need "to revise our notions of high-agreement reliability as a
cardinal symptom of a useful and viable approach to scoring student performance"
and "seek other sorts of evidence that responsible judgment is unfolding" (p. 63).
Wiliam (1992) has proposed the concepts of disclosure ("the extent to which an
assessment produces evidence of attainment from an individual in the area being
assessed") and fidelity ("the extent to which evidence of attainment that has been
disclosed is recorded faithfully") as an alternative way of thinking about the
reliability of assessment of performance (p. 13).
Advocates of alternative assessment make other "specialized" claims which
reflect the stronger validity argument that the directness of the relationship
between assessment and the goals of instruction implies. For example, there is
the appeal to "directness" and "systemic validity" by Fredericksen & Collins
(1989), to "authenticity" by Newmann (1990), and to "cognitive complexity" by
Linn et al. (1991).
The dilemma is whether such strengths can compensate for weakness in
relation to other aspects of validity. Kane et al. (1999) make the case that any
validity argument has to focus on the weakest link, as this is where validity will
break down. For example, in multiple-choice tests, a key weakness is that of
extrapolation from results to interpretation - a link that is often neglected (Feldt
& Brennan, 1989). Kane et aI., (1999) argue that for "performance assessment,
this ~le of thumb argues for giving special attention to the generalizability of the
results over raters, tasks, occasions, and so forth" (p. 15).
Put more simply, alternative assessments are particularly vulnerable to reliability
problems relating to sampling, standardization, and scoring. The concern with
sampling is that because high-fidelity tasks are often time-consuming, assessment
will be based on a restricted number of performances. The consequence is that
"generalization from the small sample of high-fidelity performances to a broadly
defined universe of generalization may be quite undependable" (Kane et aI.,
1999, p. 11) - Messick's "construct under-representation." Concern with sampling
error associated with a small number of tasks has been repeatedly raised
(Brennan & Johnson, 1995; Mehrens, 1992; Swanson, Norman, & Linn, 1995).
Shavelson, Baxter, and Pine (1992) concluded that "task-sampling variability is
considerable. In order to estimate a student's achievements a substantial number
of tasks may be needed" (p. 26). Shavelson, Baxter, and Gao (1993) estimated
that between 10 and 23 tasks would be needed to achieve acceptable levels of
reliability. Kane et al. (1999) provide some ideas on how this might be done, for
example by requiring students to complete, for assessment purposes, only one
part of a lengthy task.
It has been our experience in national assessment in England that marker
reliability is what most concerns policymakers, particularly if the marker is the
student's teacher. Interestingly, the unreliability related to raters' scoring of
performance assessments is seen by technical experts as less of a problem than
Task x Person interaction (Shavelson et aI., 1992). This is in part because rater
reliability is susceptible to rapid improvement where care is taken with the
scoring system and raters are well-trained (Baker, 1992; Dunbar, Koretz, &
Hoover, 1991; Shavelson et aI., 1993). Improvement can be achieved by
standardizing tasks and procedures (asking all students to take the same tasks
under similar conditions) and by providing detailed scoring rubrics (Swanson et
aI., 1995). The dilemma is that increased standardization - and reliability - will
be accompanied by "a shrinking of the universe of generalization relative to the
target domain" (Kane et aI., 1999, p. 12). The issue is how to strengthen scoring
without weakening the sampling of the target domain.
Another approach to enhancing scorer reliability is moderation, which is used
widely in national examination systems in which teachers' assessments of
students contribute to the final grade (Broadfoot, 1994; Harlen, 1994; Pitman &
Allen, 2000). Moderation comes in a number of forms, ranging from work with
teachers prior to final assessment to increase consensus on marking standards
(Gipps, 1994c) to statistical adjustment of marks by means of a reference test. In
the United Kingdom, moderation often involves external moderators reviewing
a sample of work from each school and deciding whether the marks need to be
adjusted.
Equity is an important consideration in any type or form of assessment, and
one of the advantages attributed to alternative approaches to assessment is that
they can help redress inequities and biases in assessment (Garcia & Pearson,
1991). According to Wolf et a1. (1991) "there is a considerable hope that new
modes of assessment will provide one means for exposing the abilities of less
traditionally skilled students by giving a place to world knowledge, social processes,
and a great variety of excellence" (p. 60). Neill (1995) argues that performance
assessment and evaluation of culturally sensitive classroom-based learning have
the potential to foster multicultural inclusion and facilitate enhanced learning.
Others are more cautious, pointing out the alternative assessment on its own will
not resolve issues of equity. Consideration will still have to be given to students'
opportunity to learn (Linn, 1993), the knowledge and language demands of tasks
(Baker & O'Neill, 1995), and the criteria for scoring (Linn et aI., 1991).
Furthermore, the more informal and open-ended assessment becomes, the
greater will be the reliance on the judgment of the teacher/assessor. Neither will
alternative forms of assessment, of themselves, alter power relationships and
cultural dominance in the classroom. However, what we can say is that a broadening
of approaches to assessment will offer students alternative opportunities to
demonstrate achievement if they are disadvantaged by any particular form of
assessment. Linn (1993) makes the point that "multiple indicators are essential
so that those who are disadvantaged on one assessment have an opportunity to
offer alternative evidence of their expertise" (p. 44). The best defence against
inequitable assessment is openness (Gipps & Murphy, 1994). Openness about
design, constructs, and scoring, will reveal the values and biases of the test design
process, offer an opportunity for debate about cultural and social influences, and
open the relationship between assessor and learner.
In summary, strong validity claims may be made for alternative assessment
when uses and consequences are linked to classroom instruction and student
learning. When the purpose shifts to certification and accountability in large-
scale systems, the arguments are harder to sustain. While there are ways of
increasing the reliability of alternative assessments, the need to increase
standardization may begin to undermine the fidelity of the tasks. Our involvement
in national programs in England, where forms of alternative assessment have
been used, has taught us that arguments on validity rarely win the day, since
policymakers, who often see the primary purpose of an assessment system as one
of managerial accountability, operate with conventional views of reliability
(assessment must be "externally marked" with a single grade/mark outcome,
and "scientific" levels of accuracy are expected). We now turn to some examples
of alternative assessment in practice that exemplify these pressures and
concerns.
ALTERNATIVE ASSESSMENT IN LARGE-SCALE

ASSESSMENT SYSTEMS
Large-scale testing seeks to provide a relatively simple and reliable summary of

what a student has learned in a particular subject or skill. Performance assess-
ment by contrast is time-consuming, tends to provide detailed multidimensional
information about a particular skill or area (and, because of the time factor,
depth may be exchanged for breath); scoring is generally complex and usually
involves the classroom teacher. Standardization of the performance is not
possible (or even desired) and, as a consequence, reliability in the traditional
sense is not high. All these features, which render performance assessment
valuable for assessment to support learning, become problematic when it is to be
used for accountability purposes (Frechtling, 1991).
We have selected three examples of the use of alternative assessments in large-
scale assessment programs. The first is the use of portfolio assessment in Kentucky
and Vermont, two well-documented and salutary experiments. The second, in
England, incorporates portfolio assessment and coursework. The third is a "good
news story" from the state of Queensland in Australia, which, for over 30 years,
has relied on teacher assessment of student coursework as the primary mea,ns of
assessment.
In the U.S.A., a number of states use portfolios locally and/or experimentally
(notably California), while two states (Kentucky and Vermont) have used
portfolio assessment on a large scale in the core subjects (Stecher, 1998). The
national assessment program (National Assessment of Educational Progress)
has also experimented with them. In a review of the American experience,
Stecher concentrates on the two main state developments, which:
share some common features. Assessment portfolios contain the diverse

products of students' learning experiences, including written material
(both drafts and final versions), pictures, graphs, computer programs, and
other outcomes of student work. Portfolios are usually cumulative. They
contain work completed over a period of weeks or months. Portfolios are
embedded in instruction, i.e. entries are drawn from on-going schoolwork.
As a result, portfolios are not standardized. Since assignments differ
across teachers, so do collections of students' work. Moreover, within a
single class students will choose different pieces to include in their
portfolios. Portfolios also have a reflective component - either implicit in
students' choices of work or explicit in a letter to the reviewer explaining
the selection of materials. Finally, there is an external framework for
scoring the portfolios that contain different contents. (p. 337)
Stecher concludes that the research on the introduction of portfolio assessment

offers support for some of the positive claims about the effect of such assessment
on teachers' attitudes and beliefs, and changes in curriculum and instruction.
In the Kentucky and Vermont portfolio schemes, results were used for
accountability purposes and, thus, marker reliability became a focus, as well as
how the results related to other test measures (Koretz, 1998). In Kentucky, in
particular, there was a high-stakes accountability purpose introduced even in the
pilot stages, with teachers' salaries being affected by the results (Callahan, 1997).
When the assessments in both Vermont and Kentucky showed considerable
unreliability (Koretz, 1998), policymakers began to move towards more
conventional, and reliable, testing. Hill (2000), an insider on the Kentucky
scheme, grimly concludes:
most of the initial innovations in Kentucky's assessment system were

abandoned. Mter seven years, the assessment system had deteriorated
into a series of external, on-demand tests that were largely multiple-
choice with some open responses questions thrown in. Worse, the idea
that assessment should derive first from the information the teacher had
already had been lost, and the notion that external tests should be telling
teachers what to do and how to do it had gained support. (p. 4)
The evidence for reliability and generalizability from the state-wide portfolio
programs (Koretz, 1998) suggests that this approach is generally not suitable for
accountability assessment in the U.S.A. Anderson and Bachor (1998) echo this
view on the basis of their experience in Canada, and see the portfolio as much
more likely to be used for assessment at classroom level than at provincial level
for accountability.
This is not the end of the story, however. There is good evidence that teachers
have the capacity to contribute to high stakes assessment. In Germany, where
teachers enjoy high professional status, there is virtually no external assessment
in the school system (Broadfoot, 1994). In England, also, performance assessment
plays an accepted part in high stakes assessment in three public examinations:
the General Certificate of Secondary Education (GCSE), a subject specific
examination taken by 16-year olds at the end of compulsory schooling, in which
students typically take an examination in seven to nine subjects; the General
Certificate of Education Advanced Level (GCE A-Ievel), normally taken by
18-year olds in two or three subjects, and the basis of most university admissions;
and the Vocational Certificate of Education Advanced Level (VCE A-Ievel).
Each incorporates, to varying degrees, an element of teacher assessment of
students. Coursework may take the form of a project (e.g., a historical
investigation, a geography field trip, a mathematics task), a portfolio of work
(English, all VCEs), an oral examination (modem languages), or a performance
(in drama, music, physical education). The weighting of the component will vary
in terms of "fitness-for-purpose" though the maximum contribution to a
student's overall grade has been politically determined, in the case of the GCSE
by a surprise announcement by the then Prime Minister, John Major, of a ceiling
of 25 percent (Stobart & Gipps, 1997), which has subsequently been relaxed.
Typically marks for coursework in GCE A-levels contribute around 20-30
percent, and VCE A-levels nearer 70 percent. The point is that these
performance assessments are an accepted part of the assessment culture.
Concerns about reliability have been addressed by increased standardization of
tasks and by providing exemplification and training to teachers. Coursework is
seen as both motivating and providing a more valid assessment within a subject;
for example, GCSE English includes the assessment by teachers of students'
speaking and listening, as well as their reading and writing. Furthermore, there
is a well-established system of moderation and checking in the GCSE and the
GCE, though not yet in the VCE.
An even greater commitment to performance assessment is demonstrated in
the Australian state of Queensland, which abolished examinations over 30 years
ago and replaced them with criteria-based subject achievement assessments by
teachers. The system involves extensive moderation by over 400 district review
panels, composed mainly of teachers. These panels review samples of work
across subjects and schools and provide advice to schools (Pitman & Allen,
2000). The process is seen as an effective form of professional development in
assessment. The only external assessment is the Queensland Core Skills Test
(cognitive skills based on the "common curriculum elements") which is taken by
17-year-olds. This, together with teacher assessment, provides the basis for entry
to tertiary education.
It seems, therefore, that it is possible to use alternative assessment in iarge-
scale programs, even for certification purposes, in some educational and cultural
contexts. It is interesting to speculate on what the social and cultural conditions
are that make these approaches possible. The professional status of the teacher
and confidence in the school system (as in Germany) are no doubt important
factors. An emphasis on accountability and selection is likely to deter such develop-
ments, as are legal challenges to the fairness of teachers' assessments. The role
and power of test agencies may also playa part (Madaus & Raczek, 1996).
CONCLUSION
Alternative assessment differs from traditional assessment in many ways. It is not

simply that the tasks are different and require the student to produce a response;
it embodies a different concept of assessment (as an essential part of the learning
process), new understandings of learning itself, and a different relationship
between student and teacher/assessor. Alternative assessment does indeed come
from, and require, a different way of thinking about assessment.
A number of approaches to alternative assessment were outlined in this
chapter, examples were provided, and technical difficulties when such
assessment is used in large-scale accountability programs discussed. More
detailed accounts of technical issues can be found elsewhere (Baker, O'Neil &
Linn, 1991; Pellegrino et aI., 1999). Major challenges face test developers and
classroom assessment researchers in developing alternative forms of assessment.
The focus shifts from assessing of discrete, de-contextualized items in
standardized conditions to the assessing of important constructs, and looking to
elicit higher-order skills in real-life contexts. These must involve the student as a
more active participant in hislher own learning, and its evaluation. While the
approaches will make greater demands on student and teacher/assessor, they
have the potential to support good curriculum and instructional approaches, as
well as high quality learning.
ENDNOTE
1 These, and other authors, prefer the term generalizability to reliability. This reflects concern with
the use of findings. However generalizability is determined by the reliability of an assessment;
unreliable measures limit it (Salvia & Ysseldyke, 1998).
REFERENCES
AERA, APA, NCME (American Educational Research Association, American Psychological

Association, National Council on Measurement in Education). (1985). Standards for educational
and psychological testing. Washington, DC: Authors.
Anderson, J.O., & Bachar, D.G. (1998). A Canadian perspective on portfolio use in student
assessment. Assessment in Education, 5, 353-380.
Baker, E. (1992). The role of domain specifications in improving the technical quality of performance
assessment. Los Angeles: CRESST, University of California at Los Angeles.
Baker, E., & O'Neil, H. (1995). Diversity, assessment, and equity in educational reform. In M.
Nettles & A. Nettles (Eds.), Equity and excellence in educational testing and assessment (pp. 69-87).
Boston: Kluwer Academic.
Baker, E., O'Neil, H., & Linn R (1991). Policy and validity prospects for performance-based
assessment. Paper presented at annual meeting of the American Psychological Association, San
Francisco.
Bennett, R (1998). Reinventing assessment. Princeton, NJ: Educational Testing Service.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education,S, 7-74.
Brennan, RL., & Johnson, E.G. (1995). Generalizability of performance assessments. Educational
Broadfoot, P. (1994). Approaches to quality assurance and control in six countries. In W. Harlen
(Ed.), Enhancing quality in assessment (pp. 26-52). London: Chapman.
Broadfoot, P. (1996). Education, assessment and society. A sociological analysis. Buckingham,
England: Open University Press.
Broadfoot, P. (1998). Records of achievement and the learning society: A tale of two discourses.
Assessment in education,S, 447-477.
Broadfoot, P., James, M., McMeeking, S., Nuttall, D., & Stierer, S. (1988). Records of achievement:
Report of the national evaluation of pilot schemes (PRAISE). London: Her Majesty's Stationery
Office.
Brown, A.L., Bransford, J.D., Ferrara, R, & Campione, J. (1983). Learning, remembering and
understanding. In P.H. Mussen (Series Ed.), J.H. Flavell, & E.M. Markman (Vol. Eds.),
Handbook of child psychology: Volume 3. Cognitive development (4th ed., pp. 77-166). New York:
Wiley.
Bruner, J.S. (1966). Toward a theory of instruction. Cambridge, MA: Harvard University Press.
Bruner, J. (1996). The culture of education. Cambridge, MA: Harvard University Press.
Callahan, S. (1997). Tests worth taking? Using portfolios for accountability in Kentucky. Research
into the Teaching of English, 31, 295-330.
Clay, M.M. (1985). The early detection of reading difficulties (3rd ed.). Auckland, NZ: Heinemann.
Crooks, T.J. (1988). The impact of classroom evaluation practices on students. Review of Educational
Research, 58, 438-481.
Dunbar, S.B., Koretz, D., & Hoover, H.D. (1991). Quality control in the development and use of
performance assessments. Applied Measurement in Education, 4, 289-304.
Feldt, L.S., & Brennan, RL. (1989). Reliability. In RL. Linn (Ed.), Educational measurement (3rd
ed., pp. 105-146). New York: American Council on EducationlMacmillan.
Feuerstein, R. (1979). The dynamic assessment of retarded performers: The learning assessment device,
theory, instruments and techniques. Baltimore, MD: University Park Press.
Frechtling, J. (1991). Performance assessment: Moonstruck or the real thing? Educational
Measurement: Issues and Practice, 10(4),23-25.
Fredericksen, J.R, & Collins, A. (1989). A systems approach to educational testing. Educational
Garcia, G., & Pearson, P. (1991). The role of assessment in a diverse society. In E. Herbert (Ed.),
Literacy for a diverse society (pp. 253-278). New York: Teachers College Press.
Gipps, C. (1992). Equal opportunities and the SATs for seven year olds. Curriculum Journal, 3,
171-183.
Gipps, C. (1994a). Beyond testing: Towards a theory of educational assessment. London: Falmer.
Gipps, C. (1994b). Developments in educational assessment or what makes a good test? Assessment
Gipps, C. V. (1994c). Quality assurance in teacher assessment. In W. Haden (Ed.) Enhancing quality in
assessment. London: Paul Chapman.
Gipps, C. (1999). Socio-cultural aspects of assessment. Review of Research in Education, 24, 355-392.
Gipps, c., Brown, M., McCallum, B., & McAlister, S. (1995). Intuition or evidence? Teachers and
national assessment of seven year olds. Buckingham, England: Open University Press.
Gipps, C., McCallum, B., & Hargreaves, E. (2000). What makes a good primary school teacher? Expert
classroom strategies. London: Routledge Falmer.
Gipps, c., & Murphy, P. (1994). A fair test? Assessment, achievement and equity. Buckingham,
England: Open University Press. •
Glaser, R (1963). Instructional technology and the measurement of learning outcomes: Some
questions. American Psychologist, 18, 519-521.
Glaser, R (1990). Toward new models for assessment. International Journal of Educational Research,
14, 475-483.
Goldstein, H. (1994). Recontextualising mental measurement. Educational Measurement: Issues and
Practice, 13(1), 16-19, 43.
Gould, S.J. (1996). The mismeasure of man. New York: Norton.
Haertel, E. (1992). Performance measurement. In M. Aikin (Ed.), Encyclopaedia of educational
research (6th ed., pp. 984-989). London: Macmillan.
Harien, W. (1994). Issues and approaches to quality assurance and quality control in assessment. In
W. Harien, (Ed.) Enhancing quality in assessment (pp. 11-25). London: Paul Chapman.
Herman, J., Klein, D., & Wakai, S. (1997). American students' perspectives on alternative assessment:
Do they know it's different? Assessment in Education, 4, 339-352.
Hill, R (2000). A success story from Kentucky. Paper presented at 30th Annual Conference on Large-
Scale Assessment, Snowbird, Utah.
HMI (Her Majesty's Inspectorate). (1979). Aspects of secondary education in England. London: Author.
HMI. (1988). The introduction of the GCSE in schools 1986-1988. London: Author.
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535.
Kellaghan, T., Madaus, G.F., & Raczek;, A (1996). The use of external examinations to improve student
motivation. Washington, DC: American Educational Research Association.
Klein, S.P., & Hamilton, L. (1999). Large-scale testing. Santa Monica, CA: RAND Education.
Klenowski, Y. (2000). Portfolios: Promoting teaching. Assessment in Education, 7, 215-234.
Kluger, AY., & DeNisi, A (1996). The effects of feedback interventions on performance: A
historical review, a meta-analysis, and a preliminary feedback intervention theory. Psychological
Bulletin, 119, 252-284.
Koretz, D. (1998). Large-scale portfolio assessments in the US: Evidence pertaining to the quality of
measurement. Assessment in Education, 5, 309-334.
Linn, RL. (1993). Educational assessment: Expanded expectations and challenges. Educational
Evaluation and Policy Analysis, 15, 1-16.
Linn, R.L. (1997). Evaluating the validity of assessments: The consequences of use. Educational
Linn, RL. (2000). Assessment and accountability. Educational Researcher, 29(2), 4-16.
Linn, RL., Baker, E.L., & Dunbar, S.B. (1991). Complex performance assessment: Expectations and
validation criteria. Educational Researcher, 20(8), 15-21.
Lunt, I. (1994). The practice of assessment. In H. Daniels (Ed.), Charting the agenda: Educational
activity after Vygotsky (pp. 145-170). New York: Routledge.
Madaus, G.F. (1988). The influence of testing on the curriculum. In L. Tanner (Ed.), Critical issues in
curriculum. Eighty-seventh Yearbook of the National Society for the Study of Education, Part 1
(pp. 83-121). Chicago: University of Chicago Press.
Madaus, G.F., & Raczek, AE. (1996). The extent and growth of educational testing in the United
States: 1956-1994. In H. Goldstein, & T. Lewis (Eds.), Assessment: Problems, developments and
statistical issues (pp. 145-165). New York: Wiley.
Mehrens, w.A. (1992). Using performance assessment for accountability purposes. Educational
Measurement: Issues and Practice, 11(1), 3-9, 20.
Mehrens, w.A. (1997). The consequences of consequential validity. Educational Measurement: Issues
and Practice, 16(2), 16-18.
Messick, S. (1989). Validity. In RL. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New
York: American Council on Education.
assessments. Educational Researcher, 23(2), 13-23.
Meyer, C.A. (1992). What's the difference between authentic and performance assessment?
Educational Leadership, 49(8), 38-41.
Mishler, E.G. (1990). Validation in inquiry-guided research: The role of exemplars in narrative
studies. Harvard Educational Review, 60, 415-442.
Mislevy, RJ. (1993). Foundations of a new test theory. In N. Frederickson, RJ. Mislevy, & 1.1. Bejar
(Eds.), Test theory for a new generation of tests (pp. 19-39). Hillsdale, NJ: Eribaum.
Moss, P.A (1992). Shifting conceptions of validity in educational measurement: Implications for
performance assessment. Review of Educational Research, 62, 229-258.
Moss, P.A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12.
Moss, P.A (1996). Enlarging the dialogue in educational measurement: Voices from interpretive
research traditions. Educational Researcher, 25(1), 20--28.
Neill, M. (1995). Some prerequisites for the establishment of equitable, inclusive multicultural
assessment systems. In M. Nettles, & A. Nettles (Eds.), Equity and excellence in educational testing
and assessment (pp. 115-157). Boston: K1uwer Academic.
Newman, D., Griffin, P., & Cole, M. (1989). The construction zone: Working for cognitive change in
school. Cambridge, England: Cambridge University Press.
Newmann, EM. (1990). Higher order thinking in social studies: A rationaie for the assessment of
classroom thoughtfulness. Journal of Curriculum Studies, 22, 41-56.
Oosterhof, A. (1996). Developing and using classroom assessments. Englewood Cliffs, NJ: Prentice Hall.
Pellegrino, J.w., Baxter, G.P., & Glaser, R. (1999). Addressing the "two disciplines" problem: Linking
theories of cognition and learning with assessment and instructional practice. Review of Research
Perrenoud, P. (1991). Towards a pragmatic approach to formative evaluation. In P. Weston (Ed.),
Assessment of pupils , achievement: Motivation and school success (pp. 77-101). Amsterdam: Swets
& Zeitlinger.
Pitman, J., & Allen, R. (2000). Achieving quality in assessment practices: Challenges and strategies for
the new millennium. Paper presented at 26th Conference of the International Association for
Educational Assessment, Jerusalem.
Popham, w.J. (1997). Consequential validity: Right concern - wrong concept. Educational Measurement:
Issues and Practice, 16(2), 9-13.
Raban, B. (1983). Guides to assessment in education: Reading. London: Macmillan.
Resnick, L., & Resnick, D. (1992). Assessing the thinking curriculum: New tools for education
reform. In B. Gifford & M. O'Connor (Eds.), Changing assessments: Alternative views of aptitude,
achievement and instruction (pp. 37-75). Boston: K1uwer Academic.
Sadler, R (1989). Formative assessment and the design of instructional systems. Instructional Science,
18, 119-144.
Sadler, R (1998). Formative assessment: Revisiting the territory. Assessment in Education,S, 77-84.
Salvia, J., & Ysseldyke, J.E. (1998). Assessment (7th ed.). Boston: Houghton Mifflin.
SEAC (School Examinations and Assessment Council). (1991). National curriculum assessment at Key
Stage 3: A review of the 1991 pilots with implications for 1992. London: Author.
Shavelson, R.J., Baxter, G.P., & Gao, X. (1993). Sampling variability of performance assessments.
Shavelson, RJ., Baxter, G.P., & Pine, J. (1992). Performance assessments: Political rhetoric and
measurement reality. Educational Researcher, 21(4), 22-27.
Shepard, L.A. (1993). Evaluating test validity. Review of Research in Education, 19,405-450.
Shepard, L.A. (1997). The centrality of test use and consequences for test validity. Educational
Measurement: Issues and Practice, 16(2), 5-8, 13.
Shepard, L.A. (2000). The role of classroom assessment in teaching and learning. In V. Richardson
(Ed.), Handbook of research on teaching (4th ed., pp. 1066-1101). Washington, DC: American
Educational Research Association.
Shepard, L.A., & B1eim, c.L. (1995). Parents' thinking about standardized tests and performance
Stecher, B. (1998). The local benefits and burdens of large-scale portfolio assessment. Assessment in
Education,S, 335-352.
Stiggins, R (1994). Student-centered classroom assessment. New York: Merrill.
Stiggins, R, & Bridgeford, N. (1982). The role of peiformance assessment in day to day classroom
assessment and evaluation. Paper presented at annual meeting of the National Council for
Measurement in Education, New York.
Stobart, G., & Gipps, C. (1997). Assessment: A teacher's guide to the issues. London: Hodder &
Stoughton.
Swanson, D., Norman, G., & Linn, RL. (1995). Performance-based assessment: Lessons from the
health professions. Educational Researcher, 24(5), 5-11, 35.
Taylor, P., Fraser, B., & Fisher, D. (1997). Monitoring constructivist classroom learning
environments. International Journal of Educational Research, 27, 293-301. •
Torrance, H., & Pryor, J. (1998). Investigating formative assessment: Teaching, learning and assessment
in the classroom. Buckingham, England: Open University Press.
Tunstall, P., & Gipps, C. (1996). Teacher feedback to young children in formative assessment: A
typology. British Educational Research Journal, 22, 389-404.
Vygotsky, L. (1978). Mind in society. Cambridge, MA: Harvard University Press.
Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan,
72, 700-713.
Wiliam, D. (1992). Some technical issues in assessment: A user's guide. British Journal of Curriculum
Assessment, 2(3), 11-20.
Wiliam, D. (1993). Validity, dependability and reliability in national curriculum assessment. Curriculum
Journal, 4, 335-350.
Wittrock, M.e., & Baker, E.L. (1991). Testing and cognition. Englewood Cliffs, NJ: Prentice Hall.
Wolf, A., & Koretz, D. (1998). Editorial. Assessment in Education,S, 301-307.
Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms
of student assessment. Review of Research in Education, 17, 31-74.
Wood, R. (1986). The agenda for educational measurement. In D.L. Nuttall (Ed.), Assessing
educational achievement (pp. 185-203). London: Palmer.
27
External (Public) Examinations
THOMAS KELLAGHAN
GEORGE MADAUS
Boston College, MA, USA
In many countries, "external" examinations (sometimes called "public" or

"national") occupy a central role in the assessment of individual students.
Because of the wide variation in practice that they exhibit, however, it is difficult
to define the examinations in a way that will include all to which the term is
applied. Some examinations, though generally regarded as external or public, are
not entirely external to the schools from which examinees come, while others are
not public if that term is taken to mean that they are administered by a state
authority, though they may be public in the sense that they are open to all.
Rather than attempting to reach a formal definition of external examinations,
we will describe characteristics of the examinations, after which a brief descrip-
tion of their development will be presented. A discussion of major issues
associated with examinations follows: validity, reliability, comparability of per-
formances on examinations taken in different subjects or on different occasions,
criterion-referencing, and equity. Finally, disadvantages attributed to public
examinations and efforts to combine elements of school-based assessment with
external examining are described.
Most of the' chapter addresses issues relating to external examinations in
Europe, Mrica, and Asia. However, some consideration will also be given to
recently introduced state-mandated tests in the United States since they
resemble external examinations in other countries in a number of ways.
CHARACTERISTICS OF EXTERNAL EXAMINATIONS
Examinations to which the term external is applied possess a number of

characteristics (see Eckstein & Noah, 1993; Keeves, 1994). First, they are set
and/or controlled by an agency external to the schools from which candidates
come. Second, the administering authority is usually a national or state
577

578 Kellaghan and Madaus
government or agency or, if it does not actually administer the examinations, it

will have an overseeing function. Third, the examinations are based on prescribed
syllabi in curriculum (or subject) areas (languages, mathematics, sciences,
history, geography, etc). In line with a tradition in which th~ study of classical
texts was the main feature of syllabi, the main emphasis tends to be on content
rather than on skills. Fourth, examinations involve the application of a common
test, in which examinees do not have access to books or other material, are
separated from the classroom situation, and usually are administered to many
students at the same time. There is a heavy emphasis on written tasks involving
essay, supply, or selection items, though other forms of assessment may be
included.
Fifth, most examinations serve a number of functions. The primary function is
usually stated to be certification of examinees' achievements at the end of a
period of study. Candidates are awarded a certificate or diploma that describes
their performance on each subject in the examination, in letter grades (e.g., A,
B, C, D, E), numbers (e.g., 1, 2, 3, 4, 5), percentages, or a proficiency statement
(e.g., pass/fail). Usually grades are arrived at by simply summing marks allocated
to sections of questions and across questions and components (or papers) if the
examination has more than one component (or paper). The certificate or
diploma, in addition to testifying to an examinee's performance in the
examination may also confer rights, such as the right to be considered for (if not
actually admitted to) some sector of the social, professional, or educational
world. For example, in the past, possession of a school-leaving examination (e.g.,
the German Abitur or the French Baccalaureat) granted access to a university. In
many countries, the selection of students for further education has become the
main function of external examinations, though in some, the selection examina-
tion (e.g., the concours in France) is separate from the certification examination.
In other cases, individual universities may hold their own examinations to select
students (e.g., in Japan; see Amano, 1990). Examinations may also serve a
motivational function. This may be the major function when no decision is based
on performance (e.g., midway through secondary schooling), but the examina-
tion is thought useful to direct students in their study or to motivate them to
work. When important consequences are attached to performance, however, we
might expect the examination to have a stronger motivational influence. In
addition to these functions, an examination may also be considered a way of
controlling what goes on in schools, helping to ensure that similar content is
taught, and contributing to uniformity of standards in teaching and in
achievement. These can be important considerations where private schools
represent a considerable proportion of schools, or where there is strong local
control of schools.
Sixth, public examinations are voluntary (while their counterparts in as~ess
ments in the United States are not) in the sense that it is up to individual students
to decide whether or not to take an examination. However, if a student wishes to
proceed to third-level education, he/she will have no option but to take the
examination if decisions about entry are based on examination performance.
External (Public) Examinations 579
Finally, the tradition of external examinations has been to make public the
content of examinations and their results. This was so even before examinations
were presented to candidates in written form. For example, after the oral
examination of candidates at Dublin University during the first half of the 19th
century, individuals who had been present wrote up and circulated the questions
that had been asked. These questions were then used by students who were
preparing for similar examinations in the future, establishing the tradition of
having examinations determine curricula (Foden, 1989). In a number of countries
(Germany, Ireland, New Zealand), in addition to making examination papers
available, answer scripts are returned to students to allow them to evaluate the
way their responses were scored.
While these general characteristics can be identified, there are many ways in
which external examinations across the world differ (Madaus & Kellaghan, 1991;
West, Edge, & Stokes, 1999; World Bank, 2002). First, there are differences at
the stage at which examinations are held. Examinations are most common at the
end of secondary schooling (and it is these examinations which are the main
focus of this chapter). However, many countries also have examinations at lower
levels - around the end of compulsory schooling (when students are about 16
years of age) and, in developing countries, because of a shortage of space to
accommodate students in secondary schools, at the end of primary schooling.
Second, there is variation between countries in the number of bodies involved in
the administration of examinations. There may be a single examination authority
for a country, often a government ministry of education; or, in a federal system,
each state may have its own authority (e.g., Germany); or examinations may be
administered by public or private examination boards (e.g., the United Kingdom);
or administration may be localized (e.g., in regional academies in France).
A third way in which examinations differ is in the extent to which they are
totally external to the school from which examinees come (and in which case the
examinee will be unknown to the examiner) or in the extent to which they
incorporate an element of school-based evaluation. School-leaving examinations
are totally or almost totally external in only a few European countries (France,
Ireland). At the other extreme, an "external" examination may be set by and/or
marked by teachers within a school, with some form of external moderation.
Between these extremes, a student's final grade will be a combination of marks
on an external examination and marks for school work (which may be based on
the student's performance over a period of time).
Fourth, there is variation in the number of subjects in which an examinee takes
an external examination. At the end of secondary school, this ranges from a low
of two in Italy, through three in Germany and the United Kingdom, to more than
six in France, Ireland, and the Netherlands. Fifth, examinations vary in their
formats and in the number of components that make up an examination in an
individual subject. While essay-type questions have been predominant in many
systems, short-answer constructed responses and objective (multiple-choice)
items also feature. Many examinations also include "performance" tasks, in
which oral, aural, and practical competencies are assessed. Finally, there is
variation in the extent to which examinees are allowed a choice in examinations.
Candidates may choose the subjects in which they wish to be examined. There
may also be choice in the components or questions in the examination in any
particular subject. Most allow some choice (though they might also include some
common tasks), but the range of options from which a choice may be made, in
for example a language test, is very large in some systems (e.g., the United
Kingdom) and non-existent in others (e.g., France).
THE DEVELOPMENT OF EXTERNAL EXAMINATIONS
It is generally accepted that written competitive group examinations originated

in China (Du Bois, 1970; Eckstein & Noah, 1993; Mathews, 1985; Morris, 1961).
A regular performance examination has been traced back to the time of the Western
Zhou Dynasty (1027-711 BC). Later, during the Han Dynasty (206 Bc-220 AD)
examinations included oral and written tests (Han & Yang, 2001). At varying
times, district, prefectural, and provincial examinations were held to select
individuals for positions in the civil service and the military, as well as to admit
students to government schools in which they prepared for the civil service
examinations. Features of examinations during the Ch'ing Dynasty in the 19th
century can be readily identified with conditions under which examinations are
held today (Miyazaki, 1976). The examinations were administered in a spacious
hall or shed cut off from communication with the outside; candidates sat alone
at their desks; answers were written in a book of folded plain white paper; and
only the candidate's number appeared on the paper. A number of features of
Chinese examinations, however, have not survived. Candidates are no longer
required to write through the day until dark, nor are readers who graded papers
prevented from leaving the hall until their work is complete. And candidates who
write identical answers are no longer regarded as "recopying" model answers
[see ~adaus (1988) for examples of "recopying" in more recent examinations].
News had been coming through from China from missionaries and travellers
since the 16th century about the Chinese examination system. The Jesuits
incorporated written examinations into their schools, thol1gh these, of course,
were not external (Eckstein & Noah, 1993). The Ratio Studiorum (1586) describes
how a student was required to achieve an adequate standard on a written
composition in Latin and in an oral examination on the grammar, syntax, and
style of the composition (Madaus & O'Dwyer, 1999). It is surprising that
examinations do not seem to have been held for admission to a medieval university
or for licensing following study (Rashdall, 1936). However, there is evidence that
at a later stage of study theses were presented in writing, and then disputed orally
(Perreiah, 1984). •
Large-scale group written examinations were first used in Europe in the 18th
century for selection to the civil service, as had been the case in China. Germany
(Prussia) led the way in 1748, and the practice was adopted in France (in 1793)
following the revolution. Through the 19th century, written qualifying
examinations were introduced in Britain for selection to the civil service and to
the professions (e.g., apothecaries, solicitors) inspired by Benthamite principles
of maximizing aptitude, minimizing expense, and controlling nepotism and
patronage (Foden, 1989; Madaus & Kellaghan, 1991; Montgomery, 1965; Roach,
1971).
At the school level, universities played an important role in the 19th century
in the introduction of group written examinations The Abitur, following its
introduction in German schools (in 1788), soon became a qualification examina-
tion. In France, higher educational institutions administered the Baccalaureat
examination which had been established by Napoleon (in 1808), and used it for
making admission decisions. In the United Kingdom, several universities
established procedures for examining local secondary school students (e.g.,
Cambridge, Dublin, London, Oxford), and strengthened their control of the
examinations during the second half of the 19th century (Brereton, 1944;
Madaus & Kellaghan, 1991; Montgomery, 1965, 1978; Roach, 1971; Wilmott &
Nuttall, 1975; Wiseman, 1961).
In the United States, Horace Mann, faced with the impossibility of "committee
men" conducting oral examinations for "over 7,000 children," introduced written
essay examinations to the Boston Public Schools in 1845. In addition to
facilitating the examination of large numbers of students, Mann recognized that
the examinations allowed the examiner to pose an identical set of questions to
students, under similar conditions, in a limited time frame, producing
"comparable" scores, though he did not seem to hit on the idea embodied in
Chinese practice of administering the same papers to everyone simultaneously
(Madaus & O'Dwyer, 1999; Morris, 1961). Somewhat later, under the influence
of British practice, the Civil Service Act of 1883 in the United States established
competitive examinations to select personnel for government service. The
examinations, however, were abandoned when Congress failed to make
appropriations to continue them (Du Bois, 1970).
Today, external (public) examinations are a key feature of the educational
systems of most countries in Europe, Africa, Asia, and the Caribbean (Bray,
Clarke, & Stephens, 1986; Bray & Steward, 1998; EURYDICE, 1999; Kellaghan,
1992; Kellaghan & Greaney, 1992; Madaus & Kellaghan, 1991; West, Edge, &
Stokes, 1999; World Bank, 2002). While they have not in the past featured in
education systems in countries in the former Soviet Union or in eastern Europe,
that situation is changing, and many countries are now establishing an external
examination system, usually to assess students at the end of secondary schooling.
External examinations have not been a feature of American education either,
with the exception of the Regents' examination in New York state which, since
the 1860s, has offered syllabuses and a system of examinations to students
wishing to obtain a Regents' diploma. All secondary school students in the state
are now required to take the examination for graduation.
Advocates of a "national" examination system in the 1980s and 1990s claimed
that the fact that the U.S. was one of the few industrialized countries in the world
that did not have common national examinations was a key reason why students
performed poorly in international comparisons. They argued that a national
examination system would provide teachers with clear and meaningful standards,
and would motivate students to work if real consequences were attached to
performance (Madaus & Kellaghan, 1991). While a national sy-stem has not been
established, 49 states have introduced state-wide assessments, the results of
which may be used for a variety of decisions about students, schools, school
systems, and teachers (Shore, Pedulla, & Clarke, 2001). The examinations, which
are called assessments or tests, however, differ in many respects from traditional
external examinations in other countries.
First, the tests are mandated by local authorities (e.g., large school systems),
state departments of education, state legislatures and, under the No Child Left
Behind Act of 2001 (Reauthorization of the 1965 Elementary and Secondary
Education Act), the United States Congress (U.S. Department of Education,
2002). However, they are almost always developed, administered, and scored by
testing companies under contract to the authorizing agency. Second, while the
tests are geared to state curriculum frameworks or standards, the emphasis at the
elementary grades is on skills rather than on content. Third, the tests have a large
multiple-choice component in addition to a small number of supply-type short-
answer or essay questions. Fourth, while the main function of testing is to
support standards-based reform, in practice the decisions based on performance
are wider than is generally the case in public examinations, and include the
awarding or withholding of a high school diploma, retention of a student in a
grade, and holding teachers, schools, or school districts accountable.
Fifth, student performance is generally reported in the form of "performance
categories" or standards (e.g., "advanced," "proficient," "basic or needs
improvement," and "unsatisfactory or fail") which are arrived at through the use
of a number of standard-setting methods (see Hambleton, 2001; Horn, Ramos,
Blumer, & Madaus, 2000; Kane, 2001; Raymond & Reid, 2001). Scores (and
standards) are linked from one year to another through the use of item response
theory.(IRT) technology (see Hambleton, Swaminathan, & Rogers, 1991; Mislevy,
1992). Sixth, while in many countries entrance to third-level institutions is based
on performance on a public examination, in the United States, the tradition of
using aptitude (SAT 1) and achievement (SAT 2) tests 'administered by the
College Entrance Examination Board (CEEB) continues. Results of these, or of
American College Testing Program Tests (ACT), are used in conjunction with
class rank, class grades, and other factors in making admission decisions. Seventh,
test authorities in the United States follow the tradition of public examinations
in releasing test items each year. However, 10 to 20 percent of items may be
withheld for year-to-year linking.
Eighth, testing in the United States differs from practice elsewhere in the
range of grades in which assessments are carried out. The No Child Left Bel].ind
law provides for testing in a much wider range (grades 3 through 8) than is the
case in public examination systems elsewhere. Ninth, all students are required to
take tests, while public examinations in other countries are voluntary. Finally,
while in many public examinations at the secondary school level, students can
choose from a large range of subjects, and the items they wish to respond to in
an examination, in most places in the United States, such choice is not available,
and all students take examinations in the same subjects (e.g., English,
mathematics, science, history).
VALIDITY
Content Validity
The claim to validity of external public examinations rests primarily on their

content. Syllabi will provide more or less detailed guidance regarding content to
be covered, and skills and knowledge to be acquired, and validity will be a
function of the extent (based on professional judgment) that a test or examina-
tion is relevant and representative of the content, knowledge, and skills. This
notion of validity fits well with the function of an examination as certifying the
knowledge and skills of students.
Sampling and representation are very important in considering content validity.
Since the knowledge and skills specified in syllabi are generally very numerous,
the only wayan agency can assess how well a student has learned is to measure
his/her performance on a selected sample of topics. How the student responds to
the topics, however, is of interest only insofar as it enables the agency to infer
how well the student has acquired the knowledge as a whole and mastered all
prescribed skills (Madaus & Macnamara, 1970). This situation gives rise to two
problems. One is that examinations often have a relatively small number of
questions, making it difficult to achieve any great degree of curriculum coverage.
And secondly, allowing students choice, in selecting subjects or components of
subjects or in responding to questions, may mean that they avoid important
sections of the curriculum. Furthermore, options may not be marked to the same
standard. Both issues have implications for reliability, for comparability of
student responses, and for the use of performance on examinations to select
students for further education.
A further point relating to content validity is the extent to which the intentions
of the curriculum are reflected, not just in the examination tasks, but in the
achieved scores of students, when those scores are used to rank-order students,
say for admission to third-level education. Examinations are made up of many
components, each of which will be allocated a mark loading which will indicate
the total number of marks available for the component on a scale of achievement
common to the examination as a whole. These loadings indicate the intended
weightings of components. Thus within a subject, a theory paper might be
allocated 200 marks and a practical or oral paper or teachers' assessments of
students might be allocated 100 marks, on the basis that the theory is considered
to be twice as important as the other component, or at any rate to merit twice the
marks. When the marks obtained by students have been aggregated, the overall
rank of examinees is determined, in which the influence of the component on the
rank order will be represented by its achieved weight. It might be assumed that
the outcome will reflect the intended weighting. However, this may not be the
case. Even when less marks are available for a particular component, adding
them to the marks for another component can completely r~verse the order of
merit of the examinees on the components to which the greater number of marks
were assigned if the dispersion of the marks for the components differs. If this
happens, the validity of the examination will be reduced insofar as the weightings
will not operate as specified. To address this situation, it is necessary to take into
account the proportion of the variance of the total marks attributable to a
component and the correlation of the component marks with the total score (see
Delap, 1994).
Criterion-Related Validity
The degree of empirical relationship between test scores and criterion scores to
assess criterion-related validity is usually expressed in terms of correlations or
regressions. While content validity is usually viewed as a singular property of a
test or examination, an examination or test could be assessed in the context of
many criterion-related validities. This might involve comparing performance in
a job with performance on a test that assessed the knowledge and skills required
for the job. It is difficult to identify a criterion with which performance on a
school-leaving examination might be compared, apart from performance in later
education (considered below under predictive validity).
Construct Validity
There is a sense in which an investigation of any aspect of validity may contribute

to an understanding of construct validity, since this facet embraces almost all
forms 'of validity evidence. Any evidence that elaborates on the meaning of the
measure, its nature, internal structure, or its relationship to other variables may
be regarded as relevant (Messick, 1989).
In a study of the Irish Leaving Certificate Examination carried out over 30
years ago, Madaus and Macnamara (1970) analyzed the nature of the abilities
which students displayed in the examination using Bloom's (1956) Taxonomy of
Educational Objectives, which classifies intellectual skills hierarchically. The study
may be regarded as an evaluation of construct validity, as it may be assumed that
the levels of knowledge reflected in the taxonomy would be ones that would find
the approval of curriculum designers. Its findings indicated that examinations
focused on the lower levels of the taxonomy (knowledge) rather than on the
higher levels of analysis, synthesis, or evaluation, placing, the authors say,' "an
impossible and senseless burden on students' memories" (p. 113).
In a recent study, the performance of students on a public examination in
Ireland was related to their performance on the OECD Programme for
International Student Assessment (PISA) tests in English, Mathematics, and
Science. The PISA tests, it is claimed, measure competencies that students will
need in later life, rather than the outcomes of exposure to any particular
curriculum. Certainly, they are not based on Irish curricula, and so cannot be
used as criteria to evaluate concurrent validity. However, if they are valid
measures of competencies needed for later life (and that is open to question), a
consideration of students' performance on them vis a vis their performance on
the examination is of interest since programs of study in Irish schools also have
the general long-term objective of preparation for later life. The relationships
between PISA and external examination performances in all three domains
(English, Mathematics, and Science) were in fact moderately strong (correla-
tions were either 0.73 or 0.74), suggesting that, despite differences in the context,
content, and methods of the assessments, there was considerable overlap in the
achievements measured (Shiel, Cosgrove, Sofroniou, & Kelly, 2001). The finding
can also be regarded as supportive of the construct validity of the examinations.
Predictive Validity
There have been a number of studies of the predictive validity of external

examinations. In general, performance on one external examination predicts
performance on an external examination taken some years later (r=0.82) (Millar
& Kelly, 1999). However, the strength of the relationship varies across curriculum
areas.
Several studies indicate that external examinations predict performance in
university studies better than scholastic aptitude tests (Choppin & Orr, 1976;
O'Rourke, Martin, & Hurley, 1989; Powell, 1973). In passing, it may be noted
that relationships between performance on external examinations and scholastic
aptitude tests have been found to be modest, ranging from about 0.20 to about
0.50 for a number of subjects at varying levels in secondary school, and with
varying time intervals between testing (Madaus, Kellaghan, & Rakow, 1975;
Martin & O'Rourke, 1984). A further indication of the predictive validity of
external examinations may be obtained by looking at the relationship between
examination performance and completion rates in university. In a recent study,
a clear relationship was observed between the scores required for admission to
courses and course completion (Morgan, Flanagan, & Kellaghan, 2001).
Consequences of Test Use
Messick's (1989) definition of validity points to the need to take account of the
potential and actual consequences of test use, something that has frequently
been overlooked in the past. Positive consequences of the use of information
derived from a public examination may be said to occur if the examination fulfils
the functions for which it is designed (i.e., certification, selection, motivation,
and the control of teaching and learning in schools). The most obvious of these
is perhaps selection (e.g., the extent to which the examination provides
information that can be used to make adequate and equitable decisions in
selecting students for higher education), but consequences .also relate to the
other facets. These may not all be positive; there is evidence that examinations
to which high stakes are attached can have negative, if unintended, effects on
teaching, on students' cognitive development, and on their motivation to achieve
(see Disadvantages of External Examinations below).
RELIABILITY
The issue of reliability arises from the fact that human performance (physical or
mental) is variable. In the case of examinations or tests, quantification of the
extent to which an examinee's performance is consistent or inconsistent con-
stitutes the essence of reliability analysis (Feldt & Brennan, 1989). When
performance on a public examination is used for selection, the main issue is
replicability. Would the same subset of individuals be selected if they took the
examination on another occasion (Cresswell, 1995)?
A number of factors contribute to unreliability (Feldt & Brennan, 1989;
Wiliam, 1996). Some are common to all types of assessment, while some are
more likely to arise in the case of free-response assessments. First, examinees
will perform differently on different occasions. This could be due to variation in
the examinee relating to health, motivation, concentration, lapse of memory, or
carelessness. Some students may consistently perform poorly under examination
conditions (e.g., an anxious student) thus introducing a systematic error into the
examination process. However, this is not the kind of error addressed in
reliability concerns, in which error is considered random, not systematic. Second,
fluctuations in external conditions can affect an examinee's performance. These
may be physical (e.g., the examination hall may be too hot) or they may be
psychological (e.g., an examinee may have experienced a traumatic event
recently, such as the death of a relative).
Third, variation in the specific tasks required of an examinee in the assessment
may affect his or her performance. Since an examination consists of questions or
tasks that represent only a sample from a larger domain, the specific tasks in an
examination might unintentionally relate more closely to the information of one
examinee than to the information of another. The situation might be reversed if
a different sample of tasks had been included.
Fourth, different examiners will give different marks to the same piece of
work, and indeed, the same examiner may give different marks to the work if
marked on a different occasion. This is not a problem with objective closed-
response tests. In the case of free-response examinations, however, exa~iner
inconsistency can constitute a major source of error. It has two components.
First, there is the general bias of an examiner, that is, a tendency to mark "hard"
or "easy" relative to the "average" marker. The extent of this may be calculated
if examination scripts have been randomly allocated to examiners by subtracting
the average of a set of marks assigned by a particular examiner from the average
mark awarded by the entire group of examiners to the same questions or papers.
A second source of error arises from random fluctuation in an examiner's
marking. This is separate from any bias that may exist and relates to the marking
of all examiners, whether or not they exhibit bias.
Early studies of examination marker reliability revealed considerable
disagreement between separate scorings of essays (Hartog & Rhodes, 1935;
Stalnaker, 1951; Starch & Elliott, 1912, 1913). Disagreement was likely to be
greater in an English essay where imagination and style might be important
qualities to be assessed than in a science examination where content will be more
important. Madaus and Macnamara (1970) addressed both the issue of the
general bias of examiners (the extent to which different examiners agreed in the
marks they awarded to the same answers) and the inconsistency of individual
examiners (the extent to which two sets of marks awarded at different times by
the same examiner to the same answers agreed). Their first finding was that across
subjects, there was generally a 1 in 20 chance that marks would swing up or down
by about 10 percent. In an extreme case, for example in the case of English, it
could result in an examinee being awarded 50 percent by one examiner and 28
percent (a fail) by another. Second, some examiners were more harsh than
others. Third, the unreliability associated with the marks assigned by a single
examiner on two different occasions was scarcely lower than that associated with
the marks of two different readers. The findings are not greatly dissimilar to the
those of other studies (e.g., Good & Cresswell, 1988; Hewitt, 1967).
In the case of tasks used in state-mandated assessments in the United States,
the reliability of classification decisions is an important but generally overlooked
issue. While standard errors for total tests are provided, they are not provided
for individual cut-score points.
The requirement for high reliability in examinations has had profound effects
on assessment practice. Highly reliable assessments tend to be associated with
standardized conditions of administration, the use of detailed marking schemes,
marker standardization to ensure that criteria and standards are uniform, and
multiple marking of scripts. The downside of these conditions is that limitations
are imposed on the kind of knowledge, skills, and understanding that can be
assessed (Cresswell, 1995). Alternative approaches in Britain to address this
issue by having teachers assess students' work using non-standard tasks taken
under non-standard conditions supervised by one who knows the students well
have run into problems from the point of view of the selection function of
examinations relating to bias, reliability, and comparability.
COMPARABILITY OF PERFORMANCES ON EXAMINATIONS
A persistent issue in external examinations relates to the comparability of the

grades obtained by examinees in different subjects taken in the same examination,
and of grades in examinations administered in different years in the same
subject. One might assume that grades in different examinations would be of
comparable standard and that, for example, obtaining an A or a C in English
should be no easier or harder than obtaining the same grad~ in mathematics.
The issue acquires a particular significance when the grades are treated as of
equal value for selection for further education, and has been the subject of much
research in the United States (see Mislevy, 1992) as well as in Britain where the
administration of examinations in the same subject by separate examination
boards adds to comparability problems. It is also an issue in Germany and
Australia where examination systems vary by state. During the 1970s, a number
of techniques to address the problem were developed in Britain, and were widely
used in grading decisions (Cresswell, 1988; Goldstein, 1986; Newton, 1997;
Nuttall, Backhouse & Willmott, 1974). There was little further research until the
1990s when Dearing's (1996) review of qualifications for 16- to 19-year olds
claimed that certain subjects were "harder" than others.
One way of dealing with the issue is by norm-referencing. If, for example, the
sole purpose of a test is to select (as in the concours), all that is required is that
students be rank-ordered in terms of their performance, and that those who
seem best prepared for further education are identified. Reflecting this approach,
the "pass" mark in some systems is determined by the number of available places
at the next highest level in the education system.
Apart from addressing the selection issue, a normal distribution may be
imposed on scores, or grade boundaries may be determined on the basis of the
proportions of students to be allocated to grades, rather than on their actual
achievements, to achieve uniformity in grading across subjects. Thus, the top 10
percent of scores might be assigned an A grade, the next 20 percent a B grade,
and so on. These percentages may be applied across all subjects, and from year
to year.
A norm-referenced approach, however, is not without its problems. First, it
takes the focus off the knowledge and skills that are being assessed at a time
when there is great emphasis on standards-based reform in education and in
attempting to specify the knowledge and skills that students should acquire.
Secondly, it will not register changes in standards if they deteriorate which, for
example, they do when the proportion of the population taking examinations
increases: Thirdly, it pays no attention to the differential requirements of subjects.
Some subjects may require greater student effort and time. Fourthly, it does not
recognize that perhaps because of those demands, some subjects may be more
highly selective than others. Finally, it will not prevent students from shopping
around for subjects in which the probability of obtaining a higher grade might be
high relative to other subjects. In effect, the use of norm-referencing "solves" the
problem of possible differences between the difficulty level of examin~tion
subjects by ignoring it.
A number of studies of between-subject comparability, despite limitations in
the procedures used, provide at the very least prima facie evidence that
comparability problems exist in the grading of subjects. In subject-pair analysis, in
which the mean grade of examinees, converted into numerical scores (e.g.,
A = 100; B = 90, etc.), is compared with the mean grade achieved by those
examinees on a comparison subject (or group of comparison subjects), differ-
ences in the mean scores (mean grade difference) attributed to a difference in
grading standards have been found (Nuttall et aI., 1974). A problem with the
procedure is that results may differ when analyses are conducted for subgroups
in an examination cohort. For example, in a test of its sensitivity using examination
results in England, mathematics and English were found to have been graded in
a similar manner for male examinees, while for females, English appeared to be
more leniently graded than mathematics (Newton, 1997). This would be a
problem if a correction factor based on the overall mean difference were applied
to students' grades in an examination.
A second approach to examine the comparability of grades in different subject
areas is to have a sample of examinees take an independent "moderator" test of
aptitude or of general achievement (e.g., a verbal reasoning test) in conjunction
with the examination, and to use scores on the test to match up performances on
individual examination subjects. The assumption underlying the approach, which
would be very difficult to sustain, is that the additional measure represents a
major factor underlying diverse achievements, and that the factor is equally
appropriate for all achievements. A further problem if the measure is used to
adjust scores on examinations is the assumption that it is aptitude (or general
ability), rather than achievement in individual subjects, that merits reward.
A third approach to examining the comparability of grades uses a measure of
a student's overall performance on the examination, which is interpreted as an
indication of "general scholastic ability". For example, grades might be converted
to numerical scores and an overall index obtained for each candidate by
summing scores for his/her six best subjects. It would then be possible to estimate
the overall "ability" of examinees taking a particular subject by computing, for
example, the percentage of examinees taking a subject that exceeded the median
score on the overall performance scale. Such analyses reveal substantial
differences between the "general scholastic ability" of candidates taking different
subjects (Kellaghan & Dwan, 1995). Again, there are difficulties with the use of
this measure of general scholastic ability. First, subjects that contribute to the
score, and the level at which they are taken, vary from candidate to candidate.
Secondly, the trait that is represented in the score may vary in its relevance to
individual subject areas. Thirdly, in the case of examinations which have different
levels, the score is directly affected by the level at which an examination is taken.
Yet another approach uses an added value measure to assess the comparability
of standards in different subject areas. In this approach, examinations are
regarded as comparable in standards if examinees with the same achievement on
an examination taken some years earlier receive grades that are similarly
distributed in the examination (Fitz-Gibbon & Vincent, 1995; Millar & Kelly,
1999).
Some commentators regard the notion of comparability across subjects as
meaningless, when one considers that courses on which examinations are based
differ in difficulty, in the demands they make on students, and in their motiva-
tional effects. Furthermore, students opting for different subject areas differ in
their characteristics (Cresswell, 1996; Goldstein, 1986; Newbould & Massey,
1979; Newton, 1997; Wiliam, 1996). In light of these considerations, it has been
suggested that what is really important is whether or not grades are accorded the
same value by certificate users. The issue then becomes one of maintaining
public confidence in the use of an examination for selection purposes (Cresswell,
1995). However, given the fact that analyses of the kind described above point to
differences in grading standards, and the additional fact that students are making
decisions based on this perception regarding subject choice, public confidence
may well run out.
CRITERION-REFERENCING
Criterion-referencing, and the related concept of domain-referencing, grew out

of dissatisfaction with the use of norm-referenced tests in classrooms, which were
perceived to be of limited value for instructional purposes. The issue is also
relevant to other assessment uses, for example in public examinations and in the
establishment of standards or levels of proficiency in national and international
assessments.
Both criterion and domain-referenced scores have been used to place
individuals above or below specific performance standards (e.g., "minimal
competency," "mastery"). In the case of public examinations, such scores would
be used to determine the grade awarded to an examinee. It would seem reasonable
to assume that transparency and reliability would be improved if the criteria for
standard setting or for determining grade boundaries could be specified, that is,
if written statements were available which prescribed the absolute level of
achievement required to justify the award of a particular grade (see Nedelsky,
1954).
There were numerous efforts in British public examinations in the 1980s to
identify the qualities of candidates' work appropriate to different grade levels,
and many descriptions of specific standard-setting grade criteria are available
(see e.g., Christie & Forrest, 1981; Orr & Nuttall, 1983; SEC, 1985). However,
serious difficulties were encountered in their application. For example, some
examination performances that would have merited a grade using conventional
procedures would not have merited the same grade using the grade-related
criteria, though examiners agreed that they should. It seems that the criteria
could not accommodate the multidimensional nature of achievement by, for
example, specifying the weight that should be attached to different aspects of
performance when judging an individual examinee's work. Neither could the
application of concise sets of explicit written criteria replicate the holistic value
judgments made by qualified judges (Cresswell, 1996).
The situation today would seem to be that it is worthwhile having general, but
not specific, descriptions of the achievement worthy of a grade. This view
recognizes the subjective nature of the judgments involved, which have been
compared to those made in evaluation of a work of art. Acceptance of this view
may be a cause for concern to some given the decisions that are based on results.
However, while the judgments or decisions of examiners may not be amenable
to empirical verification, though they may be supported by empirical data, this
does not mean that they are capricious or unreliable in the sense that they are
difficult to replicate. On the contrary, they can be considered the results of a
rational process, which can be supported by reason (Cresswell, 1996, 2000;
Goldstein & Cresswell, 1996). In this situation, the experience and skills of
examiners is paramount, together with adequate procedures to review judgments
and achieve consensus. To aid such consensus, statistical data on the performance
of students in the examination, as well as performance in previous years, is
usually considered. The procedure is similar to that employed in setting levels of
proficiency in the United States (see Hambleton, 2001; Loomis & Bourque,
2001; Raymond & Reid, 2001).
EQUITY
The issue of equity, reflected in the fact that examinations are carried out under
standard conditions and that examinee responses are scored anonymously, has
always been a central concern, and indeed justification, for external (public)
examinations. However, given the importance of examinations in determining life
chances, it is inevitable that questions have arisen about possible bias in the proce-
dure which may limit the chances of particular groups (boys, girls, ethnic groups,
language groups, urban-rural candidates, public/private school candidates) (see
Mathews, 1985). While differences in performance have frequently been associated
with the socioeconomic status, location, ethnicity, and gender of examinees, it
has proved extremely difficult to determine whether these are reflections of the
"achievements" of the groups or are a function of the examining instrument.
A consistent finding in a number of countries is that method of measurement
may contribute to gender differences. Males obtain higher scores on standardized
multiple-choice tests, while females do better on essay-type examinations (Bolger
& Kellaghan, 1990; Mathews, 1985; Stage, 1995). A number of possible explana-
tions have been put forward for this finding: that variables unrelated to content
(e.g., quality of writing, expectations of readers) may affect the score awarded by
examiners to an essay test, and the greater tendency for males, compared to
females, to guess in answering multiple-choice questions.
DISADVANTAGES OF EXTERNAL EXAMINATIONS
Several commentators see a role for external examinations in raising academic

standards (e.g., Eisemon, 1990; Heyneman & Ransom, 1992; Ross & Maehlck,
1990; Somerset, 1996). They argue that if the quality and scope of examinations
is satisfactory, they will provide acceptable guides for teaching, leading to an

adjustment of instructional and learning processes which, in turn, will enhance
the quality of students' achievements. It is believed that impact can be
strengthened by providing feedback to schools on how their students performed
in specific subject areas, and even on individual questions or items.
On the basis of empirical and anecdotal evidence, there can be little doubt that
examinations to which high stakes are attached affect teaching and learning
(Crooks, 1988; DeLuca, 1993; Frederiksen, 1984; Kellaghan & Greaney, 1992).
The consequences, however, are not always as intended; neither are they always
positive (Kellaghan & Greaney, 1992; Madaus, 1988; Madaus & Kellaghan,
1992). First, examinations tend to emphasize scholastic skills (particularly ones
involving language and mathematics), often paying little attention to knowledge
and skills that would be useful in the everyday life of students. Students (and
teachers) will focus their curriculum efforts on these areas, often excluding areas
(both cognitive and non-cognitive) that are not relevant to the examination.
Teachers often refer to past examination papers in deciding what areas of the
curriculum to study (and not to study) in a phenomenon known as "teaching to
the test". In effect, the examinations come to define the curriculum. The
negative impact of the examination will be more pronounced if the knowledge
and skills required to do well relate for the most part to the recall or recognition
of factual information, rather than the ability to synthesize information or apply
principles to new situations. If this is the case, students will spend much of their
time in rote memorization, routine drilling, and the accumulation of factual
knowledge, which in turn may inhibit creativity, experimentation, and the
development of higher order and transferable skills. Process, planning, and
perseverance skills will be accorded little attention.
Secondly, the results of experiments and field studies on the effects of students'
goals on their cognitive processes suggest that examinations or tests to which
high stakes or sanctions are attached affect students' learning goals, learning
strategies, involvement in learning tasks, and attitudes to learning, in particular
attitudes towards improving their competence. These studies distinguish between
learning (or mastery) goals, which reflect the concern of individuals to develop new
skills and increase their competence, understanding, and mastery of something
new, andpeiformance (or ego) goals, which reflect individuals' concern to demon-
strate their ability and to gain favorable judgments or avoid negative evaluations
of their ability or competence. Different patterns of behavior, cognition, and
affect have been found to be associated with pursuit of the two categories of goal.
For example, learning-goal students are more likely to apply self-regulating
effective learning and problem-solving strategies, while performance-goal
students make more use of strategies that are superficial or short-term, such as
memorizing and rehearsing. Further, students who pursue learning goals are
more positive and exhibit a preference for challenging work and risk taking,
while students with performance goals tend to avoid challenging tasks and risk
taking, especially if their self-concept of ability is low (Ames, 1992; Ames &
Archer, 1988; Grolnick & Ryan, 1987).
Most people would probably agree that it would be preferable if schools
promoted the development of learning goals rather than of performance goals.
This view is very much in line with current standards-based reform efforts,
particularly in the United States, that specify challenging definitions of the
proficiencies that students will need in the information-based economies of the
21st century: higher order thinking skills, problem-solving abilities, investigative
and reasoning skills, improved means of communication, and a commitment to
life-long learning. The problem in the case of high-stakes examinations is that
they tend to foster performance goals rather than learning goals (Kellaghan,
Madaus & Raczek, 1996).
A third negative factor associated with high stakes examinations arises in the
case of motivation. There seems little doubt that if sanctions are attached to
students' examination performance, then some students will work quite hard for
the examinations. However, two issues arise. One relates to the type of motivation
that is engendered. It can be argued that because examinations rely primarily on
extrinsic motivation, they will not provide optimal conditions for developing or
sustaining interest in learning (intrinsic motivation). The second issue relates to
the extent to which examinations are successful in motivating all students. It
would seem that many students, on the basis of their past record of achievement,
will make the realistic judgment that they are not going to be numbered among
the successful. When such a negative conclusion is reached, there is evidence
that some students become alienated (Lewis, 1989; Willis, 1977), others respond
by indifference to the examination process (Hargreaves, 1989), others by
disappointment and anger (Edwards & Whitty, 1994) and, in extreme cases, by
discord and open rebellion (Kariyawasam, 1993).
Fourthly, in trying to obtain high scores on an examination, students (and
sometimes teachers and others) may resort to various forms of malpractice. This
is a serious issue in many developing countries, where bribing or intimidating
examiners, purchasing copies of examination papers before the examination,
impersonation, copying, and collusion between examinees, exam superintendents,
and examiners are common (Greaney & Kellaghan, 1996).
Finally, there is a danger that schools will concentrate their efforts on those
most likely to succeed on the examinations. As a result, students who are not
expected to do well may be retained in grade, or drop out of school (Clarke,
Haney, & Madaus, 2000; Greaney & Kellaghan, 1996; Haney, 2000; Kellaghan
& Greaney, 1992; Little, 1982; Madaus, 1988; Madaus & Kellaghan, 1992).
SCHOOL-BASED ASSESSMENT
Just as education systems that relied entirely on school-based assessment are

adopting some form of external testing, so systems with external examinations
have been considering how some elements of school-based assessment might be
incorporated into their examining. This is partly because of the advantages that
teachers' assessments of their own students are perceived to have, and partly
because of the disadvantages associated with examinations that are completely
external to the school.
School-based assessment recommends itself because it should improve validity
since teachers can use a great variety of assessment approaches, spread over a
long period of time in realistic settings, and assess a wide range of practical skills,
including oral fluency, ability to plan and execute projects, and ability to wQrk
with other students, which are very difficult, some would say impossible, to assess
adequately in an external examination. Even oral tests of language administered
by an external examiner are recognized as being extremely artificial and as
unlikely to assess accurately the level of an examinee's competence. Teachers
would seem well placed to capture the multidimensional aspects of students'
performance and to incorporate judgments about the processes through which
students develop understanding, and in which account is taken of their ability to
reflect on the quality of their work (Wolf, Bixby, Glen, & Gardner, 1991).
Despite obvious advantages, it is recognized that problems arise if school-
based assessments are incorporated into an examination on which performance
will have important consequences for a student's educational future. To address
them, a number of procedures have been adopted. In the first, efforts are made
before the assessment is carried out to standardize the process of assessment in
schools (Harien, 1994). These efforts, which are made in the interest of quality
assurance, include teachers meeting with other teachers in a school and with
teachers in other schools to discuss students' work and how it is assessed, as well
as visits of moderators to schools to help identify variation in assessment practice
and in the application of criteria.
An alternative system of moderation involves adjusting the outcomes of
assessments after the assessments have been carried out (quality control). This
may be done by having all students take a "reference test" (e.g., a verbal reason-
ing test or other general test of skills or abilities). A rank order is then created
of students based on their performance on the test, and teachers' assessments of
the students are adjusted to conform to this order. This procedure addresses the
issue of systematic variation between teachers in the criteria which they use for
the award of grades or marks, and puts all students on a common scale, and may
be used to generate scores for individuals or to determine group parameters for
the statistical moderation of school assessments (McCurry, 1995). However, it is
based oli the questionable assumption that the characteristic measured by the
reference test is an appropriate one on which to order student performance in
other areas of achievement. This form of moderation was once used in the
United Kingdom and is still used elsewhere (e.g., South Australia).
While school-based assessment is well developed in some countries, efforts to
have it contribute to grades in countries in which the external element in public
examinations has been very strong have run into difficulties because of, cost,
malpractice (individual students obtaining external assistance), and the pressures
it places on teachers and students. Furthermore, teacher assessments include
many sources of error, including variation in the types of assessment tasks they
use, differences in the interpretation and application of performance criteria or
marking schemes, and the intrusion of irrelevant contextual information in
making judgments. It has also been noted that students from high status
backgrounds, because of the availability of greater resources and support, are
likely to do better, and that females are also likely to be awarded higher grades.
Other criticisms relate to an increase in teacher workload; doubt about whether
all teachers have the assessment skills required for the task, and administrative
problems (e.g., dealing with student absences and transfers) (Elley & Livingstone,
1972; Haden, 1994; Kellaghan & Greaney, 1992; Pennycuick, 1990).
CONCLUSION
All education systems are faced with two major tasks. The first is the certification
of the achievements of students, both to maintain standards and to provide
evidence which individual students may need to use in the market place. The
second is the selection of students for further education in a situation which
prevails everywhere, in which the number of applicants, at least for certain
courses, exceeds the number of available places. A variety of procedures have
been developed to meet these purposes, which attempt to satisfy standards of
fairness and comparability, while ensuring that the objectives represented in
school curricula are adequately reflected. The procedures have taken examina-
tions in two opposite directions. Concern with fairness and equity is represented
in the use of assessment procedures in which common tasks are presented to all
students, administration conditions are standardized, and the student is unknown
to the examiner. Concern with adequate assessment of all aspects of the curricula
to which students have been exposed, on the other hand, is represented in
procedures in which the performance of students is assessed using non-standard
tasks under non-standard conditions by a teacher who knows the students well.
Sensitivity to these contrasting conditions and the need to meet the purposes
they represent are apparent, on the one hand, in the movement in education
systems in which assessment has relied heavily on external terminal examinations
to incorporate some aspects of assessment by students' own teachers into the
grade a student is awarded in an examination, and on the other, by the adoption
of external examinations in systems that have traditionally relied entirely on
school-based assessment for student certification. While these movements
represent a convergence in practice between education systems, a more radical
approach than one that combines existing practices may be required if assessment
practices are to contribute to meeting the challenges facing education in the
future, in particular fostering the development in students of the knowledge and
skills required in contemporary knowledge-based economies. These involve a
broader view of achievement than has traditionally been represented in most
assessment procedures. Curricula in secondary and tertiary institutions are
already responding to the needs of the changing workplace, as well as to the need
to accommodate students who, as participation in secondary and higher education
increases, are becoming more diverse in their abilities and interests. Unless
radical steps to adjust assessment procedures are also taken, there is danger that
these efforts will be inhibited.
While external public examinations have been in operation in some European
countries for as long as two centuries, the United States is ~ newcomer to the
scene. Although ostensibly inspired by practice elsewhere, the manner in which
testing programs are being implemented in American states suggests that they
are based on a very superficial knowledge on the nature, variety, and history of
public examinations. American programs differ from practice in most European
countries in their total reliance on evidence obtained from an external test; in the
testing of elementary school students; in the narrow range of curriculum areas
assessed; in lack of choice (either to take a test or in the test taken); and in the
wide range of consequences attached to test performance for students, teachers,
and schools. It would seem that the reformers were intent on adding a new set of
problems to the many problems associated with public examinations in Europe
and elsewhere.
Perhaps the lesson to be drawn from the American experience by education
systems that are introducing or contemplating the introduction of new forms of
assessment is the need to consider the range of options available, and indeed to
look beyond well-established options, and then to assess carefully the advantages
and disadvantages associated with each, before deciding on the form of assess-
ment that best fits their educational traditions and expectations. It would indeed
be ironic at a time when higher standards of achievement, such as deep
conceptual understanding, an interest in and commitment to learning, a valuing
of education, and student confidence in their capacities and attributes, are being
proposed for all students, if the means adopted to meet these objectives
inhibited, rather than facilitated, their attainment.
REFERENCES
Amano, I. (1990). Education and examination in modern Japan (W.K. Cummings & F. Cummings,
Trans). Tokyo: University of Tokyo.
Ames, C. (1992). Classrooms: Goals, structures, and student motivation. Journal of Educational
Psychology, 84, 261-271. .
Ames, c., & Archer, J. (1988). Achievement goals in the classroom: Students' learning strategies and
motivation processes. Journal of Educational Psychology, 80, 260-267.
Bloom, B.S. (Ed.). (1956). Taxonomy of educational objectives. Handbook 1. Cognitive domain. London:
Longmans Green.
Bolger, N., & Keliaghan, T. (1990). Method of measurement and gender differences in scholastic
achievement. Journal of Educational Measurement, 27, 165-174.
Bray, M., Clarke, P.B., & Stephens, D. (1986). Education and society in Africa. London: Edward
Arnold.
Bray, M., & Steward, L. (1998). Examination systems in small states. Comparative perspectives on
policies, models and operations. London: Commonwealth Secretariat.
Brereton, J.L. (1944). The case for examinations. An account of their place in education with some
proposals for their reform. Cambridge: University Press.
Choppin, B., & Orr, L. (1976). Aptitude testing at eighteen-plus. Windsor, Berks: National Foundation
for Educational Research.
Christie, T., & Forrest, G.M. (1981). Defining public examination standards. London: Schools Council.
Clarke, M., Haney, W, & Madaus, G.F. (2000). High stakes testing and high school completion.
National Board of Educational Testing and Public Policy Statements (Boston College), 1(3), 1-11.
Cresswell, M.J. (1988). Combining grades from different assessments: How reliable is the result?
Educational Review, 40, 361-383.
Cresswell, M.J. (1995). Technical and educational implications of using public examinations for
selection to higher education. In T. Kellaghan (Ed.), Admission to higher education: Issues and
practice (pp. 38-48). Princeton, NJ: International Association for Educational AssessmentIDublin:
Educational Research Centre.
Cresswell, M. (1996). Defining, setting and maintaining standards in curriculum-embedded
examinations: Judgemental and statistical approaches. In H. Goldstein & T. Lewis (Eds),Assessment:
Problems, developments and statistical issues (pp. 57-84). Chichester: Wiley.
Cresswell, M. (2000). The role of public examinations in defining and monitoring standards. In H.
Goldstein & A. Heath (Eds), Educational standards (pp. 69-104). Oxford: Oxford University Press.
Crooks, T.J. (1988). The impact of classroom evaluation procedures on students. Review of
Educational Research, 58, 438-481.
Dearing, R (1996). Review of qualifications for 16-19 year olds. London: School Curriculum and
Assessment Authority.
Delap, M.R. (1994). The interpretation of achieved weights of examination components. Statistician,
43,505-511.
DeLuca, C. (1993). The impact of examination systems on curriculum development. An international
study. Dalkeith, Scotland: Scottish Examinations Board.
DuBois, P.H. (1970). A history ofpsychological testing. Boston: Allyn & Bacon.
Eckstein, M.A., & Noah, H.J. (1993). Secondary school examinations. International perspectives on
policies and practice. New Haven CN: Yale University Press.
Edwards, T., & Whitty, G. (1994). Education: Opportunity, equality and efficiency. In A. Glyn & D.
Milibrand (Eds), Paying for inequality (pp. 44-64). London: Rivers Oram Press.
Eisemon, T.O. (1990). Examination policies to strengthen primary schooling in African countries.
International Journal of Educational Development, 10, 69-82.
Elley, WB., & Livingstone, LD. (1972). External examinations and internal assessments. Alternative
plans for reform. Wellington, New Zealand: New Zealand Council for Educational Research.
EURYDICE. (1999). European glossary on education. Vol. 1. Examinations, qualifications and titles.
Brussels: Author.
Feldt, L.S., & Brennan, RL. (1989). Reliability. In RL. Linn (Ed.), Educational measurement (3rd
ed.; pp. 105-146). New York: American Council on Education/Macmillan.
Fitz-Gibbon, C.T., & Vincent, L. (1995). Candidates' performance in public examinations in
mathematics and science. London: School Curriculum and Assessment Authority.
Foden, F. (1989). The examiner. James Booth and the origins of common examinations. Leeds: School
of Continuing Education, University of Leeds.
Frederiksen, N. (1984). The real test bias: Influence of testing on teaching and learning. American
Psychologist,39,193-202.
Goldstein, H. (1986). Models for equating test scores and for studying the comparability of public
examinations. In D.L. Nuttall (Ed.), Assessing educational achievement (pp. 168-184). London:
Falmer.
Goldstein H., & Cresswell, M.J. (1996). The comparability of different subjects in public examinations:
A theoretical and practical critique. Oxford Review of Education, 22, 435-442.
Good, F.J., & Cresswell, M.J. (1988). Grading the GCSE. London: Secondary Schools Examination
Council.
Greaney, v., & Kellaghan, T. (1996). The integrity of public examinations in developing countries. In
H. Goldstein & T. Lewis (Eds), Assessment: Problems, developments and statistical issues
(pp. 167-181). Chichester: Wiley.
Grolnick, WS., & Ryan, RM. (1987). Autonomy in children's learning: An experimental and individual
difference investigation. Journal of Personality and Social Psychology, 52, 890-898.
Hambleton, RK. (2001). Setting performance standards on educational assessments and criteria for
evaluating the process. In G.J. Cizek (Ed.), Settingperformance standards. Concepts, methods and
perspectives (pp. 89-116). Mahwah, N.J.: Lawrence Erlbaum.
Hambleton, RK., Swaminathan, H., & Rogers, H. (1991). Fundamentals of item response theory.
Newbury Park CA: Sage.
Han Min, & Yang Xiuwen. (2001). Educational assessment in China: Lessons from history and future
prospects. Assessment in Education, 8, 5-10.
Haney, W (2000). The myth of the Texas miracle in education. Education Policy Analysis Archives,
8(41). Retrieved from http://epaa.asu.edu/epaa/.
Hargreaves, A (1989). The crisis of motivation in assessment. In A Hargreaves & D. Reynolds
(Eds), Educational policies: Controversies and critiques (pp. 41--63). New York: Falmer.
Harlen, W (Ed.) (1994). Enhancing quality in assessment. London: Chapman.
Hartog, P., & Rhodes, E.C. (1935). An examination of examinations. London: Macmillan.
Hewitt, E. (1967). Reliability of GCE O-level examinations in English language. Manchester: Joint
Matriculation Board.
Heyneman, S.P., & Ransom, AW (1992). Using examinations and testing to improve educational
quality. In M.A Eckstein & H.J. Noah (Eds), Examinations: Comparative and international studies
(pp. 105-120). Oxford: Pergamon.
Horn, c., Ramos, M., Blumer, I., & Madaus, G.F. (2000). Cut scores: Results may vary. National
Board on Educational Testing and Public Policy Monographs (Boston College), 1(1), 1-31.
Kane, M.T. (2001). So much remains the same: Conception and status of validation in setting
standards. In G.C. Cizek (Ed.), Setting performance standards. Concepts, methods, and perspectives
(pp. 53--88). Mahwah, NJ: Lawrence Erlbaum.
Kariyawasam, T. (1993). Learning, selection and monitoring: Resolving the roles of assessment in Sri
Lanka. Paper presented at Conference on Learning, Selection and Monitoring: Resolving the
Roles of Assessment, Institute of Education, University of London.
Keeves, J.P. (1994). National examinations: Design, procedures and reporting. Paris: International
Institute for Educational Planning.
Kellaghan, T. (1992). Examination systems in Africa: Between internationalization and indigeniza-
tion. In M.A Eckstein & H.J. Noah (Eds), Examinations: Comparative and international studies
Kellaghan, T., & Dwan, B. (1995). The 1994 Leaving Certificate Examination: A review of results.
Dublin: National Council for Curriculum and Assessment.
Kellaghan, T., & Greaney, V. (1992). Using examinations to improve education. A study in fourteen
African countries. Washington, DC: World Bank.
Kellaghan, T., Madaus, G.F., & Raczek, A (1996). The use of external examinations to improve student
motivation. Washington, DC: American Educational Research Association.
Lewis, M. (1989). Differentiation-polarization in an Irish post-primary school. Unpublished master's
thesis, University College Dublin.
Little, A. (1982). The role of examinations in the promotion of "the Paper Qualification Syndrome."
In International Labour Office. Jobs and Skills Programme for Africa (Ed.), Paper qualifications
syndrome (PQS) and unemployment of schoolleavers. A comparative sub-regional study (pp. 176--195).
Addis Ababa: International Labour Office.
Loomis, S.C., & Bourque, M.L. (2001). From tradition to innovation: Standard setting on the
National Assessment of Educational Progress. In G.J. Cizek (Ed.), Setting performance standards.
Concepts, methods, and perspectives (pp. 175-217). Mahwah, NJ: Lawrence Erlbaum.
Madaus, G.F. (1988). The influence of testing on the curriculum. In L.N. Tanner (Ed.), Critical issues
in curriculum (pp. 83-121). Eighty-seventh Yearbook of the National Society for the Study of
Educatiqn, Part 1. Chicago: University of Chicago Press.
Madaus, G.F., & Kellaghan, T. (1991). Student examination systems in the European Community:
Lessons for the United States. In G. Kulm & S.M. Malcolm (Eds), Science assessment in the
service of reform (pp. 189-232). Washington, DC: American Association for the Advancement of
Science.
Madaus, G.F., & Kellaghan, T. (1992). Curriculum evaluation and assessment. In P.W Jackson (Ed.),
Handbook of research on curriculum (pp. 119-154). New York: Macmillan.
Madaus, G.F., Kellaghan, T., & Rakow, E.A (1975). A study of the sensitivity of measures of school
effectiveness. Report submitted to the Carnegie Corporation, New York. Dublin: Educational
Research Centre/Chestnut Hill, MA: School of Education, Boston College. •
Madaus, G.F., & Macnamara, J. (1970). Public examinations. A study of the Irish Leaving Certificate.
Dublin: Educational Research Centre.
Madaus, G.F., & O'Dwyer, L.M. (1999). A short history of performance assessment. Phi Delta
Kappan, 80, 688-695.
Martin, M., & O'Rourke, B. (1984). The validity of the DAT as a measure of scholastic aptitude in
Irish post-primary schools. Irish Journal of Education, 18, 3-22.
Mathews, J.e. (1985). Examinations. A commentary. Boston: Allen & Unwin.
McCurry, D. (1995). Common curriculum elements within a scaling test for tertiary entrance in Australia.
In T. Kellaghan (Ed.), Admission to higher education: Issues and practice (pp. 253-360). Princeton,
NJ: International Association for Educational AssessmentlDublin: Educational Research Centre.
Messick, S. (1989). Validity. In RL. Linn (Ed.), Educational measurement (3rd ed.; pp. 13-103). New
York: American Council on EducationlMacmillan.
Millar, D., & Kelly, D. (1999). From Junior to Leaving Certificate. A longitudinal study of 1994 Junior
Certificate candidates who took the Leaving Certificate Examination in 1997. Dublin: National
Council for Curriculum and Assessment.
Mislevy, RJ. (1992). Linking educational assessments: Concepts, issues, methods, and prospects.
Princeton, NJ: Educational Testing Service.
Miyazaki, I. (1976). China's examination hell. The civil service examinations of Imperial China
(e. Schirokauer, Trans.). New York: Weatherhill.
Montgomery, R (1965). Examinations. An account of their evolution as administrative devices in
England. London: Longmans Green.
Montgomery, R (1978).A new examination of examinations. London: Routledge & Kegan Paul.
Morgan, M., Flanagan, R, & Kellaghan, T. (2001). A study of non-completion in undergraduate
university courses. Dublin: Higher Education Authority.
Morris, N. (1961). An historian's view of examinations. In S. Wiseman (Ed.), Examinations and
English education (pp. 1-43). Manchester: Manchester University Press.
Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological
Newbould, C.A., & Massey, AJ. (1979). Comparability using a common element. Occasional
Publication 7. Cambridge: Oxford Delegacy of Local Examinations, University of Cambridge
Local Examinations Syndicate, Oxford and Cambridge Schools Examinations Board.
Newton, P.E. (1997). Measuring comparability of standards between subjects: Why our statistical
techniques do not make the grade. British Educational Research Journal, 23, 433-449.
Nuttall, D.L., Backhouse, J.K., & Willmott, AS. (1974). Comparability of standards between
subjects. Schools Council Examinations Bulletin 29. London: Evans/Methuen Educational.
O'Rourke, B., Martin, M., & Hurley, J.J. (1989). The Scholastic Aptitude Test as a predictor of third-
level academic performance. Irish Journal of Education, 23, 22-39.
Orr, L., & Nuttall, D.L. (1983). Determining standards in the proposed single system of examinations at
16+. London: Schools Council.
Pennycuick, D. (1990). The introduction of continuous assessment systems at secondary level in
developing countries. In P. Broadfoot, R. Murphy & H. Torrance (Eds), Changing educational
assessment. International perspectives and trends (pp. 106-118). London: Routledge.
Perreiah, AR (1984). Logic examinations in Padua circa 1400. History of Education, 13, 85-103.
Powell, J. L. (1973). Selection for university in Scotland. London: University of London Press.
Rashdall' H. (1936). The universities of Europe in the middle ages (Ed. EM. Powicke & AB. Emden)
(3 vols). Oxford: Oxford University Press.
Raymond, M.R, & Reid, J.B. (2001). Who made that a judge? Selecting and training participants for
standard setting. In GJ. Cizek (Ed.), Setting performance standards. Concepts, methods, and
perspectives (pp. 119-157). Mahwah, NJ: Lawrence Erlbaum.
Roach, J. (1971). Public examinations in England 1850-1900. Cambridge: Cambridge University Press.
Ross, K.N., & Maehlck, L. (Eds). (1990). Planning the quality of education. The collection and use of
data for informed decision making. Oxford: Pergamon/Paris: UNESCO.
SEC (Secondary Examinations Council). (1985). Reports of the Grade-Related Criteria Working Parties.
London: Author.
Shiel, G., Cosgrove, J., Sofroniou, N., & Kelly, A (2001). Ready for life? The literacy achievements of
Irish 15-year olds in an international context. Dublin: Educational Research Centre.
Shore, A, Pedulla, J., & Clarke, M. (2001). The building blocks of state testing programs. Chestnut Hill,
MA: National Board on Educational Testing and Public Policy, Boston College.
Somerset, A (1996). Examinations and educational quality. In A Little & A Wolf (Eds.),Assessment
in transition. Learning, monitoring and selection in international perspective (pp. 263-284). Oxford:
Pergamon.
Stage, C. (1995). Gender differences on admission instruments for higher education. In T. Kellaghan
(Ed.), Admission to higher education: Issues and practice (pp. 105-114). Princeton, NJ:
International Association for Educational AssessmentlDublin: Educational Research Centre.
Stalnaker, J.M. (1951). The essay type of examination. In E.F. Lindquist (Ed.), Educational
measurement (pp. 495-530). Washington, DC: American Council on Education.
Starch, D., & Elliott, E.C. (1912). Reliability of grading high school work in-English. School Review,
20, 442-457.
Starch, D., & Elliott, E.C. (1913). Reliability of grading work in mathematics. School Review, 21,
254-259.
U.S. Department of Education. (2002). No Child Left Behind Act of 2001: Reauthorization of the
Elementary and Secondary Education Act. Retrieved from http://www.ed.gov/offices/OESE/esea/
West, A, Edge, A, & Stokes, E. (1999). Secondary education across Europe: Curricula and school
examination systems. London: Centre for Educational Research, London School of Economics and
Political Science.
Wiliam, D. (1996). Standards in examinations: A matter of trust? Curriculum Journal, 7, 293-306.
Willis, P. (1977). Learning to labor. How working class kids get working class jobs. New York: Columbia
University Press.
Wilmott, AS., & Nuttall, D.L. (1975). The reliability of examinations at 16+. London: Macmillan.
Wiseman, S. (Ed). (1961). Examinations and English education. Manchester: Manchester University
Press.
Wolf, D., Bixby, J., Glen, J., & Gardner, H. (1991). To use their minds well: Investigating new forms
of student assessment. Review of Research in Education, 17, 31-74.
World Bank. (2002). Toolkit on public examination systems in developing countries. Retrieved from
http://www.worldbank.orglexams.
Section 7
Personnel Evaluation
Introduction
The Evaluation Center, Western Michigan University, MI, USA
Handbooks and encyclopedias on evaluation typically focus on student, project,

program, and policy evaluations and exclude the area of personnel evaluation.
This is unfortunate, because the performance of personnel is one of the most
fundamental determinants of quality in any human enterprise. This is especially
evident in education, where the performances of principals, support specialists,
and, in particular, teachers heavily influence what students learn, how well they
learn it, and how well they grow up to be productive, fulfilled members of society.
The role of teachers is especially crucial to schools' successes, as recent research
has demonstrated (Sanders & Horn, 1998; Tymms, 1994, 1995; Webster, 1995).
For example, based on analysis of data from the Tennessee Value-Added
Assessment System database, Sanders and Horn (1998) concluded that teacher
effectiveness explained student academic growth better than race, socio-
economic level, class size, and classroom heterogeneity.
This section addresses the need for more and deeper exchange on the issues
in educational personnel evaluation. The authors of the three chapters discuss
personnel evaluation in relation to teachers, support personnel, and principals.
All three chapters speak of the unique challenges associated with evaluating
personnel, and hold in common an emphasis on using evaluation to assure that
students are served by competent personnel. All point to the need for further
development of the educational personnel evaluation enterprise.
TEACHER EVALUATION
Pearlman and Tannenbaum (two eminent researchers at the Educational Testing

Service) ground their discussion in the optimistic note that teacher evaluation,
when well done, can contribute to teachers' professional development and
enhance professional practice. However, the theme of their chapter is that while
some progress has been made, the teacher evaluation field is far from delivering
what is needed.
The authors closely examine the relationships between teacher education,
teacher quality, teacher certification/licensure, teaching standards, learning
603

604 Stufflebeam
standards, student achievement, student testing, and legislated policy. They
discuss the validity, reliability, political, and feasibility challenges in linking these
things in a meaningful way. They cite research that confirms that teacher educa-
tion impacts on teacher quality, which, in turn, impacts on st~dent achievement.
However, they also point out that "connecting teacher performance to student
achievement and high stakes decisions and employment and compensation is
fraught with difficulty and potential legal complications." Especially, they note
that licensure test scores cannot predict the quality of teaching.
In the view of Pearlman and Tannenbaum, standardized tests are often grossly
inappropriate and thus misapplied in teacher accountability schemes, noting
especially that the most widely used multiple-choice measures of student
achievement are not designed to measure complex cognitive skills or deep levels
of subject matter understanding. They point out that this type of testing is often
misaligned with teaching and learning standards that stress construction of
knowledge and higher-order learning.
In assessing the current state of teacher evaluation theory and practice,
Pearlman and Tannenbaum concentrate on actual (rather than theorized) practices
of teacher assessment. These are mainly focused on evaluating new teachers
during their induction processes, but also include the portfolio/assessment center
evaluations employed by the National Board of Professional Teaching Standards
to provide their imprimatur to outstanding, experienced teachers. The authors
note that recent licensure, induction, and certification assessment programs
reflect an improved link between the examinee's teaching task and the goal of
the measurement. Thus, there is increased capacity to ground personnel decisions
on relevant evidence. The chapter includes a comprehensive review of relevant
publications in these areas.
The authors see clearer links now between student learning standards and
teacher preparation standards than Dwyer and Stufflebeam reported in 1996.
However, Pearlmann and Tannenbaum lament that states often undermine these
links,by hiring teachers that do not meet pertinent standards and, for example,
assign teachers to subjects they are not prepared to teach. They say that the
evaluation development community can do little more than produce better
concepts and methods pertaining to licensure, certification, advancement, and
pay, unless they develop capacity to influence authority and power groups to
properly apply the concepts and methods.
Basically, Pearlman and Tannenbaum disregard conceptual developments in
school-based teacher evaluation, that are reflected mainly in the literature, but
little in practice. They conclude that there have been no noteworthy advances in
how teachers are evaluated in schools and classrooms for over a decade.
EVALUATION OF SCHOOL PRINCIPALS
Glasman and Heck focus on the crucial function of evaluating the performance
of school leaders, which they note has been poor in the past. Despite their
Personnel Evaluation 605
misgivings about past practices, they say that the fundamental stages of principal
evaluation are agreed upon, sound, and should proceed in three areas: base-line
employment evaluation, fine-tuning evaluation, and secondary employment
evaluation.
They state that recent changes in schools and the principalship have led to a
need for new and different models for evaluating principals, noting that the
changing role of the principal has been brought about by the devolution of
responsibility and authority from a central office to the local school for directing
and controlling schools. Under site-based management approaches, schools have
been given more authority to determine curriculum, to hire and fire personnel,
to allocate resources, etc. They have also been assigned greater responsibility for
effecting and being accountable for student learning and other school functions.
More than ever, principals are seen to be accountable to a broad group of
stakeholders (e.g., parents, community, district administrators, teachers, staff,
and school councils).
A problem that arises in district/central office supervisors and school board
members fulfilling their role as evaluators of principals is that typically they are
not adequately trained in personnel evaluation. To address this issue, Glasman
and Heck recommend that administrative training programs should include a
course in formative personnel evaluation and that advanced courses (e.g., inservice
training) could focus on summative evaluation, which would attend to issues of
accountability, equity, and efficiency. (While such training would strengthen
administrators' command of evaluation, there is also a clear need to train the
other persons who evaluate school principals, especially members of school
councils.)
In the face of poor models and tools for assessing principals' performances,
Glasman and Heck call for an improved principal evaluation approach, in which
a principal evaluation's purposes, objects, criteria, and evaluator qualifications
would be clarified. They point out that the purposes of an evaluation may be
formative (e.g., professional development) or summative (e.g., preparation,
certification, and selection of principals; granting tenure or promotion;
reassignment or dismissal; fixing accountability for student outcomes). Its objects
are defined as the aspects of the principal's work or role that are to be assessed
(e.g., leadership, attitudes, behavior, decision making, competence, performance,
effectiveness), with evaluative criteria providing a further breakdown of the
objects (e.g., goal achievement, with an emphasis on efficiency and effectiveness,
and taking into account what the school's stakeholders expect of the principal).
Finally, they call for a careful determination of the authorized evaluator(s}, who
might include the school district's superintendent, teachers, parents, peers,
and/or members of the school's council.
Clearly, these authors see principal evaluation as a primitive area of practice
in need of redefinition and great improvement. They point to the need for sound
research and development, and have provided some useful ideas about how
needed improvement can be pursued.
606 Stufflebeam
SUPPORT PERSONNEL EVALUATION
Stronge in his chapter emphasizes the importance of evaluating educators who

work with students or faculty in a support capacity. These !nclude counselors,
psychologists, social workers, work-study supervisors, library-media specialists,
computer specialists, content specialists, speech therapists, gifted/talented
resource specialists, and nurses.
He states that historically, education has neglected to evaluate such specialists
in terms that are relevant to their assignments and, compared to teacher evaluation,
the evaluation of educational specialists has a "less developed support system of
guidelines, training, and technical support." With increasing pressure for
accountability, he notes that "performance evaluation has been extended more
systematically to include all professional employees." Thus, he calls for an
"evaluation process which recognizes and rewards the varied and diverse
contributions of educational specialists" and provides them with direction for
improvement.
Stronge notes that evaluating such specialists is unique and requires new,
relevant models. These should be grounded in valid job descriptions; take
account of the specialists' highly specialized training, functions, and associated
theoretical orientations; and especially assess the non-instructional nature of
much of their work.
Noting that observation - the primary method for evaluating teachers - has
limited value in evaluating support specialists, Stronge suggests other methods,
which include surveying client groups, collecting and assessing job artifacts, and
documenting the specialist's contributions and impacts on student success.
Strong's proposals in the chapter and in his previous publications have a
functional orientation. They are aimed especially at helping administrators/
supervisors to clarify specialist roles and evaluate performance against role
requirements. Stronge proposes an evaluation model, with the following steps:
identify system needs; develop roles and responsibilities for the job; set standards
for job performance; document job performance; evaluate performance; and
improve/maintain professional service.
SOME OBSERVATIONS ABOUT THIS SECTION
When one considers all three chapters on educational personnel evaluation, it

seems that teacher evaluation is presented as the most straightforward and
systematic but that an alternative technology is needed to evaluate specialists or
principals. However, when one considers what is left unsaid as well as said in the
teacher evaluation chapter, it seems clear that much research and develop'ment
is needed there as well, especially in the evaluation of teachers in classroo~s for
both formative and summative purposes.
All three chapters comment on the push toward greater accountability.
Clearly, teachers are heavily impacted by the emphasis on accountability based
Personnel Evaluation 607
on student outcomes. However, the call for greater accountability affects all
school personnel. It must not be forgotten also that accountability demands can
greatly impact on students, since accountability not infrequently drives what is
taught and determines which students are given most attention. The practice of
accountability raises the stakes for everyone; as a consequence, it is crucially
important that personnel evaluation practices be reasonable, accurate, and fair.
All three chapters note the importance of formative evaluation, something that
can easily be overlooked as the public and legislators focus their attention on
summative evaluation, results, and accountability. It is perhaps ironic that the
advancements in teacher evaluation cited by Pearlman and Tannenbaum are
clearly more summative than formative.
The chapters vary in their consideration of the role of professional standards
for guiding and assessing personnel evaluations. Stronge points to the importance
of meeting the Joint Committee (1988) Personnel Evaluation Standards in evalu-
ating educational specialists, stating that "In order for evaluation to be valid and
valued for any educator, these issues [raised in The Personnel Evaluation
Standards] must be adequately addressed." Glasman and Heck also discuss The
Personnel Evaluation Standards in the context of schools using them as a part of
a general approach to standards-based education. Pearlman and Tannenbaum do
not consider the Standards, though clearly relevant to the problems they identify
in high stakes evaluations of teachers. Obviously, teacher evaluation systems
need to do a better job of meeting requirements for propriety, utility, feasibility,
and accuracy, and it must be accepted that personnel evaluations can do harm
(e.g., through misguided, corrupted high stakes testing), as well as good. It is
critically important that personnel evaluation systems be regularly subjected to
systematic metaevaluation, and that such metaevaluations be keyed to
appropriate professional standards for judging personnel.
A reasonable conclusion to this overview of the state of educational personnel
evaluation might be that the glass is half full. Given that the three chapters tend
to focus on the shortcomings of current practice, their observations might be
regarded as providing an agenda for research, development, and improvement in
the educational personnel evaluation arena. Thinking about and the practice of
educational personnel evaluation have improved over time. The field now has
professional standards that define what is required in sound evaluation (Joint
Committee, 1988), and some progress has been made, both in theory and practice,
toward satisfying the conditions of propriety, utility, feasibility, and accuracy.
There have been important advances in teacher evaluation, in recognizing that
the teacher is the school's most influential asset in effecting sound student
learning, in developing rigorous approaches to evaluating teacher induction and
to certifying outstanding teaching competence, and in proposing new models for
evaluating teacher performance in the classroom. In the area of support
personnel, Stronge has brought to awareness the need to clarify roles for, and
systematically evaluate, educational specialists, and has proposed a clear process
for proceeding with such evaluations. The area of principal evaluation continues
to be weak, and clearly needs much more attention.
608 Stufflebeam
REFERENCES
Dwyer, c.A., & Stufflebeam, D.L. (1996). Teacher evaluation. In D.C. Berliner & R.C. Calfee (Eds.),
Handbook of educational psychology (pp. 765-786). New York: Simon & Schuster Macmillan.
Sanders, w.L., & Horn, S.P. (1998). Research findings from the Tennessee Value-Added Assessment
System (TVAAS) database: Implications for educational evaluation and research. Journal of
Personnel Evaluation in Education, 12, 247-256.
Tymms, P. (1994, July). Evaluating the long term impact of schools. Paper presented at the National
Evaluation Institute of the Center for Research on Educational Accountability and Teacher
Evaluation (CREATE), Gatlinburg, TN.
Tymms, P. (1995). Setting up a national "value added" system for primary education in England: Problems
and possibilities. Paper presented at the National Evaluation Institute of the Center for Research
on Educational Accountability and Teacher Evaluation (CREATE), Kalamazoo, MI.
Webster, w.J. (1995). The connection between personnel evaluation and school evaluation. Studies
in Educational Evaluation, 21, 227-254.
28
Teacher Evaluation Practices in the Accountability Era
MARl PEARLMAN
Educational Testing Service, NJ, USA
RICHARD TANNENBAUM
Educational Testing Service, NJ, USA
In 1996, when Dwyer and Stufflebeam last surveyed teacher evaluation practices
and philosophies, they made the following assertion:
Teacher evaluation is a highly controversial area, with myriad stakeholders

and a wealth of technical, psychological, political, ethical, and educational
complexities. Teacher evaluation is relevant to every segment of the
educational system, and society at large has an intense interest in how it is
carried out and what its impact on education and on individuals' lives will
be. Thus, the criticisms of theory and practice are strongly held, and how
(or whether) these criticisms are resolved has direct implications for the
quality of American schooling. (Dwyer & Stufflebeam, 1996, p. 768)
They then reported on a broad range of issues and challenges raised by the nexus
of the burgeoning reform movement in education, the dynamic and complicated
nature of teaching as a profession, and the technical standards for high quality
personnel evaluation of any kind. As their introductory matrix of "Types of
Evaluations and Decisions Involved in Preparation, Licensing, Employment, and
Professionalization of Teachers" makes clear (see Appendix A), there was little
consensus about what practices worked best in ensuring the overall quality of the
education workforce. In addition, that teacher evaluation practices were deficient
and inadequate was presented as a fact; the remainder of their survey in one way
or another dealt with various aspects of improvement in these practices.
Years have passed, and nothing has occurred that would undermine the
assertions in the quotation above. There are, however, two significant areas of
change, in the public policy arena and in assessment methodologies, that could
have profound effects on teacher evaluation practices in the next decade. The
single most important shift in the public policy arena since Dwyer and
Stufflebeam's survey is the emergence of a tidal wave of support for what is
609
International Handbook of Educational Evaluation, 60~42

610 Pearlman and Tannenbaum
loosely called "teacher accountability." What this seems to mean in effect is a
growing insistence on measurement of teacher quality and teacher performance
in terms of student achievement, which is all too often poorly defined, crudely
measured, and unconnected to what educators regard as sjgnificant learning.
While there have been numerous explorations of varied approaches to teacher
evaluation that could lead to increased sophistication, validity, or utility of
teacher evaluation practices in schools by school personnel over the past decade,
none of these has resulted in widespread application of new understandings of
evaluation practice for teachers.
Because there is still little consensus about acceptable ways to meet the very
substantial challenges posed by links between measures of student achievement
and consequent conclusions about teacher effectiveness, the fact that this issue
dominates current discourse about teacher evaluation is very significant, and
somewhat alarming. This. is not a new effort or a new issue, but the heated
insistence on its power as the single most important criterion for establishing a
teacher's effectiveness is new (Millman, 1997, pp. 6-7). Simply put, most efforts
to connect student achievement to individual teacher performance have
foundered in the past on these weaknesses:
• The measurement does not take into account teaching context as a performance
variable
• The measurement is unreliable, in part because it does not include time as a
variable - both the teacher's time with a cohort of students; and some model
or models of sufficient time to see learning effects in students.
• The measures used to reflect student achievement are not congruent with best
practice and philosophy of instruction in modern education.
Several notable efforts to address these deficiencies have been launched in the
past decade; we discuss those below.
The second area of change has been in the area of teacher assessment. Here
there is both good news and bad news. Teacher testing is now more than ever a
high stakes enterprise. The most striking example of the potential stringency of
teacher testing practices is provided by the experience of teachers in Massachusetts,
which introduced a new teacher licensing testing program in 1998, the
Massachusetts Educator Certification Tests (MECT). At its first administration
in April 1998, only 41 percent of all test takers met the passing standard
established by the state on all three parts of the test: reading literacy, writing
literacy, and subject matter knowledge (Haney et aI., 1999). There have been and
continue to be numerous analyses of the technical and policy implications of the
development, implementation, and continuing use of the MECT (Haney, et aI.,
1999; Ludlow, 2001; Melnick & Pullin, 1999; Melnick & Pullin, 2000; J.,una,
Solsken, & Kutz, 2000), but the essential point of the very public suffering of
Massachusetts educators is how congruent this testing initiative and the
responses to its results are with the rhetoric of "teacher accountability." All of
the furor over the Massachusetts approach to teacher testing emphasizes in one
Teacher Evaluation Practices in the Accountability Era 611
way or another the critical validity concerns that the Massachusetts approach to
evaluation raises for all stakeholders in the educational reform arena.
The Massachusetts teacher testing program began as a legislative and policy
initiative that was intended to ensure the quality of teachers entering
Massachusetts's education workforce. On the face of it, Massachusetts was merely
implementing a quality assurance step in the licensing process for new teachers,
something states do, and have both a right and responsibility to do, on a regular
basis. It was both the very high stakes imposed on this new step in the licensing
process and the methods policy makers used to impose this step that exploded
Massachusetts teacher testing into headlines all over the U.S. (Sterling, 1998).
All aspects of validation of the constructs that underlie testing teachers at the
beginning of their careers - the technical standards the measurement profession
believes that all test makers are obliged to meet - appeared to be negotiable and
subject to policymakers', not test makers', quality standards. As Haney et aI.
make clear, all aspects of the development - from the contractor chosen, the
time allotted to test development (six months), the policy decisions regarding
consequences communicated to candidates and then rescinded and replaced, the
methods for setting passing score standards, and the technical information
provided by the test maker - were fraught with errors and violations of
professional measurement standards (Haney et aI., pp. 1-4). Furthermore, a
critical and potentially divisive issue was enjoined by the Massachusetts con-
troversy: who has the authority to set the standard for teachers' knowledge and
skill? What is the proper balance of power among policymakers, legislators,
teacher educators, teacher candidates, and the general public whose children are
the reason for all of this concern (Brabeck, 1999)?
If licensure testing for beginning teachers appears to be fraught with
controversy, the good news in teacher evaluation is that several of the assessment
and evaluation ventures mentioned as promising but unproven by Dwyer and
Stufflebeam have been developed and elaborated and are now more than
promises. There has been effective application of theoretical knowledge to
aspects of teacher evaluation, with some actual practical results. Some essential
foundations for a degree of consensus about what teachers should know (one
part of the assessment and evaluation challenge) and be able to do (the other-
and far more vexing - part of the challenge) have been laid. Some actual
decisions and actions that affect teachers' and students' interconnected lives are
being taken as a result of this growing consensus. And there are innovative new
assessments being designed and used to evaluate teachers in many locations.
HIGHLIGHTS OF THE LITERATURE
In the past decade considerable theoretical and empirical work has been
accomplished in three areas: what student testing can and should do as a part of
the instructional system; what teachers should know and be able to do and how
to incorporate those values and standards into the preparation program for
beginning teachers; and the links between teacher performance and student
achievement. In addition, innovative evaluation strategies for both beginning
and in-service teachers have been implemented in several states in the U.S.
Many of these evaluation strategies combine assessl!lent of teachers'
performance with development and enhancement of teaching skills.
Beginning with what is known about current practice in student testing is an
increasingly important part of any consideration of teacher evaluation practices,
given the current political and policy climate. How teachers are prepared for
their professional work and how they should be prepared to do that work well is
the foundation for any evaluation system for professional performance. Clearly,
with the emphasis on links between teacher performance and student
achievement, how teachers are prepared for their instructional and motivational
roles is essential. And the growing body of evidence about links between teacher
performance and student achievement must be an influential element in
evaluation of teachers' work and support for their ongoing learning to improve
that work.
Student Testing
The most recent evaluative commentary on the use of student tests for the
purpose of high stakes accountability decisions is incoming American Educational
Research Association (AERA) President (2002) Robert Linn's 2000 survey of
50 years of student testing in the U.S. education system, and the effects of that
testing (Linn, 2000). Linn concludes:
I am led to conclude that in most cases the instruments and technology

have not been up to the demands that have been placed on them by high-
stakes accountability. Assessment systems that are useful monitors lose
mJlch of their dependability and credibility for that purpose when high
stakes are attached to them. The unintended negative effects of the high-
stakes accountability uses often outweigh the intended positive effects.
(Linn, 2000, p. 14)
Given the current policy climate, this is a sobering and cautionary conclusion,
coming as it does from such a major figure in the measurement community, and
one known for his evenhanded and judicious treatment of measurement issues.
It is clear that the most widely used current measures of student achievement,
primarily standardized norm-referenced multiple choice tests developed and
sold off-the-shelf by commercial test publishers, are useful for many educational
purposes, but not valid for school accountability. Indeed, they may be posi!ively
misleading at the school level, and certainly a distortion of teaching effectiveness
at the individual teacher level (Madaus & O'Dwyer, 1999; Barton, 1999). Concerns
about the increased dependence on high-stakes testing has prompted a number
of carefully worded technical cautions from important policy bodies as well in the
past two to three years (Heubert & Hauser, 1999; Feuer et aI., 1999; Elmore &
Rothman, 1999; AERA, 2000). While it is possible to imagine a program of student
testing that aligns the assessments used to the standards for learning and to the
curriculum actually taught, and that employs multiple methods and occasions to
evaluate student learning, the investment such a program would demand would
increase the cost of student assessment significantly. Furthermore, involving
teachers in the conceptual development and interpretation of assessment measures
that would be instructionally useful, particularly when those measures may have
a direct effect on the teachers' performance evaluation and livelihood, is no
closer to the realities of assessment practice than it has ever been, which is to say
that it is, in general, simply not part of the practice of school districts in the U.S.
Teacher Quality and Effectiveness
The emphasis on teacher quality has gained considerable momentum from the
body of empirical evidence substantiating the linkage between teacher competence
and student achievement. The "value-added" research, typified by the work of
Sanders et aI. (Sanders & Rivers, 1996; Wright, Horn, & Sanders, 1997; Sanders
& Horn, 1998), reinforces the assumption that the teacher is the most significant
factor impacting student achievement. Sanders' work in this area is the best
known and, increasingly, most influential among policymakers. In the measure-
ment community, however, independent analyses of Sanders' data and methods
have just begun (Koretz, 2001; Kupermintz, Shepard, & Linn, 2001; Hamilton,
Klein, & McCaffrey, 2001: Stecher & Arkes, 2001).
The "value added" research typified by Sanders et aI. clearly points to the
importance of teacher quality in one kind of student achievement, standardized
multiple choice testing. What is not clear from such research, however, is an
understanding of the factors that necessarily contribute to teacher quality.
Teacher Preparation
The question of "What are the qualifications of a good teacher?" has been
addressed by Linda Darling-Hammond (1999). Darling-Hammond conducted
an extensive review of state policies and institutional practices related to factors
presumed to contribute to teacher quality. Such factors included teachers' subject
matter knowledge, knowledge of teaching and learning, teaching experience, and
certification status. Prominent among her findings was a consistent and signifi-
cant positive relationship between the proportion of well qualified teachers in a
state and student achievement on National Assessment of Educational Progress
(NAEP) reading and mathematics assessments, even when controlling for
students' characteristics such as language background, poverty, and minority
status. A well-qualified teacher was defined as being state certified and holding
a college degree with a major in the field being taught.
What emerges from Hammond's research is an appreciation for the impact
that teacher education has on teacher quality, and, subsequently and presumably,
student achievement. Appropriately prepared teachers are necessary for
students to achieve. This is a particularly significant finding g!ven the education
labor market dynamics currently in force in the U.S. While schools and
departments of education at u.S. universities are preparing more than enough
undergraduates and graduates to fill the available jobs (Feistritzer, 1998), those
students are either not entering the teaching profession or not willing to work in
geographical locations that have teaching jobs available. Over the next decade,
approximately two million teaching positions will need to be filled as the current
cadre of teachers retires. As Education Week's Quality Counts 2000 makes clear
(Olson, 2000a, 2000b), unless state policies regarding teacher licensure and
evaluation are radically revised, many of those positions will be filled with
teachers whose subject matter knowledge and professional teacher education is
at best unknown, as states grant emergency licenses and waive requirements so
that there will be an adult in every classroom in the role of teacher. Thus, a
widening gap between research findings regarding effective teaching and actual
hiring practices is emerging.
Evaluation of Practicing Teachers: Using Student Achievement Data
While the evaluation of teachers is certainly not a new phenomenon, the context
of evaluation has changed. Traditionally, teacher evaluation focused primary on
what Danielson and McGreal (2000) label "inputs." Inputs refer to what teachers
do, the tasks they perform. That is, the philosophy guiding this evaluative model
is that teachers should be doing, behaving, interacting, demonstrating, and
reinforcing those teaching and learning practices and attitudes valued by the
educational system in which they are teaching. Assuming the presence of such
factors as clearly defined and communicated values (evaluation criteria), well-
trained evaluators, and a supportive professional climate, such an evaluation
orientation contributes to teachers' professional development and enhances
professional practice. The obvious merit of such systems is that they can provide
constructive formative feedback to teachers. Typically absent from such teacher
traditional evaluation systems are student achievement measures; that is,
indicators of student achievement do not generally factor into the teachers'
performance evaluation.
More recently, however, teacher evaluation systems have shifted attention to
student achievement measures. The philosophy guiding these systems is that it
matters less what teachers do if their actions do not lead to student achievement;
student success is what matters most (Danielson & McGreal, 2000). H.D. Schalock
(1998) summarizes the prevailing sentiment this way, "Put simply, if the purpose
of teaching is to nurture learning, then both teachers and schools should be
judged for their effectiveness on the basis of what and how much students learn"
(p. 240). This movement towards student outcomes as the benchmark of teacher
quality has gained momentum from the often publicized lack of perceived
academic competitiveness of students in this country compared to students in
other industrialized nations (see, for example, Manzo, 1997; Sommerfeld, 1996;
Walberg & Paik, 1997) and the mounting empirical evidence that teachers are
the most significant factor in student achievement (see, for example, Greenwald,
Hedges, & Laine, 1996; Sanders & Rivers, 1996; Wright, Hom, & Sanders,
1997). Teachers and, by extension, schools are being evaluated on the outcomes
of their students. Teacher and school effectiveness are measured by the progress
that students make in terms of their learning and achievement; "outputs" matter
most (Danielson & McGreal, 2000). The use of student achievement as a gauge
of teaching effectiveness is reasonable and appropriate; and some would argue
that student learning is the most important criterion by which to evaluate
teachers (Popham, 1997). As we report above, however, the measures by which
student learning is currently assessed are far from adequate in most cases to the
task of yielding trustworthy information by which teacher effectiveness can be
gauged.
For the most part, Danielson and McGreal report, student achievement has
been measured by standardized, multiple-choice assessments. Multiple choice
assessments are neither inherently valid nor invalid. Validity is not a property of
a test, but of the inferences and consequences tied to the scores of the test
(Messick, 1990). Validity arguments should be framed around the question,
"What does the testing practice claim to do?" (Shepard, 1993, p. 408). Widely
used multiple choice-based tests developed by commercial publishers for the K-
12 market primarily measure factual knowledge and procedural knowledge. In
most cases they do not nor do they claim to measure complex cognitive skills or
deep levels of subject matter understanding. Independent reviewers of accounta-
bility programs relying mainly on the use of standardized, multiple-choice tests,
such as the Dallas Valued-Added Accountability System, are quick to underscore
the misalignment of such testing with current teaching and learning standards
stressing, for example, construction of knowledge and higher-order learning
(Sykes, 1997). At the present time there are very few teacher evaluation systems
that actually employ student achievement data as a part of the performance
appraisal process. According to the Education Commission of the State (ECS),
'1\lthough no state has yet gone so far as to hold individual teachers accountable
for the performance of their students, several states have implemented direct,
performance-based assessments of teacher effectiveness" (Allen, 1999, p.1) That
is, the data are being gathered and evaluated, though not yet employed as
appraisal measures. The states currently gathering and analyzing such data are
Tennessee, Texas, Minnesota, and Colorado. However, these states contrast
dramatically in their methods of using these data. Furthermore, this is a fast-
moving target, and changes in states' requirements for and uses of data are rapid;
the ECS website is a valuable source of up-to-date policy information from states
(Education Commission of the States, 2001).
OBSERVATIONS ON THE CURRENT LANDSCAPE
OF TEACHER EVALUATION
Teacher evaluation today is no less controversial than it was in ~992, but the stakes
are much higher. An ever-increasing number of politicians, from the President of
the United States to senators and representatives, to governors and state legis-
lators, have made educational accountability a war cry. And by accountability,
politicians most particularly mean teacher accountability for student learning.
Demanding it, figuring out how to measure it and what a high "it" looks like, and
then attaching rewards to it is the pattern of key political initiatives in many
states.
This particular interpretation of accountability emerges from a national focus
on education reform that has increasingly occupied center stage over the past
decade. What followed the statement of purpose and commitment that marked
President Clinton's 1996 National Education Summit (Achieve, 1997) was a
notable surge of attention in states to both content and performance standards for
students, and accompanying standards-based testing initiatives to help instantiate
these standards for learning throughout the education system (U.S. Department
of Education, 1999; National Research Council, 1999; Council of Chief State
School Officers, 1998; Baker & Linn, 1997; Massell, Kirst, & Hoppe, 1997). This
has, in turn, led to links between student learning standards and standards for
teacher preparation (Cohen & Spillane, 1993; Smith & O'Day, 1997; National
Association of State Directors of Teacher Education and Certification, 2000).
Talking tough about teacher quality appears to be a requirement for all state
and national political figures, and the federal government has done more than
talk. In Title II of the Higher Education Reauthorization Act of 1998 (Public
Law 105-244), the Teacher Quality Enhancement Act, the Congress entered the
fray, creating a law that specified that all states must report the passing rates of
teacher education students on state-required teacher licensing tests, and that all
institutions achieve a 70 percent passing rate in order to qualify for continued
federal student loan subsidies to all of their students, not just students in the
school or department of education. This requirement has led to complicated new
reporting methods that require coordination and collaboration among testing
organizations, state regulatory agencies, and teacher education institutions. It
has also led to charges, primarily from institutions of higher education (Interstate
New Teacher Assessment and Support Consortium [INTASC), 1999; American
Association of Colleges for Teacher Education, 2000; Teacher Education
Accreditation Council, 2000) that the information reported will be a distortion
of the realities of teacher preparation and that the effort is punitive rather than
supportive. The measure demonstrates, however, the seriousness of the
movement to require empirical evidence that speaks to teacher evaluation ..
There is more student testing today than ever before, though there are very
few signs that such testing is any more effective or informative than it was 10 or
20 years ago, as the remarks by Robert Linn discussed above make clear. Nor are
there signs that such testing is any more closely connected to learning than it ever
was. There is, however, both more talk and more action on the use of tests for
very high stakes: promotion to the next grade, high school graduation, access to
special programs (see ECS, 2001; Olson, 2001). And student testing for high
stakes inevitably focuses attention on teacher effectiveness and its measurement.
There has been an exponential increase in the number of standards documents,
both for student learning and for teacher knowledge and performance. Both
state student standards and national standards for teachers (INTASC, 1992;
National Board for Professional Teaching Standards [NBPTS], 2002; National
Council for Accreditation of Teacher Education [NCATE], 1997) have more or
less thoroughly articulated the content standards in both· professional and
subject matter knowledge. What is harder to articulate are the performance
standards, or the answer(s) to the question, "How much of all of this is enough
for a particular purpose?" The growing acceptance of standards documents that
define the content of teaching as a profession, convergence among the standards
documents, in both language and content, and the powerful effects of a common
vocabulary that can define "knowing enough" and "doing enough" as a teacher
mean that it is now possible to achieve some consensus on a definition of
effective teaching. The INTASC, NBPTS, and NCATE standards - for beginning
teachers, for advanced and accomplished teachers, and for accreditation of
teacher education institutions - form the basis of the "three-legged stool" cited
as the essential foundation for the reform of teacher preparation and develop-
ment in the report of the National Commission on Teaching and America's
Future (National Commission on Teaching and America's Future [NCTAF],
1996). Citing the extensive and painstaking consensus methodology that all three
organizations have used to create and validate these standards, the National
Research Council (NRC) report on teacher licensure also uses these sets of
standards as the basis for the working definition of teacher quality (NRC, 2001).
There has been remarkable progress as well in sophisticated methodologies
for assessing teachers, from entry level to advanced practice. And there are signs
from these methodologies that "testing" is coming much closer to its rightful
place, as part of something larger and more important, "learning."
Links between student achievement and teacher effectiveness, and measurement
or assessment methodologies to track such links, are increasingly connected to
pay in the rhetoric of "teacher accountability" for performance teacher evaluation
systems. Under discussion in several states, these evaluation systems integrate
teacher development with monetary rewards for achieving new levels of
professional achievement (see Blair, 2001, p. 1).
The Political Landscape
The first decade of the 21st century opens with education reform as arguably the
most critical political issue facing states and the federal government. Increasingly
insistent calls for teacher accountability and documented increases in student
achievement; arguments over vouchers and charter schools; the growing
shortages of qualified and fully licensed teachers in mathematics, science, and
special education; the nationwide crisis in securing qualified school principals for
the available open positions; and the continuing difficulty in staffing schools in
high-poverty urban and rural areas with qualified educators a.re all contentious
issues that dominate both politics and the media in unprecedented measure at
the start of the new century. And all of this attention makes the issue of teacher
evaluation much more complex than ever before. Teacher evaluation, a key
component in any accountability system, is both more important and more
beleaguered than ever as a result of the rising pressure to "fix education" from
federal, state, and local officials and constituents. It remains to be seen whether
or not states and local districts will be able to muster the resources and the will
to implement what has been learned about teacher performance and assessment
of both teacher and student achievement in the past decade. What has not
changed - and, from a measurement perspective, is not likely to change - is that
sound and valid assessment of complex behavior requires time and sustained
resource allocation. As Linn's article (Linn, 2000) makes clear, support for this
kind of assessment has not been a hallmark of K-12 education policy in the U.S.
to date.
In addition, there is much more insistence on the necessary connection
between teacher performance evaluation and student achievement, but no more
clarity about how such a connection can be credibly and validly made. Neither
educators nor politicians and parents disagree that student learning is a key
outcome of effective teaching. The problem is what constitutes valid evidence of
the outcome. Today parents and legislators insist in ever-louder voices that such
evidence must be primarily gains in student achievement, which they see as a
criterion for student learning. Educators and researchers are much less con-
vinced of the clarity and credibility of such evidence, given the variability of
school contexts and the almost innumerable influences on student achievement
(Berk, 1988; Brandt, 1989; Frederiksen, 1984; Haas, Haladyna, & Nolen, 1990;
Haladyna, 1992; Haertel 1986; Haertel & Calfee, 1983; Linn, 1987; Madaus,
1988; Messick, 1987; Shepard, 1989)
This disagreement, now more acute than ever before, has direct consequences
for teacher evaluation methods. Historically, the perspectiVe of educators - that
the only way to include student achievement in teacher evaluation is to talk about
the "likelihood" of student achievement, given certain teacher variables like
content knowledge and pedagogical skill - has dominated teacher evaluation
practices. That is likely to be forcibly changed in the next decade.
The Standards Movement & Student Testing
That a teacher's performance and students' achievement are inextricably linked

.
is intuitively compelling as an argument for making how students perform on
some array of assessments an important part of a teacher's performance evalu-
ation. Exactly how to attribute any student's achievement in any particular year
to the current teacher, how to control for variables far outside a teacher's control
but profoundly important in affecting students' achievement, and creating and
using really sound and valid assessments are some of the challenges in gathering
the evidence to support the argument. And these challenges have proved
sufficiently daunting that links between teachers' work in the classroom and
students' scores on assessments have never really been systematically achieved in
the U.S. The pressure to forge such links is immense at this time, and it is critical
to the health and vitality of the education workforce that the link be credible and
valid. A foundational validity issue is, of course, the quality and integrity of the
methods states and districts have developed or adopted to measure student
achievement. The teaching workforce has long disdained standardized national
tests, the most commonly used assessments in school districts across the U.S. to
represent student achievement, arguing persuasively that actual local and state
curricula - and thus instruction - are not adequately aligned (or aligned at all)
with the content of these tests. Furthermore, education reformers have almost
universally excoriated these tests for two decades as reductive and not
representative of the skills and abilities students really need to develop for the
new millennium.!
The sweeping popularity across the U.S. of standards that articulate what
students should know and be able to do, and the link between this articulation
and an assessment strategy that would yield information about the status of
student learning in relation to those standards, is thus, a critical part of the new
landscape for teacher evaluation. Education ~ek's Quality Counts 2000, which
focuses on teacher quality and states' efforts to assure that quality, summarizes
the current state of state student standards and testing across the U.S. They
report that by 2000, 49 states (all but Iowa) have adopted student standards in at
least some subjects; 48 states administer statewide testing programs and 37 of
them say they incorporate "pedormance tasks" in their assessments. Forty-one
states assert that their student assessments have been matched to their standards
in at least one subject.
All of this sounds quite promising, but the picture - at least from the perspective
of holding teachers accountable in their pedormance evaluations for student
achievement as reflected on these tests - is less than rosy. It turns out, for
example, that states get counted as "aligned to standards" and as having "more
sophisticated measurements" if they test against their standards at any grade
level (not all levels tested) and if they include any kinds of questions other than
multiple choice in their tests (Edwards, 2000, p. 84). Only 10 states use assess-
ment evidence gathered outside the confines of the standardized test - evidence
like writing or project portfolios or extended response questions in subjects other
than English. Furthermore, only 10 states administer any standardized tests
aligned with student standards at the elementary level (Jerald, 2000, p. 63;
Edwards, 2000, p. 74).
This absence of alignment between tests and standards at the elementary level
is not particularly good news for teacher evaluation, since fully half of the
education workforce in the U.S. consists of elementary teachers. Furthermore, a
critical part of the validity evidence for the link between student tests and teacher
performance needs to be teachers' knowledge and awareness of the testing
methodologies and test content, and the interface between those methodologies
and that content and the curriculum that guides their instl1!ctional practices.
Little information about such efforts to inform and guide teachers in instruction
is available from states.
The Standards Movement & Teachers
Widely accepted standards for what teachers should know and be able to do -
essentially a definition of effective teaching - is one of the important accomplish-
ments in teacher evaluation over the past decade. The recognition that
meaningful teacher evaluation systems must begin with evaluation of teacher
preparation institutions is part of that accomplishment. The standards developed
by the Council of Chief State School Officer's Interstate New Teachers
Assessment and Support Consortium (INTASC) for beginning teachers and the
National Board for Professional Teaching Standards for experienced and accom-
plished teachers have gained broad acceptance across the U.S. This acceptance
crosses political and ideological lines, for the most part - virtually every notable
education policy and reform organization, along with both national teacher
organizations and virtually all discipline-specific membership organizations have
supported these sets of standards and participated actively in their development
and promulgation.
The thrust of the NCATE 2000 reform was to change the focus of accredita-
tion from "inputs" to "outputs" or results. Performance-based accreditation, as
expressed in the NCATE 2000 standards for schools and departments (NCATE
2000 Unit Standards) requires that institutions demonstrate the results of what
they do. These results need not be standardized test scores, though some of those
are also considered acceptable evidence. They include, however, documentation
of the kinds of work students are required to complete; documentation of
assessment methodologies and practices; documentation of student learning and
instructional supports for that learning. As a culminating record, documentation
of the proportion of students who pass licensing tests that are required by the
state in which the institution is located is also required. The convergence of these
accreditation requirements with the Federal Title II reporting requirements is
noteworthy, as it lends credibility to the use of test scores as a means of program
evaluation.
The publication in 1996 of the report of the National Commission on Teaching
and America's Future (NCTAF), What Matters Most: Teaching for America's Future
(NCTAF, 1996), solidified the influence of these standards and the organizations
that championed them. Chaired by leading education reform spokesp~rson
Linda Darling-Hammond, NCTAF articulated an approach to state policies for
teacher evaluation and quality assurance that integrates the preparation, initial
assessment, ongoing professional development, and advanced certification of
teachers. Each of these stages in a professional educator's career would be
supported by standards-based evaluations, in the Commission's schema. The
report's metaphor of the "three-legged stool" on which a quality teaching work-
force rests has become well known as an expression of this vision: the INTASC
standards generate an approach to preservice teacher preparation and the initial
induction period of a teacher's career; the NCATE 2000 performance-based
accreditation standards and requirements generate an approach to formative
and summative assessment of preservice teachers and their institutions; and the
NBPTS standards generate, by implication, a professional development and
career plan for teaching professionals that culminates in the NBPTS certification
assessment (described in detail below). Each of these "legs" is to be supported
by sophisticated assessment methodologies that yield a complex array of informa-
tion about what a teacher at any stage of his or her career knows and is able to
do.
The Worm in the Apple: Standards and Regulatory Action
One of the most important challenges to effective reform of teacher evaluation

has always been the separation of the academy, which controls teacher prepara-
tion, and the state regulatory apparatus, which controls access to professional
practice in schools. The INTASC standards, the NCATE standards, the NBPTS
standards, and the NCTAF report are all reform efforts dominated by the academy
and practitioners, not the regulatory apparatus of state education bureaucracies.
The latter establishes a different kind of "standard" for teachers, one that is
expressed by requirements and rules for a teaching license. This attention to
licensure standards affects the vast majority of teachers only in the first 3 to
5 years of their teaching careers. Few resources have been allocated for ongoing
evaluation and development of career teachers after their initial entry into the
profession. Even the regulation of entry into the profession is confused and
complicated by differing standards from state to state. The recent National
Research Council report on teacher licensure testing analyzes the current
licensure regulations and their possible effects on the education workforce
(NRC, 2001).
In the next decade, real progress in implementing meaningful teacher
evaluation systems in U.S. schools, both for beginning teachers and for
experienced teachers, will depend on the integration of education policy visions
like that expressed in the NCTAF report with state regulatory functions.
Currently, the two are far apart and in many cases opposed. One consistent
difficulty is that state regulatory agencies are legitimately concerned about the
cost of various assessments to prospective teachers, and the administrative
feasibility of more performance-based assessments. Thus, even when
sophisticated assessments exist, they are often not required because the state is
loathe to require teacher candidates to pay the substantially higher fees for
licensing tests that these would entail. Yet it is these more sophisticated kinds of
assessments that promise a clearer picture of a teacher's knowledge and skill as
it is applied in the classroom. If the definition of teacher quality, and
consequential decisions about who can and should teach, depend on the
connection between student achievement and teacher performance, the
widespread use of more sophisticated assessment methodologies clearly has
important implications for education reform.
Quality Counts 2000 reports that states continue to play "an elaborate shell
game" (Edwards, 2000, p. 8) with teacher licensing. States set standards, often
rigorous standards, for demonstrations of sufficient skill and knowledge to be
licensed. For example, 39 states require all licensed teachers to pass a basic skills
test (reading, mathematics, and writing), 29 require secondary teachers to pass
subject-specific tests in their prospective teaching fields, and 39 require prospec-
tive secondary teachers to have a major, minor, or equivalent course credits for
a subject-specific license. This means that a number of states require all three
hurdles to be cleared before granting a license. In addition, most states require
that the teacher's preparation institution recommend the candidate for the
license. In every state but New Jersey, however, the state has the power to waive
all of these requirements "either by granting licenses to individuals who have not
met them or by permitting districts to hire such people" (Edwards, 2000, p. 8).
And, perhaps most discouraging, Quality Counts 2000 reports that only about 25
of the 50 states even have accessible records of "the numbers and percentages of
teachers who hold various waivers" (Jerald & Boser, 2000, p. 44). Thus, reliance
on rigorous state testing and preparation requirements to assure the quality of
the education workforce is likely to lead to disappointment.
In 2000, 36 of the 39 states that require teachers to pass a basic skills test
waived that requirement and permitted a teacher to enter a classroom as teacher
of record without passing the test - and in 16 states, this waiver can be renewed
indefinitely, so long as the hiring school district asserts its inability to find a
qualified applicant (Edwards, 2000, p. 8). Of the 29 states that require secondary
teachers to pass subject matter exams - most often only multiple choice tests,
even though more sophisticated tests are available - only New Jersey denies a
license and therefore a job to candidates who have not passed the tests. Eleven
of these 29 states allow such candidates to remain in the jdb indefinitely, and all
29 but New Jersey waive the coursework completion requirement for secondary
teachers if the hiring district claims that it cannot find a more qualified applicant
for the position (Edwards, 2000, p. 8).
The situation for teacher evaluation - especially a system that would link student
achievement to teacher performance - is further undermined by two almost
universal hiring practices across states. First, states assign teachers to teach
subjects in which they have little or no preparation. Eleven states do not even
require permission for such assignment of a teacher for part of the school day.
Twenty-two states have the power to penalize schools for overusing such assign-
ment practices, but there is little or no enforcement effort; only Florida and
Texas require that parents be notified if the teacher assigned to their children's
classroom is not fully qualified. And the out-of-field teaching assignment is far
more prevalent in high poverty urban or rural school districts, for whom filling
the job with anyone at all is (NCES) an ongoing challenge (Olson, 2000a, p .14).
The National Center for Education Statistics reports that in 1993-94,47 percent
of the teachers in schools with high proportions of low-income students did not
have a college major or minor in the subjects they were assigned to teach (NCES,
1997). This situation is worsening as shortages of teachers who are interested in
jobs in difficult-to-staff schools grows (see Hartocollis, 2000).
Second, almost all states have failed to allocate resources devoted to efforts to
build a teaching force with strong content knowledge for middle school students.
Despite the clear evidence that achievement at the high school level depends on
solid preparation from sixth through eighth grade, only 17 states require secondary
level licenses of middle school teachers, with all of the attendant content require-
ments. Of these, however, only nine require these teachers to pass content
examinations as well. Seven more require subject-specific coursework in addition
to an elementary license. Every other state requires middle school teachers to
hold only an elementary generalist license in order to teach any subject in middle
school. Only New Jersey and Colorado require that all middle school teachers
have a college major in the subject they wish to teach (Jerald & Boser, 2000,
p.44).
Given the uneven state of regulation of teacher quality, even at the most basic
level of licensing, connecting teacher performance to student achievement and
to high stakes decisions about employment and compensation for individual
teachers is fraught with difficulty and potential legal complications. Furthermore,
though increasingly complex assessments are available (some of these are
described in the next section), states have been slow to adopt an integrated
assessment strategy that would build a high-quality teaching workforce by locating
assessment in the much larger context of learning. If states create regulations
only to ignore them because of the difficulties in filling open teaching positions,
then state-sponsored implementation of sanctions for what is defined, on the
basis of student achievement standards, as substandard teaching performance is
legally indefensible.
TEACHER EVALUATION AND ACCOUNTABILITY:

CURRENT PRACTICE
Concurrent with the increased emphasis on teacher quality and teacher education
has been the proliferation of discussion and debate regarding teacher evaluation.
There is an emergent new focus of interest and funding from legislatures in
evaluation processes that are linked to teacher professional development. This
new emphasis is particularly apparent in the growth of programs for the
induction period - usually defined as the first two years of a new teacher's career.
There is also increased attention in states to performance-based assessments for
teachers that would lead to a career ladder in which compensation is based on
skills and performance, not seniority. These links are motivated by the realities
of the labor market for teachers (and principals) in the U.S. A very large
proportion of the current cadre of teachers began teaching in the 1960s and will
reach retirement within the next five years. There is already a shortage of teachers
in certain subject matter fields like mathematics, the sciences, and exceptional
needs. Rural and inner city schools are chronically understaffed. It has become
increasingly difficult for school districts to recruit enough graduates who have
completed a teacher preparation program to fill all available jobs (Bradley,
1999). All of these factors have focused attention on improvement of the quality
of a teacher's working life and on creating possibilities for growth and advance-
ment within a career as an educator. Thus increased interest in evaluation of
teachers for accountability purposes is closely allied with the formative use of
evaluation for recruitment, retention, and reward.
California's CFASST program for beginning teachers, for example, which offers
beginning teachers a structured program of formative evaluation with individual
mentors during the first two years ofteaching appears to have cut attrition rates
among new teachers by as much as 40 percent in certain school districts (Briggs
et al., 2001). Cincinnati is implementing a program of teacher evaluation that is
tied to a career ladder based on performance rather than the traditional time on
the job and number of credits earned. (Cincinnati Federation of Teachers &
Cincinnati Public Schools, 2000) Iowa, responding to its own analysis of the
teacher labor market in the region, its difficulty attracting Iowa teacher prepara-
tion program graduates to in-state teaching jobs, and its concerns about the
disparities in achievement between its rural and urban students, has begun work
on a restructuring of teacher compensation that would evaluate teachers at
several points in their careers and base salary increases on achievement of
increased skills (Blair, 2001, p. 1). In all of these programs, teacher professional
development and learning is an integral part of the evaluation structure.
Licensure Testing for Beginning Teachers
The first stage in teacher evaluation remains licensure testing, which is subject to
the same professional and technical regulations as allliceilsure testing regardless
of the scope or sphere of practice in a job or profession. There has been much
debate in the past three years over the type, quality, and scope of licensure tests
for teachers, prompted in many cases by the Massachusetts teacher testing
controversy. The perceived importance of efforts to improve teacher quality,
evidenced by the Teacher Quality Enhancement Act (Title II) discussed above,
led the U.S. Department of Education to request that the National Academy of
Sciences establish a panel of experts, called the Committee on Assessment and
Teacher Quality, early in 1999. The committee's work and the structur~ of its
final report provide a useful framework for a discussion of current practice in
initial licensure testing for teachers.
According to the committee's final report, it "was asked to examine the
appropriateness and technical quality of teacher licensure tests currently in use,
the merits of using licensure test results to hold states and institutions of higher
education accountable for the quality of teacher preparation and licensure, and
alternatives for developing and assessing beginning teacher competence" (NRC,
2001, Executive Summary, pp. 2-3).
The NRC report on licensure testing explores five major areas: the appro-
priateness and technical soundness of current measures of beginning teacher
competence; the methods and consequences of decisions made about candidates
based on licensure tests; a technical evaluation of current licensure tests; the use
of teacher licensure tests to hold states and institutions of higher education
accountable for the quality of teacher preparation and licensure; and the value
of innovative measures of beginning teacher competency in helping to improve
teacher quality (NRC, 2001). The analyses and explorations in the full report are
informative and important, but the conclusions the Committee reached are not
startling. While the committee found that the current licensure tests they examined
were, in general, technically sound and appropriately developed for their purpose,
they asserted that "even a set of well-designed tests cannot measure all of the
prerequisites of competent beginning teaching. Current paper-and-pencil tests
provide only some of the information needed to evaluate the competencies of
teacher candidates" (NRC, 2001, Executive Summary, p. 5). They recommended
that states and regulatory bodies making decisions based on licensure tests follow
best-practice guidelines concerning standard setting and examine the impact of
their decisions not only on the competence of the licensed pool of teachers, but
on the diversity of the teacher work force, on teacher education programs, and
on the supply of new teachers (NRC, 2001, Executive Summary, p. 8). On the
issue of the technical quality of licensure tests currently in use, the committee's
conclusions were guarded, primarily because they were able to examine only
licensure tests developed and delivered by one of the two major developers of
such tests, Educational Testing Service. The committee report says in this regard,
Solid technical characteristics and fairness are key to effective use of tests.
The work of measurement specialists, test users, and policy makers suggests
criteria for judging the appropriateness and technical quality of initial
teacher licensure tests. The committee drew on these to develop criteria
it believes users should aspire to in developing and evaluating initial
teacher licensure tests. The committee used these evaluation criteria to
evaluate a sample of five widely used tests produced by ETS. The tests the
committee reviewed met most of its criteria for technical quality, although
there were some areas for improvement. The committee also attempted
to review a sample of NES tests. Despite concerted and repeated efforts,
the committee was unable to obtain sufficient information on the technical
characteristics of tests produced by NES and thus could draw no conclu-
sions about their technical quality. (NRC, 2001, Executive Summary, p. 9)
On the controversial issue of using licensure test scores to hold states and insti-
tutions of higher education accountable for the quality of teacher preparation,
the committee recommended that the federal government "should not use
passing rates on initial teacher licensing tests as the sole basis for comparing
states and teacher education programs or for withholding funds, imposing other
sanctions, or rewarding teacher education programs" (NRC, 2001, Executive
Summary, p. 15). Because this directly criticizes the federal Title II Teacher
Quality Improvement provisions, this recommendation is likely to be among the
most debated in the report. Finally, and not surprisingly, the committee recom-
mended investigation of new methods and systems of evaluation of beginning
teacher competencies (NRC, 2001, Executive Summary, p. 16). Several of these
innovative assessment methods are discussed below. One critical element in
teacher licensure, and its potential role in teacher quality improvement, that the
NRC does not discuss is the role of regulatory agencies in states, which have the
responsibility for setting the licensure requirements. More performance-based
measures exist for beginning teachers; they are not being used by states because
the regulations do not require them.
One issue in the debate over teacher licensure tests as a means to improve
teacher quality has been both how rigorous these tests need to be and how
"knowledge" for teachers should be defined and described (NRC, 2001). On the
question of rigor, one side of the debate is represented by The Education Trust's
1999 report on teacher tests entitled Not Good Enough (Mitchell & Barth, 1999).
The authors of that report contend that the teacher licensure tests they examined
are far too easy and that they do not represent what they call "college-level
understanding" of content (p. 3). The underlying assumption in this argument is
that all teachers need to demonstrate a command of discipline-specific
knowledge that is commensurate with academic tasks appropriate for college
majors in a discipline, as such tasks are defined by the authors of the report
(p. 17). The other side of the debate is represented operationally by the actual
content of the ETS licensure tests the NRC Committee on Assessment and
Teacher Quality examined. If the AERA/NCME/APA Joint Technical Standards
for test development are used to guide the determination of the test
specifications and content of test forms, then a long chain of evidence leads to
the actual test forms. This begins, in the case of ETS licensure test development,
with a careful cross analysis of relevant national and slate standards, moves
through a job analysis and test specification draft, and culminates in test items
that are assembled into a form that is further validated by individual panels in
states that are considering adoption of the test as a licensing instrument (NRC,
2001, Executive Summary, pp. 6-7). Since these are the same tests decried as far
too easy by the proponents of more rigor, it is clear that the practitioner and job
based approach required by professional standards is currently far separated
from the approach of the proponents for "higher standards."
Another issue has been the call for predictive validity evidence for ,these
licensing tests. That is, there is mounting demand for a licensure test whose scores
can predict who will be an effective, or at least adequate, teacher (Dunkin, 1997;
Schaeffer, 1996; Pearlman, 1999). The licensing tests state regulatory agencies
have been willing to require are typically limited to tests of knowledge - basic
skills, professional knowledge (pedagogy, developmental psychology, etc.), and
content knowledge. Test scores are required prior to the test taker's occupation
of a job. What a prospective teacher will do with this knowledge is unknown on
the basis of this kind of test. Only a direct measure of the kinds of skills and
knowledge as they are used in the work of teaching could afford any predictive
validity, and that only after stringent analysis of the connections between what is
tested and what the job requires. Thus, demands for predictive validity for
current licensing tests seem unreasonable. The inherent difference between
"job-related" knowledge and knowledge that, if held, would predict performance
in a complex domain of professional judgment and behavior like teaching has
been largely ignored in this debate, as has the virtually universal absence of such
evidence for licensure tests across other jobs and professions. It is an accepted
assumption that teachers must know the content they will have to teach and also
know some principles of pedagogy, and most states require evidence of
that knowledge in licensure tests for teachers. Similarly, the legal profession
believes that lawyers must know some basic principles of the law and its
interpretation over the centuries, and the bar examinations, both the Multi-State
Bar Examination and state-specific bar examinations reflect that assumption.
The medical profession believes that doctors must know about human
anatomy, blood chemistry, and the like, and the medical licensure examination
reflects that belief. What is not known or asserted in any of these cases,
however, is how the newly licensed teacher, lawyer, or doctor will perform
in the actual job he or she is licensed to do after reaching the passing standard
on the licensing test. Licensure testing is implicitly said by the measurement
profession to be necessary but not sufficient evidence of competence to
practice.
Teacher licensure testing represents a complex and often chaotic picture for
those interested in quality assurance for the national teacher workforce.
Regulating professional licenses in the U.S. is a right reserved for state govern-
ments or their designated regulatory arms. This means, for teacher licensure,
that there are 50 sets of rules, requirements, and regulations and a broad canvas
for individualized and often idiosyncratic licensure testing practices. When many
of these regulations were enacted, teachers tended to take jobs in the same states
in which they were trained for the profession. Thus, regulation of the quality of
the "pipeline" was largely a local concern, and state-specific regulations for
licensing could be said, at least in theory, to represent a considered weighing of
the usefulness of an independent criterion (the test) in addition to the trust-
worthiness of the preparation institutions' recommendations of graduates for
licensure.
The situation over the past 10 years has changed dramatically. Teachers have
become more mobile, states - and districts within state - compete for beginning
teachers, and there is a critical shortage of teachers in certain fields and certain
geographical areas (Edwards, 2000). Tests that offer some uniform and indepen-
dent view of a beginning teacher's knowledge have thus grown more important
as an indicator of one critical job-related criterion.
Sophisticated Tools for Teacher Assessment
The NRC Report on Assessment and Teacher Quality (NRC, 2001) ends with the
following recommendation concerning what it calls "inno~ative measures of
beginning teacher competence":
Research and development of broad-based indicators of teacher com-

petence, not limited to test-based evidence, should be undertaken; indicators
should include assessments of teaching performance in the classroom, of
candidates' ability to work effectively with students with diverse learning
needs and cultural backgrounds and in a variety of settings, and of compe-
tencies that more directly relate to student learning.
When initial licensure tests are used, they should be part of a coherent
developmental system of preparation, assessment, and support that reflects
the many features of teacher competence. (NRC, 2001, Chapter 9, p. 16)
There have been notable advances in teacher assessment methodology during

the past ten years which further these recommendations. In all of these
innovative assessments two underlying beliefs are modeled: effective teaching
can be recognized and evaluated, and the underlying knowledge and skills that
make for effective teaching can be modeled and learned. Furthermore, the
assessments embody a belief in the importance of opportunities to learn what the
assessment will evaluate, and the belief that the assessment process itself must
be embedded in learning.
The INTASC consortium's work on model assessments, the redesign of the
PRAXIS ™ series professional and pedagogical knowledge assessments for
licensure, California's support for induction programs like CFASST, Ohio's com-
bination of formative and summative performance assessments for licensure of
beginning teachers in the Ohio Formative Induction Results in Stronger Teaching
(FIRST) program, the use by several states of Educational Testing Service's
standards-based induction portfolio, and the Connecticut BEST teacher licensing
system have all been applications of innovative assessment methodologies for
beginning teachers (Connecticut State Department of Education (CT DOE),
1996a, 1996b, 1998; Ohio Department of Education, 2001; INTASC, 1992, 2002;
California Commission on Teacher Credentialing & The California Department
of Education, 1999; Olebe, 1999). All of them have focused on simulated or direct
measurement of actual teaching practice. In addition, the most elaborate of all
innovations in the assessment of teaching, the National Board for Professional
Teaching Standards certification assessments for experienced teachers, has
demonstrated that very elaborate performance assessment methodologies can be
operationalized at large volumes. All of these are now to some extent estabVshed
and tested methodologies.
What characterizes all of these efforts is an assessment design strategy that
insists on much tighter links between the task given to the examinee and the goal
of the measurement. If the central validity question in all assessment is how much
one can trust an inference based on a test score, then attention to the nature and
quality of the evidence gathered in the course of the assessment process, and
clear and explicit links between that evidence and what is purportedly being
measured are critical. Evidence-centered design strategies have been discussed
or at least written about in the research literature for more than a decade, as
cognitive scientists and psychometricians have discussed ways to apply
increasingly sophisticated theories of learning and performance to assessment,
and to much more closely couple assessment and learning (see Glaser, Lesgold,
& Lajoie, 1987; Mislevy, 1987, 1994, 2000; Mislevy et aI., 1999). It is only in the
past five years, however, that some serious attempts to change the way
assessments are conceptualized and then used in operational settings have taken
place.
Innovations in Teacher Tests
The first example of evidence-centered assessments for licensure and certifica-

tion of educators that demonstrate the application of these theoretical principles
to real testing situations, is INTASC's design for a simulation-based performance
assessment of a beginning teacher's professional and pedagogical knowledge.
Called the Test for Teaching Knowledge, the prototypes for this assessment were
designed and field-tested on behalf of INTASC by Educational Testing Service
from 1998-2000. The domain of the assessment is established in the INTASC
Model Standards for Beginning Teacher Licensing and Development (INTASC,
1992, Council of Chief State School Officers, 1998). The assessment itself is four
hours long and consists entirely of constructed response questions. Most of the
questions are connected to cases, or detailed descriptions of actual teaching
situations, and they ask test takers to evaluate what they have read in the case
and respond based on their own professional knowledge and pedagogical
experience (presumed to be in student teaching). The measurement assumption
that underlies the test design is that tasks based on simulations of actual teaching
practice will yield more credible and trustworthy evidence of a beginning
teacher's ability to use his or her academic and very limited practical knowledge
than do more conventional tests of knowledge (ETS, 2000).
A second example is Educational Testing Service's redesign of the Praxis ™
series of licensure tests. There are two notable aspects of this redesign effort in
addition to the reorientation of domains of content knowledge for test specifica-
tions, in light of the proliferation of content standards. First, the intent of the
INTASC Test for Teaching Knowledge design - to influence licensing practice in
many states - has been at least partially fulfilled. ETS's own Principles of
Learning & Teaching, a test of a beginning teacher's professional and pedagogical
knowledge, is being redeveloped with the INTASC standards as the explicit
domain of assessment, and simulation-based (case-based) constructed response
questions as the dominant task model. This assessment is widely used and
increasingly popular among states as they seek to license teachers based not only
on content knowledge in a subject field but also on professional teaching
knowledge.
-
NATIONAL BOARD FOR PROFESSIONAL TEACHING STANDARDS
The most dramatic accomplishment in teacher evaluation in the past decade is

represented by the achievements of the National Board for Professional Teaching
Standards (NBPTS). Mentioned in Dwyer and Stufflebeam as a promising but
unproven notion in 1996, this program has grown exponentially since the first
candidates received certification results in 1994.2 National Board Certification is
now routinely mentioned as the hallmark of teacher excellence in policy and
political discourse about teacher quality. Both from a public policy and from a
measurement perspective the National Board Certification process represents a
leap forward in teacher evaluation in the U.S.
Little has been published to date about the NBPTS assessment program from
1996 forward. Most writers on the NBPTS assessments have based their analyses
and predictions on the early prototypes of the assessments that were field tested
prior to 1996. Shinkfield and Stufflebeam's 1995 analysis of the NBPTS assess-
ment model serves as a useful summary of early commentary on this ambitious
assessment (Shinkfield & Stufflebeam, 1995, pp. 390-391). Because virtually no
empirical data were available for anyone to review, most attention was directed
at the NBPTS's statements of aspiration and procedures regarding consensus-
building in the education community, creation of standards that could define
effective practice, and technical scrutiny of all assessments and data associated
with them, which were praised even as they were viewed with some skepticism.
Far more worrisome to Shinkfield and Stufflebeam were the NBPTS's assessment
methods. They state categorically, for example, that "the validity of outcomes is
highly suspect, as there is not a thorough, well-planned, and well-executed
observation of candidates by credible evaluators" (Shinkfield & Stufflebeam,
1995, p. 390). At the time of that writing, there seemed to be no viable alternative
to live observation as a methodology for evaluating teaching authentically and
directly. Indeed, until approximately 1994, the NBPTS itself regarded live
observation as the only possible methodology for the kind of assessment they
desired. It was the NBPTS's own research into the feasibility and validity of this
methodology that led to a different medium for observation of instruction -
videotape - in the assessment model finally adopted (see Land & Herman,
1994). Shinkfield and Stufflebeam echo many other observers in their additional
concerns regarding scoring, which they believe to be very difficult to
operationalize in a reliable and valid manner. They also express some concern
about quality control and consistency across all assessments. Finally, they record
the concern that the assessment may prove too expensive to be sustainabl~.3
In the discussion that follows, these observations are addressed, and an
account of the current status of the assessment program is given. The success and
phenomenal growth of the assessment program - in 1996, the Board offered six
different assessments to approximately 600 candidates; in 2001 it offered 19 dif-
ferent assessments to approximately 14,000 candidates - has flouted conventional
wisdom about public policy support for really innovative and resource-intensive
assessment, as well as about assessment design and operational feasibility. The
cost, size, and methodology of the assessments; nature of the scoring apparatus
and its results, and magnitude of the validation and quality assurance docu-
mentation process - all of these were unprecedented in education in the U.S.
before the NBPTS assessment program. In 2001, the fee for the assessment was
$2,300 per candidate, validating at least one of Shinkfield and Stufflebeam's
concerns. This makes National Board Certification one of the most expensive
certification assessments used in any profession in the United States, including
medicine and finance, both much more high-paying professions. One of the
National Board's accomplishments has been its persuasive argument that states
and the federal education budget should support this certification effort, since it
is a key to real reform of teaching and, thus, increased student achievement.
Over 95 percent of all National Board Certification candidates have fee
subsidies, in whole or in part. At present, there are federally funded and state-
administered fee subsidy programs in 40 states and 200 local districts. Financial
rewards for successful candidates are awarded in 29 states and more than 140
school districts. 4
Portfolio-based assessment has been written about extensively in the past
decade, and it is often mentioned as the best assessment methodology to capture
learning over time, complex understandings, self-analysis and reflection, and the
like. It has been particularly praised as a methodology for formative assessment
of teachers. (Blake & Bachman, 1995; Lomask & Pecheone, 1995; Supovitz &
Brennan, 1997; LeMahieu, Gitomer, & Eresh, 1995; Stecher, 1998; Curry, 2000;
Hewitt, 2001; Valencia & Au, 1997). Portfolio assessment has not been successfully
operationalized on a large scale, however. The costs, the psychometric challenges,
and the formidable feasibility issues have relegated portfolio use in most
education settings to individual classrooms for low stakes.
The NBPTS program now represents the most elaborate high-stakes portfolio
assessment in use in education in the U.S. today. The NBPTS assessment design
engages some of the persistent problems that have bedeviled measures of teaching
performance and complex performance assessment methodologies in general. In
Samuel Messick's 1994 article on the validation issues that attend performance
assessments, he defines both issues and criteria in the ongoing validation of
complex performance assessments (Messick, 1994). In this analysis, he builds on
the earlier work of Linn, Baker, and Dunbar (1991) and Frederiksen and Collins
(1989) on validation criteria for performance assessment. Linn, Baker, and
Dunbar included content quality and coverage, cognitive complexity, meaning-
fulness, cost and efficiency, transfer and generalizability, fairness, and consequences
as validity criteria; Frederiksen and Collins suggest directness, scope, reliability,
and transparency.
As principle conceptualizers and developers of these assessments, the authors
believe that some analysis of salient features of this assessment program can
demonstrate that Messick's validity criteria - which subsume those proposed by

Linn, Baker, and Dunbar and Frederiksen and Collins - can be met. Indeed, the
development and implementation history of the NBPTS assessments indicates
that Messick's criteria for validation can be seen as guiding principles for the
design and delivery of complex performance assessments. Messick's overarching
assertion is that the validity standards and criteria for performance assessments
are really no different from those that apply to more conventional assessments,
an array of standards and criteria he explicated at length in his 1989 chapter in
Linn's Educational Measurement (Messick, 1989). The criteria for evaluation of any
assessments are both evidential and consequential. Proponents of performance
assessment have long touted the positive consequences of complex performance
assessment; Messick expresses concern that such supposed positive consequen-
tial validation might eclipse the equally important issues of evidential validity. As
Messick makes clear, the issue for performance assessment is what constitutes
credible evidence.
The NBPTS assessment program is built on a foundation of careful and
methodical attention to definition of the purposes for the assessment, and the
constructs of interest given that purpose. In the NBPTS assessment program,
task design, the overall architecture of the collection of tasks that makes up an
entire certification assessment, the integration of development of the evidence
structure created by the tasks with development of the evaluation structure
necessary to provide scores, and continuing attention to operational feasibility
and implementation as validity issues have quite deliberately engaged the problems
of construct underrepresentation and construct irrelevant variance. Included in
the evidence-centered design approach taken by Educational Testing Service on
behalf of the National Board is conscious attention to issues of fairness, equal
opportunity to perform, cost and efficiency, cognitive complexity, breadth and
depth of content coverage, and reliability and generalizability. The consequences
of the certification decision - for those who are successful and those who are not
- has been an issue systematically scrutinized by the National Board from the
outset.
The NBPTS certification assessment consists of two parts, a teacher-made
portfolio that assembles evidence from classroom practice and professional and
collegial work outside the classroom according to standardized directions and a
written, timed, secure "assessment center" that is focused primarily on content
and pedagogical content knowledge and is delivered via computer. There are ten
separately scored components with the portfolio components weighted more
heavily than the assessment center components in the final weighted total score.
The NBPTS Board of Directors set a passing score in 1996 of 275 on a scale that
ranges from 75 to 425. The total weighted score scale is identical to the scale
used to score individual parts of the assessment (NBPTS, 2000, p. 55-56).
National Board assessments are built following the guidelines of evioence-
centered design, where, for example, the constructs to be measured are explicitly
articulated at the outset, as are the kinds and nature of the score inferences to
be rendered. These explicit parameters then guide and shape the nature and
structure of the assessment, including the sources and types of evidence that will
support those inferences. In this approach to assessment design, the assessment
tasks and scoring rubrics are developed concurrently to better ensure a tight
mapping between what is valued by the assessment tasks and what is valued by
the scoring system. This principled approach to development extends to the
rigorous training received by assessors before they engage in live, consequential
scoring, which includes bias-reduction training - targeted to reduce construct
irrelevant variance from entering into an assessor's judgment - extensive scoring
practice and feedback opportunities, and the continual monitoring of assessor
calibration during the scoring process. These and other structural properties of
the assessments themselves (see Wolfe & Gitomer, 2000) have yielded reliability
estimates that rival those of more traditional, less complex assessments.
While it is apparent that some of the distress articulated in earlier research
regarding the reliability of complex performance assessment such as those
offered by the National Board may have been soothed, the question of validity
remained open. The critical validity question is just this: Do National Board
certified teachers differ from non-certified teachers on agreed upon dimensions
of quality teaching practice? The construct validity of National Board assess-
ments was recently addressed by Bond, Smith, Baker, and Hattie (2000).
In that study, National Board certified teachers were compared to teachers
not achieving certification on 13 dimensions of quality teaching practice and two
dimensions of student learning. In total, evidence of quality teaching practice
was collected from 65 teachers in two certificates areas, Early Adolescence/
English Language Arts and Middle Childhood/Generalist. Instructional objectives
and lesson plans were collected, interviews of teachers and students were
conducted, and classroom observations were conducted. All data collection was
conducted without knowledge of a teacher's certification status. The results
indicated that certified teachers performed differently from non-certified
teachers on 11 of 13 dimensions. Certified teachers, for example, demonstrated
greater flexibility in their pedagogical content knowledge, were able to improvise
and alter their instruction appropriately, provided more developmentally
appropriate learning activities for their students, had a greater understanding of
the reasons for student success and student failure, and exhibited a distinct
passion for teaching. The students of National Board Certified teachers also
appeared to have a deeper and richer understanding of the instructional concepts
compared to the students of non-certified teachers. Collectively, these results
provide initial support for the validity of score inferences from National Board
assessments.
CONCLUSION
Teacher evaluation remains and is likely to continue to be a pressing issue in

education and educational reform. Undoubtedly, the teacher will occupy the
central focus of concerns regarding student achievement, and causal links between
teacher pedormance and student pedormance will persist, technical measurement
concerns about the soundness of such attributions notwithstanding. Decisions
about, how, and when the process and outcomes of teacher evaluation inform
decisions pertaining to licensure, certification, advancement, and pay - variously,
issues of teacher accountability -largely reside in the domain of policy formation
and acceptance. Yet, such policy decisions no doubt may benefit from the lessons
learned and advances witnessed in teacher assessment models and method-
ologies in the past decade - scalable approaches that have successfully increased
measurement fidelity by, proximity to the assessment of actual teaching practice
- while maintaining high levels of technical quality, public acceptance, and
operational feasibility.
ENDNOTES
1 See the whole September/October 2000 issue of Journal of Teacher Education (Cochran-Smith,
2000). It is an attack on testing of students and teachers.
2 Consider, for example, that in 1996 there were approximately 700 first-time candidates, and by
2000 that number rose to nearly 7,000; approximately 14,000 candidates will participate in 2001.
3 The detailed knowledge of the development and implementation of this assessment program
evinced in this section is based on the authors' direct and ongoing roles in the development and
operational delivery of this assessment program on behalf of the NBPTS
4 See the NBPTS website (http://www.nbpts.org) for a continuously updated list of all financial
incentives and support offerings for NBPTS candidates in the 50 states.
REFERENCES
Achieve. (1997). A review of the 1996 National Education Summit. Washington, DC: Author. (ERIC
Document Reproduction Service No. ED407705)
Allen, M. (1999, May). Student results and teacher accountability, ECS Policy Brief. Retrieved from
http://www.ecs.org/clearinghouse/12/28/1228.htm
American Association of Colleges for Teacher Education. (2000). Counterpoint Article: USA Today
January 31, 2000: "We Need Time to Do the Job Right. " Retrieved from
http://www.aacte.org/governmentaIJelations/time_dojobJight.htm
American Educational Research Association. (2000).AER4 position statement conceming high-stakes
testing in preK-12 education. Retrieved from http://www.aera.net/aBout/policy/stakes.htm
Baker, E.L., & Linn, R.L. (1997). Emerging educational standards of performance in the united states.
(CSE Technical Report 437). Los Angeles: University of California, National Center for Research
on Evaluation, Standards, and Student Testing.
Barton, P. (1999). Too much testing of the wrong kind: Too little of the right kind in K-12 education.
Retrieved from http://www.ets.org/research/pic/204928tmt.pdf
Berk, R.A. (1988). Fifty reasons why student achievement gain does not mean teacher effectiveness.
Journal of Personnel Evaluation in Education, 1(4), 345-364.
Blair, J. (2001, May 16). Iowa approves performance pay for its teachers. Education Week, 20(36),1,
24,25.
Blake, J., & Bachman, J. (1995, October). A portfolio-based assessment model for teachers:
Encouraging professional growth. NASSP Bulletin, 79, 37-47. •
Bond, L., Smith, T., Baker, w.K., & Hattie, J.A. (2000, September). The certification system of the
National Board for Professional Teaching Standards: A construct and consequential validity study.
(Center for Educational Research and Evaluation, The University of North Carolina at
Greensboro). Washington, DC: National Board for Professional Teaching Standards.
Brabeck, M.M. (1999). Between Scyllan and Charybdis: Teacher education's odyssey. Journal of
Teacher Education, 50, 346-35l.
Bradley, A. (1999, March 10). States' uneven teacher supply complicates staffing of schools.
Education Week, 18(26), 1, 10-11.
Brandt, R (1989). On misuse oftesting: A conversation with George Madaus. Educational Leadership,
46(7),26-30.
Briggs, D., Elliott, J., Kuwahara, Y., Rayyes, N., & Tushnet, N. (2001, March). The effect of BTSA on
employment retention rates or participating teachers. (WestEd Task 2A Report). San Francisco:
WestEd.
California Commission on Teacher Credentialing and the California Department of Education.
(1999). The CFASST Process. [Brochure]. Sacramento, CA: Author.
Cincinnati Federation of Teachers and Cincinnati Public Schools. (2000). Teacher evaluation system.
Cincinatti, Ohio: Author.
Cochran-Smith, M. (Ed.). (2000). Journal of Teacher Education, 51(4).
Cohen, D.K., & Spillane, J.P. (1993). Policy and practice: The relations between governance and
instruction. In S.H. Fuhrman (Ed.), Designing coherent education policy: Improving the system
(pp. 35-95). San Francisco: Jossey-Bass Publishers.
Connecticut State Department of Education. (1996a). The Beginning Educator Support and Training
Program (BEST). [Brochure]. Hartford, CT: Author.
Connecticut State Department of Education. (1996b). A guide to the BEST program for beginning
teachers and mentors. [Brochure]. Hartford, CT: Author.
Connecticut State Department of Education. (1998). Handbook for the development of an English
language arts teaching portfolio. [Brochure]. Hartford, CT: Author.
Council of Chief State School Officers. (1998). Key state education policies on K-12 education:
Standards, graduation, assessment, teacher licensure, time and attendance: A 50-state report. Washington,
DC: Author.
Curry, S. (2000). Portfolio-based teacher assessment. Trust for educational leadership, 29, 34-38.
Danielson, e., & McGreal, T.L. (2000). Teacher evaluation to enhance professional practice.
Alexandria, VA: Association for Supervision and Curriculum Development.
Darling-Hammond, L. (1999). Teacher quality and student achievement: A review of state policy
evidence. (Tech. Rep. No. 99-1). University of Washington, Center for the Study of Teaching Policy.
Dunkin, Michael J. (1997). Assessing teachers' effectiveness. Issues in Educational Research, 7, 37-51.
Dwyer, e.A., & Stufflebeam, D.S. (1996). Evaluation for effective teaching. In D. Berliner & R
Calfee (Eds.), Handbook of research in educational psychology. New York: Macmillan.
The Education Commission of the States. (2001). A closer look: State policy trends in three key areas
of the Bush Education Plan - Testing, accountability, and school choice. Retrieved from
http://www.ecs.org/ciearinghouse/24/19/2419.htm
Educational Testing Service. (2000). Candidate information bulletin: Test for teaching knowledge.
Princeton, NJ: Author.
Edwards, V.B. (2000, January 13). Quality Counts 2000 [Special issue]. Education Week, 19(18).
Elmore, RE, & Rothman, R (Eds). (1999). Testing, teaching, and learning: A guide for states and
school districts. Washington, DC: National Academy Press.
Feistritzer, E.e. (1998, January 28). The truth behind the "teacher shortage." The Wall Street Journal,
p. A18.
Feuer, M.1., Holland, P.w., Green, B.E, Bertenthal, M.W., & Hemphill, Ee. (Eds). (1999). Uncommon
Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: National Academy
Press.
Frederiksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American
Psychologist, 39, 193-202.
Frederiksen, J.R & Collins, A. (1989). A systems approach to educational testing. Educational
Glaser, R, Lesgold, A., & Lajoie, S.P. (1987). Toward a cognitive theory for the measurement of
achievement. In R. Ronning, J. Glover, J.e. Conoley, & J.e. Witt (Eds.), The influence of cognitive
psychology on testing, Buros/Nebraska symposium on measurement, 3 (pp. 41-85). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Greenwald, R., Hedges, L.v., & Laine, RD. (1996). The effect of school resources on student
achievement. Review of Educational, 66, 361-396.
Haertel, E. (1986). The valid use of student performance measures for teacher evaluation.
Educational Evaluation and Policy Analysis, 8(1), 45--60.
Haertel, E., & Calfee, R. (1983). School achievement: Thinking about what to test. Journal of
Educational Measurement, 20(2), 119-132.
Haladyna, T. (1992) Test score pollution: Implications for limited Englis1.t proficient students. In
Proceedings of the Second National Research Symposium on limited English proficient student issues:
Focus on evaluation and measurement, 2, (pp. 135-163). Washington, DC: U.S. Department of
Education, Office of Bilingual Education and Minority Language Affairs.
Hamilton, L., Klein, S., & McCaffrey, D. (2001, April). Exploring claims about student achievement
using statewide tests. In Daniel Koretz (Organizer/Moderator), New work on the evaluation of
high-stakes testing programs. Symposium conducted at the National Council on Measurement in
Education's Annual Meeting, Seattle, WA.
Haney, w., Fowler, C., Wheelock, A, Bebell, D., & Malec, N. (1999). Less truth than error?: An
independent study of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(4).
Retrieved from http://epaa.asu.edu/epaa/v7n4/
Hartocollis, A. (2000, September 2). Hiring of teachers is more than a matter of decree. The New
York Times, p. B1.
Hass, N.S., Haladyna, T.M., & Nolen, S.B. (1990). Standardized testing in Arizona: Interviews and
written comments from teachers and administrators. (Technical Report 89-3). Phoenix, AZ: Arizona
State University West.
Heubert, J.P', & Hauser, R.M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and
graduation. Washington, DC: National Academy Press.
Hewitt, G. (2001). The writing portfolio: Assessment starts with a. Clearing House, 74, 187-191.
Interstate New Teacher Assessment and Support Consortium. (1992). Model standards for beginning
teacher licensing and development: A resource for state dialogue. Washington, DC: Council for Chief
State School Officers.
Interstate New Teacher Assessment and Support Consortium. (1999). INTASC tackles challenges of
new title II reporting requirements. INTASC in Focus, 2(1), 1-6.
Interstate New Teacher Assessment and Support Consortium. (2002). Retrieved from
http://www.ccsso.org/intasc.html.
Jerald, C.D. (2000, January 13). The state of states. Education l*ek, 19(18),62--65.
Jerald, C.D., & Boser, U. (2000, January 13). Setting policies for new teachers. Education Week,
19(18),44-45,47.
Koretz, D. (2001, April). Toward a framework for evaluating gains on high-stakes tests. In Daniel Koretz
(Organizer/Moderator), New work on the evaluation of high-stakes testing programs. Symposium
conducted at the National Council on Measurement in Education's Annual Meeting, Seattle, WA.
Kupermintz, H., Shepard, L., & Linn, R. (2001, April). Teacher effects as a measure of teacher
effectiveness: Construct validity considerations in TVAAS (Tennessee Value Added Assessment
System). In Daniel Koretz (Organizer/Moderator), New work on the evaluation of high-stakes
testing programs. Symposium conducted at the National Council on Measurement in Education's
Annual Meeting, Seattle, WA.
Land, R., & Herman, J. (1994). Report on preliminary observation scheml!. Greensboro, NC: National
Board for Professional Teaching Standards, Technical Analysis Group, Center for Educational
Research and Evaluation, University of North Carolina-Greensboro.
LeMahieu, P.G., Gitomer, D.H., & Eresh, J. (1995). Portfolios in large-scale assessment: Difficult but
not impossible. Educational Measurement: Issues and Practice, 14, 11-16,25-28.
Linn, R.L. (1987) Accountability: The comparison of educational systems and the quality of test
results. Educational Policy, 1, 181-198.
Linn, R.L. (2000). Assessments and accountability. Educational Researcher, 29, 4-16.
Linn, R.L., & Herman, J.L. (1997). A policymaker's guide to standards-led assessment. (ERIC
Document ED 408 680). Denver, CO: Education Commission of the States.
Linn, R.E., Baker, E.L. & Dunbar, S B. (1991). Complex, performance-based assessment: Expectations
and validation criteria. Educational Researcher, 20(8), 15-21. •
Lomask, M.S., & Pecheone, R.L. (1995). Assessing new science teachers. Educational Leadership, 52,
62--66.
Ludlow, L.H. (2001, February). Teacher test accountability: From Alabama to Massachusetts.
Educational Policy Analysis Archive. Retrieved from http://epaa.asu.edu/epaa/v9n6.html
Luna, C., Solsken, J., & Kutz, E. (2000). Defining literacy: Lessons from high-stakes teacher testing.
Journal of Teacher Education, 51(4),276-288.
Madaus, G.P' (1988). The influence of testing on curriculum. In L.N. Tanner (Ed.) Critical issues in
curriculum. Eighty-seventh yearbook of the National Society for the Study of Education, 83-121.
Chicago: University of Chicago Press.
Madaus, G.E., & O'Dwyer, L.M. (1999). A short history of performance assessment: Lessons
learned. Phi Delta Kappa, 80, 688-695.
Manzo, K.K. (1997). U.S. falls short in a 4-nation study of math tests. Education Week. Retrieved
from http://www.edweek.orglew/newstory.cfm?slug=35aft.h16
Massell, D., Kirst, M.W, & Hoppe, M. (1997). Persistence and change: Standards-based reform in nine
states. CPRE Research Report Mo. RR-037. Philadelphia: University of Pennsylvania, Consortium
for Policy Research in Education.
Melnick, S., & Pullin, D. (1999). Teacher education & testing in Massachusetts: The issues, the facts, and
conclusions for institutions of higher education. Boston: Association of Independent Colleges and
Universities of Massachusetts.
Melnick, S.L., & Pullin, D. (2000). Can you take dictation? Prescribing teacher quality through
testing. Journal of Teacher Education, 51(4), 262-275.
Messick, S. (1987). Assessment in the schools: Purposes and consequences (Research Report RR-87-
51). Princeton, NJ: Educational Testing Service. Also appears as a chapter in P. W Jackson (Ed.)
(1988) Educational change: Perspectives on Research and Practice. Berkeley, CA: McCutchan.
Messick, S. (1989). Validity. In RL. Linn (Ed.) Educational measurement (3rd ed.), 13-103).
Washington, DC: American Council on Education.
Messick, S. (1990). Validity of test interpretation and use (Rep. No. 90-11). Princeton, NJ: Educational
Testing Service.
Millman, J. (Ed.). (1997). Grading teachers, grading schools. Is student achievement a valid evaluation
measure? Thousand Oaks, CA: Corwin Press, Inc.
Millman, J., & Sykes, G. (1992). The assessment of teaching based on evidence of student learning: An
analysis. Research monograph prepared for the National Board for Professional Teaching
Standards, Washington, DC.
Mislevy, RJ. (1987). Recent developments in item response theory with implications for teacher
certification. In E.P. Rothkopf (Ed.), Review of Research in Education, 14 (pp. 239-275).
Mislevy, RJ. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439-483.
Mislevy, R.J. (2000). Validity of interpretations and reporting results - Evidence and inference in
assessment (U.S. Department of Education, Office of Educational Research and Improvement
Award # R305B60002). Los Angeles, CA: National Center for Research on Evaluation,
Standards, and Student Testing (CRESST); Center for the Study of Evaluation (CSE); and
Graduate School of Education & Information Studies at University of California, Los
Angeles.
Mislevy, R.J., Steinberg, L.S., Breyer, P.J., Almond, RG., & Johnson, L. (1999). Making sense of data
from complex assessment (Educational Research and Development Centers Programs, PR/Award
Number R305B60002). Princeton, NJ: Educational Testing Service, The Chauncey Group
International, and Dental Interactive Simulations Corporation.
Mitchell, R, & Barth, P. (1999). Not good enough: A content analysis of teacher licensing
Examinations. Thinking K-16, 3(1).
National Association of State Directors of Teacher Education and Certification. (2000). T. Andrews,
& L. Andrews (Eds.), The NASDTEC manual on the preparation and certification of educational
personnel (5th ed.). Dubuque, IA: Kendal1!Hunt Publishing Company.
National Board for Professional Teaching Standards. (2000). Assessments Analysis Report: 1999-2000
Administration. Southfield, MI: Author.
National Board for Professional Teaching Standards. (2002). Standards and Certificate OvelViews.
Retrieved from www.nbpts.org
National Center for Education Statistics. (1997). NAEP 1996 mathematics report card for the nation
and the states: Findings from the national assessment of educational progress. Washington, DC:
Author.
National Center for Education Statistics. (2000). Reference and reporting guide for preparing state and
institutional reports on the quality of teacher preparation: Title II, higher education act (NCES 2000-
89). Washington, DC: Author.
National Commission on Teaching and America's Future. (1996). What matters most: Teaching for
America's future. New York: Author. _
National Council for Accreditation of Teacher Education. (1997). Technology and the new professional
teacher: Preparing for the 21st century classroom. Washington, DC: Author.
National Council for Accreditation of Teacher Education. (2001). Professional standards for the
accreditation of schools, colleges, and departments of education. Washington, DC: Author.
National Research Council. (1999). M. Singer & J. Tuomi (Eds). Selecting Instructional Materials: A
Guide for K-12 Science. Washington, DC: National Academy Press.
National Research Council. (2001). K.J. Mitchell, D.Z. Robinson, & K.T. Knowles (Eds.), Testing
teacher candidates: The role of licensure tests in improving teacher quality. Washington, DC: National
Academy Press.
Ohio Department of Education. (2001). Ohio FIRST. Retrieved from
http://www.ode.state.oh.us/Teaching-Profession/Teacher/Certification_Licensure/ohiol
Olebe, M. (1999). California Formative Assessment and Support System for Teachers (CFASST):
Investing in teachers' professional development. Teaching and Change, 6(3), 258-271.
Olson, L. (2000a, January 13). Finding and keeping competent teachers. Education Week, 19(18),
12-16,18.
Olson, L. (2oo0b, January 13). Taking a different road to teaching. Education Week, 19(18), p. 35.
Olson, L. (2001, January 24). States adjust high-stakes testing plans. Education Week, 20(19), 1,18,19.
Pearlman, M. (1999). K-12 math and science education - Testing and licensing teachers. Testimony of
Educational Thsting Service before the House Science ?Committee, August 4. Princeton, NJ:
Educational Testing Service.
Popham, W,J. (1997). What's wrong - and what's right - with rubrics. Educational Leadership, 55(2),
72-75.
Sanders, w'L., & Hom, S.P. (1998). Research findings from the Tennessee Value-Added Assessment
System (TVASS) database: Implications for educational evaluation and research. Journal of
Sanders, w'L., & Rivers, J.C. (1996). Cumulative and residual effects of teachers on future student
academic achievement. (Research Progress Report). Knoxville, TN: University of Tennessee Value-
Added Research and Assessment Center.
Schaeffer, B. (1996). Standardized tests and teacher competence. Retrieved from
www.fairtest.org/empl/ttcomp.htm
Schalock, H.D. (1998). Student progress in learning: Teacher responsibility, accountability, and
reality. Journal of Personnel Evaluation in Education, 12, 237-246.
Shepard, L.A (1989). Why we need better assessments. Educational Leadership, 46(7), 4-9.
Shepard, L. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of research in
education (pp. 405-450). Washington, DC: American Educational Research Association.
Shinkfield, AJ., & Stufflebeam, D. (1995). Teacher evaluation: Guide to effective practice. Boston:
Kluwer Academic Publishers. "
Smith, M., & O'Day, J. (1997). Systemic school reform. In S. Fuhrman & B. Malen (Eds.), The politics
of curriculum and testing. New York: Falmer Press.
Sommerfeld, M. (1996, March 27). High-level science exams in 5 nations studied. Education Week.
Retrieved from http://www.edweek.org/ew/newstory.cfm?slug=27aft.hI5
Stecher, B. (1998). The local benefits and burdens of large-scale portfolio assessment. Assessment in
Education: Principles, Policy & Practice, 5, 335-352.
Stecher, B., & Arkes, J. (2001, April). Rewarding schools based on gains: It's all in how you throw
the dice. In Daniel Koretz (Organizer/Moderator), New work on the evaluation of high-stakes
testing programs. Symposium conducted at the National Council on Measurement in Education's
Annual Meeting, Seattle, WA.
Sterling, Whitney. (1998, December 9). What is the Massachusetts teacher exam really testing?
Education Week, 19(15), p. 37.
Supovitz, J.A, & Brennan, R.T. (1997). Mirror, mirror on the wall, which is the fairest test of all? An
examination of the equitability of portfolio assessment relative to standardized tests. Harvard
Educational Review, 67, 472-506.
Sykes, G. (1997). The Dallas value-added accountability system. In J. Millman (Ed.), Grading teachers,
grading school: Is student achievement a valid evaluation measure? (pp. 110-119). Thousands Oaks,
CA: Corwin Press.
Teacher Education Accreditation Council. (2000). Prospectus for a new system of teacher education
accreditation. Retrieved from http://www.teac.org
U.S. Department of Education (1999). The condition of education, 1999. Washington, DC: U.S.
Valencia, S.w., & Au, KH. (1997). Portfolios across educational contexts: Issues of evaluation,
teacher development, and system validity. Educational Assessment, 4, 1-35.
Walberg, H.J., & Paik, SJ. (1997). Assessment requires incentives to add value: A review of the
Tennessee Value-Added Assessment System. In J. Millman (Ed.), Grading teachers, grading school:
Is student achievement a valid evaluation measure? (pp. 169-178). Thousands Oaks, CA: Corwin
Press.
Wolfe, E.w. & Gitomer, D.H. (2000). The influence of changes in assessment design on the
psychometric quality of scores. Applied Measurement in Education, 14(1), 91-107.
Wright, S.P., Horn, S.P., & Sanders w.L. (1997). Teacher and classroom context effects on student
achievement: Implications for teacher evaluation. Journal ofPersonnel Evaluation in Education, 11,
57-67.
APPENDIX A 0\
~
0
'JYpes of Evaluations and Decisions Involved in Preparation, Licensing, Employment, and Professionalization of Teachers
~
I::l
Stages in the career of a Teacher ;::t.
Activities in ~
I:l
Each Career Preparation Licensing Practice Professionalization ;:s
I::l
;:s
Stage Evaluations Decisions Evaluations Decisions Evaluations Decisions Evaluations Decisions ~
Entry • Evaluations • Ranking& • Reviewof . Approval to • Evaluation Job • Examination Continuing ~

· · ;:s
of supply funding credentials enter the of staffing definitions, of staff needs education ;:s
& demand training certifications needs job search and institu- opportunities ~
programs process Program tional needs Approval of <::l"
• Evaluations • Evaluation I::l
of recruitment • Redesign of recruitment · redesign • Assessment · study leaves & :;::
programs of the program Selection of of needs & special grants ~
• Assessment programs • Evaluation · staff members achievements Participation
of applications • Selections of applicants of teachers · in a national
of students • Assessment of certification
basic qualifi- program
cations for
national
certification
Participation • Intake • Planning • Induction . Provisional • Comparison Assignment • Intake Designing
evaluations student evaluation state license of job ·· End of evaluations · individual
• Evaluations prograIIJS during a Partial requirements probation Examination education
of students' • Grades probationary qualification & teacher Promotion of competence programs
mastery of • Counseling year for a license competencies
· Tenure National
course • Remediation • Licensing • Performance Merit pay certification
requirements Counseling review
· Staff
• Cumulative • Revising • Investigation development
progress student of charges Honors
reviews programs Rulings on
• Termination
·· grievances
cont'd ...
cant'd ...
Stages in the career of a Teacher
Activities in
Each Career Preparation Licensing Practice Professionalization
Stage Evaluations Decisions Evaluations Decisions Evaluations Decisions Evaluations Decisions
Exit • Final • Graduation • Review of . Permanent • Comparison . Reduction in • Participant . Qualification
evaluation of • Program success in or long-term of resources, force achievement for future
students' review & teaching for license staff needs, Termination in continuing leaves
~
;:,
fulfilment of improvement a designated & staff or sanctions education . New
graduation Program period seniority Examination assignments g.
requirements review & • Performance of competence ~
Exit improvement review & aptitude
interviews ~
• Follow-up ~
;:,
......
survey 0'
;:
~
$::i
t'l
......
ri'
~
S·
~
(\)
:A.
t'l
t'l
o
;:::
;:
l:l
~
~
~.
tl'J
i:l
0\
~
>-'
29
Principal Evaluation in the United States
NAFTALYS.GLASMAN
University of California at Santa Barbara, CA, USA
RONALD H. HECK
University of Hawaii at Manoa, HI, USA
THE CONTEXT
Evaluation is an administrative function in education (Glasman, 1979; Razik &

Swanson, 1995; Stufflebeam & Webster, 1988). It is one of several that include
decision making, problem solving, and instructional supervision. The evaluation
of school principals is a personnel administrative function (Seyfarth, 1991). It is
also one of several others such as the selection and induction of school principals
and other personnel, compensation, staff development, and reassignment (Van
Zwoll, 1964). Personnel administration as a whole figures prominently in educa-
tion because education is a labor-intensive industry (Webb, Greer, Montello, &
Norton, 1987). While the institutional nature of schools has often resisted change
(Tyack & Cuban, 1995), over the past several decades, the role of personnel
evaluation has become central to policymakers' efforts aimed at promoting
academic improvement and long-term school effectiveness. Consequently, the
evaluation of school principals has become increasingly prominent in personnel
administration because principals are pivotal in schools and because the evalua-
tion of principals reveals where schools are and where they are going (Glasman
& Nevo, 1988).
The intersection of evaluation and the school principalship is a relatively new
domain of academic inquiry (Duke & Stiggins, 1985; Ginsberg & Berry, 1990;
Glasman & Heck, 1992; Glasman & Nevo, 1988). Despite current interest in
appraising the principal's performance, the empirical study of principal evaluation
has been slow to develop, has not experienced a high degree of systematization,
and has not been guided by firmly established theoretical foundations (Glasman
& Heck, 1992). In response to the current state of the art in the evaluation of the
principal, the purposes of this chapter are to: (1) examine the increasing
importance of principal evaluation in light of changing conditions in education;
643

644 Glasman and Heck
(2) describe conceptual frameworks and methods of developing a prescriptive
structure for principal appraisal systems; (3) provide some examples of such
systems that are in use currently; and (4) comment on issues likely to impact on
the development of principal appraisal systems in the future. _
Images of the school principal have evolved over time (Glasman & Heck,
1992; Hallinger & Heck, 1996; Leithwood & Duke, 1999). Early American
school "principal teachers" shared responsibility with lay leaders in developing
the curriculum. As the school expanded, so did the role. Added duties for principal
teachers included keeping the school enrollment and attendance records,
preparing reports, and supervising the work of teachers (Glasman & Nevo,
1988). These latter functions were performed with minimal lay participation. By
the second half of the 19th century, school principals had been gradually relieved
of teaching duties so they could devote more attention to the school's organiza-
tion and management. No systematic data are available about early principal
evaluation. Perhaps compliance with prevailing education values was one
evaluation criterion. Expected exemplary behaviors were possibly others.
By the end of the century, school administration had become more profes-
sionalized, as training and licensing became more specific (Glasman & Nevo,
1988). The 20th century brought some additional changes. Efficiency-focused
evaluation criteria emerged in the society and in the schools during the municipal
reform early in the century (Callahan, 1962). Accounting procedures related to
the outcomes of production were of major importance in organizations. Principals
of this era attended more and more to increasing the efficiency of the manage-
ment of their schools.
In the 1950s and 1960s, concerns for equality of educational opportunities
intensified. Intensification of federal and state legislation and court decisions
had significant implications at the local school level. Schools and school principals
took on new tasks which required the establishment and coordination of a
network of agencies and individuals (Ubben & Hughes, 1997). Research on
effective schools during the 1970s and 1980s accentuated the school principal's
leadership role in producing student achievement indirectly (Glasman & Nevo,
1988; Hallinger & Heck, 1996), focusing on such activities as developing vision
and school purpose, setting goals, communicating expectations for academic
performance, "gatekeeping" with parents and other community interests, providing
support, allocating resources, and monitoring the work activities at the school.
As a result of the evolving role, at least six different conceptualizations have
been used over the past several decades to describe aspects of the principal's
leadership (Leithwood & Duke, 1999). At the same time, the evaluation of
principals became much more complex because of the multiplicity of evaluation
criteria (Smrekar & Mawhinney, 1999).
More recently, accountability has emerged as a dominant and influential s!=>cial
value (Lessinger, 1971). In schools, demands for accountability focused on student
accomplishment, independent review, and public support. The quality, fairness,
and acceptability of educational accountability generated intense public debates
(Glasman, 1983) over the quality of the curriculum, the length of the school year,
Principal Evaluation in the United States 645
graduation requirements for students, the preparation and certification of educa-
tional personnel, and evaluation processes at the state, district, and school levels
(Hess, 1991; Wirt & Kirst, 1989).
In several states, evaluating school principals was a response to legal, political,
and market accountability demands. From the mid 1970s to the late 1980s the
number of states that mandated formal evaluation of principals increased
dramatically from 9 to 40 (Snyder & Ebmeier, 1992). The increased demands for
school accountability gradually resulted in changes in the principal's work and
authority. School reforms during this period emphasized ways to redistribute
power from central offices to the local school (e.g., shared decision making,
charter schools) as a means of creating greater accountability. Several goals
drove this change including the belief that decentralization would lead to greater
educational efficiency, would empower local stakeholders, and would shift the
responsibility for poor outcomes away from the school district as a whole to the
individual site (Glasman & Heck, 1992). The expectation was that school
leadership could facilitate increased school effectiveness and student academic
improvement (Hallinger & Heck, 1999).
The changing political context surrounding education brought about numerous
new issues associated with the criteria of evaluating principals. Preliminary
attempts to evaluate principals formally focused on developing evaluation
systems that defined and measured essential principal competencies in
performing the job (Ellett, 1978; Gale & McCleary, 1972; Walters, 1980). There
was little agreement, however, about which competencies should be required for
various evaluation purposes (e.g., preparation, certification, routine performance
appraisal). Where previously the principal had been viewed as a middle manager
in the educational hierarchy who implements the demands of the central office,
structural reforms that focused on shifting greater decision-making authority
and responsibility highlighted the principal's role in working with various groups
to facilitate school improvement.
In perhaps the most far-reaching redistribution of power from administrators
to local stakeholders (e.g., parents, community interests), in the late 1980s
Chicago principals were granted greater authority for running the school, but
their continuance as school leaders was related to their ability to establish
consensus about the school's educational program, foster high expectations for
school goals, and implement educational changes that would lead to improved
outcomes (Glasman & Heck, 1992; Hess, 1991). Instead of enjoying lifetime
administrative tenure, principals became primarily accountable to a local school
site council whose support they needed to maintain to have their contracts
renewed. Site councils were mandated broad responsibilities including
assessment of the principal with performance-based criteria, hiring of staff,
approval of the school improvement plan, and advisory power over policies and
programs that affect students and instruction (Hess, 1991; Wong & Rollow,
1990). Over the past decade, some other districts have begun to adopt this
approach by tying the evaluation of principal performance to the school's
academic outcomes (e.g., Education Week, 1995; Richard, 2000).
Currently, other types of environmental forces (e.g., globalization, privatiza-
tion, choice) continue to reshape the principal's work and expand the boundaries
of the school (Leithwood & Duke, 1999). Under this new set of expectations, the
specific nature of the principal's work within any particular sc.hool may be quite
varied and not easily standardized for purposes of evaluation (Ginsberg &
Thompson, 1992). As a result, for some districts principal evaluation has
gradually moved from determining individual competencies and products (Duke
& Stiggins, 1985; Glasman, 1979) to also examining the processes in which the
principal engages to reach decisions about school goals, budget allocations, and
program implementation (Smylie & Crowson, 1993).
Compared with their role as middle manager of the previous era, principals in
decentralized governance systems (e.g., site-based management) assume greater
responsibility for the development of the school's goals, its strategic plan, its
implementation, and its demonstrated results. Principals also assume greater
responsibility for the school's budget and for evaluating the school's staff (Hess,
1991). They may be expected to assume a larger role in getting new curriculum
designed and implemented within their schools (Smylie & Crowson, 1993). They
may also be expected to work with a wide variety of social service agencies (Smylie
& Hart, 1999). Ironically, however, principals are also accountable to a broader
range of stakeholders (e.g., parents and community members, teachers and staff,
district administrators) for the results of their efforts. This tension between
increased responsibility for results and less control (e.g., over decision making)
results in a range of approaches that have been adopted toward principal
evaluation.
CONCEPTUAL FRAMEWORKS
Unfortunately, little empirical research exists on actual principal evaluation

practices, so little is known about the purposes, nature, and quality of the
procedures used (Duke & Stiggins, 1985; Ginsberg & Thompson, 1992; Glasman
& Heck, 1996). At least three conceptual frameworks have been offered in
recent years for the purpose of examining the practice of principal evaluation:
role-based, standard-based, and outcome-based evaluation (Heck & Glasman,
1993). These frameworks determine the criteria that will be used in the evaluation.
Role-Based Evaluation
Role-based evaluation of principals is anchored in formally articulated responsi-

bilities and tasks (e.g., job description) as well as in contextual expecta~ions
(Hart, 1992). The tasks imply the responsibilities that go with the role of a middle
manager. The expectations imply the building and maintaining of cohesive social
relationships within the school and between the school and community. Each
principal is considered to have a unique school context. The key evaluation
criterion becomes the extent to which the principal complies with the school
community's expectations. Such compliance has been labeled a set of "right"
behaviors (Duke & Iwanicki, 1992). They include analyzing the needs of the
school and community and implementing strategies that correspond to the
chosen goals.
In recent years, several structural reforms (charter schooling, decentralization,
school choice) have developed with a profound effect on the role of the school
principal (Glasman & Heck, 1992). As shown in the Chicago example, shared
leadership, for example, features parents and teachers who are empowered to
participate in school-wide decisions (Smylie & Crowson, 1993). The principal's
role has become a first among equals in the management of the school. In
addition to being an "instructional" leader, these structural changes free the
principal to adopt new roles such as a "transformational" leader who is concerned
with establishing a vision for the school (Leithwood & Duke, 1999) and a
"moral" leader whose role includes the maintenance of ethical behaviors in the
school and advocacy for students (Dillard, 1995; Sergiovani, 1990). In fact,
Leithwood and Duke also suggest at least three other conceptualizations of
principal leadership that have drawn considerable attention recently. Participative
leadership stresses group decision-making processes, such as those mandated in
Chicago. Managerial/strategic leadership emphasizes the functions, tasks, or
behaviors of the principal that when carried out competently will facilitate the
work of others in the school. Contingent leadership involves how leaders respond
to and solve the unique organizational circumstances and problems that they
face.
As the role of the principal becomes more complex, role-based evaluation
becomes more challenging. The formal tasks and the social expectations blend
within the role in very specific patterns for each principal. Despite this increased
complexity, Smylie and Crowson (1993) and Goertz and Duffy (2000) argue that
role-based evaluation is still central in educational evaluation. They call for
expanding the evaluation criteria to include all those which are pertinent to the
multiple tasks and expectations.
From this standpoint, any of the six leadership conceptualizations identified by
Leithwood and Duke (1999) could constitute the focus of role-based evaluation
but, of course, each might have different evaluation purposes, objects that are
evaluated, and criteria which are associated with judging the principal's
performance or effectiveness. While there is obvious overlap within these
conceptualizations, an evaluation of the principal's instructional leadership
could center on efforts to define the school's academic goals, implement changes
to strengthen the instructional program, and strengthen the norms of the academic
climate in the school. Evaluating the principal's transformational leadership
might focus on how the principal helps build a school vision, establishes school
goals, works to establish high expectations for performance, provides support,
and fosters school participation in decision making. Evaluating the principal's
management/strategic leadership could assess efforts to provide financial and
material resources, distribute the resources so they are most useful, accommodate
policies and initiatives undertaken by the district, buffer staff by reducing inter-
ruptions to their classroom instruction, attend to students, and mediate conflicts.
The evaluation of the principal's contingent leadership might focus on how
principals interpret problems, set goals, anticipate constraints,_use their personal
values and principles, develop solution processes, and implement those
processes.
Standard-Based Evaluation
A contrasting perspective on evaluation is embedded in a standard-based

approach. Educational reformers have adopted standard setting as a prominent
improvement strategy in the 1990s for students, educational professionals, and
schools (Sykes, 1999). Standard-based evaluation includes delineating the
information requirements through the interaction with decision-making
audiences, obtaining the needed information through a formal data collection
process, and communicating the information to decision makers in a clear and
concise format. Standard-based evaluation of the school principal is grounded in
the criteria used to judge the individual's knowledge, skills, and attitudes as these
are exhibited in the on-going work. It often stems from a job description of the
principal's duties (Stufflebeam & Nevo, 1993).
A well-known set of pertinent educational evaluation standards was prepared
by the Joint Committee (1988). The standards were defined as "principles agreed
upon by representatives of professional organizations whose members engage in
the professional practice of evaluation" (p. 12). The standards are intended to
guide professional practice, hold the professionals accountable, and provide
goals for upgrading the profession's services (Stufflebeam & Nevo, 1993). They
are intended to provide a comprehensive basis for assessing and improving
principal evaluation systems (e.g., the design, implementation, and potential
impact on the educational system). The Joint Committee conceived of four sets
of standards: propriety, utility, feasibility, and accuracy.
Propriety standards are aimed at ensuring that the rights of personnel (in this
case, school principals) are protected. These include service orientation, formal
evaluation guidelines, conflict of interest, access to personnel evaluation reports,
and interaction with evaluators. Utility standards guide the evaluation so that it
becomes as informative as possible. The intention is that the information will be
timely and influential. These include constructive orientation, defined uses,
evaluator credibility, functional reporting, and follow-up procedures. Feasibility
standards focus on conducting evaluations in institutional settings which possess
limited resources and are influenced by a variety of external forces. The standards
in this set are practical procedures, political viability, and fiscal viab,ility.
Accuracy standards concern the dependability of the information. Standards in
this group are defined role, work environment, documentation of procedures,
valid measurement, reliable measurement, systematic data control, bias control,
and monitoring systems.
Standard-based evaluation is intended to evaluate the approaches used including
the procedures, instruments, and evaluation systems themselves within different
agencies that supervise principals' career paths (e.g., university preparation
programs, state education departments, districts). Evaluation standards are
intended to diagnose strengths and weaknesses and provide guidance for
improving the systems. While there is little empirical work on this perspective in
the literature (Glasman & Heck, 1996), in one empirical study of the utilization
of the standards in principal evaluation systems, Glasman and Martens (1993)
found that different districts emphasize different evaluation standards in their
systems for evaluating school principals. On the average, as a group, utility
standards had the highest priority (i.e., the goal being that the evaluation be
informative, timely, and influential). In contrast, accuracy standards (e.g.,
reliability and validity of the data collected) had the lowest priority. When asked
what the standards actually meant to them, most district officials in charge of
principal evaluation and the principals themselves saw them as evaluation
"philosophies" or "values." This may be the clue as to how broadly standard-
based evaluation criteria may be defined.
Outcome-Based Evaluation
The third framework is outcome-based evaluation of school principals. This

approach focuses on the effectiveness of principals. Criteria which are used to
judge effectiveness within this perspective relate to results including parent
satisfaction, student dropout rate, and, most prominently, student achievement.
Actually, as defined previously, the effectiveness of the school principal has only
recently become an integral part of principal evaluation (Goertz & Duffy, 2000).
Earlier, the effectiveness of the school as a whole was of interest, with the principal
identified as one variable among many that contributed (e.g., Brookover &
Lezotte, 1977). Ultimately, effectiveness became personalized: the principal and
the teaching staff. Strong demands for educational accountability were the cause.
Illustrating this shift in thinking about accountability, currently in 19 states,
school personnel must write or revise a school improvement plan when they are
identified as low performing under some set of criteria. Moreover, in 10 states
legislation exists allowing for replacement of the principal or teachers in con-
tinually low-performing schools (Jerald & Boser, 1999).
Over the past two decades, the literature on school improvement has empha-
sized a variety of ways in which principals may affect school improvement
(Hallinger & Heck, 1999). Despite these identified linkages, however, no clear
blueprint for school improvement has emerged, that is, information that school
leaders can actually utilize to create academic improvement. While some argue
that assessment should focus primarily on principals' interactions because of the
uncertainty of the role in producing outcomes (e.g., Duke & Iwanicki, 1992;
Hart, 1992), Ginsberg and Thompson (1992) suggest that principal evaluation
should focus squarely on results of the principal's efforts, rather than on the
specifics of how he or she gets there.
Outcome-based evaluation of principals has assumed both a direct (explicit)
and an indirect (implicit) linkage between principals and .levels of student
achievement (Adams & Kirst, 1999). Snyder and Ebmeier (1992) for example,
emphasized the indirect approach. They studied the linkage between actions
taken by principals and what they labeled as "intermediate effects" or "organiza-
tional processes" (instruction, teachers' work, satisfaction). Their assumption
was that principals have some control over these "outcomes" and, thus, these
outcomes may constitute evaluation criteria (Rallis & Goldring, 1993).
Those outcomes over which principals have partial responsibility and may,
thus, serve as "ethical" outcome evaluation criteria have constituted insufficient
responses to demands for accountability in several states (Goertz & Duffy, 2000;
Heck & Glasman, 1993). More explicit criteria have been appearing slowly. One
example includes actions taken by principals to improve student outcomes directly
(Glasman, 1992). Such actions include examining test scores, identifying learning
problems, mobilizing the staff to plan and implement instructional interventions,
and examining test scores again. These actions can then be related to results of
school-initiated efforts to address student outcomes.
The future of outcome-based evaluation of principals is unknown. It may be
just a temporary phenomenon in a constantly changing political arena. It may
also imply more rhetoric than actual practice (e.g., see Davis, 1998), but it could
also be a very useful vehicle to improve educational practice. Although the
stakes are high, including sanctions or reward for performance, there has been
little empirical study documenting actual processes and results of outcome-based
principal evaluation.
A PRESCRIPTIVE STRUCTURE
Although the field of principal evaluation has been labeled as fuzzy (Heck &
Marcoulides, 1992), our understanding of this role and its responsibilities is more
advanced than it was only two decades ago (Hallinger & Heck, 1996; Murphy,
1988). In this section of the chapter, we outline a prescriptive structure for
principal evaluation. Major components of this structure include the evaluation
purposes, the objects being evaluated, evaluation criteria, and the authority and
qualification to evaluate. Principals' actions constitute specific objects of
evaluation and may be observed in an effort to assess the effectiveness of their
performance.
In designing principal evaluation systems within any of the conceptual frame-
works that we have outlined, we must view the pertinent action in regard to &ome
specific evaluation purpose and collect the right data about principal actions with
respect to this evaluation purpose (Glasman, 1992). The type of data collected
and the manner in which it is reported back to principals will thus vary. Moreover,
the entire set of relationships including evaluation purposes, objects, criteria,
and method is embedded within two important contexts which themselves
overlap somewhat (Glasman & Glasman, 1990). One is the political context
within which policies are made at the state, district, and building levels. The
other is the evaluation context which has been politicized at the federal, state,
and local levels (Heck & Glasman, 1993). These contexts help determine the
conceptual frameworks that underlie the approach taken toward principal evalu-
ation.
Evaluation Objectives
A first concern in the design of a principal evaluation system is the identification

of the objectives, or purposes, of the evaluation. With growing attention to
accountability in recent decades, personnel evaluation has become one strategy
utilized in efforts to improve schools. Personnel evaluation can have a variety of
objectives. For the evaluation of educators in general, Stufflebeam and Nevo
(1993) offered a classification offour stages in the career of an educator to which
evaluation objectives may correspond. These stages include preparation, licensing,
practice, and professionalization (pp. 27-28).
For the evaluation of administrators, the objectives for each stage have often
been poorly articulated (Glasman & Heck, 1992). This has constituted a con-
straint in the development of meaningful administrator appraisal systems. In
their review of the literature on principal evaluation, Ginsberg and Berry (1990)
found a wide range of reported practices, but with little empirical support
demonstrating the relationship of each practice to its effectiveness in meeting its
evaluation objective.
The evaluation process itself consists of both description and judgment (Glasman
& Nevo, 1988). In actual practice, it is apparent that establishing purposes of
evaluation leads back to value positions policymakers have about those purposes
(Glasman & Martens, 1993; Williams & Pantili, 1992). These value positions
must be acknowledged in the process of developing a conceptual framework to
guide the evaluation (i.e., role-based, standard-based, outcome-based). At least
three general types of evaluation purposes can be defined. These include those
purposes that are more formative in nature (e.g., improvement oriented), more
summative in nature (e.g., certification, selection, dismissal), or that are designed
to hold someone accountable (e.g., for outcomes). These particular evaluation
objectives can differ across educational contexts and nations.
Some formative evaluations are conducted on a routine basis. The purpose is
often to help the person improve what she or he is doing or promote the
individual's professional growth. This could include participating in certain types
of professional development activities. While the routine evaluation of principals
has not received sufficient attention from researchers (Blase, 1987), the evidence
that does exist on principal evaluation suggests that most principal appraisal is
directed toward professional development (Brown & Irby, 1998; Clayton-Jones
et aI., 1993). In one preliminary study in Australia, Clayton-Jones and colleagues
found that principals perceived positive benefits of evaluation aimed at the
purpose of professional development. It is unclear from existing research,
however, how participation in these types of professional-development programs
might impact on the school administrator's effectiveness within certain dimensions
of the role.
There are several summative evaluation objectives. One set in this category
includes preparation, certification, and selection of principals (Sirotnik &
Durden, 1996; Williams & Pantili, 1992). Typically, states monitor administrative
preparation through requiring a prescribed set of courses, intern experiences,
and an advanced academic degree (e.g., master's) as part of the initial process of
evaluating the applicant's eligibility to receive a credential or license. At present,
there is little consistency in requirements for entry. Hence, the specific process
varies somewhat from state to state (e.g., screening, educational requirements,
specific course work). By 1993, 45 states required at least a master's degree (or
equivalent courses) for a principal's license (Tryneski, 1993). As a result of state
mandates, many preparation programs for administrators have begun to revise
their course requirements, program selection criteria, and assessment instruments
in an effort to upgrade their standards for licensing administrators (McCarthy,
1999).
Another evaluation purpose is the determination of the extent to which the
principals fulfills, or implements, particular evaluation routines. One example
might be the evaluation of teachers and other staff on a yearly basis. Another
example might be the extent to which the principal abides by particular evalu-
ation mandates that are passed (but are not necessarily routine). For example,
legislation could be passed that requires principals to prepare and submit a
budget to a site counsel, to create a school improvement plan (as in Chicago), or
to attend to special education procedures. The purpose of the evaluation, then,
would be to ascertain that the principal complies.
Another set of summative evaluation purposes is associated with granting
tenure or promotion, or focusing on reassignment and dismissal (Davis, 1998).
Evaluation of school principals has also been linked to salary adjustments and
personal rewards for school performance (e.g., Clayton-Jones et aI., 1993;
Education ~ek, 1995). More recently, formal evaluation practices have also
resulted in the removal of school principals (Richard, 2000).
A final type of evaluation objective concerns principals' and teachers'
accountability for student outcomes. This is a relatively new evaluation purpose
as far as administrators are concerned. Accountability for outcomes is not a basic
task or function of the role, as it might be more directly for teachers but, rather,
is an impact or consequence. Currently, much educational evaluation, and
personnel evaluation more specifically, is outcomes focused. This represents a
new type of evaluation, as it does not necessarily depend upon direct observ~tion
or self-assessments about what the principal is doing on the job.
One caution is that while the effect of the principal's leadership in school
effectiveness and school improvement has been established empirically,
evaluators must also consider the impact of the school's context on principal
leadership and its potential impact on the evaluation of school administrators
(Hart, 1992). This is especially apparent in districts that emphasize decentralized
decision making through site-based management and greater public responsi-
bility for and control in hiring and evaluating school administrators. The context
may muddy the waters surrounding the delimitation of the type and parameters
of evaluation that should be utilized to monitor principal performance for the
specific purpose, as principals do not have complete control of the school's inputs
(e.g., student composition, resources). In one empirical study on the impact of con-
text on administrative performance, Heck (1995) determined that the organiza-
tional context exercised a measurable effect on how the performance of the new
administrator was rated in several different domains - that is, the school context
affected how well administrators performed the tasks associated with their role.
Davis (1998) examined superintendents' perspectives on the involuntary
departure of principals - that is, reasons why they lost their jobs. Principals' lack
of interpersonal relationships was the primary reason superintendents gave for
dismissing principals. Other reasons included their failure to perform various
duties (e.g., not meeting expectations for student achievement, not providing a
safe campus environment, resistance to change). The study does not specifically
mention the evaluation process used in these decisions. With respect to our
discussion of outcome-based evaluation, however, it is interesting that while
policymakers are moving in the direction of applying political pressures on
superintendents to remove principals who fail to perform in the role, super-
intendents report using these types of performance criteria less than the
interpersonal domain.
Policymakers interested in outcomes-based evaluation must keep in mind the
extent to which administrators can (and should) be held accountable for the
overall effectiveness of their schools. Personnel decisions made by boards of
education, superintendents, or lay-controlled site councils about the performances
of educators with respect to school outcomes should only be made after a careful
consideration of research concerning the how the school's context and its
processes impact on its academic outcomes.
Evaluation Objects
Evaluation objects help evaluators determine what type of information should be

collected and how that information should be analyzed. Clear object identi-
fication helps keep the evaluation focused and serves to resolve value conflicts
among stakeholders and others likely to be affected by the evaluation (Glasman
& Nevo, 1988). Major categories of evaluation objects include personnel,
programs, and student learning. With respect to the evaluation of principals,
more specifically, evaluation objects concern what it is exactly about the
principal's work or role that is being evaluated (Heck & Glasman, 1993).
It is apparent that leadership is a major component of the role. As we have noted,
Leithwood and Duke (1999) recently identified several different classifications
of leadership that have been used in research on the principal's role (e.g.,
instructional, transformational, participative, contingent, managerial). These
conceptualizations provide hints as to what could be evaluated. The objects
chosen might be different under each type of leadership evaluated. Identifying
clearly articulated evaluation objects may be more difficult in evaluating
principals, however, because of the complexity of the role.
In choosing potential evaluation objects, one useful step is to determine who
is accountable for what and to whom. As we have suggested, this issue involves
determining responsibility and control. All dimensions of the work of principals
for which they are responsible constitute potential evaluation objects. As to
control, however, only objects should be evaluated over which the principal
exercises guiding power. No evaluation object should be chosen over which the
principal has no control or little responsibility.
To illustrate this idea in more detail, in evaluating a principal's instructional
leadership, accountability for student learning is shared by teachers and
administrators, the students, their parents, governmental agencies that provide
support, and the general public (Gorton & Snowden, 1993). It can be somewhat
difficult to identify objects that are associated with the principal's role more
directly. Examples might be the principal's skills and efforts in visiting classrooms
(e.g., monitoring teachers' lesson design, delivery, and evaluation) or aligning
the curriculum with district standards and benchmarks.
Similarly, when thinking about the principal's transformational leadership
role, the principal may be responsible for helping to define and articulate the
school's goals and educational objectives, the commitment of its staff toward
attaining those goals, its programs and plans to reach its objectives, and its
progress in reaching those objectives (Ginsberg & Thompson, 1992; Gorton &
Snowden, 1993). In terms of managerial/strategic leadership, some of the
principal's responsibilities might include assignment of teachers to classes,
supervision and evaluation of teachers and staff, administrative procedures and
resources needed to achieve school objectives, and student discipline. Contingent
leadership responsibilities might have objects related to problem solving (e.g.,
problem identification, use student achievement data to determine areas of
instructional need, cognitive processes used to construct sdlutions).
As Glasman and Heck (1992) summarize this part of the evaluation process,
possible evaluation objects therefore could include a wide array of principals'
attitudes, behavior, cognitive processes (e.g., the ability to solve instructional
problems or make appropriate decisions), competence (e.g., skills), performance
(on-the-job behavior), and effectiveness (results-oriented activities such as
improvement in outcomes, achievement of benchmarks). Delimiting the objects
becomes an important part in ensuring the validity (and generalizability) of the
evaluation results. The more specific the evaluation object, the more then; is a
need to include a large number of objects in the evaluation because highly
specific objects "cover" only a small portion of what is being evaluated. In contrast,
the broader the object, the more ambiguous it becomes. More information is
needed about the degree of specificity of the evaluation objects that districts use
in their principal assessment practices, as well as that which is desirable in
creating effective principal evaluation systems.
It is important to emphasize that the evaluation objects should link to what it
is about the principal's work that is being assessed and the overall evaluation
purposes. For example, the principal could be evaluated in isolation in terms of
his or her managerial skills and effectiveness (e.g., communication, decision
making, instructional supervision). In contrast, the principal could be assessed
with respect to the entire staff in terms of his or her ability to contribute to school
improvement (e.g., goal setting, planning and implementing a school-
improvement program, evaluating the results of the staff efforts). The principal
could also be evaluated with respect to the school in terms of its inputs,
processes, and outcomes.
Concerning this latter type of evaluation, as we have suggested, more recently
some districts are tying principal evaluation to levels of student outcomes (e.g.,
Education Week, 1995; Sack, 2000). In this case, the performance of the principal
may not even be directly assessed, with the assumption being that the outcome is
the product of the performance. Current evaluation practices have generally
been inadequate for this particular purpose for which decision makers would like
to collect information about the principal. Several states now have been granted
legislative authority to remove the principal of a failing school (White, 1999).
Because of the complexity of relationships involved in producing student
outcomes, it is hard to determine exactly what data should be collected and how
that data should be analyzed to determine the extent to which personnel should
be accountable for outcomes and improvement (Heck, 2000). Since few states or
districts make any attempt to adjust outcomes by student composition or other
types of educational indicators, at this time it is probably best to proceed with
caution when using this type of evaluation object.
Evaluation Criteria
Once evaluation purposes and objects are determined, evaluation criteria are
needed in order to judge the quality of the performance. Criteria concern how
the merit of the evaluation object will be assessed. Judging the merit of evalua-
tion objects is one of the most difficult tasks in educational evaluation (Glasman
& Nevo, 1988). Debates concerning the most appropriate types of criteria have
existed since the study of the field of principal evaluation emerged (Ginsberg &
Thompson, 1992).
One approach is that evaluation should attempt to determine whether goals
have been achieved (e.g., Provus, 1971; Tyler, 1950). It depends, however, on
what is meant by "goal achievement." Some goals may be trivial, while other
stated objectives may not be worth achieving. The achievement of "important"
goals is one possible way to deal with evaluation criteria (Glasman & Nevo,
1988). At one time, evaluators looked at this as the effectiveness versus efficiency
argument. Effectiveness is concerned with the extent to which a goal is attained
(accountability), whereas efficiency is concerned with how much one gets with
how many resources. Currently, accountability and effectiveness appear to be
dominating.
A second approach concerns meeting standards that aJ:e set. This could
involve developing some type of criteria-based standards derived from the
performance that is desired on given tasks or functions of the role. More
specifically, an example might be a role-based approach that emphasizes the
managerial conceptualization of principal leadership. In this approach, standards
could be set by experts or other relevant groups about the important aspects of
the job. For example, the National Association of Secondary School Principals
and the National Association of Elementary School Principals established the
National Commission for the Principalship for preparing and licensing principals.
They argued that most principal licensing requirements were irrelevant to
current demands and that preparation should reflect the day-to-day realities of
operating schools (National Commission for the Principalship, 1990). The National
Commission identified 21 functional domains and attempted to delineate the
knowledge and skill base for principals in each domain (Thompson, 1993).
The approach is not without problems, however. There has been considerable
debate over what should be the core knowledge and skills required for school
leaders. While the National Commission on the Principalship laid out functional
domains, some have questioned the wisdom of attempting to develop a
knowledge and skill base at all (Bartell, 1994; Donmoyer, 1995). Given the varied
nature of the principal's work, the problems of specificity of tasks and the
situational nature of the job (Ginsberg & Thompson, 1992; Glasman & Heck,
1992; Hart, 1992), it is difficult to specify what domains, tasks, and behaviors
principals should pursue beyond very broad areas such as decision making,
teacher supervision, facilities management, and school improvement. Because
organizations such as schools are socially constructed, participants must share in
the defining and solving of problems. The evaluation criteria for each principal
may be closely related to the problems of each specific school. Because of the
complexity of interrelationships among the school's context, its learning-related
processes, and student outcomes, it is likely that the assessment of the
performance of one principal would be quite differel1t from that of the
performance of another principal. It is also difficult to specify how "good"
someone should be in each of these defined domains (i.e., what level of skill one
should possess).
Whatever evaluation criteria are chosen should meet the evaluation standards
described in the Joint Committee on Standards for Educational Evaluation
(1988) publication on the evaluation ofteachers and other educational personnel.
As we have suggested, these include standards addressing utility, propriety,
feasibility, and accuracy. Glasman and Martens (1993) found documentatiQn of
activities related to principal performance to be relatively informal and lacking
in detail. Perceived weaknesses included an absence of formal observation
formats, a lack of written comments and conferencing following each observa-
tion or other data collection method, and few formative reports during the
evaluation process. Principals perceived as problematic the lack of formal training
for evaluators, little time and emphasis on evaluation process development, and
poor psychometric qualities of the measurements taken. Principals perceived
that utility evaluation standards had the top priority, possibly because they are
intended to make evaluation informative, timely, and influential.
As Ginsberg and Thompson (1992) argue, principal evaluation criteria may
not necessarily be based on a preset standardized set of goals for all principals.
Criteria could also be developed for principal accountability for school outcomes.
As we have suggested, a number of states have passed legislation that allows the
state to move to replace principals and staff in schools that are identified as low
producing over time. In these states, this legislation tends to move the authority
for evaluation away from districts and to the state level. Because of the variety of
influences on student achievement, however, it is unclear exactly to what extent
principals should be held accountable for outcomes. This makes the delineation
of criteria and associated evaluation standards more difficult to determine.
Significant questions remain with regard to principal evaluation criteria in
light of the fact that this type of evaluation typically requires more sophisticated
analyses to determine, for example, how much educational value schools contri-
bute to student learning (Heck, 2000). The types of criteria could include how
well the school has done (i.e., the level of school performance in terms of raw
scores), how much the school has added (i.e., educational value added by the
school over and above student composition effects), or how well does the school
compare to other schools (i.e., how well has it met a standard set of criteria like
graduation rates, percent of students who achieve a particular standard).
Authority and Qualifications to Evaluate
Before one deals with evaluation methodology, one needs to ask who is to be
involved in the appraisal of principals. In the past, most systems have been
hierarchical (Fletcher & McInerney, 1995; Newman, 1985). This means that
principals are evaluated by superiors (assistant superintendents or super-
intendents). Other evaluators have also been proposed (Garrett & Flanigan,
1991). For example, Blase (1987) and Ballou and Podgursky (1995) argue that
teacher feedback would be an important part of the process. Garrett and
Flanigan suggest that information from parents could also be utilized in the
evaluation. These groups traditionally have not had the authority to evaluate
educational professionals. Duke and Stiggins (1985) and Brown, Irby, and
Neumeyer (1998) suggest that principal peer evaluation would be desirable.
Others have contended that self-assessment be a part of the process (Brown,
Irby, & Neumeyer, 1998). For example, the use of portfolios has gained popu-
larity in the 1990s in programs to prepare administrators for licensing (Barnett,
1995). The portfolios include materials documenting what the individual has
learned over a period of time - ability to apply learning concepts to completing
complex tasks, personal reflections, letters of support from others with whom the
trainee has worked (McCarthy, 1999). The strength of this method is that the
principal is the one who knows what he or she did. On the other hand, this
assessment is subjective.
The choice of evaluator has both political (authority to evaluate) and technical
(knowledge and skills) implications for how the evaluation is to be conducted. It
can also affect the types of information that will be collected (Clayton-Jones et
aI., 1993). Earlier work about the authority to evaluate educational personnel
derived from the definition of authority as "legislated power to act" (Gorton &
Snowdon, 1993). With specific regard to authority to evaluate, then, the assump-
tion used is that she or he who has the legislated power to act (e.g., guide, give
instructions) has the legislated power to evaluate that person whose work slhe
supervises (Dornbush, 1975). If the authorized evaluator is the sole data source
for the evaluation of personnel, then the evaluator also has the legislated power
to choose the criteria for evaluation.
In the domain of principal evaluation, there are several data sources, however.
This is so because of the complexity of the role of the principal and the multiple
interaction the role incumbent has with other people who could serve as
evaluation data sources. The choice of these sources is up to the authorized
evaluator. Typically, the'evaluator chooses sources which qualify as relevant data
sources, that is, individuals who are exposed to actions and behaviors of the
principal and who have clearly articulated criteria which they can use to describe
what they see and hear and over which they can render judgment about the
worth of their data (Glasman, Killait, & Gmelch, 1974). These sources also need
to understand the unique set of the principal's job expectations and the context
within which slhe works (Glasman & Nevo, 1988; Hoy & Miskel, 1982).
METHODOLOGY
1Ypes of Information
Many kinds of information can be collected regarding each object that is being
evaluated (Glasman & Nevo, 1988). Because of the complexity of assessing the
principal's role (e.g., cognitive processes, actions, effects), evaluation should
reflect critical perspectives on the role, be flexible enough to allow for variation
in how the role is perceived, multidimensional to adequately cover central
aspects or purposes of the evaluation, and include multiple information sources
(e.g., different types of evaluators, different methods of data collection).
Inconsistences in evaluation can develop because of differences in the principal's
work across school levels and specific school sites and the nature of evaluation
decoupled from context.
As we have suggested, one issue to be considered is the types of purposes on
which to focus - role-based aspects such as the principal's skills and tasks (e.g.,
communication, monitoring instruction, disciplining students), processes in which
the principal engages (e.g., school planning decisions, resolving instructional
problems, planning staff development), or outcomes (e.g., implementation of
programs, effects of programs). Stufflebeam's (1972) CIPP (context, input,
process, product) evaluation approach, for example, suggests focusing on four
sets of information pertaining to each evaluation object: the goals of the object;
its design; its implementation; and its outcomes. From this standpoint, the
evaluation of the principal might include information collected about his or her
goals in terms of improving the school, the quality of the improvement plan, the
extent to which the plan is carried out, and its results.
A second issue is what type of information to collect about the principal's
tasks, skills and behavior. One type of information is that which is collected
directly about the principal's performance (e.g., ratings, observations, interviews,
checklists of tasks completed). Regarding principal ratings, a major concern in
the literature on principal evaluation has been the development of valid and
reliable instruments to assess principal effectiveness (Pitner & Hocevar, 1987).
There have been criticisms raised about the instruments used to collect data
about principal performance (Heck & Marcoulides, 1992). Often, instruments
used to examine administrative job performance are constructed in a manner
that makes it difficult to measure actual job performance (Marcoulides, 1988). A
supervisor is most often asked to judge as "below average," "average," or "above
average," generally without operational definitions of these terms or on what
observations the judgments should be based. Early attempts typically used ratings
concerned with competencies (Ellett, 1978; Gale & McCleary, 1972; Walters,
1980). TYPically, they are not directed at professional growth (Fontana, 1994).
Another type of information collected that has been used as part of the
principal evaluation process is data about the school (e.g., its processes and
outcomes). In this type of evaluation, the data collection tends to be more
indirect, in that it focuses on the results of efforts, as opposed to the efforts
themselves. The data could include student achievement information, dropout
rates, patterns of course taking, and discipline. Some of this type of information
is collected as part of states' "report card" or educational indicators systems
(Heck, 2000). This approach can help ascertain how much educational value is
added (e.g., improvement). It can also enhance comparisons among schools in
terms of how well schools have reached performance standards. There are few
consistencies in the type information that is collected and the use of this
information currently for principal evaluation.
Evaluators
There are other issues involved with who is providing the information used in the
evaluation. Some empirical work exists on the quality of assessments using
various evaluators. In one examination of teacher ratings of their principal,
Marcoulides and Heck (1992) determined that teacher ratings were subject to
considerable error (almost half of the observed variance in the leadership
effectiveness score was due to combined teacher error components). They
concluded that the number of teachers providing information about the principal's
performance is extremely important in obtaining reliable estimates. There may
be other rating biases due to ethnicity, gender, or experience. For example,
Ballou and Podgrusky (1995) determined that teachers rated principals of the
same ethnicity and gender higher. In another exploratory study of others'
assessments of principal leadership, Heck (2000) found considerable consistency
among parent, student, and staff ratings.
Another evaluator is the principal him or herself. An obvious strength of this
approach is that principals are close to the source regarding evaluation objects.
A weakness is that the information is more subjective. In one study of self-
assessment versus teacher assessment of the principal's leadership effectiveness,
Heck and Marcoulides (1996) determined that principals rated their own
performance systematically higher across several dimensions of the role, as
compared to their teachers' assessments. Marcoulides and Heck (1993) also
identified inconsistencies in how assistant principals rated their leadership skills
in several domains versus how their supervising principals rated their assistant
principals' leadership skills. No systematic rating bias emerged in this latter
study, however, across the two sets of evaluators.
About the General Method
Despite the existence of the various problems associated with principal evaluation
methodologies, the fundamental stages of the evaluation process generally are
agreed upon (Stufflebeam & Webster, 1988). First, the choice of an evaluation
purpose is made. Then, the evaluation objects (evaluands in the evaluation jargon)
are determined in the direction "from the general to the most specific." Once the
specific objects are determined, choices of data sources are made as well as choices
of data collection procedures. Analyses of the data and data reporting follow.
With regard to principal evaluation, several basic issues have been raised and
dealt with, some successfully, and some are still being debated. For example, as
we have noted, the rating scales which are used in some questionnaires are
insufficiently descriptive. The rating is too abstract as a result. Other instruments
may not meet various evaluation standards (e.g., accuracy). The "growth
contracting" which has emerged with outcome-based evaluation left not totally
resolved as to whose responsibility it is to define the levels of possible outcomes
(Glasman & Heck, 1996). For the future, such contracting may include self-
evaluation in the main, which will focus on areas for improvement, plans for
improvement, and long-range personal goals.
Finally, a proposed use of ad hoc committees (e.g., site councils) for principal
evaluation is only emerging in selected districts. The outstanding issues her,e are
ways of keeping the evaluation confidential, the ways to agree on evaluation
criteria, the ways to respond to demands for open disclosure of the evaluation
process, the ways to increase validity, and the ways to decrease the time spent on
the evaluation.
Evaluation Consequences
When the fundamental consequences of evaluation are followed, the evaluation

product can be easily used by decision makers (Aikin, Daillak, & White, 1970;
Patton, 1978). In principal evaluation, the decision makers are those individuals
who have legislative power over the principal (e.g., district superintendent,
assistant superintendent, site council). The use of the evaluation results is guided
by the purpose for which the evaluation is done. A few possibilities include hiring
versus not hiring, reassigning versus not reassigning, taking certain kinds of
actions to promote improvement versus taking other actions, and rewarding the
principal on the basis of student achievement outcomes versus not rewarding
(Razik & Swanson, 1995).
Clearly, there must be a correspondence between the evaluation objective and
results and the audience with whom the results are shared. At times, it is
appropriate to share the results with no one. At other times, it is important to
share with others not only the results of the evaluation, but also the actions taken
as a result of the evaluation (Glasman, 1995).
EVALUATION SYSTEMS IN CURRENT USE
In this section, we provide five examples of principal evaluation systems that are
currently in use in different parts of the United States. These examples are by no
means inclusive of all evaluation systems in use. They represent, however,
different evaluation approaches. Table 1 summarizes these five evaluation systems
in some detail.
Three of the five are formative systems. One is role-based. Another is also
role-based, but is in the process of adding standard-based elements as well.
These are based mainly on performance of duties and job responsibilities. The
elements do not appear to be tied to the meeting of specific goals.
The third is standard-based. In comparison, the other two of the five evalu-
ation systems are more summative and accountability centered. One employs a
standards-outcomes approach in combination. The other is outcome-based.
These latter two systems have been recently reconfigured with a stronger
emphasis on outcomes as part of the evaluation objects. These include elements
of standard setting for principals, in terms of identifying goals and working
toward the achievement of those goals. The assessment appears to be increasingly
tied to meeting these goals. These have stronger sanctions for failure to meet
goals (e.g., removal of the principal) than the formative systems.
Table 1 suggests that principal evaluation purposes clearly determine which
objects are chosen for evaluation. For example, school districts which restrict
their purpose to that of improvement seem to choose evaluation objects such as
the principals' execution of formally assigned responsibilities. Districts which are
in need of making summative decisions seem to focus the evaluation on
Table 1. Evaluation Practices in Selected School Districts 0\
0\
N
District 1 2 3 4 5
Size Midsized Very large Midsized Large Very large G)
i:S"
Location Rural System Suburban Urban Urban
""!::l~
Evaluation Purposes 1. Improve leadership 1. Improve leadership 1. Improve leadership 1. Reassign 1. Remove ;::s
2. Assess competence 2. Communicate 2. Meet performance 2. Place on probation 2. Provide incentives !::l
;::s
expectations standards 3. Remove !:l...
Evaluation Objects 1. Decision making 1. Management of all 1. Instructional 1. Leadership in 1. Student achievement ~
behaviors assigned responsi- leadership developing curricular scores
2. Interpersonal relations bilities 2. School management frameworks 2. Efforts to enhance
3. Supervision of students 2. Unifying staff 3. Focus on achievement 2. Attracting and student achievement *
and learning 3. Commitment to 4. School-parent retaining good scores
environment school improvement relations teachers
4. Managing school plant 3. Student achievement
5. Monitoring student scores
progress
Evaluator Superordinate Superordinate Superordinate Superordinate Superordinate
Evaluation Methodology Made by evaluator Made by evaluator Made by a committee Made by the evaluator Included in contract
Decisions headed by the evaluator on the basis of recom- agreements by the
mendations made by an school board and the
external and, later, an administrator
internal committee association
Evaluation 1. Evaluatee-evar~ator 1. Evaluatee-evaluator 1. Evaluatee-evaluator 1. Reassignment 1. Removal
Consequences discussions and discussions and discussions and 2. Placement on 2. Providing incentives
follow-up follow-up follow-up probation 3. Loss of tenure as
2. Rating 2. Rating 2. Rating 3. Removal administrators
3. Sharing results with
school board
Overall Evaluation Role-based Role-based with Standard-based A standard-outcome Outcome-based
Approach increased inclusion of combination
standards
principals' efforts associated with overcoming deficiencies found in the school
(e.g., curriculum, teachers, student achievement).
The evaluation methodologies follow suit. In the formative principal evaluation
systems included in Table 1, the principals' superordinate alone makes the design
choices, with only minimum information consultations. In the outcome-based
systems, years are needed to study the relevant design parameters and their
acceptability by all relevant stakeholders. While the final design choices are
made here, too, by the principal's superordinate (e.g., superintendents, boards of
education), the input s/he receives is formal and known to the public.
FURTHER THOUGHTS
So far, this chapter has addressed the increased significance of principal evaluation
as a function of changing conditions of education. We also provided a conceptual
framework for viewing principal evaluation systems and a variety of related
considerations found in the literature pertaining to school principals as evaluatees.
A brief outline of selected evaluation systems in current use then followed. We
now close this chapter with further thoughts about principal evaluation practice,
training and research.
Practice
Logic suggests that improvement in the quality of principal evaluation practices

should include at least three evaluation subsystems:
(1) a base-line employment evaluation (related to a hiring purpose);

(2) a fine-tuning evaluation (related to an improvement purpose);
(3) a secondary employment evaluation (related to a summative purpose such as
reassignment ).
The base-line employment evaluation might include:
(a) data gathered and judgments made in the process of making the employ-
ment decision;
(b) organizational expectations regarding vision and task;
(c) indicators of the extent to which the expectations will be met.
The fine-tuning evaluation might include:
(a) data gathered in relation to organizational expectations;

(b) judgments about needed adjustments and ways of monitoring their imple-
mentation.
The secondary employment evaluation might include:
(a) judgments about the principal's suitability to handle a crisis or a change in

superordinate personnel;
(b) judgments about benefits and costs of making a subs-equent summative
decision.
The extent to which school districts need to, and/or wish to, follow this logic is
unknown at this time. One could contend that improvement of training (pre-
service and inservice) of both evaluators and evaluatees in this case may lead to
improvement of principal evaluation itself.
Training
Evaluation fundamentals (e.g., assessment, testing) are not taught in most

administrative training programs although they are offered in almost all schools
and colleges of education. A bit of "program" or "teacher" evaluation is offered
in some preliminary administrative training programs and "principal" evaluation
is taught in some advanced programs, but without the fundamentals (Glasman,
1997). In large school districts which have research and evaluation units, principals
may attend workshops in which they learn to diagnose student achievement test
scores in order to help them guide future instruction. But principal evaluators
seldom acquire sufficient knowledge needed to evaluate principals adequately.
So principal evaluators, typically former principals themselves, are probably
reluctant to engage in evaluation for which they know they lack competence.
Neither are they sufficiently competent to enter the debate about whether or not
principals should be evaluated in the context of the evaluation of the entire
school or as a separate evaluation object (Goertz & Duffy, 2000). If they express
an opinion in favor of the former, they would not be certain which incentive to
give to some schools or how to "punish" others. If they favor the latter approach,
they would not know how to correct for the fact that principals typically have no
direct control over student learning (Dornbush, 1975), rendering outcome-based
evaluation, for example, totally unfair and even unethical (Darling-Hammond,
1994).
In preliminary administrative training programs, a course could be required in
formative personnel evaluation.
Beginning with evaluation purposes (Joint Committee, 1988), the course could
briefly cover evaluation objects associated with teacher evaluation (see Pearlman
and Tannenbaum, this volume) as well as those associated with principal evalua-
tion (Glasman & Heck, 1992; Heck & Glasman, 1993). Students might also
examine documents and practices in nearby school districts. The rest of the
course could be devoted to fundamentals of personnel evaluation methodology,
including the nature of data needed, data sources, data collection procedures,
data reporting and data analysis. Evaluation consequences could be taught in
connection with teacher supervision and principal evaluation followup (Glasman
& Marten, 1993).
Advanced administrator training programs could capitalize on what has been
learned in preliminary preparation programs, in principal on-the-job experience
and in inservice training in personnel evaluation. Advanced courses could
perhaps focus on summative evaluation approaches because the principals'
superordinate-evaluators need to consider accountability, equity and efficiency
in their evaluation criteria. If, in the future, district decentralization plans allow
for increased autonomy and if school principals increasingly work within such
organizational contexts, then training for "team leadership" in such schools and
for "collective personnel evaluation" should be included in these advanced
training programs (Nevo, 1998).
Research
Many questions in the area of principal evaluation remain unanswered. Since

principals, by definition, must engage in educational politics, the process of
principal evaluation is embedded in two surrounding contexts - an evaluation
context and a political context. We have only rudimentary descriptive evidence
about actual principal evaluation systems in use.
We know little about specific methodological designs and implementation
processes. We can certainly learn about each one of the evaluation systems by
conceptualizing each as a case in itself. We can also survey a sample of them with
the recognition that we gather only documentation and opinions rather than
primary facts. In pursuing the path of studying methodologies, we need to make
choices about how far we wish to understand a given evaluation system versus
how strongly we wish to be able to offer generalizations from our findings. One
example might concern the correspondence between the evaluators used, their
individual ratings of the principal's skills, and the overall evaluation purposes.
Accounting for the sources of variation in observed performance can be a
difficult task (Marcoulides & Heck, 1992).
A second area of research may focus on the politics of principal evaluation. We
know that in some cases the evaluation purposes are not made public.
Unintended ones are. We know of cases in which a crisis atmosphere is induced
and a call for evaluation of principals is made (Glasman & Glasman, 1990). We
know of subjective choices of unbalanced data sources. We know of using data
out of context. These are only a few examples of the use of "political" evalu-
ations. However, we lack systematic data about these instances. If we pursue this
line of research, we may contribute to the debate about the extent to which some
kinds of principal evaluations are, in fact, and should be, in fact, a political value
(Glasman, 1986). Entering the debate assumes a recognition that principal
evaluation is not a pure professional education practice and that its study is not
a pure scholarly educational endeavor.
REFERENCES
Adams, J.E., & Kirst, M.W. (1999). New demands and concepts for educational accountability:
Striving for results in an era of excellence. In J. Murphy & K. Seashore-Lewis (Eds.),
Handbook of research on educational administration (2nd ed.) (pp. 4_63-490). San Francisco:
Jossey-Bass.
AIkin, M.C., Daillak, R., & White, P. (1979). Using evaluation. Newbury Park, CA: Sage.
Ballou, D., & Podgursky, M. (1995). What makes a good principal: How teachers assess the
performance of principals. Economics of Education Review, 14(3),243-252.
Barnett, B. (1995). Portfolio use in educational leadership programs: From theory to practice.
Innovative Higher Education, 19, 197-206.
Bartell, C. (1994, April). Preparing future administrators: Stakeholder perceptions. Paper presented at
the annual meeting of the American Educational Research Association, New Orleans, LA.
Blase, J. (1987). Dimensions of effective school leadership: The teacher's perspective. American
Educational Research Journal, 24(4),589-610.
Brookover, w., & Lezotte, L. (1977). Changes in school characteristics coincident with changes in
student achievement (Occasional Paper #17). East Lansing, MI: University Institute for Research
on Teaching.
Brown, G., & Irby, B. (1998). Seven policy considerations for principal approval. School Administrator,
55(4),10-11.
Brown, G., Irby, B., & Neumeyer, C. (1998). Taking the lead: One district's approach to principal
evaluation. NASSP Bulletin, 82(602), 18-25.
Callahan, R.E. (1962). Education and the cult of efficiency. Chicago: University of Chicago.
Clayton-Jones, L., McMahon, J., Rodwell, K., Skehan, J., Bourke, S., & Holbrook, A (1993). Appraisal
of school principals in an Australian department of education. Peabody Journal of Education,
68(2), 110-131.
Darling-Hammond, L. (1994). Performance-based assessment and educational equity. Harvard
Educational Review, 64(1), 5-29.
Davis, S. (1998). Superintendents' perspectives on the involuntary departure of public school principals:
The most frequent reasons why principals lose their jobs. Educational Administration Quarterly,
34(1),58-90.
Dillard, C. (1995). Leading with her life: An African American feminist (re )interpretation ofleadership
for an urban high school principal. Educational Administration Quarterly, 31(4),539-563.
Donmoyer, R. (1995, April). The very idea of a knowledge base. Paper presented at the annual meeting
of the American Educational Research Association, San Francisco.
Dornbush, S. (1975). Evaluation and the exercise of authority. San Francisco, CA: Jossey-Bass.
Duke, D., & Iwanicki, E. (1992). Principal assessment and the notion of "fit." Peabody Journal of
Education, 68(1), 25-36.
Duke, D., & Stiggins, R. (1985). Evaluating the performance of principals: A descriptive study.
EducationalAdministration Quarterly, 21(4), 71-98.
Education Week. (January 25, 1995). Cincinnati links administrators' pay, performance, 1,6.
Ellett, C. (1978). Understanding and using the Georgia principal assessment system (GPAS). CCBC
Notebook, 7(2), 2-14.
Fletcher, T., & McInerney, W. (1995). Principal performance assessment and principal evaluation.
ERS Spectrum, 13(4), 16-21.
Fontana, J. (1994). Principal assessment: A staff developer's idea for a complete overhaul. NASSP
Bulletin, 78(565), 91-99.
Gale, L., & McCleary, L. (1972). Competencies of the secondary school principal: A needs assessment
study. RIEOCT73.
Garrett, R., & Flanigan, J. (1991). Principal evaluation: A definitive process. Journal of School
Leadership, 1(1), 74-86.
Ginsberg, R., & Berry, B. (1990). The folklore of principal evaluation. Journal of Personnel Evaluation
in Education, 3(3), 205-230. •
Ginsberg, R., & Thompson, T. (1992). Dilemmas and solutions regarding principal evaluation.
Peabody Journal of Education, 68(1), 58-74.
Glasman, N.S. (1979). A perspective on evaluation as an administrative function. Educational
Evaluation and Policy Analysis, 1(5),33-44.
Glasman, N.S. (1983). Increased centrality of evaluation and the school principal. Administrator's
Notebook, 30(7), 1-4.
Glasman, N.S. (1986). Evaluation-based leadership. New York: SUNY.
Glasman, N.S. (1992). Toward assessing the test score-related actions of principals. Peabody Journal
of Education, 68(1),108-123.
Glasman, N.S. (1995). Generating information for the evaluation of school principals' engagement
in problem solving. Studies in Educational Evaluation, 21, 401-410.
Glasman, N.S. (1997). (Ed.) New ways of training for school leadership. Peabody Journal ofEducation,
2(2).
Glasman, N.S., & Glasman, L. (1990). Evaluation: Catalyst for or response to change. In S.B.
Bacharach (Ed.), Education reform (pp. 392-399). Boston: Allyn & Bacon.
Glasman, N.S., & Heck, R (1992). The changing leadership role of the principal: Implications for
principal assessment. Peabody Journal of Education, 68(1), 5-24.
Glasman, N.S., & Heck, RH. (1996). Role-based evaluation of principals: Developing an appraisal
system. In K. Leithwood, J. Chapman, D. Corson, P. Hallinger, & A. Hart (Eds.), International
handbook of educational leadership and administration (pp. 369-394). London: Kluwer Academic
Publishers.
Glasman, N.S., Killait, B.R, & Gmelch, W. (1975). Evaluation of instructors in higher education. Santa
Barbara, CA: Regents of the University of California.
Glasman, N.S., & Marten, P.A. (1993). Personnel evaluation standards: The use of principal
assessment systems. Peabody Journal of Education, 68(2), 47-63.
Glasman, N.S., & Nevo, D. (1988). Evaluation in decision making. Boston, MA: Kluwer.
Goertz, M., & Duffy, M. (2000, April). Tlariations on a theme: What is performance-based accounta-
bility? Paper presented at the annual meeting of the American Educational Research Association,
New Orleans, LA.
Gorton, RA., & Snowden, RE. (1993). School leadership and administration. Madison, WI: Brown
& Benchmark.
Hallinger, P., & Heck, R (1996). Reassessing the principal's role in school effectiveness: A review of
the empirical research, 1980-1995. Educational Administration Quarterly, 32(4), 5-44.
Hallinger, P., & Heck, R (1999). Can leadership enhance school effectiveness? In T. Bush et aI.,
(Eds.) Educational management: Redefining theory, policy, and practice (pp. 178-190). London: Paul
Chapman Publishing.
Hart, A. (1992). The social and organizational influence of principals: Evaluating principals in
context. Peabody Journal of Education, 68(1), 37-57.
Heck, R (1995). Organizational and professional socialization: Its impact on the performance of new
administrators. The Urban Review, 27(1), 31-49.
Heck, R (2000). Examining the impact of school quality on school outcomes and improvement: A
value-added approach. EducationalAdministration Quarterly, 36(4), 513-552.
Heck, R, & Glasman, N.S. (1993). Merging evaluation and administration. Peabody Journal of
Education, 68(2), 132-142.
Heck, R, & Marcoulides, G. (1992). Principal assessment: Conceptual problem, methodological
problem, or both? Peabody Journal of Education, 68(1),124-144.
Heck, R, & Marcoulides (1996). The assessment of principal performance: A multilevel Evaluation
Approach. Journal of Personnel Evaluation in Education, 10, 11-28.
Hess, C. (1991) School restructuring, Chicago style. Newbury Park, CA: Corwin Press.
Hoy, w., & Miskel, C. (1982). Educational administration: Theory, research, and practice (2nd ed.).
New York: Random House.
Hoyt, D.P. (1982). Evaluating administrators. In RE Wilson (Ed.), Designing academic profession
barriers. New directions in higher education, No. 37, 10(1), 89-100.
Jerald, c., & Boser, U. (Jan. 11, 1999). Taking stock. Education Week, 28(17), 81-97.
Newbery Park, CA: Sage.
Leithwood, K., & Duke, D. (1999). A century's quest to understand school leadership. In J. Murphy
& K. Seashore-Lewis (Eds.), Handbook of research on educational administration (2nd ed.)
(pp. 45-72). San Francisco: Jossey-Bass.
Lessinger, L.M. (1971). The powerful notion of accountability in education. In L. Bowder (Ed.),
Emerging patterns of administrative accountability (pp. 62-73). Berkeley, CA: McCatelan.
Marcoulides, G. (1988). From hands-on measurement to job performance: The issue of dependability.
The Journal of Business and Society, 2(1),132-143.
Marcoulides, G., & Heck, R. (1992). Assessing instructional leadership with "g" theory. International
Journal of Educational Management, 6(3), 4-13.
Marcoulides, G., & Heck, R. (1993). Examining administrative leadership behavior: A comparison
of principals and vice principals. Journal of Personnel Evaluation in Education, 7, 81-94.
McCarthy, M. (1999). The evolution of leadership preparation programs. In J. Murphy & K Seashore-
Lewis (Eds.), Handbook of research on educational administration (2nd Edition) (pp. 119-140). San
Murphy, J. (1988). Methodological, measurement, and conceptual problems in the study of instruc-
tionalleadership. Educational Evaluation and Policy Analysis, 11(2), 117-134.
National Commission for the Principalship (1990). Principals for our changing schools: Preparation
and certification. Fairfax, VA: National Policy Board for Educational Administration.
Newman (1985). Staff appraisal in the South Midlands and the South West. Educational Management
and Administration, 14, 197-202.
Patton, (1978). Utilization focused evaluation. Newbury Park, CA: Sage.
Pitner, N., & Hocevar, D. (1987). An empirical comparison of two-factor versus multifactor theories
of principal leadership: Implications for the evaluation of school principals. Journal of Personnel
Evaluation in Education, 1(1), 93-109.
Provus, M. (1971). Discrepancy evaluation. Berkeley, CA: McCutchan.
Rallis, S.F., & Goldring, E.B. (1993). Beyond the individual assessment of principals: School-based
accountability in dynamic schools. Peabody Journal of Education, 68(2), 3-23.
Razik, T.A., & Swanson, A.D. (1995). Fundamental concepts of educational leadership and management.
Englewood Cliffs, NJ: Merrill.
Richard, A. (Feb. 2, 2000). Principals approve new contract in New York City. Education Week,
29(21),5.
Sack, J. (May 2,2000). Del. ties school job reviews to student tests. Education Week, 19(34), 24, 27.
Sergiovani, T. (1990). Value-added leadership: How to get extraordinary performance in schools. New
York: Harcourt Brace Jovanovich, Publishers.
Seyfarth, J.T. (1991). Personnel management for effective schools. Boston, MA: Allyn & Bacon.
Sirotnik, K, & Durden, P. (1996). The validity of administrator performance assessment systems: The
ADI as a case-in-point. Educational Administration Quarterly, 32(4), 539-564.
Smrekar, c.E., & Mawhinney, H.B. (1999). Integrated services: Challenges and linking schools,
families, and communities. In J. Murphy & K Seashore-Lewis (Eds.), Handbook of research on
educational administration (2nd ed.) (pp. 443-463). San Francisco: Jossey-Bass.
Smylie, M., & Crowson, R. (1993). Principal assessment under restructured governance. Peabody
Journal of Education, 68(2), 64-84.
Smylie, M., & Hart, A. (1999). School leadership for teacher learning and change: A human and
social capital development perspective. In J. Murphy & K Seashore-Lewis (Eds.), Handbook of
research on educational administration (2nd ed.) (pp. 421-442). San Francisco: Jossey-Bass.
Snyder, J., & Ebmeier, H. (1992). Empirical linkages among principal behaviors and intermediate
outcomes: Implications for principal evaluation. Peabody Journal of Education, 68(1), 75-107.
Stufflebeam, D. (1972). The relevance of the CIPP evaluation model for educational accountability.
SRIS Quarterly,S, 3-6.
Stufflebeam, D., & Nevo, D. (1993). Principal evaluation: New directions for improvement. Peabody
Journal of Education, 68(2), 24-46.
Stufflebeam, D., & Webster, w.J. (1988). Evaluation as an administrative function. In N.J. Boyan
(Ed.), Handbook of research on educational administration (pp. 569-602). New York: Longman.
Sykes, G. (1999). The "new professionalism" in education: An appraisal. In J. Murphy & K Seashore-
Lewis (Eds.), Handbook of research on educational administration (2nd ed.) (pp. 227-250). San
Thomson, S. (Ed.), (1993). Principals for our changing schools: The knowledge and skill base. Fairfax,
VA: National Policy Board for Educational Administration. •
Tryneski, J. (1993). Requirements for certification of teachers, counselors, librarians, and administrators
for elementary and secondary schools 1993-4. Chicago: University of Chicago Press.
'!Yack, D., & Cuban, L. (1995). Tinkering toward utopia: A century ofpublic school reform. Cambridge,
MA: Harvard University Press.
1Yler, R. (1950). Basic principles of curriculum and instruction. Chicago: University of Chicago Press.
Ubben, G., & Hughes, L. (1997). The principal. Boston, MA: Ally & Bacon.
Van Zwoll, J.A. (1964). School personnel administration. New York: Appleton-Century-Crafts.
Walters, D. (1980). The measurement of principal competencies. Phi Delta Kappan, 61(6), 423--425.
Webb, L.D., Greer, J.T., Montello, P.A., & Norton, M.S. (1987). Personnel administration in education.
Columbus, OH: Merrill.
White, K. (Jan. 11, 1999). Keeping the doors wide open. Education Week, 38(17),12-13.
Williams, J., & Pantili, L. (1992). A meta-analytic model of principal assessment. Journal of School
Leadership, 2(3), 256-279.
Wirt, F., & Kirst, M. (1989). Schools in conflict (2nd ed.). Berkeley, CA: McCutchan.
Wong, K., & Rollow, S. (1990). A case study of the recent Chicago school reform. Administrator's
Notebook, 34(5-6),1-4.
30
Evaluating Educational Specialists
JAMES H. STRONGE
College of William and Mary, ~, USA
INTRODUCfION
How do you evaluate the array of professional support staff in today's schools
who are neither administrators nor teachers? Counselors, school psychologists,
librarians/media specialists, school nurses, and other professional support
personnel represent a growing and invaluable group of educators who fulfill an
array of duties and responsibilities which are fundamental to the support of
students, teachers, and, indeed, the entire educational enterprise (Stronge, 1994;
Stronge & Tucker, 1995a). ''As American schools in recent decades have labored
to include and educate all children, the role of these specialists has expanded to
serve these students' many and diverse needs" (Stronge & Tucker, 1995a, p. 123).
Despite their growing importance in contemporary schooling, evaluation of
educational specialists has been relatively rare, uneven, and inadequate (Gorton
& Ohlemacher, 1987; Helm, 1995; Lamb & Johnson, 1989; Norton & Perlin,
1989; Stronge & Helm, 1991; Stronge, Helm, & Tucker, 1995; Tucker, Bray, &
Howard, 1988).
Strategic plans, mission statements, and school improvement efforts are
important documents for defining current priorities and future goals but they are
not sufficient alone. There must be quality people to implement those plans and
programs, make improvements, and work toward fulfilling the school's mission.
No school or program is better than the people who deliver its services. And
related to the delivery of quality services, both programs and the people
responsible for implementing them must be assessed on a regular basis to
maintain and improve performance. Evaluation, therefore, can be a catalyst for
school improvement and effectiveness (Stronge & Tucker, 1995b).
As with teachers and administrators, the basic need for educational specialists
in a quality personnel evaluation system is for a fair and effective evaluation
based on performance and designed to encourage improvement in both the
employee being evaluated and the school (Stronge, 1997). The purpose of this
chapter is to explore key elements for constructing and implementing quality
evaluations for all educational specialists. Specifically, the chapter addresses the
following questions:
671
International Handbook of Educational Evaluation, 671--694

672 Stronge
1. Who are educational specialists?
2. What is the history of evaluation of educational specialists?
3. What is the status of educational specialists' evaluation?
4. What is unique about the evaluation of educational specialists?
5. How should educational specialists' performance be documented?
6. How can educational specialists be evaluated?
7. What is the connection between educational specialist evaluation and the
Personnel Evaluation Standards (Joint Committee, 1988)?
WHO ARE EDUCATIONAL SPECIALISTS?
Educational specialists include non-teaching, non-administrative professionals

who provide a myriad of support services to students, teachers, and parents. This
expansive definition is intended to include educators who are either licensed or
certificated in their respective fields; however, it does not include individuals who
provide auxiliary-type support services to the school district (e.g., clerk-of-the-
works, architect, purchasing agent). Moreover, while the work of non-certificated,
non-licensed staff positions (e.g., custodian, food service worker, secretary) is
vitally important to schools, these positions are not adtlre.$tnNiI the educational
specialist definition of this chapter.
Depending on the size and organization of a school system,. educational
specialists may work in a single school, multiple schools, or the central office
(Stronge, Helm, & Tucker, 1995). Although not exhaustive, the following list
includes many of the typical educational specialist positions found in school
settings:
• Pupil services personnel - counselors, school psychologists, social workers,

school nurses
• Instntctional support services personnel - deans, work-study supervisors, library-
media specialists, instructional computer specialists
• Academic!curriculum development services personnel - directors, coordinators,
content specialists, gifted and talented resource specialists
• Special education personnel - self-contained classroom teachers, resource-
consulting teachers, speech therapists, occupational therapists, physical
therapists
A common thread that runs through the job expectations of these varied educa-
tional specialist positions is that they include "those educators who work with
students or faculty in a support capacity" (Helm, 1995, p. 105). In essence, these
individuals are playing a fundamental role in fulfilling the school's mission ..
Over the last two decades, as schools have become responsible for providing
and coordinating a wider array of educational and related services, the number
and types of specialized support staff has increased dramatically. In a typical
elementary school, 30-45 percent of the professional school staff who walk
Evaluating Educational Specialists 673
through the schoolhouse door each day likely are educational specialists of one
variety or another (Stronge & Tucker, 1995b).1
WHAT IS THE HISTORY OF EVALUATION OF

EDUCATIONAL SPECIALISTS?
'~s the most visible professional within the school environment, the classroom
teacher has been evaluated for as along as we have had schools" (Stronge, &
Tucker, 1995a, p. 123). As Shinkfield and Stufflebeam (1995) noted, "Socrates'
pupils undoubtedly had opinions about his teaching skills in the 5th century Be"
(p. 9). While formal evaluation of teachers may not have occurred on a
systematic and large-scale basis until the turn of the 20th century, there is a deep
and growing base of knowledge regarding the ways and means of effectively
evaluating teaching (see, for example, texts such as Millman & Darling-
Hammond, 1990; Peterson, 1995; Shinkfield & Stufflebeam, 1995; Stronge,
1997). Unfortunately, research and knowledge of best practice related to the
evaluation of educational specialists lags far behind that of teachers (Stronge &
Tucker, 1995a). To illustrate this point, a search of journal articles and manuscripts
indexed in the Educational Resources Information Clearinghouse (ERIC)
revealed the following frequency counts (shown in Table 1) during the selected
time periods.
Research from disciplines such as counseling, school psychology, and library
science repeatedly cites the use of inappropriate evaluation criteria and
instruments with educational specialists in the schools and the lack of valid,
conceptually sound evaluation procedures (Kruger, 1987; Norton & Perlin, 1989;
Turner & Coleman, 1986; Vincelette & Pfister, 1984). This is in sharp contrast to
the "wide consensus on the general goals of evaluation and areas of competence
for teachers" (Sclan, 1994, p. 25). Only in recent years have the major job
responsibilities for educational specialists begun to be defined by various state
education agencies and national organizations representing the respective
professional groups.
A number of studies have documented the inappropriate or inadequate
evaluation of educational specialists. Helm (1995) noted that in two educational
specialist positions for which there exists a significant body of literature and
extensive evaluative activity (the school counselor and the school library-media
specialist), "we still find inappropriate evaluation a common complaint" (p. 106).
In a study of counseling practices in the state of Nevada, Loesch-Griffin,
Spielvogel, and Ostler (1989) found that 63 percent of the counselors were
evaluated using an instrument not specifically designed for school counselors.
Similarly, Gorton and Ohlemacher (1987) found that approximately 38 percent
of secondary school counselors in Wisconsin were evaluated with a teacher
evaluation form with most of the criteria either inappropriate or inapplicable to
the evaluation of counselors. Only 17 percent of the counselors were evaluated
on the basis of explicit, written criteria specific to their professional roles.
674 Stronge
Thble 1. Frequency of Selected Uses of the Descriptor, "Evaluation," in ERIC
ERIC
Search Descriptors 1966-1989 1990-1999
Teacher Evaluation 4,441 2,816
Counselor Evaluation 327 136
Media Specialist Evaluation 29 12
School Psychologist Evaluation 99 41
Regarding library media specialists, Scott and Klasek (1980) found in a study
of 80 northern Illinois high schools that almost 98 percent of the criteria on the
evaluation forms were the same criteria used for classroom teachers. Coleman
and Turner (1985) found only marginal improvement in their national study, in
which "slightly more than one-third of the state education agencies either
mandated or recommended procedures or forms for evaluating school library
media specialists" (cited in Helm, 1995, p. 106).
WHAT IS THE STATUS OF EDUCATIONAL

SPECIALISTS' EVALUATION?2
In an effort to obtain more accurate information regarding requirements for the

evaluation of all school personnel, including educational specialists, a national
survey of the 50 state education agencies was conducted in 1993 as a replication
of a 1988 study. Respondents for each state education agency were asked to
(a) identify the source of state-level legal mandates regarding the evaluation of
school personnel and (b) indicate the level of technical support provided by the
state education agency in the implementation of the legal Iilandates for teachers
as compared to educational specialists. Forty-two of the 50 surveys were received
from state education agencies, resulting in a response rate of 84 percent, which
is considered statistically adequate (Borg & Gall, 1989) to draw conclusions
about national trends.
Analysis of the surveys revealed a differentiated response to the evaluation of
teachers as compared to educational specialists. The specialists had a less
developed support system of guidelines, training, and technical support.
Additionally, the states as a whole tended to omit guidelines for the evalua~ion
of educational specialists far more often than they did for teachers. Relatively
few states provided technical support or formal training for evaluation of either
group - teachers or professional support personnel. In all cases, substantially less
attention was focused on the evaluation of educational specialists.
State Guidelines
As shown in Table 2, more than half (55 percent) ofthe responding states provided
guidelines to local educational agencies regarding teacher evaluation in com-
parison to 38 percent or less that provided guidelines for the evaluation of
selected educational specialist positions. The evaluation of counselors received
the most attention of all the educational specialists (38 percent), while the fewest
number of states provided evaluation guidelines for school social workers
(17 percent) and school nurses (12 percent).
Formal Training
A similar pattern of state involvement can be noted in Table 2 for state-

sponsored formal training in personnel evaluation. Thirty-six percent of the
responding states provided formal training to local educational agencies in the
evaluation of teachers while 24 percent or less provided training in the evaluation
of educational specialists. Library/media specialists (24 percent) received the
most attention followed by counselors (21 percent). The least amount of support
was provided for school social workers (14 percent), school psychologists
(14 percent), and school nurses (12 percent). The level of state-supported
guidance and formal training are contrasted in Figure 1.
Technical Assistance
As might be expected, the level of technical assistance offered by state education

agencies for the various educational specialist positions was consistent with the
level of support for formal training and guidelines. Technical assistance for
teachers was provided in 50 percent of the states compared to 38 percent which
provided assistance in the evaluation of educational specialists. Table 3 contains
Table 2. Guidelines and Formal 'fraining for Personnel Evaluation (N =42)

Professional Group Guidelines Formal Training
Teachers 23 (55%) 15 (36%)
Counselors 16 (38%) 9 (21%)
School Psychologists 10 (24%) 6 (14%)
School Social Workers 7 (17%) 6 (14%)
School Nurses 5 (12%) 5 (12%)
Librarians/Media Specialists 11 (26%) 10 (24%)
Program/Project Directors 8 (19%) 7 (17%)
Coordinators of Curriculum!Instruction 9 (21%) 8 (19%)
Content/Curriculum Specialists 9 (21%) 8 (19%)
676 Strange
100
90
"ll
CI)
80
~
::J
iii 70
<0
<1)
60
..(
a>
"'.
))
50
CI) 40
.jg
0
::J 30
lI,;
'" 20
10 Guidelines
Formal Training
0
Teacher Counselor School School School Media
Psych'is! Social Nurse Specialist
Worker
N=42
Figure 1: A National Survey of Guidelines Versus Formal Training for Educational Specialists
Table 3, Technical Assistance Provided by States for Personnel Evaluation (N = 42)

Professional Group 11?s No
Teachers 21 (50% ) 21 (50%)
Education Specialists 16 (38%) 26 (62%)
the relative numbers of states offering technical assistance. Overall, the

discrepancy in technical assistance between teachers and educational specialists
is less marked than with formal training, but in many states the available
technical assistance was of a generic nature and not specialized to reflect the
specific needs of educational specialists.
WHAT IS UNIQUE ABOUT THE EVALUATION

OF EDUCATIONAL SPECIALISTS?
The need for an effective and comprehensive evaluation based on performance

and designed to encourage improvement in both the school and the employee is
basic, regardless of the specific position (Stronge, 1997; Tucker, Bray, & Howard,
1988). Implementation of a comprehensive evaluation system for educational
specialists, however, can be unique and likely will require special attention if the
evaluations are to be valid and contribute to the overall effectiveness of the
school or educational program (Stronge, 1993). A number of factors, including
the following, help explain why evaluating many educational specialists is

unique:
• specialized theoretical orientations and highly specialized practices and

training required in many of the positions;
• lack of well-defined job descriptions for many professional support personnel;
• non-instructional nature of much of their work;
• erosion of traditional unity of command in education.
Specialized Orientation and Training
The training and orientation of many educational specialist disciplines (e.g.,

occupational therapist, social worker, school psychologist, nurse) is substantially
different from that of teachers. Moreover, their professional backgrounds often
are unfamiliar territory for building principals - the primary evaluators for school-
based educational professionals - and, without a common base of understanding
regarding professional expectations, evaluation can be challenging for both
parties.
Lack of Well-Defined Job Descriptions
As noted earlier, inappropriate or nonexistent evaluation criteria too frequently

have been commonplace for many educational specialists (American School
Counselor Association, 1990; Gorton & Ohlemacher, 1987; Kruger, 1987; Lamb
& Johnson, 1989). This, in part, has been due to the lack of well-defined roles
within schools for these specialists. Although general program goals may be
defined, specific duties for educational specialists tend to be more ad hoc
depending on the circumstances of the particular school. The result is that some
counselors, for example, find themselves responsible for discipline, attendance,
and a variety of miscellaneous tasks that are unrelated to their professional
training (Stronge, Helm, & Tucker, 1995).
Non-Instructional Nature of Work
The diversity of duties performed by educational specialists places a premium on

developing better understanding of their unique roles and responsibilities within
the educational enterprise if we are to create more meaningful evaluations of
their work. In addition to teaching, many educational specialists also are
responsible for functions such as those delineated in Table 4.
Many educational specialists actually do not devote significant time to actual
instruction. Nonetheless, all of their major job domains are focused on improving
or sustaining quality performance among the primary target audience - the
678 Stronge
Table 4. Common Job Domains for Educational Specialists
Common Job Domains
for Educational Specialists Descriptions
• Planning/Preparation Designing activities that change the program or its
implementation
• Administrative/Management Organizing, directing, or coordinating programs that include
responsibility for budgeting, staffing, reporting, and other similar
activities
• AssessmentlEvaluation Gathering and interpreting data from individuals, groups, or
programs to evaluate needs/performance
• InstructionlDirect Delivering services to improve skills/functional abilities or inform
Intervention Services receipients
• Consultation Collaborating with school personnel and/or parents to assist with
and coordinate the delivery of services to students
• Liaison/Coordination Coordinating information and program delivery within the school
and between the school and its major constituents
• Staff Development Facilitating the staff's achievement of desired professional goals
• Professional Responsibilities Developing and improving individual competence and skill and
delivering services consistent with professional standards
students. Given the diverse nature of their jobs, however, a pure classroom
instructional model of evaluation which samples a restricted set of job duties
simply does not satisfy their evaluative needs.
Erosion of Unity of Command
American education has a long history of control based on the classical

administrative principle of unity of command. Adopted from the work of Henri
Fayol (cited in Gulick & Urwick, 1937) and others, unitary command structures
in schools has meant that every employee has had one immediate supervisor and
that formal communications and evaluations within the organization have
occurred within the linear chain of command (e.g. superintendent to principal to
teacher). This principle is codified in virtually all schools in the form of the
organization chart, resulting in a simplified, bureaucratic decision structure
(Stronge, 1993).
Unity of command has eroded in recent years in favor of more complex and
collaborative evaluation processes, especially as it relates to educational
specialists.
These itinerant personnel, while part of a school's staff, are shared with
other schools. They are here today and gone tomorrow and, under such
an amorphous work schedule, fall outside the normal control loop of the
school. Problems related to this change in organizational structure are
especially acute when it comes to evaluation. Who is responsible for
monitoring performance and conducting the evaluation? Is it the principal
in School A or School B? Or is it both? Or is evaluation now charged to
someone entirely outside both schools, such as the special education
director or school nurse supervisor? Under anyone of these ... evaluative
scenarios, one point is clear: evaluating many educational specialists is
especially complex. (Stronge, Helm, & Tucker, 1995, p. 18)
HOW SHOULD EDUCATIONAL SPECIALISTS'

PERFORMANCE BE DOCUMENTED?3
Given the multiplicity of duties performed by educational specialists, their

specialized training, multiple supervisors, and the confidential nature of some
aspects of their work, evaluation techniques need to be adapted to the unique
characteristics of these positions. As one example of a modification in traditional
teacher evaluation methodologies, there is a need for using multiple data source
in educational specialists' evaluation.
The most common method for evaluating classroom teachers, in recent
decades, has been a clinical supervision model consisting of a pre-conference,
observation, and post-conference. In fact, as noted in a study conducted by the
Educational Research Service (1988), 99.8 percent of American public school
administrators used direct classroom observation as the primary data collection
technique. However, primary reliance on formal observations in evaluation
poses significant problems (e.g., artificiality, small sample of performance) for
teacher evaluation (Medley, Coker, & Soar, 1984; Scriven, 1981), and its value is
even more limited for evaluating many educational specialists. Issues of
confidentiality actually preclude observation in some cases (e.g., for a school
psychologist in a confidential session) and, thus, require creative use of multiple
data sources to provide an accurate measure of job performance. In addition to
observation, documentation of performance should include:
• input from client groups (i.e., informal or formal surveys from parents, students,
classroom teachers, or other collaborating peers);
• collection and assessment of job artifacts (portfolios, dossiers, copies of
pertinent correspondence, relevant news clippings, etc.); and
• documentation of performance of the educational specialist's impact on
student success.
Observation
As previously mentioned, observation does not hold the same potential for
acquiring useful information about the job performance of educational specialists
that it does for documenting teachers' classroom performance. Specialists such
680 Stronge
as librarians/media specialists, nurses, counselors, school psychologists, and
others spend much of their time engaged in activities that would either be:
1. tedious to observe (e.g., a nurse performing record-keeping functions or a

curriculum specialist planning for a curriculum meeting or in:service program);
or
2. in violation of the professionally and legally required confidentiality that must
be maintained (e.g., a counselor working with a student having personal
problems or a social worker discussing a student's home situation with the
student).
In spite of the inherent and substantial limitations of observation for evaluating

educational specialists, however, it can play a meaningful role in the data
collection process if used appropriately.
Performance observation can be classified as systematic or incidental. In
systematic observation, the supervisor conducts a semi-structured, planned
observation of an employee who is presenting a formal program to staff,
students, or some other client group. The potential of this type of systematic
observation is relatively limited, but specific situations may be conducive to its
use. Incidental observation is less direct and structured than systematic observa-
tion. It might include observation, for example, of a specialist participating in
faculty meetings or committee meetings. In this kind of observation, the
evaluator would be alert for evidence of constructive contributions such as
articulate expression of ideas, ability to relate to other staff in the meeting, and
so forth. An important factor for evaluators to remember when compiling
incidental observation data is that they must always focus on specific, factual
descriptions of performance. Also, it is important to obtain a representative
sampling of performance observations (Stronge & Helm, 1991, pp. 175-177).
Client Surveys
Surveying and interviewing - both formally and informally ""'- those with whom an
educational specialist works about their perceptions of that employee's
effectiveness constitutes an important source of documentation. This is
particularly true in view of the fact that data collection through traditional
observational channels is limited. A more complete picture of an employee's
performance can be obtained by surveying a representative sampling from the
various constituencies with whom the specialist works. The following example
may help illustrate the use of 360-degree evaluation (Manatt, 2000; Manatt &
Benway, 1998; Manatt & Kemis, 1997). .
A librarian/media specialist's work can impact the teaching staff, the students
in the school, and the library/media aides (adult or student) who assist the
individual in his or her job. Therefore, it would be appropriate to survey the
teaching staff with special attention to those who use the library and media
equipment most heavily in connection with their classes. Surveys of teachers
could elicit perceptions about the library/media specialist's effectiveness in
meeting their needs, ability to keep equipment in working order, and so forth.
Students might be surveyed to determine their use of the library, their experiences
in receiving help when needed, and their perceptions of the responsiveness of the
library staff to their questions. Above all, however, the questions should be
designed with the employee's specific job responsibilities in mind and with
recognition that these data merely constitute one source of evaluative informa-
tion (Stronge & Helm, 1991, pp. 180-182).
POltfolios and Job Artifacts
Another important source for obtaining documentation of a person's job per-

formance is frequently overlooked in many educational systems: the collection of
written records and documents produced by the employee as a part of his or her
job responsibilities. These written materials are sometimes referred to as
portfolios (see, for example, Wheeler, 1993; Wolf, Lichtenstein, & Stevenson,
1997) or dossiers (Peterson, 1995).
The term portfolio in the context of personnel evaluation is by no means
universally defined. Wolf (1996) described one form that portfolios have taken
as "scrapbooks" - a version that has proven to be neither practical nor useful. A
more productive definition of portfolios for the purposes of evaluation is a
collection of materials by and about an employee that is limited in scope, yet
whose specific contents remain the choice of the individual (Dietz, 1995; Gareis,
1999; Hurst, Wilson, & Cramer, 1998; Peterson, 1995; Wolf, 1996). This approach
to artifact collection and review can be a more revealing process in terms of
performance evaluation (Gareis, 1999).
Much of the portfolio data collected to provide insight into an educator's
performance can and should be collected by the employee. Thus, the portfolio
collection and review process becomes a type of structured self-assessment,
especially when reflection about performance, written by the specialist, is included
in the portfolio. An example of this self-reflective side to portfolios would be the
inclusion of a narrative summary highlighting major accomplishments and areas
for continued growth.
Moreover, a supervisor with responsibility for evaluating a specialist'S per-
formance will find it useful to file pertinent artifacts as they cross his or her desk.
This doesn't necessarily entail significant additional record keeping for either the
evaluatee or the evaluator, but rather the collection and later analysis of
materials pertinent to fulfilling specific job responsibilities. Some examples of
portfolio entries include:
• logs of meetings, conferences, activities, and students served;

• written products such as brochures, handouts, monographs, reports produced
by the educator;
682 Stronge
• pictures of displays or performances (e.g., snapshots of media center displays
or videos of presentations);
• copies of solicited or unsolicited commendatory letters or memos regarding
the educator's performance.
Measures of Student Success
School accountability as a fundamental aspect of contemporary education is

reflected in states and local communities across America and, indeed, the world.
Parents, policymakers, and educators alike have examined the state-of-the-art of
public schools and are calling for - even demanding - improvement. Calls for
school reform are taking a variety of forms, with one of the most prominent
being higher teacher and educator standards and improved student performance
(Stronge & Thcker, 2000).
Given the central role that teachers and other educational professionals have
always played in successful schools, connecting teacher performance and student
performance is a natural extension of the educational reform agenda (see, for
example, Mendro, 1998; Sanders & Horn, 1998). "The purpose of teaching is
learning, and the purpose of schooling is to ensure that each new generation of
students accumulates the knowledge and skills needed to meet the social,
political, and economic demands of adulthood" (McConney, Schalock, & Schalock,
1997, p. 162). Although the use of student success measures is technically and
politically challenging, their use should be considered and appropriate measures
incorporated into the evaluation process for educational specialists.
For many educational specialist positions, direct measures of student
performance can be directly documented. Just as with classroom teachers, a
value-added - or gain score - approach (Wright, Horn, & Sanders, 1997; Stronge
& Tucker, 2000; Webster & Mendro, 1997) can be employed for selected
educational specialists to document their influence on student success. For
example, the improvement in students' language development as it relates to the
services provided by a speech therapist, or the academic improvement by
students with disabilities on selected aspects of Individual Educational Plans as
it relates to the instruction delivered by a special education resource-consulting
teacher, can be readily documented.
For other educational specialists, the connection between student success and
the specialist's role is more tenuous and far more difficult to document. For
example, a school nurse delivers health screening and selected medical services
that are fundamentally important to the school and to the health and safety of its
students. Nonetheless, attempting to directly link the school nurse's he.alth-
related services to student academic success would be a virtual impossibility.
Nonetheless, the challenge in connecting the educational specialist's performance
to student performance doesn't negate the value of the specialist's work, nor
should it excuse the specialist from appropriate accountability measures. Rather,
the impact of the specialist's performance needs to be judged on relevant
measures. For example, the nurse's success in screening for selected health
problems and then following through with appropriate medical referrals can be
readily documented. For another example, performance of a high school
guidance counselor who is charged with the responsibility of assisting students
with academic advising and transition to college could be based, at least partially,
on students' awareness of available scholarships and the counselor's assistance in
applying for and securing the scholarships.
In all instances where student success (regardless of the measures of student
performance that might be assessed) is linked to employee success, it is essential
to ascertain a direct link between the employee's and students' performance. It
is insufficient to merely identify how an educational specialist might indirectly
influence a student's performance on a given performance measure. For example,
while graduation rates and SAT scores most certainly are linked to school
factors, connecting these measures to anyone employee, in almost all cases,
simply is not possible. If employee success is to be meaningfully linked to student
success, then it is imperative that a causal link be clearly established.
To summarize, the use of multiple sources of data to document educational
specialists' performance offers numerous advantages (Helm, 1995). In particular,
multiple data sources are essential to a fair and valid evaluation. Integrating
multiple data sources to document performance provides for a process of
triangulation, blending low inference and high inference data as well as objective
and subjective data, all of which contribute to a richly textured and far more
complete portrait of the specialist's performance.
HOW CAN EDUCATIONAL SPECIALISTS BE EVALUATED?4
As with any type of meaningful evaluation, personnel evaluation must provide a

sound conceptual framework upon which to build. Moreover, it is important to
consider the unique contributions made by each educational specialist position
to the accomplishment of the school's mission if the personnel evaluation system
is to be effective. The Goals and Roles evaluation model (Stronge, 1997, 1988,
1995; Stronge & Helm, 1990, 1991, 1992)5 is one that offers a practical, research-
based model. It is designed generically for use with a variety of positions and it
may serve well as the basis for an evaluation system not only for educational
specialists but also for teachers and administrators (Stronge, 1997).
The Goals and Roles evaluation model reflects two phases with six distinct
steps in the evaluation process:
Development Phase
1. Identify system needs;
2. Develop roles and responsibilities for the job;
3. Set standards for job performance;
684 Stronge
Implementation Phase
4. Document job performance;
5. Evaluate performance;
6. Improve/Maintain professional service.
The following provides a brief description of each step as represented in Figure 2.
Development Phase
Step 1: Identify System Needs
Each school has specific needs that relate to its mission and that are met through
the collective performance of all personnel (e.g., principals, classroom teachers,
resource specialists, counselors). A systematic examination of the needs of the
school's constituents will help clarify its mission and purpose. Goals should be
developed within the context of the greater community and in consideration of
relevant variables such as financial and personnel resources. School or district-
wide goals often are found in a mission statement, a set of educational goals, a
multi-year school plan, or a strategic plan.
Once school goals have been established, attention should turn to the matter
of translating those goals into operational terms. One logical way of accomplish-
ing this task is to consider all programs (e.g., math curriculum, guidance-
counseling services, athletic program) in light of whether they contribute to the
accomplishment of the school's goals, and, then, to relate program objectives to
position expectations (Stronge & Tucker, 1995a). In essence, a domino effect is
initiated in this line of planning and evaluation:
~ school goals dictate which programs are relevant to the school's mission;
~ program objectives dictate what positions are needed in order to fulfill
those objectives; and finally,
~ individual positions are implemented (and, ultimately, evaluated)
in light of the duties and responsibilities dIctated by the program
objectives.
Determining the needs of the organization is a prerequisite for all remaining

steps if the evaluation process is to be relevant to the school's mission and,
ultimately, responsive to public demands for accountability (Castetter, 1981;
Connellan, 1978; Goodale, 1992; Patton, 1986; Stufflebeam et aI., 1971).
Moreover, it is important to recognize the primary needs of any educational
organization are based on the needs of its students. Making certain, to the
greatest degree possible, that students are safe and secure, healthy, and have the
fullest opportunities to develop their academic potential is the mission of the
school. Simply put, the collective activities and efforts of our schools don't count
unless they positively impact the lives of students.
Goals and Roles Evaluation Model
Development Phase
Figure 2: Goals and Roles Evaluation Model
Step 2: IdentifY Job Responsibilities
Accurate and appropriate descriptions of the educational specialist's roles and

responsibilities can be developed only from clear statements of school or district
goals and philosophies. Once school goals are determined, then it is only sensible
to relate program expectations to position expectations. Typical major job domains
for educational specialists might include planning, management, assessment,
direct services, professionalism, and others. These job domains can serve as a
framework for the categorization of more specific responsibilities (or "duties")
(Olsen & Bennett, 1980; Redfern, 1980; Scriven, 1988).
Because job performance must reflect behavior in order to be evaluated, an
important addition to the definition of the specialist's role and responsibilities is
a set of sample performance indicators. While professional responsibilities are
intended to capture the essence of the job, it is difficult, if not impossible, to
document the fulfillment of professional responsibilities without some
measurable indication of their accomplishment. Thus, to give meaning to the
educational specialist's responsibilities, it is recommended to select a sampling of
performance indicators that are both measurable and indicative of the job
(Bolton, 1980; Redfern, 1980; Valentine, 1992). Figure 3 uses a school counselor
example to illustrate how steps 1 and 2 can be integrated in the Goals and Roles
evaluation model.
686 Strange
School Counselor
System Goal
The chool district will addre the needs of the whole child to en ure academic uccess.
Area of Responsibility
Intervention/Direct Service (I)
Duties Performance Indicators
Duty 1-1: Provides individual and group 1. Provides effective ervices in as isting
coun eling ervice to meet the developmental, the student to cope with family,
preventive, and remedial need of student . interpersonal, and educational issues.
2. Trains student to use effective

technique for conflict resolution!
management.
Duty 1-2: Provides follow-up activities to 1. Communicates effectively with parents

monitor student progress and teachers.
2. Provides prompt and pecific feedback

to students and staff in a constructive
manner.
Figure 3: Sample Duties and Performance Indicators for School Counselors
Step 3: Set Performance Standards
Setting standards involves determining a level of acceptable performance as well

as performance that exceeds or fails to meet the acceptable standard. Although
operational definitions for standards may vary from organization to organiza-
tion, they must be standard and consistently implemented within the school or
school district in order to ensure fairness and legal defensibility. This step is
important in any goals-oriented personnel evaluation system (Joint Committee,
1988; Manatt, 1988; Scriven, 1988; Stufflebeam et ai., 1971).
Step 4: Ddcument Job Performance
Documentation is the process of recording sufficient information about the

employee's performance to support ongoing evaluation and to justify any
personnel decisions based on the evaluation. The basic question is: How will the
educational specialist's performance of the roles and responsibilities of the job
be demonstrated? The use of multiple data sources can include observation,
client surveys, collection and analysis of artifacts, and student success measures.
The strengths and limitations of each of these data gathering techniques are
discussed in detail earlier in the chapter.
Step 5: Evaluate Performance
Evaluation is the process of comparing an individual employee's documented job

performance with the previously established roles and responsibilities and
acceptable performance standards. While this step clearly entails an end-of-cycle
summative evaluation, evaluating performance also must include periodic
formative feedback. By providing feedback throughout the evaluation cycle, the
employee is supported in his or her ongoing efforts to fulfill performance
expectations and is able to identify areas of performance that need attention.
Additionally, an opportunity for adequate notice is provided through periodic
formative feedback, leading to a fair summative evaluation in which there should
be no surprises.
Summative evaluation provides an opportunity to determine individual merit
based on performance. Further, the evaluation affords the basis for judging
worth, first, by viewing evaluation performance in light of the school's goals and,
second, by maintaining compatibility between individual performance and
school goals. In an ongoing, systematic evaluation process, identifying system
needs and relating those needs to performance ensures that the evaluation is
concerned with both the merit (internal value) and worth (external value) of
performance (Bridges, 1990; Frels & Horton, 1994; Medley, Coker, & Soar, 1984;
Scriven, 1995; Valentine, 1992).
Step 6: Improve/Maintain Professional Service
With an emphasis in the evaluation process on both improvement (i.e., formative)

and accountability (i.e., summative) purposes, Step 6 brings the Goals and Roles
evaluation process full circle. Formative aspects of evaluation, intended to
provide recognition for noteworthy performance, along with immediate and
intermediate feedback for performance improvement and correction where
needed, should be ongoing throughout the evaluation process and are implicit in
this model. Nonetheless, it is beneficial to provide an explicit step for improving
or maintaining professional service as the culmination of the evaluation cycle
and as an entree into the following cycle.
This step suggests the importance of professional development with a balance
between the interests of the employee and the interests of the school in a
continuous improvement cycle (Little, 1993). After all, the most fundamental
purpose of an evaluation is to improve both the individual's and institution's
performance (Hunter, 1988; Iwanicki, 1990; McGreal, 1988; Stufflebeam, 1983).
688 Strange
WHAT IS THE CONNECTION BETWEEN EDUCATIONAL SPECIALIST

EVALUATION AND THE PERSONNEL EVALUATION STANDARDS?6
Although there may be unique aspects to the nature and ne_eds of educational
specialists' work, they have in common with all educational personnel the need
for fair, job-relevant, and meaningful evaluations. Unfortunately, as with teachers
and administrators, educational specialists have suffered from numerous
systemic problems with the state-of-the-art of personnel evaluation. The Joint
Committee on Standards for Educational Evaluation (1988) asserted that
personnel evaluation has been ineffectively conducted in educational organiza-
tions, despite the centrality of the process. They specifically identified the
following personnel evaluation practices as commonly not being satisfied:
• screen out unqualified persons from certification and selection processes;

• provide constructive feedback to individual educators;
• recognize and help reinforce outstanding service;
• provide direction for staff development programs;
• provide evidence that will withstand professional and judicial scrutiny;
• provide evidence efficiently and at reasonable cost;
• aid institutions in terminating incompetent or unproductive personnel;
• unify, rather than divide, teachers, educational specialists, and administrators
in their collective efforts to educate students (pp. 6-7).
In order for evaluation to be valid and valued for any educator, these issues must
be adequately addressed. A brief discussion of how the four basic attributes of
sound evaluation can be incorporated in educational specialist evaluation
follows.
Propriety Standards
Propriety Standards "require that evaluations be conducted legally, ethically,

and with due regard for the welfare of evaluatees and clients of the evaluations"
(Joint Committee, 1988, p. 21). In compliance with the five Propriety Standards
(Service Orientation, Formal Evaluation Guidelines, Conflict of Interest, Access
to Personnel Evaluation Reports, Interactions with Evaluatees), the evaluation
of educational specialists should seek to create a personnel evaluation system
that directly supports the primary principle - that schools exist to serve students.
The educator evaluation exists within a legal context and with a strong. goal
orientation. In addition, the Propriety Standards' aim "at protecting the rights of
persons affected by an evaluation" (Joint Committee, 1988, p. 11) and are
addressed through substantive and procedural sensitivity to the unique needs of
educational specialists (Stronge & Helm, 1991, p. 68).
Utility Standards
"Utility Standards are intended to guide evaluations so that they will be informa-
tive, timely, and influential" (Joint Committee, 1988, p. 45) as reflected in
Constructive Orientation, Defined Uses of the Evaluation, Evaluator Credibility,
Functional Reporting, and Follow-up and Impact. As with teachers and other
educators, educational specialists' evaluation should provide an informative and
timely evaluation service delivery through an emphasis on formal and informal
communication. Additionally, the evaluation should inform decision makers
regarding goal accomplishment. Although providing evaluative feedback to
individuals, the collective evaluations of all employees should relate individual
performance to the overarching organizational goals (Stronge & Helm, 1991,
pp.68-69).
Feasibility Standards
The Feasibility Standards require that evaluation systems be "as easy to implement
as possible, efficient in their use of time and resources, adequately funded, and
viable from a number of other standpoints" (Joint Committee, 1988, p. 71).
These standards include Practical Procedures, Political Viability, and Fiscal
Viability. The utility of any personnel evaluation system can be found in its
ability to be simultaneously effective for its intended purposes and efficient in its
use of resources. Thus, a constructive evaluation system will be applicable
specifically to educational specialists and, yet, be aware of the role of evaluation
within the educational organization (Stronge & Helm, 1991, p. 69).
Accuracy Standards
Accuracy Standards require that obtained information "be technically accurate

and that conclusions be linked logically to the data" (Joint Committee, 1988,
p.83). The eight standards as they relate to the educational specialists'
evaluation are summarized in Table 5.
CONCLUSIONS
Despite the fact that evaluation in education has a history as long as public
schooling, its use has been restricted largely to teachers and, to a lesser degree,
administrators. In recent years, primarily in response to a push for public
accountability, performance evaluation has been extended more systematically
to include all professional employees. As schools become responsible for
providing and coordinating a wider range of services, specialized support staff
are beginning to constitute a larger proportion of a school staff and playa more
690 Strange
Table 5. Accuracy Standards Applied to Educational Specialist Evaluation
Application to
Accuracy Standards Educational Specialist Evaluation
Al Defined Role Educational specialist evaluation should be based on clearly
defined job responsibilities
A2 Work Environment Evaluation is an open system in which contextual issues must be
considered in the evaluation.
A3 Documentation of The use of multiple data sources in educational specialists'
Procedures evaluation should be emphasized.
A4 Valid Measurement A fundamental requirement for any good evaluation is that it be
valid for is intended audience, in this instance, educational
specialists.
AS Reliable Measurement Valid evaluation cannot occur in the absence of consistency;
therefore, specialists' evaluation must provide a methodology
that encourages procedural standardization and consistent
performance evaluations.
A6 Systematic Data Control Systematic and accurate analysis of data is considered
fundamental for a fair evaluation system.
A7 Bias Control Fairness in both evaluation processes and outcomes can be
enhanced by using trained evaluators who are knowledgeable about
the roles and reponsibilities of educational specialists.
A8 Monitoring Evaluation By monitoring the application of the educational specialists'
Systems evaluation system, it can be refined throughout each evaluation
cycle.
central role in fulfilling the school's mission. Many school systems are beginning
to recognize the need to maximize services and expertise offered by these
professionals. To do so requires an evaluation process which recognizes and
rewards the varied and diverse contributions of educational specialists.
Ultimately, evaluation is nothing more than a process for determining how an
individual, program, or school is performing its assigned functions in relation to
a given set of circumstances. When evaluation is viewed as more than this
process (i.e., evaluation as an end within itself), it gets in the way of progress and,
thus, becomes irrelevant. "When evaluation is treated as less than it deserves
(i.e., superficial, inadequate evaluator training, invalid evaluation systems,
flawed implementation designs), the school, its employees, and the public at
large are deprived of opportunities for improvement and the benefits that
accountability afford. All of us, whatever our relationship to the educational
enterprise, deserve high quality evaluation" (Stronge, 1997 p. 18).
ENDNOTES
1 Percentages are based on the number of people who provide support services, including special
education personnel, and not full-time equivalents. In a similar analysis of a high school, it was
found that 20-25 percent of the staff members provided support services.
2 This section is adapted from Stronge and Tucker 1995a.
3 This section is adapted from Stronge and Helm (1991).
4 This section is adapted from Stronge (1997) and Stronge, Helm, and Tucker (1995).
5 Initial work on the model was in the context of evaluation for educational specialists (e.g., school
counselor, school psychologist) described as "professional support personnel." The model is
intended to provide an evaluation paradigm that can be adapted to each individual and institution.
Permission is granted for use in this text.
6 This section is adapted from Stronge and Helm, 1991.
REFERENCES
American School Counselor Association. (1990). The school counselor and their evaluation. In
Position statements of the American School Counselor Association. Alexandria, VA: American
School Counselor Association.
Bolton, D.L. (1980). Evaluating administrative personnel in school systems. New York: Teachers
College Press.
Borg, WR., & Gall, M.D. (1989). Educational research: An introduction. New York: Longman.
Bridges, E.M. (1990). Evaluation for tenure and dismissal. In J. Millman & L. Darling-Hammond
(Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers
(pp. 147-157). Newbury Park, CA: Sage Publications.
Castetter, WB. (1981). The personnel function in educational administration (3rd edition.). New York:
Macmillan.
Coleman, J., Jr., & Turner, P. (1986). Teacher evaluation: What's in it for us? School Library Journal,
32(10),42.
Connellan, T.K. (1978). How to improve human performance: Behaviorism in business and industry.
New York: Harper & Row.
Dietz, M.E. (1995). Using portfolios as a framework for professional development. Journal of Staff
Development, 16(2), 40-43.
Educational Research Service. (1988). Teacher evaluation: Practices and procedures. Arlington, VA:
Author.
Frels, K., & Horton, J.L. (1994). A documentation system for teacher improvement and termination.
Topeka, KS: National Organization on Legal Problems in Education.
Gareis, C.R. (1999). Teacher portfolios: Do they enhance teacher evaluation? (located in CREATE
News section). Journal of Personnel Evaluation in Education, 13, 96-99.
Goodale, J.G. (1992). Improving performance appraisal. Business Quarterly, 5(2), 65-70.
Gorton, R., & Ohlemacher, R. (1987). Counselor evaluation: A new priority for the principal's
agenda. NASSP Bulletin, 71(496),120-124.
Gulick, L. & Urwick, L. (Eds.). (1937). Papers on the science of administration. New York: Institute of
Public Administration, Columbia University.
Helm, Y.M. (1995). Evaluating professional support personnel: A conceptual framework. Journal of
Personnel Evaluation in Education, 9(2), 105-122. .
Hunter, M. (1988). Create rather than await your fate in teacher evaluation. In S.J. Stanley & WJ.
Popham (Eds.), Teacher evaluation: Six prescriptions for success (pp. 32-54). Alexandria, VA:
Association for Supervision and Curriculum Development.
Hurst, Wilson, Cramer. (1998). Professional teaching portfolios: Tools for reflecting, growth, and
advancement. Phi Delta Kappan, 79, 578-582.
Iwanicki, E.F. (1990). Teacher evaluation for school improvement. In J. Millman & L. Darling-
Hammond (Eds.), The new handbook for teacher evaluation. Newbury Park, CA: Sage Publication.
Joint Committee on Standards for Educational Evaluation. (1988). The personnel evaluation
standards: How to assess systems of evaluating educators. Newbury Park, CA: Sage Publications.
Kruger, LJ. (1987, March). A functional approach to performance appraisal of school psychologists.
Paper presented at the Annual Meeting of the National Association of School Psychologists, New
Orleans, LA.
Lamb, A., & Johnson, L. (1989). An opportunity, not an albatross. The Book Report, 8(3), 11-24.
692 Stronge
Little, J.w' (1993). Teachers professional development in a climate of educational reform. Educational
Loesch-Griffin, D.A., Spielvogel, S., & Ostler, T. (1989). Report on guidance and counseling personnel
and programs in Nevada. Reno, NY: University of Nevada, College of Education. (ERIC Document
Reproduction Service No. ED 317 340). _
Manatt, RP. (1988). Teacher performance evaluation: A total systems approach. In S.J. Stanley, &
w,J. Popham (Eds.), Teacher evaluation: Six prescriptions for success (pp. 79-108). Alexandria, VA:
Association for Supervision and Curriculum Development.
Manatt, RP. (2000). Feedback at 360 degrees. The SchoolAdministrator, 9(57),10-11.
Manatt, RP., & Benway, M. (1998). Teacher and administrator performance: Benefits of 360-degree
feedback. ERS Spectrnm: Journal of Research and Infonnation, 16(2), 18-23.
Manatt, RP., & Kemis, M. (1997). 360 degree feedback: A new approach to evaluation. Principal,
77(1),24,26-27.
McConney, AA, Schalock, M.D., & Schalock, H.D. (1997). Indicators of student learning in teacher
evaluation. In J.H. Stronge (Ed.), Evaluating teaching: A guide to current thinking and best practice
(pp. 162-192). Thousand Oaks, CA: Corwin Press, Inc.
McGreal, T.L. (1988). Evaluation for enhancing instruction: Linking teacher evaluation and staff
development. In S.J. Stanley & w.J. Popham (Eds.), Teacher evaluation: Six prescriptions for success
(pp. 1-29). Alexandria, VA: Association for Supervision and Curriculum Development.
Medley, D.M., Coker, H., & Soar, RS. (1984). Measurement-based evaluation of teacher perfonnance.
New York: Longman.
Mendro, RL. (1998). Student achievement and school and teacher accountability. Journal of
Millman, J. & Darling-Hammond, L. (1990). The new handbook of teacher evaluation: Assessing
elementary and secondary school teachers. Newbury Park, CA: Corwin Press, Inc.
Norton, M., & Perlin, R (1989). Here's what to look for when evaluating school psychologists.
Executive Educator, 11(8), 24-25.
Olsen, L.D., & Bennett, A.c. (1980). Performance appraisal: Management technique as social
process? In D. S. Beach (Ed.), Managingpeople at work: Readings in personnel (3rd ed.). New York:
Macmillan.
Patton, M.O. (1986). Utilization focused evaluation (2nd ed.). Newbury Park, CA: Sage Publications.
Peterson, K.D. (1995). Teacher evaluation: A comprehensive guide to new directions and practices.
Thousand Oaks, CA: Corwin Press.
Redfern, G.B. (1980). Evaluating teachers and administrators: A perfonnance objectives approach.
Boulder, CO: Westview.
Sanders, w'L., & Hom, S.P. (1998). Research findings from the Tennessee Value-Added Assessment
System (TVAAS) database: Implications for education evaluation and research. Journal of
Personnel in Education, 12, 247-256.
Sclan, E.M. (1994, July). Perfonnance evaluation for experienced teachers: An overview of state policies.
Presented at the annual conference of the National Evaluation Institute, Gatlinburg, TN.
Scott, J., & Klasek, C. (1980). Evaluating certified school media personnel: A useful evaluation
instrument. Educational Technololgy, 20(5), 53-55. -~
Scriven, M.S. (1981). Summative teacher evaluation. In J. Millman (Ed.), Handbook of teacher
evaluation (pp. 244-271). Beverly Hills, CA: Sage Publications.
Scriven, M.S. (1988). Duties-based teacher evaluation. Journal of Personnel Evaluation in Education,
1,319--334.
Scriven, M. (1995). A unified theory approach to teacher evaluation. Studies in Educational
Evaluation, 21,111-129.
Shinkfield, AJ., & Stufflebeam, D. (1995). Teacher evaluation: Guide to effective practice. Boston:
Stronge, J.H. (1988). Evaluation of ancillary school personnel training model (Illinois Administrators'
Academy). Springfield, IL: Illinois State Board of Education.
Stronge, J.H. (1993). Evaluating teachers and support personnel. In B.S. Billingsley (Ed.), PrVgram
leadership for serving students with disabilities (pp. 445-479). Richmond, VA: Virginia Department
of Education.
Stronge, J.H. (1994). How do you evaluate everyone who isn't a teacher: A practical evaluation
model for professional support personnel. The School Administrator, 11(51), 18-23.
Stronge, J.H. (1995). Balancing individual and institutional goals in educational personnel evaluation:
A conceptual framework. Studies in Educational Evaluation, 21, 131-15l.
Stronge, J.H. (1997). Improving schools through teacher evaluation. In J.H. Stronge (Ed.),
Evaluating teaching: A guide to current thinking and best practice (pp. 1-23). Thousand Oaks, CA:
Corwin Press, Inc.
Stronge, J.H., & Helm, Y.M. (1990). Evaluating educational support personnel: A conceptual and
legal framework. Journal of Personnel Evaluation in Education, 4,145-156.
Stronge, J.H., & Helm, Y.M. (1991). Evaluatingprofessional support personnel in education. Newbury
Park, CA: Sage Publications.
Stronge, J.H., & Helm, Y.M. (1992). A performance evaluation system for professional support
personnel. Educational Evaluation and Policy Analysis, 14, 175-180.
Stronge, J.H., Helm, Y.M., & Tucker, P.D. (1995). Evaluation Handbook for Professional Support
Personnel. Kalamazoo: Center for Research on Educational Accountability and Teacher Evaluation,
Western Michigan University.
Stronge, J.H., & Tucker, P.D. (1995a). Performance evaluation of professional support personnel: A
survey of the states. Journal of Personnel Evaluation in Education, 9, 123-138.
Stronge, J.H., & Tucker, P.D. (1995b). The principal's role in evaluating professional support
personnel. NASSP Practitioner, 21(3),1-4.
Stronge, J.H., & Thcker, P.D. (2000). Teacher evaluation and student achievement. Washington, DC:
National Education Association.
Stufflebeam, D.L. (1983). The CIPP model for program evaluation. In G. Madaus, M.S. Scriven, &
D.L. Stufflebeam (Eds.), Evaluation models: Viewpoints on educational and human services in
evaluation (pp. 117-141). Boston: Kluwer-Nijhoff.
Stufflebeam, D.L., Foley, W, Gephart, W, Guba, E.G., Hammond, R., Merriman, H., & Provus, M.
(Phi Delta Kappa National Study Committee on Evaluation). (1971). Educational evaluation and
decision making. Itasca, IL: EE. Peacock.
Tucker, N.A., Bray, S.W, & Howard, KC. (1988). Using a client-centered approach in the principal's
evaluation of counselors. Paper presented at the annual meeting of the Assessment Conference.
(ERIC Document Reproduction Service No. ED 309 341)
Turner, P.M., & Coleman, J.G., Jr. (1986, January). State education agencies and the evaluation of
school library media specialists: A report. Paper presented at the Annual Meeting of the Association
for Educational Communications and Technology, Las Vegas, NY.
Valentine, J.W (1992). Principles and practices for effective teacher evaluation. Boston, MA: Allyn and
Bacon.
Vincelette, J.P., & Pfister, EC. (1984). Improving performance appraisal in libraries. Library and
Information Science Research, An International Journal, 6, 191-203.
Webster, WJ., & Mendro, R.L. (1997). The Dallas value-aded acountability system. In J. Millman
(Ed.), Grading teachers, grading schools: Is student achievement a valid evaluation measure?
(pp. 81-99). Thousand Oaks, CA: Corwin Press.
Wheeler, P.H. (1993). Using portfolios to assess teacher peiformance (EREAPA Publication Series No.
93-7). (ERIC Document Reproduction Service No. ED 364 967) (TEMP Abstract No. 11231).
Wolf, K (1996). Developing an effective teaching portfolio. Educational Leadership, 53(6), 34-37.
Wolf, K, Lichtenstein, G., & Stevenson, C. (1997). Portfolios in teacher evaluation. In J.H. Stronge
(Ed.), Evaluating teaching: A guide to current thinking and best practice (pp. 193-214). Thousand
Oaks, CA: Corwin Press, Inc.
Wright, S.P., Horn, S.P., & Sanders, WL. (1997). Teacher and classroom context effects on student
achievement: Implications for teacher evaluation. Journal ofPersonnel Evaluation in Education, 11,
57-67.
Section 8
Program/Project Evaluation
Introduction
JAMES R. SANDERS, Section Editor

The Evaluation Center, ~stem Michigan University, Ml, USA
In her chapter, Dr Jean King defined program evaluation and project evaluation
in the following way:
Programs are typically long-term ongoing activities representing

coordinated efforts planned to achieve major educational goals (e.g., a
school district's language arts program) .... Educational projects, by
contrast, are short-term activities with specific objectives and allocated
resources (e.g., professional development training for a school faculty)
(Nevo, 1995, p. 119) .... the process of evaluating educational programs
and projects, however, [is the same and need not be separated].
This is the definition that will be used in this section. Evaluation is defined as a
systematic process of determining the merit or worth of an object (Joint
Committee, 1994): in this, case programs or projects.
Program and project evaluations can be viewed alternatively as a recently
developed methodology and practice or one that dates back to the beginnings of
recorded history. By other names, program and project evaluations have been
with us for thousands of years as new ways of doing things are tried and then
accepted or rejected. Examples abound in every international context. The
biggest challenge of this section is where in time and geography to begin.
Four international settings where educational program and project evaluation
experiences have developed were selected for this section. Three are "Western"
settings while the fourth, in developing countries, offers a provocative contrast.
The four authors of the chapters in this section chose different starting points.
Dr Gila Garaway described unique characteristics of program and project
evaluations in developing countries. In her chapter, she used illustrative cases to
build the reader's understanding of contextual differences between educational
evaluations conducted in the West and those conducted in Third World countries.
She contrasted them in educational systems and context, sociopolitical and
cultural contexts, players and their roles, evaluation impact, and ethical con-
siderations in evaluation. Having made these contrasts, Garaway proceeded to
layout a view of future development of educational evaluation in developing
countries. The foremost issue is the extent to which program evaluation in
697

698 Sanders
developing countries is shaped by non-Western realities in those countries. New
evaluation models, methods, and standards are not developed overnight. Yet, a
strong argument can be made for the need to develop new ways of practicing
program evaluation in developing countries. There is also the issue of evaluation
cost and its influence on what gets evaluated, who does it, who can pay for it, and
ultimately the impact of the evaluation on local programs. Needs in developing
countries also include locally trained evaluators, new methods to address local
realities, relevant and timely information and access to it at the local level, and
common understandings about the purposes, roles, procedures, and underlying
values of program evaluation.
Dr Jean King traced the development of program and project evaluation in
the United States, beginning with roots in experimental research design, testing,
and accreditation. The great leap in the development of program and project
evaluation in the United States, however, took place in the late 1950s and the
1960s, spurred on by government mandates for evaluating funded education
programs and projects. This rapid growth slowed noticeably after 1980 with much
of the federal role in stimulating program evaluation switching to the states.
Processes such as testing and accreditation have been institutionalized in U.S.
schools. The future development of program and project evaluation in the
United States will be affected by the rapid growth of technology and information
processing, by the development of an evaluation profession with professional
organizations that provide networks of support for evaluators, and by the concept
of organizational learning that expands the role of program evaluation in schools
to supporting change.
Dr Alice Dignard described developments in Canada that address the two
most common purposes of program and project evaluation: accountability and
program improvement. Each of Canada's 10 provinces and 3 territories is
responsible for its own education system. Consequently, the organization of
education and the use of evaluation in education differs across geopolitical
boundaries. Dr Dignard used examples to illustrate the kind of development
taking place in educational program evaluation in Canada during the 1990s. For
instance, the government in the province of Quebec created a commission respon-
sible for, among other responsibilities, evaluation of the quality of educational
programs in Quebec colleges. Public reports with recommendations are aimed at
program improvement. The Quebec legislature also affected evaluation
practices by developing a policy on university funding based on accountability
measures. Another evaluation initiative was developed in Manitoba with the
financial support of a Canadian charitable foundation. The Manitoba School
Improvement Program uses program evaluation as a means of stimulating
significant school improvement. The Canadian experience illustrates the
important roles of government and private foundations in developing program
evaluation practices in schools and colleges. Evaluation is taking on an important
role in education in Canada, although the specifics differ across provinces.
Dr John Owen discussed the supply and demand for project and program
evaluation in Australia. Evidence of both reveals an important role that
Program/Project Evaluation 699
government is playing in requesting evaluations while evaluators are coming out
of many disciplines and professions, including education. Dr Owen used program
evaluations to illustrate three approaches to program evaluations that are
developing in Australia: (1) evaluations of innovations that acknowledge the
complexity of educational change; (2) development of accountability frameworks;
and (3) development of an evaluation culture within educational organizations.
Important implications of these developments for evaluation practice in educa-
tion in Australia include more involvement of evaluators in policy development
and in organizational development.
Looking across these four chapters, some threads become apparent. Program
and project evaluation is a human activity that is as old as mankind. It is used to
make choices and decisions. It motivates change. It is used to develop under-
standing. It is used to justify investments of resources. In many ways program and
project evaluation has been and will continue to be embedded in human
cognition and behavior.
Formal program and project evaluation, characterized by systematic, public
processes open to examination and revision (if warranted) and aimed toward
assessing the value of some object, goes beyond opinion justified by intuition,
anecdotes, or casually collected pieces of evidence. When important decisions
affect large numbers of students or investments of large amounts of educational
resources or result in significant changes that affect the well being of society,
formal program and project evaluations are expected and justified.
Formal program and project evaluation does not follow one set of procedures,
however. Depending on the stage of inquiry, the cultural context in which
evaluation is conducted, the central value questions that frame the evaluation,
the availability or accessibility of vital information, the resources available for
evaluation, and the diversity of perspectives of the program or project,
evaluations are shaped accordingly. It is fair to say that every program or project
evaluation is unique in many ways and consequently that the practice of
evaluation is complex.
Evaluation communities throughout the world must be concerned about the
quality of their evaluations. With so many decisions affecting evaluation practice,
on what basis can one judge program or project evaluation designs or finished
projects? In the United States and Canada, a group of professional associations
have collaborated in defining The Program Evaluation Standards (Joint
Committee, 1994). These standards are intended to communicate shared
learning about sound evaluation practice from program and project evaluators
and from consumers of program and project evaluations in education. As the
field of evaluation evolves, these standards will be revised to reflect this evolution.
Evaluation communities throughout the world should consider defining their
standards in order to communicate their expectations for sound program and
project evaluation. Each chapter in this section provides a beginning discussion
of what constitutes good evaluation practice in a particular sociopolitical context.
This is a start to global analyses of the differences and commonalities of program
and project evaluation methodologies that best fit the realities of different cultures.
700 Sanders
Needs for future developments in program and project evaluation have been
identified. They include the following:
1. Capacity building for localized evaluation practitioners;

2. New evaluation methods that address the realities of the local context;
3. Information made available when it is needed and to whom it should inform;
4. Better awareness and expectations for the roles that evaluators and evalua-
tions play in local contexts;
5. Better use of technology to enhance the effectiveness of program and project
evaluations, including the development of institutional memory and the ability
to be analytical about educational programs;
6. Continuing to build professional networks of support for program evaluators;
7. Resources to continue to stimulate program evaluation at all levels of societies;
8. Involvement of program evaluators in policy development;
9. Involvement of program evaluators in organizational development.
Drs Garaway, King, Dignard, and Owen have provided an important agenda for
the development of program and project evaluation throughout the world. Each
agenda item can be considered to be a significant project by itself. It is incumbent
upon the evaluation communities found locally, nationally, regionally, and globally
to take these challenges seriously and to continue the evolution of program and
project evaluation.
REFERENCE
31
Evaluating Educational Programs and Projects
in The Third World
GILA GARAWAY
Moriah Africa, Israel
Educational evaluation in "third world" or developing nations is by no means a

new phenomenon. It can, however, be quite different than that found in the
"west" and thus warrants a separate view. The portrait and discussion that follow
are general strokes. The aim is not to provide a definitive statement of what
program evaluation looks like in third world countries, but rather to open a door
onto some of the realities that are different than those in the west by providing
samplings of what it looks like, how it's different, and then some thoughts about
its future.
WHAT DOES IT LOOK LIKE?
Sample #1
It is a rural area in anyone of a large number of developing nations. The

evaluator has come from the city to look at the implementation of the latest
initiative to improve rural education. Sitting on a bench in front of the small
crumbling structure, she is surrounded by the children who have arrived today.
Some sit around her on the bench, touching her books and asking her simple
questions. Others play here and there in the bare, dirt yard surrounding the
school. Although the teacher has not come yet, the children seem unperturbed.
After what seems like an endless wait, the evaluator looks at her watch and
wonders if it is worth waiting longer. It has been a very long walk down dirt tracks
to get here. She did not send word ahead that she was coming, as she wanted to
capture the program "as it is." She asks the children if they know where the
teacher is. No, they say, they don't. He often does not come. The children do not
seem in any hurry to leave. Thin, they too have walked a long way to get here.
Soon she leaves and heads to her next destination, a school closer to the main
road. Here she finds teachers and several small dark buildings packed with
students. School is in progress and as she approaches she sees that the children
701

702 Garaway
are sitting relatively quietly and attentively, crammed one next to the other on
the small benches that fill the rooms. Quite a few hold small slates and tattered
primers. As soon as they notice her arrival, everyone wants to stop in order to
greet this new friend properly. It is not often that they hav~ guests. They are
sorry they did not know she was coming or they would have prepared in some
way. Clearly, all are pleased she has come and are eager to please. After many
warm smiles and greetings around, the evaluator takes a place in the back of one
of the rooms, by which to try to get an objective view of activity. Later she will
test the children with the testing instruments she has brought with her. Designed
by the foreign evaluation consultant hired by the foreign donor, they are meant
to test the children's achievement, the central measure of the success or failure
of the program.
Back in the city, the 'evaluation' has pre-empted almost all other program
activity. The foreign consultant has just arrived in country to oversee the data
gathering. Given budget constraints, he has only a short time period allotted to
spend in-country and thus an in-depth look at things will not be possible. To add
to that, there have been a number of unexpected problems and delays and so
they will only be able to do about half of what was planned. Fortunately, the
education ministry has an evaluation department with a number of trained
evaluators. This has been a great help to the foreign consultant. In addition to
the tests, participatory techniques and activities have been incorporated and
implemented where possible.
The small evaluation team covering the rural area closest to the city has just
returned from a day of testing. As the team sits to share findings, the discussion
turns to focus group interviews. The feedback from each team member is the
same. Generally the local authority figure was present and led the discussion.
"Everyone agreed and everything was positive ... no, the women present really
had nothing to say." The consultant questions this uniformity of response from
the team but the answers seem to skirt the issue. He has a growing feeling that
the evaluation design is missing the mark. The discussion changes and takes on
a fairly technical dimension related to validity and reliability. This is a compli-
cated issue in this setting and the outside evaluator knows he will have to bridge
it in the writing of his report. Efforts have been made to address the issue with
some brief training in test administration given to locals and with instruments
modified for local use, although this was complicated given the multilingual
nature of the overall educational setting. At this point, for the consultant the
important point is being able to convey concrete quantitative information to the
multi-lateral donor so that it can in turn account to its support base.
Sample #2
It is an urban setting in a secondary city of a developing nation. The evaluator

hired by the international NGO (non-governmental organization, in this case an
Evaluating Educational Programs and Projects in the Third World 703
international relief and development agency) has just completed an evaluation
of a large integrated child-headed household program which contained elements
of health, education, and self-help activities framed in an overall objective of
community capacity building. The donor providing funds for this program (a
western government) has stipulated a one-year maximum time frame for
implementation of activities in any particular area with phase-out expected to
find local authority fully carrying on with the program.
Tho separate focus group sessions are arranged - one with the local education
authorities where the program just finished and one with the NGO program
staff, the majority of whom are nationals. In the session with the local education
authorities, the goal is to determine to what degree the program has stimulated
local ownership and responsibility for ongoing implementation of the program.
The meeting is called for 9:00 and by 10:30 all have arrived. Each has come with
a plan for how activity will continue and each is eager to share. There is a sense
of great expectancy in the air. Not fully aware of the risks involved, the evaluator
decides to lay aside her earlier planned format and let the session take its own
shape, a series of presentations made by each of the authorities present. As the
session unfolds, it becomes increasingly clear to the evaluator that the locals do
not understand her role or position in the organization as each of the presenta-
tions ends with a rough budget and a request for ongoing funding. Although she
prefaced the beginning of the session with an explanation of the purposes of the
session (to gather information about the success/sustainability of the program)
and her relationship to the organization (outside consultant), and although all
present are aware that the program's funding was for one year only with the
overall objective being hand-over and local maintenance, it is clear they have not
understood or have not wanted to understand. She realizes she must reclarify her
position to avoid further misunderstanding regarding her ability to get them
funds. The response however is unanimous and clear - surely she must have
voice and power within the organization to advocate for them or they would not
have sent her as their emissary. Can't she see that they cannot continue without
continued outside support? It is now that she fully recognizes the pitfall: Her
presence has bred expectations. And unmet expectations can lead to dis-
appointment and quite possibly rejection of gains made during the life of the
program. Instantly she realizes she will need to redirect the focus while at the
same time promoting the continued presentation of ideas.
In the session with the NGO staff she is prepared. They too have come with
requests: for better working conditions, better salaries, and more equipment.
While this feedback is valid and will form part of her report to her client, the
NGO home office, it is not what she had intended for the session. Instead she
had hoped that staff would focus more directly on program components. She
thus steers the discussion in that direction and to her relief the staff enter into a
meaningful discussion related to perceptions of process and why (or why not)
certain steps are adhered to. Staff are animated and many are taking notes. She
leaves hopeful that the discussion might catalyze positive changes in the coming
round of implementation activity in other parts of the country.
704 Garaway
Sample #3
It is a provincial center in a central Mrican nation. A "participatory" evaluation

of a training program is underway. Although the main structllre and approach of
the evaluation has been designed by an outside consultant it is still considered
participatory because many of the questions and areas of investigation were
compiled from suggestions made collectively by program staff and program
beneficiaries.
One of the components of the evaluation is interviews with local political
authority, an attempt to adhere to social custom as well as to gain local perspec-
tives on the strengths and weaknesses of the program. As the data gathering
period comes to closure, the evaluator receives a call from someone in the
ministry of education questioning what the evaluator told a particular provincial
authority. Evidently the evaluator had solicited ideas and suggestions from the
provincial authority who in tum understood it to mean that they would be
implemented as changes in the program. The evaluator explains to the educational
ministry that he had only been giving the provincial authority suggestive voice
not deliberative voice. The response is that for the provincial authority there is
no difference, in his position everything is deliberative voice.
HOW IS IT DIFFERENT?
Differences can be viewed from a number of perspectives. I have selected five,

as they reflect many of the issues involved: educational systems and structures,
socio-political and cultural elements, players and their roles, evaluation's impact
and ethical considerations.
Educational Systems, Structures, and the Context in which They Are Found
From the perspective of systems, structures and contexts, funding, staffing,

facilities, and conflicts are just a few of the elements contributing to the contrasts.
In most third world countries, an ongoing problem is funding. With low GNPs
and possibly shrinking foreign assistance, limited funds must be stretched to
cover the massive requirements of running a country: transport, industry, agri-
culture, health, communication, internal and external security, to name but a
few. Invariably, education tends to get left with little in her bag. This translates
into a number of different realities: (1) a large portion of educational program-
ming is supported by outside (non-national governmental) sources; (2) monies
intended for some specific educational activity may get funneled to other necessary
activities; (3) there are great disparities between schools. With limited funds
earmarked for education, the bulk of monies tends to go toward supporting
urban schools in order to provide at least one sector with minimally adequate
resources. This reality in combination with weak rural local economies leaves
rural areas with little on which to build and run their schools. In many rural areas
of the world, the classroom is simply a spot under a tree. Those fortunate enough
to have a building may have more than 100 students crammed into each class.
Most often these classrooms are bare other than for the benches and a faded
blackboard up front. Paper, pencils, or other material resources that in the west
are considered absolutely minimum essentials may quite possibly be absent.
When a program (and its evaluation) is a regional or nationwide one, which is
often the case, all of this means vastly different conditions under which the
program is being implemented. The lack of any kind of standard greatly constrains
the comparison of program implementation across sites and greatly challenges
the designers of evaluation in their attempts to fairly represent a whole picture.
Skilled personnel is another area of challenge in the functioning of
educational systems. Most developing countries suffer a severe shortage. Even
on the highest level, ministries of education often lack sufficiently trained staff
to help steer them through the quagmire of complexities that surround building
an educational system equipped to meet the challenges of modern society.
Further, with inadequate financial means, ministries of education are often
critically strapped to provide teachers with even a bare minimum salary. If
someone has gained a secondary education, what would be the purpose of
staying in a rural area if there is a better paying job in town? This often leaves
rural authorities with a very small pool of teachers to draw from, many of them
barely educated themselves. Another problem is absence. In addition to high
absenteeism and dropout rate amongst students, there is the phenomenon of
teacher absenteeism. Teachers too have activities of daily living and errands to
do. With no transport but by foot, taking care of personal business or obligatory
attendance at community meetings means a day off work. If the teachers are not
consistently present, parents often find little worth in sending their children off
each day when they could be put to much better use taking care of younger
siblings, doing chores around the house or working in the fields. For many
mothers who already carry a heavy workload, sending girls off poses especial
hardships.
Staffing challenges also have a direct impact on evaluation and evaluation-
related activities. Governments often work very hard to attract large donor
funding. Donors in turn look very closely at the potential recipient's capacity to
make effective use of funds. Because of perceived or real gaps, donors often
provide funds with provisos that outside non-national program expertise will be
brought in for both the design and the evaluation of the program. For govern-
ments and donors, identifying, training, and then payrolling national evaluation
staff is often a low priority. This means that evaluations are primarily conducted
by outsiders, in the fullest sense of the word. It also means that design and output
have a higher likelihood of missing the mark in terms of relevance or
appropriateness to the context.
Facilities or access to them offer additional challenges. Lack of roads or means
of transport other than by foot means children are required to walk long distances
to get to and from school, which mayor may not be an equipped structure, as
706 Garaway
described above. For an evaluator, it also means that access to schools is limited
and difficult requiring huge expenditures of time. This factor alone is a major
constrainer in conducting a thorough evaluation of an educational program. As
Bhola (1998) shares with respect to his evaluation of a n.ationwide literacy
program in Namibia, visits to the six educational regions meant elephants,
buffalo, and journeying on unmarked trails and across flooded plains.
In addition to drought, flood, and famine, many of the world's developing
nations have suffered and continue to suffer ongoing cycles of civil strife and/or
war. While the impact of drought, flood, and famine might be more clearly
understood, the insidious impact of war is less so. Wars, particularly the simmering
ongoing conflicts found in many places around the globe, have had a devastating
impact on massive numbers of children. Wars are powerfully destabilizing
psychologically, not to mention physically. In terms of education they often mean
damaged or poorly maintained structures and disruptions in day-to-day func-
tioning. Added to this, changes in governments often mean changes in ministries
of education, rendering many educational systems with little or no continuity.
How do you factor these elements into an evaluation? Although it may be
clear to all that the contextual factors are having a significant impact on a
program's delivery, how can the evaluator really understand or get at all of the
dynamics impacting on a program? If he or she is an outsider, insider
understandings are absent and key elements may be altogether missed. If a local,
he or she may be so much a part of the fabric as to not even see the individual
threads.
Socio-Political and Cultural Issues
The concept of hierarchical distance - the degree of verticality - goes a long way
in helping to describe some of the differences between western socio-political
relationships and those found in many developing nations. Bollinger and
Hofstede (1987) maintain that in high verticality societies, wealth and power
tend to be concentrated in the hands of an elite, the society-'tends to be static and
politically centralized, and subordinates consider their superiors to be better
than themselves, having a right to privilege. There is minimal attention to the
future because life will go on regardless. Contrasted to this is the more horizontal
society in which planning for the future is deemed necessary for survival, the
political system is decentralized, wealth is widely distributed and people are
deemed more or less equal. Bollinger and Hofstede maintain that high verticality
prevails in warmer climates, where the survival of the group is less dependent on
saving, future planning, and technological advancements. While this thesjs on
things may be debatable, the point is that countries around the world have vastly
different social, political, and cultural systems, and that differences are not just
a function of isolated discrete manners and customs but rather of systems that
penetrate behavior at all levels.
In a treatise on African culture, Etounga-Manguelle (2000) maintains that in
spite of differences between nations, tribes, and ethnic groups across Africa,
there is a common ground of shared values, attitudes and institutions. He notes
several: subordination of the individual by the community, a resistance to changes
in social standing, and thought processes that avoid skepticism, skepticism being
a mental endeavor tied to individualism. In addition, there is a high level of
acceptance of uncertainty about the future. Given the very high value placed on
sociability, social cohesion, and consensus, differences in opinion, thought, or
viewpoint are often masked or rejected in both a conscious and subconscious
move to avoid conflict and promote conviviality. This is contrasted to societies
where individualism reigns and conquering the future is related to aggressiveness,
anxiety, and institutions oriented to change. Western thought and evaluation, as
concept and practice quite nicely fall into the last category.
Few of us are aware of the degree to which cultural values pervade all aspects
of our practice. A brief look at our operational definitions provides just part of
the picture. In the West, poverty is equated with lack of material wealth. When
needing to determine the impact of poverty on a program we thus might look at
family income or condition of the family dwelling rather than at the locally
defined measure, distance from the community water tap, or we might be
inclined to give a higher value to the little hut in the midst of a lush looking
garden than the dusty tent in the midst of barren ground even though by local
definition it is the pastoralist tent dweller who has "wealth" because only the
"poor" work the ground. Our biases can also be seen in our choice of approach.
Because of the multiplicity and complexity of intervening variables in a developing
nation context, the evaluator usually feels a greater need to involve a variety of
people in his or her attempt to get at truth. Added to this is often a desire to
"help change things," to promote the development which seems so needed.
Participatory approaches and their proliferation are a natural response to this.
This seems good and is in fact often an indication that we are attempting to
realize the principles of fairness outlined in the Guiding Principles for
Evaluators (AEA, 1995). What we fail to realize however is that the way we
design and implement our participatory approach may be rooted in our own
cultural view of things; that is, "fairness" is an issue of voice; time is to be
maximized, "extraneous" social banter minimized; getting it all "out on the
table" is the most efficient and effective way to begin processes of learning and
change; women should have an equal representation. Quite clearly, the more
participatory type approaches assume a democratic frame, or are an attempt to
create one.
And what about our definition? Evaluation as defined by many in the field is
all about judgment, and judgment speaks criticism. And criticism, as any
evaluator knows, breeds defensiveness and wall building; it hinders learning. In
many developing nations, where people are not used to open criticism at all
because of social and cultural constraints to anything which might appear to be
stirring up conflict, this wall building is magnified. In fact "dialogue" as critical
communication may not only be avoided but resisted and frowned upon. Rather
708 Garaway
consensus building is the expected and acceptable norm. Hence the failure of
many focus group interviews to produce the interactions that the evaluator wants
and expects.
In much of the developing world the concept of evaluation i!llittie understood.
Evaluations are generally called for by foreign development agencies to meet
their donor needs, be it information for decision making or information to satisfy
their donor population's demand for accountability - did you succeed in doing
what you said you were going to do? While these aims are fully justified, the
information gained usually leaves with the evaluation, leaving those locally
involved in the evaluation with little understanding of its purposes or conclusions.
In cases where more participatory approaches have been implemented, social
interaction patterns employed may have generated negative reactions and a
resistance to any form of western evaluative activity, with control and "social
transformation" perceived as its primary focus.
In many ways, this same statement about lack of understanding of evaluation
could be made about developed countries, where the stated (or unstated) role
and purposes of evaluation may be as varied as the evaluation approaches
themselves. Within the variation however are common threads, perhaps common
understandings. Many if not most (Cronbach et aI., 1980; House, 1993; Worthen,
1990;) would support the statement that evaluation is essentially political
activity. With an eye on issues of ethics, justice, and cultural fairness, House
(1993) points to evaluation's authority, it's role as a strong and significant force
in the process of social change, and points to the responsibility of the evaluator
to conduct professional fair democratic practice. Clearly, in the West evaluation
is a growing source of social authority in societies with increasingly fragmented
social structures.
What then does this mean for program evaluation in developing countries
where social systems and structures may be far more intact and functioning and
where people may be far less willing to concede any authority to evaluation?
Surely the evaluator is still accountable and responsible for ethical, professional
practice. But what do we assume in the design and conduct of practice? In terms
of design are we assuming certain social structures, roles, and modes of
interaction? Do we include women and children in inappropriate ways? Do we
pattern our discourse along critical dimensions or do we seek to understand and
implement local patterns? And if we respect local patterns and give voice accord-
ing to local custom, are we aware of the implications in terms of subsequent
expectations? Have we accounted for the power of social obligation? In terms of
conduct, do we extend ourselves to step outside our own limited perception of
the way things should be? In one study done in an Mrican nation, the rural
dwellers could not quite understand why this stranger had come to ask questions
and stir things up. Was he just a troublemaker? Didn't he know that it ill the
leaders who know? It is they who are responsible to know and provide for us. Is
this outsider trying to undermine that? Does our design or approach relate to the
way things are?
Players and Their Roles
How is the client, the evaluator, or the stakeholder different in developing nation
evaluation? In terms of the client, in developed nations most educational programs
(and thus their evaluations) are funded by national or state government entities.
In contrast in developing nations the program donors are very often foreign
international development agencies, external non-national sources like UNESCO,
World Bank, an NGO or others with social and political cultures, agendas and
perspectives quite different than those of the program population. Large donors
often have very little, if any, exposure to the program and its context and thus are
far more dependent on an evaluator/evaluation to be their eyes and ears in the
field.
In terms of the evaluator, in developing nations, there is both far more
freedom for creativity and a far greater responsibility to hold oneself
accountable to professional practice given the distance from a context with
defined expectations of practice. This is significant given that the individual
evaluator's ultimate power and authority is usually far greater than in the west.
There are several reasons for this. In the west, if a program gets a poor report, it
may be revised or discarded; the education system, however, continues to
function. In contrast, in countries where there is heavy dependence on outside
resources to support or provide significant amendments to national educational
system budgets, a poor report can potentially have dramatic and far-reaching
consequences on maintenance of the educational system itself. Combined with
this reality is the power of a single evaluation. In the west, a large program will
very possibly include several evaluation studies; in developing nations, it will be
at best one. Thus the report made by even a single evaluation has the power to
significantly affect future funding.
With respect to foreign evaluators, the greater disparities and differences
between the donor population, the evaluator, and the program population mean
that there is a greater need to establish common understandings and to examine
the assumptions underlying not only evaluation practice but also those under-
lying educational practice in that context. What is the big picture, extending even
beyond the program? Who are the educational service providers, beyond those
assumed? What are the roles of the education ministry? What are the
assumptions underpinning educational action? Although such inquiry may be
critically important in a particular setting, it carries with it real costs in terms of
time, effort and resources.
With respect to national evaluators, differences can be found in the realm of
the socio-cultural pressures under which they may be operating. They may be
subject to strong cultural mores that demand loyalties that can both interfere
with objectivity as well as a full presentation of findings.
Stakeholder differences include teachers with limited understanding of the
concept of program evaluation, its role and purposes; students and their families
with no understanding of the concept evaluation, particularly so in less
democratic societies where there is a tremendous dependency on "leaders." In
710 Garaway
addition to these factors are the issues of gender and class and their interplay
with a program. This latter issue in particular can pose a dilemma in the
identification of participants for specific evaluative activity.
Local authority structures can be another area of difference: While there may
be a clearly defined and set up educational system with its own set of authority
and power brokers, there often are parallel but less apparent traditional authority
structures that may be the important ones with which to reckon. This may be the
area chief or community elders. Showing deference to local authority and custom
requires care and sensitivity however, as misunderstandings can easily be created
with respect to control of the evaluation and/or the program. This is important
in all aspects of the evaluation, from data gathering to dissemination of
information.
Evaluation Impact
Impact made by an evaluation can be viewed from two perspectives: impact

made by the evaluation process during the conduct of the evaluation and impact
related to post-evaluation use of findings. In terms of impact made by process,
there is on one hand the unintentional impact related to inappropriate imple-
mentation of process and the sometimes negative response to western critical
social interaction patterns, which can leave program implementers less
enthusiastic about the western model program at hand. On the other hand are
large numbers of appropriately implemented participatory processes that have
resulted in learning and tremendous program improvement. Among other things
this may be due to a greater receptiveness to participatory approaches, a
function of a far higher value placed on collective activity than in the west where
an individualist mindset prevails.
In terms of impact related to post-evaluation utilization of findings, there are
the issues of ability to use and of systematic use. With respect to ability to use,
there is first of all the problem of findings leaving the country of program
implementation; this is matched by the other problem mentioned above,
language of the evaluation report. Language of the report is more than an issue
of English, French, Tamil, or Nepali. It is as much the level of language, use of
terms, technicality, and clarity of presentation. Any or all of these can pose real
barriers to local access to meaningful or useful information. While the
observations may be valid, over technicality can be overwhelming and lengthy
narrative reports can leave the reader without a clear definitive picture of anything.
With respect to systematic use, though evaluation findings and recommendation
may not be readily usable by nationals, they are often used by development
agencies to make changes. A common problem, however, is that the change~ are
usually non-systematic - disconnected changes are made and subsequent
programs and their evaluations focus only on the latest program with little
systematic evaluation of overall sets of changes and their impact. This is a
function both of a lack of continuity in donor base - new programs being
supported by new donors with new agendas and little interaction between donor
- as well as a lack of use of available evaluation document resources (World
Bank, UNESCO, etc).
Ethical Considerations
In his piece on ethical issues in international evaluations, Bamberger (1999)

refers to the Joint Committee on Educational Evaluation's Program Evaluation
Standards (Joint Committee, 1994) and the American Evaluation Association's
Guiding Principles for Evaluators (AEA, 1995) as a jumping off point for his
discussion and identifies issues of particular importance in the international
arena. One of the questions he raises, plus several others, are discussed here: To
what extent should local custom be observed? To what extent is the evaluator
accountable to the program population? And whose concerns take precedence,
those of the donor or those of the program population? While none of these
questions have simple answers, an attempt to address them brings to light a
number of ethical issues specific to evaluation in developing nations.
To What Extent Should Local Custom Be Observed?
With reference to the Guiding Principles, we see that a central ethical issue in
conducting evaluation relates to the need to respect differences in culture. It has
been said that evaluators frequently fail to do so, either intentionally or uninten-
tionally. Yet often we are faced with dilemmas, the dilemmas most often arising
when one or more of the Guiding Principles come into conflict. A common
dilemma revolves around how and when to give voice to women, children, and
vulnerable groups. We want to honor the principle of meeting the needs of
different stakeholders particularly when the stakeholders are women and
children with serious needs. Yet to do so may mean disregarding and disrespecting
local customs and consequently causing the opposite effect than that hoped for.
An irate husband may severely beat his wife for simply attending a meeting
where participatory voice is promoted. Or a village headman seeking to be
responsible in his position of maintaining social order may throw bureaucratic
obstacles in the way of program functioning, leaving children or the vulnerable
without the services intended. As Bamberger points outs, a western unidimensional
focus on gender or apparent vulnerability may fail to recognize and factor in
deeper and more complex issues of ethnicity, class, and social structures.
Another issue is approach. To what degree should process be defined by local
social custom? Close adherence to local custom often requires greatly increased
expenditures of time to attend to social protocol and its definitions of people to
visit and expected time period spent at each visit. The big question for the
evaluator is where to draw the line? How much is warranted and valid? This can
be an especially difficult dilemma given on one hand the costs to the donor in
712 Garaway
terms of evaluator time and to the local in terms of his time and money to provide
for the evaluator hospitably, and on the other hand, the potential negative effect
on the program if he/she fails to observe local social etiquette.
To What Extent is the Evaluator Accountable to the Program Population?
With reference again to the Program Evaluation Standards and Guiding Principles
for Evaluators, a number of areas of accountability are suggested and addressed
here.
Accountable to conduct professional practice:

Clearly evaluators are accountable to conduct professional practice but professional
practice can be interpreted to mean sensitive and culturally inappropriate
invasion of program population privacy. Or it may mean lengthy interviews or
group meetings with teachers, parents, or other program beneficiaries that
require they walk long distances. Evaluators have an infinitely higher status than
the program population and so the program population will unquestioningly do
whatever is asked of them. Professionalism then must be defined as including a
definitive assessment of all factors involved when designing steps for action.
Accountable to meet the needs of stakeholders:

The need to know and the need to have a say in directing actions affecting one's
life are two central needs discussed in the literature. The latter has a fairly clear
political dimension and raises the sensitive question if it is indeed the evaluator
who is to wield (or attempt to wield) that type of authority in that particular
context. This is in contrast to western thought where support of "right to know"
and broad inclusion of stakeholders can be considered to be neutrality and only
a stand for a particular position to be advocacy (Datta, 1999). With respect to the
former, perhaps the need to know (be informed) must be preceded by the need
to learn (be trained). While it can be said that learning is a feature of all evalu-
ations, we must explicitly ask, do those learning have an adequate knowledge
base to absorb or benefit from the information provided? Are the methods we
are using promoting or hindering learning? And who is learning? Are all those
that need to learn the ones learning? For more than a decade, the Organization
for Economic Cooperation and Development (OECD) and others have been
discussing the need to place a priority on increasing the evaluation and policy
analysis capacity of developing nations, indicating both the need to train and for
nationals to gain an expanded knowledge base on which to build understandings
related to evaluation process and product.
Accountable to inform the evaluation population of evaluation's purposes and

planned approaches for gathering information:
Professional evaluation surely has an ethical responsibility to ensure that all
affected groups understand the potential risks and benefits that can result from
an evaluation. The question here relates to informing. Is merely informing the
population adequate? In practice, terms of reference or scope of work are usually
worked out ahead of time with the donor/client leaving no room for design and
selection of locally appropriate approaches in partnership with local bodies.
Beyond the more obvious risks and benefits involved in evaluation related to the
possible conclusions and recommendations that will be drawn are those related
to the impact of the evaluation process itself. Perhaps it might be better to say
that we are accountable both to inform the population of donor needs and
purposes and also to involve that population in the process of identifying local
information needs/purposes and appropriate approaches given the particular
purposes or sets of purposes of all parties.
Accountable to disseminate findings:

The questions in developing countries are what findings, how much, to whom,
and by whom? Information (and more specifically the delivery of information)
means power and thus decisions related to the dissemination of same often
cannot or should not be made independently by the evaluator. It is thus of
particular importance that understandings related to information dissemination
be reached between the evaluator, donor, and program population at the outset.
In cases where the evaluator has full latitude, he or she must ask what does local
custom demand? Is information expected to be given by the evaluator or the
village elders? If by the elders, by what mechanism can the evaluator ensure that
the information gets disseminated? Again, this raises the issue of another aspect
of information dissemination: the report itself - its definitiveness, readability,
and understandability.
Whose Concerns Take Precedence, those of the Donor or those

of the Program Participants?
A familiar response to "it's the donor who's paying," goes something like, "but
isn't everyone else too?" Stakeholder or participatory approaches, discussions of
ethics, and establishment of the Guiding Principles are each just a part of the
expression of the profession's concern over participant rights. It is in fact in the
third world that participatory approaches have found their greatest expression
(Feuerstein, 1986; Garaway, 1995; Narayan-Parker, 1991; Rebien, 1996), where
the separation in every sense of the word between the donor and recipient seems
greatest and where perceptions are most wide spread that the donor (be it a
multilateral agency like World Bank, UN or others) is the bad guy and the
recipient the victim. The tendency or temptation to support the underdog is
strong. In contrast to this however are findings like those of Morris (1999) who
reports that empirical research on ethics in evaluation shows concern over
differential attention to stakeholders. Donors in fact may have well-founded
accountability concerns. Given poor performance records there are often
tremendous external pressures on donors to find out why. In the face of the
714 Garaway
temptation to advocate, the evaluator's responsibility must be to remain as an

open channel of communication about concerns on all sides and as a facilitator
endeavoring to create a clear picture and balance.
WHAT IS ITS FUTURE?
What then is the future of educational program evaluation in the developing

world? What are the issues and how can it be shaped and directed to best serve
the needs of both clients and stakeholders? What follows is a summary look at
some of the issues and needs discussed here followed by a look at some
suggestions for the future with respect to definition, aims, purposes, and practice.
Summary Look at Some of the Issues and Needs Presented
Looking at the issues raised above there is first and foremost the fundamental
issue of the form and roles of program evaluation and its alignment with non-
western realities. Who defines it and how is it perceived? Is it a welcome
newcomer on the block of social authority? What and where is the value of
evaluation and where and how can it interweave and become a more beneficial
element in the educational arena? There is also the issue of the costs of
evaluation - limited resources for payrolling trained local evaluation staff or for
conducting locally generated studies on one hand, and the high costs of foreign
consultant evaluation to foreign donors on the other hand. And in both cases,
there is a need for evaluations to pay for themselves in terms of real savings to
donors, improvements in programs, and/or positive impact. There are also the
issues of lack of information or inaccessible information and lack of knowledge
and understanding related to evaluation. These issues in turn speak of the
following needs: locally equipped practitioners, effective and efficient practice (a
positive cost!benefit relationship), relevant and timely information and access to
same, relevant and appropriate practice, common understandings with respect to
aims, purposes and process, and finally common understandings with respect to
values underpinning it all.
Direction for the Future: Definition, Aims, Purposes, and Practice
Definition
Several definitions have bearing on developing nation evaluation. One is of

conventional aid evaluation (OEeD, 1992), which defines evaluation activity as
the objective assessment of a project, the analysis of effects, the calculation of
comparative advantages, and establishment of causal relationships between
inputs and outputs.
Another is that of World Vision, "the facilitation of informed judgments about
the merit and worth of something based on verifiable evidence" (Cookingham,
1992, p. 21). A third is that of Brunner and Guzman (1989), "an educational
process through which social groups produce action oriented knowledge about
their reality, clarify and articulate their norms and values, and reach a consensus
about further action" (p. 11). The first implies evaluator control, quantitative
measurement, experimental research, judgment, and a heavy focus on
accountability concerns. The second implies facilitation, quantitative measure-
ment, experimental research, judgment, and a focus on shared process concerns.
The third implies facilitation, qualitative measurement, discovery, and a focus on
learning and growth concerns.
In terms of operational definition for program evaluation, Lapan (in print)
outlines the logic of evaluation in four steps: (1) establishing criteria of merit; (2)
constructing standards; (3) measuring performance and comparing it to standards;
and (4) synthesizing and integrating the performance data into conclusions of
merit and worth.
Clear and concise, the four steps could be applied in any of the three
definitions given above, with clarification and articulation of norms and values,
as expressed by Brunner and Guzman, falling into "establishing criteria of merit."
The question remains however, is this adequate or is there some fundamental
element or perspective missing in practice and concept?
Science is all about observation and discovery. It flows from our desire to
know about our surroundings - things, people, places, events. We observe and we
draw conclusions. Research is disciplined inquiry. The notion of research came
from our desire to certify or be more certain about the conclusions we are
drawing from our observations.
Experimental research stands as the centerpiece of traditionally implemented
forms of research. It asks questions about cause and effect. Did A have an impact
on B? Did taking this medicine result in the remission of that disease? Evaluation,
program evaluation in particular, is a more recently developed and practiced
form of research. It asks questions about a variety of things. In addition to
questions about cause and effect, it may ask, did program X produce the desired
results? What were the benefits derived from Y program? Were the objectives of
Z program achieved? The answers to these questions in turn are expected to help
us answer the ultimate question, does X, Y, or Z program have merit or worth?
In a nutshell, we can say experimental research undertakes rather defined
activity that will, it assumes, determine with some measure of certainty if A
causes B. Program evaluation research undertakes rather diverse activity that
will, it hopes, determine with some measure of plausibility if program X, Y, or Z
has merit and worth.
The problem, however, is that the determination of merit and worth is intricately
interwoven into the realm of values and values vary. Variation can be especially
critical in developing nations where the values held by the society implementing
a program may be quite different foundationally than the values underpinning
both the educational program and its evaluation. This is not to imply that
716 Garaway
implementing western model programs in a developing nation setting is
necessarily inappropriate or inadequate. It is to say, however, that given the
complex of value issues, the design and evaluation of programs must reckon with
the local context, its values and their impact on a program.
Aims
There are a number of questions that can be asked here which suggest possible
directions: Is social authority an appropriate or acceptable role of evaluation? Is
judgment the construct within which evaluation is to be defined and confined? Is
the ultimate underlying aim of practice only the determination of merit and
worth or is it both the determination and discovery of same?
Looking at the last question, in strictly semantic terms determination of merit
and worth implies measurement, judgment, and rather close-ended conclusions.
Discovery of merit and worth implies measurement, an exploration for and of
value, and open-ended learning. Measurement is an acknowledgement of
accountability concerns and the need for scientific authority. Discovery is an
acknowledgement of the need to learn as well as of colors that might be out of
the range that I am accustomed to seeing - I may see only the leg of the elephant,
but that does not mean that the leg is all there is. It also does not mean that the
reality of the elephant is necessarily constructed by you or me.
The exploration for and definition of underlying aims must be a joint process
conducted in-country by those directly involved in evaluation for that context.
Purposes
A number of purposes have been put forth for program evaluation:
1. Contribute to decisions about program installation;

2. Contribute to decisions about program continuation;
3. Contribute to decisions about program modification;
4. Obtain evidence to rally support;
5. Obtain evidence to rally opposition;
6. Contribute to the understanding of basic psychological, social and other
processes;
7. Promote improvement;
8. Promote social change;
9. Promote learning (implying a training function and information dissemination).
The first six purposes can be considered to be information accumulation and

delivery and can be viewed as the development of a clear concise picture of the
way things are in order to provide for, among other things, corrected vision, the
making of decisions, and justification; it can also be a means of providing new
views and suggested perspectives on how things might be.
Promoting improvement is less clear cut. If we speak of improvement as being
one of evaluation's purposes, then we must ask ourselves how efficient is it at
doing so? As conventional process, there is no built-in mechanism that can
ensure utilization of findings and recommendations that will in turn (we hope)
promote improvement. Given its lack of control over post-evaluation utilization
of findings, evaluation as practice and process can only directly address the issue
of improvement to the degree it of itself as process impacts improvement. This is
a major point of consideration in developing nations given the lack of access to
findings and recommendations by those on the microllocal level on one hand,
and the higher donor accountability concerns on the macro level on the other
hand.
In evaluating programs, four common criteria are relevance, effectiveness,
efficiency, and impact: To what degree is the program relevant, effective, efficient,
and making an impact? From a metaevaluation perspective, we can use these
same points to examine our evaluation practice: Is the evaluation process relevant,
effective, efficient, and making an impact? We can even go one step further and
ask, how can the evaluation process itself fulfill these criteria and at the same
time promote these qualities in the evaluand? By asking these questions and then
designing accordingly, we not only expand improvement as a purpose for evalu-
ation, we also render evaluation as a proactive rather than reactive component
in the system.
Social change as a purpose has been discussed here and has been implied as a
purpose inherent in all evaluation activity. The questions here are who has
determined the need for change and who has defined it and who is driving the
process? And are these articulated in our evaluations?
Learning/training encompasses both information delivery, discussed above,
and improvement purposes. With respect to learning and information delivery,
Crespo, Soares, and de Mello de Souza (2000) in their discussion of the growing
role of evaluation in Brazil's educational system note the increasing role of
incoming information in the administration of educational programs. Given the
information explosion in the past decade with computers and internet access,
this is not surprising. However this underlines the gap between developed and
developing nations in information access. In developing nations where access to
computers, and thus information, is limited, evaluation process may have a more
specific role and responsibility in addressing the issue of information access and
delivery. This could mean among other things closer attention to the way
information is compiled and presented, with the gathering and compilation
process promoting use, and with definitive information presented in a concise,
organized, and clearly understandable way.
With respect to training and improvement, there is both program improve-
ment (discussed above) and individual improvement. With respect to individual
improvement, an example suggests some possible direction. In one African
country, a number of different evaluation studies were conducted over a period
718 Garaway
of some years. In the first study to be conducted, a local translation helper who
showed great interest in learning was included as a colleague in the evaluation
process. An informal hands-on training/learning process thus began that proved
beneficial to both the outside evaluation "facilitator" and the IQcal, and continued
over several years through subsequent evaluation studies, with the local playing
an increasingly greater role in defining practice. In the process, the "trainee"
gained invaluable understandings about development and program design as
well as about disciplined inquiry, and subsequently became not only a skilled
evaluator but also a skilled educator and program designer, capable of defining
locally meaningful and effective practice. The outside evaluation facilitator in
turn gained invaluable insights and understandings about assumptions under-
pinning practice, the need to examine these, and the beauty of evaluation when
it is more closely aligned to local realities.
Practice
What might practice look like? If one were to follow on the suggestions given
above, there is measurement (as much as possible using multiple methods) in
order to convey key and timely information, and discovery of merit and worth,
with a focus on values exploration, in order to promote learning, growth and
improvement. Further, there is practice that can stand the tests of "relevance,"
"effectiveness," "efficiency," and "impact."
In terms of relevance, practice could be the development of common
understandings gained via an examination of assumptions and creation of a
common language for evaluation. As a value-tied endeavor, it demands
explication and at least some examination of who, where, and what we are, and
with respect to evaluation, what it is, what it's for, how it's done, who's doing it,
and then, given all those understandings, how the process and product can fulfill
defined purposes. This implies the need for both insider and outsider
perspectives, along the lines of "dialogue" evaluation as suggested by Nevo (2000).
In terms of effectiveness, practice could be appropriate and understandable
language, in process and presentation. In addition, it cbuld be trained and
mentored local practitioners, definitions of effectiveness and culturally appro-
priate discourse patterns, identification of strengths - social, cultural, and
political - and building bridges to them. Further it could be clearly understood
and appropriate approaches that support rather than detract from program
purposes and that promote the thorough gathering of adequate, accurate,
relevant information. Some possible approaches include the systems approach
put forth by Bhola (1997) where the educational program is seen as part of a
greater system made up of many subsystems. In this approach, the evaluation
process includes a definition and examination of the evaluand's system
components, and the evaluation report becomes a "situated discourse," the
rendition of the evaluation findings being framed within a presentation of the
system as a whole. Or approach could involve use of a model conducive to cross-
cultural study (Garaway, 1996), in which case there is development of a big
picture which is methodically and systematically looked at over time and
program to provide an extended time-frame picture of an educational entity and
its health.
In terms of efficiency, practice can be more than providing needed information
in a timely and easily accessible format to donors and other decision makers.
Efficiency can be practice that makes the most of the time spent, addressing the
needs of both the participants and the donors. It can be practice that brings
about a range of benefits that help outweigh costs. It can be a tool for learning,
an embedded part of existing systems and structures, with the evaluation
designed not only to gather information but also to help address learning needs
and form bridges to new understandings. If the local need is for increased skill
capacity to be able to deliver educational services, then evaluation could be
designed as a deliberative part of the curricular process (i.e. definition of
aims ~ design ~ planning ~ implementation ~ evaluation ~ and back to
aims), thus allowing participants an opportunity to expand their base of skills in
the process of gathering information.
As impact, practice could be addressing local needs for adequate knowledge to
process evaluation information, and further, presenting information in locally
meaningful, useable ways. Further, it could be practice designed specifically to
support rather than detract from program purposes. In terms of client or
decision making processes, it could be practice that maintains elements of
scientific authority in the gathering of data rather than discarding the more
"objective" measures as meaningless. It could also be systematicity: following on
previous studies and taking the time in the beginning of a study to refer to
existing bodies of knowledge. In terms of "process" impact - the impact of the
process of an evaluation on the program being evaluated (impact being enhance-
ment, improvement and learning) - it could be an evaluation process that
engenders a greater understanding of program elements. It could be a process
intertwined with program in ways that improve delivery of educational services.
And finally, it could be a process that of itself brings about a paradigm shift
related to evaluation: evaluation process becoming equated with endeavor that
promotes learning rather than with endeavor that measures it.
Given the very real educational needs in many developing nations, the con-
straints to educational delivery, and the multiple costs of conducting evaluations,
ever greater efforts should be put into aligning evaluation with the overall needs
of a particular educational setting. It is both responsibility and opportunity. As
responsibility, it is to design and conduct our activity according to mutually
agreed upon standards and common understandings. It is also to create a fair
representative picture of the entity being studied that will benefit and meet the
needs of both the donor population and the program population. As opportunity
it is a chance to expand the expression and understanding of our practice to meet
the challenges and realities of a myriad of settings. The potential is tremendous
for expressions of professional practice as richly varied as the world's nations
themselves, a rich global heritage.
720 Garaway
REFERENCES
American Evaluation Association Thsk Force on Guiding Principles for Evaluators. (1995). Guiding
principles for evaluators. New Directions for Program Evaluation, 66, 19-26.
Bamberger, M. (1999). Ethical issues in conducting evaluation in international settings.
In J.L. Fitzpatrick, & M. Morris (Eds.) Current & emerging ethical challenges in evaluation. New
Directions for Program Evaluation, 82, 89-99.
Bhola, H.S. (1997). Systems thinking, literacy practice: Planning a literacy campaign in Egypt.
Entrepreneurship, innovation and change, 6(1), 21-35.
Bhola, H.S. (1998). Program evaluation for program renewal: A study of the National Literacy
Program in Namibia (NLPN). Studies in Educational Evaluation, 24(4),303-330.
Bollinger, D., & Hofstede, G. (1987). Les differences culturelles dans Ie managements. Paris: Les
Editions Organisation.
Brunner, L., & Guzman, A. (1989). Participatory evaluation: A tool to assess projects and empower
people. In R.F. Conner, & M. Hendricks (Eds.), International innovations in evaluation methodology.
New Directions for Program Evaluation, 42, 9-18.
Cookingham, F. (1992), Defining evaluation. Together 3, 21.
Crespo, M., Soares, J.F., & de Mello e Souza, A. (2000). The Brazilian national evaluation system of
basic education: Context, process and impact. Studies in Education Evaluation, 26(2), 105-125.
Cronbach, L.J., et aI. (1980). Toward reform ofprogram evaluation. San Francisco: Josey-Bass.
Datta, L. (1999). The ethics of evaluation neutrality and advocacy. In J.L. Fitzpatrick, & M. Morris
(Eds.), Current & emerging ethical challenges in evaluation. New Directions for Program Evaluation,
8,77-89.
Etounga-Manguelle, D. (2000). Does Africa need a cultural adjustment program? In L.E. Harrison,
& S.P. Huntington (Eds.), Culture matters: How values shape human progress (pp. 65--77).
Huntington, NY: Basic Books.
Feuerstein, M.T. (1986). Partners in evaluation: Evaluating development and community programmes
with participants. London: Macmillan.
Garaway, G. (1995). Participatory evaluation. Studies in educational evaluation, 21, 85-102.
Garaway, G. (1996). The case-study model: An organizational strategy for cross·cultural evaluation.
Evaluation, 2(2), 201-212.
Giddens, A. (1987). Social theory and modern sociology. Cambridge: Polity.
House, E.R. (1993). Professional evaluation. Newbury Park, CA: Sage.
Lapan, S.D. (in press), Evaluation research. In K.B. deMarrais, & S.D. Lapan (Eds.),Approaches to
research in education and the social sciences. Mahwah, NJ: Lawrence Erlbaum.
Morris, M. (1999). Research on evaluation ethics: What have we learned and why is it Important. In
J.L. Fitzpatrick & M. Morris (Eds.) Current & emerging ethical challenges in evaluation. New
Directions for Evaluation, 82, 89-99.
Narayan-Parker, D. (1991). Participatory evaluation: Tools for managing-change in water/sanitation.
New York: PROWESS/UNDP/lBRD/UNICEF/WHO.
Nevo, D. (2000, June). School evaluation: Internal or external? Paper presented at the International
Seminar on School Evaluation, Seoul, Republic of Korea.
OECD. (1992). Development assistance manual: DAC principles for effective aid. Paris: OECD.
Rebien, C. (1996). Participatory evaluation of development assistance: Dealing with power and
facilitative learning, Evaluation, 2(2), 151-172.
Worthen, B.R. (1990). Program evaluation. In H.J. Walberg & G.D. Haertel (Eds.), International
encyclopedia of educational evaluation (pp. 42-46). Oxford: Pergamon.
32
Evaluating Educational Programs and Projects in the USA
JEAN A. KING
University of Minnesota, MN, USA
The past forty years have witnessed a dramatic increase in the amount and
variety of educational evaluation in the USA. Program and project evaluations
in this country come in many shapes and sizes, as the following examples suggest:
• Near the end of a three-year project funded by a foundation grant that

developed a critical thinking curriculum for gifted students, a building principal
contacts an external evaluator to conduct a study of project outcomes.
• A federa}: grants administrator in a large urban school district collaborates
with the district's evaluation unit to compile mandated federal program data
and prepare routine reports that are sent to the state department and ultimately
to Washington, DC.
• A superintendent calls a program director into her office and orders an evalu-
ation of an expensive and controversial new program.
• An evaluation consultant has a continuing contract with an education agency
to study innovative programs as they develop, often traveling to sites in a
number of states.
National events triggered a rise in program evaluation from the late 1950s to the
late 1970s, due in large part to the creation of federally funded education
programs that mandated evaluation. Since 1980, a focus on accountability and
standardized testing to measure student achievement has reduced the emphasis
on program evaluation in many educational settings, despite the fact that the
evaluation function is exercised to some degree in most American school districts.
This chapter will discuss the evolution of American educational evaluation, its
current practice, and likely future.
In beginning, it may be helpful to distinguish between educational programs
and projects. Their difference lies in the scope of activities and the resources
committed to them (Payne, 1994). Programs are typically long-term, "ongoing
activities representing coordinated efforts planned to achieve major educational
goals," for example, a district's integrated language arts program or a university's
721

722 King
chemical engineering degree program. Educational projects, by contrast, are
"short-term activities with specific objectives and allocated resources" (Nevo,
1995, p. 119), for example, a project to provide a semester's worth of multi-
cultural training to a school faculty or a foundation-funded p.roject to integrate
technology into a college's classrooms. Although expected outcomes might
differ, the process of evaluating educational programs and projects, however, is
identical, and therefore this chapter will not distinguish between them.
THE HISTORY OF PROGRAM/PROJECT EVALUATION IN THE USA
Evaluation texts identify numerous antecedents to current practice in educational

program and project evaluation - for example, ancient Chinese civil service
examinations, 19th century royal commissions in Great Britain, Horace Mann's
mid-19th century reports on student learning in Massachusetts schools, and
Joseph Rice's turn of the 20th century exposes of ineffective spelling instruction
(Cronbach et aI., 1980; Worthen, Sanders, & Fitzpatrick, 1997). Broadly speaking,
educational evaluation in the USA had its roots in two distinct activities - school
accreditation and standardized testing - that are discussed in other sections of
this handbook (see Section 6 on Student Evaluation, as well as Chapters 35
and 41).
School accreditation marked an early form of educational evaluation. Since
the end of the 19th century when "schoolmen" organized regional accreditation
organizations (e.g., the North Central Association) in order to create compliance
models for quality schools, "school evaluation has been an integral part of the
U.S.'s system of voluntary accreditation of schools" (Gallegos, 1994, p. 41). In
recent years, accreditation and other school-wide improvement efforts have
broadened their perspectives, moving from mere bean counting (e.g., the
number of books in the library, teachers' credentials) to direct measurement of
the outcomes of effective schools (e.g., student achievement on standardized
measures, improved student attitudes). The earlier accreditation models have
evolved to school improvement models, i.e., "towards models that focus on both
performance (outcomes) and diagnostic information and'data (improvement)"
(Gallegos, p. 43). When coupled with site-based management/decision making,
such models may hold the potential to effect continuous improvement in
educational programs.
Like school accreditation, the use of standardized tests to assess student
achievement has been a mainstay of educational evaluation practice for eighty
years. The growth and expansion of the accountability movement in the past two
decades has affirmed a central role for such tests, including the use of per-
formance assessments at the classroom and district level- a fairly recent adqition
to the commonly used student assessment toolkit. In addition to accreditation,
for the first half of the 20th century educational evaluation was in large part
indistinguishable from testing and measurement. While other social scientists in
the 1930s advocated "the use of rigorous research methods to assess social
Evaluating Programs and Projects in the USA 723
programs" (Rossi, Freeman, & Lipsy, 1999, p. 10), educational researchers
sought improved ways of assessing student achievement in schools (Norris, 1990).
A notable example is the oft-cited Eight Year Study that brought Ralph Tyler,
the man who can rightly be called the father of educational evaluation in this
country, to national prominence. Starting in 1932, the Progressive Education
Association secured sizeable grants (almost $700,000) to conduct a longitudinal
study of approximately 3,600 students in matched pairs as they moved through
30 high schools - half experimental progressive schools and half traditional high
schools - then on to colleges across the nation (Kleibard, 1995, p. 183). Tyler's
evaluation innovation was to make a connection between educational outcomes
and educators' intentions prior to instruction; as he and a colleague put it, 'fu
educational program is appraised by finding out how far the objectives of the
program are actually being realized" (Smith & Tyler, 1942, cited in Kleibard,
1995, p. 188). This approach - linking program objectives to outcome measures
- was "to organize the tasks and goals of evaluation in a framework that was to
be the dominant paradigm for almost half a century" (Baker & Niemi, 1996,
p. 927). Wolf (1990) credits Tyler's approach with fostering the notion of
evaluation as an integral part of the educational process. Stufflebeam and
Webster (1988), however, blame the "ritualistic application" of this approach for
just the opposite, i.e., for "suggesting that evaluation is applied only at the end
of a project" (p. 570). They also contend that the objectives-based approach
narrowed evaluation only "to those concerns evident in specified objectives" and
encouraged people to "define success in terms of objectives without also evalu-
ating the objectives" (p. 570)
In the area of social programs, "program evaluation research" - the applica-
tion of social science research methods to study program effectiveness - was
widespread by the end of the 1950s (Rossi et aI., 1999, p. 11). In the education
arena, by contrast, the quarter century from 1940 to 1965 saw a consolidation of
earlier developments in program and project evaluation - improving
standardized testing and teacher-made classroom tests, accreditation, and school
surveys in the 1940s and the expanded use of behavioral objectives and published
taxonomies of educational objectives (cognitive and affective) in the 1950s and
early 1960s (Worthen et aI., 1997, p. 30; see Bloom, Engelhart, Furst, Hill, &
Krathwohl, 1956; and Krathwohl, Bloom, & Masia, 1964).
However, events in Washington, DC in the middle 1960s jump-started a new
thrust in educational evaluation, leading to an unprecedented era of development,
both practical and theoretical. Talmage (1982) points out that the fourth edition
of the Encyclopedia of Educational Research, published in 1969, contained no
entry for evaluation research (or program evaluation); yet by the end of the
1970s, numerous journals and yearbooks, new scholarly associations, and more
than a 1,000 percent increase in the budget of the Office of Planning, Budgeting,
and Evaluation of the U.S. Office of Education documented tremendous growth
in the field (p. 593). Sharp (1981) estimated that $100 million in federal funds
were spent to hire "extramural performers" to conduct educational evaluations
in 1977 alone (p. 223).
724 King
The impetus for increased evaluation activity stemmed both from federal
accountability mandates and from a dramatic increase in federal funding for
such work. "These funds were earmarked for two distinctly different purposes at
two slightly overlapping points in time: massive curriculum d~velopment efforts
preceding social intervention programs" (Talmage, 1982, p. 593). In response to
the Russian launch of Sputnik in 1957, the National Defense Education Act
(NDEA) led to the development of the federally supported "alphabet soup
curricula" of the late 1950s and early 1960s (e.g., those developed by the
Biological Sciences Curriculum Study [BSCS], the Physical Sciences Study
Committee [Pssq, and the Chemical Education Material Study [CHEM
Study]). The nationwide implementation of these curricula created demands for
large-scale evaluation efforts to demonstrate that they actually improved
American students' achievement in science and mathematics. NDEA program
evaluations "revealed the conceptual and methodological impoverishment of
evaluation in that era" (Worthen et aI., 1997, p. 30). The results of the curricular
horse races between old and new curricula suggested that students learned
whatever the curriculum they studied taught them; there were no universal
winners or losers (Walker & Schaffarzick, 1974).
President Lyndon Johnson's War on Poverty and Great Society programs
marked a second area that spurred the development of program and project
evaluation. In federal offices, the mandatory implementation - at least on paper
- of the Planning, Programming, and Budgeting System (PPBS) that Secretary
Robert McNamara introduced to the Department of Defense under President
John F. Kennedy initiated an approach to systematic evaluation over time
(Cronbach et al., 1980, pp. 30-31). It was a legislative action, however, that yielded
more dramatic outcomes. Worthen, Sanders, and Fitzpatrick (1997) write, "The
one event that is most responsible for the emergence of contemporary program
evaluation is the passage of the Elementary and Secondary Education Act (ESEA)
of 1965" (p. 32). Prior to this act, only two federal programs had congressionally
mandated evaluation components. By insisting that the ESEA include
mandatory evaluation that would track the results of federal funding, Senator
Robert F. Kennedy created an instant market for program evaluators, coupled
with extensive funding to support their practice (McLaughlin, 1975).
"Through a coordinated program of federal grants and contracts supporting
key university groups [during the 1960s], hotbeds of educational evaluative
activity were created throughout the country .... Predictably, academics used
what they knew as the basis for their strategies" (Baker & Niemi, 1996, p. 928),
i.e., traditional research designs and methods. Unfortunately, "the educational
community was ill-prepared to undertake such a massive amount of evaluation
work" (Wolf, 1990, p. 10), and "evaluators of curriculum development efforts
and those engaged in evaluation of social intervention programs soon recognized
the limitations' of their respective research traditions in addressing pertinent
evaluation questions" (Talmage, 1982, p. 593). Cook and Shadish (1986, cited in
Aikin & House, 1992) summarize the lessons learned during this stage of the
field's development:
Late in the 1960s and early in the 1970s evaluation assumed the task of
discovering which programs ... worked best. Most evaluators thought that
social science theory would point to the clear causes of social problems
and to interventions for overcoming them, that these interventions would
be implemented and evaluated in ways that provided unambiguous
answers, that these findings would be greeted enthusiastically by
managers and policymakers, and that the problems would be solved or
significantly improved. They were disappointed on all counts. (p. 463)
From the perspective of the intended users - the Congress and educators at all
levels - "viewed collectively, these evaluations were of little use" (Worthen et aI.,
1997, p. 33). From the perspective of educational evaluation, this broad social
change - the increased role of federal and state government in education with
accompanying evaluation requirements - dramatically increased and helped to
institutionalize the practice of educational evaluation in school districts across
the United States. As Nevo (1994) notes, "local school districts became the focal
point for the commitment to evaluation in the complex relationship among the
various levels of the American educational system" (p. 89).
Ronald Reagan's election as President was important to the history of
program evaluation in this country because, following his move to the White
House, the burgeoning evaluation practice connected with increased federal
funding of public education slowed dramatically. However, earlier research
conducted in the late 1970s - research on program evaluation funded by the
National Institute of Education (e.g., Lyon, Doscher, McGranahan, & Williams,
1978; Kennedy, Apling, & Neumann, 1980) and the deliberations of the Program
Evaluation Committee of the National Research Council (National Academy of
Sciences) that resulted in a report (Raizen & Rossi, 1981) - documented the
status of educational evaluation shortly before Reagan took office. King,
Thompson, and Pechman (1982) summarize the outcomes of three nationwide
surveys ofthe evaluation units of large city school districts (Lyon et aI., 1978; Rose
& Birdseye, 1982; Webster & Stufflebeam, 1978): district personnel supported by
local funds - not external consultants - conducted the evaluations; "evaluation"
in these district contexts meant testing programs; and there was little emphasis
on instructional or program improvement. At the local level, in large and small
districts alike, federal program evaluations were largely compliance-driven
(Holley, 1981) and funded with program grant funds (Reisner, 1981).
Holley (1981) described the work of a local district evaluator. In many
districts, especially those that were not large or urban, "public school evaluation
is an all-consuming role. An evaluator works 12 months, with summer bringing
heaviest work load; because resources are often inadequate, the workday and
workweek are far longer than those of the average worker" (p.261). People
given the work of evaluation might have "neither training nor experience in
evaluation methodology, measurement, or statistical analysis" (p. 258), and
"enormous variation both from state to state and from district to district" was the
norm (p. 246).
726 King
As noted previously, the rapid expansion of program evaluation slowed
dramatically after 1980. The fiscal conservatism Reagan directed at social
programs shifted the field's focus to matters financial - to "assessing the
expenditures of social programs in comparison to thei! benefits and to
demonstrating fiscal accountability and effective management" (Rossi et aI.,
1999, p. 18; see also House, 1993). Much of the responsibility for evaluation
simultaneously shifted to the states, many of which did "not have the capabilities
or the will to undertake the rigorous evaluation needed" (Rossi et aI., 1999,
p. 19). Webster (1987) summarized the effect of these shifts:
The golden age of evaluation, from a funding standpoint, is over. The level
of formative and summative program evaluation in the nation's public
schools has declined precipitously. The testing function remains as a major
consumer of resources and a major barometer by which school districts
are judged (appropriately or inappropriately) .... What has happened is
that the data for program improvement emphasis of the late sixties and
seventies has been largely replaced by the accountability movement of the
eighties. (p. 49)
More than a decade after the publication of Webster's article, this educational
accountability movement has not only remained, but intensified. President
George W. Bush's proposed national testing program for public school students
in the USA is a highly visible reminder of the centrality of testing to educational
evaluation in the first decade of the 21st century.
EDUCATIONAL PROGRAM EVALUATION PRACTICE IN THE USA
Program evaluations in the USA can focus on different types of objects. Baker
and Niemi (1996) identify two types of programs that are most commonly
evaluated: innovative programs and complex, large-scale programs. "The most
commonly evaluated programs are stand-alone, innovative programs," they
write. "Thousands of evaluations of specific interventions have been conducted,
probably accounting for the lion's share of evaluation efforts undertaken in the
past 20 years [mid 1970s-mid 1990s]" (Baker & Niemi, 1996, p. 933). The reasons
for evaluating these innovative programs and projects are typically externally
motivated, often as a condition of funding, and deal simultaneously with formative
and summative evaluation. Because they are innovative - outside of traditional
practice - these programs and their evaluations "do not threaten the educational
status quo directly, nor do they immediately impinge on mainstream educational
practice" (Baker & Niemi, 1996, p. 934). Either internal or external evaluators
may conduct such studies, examples of which include the innovative curriculum
projects of the 1960s (e.g., BSCS biology, which continues its evolution), many
federally funded Title III innovations, local implementations of the Drug Abuse
Resistance Education (DARE) program, and projects funded by the Fund for
the Improvement of Post-Secondary Education (FIPSE).
The second type of evaluation object identified by Baker and Niemi (1996) is
the complex, large-scale program like Headstart or Title I - programs that have
multiple sites, often across the country.! Given the multiple actors engaged in
such programs, these evaluations take on a complexity that unavoidably
challenges those involved, evaluators and politicians alike:
Evaluations of large-scale programs almost always have to serve a combina-

tion of needs: to account for the use of public funds, to examine whether
the intervention is operating as intended, to shore up or perhaps undermine
political and public support, and to determine at the point of reconsideration
or reauthorization whether the programs have been successful enough to
continue. Such evaluations are often thought to be negative political acts
(Baker & Niemi, 1996, p. 934).
External evaluators or teams of evaluators typically conduct such studies in order

to emphasize the objectivity of the evaluation process.
In addition to innovative programs and large-scale, complex programs, there
is a third target for program evaluation: ongoing curricula or existing programs.
Internal evaluators often conduct these evaluations, and the concept of develop-
mental evaluation (Patton, 1997) is helpful in framing studies of programs that
are neither large-scale nor innovative. Examples include curriculum self-studies
- for example, examining a K-12 district social studies scope and sequence or
focusing on the course offerings of a single department - and evaluations of
standing programs, for example, a teacher education program, the middle school
implementation in a district, or an undergraduate advising program.
Regardless of the program or project identified for study, evaluators in the
United States - formative and summative, internal and external - use a variety
of approaches and methods at the local, state, and federal levels. Evaluation
approaches, which are linked closely to purpose, have evolved over time to
create a wide variety of options, moving from the dominance of the accreditation
and Tylerian goal-oriented versions discussed earlier. To compile an exhaustive
listing of these approaches is beyond the scope of this chapter (see Worthen et
aI., 1997). Indeed, King (1999) notes that there are easily over two dozen
adjectives that can be attached to the word evaluation to identify different
writers' approaches (e.g., goal-free, discrepancy, free-range, democratic-
deliberative, and so on). There are, however, several commonly used approaches
that have been influential in the field and exemplify its diversity:
• Educational evaluators used formal experiments and quasi-experiments to

study the effects of large federal programs in the 1960s, and the experimental
paradigm remains the backbone of studies designed to demonstrate causation.
• The notion that evaluators should provide information directly to decision
makers marked an important development in the field in the late 1960s
728 King
(Stufflebeam, 1971). The CIPP model, for example, identified four areas of
potential study: context, input, process, and product. More than twenty years
later, the Center for Research on Educational Accountability and Teacher
Evaluation (CREATE) used these as the basis for four cor~ recommendations
for "achieving sound educational evaluation of U.S. education" (Stufflebeam,
1994; see also Chapter 2).
• In the consumer-oriented approach to evaluation, programs or projects are
evaluated using set criteria and standards, just as products are in a Consumer
Reports study. One way to do this is by using a Key Evaluation Checklist
(Scriven, 1991).
• Michael Quinn Patton's utilization-focused evaluation encourages evaluators
to frame studies around" intended use of data by intended users," targeting
specific individuals throughout the evaluation in an effort to increase the
likelihood of use (Patton, 1997; see also Chapter 11).
• Responsive evaluation emerged in the early 1970s and emphasized responding
to the concerns and issues of program or project stakeholders, rather than
attending solely to "preordinate" design issues (Stake, 1975; see also Chapter 3).
Over the years, this notion of responsiveness has evolved to include the active
involvement of program staff and participants in participatory forms of
evaluation, including fourth generation evaluation (Guba & Lincoln, 1989)
and empowerment evaluation (Fetterman, 1994).
• Theory-driven evaluation (Chen, 1990; Chen & Rossi, 1983) bases studies on
the theory underlying programs, enabling evaluators to address both informal
and formal theories in developing program descriptions and studying effects.
In light of the numerous approaches extant in the field, Baker and Niemi (1996)
document the "striking range of conceptual schemes that have been turned to
the purpose of sorting and comparing evaluation models" - they describe seven
such efforts - but label the existing typologies "somewhat limited in their
practical applications" (p. 931).
The absence of an overarching typology or a single integrative approach
potentially creates both flexibility and frustration for evaluators and their clients.
On the one hand, the wide diversity of approaches suggests that a well-versed
and skilled evaluator can custom tailor an appropriate approach to virtually any
evaluation context. On the other hand, this diversity may prove challenging to
novice evaluators and confusing to evaluation clients as they seek to commission
a useful study. Although Michael Scriven has framed the "transdiscipline" of
evaluation (Scriven, 1994), the absence of a generally accepted, unified framework
is a notable challenge in the field's theoretical development.
The implications for practice are, fortunately, less daunting. As Stufflebeam
(1994) notes, "the importance of the different viewpoints on educational evalu-
ation is not in their distinctiveness but in what particular concepts, strategies, and
procedures they can contribute collectively to the planning, execution, reporting,
and use of sound evaluation" (p. 5). Baker and Niemi (1996) provide a pragmatic
question for evaluators choosing among competing approaches: "What is likely
to be the most effective approach to serve particular purposes in a particular
context?" (p. 932). What is clearly evident is the importance of evaluators being
well-grounded in the full range of available approaches, which may be a
challenge given the lack of systematic training and credentialing of program
evaluators in the USA.
Regardless of the approach selected, program and project evaluators use a
variety of data collection methods as they conduct studies. From an early
reliance on quantitative data, evaluators over the past two decades have
broadened their methodological perspective so that qualitative data are now
commonplace and mixed methods are widely accepted in practice (Greene,
Caricelli, & Graham, 1989). Evaluators use a variety of procedures for collecting
data, including the following: tests and psychological scales; surveys; individual
or group interviews, focus group interviews, and polls; observations and site
visits; unobtrusive measures; content analysis of existing documents and data;
and less common procedures such as the Delphi technique or the Q-sort.
WHITHER PROGRAM AND PROJECT EVALUATION IN THE USA?
The first years of the new millennium find district-based educational program
and project evaluation in the USA on the horns of a dilemma. On the one hand,
evaluation processes of a sort have been institutionalized in schools across the
nation where a limited version of evaluation related to standardized testing
programs is steady work due to increasing numbers of data-driven accountability
mandates, both internal and external. Despite a decline in federally supported
evaluation for the past two decades, "the function of evaluation appears to have
been institutionalized in enough agencies that the career opportunities for
evaluators once again look promising" (Worthen et aI., 1997, p. 43). The rapid
development of technology (e.g., management information systems, inexpensive
personal computers, the Internet, and hand-held devices) has enabled districts to
compile, communicate, and use data in ways that were unimaginable even ten
years ago. Externally funded innovations and federal programs continue to
require evaluation, and evaluation is surely alive and well and living in district
evaluation units, often staffed (at least in larger districts) by trained professionals
with expertise in evaluation, measurement, and statistics. Professional
organizations with an evaluation focus provide networks of support for these
individuals.
On the other hand, given their standardized testing and accountability focus,
these evaluation units risk serving a "signal function," providing compliance
indicators to people both inside and outside the district (King & Pechman, 1984;
Stufflebeam & Welch, 1986) rather than generating information directly relevant
to improving instructional programs and practices. Program evaluation activities
distinct from norm referenced and/or standards-based testing programs may find
little support in the ongoing competition for district funds, a striking parallel to
the situation of large city school evaluation units twenty years ago (Lyon et aI.,
730 King
1978; King, 2002). Current efforts in "value added" teacher evaluation, for
example - seeking to identify the specific learning outcomes attributable to
individual teachers - may provide excellent information to administrators, but
prove less useful to teachers seeking to improve their practice (Mendro &
Bembry, 2000). School-based evaluation and accreditation efforts may offer
some hope by involving school communities in collecting and analyzing relevant
data, but, as Baker and Niemi (1996) note, "the ability to provide the time and
resources necessary for true local engagement in evaluation may be limited in
the short term by both will and economics" (p. 938). Sanders (2000) speaks to the
need for school districts to find the resources and the will to conduct program
evaluations in this environment.
This is not to say, however, that program and project evaluation is likely to
fade from the evaluation landscape in the USA. Triggered by events at the
national level, the emergence of evaluation as a profession has created an
ongoing role for evaluation in the development of educational programs and
projects in a number of settings, funded in a variety of ways. The concept of
organizational learning - building evaluation capacity within organizations over
time (Preskill & Torres, 1999; King, 2000) - further expands evaluation's role and
potential for supporting change and may well mark the integration of program
and projec~ evaluation into the ongoing work of educational organizations.
ENDNOTE
1 Such large-scale program evaluations are often part of educational policy studies conducted to help
policymakers determine their course of action. While policy evaluation may differ from program
evaluation in its scope, it clearly shares conceptual and methodological similarities.
REFERENCES
Aikin, M.C., & House, E.R (1992). Evaluation of programs. In M.C. Aikin (Ed.), Encyclopedia of
educational research (6th ed.) (Vol. 2, pp. 462--467). New York: MacIl!illan Publishing Co.
Baker, E.L., & Niemi, D. (1996). School and program evaluation. In D.C. Berliner & RC. Calfee
(Eds.), The handbook of educational psychology (pp. 926-942). New York: Macmillan.
Bloom, B.S., Engelhart, M.D., Furst, E.J., Hill, W.H., & Krathwohl, D.R. (1956). Taxonomy of
educational objectives: Handbook l. Cognitive domain. New York: David McKay.
Chen, H. (1990). Theory-driven evaluation. Newbury Park, CA: Sage.
Chen, H., & Rossi, P.H. (1983). Evaluating with sense: The theory-driven approach. Evaluation
Review, 7, 283-302.
Cronbach, L.J., Ambron, S.R, Dornbusch, S.M., Hess, RD., Hornik, RC., Phillips, D.C., Walker,
D.E, & Weiner, S.S. (1980). Toward reform of program evaluation. San Francisco: Jossey-Bass
Publishers.
Fetterman, D.M. (1994). Empowerment evaluation. Evaluation Practice, 15, 1-15.
Gallegos, A. (1994). Meta-evaluation of school evaluation models. Studies in Educational Evaluation,
20,41-54.
Greene, J.C., Caricelli, v.J., & Graham, W.E (1989). Toward a conceptual framework for mixed-
method evaluation designs. Educational Evaluation and Policy Analysis, 11, 255-274.
Guba, E.G., & Lincoln, Y.S. (1989). Fourth generation evaluation. Thousand Oaks, CA: Sage.
Holley, F.M. (1981). How the evaluation system works: The state and local levels. In Raizen, SA., &
Rossi, P.H. (Eds.), Program evaluation in education: When? how? to what ends? (pp. 246-274).
House, E.R (1993). Professional evaluation: Social impact and political consequences. Newbury Park,
Kennedy, M.M., Apling, R, & Neumann, WF. (1980). The role of evaluation and test information in
public schools. Cambridge, MA: The Huron Institute.
King, J.A. (1999, June). Making evaluations useful. Paper presented at the annual meeting of the
American Cancer Society's Collaborative Evaluation Fellows Program, Atlanta, GA.
King, J.A. (2000, November). Building evaluation capacity in Lake Wobegone ISD: How many
Minnesotans does it take to use data? Paper presented at the annual meeting of the American
Evaluation Association, Honolulu, HI.
King, J.A. (2002, Spring). Building evaluation capacity in a school district. In Compton, D.W,
Baizerman, M., & Stockdill, S.H. (Eds.), The art, craft, and science of evaluation capacity building.
New Directions in Evaluation, 93, 63-80.
King, J.A., & Pechman, E.M. (1984). Pinning a wave to the shore: Conceptualizing school
evaluation use. Educational Evaluation and Policy Analysis, 6(3), 241-251.
King, J.A., Thompson, B., & Pechman, E.M. (1982). Improving evaluation use in local schools. (Final
report of NIE Grant G-80-0082). New Orleans, LA: New Orleans Public Schools. (ERIC
Document Reproduction Service No. ED 214 998).
Kleibard, H.M. (1995). The struggle for the American curriculum, 1893-1958 (2nd ed.). New York:
Routledge.
Krathwohl, D.R., Bloom, B.S., & Masia, B.B. (1964). Taxonomy of educational objectives: Handbook
II. Affective domain. New York: David McKay.
Lyon, C.D., Doscher, L., McGranahan, P., & Williams, R (1978). Evaluation and school districts. Los
Angeles: Center for the Study of Evaluation.
McLaughlin, M.W (1975). Evaluation and reform: The Elementary and Secondary Education Act of
1965. Cambridge, MA: Ballinger.
Mendro, R, & Bembry, K. (2000, April). School evaluation: A change in perspective. Paper presented
at the annual meeting of the American Educational Research Association. New Orleans, LA.
Nevo, D. (1994). Combining internal and external evaluation: A case for school-based evaluation.
Studies in Educational Evaluation, 20, 87-98.
Nevo, D. (1995). School-based evaluation: A dialogue for school improvement. New York: Elsevier
Science Inc.
Norris, N. (1990). Understanding educational evaluation. New York: St. Martin's Press.
Patton, M.Q. (1997). Utilization-focused evaluation (3rd ed.). Thousand Oaks, CA: Sage Publications.
Payne, D.A. (1994). Designing educational project and program evaluation: A practical overview based
on research and experience. Norwell, MA: Kluwer Academic Publishers.
Preskill, H., & Torres, RT. (1999). Evaluative inquiry for learning in organizations. Thousand Oaks,
Raizen, SA., & Rossi, P.H. (Eds.). (1981). Program evaluation in education: When? how? to what
ends? Washington, DC: National Academy Press.
Reisner, E.R (1981). Federal evaluation activities in education: An overview. In Raizen, SA., &
Rossi, P.H. (Eds.), Program evaluation in education: When? how? to what ends? (Appendix A,
pp. 195-216). Washington, DC: National Academy Press.
Rose, J.S., & Birdseye, AT. (1982, April). A study of research and evaluation offices and office
functions in the public schools. Paper presented at the annual meeting ofthe American Educational
Research Association, New York, NY.
Rossi, P.H., Freeman, H.E., & Lipsey, M.W (1999). Evaluation: A systematic approach (6th ed.).
Thousand Oaks, CA: Sage Publications.
Sanders, J.R (2000, April). Strengthening evaluation use in public education. Paper presented at the
annual meeting of the American Educational Research Association, New Orleans, LA.
Scriven, M. (1994). Evaluation as a discipline. Studies in Educational Evaluation, 20,147-166.
Sharp, L.M. (1981). Performers of federally funded evaluation studies. In Raizen, SA., & Rossi, P.H.
(Eds.), Program evaluation in education: When? how? to what ends? (Appendix B, pp. 217-245).
732 King
Stake, RE. (1975). Evaluating the arts in education: A responsive approach. Columbus, OH: Merrill.
Stufflebeam, D.L. (1971). The relevance of the CIPP evaluation model for educational
accountability. Journal of Research and Development in Education, 5, 19-25.
Stufflebeam, D.L. (1994). Introduction: Recommendations for improving evaluations in the U.S.
public schools. Studies in Educational Evaluation, 20, 3-2l.
Stufflebeam, D.L., & Webster, WJ. (1988). Evaluation as an administrative function. In N.J. Boyan
(Ed.), Handbook of research on educational administration (pp. 569-601). New York: Longman.
Stufflebeam, D.L., & Welch, WL. (1986). Review of research on program evaluation in United States
school districts. Educational Administration Quarterly, 22, 150-170.
Thlmage, H. (1982). Evaluation of programs. In H.E. Mitzel (Ed.), Encyclopedia of educational
research (5th ed.) (pp. 592-611). New York: The Free Press.
Walker, D.F., & Schaffarzick, J. (1974). Comparing curricula. Review of Educational Research, 44,
83-11l.
Webster, WJ. (1987). The practice of evaluation in the public schools. In T.e. Dunham (Ed.), The
practice of evaluation: Proceedings of the Minnesota Evaluation Conference. Minneapolis, MN:
Minnesota Research and Evaluation Center.
Webster, WJ., & Stufflebeam, D.L. (1978, April). The state of theory and practice in educational
evaluation in large urban school districts. Paper presented at the annual meeting of the American
Educational Research Association, Toronto, Ontario.
Wolf, RM. (1990). Evaluation in education: Foundations of competency assessment and program
review. New York: Praeger Publishers.
Worthen, B.R, Sanders, J.R, & Fitzpatrick, J.L. (1997). Program evaluation: Alternative approaches
and practical guidelines (2nd ed.). New York: Longman.
33
Evaluating Educational Programs and
Projects in Canada
ALICE DIGNARD
Ministere des Ressources naturelles, Quebec, Canada
Like other countries, Canada cannot escape from the worldwide movement
toward increased accountability of public services including the education
system. All education institutions - public.or private, large or small, elementary-
level up to university-level instruction - aim at producing the highest quality
learning experience for every Canadian. Furthermore, they must demonstrate
that they are executing their educational mission effectively in a more and more
competitive climate. How did we get here?
Let's recap some historical facts. Dennisson and Gallagher (1986) stated that
"the decade of the 1960's was truly a 'golden age' for public education in
Canada" (p. 11). The public demand for more advanced education, the need to
adapt to changes in technologies, and the need to meet the demands of the
labour market and industry coincided with the financial capability of
governments in Canada to respond favourably to these demands. It resulted in
the introduction of new concepts of accessibility to higher or postsecondary
education and the establishment of new institutions and new curriculum,
including the technical and vocational training. According to these authors the
term accountability was not frequently used:
The emphasis at that time was on expanding access to post-secondary

education as the improvement of accessibility was a dominant goal of
virtually all colleges, technical institutions, and universities [ ... ]. Colleges
were implicitly accountable for making it possible for more students to
obtain more kinds of post-secondary education. And they did the job well:
more than 300,000 full-time students now attend Canada's colleges.
(Dennisson & Gallagher, 1986, p. 252)
Then, in the early 1980s, came the financial problems facing all governments.
Education was definitely in competition with other public priorities. The
reductions in government grants both at the federal and provincial levels
required more responsibilities to meet the challenge of ensuring the highest
733

734 Dignard
quality in education, to demonstrate that resources are not wasted, and to adapt
to the changing needs of learners and of society. Parents and consumers were
also asking for solid evidence of graduate employability.
Furthermore, many commissions, parliamentary committees, and advisory
panels around the country recommended - in the context of deregulation and
decentralization of governments' responsibilities to the lower level - processes
that allow the assessment of performance in education institutions. The search
for quality in education, the task of continuous improvement, and the
demonstration of the value and quality of the programs or the schools are the
main targets of effective evaluation.
Everyone is for quality and improvement, but now let us see how good
intentions have been translated into action. This chapter will give an overview of
some evaluation bodies, processes, and projects that have been put in place to
permit the implementation of educational programs evaluation in Canada. This
also means that it will not be possible to cover everything coast-to-coast.
Through a variety of concrete examples, mainly from higher level education, I
will try to show that there is an increasing concern for the implementation of a
lasting evaluation culture. But before anything else, the first part of this chapter
will introduce the reader to the educational system in Canada.
THE EDUCATIONAL SYSTEM IN CANADA
Canada has 10 provinces and 3 territories, l with a population of more than

30 million (national census: summer 1996). The Canadian Constitution grants
the provinces full jurisdiction with respect to education. Therefore education is
the responsibility of each one of them. Developed under historical and cultural
heritage, the educational system varies from province to province. On the whole,
most provinces and territories provide pre-elementary programs (one year of
pre-school programs or kindergarten to five year olds), six to eight years of
elementary school, four or five years of secondary education (high school), and
three or four years at the university undergraduate level. The secondary level
offers a great variety of vocational programs (job training) as well as academic.
In brief, schooling is generally required from age 6 or 7 to age 16, and once
secondary school has been successfully completed - that is to say after 12 years
of schooling - students may go directly to university or to the network of
technical colleges.
Quebec is an exception: It's the only province with a college education level,
which falls between the 11 years of compulsory schooling (primary and secondary
education) and university and provides both technical and pre-university
education. These public general and vocational colleges are also called C;egeps
(from the French college d'enseignement general et professionnel). The pre-
university studies last for 2 years and constitute a route to university. The
technical studies last for 3 years and are a preparation for the labour market.
Meeting all the specific conditions, some of the technical sector graduates may
Evaluating Educational Programs and Projects in Canada 735
also be allowed to attend university (Commission d'evaluation de l'enseignement
collegial [CEEC], 1997, p. 3).
According to the Canadian Education Statistics Council, in 1993-94 "there
were a total of 16,233 public and private educational institutions in Canada [... J.
Of these, 12,441 offered elementary instruction; 3,509, secondary education; 206,
college-level education; and 77, university-level instruction" (Canadian Education
Statistics Council, 1996, p. 14). From year to year, the number of colleges and
universities is quite stable, but the number of elementary and secondary
institutions can vary from one jurisdiction to another.
CREATING NEW EVALUATION BODIES
The next step is to introduce some of the evaluation bodies that can provide
various values and practices that contribute to the establishment of education
evaluation in Canada. These structures can be at a national or provincial level
and involve high school education and higher education. In this second part, two
agencies working at a provincial level are described. These agencies, put in place
in the 1990's, have obtained substantial results.
Overview of Two Provincial Bodies
Commission d'Evaluation de l'Enseignement Collegial (Quebec, 1993)
In the summer of 1993, the National Assembly of Quebec set up the Commission
d'evaluation de l'enseignement collegial - College Education Evaluation
Commission (CEEC). This happened in the context of the reform called College
Education Renewal, which sought to provide all Quebecers with "access to a
high-caliber, top-quality college education that enables them to attain the
highest possible skills standards" (Ministere de l'enseignement superieur et de la
science, 1993, p. 13). In fact, this Renewal caused an increase in the colleges'
academic responsibilities which was counterbalanced by more rigorous
mechanisms for external evaluation, the CEEC for instance.
In the process of creating the Commission, an autonomous body, the
Government simultaneously abolished the Colleges Council and the Universities
Council. This evaluation body consists of three commissioners,2 one of whom is
the chairperson. For the financial year 2000-2001, the CEEC had a budget of
$1.9 million and 28 permanent employees, 16 of them professional analysts or
evaluators.
The Commission's mandate includes evaluating the quality of the implementa-
tion of the educational programs provided in Quebec colleges, and it also involves
evaluating the institutional policies adopted by the colleges on evaluation of
student achievement and on program evaluation. To do this, the legislation
attributes three main powers to the Commission: the power to verify, the power
736 Dignard
to make recommendation to the colleges and to the Minister, and a declaratory
power. More precisely it can:
• evaluate how some or all institutions implement any cpllege program of

studies it designates;
• develop evaluation criteria and instruments and ensure their dissemination;
form advisory committees and determine their powers, duties, and operating
rules; retain the services of experts;
• conduct an evaluation whenever it deems expedient and according to the
procedures it establishes;
• recommend that the educational institution take measures to enhance the
quality of its evaluation policies, programs, or program implementation.
These measures can also concern the organization, operation and academic
management of the institution;
• make recommendations to the Minister on any matter relating to programs of
studies and evaluation policies, including any governmental or ministerial
policy affecting college management of programs of. studies and evaluation;
• recommend that the Minister authorize an educational institution to award
the Diploma of College Studies;
• make public its evaluation report in whatever manner it deems appropriate;
• authorize individuals to visit any educational institution being evaluated and
gather whatever information they require. 3 (CEEC, 1994, pp. 5-6)
The Commission enjoys considerable autonomy in its work. Its mission focuses
on developing the quality of college education and, in this perspective, one of its
main enterprises is to evaluate the implementation of programs established by
the Minister of Education, as well as those established by the 146 public and
private college educational institutions.
To carry out its mission as rigorously as possible, the Commission uses
techniques and procedures that are widely used in higher education evaluation
agencies in Canada, the United States, and Europe. The evaluation model is
based on a self-evaluation process conducted by the colleges with the help of a
guide provided by the Commission. The evaluation apptoach chosen enables
colleges and their academic staff to be directly involved in the process of
evaluation. Self-evaluation also gives colleges the opportunity to take into
account their culture, their structures, and the context in which the program is
implemented.
Furthermore, the Commission attaches great importance to teachers'
involvement in the process. They can be members of an advisory body at the
Commission, or members of a self-evaluation committee at their college, they
can voice their opinions on program strengths and weaknesses, they can sit as
experts in their fields of study on a CEEC's visiting committee to an institution,
and more recently they can even work at the Commission to acquire the
technical expertise to perform evaluation when they return to their college. For
this last purpose, the Commission keeps a few remunerated positions for
teachers who are loaned by their colleges, generally for a term of two years.
There is a sincere belief in the idea that the teachers and executive staff are a
deciding factor in the improvement of a program. With the assistance of the
Commission, the colleges are in fact developing their own evaluation expertise.
Each evaluation process essentially includes the following steps:
• formation of an advisory body by the Commission and the production of a

specific guide for the program under scrutiny;
• self-evaluation of the program by the institution and preparation of the self-
evaluation report to be transmitted to the Commission;
• analysis of this report and visit to the institution by Commission with a
committee of experts from appropriate working fields and academic experts
from colleges and universities;
• drafting of the Commission's preliminary report; college reactions and discus-
sions on the report;
• producing the final public report, which is sent to the institution and the
Minister of Education - the final report may contain suggestions for
improvements as well as recommendations and this report is also available on
the Commission's Internet site;
• follow-up measures.
More precisely, the judgement given by the Commission on a program takes

generally one of these three forms: the Commission recognizes the quality of the
program, the program presents some strengths and weaknesses, or the program
is problematic. For possible improvements, the Commission will refer an issue to
the college using recommendations, suggestions, and comments. In approximately
95 percent evaluation reports produced, there is one or more recommendations.
One year after the release of the final report, the college has to make a follow-
up report to explain to the Commission what measures were put in place to meet
all of its recommendations. When the analysis of the follow-up is satisfactory, a
summary of the actions that have been carried out is added to the final report
already on the Commission's Internet site. 4 Most of the time, this part is taken
seriously by the colleges, and many of them also indicate what measures they
took to meet all the suggestions and other comments found in the evaluation
report.
The Commission will also gather data on graduation rates, course success
rates, perseverance in the program, etc. Most of these data are collected by the
department of Education. However, the college's organizational culture; its
location, the number and the kind of educational programs it offers; and the
composition of its student body, including their average mark obtained in high
school, are some of the factors that differentiate one college from another and
are taken into account. With the adoption by each college of an institutional
policy on program evaluation came the establishment, by each college, of a
functional information system for the programs. This gives the colleges the
opportunity to develop their own indicators suitable for their particular situation
738 Dignard
and characteristics. With a qualitative and formative approach, the Commission's

aim is to help to improve the quality of education. There is no grading or ranking
of institutions.
Since its creation, the Commission has conducted six major evaluations:
Technical Education Programs in Early Childhood Education (1994-1996), Computer
Sciences Programs (1994-1996), Social Sciences Program (1995-1997), Business
Administration Technology and Cooperation Sectors Programs (1996-1999),
Programs Leading to an Attestation of College Studies (AEC)5 in the Private and
Non Subsidized Institutions (1997-1999), and the General Education Component
of Programs of Studies (1997-2001). Each evaluation was done simultaneously in
all colleges offering the program.
In the fall of 2000, the Commission was engaged in the evaluation of the
Implementation of the Institutional Policy on Program Evaluation (IPPE) in all the
colleges that offer programs leading to the Diploma of College Studies (DEC).
This time the Commission required that each college designate a program to be
evaluated. More precisely, the program was chosen according to their own priority
and needs. This evaluation is also the first to be carried out in the absence of
specific guidelines provided by the Commission. The responsibilities and the
self-evaluation process, and procedure are described in the college's program
evaluation policy. This also means that the Commission recognizes that there
are different ways to establish mechanisms leading to effective evaluation.
Therefore, the purpose of this project, still active in 2001, is to prove that the self-
evaluation was done in accordance with the college's policy and to demonstrate
its efficacy.
More than 600 evaluation reports to the colleges have been produced under
these processes. To ensure openness, transparency, and real public access to the
results, the Commission has decided to disclose the entire contents of its
evaluation's reports on its own Web site. 4 On this site you can also find all its
publications. As we can see, the Commission's autonomy is also apparent in the
way it releases the results of its evaluations to the public. This was made possible
because the legislation provided provisions where the Commission was obligated
to send a copy of its evaluation report to every educational institution concerned
and to the Minister. The Act specifies also that "the report shall be made public
by the Commission in the manner it considers appropriate" (Gouvernement du
Quebec, 1993, chapter 26).
In addition, an institutional evaluation was launched in June 2000. This new
process has three objectives:
(1) on the one hand, it will help the colleges fulfill even more fully their
educational mission;
(2) on the other hand, it will testify to the efforts the colleges invest in reaching
their objectives and to the results they obtain;
(3) finally, in the medium term, it will use the institutional evaluation to justify
its decision to recommend that the college be authorized to grant the Diploma
of College Studies (CEEC, 2000, p. 1).
As mentioned previously, this project was undertaken because the legislation
establishing the Commission stipulates that it "may recommend to the
educational institution [... ] measures concerning the organization, operation
and academic management ofthe institution" (Gouvernement du Quebec, 1993,
article 17). Furthermore, the Commission can "recommend to the Minister that
an educational institution be authorized to award the Diploma of College
Studies" (Gouvernement du Quebec, 1993, article 17). For the moment, this
sanction is an exclusive right of the Minister. After developing its own culture of
evaluation and through self-evaluation of a certain number of programs, the
Commission believes that it's now time to introduce "a continuous and concerted
process of analysis and appreciation of the carrying out of the educational
mission of an institution": institutional evaluation (CEEC, 2000, p.3). The
Commission has released guidelines outlining this process.
The Commission, through an interactive process, has covered a lot of ground
and throughout the last seven years has implemented program evaluation in the
cegep system based on these main features:
• The program evaluation process relies on the participation of people who are
involved in local implementation of the program to be evaluated (teachers,
management, support staff). It is cltar to the Commission that the teachers
are central to pedagogical activities and they have a key role to play in any
evaluation process.
• The Commission uses the following six criteria to evaluate a program: the
program's relevance; the program's coherence; the value of teaching methods
and student supervision including appropriate support; the appropriateness of
human, material and financial resources according to education needs; the
program's effectiveness; and the quality of program management.
• The Commission uses its power of recommendation to identify ways of
improving the quality of the programs.
• The program evaluation reports on each college are put on the Commission's
Web site; this gives a public access to the results.
• The follow-up is taken seriously; the college has a year to implement the
Commission's recommendations and to send in a follow-up report.
• With the adoption of an Institutional Policy on Program Evaluation (IPPE),
each college has to put in place not only an evaluation process but also an
information system to monitor their programs. Qualitative and quantitative
indicators have been developed by the colleges and efforts are being made to
incorporate a variety of assessment information, including periodic follow-ups
with graduates, and to incorporate data from the province and other sources.
The colleges are in the process of developing more comprehensive systems.
• From an organizational perspective, the institutional evaluation will commit
the colleges to a process of continuous improvement. This new process will
not reduce the importance of program evaluation.
740 Dignard
It is clear that the colleges in the province of Quebec have more responsibilities
toward making college education a success. Working within budgetary restraints,
they are finding new ways to fulfill their primary role and the search for quality
in education is an endless movement.
Manitoba School Improvement Program Inc. (Manitoba, 1991)
Established in 1991, the Manitoba School Improvement Program Inc. (MSIP) is

another example of a project that is supporting, in this case, secondary school
initiatives for more effective education. Approximately 30 secondary schools
have received grants to support their improvement efforts including evaluation
projects. For the MSIp, evaluation is a key aspect of the process, following
Fetterman's view on empowerment evaluation. This approach uses "evaluation
concepts, techniques, and findings to foster improvement and self-determina-
tion" (Lee, 1999, p. 157, quoted Fetterman, 1996, p. 4). Fetterman also says that
"evaluators serve as coaches or facilitators to help others conduct a self-
evaluation" (Fetterman, 1996, p. 11).
The MSIp, a quite unique organization in Canada, is in fact financially
supported by the Walter and Duncan Gordon Foundation, a Canadian charitable
foundation. In 1997 it became a "non-profit, non-governmental organization,
independent organization" (Lee, 1999, p. 157). MSIP is dedicated to supporting
youth, and especially at-risk students, through the improvement of public
secondary schools in Manitoba, the province the Foundation selected as its
testing ground. MSIP's approach to evaluation began with a participatory
approach and then moved into an empowerment model of evaluation. MSIP
strongly believes "that having the teachers who are the 'owners and the doers' of
their own evaluation processes is indeed a powerful tool in stimulating and
sustaining significant school improvement" (Lee, 1999, p. 159). From the begin-
ning in 1991 to 1999, the Foundation has invested more than $5 million to
support innovative secondary school projects. In addition, MSIP schools receive
professional support for skill development, including support for program
evaluation.
The MSIP infrastructure includes a full-time program coordinator, a part-time
evaluation consultant, and an advisory committee that provides advice to school
communities. More recently, other consultants and volunteers have also joined
in the project. This agency can provide external help to school personnel for
program evaluation. Such help would include consultative and technical support
to understand the importance of evaluation (evaluation workshops, consultative
assistance) and to teach school personnel how to collect data and use them for
planning, decision making, and problem solving. Various ways are used to collect
information: surveys, focus groups, interviews, reflection meetings, visits to other
schools, and learning from each other. The MSIP participatory approach involves
stakeholders: teachers, school staff, and students. From Lee's perspective (1999),
the MSIP evaluation consultant, "teachers in some schools have taken extensive
ownership of the evaluation process, something for which, historically, they have
not had responsibility." She adds, "this has truly empowered teachers to be key
players in transforming their schools" (Lee, 1999, p. 156). The next step was to
involve students as researchers or evaluators. The MSIP has had experiences
with students in half a dozen schools. "One of MSIP's goals is to further assist in
the development of student empowerment through evaluation project" (Lee,
1999, p. 173).
For example, a group of 18 students at three high schools in the district of the
Seven Oaks School Division and their three teachers participated in a follow-up
study of former high school students. This project was a part of the requirements
to complete their Language and Transactional English course. With the
assistance of the MSIP evaluation consultant, a series of questions, including the
students' suggestions, was designed to be used in telephone interviews. In one
month the students had conducted 410 telephone interviews out of a potential
population of about 1300 former students in the two years chosen for the study.
This sample represented approximately 30 percent of the total population. The
students worked in teams to interpret their school's data and prepared several
presentations which they gave at the district high school conference, at their
school's staff meeting, at a meeting of the school board, as well as in their own
classrooms to their peers. Finally, the students wrote research reports, and their
schools were able to use their findings in their school planning and evaluation
process.
This sort of research project, done by 12th grade students or any students
ending their program, is not only a great model for curricular activity or a
comprehensive assessment that could constitute a summative conclusion to their
studies, but it is also an example of how MSIP has in recent years encouraged
significant people in the school community to become involved in the processes
of inquiry and data collection. Furthermore,
Student voice in this case had the opportunity to be credible and

information-based. The process pushes thinking about the potential of
evaluation as a tool for empowerment. [... ] the potential that evaluation
can become a tool for learning and a tool for change. (Bryant, Lee, &
Zimmerman, 1997, p. 3)
Over time, schools built their own internal capacity in the area of evaluation and
were becoming continuously improving organizations. But how well did they
succeed? In 1997, an overall evaluation of the Manitoba School Improvement
Program was undertaking by Earl and Lee. They measured individual school
success using various forms of evidences. To do so, they constructed four individual
indexes of success or impact to measure the extent to which MSIP Schools
achieved their projects goals, influenced student learning, influenced student
engagement and established school improvement process (Earl & Lee, 1999, p. iv).
For example, student learning was estimated using indicators such as grades/
marks, graduation rates, credits, retention rates, and quality of student work.
742 Dignard
Using all the factors included in their evaluation model, the authors did
classify schools by success score on a range of overall scores from 4 through 16.
Each of four indexes had a scale of 1 to 4, with 4 being the highest score. Seven
schools got a score of 11 or above (16-point scale); 10 school~ got a score of 8 to
10; 4 schools got a score below 8; and one school had insufficient data. Earl and
Lee made three general observations:
• First, success is not easy, but possible. Many schools made progress on all four
dimensions - 17 schools if one takes scores of 8 and above; some schools went
quite far.
• Second, success is not linear or static. The 10 schools in the second category,
for example, could go one way or the other in the future [... ] time is a factor.
Depending on new developments some schools will push forward, others may
lose ground[ ... ].
• Third, this is a collective effort, not 22 isolated initiatives [... ]. Plans are
underway for MSIP to go on for many years. The Foundation has committed
resources through the year 2000, as MSIp, now a locally based entity, is
working to establish a basic infrastructure to continue to work beyond the year
2000. The commitment to keep going, based on the experiences over the past
seven years, is one of the strongest indicators of the worth of the program.
(Earl & Lee, 1999, p. v)
The most original finding is the "urgency-energy-agency-more energy pattern

of success" discovered by Earl and Lee in the most successful schools. They
explain the urgency by the fact that many teachers in secondary schools in
Canada and elsewhere have a great sense of urgency about the future of many
students in their classrooms. Something has to be done to engage them in
learning and teachers can find new ways to help kids learn. But Earl and
Lee speculate that "this potential energy is fragile. Without any opportunity to
act on it, it can easily turn into despair and cynicism. The MSIP provided the
stimulus to take action and to persevere" (Earl & Lee, 1999, p. v). As we can see,
external involvement from an agency and internal collaboration can bring the
opportunity to get started and to focus action on the effectiveness of teaching
and learning.
ESTABLISHING POLICIES
In this section we will see how the new Quebec Policy on University Funding
associated with accountability measures will also have a positive impact on the
need for effective educational evaluation. A variety of similar projects were also
put in place by other provinces.
Quebec Policy on University Funding (Quebec, 2000)
In the 1990s, when the federal transfer cuts came with budget restraint at the
provincial level, provinces across Canada began to re-evaluate their university
system and to examine their funding. Through several consultations, discussion
papers, reports, policies, and strategic plans, a trend has developed towards
creating new frameworks for change with a clear view toward making institutions
more efficient. One of the research analyses done in 1996 by the Legislative
Research Service from the Ontario Legislative Library drew a picture of the
post-secondary education reforms that were taking place in the provinces of
Canada. The conclusion of this paper revealed some constant themes:
responsiveness to the labour-market needs of the students, cost and

accessibility, transition among secondary, college and university education,
efficiency and accountability appear as issues in all the provinces that are
examining their systems [Nova Scotia, Newfoundland, Quebec, Manitoba,
Saskatchewan, Alberta, and British Columbia]. Other common elements
among provinces are: the establishment of committees, task forces or
elaborate studies; slow, consensus-based implementation of many of the
resulting recommendations; and rationalization of the number of colleges
and universities, or of programs in the system. (Drummond, 1996, p. 7)
From this perspective, it is interesting to examine some aspects of a concrete

result like the new funding mechanism associated with accountability measures
of the Quebec Policy on University Funding.
On December 1, 2000, the Minister of State for Education and Youth
launched the Quebec Policy on University Funding. After cutbacks in public
funding, the Government of Quebec has announced a reinvestment of $1 billion
in education over a three-year period and universities will have access to an
additional $323 million annually starting in the 2002-2003 academic year. Most
of this amount ($230 million) will be used to provide general funding for the 18
universities, and around $90 million will be allocated to priority needs. This
recurrent amount represents for the universities an unprecedented increase of
25 percent. In return, the State expects the universities to use the funding
efficiently and to account for how they manage it. The policy was the subject of
a vast consultation with the institutions, teachers, sessional lecturers, profes-
sionals, technical and support staff and the students.
On one hand, this policy introduces a new formula for distributing financial
resources based on a more equitable and transparent method. On the other hand,
the policy sets directions and priorities for government action. Furthermore, this
project presents several criteria useful for carrying out programs and institutional
evaluation. These criteria focus primarily on the relevance and effectiveness of
programs and appropriateness of resources, but also on the efficiency of
management by universities including their accountability and their results. The
following directions are certainly related to evaluation:
744 Dignard
• modernization of basic equipment and infrastructures (libraries, studios,
scientific and laboratory equipment, information technologies, etc.) to ensure
that they meet teaching needs;
• optimal management of program offerings;
• an increase in the number of students in master's and, especially, doctoral,
programs;
• greater interaction between the universities and the various stakeholders;
• the adoption of strategies to help graduates enter the workforce;
• the optimal use of available resources, efficient management by universities,
and accountability of universities to society and the public authorities for the
management of public funds they receive, for the main orientations of their
development and the results they achieve;
• in-depth reviews of the universities' short programs in relation to those offered
by college-level institutions, in order to ensure complementarity, consistency
with the specific missions of universities and colleges, as well as an economical
use of resources within the education system. 6 (Ministere de l'Education,
2000,p.9)
According to the policy, available data on graduation rates in short programs

(other than degree programs) and other information such as the types of programs
and the student status and profiles will be important to analyse. Afterwards, the
department of Education (Ministere de l'Education du Quebec) "would like to
work with universities to establish a streamlined mechanism to certify the quality
and relevance of such programs for funding purposes [... ] and to monitor
graduation rates (graduate follow-up surveys)." This project will be carried out
from 2000-2001 to the 2002-2003 university year.
Furthermore, this reinvestment plan is conditional upon the signing, in
2000-2001, of a performance contract between the Minister and each university.
This public contract stipulates the commitment made by the institution to use the
new funding for projects that will improve the quality of teaching and supervision
and excellence in research as well. Specific funding will be set aside to provide
universities with financial support for projects that will ensure optimal
management of their program offerings. To be eligible, projects must involve the
rationalization of program offerings, the pooling of resources, and partnerships
with college-level institutions and other activities that will ensure optimal
management of the programs.
In the fall of 2000, some universities said they were almost ready to sign their
performance contract. The contract submitted to the Minister by the Sherbrooke
University calls into question the viability of a dozen of their study programs
(l'Universite de Sherbrooke, 2000). Facing a decrease in the student enrolment
in some programs, the Rector said that the institution is now ready to do the
painful analysis that could result in giving up some programs. In the majority of
the programs that are being analysed, they saw their student enrolment
decreasing, in some cases up to 50 percent in five years. Furthermore, they were
hoping that the agreement would give more power to their academic deans so
they can increase the workload of professors whose research involvement is

obviously low. This institution is committing itself to put in place a "management-
evaluation system" for the achievement of its teaching staff. At the moment this
performance evaluation is used for the senior managers, a teacher's promotion,
or a contract renewal. This will require some negotiations with the union.
(Chouinard, 2000, pp. 1,8).
It is clear that the performance contract was developed around the priorities
of the Minister: the work of the professors, the target success rates, and the
rationalization of the programs. The result is important because the ratification
of this contract will give access to a part of the budget to be granted to the
university.
No one can deny that the purpose of the Quebec Policy on University Funding
is not only funding but also to improve the effectiveness of the system. New
measures are put in place to enforce accountability. Quebec's higher education
is funded in large part by the State. It is then normal that the government take a
closer look at the way universities perform even if they enjoy a great deal of
autonomy. For these institutions, this policy is a clear message to evaluate their
programs and to demonstrate their value for students, parents, employers and
society. This project is also a good example of an accountability strategy that will
increase the need and usefulness of effective educational programs evaluation.
The link between assessment and accountability is necessary to improve the
quality of education.
DISCUSSION
The public and the governments are asking for evidence that educational
institutions are providing the highest quality learning experience for every
student and value for their money. From the 1990s, there has been a variety of
responses to the growing demands for public accountability in Canada. This
chapter presented three different answers: (1) the establishment of a govern-
mental agency: the College Education Evaluation Commission (CEEC, Quebec,
1993); (2) the establishment of a non-governmental agency: the Manitoba School
Improvement Program Inc. (MSIP, Manitoba, 1991); (3) the establishment of a
new policy funding for educational institutions: the Quebec Policy on University
Funding (Quebec, 2000). As we can see in our three examples, educational
evaluation has various purposes. The characteristics of the different processes
are presented in Table l.
In different places and times, the responses to the demands for public
accountability were adapted to the historical, social, cultural and economic
contexts where they took root. High schools in Manitoba and cegeps and
universities in Quebec have different background.
For example, the interest in Quebec for evaluation, at the college education
level, took a more concrete form in 1978 with the creation of the College Council
including its own evaluation Commission. After ten years, a study concluded that
746 Dignard
Table 1. Characteristics of the different processes presented
(1) CEEC (Quebec) (2) MSIP (Manitoba) (3) Funding Policy (Quebec)
Purposes
· compulsory/state
regulation (ACT) ·· voluntary/self-regulation
· related to funding!
· ·
self-improvement external control
accountability for accountability to inform,
the institution to explain and to justify
performance, its decisions and choices,
effectiveness and to prove efficient
results management and results
(performance contract)
Approaches
· formative and
learning ·· formative and learning
· universities were allowed
·
participatory to considerable freedom to
participatory/self-
·
empowerment run their own affairs
evaluation
· ·
evaluation tension between
use expert use assistance and autonomy and
academic peers
·
support of external accountability
(college and
·
evaluator internal and external
university) and internal procedures procedures
experts from
·
working fields
internal and
external procedures
(CEEC)
Objects • focus on quality of • focus on student learning • focus on rationalization

programs and their and student's success of the program offerings;
improvement; relevance and effective-
efficiency of ness of programs;
institutional policies; appropriateness of
student's success and resources to meet
student's achievement; teaching needs
educational mission
Results • access to the results • results are communicated • access to some results
and public to the stakeholders
statement
evaluation gained an increasing importance mostly in the field of student

learning evaluation (Conseil superieur de l'education, 1999, p. 35). As I said
before, in the context of the reform called College Education Renewal came the
creation of the College Evaluation Commission (CEEC) in replacement of the
College Council and the Universities Council. Since then, each college has not
only the obligation to adopt an institutional program evaluation policy but also,
to apply this policy and to evaluate its programs. The CEEC exists only at the
college education level in Quebec's public and private institutions. It is quite
obvious that without an act setting up the Commission, the evaluation in the
colleges would not have been so strongly put in place. We also have to remember
that the cegeps network was set up in 1967 and it coincided with the vast
movement setting up teachers' unions (Corriveau, 1991, p. 12). In the field of

education, each province has its own story leading to a more or less important
place for evaluation.
MSIP did demonstrate that institutions could themselves be responsible for
insuring quality in learning and in teaching. To achieve this, the schools' staff
needed external help and time. The resources and assistance provided by MSIP
did support creating their own locally based data and indicators with the
appropriate tools (survey, focus groups, etc.); getting the data to circulate around
the school community for discussion and analysis; improving the school staff
evaluation skills with an emphasis on formative evaluation rather than on
accountability and control; involving teachers, parents, students and community
in the processes; and putting in place schools improvement activities with a focus
on student learning and achievement.
The CEEC and MSIP experiences underline that there is more than one way
to get improvements in education. Furthermore, the institutions are moving in a
complex system at different speeds, some slower others faster. This is quite normal
since the situation varies from one institution to another. At the university level
we are witnessing an increase in the demand for more visible and external
evaluation. Close examination of the Quebec Policy on University Funding project
revealed that one university was already anticipating major changes, such as a
reduction in the number of programs, keeping the effective ones.
In this situation, generalization can be difficult but it is possible to bring out
some consensus:
• Decentralization and accountability movement: At all levels of education, every

institution has a responsibility for maintaining and improving the quality of
education, its basic mission.
• External pressures: There are a whole range of external pressures (budget
restraint; educational reform; administrative reform requiring accountability
for results and transparency through strategic plan, performance plan, success
plan or annual report; publications bringing out academic performance indi-
cators7) on institutions to demonstrate the value and quality of their programs
and to improve their results including students' success and achievement. Key
stakeholders (public, governments, parents, students, employers, boards,
professional corporations, research organizations; and others) are asking for
results and for more obvious facts measuring performance in education and
the overall impact of quality improvements efforts.
• Internal pressures: There are also internal pressures to improve quality of
teaching and learning from the teachers and management staff who are more
and more concerned with the future of their students.
• External input: The different processes presented in this chapter emphasize
the need for some kind of external input to start a self-improvement process
including evaluation activities.
748 Dignard
CONCLUSION
Action does not always lead directly to a traditional evaluation process but to
different mechanisms, where performance, accountability, transparency, responsi-
bilities, continuous improvement, outcomes and stakeholders information are
the main focus. The evaluation issue is more about establishing an effective
performance management process in each school, college or university.
Furthermore, this movement, which is gaining ground everywhere in Canada,
does include evaluation activities based on facts: developing a local information
system, collecting data, producing performance indicators, evaluating student
performance, monitoring satisfaction of the stakeholders, and producing and
releasing a report on the performance of an educational institution. These activities
are taking place in an organizational culture that encourages improvement
efforts. At the moment, the implementation of evaluation varies from one
province to another but also from one education level to another and from one
institution to another.
Local institutions must find ways to take charge of their own performance. In
this view, Koffi, Laurin and Moreau (2000) stressed a new management model
for the educational institutions entering in the 21st century. Their model is based
on concrete strategies such as school-focused management; collegial work as a
school team; teacher and management staff empowerment; team decision
making; teacher professionalism, partnership, self-leadership, and super-
leadership of the school executive; and accountability in carrying out their
educational mission.
At last, we have to recognize the need for continual evaluation of all Canadian
educational institutions to maintain a competitive position in the world. I
sincerely believe that its success will only be possible with appropriate resources,
training, tools, support, and incitement to carry out evaluation activities.
ENDNOTES
1 Alberta, British Columbia, Manitoba, New Brunswick, Newfoundland'and Labrador, Nova Scotia,
Ontario, Prince Edward Island, Quebec, Saskatchewan, Nunavut, Northwest Territories, and Yukon.
2 The three commissioners are appointed by the Government for a five-year term, which may be
renewed once. In fact, their term was renewed in September 1998.
3 This part was largely extracted from sections 13 and 19 of the Act constituting the Commission.
4 http://www.ceec.gouv.qc.ca
5 Regular education, both pre-university education (2 years) and technical education (3 years), is
provided for the vast majority of college students; this leads to the Diploma of College Studies
(DEC, from the French Diplorne d'etudes collegiales). Under continuing education, colleges offer
technical programs leading to an Attestation of College Studies (AEC, from the FrenchAttestation
d'etudes collegiales). These programs are provided for people who have interrupted their studies for
more than one academic year or left the labour market and who wish to resume studying either
part-time or on an intensive basis.
6 Note: only 7 out of 15 directions are mentioned in this text.
7 For instance, the Fraser Institute, an independent Canadian economic and social research and
educational organization, publishes Annual Report Cards on three provinces' Secondary Schools
(or High Schools): British Columbia's (since 1998), Alberta's (since 1999), Quebec's (since 2000
with the Montreal Economic Institute) and Ontario's (since 2001). The role of these reports is to
collect objective indicators of school performance, compare the performance and improvement of
individual schools, and produce an overall rating out of 10. The schools are ranked according to
their performance. The Fraser Institute Report Cards are planned for all ten Canadian provinces
by 2002. For more information visit their Web site at http://www.fraserinstitute.ca
REFERENCES
Bryant, c., Lee, L.E., & Zimmerman, M. (1997). Who is the evaluator anyway? How the Manitoba
School Improvement Program has used participatory evaluation to help improve schools. Paper
presented at the Canadian Evaluation Society Conference, Ottawa, Canada.
Canadian Education Statistics Council. (1996). Education indicators in Canada, Pan-Canadian
Education Indicators Program. Toronto, Canada: Author.
Chouinard, M.-A. (2000). Grande remise en question a l'Universite de Sherbrooke. Le «contrat de
performance» pourrait signifier la disparition d'une douzaine de programmes. (2000, October 26),
Le Devoir, pp. 1 & 8.
Commission d'Evaluation de I'Enseignement Collegial. (2000). The institutional evaluation guide.
Quebec: Gouvernement du Quebec.
Commission d'Evaluation de I'Enseignement Collegial. (1997). Evaluating programs of study in
Quebec, case study: Technical education programs in early childhood education at the Cegep de Saint-
Jerome. Quebec: Gouvernement du Quebec.
Commission d'Evaluation de I'Enseignement Collegial. (1994). The Commission d'evaluation de
l'enseignement collegial: Its Mission and directions. Quebec: Gouvernement du Quebec.
Conseil Superieur de I'Education. (1999). L 'evaluation institutionnelle en education: Une dynamique
propice au developpement, Rapport annue11998-1999 sur !'etat et les besoins de l'education. Sainte-
Foy: Gouvernement du Quebec.
Corriveau, L. (1991). Les Cegeps question d'avenir. Quebec: Institut quebecois de recherche sur la
culture.
Dennison, J.D., & Gallagher, P. (1986). Canada's community colleges: A critical analysis. Vancouver,
Canada: University of British Columbia Press.
Drummond, A. (1996). Post-secondary education reforms in other provinces. Toronto: Government of
Ontario. Retrieved from Ontario Legislative Library Web site: http://www.ontia.on.ca/library/
repository/mon/1000/l0262792.htm
Earl, L.M., & Lee, L.E. (1998). Evaluation of the Manitoba School Improvement Program. Toronto:
Walter and Duncan Gordon Foundation. Retrieved from http://www.gordonfn.ca/resfiles/MSIP-
evaluation.pdf
Fetterman, D.M. (1996). Empowerment evaluation: An introduction to theory and practice. In D.M.
Fetterman, S.J. Kaftarian, & A. Wandersman (Eds.), Empowerment evaluation: Knowledge and
tools for self-assessment and accountability (pp. 3-46). Thousand Oaks, CA: Sage.
Gouvernement du Quebec. (1993). An act respecting the Commission d'evaluation de l'enseignement
collegial and amending certain legislative provisions.
Koffi, v., Laurin, P., & Moreau, A., (2000). Quand !'ecole se prend en main. Sainte-Foy, Quebec,
Canada: Presses de I'Universite du Quebec.
Lee, L.E. (1999). Building capacity for school improvement through evaluation: Experiences of the
Manitoba School Improvement Program Inc. The Canadian Journal of Program Evaluation,
pp.155-178.
Ministere de I'Education. (2000). Quebec policy on university funding. Quebec: Gouvernement du
Quebec. Retrieved December 5, 2000 from: http://www.meq.gouv.qc.ca/reforme/poljinanc/index.htm
Ministere de l'Enseignement Superieur et de la Science. (1993). Colleges for the 21st century. Quebec:
Gouvernement du Quebec.
Universite de Sherbrooke. (2000). Constat et contrat de performance. Une proposition de l'Universite
de Sherbrooke au ministre de I'Education du Quebec, monsieur Fram;ois Legault. Retrieved
December 6, 2000 from Sherbrooke University website: http://www.usherb.ca
34
Evaluating Educational Programs and Projects in
Australia l
JOHNM.OWEN
The University of Melbourne, Centre for Program Evaluation, Victoria, Australia
INTRODUCTION
This chapter focuses on the evaluation of educational programs in Australia. At

the outset it should be said that there is a strong interest in evaluation across the
country and that there is an emerging body of evaluation practice. While some
of this practice could be regarded as less than the highest quality, there is a
sufficient corpus of action which qualifies an entry into a handbook devoted to a
discussion of contemporary developments in evaluation work.
What are some of the indicators of a thriving evaluation milieu in Australia?
Many readers will know that Australians who have evaluation interests have
established extensive contacts with their counterparts around the world. For
example, Australia has been well represented at the annual meetings of the
American and European evaluation societies over the past decade.
From these and other formal contacts, for example, visits of overseas evalu-
ators to organisations in Australia, meaningful ongoing linkages have been made
between Australian evaluators and colleagues in other countries. The fruitful-
ness of these contacts is manifest in the increasing influence of Australians in
thinking about evaluation issues, as seen in the production of books and journals,
and by the number of papers given by Australians at the overseas evaluation
meetings mentioned in the previous paragraph.
Internally, there has been a heightened interest in evaluation within Australia
(and New Zealand) over the past decade. This interest crosses discipline or area
boundaries, such as health, welfare, and education. One can understand this is
terms of demand and supply.
Demand Side
On the demand side, there has been a genuine interest by government and the
helping professions in evaluation work. One indication is the large number of
751

752 Owen
advertisements for evaluation studies in the national press. Another is the
number of requests from agencies for evaluation assistance. I estimate that the
Centre for Program Evaluation (CPE) gets at least five inquiries each week from
agencies wanting us to respond to an evaluation brief or to meet with them to
discuss their evaluation needs.
While the interest in evaluation has been driven somewhat by mandate, I am
aware of many instances of agencies voluntarily allocating resources for
evaluation. I see the following as key impetuses for evaluation work. These are
the need for information to:
• set up the same or similar programs in other locations;

• provide evidence which can be used for external accountability purposes;
• modify programs and arrangements within an agency to improve the general
quality of organisational implementation and provision to those who use the
services of the agency.
I come back to these as they apply to educational settings later in the chapter.
Taking a broader view, it is possible to link the degree of evaluation work and
the nature of this work to the prevailing political climate. As a democratic
country, there is a strong commitment to openness and access to information,
and in recent times, an increasing propensity to the use of what might be called
"systematic enquiry" in decision making (Lindblom & Cohen, 1979). This is
manifest in trends such as the use of evidenced-based practice in the medical and
nursing professions and an increased interest in the role of research in the
development of social and educational policy (Holbrook et ai., 1999). It is also
interesting to note that support for evaluation has also varied across states and
over time with the preferences of various Commonwealth and State government
openness to evidence over ideology.
Supply Side
A major manifestation of the interest in doing evaluations'has been the develop-

ment of the Australasian Evaluation Society (AES), which had membership of
over 700 persons as of December 2000. AES has served to "professionalise"
evaluation in that it has provided an avenue for practitioners to disseminate their
work, provided guidelines on standards of practice, and promoted an annual
conference and associated events each year. The cross-disciplinary nature of the
membership is notable. If one classifies university staff as educators in addition
to those working in schools and educational systems, educators would form a
sizeable but not major proportion of the participants at the conference each
year. The plurality of membership and their interests is reflected in the identity
of keynote speakers. For example, the AES conference held in Perth in 1999 had
three keynote speakers, one of whom had interests in crime prevention, another
in performance auditing, and the third in education. What this and other trends
Evaluating Educational Programs and Projects in Australia 753
reveal is that evaluation in Australia is cross-disciplinary in nature and
multifaceted.
One can observe parallels between the development and demography of
Australasian and European evaluation societies. Both tend to be influenced by
practitioners: by encouragement for new evaluators to make their mark and by a
view of evaluation as an approach to enquiry which can support decision making
at all phases of program development. This can be contrasted to the situation in
the United States. In the U.S., the evaluation profession has been heavily
influenced by academics, where there is an emphasis on determination of
program outcomes using well tried social science methodology; and by annual
meetings which are dominated by a small group of well known evaluators (Rossi
& Freeman, 1989).
EDUCATIONAL PROGRAMS AND EVALUATION
To address the brief of this chapter, I had to make some key decisions about
emphasis. Given my earlier statement, that there was a strong body of evaluation
practice in Australia, how could I go about representing what has been achieved?
One option would be to describe the entire range of recent educational
evaluation activities in as much detail as possible. But a systematic database of
educational evaluation in Australia over the past five years does not exist.
There is no doubt that such a description would have benefits, as such a
compilation would be useful to those on both the demand and supply side of
evaluation practice. But the compilation of a directory of practice in educational
evaluation would be a daunting task. The compilation would need to identify
evaluation in sectors such as early learning and kindergartens, primary and
secondary schooling, and university education. In addition, one can add in the
technical and further education sector and the adult learning sector, each of
which has burgeoned over the past decade or so.
What to do? I decided that, in the light of the lack of readily available
systematic information to describe the population of evaluation practice, I would
be selective. I decided to use my own informal knowledge of evaluation practice
in Australia to describe and examine a small number of initiatives which should
be of interest to evaluators worldwide.
I will call these initiatives evaluation perspectives. These perspectives stand out
on the evaluation landscape for one reason or another. They can be regarded as
initiatives which have been noticed as signalling new developments in evaluation
in Australia over recent times. They should not necessarily be regarded as good
practice, although I believe that there are elements of desirability in each of
them. A more apt description would be that they represent examples of new
ground being broken, signalling an element of risk in program provision, and a
belief that the boundaries of evaluation practice can be expanded through trying
new ways of doing things.
I will now discuss three perspectives in turn.
754 Owen
Perspective 1. Evaluation of Educational Innovations which Acknowledge the

Complexity of Educational Change
In Australia, studies of educational policy innovation are frequently
commissioned. 2 Policymakers and managers of programs arid other interested
parties wish to know whether their program "works." Managers have a right to
be defensive about negative findings about programs for which they are respon-
sible. But I have found that educational administrators in Australia tend to take
on board constructive criticisms that have the potential to improve program
provision. Most evaluations with this purpose rely on the involvement and
experience of a trained external evaluator.
There is an increasing understanding among evaluators that one of the first
tasks is to work with decision makers to make explicit the key feature of the
program being evaluated. That is, the program logic needs to be explored. The
need for the development of the program logic has been dealt with over time
through the contributions of key theorists such as Joe Wholey and Midge Smith.
Australian evaluators have made major contributions to thinking about how best
to develop program logic in organisational settings (Funnell, 1997). Assisting
program managers to develop program logic has become an important process
in its own right, without necessarily leading to a subsequent impact evaluation
(Rogers et aI., 2000). However, within this perspective, the development of the
logic usually facilitates discussion about what aspects of the program need to or
can be tracked as the program is implemented.
Within this perspective, the key to the development of logic is to recognise that
the educational innovation is systemic in nature, that it is supported across an
educational system by a central authority which has championed its take up by a
population of individual users (e.g in schools).
There is now a well grounded theory base which has the potential to give
systemic change efforts in education some internal coherency through linking
policy development at the system (e.g., state education department) level with
the practice-based approach of individual schools within the system.
This research shows that broad requirements for systemic change are based on
several elements. The first concerns the importance of aligning components of
the process so that they are consistent and supportive of each other and form a
coherent policy (Elmore, 1996). A second requirement is based on the recognition
that innovation requires some degree of external pressure to provide motivation
and support for implementation efforts. A third element is to move the locus of
responsibility for establishing and maintaining an innovation from central
control to provide greater autonomy for schools in the management of the innova-
tion process. Finally, there needs to be a process by which individual schools
provide gross accountability information to the system for what has been achieved
(Knapp, 1997). The model suggests that if these elements are in place, the
innovation is likely to become sustained across a school system (Harvey, 2000).
These research findings provide the basis for the program logic. Evaluators
can use this logic as the basis for systematic analyses of the impact of major
educational innovations.
Examples of this Perspective
Evaluation of Turning the Tide in Schools (TTIS)

TTIS represented a major initiative to support the development of drug
education in schools by the Department of Education, Employment and Training
(DEET) in the State of Victoria, Australia. Following a major government-
initiated enquiry into drugs in the community, over AUD $14 million was
earmarked for the development and implementation of this strategy in the
period 1995 to 1999. The TTIS was based on the notion of harm minimisation,
and recognition that many school-aged students would encounter drugs in their
every day life. Adoption by school of a harm-minimisation approach meant a
major change in direction in those schools which had previously adopted a policy
of zero tolerance of drugs.
The TTIS strategy in practice closely reflected the change principles described
above. Centralised support was manifest in the development of an articulated
policy, the availability of teaching materials, and the employment of experienced
consultants to support schools and teachers.
All schools were expected to set up an internal team led by a senior member
of staff to plan a school level strategy which responded to the needs of the school,
and was consistent with the TTIS policy. This plan was called the Individual
School Drug Education Strategy (ISDES). The consultants were also entrusted
with "signing off" on the quality of the plan on behalf of DEET.
In 1998, DEET commissioned an external evaluation of TTIS. The major
stakeholders were policymakers who had supported and developed the TTIS
framework. The major policy decision related to the future of the TTIS, specif-
ically whether there was a need for further or ongoing support to schools.
The first step in the evaluation was the explication of the TTIS program logic,
which appears in Figure 1 and reflects the change theory discussed above. This
was the basis for:
• the creation of a common understanding between policy clients and the

evaluators of the way the program was supposed to operate;
• developing a range of data collection methods, foremost of which was a survey
which sought data from schools on the success of each aspect of the program
logic;
• formulating ways in which the findings would be disseminated.
The evaluation found that implementation was generally successful with the
exception of the family/community links component (see Figure 1). While it was
too early to be definite about outcomes, there were indications that the TTIS had
major influences on the thinking of students about drug issues where the school
had developed an integrated approach to harm minimisation (McLeod, 1999).
The report also alerted policymakers to the need for additional support for many
schools to ensure that widespread implementation would be achieved. DEET has
committed additional resources to maintain consultants to achieve this objective.
756 Owen
An unanticipated outcome was to examine the existing theory to find an

element or elements which could benefit from further investigation. In this case,
it was found that there had been little research on the notion of school-level
sustainability of innovations. Further evaluation is being undertaken to check the
sustainability of the program of TTIS in selected schools with a view to
contributing to the knowledge base on educational change (Harvey, 2000).
The Frontline Management Initiative (FMI)

FMI is a training program developed by the Australian National Training
Authority, the major national education body which is concerned with vocational
education. FMI is designed to provide professional development for middle level
managers in a range of industries. Features of this innovation are individual
training plans for participants, on the job learning, recognition of prior learning,
use of accredited training providers to structure the training, and a system of
national recognition of achievement across industries.
FMI has been adopted by a range of industries as a means of skilling middle
management. Evaluations of the impact of FMI are being undertaken across the
country. Within the context of this perspective and the relevant research, the
impact of FMI within one organisation is of major interest. A major government
agency, the Department of Natural Resources and the Environment, has begun
to implement FMI across a number of dispersed sites. An evaluation of the impact
of this initiative has been commissioned in conjunction with its implementation,
to be undertaken by myself and colleagues at the Centre for Program Evaluation.
As with the previous example, the evaluation commenced by making the
program logic explicit and using it to design an ongoing study of its impact. The
evaluation will conclude toward the end of 2001.
System School
level level
/L----Linkll------l
Family/Community
Major
Report
ODDrng
Issues
f----+
Turning
the Tide
Strategy
Developed
-
- ISDES
Developed -1
\
Drug Ed Units
in Curriculum
'---_-----.J
,.-------,
Drug Ed emphasis
I
r----+
Changes in
Understanding!
Knowledge of
Students
in Welfare
1 1
external support
·outside consultants
Ii-..-.. -..-.. -.-.-.-..-.. .-
~
internal support
-internal
change
.. _.... _. __._._ __ .. ..
.. .. ... _ _
-documentation
strategies
"lime release
~
Figure 1: Thming the Tide in Schools: Program Logic

Perspective 2. Development of an Accountability Framework Across a School

System
As the title implies, the focus here is on the use of evaluation for accountability
purposes. In general terms accountability is predicated on the assumption that
government and citizens have the right to know whether programs funded from
the public purse are making a difference. As I have said elsewhere,
In times of economic stringency particularly, the public has a right to know

that money spent in the public arena has been translated into effective
social or educational interventions .... For a given program stakeholders
have a right to expect that programs, where possible, meet their intended
goals and do not lead to negative side effects. Parents at a local school are
rightly interested in whether the literacy approach taken by the school in
meeting the needs of the children. (Owen & Rogers, 1999, p. 263).
A notable development in this area has been the increased emphasis on

performance auditing in democratic societies. Government auditing offices have
become key arms of political decision making. In evaluation terms we are seeing
closer links between performance auditing and outcomes focussed evaluation
practices (Leeuw, 1996). Performance auditors have a strong presence in most
professional evaluation societies, particularly in Europe (Pollitt & Suuma, 1996).
The TTIS example illustrated that policy interventions can be the object of
systematic evaluation (the evaluand). In evaluations which fit within Perspective
#2, the object of the evaluation is usually not a policy but an organisation or agency.
In educational terms this means that the unit of analysis would be a university or
a school.
An issue is, How does government set up a system of educational provision
within which accountability mechanisms can be meaningful to those with a
legitimate interest in the findings?
Example of this Perspective
The School Accountability Framework

In 1992, a recently elected conservative government set up a major program of
structural, curriculum, and accountability reform which was designed to change
the ways schools operated in the State of Victoria. While there was a wide range
of reasons for this, a major impetus was a report by the Victorian Commission of
Audit in 1992, which reported that there was virtually no systematic data on the
performance of the government school system in the state. It recommended to
the government that the state education department should arrange for periodic,
independent reviews of school performance against a wide variety of
benchmarks and other indicators.
758 Owen
There is not space to trace the detail of ways in which this framework evolved
over a period of eight years. However, it is important to describe key features
because the experiment with accountability in Victoria has been of great interest
to educators in Australia and overseas. These features are outlined with
reference to Figure 2.
Beginning in 1992, all government schools were required to be involved in the
following:
(1) The development of a planning document, known as the school charter. This
was a plan for the next three years of the work of the school. The charter was
designed to enable schools to identify long-term goals and implementation
strategies. While the original charters concentrated on school priorities, those
developed later in the 1990s were also required to incorporate government
policy directions and priorities, for example a statewide focus on literacy.
It is important to note that four areas that had to be given attention in the
school charter were educational programs, school environment, financial
management, and human resource management. The list reflected the interest
of government in school management issues. This was perceived by many
teachers as taking the emphasis away from the schools' core business of teaching
and learning.
The charter document was used as the "first order" accountability strategy by
the department of education. It was signed by the president of the school council,
and the principal, and countersigned by the Director of Schools.
School Charter
Developed
r
Triennial Review
~ "
First Annual
-School Self-Assessment
-Independent Verification
Report
'-. Second Annual

Report
.,I
Figure 2: The Accountability Framework
The charter involved significant work by staff, in particular by the principal. In
addition to the planning side of the charter, the school was also asked to develop
methods by which the plan could be monitored. While some assistance was
available in the form of guidelines and instruments, this evaluation task was
an additional one for schools not accustomed to the collection and use of
systematic data.
(2) School annual reports. These were designed to be interim evaluative reports
for the school community and the Department of Education. They often
featured major school and student achievements, and commented on progress
towards the achievement of charter objectives. Schools also included scores on a
set of common performance indicators, some of which were provided from a
central source (the Office of Review), which reflected government priorities,
e.g., student literacy achievement.
(3) Triennial school reviews. The review process had two stages: (i) school self-
assessment conducted within guidelines developed by the Office of Review and
(ii) external verification conducted by an independent external review team or an
individual. In some cases, university faculty have been involved in the external
verification. The original aim of the independent review was to verify that the
school's self assessment represented a fair statement of the school's achieve-
ments over the three-year period. In this sense the reviewer played the role of an
external inspector on behalf of the central authority. However, as the framework
has matured, there has been an increased emphasis in the review process on
formative improvement focussed input from reviewers.
The process has been modified during the years since its inception. This can
be understood with reference to competing needs for information at different
times. There is some evidence to support a claim that the original reason for the
development of the accountability processes was to enable the Department of
Education to account to the State Treasury for the expenditure on school educa-
tion. In order to obtain this information, school-level data had to be aggregated
to provide a system-level picture. It is widely held that the development of the
original school charter arrangements and the measures which were to be
collected were driven more by the information needs of the central educational
authority than by the information needs of individual schools. This was
reinforced by the education department introducing student tests of basis skills
to put alongside information which had to be obtained from schools. By the end
of the 1990s the Department of Education was able to show trends on statewide
indicators that the achievement of students was on the increase.
Once the imperative to set up gross statewide indicators had been achieved,
the Office of Review gave more of its attention to ways in which data collected
centrally and by individual schools could be used for individual school
improvement.
By 1999, the triennial school review had taken on a much more formative
perspective. This had been brought about by a combination of factors, including
760 Owen
the sympathy of most external reviewers towards the information needs of the
schools and the provision of indicator information that was meaningful to
principals and school review panel members. This included information which
allowed them to compare the performance of the school w~th "like schools" -
those with similar student populations on socio-economic grounds (incorrectly
labelled benchmarks in the Office of Review literature).
There has not to this date been a systematic metaevaluation of the Accountability
Framework. However, there is some anecdotal information available from
colleagues who have been close to the action. An interesting reflection is that, by
and large, there were concerns with the relevance and quality of much of the data
and its interpretation assembled by schools as part of their charter monitoring
(an example is the use of parent opinion surveys). Yet, there has been wide-
spread satisfaction in the latter years of the Framework with the availability of
comparative data provided by the government and the feedback and consultancy
processes in which external evaluators playa leading role. One explanation for
this is that schools see systematic data as important if they confirm or explain
knowledge of a more tacit nature which the schools accumulate through more
informal methods. This knowledge is also used to make alterations to the new
charter which formalised the school plan for the subsequent three year period.
However, 1999 also brought in a change of government in Victoria and the
almost inevitable formal review of educational provision. In regard to the
Accountability Framework just described, the review made the following
statement:
"Several initiatives are now in place that might strengthen the focus of school
planning on continuous improvement while retaining core elements of
accountability. These initiatives include:
• the Learning Improvement Fund, announced recently by the Minister, which

allocates school improvement funding to schools. Funds are provided to
address student learning issues ... following the school's triennial review;
• the development of support materials for schools to evaluate the effectiveness
of their core procedures [italics added]. The first three--sets of these materials
relate to social outcomes of schooling, student welfare processes, and
approaches to teaching. They are designed to facilitate school program evalu-
ation and to assist schools to align their behaviour with the educational values
espoused in their charters;
• the provision of benchmark data annually to all schools to enable them to
establish appropriate points of comparison in assessing their own standards.
While these activities are valuable in their own right, there is a need for the
Department to develop and articulate a coherent policy framework which
establishes a commitment to school improvement and identifies the strategies
the Department intends to adopt to support schools in their improvement
efforts. This is reflected in the issues raised in both the focus groups and
submissions to the review in the general area of school improvement. These
included:
• workload issues related to the preparation of annual reports and triennial

review self assessments;
• a general tension between the school's responsibility for program evaluation
and the accountability requirements of the Department;
• a view that the Department needs to support schools in improvement and
especially in narrowing the gap in achievement between schools;
• a concern that the measures of school performance used in the accountability
framework are too narrow and do not include the wide range of things schools
achieve outside the purely cognitive domain;
• a view that the timeframes and structures associated with school charters and
triennial reviews lack the flexibility necessary to cater for a range of school
performance (Government of Victoria, 2000, pp. 58-59)".
This report encapsulates much of the misgivings about the Accountability

Framework which had been put in place by the previous administration. The
Office of Review has been reorganised and renamed the Accountability and
Development Division. A challenge for them will be to retain some of the
established procedures from the previous Framework while at the same time pro-
viding resources and guidelines for the development and evaluation of outcomes
which are less managerial and more in tune with the core work of schools.
Perspective 3. Creating an Evaluation Culture within Educational Organisations
In most western countries we have witnessed an increase in organisational-level

involvement in evaluation. At the basis of this support are some fundamental
beliefs about the control of providers over program development and delivery.
Some time ago, Ronald Havelock recognised these beliefs in the Problem Solving
model of change (Havelock, 1971). The model describes an organisational
setting where those responsible for a given problem have the ability and
inclination to deal with that problem. If outsiders are involved they are involved
on the terms set by providers, that is outsiders are not there to hand down
knowledge. The model is thus predicated on the adequacy of local expertise to
deal with local problems and to recognise the need for outside assistance when
it is needed. In such a scenario, there is an assumption that external knowledge,
and in particular accumulated research-based knowledge, is of lesser relevance
than "local" knowledge. Thus, external policies and directives do not play a
major role in shaping program delivery and associated evaluative procedures.
The evaluator is employed to provide input and in some cases support the
agenda of the local practitioners.
The Problem Solving model is the basis for a cluster of evaluation approaches
which I have drawn together and labeled Interactive Evaluation (Owen & Rogers,
762 Owen
1999, p. 222). In Interactive Evaluation, participants playa major part in setting

goals and in program delivery, and evaluation efforts are influenced strongly by
those who are "close to the action." Consistent with this is a view that each
program initiative is "new"; that is, it can be regarded as an ipnovation from the
perspective of those involved in its delivery.
Programs which are consistent with this view attempt to address problems that
have not been subject to program intervention before and/or employ program
structures or processes that are unique, at least from the point of view of the
program staff.
Typically, program objectives and delivery are evolving rather than preordinate,
sometimes not fully explicit and thus open to debate. Those directly involved
might expect to come to a steady state in terms of general direction over a period
of time, but continue to reserve the right to use new or alternative strategies to
achieve program goals, that is, there is ongoing adaptation and responsiveness.
Program delivery can be idiosyncratic and there is little, if any, regard for
consistency of treatment, uniform implementation or outcomes. In some areas,
there is government support for this approach to problem solving, based at least
partly on an ideological position that there should be local control over decisions
that affect the local community.
The Problem Solving perspective is also consistent with the more recent litera-
ture on developing a culture of learning within organisations (Preskill & Torres,
1996) and the broader need for continuous relevant knowledge for decision
making in social, political and economic spheres of endeavour (Sowell, 1996).
This has led to a burgeoning interest in integrating evaluation into the day-to-day
processes of organisations that adopt a commitment to systematic examination
of what they do and how they might become more effective and efficient (Rowe
& Jacobs, 1996). We see an increasing need for trained evaluators to provide
timely information for organisational decision making at all levels, formal and
informal leaders to those working on the "shop floor" (Owen & Lambert, 1998).
Interactive Evaluation is thus concerned with:
• the provision of systematic evaluation findings through which local providers
can make decisions about the future direction of their programs;
• assistance in planning and carrying out self evaluations;
• focussing evaluation on program change and improvement, in most cases on a
continuous basis;
• a perspective that evaluation can be an end in itself, as a means of empowering
providers and participants.
So, ideally, a trained evaluator provides a direct service to program providers
based on her expertise, working in close contact, and providing a range of inputs
and advice. The evaluator may be asked to observe what is happening, to help
participants make judgements about the success or otherwise of a given strategy
or program initiative, with a view for future planning. In some approaches the
evaluator can be thought of as an extension of the program team, helping to
make decisions about program direction. In addition to collecting and analysing
information, the evaluator might assist decision makers in setting directions and,
in some cases, actually assisting with change and improvement strategies.
Evaluations based on this perspective have been a strong feature of the
educational landscape in Australia over the last two decades. Some education
authorities have actively encouraged action research among schools.
Examples of this Perspective
The National Professional Development Program

The Commonwealth Government has supported the National Professional
Development Program (NPDP), which encouraged partnerships between
schools and universities. Objectives of this national program were to:
• develop schools as learning communities in which research, rethinking and

renewal are regarded as essential work practices;
• provide participating schools with access to academic associates for advice and
expertise on current research and theory relating to the area of concern for
the school.
Schools were provided with small amounts of money, about AUD $5000, and
encouraged to:
• engage in participative management structures in the workplace;

• engage in a program of research and action directed towards examining and
changing work organisation practices in order to provide a working environ-
ment and a collaborative culture in which the competence of teachers might
be enhanced and the learning outcomes for all students improved;
• collaborate with colleagues from an associated university in the development
of research and action plans (Yeatman & Sachs, 1995, p. 18).
An evaluation of this program showed that extensive improvements were

achieved by cooperative efforts that has often been more than could reasonably
be expected from the participants.
School Based Research and Reform

A similar initiative, the School Based Research and Reform program was
supported by the South Australian Department of Education and Training and
Employment. The central aims of the program were to:
• improve the organisational and educational practices at the school level;

• share the outcomes of this research and evaluation so as to influence princi-
ples and practices within the organisation and share its future directions;
• evaluate the extent to which these practices have been successful (Mader, 1998).
764 Owen
Initially 20 schools were funded to undertake a range of innovations, after

success in a competitive tender process. Principals were informed that the topics
the Department would support were those designed to influence learning
outcomes for students and which were consistent with p:.:eviously identified
strategic directions. Typically schools received AUD $6,000 to support their
evaluation efforts, which could buy the expertise of external evaluators and time
release for teachers. Some of the issues schools investigated included:
• A School Wide Approach to Reducing Bullying;

• Middle Schooling and Social Justice;
• Developing Oral Language through Play.
Individual Internal School Evaluation

The role of external evaluators in school level evaluation and change efforts has
been mentioned above. During 1998, the Centre for Program Evaluation (CPE)
undertook an evaluation of the use of technology at a private independent school
for girls in Melbourne. The problem here was that school had been unable to
develop a curriculum policy which incorporated the use of technology and in
particular, notebook computers, which all students had been asked to purchase
at not inconsiderable expense.
In terms of the involvement of practitioners in decisions about evaluation, it is
important to note that the stimulus for the evaluation came from heads of
subject departments who were concerned about the ineffective use of notebook
computers in the school. Subsequently, the school principal had approached the
CPE to provide evaluation expertise for the study.
Working intensively with the school management, including the principal, the
following set of questions were developed as a basis for the evaluation:
• What is the level and nature of existing use of notebooks?

• What are the opinions of staff and other key persons, regarding the
effectiveness of this usage up to now?
• In what ways could the effective educational usage of notebooks be increased?
• How can the school and its administration support~he effective usage of
notebooks?
Over a period of four months, CPE staff worked intensively in the school. The
approach taken was to provide feedback about the conduct of the evaluation to
staff and management; for the latter, this involved regular face-to-face meetings.
In addition the CPE staff assisted the principal to disseminate information to
parents.
Key "products" of the evaluation, which was completed in mid-September
1998, were:
• the clarification of what was meant by a Computer Enhanced Curriculum

(CEC). This provided a policy position to be achieved and key findings about
the nature of the future curriculum in the school;
• the development of a typology of notebook computer use;
• reports which went to management and staff. These were written specifically
with these audiences in mind. For example the report to management
addressed issues related to the future financing of the use of technology in
schools, and implications for structural reorganisation of the curriculum to
incorporate computer use;
• an end of evaluation conference of staff, at which the implications of the
findings were discussed, and where "existing good practice" was presented by
teachers;
• development of plans by which the position could be achieved, that is strate-
gies to improve the use of notebooks and technology in the future and, in
particular the next school year (Livingston et aI., 1998).
It is clear from these examples that evaluation within this perspective is

responsive to the decision needs of the school, and driven by the school agenda.
The programs under review are unique to the school, not only in content but also
in style. There are also implications for evaluation in that each evaluation design
needs to be developed individually. This behoves those engaged in school level
evaluation to have a range of skills which meet different situations. This applies
not only to data management skills but also to those associated with social
interaction.
DISCUSSION
What can we learn from these perspectives in terms of their potential influence
on educational evaluation in Australia and around the world?
Perspective 1 sees evaluation in terms of the determination of systemic impact
of an innovation. The success of evaluation work within this perspective relies on
the evaluator having a detailed knowledge of educational change theory. Ideally,
this theory should be available to developers when they are preparing
educational policy. This is most likely to happen if the evaluation is planned at
the same time the policy is being developed. However, even if this is not the case,
evaluators can base post hoc evaluations of the relevant theory, as a basis for
pointing out any shortcomings of a given program or innovation. There is also an
opportunity for evaluators to act as educators to inform policy makers about
theory with a view to improving their general conceptual understanding about
innovation impact.
In the Australian context, I believe that there is an emerging trend to
undertake impact educational evaluations using these guidelines.
Perspective 2 sees evaluation in terms of setting up monitoring systems and
using the findings of these systems to make decisions about programs and organi-
sations. The evaluator must have expertise in translating organisational goals
into meaningful periormance information, both qualitative and quantitative
766 Owen
indicators. Those engaged in evaluations of this kind must understand the
limitations of indicators in the synthesis of information and the development of
conclusions about the state or condition of a given organisation at a given time.
In the Australian context, I believe that educational evaluations based on good
practice within this perspective are difficult to find, either because they have not
been widely attempted or have not been published.
Perspective 3 sees evaluation as an activity which complements local decision
making and program delivery. The evaluator must have a grasp of organisational
change principles and group dynamics in addition to more conventional evalu-
ation skills, and be prepared to work with the agenda set by the organisation. In
the Australian context, I believe that there is a significant set of evaluators who
have worked within this perspective and that there will be a continuing increase
in the need for evaluation consultants to work in this way.
While the perspectives are in themselves an indication of the diversity of evalu-
ation practice in Australia, it is useful to now examine what is common across
them. In all cases, the program under review and the context are interlinked. As
evaluators, we would not do justice to providing a full understanding of any of
these situations if the task was merely perceived as evaluating a single entity. In
Perspective 1, the fate of the Turning the Tide was dependent on school- and
system-level effects and the interaction between them. In Perspective 2, an
important issue was the competing needs of the schools and the central authority
for information about the effects of the school charter policy. In Perspective 3,
the objective of small scale in-school programs were also designed to have a
whole school impact. The interactions described here represent the need for
systems level thinking. Put simply, in social and educational settings, there is an
interlinkage of programs and organisational arrangements. If one makes a
change in one part of an organisational system, other parts are also affected by
that change.
Neophyte evaluators have often been advised to become familiar with the
evaluand or program under review as part of the evaluation process. It follows
from the previous paragraph that evaluators need to do more than this. An
essential requisite for a well-founded evaluation is to become familiar, not only
with the program, but with a broader framework which'includes the organisa-
tional setting or the educational system within which the program is nested. At
the planning and negotiation phase of an evaluation, this knowledge can be
invaluable in determining the direction of the evaluation study and the key issues
which focus the data collection and analysis.
Finally, an issue which suggests itself is whether evaluators can or wish to work
across a range of evaluation situations. The perspectives presented here
indicated the variety of evaluation practice which one can expect to encounter in
Australia and elsewhere. There is also an associated issue of preference. An
evaluator may be at home working closely with providers but feel uncomfortable
working in evaluations with a strongly managerial orientation. There are argu-
ments for plurality of practice and specialisation. What is important, however, is
that prospective evaluators in training programs have an opportunity to
appreciate the range of potential evaluation practice, and make informed
choices of the area or areas in which they decide to work.
ENDNOTES
1 The editors of this handbook asked me to examine the evaluation of educational programs and
projects within the Australian context. I note that other writers have made this distinction, see for
example (Shadish, Cook, & Leviton, 1991). The implication is that there is a hierarchy of
interventions, with projects seen as the "on the ground" manifestation of the program. There is also
the implication that providers have little influence on the design of the program. This is generally
not the case in Australia where there is more of a tradition of local program design and direct
funding of local action. A more appropriate distinction in the Australian context is between "big P"
and "little p" programs. "Big P" programs are typically those provided by federal and state
governments, while "little p" programs are managed by agencies directly responsible for their
implementation. For simplicity I use the generic term "program" in the text to refer to both levels
of program development and provision.
2 An innovation, for the purposes of this discussion is a policy or program that is regarded as new by
those who develop it and/or in the eyes of the potential users of the entity (Rogers, 1995).
REFERENCES
Elmore, R. (1996). Getting to scale with successful educational practices. In S. Furhman, & J. Q'Day
(Eds.), Rewards and refonns: Creating educational incentives that work (pp. 249-329). San Francisco:
Jossey Bass.
Funnell, S. (1997). Program logic: An adaptable tool for designing and evaluating programs. Evaluation
News and Comment, 6(1),5-17.
Government of Victoria. (2000). The next generation: Reporl of the Ministerial Working Party.
Accountability and Development Division, Department of Education, Employment and Training,
November 2000.
Harvey, G. (2000). Sustaining a policy innovation: The case of turning the tide in schools drug education
initiative. Unpublished doctor of education thesis proposal, Melbourne, Australia, Centre for
Program Evaluation.
Havelock, R.G. (1971). Planning for innovation through dissemination and utilization of knowledge.
Ann Arbor MI: Center for Research on Utilization on Scientific Knowledge.
Holbrook, A, et al. (1999). Mapping educational research and its impact on Australian schools.
Camberwell, Victoria: Australian Council for Educational Research.
Knapp, M.S. (1997). Between systemic reforms and the mathematics and science classroom: The
dynamics of innovation, implementation and professional learning. Review of Educational Research,
67(2), 227-266.
Leeuw, EL. (1996). Auditing and evaluation: Bridging a gap, worlds to meet? In C. Wisler (Ed.).
Evaluation and Auditing: Prospects for Convergence. New Directions for Evaluation, 71,51-60.
Lindblom, C.E., & Cohen, D.K. (1979). Usable knowledge. Newhaven and London: Yale University
Press.
Livingston, J.J., Owen, J.M., & Andrew, P.E (1998). The computer enhanced curriculum: Use of
notebook computers in the middle school curriculum. Melbourne, Victoria: Centre for Program
Evaluation, The University of Melbourne.
Mader, P. (1998). School based research and reform. Proceedings of the Annual Conference of the
Australasian Evaluation Society, 2,714-731.
McLeod, J. (1999). Turning the tide in schools. (Report to the Department of Education, Employment
and Training, Melbourne, Vic, November, 1999). Melbourne, Victoria: McLeod Nelson &
Associates.
Owen, J.M., & Lambert, EC. (1998). Evaluation and the information needs of organisational leaders.
768 Owen
Owen, J.M., & Rogers, P. (1999). Program evaluation: Forms and approaches. (2nd ed.). London:
Sage.
Pollitt, c., & Suuma, H. (1996). Performance audit and evaluation: Similar tools, different
relationships? In C. Wisler (Ed.). Evaluation and Auditing: Prospects for Convergence. New
Directions in Program Evaluation, 71,29-50.
Preskill, H., & Torres, R.T. (1996, November). From evaluation to evaluative enquiry for organisational
learning. Paper presented at the annual meeting of the American Evaluation Association, Atlanta,
GA.
Rogers, E.M. (1995). Diffusion of innovations. (4th ed.). New York: The Free Press.
Rogers, P., Hacsi, T.A, Petrosino, A, & Huebner, T.A (Eds.). (2000). Program theory in evaluation:
Challenges and opportunities. New Directions for Evaluation, 87.
Rossi, P.H., & Freeman, H.E. (1989). Evaluation, A systematic approach. Newbury Park. CA; Sage.
Rowe, w., & Jacobs, N. (1996). Principles and practices of organisationally integrated evaiuati'Ol1;.
Unpublished paper.
Shadish, w.R., Cook, T.D., & Leviton, L.C. (1991). Foundations of Program Evaluation. Newbury
Park: Sage.
Sowell, T. (1996). Knowledge and decisions. New York: Basic Books.
Yeatman, A, & Sachs, J. (1995). Making the links: A formative evaluation of the first year of die
Innovative Links Project between univerisities and schools for teacher professional development.
Murdoch University, School of Education.
Section 9
Old and New Challenges for Evaluation in Schools

Introduction
GARY MIRON, Section Editor

The Evaluation Center; Western Michigan University, Ml, USA
Education systems and schools have changed considerably in recent years. There
has been a shift from steering schools by planning inputs and monitoring
processes to steering through the measurement of outcomes. Connected to this
have been trends toward decentralization, increasing autonomy for schools,
increasing reliance on market mechanisms to determine funding levels, and the
creation or expansion of school choice. These changes can also be seen across a
large number of countries.
Side by side with the changes, the demands and needs for evaluation in schools
have also changed. These include (i) increasing pressure on schools to demon-
strate success and to market themselves; (ii) increasing demands for evaluative
information from parents (consumers) and policymakers; and (iii) increasing
demands on school leaders to demonstrate effectiveness as measured by
standardized tests.
This section of the handbook contains a collection of chapters that address old
and new challenges for evaluation in schools. Broadly defined, the challenges
include institutionalizing the practice of evaluation in schools; developing
evaluation practices to address new demands for accountability and to serve the
information needs of consumers; and evaluating new aspects of schools, such as
the implementation of technology and the use of new technologies in instruction.
INSTITUTIONALIZING EVALUATION IN SCHOOLS
A problem of long standing for schools that has not gone away is the institu-
tionalization or mainstreaming of the practice of evaluation. While some school
leaders have come to appreciate, and benefit from, evaluation, few have been
able to incorporate it into the daily practice of schools, while those who have
succeeded have struggled to sustain its use and application. Evaluators and
practitioners alike remain doubtful that we will be able to institutionalize evaluation
practices in a satisfactory manner in schools.1 This skepticism, however, does not
mean that efforts to achieve it have subsided. Two academics who together have
spent nearly a half century wrestling with this issue are Daniel Stufflebeam and
James Sanders. Both have contributed chapters to this section of the handbook.
771

772 Miron
Daniel Stufflebeam describes the requirements of a sound and fully functional

evaluation system in a school. His integration of the literature on learning
organizations and metaevaluation informs his idealized model for systematically
adopting, installing, operating, and assessing a school evaluation system which
should be designed to address the evaluation of students, personnel, and
programs. Stufflebeam also identifies concrete steps that can be taken to
institutionalize evaluation and provides an overview of strategies for acquiring
the leadership, expertise, and assistance needed to succeed in this endeavor.
The following chapter by James Sanders and Jane Davidson offers more of a
hands-on approach. It contains an extensive overview of examples of school
evaluation practices, as well as presenting a general model to guide the develop-
ment and implementation of evaluation. The model was adapted from one
developed earlier by Stufflebeam, and incorporates the program evaluation
standards and reflects the components of the CIPP model. The authors argue
that the model is eclectic and can be applied in the context of a range of other
evaluation approaches commonly used today.
The first two chapters contain useful tools to guide educators whose task it is
to incorporate evaluation into the mainstream of their work. The tools include
models of, and lists of logical steps in implementation, as well as lists of questions
to consider and possible indicators. While the Stufflebeam paper is more
theoretical, it contains insightful tables that logically present the areas and levels
of evaluation in schools. The Sanders and Davidson chapter provides links to a
wide range of Web-based evaluation tools.
RESPONDING TO NEW DEMANDS FOR ACCOUNTABILITY

AND SERVING THE INFORMATION NEEDS OF CONSUMERS
Changes in education have made it more important that evaluations should focus
on outcomes and serve the emerging information needs of consumers as well as
of practitioners and policymakers. For example, increasing school autonomy,
school choice, and demands for accountability have spuii'ed the need for easily
accessible information about school performance. This information is important
for policymakers and school administrators that steer the schools and for parents
who want easily accessible and comparable information about schools to make
informed decisions on where to enroll their child(ren).
In the third chapter in the section, Robert Johnson examines the development
and uses of school report cards, or school profiles, as he refers to them, and
draws on his earlier work with Richard Jaeger and Barbara Gorney to provide a
complete and thorough review of what is known about them.
Indicators of school performance, both quantitative (such as student
attendance rates, pupil-teacher ratio, and grade-level achievement scores) and
qualitative (such as information about school and community partnerships and
awards received by the school) are examined. Other important topics covered
Old and New Challenges for Evaluation in Schools 773
include potential stakeholder audiences for school-level profiles, changes and
trends in profiles over time, advice in the selection of indicators for profiles,
guidance in the formatting and presentation of data, as well as selection of
comparison groups and dissemination of the information profiles provide.
Examples of school profiles from a variety of countries are provided in the
chapter.
EVALUATING NEW ASPECTS OF SCHOOLS:

THE IMPLEMENTATION AND USE OF NEW TECHNOLOGIES
In the last 15 years there has been enormous public and private investment in
technology for primary and secondary schools. Early studies have focused on
quantifying the available equipment by school or classroom, while later studies
have attempted to link the technology with changes in instructional practices and
learning processes that could be measured through observed behavior, changed
attitudes, and improved student performance.
The last chapter in the section, prepared by Catherine Awsumb Nelson,
Jennifer Post, and William Bickel, examines one of the new challenge areas for
school evaluation - the evaluation of technology programs. The authors argue
that it is necessary to develop human capital in order to successfully
institutionalize technology in the culture and practices of schools, and propose
an evaluation framework in which the growth of school-level ownership of
technology over time is assessed as a series of three interrelated learning curves:
maintaining the technology infrastructure; building teacher technology skills;
and integrating technology into teaching and learning. Through the development
of building-level human capital, schools gradually move up each learning curve
and (ideally) from one learning curve to the next, increasing their ownership of
the technology and the extent to which it is integrated into the core educational
processes of teaching and learning.
The authors found that to use the learning curves as an evaluation framework,
it was necessary to further specify that movement on each curve proceed along
three dimensions: depth (quality of technology use); breadth (quantity and variety
of technology use); and sustainability (infrastructure to support technology experi-
ences over time). For each dimension of the three learning curves, the evaluation
generated a set of factors related to successful technology implementation and
eventual institutionalization in school learning environments.
To illustrate the successful application of the model, the authors described a
multi-year evaluation of school technology implementation in a large urban
district in the U.S. The two-part framework for evaluating implementation
presented in the chapter has a number of demonstrated and potential uses for
educators and policymakers interested in understanding and enhancing the
potential for technology to become institutionalized in schools.
774 Miron
OBSTACLES TO INSTITUTIONALIZING EVALUATION IN SCHOOLS
Schools are involved in evaluation activities whether they wish to or not. In fact,
many school leaders and educators are largely unaware _of the extensive
evaluation that goes on at their school because it is often done by persons outside
the school and it serves the needs of external stakeholders rather than those of
school staff. For example, most externally funded programs are likely to be
evaluated. Likewise, schools are often the focus of state or district level evalu-
ations, as well as of studies conducted by universities or private sector organizations,
but may not receive results of the studies. In order to institutionalize evaluation,
and to serve their information needs, school staff will need to become more
actively engaged in evaluating their school, and school leaders will need to
coordinate and adapt the various evaluation activities already taking place that
involve their school. Schools will also need to find ways to overcome the most
commonly cited obstacles to institutionalizing evaluation such as a lack of human
and financial resources and the fact that evaluation is low on lists of priorities.
Recent reforms that emphasize greater accountability are placing increased
emphasis on evaluation in schools, yet these reforms also introduce new
obstacles such as the increasing involvement of private groups and a tendency to
narrowly evaluate schools based on student performance alone. While leaders in
the field will continue to think about ways to overcome the many obstacles that
inhibit the use of evaluation in schools, any long-term solution will be likely to
require a number of changes that involve pre service and inservice education of
educators, the provision of technical assistance, and specific mandates that are
supported by adequate funding and steered with a balance of "carrots and
sticks."
ENDNOTE
1 At the 2000 annual meeting of the American Evaluation Association, I chaired a session entitled
"Institutionalization of Evaluation in Schools." The presenters identified obstacles to the
institutionalization of evaluation, and none appeared optimistic about s~cceeding in this endeavor.
At the close of the session, I asked the large audience a direct question: "How many believe that in
5 or 10 years we will still be wrestling with how to incorporate evaluation into the mainstream of
work activities in schools?" Three-quarters answered in the affirmative.
35
Institutionalizing Evaluation in Schools l
Evaluation, which is the systematic process of assessing the merit and/or worth of
a program or other object, is essential to the success of any school or other social
enterprise. A functional school evaluation system assesses all important aspects
of the school, provides direction for improvement, maintains accountability
records, and enhances understanding of teaching, learning, and other school
processes. The focus of this chapter is on how schools can institutionalize a
sound system of evaluation. Institutionalization of school-based evaluation is a
complex, goal-directed process that includes conceptualizing, organizing, funding,
staffing, installing, operating, using, assessing, and sustaining systematic evaluation
of a school's operations. The chapter is directed to persons and groups dedicated
to improving a current evaluation system or installing a new one. These include
school councils, principals, teachers, and those university, government, and
service personnel engaged in helping schools increase their effectiveness.
Advice in the chapter is conveyed in "Type I" and "Type II" models. A Type I
Model is defined as a representation of the requirements of a sound and fully
functional evaluation system, a Type II Model as an idealized process for institu-
tionalizing such a system. The example of a Type I Model that is presented is the
CIPP Model (Stufflebeam, 2000), which is widely used throughout the U.S. and
in many other countries. CIPP is an acronym that denotes four main types of
evaluation: Context to assess needs, problems, and opportunities; Inputs to assess
planned means of addressing targeted needs; Process to assess implementation
of plans; and Products to assess intended and unintended outcomes.
Work in developing learning organizations (Full an, 1991; Kline & Saunders,
1998; Senge, 1990; Senge, Kleiner, Roberts, Ross, & Smith, 1994) is worth consid-
ering in efforts to install evaluation systems in individual schools. However,
Senge's five proposed disciplines of a learning organization (personal mastery,
shared vision, mental models, team learning, and systems thinking) need to be
extended to include a sixth and most fundamental discipline: evaluation.
Furthermore, since in Bloom's (1956) Taxonomy of Educational Objectives,
evaluation was accorded the highest level in the learning process, the process for
institutionalizing evaluation in a school must include evaluation of the
evaluation system, or what evaluators call metaevaluation (Stufflebeam, 2001).
775

776 Stufflebeam
The chapter is divided into three parts. The first lays out a Type I model for a
school evaluation system. In part two, the literature on learning organizations
and metaevaluation is used to suggest a Type II model for systematically adopting,
installing, operating, and assessing a school evaluation syst~m. The final part
considers how schools can acquire the leadership, expertise, and assistance
needed to successfully institutionalize evaluation.
A TYPE I MODEL FOR EFFECTIVE SCHOOL EVALUATION
The Type I Model is presented in response to 11 questions that school personnel

need to ask when determining how best to conceptualize evaluation. Responses
to the questions in the chapter are not intended as the answers for any person,
group, or particular school, but rather as examples for consideration when a
"Type I" concept of evaluation is being defined.
1. Why Should School Professionals Conduct Evaluations, and What Important

Differences Do They Make Anyway?
This question poses an issue that every educator needs to settle firmly in order
to proceed confidently with evaluations and convince others that the evaluations
are worth doing. All schools have problems, and evaluation is needed to delin-
eate and help solve them. Furthermore, evaluation is inevitable, since grading
homework and other student products is part and parcel of the teaching process.
Moreover, schools are learning organizations and should therefore promulgate
a process that helps their staff continually study, assess, and improve services.
School personnel need to constantly improve their understanding of the needs of
their students and how much they are learning. Current practices should not be
considered fixed but instead subject to innovation and improvement. School staff
should value mistakes for what they can teach, should share lessons learned with
colleagues, and should archive the lessons for later use by new generations of
school personnel. They should conduct holistic assessments to assure that the
school's various components are interacting functionally to produce the best
possible educational outcomes. Clearly, a school's professionals need evaluation
in order to increase their understanding of their students and school services;
strengthen the services accordingly; function as a coherent, improving system;
identify intended and unintended outcomes; and earn reputations as highly
successful, responsible professionals.
2. What Parts of a School Should be Evaluated?
The general response to this question is that school personnel should evaluate all
important aspects of the school and constantly seek to improve them individually
Institutionalizing Evaluation in Schools 777
and collectively. The most important objects for school evaluation are students,
personnel, and programs. This response sets priorities rather than discounting
the importance of evaluating buildings, grounds, vehicles, equipment, bus routes,
policies, and other objects. As appropriate, all important aspects of the school
should be evaluated. However, a three-part system of student, personnel, and
program evaluation is a good place to begin. The school's most important clients
are the students. Its most important resources are its teachers, administrators,
and other personnel. And it makes its most important contributions by delivering
effective programs. Thus, all three parts of this crucial triad should be subjected
to evaluation.
Student evaluation involves assessing the needs, efforts, and achievements of
each student. These evaluations should be carried out to help plan and guide
instruction and other school services, to give feedback to students and their
parents or guardians, and to document students' accomplishments.
Personnel evaluations include assessments of job candidates, clarification of
staff members' duties, assessments of their competence, and determination of
performance strengths and weaknesses. These evaluations are vital in selecting,
assigning, developing, supervising, rewarding, and retaining or dismissing school
staff. An effective personnel evaluation system addresses all these aspects of a
professional's involvement with the school.
Program evaluations cover all of the instructional and other goal-directed
services offered to students, in, for example, math, science, and language
programs, and recreation, health, food, and transportation services. Curricular
program evaluations examine the fit of instructional offerings to assessed student
needs, the quality of materials, the coverage of content, the appropriateness and
comprehensiveness of offerings for the full range of students' abilities and special
needs, the offerings' cost-effectiveness compared with other available approaches,
the vertical and horizontal coherence of programs, and their individual and
collective effectiveness.
Noncurricular program evaluations include assessments of such endeavors as
counseling, home visitation, health services, professional development, and
school breakfasts. In general, evaluations of noncurricular programs and services
look at many of the concerns that are studied in curricular programs (e.g.,
comparative costs and effects of options and responsiveness to students' assessed
needs). They may also look at safety, maintenance requirements, efficiency, and
other such general matters. Program evaluations are important for assessing
new, special projects and ongoing programs and services. Sanders, Horn, Thomas,
Tuckett, and Yang (1995) provided a useful manual for assessing the full range
of school programs, facilities, and services.
3. At What Levels Should School Evaluation be Conducted?
It is crucial to evaluate at all school levels. Basically, these are the students,
classroom, and school as a whole. From the perspectives of students and parents,
778 Stufflebeam
student-level evaluation is an essential means of highlighting and effectively
addressing each student's special needs. At the classroom level, the teacher needs
to look not only at the needs and progress of individuals, but also of the full
group of students and pertinent subgroups. This is necessary. to plan instruction
that is both efficient and effective and to assess the impacts of both individ-
ualized and group instructional approaches. Evaluation across the entire school
is especially important in assessing the extent to which the school's different parts
are effectively integrated and mutually reinforcing. For example, instruction at
each grade level should effectively and efficiently build on instruction in the
previous grade and prepare the student for the next grade level's offerings. A
systems concept should govern the evaluation process, so that the school's parts
fit well together and reinforce each other and so that the overall achievements
are greater than the sum of achievements of the school's individual parts.
4. What Main Questions Should be Addressed?
According to the CIPP Model (Stufflebeam, 2000), there are four types of evalu-
ation questions. Each may be presented in proactive and retrospective forms so
that evaluations are appropriately applied both to guide a school's efforts and to
assess them after the fact.
• What assessed needs should be, or should have been, addressed? (context)
• By what means can targeted needs best be addressed, or were the plans for
services appropriately responsive to assessed needs and better than other
available courses of action? (inputs)
• To what extent is the planned service being implemented appropriately, or to
what extent was it well executed? (process)
• What results are being or were produced, and how good and important are
they? (products)
These questions denote generic categories of questions. It will be necessary for

each school person or group to derive more specific questions depending on cir-
cumstances. Questions should be selected to comprehensively assess an object's
merit and worth and to address issues that require someone's timely and substan-
tive response to improve and/or be accountable for services. The four questions
provide a general framework in which to identify the more specific questions.
Table 1 provides examples of questions that often need to be addressed in
evaluating context, inputs, processes, and products in schools. In reviewing this
table, keep in mind that the questions at the student level are those the teacher
might ask in assessing student-level needs, plans, progress, and achievements.
The questions at the teacher level are those that might be asked by the principal,
a peer group of teachers, an inspector, or the teacher. The questions at the
program level might be addressed by the school council, an outside sponsor, a
government official, or the school staff as a whole. The proactive questions are
Table 1. Sample Questions to Guide Evaluations of Students, Personnel, and Programs
Four Types of Evaluation
Objects of
Evaluation Context Inputs Process Products
Students
Proactive: What are the students' • What instructional approaches • To what extent is the sequence • To what extent are the students
entering levels of proficiency? did previous teachers successfully of learning activities proving achieving intermediate
What are their special, use with these students, to be appropriate for these learning objectives?
course-related needs, individually and collectively? students? • To what extent are the students
individually and collectively? • What steps can be taken • Are the students doing making progress toward
Considering the assessed to secure support and involve- their part to learn all fulfilling the school's expected
needs and school standards, ment of these student's they can during the year? outcomes?
what learning objectives parents/guardians? • To what extent are parents • To what extent are the students
should each student achieve? doing all they can and should developing positive attitudes
do to support their toward learning?
child's education?
Retroactive . To what extent did each student • To what extent did the students To what extent was the To what extent were students' S
receive instruction that was or their parents agree that the
· instructional plan implemented
· assessed needs for this course '"~.
targeted to her or his assessed instructional schedules and for each student? met, individually and ~
~.
needs? procedures had been To what extent did the collectively? 0
;::
To what extent did the total planned in consideration
· instructional plans prove To what extent are the students :;:,
group of students receive of the students' assessed needs workable and effective? · individually and collectively ~
~.
instruction at levels consistent and the school's standards? ready to enter the next level
with the school's standards? of schooling? t:l1
Personnel ;S
~
Proactive: What are this teacher's assigned Is this teacher carrying out an Is the teacher succeeding in To what extent is the teacher :;:,
· ~.
duties? effective planning process for · motivating all the students to · succeeding in meeting the 0
What assessed needs of the meeting students' assessed learn? assessed needs of all the ;::
incoming students should this needs and fulfilling other Is the teacher effectively students? S·
~
teacher address or arrange to assigned duties?
· engaging all the students in To what extent are students' ("")
have others address? Is the teacher appropriately the learning process?
· attitudes to learning positive ;::-
· 0
What school instructional providing for students with Is the teacher effectively or negative as a result of the 0
· 1;;
standards should this teacher special needs? engaging the parents in course?
meet? their child's learning process?
-.J
-.J
Continues on next page \0
Table 1. Continued -...l
00
o
Four Types of Evaluation
Objects of
Evaluation Context Inputs Process Products ~
What goals and priorities should • Does the teacher's instructional • Is the plan of instruction being ~
guide this teacher's work in plan appropriately provide for followed, and is it working? ~
the coming semester/year assessing student progress? • Is the teacher communicating ~
• Is the teacher adequately effectively with all the students? ~
prepared to teach the course • Are classroom activities being
content to the full range of managed effectively?
students? • Is the teacher effectively
assessing and responding to
each student's progress?
Retroactive • To what extent did the teacher • Did the teacher develop and • Did the teacher faithfully carry • To what extent did all students
study the students' educational and work from sound out her or his instructional make reasonable learning
needs and set goals and instructional plans? plans or revise them as needed? gains, given their assessed
priorities accordingly? • Did the teacher make appro- • Did the teacher succeed in needs and the school's
To what extent did the teacher priate provisions for serving getting students to engage in instructional standards?
address all her or his assigned the students with special needs? their part of the learning • To what extent did all students
duties? • Did the teacher take appro- process? finish the course with a
To what extent did the teacher priate steps to fully prepared to • Did the teacher engage in positive attitude toward school
address the school's instruc- teach the course's content? efforts to involve parents, and and a love of learning?
tional standards? were those efforts successful?
• Was the class well managed?
• Did the teacher effectively
assess and respond to each
student's progress?
Programs
Proactive: How is the program related to • Is the program plan sufficiently • Is the program being carried • To what extent are student
meeting the school's standards? responsive to assessed student out according to the plan? needs being met?
To what students is the needs and school standards? • Does the plan need to be • To what extent are the school's
program targeted? • Is the program plan better than modified? student outcome standards
What are the assessed needs alternatives being implemented • Is there a need to further train being met?
of the targeted students? in other schools? staff to carry out the program?
Continues on next page
Table 1. Continued
Four TYpes of Evaluation
Objects of
Evaluation Context Inputs Process Products
What program goals would be • Is the program plan understood
responsive to students' assessed and accepted by those who will
needs and consonant with the have to carry it out?
school's standards? • Does the program plan include
To what extent are program goals a defensible complement of
in need of revision? content and a reasonable sequence
across the grades?
• Is the plan workable and affordable?
• Does the plan provide sufficiently
for assessment of process and
product? ~
::to
""
Retroactive • Were program objectives and • Was the strategy for this program • Was the program adequately • To what extent were all the
!2'
....
priorities based on assessed fully responsive to program goals, implemented? students' targeted needs met? ~.
needs of the students to be assessed students needs, and • Was the plan appropriately • To what extent were the
served? school standards? improved? school's program outcome ~
~.
Were program objectives and • Was the strategy for this program • Were staff adequately trained standards fulfilled?
priorities consonant with the fully functional, affordable, to carry out the program? • To what extent were the
school's student outcome acceptable to the staff, and • Did all the staff fulfill their program outcomes worth the ~
standards? workable? responsibilities? investment of time and ie"
~
Were program objectives • Was the strategy for this resources? ....§.
appropriately revised as more program better than • To what extent was the
was learned about the students? alternatives being used by program free of negative side s·
other schools? effects?
• Did the program plan include
an adequate evaluation design?
!:;'
i
--.J
00
....
782 Stufflebeam
oriented to getting the evaluation input needed to plan and carry out services
successfully. The retroactive questions are focused on responding after the fact
to audiences that want to know what services were offered, why, according to
what plans, how well, and with what effects.
5. What Purposes Should Guide School Evaluation?
The four main purposes of a school evaluation are improvement, accountability,

understanding, and dissemination. The first purpose is ongoing improvement of
the school and its particular services, all the way down to the teaching provided
to individual students. Complementing this purpose is the imperative that the
school and its participants supply information for accountability at all levels of
the school. Evaluations should also be cumulative, and over time they should
help school staff and constituents gain better insights and understanding of the
school's functioning. The evaluation system should also supply evidence of use to
outsiders who are interested in the school's offerings and might want to transport,
adapt, and apply them elsewhere; hence, the fourth purpose of dissemination.
6. Who Should Conduct Evaluations in the School?
The general answer to this question is that all the school's professionals and
policymakers should conduct the evaluations needed to fulfill their responsibilities.
Teachers regularly have to obtain evidence on the needs, efforts, and achieve-
ments of their students and should, periodically, examine the effectiveness of
their teaching methods. Principals/headmasters need to evaluate how well each
teacher is performing and, with their faculties, how the school as a whole is
functioning. School councils must evaluate principals, school policies, resource
requirements, and the appropriateness of resource allocations. These examples
confirm that evaluation is both a collective and individual responsibility of all
those persons charged with making the school work for the benefit of the
students and c o m m u n i t y . '
In certain cases it will be important to engage independent evaluators. Examples
include district evaluation offices in countries that have school districts,
intermediate educational service centers, state or national government offices of
evaluation, evaluation companies, and university-based evaluation centers. These
specialized organizations can add objectivity, technical competence, and credi-
bility to school evaluations.
7. When Should Evaluations be Conducted?
Ideally, evaluation will be integrated into all the school's processes and will be an
ongoing activity. Insofar as possible, teachers should embed evaluative questions
and feedback into their teaching materials and daily instructional processes.
Developing curriculum-embedded evaluation should be a primary objective of
all schools that aspire to be effective learning organizations. Evaluations that are
command performances and only yield results weeks or months following data
collection may be necessary to support the purposes of public accountability and
understanding in a research sense, but are of limited utility in guiding and
improving instruction for particular students and groups of students. For that
purpose, evaluations should be scheduled to provide timely and relevant feed-
back to users, in consideration of their agenda of decisions and other functions
that require evaluative input.
During the year, the principal should evaluate or engage others to evaluate
designated teachers' performances and provide feedback for improvement.
When feasible, each year all teachers should be provided with feedback they can
use to strengthen performance. But each teacher need not be formally evaluated
every year. It is best to limit the formal evaluations to those teachers who most
need attention, to those who obviously have problems, or to those who are being
considered for tenure and/or promotion. Following this idea, the evaluations can
be targeted where and when they will do the most good and conducted with
rigor as opposed to being conducted as a superficial ritual involving all teachers
every year.
At least once, and preferably twice a year, the principal and the total school
staff and/or logical groupings of staff should be engaged in evaluating selected
programs and planning for improvement. As in the case of teacher evaluations,
not every program should be evaluated every year. Rather, the staff should target
programs of particular concern and over time should periodically evaluate all the
programs with the central goal of program improvement. Program staffs period-
ically should compare their programs with others that might be adopted or that
might be developed locally. While innovation and creativity are to be encouraged
and supported, evaluations should be planned and conducted in concert with the
innovative activities. Evaluations should be conducted when plans for new
programs are submitted to screen those that are worthy of execution or tryout
from those that predictably would fail, be dysfunctional, and/or waste resources.
In the school council's evaluation of the principal, it can be useful to have a
phased process across the year. For example, in a first quarter (it might be either
summer or autumn) the principal and council could examine student achieve-
ment results from the previous year and clarify school priorities for the current
school year. In the second quarter, they could examine the school's plans for
achieving the new priorities and define the performance criteria for which the
principal will be held accountable. In the third quarter, the council and principal
could meet to review and discuss the principal's progress in leading the school to
carry out the year's plans of action. Finally, in the fourth quarter, they could
examine school and student outcomes, and the council could compile and present
its evaluation of the principal's service and accomplishments. As appropriate,
during this fourth quarter the council might also prescribe an improvement plan
for the principal. Most important, the principal and council should use the year's
784 Stufflebeam
principal evaluation results to plan for improvements in the coming year. This
example shows how evaluation can be integrated into a school's functions. The
point of the illustrative process for evaluating the principal is not just to issue
judgments of her or his performance, but to do so in consideration of the school's
ongoing planning, review, and improvement process.
8. What Audiences Should School Evaluations Serve?
School evaluations should not be carried out in the abstract, or as a ritual, but
should serve the functional information needs of particular persons or groups.
Key audiences include parents, students, teachers, the principal, the school
council, government agencies, funding organizations, and others as appropriate.
Sometimes the evaluator and audience are the same, such as when the teacher
gathers evaluative information on the effectiveness of teaching procedures as a
basis for planning future instruction. In other cases, the audience is distinct from
the evaluator, such as when the teacher provides student report cards for review
by the student and parents/guardians. In general, the school staff should identify
the full range of audiences for evaluation findings, project and periodically update
their information needs, and design and schedule evaluations to meet the infor-
mation needs. Ideally, evaluations during the year will be guided by a basic
annual evaluation plan developed at the year's beginning.
9. What Evaluative Criteria Should be Employed?
The specific criteria for given evaluations depend on the particular audiences,
objects of the evaluation, etc. These should be carefully determined in the plan-
ning process. To assist in this process, the school's personnel may find it useful to
consider eight classes of generic criteria: basic societal values, especially equity
and excellence; criteria inherent in the definition of evaluation, particularly merit
and worth; criteria associated with context, inputs, process, and products; defined
institutional values and goals; areas of student progress, -including intellectual,
social, emotional, vocational, physical, aesthetic, and moral; pertinent technical
standards, especially in the cases of the safety and durability of facilities, vehicles,
and equipment; duties and qualifications of personnel; and more specific,
idiosyncratic criteria.
10. What Evaluation Methods Should be Employed?
A general response to this question is that evaluations generally are conducted

in four stages: (i) delineating the questions and other important focusing matters;
(ii) obtaining the needed information; (iii) reporting the findings; and (iv)
applying the lessons learned. The delineating stage is aimed at achieving clarity
about such aspects as the object of the evaluation and the questions, criteria, and
time line. The obtaining stage typically employs multiple methods of data
collection, including both qualitative and quantitative approaches. It also involves
verifying, coding, and storing information and the required analytic procedures,
again both qualitative and quantitative. The providing stage concerns presenta-
tion of reports. The reporting process will employ a variety of media including
printed reports, oral presentations, group discussions, and multimedia renderings.
The particular modes of presentation should be chosen to get the message across
to the different audiences in terms that address their interests and that they will
accept and understand. The applying stage is concerned with using findings for
pertinent purposes. As noted above, these can include improvement, accoun-
tability, understanding, and dissemination. This stage particularly sees the
evaluator (who might be a teacher, principal, or evaluation specialist) working
with audiences, in settings such as workshops, to help them understand and use
evaluation findings in their decision processes.
11. What Are the Standards of Sound Evaluation?
In the United States and Canada, this question is answered by invoking the stan-
dards produced by the Joint Committee on Standards for Educational Evaluation.
The Committee has produced standards for personnel evaluations (1988) and
program evaluations (1994) and currently is developing standards for student
evaluations. All the Joint Committee's standards are grounded in four basic
requirements of a sound evaluation: utility, feasibility, propriety, and accuracy.
The utility standards require evaluators to address the evaluative questions of
the right-to-know audiences with credible, timely, and relevant information and
to assist audiences to correctly interpret and apply the findings. According to the
feasibility standards, evaluations must employ workable procedures and be
politically viable and cost-effective. The propriety standards require evaluators
to meet pertinent ethical and legal requirements, including working from clear,
formal agreements; upholding the human rights of persons affected by the
evaluation; treating participants and evaluatees with respect; honestly reporting
all the findings; and dealing appropriately with conflicts of interest. The accuracy
standards direct evaluators to produce valid, reliable, and unbiased information
- about the described object of the evaluation - and to interpret it defensibly
within the relevant context (see Chapter 13 for an explication of the Joint
Committee's standards.)
Experience has shown that the utility, feasibility, and accuracy standards are
quite applicable across cultures. However, some cultures would not accept the
American propriety standards that deal with such matters as freedom of infor-
mation, conflicts of interest, human rights, and contracting. While American
educators should meet the Joint Committee standards, school personnel in other
countries can probably best consult these standards as a potentially useful
786 Stufflebeam
reference. Ultimately, they should determine and apply evaluation standards
that are appropriate in their cultures.
An Example Representation of the 1jJpe I Model for Evaluating Programs
Figure 1 provides a summary of the Type I evaluation model as it might be

applied to program evaluations. This model is adapted from the original version
(Stufflebeam, 1993), which was keyed to evaluations of school administrators.
The general scheme of the Type I model, however, can be adapted for use in
evaluating students, personnel, programs, and other school aspects. The model
is grounded in a set of standards for evaluations requiring that they be useful,
feasible, ethical, accurate, and that they include ongoing assessment, commu-
nication, and integration in school functions. The model for program evaluation
Figure 1: A General ('JYpe I) Model for Program Evaluation

is further based on the school's overall mission. The evaluation steps around the
middle circle reflect the unique characteristics of program evaluation. While the
steps are presented in a logical sequence, it must be emphasized that they will
often be conducted interactively and not in the order given. The best order will
be determined by the ongoing interactions of evaluator and users and will vary
from situation to situation.
So far, this chapter has illustrated the questions and responses that a school's
personnel individually and collaboratively might employ in developing their
school evaluation systems. None of the above views should be uncritically adopted.
Rather, interested readers should examine them, determine their own questions
and responses about sound evaluation, devise their own evaluation models, inter-
nalize and apply them, and periodically evaluate and improve their evaluation
approaches. The school's staff should devise separate models for evaluating
students, personnel, and programs. They should then bind the models together
as mutually reinforcing parts of the overall evaluation system. The most
important bonding agents will be locally adopted standards for evaluation, the
school's mission, and ongoing assessments of students' needs.
A TYPE II MODEL FOR INSTITUTIONALIZING A SOUND SYSTEM

OF EVALUATION
Given a schoolwide commitment to install a sound evaluation system and develop

a generalized notion of the types of required evaluation, a school's staff needs to
undertake an effective process of installation. In order to be concrete and avoid
generalized advice that fits no particular subset of schools, the following
presentation addresses a particular class of schools believed to be at least not
uncommon across the world.
The steps that are presented are designed to serve an elementary school with
approximately 500 students, 1 principal, 20 teachers, 20 support staff (including
teachers' aides, parent assistants, and specialists in such areas as bilingual educa-
tion, health, speech and hearing, psychological testing, counseling, social work,
special education, and evaluation and improvement), a community-based school
council, and a government oversight authority. The unusual but desirable element
added to what is typical in such schools is an evaluation and school improvement
specialist among the 20 support staff. The U.S. and Canada Joint Committee
standards for educational evaluations have been used as a proxy for the school's
evaluation standards. Furthermore, attention is focused on schools that are
committed either to improving their existing practices of student, personnel, and
program evaluations, or to designing and installing new evaluation systems for
one or more of these areas. By addressing the needs of this hypothetical class of
schools and by working from a published set of evaluation standards, the discus-
sion aims to provide actionable advice and general processes that are applicable
to a wide range of schools. Twelve steps for installing evaluation are proposed.
788 Stufflebeam
1. Staff the Institutionalization Effort
At the outset, the school's principal should appropriately staff the institution-
alization effort. Staffing should provide for both widespread participation and
the needed evaluation expertise.
To begin, the principal should consider appointing a representative joint
committee of about 15 members, who might include the principal, at least three
teachers, two parents, two teacher's aides, two support staff, two sixth grade
students, two representatives of the business community, and the school's evalu-
ation and improvement specialist. The point is to include all perspectives that
will be involved in conducting and using the evaluations. Except for the principal
and evaluation/improvement specialist, it is probably wise to include at least two
members from each category of committee members, both for comfort in numbers
and to provide significant voice in the committee's decision making. Perhaps the
most problematic aspect of this recommendation is the inclusion of two sixth
grade students. However, their input can be invaluable in that they have had
regular opportunities during more than five years in elementary school to
observe and think about teaching and learning in their classrooms. However,
their participation should be included only if it can be efficient and enhance
rather than detract from their learning process in the school. If students are not
included on the committee, their input might be gathered in other ways, such as
in discussions by an evaluation improvement committee member with small
groups of students. Whatever procedure is adopted, it is important to find some
way to give students a voice in the evaluation improvement process, since they
are the school's key stakeholders.
The principal should prepare, or engage someone else on the committee to
prepare, a charge for the committee. This charge might be to devise general
models for program, personnel, and student evaluation; to acquire the know-
ledge necessary to develop the models; to review and examine the school's past
evaluation experiences as one basis for model development; to develop a state-
ment on how the school should interrelate the three evaluation models; to obtain
assessments of the committee's composition and planned work from the school's
other stakeholders; and to develop a budget for the committee's first year or two
of projected work. Such a charge will be demanding and require that the com-
mittee essentially be a standing body. It could be expected that, on average, each
member would need to devote about eight hours a month to the evaluation
improvement process. After the initial evaluation improvement process, the
committee's membership should be updated. This could be on a three-year cycle.
With the exception of the principal and evaluation/school improvement specialist,
about a third of the membership should be replaced each year, with each member
serving a three-year term. The evaluation/school improvement specialist will
have a key role in helping the rest of the committee learn what they need to know
about evaluation standards, models, procedures, and uses in schools. The
principal and committee should also consider the possibility and appropriateness
of obtaining additional assistance from outside evaluation experts.
2. Adopt Standards for Guiding and Judging Evaluations
For the projected evaluation system to work, it must be a shared and valued
process and be integral to what the school does. Since evaluation is often an
anxiety-provoking and sometimes a contentious process, its successful applica-
tion requires development and maintenance of trust throughout the school,
collaboration, and firm grounds for settling disputes. One of the strongest ways
to address these needs is through adoption, internalization, and application of
professional standards for evaluations.
When firmly established and employed, such standards provide criteria and
direction for planning, judging, and applying evaluations in a fair, effective, and
accountable manner. They also provide a shared vision about good evaluation
that is central to an effective learning organization. By firmly learning and applying
standards, school staffs will have an objective basis for deliberating and resolving
issues in their work, assessing and improving evaluation operations, and making
sure they yield valid and reliable assessments. In the Type I model part of the
paper, the North American Joint Committee standards were advanced as an
exemplar for schools to study as they decide what standards should undergird
their evaluation work.
3. Use the Adopted Standards to Evaluate the School's Cumnt Practices of

Student, Personnel, and Program Evaluation
Having adopted a set of evaluation standards, the school's staff will be in a strong
position to begin examining and improving or replacing current evaluation
systems. The principal should engage the staff to reach agreement on whether to
evaluate current evaluation approaches in the school and, if so, whether the
evaluations should focus on students, personnel, and/or programs. The staff's
recommendation should then be presented to the school councilor other
authority for approval as appropriate.
For each evaluation system that is to be examined (student, personnel, program),
the evaluation improvement committee should compile the evaluation materials
that are being used, such as sample guidelines, instruments, data sets, and
reports. They should also characterize how the evaluation approach is intended
to operate and how it actually operates. As this work proceeds, the principal
should encourage each committee member to use the evaluation improvement
experience to gain deeper insights into the school's new evaluation standards.
Once the relevant materials have been compiled and the needed character-
izations have been made, the principal should assign members to task forces for
each evaluation area to be examined. The task forces may be made up entirely
of evaluation improvement committee members or include additional members
from the school and/or community. The charge to each task force should be to
assess the assigned evaluation model or approach presently in use against the
school's new evaluation standards. If two or more task forces are involved, they
...
790 Stufflebeam
should periodically be convened together to review each other's work. Together

they should also look at how the school's existing models/approaches do or do
not complement each other.
The committee should constantly aim to produce student, personnel, and
program evaluation models that are compatible and that work together for the
benefit of the school. The most important unifying agent is the school's set of
evaluation standards. This assessment activity should result in reports on how
well each evaluation system meets each of the school's evaluation standards
(utility, feasibility, propriety, and accuracy). In this complex process of applying
the school's evaluation standards to evaluate existing practices, it would be helpful
to engage one or more outside evaluation experts to oversee and evaluate the
effectiveness of the evaluation task forces' efforts.
4. Develop Improved or New Models for Evaluating Students, Personnel, and/or

Programs
Once the diagnostic work of examining existing evaluation practices has been
completed, the evaluation improvement committee should turn to the develop-
ment of new or improved systems for evaluating students, personnel, and
programs. As a first step, the committee should reach consensus on whether any
or all of the school's existing evaluation approaches are acceptable as they are or
are in need of revision or of replacement. If crucial standards such as evaluation
use and valid measurement are not being met, then the systems should be revised
and possibly replaced.
When new or improved models are needed, the principal should appoint task
forces to proceed with the needed work. These would likely be the same task
groups that evaluated the existing evaluation systems but could be reconstituted
as appropriate. Each task force would be charged to use the school's evaluation
standards and the evaluations of the existing evaluation systems to develop new
or improved models of evaluation. The principal should also encourage each task
force member to constantly consider how the task force's work in developing new
models would apply in the member's specific area of responsibility. This will provide
constant reality checks for the task force and also accelerate each member's
preparation to apply the new models.
To launch the task forces' work, the principal should train or arrange for a
consultant to train the task force members in the requirements of sound evalu-
ation. This training should look at the formal requirements of sound operating
models, review applicable models found in the literature, and revisit the school's
evaluation standards.
The task forces then should deliberate, conceptualize the needed improve-
ments or new models, and write them up. In this process, the task forces should
meet periodically in plenary sessions to compare, contrast, and critique each
other's progress. They should especially strive to ensure that the models will be
complementary and mutually reinforcing. The model development process should
culminate with a commissioned independent evaluation of each produced evalu-
ation model. The external evaluation should examine the models, individually
and collectively, for their soundness as working models, compatibility with each
other and the school environment, and their compliance with the school's
5. Summarize the New/Improved Models and Operating Guidelines in Draft Users'

Manuals
Using the improved or new evaluation models and the external evaluations of
these, the task forces should next develop manuals to guide school personnel in
implementing the models. Each manual should be grounded in the school's
standards; provide a mental model for thinking about and sequencing the steps
in the particular evaluation approach; clearly outline responsibilities for carrying
out the evaluation work; and provide sample tools, charts, etc. Tables 2, 3, and 4
contain sample outlines of evaluation manuals for students, personnel, and
programs respectively.
Once manuals have been drafted, it is time to communicate the committee's
progress to the entire school staff. In distributing the draft manuals, the com-
mittee should encourage the individual school staff members to study them in
the context of how they can or cannot use them in their own evaluation work. In
addition to the manuals, it can be helpful to set up a resource library of materials
relevant to carrying out effective student, personnel, and program evaluations.
The committee should also make sure that each school staff member has and is
attending to a copy of the school's evaluation standards. Furthermore, the com-
mittee should provide orientation and training sessions focused on developing
understanding of the draft evaluation manuals for all school staff members.
Technical assistance should be made available to staff members who plan to
conduct or participate in field tests of the manuals.
Follow-up group training should also be provided for school members who
want to exchange information on their views of the new models. This can provide
each member with a better understanding of the models. In these various group
meetings on the new models, participants should closely examine how the differ-
ent models fit together and how well they do or do not support the school's
overall mission.
It is crucial that the new models themselves be evaluated - not just in the testing
stage, but also in their future regular use. Thus, metaevaluation (provisions for peri-
odically evaluating the evaluations) should be built into each evaluation manual.
6. Thoroughly Review and Test the New Evaluation Manuals
Following the efforts to acquaint the school's staff members with the draft evalu-
ation manuals, it is next appropriate to thoroughly evaluate and improve each
792 Stufflebeam
Table 2. A Sample Table of Contents for a Student Evaluation Manual
1. The School's Mission in Serving Students
2. Nature and Importance of Student Evaluation
3. Standards for Sound Student Evaluation
4. Areas of Student Growth and Development
5. Key Audiences for Student Evaluation Findings
6. Responsibilities for Student Evaluation
7. The School's General Approach to Student Evaluation
8. Schoolwide Evaluations of Students
9. Evaluation of Students in the Classroom
10. Resources to Support Student Evaluation
11. Focusing the Student Evaluation
12. Collecting, Storing, and Analyzing Student Evaluation Information
13. Reporting Evaluation Findings
14. Using Evaluation Findings
15. A Sample Calendar of Student Evaluation Activities
16. A Matrix of Example Student Evaluation Questions
17. Sample Data Collection Instruments
18. Sample Student Evaluation Reports
19. Provisions for Metaevaluation
20. Student Evaluation References
Table 3. A Sample Table of Contents for a Personnel Evaluation Manual

2. The School's Commitment to Support and Develop its Personnel
3. Nature and Importance of Personnel Evaluation
4. Standards for Sound Personnel Evaluation
5. The School's General Approach to Personnel Evaluation
6. Defined Users and Uses of Personnel Evaluation
7. Responsibilities for Personnel Evaluation
8. The School Member's Job Description
9. School Priority Areas for Improvement
10. Work Conditions and Possible Mitigating Circumstances
11. Current Accountabilities, Weights, and Performance Indicators
12. Staff Member's Plan for Meeting the Accountabilities
13. Midyear Self-Appraisal
14. Supervisor Appraisal
15. Peer Appraisal Form
16. Form for Summarizing Peer Appraisals
17. End-Of-Year Supervisor Appraisal
18. Growth Plan for Coming Year
19. Resources to Support Personnel Evaluation
20. A Sample Calendar of Student Evaluation Activities
21. A Matrix of Example Personnel Evaluation Questions
23. Personnel Evaluation References
manual. The committee should engage all or most staff members in reviewing
each manual and submitting a written critique, including strengths, weaknesses,
and recommendations for each. The staff members should be asked to key their
critiques to the school's evaluation standards and to indicate how well they think
Table 4. A Sample Table of Contents for a Program Evaluation Manual
2. The Nature of Programs and their Role in Fulfilling the School's Mission
3. Nature and Importance of Program Evaluation
4. Standards for Sound Program Evaluation
5. The School's General Approach to Program Evaluation
6. Defined Users and Uses of Program Evaluations
7. Responsibilities for Program Evaluations
8. School Priority Areas for Improvement
9. Needs Assessments
10. Evaluation of Program Plans
11. Evaluation of Program Implementation
12. Evaluation of Program Outcomes
13. Selection of Programs for Evaluation
14. Resources to Support Program Evaluation
15. Contracting for Outside Evaluation Services
16. The Annual Focus for Program Evaluations
17. A Matrix of Example Program Evaluation Questions
19. Program Evaluation References
the approaches and operating manuals will function in their areas of respon-
sibility. They also should be invited to note what added inservice training and
support they will need to implement the new or revised approaches. They should
especially be asked what additional materials are needed in each manual and
what existing materials need to be clarified. It would be well for the committee
to convene focus groups to assess the new manuals and systematically develop
recommendations for strengthening them. Such groups could be productively
engaged in estimating the time and resources required to implement the new
evaluation approaches.
It is highly desirable that each evaluation manual be tested in actual evalu-
ations in addition to the individual and group critiques. This, itself, could require
a year or more; institutionalizing new evaluation approaches involves time-
consuming, complex activity. The more the school can make the needed investment,
the more likely it will succeed in producing a functional, valuable evaluation
system that will work over time and help staff improve the school.
7. Revise the Evaluation Manuals Based on the Reviews and Field Tests
Mter completing the reviews and tests of the evaluation manuals, the evaluation
improvement committee should revise them for use throughout the school. The
committee should subsequently print and distribute copies of the finalized
manuals to all school staff members. This should be followed by arranging for,
scheduling, and publicizing the training and support to be provided to help all
school staff effectively implement the evaluations. Shortly after distributing the
manuals, the committee should hold discussions to help school staff begin
preparing to implement the new evaluation approaches. It will be essential that
794 Stufflebeam
the school council firm up a resource plan for implementing the new models. In
the course of arranging for resources, the committee should evaluate the evalu-
ation improvement effort to date for use by the school council and the entire
school faculty.
8. Provide Resources and Time Necessary to Implement the School's New or

Improved Evaluation Models
As the implementation stage gets under way, the school's leaders must match the
rhetoric with action. They must provide school staff with the time and resources
needed to do the evaluations. The principal should engage staff to make known
their needs for help in carrying out the evaluations and in collaboration with the
school council should allocate the needed time and resources in accordance with
the needs and provisions in the evaluation manuals. As staff members run into
challenges and difficulties in conducting the evaluations, the principal should
encourage them and creatively support them to increase their repertoires for
doing sound evaluation work. Early in the installation process it can be especially
cost effective to engage external evaluation experts to support and train staff to
meet their individual evaluation responsibilities and to help the principal and
evaluation improvement committee coordinate student, personnel, and program
evaluations. The first year of installation is a good time to institute the regular
practice of annually evaluating and, as appropriate, strengthening the evaluation
operations. It can be highly cost effective to annually engage one or more exter-
nal evaluation experts to review and provide recommendations for strengthening
the student, personnel, and program evaluation systems.
9. Annually Plan and Carry Out Student, Personnel, and Program Evaluations
Insofar as is possible, the evaluation improvement committee should work with

all school staff members to integrate evaluation operations into regular class-
room and school processes. Individual staff members should be encouraged to
adapt the models to their individual circumstances. This will be especially true of
teachers who need to take into account the needs of their particular students.
They should also be asked to help evaluate and improve the models over time.
Periodic metaevaluations should look closely at questions of compatibility and
integration as well as positive impacts.
10. Apply the Evaluation Findings to Improve Administration, Teaching, and

Learning throughout the School
Every evaluation should be keyed to a schoolwide view of how evaluation can

help staff improve services at every level. Each evaluation plan should specify
who will use the evaluation and how it will be used. If these matters cannot be
clarified at the outset, then it is doubtful if the evaluation is worth doing. In
providing evaluation training, school members should be taught not only to
conduct evaluations, but also to make appropriate use of results. To promote use,
the principal should encourage and support feedback workshops in which school
staff members go over evaluation results and formulate pertinent improvement
plans together. All evaluations should be planned and conducted in
consideration of how the findings can be used to strengthen not just the part of
the school being focused on, but the functioning of the total school. A constant
aim should be to make the evaluations cost-effective for the entire school.
11. Administer a System of Recognitions and Rewards for Effective Evaluation Work
The evaluation improvement committee, principal, and school council should

take every opportunity to encourage and reinforce effective evaluation work. One
strategy is to administer a program of recognitions and rewards for especially
effective evaluation work. School personnel might be involved in defining what
level of contributions should be rewarded. In setting up the awards program,
rewards should be used to motivate staff to conduct sound evaluations and to
upgrade their evaluation skills. Reward recipients might subsequently be invited
to conduct team learning sessions for other school staff members. As with all other
aspects of the school's evaluation system, the awards program should be evalua-
ted periodically to assure that it is fair, constructive, and effective for the school.
12. Commission External Evaluations as Needed
It is important to periodically supplement internal evaluations with external

evaluations. This adds objectivity, credibility, technical expertise, and external
perspectives. The principal should communicate with school personnel to deter-
mine the conditions under which external evaluations should be approved and
supported. Together they also should clarify the expectations that external evalu-
ators should meet. External evaluators should be asked to provide information
that the individual school staff members can use to strengthen their services.
When external evaluations are conducted, the school principal should ensure
that the affected and interested school staff members are informed of results and
helped to use them to improve school services. In the case of sizable external
evaluations, the principal should consider employing an external metaevaluator
to assess the merit and worth of the evaluation.
Institutionalization Themes
In addition to presenting the 12 institutionalization steps, six themes ran through

the preceding discussion. In improving evaluation, school staffs need to develop
796 Stufflebeam
a shared vision of what constitutes sound evaluation, develop working models
that all school staff can adapt and internalize, foster the development of personal
mastery of the needed evaluation skills by all school staff members, engage in
collaboration and team learning, put all the evaluation conc~pts and operations
in a systems concept, and subject the evaluation itself to evaluation.
Table 5 is a summary of the preceding discussion. It is designed to help school
personnel systematically consider what steps they need to undertake to institu-
tionalize a sound system of evaluation. The horizontal dimension of the matrix
includes Senge's (1990) five disciplines of learning organizations plus the added
discipline of evaluation. The vertical dimension lists 12 steps, defined above, to
help schools install evaluation systems. The cells in the matrix contain key
actions that a school's staff probably would find useful in the installation process.
The contents of this matrix reflect a wide range of experience in helping schools
and other organizations assess existing evaluation systems and develop new ones.
It is suggested that school groups planning to improve their evaluation system
closely study and consider carrying out the tasks depicted in Table 5.
ACQUIRING THE LEADERSHIP AND EVALUATION EXPERTISE

NEEDED TO INSTITUTIONALIZE SOUND EVALUATION
The institutionalization process described above is not easy. It requires expertise,
personnel time, resources and, above all, strong leadership. It is assumed that the
school's principal will provide much of the needed evaluation-oriented leader-
ship and that the school's evaluation and school improvement specialist will
provide some of the needed specialized expertise. A broadly representative school
evaluation improvement committee would be expected to devote the equivalent
of about a day a month to implementing the planning and development process,
with the principal and evaluation specialist devoting about a day a week to the
enterprise. It is suggested that the school allocate funds to hire outside
evaluation experts to support the institutionalization process and to provide
external perspectives and that thereafter the school council allocate funds for
ongoing support of the evaluation system. "
This is a bare bones projection of the support needed to develop the school
evaluation system. One implication of this overview is that school principals and
other school personnel must develop a mindset for effective school evaluation and
a functional level of proficiency in evaluation theory and methodology. National
ministries of education and universities can find an important role to play here
and should consider offering substantive, hands-on continuing education programs
in evaluation, as well as technical support to principals and other educators.
An Ohio Project to Institutionalize Evaluation in Six School Districts

One example of such training and technical assistance occurred when The Ohio
State University Evaluation Center conducted a semester-long evaluation seminar
Table 5. Using Disciplines of a Learning Organiiation to Institutionalize Evaluation
Six Disciplines of Learning Organizations
Institutionalization
Steps Shared Vision Mental Models Personal Mastery Team Learning Systems Thinking Evaluation
1. Staff the institution- Appoint a Charge the Assign the school's Engage the Engage the Assess the
alization effort representative committee to reach evaluation specialist committee to jointly committee to committee's
committee to lead consensus on to help the committee examine past student, consider how the composition and
the institutional- proposed models for gain sophistication personnel, and different evaluation agenda to assure
ization effort. evaluating students, in evaluation. program evaluation systems can be better that it will be
personnel, and experiences in the tied together and credible and
programs. school and form goals integrated into the successful.
for improving the regular work of
evaluations. the school.
2. Adopt Evaluation Determine and Plan steps to ingrain Help each staff Teach the standards Configure Periodically examine
Standards adopt standards as such standards member assess his/her to all school evaluations so that the school's
school policies to concepts as utility, evaluation plans and professionals. they are compatible evaluations against
S
guide and assess feasibility, propriety, reports against the with the system and the school's '":::to~
evaluations. and accuracy in all standards. yield useful findings. evaluation :::to
0
the school's standards. ~
;::,
evaluations. "-
E!:
3. Evaluate the school's Agree to evaluate Compile pertinent Encourage each Form three task forces Periodically engage Engage one or ~
current approaches for the school's artifacts of the committee member and engage them, the task forces to more outside trl
~
evaluating students, existing evaluation school's current to use this experience respectively, to review one another's evaluation experts ;::,
personnel, and approaches against evaluation practices to gain deeper insights examine the school's work and to look to oversee and !2"'
;::,
programs the school's new and characterize the into and more existing models/ together at how the evaluate the :::to
0
evaluation existing models/ proficiency in the approaches for the school's existing effectiveness of the ;::
standards. approaches for school's new evaluating students, models/approaches evaluation task S·
student, personnel, evaluation standards. personnel, and do or do not forces' efforts. V:l
and program programs. complement each ;::-
'"0
evaluation. other. 0
1::;"'
-.l
\0
-.l
Table 5. Continued -...l
1.0
00
Steps Shared Vision Mental Models Personal Mastery Team Learning Systems Thinking Evaluation ~
4. Develop improved Agree on whether Engage task forces Assign each task Teach the require- Periodically convene Engage one or ~
or new models for new or improved to construct or force member to ments of sound the task force more external
evaluating students, models are needed improve models for constantly consider evaluation models members to assess evaluators to assess t
personnel, and for evaluating evaluating students, how the new model to each task force the compatibility of the soundness of ~
programs students, personnel, personnel, and being developed and engage them in the models under the new or improved
and programs. programs as would work in her reviewing pertinent development and models, their
appropriate. or his sphere of evaluation models their fit to the compatibility, and
responsibility. found in the school's evaluation their fit to the
literature. standards. school's evaluation
standards.
5. Summarize the new/ Develop school- Encourage and Provide a resource Conduct in-service Conceptualize the Include metaevalu-
improved models and wide users' support each teacher library and technical education so that school's overall ation in all school
operating guidelines manuals for and other professional assistance for school school staff can share approach to evaluation models
in draft users' manuals student, personnel, staff members to staff to use in and obtain help in evaluation in a and periodically
and program adapt and internal- upgrading their improving their own general manual. evaluate the
evaluation. ize the school's new knowledge of approaches to Include school, evaluation models
evaluation models. evaluation. Train all evaluation, including class, and student and their
the school's staff especially adaptation levels; show their implementation.
members in the new and use of the school's interconnections;
models for evaluating new or improved and show how they
students, personnel, evaluation models. all embrace a
and programs. common set of
6. Thoroughly review Engage all the Ask the staff to Ask the staff to Convene focus groups Analyze the time Formally field-test
and test the new school's staff in indicate how they indicate what in- to assess the new and resources the new or improved
evaluation manuals critiquing the think they could make service training and models and develop required to models in selected
Table 5. Continued
Six Disciplines of Leaming Organizations
Steps Shared Vision Mental Models Personal Mastery Team Leaming Systems Thinking Evaluation
proposed evaluation the models function support they would recommendations for implement settings in the
models. in their areas of need to implement improvement. the new models. school.
responsibility. the new/revised
models.
7. Revise the evaluation Engage the evalu- Update the users' Project the types of Hold discussions to Firm up a resource Review the finalized
manuals based on the ation improvement manual for each training and support help school staff begin plan for implementing models to identify
reviews and field committee to make evaluation model to be provided to preparing to imple- the new models. any areas still to be
tests the needed and disseminate help all school staff ment the new models. addressed.
revisions. the manuals. effectively implement
the new models.
8. Provide the resources Determine with Allocate time and Provide time and Provide time and Examine how evalu- Provide resources S
....
'"~.
and time necessary to staff a common purchase technical resources for resources for needed ation can work and time for annual
implement the school's understanding of support to help individual staff to use team learning toward a goal of metaevaluations. ....
new or improved what level of school staff members in developing their sessions. being "cost free" by
o·
;:=
>:l
evaluation models resources and time develop and repertoire of evalu- helping to increase
should and will apply the needed ation procedures. efficiency and ~
~.
be invested in conceptualizations Consider hiring an effectiveness.
evaluation. of evaluation. evaluation specialist
to work with school ~
staff and coordinate $:
>:l
school-level evaluation.
S"
;:=
9. Annually schedule Key each evalu- Annually adapt and Key each type of Engage groups of Ensure that the Periodically evaluate S·
and carry out a plan ation system to improve the school's evaluation to staff to assess different evaluation the different ~
of student, personnel, the Evaluation models for student, strengthening both the different evalu- systems are compat- evaluation models ~
C
and program Standards. personnel, and individual perform- ation systems against ible and functionally and systems and C
evaluations program evaluations ance and overall the evaluation appropriate for the the extent to which !;;
as appropriate. school improvement. standards. school. which evaluation
-..J
\0
Continues on next page \0
Table 5. Continued 00

8
Steps Shared Vision Mental Models Personal Mastery Team Learning Systems Thinking Evaluation ~
operations are ~
adequately
integrated into t
school operations. ~
10. Apply the Key every evalu- Specify in every ltain school staff not Conduct workshops Always plan and In evaluating evalu-
evaluation findings to ation to a school- evaluation plan who only in the conduct in which groups of conduct evaluations ations, pay close
strengthen the school wide view of how will use the evaluation of evaluation but in school staff go over in consideration of attention to their
evaluation is to help and how they will the use of results. evaluation findings how they can help impacts and func-
staff improve use it. and formulate school staff rethink tionality compared
services at every pertinent recommen- and strengthen the with their costs and
level. dations for school school's functioning. disruptive qualities.
improvement.
11. Administer a Involve the school Consider what Use the rewards to Engage award winners Regularly work to Periodically
system of recognitions staff in defining exemplary aspects of motivate staff to to conduct team strengthen the evaluate the
and rewards for what level of the award-winning conduct sound learning sessions in overall evaluation recognitions and
effective evaluation evaluation contri- evaluations the school evaluations and evaluation. system. awards program to
work butions should be should encourage upgrade their assure that it is fair,
rewarded. and use. skills. constructive, and
effective for the
school.
12. Provide for obtain- Reach a consensus Clarify the models of Ensure that external Ensure that affected Ensure that external Provide for
ing external evaluations about when external evaluation to be evaluations provide and interested school evaluations are evaluating all
as needed evaluations should implemented by information that staff are schooled in functionally related external evaluations
be commissioned external evaluators. individual staff the results of external to helping the school against the
and what they members can use to evaluations. improve its services school's evaluation
should yield for strengthen their and accountability to standards,
the school. services. the public.
aimed at institutionalizing program evaluation in six U.S. school districts. The

participants included six three-person teams, one for each school district. Each
team included the district's lead administrator and evaluator and a university
graduate student specializing in evaluation. The seminar group met weekly in
three-hour sessions. At the outset, the group was schooled in three main topics:
standards for judging evaluations, the CIPP Evaluation Model (as the Type I
Model to be considered by each district), and a Type II Evaluation Installation
Process (similar to that summarized in Table 5). In the next segment of the
seminar, each team adapted the CIPP Model for use in the district and com-
municated the model to the school board and school district staff. In the last
segment of the seminar, the teams devised a 'JYpe II strategy for installing the
new program evaluation system and communicated the installation process to
the school board and school staff. Following the seminar, the districts installed
the new or improved program evaluation systems, with several of the districts
hiring the graduate students who had participated in the seminar process. It
seems that universities and ministries of education could adapt this process to
help interested schools systematically learn evaluation and install new or
improved systems of evaluation.
Ralph 1Jler's Eight ~ar Study
One example of an extensive effort to help schools become evaluation-oriented

learning organizations was the famous Eight Year Study in the United States
(Tyler, Smith, & the Evaluation Staff, 1942). Between 1934 and 1942, an evaluation-
talent-rich Evaluation Committee led by Ralph Tyler helped 30 schools clarify
educational objectives, develop assessment techniques, measure and study
students' progress, and use the findings for school improvement. The Evaluation
Committee did not evaluate the schools. The philosophy of the Eight Year Study
was that the responsibility for evaluating a school's program belonged to the
school's staff and clients.
With the Evaluation Committee's expert assistance, participating schools
devised and applied innovative means of assessing students' progress and obtained
information to address the questions of a variety of audiences. The Evaluation
Committee oriented and trained teachers and other school personnel in the
conduct of evaluation and in the use and interpretation of results. The school
staffs formulated objectives. The Evaluation Committee classified the objectives.
The school and evaluation staffs then refined the objectives. The Evaluation
Committee developed responsive assessment devices. The school staff applied the
assessment tools and obtained information on student outcomes. The Evaluation
Committee helped the schools record student outcomes, using a variety of quan-
titative and qualitative methods, and report them to students, parents, teachers,
colleges, and the general public. These schools became richly deserving of the
label of "the progressive schools" that was given to them.
802 Stufflebeam
The collaborative effort produced a number of important insights into effec-
tive, evaluation-guided schools. It was concluded that a comprehensive school
evaluation program should serve a broad range of purposes: (i) grading students;
(ii) grouping students; (iii) guiding students; (iv) reporting to parents on their
students' achievements; (v) reporting to school councils and boards on the achieve-
ments of students, schools, and classrooms; (vi) validating the hypotheses under
which a school operates; (vii) reinforcing teachers; and (viii) providing feedback
for public relations. Over time, the collaborative project classified objectives
under the following categories: (i) effective methods of thinking; (ii) work habits
and study skills; (iii) social attitudes; (iv) interests; (v) appreciation of music, art,
and literature; (vi) social sensitivity; (vii) personal-social adjustment; (viii)
information; (ix) physical health; and (x) consistent philosophy of life. In spite of
the wide differences among the 30 schools and a variety of curricular programs,
the Evaluation Committee found general, content-free objectives that applied to
all of the schools and courses. These encompassed higher order thinking skills,
such as critical thinking, interpreting data, applying principles in new situations,
and using proofs in arguments. The project found that the evaluation's audiences
needed different kinds of information and that student performance data should
be collected at the individual, class, school, and district levels. Thus, the project
established the need for multiple measures and multidimensional scoring. The
school staffs also found that evaluative feedback provided invaluable direction
for continued inservice training. The hallmark of the project was collaboration:
Universities and schools worked together to institutionalize relevant and system-
atic evaluation as a means of increasing effectiveness at all levels of the school.
Combining Internal Evaluation and External Evaluation of Schools in Israel
An effort to develop school-based evaluation was reported by David Nevo (1994),

who described a project in Israel designed to combine internal evaluation and
external evaluation in "a complementary rather than contradictory way." Nevo
noted that Israel had a very narrow and weak tradition of school evaluation and
that evaluations mainly had been based on superficial inspections and tests used
for the selection and classification of students and comparison of schools, which
was the subject of sharp criticism by school personnel. Due to dissatisfaction with
external evaluations, some schools developed internal evaluation mechanisms.
While the schools' personnel judged these mechanisms to be useful, outside
audiences often found them not to be credible.
To address this dilemma, the Israeli Ministry of Education and some local and
regional education authorities supported a new school evaluation project aimed
at combining internal and external evaluation. The central office in the Ministry
of Education or regional or local authorities helped volunteer schools develop
school-based evaluation mechanisms.
This general arrangement parallels the two examples provided earlier in this
section. One might characterize the arrangements as involving responsible
autonomy, wherein schools have freedom to conduct evaluations in the ways that
best serve their needs but must also be accountable to an outside group for
addressing external audiences' questions and meeting the requirements of sound
evaluation. According to Nevo, such school-based evaluations must provide both
formative and summative evaluation. The additional important element here is
that the external, administering group must provide training and technical
support to help the schools conduct high quality evaluations.
The project was conducted in four stages. Nevo (1994) described the stages as
follows:
The first stage comprises regional in service training workshops attended
by principals and teachers from schools interested in developing internal
school evaluation mechanisms. The workshops are conducted on a weekly
or biweekly basis and provide 70 to 80 hours of basic training in program
evaluation, testing, data collection procedures, and data analysis.
At the second stage ... participating schools establish internal evaluation
teams comprising three or four teachers who attended the training work-
shops. The decision to establish an internal evaluation team is usually
made by a school towards the end of the first stage of the project. A school
that has established an evaluation team chooses its first evaluation object
(a school project, a program of instruction, or some domain ... ) on which it
will practice its internal evaluation capability with the technical assistance
provided by an external tutor ... who works individually with the school on
a weekly or biweekly basis.
At the third stage the internal evaluation team is institutionalized ... The
composition of the team changes periodically ...
At the fourth stage, the school is ready for external evaluation ...
(pp.91-92)
Based on this project's early history, Nevo (1994) drew a number of conclusions.
First, school people best understand the meaning of evaluation through the dis-
tinction between description and judgment. Second, students and their achieve-
ments should not be the only object of school evaluation. Third, outcomes or
impacts should not be the only thing examined when evaluating a program, a
project or any other evaluation object within the school. Fourth, school evalu-
ation has to serve information needs for planning and improvement, as well as
for certification and accountability. Fifth, there is no meaningful way to judge
overall quality of a school (or a teacher, or a student or a program). Sixth, the
internal evaluation needs can be best served by a team of teachers and other
educators supported by appropriate training and external technical assistance.
Seventh, it is necessary to mobilize alternative tools and methods of inquiry and
to adapt them to the needs of the school and the capabilities of amateur
evaluators. Eighth, learning-by-doing is still the best way to do evaluation. Ninth,
internal evaluation is a prior condition to useful external evaluation. Finally,
internal evaluators are more appropriate to serve formative evaluation, and
external evaluators better serve summative evaluation (pp. 92-96).
804 Stufflebeam
Based on this study and his considerable experience with American education,
Nevo questioned America's recent moves to stress external evaluation based on
summative tests over what he sees as the additional essential ingredient of
school-based, internal, formative evaluation. In support, he quoted the U.S.
Department of Education (1991): "real education improvement happens school
by school, the teachers, principals and parents in each school must be given the
authority - and the responsibility - to make important decisions about how the
school will operate" (p. 15). He also suggested that "We must stop using it
(evaluation) as a source of coercion and intimidation, and start using it as a basis
for a dialogue between the schools, their teachers and principals, and the rest of
the educational system and the society at large" (Nevo, 1994, p. 97). Along with
the Ohio State and Eight Year Study experiences, the Israeli project provides
useful ideas for mounting collaborative projects to develop and institutionalize
school-based evaluation.
CONCLUSION
Schooling is one of the most vital processes in any society. It should be done well,
meet the needs of students, and constantly be improved. Every school should be
an effective learning organization. Ongoing, defensible evaluation is essential to
help schools identify strengths and weaknesses and obtain direction for school
improvement, especially in areas of student learning, teaching and other per-
sonnel functions, and programs.
If evaluation is essential to effective schooling, it is also a monumental task.
This chapter illustrates the extensive work needed to effectively conceptualize and
institutionalize sound evaluation. Many schools could not succeed in conducting
the needed planning and institutionalization processes by themselves. They need
resources, technical support, external perspectives, and strong leadership.
Ministries of education, universities, and educational resource centers exist to
assist schools in these ways. The examples given in this chapter's final section
demonstrate that such organizations can effectively help schools install sound
and effective evaluation systems.
Readers should consider the ideas, models, and strategies discussed in the
chapter as something like a set of empty boxes that they should fill in as they see
best. The important point is to take steps to assure that their efforts to develop
and sustain their schools as learning organizations be grounded in the develop-
ment and use of sound evaluation systems. For it is only through the use of sound
evaluation at all levels of the school that true learning can take place.
ENDNOTE
1 This chapter is based on a paper, titled "Evaluation and Schools as Learning Organizations,"
presented at the III International Conference on School Management, University of Deusto,
Bilbao, Spain, September 12-15, 2000.
REFERENCES
Bloom, B.S. (Ed.). (1956). Taxonomy of educational objectives. Handbook I: The cognitive domain.
New York: David McKay.
Fullan, M. (1991). The meaning of educational change. New York: Teachers College Press.
Kline, P., & Saunders, B. (1998). Ten steps to a learning organization (2nd ed.). Arlington, VA: Great
Ocean Publishers.
Nevo, D. (1994). Combining internal and external evaluation: A case for school-based evaluation.
Studies in Educational Evaluation, 20(1),87-98.
Sanders, J., Horn, J., Thomas, R., Tuckett, M., & Yang, H. (1995). A model for school evaluation.
Kalamazoo, MI: CREATE, The Evaluation Center, Western Michigan University.
Senge, P. (1990). The fifth discipline: The art and practice of the learning organization. New York:
Doubleday.
Senge, P., Kleiner, A, Roberts, C., Ross, R, & Smith, B. (1994). The fifth discipline: The art and
practice of the learning organization. New York: Doubleday.
Stufflebeam, D. (1993, September). Toward an adaptable new model for guiding evaluations of educa-
tional administrators. CREATE Evaluation Perspectives, 3(3). Kalamazoo, MI: The Evaluation
Center, Western Michigan University.
Stufflebeam, D.L. (2000). The CIPP model for evaluation. In D.L. Stufflebeam, G.F. Madaus, &
T. Kellaghan (Eds.), Evaluation models: Viewpoints on educational and human services evaluations
(2nd ed.) (pp. 279-317). Boston: Kluwer Academic.
Stufflebeam, D.L. (2001, Spring-Summer). The metaevaluation imperative. American Journal of
Evaluation, 22(2), 183-209.
Tyler, R., Smith, E., & the Evaluation Staff. (1942). Purposes and procedures of the evaluation staff.
In E.R Smith, & R.W. Tyler,Appraising and recording student progress, III. (Chapter 1) New York:
Harper.
U.S. Department of Education. (1991). America 2000: An education strategy. Washington, DC: Author.
36
A Model for School Evaluation1
JAMES R. SANDERS
The Evaluation Center, ffestern Michigan University, Ml, USA
E. JANE DAVIDSON
School evaluation can be defined as the systematic investigation of the quality of

a school and how well it is serving the needs of its community, and it is one of the
most important investments we can make in K-12 (kindergarten through high
school) education. It is the way we learn of strengths and weaknesses, the way we
get direction, and the way critical issues get identified and resolved. It addresses
the needs of many parents who want to know how to choose a good school and
the needs of teachers, school administrators, and school board members who
want to know how to improve their schools. It also provides important informa-
tion to local, state/province, and national leaders by informing their decisions. As
an integral part of good management practice, it contributes to (i) identifying
needs; (ii) establishing goals; (iii) clarifying goals; (iv) selecting strategies to
achieve goals; (v) monitoring progress; and (vi) assessing outcomes and impact.
It can be used for public relations and communications, as school report cards to
the public (Jaeger, Gorney, Johnson, Putnam, & Williamson, 1994), as well as to
give direction to planning or to select materials or programs for adoption. In
light of contemporary proposals for site-based management, school choice, and
school restructuring, evaluation at the school level is needed more today than
ever before.
The size of the investment in the United States in federal, state, and local
school district evaluations of schools, school accreditation, evaluations of the
performance of teachers and others who work in schools, and in accountability
reporting by schools is hard to estimate, but we know it is large just through
enumeration of evaluation practices. There is also a long history of practice, which
can be traced back to the mid-1800s, when Horace Mann submitted annual
reports to the board of education in Massachusetts that identified educational
concerns that were supported by empirical documentation, and when the Boston
School Committee undertook the Boston Surveys.
807

808 Sanders and Davidson
There are many good practices of school evaluation around the world. The
problem is not one of lacking good models. Rather, it is one of sorting through
good practices and identifying the best of the models that exist and melding them
into a coherent generalizable model. This is what we have ~ttempted in this
chapter.
Evaluation models vary widely in many respects, especially with regard to their
focus, and to the level of detail they provide. The model presented in this chapter
will certainly not fit all situations for school evaluation, nor does it provide com-
prehensive guidance for every step ofthe evaluation process. Rather, it is intended
to provide a useful guide for designing and planning an evaluation, and for gen-
erating useful questions for investigation. It provides rather less guidance in the
specific interpretation of the answers to those questions, or in the synthesis of
multiple answers to draw overall conclusions about a school's performance.
Because of the widely varying budgetary, time, and resource constraints faced by
evaluators in different contexts, the model may need to be modified for use.
More will be said about this later in the chapter.
Our purpose is to review the practice of school level evaluation in its many
forms, to identify what are reported in the literature to be good practices and
problematic practices, and to use what we have learned to develop a workable,
generalized model. The model is intended for use by educators who are faced
with a multitude of evaluation demands and options, who cannot be aware of all
that is out there and what constitutes good practice.
The chapter comprises three main sections, which are supported with references
to an online collection of detailed evaluation tools to help get an evaluation
started. In the first, we describe what we have found in our review of school
evaluation practices. In the second, a generalized model and process for school
evaluation that allows evaluators to select their focus for evaluation from a menu
of options is described. In the third and final section, we offer some concluding
comments about the strengths and limitations of the model, its relation to other
evaluation models and approaches, and some suggestions for model enhance-
ment and opportunities for innovation. In the collection of school evaluation
tools (available online at http://evaluation.wmich.edu/tools/), we provide direc-
tion in the critical first stage of evaluation - that of asking questions. The reader
is then pointed toward resources that can be used to support efforts at refining
the questions, getting answers, and using evaluation findings to make changes.
SCHOOL EVALUATION PRACTICES
Evaluation involves the identification of the characteristics of a good school (Le.,

the criteria) and then working a judgment based on these criteria about how well
it actually is performing. This involves two basic acts: gathering information so
that conclusions and/or judgments will be informed and supportable, and using
criteria to assess or judge the level or adequacy of school performance, or some
aspect of it. These will result in a listing of strengths and weaknesses that can be
A Model for School Evaluation 809
used to guide future decisions, and/or a concise, overall assessment of school
effectiveness.
School evaluation takes many different forms, some of which are known by
different terms in different countries, and many of which are evolving in a way
that blurs traditional distinctions. Four main (but overlapping) types of practice
can be identified: self-study plus visitation-based evaluations; indicator-based
evaluations; self- or district-initiated school evaluations; and ad hoc school evalu-
ations initiated by local, state, or national organizations. The approaches reflect
contemporary practice, and provide a starting point for improving school
evaluation.
Self-Study plus Visitation-Based Evaluation
Self-study plus visitation-based evaluation includes both "accreditation" (used

in the United States) and "inspection" systems with a self-study component.
Accreditation is a process of self-study by school staff using published criteria,
often followed by a site visit and report by an external accreditation team appointed
by a regional or state association. The first regional accreditation association in
the U.S. was the North Central Association, which was founded in 1898. The
process is guided by the National Study of School Evaluation's (1987) Evaluative
Criteria or state guides for public schools and similar guides for private schools. 2
These guides are supplemented by materials developed by regional and state
accreditation associations that also administer the school accreditation process.
Similar systems combining the sequential use of self- and external visitation-
based evaluation are used in various other countries in Europe and Australasia
(Learmonth, 2000; Scheerens, Gonnie van Amelsvoort, & Donoughue, 1999;
Tiana, in press). The latter systems use an inspection system, and so differ from
accreditation. While inspection is a government-mandated system in which the
visitation component is carried out by government-appointed inspectors, accredi-
tation is a voluntary system run by education professionals. However, inspection-
based systems are included in this section because they are increasingly moving
toward an approach that combines the benefits of self-evaluation with the comple-
mentary perspective of an external review team (Scheerens et aI., 1999; Tiana, in
press). Even systems based almost exclusively on external evaluation are beginning
to promulgate self-evaluation checklists, in addition to publishing lists of the evalu-
ative criteria to be used (e.g., New Zealand Education Review Office, 2000a;
OFSTED, 2000).
As stated in the Evaluative Criteria of the National Study of School Evaluation
(1987), the Criteria profile the important characteristics of a quality school. "It
offers a systematic process by which to assess the effectiveness of a school and to
stimulate a school and community to establish a planned program of continuous
growth so that its school may become progressively better" (p. v).
Accreditation and inspection-based approaches to school evaluation provide a
guide for evaluating the school and community relationships, school philosophy
and goals, curriculum subjects, auxiliary services, guidance services, learning
media, school facilities, school staff and administration, student activities programs,
other school programs, and school faculty members. Forms are provided and
questions are asked about each aspect of the school. National Society for the
Study of Education (NSSE) forms and questions have been used in the u.S.
since 1940 and have been revised about every ten years since then. They have
evolved into a well-refined set of good questions. In 1951 the Southern Association
Cooperative Study in Elementary Education defined the standards that are still
used today, with little change, for evaluating elementary schools. These
standards were categorized as knowledge of the children to be taught, scope of
the program, organization for learning, the teaching-learning process, resources,
and personnel. What has been missing from most accreditation, however, is the
study of student performance.
Indicator-Based School Evaluation
Indicator-based school evaluations began in the u.S. in the 1960s in response to

public demands for school accountability. They have continued to the present,
and now all U.S. states have some form of accountability program. This is a trend
that is increasing globally, and many other examples can be found of states
mandating the collection of a range of indicator data, such as school pass rates
on national examinations and/or achievement tests. These are sometimes made
public, as is the case with the so-called "league tables" published in England
(Scheerens et aI., 1999; Tiana, in press).
State-mandated evaluations of schools take various forms in the U.S. Some
states rely solely on test score averages as an indicator of school quality, and this
appears to also be the case in several other countries (Cahan & Elbaz, 2000;
Tiana, in press). Some countries have developed fairly complex school reporting
processes, such as multi-indicator profiles and "school report cards" (Johnson,
2000). Still others have developed "baseline assessment" systems, which are used
either to measure the "value" added by schools (based on increases in key indica-
tors), and/or to identify schools needing additional resoutces due to low initial
baseline scores (Lindsay & Desforges, 1998; Wilkinson & Napuk, 1997).
Caution is required in using indicators of school quality to judge an individual
school. Just because an indicator such as student-teacher ratio is correlated with
effective teaching does not mean that it is essential for a given school to function
well. A particular school may have large classes, but it might also employ effective
classroom management techniques that more than compensate for this. Moreover,
single tests of student performance fail to take into consideration contextual
factors that affect student achievement, such as student characteristics and school
finances. This does not mean that indicators are not useful; the descriptive
information they provide can stimulate productive thinking and discussion about
needs and direction for school improvement. Coupled with other information,
indicators can playa useful function in school evaluation.
Brown (1990) described 40 state accountability models in the U.S. that are
based on the indicators approach. She listed input indicators that included
curriculum guides, revenues, teacher qualifications, and categorical funding.
Process indicators included the use of non-administrators in school decision
making, development of plans, and legal due process. Outcome indicators
included test scores and performance reports. She reported, for example, that
Utah school-level reports cover 18 items including test scores, fiscal information,
attendance and dropout rates, course taking patterns in high schools, professional
data on teachers, and demographic data on students. More recently, Johnson
(2000) described the current use of school profiles and report cards. Indicators
include student characteristics, school facilities, school finance, staffing and charac-
teristics of teachers (race, education, pupil/teacher ratio, etc.), programmatic
offerings, student services, school environment (e.g., after-school programs,
school climate, PTA), student engagement (attendance, suspension, dropouts),
school success (retention, PE, honors, math pass rate), high school success (post-
graduation plans, graduation rate), and student achievement (reading, math,
Scholastic Aptitude Test). Profiles are presented in narrative and/or tabular
form, depending on the intended audience.
Very few of the indicators currently in use assess quality of instruction or of
teachers. Neither do they indicate how the data can be used to improve low per-
forming schools or sustain high performing ones. A recent international review
suggests that there is increasing consensus among European school inspec-
torates that a broad range of indicators should be used, including measures of
teaching quality such as structure, clarity, challenge, strategy, assessment, resource
development, providing for differences, and efficiency (Deketelaere, 1999).
Self- or District-Initiated School Evaluation
Self- or district-initiated school evaluations were developed in the U.S. in the

1970s and continue to the present as supplements to state-mandated school
evaluation and in response to local calls for accountability. More recently, school
districts have developed evaluation models that can be used by school staff and
state education agencies to guide school improvement. Several other countries
(especially in northern Europe) have a very strong and/or increasing emphasis on
local (rather than national-level) accountability, with the Netherlands being a
notable example (Scheerens et aI., 1999). There is also increasing emphasis in
some parts of Europe on "bottom-up" (school-driven) rather than "top-down"
(government-mandated) evaluation systems (Tiana, in press), and on the use of
evaluation to help schools become 'learning organizations' that continually improve
practices based on their own and others' experience (e.g., Leithwood, Aitken, &
Jantzi,2001).
Most countries practicing regular school evaluation tend to have a fairly uniform
national system, with not a lot of local variation. However, in the United States,
many school districts have developed their own approaches to evaluation,
incorporating state reporting and accreditation into their model. These
approaches vary according to the roles that evaluation is expected to play. In one
approach, teams of educators - in this case student teachers, administrators, and
university teachers/researchers - use an action research model (Elliot, 1991) to
study school practices and to initiate change (Ackland, 1992). Relying on critical
reflection (Weiss & Louden, 1989), Ackland focused on classroom teaching
practices and learning patterns, guidance and counseling needs and practices,
curriculum content, student motivation, and school policy. A lesson from this
work was that teacher and administrator participation and ownership in the
evaluation process is essential to bringing about change.
Cheng (1990) identified the organizational learning model (Argyris, 1982;
Argyris & Schon, 1974, 1978;) as the one that may hold the most promise for
schools. The model promotes continuous improvement and adaptation to the
environment. The popularity of writings by Senge (1990) and Gray (1993) appear
to support this approach, which has spawned several models (e.g., Leithwood et
al., 2001). Climaco (1992) added that any school evaluation model must be simple
and understandable, answer information needs of stakeholders, promote self-
evaluation and reflection at the school level, and involve teachers and relevant
others in its conception, planning, and monitoring. He proposed a model using
indicators grouped by context, resources, functioning, and results. Sirotnik and
Kimball (1999) proposed a set of 11 standards that accountability-based evalua-
tion systems should meet. These included the use of multiple indicators, flexibility
for appiication in different multicultural environments, and maintaining a balance
between comparative evaluation (benchmarking against other schools) and con-
text relevance.
Many U.S. school districts are using an indicators- or outcomes-based model
for school evaluation, often using longitudinal data on selected indicators to
describe progress and highlight needs for change (e.g., DeMoulin & Guyton,
1989). By using data on their own systems for comparisons, they are following a
recommendation by Glass (1977) who advocated the use of local norms as a
logical standard for school evaluation. School report cards or school profiles
(Alberta Department of Education, 1991; Gaines & Cornett, 1992; Jaeger et aI.,
1993, 1994; Johnson, 20(0) are a variation on the indicatorslheme used to produce
public accountability reports. The content of school report cards, recommended
in the research of Jaeger et al. should include information on the following:
(i) standardized test performance; (ii) student engagement; (iii) school success;
(iv) school environment; (v) school staffing and the characteristics ofteachers; (vi)
programmatic offerings; (vii) school facilities; (viii) student services; (ix) back-
ground characteristics of students; and (x) school finances.
Lafleur (1991) described a cyclical program review system that could be adopted
by or adapted to fit most school districts of any size in the U.S. A model of this
type has been developed by the Vicksburg (MI) Community Schools. Lafleur's
model is noteworthy in that it is highly structured and simple, features that lead
Blasczyk (1993) to argue that it is overly prescriptive. His position is that school
personnel should build models together through a collective problem-solving
process. "The act of building a model is an opportunity to learn new skills, is

empowering, and transforms people into competent problem-solvers who believe
they can transform institutions and practices. Needed is a kind of starter kit (or
set of recipes which good cooks can add to)" (p. 11). The notion espoused in this
approach of evaluation as the production of critical knowledge, by and for those
who use it, is consistent with a view put forward by Sirotnik (1987).
Ad Hoc School Evaluation Approaches
Ad hoc school evaluations appeared in the 1970s in response to citizen needs for
information about schools. Documents published by local boards of realtors
about local schools, by citizens councils concerned with local schools, and by
organizations such as the National School Public Relations Association (NSPRA)
in the United States serve as guides for evaluating a school.
There are occasions when external groups or individuals organize to evaluate
a school, an approach that adds the important perspective of citizens to school
evaluation. Their motives may vary, but they often want an answer to the
question, "How good is this school?" In the U.S., the NSPRA published a review
of what the effective schools literature says about the characteristics of effective
schools and describes schools that have these characteristics (Tiersman, 1981).
They concluded that schools are only as effective as the people and ideas that
shape them. The Northwest Regional Educational Laboratory (1990) also
brought together a list of effective schooling practices based on the effective
schools literature. Instruments such as the School Assessment Survey (Research
for Better Schools, Inc., 1985), Connecticut School Effectiveness Questionnaire
(Gauthier, 1983), the San Diego County Effective Schools Survey (San Diego
County Office of Education, 1987), and The Effective School Battery
(Gottfredson, 1991) all grew out ofthis literature. The criteria identified in these
publications can be used to establish evaluation standards for different aspects
of a school, but we must be careful not to apply these criteria blindly. The criteria
are merely heuristics that suggest aspects of a school to study, and there will
always be exceptional schools that fall short on one or more of them.
Some organizations have published criteria to guide citizens in evaluating
schools. Again, these criteria can be used to establish standards for judging differ-
ent aspects of a school. Noteworthy publications have been prepared by the
Citizens Research Council of Michigan (1979), The National Committee for
Citizens in Education (Thomas, 1982), and the New Zealand Educational Review
Office (2000a, 2000b). The Appalachia Educational Laboratory has developed a
school study guide for use by external evaluators. Their Profile of School Excellence
(ProISe) can be used to generate a report to guide school improvement planning.
There is a wealth of information, approaches, ideas, and experiences with
school evaluation, as the preceding review indicates. The challenge is to identify
the dominant themes and strengths from the idiosyncratic and somewhat limited
elements of current practices to improve school evaluation practice. Van Bruggen
(1998) noted the following broad international trends in the evaluation of schools
and other institutions: (i) increased use of self-evaluation by schools (some man-
dated, some voluntary); (ii) increased prevalence of systems for comprehensive
school inspection; (iii) greater realization of the importance of national (and
perhaps international) consensus regarding quality indicators for stimulating a
clear dialog about school quality and its improvement; (iv) a growing emphasis
on accountability, including public reporting of self- and external evaluations,
and action plans for improvement; (v) the realization that school evaluation
should be frequent, ongoing, and linked to continuous improvement; and (vi) an
understanding of the value added by both self- and external evaluation.
A MODEL FOR SCHOOL EVALUATION
On the basis of the preceding review of school evaluation approaches, a general

ll-step process (see Figure 1) is proposed for school evaluation. Included with
each step is a reference (in parentheses) to relevant evaluation standards (Joint
Committee on Standards for Educational Evaluation, 1994). These standards
define the touchstones that are needed to identify sound school evaluation
processes and to improve our own evaluation processes.
The model is designed for use by school stakeholders (educators, citizen
groups, school boards, school administrators, and other interested parties) who
want to initiate school self-evaluation. The Evaluation Center developed a
general model to guide program evaluations (see the model in the previous
chapter as well as Stufflebeam [1994]), and this model has been adapted for a
variety of purposes. Figure 1 is the adaptation for school evaluation.
The-discussion of school evaluation in the remaining sections of this chapter
covers each step in the model and highlights linkages with other contributions to
the evaluation literature. These include checklist-based approaches (Scriven,
2000; Stufflebeam, 2001), utilization-focused evaluation (Patton, 1997), empower-
ment evaluation (Fetterman, Kaftarian, & Wandersman, 1996), action research
(Argyris & Schon, 1978, 1996), responsive evaluation (Davis, 1998; Stake, 1976),
and evaluative inquiry for organizational learning (Preskill & Torres, 1999).
1. Identify School Components to Be Evaluated
There are many aspects of a school that may fall under the heading of school
evaluation. School evaluation is broader than program evaluation, personnel
evaluation, and evaluation of student performance, and can include evaluation
of school facilities, school finances, school climate, school policies, school record
keeping. In an attempt to compile a menu of important aspects of a school that
may be included in an evaluation, we have drawn entries from many different
sources. The resulting menu is presented as Table 1.
Dofino Crj)eriA
For EvaluAting
Each Component
(U4)
Colloct Information
About ~.nd
~R. lat.dl:>
tM Criteria for Each
Component
(Al,A5,A6)
Figure 1: School Evaluation Model
School evaluation leaders may select the components they want to include in
their school evaluation plan from the list in Table 1. If something in the menu is
not relevant to a situation, it should not be included. If something important
cannot be found in the menu, it should be added. However, only aspects of the
school that are of immediate interest should be included. Often a scanning of
Table 1 can serve to highlight the components most in need of review and
change. Comprehensive lists of questions for all components set out in the table
are available at http://evaluation.wmich.edu/tools/.
2. Define Evaluation Uses and Users
Among the first questions that school evaluation leaders should ask themselves
are, "Who will use the evaluation findings?" and "How do they plan to use them?"
Answers to these questions will clarify the purpose of the evaluation and identify
Table 1. Aspects of a School That May be Included in a School Evaluation
School Context
and Input School Design School Services Student Outcomes
• School Climate • School • Instruction • Gognitive Development
• Personnel Organization • School Counseling Programs • Social Development
Qualifications • School • School Record Keeping • Emotional Development
Practices • Physical Development
• Student Assessment • Aesthetic Development
• Parent Facilities • School Finance • Vocational Development
Involvement • School • Food Service • School Policy
• School Safety Curriculum • Custodial Service • Moral Development
• Transportation Services
• Extracurricular Activities
• School Professional
Development Programs
the intended users of the information it will provide. Discussion with users -
people or groups - will help to identify important questions that need to be
addressed and the best ways to obtain information. The scope of an evaluation
will be determined in part by the expectations of intended users, and the role that
it will play in the school in part by those for whom it is intended. That role can
be formative, giving direction to school improvement activities, or summative,
involving accountability reporting to outside audiences.
3. Develop Collaborative Relationships Among Participants
One of the important lessons in school evaluation that is reported frequently is

that school staff and other important audiences must be involved in planning and
implementing the evaluation process. Unless there is some sense of "ownership,"
findings are not likely to lead to action. This is not to say that certain people or
groups should dictate the evaluation process. It does say, however, that school
participants need to see the importance and utility of the evaluation. Imposed or
standard "off the shelf" evaluations often result in pro forma, unused evaluation
information, a drain on scarce school resources.
It is important to identify those who need to know about the evaluation, who
can provide assistance, the time and other commitments that will be required,
and whether or not technical assistance will be required. Evaluation is a human
activity and it is important to take time to nurture those relationships that will
enable the process to work. People should be informed all along the way, and a
structured time line that will help people see what is happening should be
provided.
It helps to develop collaborative relationships among participants if the
leaders of the evaluation process invite all who want to join them to become part
of an evaluation steering team for the school. This team would coordinate
activities and be responsible for reporting findings to those people or groups who
need to know, when they need to know. Participants in the process should be able
to see the value of the evaluation and to assume responsibility for contributing
to its success. Often small group and individual discussions about the evaluation
process will be needed to develop a common and constructive mind set toward
the evaluation.
4. Define Criteria for Evaluating Each Component
Evaluation is a process of collecting information about objects and then applying

criteria to what is known about them to arrive at evaluative conclusions. Criteria
help determine whether something is good, adequate, or poor. If a curriculum is
being evaluated, it is necessary to identify the attributes of a good curriculum at
the beginning. The same can be said about teaching performance or any other
component of a school. Of course, since several perspectives often exist about
what is important, criteria from the point of view of different stakeholders or
groups will need to be defined and applied. The important thing is to identify
criteria at the beginning, and then to refine them as new ideas are introduced.
There are several sources of criteria. They come from professional and research
literature on the objects being evaluated; from norms or generally accepted prac-
tices; from accreditation standards or other standards set by a professional group;
from expectations established by a school board, a funding agency, or other
important participants in education; from school policies, laws, and regulations;
and from mission statements, goal statements, democratic ideals, known needs,
comparable objects from outside the school, and other points of reference. The
definition of criteria is not easy, and can be controversial. However, it is crucial
that criteria be defined and shared explicitly in an open, participatory manner.
An example of criteria used to evaluate the clarity of instruction, and relevant
sources of information, is shown in Table 2. Many other specific resources are
available online at http://evaluation.wmich.edu/tools/.
5. Collect Information About Activities and Outcomes Related to the Criteria

Defined For Each Component of the School Being Evaluated
Collecting information is critical if judgments about each school component

being evaluated are to be informed. Fact-free evaluations amount to little more
than opinions. Information about any school component can be quantitative
(numerical) or qualitative (narrative). For example, the criteria for evaluating a
curriculum may include how much time is spent on each unit (quantitative) and
the ways in which it contributes to some ultimate educational goal such as
development of lifelong learning skills (qualitative). For each criterion in step 4
of the model, it is important to identify the information needed to make a
Table 2. Specific Criteria for Evaluating the Clarity of Instruction
How to Answer
Questions the Questions Resources
Does the teacher Self-assessment, Use .the checklist of
1. review procedures, information, and/or observation by observation, teacher
directions before moving to new material? by principal self-assessment instrument,
2. give simple, concise directions and list or peer and survey items to teachers
them on the chalkboard when necessary? inA Research-Based Self-
3. rephrase questions, often repeat statements, Assessment System. Analysis
and encourage students to ask questions? and Action Series (Reinhartz
4. use many examples to explain a point in a less & Beach, 1983). Other
abstract and confusing way? useful items that can be used
5. pace the lesson to coincide with varying rates in surveys can be found in
of learning? Formulating Instrumentation
6. establish smooth transitions from subject to for Student Assessment of
subject and situation to situation? Systematic Instruction
7. expect students to learn and communicate (Nelson, 1976).
this expectation?
8. plan - including short- and long-term goals,
behavioral performance objectives, a description
of methods, content, and evaluation system?
9. have organizational skills and attend to detail?
10. explain how the work is to be done?
judgment and how information (observe, ask, test, survey, compile records, scan
documents) will be collected.
6. Assess Environmental Factors That Affect Activities and Outcomes

Parts of a school, and the school itself, do not work a vacuum. School district,
state, and national policies, resources, and constraints can affect their perfor-
mance. Likewise, politics, economic conditions, demographics, and cultural
history can affect school programs and services. Therefore, it is worth looking at
the institutional context of the school and the components of the school being
evaluated. Are there environmental influences such as local politics, fiscal
resources, human resources, material resources, the physical milieu, cultural
influences, access or geographical conditions, or other factors that are affecting
performance of the component (hindering or facilitating) or the level of out-
comes? Assessing context is best done through group discussion. Some things
may go unnoticed by one person but are picked up by another. Environmental
influences can often be as powerful as the component itself and deserve atten-
tion in helping identify the "whys" of the outcomes.
7. Compile Findings
Information collected for a school evaluation can become overwhelming if not
filed properly and summarized as it comes in. One of the best ways to file
information is by criteria that are being used to evaluate the school (see step 4).
Each criterion could have a separate file folder in which information about
context (see step 6), activities and outcomes (see step 5) is deposited. When
compiling information, it will be a matter of summarizing what has been found
on each criterion. Organize and summarize what you are finding as you go along
with the evaluation, rather than waiting until the end. A daily summary memo of
findings would help.
8. Build Stakeholder Perspectives into Interpretations
It would be a mistake to expect information to speak for itself. One person's

interpretations of the information provided by an evaluation may be that "all is
fine and that no changes are needed," while another may find that "we are way
off base and need to make some corrections." While one person may focus on
one criterion as being the most critical, another may value other criteria.
There are two possible strategies for dealing with the issue of multiple stake-
holder values, each of which has tradeoffs associated with it. One is for the
evaluator to spend some time understanding the values and standards relevant
to the evaluation, including multiple stakeholder perspectives, and to draw an
informed overall conclusion about school performance based on that informa-
tion. This allows a single evaluative conclusion based on multiple viewpoints,
which has the advantage of that it removes the opportunity for stakeholders to
use any ambiguity in the report as a leeway for political maneuvering. Its
disadvantage is that it reduces the extent of stakeholder buy-in to the results,
since they may feel less involved in the creation of an overall conclusion.
A second alternative for capturing this diversity of viewpoints is to ask key
stakeholders to think about what the compiled information means to them in
light of the criteria being used to judge the school. Where are its strengths and
weaknesses? How should resources be apportioned to maintain strengths and
address needs? What are their recommendations for the future, given what they
now know? In a group meeting each key stakeholder could come prepared to
provide herlhis interpretation of the findings. As stakeholders speak, it should
soon become evident where consensus and where disagreements exist. Further
discussion or more information collection may be needed to get a clear sense of
direction out of the evaluation. Clearly, this approach involves considerable time
commitment on the part of the stakeholders, while still leaving open the possi-
bility of failing to reach consensus. However, experience suggests that it is critical
to include this step if buy-in to, and ownership of, school change is to be achieved.
Typical stakeholders that might be involved in steps 4 and 8 are teachers and
school staff, parents, employers, and students. Others too may need to be
involved in the interpretation of data and in making recommendations. Outside
participants help to control for bias, enhance ownership and credibility, and
avoid conflicts of interest.
9. Provide Fonnative Feedback
Formative evaluation is used internally by school personnel to improve what they

are doing. A coordinator or steering team should provide evaluation feedback,
both positive and negative, to those who can use it to improve the school at a
time when it can best be used. Formative evaluation is most effective when it is
continuous. It does not need to be in the form of formal reports. Practices can be
informed and improved as relevant information becomes available. Hallway
conversations, staff meetings, short memos, and brief presentations are all good
ways to provide formative feedback.
10. Compile Summative Reports
Summative evaluation is used for public reporting and accountability. School

evaluations are of interest to the community and may even be mandated by a
local board of education, a state department of education, or a funding agency.
It is the duty of the evaluation team to provide formal reports that are accurate,
ethical, and useful for the particular purposes of the requesting audience. Reports
should be interesting, to the point, and targeted toward specific audiences. The
message should not be lost in a forest of details. The use of an external evalu-
ation advisory panel to review the evaluation process and reports prior to their
public release will serve to enhance their credibility, impartiality, and balance.
This review process is called "metaevaluation," or evaluation of the evaluation
(processes and products). The Program Evaluation Standards (Joint Committee,
1994) and The Personnel Evaluation Standards (Joint Committee, 1988) provide
a framework for this work.
11. Improve the School and Be Accountable
Good evaluation does not stop with evaluation reports. There is a need to act on
the reports. Individuals who were involved in an evaluation are often in a good
position to help recipients use the results. For example, school evaluation leaders
or a steering committee might plan follow-up meetings with recipients to help
them understand and apply what has come out of the evaluation process. As the
Standards note, evaluation is a poor investment of resources if it is not used.
By formalizing the process of school evaluation, a school staff can accomplish
more than simply evaluating the school. Several benefits have been reported in
the literature. First, stakeholders will be encouraged to work together on behalf
of students, rather than in isolation. Second, a time schedule for the evaluation
will facilitate getting it done. Third, evaluators will make better use of existing
information and will be in a position to assess whether other routine but unused
information really needs to be collected. Fourth, new ideas about change in a
school will surface where they might ordinarily be mentioned but die an untimely
death due to lack of follow-up. Fifth, information that teachers and others have
collected through everyday experiences will be transmitted to others. Thus an
evaluation can provide a forum for communicating important information and
knowledge that might otherwise never be shared. Sixth, stakeholders will develop
ownership in the school and its students. Seventh, "drift" in the school (i.e. gradual
changes that occur unknowingly and without reasons) will be identified. Changes
based on evaluation occur knowingly and with justification.
CONCLUSION
An examination of Figure 1 will serve to underline the centrality of the Standards

to school evaluation. Thus, an evaluation that is sound will be useful, feasible,
proper, and accurate. Furthermore, our review of school evaluation approaches
shows that the best evaluation has four characteristics: stakeholder involvement,
ongoing assessment, good communication, and metaevaluation.
We will now proceed to consider some of the strengths and limitations of the
model, as well as its relation to many of the other types of evaluation models and
approaches being used in the discipline of evaluation today. We also consider how
the fusion of some cutting-edge ideas might provide opportunities for innovation
in school evaluation.
The model presented in this chapter represents a step-by-step, relatively detailed
guide for the evaluator designing and planning a school evaluation. A range of rich
and detailed evaluation tools (available online at http://evaluation.wmich.edu/tools/)
provides the evaluator with some very comprehensive sets of criteria, evaluation
questions, and suggestions for information sources. These may be used in their
entirety, or relevant sections may be excerpted for use in a more focused
evaluation that concentrates on a limited set of issues.
Evaluators around the world work under time, budgetary, resource, and
political constraints that vary widely. For this reason, it may be impossible, or
even unwise, in the case of a particular evaluation project to attempt to encom-
pass all the elements mentioned in the l1-step model. For example, time and
budgetary constraints may mean that it is not feasible to obtain all stakeholder
groups' interpretation of results. Furthermore, this could compromise evaluation
credibility if the independence of the evaluation is important politically. Similar
considerations will come into play for many of the other steps, which the
evaluator should take into account when designing an evaluation.
One area in which the model provides minimal advice for the evaluator is in how
exactly the answers to evaluation questions can be converted from descriptive
information into explicit determinations of merit, especially when stakeholder
consensus is either difficult or not appropriate to obtain. For example, should a
dropout rate of 15 percent for an Australian high school in a working class area
of Sydney be considered good, mediocre, or poor? The answer will necessarily
include applying a combination of comparative and absolute standards that take
into account the school's context, the real-life impact of the dropout rate, and a
number of other factors. The limited scope of this chapter precluded the
development of detailed algorithms and/or heuristics for this difficult merit
determination step.
Another important evaluation step, and it is one that is critical in the interpreta-
tion of evaluation results and in the presentation of information to stakeholders,
is the combination of multiple evaluative results into either a more concise
summary of results (partial synthesis for profiling), or an overall conclusion
about the school (full synthesis) (Davidson, 2001; Scriven, 1994). For example,
consider the criteria for evaluating teacher-student relations, a subcomponent of
school climate, listed in Table 3. To convert results into a comprehensible form
for stakeholders (most of whom will not have the time, expertise, or inclination
to interpret raw results themselves), several synthesis steps are required. First,
data gathered from all four stakeholder groups (students, the principal, parents,
and teachers) need to be condensed to provide a single answer to each evalu-
ation question. Second, answers to all 12 questions need to be condensed, and
the overall merit of teacher-student relations determined. Third, the results for
teacher-student relations need to be somehow combined with those for the other
nine subcomponents of school climate (security and maintenance, administra-
tion, student academic orientation, student behavioral values, guidance, student-
peer relationships, parent and community-school relationships, instructional
management, and student activities; see the Evaluation of School Climate tool at
http://evaluation.wmich.edu/tools/) to determine how positive school climate is
overall. This would be the option if the evaluator needed to present a profile of
how well the school was doing on all 25 aspects listed in Table 1. For a full
Table 3. Specific Criteria for Evaluating Teacher-Student Relations

How to Answer
Questions the Questions Resources
Teacher-Student Relations: Survey to students, Available surveys are
1. Are students treated individually? the principal, School Climate Survey
2. Do teachers greet students in the hallway? parents, and (Kelley et aI., 1986),
3. Are students willing to go to teachers with teachers School Climate and
personal and academic problems? , Context Inventory
4. Do teachers give students the grades they '(Wayson, 1981), School
deserve? Climate Profile (in Howard
5. Do teachers in this school like their students? et aI., 1987). Information
6. Do teachers help students to be friendly and on more instruments can
kind to each other? be found in School climate
7. Are teachers patient when students have assessment instrument:
trouble learning? A review (Gottfredson &
8. Do teachers make extra efforts to help students? Castenada,1986).
9. Do teachers understand and meet the needs
of each student?
10. Do students receive praise more than they
are scolded by their teachers?
11. Are teachers fair to students?
12. Do teachers explain carefully so that
students can get their work done?
synthesis, this information would need to be condensed one step further (by
synthesizing the subevaluation for school climate with those of the other 24
dimensions) to provide an overall determination of school merit or worth.
The Relation of the Model to Other Evaluation Approaches
The 11 steps in the model (or a subset of them) may be applied to many of the
types of school evaluation that are practiced around the world. For example, a
school self-study project (which may be facilitated by an external evaluation
consultant) could apply the steps just as easily as a fully external evaluation team,
or even an individual school inspector using more of an expertise-oriented (or
"connoisseur") approach.
The model may also be applied in the context of a considerable range of other
evaluation approaches in use today, an application that has the potential to lead
to some extremely fruitful innovations in the practice of evaluation, both in
schools and elsewhere. One area of application might be program theory in par-
ticipatory evaluation, which is becoming an important tool for allowing schools
(and other organizations) to experiment with new ideas and learn about what
works and why (Rogers, Hacsi, Petrosino, & Huebner, 2000). Although the
expense of using program theory does not always outweigh the benefits, it has
huge potential for engaging stakeholders in the evaluation process, helping them
articulate their assumptions, and allowing not only organizational learning, but
also organizational unlearning, to occur.
With some adjustment to the nature of stakeholder input into evaluation
design and the interpretation of results, many of the steps could also be applied
to a fully consumer-oriented and/or goal-free evaluation (Scriven, 1974). In this
case, the evaluator would gather extensive input from direct and indirect down-
stream impactees (e.g., students, parents, future employers), but would minimize
the role of staff, teacher, and administrator agendas in the process.
The evaluation criteria, questions, and information sources listed in the 11-step
model could also be blended with some of the ideas from the Balanced Scorecard
approach to organizational improvement (Kaplan & Norton, 1992) and to
augment the work being done in the development of school profiles and report
cards (e.g., Johnson, 2000). Again when a comprehensive system for reporting on
school merit/worth is combined with an inquiry-based approach (e.g., Preskill &
Torres, 1999), the move to transform schools from the more mechanistic
structures of yesteryear into the more organic, living and learning organisms that
are starting to emerge in the new millennium is enhanced.
ENDNOTES
1 This chapter is based on the book, A Model for School Evaluation, (Sanders, Horn, Thomas,
Tuckett, & Yang, 1995). We wish to acknowledge the contributions of Jerry L. Horn, Rebecca
Thomas, D. Mark Tackett, and Huilang Yang to the book and to the evaluation tools which can be
found at http://evaluation.wmich.edu/tools/
2 Published by the National Catholic Educational Association (e.g., Self-Study Guide for Catholic
High Schools) and the National Independent Private Schools Association, Accreditation Program.
REFERENCES
Ackland, l.w' (1992, April). Collaborative school-based curriculum evaluation: A model in action.
Paper presented at the annual meeting of the American Educational Research Association, San
Francisco, CA.
Alberta Department of Education. (1991). Achieving the vision: 1991 report. Edmonton, Alberta:
Author.
Argyris, C (1982). Reading, learning and action: Individual and organizational. San Francisco:
1ossey-Bass.
Argyris, C, & SchOn, D.A. (1974). Theory in practice: Increasing professional effectiveness. San
Francisco: lossey-Bass.
Argyris, C, & Schon, D.A. (1978). Organizational learning. Reading, MA: Addison-Wesley.
Argyris, C., & Schon, D.A. (1996). Organizational learning II: Theory, method, and practice. Reading,
MA: Addison-Wesley.
Blasczyk, l. (1993). The aphorisms about the relationships of self-study evaluation to school improve-
ment. Paper presented at the 1993 annual meeting of the American Evaluation Association,
Dallas, TX.
Brown, P.R. (1990). Accountability in education: Policy briefs, No. 14. San Francisco, CA: Far West
Laboratory. (ERIC Document Reproduction Service No. 326 949).
Cahan, S., & Elbaz, l.G. (2000). The measurement of school effectiveness. Studies in Educational
Cheng, Y. (1990). Conception of school effectiveness and models of school evaluation: A dynamic
perspective. CUHK Education Journal, 18(1), 47-61.
Citizens Research Council of Michigan. (1979). Evaluating the educational outcomes of your local
schools: A manual for parents and citizens. (Report No. 257). Lansing, MI: Author.
Climaco, C (1992). Getting to know schools using performance indicators: Criteria, indicators and
processes. Educational Review, 44, 295-308.
Davidson, E.1. (2001). The meta-learning organization: A model and methodology for evaluating
organizational learning capacity. Dissertation Abstracts International, 62(05), 1882. (MMI No.
3015945).
Davis, R. (Ed.) (1998, May). Proceedings of the Stake Symposium on Educational Evaluation.
Champaign: University of Illinois at Urbana-Champaign.
Deketelaere, A. (1999). Indicators for good schools: Comparative analysis of the instruments for full
inspection of primary and secondary schools in England, Flanders, the Nederlands and Scotland.
Report from a workshop given at the Standing International Conference of Central and General
Inspectorates of Education (SICI), Bruges, Belgium-Flanders.
DeMoulin, D.P., & Guyton, l.w' (1989, November). Attributes for measuring equity and excellence in
district operation. Paper presented at the annual meeting of the Southern Regional Council on
Educational Administration, Columbia, SC.
Elliot, w,L. (1991). RE-FOCUS program - redefine efforts: Focus on change under supervision. (Report
No. CS507609). Kansas City, MO: Hickman Mills Consolidated School District. (ERIC Document
Reproduction Service No. ED 337 830).
Fetterman, D.M., Kaftarian, S.l., & Wandersman, A. (1996). Empowennent evaluation: Knowledge
and tools for self-assessment and accountability. Thousand Oaks, CA: Sage.
Gaines, G.P', & Cornett, L.M. (1992). School accountability reports: Lessons learned in SREB states.
Atlanta, GA: Southern Regional Education Board. (ERIC Document Reproduction Service No.
ED 357 471).
Gauthier, w'l., lr. (1983). Instructionally effective schools: A model and a process (Monograph No.1).
Hartford, CT: Connecticut State Department of Education. (ERIC Document Reproduction
Service No. ED290216).
Glass, G.Y. (1977). Standards and criteria (Occasional Paper #10). Kalamazoo, MI: The Evaluation
Center, Western Michigan University. (Also available at http://www.wmich.edu/evalctr/pubs/ops/).
Gottfredson, G.D. (1991). The effective school battery. Odessa, FL: Psychological Assessment of
Resources. .
Gottfredson, G.D., & Castenada, ?? (1986). School climate assessment instruments: A review. (ERIC
Document Reproduction Service No. ED 278702).
Gray, S.T. (1993). A vision of evaluation. A report of learnings from Independent Sector's work on
evaluation. Washington, DC: Independent Sector.
Howard, E., Howell, B., & Brainard, E. (1987). Handbook for conducting school climate improvement
projects. Bloomington, IN: Phi Delta Kappa Educational Foundation.
Jaeger, RM., Gorney, B., Johnson, R, Putnam, S.E., & Williamson, G. (1994).A consumer report on
school report cards. A Report of the Center for Educational Research and Evaluation.
Greensboro: University of North Carolina.
Jaeger, RM., Gorney, B., Johnson, RL., Putnam, S.E., & Williamson, G. (1993). Designing and
developing effective school report cards: A research synthesis. A research report. Greensboro: Center
for Educational Research and Evaluation, University of North Carolina.
Johnson, RL. (2000). Framing the issues in the development of school profiles. Studies in
Joint Committee on Standards for Educational Evaluation. (1994). The program evaluation standards
(2nd ed.). Thousand Oaks, CA: Sage.
Kaplan, RS., & Norton, D.P. (1992). The balanced scorecard: Measures that drive performance.
Harvard Business Review, 70(1), 71-79.
Kelley, EA., Glover, J.A., Keefe, J.w., Halderson, C., Sorenson, c., & Speth, C. (1986). School
climate survey. Resto, VA: The National Association of Secondary School Principals.
Lafleur, C. (1991). Program review model. Midhurst, Ontario: The Simcoe County Board of
Education. (ERIC Document Reproduction Service No. 348 388).
Learrnonth, J. (2000). Inspection: What's in it for schools? London: Routledge Falmer.
Leithwood, K., Aitken, R., & Jantzi, D. (2001). Making schools smarter: A system for monitoring school
and district progress (2nd ed.). Thousand Oaks, CA: Corwin Press.
Lindsay, G., & Desforges, M. (1998). Baseline assessment: Practice, problems and possibilities. London:
David Fulton.
National Study of School Evaluation. (1987). Evaluative criteria. Falls Church, VA: Author.
Nelson, D.E. (1976). Fomulating instrumentation for student assessment of systematic instruction
(Report No. JC 760324). Texas City, TX: College of the Mainland. (ERIC Reproduction Service.
No. ED 124 243).
New Zealand Education Review Office. (2000a). Evaluation criteria. Retrieved from
http://www.ero.govt.nz/AcctRevInfo/ERO _BOOK_AS _ HTML/ERO _CONTENTS _ HTML/
Contents.html.
New Zealand Education Review Office. (2000b). Choosing a school for a five year old. Retrieved from
http://www.ero.govt.nz/Publications/pubs2000/choosingaschoolfive.htm.
Northwest Regional Educational Laboratory. (1990). Effective schooling practices: A research synthesis.
Portland, OR: Author. (ERIC Document Reproduction Service No. ED 285 511).
OFSTED (Office for Standards in Education) (2000). Handbook for inspecting secondary schools. Her
Majesty's Chief Inspector of Schools in England. Retrieved from http://www.official-documents.
co.uk!document/ofsted/inspect/secondary/sec-OO.htm.
Patton, M.Q. (1997). Utilization-focused evaluation (3rd ed.). Thousand Oaks, CA: Sage.
Preskill, H., & Torres, RT. (1999). Evaluative inquiry for learning in organizations. Thousand Oaks,
CA: Sage.
Reinhartz, J., & Beach, D.M. (1983). Improving middle school instruction: A research-based self-
assessment system. Analysis action series. (Report No. OEG-7-9-780085-0131-0l0). New Orleans,
LA: Xavier University of Louisiana. (ERIC Document Service No. ED 274 051).
Research for Better Schools, Inc. (1985). School assessment survey. Information for school improve-
ment. A technical manual. Philadelphia, PA: Author.
Rogers, P.J., Hacsi, T.A., Petrosino, A., & Huebner, T.A. (Eds.) (2000). Program theory in evaluation:
Challenges and opportunities. New Directions for Evaluation, 87.
San Diego County Office of Education. (1987). Effective schools su/Vey. San Diego: Author.
Sanders, J.R., Horn, J.L., Thomas, R.A., Tuckett, D.M., & Yang, H. (1995).A model for school evalu-
ation. Kalamazoo: Center for Research on Educational Accountability and Teacher Evaluation
(CREATE), The Evaluation Center, Western Michigan University.
Scheerens, J., Gonnie van Amelsvoort, H.We., & Donoughue, C. (1999). Aspects of the
organizational and political context of school evaluation in four European countries. Studies in
Educational Evaluation, 25, 79-108.
Scriven, M. (1974). Prose and cons about goal-free evaluation. In WJ. Popham (Ed.), Evaluation in
education: Current applications (pp. 34--42). Berkeley, CA: McCutchan Publishing.
Scriven, M. (1994). The final synthesis. Evaluation Practice, 15, 367-382.
Scriven, M. (2000). The logic and methodology of checklists. Retrieved from http://www.wmich.edu/
evalctr/checklists/.
Senge, P.M. (1990). The fifth discipline. New York: Doubleday Currency.
Sirotnik, KA. (1987). The information side of evaluation for local school improvement. International
Journal of Educational Research, 11, 77-90.
Sirotnik, KA., & Kimball, K (1999). Standards for standards-based accountability systems. Phi Delta
}(appa, 81, 209-214.
Stake, R.E. (1976). A theoretical statement of responsive evaluation. Studies in Educational Evaluation,
2,19-22.
Stufflebeam, D.L. (1994). Evaluation of superintendent performance: Toward a general model. In
A. McConney (Ed). Toward a unified model: The foundations of educational personnel evaluation
(pp. 35-90). Kalamazoo: Center for Research on Educational Accountability and Teacher
Evaluation (CREATE), The Evaluation Center, Western Michigan University.
Stufflebeam, D.L. (2001). Guidelines for developing evaluation checklists: The Checklists development
checklist (CDC). Retrieved from http://www.wmich.edu/evalctr/checklists/.
Thomas, M.D. (1982). Your school: How well is it working? A citizens guide to school evaluation.
Columbia, MD: National Committee for Citizens in Education.
Tiana, A. (in press). INAP - Innovative approaches in school evaluation. Socrates Programme Project
14. Brussels: The European Commission.
Tiersman, C. (1981). Good schools: What makes them work. (Report No. EA 014276). Education USA,
special report. Arlington, VA: National School Public Relations Association. (ERIC Document
Reproduction Service No. ED 210797).
van Bruggen, J. (1998). Evaluating and reporting educational pet[ormance and achievement:
International trends, main themes and approaches. London: The Standing International Conference
of Inspectorates.
Wayson, W.W. (1981). School climate and context inventory. Paper presented at the Annual Meeting
of the Mid-South Educational Research Association, Lexington, KY.
Weiss, J., & Louden, W (1989). Images of reflection. Unpublished manuscript. Toronto, Canada:
Ontario Institute for Studies in Education, University of Toronto.
Wilkinson, J .E., & N apuk, A. (1997). Baseline assessment: A review of literature. Glasgow: Department
of Education, University of Glasgow.
37
The Development and Use of School Profiles
ROBERT L. JOHNSON
University of South Carolina, Department of Educational Psychology, SC, USA
School profiles, report cards, performance reports, and accountability reports

are all common labels for reports used to inform the public about the status of
schools. (In this chapter, "school profiles" is used in place of the litany of labels
that are used across settings.) Contained in the school profiles are indicators of
school performance, which include such quantitative information as student
attendance rates, pupil/teacher ratio, and grade-level achievement scores. In
some profiles, qualitative indicators about school performance provide
information about school and community partnerships and awards received by
the school. Potential audiences for school-level profiles include parents, real
estate agents, principals, teachers, superintendents, and school board members.
Although also potential consumers of school report card information, less than
a third of the general public and about 50% of educators report having seen a
school profile (Olson, 1999).1
Profiles may be printed and individually issued for each school, or they may be
published in a compilation of schools in a district (Jaeger, Johnson, & Gorney,
1993). Educational agencies also have begun to publish profiles on the Internet.
For example, a website in England provides school performance tables for individ-
ual schools associated with each local education authority (United Kingdom
Department for Education and Employment, 2001). Also on the Web are lengthier
evaluation reports, such as England's school inspection reports and New
Zealand's accountability review reports (New Zealand Education Review Office,
2000). In the United States, 26 states published school report cards on the World
Wide Web as of 1999 (Olson, 1999). One state has produced an interactive
CD-ROM that allows stakeholders to examine performance trends for individ-
ual schools or districts over a 5-year period and download customized reports.
In a review of school profiles published from 1985 to 1992, Jaeger, Johnson,
and Gorney (1993) reported that compiled reports typically allocated two pages
to each school profile, and the individually issued profiles often were four to six
pages in length. The most prevalent formats for reporting data were tabular and
narrative. Tabular profiles organized indicator labels and statistics in tables and
figures to report the status of schools. Narrative profiles accompanied tables and
827

828 Johnson
figures with verbal summaries and provided qualitative indicators of school

performance (e.g., recognition received by students and staff members).
Indicators used in school profiles can be grouped into four categories: context,
resource, process, and outcome. Context indicators, such as the percentage of
students who participate in free or reduced lunch, the percentage of students by
ethnicity, and mobility rate, provide demographic information about the student
body and community. Resource indicators, which may include per-pupil expen-
diture, rate of staff turnover, and teacher educational level, address the types of
resources that are available to a school in the delivery of its services. Process
indicators reflect factors, such as educational policies and practices, that the school
can modify to impact on student learning; examples include the percentage of
the school day dedicated to reading, average number of student hours in aca-
demic courses, attendance rate, and results of school climate surveys. Outcome
indicators address desired educational results and typically include graduation
and dropout rates, scores on norm-referenced tests or criterion-referenced tests,
and percentage of students meeting state achievement standards. These cate-
gories will be used in this chapter to synthesize findings about indicators across
studies of school profiles.
From 1985 to 1992, the indicators most frequently used in profiles published
in the United States provided information about such context indicators as student
enrollment by ethnicity and enrollment by special education status (Gorney,
1994). Resource indicators most often provided information about staff qualifi-
cation in terms of the number of teachers with a master's degree and years of
teaching experience. Process indicators addressed attendance and programs
offered in the school, such as special education, bilingual, honors/academically
gifted, and Title I classes. The most frequently used outcome indicators were
norm-referenced test scores in reading and mathematics.
As with earlier profiles, a majority of profiles published in the late 1990s con-
tained such context indicators as student characteristics (Olson, 1999). Similar to
earlier profiles, approximately half of current profiles include the resource
indicator of class size; however, in 1999 fewer than half of the states with school
report cards included resource indicators related to teacher qualification. In
1999, profiles also often included the process indicator of attendance rates. As
with previous profiles, a majority of report cards continued to include such
outcome indicators as test scores; however, unlike earlier profiles, a majority of
school report cards in 1999 included dropout rate and graduation rate. In accord
with their accountability purpose, report cards started including an overall
performance rating that assigns a letter grade to a school (e.g., A, B, or C) or
labels the school (e.g., exemplary, acceptable, or unacceptable), but these were
in a distinct minority as of 1999.
DEVELOPMENT OF SCHOOL PROFILES

In the development of school report cards an agency should establish a com-
mittee of parents, high school students, public leaders, teachers, administrators,
The Development and Use of School Profiles 829
research and evaluation staff, and other stakeholders to provide input about
their needs as consumers of profile information. In the development of school
report cards, these potential consumers should address the selection of indica-
tors, methods for data summarization, and format of the profile.
Selection of Indicators
To select indicators, the profile development committee needs to consider the

purpose of the profile, the definition of each indicator, and stakeholder informa-
tional needs. Furthermore, in the case of indicators that school teams may target
for change in order to improve student outcomes, investigations should examine
whether a stable relationship exists between the targeted input (i.e., context and
resource) and process indicators and the desired educational outcomes.
Purpose of the Profile
School profiles are central to accountability (Center for Community Change,

2001; Olson, 1999; Rafferty, & Treff, 1994), in part, because taxpayers and parents
assume that publication of test scores and graduation rates will motivate educa-
tors to work harder to improve school performance (Olson, 1999). Other purposes
of school profiles include the following: providing information for site-based
decision making (Lingen, Knack, & Belli, 1993; Mestinsek, 1993; Webster, 1995;
Webster & Mendro, 1995); informing school-choice decisions (Rafferty & Treff,
1994); meeting state-mandated reporting requirements (Clements & Blank, 1997);
screening schools for state-level recognition for accomplishments (Feder, 1989);
and providing information about the school's academic environment to college
and university admissions officers (National Association of Secondary School
Principals, 1992; Pelowski, 1994).
Establishing a clear purpose for the school profile is critical since the profile's
purpose guides the selection of potential indicators for inclusion in the report.
For example, a profile designed to inform school improvement efforts would
report outcome indicators with process indicators. In this case, a profile might
include achievement scores, graduation rates, and results from school climate
surveys. The need for congruence between the identified purpose of the school
profile and the selection of indicators is illustrated in the experience of one
school system that produced profiles to inform the development of school
improvement plans (Howell, 1995). Principals in the school district questioned
the disproportionate number of demographic indicators included in the profiles,
expressing the view that such indicators provided few insights into changes that
should be considered in school improvement efforts.
When the purpose of the report card is to support stakeholders' judgments
about school quality, it becomes a tenable prospect to support such judgments if
a review process has identified educational outcomes that are valued by the
830 Johnson
stakeholders in that community. Education agencies are beginning to involve
stakeholders, such as parents, business leaders, and school faculty, in the develop-
ment of accountability systems and school profiles. In the Dallas school district,
an Accountability Task Force composed of high-school student~, parents, teachers,
administrators, members of the business community, and representatives of
employee organizations identified the valued outcomes of schooling and reviewed
the methodology of the accountability system, the testing program, and the rules
for school awards (Webster, 1995; Webster & Mendro, 1995).
Stakeholder involvement in California resulted in valued outcomes being
expanded to include reducing incidents of violence in a school, reclassifying
students with limited English proficiencies, and lowering dropout rates (Feder,
1989). Omission of these valued outcomes may present an image of a successful
school when in actuality a rise in dropout rates may result in a public perception
of problems in school quality. The absence of profile information about desired
outcomes would undermine the credibility of the school report card since the
information would not be congruent with information used by stakeholders to
form judgments about the status of schools.
When the primary purpose of a school profile is accountability, the selection
of outcome information will focus on achievement information. Outcome indica-
tors may be combined with context and resource indicators to allow equitable
comparisons to be made between schools with similar challenges in meeting
student needs. Furthermore, the achievement measures included in the profile
must align with the expectations of the accountability legislation. If demonstrat-
ing progress based on the state curriculum is the emphasis of legislative reform,
then criterion-referenced assessments that are aligned with state or local curric-
ula are appropriate. When comparisons with national levels of achievement are
of interest, then scores from standardized norm-referenced achievement tests
should be reported.
A misalignment between reform intentions and outcome indicators is likely to
result in inappropriate judgments about school effectiveness. Such would be the
case for the implementation of a state curriculum and use of national norm-
referenced tests as the sole student outcome in a profile. In these circumstances,
the national achievement tests are likely to establish the instructional goals in a
school, and these goals may, or may not, align with state and local curricula. As
the goals associated with the state curricula and the goals assessed in the national
test depart from alignment, the inferences made about school progress become
questionable. Thus, the outcome measures reported in the profile should align
with the profile purpose.
Clear Definition
Accuracy of stakeholders' decisions about the status of schools requires that each
indicator has a specific, unambiguous definition. Comparability of data across
schools, and across time, is contingent on the extent that an indicator is clearly
defined. Stakeholders assume that the statistic accompanying the indicator is
created in the same manner across schools. However, a study of indicators used
by state departments of education in the United States found that comparison of
graduation rates across governing bodies was not warranted because the authors
found three different formulas used by states to calculate such rates (Clements
& Blank, 1997). To establish common meanings for indicators, some education
agencies have included a glossary of terms in compilations of profiles (Jaeger,
Johnson, & Gorney, 1993; United Kingdom Department for Education and
Employment, 2001).
The meaning of an indicator to stakeholders should be investigated in order to
operationalize it in a form congruent with stakeholders' understanding (Kernan-
Schloss, 1999). For example, when stakeholders express the need for information
about safety in a school profile, the profile development committee should review
potential methods for operationalizing the indicator to select a form congruent
with stakeholders' understanding of the indicator. In the instance of a safety
indicator, would information about disciplinary actions address stakeholder con-
cerns? Or, does the public want survey data about how safe students feel in their
schools?
Stakeholder Informational Needs
For school report cards to serve the decision making of stakeholders, the content
should reflect consumer information needs. In Alberta, Canada, stakeholders
(i.e., students, educators, administrators, support staff, parents, and the general
public) selected indicators for review of the Grande Prairie School District's educa-
tional programs (Mestinsek, 1993). Stakeholders selected resource indicators
relating to instruction (e.g., professional attributes, inservice, and substitute
days) and funding (e.g., cost efficiency, and staffing) and process indicators
relating to school climate (e.g., student morale, teacher morale, and attendance).
Stakeholders selected such outcome indicators of student achievement as scores
on provincial achievement tests, diploma exams, and retention rates.
Although stakeholder audiences may be treated as a generalized group, it may
be more appropriate to determine informational needs of specific stakeholder
groups (Nevo, 1974). Across studies of school profiles, parents consistently
endorsed such resource indicators as class size (Kochan, Franklin, Crone, &
Glascock, 1994; Olson, 1999) and teacher qualifications (Jaeger, Gorney, Johnson,
Putnam, & Williamson, 1993; Kochan et aI., 1994; Olson, 1999). Parents also
endorsed process indicators such as attendance (Kochan et aI., 1994; Olson,
1999) and types of program offered (e.g., special education courses and
advanced courses for the gifted) (Jaeger, Gorney et aI., 1993; Olson, 1999). In
addition, parents consistently indicated a desire for information about the school
environment, such as safety and percentage of students suspended/expelled
(Jaeger, Gorney et aI., 1993; Kochan et al., 1994; Olson, 1999).
832 Johnson
Across studies, parents endorsed outcome indicators, such as student achieve-
ment (Jaeger, Gorney et aI., 1993; Kochan et aI., 1994; Lee & McGarth, 1991;
Olson, 1999), graduation rates (Jaeger, Gorney et aI., 1993; Olson 1999), and
dropout rates (Kochan et aI., 1994; Olson 1999). Parents questioned reliance on
test scores alone to determine school performance. However, when interpreting
school profiles, these same parents found student performance results important
in rating school quality (Jaeger, Gorney et aI., 1993; Olson, 1999).
School board members considered it important to include context indicators
about student characteristics (Jaeger, Gorney et aI., 1993) and such resource
information as school facilities, school finance, and characteristics of teachers
and staff. Board members indicated a preference for process indicators that
provided information about the types of services available to students, a school's
programmatic offerings, and school environment, as well as for indicators based
on standardized testing information and school success information (e.g.,
graduation and promotion rates, number of ~s awarded, and students' post-
graduation plans).
School faculty endorsed the inclusion in school profiles of resource indicators
such as teacher qualification and class size and the process indicator of atten-
dance (Kochan et aI., 1994; Olson, 1999). They also supported test scores as an
outcome indicator (Kochan et aI., 1994; Lee & McGarth, 1991; Olson, 1999), but
at a lower rate than the general public (Lee & McGarth, 1991; Olson, 1999).
Dropout rate was also endorsed by educators across studies (Kochan et aI., 1994;
Olson, 1999).
The previous paragraphs examined the indicators that various stakeholder
groups endorsed; however, across studies some stakeholders questioned the
reporting of some indicators. Parents considered gender and ethnicity data to be
useless information when reviewing school profiles to judge the quality of schools
(Jaeger, Gorney, et aI., 1993), and expressed reservations about reporting student
socioeconomic status (Franklin, 1995). Some stakeholders expressed the
viewpoint that demographic information is possibly divisive or may prejudice
stakeholder expectations for students in a school, whereas, others simply con-
sidered the information irrelevant (Olson, 1999).
Stable Relationship between Input Indicators and Outcomes
Before inclusion of input indicators in a school profile, members of the profile

development committee should review the relationship between outcome indica-
tors and input indicators. This step allows stakeholders to make an informed
decision about the inclusion of an indicator due to its demonstrated relationship
with student outcomes or, in instances where no statistical relationships exist, the
inclusion of an indicator for the sake of the information itself.
The relationship between indicators is generally examined in terms of the
amount of outcome variance that can be accounted for by resource, context, and
process indicators. In a series of studies of district-level profiles, Bobbett and his
colleagues found that input indicators included in school profiles did not account
for 40 to 75 percent of the variance in student achievement outcomes. One
contextual indicator consistently reported to demonstrate a relationship with
achievement outcomes is that of socioeconomic status (SES). Specific SES
indicators linked with student achievement include the percentage of students
receiving free and reduced lunch (e.g., Bobbett, Achilles, & French, 1994;
Bobbett, French, & Achilles, 1993; Bobbett, French, Achilles, & Bobbett, 1995b;
Bobbett, French, Achilles, McNamara, & Trusty, 1992; Franklin, Glascock, &
Freeman, 1997), the percentage of parents with a college education, median income
of the community, and educational level of a county (Bobbett et aI., 1994).
Resource indicators associated with achievement scores include teacher turnover
(Bobbett et aI., 1995a; Bobbett et aI., 1995b; Franklin et aI., 1997) and per-pupil
expenditure in elementary and middle schools (Bobbett et aI., 1993). Another
resource indicator that demonstrated a consistent relationship with student
outcomes was teachers' career-ladder status. In Tennessee, a career-ladder system
for teacher advancement based on merit was implemented in the 1984-1985
school year. Bobbett et al. (1993) found that the percentage of teachers at the
higher levels of the career ladder consistently contributed to the prediction of
student achievement.
In school improvement efforts, decision making often focuses on changes in
process in order to improve outcomes. In a review of school-effectiveness research,
Oakes (1989) reported that previous studies had demonstrated relationships
between outcome measures and organizational policies, such as time (i.e., the
length of the school day or year), and curriculum emphases (i.e., courses taken
in science and mathematics). Other promising process indicators included parent
involvement, grouping or tracking practices, primacy of teaching, collegiality,
and administrative leadership. Because these indicators have not been studied in
the context of school profiles, it is possible that any relationship between them
and student outcomes may disappear when investigated in the context of
hundreds of schools.
Some indicators that have demonstrated an association with student achieve-
ment in studies of school profiles may be considered either outcome or process.
For example, indicators of course offerings that have been shown to account for
variance in achievement outcomes, such as the percentage of students in
advanced placement programs and the percentage of students enrolled in
foreign language (Bobbett et aI., 1995a), may be more appropriately considered
as measures of student outcomes than as predictors of student achievement. In
other words, the percentage of students in advanced placement classes may be
viewed as a proxy measure of student outcomes rather than a process variable
that can be changed to improve achievement outcomes. The distinction is impor-
tant because school improvement efforts directed at changing the conditions
associated with a proxy measure (percentage of students in advanced placement
classes) may misdirect resources that should be allocated for addressing under-
lying conditions (such as school climate) that may at the same time increase the
percentage of students in advanced classes and improve achievement outcomes.
834 Johnson
Other indicators that may similarly be conceptualized include attendance and
dropout rates (Bobbett et aI., 1994; Bobbett et aI., 1993; Bobbett et aI., 1992;
Bobbett et al., 1995a; Bobbett et aI., 1995b).
Although a strong relationship between outcome measures and process/
resource indicators may exist under some circumstances, it may disappear under
other conditions. The instability of indicators across school and grade levels has
been documented (e.g., Bobbett et aI., 1992; Bobbett et aI., 1993, Bobbett et aI.,
1995a; Franklin et aI., 1997). The amount of variance accounted for in student
outcomes also differs by the achievement measures that are selected (Bobbett
et aI., 1992; Bobbett et aI., 1995a). Furthermore, the statistical analyses used to
explore the relationship between input indicators and outcome indicators
influence conclusions about the relationship between indicators (Bobbett et aI.,
1993). Thus, there is a need to conduct analyses across conditions to clarify
relationships between input indicators and outcome measures.
METHODS FOR DATA SUMMARIZATION
The selection of summary statistics, such as the percentage of students receiving

scores in the bottom quartile of a norm-referenced achievement test or the mean
normal curve equivalent (NCE) for all the students in a school, has implications
for program decisions and should be congruent with stakeholder informational
needs. Examples of summary statistics include percentile rank, normal CijlVe
equivalents, quartiles, local percentiles, raw scores (Gorney, 1994) and percen-
tage passing (Linn, 2001). The types of comparisons reported in school profiles
include comparisons within the school, such as the school's scores in previous
years and scores by gender, ethnicity, and socioeconomic status (Gorney, 1994;
Olson, 1999). External comparisons may be made with scores in the district,
state, and in similar schools.
Selection of Summary Statistics
The selection of the statistic (e.g., percentage of students in bottom quartile, or

mean NCE) for reporting outcomes has implications for school improvement
decisions (Rafferty & Treff, 1994). In one example, it was noted that use of
percentage passing to summarize test scores focused decision making and school
resources on students who scored near the cut score. In their study of an inner
school district, Rafferty and Treff reported that none of the students receiving
Title I or gifted services contributed to district changes in percentage passing
rates. The use of percentage passing ignores the gains of students below the cut
score, as well as those instances where students above it fail to show progress.
Thus, stakeholders must realize that the decision to use percentage passing as an
indicator of school improvement may dilute school improvement efforts for
students at either extreme of the achievement spectrum.
Mean scores incorporate the scores of all student populations; however, they
may be unduly affected by changes in achievement among gifted or Title 1
populations (Rafferty & Treff, 1994). Decision making in school improvement
planning may then focus primarily on one, or both, of these groups rather than
the entire student population. It follows that selection of the summary statistic
should be an informed choice made by stakeholders in congruence with their
focus on school improvement efforts.
In terms of stakeholder preferences, when asked, "What do you think is more
important: knowing how a child is doing against a set standard or knowing how
a child is doing relative to other children?" parents and the general public were
split (Olson, 1999). Approximately half of teachers wanted comparisons to be
with other children and a third wanted comparisons with set standards. In fact,
stakeholders generally expressed a desire for both types of comparison, but it is
unclear whether they realized the potential implications for assessing with both
standards-based and norm-referenced tests.
Forming Comparison Groups
School outcomes may be compared with those of the nation, the school district,
or similar schools. However, reports that compare student outcomes for a school
with all other schools ignore the natural grouping in schools and the conse-
quences of students not beginning their school experience at the same educational
level. In recognition of this situation, school outcomes have been compared with
student achievement in similar schools in a large number of states.
The selection of indicators that will be used to form comparison groups (also
referred to as grouping indicators) should meet two criteria: the indicators
should be factors beyond the control of the schools, and research should indicate
that there is a relationship between the grouping indicators and student
outcomes (Salganik, 1994; Webster & Mendro, 1995; Willms, 1992). A grouping
indicator that meets both criteria is socioeconomic status. Thus, equity would
require that the achievement levels in schools be compared with achievement in
schools of similar socioeconomic status.
Other context indicators used to form comparison groups include parent educa-
tion, unit size (including urbanicity and popUlation density), limited-English
proficiency, and student mobility. Comparison groups also have taken into account
parent occupation, ethnicity, mean teacher salary, percentage of female students,
time watching television, number of books and magazines in the home, average
daily membership change, funding formula, and first grade test scores. Assessment
of student achievement levels at entry to a school, such as the readiness test
formerly used in the South Carolina accountability model (Mandeville, 1994), is
also considered necessary for forming fair comparison groups (Willms, 1992).
836 Johnson
Approaches for grouping that take into account contextual information typi-
cally compare a school's performance with clusters of similar schools or compare
a school's performance with its predicted performance through use of regression
or mixed-effects models (Fetler, 1989; Salganik, 1994; Sanders, Saxton, & Horn,
1997; Webster & Mendro, 1995; Willms, 1992; Willms & Kerckhoff, 1995). An
advantage of comparing schools grouped on a clustering variable is that stake-
holders will more readily understand the basis for the comparison. Regression
and mixed-effects models eliminate the need for grouping schools; however, the
more complex analyses are likely to impede stakeholder understanding of the
basis for the comparison.
While the rationale for forming comparison groups is to provide fair compar-
isons of student outcomes for schools operating with similar student populations
and resources, to some stakeholders the practice of adjusting outcomes accord-
ing to the differential impact of student demographics may suggest that lower
expectations for some students are acceptable (Fetler, 1989; Linn, 2001; Olson,
1999). To avoid this implicit assumption, Fetler (1989) suggested that profiles
should provide trend data for the school and comparisons based on school
clusters and all schools. Linn (2001) concurs with the use of trend data; however,
he indicates that ranking schools based on comparisons with similar schools may
result in confusion. He provides an example of performance ranking for five
elementary schools in a district. When compared to all other schools, the perform-
ance ranking of one elementary school was the lowest in the district. However,
when compared to similar schools, the ranking of the school was the highest in
the district. Such a reversal in rankings may create doubt for stakeholders about
the credibility of the information. The situation is avoided if Linn's counsel is
taken and school improvement is used as the measure of school effectiveness,
rather than current performance, since improvement allows for different starting
points but similar expectations in terms of improvement for all.
Formatting School Profiles
As with the selection of indicators, the design of profiles should take into consid-
eration audience preference in terms of format and data presentation. For
example, lengthy school profiles are considered problematic by stakeholders
(Olson, 1999). Participants in one study approved of the report cards published
for schools in New York until they realized that the report was about 12 pages in
length. Extremely short profiles, however, do not appear to be a requirement of
stakeholders: Jaeger, Gorney et al. (1993) reported that parents and school
board members preferred four-page rather than two-page profiles. In another
study, parents indicated a preference for report cards that are specific to school
level (i.e., elementary, middle, and high school) to avoid having empty tables,
such as student dropouts at the elementary level (Kochan et al., 1994).
Data Presentation
Presentation of data should consider consumer needs; however, research on the
preferences of stakeholders in this area is not conclusive. One study found that
school board members preferred tabular presentation of data over narrative and
that parents also had a slight preference for tabular reports rather than narrative
(Jaeger, Gorney et aI., 1993). However, in a review of Louisiana school profiles,
parents wanted more information presented in narrative (Kochan et aI., 1994).
To assist stakeholders in the interpretation of profile data, Rhode Island
provided short narratives below tables and charts to explain "what you're looking
at" and "what you're looking for" (Olson, 1999, p. 36). Parents have also indi-
cated that the narrative in school profiles should be monitored for readability
(Caldas & Mossavat, 1994; Kochan et aI., 1994). To address these stakeholder
needs, the Louisiana school profiles supplemented the tabular presentation of
data with explanatory text that is written at an 8th grade reading level.
To facilitate interpretation of data, Louisiana parents suggested that data in
tables be modified so that frequencies accompany percentages (Kochan et aI.,
1994). Stakeholders generally favor the inclusion of trend data for student
outcomes (Jaeger, Gorney et aI., 1993; Olson, 1999). Furthermore, school board
members have indicated that pupil-teacher ratios and attendance rates should be
compared with district information (Jaeger, Gorney et aI., 1993). A desire for
points of comparison was the focus of some comments in Delaware (Lee &
McGarth, 1991) where respondents indicated that high school results should be
compared with other schools in the state and nation.
Accuracy of Stakeholder Judgments

Accuracy of stakeholder interpretation was investigated by Jaeger, Gorney et al.
(1993) by manipulating two format features of school profiles: length (two pages
or four pages) and data presentation (tabular or narrative). The indicators in the
reports were varied to create school profiles that reflected four levels of quality:
"good" (e.g., schools with high test scores and attendance rates), "consistently
mediocre" (e.g., schools with test scores and attendance rates that were between
the "good" and "poor" schools), "inconsistently mediocre" (e.g., schools with low
math test scores and high reading test scores), and "poor" (e.g., schools with low
test scores and low attendance rates). Stakeholders were asked to judge the
quality of the schools on the basis of these profiles.
Parents were most accurate with a four-page, narrative profile and least accu-
rate with a two-page, narrative profile (Jaeger, Gorney et al., 1993). School board
members were most accurate with the four-page profile; the results were inconsis-
tent for the narrative and tabular formats. Superintendents were most accurate
with the four-page, narrative profile and the two-page, tabular profiles, and least
accurate with the four-page tabular profiles. In general, the longer profile, con-
taining more information, appears to enhance the accuracy of decision making
for stakeholder groups.
838 Johnson
UTILIZATION OF SCHOOL PROFILES
School profiles have existed at least since 1969 when individual school report
cards were published in a compilation entitled The Columbus. School Profile: A
Report of the Columbus Public Schools to the Community (Merriman, 1969).
However, the fact that many of the profiles published in the late 1980s and early
1990s appeared in compilations raises doubt about their distribution to a broad
audience and questions about their role in providing information about the
schools to the general public. Use of profiles as accountability tools assumes that
the reports will be distributed to stakeholder groups; that stakeholders will be
prepared to use the information contained in the reports; and that profile use
will be integrated into school policy and procedures.
Distribution of School Profiles
As mentioned earlier, education agencies have begun to publish school report

cards on the World Wide Web. However, since parents may not be able to readily
access such information, report cards should also be sent to students' homes
(Center for Community Change, 2001). In some instances, the state or district is
required to mail a school profile to all families of enrolled students. In addition,
profiles should be made available to the public at community locations (e.g.,
library, or district office) (Kernan-Schloss, 1999). Another method for
distribution and communication of profile information is for schools to share
information with parents on school report card nights (Kernan-Schloss, 1999;
Olson, 1999). Thus, schools must take an active role in identifying methods for
getting the information in school profiles to stakeholders.
Preparation of Stakeholders for Use of Profiles
The effective use of report cards requires that stakeholders be able to interpret
the information; however, it is likely that stakeholders will require training in
order to understand the indicators and the implications of the information they
provide. If administrators and teachers are to use profile information to plan for
school improvement, and if they will be expected to assist parents in the inter-
pretation of the information, then staff development will need to address these
skills (Kernan-Schloss, 1999; Soholt & Kernan-Schloss, 1999). Training should
not be reserved only for school employees; for example, some states offer
workshops for news reporters (Olson, 1999).
If school profiles are conceptualized as a tool for parents to effect change
(Center for Community Change, 2001), then education agencies will need to assist
parents in understanding how such information can be used in school reform.
The preparation of parents to use school profiles is required because parents
have indicated that, although they would use the information to select schools, it
would be unlikely that such information could be used to effect change (Olson,
1999). One method for assisting parents is offered by Ohio where suggestions are
provided about how stakeholders can follow up on report card information and
the type of questions they might ask their public schools (Olson, 1999).
Integration Into Policy and Procedure
Some policies for school profiles will be mandated by national, state, or provin-
cial governing bodies; however, many issues in the publication, distribution, and
utilization of profiles will be left to local school boards to set policy. Local boards
may specify additional avenues for distribution of profiles to the public, they may
identify areas for staff development, and they may elect to set policy about the
use of profiles in planning school improvement (Soholt & Kernan-Schloss,
1999). If the profile is to be an accountability tool, schools should be required to
write an action plan based on report card information (Brown, 1999).
In addition to incorporating profiles into school planning, education agencies
will want to establish review cycles for the revision of profiles (Johnson, 2000).
Stakeholder information needs are likely to change over time and can be identi-
fied and addressed in such a cycle. The inclusion of profiles in a review cycle also
allows the education agency to assess the quality of indicators once the publica-
tion of profiles begins. Indicators identified as of little public interest, or that
bear little relationship to school outcomes, can be re-evaluated by the review
committee as profiles are revised.
CONCLUSION
Based on the state of the art in research on school profiles, the following issues
appear relevant in the design and use of profiles. Educators and the public gen-
erally approve of the inclusion of achievement measures in profiles; however,
stakeholders also express reservations about sole reliance on this type of indica-
tor. Other outcome indicators endorsed by stakeholders include graduation and
dropout rates. Stakeholders consistently express an interest in information that
provides a broader base for making conclusions about school effectiveness.
Across studies of school profiles, stakeholders indicate an interest in information
about safety and teacher qualifications. Stakeholders also indicate that
comparisons and trend data should be provided in profiles. Agencies in the
process of developing or revising schools profiles should determine if the
information needs of their stakeholders parallel those expressed in these studies.
The differing purposes of profiles, and the unique information requirements
of stakeholders, may require tailoring profiles for different stakeholder
audiences, such as parents and school board members. School profiles will also
need to be tailored to each school level to allow inclusion of indicators relevant
for each student population, such as dropout rates for high schools. To promote
840 Johnson
understanding of the information, publishers of profiles should consider accom-
panying data that are presented in tables with explanatory text, and percentages
should be accompanied by frequencies. Length of profiles should be approxi-
mately four pages and the narrative written at the stakeholders' reading level.
Finally, education agencies will need to establish mUltiple avenues for the
distribution of profiles. Effective use of profiles will require staff development
for school personnel and opportunities for the general public to interpret profile
information with teachers and administrators. Establishment of a review cycle
for the revision of school profiles will allow education agencies to keep indicators
current with stakeholders information needs and to study the relationship
between indicators and student outcomes.
ENDNOTE
1 Examples of school profiles are readily available from the Internet so readers who wish to see and
study sample profiles can easily do so. The Center for Education Reform and the Heritage
Foundation both maintain lists of links to school report cards across the United States:
< http://www.edreform.com/educationJeform_resources/school_report_cards.htm > and
<http://www.heritage.org/reportcards>. Standards & Poors, which traditionally reviews and evalu-
ates corporations, has also developed school evaluation services. Currently they have developed
rather unique school profiles for two states and hope to expand these services in the future:
<http://www.ses.standardandpoors.com>. Profiles for schools in Ontario, Canada are available at
< http://esip.edu.gov.on.ca/English> .
REFERENCES
Bobbett, G., Achilles, C., & French, R. (1994). Can Arkansas school districts' report cards on schools
be used by educators, community members, or administrators to make a positive impact on student
outcome? Paper presented at annual meeting of the Southern Regional Council on Educational
Administration, Atlanta, GA.
Bobbett, G., French, R., & Achilles, C (1993). The impact of community/school characteristics on
student outcomes: An analysis of report cards on schools. Paper presented at annual meeting of
the American Educational Research Association, Atlanta, GA.
Bobbett, G., French, R., Achilles, C, & Bobbett, N. (1995a). An analysis of Nevada's report cards on
high schools. Paper presented at annual meeting of the Mid-South Educational Research
Association, Biloxi, MS.
Bobbett, G., French, R., Achilles, C, & Bobbett, N. (1995b). Texas' high school report cards on
schools: What parents, educators, or policymakers can glean from them. Paper presented at annual
meeting of the Mid-South Educational Research Association, Biloxi, MS.
Bobbett, G., French, R., Achilles, C, McNamara, J., & Trusty, F. (1992). What policymakers can learn
from school report cards: Analysis of Tennessee's report cards on schools. Paper presented at annual
meeting of the American Educational Research Association, San Francisco;CA.
Brown, R. (1999). Creating school accountability reports. School Administrator, 56(10), 12-14, 16-17.
Caldas, S., & Mossavat, M. (1994). A statewide program assessment survey of parents', teachers', and
principals' perceptions of school report cards. Paper presented at annual meeting of the American
Educational Research Association, New Orleans, LA.
Center for Community Change. (2001). Individual school report cards: Empowering parents and
communities to hold schools accountable. Washington, DC: Author.
Clements, B., & Blank, R. (1997). What do we know about education in the states? Education indicators
in SEA reports. Paper presented at annual meeting of the American Educational Research
Association, Chicago, IL.
Fetier, M. (1989). Assessing educational performance: California's school quality indicator system.
Paper presented at annual meeting of the American Educational Research Association, San
Francisco, CA.
Franklin, B. (1995). Improving Louisiana's school report cards: Public opinion. AERA SIG: School
Indicators and Report Cards Newsletter, 1-3.
Franklin, B., Glascock, c., & Freeman, J. (1997). Teacher retention as an indicator: Useful or not? Paper
presented at annual meeting of the American Educational Research Association, Chicago, IL.
Gorney, B. (1994). An integrative analysis of the content and structure of school report cards: What
school systems report to the public. Paper presented at annual meeting of the American Educational
Research Association, New Orleans, lAo
Howell, J. (1995). Using profiles to improve schools. Paper presented at annual meeting of the
American Educational Research Association, San Francisco, CA.
Jaeger, R., Gorney, B., Johnson, R., Putnam, S., & Williamson, G. (1993). Designing and developing
effective school report cards: A research synthesis. Kalamazoo: Center for Research on Educational
Accountability and Thacher Evaluation (CREATE), The Evaluation Center, Western Michigan
University.
Jaeger, R., Johnson, R., & Gorney, B. (1993). The nation's schools report to the public: An analysis of
school report cards. Kalamazoo: Center for Research on Educational Accountability and Teacher
Evaluation (CREATE), The Evaluation Center, Western Michigan University.
Johnson, R. (2000). Framing the issues in the development of school profiles. Studies in Educational
Kernan-Schloss, A. (1999). School report cards: How to tell your district's story - and tell it well.
American School Board Journal, 186(5), 46-49.
Kochan, S., Franklin, B., Crone, L., & Glascock, C. (1994). Improving Louisiana's school report cards
with input from parents and school staff. Paper presented at annual meeting of the American
Educational Research Association, New Orleans, LA.
Lee, S., & McGrath, E. (1991). High school profile impact study. Dover, DE: Delaware Department
of Public Instruction.
Lingen, G., Knack, R., & Belli, G. (1993). The development and use of school profiles based on
opinion survey results. Measurement and Evaluation in Counseling and Development, 26, 81-92.
Linn, R. (2001). Reporting school quality in standards-based accountability systems. (CRESST Policy
Brief 3). Los Angeles, CA: Center for Research on Evaluation, Standards, and Student Testing,
University of California.
Mandeville, G. (1994). The South Carolina experience with incentives. In T. Downes, & w. Testa
(Eds.), Midwest approaches to school reform: Proceedings of a conference held at the Federal Reserve
Bank of Chicago (pp. 69-91). Chicago: Federal Reserve Bank of Chicago.
Merriman, H. (1969). The Columbus school profile: A report of the Columbus Public Schools to the
community. Columbus, OH: Columbus Public Schools.
Mestinsek, R. (1993). District and school profiles for quality education. Alberta Journal ofEducational
Research, 39(2), 179-189.
National Association of Secondary School Principals Committee on School-College Relations.
(1992). Guidelines for designing a school profile. Reston, VA: Author.
Nevo, D. (1974). Evaluation priorities of students, teachers, and principals (Doctoral dissertation,
Ohio State University, 1974). Dissertation Abstracts International, 35, llA. (University Microfilms
No. AAI7511402).
New Zealand Education Review Office (2000). Accountability review process. Retrieved from
http://www.ero.govt.nz/AcctRevInfo/ARprocesscontents.htm.
Oakes, J. (1989). What educational indicators? The case for assessing school context. Educational
Olson, L. (1999). Report cards for schools: No two states report exactly the same information.
Education Week, 18(17), pp. 27-28, 32-36.
Pelowski, B. (1994). Developing a school profile in an overseas setting. International Schools Journal,
27,15-17.
Rafferty, E., & Treff, A. (1994). School-by-school test score comparisons: Statistical issues and
pitfalls. ERS Spectrum, 12(2), 16-19.
Salganik, L. (1994). Apples and apples: Comparing performance indicators for places with similar
demographic characteristics. Educational Evaluation and Policy Analysis, 16, 125-141.
842 Johnson
Sanders, w., Saxton, A, & Hom, S. (1997). The Tennessee Value-Added Assessment System: A
quantitative outcomes-based approach to educational assessment. In J. Millman (Ed.), Grading
teachers, grading schools: Is student achievement a valid evaluation measure? (pp. 137-162).
Thousand Oaks, CA: Corwin Press.
Soholt, S., & Kernan-Schloss, A (1999). School report cards: The board's .policy role. American
School Board Joumal, 186(5),48.
United Kingdom. Department for Education and Employment. (2001). School and college
performance tables. Retrieved from http://www.dfee.gov.uklperform.shtml.
Webster, W. (1995). The connection between personnel evaluation and school evaluation. Studies in
Educational Evaluation, 21, 227-254.
Webster, w., & Mendro, R. (1995). Evaluation for improved school level decision-making and
productivity. Studies in Educational Evaluation, 21, 361-399.
Willms, J. (1992). Monitoring school performance: A guide for educators. Washington, DC: Falmer
Press.
Willms, J., & Kerckhoff, A (1995). The challenge of developing new educational indicators.
38
Evaluating the Institutionalization of Technology in
Schools and Classrooms l
CATHERINE AWSUMB NELSON

Evaluation Consultant, Paw Paw, Michigan, USA
JENNIFER POST
Learning Research and Development Center, University of Pittsburgh, PA, USA
WILLIAM BICKEL
Learning Research and Development Center, University of Pittsburgh, PA, USA
Schools around the globe are scrambling to make sure that their students do not
end up on the wrong side of the "digital divide." As technology assumes an
increasingly central role in economic development, no nation can allow its next
generation, or any segment of it, to be left behind. Under pressure from parents,
politicians, and the public, schools have invested heavily in the most visible
accoutrements of the "wired" or "high tech" school: hardware, software, and
connectivity. Finding the resources to invest in technology has been one of the
dominant concerns of schools over the last decade. As the installed base of tech-
nology in schools continues to grow, a second, and perhaps more educationally
significant, digital divide looms. While equity in basic access to technology remains
a problem within and between nations, an increasing gap is appearing even
among those schools that are well equipped with basic technological tools. This
second digital divide is the gap in the ability to use technology in educationally
significant ways to achieve specific, meaningfulleaming goals. In this chapter, we
propose a framework that evaluators and school personnel can use to assess capac-
ity building and institutionalization efforts that increasingly have been identified
as essential to the effective implementation and use of educational technology.
Scriven (1967) argued that evaluators must always be explicit about their
evaluative criteria, that is, the standards of value they are using to judge the merit
or worth of the program under study. Our framework for evaluating educational
technology implementation focuses on the crucial issues of human capital (the
extent to which a school's personnel have developed the knowledge, skills, and
confidence to use the school's technology in educationally valuable ways), and
institutionalization (the extent to which technology is self-sustaining and has
843

844 Nelson, Post and Bickel
become a core part of the school's culture and classroom practice, a taken-for-
granted part of the way the school operates). Each element of the framework is
oriented to assessing how well a given educational technology implementation
meets the standards of building human capital and achieving institutionalization.
The chapter addresses the issue of whether or not technology has been suc-
cessfully institutionalized in a school, that is, whether it has become a core part
of the school's culture and classroom practice, as opposed to an add-on program
distinct from the school's most important activities and goals. Based on our
review of the literature and our evaluation work in the field of educational tech-
nology, we posit that the development of a school's human capital is the key to
making its investments in the physical capital of technology payoff. Thus, the
framework we present is structured around the learning curves that school
personnel must climb, and the dimensions of development needed to climb each
one, in order to successfully weave technology into the school's core and make it
an important contributor to teaching and learning.
SCALE OF INVESTMENT IN TECHNOLOGY
The worldwide flood of public and private investment in technology for primary
and secondary schools shows no signs of abating.2 In the United States, for
example, Meyer (1999) reported that "by 1994, 98 percent of American schools
had one or more computers for instructional use ... in contrast to fewer than
20 percent in 1981" (p. 3). A 1999 Education H-eek survey noted that American
schools currently have an average of one instructional computer for every
5.7 students (Smerdon et aI., 2000, p. 5). Internationally, among the 12 developed
countries participating in the 1998-99 Information Technology in Education
Study, the student-to-computer ratio ranged from 11.1 to 157.1 at the primary
level, 8.8 to 43.7 at the lower secondary level, and 8 to 34.1 at the upper second-
ary level (OECD, 2001, p. 260).
The growing investment in hardware and software that these statistics repre-
sent has been further bolstered by the more recent commitment in school systems
to connect schools to the Internet. In 1994, only 35 percent of U.S. schools had
access to the Internet (Coley, Cradler, & Engle, 1997). By 1999, 89 percent of
schools and more than 50 percent of classrooms were connected (Jerald & Orlofsky,
1999). Internationally, the 1998-99 Information Technology in Education Study
showed that in almost all participating countries more than 50 percent of schools
were connected to the Internet at all levels of education, with many countries in
the 70-90 percent range. Even more significantly, nearly every country had a
goal approaching 100 percent of schools being connected to the Internet by 2001,
with the notable exception of Japan, which appears to be taking a much more
cautious approach to this investment (OECD, 2001, p. 260). Perhaps the most
compelling statistics are those that show ratios of student to Internet-connected
computers. According to Jerald and Orlofsky (1999), there were 19.7 students
for every connected computer in United States schools in 1998. One year later,
Evaluating the Institutionalization of Technology in Schools and Classrooms 845
that ratio was 13.6 to 1. Clearly, schools around the world are investing heavily
in both the sheer numbers of computers available for use and in the resources to
which computers are connected.
While it is admittedly difficult to get accurate information on costs, it is estimat-
ed that the cost of technology in U.S. schools in 1994-1995 was about $3.2 billion
or about $70 per pupil. One estimate is that the additional investment to provide
"technology rich learning environments" would be in the neighborhood of $300
per student (Coley et aI., 1997, p. 58). As U.S. public schools worked to build such
environments, they spent an estimated $5.67 billion on technology in 1999-2000
(Weiner, 2000, citing Market Data Retrieval's annual survey), almost double in
a single year the cost of the entire installed base just five years previously.
RATIONALES FOR EDUCATIONAL TECHNOLOGY INVESTMENT
Why the investment in technology? There are two broad justifications. The first
rests with a vision of the workplace and the implications that this has for student
preparation. The Organisation for Economic Co-operation and Development
(OECD) speaks of the need for schools around the world to "prepare students
for the information society" (OECD, 2001, p. 246). Similarly, a presidential advi-
sory panel in the United States noted that "information technologies have had
an enormous impact within America's offices, factories, and stores over the past
several decades" (President's Committee of Advisors on Science and Technology,
1997, p. 4). The implication is clear: "to get the high paying job, a graduate of
high school or college must be technologically literate" (Cuban, 2000, p. 42). The
second broad rationale for investment in technology concerns anticipated
efficiencies in the organizational and teaching/learning processes of educational
institutions: "such technologies have the potential to transform our schools ...
[providing the justification] for the immediate and widespread incorporation
within all of our schools" (President's Committee, 1997, p. 4). The OECD argues
that technology is "potentially a powerful catalyst for growth and efficiency"
(2001, p. 246). The projected transformations take many forms, but essentially
include support for "individual and group learning, instructional management,
communication (among all players), and a wide range of administration functions"
(Wenglinsky, 1998, p. 7). The International Society for Technology in Education
(ISTE) characterized the "new learning environments" to be created with tech-
nology as "student-centered learning," "multipath progression," "collaborative
work," "inquiry-based learning," "critical thinking," and "authentic, real world
context" (ISTE, 1998). The desired outcome domains involve, among others,
enhanced student learning (both in core curriculum subjects and in terms of
technical applications), more powerful instructional processes, and increased
organizational efficiencies. The vision of the promise of technology among its
advocates is both comprehensive and shining, with use resulting "in the pene-
tration of technology into every aspect of the school day, theoretically making
conventional teaching techniques [among other things] obsolete" (Wenglinsky,

1998, p. 7).3
THE ROLE OF HUMAN CAPITAL IN TECHNOLOGY INVESTMENTS
Investments in technology during the 1990s tended to focus on hardware, soft-

ware, and connectivity and the basic physical infrastructure that supports their
use. This is beginning to change, as the "human capital" question receives more
attention (OTA, 1995). This shift reflects a recognition (quite well understood in
the private sector) that the machines alone are not sufficient to bring about
improvements in an organization's core functions. For investments in the phys-
ical capital of technology to payoff, educators must be comfortable not only with
the technical functioning of the tools at their disposal but also with appropriate
strategies for applying them to educational tasks. Research suggests that current
practices fall far short of what is desirable in this regard. Among 13 countries
participating in the 1998-99 Information Technology in Education Study, between
70 and 90 percent of principals in all schools in each country reported having
established goals for training all teachers in technology, but the percentage
reporting that all teachers had actually received such training ranged only from
8 to 30 percent (OECD, 2001, 245). Cuban (1999) described the extent to which
available technology is actually used in U.S. schools:
Here's a puzzle for both cheerleaders and skeptics of using new technolo-
gies in the classroom. Out of every 10 teachers in this country, fewer than
two are serious users of computers and other information technologies in
their classrooms ... three to four are occasional users ... and the rest ...
never use the machines at all. (p. 68)
Indeed, in a 1998 survey, only 20 percent of U.S. teachers felt "very well prepared"
to integrate educational technology into their teaching (U.S. Department of
Education, 1999). What is equally disturbing is the fact that too often when
teachers are using technology the applications are rehitively low level and
traditional:
The advent of computers and the Internet has not dramatically changed
how teachers teach and how students learn. Computers have typically been
used for ... drill and practice and computer education (as opposed to
solve problems and conduct research) (Smerdon et aI., 2000, p. 101).
Increased growth in the availability of machines (and growing personal use of

technology on the part of teachers) has not been accompanied by a concomitant
increase in the use of technology in the classroom, especially if one's target is use
that facilitates new levels of learning and understanding (what Schofield [1995]
has termed the "transformative potential" of the computer). What is clear is that
research shows that helping teachers learn how to integrate technology
into the curriculum is a critical factor for the successful implementation of
technology applications in schools; and that most teachers have not had
the education or training to use technology effectively in their teaching.
(Coley et aI., 1997, p. 5)
The lack of sound teacher pre- and inservice preparation for applying technology
has been identified as a crucial barrier to more effective use of technology in
schools (Bickel, Nelson, & Post, 2000a; Glennan & Melmed, 1996; Schofield, 1995;
Smerdon et aI., 2000). However, in the past few years there is some evidence that
educational organizations are getting the "human capital building" message in
the technology arena. Recent surveys indicate that school spending for staff
development is modestly up 3 percent in 2000 compared to 1999 (Editorial
Projects in Education, 2001, p. 52). However, for 1998-99, U.S. schools projected
spending only $5.65 per student on teacher technology training, as compared to
$88.19 per student on hardware, software, and connectivity, far from the 30
percent of their technology budgets the U.S. Department of Education recom-
mends allocating to professional development (CEO Forum, 1999, p. 6). Research
underscores the importance of key elements of human capital development such
as adequate time, focused and timely assistance, broad institutional support, and
a certain vision for the educational purposes of technology, as important in
successful training and staff development efforts (Bickel, Nelson, & Post, 2000a;
Nelson & Bickel, 1999; OTA, 1995; Schofield, 1995).
As investment in technology and professional development related to its
implementation continue, the hard and still largely unanswered questions about
its educational value are increasingly difficult to ignore. Thus, the evaluation of
investments in human capital within technology initiatives is critical both for
providing continuous, formative feedback to improve implementation, and to
inform the field about unique factors that influence the institutionalization of
technology in schools and classrooms. Evaluators need both a broad conceptual
framework and knowledge of the factors that research has shown affect the
successful investment in human capital to institutionalize technology use. While
we have found no single model of evaluation specific to the institutionalization
of technology via investments in human capital, we can draw on information
gleaned from at least two distinct "pockets" in the literature base: characteristics
of effective staff development (both generally and technology-specific) and
frameworks for evaluating staff development. The contributions and limitations
of each of these to the evaluation of human capital investments in technology are
discussed in turn.
CHARACTERISTICS OF EFFECTIVE STAFF DEVELOPMENT
Much has been learned about the general characteristics of effective staff devel-
opment (both overall and content-specific). The professional development
literature provides both general and specific recommendations for creating and
implementing professional development that may lead to improved instructional
practices and enhanced student learning (Asayesh, 1993; Guskey, 1994; Guskey
& Huberman, 1995; Joyce & Showers, 1980; Joyce, Showers, & Rolheiser-Bennett,
1987; Kennedy, 1999; Loucks-Horsley, Hewson, Love, & Stiles, 1998). Numerous
organizations have attempted on the basis of such research to establish broad
principles and characteristics of effective staff development (e.g., American
Federation of Teachers, U.S. Department of Education, National Partnership for
Excellence and Accountability in Teaching, National Staff Development Council,
National Center for Research on Teacher Learning, and Northwest Regional
Educational Laboratory). These characteristics are now being translated into
state- and district-level standards for professional development. Research-based
characteristics of effective staff development can aid evaluators in identifying
factors within the school or district that should be examined to provide formative
feedback. However, these are global staff development characteristics and are
not situated within a conceptual or temporal framework. Evaluators must be
aware of these principles of effective staff development but also can benefit from
a broad conceptual framework so that key variables are not overlooked.
Furthermore, these characteristics are not specific to a content area and must be
"translated" into the context of technology professional development. Numerous
researchers have conducted studies to identify the characteristics of effective
technology staff development (e.g., Brand, 1997; Bray, 1999; Meyer, 1999; Norton
& Gonzales, 1998).
Brand's (1997) review of the staff development and educational technology
literature identified several characteristics of effective staff development specific
to technology, which are presented in such a way that they provide program
developers with key issues to consider as an initiative is developed and imple-
mented. Key issues cited include provision of time, consideration of individual
differences in teacher learning needs, provision of flexible staff development
opportunities, provision of continued support, provision of opportunities to
collaborate, provision of incentives, provision of on-going staff development,
explicating the link between technology and educational objectives, communi-
cation of a clear administrative message, and provision of intellectual and
professional stimulation. Missing from this list are indications of relationships
between their presence or absence and key "milestones" along the path to success-
ful technology institutionalization. Further, they are not tied to a larger framework
for organizing and evaluating the development of human capital over time.
Bray (1999) presents 10 steps for installing effective technology staff develop-
ment which include the creation of a staff development subcommittee, provision
of examples of how technology can be used, utilization of a needs assessment
instrument, development of individual learning plans, identification of
technology leaders, creation of a list of on-site learning opportunities, circulation
of the list of off-site learning opportunities, provision of time for grade-level or
departmental meetings, exchange of success stories, and continuous planning
and evaluation. As with Brand's characteristics, Bray's steps are most useful in
considering how to develop a technology staff development program and may
signal elements that evaluators should look for, but they do not provide a full-
fledged framework for evaluators in technology staff development contexts.
Hasselbring et al.'s (2000) literature review lists and explains the major barriers
to the integration of technology into curriculum. It addresses preservice teacher
education, teacher licensure, staff development characteristics, administrative
policies, and technological infrastructure. Specific barriers to and enablers of
technology infusion that have been identified in the research base are described.
Such barriers and enablers are extremely useful for program developers and
policymakers but would be more useful for evaluators if they were situated
within a larger conceptual framework of the phases in building human capital.
EXISTING FRAMEWORKS FOR EVALUATING STAFF DEVELOPMENT
In addition to the characteristics of effective technology staff development, com-

prehensive frameworks for evaluating staff development can provide a starting
point for evaluators examining the development of human capital within a
technology initiative. Guskey (2000) presents five levels of effectiveness criteria
that have been used to evaluate professional development generally. The most
basic and most often utilized criterion involves determining participants'
reactions to the staff development experience (e.g., "Did the participants like the
staff development session?"). The second level is the assessment of the teachers'
gains in knowledge or skills as a result of the professional development experi-
ence (e.g., "What did the participants learn?"). The third level is the assessment
of organizational support and change which involves an examination of the
extent to which policies, procedures, and resource allocation facilitate the aims
of the professional development. The fourth level of criteria is aimed at assessing
the extent to which participants integrate the new knowledge or skills into
instructional practice (e.g., ''Are the participants using it in the classroom?").
Finally, the most significant but least prevalent criterion in evaluating staff
development effectiveness involves measuring the impact of participants'
changes in knowledge and skills on student learning (e.g., "How are the changes
in teaching practice affecting what students know or are able to do?").
This model is extremely useful for evaluating the effects of human capital invest-
ments, but it does not address issues specific to technology staff development. It
implies a temporal sequence in terms of the order in which change should occur
(participants' initial reactions -+ participants' learning -+ organizational sup-
port -+ participants' use of knowledge or skills -+ student learning). However,
this "temporal" sequence does not accurately represent what we have observed
in technology initiatives. For example, the development and maintenance of the
technical infrastructure can serve as a barrier to the development of teacher
skills and use regardless of the quality of the staff development itself. Neither
does the model provide specifics in terms of factors that may serve as barriers to
or enablers of the institutionalization of technology at each of the levels. The
approach is more geared toward assessing outcomes than to assessing processes
or understanding when and under what conditions technology use is institution-
alized in the school and classrooms. Thus, Guskey's model serves as a good starting
point but does not provide the level of detail or the focus on implementation
needed to evaluate investments in human capital for the purpose of institution-
alizing technology use.
While the model encompasses the evaluation of professional development
across context or content areas, we are interested in the provision of a concep-
tual framework specific to the development of human capital in technology. This
framework should be structured to reflect research-based findings regarding the
temporal sequence of development, be geared toward processes and implemen-
tation variables rather than outcomes, and focus on assessing institutionalization
of technology. In the next section we describe the evaluation context in which we
worked to develop such a framework.
CONTEXT AND METHODOLOGY FOR THE FRAMEWORK'S

DEVELOPMENT
The framework we present in this paper was developed in the course of our work
on a small-scale evaluation project taking place within the context of a compre-
hensive, district-wide technology integration effort in a large urban school district
in the United States. District plans originally envisioned a $100 million technology
enhancement effort. In the face of an ongoing deficit crisis in the district, the
technology plan was downsized through several rounds of board negotiations,
finally settling at about a quarter of its original size. The plan ultimately adopted
focused almost exclusively on hardware and software, was targeted only to
specific grade levels and subject areas, and was to be phased in over three years.
One of the most significant implications of the downsized plan was the
elimination of training funds from core district funding of the plan. Grant funds
from the state and a local foundation were used to support a much reduced train-
ing agenda. Rather than district-wide staff development, technology training was
to be provided only to a team of six to eight teachers and administrators within
each school who were then expected to disseminate their new skills to the rest of
their school's faculty. The training initially focused largely on developing teachers'
basic computer and software skills, with relatively little attention to how to
integrate technology into the curriculum to help students reach learning objectives.
The slimmed-down plan also delegated almost all responsibility for technical
support to the school level. Schools were to designate "technical support
teachers" to maintain and support both lab and classroom computers, although
no explicit provisions were made at the district level to free up time for them to
perform these functions.
Political and fiscal constraints left the district with a technology plan that no
one would have deliberately designed, one that was focused entirely on hardware
and software with no funds for training or support. As many others have
recognized (e.g., Meyer, 1999; Norton & Gonzales, 1998; OTA, 1995), adequate
hardware is necessary but hardly sufficient for the infusion of technology into
instructional processes. But even orphan programs - with designs so unlovable
no one would claim them - have a logic of their own. As evaluators it is important
to spell out the theory of a program after it has been through the political mill as
it will actually be enacted, not as it would be in some ideal world. In doing so, we
of course recognize that the reality of program implementation can fall far short
of sound program theory. By identifying up front the district's critical
assumptions about how this pared-down plan was intended to produce
improvements in student learning, we were in a stronger position to understand
implementation.
The key implementation vehicle became the "school technology team," which
included the principal, the librarian, classroom teachers from the grade levels
and subject area targeted for technology integration (language arts), and one or
two "technology support teachers" (classroom teachers who volunteered for the
role because they had the skills or at the very least the interest to support others
in the use of instructional technology). Using grant funds, the team from each
school was provided with a baseline of training (totaling about 90 hours) designed
to enable members to take the lead in disseminating and supporting technology
use in their schools.
The study was structured by one overarching question: How did the district
technology implementation (including staff development) affect teachers'
capacities to apply technology to strengthen learning processes in classrooms?
Our work in answering this question was informed by a human capital approach
to investing in technology which underscores that it is how the technology is
implemented and supported that conditions the classroom payoff. We had to
track crucial intermediate variables between the intervention (providing small
teams in each school with some basic computer skills training) and the desired
outcome (integrating technology into classroom practice in meaningful ways). In
practice, this meant that our evaluation questions centered on documenting how
the program's basic change theory assumptions were actually being realized in
the field:
• How was the training provided to each school's tech team being disseminated
to the rest of the school?
• How and to what extent were the technology skills being learned by teachers
being integrated into curriculum?
• How could the implementation of the program's key assumptions about the
process of change be strengthened to shore up the connection between the
training intervention and student learning effects?
The framework presented in this chapter was originally developed to structure

our investigations. The study on which this paper draws was by no means a
formal, comprehensive evaluation of the outcomes of the district's technology
plan, or even of the direct, classroom outcomes of the much smaller training
component. Rather, it took place in the context of a formative evaluation agenda.
Our evaluation was designed to produce immediately useful feedback to program
leadership and participants about barriers and best practices as they emerged in
implementation, and insights to key funders about the value of supporting staff
development in this area. Data collection activities consisted of the following:
• Observation and documentation of training sessions;

• Focus groups with two types of training participants (subject-area teachers
and "technical support teachers") centering around expectations for in-school
implementation;
• Surveys of principals in all participating schools;
• In the first year of program implementation, year-long implementation case
studies of three schools (elementary, middle, and high). The design of these case
studies included: structured interviews with the technical support teachers at
intervals throughout the year; observations of team meetings; principal inter-
views; observations of several full-day samples of computer lab usage; and
lesson profiles in which teachers were interviewed about their plans for a
technology-based lesson, the lesson was observed, and the teacher was then
debriefed.
• In the second year of program implementation, a year-long panel study of
technology implementation in 10 schools, based on a series of individual and
group interviews with a panel of teacher technology leaders (classroom teachers
who were identified by district technology staff as having taken on leadership
of technology implementation in their schools).
The evaluation framework presented in this chapter was developed and refined
in the course of our two-year experience documenting and providing formative
feedback to the instructional technology capacity-building effort in a large school
district. Having found the framework to be a useful intellectual structure for
analyzing data and reporting results, we then validated and further refined it
based on our review of the literature on effective professional development for
educational technology use. Because of district budgetary pressures, the frame-
work was developed in a program context where there was'an extreme imbalance
between investment in the physical and human capital needed for successful
educational technology implementation. Indeed, with the exception of private
grant funds, institutionalization at the school level was left almost entirely to
school-level initiative and resources, making this an extreme test case for evalua-
ting the factors needed for school-level capacity building. While other educational
technology implementation contexts may have a more deliberate human capital
development component, we contend that the issues of human capital develop-
ment highlighted by this framework will be salient in the evaluation of almost any
educational technology intervention. Certainly the political, social, educational,
and investment contexts of school technology use will vary widely among the
international readers of this handbook, and we do not claim that every element
of the framework is universally applicable to every school in every country.
However, we do argue that, as a set, they comprise a useful framework of factors
for evaluators to attend to in assessing whether schools have the capacity to use
their technology in educationally productive ways.
THE FRAMEWORK
Our review of the relevant literature and our evaluation work in this field suggest
to us the need for evaluators to focus on the question of how human capital is
developed in schools to support the investment in the physical capital of tech-
nology. As part of the district-wide but school-focused instructional technology
implementation evaluation described above, we proposed and refined a concep-
tual framework showing how the growth of school-level ownership of technology
over time can be assessed as progress on three interrelated learning curves
(Bickel, Nelson, & Post, 2000b). In Figure 1 we attempt to show how the institu-
tionalization of technology in a school's culture and practice can be understood
as movement up the learning curves of three aspects of human capital relevant
to successful instructional technology implementation: growing and maintaining
technical infrastructure; building teacher technology skills; and achieving
curriculum integration.
The curves show how as a school builds each of these aspects of human capital
there is movement both over time (along the X axis) and in terms of increasing
building-level ownership of the technology (along the Y axis). We also propose
that the curves are sequential from left to right, with development on one laying
the ground for movement on the next. A functioning technical infrastructure is a
necessary precondition for the development of teachers' basic software skills,
Technology Curriculum Integration

Infrastructure
TIME
Figure 1: Three learning curves for the development of human capital and the
institutionalization of technology in a school
which in turn are required for the effective integration of technology into the
curriculum, the necessary precursor to the ultimate goal of investments in
instructional technology-enhanced student learning.
While we found this broad conceptual framework of the three learning curves
extremely useful in structuring our evaluation work, organizing data collection,
and communicating findings to evaluation clients, our experience showed that to
maximize the utility of the learning curves as an evaluation framework we needed
to further specify how schools move up the curves. Thus we added a second level
to the framework, positing that movement on each of the three learning curves
proceeds along three dimensions: depth (quality of technology use), breadth
(quantity and variety of techllology use), and sustainability (infrastructure to
support technology experiences over time).
Our fieldwork indicated that for growth on a given learning curve to occur, all
three dimensions had to be in place. Thus, even if technology is spread broadly
across grades and curriculum areas and is deeply integrated with the design of
instruction, technology integration cannot be said to be institutionalized in the
school's curriculum if those experiences are not sustainable over time.
Our general framework consists of the three interrelated learning curves and
the three dimensions of growth necessary to climb each. For each dimension the
evaluation project described above also generated a set of factors related to
successful technology implementation and eventual institutionalization in school
learning environments as these were evidenced in our case context. We present
these factors as examples, although by no means a complete set, of what educa-
tors and evaluators should be aware of when using this framework since they
were found to be important in influencing the program under investigation. As
others apply the framework, we fully expect that additional experiences will extend
both the structure presented as well as elaborate on and modify specific factors.
The evaluation framework can be conceptualized as the intersections of each
of the three learning curves with each of the three dimensions of growth. In this
section of the chapter, we examine each intersection in turn, presenting examples
of the factors that most affect the ability of schools to achieve growth towards
that aspect of the institutionalization of technology in their school culture and
classroom practice. Because the factors emerge from the specific context of one
evaluation study they should not be construed as a comprehensive list that is
automatically generalizable to every school and every situation. Indeed, we are
certain that variables including school and district size, national goals for
technology use, and levels and kinds of technology investments would influence
the factors that shape a given school's technology implementation. However,
given the central role of human capital development in the success of technology
implementation in any context, we believe the factors presented will be
suggestive of issues that most schools will face. We also note a high degree of
overlap between the factors we identified in our evaluation work and those cited
in other studies as crucial to successful educational technology implementation
(Editorial Projects in Education, 1999, 2001; Ertmer, 1999; Meyer, 1999; OTA,
1995; Schofield, 1995). We thus present the factors that follow not as an absolute
and definitive checklist for every school, but as a range of issues that evaluators
and school personnel may find useful to probe (and as a springboard for
identifying other factors specific to their context) as they seek to assess how well
a given school is moving up the successive learning curves towards successful
institutionalization of technology.4
LEARNING CURVE #1: MAINTAINING THE TECHNICAL

INFRASTRUCTURE
Depth
The goal of developing depth in maintaining the technical infrastructure is for

school-level personnel to become comfortable with increasingly sophisticated
and complex issues in maintaining and upgrading their technology. From our
study, factors related to developing this depth of expertise may include:
Comfort with Routine Glitches, Leading to Increased Autonomy
As school personnel gain experience and confidence that they can handle
common problems such as a frozen screen or a jammed printer, they become
more autonomous in managing the technology in their own classrooms. This
comfort with basic technical problems is an important aspect of human capital
development with regard to technology because it means that routine glitches
during a technology-based lesson no longer bring instruction to a halt until
"technical support" can be found. The more teachers can provide their own
technical support, the higher the school climbs on this learning curve.
Breadth
The goal of developing breadth in maintaining the technical infrastructure is for

more school-level personnel to become comfortable managing technical issues.
Rather than seeing "technical support" as a specialized role or function, it
becomes an aspect of every teacher's job. Three factors that may signal the devel-
opment of this breadth have been identified.
Dissemination of Technical Expertise
School-wide training in managing technology is likely to be needed if this kind of

expertise is to be broadly disseminated. Without explicit training, less technically
inclined teachers may view technical support as "someone else's job," rather
than something they need to be able to do in order to provide technology-based
instruction. In addition to formal training, evaluators should look for support
and reinforcement mechanisms such as printed "troubleshooting" materials and
a "buddy system" matching technical novices with more technically savvy
teachers that help schools to climb this learning curve.
Specialization of Roles
Broadening the base of school personnel with basic technical expertise allows for
the complementary strategy of developing specialized roles. Although each
teacher should have the capacity to manage the most common technical glitches
that could interrupt instruction in the classroom, it is more efficient if people
develop different kinds of specialized knOWledge. When one person in the school
becomes known as the person to go to with networking problems, another as an
expert in designing Internet searches, and a third as a whiz in spreadsheets,
everyone in the school knows who to go to with a particular issue, and no one
person is overwhelmed with requests from help from colleagues.
Strategic Use of Student Expertise
Most teachers today are aware that there are students in their classroom whose
knowledge of technology exceeds their own. While some teachers see student
expertise as intimidating, it can be put to good use to fill the void in technical
support that exists in many schools. Giving students who have an interest in
technology formal roles as "technology assistants" or "lab monitors" not only
keeps technology-based instruction flowing more smoothly, it can sometimes
draw in previously disengaged students. Schools that are not afraid to give
students these kinds of roles broaden their ability to manage their technical
infrastructure.
Sustainability
The goal of developing sustainability in maintaining technical infrastructure is

for school personnel to take ownership for maintaining the technology they need
to deliver instruction, rather than looking to outside sources of support. Factors
indicating that schools are developing this kind of sustainability over time
include:
Standardized Platforms and Configurations
Having a hodge-podge of operating systems and hardware and software configu-

rations across a district or within a school building impedes the ability of school
personnel to move smoothly into new instructional environments and to work
together in solving technical problems and sharing lesson ideas. By contrast, stan-
dard platforms and configurations allow useful information about both technical
issues and instructional ideas to be disseminated and implemented more easily,
making for a more sustainable and useful technical infrastructure.
Avoiding Over-Reliance on Individuals
Very few schools can count on the specialized corps of technical support per-
sonnel that keep computers running in the business world. Instead, teachers
must take on many basic technical support tasks themselves, or see computers go
unused. Schools where technical expertise stays heavily concentrated in just a
few individuals put the sustainability of their technical infrastructure at risk. The
few teachers known to be "techies" will be besieged by often urgent requests for
help from their colleagues, leaving them with the choice of either leaving a fellow
teacher stranded or cutting into their own teaching and planning time. These
individuals are often zealous enthusiasts for the educational uses of technology
and will make great personal sacrifices to support their school's infrastructure
for a period of time, but they are highly vulnerable to burn-out.
Routinized Policies, Procedures, and Responsibilities
To maintain their technical infrastructure over the long haul, schools need to
move away from a crisis mode in which ad hoc solutions are organized in res-
ponse to particular problems and towards a predictable routine of preventative
maintenance. For example, in many schools teachers and groups of students
shuffle in and out of heavily used computer labs all day long, with no continuity
of oversight to report and deal with technical problems. In schools that are devel-
oping the human capital to take ownership of their infrastructure, evaluators will
likely see the development of systems such as lab monitors or problem tracking
forms to identify and address problems before they derail the instruction of the
next unsuspecting teacher to bring a class to the lab.
Supply Budget
Adequate supply budgets prevent hardware and software from sitting idle for
lack of basic replenishable supplies such as printer ribbons and disks, and are
thus a key component to look for in a sustainable school-level technology
infrastructure.
Flexible Time
Technical glitches have the potential to cause a carefully designed lesson to break
down, wasting the valuable time of teachers and students unless they can be
remedied quickly. Indeed, the fear of such glitches is a major reason many teachers
are wary of relying on technology for instructional delivery. Keeping learning
going requires on-the-spot technical support, which requires flexible time from
those who know how to solve the problems. Schools that are developing a
sustainable technical infrastructure build flexible time for technical support into
their schedule, either through establishing a formal technical support role with
some time freed from teaching duties, or at least by ensuring that the planning
periods of technically expert teachers are distributed so that one is always avail-
able to get their peers out of a jam.
Stable Funding
Almost every factor listed so far as something to look for in evaluating the
sustainability of a school's ability to maintain its own technical infrastructure
implies the need for time: time for training, time for troubleshooting, time freed
from teaching to provide technical support. And time, of course, implies money.
Although schools will probably never approach the ratio of spending on ongoing
support versus spending on hardware and software that is typical in corporations,
they cannot hope to rely entirely on improvisation to keep their infrastructure
functioning. One estimate put the "total cost of ownership" (including upgrades,
support, and training) to a school of a $2,000 personal computer at $1,944 per
year (Fitzgerald, 1999). A realistic financial commitment to ongoing technical
support is a zero factor in assessing the sustain ability of a school's technical
infrastructure.
LEARNING CURVE #2: BUILDING TEACHER TECHNOLOGY SKILLS
Depth
The goal of developing depth in building teacher technology skills is for teachers
to learn how to use the available technology (hardware and software) well. This
means teachers moving beyond being novices to becoming experts in the various
applications and understanding their potential uses to accomplish tasks. Three
factors are related to achieving depth in teacher technology skills.
Quality of Training
The extent to which technology training meets the needs of adult learners, pro-
vides adequate guidance, fosters higher-order thinking and application, and is
sensitive to teachers' fears and concerns is important in setting the tone for
technology use in the school and in establishing teachers' level of base knowledge.
In particular, the extent to which the training helps teachers see connections
between the technology and their professional and instructional obligations can
be a good predictor of teachers' perspectives on the utility of the technology and
thus the potential depth of their skill development. Familiarity with and
application of the large and growing body of research that identifies and explains
characteristics of effective professional development design and implementation
is one indicator that schools are building the capacity to implement high quality
technology training.
Follow-Up from Training
Opportunities for teachers to receive additional assistance, instruction, or clari-

fication after the initial training can help reduce fears, answer questions, deepen
teacher skill levels, and set the stage for teachers to begin utilizing their new
technology skills. It seems particularly important for teachers to have the oppor-
tunity for some one-on-one interaction after the initial training to answer specific
questions and receive guidance on their own "experimentation." We know from
research that content covered in "one-shot workshops" is often left in a dusty
corner of teachers' minds. Evaluators should look for the systematic provision of
follow-up and reinforcement of training as an indicator of potential depth of skill
development.
Incentives to Apply
School or district incentives to begin applying newly learned technology skills can
provide the motivation for teachers to experiment with and ultimately apply the
technology in desired ways. Evaluators should look for recognition or other
rewards for teachers who expand their technology skills, which can provide
impetus for teachers to continue deepening their knowledge and skills.
Breadth
The goal of developing breadth in building teacher technology skills is for the
entire teaching staff to build and maintain their skills in technology (hardware
and software). Rather than just a few "techies" having the knowledge and skills
for expert-level use of applications, the entire teaching staff becomes capable of
employing technology at an advanced level. At least two factors are related to
developing this breadth.
Voluntary Training vs. Mandates
If technology skill development remains voluntary for teachers, it is likely that

other strong incentives will be needed if technology skills are to spread broadly
beyond the early adopters. Many technology teacher leaders we have worked
with are adamant that mandates for training and its application are a necessary
catalyst to develop breadth of teacher technology skills in the school. Without
mandates, students of teachers who are resistant to technology, or are not lured
by existing incentives, are at a disadvantage relative to other students in the school.
Flexibility and Appropriateness of Training and Materials
When schools institute a plan to build human capital in technology, staff members
bring a broad range of prior experience, comfort, and skill with technology to the
task. The extent to which the training and materials are sensitive to these differ-
ences can affect the breadth of teacher technology skills achieved. Furthermore,
the staff at individual schools within a district may have a wide range of tech-
nology training needs. Attention to the appropriateness of training strategies and
materials for the specific context of the school and staff is critical to institu-
tionalizing technology among the entire staff.
Sustainability
The goal of developing sustainability in building teacher technology skills is for

school personnel to take ownership of maintaining and expanding teachers' tech-
nology skills rather than relying on sporadic "workshops" provided by outside
consultants. A number of factors indicate that schools are developing
sustainability:
Plan for Dealing with Personnel Turnover
Although most preservice teacher education programs are investing tremendous

resources in enhancing their curriculum to prepare new teachers to utilize tech-
nology, the content and quality are highly variable. Thus, schools must have in
place a plan for ensuring that new teachers have the required skills.
Plan for Refresher and Update Training
Technology is changing rapidly and schools must develop a plan for keeping theirs
up-to-date or risk the erosion of teachers' ability to use updated hardware and
software. Continued professional development is necessary to sustain high quality
technology preparation for students.
School-Wide Strategy for Dissemination, Follow-Up, and Use
The presence of a clear strategy to disseminate new technology to teaching staff

can be a good indicator of the school's ability to sustain technology skill
development over time. Such a strategy includes an ongoing commitment to both
group and individual training and a plan for identifying new, useful technology
and sharing it with staff. A clear plan for providing follow-up assistance to
teachers as they practice the new technology is also important for ensuring that
skills learned during a training session are not allowed to go unused. Schools that
hope to sustain the development of teacher technology skills also establish
opportunities for teachers to use technology skills in their daily professional
practices. Having teachers create lesson plans in a word processing application,
submit student grades electronically, and communicate with administrators and
other staff using e-mail are just a few ways that daily operating procedures can
serve to foster the sustainability of teacher technology skill building. Evaluators
should look for the institutionalization of such opportunities for skill
development and application, as opposed to an ad hoc approach.
Environment that is "Safe" for Experimentation
A school culture that values innovation and risk-taking, and accepts the corres-
ponding failures and mistakes, is important in setting the stage for teachers to
continue to challenge themselves, broaden and deepen their technology skills,
and share their experiences with colleagues. It is crucial that the energy and
enthusiasm around technology use be such that resistant teachers are both
comfortable and motivated to "play" with the technology. Evaluators should
look for a school environment that uses "mistakes" as learning opportunities and
provides patient guidance to teachers learning new skills.
LEARNING CURVE #3: ACHIEVING CURRICULUM INTEGRATION
Depth
The goal of developing depth in curriculum integration is for technology to be used

in ways that meaningfully advance learning objectives, not simply replicate or
dress up traditional lessons. To assess depth of curriculum integration, evaluators
should be looking for examples of what Schofield (1995) calls "transformative"
learning, in which technology is used to give students access to experiences and
perspectives that would otherwise be unavailable and to deepen their under-
standing. Several factors can influence the depth of curriculum integration
achieved in a given school technology implementation.
Longer Instructional Periods
In many schools, an instructional period is less than an hour. Teachers know that
this is often inadequate for engaging in tasks of any intellectual depth. Add in the
logistical challenges and time demands (booting up, logging on, saving, printing,
etc.) of a technology-based lesson and the length of the instructional period
becomes a significant factor conditioning the potential for depth of curriculum
integration.
Attention to How Technology Changes Classroom Dynamics
Studies of classroom technology implementation demonstrate that technology

has the potential to transform the teaching and learning process in a number of
ways, including enhancing student engagement, loosening boundaries in the
traditional roles of teacher and student, and increasing the potential for work
that is project-based, inquiry-oriented, collaborative, and/or learner-directed
(Ertmer, 1999; Means & Olsen, 1995; Schofield, 1995). Teachers who are aware
of this potential (through training, familiarity with the literature, or their own
experimentation and discovery) are more likely to be able to capture this kind of
depth in their own use of technology. Not only do teachers need to be aware of
the likely changes, they also need to be given opportunities and guidance to
develop the requisite knowledge and skills to use such instructional approaches
effectively. Without such assistance, teachers are often fearful of "losing control"
in a technology-infused classroom and will resist technology or utilize it in very
traditional ways. In assessing the potential depth of curriculum integration,
evaluators should look for the systematic development and application of such
knowledge about classroom dynamics in the wired classroom.
Teacher-Student Ratio
The potential of technology to transform classroom dynamics increases the

possibility of deepening student understanding, but the same features of individ-
ualization, collaboration, and role-blurring can present classroom management
challenges. Evaluators should look for scheduling strategies that support teachers
with a second pair of adult hands in the lab. Having another adult in the lab to
provide technical and instructional support and help with classroom manage-
ment issues greatly increases the confidence and ability of classroom teachers to
engage their students in more complex learning tasks.
Curriculum-Specific Training
Although teachers need to become comfortable with basic computer skills in

order to use technology instructionally, those skills should never be treated as
ends in themselves. Training that focuses exclusively on skill development without
attending to the specific context of teachers' work may increase teachers' use of
technology for professional tasks (lesson planning, communication, student
record management, preparation of lesson materials, etc.) while doing little to
help them make the transition to curriculum integration and classroom use. This
observation is borne out by American survey data that show that teachers who
received curriculum-specific technology training were significantly more likely to
use technology with their students than teachers who only received skills training
(Editorial Projects in Education, 1999).
Mentoring
The availability of individualized, classroom-based coaching from an experi-

enced peer is an effective strategy for helping many teachers make the transition
from personal and professional use of technology to instructional use. This kind
of customized assistance deepens integration by helping teachers make the
necessary link between the functionality of their technology and the learning
objectives of their curriculum. Evaluators should look for both the policy infra-
structure that makes mentoring possible and a culture that supports it.
Instructional Accessibility of Hardware, Software, and Connectivity
Technology will be underutilized if equipment is not physically located where it

is convenient for teachers and students to integrate it into the flow of teaching
and learning. If equipment is physically separate or inaccessible, it will always be
an add-on or special treat, not a tool that is routinely used to extend under-
standing. Evaluators should look for significant clusters of classroom-based
computers and/or for computer labs that are easy for teachers to schedule their
class into and convenient for teachers and students to move in and out of.
Existing Teacher Comfort Levels with Basic Skills
Only when teachers feel comfortable with their own ability to use hardware or
software will they begin to see opportunities to use the technology as a tool to
reach instructional objectives, and be willing to use it with students without fear
of a lesson-derailing glitch. Evaluators should consider this is as a zero factor in
assessing the potential for depth of curriculum integration; teachers who have
not achieved a basic comfort level with the technology are unlikely to use it with
students at all, much less use it in transformative ways.
Existing Student Skill Levels
In classrooms where students are unfamiliar with the keyboard and/or with basic
on-screen navigation, most instructional time will be consumed with technical
issues rather than engagement with content. Thus, for depth of integration to be
possible, evaluators should look for technology implementations with a specific
strategy for preparing students to participate in and take advantage of trans-
formative uses of the technology.
Access to Concrete Lesson Ideas
Teachers who are just developing their own technology skills and sense for how
to use them instructionally benefit from concrete models of how technology that
is available in their classrooms can be applied to the curriculum they are expected
to teach. Once they have gotten their feet wet with ready-made materials they
will be more likely to innovate and deepen their use of technology. Although
hundreds of thousands of technology-based lesson plans exist on the web, few
teachers have the time or skills to sift through them and evaluate their quality
and relevance to their particular curriculum. Evaluators should look for strategies
and structures that facilitate the sharing of high-quality, relevant lesson ideas
both from central education offices and among schools and classrooms.
Planning Time to Develop Lessons
To develop and refine lessons that take advantage of technology to deepen

student understanding requires a significant investment of teacher time. In the
absence of adequate planning time, teachers are likely to stick with the "tried
and true," perhaps adding a superficial touch of technology but not fundamen-
tally rethinking lesson design.
Link to Curriculum Standards
In many countries teachers are under increasing pressure to help their students
reach defined curriculum standards. Such standards are often accompanied by
testing and incentive systems that carry high stakes for schools, teachers, and
students. At the same time, schools are making large investments in technology,
accompanied by either formal mandates or less formal pressures for its use.
Rarely, however, are the pushes for standards and technology use explicitly
linked. Indeed, teachers may see the two as competing rather than complemen-
tary priorities. Evaluators should look for technology implementations that have
an explicit link to standards, that is, mechanisms such as teacher technology
training that is grounded in curriculum and "lesson banks" that show teachers
that technology can be a tool for achieving standards, not a competing demand
on instructional time.
Breadth
The goal of achieving breadth of curriculum integration is for more teachers

(across grade levels and subject areas) to make technology part of their routine
toolkit for helping students achieve learning objectives. As the access of a given
student to technology too often depends on the training and inclination of the
teachers they are assigned, breadth is a significant equity issue. At least three
factors may condition the breadth or spread of a given technology implemen-
tation across a school and its curriculum.
Content-Rich Applications Aligned with Curriculum
Most schools will have teachers who will be early adopters, enthusiastically
figuring out ways to use whatever technology is available to enhance the student
learning experience. Such teachers have the skills and initiative to take content-
free productivity software (word processors, spreadsheets, presentation packages,
etc.) and use it to achieve instructional objectives. If classroom use is to be
institutionalized beyond that core, however, more hesitant teachers need to see
technology with built-in content that is clearly linked to their curriculum. The
level of investment in software that is explicitly tied to the content of a school's
curriculum, thus reducing the need for novice teachers to make the leap between
technology and curriculum unaided, can help evaluators predict the spread of
instructional use.
Student and Parent Demand
In many parts of the world, students are coming to school with an increasingly
high baseline of technology skills. They are accustomed to accomplishing tasks
with the computer and thus may raise the issue even when teachers are hesitant.
Similarly, as parents become more and more computer literate and convinced of
the importance of technology for their children's future, they may push technology-
shy teachers into greater utilization of the school's computers. Assessing the
demand of students and parents for instructional technology can help evaluators
predict the spread of classroom computer use beyond those teachers who are
early and enthusiastic adopters.
Instructional Champions
In many schools there will be one or more teachers who are so convinced of the
importance of instructional technology for student learning that they serve as
informal coaches to their peers, providing concrete ideas about specific products,
websites, or lesson plans. Librarians can also be vital in playing this role in their
schools. In the absence of someone at the school-level playing a formal role of
instructional technology coach or coordinator, the existence of such "champions"
who do it on their own time out of sheer enthusiasm can be an important predic-
tor of the spread of curriculum integration.
Sustainability
The goal of sustainability in curriculum integration is for technology to become

institutionalized as one of the tools that teachers throughout a school use to
achieve their instructional objectives. Rather than being seen as a separate program
or objective, technology is woven throughout the school's core curriculum. For
this to occur, teachers need to develop both their technical skills and knowledge
and their understanding of how that functionality can be used in instructional
design. There are several factors that are related to a school's ability to achieve
sustainable curriculum integration that evaluators should look for.
Development of Student Technology Use Standards and Evaluation Criteria
Standards and goals for students should make it clear that technology use is a
vital element of the learning process. Ideally, these goals should be woven into
content area standards rather than standing on their own, signaling to teachers
and students alike the role of technology as a tool, not an end in itself.
Alignment of Teacher Evaluation System with Goals for Technology Integration
One of the most powerful ways that educational administration can signal a
commitment to technology integration is to build goals for substantive, curriculum-
embedded technology use into teacher evaluation systems. Such incentive
systems would be strong evidence of a commitment to capacity building.
Instructional Support
Like all professionals using technology as a tool, teachers need support that
helps them keep the computers up and running so they can do their jobs. In
addition to technical support, however, teachers need support that is tied to the
substance of the task at hand: using the technology to enhance student learning.
This kind of individualized, classroom-based assistance can be characterized as
"instructional support." The presence of informal and formal mechanisms for
providing instructional support for teachers integrating technology is important
evidence in an evaluation that a school is building the necessary human capital
to institutionalize technology as part of the teaching and learning process.
Valuable elements of instructional support that evaluators should look for include
school-based capacities for: (i) working with teachers to think through their
grade-level and subject area curriculum to identify opportunities for technology-
infused lessons to reach curricular objectives; (ii) assisting teachers with Internet
searching and the evaluation of curriculum-specific software to identify resources
that are both relevant and of high quality; (iii) prompting teachers with creative
ideas about how functions of commonly available productivity software (word
processors, spreadsheets, etc.) can be used instructionally; (iv) assisting teachers
in planning for the unique timing, classroom management, assessment and other
instructional design issues that come up in running a lab-based lesson; (v) helping
teachers set up the lesson (create templates, bookmark relevant websites, etc.);
(vi) providing hands-on support ("a safety net") when the teacher is actually
teaching the lesson.
Administrative Priorities
Building human capital for the sustainable integration of technology into the
curriculum requires a commitment of time and resources. The use of technology
in a school is unlikely to become systematic, widespread,'or sustainable without
a tangible commitment from administration at the school and higher levels.
Evaluators should seek evidence of administrative priorities through the alloca-
tions given to technology in schedules and budgets, leadership through modeling
technology use, and the creation of incentive systems that reward instructional
technology use.
Cooperation Between District-Level Technology and Curriculum Staff
Because the administrative structure of education departments evolves over time

and is rarely re-thought from the ground up, "technology" is often a separate
department from "curriculum." These two fiefdoms are likely to have staff who
vary in background and expertise, have little communication or coordination of
activity, and articulate different, perhaps seemingly conflicting, priorities. However,
if technology is to become integrated into teaching and learning, it is vital that
schools get the message that it is not an end in itself but a tool to help students
master the curriculum. Evaluators seeking evidence of sustainable technology
integration should look for a deliberate effort to coordinate message, activities,
and incentives regarding technology and the curriculum from the top of a system.
Network of Contacts Beyond School
When teachers need fresh inspiration for how to use technology to teach a given
lesson, contacts with peers in other schools or experts in education departments
or universities can make the difference between retreating to the familiar and
continuing to climb the learning curve of curriculum integration.
Collaborative Planning Time/Opportunities to Observe and Share Lessons
As much as teachers need instructional support in their own classrooms from

expert mentors, they also need opportunities to learn from peers who are working
to climb the same learning curve of curriculum integration. Instructional tech-
nology use is much more likely to spread beyond isolated classrooms and become
institutionalized in a school's approach to teaching and learning if teachers have
structured opportunities to see what their peers are doing and learn from each
others' successes and failures. Evaluators should look for systematic opportu-
nities for peer-to-peer learning and dissemination.
Stable Funding
All of the above factors imply to some extent the commitment of time and other
resources. For instructional technology use to become institutionalized in a school,
these commitments need to become part of the base budget, not dependent on
year-to-year or special program funding.
CONCLUSION
The two-part framework for evaluating educational technology implementation

presented in this chapter has a number of demonstrated and potential uses for
educators and policymakers interested in understanding and enhancing the
potential for technology to become institutionalized in schools. We expect that
evaluators will find the general framework of the three learning curves and three
dimensions of growth to be the most useful aspect of the work as they go about
conceptualizing their own evaluations in this field. The specific implementation
variables we present as examples of the content of the framework will not all be
relevant to every evaluation, but as a set should help evaluators think about the
range and kinds of factors that will be significant in their particular contexts. The
specific variables may be of more use to educators and policymakers planning
and managing the implementation of school-level technology implementations,
as a kind of checklist of factors that were found in another set of schools to
influence the success of implementation.
We found this framework to be a useful tool not only for gathering, organizing,
and analyzing implementation data but also for identifying and communicating
best practices and potential pitfalls as implementation was unfolding. As a simple,
consistent communication tool we could use when sharing data across sites and
among different levels of the program, it enhanced the formative value of our
work. Furthermore, we believe that the basic dimensions of the framework, as
well as many of the specific variables within them, would be applicable to the plan-
ning, management, and evaluation of other educational technology programs.
More broadly, we believe that this framework can serve as a useful starting point
for planning and conducting other research in the crowded field of educational
technology evaluation.
ENDNOTES
1 This work was supported by the Heinz Endowments as part of the Evaluation Coordination Project
based at the University of Pittsburgh's Learning Research and Development Center. The views
expressed herein are those of the authors; no official endorsement of the sponsoring organization
should be inferred.
2 The concept of "technology" can encompass a wide range of elements. For the purpose of this
discussion, we will use the term broadly, while recognizing that much of the debate and reporting
is about computer and networked communication accessibility and applications.
3 These authors recognize that the case for technology is not without its critics. It is beyond the
purview of this paper to address the pros and cons of technology in education. For some interesting
discussions of this issue see Cuban (2000) and Schofield (1995).
4 In practice, evaluators will find that many of the factors cited here impact more than one learning
curve and more than one dimension of growth. In general, we have located each factor in the cell
we found it to influence most strongly. In a few cases, particularly powerful factors appear in more
than one cell. In using this framework, evaluators should consider the importance of factors on
multiple aspects of technology institutionalization.
REFERENCES
Asayesh, G. (1993). Staff development for improving student outcomes. Journal of Staff Development,
14(3), 24-27.
Bickel, w.E., Nelson, c.A., & Post, J.E. (2000a). Does training achieve traction? Evaluating
mechanisms for sustainability from teacher technology training. Paper presented at annual meeting
of the American Evaluation Association, Honolulu, HI.
Bickel, w.E., Nelson, c.A., & Post, J.E. (2000b). Briefing report: PPS technology professional
development program. Pittsburgh, PA: Learning Research and Development Center, University of
Pittsburgh.
Brand, G.A. (1997). What research says: Training teachers for using technology. Journal of Staff
Development, 19(1).
Bray, B. (1999). Ten steps to effective technology staff development. Retrieved from the George Lucas
Educational Foundation Web site: http://www.glef.org!.
CEO Forum. (1999). Professional development: A link to better learning. Washington, DC: Author.
Coley, R.J., Cradler, J., & Engle, P.K (1997). Computers and classrooms: The status of technology in
US classrooms. Princeton, NJ: ETS Policy Information Center.
Cuban, L. (1999). The technology puzzle. Education ~ek, 18(43),68,47.
Cuban, L. (2000). Is spending money on technology worth it? Education ~ek, 19(24),42.
Editorial Projects in Education (1999). Technology counts '99: Building the digital curriculum.
Education Week, 19(4).
Editorial Projects in Education (2001). Technology counts 2001: The new divide. Education Week,
20(5).
Ertmer, P.A. (1999). Addressing first and second order barriers to change: Strategies for technology
integration. Educational Technology Research and Development, 47(4), 47-61.
Fitzgerald, S. (1999). Technology's real costs: Protect your investment with total cost of ownership.
The School Technology Authority, September. Retrieved from http://www.electronic-school.com/
199909/0999sbot.html.
Glennan, T.K, & Melmed, A. (1996). Fostering the use of educational technology: Elements of a
national strategy. Santa Monica, CA: RAND.
Guskey, T.R. (1994). Results-oriented professional development: In search of an optimal mix of
effective practices. Journal of Staff Development, 15(4), 42-50.
Guskey, T.R. (2000). Evaluating professional development. Thousand Oaks, CA: Corwin Press.
Guskey, T.R., & Huberman, M. (Eds.) (1995). Professional development in education: New paradigms
and practices. New York: Teachers College Press.
Hasselbring, T., Smith, L., Williams Glaser, c., Barron, L., Risko, v., Snyder, c., Rakestraw, J., &
Campbell, M. (2000). Literature review: Technology to support teacher development. Report
prepared for NPEAT Project 2.1.4.
ISTE (International Society for Technology in Education). (1998). National Educational Technology
Standards for Students. Eugene, OR: Author.
Jerald, C.D., & Orlofsky, G.P. (1999). Raising the bar on school technology. Education Week, 19(4),
58-62.
Joyce, B., & Showers, B. (1980). Improving inservice training: The messages of research. Educational
Leadership, 37(5), 379-385.
Joyce, B, Showers, B., & Rolheiser-Bennett, C. (1987). Staff development and student learning: A
synthesis of research on models of teaching. Educational Leadership, 45(2), 11-23.
Kennedy, M.M. (1999). Form and substance in mathematics and science professional development.
NISE Brief, 3(2),1-7.
Loucks-Horsley, S., Hewson, P.w., Love, N., & Stiles KE. (1998). Designingprofessional development
for teachers of science and mathematics. Thousand Oaks, CA: Corwin Press.
Means, B., & Olson, K (1995). Technology's role in educational reform: Findings from a national study
of innovating schools. Washington, DC: Office of Educational Research and Improvement, U.S.
Department of Education.
Meyer, T.R. (1999). A general framework for good technology integration in schools and criteria for the
development of an effective evaluation model. Paper presented at annual meeting of the American
Evaluation Association, Orlando, FL.
Nelson, C.A., & Bickel, W.E. (1999). Evaluating instructional technology implementation: A framework
for assessing learning payoffs. Paper presented at annual meeting of the American Evaluation
Association, Orlando, FL.
Norton, P., & Gonzales, C. (1998). Regional educational technology assistance initiative - phase II:
Evaluating a model for statewide professional development. Journal of Research on Computing in
Education, 31, 25-48.
OTA (Office of Technology Assessment). (1995). Teachers and technology: Making the connections.
(SIN 052-003-61409-2). Washington, DC: U.S. Government Printing Office.
O.E.C.D. (Organisation for Economic Co-operation and Development). (2001). Education at a
glance: OECD indicators. Paris: Author.
President's Committee of Advisors on Science and Technology. (1997). Report to the President on
the use of technology to strengthen K-12 education in the United States. Washington, DC: Author.
Schofield, J. (1995). Computers in the classroom. Cambridge, MA: Cambridge University Press.
Scriven, M. (1967). The methodology of evaluation. In R.E. Stake (Ed.), Curriculum evaluation
(pp. 39-83). Chicago: Rand McNally.
Smerdon, B., Cronen, S., Lanahan, L, Anderson, J., Iannotti, N., & Angles, J. (2000). Teachers' tools
for the 21st Century: A report on teachers' use of technology. Washington, DC: National Center for
Education Statistics, V.S. Department of Education.
V.S. Department of Education. (1999). Teacher quality: A report on the preparation and qualifications
ofpublic school teachers. Washington, DC: National Center for Education Statistics.
Weiner, R. (2000, November 22). More technology training for teachers. New York Times.
Wenglinsky, H. (1998). Does it compute? The relationship between technology and student achievement
in mathematics. Princeton, NJ: ETS Policy Information Center.
Section 10
Local, National, and International Levels of System

Evaluation
Introduction
THOMAS KELLAGHAN, Section Editor

Educational Research Centre, St. Patrick's College, Dublin
Evaluation in education has for the most part focused on assessment of the achieve-
ments of individual students. Students are assessed informally, day-in, day-out,
in classrooms throughout the world as an integral part of the work of teachers,
who use a variety of procedures to determine learning progress, to identify diffi-
culties students may be experiencing, and to facilitate learning. Evaluation
becomes more formal when students take tests or examinations, which may be
administered by teachers or by external agencies, to provide evidence of achieve-
ment that may be used for certification or selection for further education.
In recent years, interest has grown in obtaining information about the achieve-
ments of a system of education as a whole (or a clearly defined part of it), in
addition to information about the achievements of individual students. The
information may be obtained for a nation or state in an exercise called a national
or local statewide assessment (also sometimes called a system assessment, learning
assessment, or assessment of learning outcomes), or it may involve obtaining
comparative data for a number of states or nations (in an exercise called an inter-
national comparative study of achievement). In both cases, individual students
complete assessment tasks, but usually only aggregated data are considered in
analyses and reports. However, in some statewide and national assessments, data
may also be used to make observations or decisions about individual schools,
teachers, or even students.
In this introductory note to the section, some of the factors that have contri-
buted to a growth in interest in providing data on the achievements of education
systems are considered. This is followed by a description of national assessments
(purposes, procedures, and uses) and ofrecent expansion in national assessment
activity. International assessments are then briefly described before the chapters
in the section are introduced.
DEVELOPMENT OF INTEREST IN ASSESSING SYSTEM

ACHIEVEMENTS
While it has not been common practice, there have been occasions in the past
when the performances of individual students have been aggregated to provide
873

874 Kellaghan
an indicator of the quality of the education system as a whole. Perhaps the most
outstanding examples of this are the payment-by-results schemes which were
introduced to the British Isles and to a number of other countries in the 19th
century. In 1862, a scheme was inaugurated in the Revised Code of the Privy
Council for England, and was later extended to Ireland, in which teachers' salaries
were dependent in part on the results of annual examinations in reading, spelling,
writing, and arithmetic conducted by school inspectors as an inducement to
make teachers work harder. National percentages of success rates were
calculated and published annually. Sir Patrick Keenan (1881), in an address to
the National Association for the Promotion of Social Science, reviewed the
scheme when it had been in operation in Ireland for ten years. In a comparison
of the educational returns of 1870 with those of 1880, he identified an increase
in the percentage of students "passing" in reading from 70.5 to 91.4, in writing
from 57.7 to 93.8, and in arithmetic from 54.4 to 74.8. Keenan introduced an
international dimension to the analysis by observing that the 1880 figures were
better than those obtained in England with the exception of arithmetic, for which
figures were similar in both countries. Even at this stage, however, the validity of
the national findings was being questioned. The school inspectors who carried
out the assessment pointed out that students could pass the reading test without
actually being able to "read". In 1895, one of them reported that reading is
"generally indistinct in articulation, the words are so falsely grouped and proper
emphasis and pauses so much neglected, that it has become often an unintel-
ligible jumble" (quoted in Coolahan, 1977, p. 15).
While these schemes were not unique, for the most part educational policy-
makers and administrators tended to view educational quality in terms of inputs
to the system (e.g., student participation rates, physical facilities, curriculum
materials, books, and level of teacher education). The assumption seems to have
been that if inputs are of high quality, so too will student learning.
That perception changed in the final decades of the 20th century when
concern with the outcomes of education grew. Several factors, ranging from the
interpretation of research evidence, to economics, to ideological positions regard-
ing the function of schooling seem to underlie this shift in focus. For example,
the failure of the Equal Educational Opportunity Survey (Coleman et aI., 1966)
to demonstrate a strong relationship between school resources and students'
achievements was associated with a change in the definition of equality of oppor-
tunity from one based on school resources to one based on student achievements.
The results of international comparative studies of achievement also served to
focus attention on outcomes when they were interpreted in the United States as
indicating that students were not being adequately prepared to compete in a
globalized information-based economy (see National Commission on Excellence
in Education, 1983).
Concern about the knowledge and skills that students were acquiring was not
limited to the United States or to industrialized nations. It was also growing in
developing countries, but for different reasons. It resulted in part from a
realization that very little information was available on the effects of the very
Local, iYational, and International Levels of System Evaluation 875
substantial investments that were being made in the provision of education, and
the limited information that was available indicated that students with short
educational careers benefited very little. There was a further concern that even
the competencies acquired by students who stayed in the education system for a
long time might fail to meet the needs of the economies of the 21st century. These
concerns were reflected in the document adopted by the World Conference on
Education for All in Jomtien, Thailand in March, 1990 when it stated that the
provision of basic education for all was meaningful only if students actually
acquired useful knowledge, reasoning ability, skills, and values. As a conse-
quence, Article 4 of the World Declaration on Education for All (UNESCO, 1990)
stated that the focus of basic education should be "on actual learning acquisition
and outcome, rather than exclusively upon enrolment, continued participation in
organized programs and completion of certification requirements" (p. 5). More
recently, The Dakar Framework for Action (UNESCO, 2000), produced at the
end of the ten-year follow-up to Jomtien, again highlighted the importance of
learning outcomes. One of its agreed goals specified that "recognized and meas-
urable learning outcomes are achieved by all, especially in literacy, numeracy
and essential life skills" (p. 17).
Growth in interest in the outcomes of education at the system level may also
be related to the development in many countries, industrialized and developing,
of a corporist approach to administration and an associated rise in "manageri-
alism". The approach draws heavily on ideas from the business world, such as
strategic and operational planning, continuous improvement, the use of perfor-
mance indicators, a focus on "deliverables" and results, quality assurance, a growth
in incentive and accountability systems based on results (e.g., performance-
related pay), and the concept of the citizen as consumer. The performance
management movement, which had its roots in the 1930s, but grew in popularity
in the late 1980s and in the 1990s, placed a heavy emphasis on obtaining data on
the achievements of education systems. Performance would be defined in terms
of results, performance targets would be set, the extent to which results were
achieved would be determined using performance indicators, and decisions
(including ones on resource allocation) would be based on performance infor-
mation. The aim of the activities was to provide rapid and continuous feedback
on a limited number of outcome measures considered to be of interest to
politicians, policy makers, administrators, and stakeholders (Kellaghan &
Madaus, 2000).
While national and international assessments would seem to fit well with the
purposes of performance management, it is difficult to estimate the extent to
which information derived from assessments is actually being used for such
purposes as target setting or the allocation of resources. This is partly because
such uses may not be publicly documented. However, some evidence of use is
available. For example, in Great Britain, the white paper Excellence in Schools
set as targets for 2002 that 80% of ll-year-olds would reach the standards
expected of their age in English, and 75% would do so in mathematics (Great
Britain, 1997). There are also several examples of the use of national assessment
876 Kellaghan
data to decide on resource allocation (e.g., France [Michel, 1995]; Uruguay
[Benveniste, 2000)), as well as evidence of the use of the results of international
studies in a managerial context, in particular to adjust school curricula (see
Kellaghan, 1996).
NATIONAL ASSESSMENTS
At least the germs of contemporary procedures in national assessment using

standardized tests can be found in a survey to measure reading standards carried
out in England in 1948 with the purpose of obtaining data that would be relevant
to suggestions that "backwardness" and "illiteracy" had increased during the
second world war (see Kellaghan & Madaus, 1982).
More systematic and elaborate achievement monitoring procedures, as des-
cribed by Torrance in this volume, were introduced in 1974 when the Assessment
of Performance Unit (APU) was established. Its life, however, was relatively
short, and the current system of assessment of the national curriculum was
introduced in 1988. The American National Assessment of Educational Progress
(NAEP), described by Jones in this volume, had been set up almost two decades
earlier (in 1969). Other early entries to the national assessment scene were
Ireland in 1972, France in 1979, and Chile in 1984.
Most countries in the industrialized world, with the notable exception of
Germany, had established national assessment systems by 1990. During the
1990s, capacity to carry out assessment studies developed rapidly in developing
countries, particularly in Africa and Latin America (Kellaghan & Greaney,
2001a). The activity for the most part seems attributable to the stress in the
Declaration of the World Conference on Education for All (UNESCO, 1990) on
the importance of knowing to what extent students were actually learning.
National assessments are all designed to describe the level of achievements,
not of individual students, but of a whole education system, or a clearly defined
part of it (e.g., fourth grade students or ll-year-olds) (Greaney & Kellaghan,
1996; Kellaghan, 1997; Kellaghan & Greaney, 2001b). The most commonly iden-
tified areas for assessment have been students' first language and mathematics,
followed by science, a second language, art, music, and social studies.
Assessments at primary school have been more common than at higher levels in
the education system. In some countries, only one assessment has been carried
out to date; in others, assessments are carried out on a regular basis.
All assessments seek answers to one or more of the following questions. (i)
How well are students learning in the education system (with reference to general
expectations, the aims of the curriculum, or preparation for life)? (ii) Is there
evidence of particular strengths and weaknesses in students' knowledge and
skills? (iii) Do particular subgroups in the population perform poorly? Are there,
for example, disparities between the achievements of boys and girls, of students
in urban and rural locations, of students from different language or ethnic
groups, of students in different regions of the country? (iv) What factors are
Local, National, and International Levels of System Evaluation 877
associated with student achievement? To what extent does achievement vary with
characteristics of the learning environment (e.g., school resources, teacher prepa-
ration and competence, type of school) or with students' home and community
circumstances? (v) Do student achievements change over time? (Kellaghan &
Greaney, 2001b).
While all national assessment systems address these questions, there is con-
siderable variation in the procedures that are used. First, an assessment may be
based on a sample of schools/students, or all students at the relevant age/grade
level in the education system may be assessed. Second, an assessment may be
based on an analytic view of achievement and rely on tests administered under
standardized conditions in a few sessions, or an assessment may be designed to
be holistic and performance-based, to be integrated into everyday classroom
practice, and to be administered over several days by teachers. Third, each
student may take only a fraction of a large number of assessment tasks, allowing
for extensive curriculum coverage without requiring students to spend a lot of
time responding to tasks, or in keeping with a more holistic view of achievement,
there may be no provision for systematically allocating different tasks to students.
Finally, an assessment may be designed to provide teachers with exemplars of
good assessment practices which, it is hoped, would influence their curriculum
priorities and methods of assessment, and/or information about the performance
of individual students, teachers, schools or districts. Most assessments, however,
are not designed to impact directly on schools and teachers, though the infor-
mation derived from them might be used to reach policy decisions that would
eventually impact on them (Kellaghan & Greaney, 2001b).
A variety of approaches are implemented in some countries. In Chile, all
students are assessed in Spanish and arithmetic, while a 10% sample is assessed
in natural sciences, history, and geography (Himmel, 1996). An alternative com-
bination operates in France. In one operation, samples of students are selected
about every five years at the end of grades 7, 9, and 10 to provide information on
achievements at the system level in relation to the national curriculum. In a
separate operation, the total population of students take "diagnostic" assessments
on the commencement of study in grades 3, 6, and 10 (Michel, 1995) to provide
information on individual schools, and feedback is sent to schools, students, and
students' parents. Although NAEP in the United States uses a sample of
students, most states have now put in place programs assessing the learning of all
students, and in which teachers and/or students are held accountable for learning
outcomes (Shore, Pedulla, & Clarke, 2001). When assessments allow the identifi-
cation of individual school districts, schools, teachers, and even students, they
may be considered as surrogates for public examinations, and, not surprisingly,
the development of this type of assessment has been most marked in countries
which do not have a public examination system (the United States and Latin
American countries).
A further important factor differentiating assessments is the use that is made
of results. Some assessments may primarily represent a symbolic activity, an effort
to imbue the policy-making process with the guise of scientific rationality
878 Kellaghan
(Benveniste, 2000; Meyer & Rowan, 1978) or, in developing countries, to satisfy
the requirements of donors supporting reforms. In these cases, the act of assess-
ment would seem to be of greater consequence than its outcomes (Airasian, 1988).
Assessments may also be mechanisms of disciplinary power and control,
procedures that are used to modify the actions of others (Foucault, 1977). Two
procedures can be identified. In the first, which may be termed diagnostic
monitoring, an attempt is made to identify problems and remediate learning
deficiencies (Richards, 1988). Following identification, various resources (new
programs, new educational materials, and in service training for teachers) may be
provided.
An alternative procedure, based on principles of microeconomics, may be
termed performance monitoring; in this case, the focus is on organizational
outcomes and the objective is to improve student achievement primarily through
competition. Its basic rationale - that assessment procedures to which high stakes
are attached have the ability to modify the behavior of administrators, teachers,
and students - has a long pedigree, and its proponents would be comfortable in
the presence of the ghosts of John Stuart Mill and Adam Smith. No specific action
may be required beyond the publication of information about performance (e.g.,
in league tables). However, in some cases inducements for improved performance
may be provided; for example, schools and/or teachers may receive money if
students achieve a specified target (e.g., if 85% reach a satisfactory level of
proficiency).
Whether a national assessment can be used for diagnostic or performance moni-
toring will depend on its design. If it is based on a sample of schools/students
(e.g., the American NAEP), it can be used only for diagnostic purposes, and then
only for diagnosis at the system level. Census-based national assessments, on the
other hand, have the option of operating either a system of diagnostic moni-
toring or of performance monitoring. An example of the former is to be found in
the national assessment in Uruguay (Benveniste, 2000), and of the latter in the
national assessment in England and in many mandated state assessment
programs in the United States.
INTERNATIONAL ASSESSMENTS
International comparative studies of achievement were developed by the

International Association for the Evaluation of Educational Achievement (lEA)
which was founded following a meeting of leading educationists at the UNESCO
Institute for Education in Hamburg in 1958. The educationists, conscious of the
lack of internationally valid "standards of achievement", proposed studies of
measured outcomes of education systems and their determinants, which they
anticipated would capitalize on the variability that existed across education
systems, exploiting the conditions which "one big educational laboratory" of
varying school structures and curricula provided, not only to describe existing
conditions, but to suggest what might be educationally possible (Husen, 1973;
Husen & Postlethwaite, 1996). Thus, the studies were envisaged as having a
research perspective (largely in the tradition of structural functionalism), as well
as policy implications.
International assessments share many procedural features with national
assessments, though they differ from them in a number of respects (Beaton,
Postlethwaite, Ross, Spearritt, & Wolf, 1999; Goldstein, 1996; Greaney &
Kellaghan, 1996; Kellaghan & Greaney, 200lb). As in national assessments, instru-
ments are developed to assess students' knowledge and skills. However, instead
of representing the curriculum of only one education system, the instruments
have to be considered appropriate for use in all participating systems. The age or
grade at which the instruments are to be administered has to be agreed, as have
procedures for selecting schools/students to participate. A unique feature of these
assessments is that they provide an indication of where students' achievements
in a country stand relative to the achievements of students in other countries.
Participation in international assessments has been more a feature of indus-
trialized than of less economically developed countries. Reasons for this may be
limitations in finances, infrastructure, and human resources, as well as the
realization that differences in the conditions of education systems render docu-
mentation of associated differences in achievement superfluous and possibly
embarrassing (Kellaghan & Greaney, 2001a).
A number of problems arise in international assessments that do not arise in
national assessments (see Beaton et al., 1999; Goldstein, 1996; Kellaghan &
Grisay, 1995). One relates to the difficulty in designing an assessment procedure
that will adequately measure the outcomes of a variety of curricula, given that
education systems differ in their goals and in the emphasis and time they accord
different topics in a curriculum. A second relates to the equivalence across coun-
tries of the populations and samples of students that are being compared. A
problem also arises (and may arise in some national assessments) if it is necessary
to translate assessment instruments into one or more languages. In practice, it is
difficult to ensure that the ways questions are phrased and the cultural
significance of content are equivalent in all language versions of an assessment
task (see Blum, Goldstein, & Guerin-Pace, 2001).
It is perhaps for reasons such as these that the findings of international assess-
ments appear to be less consistent than the findings of national assessments. For
example, differences between the performances of Irish students in science have
been observed when their results are compared on the International Assessment
of Education Progress 2 (IAEP2) (Martin, Hickey, & Murchan, 1992), the Third
International Mathematics and Science Study (TIMSS) (Martin, Mullis, Beaton,
Gonzalez, Smith, & Kelly, 1997), and the Programme for International Student
Assessment (PISA) (Shiel, Cosgrove, Sofroniou, & Kelly, 2001). In the first,
performance was below average, in the second it was average, and in the third,
above average. Similar disparities are found in the results of reading literacy
studies. Two of the studies (the lEA Reading Literacy Survey [Martin & Morgan,
1994] and the International Adult Literacy Survey [OECD/Statistics Canada,
2000]) placed Irish participants towards the lower end of OECD countries, while
880 Kellaghan
PISA (Shiel et aI., 2001) placed them towards the upper end. Differences in
performance are also in evidence in comparisons involving other countries
(O'Leary, Kellaghan, Madaus, & Beaton, 2000). It seems unlikely that these
findings reflect actual differences in "achievement"; more likely they are due to
variation in the knowledge and processes assessed, response rates, or the criteria
used in the definition of proficiency levels.
In considering international comparisons, Goldstein (1996) has identified
further inherent weaknesses including problems in assumptions regarding the
nature of achievement when results are reported in terms of single achievement
scores (in, for example, "reading" or "mathematics"). Among the issues he has
identified as in need of attention are: the scaling and interpretation of data; the
relevance of items to each country's curriculum, and indeed why particular
"common" items were chosen by test constructors in the first place; and the devel-
opment of explanatory models (which would include information on students'
prior achievements) as a basis for interpreting observed differences between
countries. In identifying these issues, Goldstein reflects the views of the founding
fathers of international studies regarding the role that the studies might play in
advancing our understanding of the contribution of formal education to students'
development. However, whether the considerable methodological work in the
design and execution of studies proposed by Goldstein will receive support is
another matter. Policy makers and managers, whose main interest in assessment
results is likely to be for managerial purposes (e.g., defining performance in terms
of results, setting performance targets, allocating resources) and diagnostic moni-
toring, rather than for "understanding", may be reluctant to provide the required
resources.
PAPERS IN THE SECTION
In the first two papers in this section, descriptions are provided of two major
national assessment systems: national assessment in the United States (Lyle
Jones) and assessment of the national curriculum in England (Harry Torrance).
Both papers describe the history of the assessments, and how they have changed
since they were instituted, in the United States in 1969, and in England in 1988.
The histories differ considerably. This is not surprising, given that the systems
differed radically in their conception. In the United States, the focus was mainly
on diagnostic monitoring and on tracking changes in educational achievement,
while in England assessment was designed for performance monitoring. Apart
from the fact that the systems were introduced to address different political prior-
ities, differences in the educational contexts in which they were implemented
also contributed to differences in the ways in which the systems evolved.
However, there is also some evidence of convergence between practice in the
two countries. Thus, the English system has moved towards the use of more
standardized and less complex assessment procedures. On the other hand, while
the American system continues to be based on samples of schools and provides
data only for identifiable groups of students, not individual schools or students,
recent legislation provides for the testing of all students in grades 3 through 8.
Individual states have also developed their own standards and assessment pro-
grams, and results of the programs are being used to determine grade promotion
and high school graduation for individual students.
In their chapter on the complex area of state and school district evaluation
activity in the United States, William Webster, Ted Almaguer, and Tim Orsak
consider some of the implications for public school evaluation units of the
increase in state activity in assessment and accountability activities. They observe
that public school evaluation units have not prospered during the 1990s, and that
increase in state involvement may have led to a sacrifice in concepts such as
validity and equity in evaluation in the interest of political expediency.
The final two papers in the section deal with international studies of educa-
tional achievement. Tjeerd Plomp, Sarah Howie, and Barry McGraw describe
the purposes and functions of such studies and the organizations that carry them
out. Central issues relating to the design of studies and methodological consider-
ations occupy a considerable portion of the chapter. In the final chapter, William
Schmidt and Richard Houang address the measurement of curriculum and its
role in international studies which is considered as three-fold: as the object of an
evaluation, as the criterion for it, and as the context in which an evaluation is
interpreted.
REFERENCES
Airasian, p.w. (1988). Symbolic validation: The case of state-mandated, high-stakes testing.
Beaton, AE., Postlethwaite, T.N., Ross, K.N., Spearritt, D., & Wolf, R.M. (1999). The benefits and
limitations of international educational achievement studies. Paris: International Institute for
Educational Planning/International Academy of Education.
Benveniste, L. (2000). Student assessment as a political construction: The case of Uruguay.
Education Policy Analysis Archives, 8(32). Retrieved from http://epaa.asu.edu/epaa/.
Blum, A, Goldstein, H., & Guerin-Pace, E (2001). International Adult Literacy Survey (IALS): An
analysis of international comparisons of adult literacy. Assessment in Education, 8, 225-246.
Coleman, J.S., Campbell, E.Q., Hobson, c.J., McPartland, J., Mood, AM., Weinfeld, ED., & York,
R.L. (1966). Equality of educational opportunity. Washington, DC: Office of Education, U.S.
Department of Health, Education, and Welfare.
Coolahan, J. (1977). Three eras of English reading in Irish national schools. In V. Greaney (Ed.),
Studies in reading (pp. 12-26). Dublin: Educational Company.
Foucault, M. (1977). Discipline and punish. The birth of the prison (A Sheridan trans). New York:
Viking.
Goldstein, H. (1996). International comparisons of student achievement. In A Little, & A Wolf
(Eds), Assessment in transition. Learning, monitoring and selection in international perspective
Greaney, v., & Kellaghan, T. (1996). Monitoring the learning outcomes of education systems. Washington,
DC: World Bank.
Great Britain. (1997). Excellence in schools. White paper on education. London: Her Majesty's
Stationery Office.
Himmel, E. (1996). National assessment in Chile. In P. Murphy, V. Greaney, M.E. Lockheed, &
C. Rojas (Eds), National assessment. Testing the ~ystem (pp. 111-128). Washington, DC: World Bank.
882 Kellaghan
Husen, T. (1973). Foreword. In L.C. Comber & J.P. Keeves (Eds), Science achievement in nineteen
countries (pp. 13-24). New York: Wiley.
Husen, T., & Postlethwaite, T.N. (1996). A brief history of the International Association for the
Evaluation of Educational Achievement (lEA). Assessment in Education, 3,129-141.
Keenan, P. (1881). Address on education to National Association for the Promotion of Social Science.
Dublin: Queen's Printing Office.
Kellaghan, T. (1996). lEA studies and educational policy. Assessment in Education, 3,143-160.
Kellaghan, T. (1997). Seguimiento de los resultados educativos nacionales. (Monitoring national
educational performance.) In H.B. Alvarez, & M. Ruiz-Casares (Eds), Evaluacion y reforma
educativa. Opciones de politica (Evaluation and educational reform. Policy options) (pp. 23-65).
Washington, DC: U.S. Agency for International Development.
Kellaghan, T., & Greaney, V. (2001a). The globalisation of assessment in the 20th century. Assessment
Kellaghan, T., & Greaney, V. (2001b). Using assessment to improve the quality of education. Paris:
International Institute for Educational Planning.
Kellaghan, T., & Grisay, A (1995). International comparisons of student achievement: Problems and
prospects. In OECD, Measuring what students learn. Mesurer les resultats scholaires (pp. 41-{)1).
Paris: Organisation for Economic Co-Operation and Development.
Kellaghan, T., & Madaus, G.F. (1982). Trends in educational standards in Great Britain and Ireland.
In G.R. Austin & H. Garber (Eds), The rise andfall of national test scores (pp. 195-214). New York:
Academic Press.
Kellaghan, T., & Madaus, G.F. (2000). Outcome evaluation. In D.L. Stufflebeam, G.F. Madaus, &
T. Kellaghan (Eds), Evaluation models. Viewpoints on educational and human services evaluation
(2nd ed., pp. 97-112). Boston: Kluwer Academic.
Martin, M.O., Hickey, B., & Murchan, D.P. (1992). The Second International Assessment of
Educational Progress: Mathematics and science findings in Ireland. Irish Journal of Education, 26,
5-146.
Martin, M.O., & Morgan, M. (1994). Reading literacy in Irish schools: A comparative analysis. Irish
Journal of Education, 28, 5-101.
Martin, M.O., Mullis, LV.S., Beaton, AE., Gonzalez, E.J., Smith T.A, & Kelly, D.L. (1997). Science
achievement in the primary school years. lEA's Third International Mathematics and Science Study
(TIMSS). Chestnut Hill, MA: TIMSS International Study Center, Boston College.
Meyer, J., & Rowan, B. (1978). The structure of educational organizations. In M. Meyer (Ed.),
Environments and organizations (pp. 78-109). San Francisco: Jossey-Bass.
Michel, A (1995). France. In OECD, Performance standards in education. In search of quality
(pp. 103-114). Paris: Organisation for Economic Co-Operation and Development.
National Commission on Excellence in Education. (1983). A nation at risk. The imperative for
educational reform. Washington, DC: U.S. Government Printing Office.
OECD (Organisation for Economic Co-Operation and Development)/Statistics Canada. (2000).
Literacy in the information age. Final report of the International Adult Literacy Survey. Paris:
OECD/Ottawa: Statistics Canada. ,
O'Leary, M., Kellaghan, T., Madaus, G.F., & Beaton, AE. (2000). Consistency of findings across
international surveys of mathematics and science achievement: A comparison of lAEP2 and
TIMSS. Education Policy Analysis Archives, 8(43). Retrieved from http://epaa.asu.edu/epaa/.
Richards, C.E. (1988). A typology of educational monitoring systems. Educational Evaluation and
Policy Analysis, 10, 106--116.
Shiel, G., Cosgrove, J., Sofroniou, N., & Kelly, A (2001). Ready for life? The literacy achievements of
Irish 15-yearolds in an international context. Dublin: Educational Research Centre.
Shore, A, Pedulla, J., & Clarke, M. (2001). The building blocks of state testing programs. Chestnut Hill,
MA: National Board on Educational Testing and Public Policy, Boston College.
UNESCO. (1990). World Declaration on Education for All. Meeting basic needs. New York: Author.
UNESCO. (2000). The Dakar Framework for Action. Education for All: Meeting our collective commit-
ments. Paris: Author.
39
National Assessment in the United States:
The Evolution of a Nation's Report Card
LYLE V JONES!
University of Nonh Carolina at Chapel Hil, Depanment of Psychology, NC, USA
In many countries, education is a responsibility of government, but in the United

States of America, it is among the functions of government for which both
funding and policy is the responsibility of the 50 states of the union, and often of
school districts within states. Given that context, it is especially interesting that
the National Assessment of Educational Progress (NAEP), a federal project, was
established, has survived, and has had considerable influence both nationally and
in the states. In this chapter, evidence bearing upon these issues is presented. The
chapter also serves as a history of the formation and the development of NAEP.
An Act of Congress in 1867 established a United States Department of
Education "for the purpose of collecting such statistics and facts as shall show
the condition and progress of education in the several states and territories, and
of diffusing such information ... as shall aid the people of the United States."
Almost a century later U.S. Commissioner of Education Francis Keppel recog-
nized that statistics and facts such as those envisaged in the 1867 Act were being
neither collected nor diffused. In 1963, he expressed the belief that something
should be done, and asked Ralph Tyler to outline a remedy.
Tyler, a pioneer in educational evaluation, was at that time Director of the Center
for Advanced Study in the Behavioral Sciences. He responded with a memoran-
dum to Keppel, who encouraged the Carnegie Corporation of New York to sponsor
two conferences to consider the matter. The Corporation's Board of Directors in
1964 allocated $100,000 to support an Exploratory Committee for the Assessment
of Progress in Education (ECAPE). Tyler became chairman, and Stephen
Whithey, on part-time leave from the University of Michigan, was appointed
Staff Director. As discussed at length by Greenbaum (1977), the final design of
the National Assessment (NAEP) essentially unchanged from Tyler's 1963 plan.
RALPH TYLER'S VISION OF NAEP

Two decades earlier, Tyler had codified his approach to educational assessment
while he served from 1934 to 1942 at The Ohio State University as Research
883

884 Jones
Director of "the Eight-Year Study." That study was designed to evaluate the
effectiveness of education at each of 30 participating high schools in the United
States. As outlined by Tyler (1942), it was to serve several major purposes: (i) to
indicate the need for improvements in the education program of each school; (ii)
to validate the hypotheses upon which each school operated; (iii) to provide
information basic to effective guidance of individual students; (iv) to give confid-
ence to school staff, parents, and teachers that the educational program is effective;
and (v) to provide a sound basis for public relations. Tyler (1942) also presented
basic assumptions underlying the development of the evaluation project. Among
his assumptions were: (i) education seeks to change the behavior patterns of
students; (ii) the kinds of changes desired become the educational objectives of
a program; (iii) the aim of evaluation is to discover the extent to which the
objectives are realized; (iv) behavior patterns are complex and no single score or
category or grade can adequately summarize any phase of a student's achieve-
ment; (v) evaluation should not be limited to paper-and-pencil tests but should
entail other techniques appropriate for the behaviors being appraised; and (vi)
the nature of the appraisal may be expected to influence the processes of
teaching and learning.
A demanding task to be undertaken prior to any evaluation is the formulation
of educational objectives. In the Eight-Year Study, the objectives for each school
were developed by faculty members in the school. In the design for NAEP, the
formulation of objectives entailed deliberation by a broad array of stakeholders,
and required a consensus among parents and public citizens as well as teachers
and educational specialists.
This, then, was Tyler's basis for the plan presented to ECAPE. To elaborate on
these general guidelines and to translate them into concrete assessment proce-
dures, Tyler in early 1965 selected a four-person Technical Advisory Committee
(TAC) of Robert Abelson, Lee Cronbach, Lyle Jones, and John Tukey
(Chairman). Both Cronbach, who had been Tyler's Ph.D. student, and Tukey, an
eminent statistician at Princeton University, had attended the first Carnegie
Corporation Conference and had actively supported Tyler in his presentations
there. It was unlikely, then, that TAC would challenge Tyler's vision for the
design of NAEP.
PLANNING FOR THE FIRST ASSESSMENT
As soon as it was constituted, TAC became the working arm for ECAPE, with
active assistance from Tyler and Jack Merwin from the University of Minnesota,
who had replaced Steven Whithey as Staff Director. An early decision was to
design assessments for ages 9,13,17, and young adults in ten subject areas: art,
citizenship, career and occupational development, literature, mathematics, music,
reading, science, social studies, and writing. It was agreed that lay panels would
establish learning objectives for each area based upon what citizens should know,
National Assessment in the United States 885
rather only upon what schools were teaching. It also was specified that the tasks
presented were to cover a wide range of difficulty to properly evaluate the per-
formance of both deficient and advanced performers. The first-round national
assessment would include citizenship, reading, and science.
Tyler earlier had met with potential contractors and, in March of 1965, TAC
reviewed preliminary proposals from the American Institutes for Research (AIR),
the Educational Testing Service (ETS), Measurement Research Corporation
(MRC), the Psychological Corporation (Psych Corp), and Science Research
Associates (SRA). Prime contractors were AIR for Citizenship, SRA for Reading,
ETS for Science, the Research Triangle Institute (RT!) for sample design and
exercise administration, and MRC for the printing of exercise materials and the
scoring of responses. From 1965 to 1969, contractor personnel met frequently
with TAC, which monitored their work.
Following the first grant from the Carnegie Corporation to ECAPE, the Ford
Foundation and its Fund for the Advancement of Education granted $1.7 million
to ECAPE between 1965 and 1969, and the Carnegie Corporation provided an
additional $2.3 million. The first federal funding came in 1966 in the form of a
$50,000 award from the Office of Education (OE) for a series of conferences on
assessment.
The funding from Carnegie and Ford supported the formulation of assessment
objectives, the development of exercises related to those objectives, and a variety
of pilot studies and feasibility studies that had been called for by TAC. Based in
part upon findings from these studies, TAC established several innovative guide-
lines for the assessment. Sampling frames were designed both for in-school and
out-of-school assessments. Exercises were to be administered in schools in
50-minute sessions to small groups of children, and out of schools in longer
sessions to individual 17- and 18-year-olds and young adults at ages 26 to 35. For
in-school administrations, a booklet of common exercises would be presented to
the group of respondents with a tape recording of a male voice reading aloud
each question to be answered, and all respondents thus would proceed at the same
pace through the exercise booklet. Different randomly selected groups typically
would receive different exercise booklets, in a "matrix-sampling" design. Thus,
for a given age level, there might be ten distinct exercise booklets, yielding perfor-
mance results for ten times the number of exercises that were administered to
any individual student. As an additional innovation, to discourage guessing and
to reduce non-responses alternative answers to multiple-choice exercises always
included "I don't know" as one possible answer.
Between 1965 and 1967, Tyler and ECAPE held many discussions at various
sites with representatives of the educational establishment. Of special impor-
tance was the effort to encourage cooperation from the American Association of
School Administrators (AASA). To conduct assessment tryouts, it would be neces-
sary to gain access to students in selected schools, for which school-administrator
approval was needed. In an attempt to satisfy some expressed concerns about
possibly invidious comparisons, Tyler and ECAPE assured AASA that
assessment results would not be reported for individual students, or for schools,
886 Jones
districts, or states. (Despite the general recognition that state-by-state results

would be useful, it became clear that AASA would not tolerate such a design.)
The AASA executive committee initially had recommended to its members that
they refuse to participate in the project. In response to pressures from both within
and outside AASA, the AASA Executive Committee later agreed that the issue
could be studied further, and a joint AASA-ECAPE Committee was appointed
for this purpose. At about the same time, in 1967, the National Education
Association passed a resolution to "withhold cooperation" with ECAPE. Thus,
while contractors continued to develop assessment materials and procedures, the
political climate for initiating the project appeared to deteriorate.
In 1967, Frank Womer of the University of Michigan was appointed to succeed
Jack Merwin as Staff Director. Later that year, ECAPE initiated discussions with
the Education Commission of the States (ECS), a newly formed organization of
state governors and state education officials. ECS was invited to become the
institutional home for NAEP.
In 1968, a successor to ECAPE was established, the Committee for the
Assessment of Progress of Education (CAPE), of which George Brain of
Washington State University, an active participant in ECS, became Chairman.
OE provided a grant of $372,000. Also in 1968, the National Science Foundation
(NSF) urged all science teachers to cooperate with CAPE. Encouraged by ECS
and NSF, among others, AASA in 1969 altered its position, and agreed to allow
each of its members make the decision as to whether or not to cooperate with
CAPE. OE then granted an additional $1 million to CAPE.
NAEP UNDER ECS
In the spring of 1969 CAPE moved forward with a successful in-school assess-
ment of 17-year-olds in citizenship, reading, and science. In June of 1969 ECS
agreed to assume control of NAEP. Later in the year, the U.S. Office of Education
granted $2.4 million to ECS to complete the first-year operation of NAEP. In
1969-70, under the auspices of ECS, assessments in citizenship, reading, and
science were conducted in schools of students aged 9 and'13, as well as out of
school of 17-and 18-year-olds and young adults.
ECS immediately converted TAC to the Analysis Advisory Committee (ANAC).
At the suggestion of TAC, a new Operations Advisory Committee (OPAC) was
formed. A National Assessment Policy Committee was appointed, with a mem-
bership that included the chairs of CAPE, ANAC, OPAC, Ralph Tyler, and one
ECS representative; initially, those members were George Brain, John Tukey,
John Letson, Tyler, and Leroy Greene, who was the Chairman of the Education
Committee of the California State Assembly.
ANAC turned its attention to the analysis of results and to a framework for
reporting assessment findings. To provide a model for reporting, it produced the
report of national results for science (Abelson, Cronbach, Jones, Tukey, & 'lYler,
1970), which it followed with a report with subgroups comparisons (Abelson,
Coffman, Jones, Mosteller, & Tukey, 1971). The subgroups included four regions
of the U.S. (Northeast, Southeast, Central, and West), four community sizes (big
cities, urban fringe, medium-size cities, and smaller places), and whether the
student was male or female. (Later reports of first-year results also reported
findings by race and by level of parental education.) Results were presented in
the form of percentages of correct answers (together with standard errors) on
individual assessment exercises. Only some of the exercises were released each
year. The identity of others was kept confidential to allow them to be used again
in later years for the purpose of assessing change over time.
Between 1969 and 1971, Lee Cronbach and Robert Abelson resigned from
ANAC; new members were William Coffman, Frederick Mosteller, and John
Gilbert. In addition to Jones and Tukey, who remained members of ANAC
through 1982, others who served on ANAC were R. Darrell Bock, Lloyd Bond,
David Brillinger, James Davis, Janet Dixon Elashoff, Gene Glass, and Lincoln
Moses.
Within the OE, oversight for NAEP became the responsibility of the National
Center for Education Statistics (NCES) in 1971 and in 1973 funding changed
from a grant to a contract. During the mid-1970s, budgetary uncertainties and
delays in federal payments to ECS became major difficulties for NAEP. Funding
had increased in 1973 from $2 million to $3 million annually. In 1974, President
Nixon's budget request for NAEP was $7 million, but only $3 million was
appropriated by Congress. Cost considerations led to the elimination of out-of-
school sampling, which also meant eliminating the assessment of young adults,
and to reducing the frequency of assessment for some learning areas from the
schedule that had been envisioned in earlier plans.
Although NAEP had been authorized by Congress, few Senators or House
members were familiar with it, and little media attention was given to its reports.
ANAC had envisioned possible media interest in the form of daily newspaper
coverage, presenting and interpreting student performance on individual exercises,
but that proved to have been a naIve expectation. To reduce the information
burden on the audience, reporting began to focus on mean or median perfor-
mance for groups of related exercises, but that too failed to stimulate much
attention. It was hoped that more interest would accrue when it became possible
to report changes in performance from one assessment to the next. That seemed
not be the case, however, perhaps because changes were small in size, because of
the inevitable delay in producing reports, and because results were based on a
national sample of schools and thus failed to reflect how students performed in
a local school or school district or even in a state. The early hopes of some that
NAEP findings would impact educational policy were not being realized.
Disappointed that NAEP was having little influence on U.S. education, the
Trustees of the Carnegie Corporation in 1972 commissioned an evaluation of the
national assessment. The report was completed in 1973 and later was published
in book form (Greenbaum, 1977). The Greenbaum critique judged that NAEP had
succeeded in meeting some of its objectives: (i) its operational goals of forming
educational objectives and launching and maintaining a viable assessment effort,
888 Jones
(ii) developing richer modes of assessment than those of norm-referenced testing,

and (iii) providing models for assessments that could be undertaken by states
and local communities. It concluded, however, that NAEP had failed to meet its
most critical objectives. Greenbaum noted that NAEP was left with
(1) virtually no capacity to answer research questions and even very little capac-
ity to generate significant research hypotheses that could not have been
generated more precisely and less expensively by smaller studies;
(2) virtually no capacity to provide the federal government, the lay public, or most
educational policymakers with results that are directly useful for decision
making; and
(3) most surprising of all, virtually no really significant and supportable new
findings with regard to the strengths and weaknesses of American education.
(Greenbaum, 1977, p. 169)
The Greenbaum critique elicited a defense in 1975 from ECS/NAEP staff, pub-
lished as Part 2 in the Greenbaum book. The NAEP staff stated that recent
improvements had rendered obsolete some of Greenbaum's criticisms. There
nevertheless emerged no substantial changes in the operation of NAEP.
In 1973, contract officers at NCES sought increasing management respon-
sibility for NAEP. That led to heightened levels of conflict between OE and the
NAEP Policy Committee at ECS. (The ECS interest in NAEP may have been
enhanced by the substantial annual income to ECS in the form of "overhead
receipts.")
With encouragement from ECS, Congressional legislation in 1978 served to
transfer federal authority for NAEP from NCES to a newly created National
Institute of Education (NIB) within OE, and also specified that funding would
be provided by a grant or cooperative agreement rather than by contract. This
served to restore policy control to ECS rather than to a federal agency. The NIB
call for proposals for the five-year continuation of NAEP resulted in only one
submittal, from ECS. Following some minor revision, the ECS proposal was
funded, and NAEP proceeded much as usual, continuing to disappoint those
critics who envisioned that the project would have a greater degree of policy
relevance. Among those experiencing continuing disappointment were the staff
of the Carnegie Corporation. The initial expectations of Carnegie were not being
met. Thus, in 1980, the Corporation, together with the Ford and Spencer
Foundations, funded yet another evaluation of NAEP, this time by the consulting
firm of Wirtz and Lapointe (1982).
Wirtz and Lapointe appointed a "Council of Seven" to guide the work, Gregory
Anrig, Stephen Bailey, Charles Bowen, Clare Burstall, Elton Jolly, Lauren
Resnick, and Dorothy Shields. The Council sponsored several special reports
and solicited views on how to improve NAEP from more than 150 individuals
representing parents, teachers, school administrators, state legislators, organized
labor, industry, and measurement professionals. The Wirtz-Lapointe report
recommended the continuation of NAEp, but only with numerous important
changes. The report emphasized the need for increased funding. It also recom-
mended an increase in the testing time from one to two hours; reporting by
grades 4, 8, and 12, rather than by ages 9, 13, and 17; providing results state-by-
state and locally within states; reporting by aggregated scores for curriculum
domains and sub-domains, rather than reporting by percentages of correct
answers to exercises; and several additional changes.
Unlike the response in 1978 to the request for proposal, NIE received five
competing proposals to conduct NAEP for the period 1983-1988, including one
from ECS. The winning proposal, from ETS, contained essentially all of the
changes that were recommended by Wirtz and Lapointe. (By 1983, Gregory
Anrig, a member of the Council of Seven, had left his position of Commissioner
of Education in Massachusetts to become ETS President. With the award of the
grant to ETS, Archie Lapointe moved to ETS as Executive Director of NAEP.)
NAEP UNDER ETS, 1983-1988
The award to ETS ($3.9 million a year for five years) was in the form of a grant
rather than a contract. Westat replaced RTI as the organization responsible for
sampling and field administration. A new Assessment Policy Committee was
formed, as well as a new Technical Advisory Committee (Robert Glaser, Bert
Green, Sylvia Johnson, Melvin Novick, Richard Snow, and Robert Linn,
Chairman). Other members of this committee through 2001 (later called the
Design and Analysis Committee) have been Albert Beaton, Jeri Benson, Johnny
Blair, John B. Carroll, Anne Cleary, Clifford Clogg, William Cooley, Jeremy Finn,
Paul Holland, Huynh Huynh, Edward Kifer, Gaea Leinhardt, David Lohman,
Anthony Nitko, Serge Madhere, Bengt Muthen, Ingram Olkin, Tej Pandey, Juliet
Shaffer, Hariharan Swaminathan, John Tukey, and Rebecca Zwick.
Beginning with assessments in 1984, students were sampled by grade as well as
by age and, except for reports on achievement change over time, grade became
the primary basis for the reporting of findings. Except for the assessment of long-
term trend, the administration of items by audiotape was discontinued, students
in a single group administration were issued different booklets of exercises
(based upon a balanced incomplete "block design"), and "I don't know" was
removed as an alternative response in multiple-choice items. Results were no
longer reported by percentage of correct responses, exercise-by-exercise;
average scaled scores were reported for domains of items, based upon item
response theory. NAEP reports were designed to be less confusing to a non-
technical audience, and were systematically released at press conferences, with
greater attention to obtaining media coverage.
In 1983 A Nation at Risk, a publication of the National Commission on
Excellence in Education (1983), presented a ringing and influential indictment
of the quality of pre-college education in the U.S. At the annual meeting of the
Southern Regional Education Board in 1984, several state governors, led by
Lamar Alexander of Tennessee and Bill Clinton of Arkansas, called for improved
890 Jones
measures of educational achievement, to include comparisons of state with
national results. Also that year, the Council of Chief State School Officers
(CCSSO) approved plans for cross-state comparisons. Two years later, the
National Governors' Association issued a report under the leadership of Lamar
Alexander entitled Time for Results. The report called for a better national
"educational report card," and for report cards to be prepared for individual
states. In that same year, Secretary of Education William Bennett appointed a
22-member panel under the chairmanship of Lamar Alexander to make recom-
mendations about the future of NAEP. The panel report was published in March,
1987, and recommended the development by NAEP of a trial state-by-state
assessment and the creation of an assessment policy board independent both of
the Department of Education and the NAEP contractors (Alexander & James,
1987). A year later, with support from Secretary William Bennett, Congress
authorized trial state assessments, sharply increased the funding for NAEp, and
established a National Assessment Governing Board (NAGB). Secretary Bennett
appointed the members of NAGB in December of 1988.
The Congressional legislation also called for an ongoing evaluation of the
state assessments by an impartial panel that was to periodically report its findings
to Congress. The 14-member panel was appointed in 1989 under the auspices of
the National Academy of Education (NAE), with Robert Glaser and Robert
Linn serving as co-chairs.
NAEP UNDER NAGB
NAGB faced major new challenges as it came into being. Entirely new working
relations needed to be established among the participant partners on a short-
time schedule, especially those involving NCES and the NAEP contractors, ETS
and Westat. NAGB was to maintain one historical NAEP project to monitor
national achievement trends based on comparable assessment conditions that
had been established many years earlier. It was to establish policy for a "new"
NAEp, so that assessment items would be matched to newly developing
standards for school grades and subject areas. Most notably, it was responsible
for developing a trial state assessment, for encouraging participation of the
states, and for having the program in operation in 1990, NAGB's very first year
of existence. Planning had been underway, of course, by ETS and Westat, with
oversight by NCES, and notably by CCSSO under a grant from NSF and the
Department of Education. The planning activity was essential to the start of state
assessments in 1990.
The Work of CCSSO
An Assessment Steering Committee was appointed by CCSSO from citizens nomi-

nated by 18 national organizations, and a Mathematics Objectives Committee
was charged with recommending the objectives for the 1990 mathematics
assessment. The framework for the objectives was deliberately aligned with the
standards for mathematics curricula that had been promulgated in March of
1989 by the National Council of Teachers of Mathematics (NCTM). The draft
framework was sent for review to state education agencies in all 50 states as well
as to additional mathematics educators. In modified form, it then was approved
by the Steering Committee and the Assessment Policy Committee, and ultimately
was adopted by NAGB for use again with some extensions, in 1992 and beyond.
CCSSO followed procedures similar to those for mathematics in the areas of
reading, science, U.S. history, geography, civics, and the arts. Specialized commit-
tees worked with a broad range of organizations and citizens to draft assessment
frameworks that then were reviewed by states and finally approved by the
Steering Committee and accepted by NAGB and NCES.
In its report to NAGB in January of 1989, CCSSO had presented a series of
recommendations, the first of which is:
An explicit policy is needed by NAGB to direct those developing objectives

on the balance of assessing what students do know and can do and objec-
tives that are primarily based on what should be taught students. While
"cutting edge" curriculum development should, to some degree, be included
in the assessment, if it gets too far outside what schools are currently
teaching the willingness to participate will diminish. (National Assessment
Governing Board, 1989, Tab 2, p. 14.)
This statement highlights the ever-present tension between those such as Tyler
who saw NAEP as a means for describing what students know and can do and
others who see NAEP as a means to drive the curriculum to teach what "should
be" the objectives of education. The statement also anticipates another recurring
problem, namely how to encourage states and school districts to participate in
NAEP assessments.
Reporting by ':Achievement Levels"
Among NAGB's major innovations was its decision in May of 1990 to report
NAEP findings by "achievement level" (basic, proficient, and advanced). NAGB's
decision to report results by percentages of children at each of these levels was
based on an interpretation of the Congressional mandate for the Board to
"identify appropriate achievement goals for each age and grade in each subject
area to be tested." In the late 1960s, Lee Cronbach had recognized the desir-
ability of such form of reporting and had urged its consideration by the initial
TAC. And in Part 2 ofthe Alexander-James report, an NAE panel had explicitly
supported this kind of reporting:
We recommend that, to the maximal extent technically feasible, NAEP use

descriptive classifications as its principal reporting scheme in future
892 Jones
assessments. For each content area NAEP should articulate clear

descriptions of performance levels, descriptions that might be analogous
to such craft rankings as novice, journeyman, highly competent, and expert.
Descriptions of this kind would be extremely useful to the educators,
parents, legislators, and an informed public. (Glaser, 1987, p. 58)
NAGB decided to adopt achievement levels as its primary form of reporting,

beginning on a trial basis with the national and state mathematics assessments of
1990, thereby setting the stage for controversies that continued throughout the
1990s as NAGB continued to favor achievement-level reporting and attempted
to create "performance standards." Critics of this approach, while generally
sympathetic to the intrinsic value of such reporting, remained sharply skeptical
of the processes by which performance standards were being established and
defined, and thus of the validity of the resulting classification of students by those
particular performance standards.
Following its decision in May, 1990, NAGB had immediately instituted a
process by appointing a 63-member panel of judges who met on August 16 and
17, 1990 to set standards for the 1990 mathematics assessments. These judges
were asked to consider each exercise and to determine the proportion of students
at each achievement level (basic, proficient, and advanced) who ought to answer
the exercise correctly. Following some revision of procedures by a technical
advisory group, 38 of the 63 judges met again on September 29 and 30, 1990, and
supplied additional ratings. Later in 1990, a third session was conducted with 11
judges. In May, 1991, NAGB accepted the recommended percentage of correct
answers that would serve to distinguish performance at each achievement level.
Recognizing the controversial nature of the process for setting achievement
levels, NAGB in 1990 had hired Aspen Systems to conduct an independent
evaluation of that process. Daniel Stufflebeam, Richard Jaeger, and Michael
Scriven were asked to review all phases of the standard-setting procedures.
These investigators filed interim reports in November of 1990 and in January of
1991, each of which suggested improvements of the process. In a third report on
May 5, 1991, they advised against releasing results without also providing
adequate warning about technical and conceptual shortcomings. They then were
asked by NAGB to provide a final "summative" report of their findings.
Stufflebeam, Jaeger, and Scriven prepared a draft final report on August 1 and
sent copies on a confidential basis for comment to NAGB members and staff,
NCES and contractor personnel, and other leading educational researchers.
Their draft final report emphasized that "the technical difficulties are extremely
serious," that "these standards and the results obtained from using them should
under no circumstances be used as a baseline or benchmark," and that "the pro-
cedures used in this exercise should under no circumstances be used as a model."
The report went further and questioned the technical competence of NAGB, assert-
ing that "the Board and its Executive Director (were) making technical design deci-
sions that they were not sufficiently qualified to make." The revised final report
was submitted to NAGB on August 23, 1991 (Stufflebeam, Jaeger, & Scriven, 1991).
Judging that the draft report included recommendations beyond the charge to
the consultants, and discovering that it had been distributed for comment to
"outside" individuals without permission from NAGB, the Board was outraged.
It sent an official reply to those who had received a copy of the draft report, and
took steps to terminate the subcontract with the three researchers. Education
Week on September 4, 1991, covered the issue under the headline, "NAEP Board
Fires Researchers Critical of Standards Process."
The House Education and Labor Committee had been critical of NAGB's
efforts to create performance standards from the start. (The House and Senate
had disagreed in 1988 on the advisability of setting performance standards.) In
October, 1991, the House Committee asked for a review by the General Accounting
Office (GAO) of NAGB's efforts to set standards. An interim report was issued
by GAO in March of 1992, and its final report was published in June, 1993, under
the title, "Educational Achievement Standards: NAGB's Approach Yields
Misleading Interpretations" (U.S. General Accounting Office, 1993). The GAO
reports essentially supported the earlier advice of Stufflebeam, Jaeger, and Scriven:
GAO concluded that NAGB's strength lies in its broad representation,

not in its technical expertise .... GAO recommends (1) that NAGB
withdraw its recommendation to NCES to publish 1992 NAEP results pri-
marily in terms of levels of achievement, (2) that NAGB and NCES review
the achievement levels approach, and (3) that they examine alternative
approaches. (U.S. General Accounting Office, 1993, pp. 4-5)
On July 13, 1993, Education ffi?ek covered the story with the headline, "G.A.O.
Blasts Method for Reporting NAEP Results."
As noted earlier, a 14-member panel had been appointed in 1989 by NAE to
monitor the trial state assessment. In 1992, Commissioner of Education
Emerson Elliot asked the panel to evaluate NAGB's standard-setting activities
pertaining to the assessments of 1990 and 1992. Already in its first report, the
NAE panel had given consideration to the NAGB process for setting standards,
and had recommended that:
Given the problems that occurred with the 1990 assessment and the great
importance of the achievement-levels-setting process, the procedures for
setting achievement levels should be evaluated by a distinguished team of
external evaluators as well as by an internal team. (National Academy of
Education, 1992, p. 41)
In its report dedicated to the setting of standards, the panel concluded that the
procedure used for establishing achievement levels was "fundamentally flawed
because it depends on cognitive judgments that are virtually impossible to
make." The report went on to:
strongly affirm the potential value of performance standards ... as one

method for NAEP reporting ... Nevertheless, it is the Panel's conclusion
894 Jones
that the 1992 NAEP achievement levels should not be used. (National
Academy of Education, 1993, p. 148)
Once again, Education Week reported on the issue; on September 22, 1993, its
story was published under the headline, "Yet Another Report Assails NAEP
Assessment Methods."
In 1996, responsibility for the Congressionally mandated evaluation of NAEP
was transferred from NAE to the National Academy of Sciences (NAS). The first
evaluative NAS report recommended that:
The current process for setting achievement levels should be replaced.

New models for setting achievement levels should be developed in which
the judgmental process and data are made clearer to NAEP's users.
(Pellegrino, Jones, & Mitchell, 1999, p. 7).
Of the report's several recommendations, this one received the greatest atten-
tion. It followed a conclusion by the NAS Committee that "the current process
for setting NAEP achievement levels is fundamentally flawed," thereby echoing
the conclusion of earlier panels of evaluators. The Congressional reautho-
rization of NAEP in 1998 includes a proviso that reports by achievement level be
accompanied by a statement "that the achievement levels are to be considered
developmental and should be interpreted and used with caution." Four other
recommendations in the NAS report also deserve attention and are discussed in
a later section of this chapter.
A VOLUNTARY NATIONAL TEST?
NAEP reports of state-by-state achievement in mathematics, reading, and science

received increasing attention by state education agencies and by the media in the
1990s. Increasingly, states were creating their own assessment tools, often high-
stakes tests designed to determine school promotion froth grade to grade and
graduation from high school. Many states created state performance standards,
loosely analogous to those developed by NAGB. The standards were highly
variable among states, however, and calls came from some quarters for uniform
national standards and national tests. Indeed, in his State of the Union address
in 1997, at the start of President Clinton's second term, the President proposed
that all of the nation's fourth graders be tested in reading and that all eighth
graders be tested in mathematics. Funding was provided through the Department
of Education to develop test materials and plans for a voluntary national test
(VNT). NAGB supported these efforts and stood ready to supervise the program,
thereby serving to insulate it from the direct control of a federal agency. In late
1997, Congressional legislation gave to NAGB "exclusive authority over all
policies, direction, and guidelines for developing voluntary national tests."
Some members of Congress, especially in the House of Representatives and
its Committee on Education and Labor, were apprehensive about the idea of a
national test. As a consequence, the 1997 legislation provided funding to NAGB
only for preliminary test development and planning. The legislation also called
for studies to be conducted by the NAS and to be reported to Congress prior to
any subsequent legislation. Three NAS reports were completed in 1999 (Feuer,
Holland, Green, Bertenthal, & Hemphill, 1999; Koretz, Bertenthal, & Green,
1999; Wise, Noeth, & Koenig, 1999). The reports concluded that it was not feasible
to use existing tests for the VNT, that projecting scores from anyone test to
others was not practical, and that high-stakes testing involved many risks.
In 1999 Congress further restricted the use of funding for VNT, and explicitly
prohibited pilot testing of items or the field-testing of item booklets. The Clinton
administration failed to press for more funding and no such funding was appro-
priated for fiscal year 2001.
NATIONAL ACHIEVEMENT TRENDS
Even prior to the first national assessment of 1969-70, planners anticipated that
little interest would be generated by results of a single assessment, because there
were no firm standards against which to judge the adequacy of measures of
student performance. Consequently, it was thought that the primary value of
NAEP would be the monitoring of changes in achievement over time. It is a tribute
to those responsible for NAEP over its lifetime that maintaining conditions for
assessing long-term trends consistently was given high priority. This required not
only special sampling by age (9, 13, and 17) but also following procedures that
remained unchanged from the earliest assessments, to ensure comparability over
time of the results obtained in the special samples.
The most extensive report of achievement trends to date is that of Campbell,
Hombo, and Mazzeo (2000), that presents results for reading, mathematics, and
science at ages 9, 13, and 17 for each of the national assessments from the early
1970s until 1999. Results for all assessment years have been converted to average
scale scores, as established by ETS in 1985. Results from earlier assessments
were converted from percentages of correct answers to scale scores, to allow for
comparisons.
A rather remarkable feature of national trends for reading, and to a lesser
degree for mathematics and science, is the relative stability of average achieve-
ment for each subject at each age over three decades. Between 1970 and 1999,
the demographics of the nation's school children changed considerably, the
country's economy shifted both upwards and downwards, the family status of
children's homes was far from stable, and widespread efforts were undertaken to
achieve "school reform." Net effects on national achievement scores were minimal,
although both mathematics and science achievement show some upward trend
from 1986 to 1999.
896 Jones
Reading
For 9- and 13-year-olds, there is a hint of improvement in average reading scores

from 1971 to 1980. For both age groups, assessments of reading from 1980 to
1999 (assessments in 1980, 1984, 1988, 1990, 1992, 1994, 1996, and 1999) show
essentially no change. At age 17, results show no change for any assessment years
from 1971 to 1999.
Mathematics
Mathematics achievement scores at all three ages were essentially the same for
the assessments of 1973,1978,1982, and 1986. They improved from that level in
1999 by 10 scale points for 9-year-olds, by 8 points for 13-year-olds, and by
6 points for 17-year-olds. These are modest changes, but the consistency ofthe
gains at each age from 1986 to 1999 do support the conclusion that average
mathematics performance increased from the late 1980s to 1999.
Science
Science achievement scores declined at all three ages between 1970 and 1982,
increased somewhat between 1982 and 1992, and changed little from 1992 to
1999. Average 1999 scores were higher than average 1982 scores by 8 points at
age 9, by 6 points at age 13, and by 12 points at age 17 (roughly the same number
of points that they had decreased from 1970 to 1982).
SUBGROUP ACHIEVEMENT TRENDS
Campbell et al. (2000) present trends over three decades not only for the nation
as a whole, but for subgroups defined by race/ethnicity, gender, level of parents'
education, and type of school (public or private).
Trends by Race/Ethnicity
As first reported by Jones (1984), NAEP results for black students in reading,
mathematics, and science showed increased achievement relative to whites between
the early 1970s and the early 1980s. From the late1980s to 1999, however, average
scores for blacks have shown no improvement, even remaining stable for mathe-
matics while average mathematics scores increased for whites. For Hispanic
students at each age and subject area, average scores are slightly higher than
those for black students (typically by 6 or 7 scale points), but the trends for
Hispanics and blacks are quite similar. Public interest in the persistent gap
between average scores for black and white children heightened after the
appearance of an influential book edited by Christopher Jencks and Meredith
Phillips (1998). Increasing media attention has been directed to the persistence
of the average white-black gap in NAEP scale scores that in 1999 remained
about 30 scale points for reading and mathematics (approximately one within-
grade standard deviation) and about 40 points for science.
Trends by Gender
Trends for reading, mathematics and .science over three decades of assessments
have remained similar for females and males at all three assessment ages. At each
age, females have shown an average advantage in reading of roughly 10 scale
points, while males display a somewhat smaller average advantage in mathe-
matics (about 3 points) and in science (about 3 points at age 9, 16 points at age
13, and 9 points at age 17).
Trends by Parents' Education
In all assessments, respondents are asked to provide the highest level of educa-
tion of their parents, and assessment results are reported for the highest level for
either parent by four categories, "less than high school graduation," "high school
graduation," "some education beyond high school," and "I don't know." For
9-year-olds, about one-third of students report "I don't know." At ages 13 and 17,
with much smaller proportions of the "I don't know" response, the distributions
of reported levels of parental education have changed dramatically from the
early 1970s to 1999: for "less than high school graduation," from about 20 percent
to 7 percent; for "high school graduation," from 33 percent to 25 percent; and
for "some education beyond high school," from 45 percent to 65 percent. While
the distributions of the students' reports of their parents' education shifted in
this way, the trends over time in achievement scores for reading, mathematics,
and science are quite similar for students in each category, with a single excep-
tion. At age 17, average reading scores declined nearly 10 scale points between
1990 and 1999 for students who reported that "high school graduation" was the
highest level of a parent's education. This might be related to decreasing rates of
high school dropouts in the decade of the 1990s, so that more poor readers
(among students whose parents did not attend college) were assessed in 1999
because they remained in high school rather than dropping out.
898 Jones
Trends by Type of School
Quite consistently from the late 1970s until 1999, NAEP results for the three age
groups and three subject areas show that average scores for students enrolled in
nonpublic schools exceed those for students in public schools, typically by between
10 and 20 scale points. Over that period, the percentage of students enrolled in
nonpublic schools increased from about 6 percent to 10 percent of the total at
age 17, from about 9 percent to 12 percent at age 13, and remained steady at
about 12 percent at age 9. Achievement trends over time are very similar for
nonpublic and public school students at each age and for each subject area.
STATE-BY-STATE ACHIEVEMENT TRENDS
From 1990 forward NAEP has reported mean scale scores for public-school
students by state for the states that chose to participate in any assessment year,
typically about 40 of the 50 states. From 1992 forward state reports also show the
percentages of students whose estimated scores place them as below basic, basic,
proficient, or advanced. State-level assessments were conducted for mathematics
in 1990, 1992, 1996, and 2000; for reading in 1992, 1994, and 1998; for science in
1996 and 2000; and for writing in 1998 (although in some cases prior to 2000,
assessments were conducted at grade 4 or 8, but not both). State-by-state results
are provided to state education agencies prior to their public release at a
national press conference, which then typically is followed immediately by press
conferences in the states that participated in an assessment. State results are
presented for all public-school students as well as for subgroups, (male-female,
ethnic origin, parental education level, and type of community).
More and more states during the 1990s established their own state testing
programs in reading and mathematics, and sometimes in science and writing, as
well. Increasingly, then, media attention within states has been directed not only
to results of annual state testing, often reported district by district or even school
by school, but also to the NAEP reports of state-by-state results and especially to
a state's changes over time on NAEP assessments, as well as to each state's
ranking among all participant states.
The most recent such report is that of August, 2001, a state-by-state report of
mathematics achievement for the year-2000 assessment, that also shows results
for prior years (National Center for Education Statistics, 2001). At grade 4, the
mean national scale score increased between 1992 and 2000 by 7 scale points.
However, some states display essentially no change, while others show higher
average scores by as much as 19 points. At grade 8, the average national change
from 1990 to 2000 is 12 points. All states that participated in both years also show
improved average scores at grade 8, ranging from 2 scale points to 30 points (an
increase approximately equal to the value of the typical within-state standard
deviation for 8th-grade scale scores). In some states, NAEP reports have become
"front-page news" and are widely cited in radio and television coverage. Clearly,
the NAEP reports are being closely attended to by state education officials.
APPRAISING THE STRENGTHS AND WEAKNESSES OF NAEP
As outlined above, NAEP has succeeded in meeting one of its original objectives
by presenting on a regular schedule trustworthy information about changes over
time in achievement of U.S. students at ages 9, 13, and 17, especially in the areas
of mathematics, reading, and science. For the 1990s, similar trend data have been
reported state by state for states (the majority) that chose to participate in the
state assessments of NAEP. For both national and state trends, results have been
presented also by demographic subgroups. This is a remarkable record for a
project that failed to consistently attract either substantial funding or high levels
of political support.
Since the 1980s, most states and school districts within states have established
educational standards for the public schools within their jurisdiction. To some
extent, the NAEP reports have served to stimulate those efforts; especially since
1990, since when state-by-state results have been reported, media coverage of
NAEP reports has increased sharply. Most states now have devised their own
within-state testing programs, and many have used NAEP as a model, typically
after having personnel trained and supported by NAEP staff members. NAEP
procedures also greatly influenced the design of international assessments such
as TIMSS, the Third International Mathematics and Science Study and national
assessments in other countries.
Some important differences, however, distinguish the purposes and the proce-
dures of NAEP from those of the states. The NAEP design focuses on levels of
achievement for identifiable groups of students and on achievement trends for
those groups over time; it does not provide results for individual students or for
schools or classes within schools. Many states have developed high-stakes testing
programs that are designed explicitly to decide for individual students whether
they are to be promoted to the next grade in school or to be graduated from high
school. In some states, student test-score results are also used as a criterion for
teacher salary bonuses. In a discussion of differences between NAEP and high-
stakes state testing programs, Jones (2001) noted that the latter are beginning to
resemble the failed programs of the late 19th century in England and Wales
when a system of "payment by results" was employed in a misguided effort to
improve public education.
The U.S. Congress mandated an NAS committee study of high-stakes testing
practices. Among the recommendations from that committee are several that are
critical of basing important educational decisions about individual students on
test scores.
High-stakes decisions such as tracking, promotion, and graduation should

not automatically be made on the basis of a single test score, but should be
900 Jones
buttressed by other relevant information about the student's knowledge
and skills, such as grades, teacher recommendations, and extenuating.
circumstances.
In general, large-scale assessments should not be used to make high-
stakes decisions about students who are less than 8 years old or enrolled
below grade 3. (Reubert & Rauser, 1999, p. 279.)
The extent to which these and related recommendations are being seriously
considered by state and local school authorities is not yet known. What is
apparent from media reports is that many parents and teachers are echoing these
concerns, in some cases effectively, and that some states and localities are
reconsidering earlier actions which had mandated that decisions about student
progress be based primarily upon test scores.
As noted earlier, the Congressionally mandated evaluation of NAEP
conducted by NAS recommended a replacement of NAGB's process for setting
achievement levels, and included several additional recommendations:
1. Educational progress should be portrayed by a broad array of education

indicators that includes but goes beyond NAEP's achievement results.
2. NAEP should reduce the number of independent large-scale data collections
while maintaining trend lines, periodically updating frameworks, and provid-
ing accurate national and state-level estimates of academic achievement.
3. NAEP should enhance the participation, appropriate assessment, and
meaningful interpretation of data for students with disabilities and English-
language learners. NAEP and the proposed system for education indicators
should include measures that improve understanding of the performance and
educational needs of these populations.
4. The entire assessment development process should be guided by a coherent
vision of student learning and by the kinds of inferences and conclusions
about student performance that are desired in reports of NAEP results. In
this assessment development process, multiple conditions need to be met: (a)
NAEP frameworks and assessments should reflect subject-matter knowledge;
research, theory, and practice regarding what students should understand and
how they learn; and more comprehensive goals for schooling; (b) assessment
instruments and scoring criteria should be designed to capture important
differences in the levels and types of students' knowledge and understanding
both through large-scale surveys and multiple alternative assessment methods;
and (c) NAEP reports should provide descriptions of student performance
that enhance the interpretation and usefulness of summary scores. (Pellegrino,
Jones, & Mitchell, 1999, pp. 4-6).
Steps were taken by NCES and NAGB to respond to Recommendation 3, above,

and recent federal legislation now requires that no students be excluded from
NAEP based on disability or English-language learning. Recent assessments,
both for the nation and the states, have built in investigations of the impact of
including more English-language learners and of providing special accom-
modations during the testing of such children (extra time, reading questions
aloud, etc.). For future assessments, testing accommodations are to become
standard practice for children who are judged to require them.
Recommendation 4 has received additional attention from yet another study
committee at NAS (Pellegrino, Chudowsky, & Glaser, 2001). Among other recom-
mendations, the committee urged more research to improve the validity and
fairness of inferences about student achievement, research to better integrate
assessment and sound principles of learning, and an increased emphasis on
classroom assessment designed to assist learning.
One of the original objectives of NAEP was "to provide ... data to researchers
working on various teaching and learning problems, either to answer research
questions or to identify specific problems which would generate research
hypotheses" (Greenbaum, 1977, p. 13). For most of its history, NAEP results
have been available to researchers, but the complexity of the data format and of
the procedures needed to access and analyze NAEP findings have discouraged
widespread use for secondary data analysis. For the first time in the summer of
2001, access has been rendered more convenient, and researchers are able to
interactively organize, download, and analyze NAEP findings on the World Wide
Web (NAEP, 2002). It is anticipated that more tools will be added at that site to
further facilitate secondary analyses (Cohen & Jiang, 2001).
As emphasized by Greenbaum (1977), NAEP failed to achieve another of its
major original objectives, to provide meaningful data "to Congress, the lay public,
and educational decision makers so that they could make more informed deci-
sions on new programs, bond issues, new curricula, steps to reduce inequalities,
and so on" (p. 168). Greenbaum correctly concluded that this was viewed as an
objective of high priority by U.S. Commissioner of Education Francis Keppel,
and also by the personnel at the Carnegie Corporation, but that Ralph Tyler was
not similarly committed. As a perceptive social scientist, Tyler sensed that NAEP
would be unlikely to be capable of providing data that could be used to make
decisions about educational policy. Tyler viewed the assessment design to be an
appropriate one for providing census-like results, but that other research designs
would be required for policy making. On that issue, it seems that history has
shown that Tyler was right.
As at earlier key junctures in its history, NAEP again faces an uncertain future.
Tensions remain between those who view its appropriate role to be the tracking
of changes in educational achievement and those who would convert NAEP into
an agent for educational reform. President George W. Bush has proposed to
Congress that every student between grades 3 and 8 be tested annually in reading
and mathematics, in keeping with his promise, "No Child Left Behind." His plan
would require that testing be managed by agencies at the state level. The plan
includes financial incentives (and penalties) for schools that report high (or low)
test scores or that show (or fail to show) improved scores from year to year. It
also recommends that states be required to participate in annual state-level
902 Jones
assessments by NAEP for reading and mathematics at grades 4 and 8, so that
NAEP would "confirm" findings from the corresponding state tests.
Mark D. Musick, Chairman of NAGB, in testimonies before a subcommittee
of the House Committee on Education and the Workforce on March 8, 2001 on
President Bush's proposal, stated, "I think that the answer is yes to the question
- Could NAEP be used to confirm state results?" (NAGB, 2001). In the same
testimony, responding to possible concerns that the activity could lead to a
national curriculum (and thereby could threaten state control of curricula),
Musick said that, after 30 years of NAEP,
there is no evidence that we are any closer to a federally directed national

curriculum at the beginning of the 21st century than we were at the
beginning of the 20th century .... The assertion that the NAEP framework
is the basis for a national curriculum just does not withstand scrutiny.
(NAGB, 2001).
The NAGB view as presented by Musick seems to present a paradox. If NAEP

is to confirm state findings, how can it succeed unless there is a common
curriculum, or at least a common set of educational standards, in all the states?
As noted earlier, several NAS studies concluded that it is not feasible to equate
test findings when tests were designed to meet disparate objectives. Thus, unless
all states adopt common standards, and unless those are the same standards that
guide NAEP, it is not reasonable to expect that NAEP could meet these demands.
And if NAEP is required to take on such new (and high-stakes) responsibilities,
will it also be able to effectively monitor the nation's educational progress over
time?
ENDNOTE
1 A far more extensive treatment of the evolution of NAEP soon will be available in an edited book
(Jones & Olkin, in press). I am grateful to all of the authors of chapters in that book for providing
information that helped me frame this Handbook chapter, and especially to Mary Lyn Bourque,
Archie Lapointe, Ina Mullis, and Ramsay Selden. I depended on other sources as well, notably the
doctoral dissertations of Fitzharris (1993) and Hazlett (1973) and the history of NAGB by Vinovski
(1998). I thank Nada Ballator for her constructive editorial suggestions. My most profound
acknowledgements are to Ralph 1Yler and John Tukey, who got me into the business of educational
assessment in 1965, from which I've never totally extricated myself.
REFERENCES
Abelson, RP., Cronbach, L.J., Jones, L.v., Tukey, J. w., & Tyler, R W. (1970). National Assessment of
Educational Progress, Report 1, 1969-70 Science: National results and illustrations of group
comparisons. Denver: Education Commission of the States.
Abelson, RP., Coffman, W.E., Jones, L.v., Mosteller, F., & Tukey, J.w. (1971). National Assessment
of Educational Progress, Report 4, 1969-70 Science: Group results for sex, region, and size of
community. Denver: Education Commission of the States.
Alexander, L., & James, T. (1987). The nation's report card. Stanford, CA: National Academy of
Education.
Campbell, J.R., Hombo, C.M., & Mazzeo, J. (2000). NAEP 1999 trends in academic progress:
Three decades of student performance. NCES 2000-469. Washington, DC: U.S. Department of
Education.
Cohen, J., & Jiang, T. (2001). Direct estimation of latent distributions for large-scale assessments with
application to the National Assessment of Educational Progress (NAEP). Washington, DC:
American Institutes for Research.
Feuer, M.J., Holland, P.w., Green, B.P., Bertenthal, M.W., & Hemphill, P.C. (Eds.) (1999).
Uncommon measures: Equivalence and linkage among educational tests. Washington, DC: National
Academy Press.
Fitzharris, L.H. (1993). An historical review of the National Assessment of Educational Progress from
1963 to 1991. Unpublished doctoral dissertation, University of South Carolina.
Glaser, R (1987). Commentary by the National Academy of Education. In L. Alexander, &
T. James, The nation's report card: Improving the assessment of student achievement. (pp. 43-61).
Stanford, CA: National Academy of Education.
Greenbaum, W. (1977). Measuring educational progress. New York: McGraw-Hill.
Hazlett, J.A. (1973). A history of the National Assessment of Educational Progress, 1963-1973.
Unpublished doctoral dissertation, University of Kansas.
Heubert, J.P', & Hauser, RM. (Eds.). (1999). High stakes: Testing for tracking, promotion, and
graduation. Washington, DC: National Academy Press.
Jencks, c., & Phillips, M. (1998). The black-white test score gap. Washington, DC: Brookings
Institution Press.
Jones, L.y. (1984). White-black achievement differences: The narrowing gap. American Psychologist,
39,1207-1213.
Jones, L.y. (2001). Assessing achievement versus high-stakes testing: A crucial contrast. Educational
Assessment, 7, 21-28.
Jones, L.Y., & Oikin, I. (Eds.). The nation's report card: Evolution and perspectives. Bloomington,
IN: Phi Delta Kappa International.
Koretz, D.M., Bertenthal, M.W., & Green, B.P. (Eds.) (1999). Embedding questions: The pursuit of a
common measure in an uncommon test. Washington, DC: National Academy Press.
National Academy of Education. (1992).Assessing student achievement in the states. Stanford, CA: Author.
National Academy of Education. (1993). Setting performance standards for student achievement.
Stanford, CA: Author.
National Assessment of Educational Progress. (2002). NAEP data. Retrieved on April 11, 2002 from
http://nces.ed.gov/nationsreportcard/naepdata/.
National Assessment Governing Board (1989). Briefing book, January 27-29. Washington, DC: Author.
National Assessment Governing Board. (2001). Testimony of Mark D. Musick, Chairman, National
Assessment Governing Board. Retrieved on August 28, 2001 from: http://www.nagb.orglnaep/
musick_testimony.html.
National Center for Education Statistics. (2001). The nation's report card: Mathematics, 2000.
National Commission on Excellence in Education. (1983). A nation at risk: The imperative for
educational reform. Washington, DC: U.S. Government Printing Office.
Pellegrino, J., Chudowsky, N., & Glaser, R (Eds.) (2001). Knowing what students know: The science
and design of educational assessment. Washington, DC: National Academy Press.
Pellegrino, J.w., Jones, L.R, & Mitchell, K.J. (Eds.). (1999). Grading the nation's report card:
Evaluating NAEP and transforming the assessment of educational progress. Washington, DC:
National Academy Press.
Stufflebeam, D.L., Jaeger, R.M, & Scriven, M. (1991). Summative evaluation of the National
Assessment Governing Board's inaugural 1990-91 effort to set achievement levels on the National
Assessment of Educational Progress. Washington, DC: National Assessment Governing Board.
Tyler, RW. (1942). Appraising and recording student progress. In G.P' Madaus, & D.L. Stufflebeam,
Ed. (1989). Educational evaluation: Classic works of Ralph W 'lYler (pp. 97-196). Boston: Kluwer
Academic.
Vinovskis, M.A. (1988). Overseeing the nation's report card: The creation and evolution of the National
Assessment Governing Board (NAGB). Washington, DC: National Assessment Governing Board.
904 Jones
U.S. General Accounting Office. (1993). Educational achievement standards: NAGB's approach yields
misleading interpretations. Report No. 6AO/PEMD-93-12. Washington, DC: Author.
Wirtz, w., & Lapointe, A. (1982). Measuring the quality of education: A report on assessing educational
progress. Washington, DC: Wirtz & Lapointe.
Wise, L.L. Noeth, RI., & Koenig, 1.A. (Eds.) (1999). Evaluation of the voluntary national tests ~ar 2,
Final report. Washington, DC: National Academy Press.
40
Assessment of the National Curriculum in England
HARRY TORRANCE
University of Sussex, Institute of Education, Brighton, UK
Outlining the key features of the national assessment system in England might
seem a fairly straightforward task, but the story is a complicated one, and the
parameters of the system are constantly changing. The system has its origins in a
highly devolved and entirely voluntaristic system of school examinations; has
been driven by differing political priorities over a period of 15 to 20 years; and is
still evolving as differing priorities interact with implementation difficulties.
Moreover, although national assessment was conceived of in the context of a
United Kingdom-wide government for implementation across the UK, and at
various points in this chapter I shall refer to the UK as if it were a single adminis-
trative unit, implementation has differed across England, Scotland, Wales, and
Northern Ireland. Scotland and Northern Ireland have always had separate
systems of education from England and Wales and, recently, constitutional devo-
lution has placed even more decision making power with the regional parliament
in Scotland and regional assemblies in Wales and Northern Ireland. Thus,
although the broad thrust of policy is similar across the UK, implementation has
differed across the four countries, and hence the detail of the chapter refers only
to England. These variations derive at least in part from variation in the strength
of the ideological debate underlying national assessment and, although the
development of policy has been broadly similar across the UK, it has been imple-
mented in the regions with rather less visceral hatred of the teaching profession
than has been manifested by central government in London. Indeed, one of the
main tasks in a review such as this is to try to identify what principles of national
assessment might be appropriately discussed as policy options elsewhere, shorn
of the particular ideological battles which pervaded invention and implemen-
tation in England.
The chapter will sketch in a little of the background to the development of
national assessment, before identifying its key features and reviewing its costs
and benefits. It will outline the debate about evaluation and accountability which
led to successive efforts in the 1970s and 1980s to implement some sort of
national monitoring and national examination system; identify the key features
of the national curriculum and national assessment introduced in 1988; discuss
905

T. Ke/laghan, D.L. Stufflebeam (eds.)
the issues which arose from its implementation, and modifications that were made;
and, finally, review impact on teaching and learning and implications for policy
in other contexts.
VOLUNTARISM AND ITS CRITICS
Although only recently acted upon at central policy level, concern about evalua-
tion and accountability in the UK education system has a long history dating
back at least 30 years. Although the issues of monitoring national standards and
rendering schools more accountable go back to the 1970s, political decisions only
coalesced around the creation of a national system relatively recently. Thus the
issue oftimescale of decision-making and implementation partly revolves around
logistics - around the scale and scope of any system - but also must be a function
of local political circumstances. While concerns about accountability surfaced in
the early 1970s, it was not considered either feasible or desirable to address them
through central government intervention in curriculum and testing until the late
1980s. As recently as the mid-1980s, the UK had one of the most decentralized
and voluntaristic examination systems, with several examination boards in England
and Wales, along with national boards for Scotland and Northern Ireland, setting
individual examination papers in individual subjects, to be taken by individual
students choosing to sit as many or as few papers as they wished, generally, but
by no means exclusively, at ages 16 (the minimum school-leaving age) and 18
(the usual maximum school leaving age prior to university entrance). Now, the
UK, and particularly England, has one of the most centrally controlled and
extensively tested education systems in the world. This dramatic turnaround has
to be understood in the context of the peculiarly ad hoc development of the
education system during the twentieth century, so ad hoc, in fact, that it might be
argued that the UK system, until very recently, was not a system at all, except in
the sense of an administrative entity. It was this lack of any overall plan or direc-
tion for education, with its concomitant lack of data on overall performance
when politicians faced criticisms about standards, or levers-by which government
might make a difference, which fuelled the continuing search for mechanisms of
accountability and control.
This voluntaristic situation had come about largely because of the enduring
influence of Britain's social class system. Up until the 1970s, a high status, academ-
ically oriented education had been largely confined to those children attending
private fee-paying schools or who passed a selective examination (the 11 +) to
attend an academic "grammar school" (about 20 percent ofthe cohort). In turn,
a minority of this minority proceeded to university entrance via the secondary
school examination system. The examinations themselves had their direct
antecedents in the university entrance requirements of the nineteenth century,
with the Universities of London (1838), Oxford (1857), and Cambridge (1858)
setting up local examination boards to conduct matriculation (entrance)
Assessment of the National Curriculum in England 907
examinations to be taken by external (usually school-based) candidates (see
Kingdon and Stobart [1988] for a fuller account).1
In effect, a series of ad hoc extensions of educational provision (especially
after the second world war) created a very piecemeal educational system in the
UK. An academic education was thought appropriate for a small elite likely to
progress into social and economic leadership roles, and a non-academic
vocationally oriented education appropriate for the rest. Criticisms of the lack of
opportunity provided for the majority resulted in the gradual abolition of the
11 + and the creation of comprehensive secondary schools, throughout the late
1960s and 1970s (prompted by the election of a left-of-centre Labour govern-
ment in 1964), but without an accompanying systematic rethink of curricular
provision, accreditation and qualifications, or overall evaluation. These activities
were pursued to some degree by individual schools, local education authorities
(LEAs), examination boards, and indeed specially constituted government bodies
(e.g., the Schools Council which funded curriculum development projects). But
there was no overarching monitoring or evaluation of the system as a whole,
except for periodic individual inspections of schools by Her Majesty's Inspectors
of Schools (HMIs). Furthermore, despite increasingly comprehensive adminis-
trative provision, the descendants of the university matriculation examinations,
Ordinary (0) level General Certificate of Education (GCE) continued to be taken
by around 20 percent of the secondary school population deemed able enough
to aspire to higher education, while a parallel system of single-subject Certificate
of Secondary Education (CSE) was developed for the next 40 percent of the
ability range, with only the highest grade in CSE (grade 1) being considered
equivalent to the lowest pass grade in GCE O-level (grade C). Thus, very little
coherent policy discussion took place before the late 1970s with respect to what
sort of overall curricular provision might be appropriate for a truly comprehen-
sive system of education, far less how this might be assessed at the level of the
individual student or evaluated at national level.
In parallel with these developments, the "terms of trade" were turning deci-
sively against Britain's old primary and secondary manufacturing industries (coal
mining, steel, shipbuilding, etc.) especially after the so-called "oil crisis" of 1974
when prices were raised significantly. Unskilled jobs began to disappear rapidly
and suddenly unemployment, especially youth unemployment, became a major
social and political concern. The rise in youth unemployment coincided with the
growth of comprehensive education; correlation was interpreted as causation by
politicians, and low educational standards were identified as a major part of the
problem. Young people needed to leave the compulsory education system with
higher (possibly different) levels of achievement if they were to gain more skilled
employment. Politicians, however, had no hard evidence about overall educa-
tional standards - high, low, or indifferent. Significant comparability studies had
been carried out over several years by the Schools Council and the National
Foundation for Educational Research (NFER), investigating in particular the
comparability of GCE and CSE grades, but these had largely demonstrated what
a difficult business conducting such studies was, and it became increasingly
908 Torrance
apparent that different students took different combinations of examinations, set
by different boards in different localities on different syllabuses, while many took
no examinations at all, and, of course, even this level of disparate data did not
exist for the outcomes of primary schooling.
THE ASSESSMENT OF PERFORMANCE UNIT
The first national attempt to address the issue of national standards came with
the setting up of the Assessment of Performance Unit in 1974, again by a Labour
government, acting under political pressure to demonstrate that increasing the
number of comprehensive schools did not compromise academic standards.
Contracts for the development and administration of tests and analysis of results
were made with university departments and other agencies such as the NFER.
Tests were to be conducted in what were considered to be the "core" subjects of
mathematics, English, and science, along with others, subsequently, such as
modern languages. They were to be administered to a 2 percent sample of the
school population at ages 11 and 15, to find out, essentially, what children of
these ages could do in these subjects, with the results being monitored over time
to identify a rise or fall in standards.
The APU was met by considerable scepticism and even hostility by many in
education at the time, with all the issues of curriculum control and educational
values with which we are now so familiar being raised (see MacDonald 1979).
Who was to construct the tests? How would appropriate content be selected? If
the tests became widely disseminated, might they not effectively begin to define
the curriculum? Are not many of the important outcomes of education not suscep-
tible to testing? Since test results are in any case highly correlated with social
class, how would this be taken into account? In turn, interest was stimulated in
exploring other avenues in school accountability, especially through school self-
evaluation and the provision of fuller and more rounded reports to parents and
other interested parties (see Becher, Eraut, & Knight, 1981; Elliott, Bridges,
Ebbutt, Gibson, & Nias, 1981). The APU, however, turned out to be relatively
benign in its influence, perhaps partly because the government of the day recog-
nized the validity of many of these issues and partly because it was still not
considered appropriate to take a more interventionist stance. Indeed a govern-
ment consultative document published in 1977, while arguing that "a coherent
and soundly-based means of assessment for the educational system" was neces-
sary, explicitly rejected the production of "league tables" since "test results in
isolation can be seriously misleading" (DES, 1977, p.17). Furthermore,
It has been suggested that individual pupils should at certain ages take
external "tests of basic literacy and numeracy," the implication being that
those tests should be of national character and universally applied. The
Secretaries of State [i.e., the government] reject this view ... the
temptation for schools to coach for such tests would risk distorting the
curriculum and possibly lowering rather than raising average standards.
(DES, 1977, p.18)
Subsequently, under governments more hostile to the teaching profession and

more impatient for change, such well-informed reasonableness was characterized
as the "educational establishment" capturing the debate and conspiring against
change in its own interests and against the interests of consumers (deemed to be
parents and employers).
The APU conducted its work over a ten-year period, taking the issue of
producing valid tests of a wide variety of achievements in different subjects very
seriously. Many involved the conduct of extended practical and oral work, as well
as more traditional paper and pencil tests. They were often developed by leading
subject experts and thus the APU developed as more of a research enterprise
than an evaluation mechanism, providing considerable evidence about the diffi-
culty of producing good tests, the difficulties children often encountered in test
interpretation and response, and the difficulties of measuring changes in standards
over time. The Unit also left a legacy of very high quality reports and instruments
for LEAs and teachers to use for professional development. However, what it did
not do was provide irrefutable evidence of the maintenance (or otherwise) of
educational standards. Nor of course, did it provide government with a mech-
anism to monitor, far less control, what was happening in every school, since it
was based from the outset on a very small sample. This was by no means a
satisfactory state of affairs for the incoming right-wing Conservative government,
led by Prime Minister Margaret Thatcher, in 1979. However, it took her a further
eight years, to propose a national curriculum and assessment system.
GCSE: THE TROJAN HORSE FOR A NATIONAL SYSTEM
In essence, the Labour governments of the 1970s tried working with the teaching
profession to develop evaluation and accountability procedures to meet political
pressures over educational standards. The Conservative governments of the
1980s started out trying to do the same, particularly manifested in the creation
of a single system of secondary school examinations - the General Certificate of
Secondary Education (GCSE) - in 1986. This amalgamation of GCE and CSE
exams at 16+ (school leaving age) had first been recommended by a government
committee under Labour in 1978 and can be seen as an obvious development
towards a comprehensive system of educational provision. However, the fact that
it was finally pushed through by the Conservatives also alerts us to the fact that
it allowed central government to take a far more overt role in determining the
structure and content of a national system, and provide an ostensibly similar and
comparable measure of output, across all schools and candidates, through which
an educational market place could be developed. Interestingly, this tension,
910 Torrance
between central government intervention to restructure and control the system
in order to raise standards ("nationalizing" it, as some right wing critics claimed),
and the pursuit of measures to facilitate parental choice of schools in order to
raise standards by market competition between schools, has remained within the
UK and is likely to pervade any debate about the efficacy of a national system.
We will return to these issues later. For the moment it is important to note that
many educational ideas and developments also pervaded the debate. The
Conservatives first acknowledged the potential of the market place and passed
legislation in 1981 making it mandatory for secondary schools to publish their
examination results. However, the driving force behind change in the early to
mid-1980s was concern about clarity in educational content and outcomes.
Educationists were increasingly arguing that clarity of curriculum goals and the
creation of a more flexible examining system would benefit students, especially
low achievers, by providing clearer curriculum pathways and allowing the accu-
mulation of coursework marks towards final grades (Hargreaves, 1984; Murphy
& Torrance, 1988). Considerable developments of this sort were taking place
within CSE and it was argued they should be disseminated across the system as
a whole and removed from association with a lower status examination (and hence
association with lower standards). Industrialists and politicians were similarly
interested in the outcomes of education being more clearly described, recorded
and communicated. What did a grade 1 CSE or a grade C GCE O-level actually
mean? What did secondary schoolleavers know and what could they actually do?
The amalgamation of CSE and GCE meant both agendas could be pursued, and
a key defining aspect of the new examination system, proposed with great enthu-
siasm by the then Secretary of State for Education Sir Keith Joseph, was that it
should be based on the principle of reporting positively what each candidate
knew, understood, and was able to do, and that syllabuses should be developed in
relation to "national criteria" to be developed for each of (initially 20) subjects.
Essentially what was proposed was a move from a fundamentally norm-referenced
rank-ordered examination system towards one which had at least an aspiration
towards criterion-referencing. The technical difficulties of designing and opera-
tionalizing such a system were legion, however (Murphy, 1986; Orr & Nuttall,
1983). A lesson of the English experience seems to be thailt just is not possible
to design a criterion-referenced system at any level of detail. The more clarity
one seeks, the more unmanageable the system becomes, as objectives "multiply
like vermin" (Brown, 1988).
It is important to note that the development of GCSE provided a key opportu-
nity for government to take a direct role in determining the curriculum of
secondary schools and provided a model of potential intervention for the future
- a model which was almost instantly seized upon. Sir Keith Joseph gave an
influential speech summarizing the essential elements of the new examination in
1984; examination courses were started in 1986; and first examinations were
conducted in 1988, by which time legislation creating a national curriculum and
assessment system for all schools, primary and secondary, was already on the
statute book (Butterfield, 1995; Daugherty, 1995).
THE NATIONAL CURRICULUM AND NATIONAL ASSESSMENT
Even the amalgamated GCSE remained a single-subject examination, taken

voluntarily by individual candidates and set by diverse examination boards, albeit
following national criteria with syllabuses subject to approval by the Secretary of
State, following the recommendation of the Secondary Examinations Council
(which had replaced the Schools Council in 1982). It was also, of course, only
directly relevant to secondary schools; and it was a development heavily influ-
enced by educational thinking. Many of the innovations initially developed
within the lower status and hence less scrutinized CSE system, such as extensive
assessment of coursework, practical work and extended project work, were
incorporated into GCSE examination syllabuses. Similarly many of the high
quality approaches to testing developed by the APU were built on by those setting
GCSE examination papers. The arguments deployed paralleled (or perhaps
anticipated) those of the American "authentic testing" movement. Modem
economies require a highly educated and skilled labor force. Old methods of
teaching and testing based on transmission and recall are no longer appropriate.
Students have to "learn how to learn" and be capable of acting innovatively and
creatively. New curriculum goals involving research, problem solving, and report
writing must be pursued, and, in tum, must be underpinned by authentic
assessments of these skills and capacities, conducted in situ, in the schoolroom or
library or laboratory, not under artificial examination conditions.
So GCSE still did not really meet government concerns for the evaluation and
accountability ofthe system as a whole, while, from the point of view of the polit-
ical right, the new examination suffered from being both too centralized and too
controlled by the education profession. Some of these critics also considered it
likely to lower standards by employing far too much school-based coursework
assessment marked by (untrustworthy) teachers who would have a direct interest
in awarding high marks which would reflect well on their teaching (Hillgate
Group, 1987; Marks, 1991).2 What were needed were simple tests of "the basics,"
coupled with the publication of results in league tables and the exercise of
parental market choice. Government funding should, in tum, follow students so
that good schools would grow and poor schools close. However, this would be a
very high risk strategy for any government to follow and the attraction of raising
standards by developing a market and controlling the curriculum from the center
proved decisive even for the Conservatives under Margaret Thatcher.
Thus the new Secretary of State, Kenneth Baker, announced the intention of
"establishing a national curriculum which works through national criteria for each
subject area" in January 1987 (quoted in Daugherty 1995, p. 13); a Consultative
Document "The National Curriculum 5-16" was published in June 1987; and
legislation was passed in time for implementation from September 1988. The
consultative document made clear the government's thinking on the matter:
A national curriculum backed by clear assessment arrangements will help

to raise standards of attainment by:
(i) ensuring that all pupils study a broad and balanced range of subjects .. .
(ii) setting clear objectives for what children ... should be able to achieve .. .
(iii) ensuring that all pupils ... have access to ... the same ... programmes of
study which include the key content, skills and processes which they need
to learn ...
(iv) checking on progress towards those objectives and performance at
various stages ... (DES, 1987, pp. 3-4)
The key elements of the new system were that: the National Curriculum would
apply to the ages of compulsory schooling (5-16 years); all "maintained" (i.e.,
government-funded) schools (but not private fee-charging schools) would have
to follow its prescriptions; the curriculum would be organized in four "key stages":
KS1, ages 5-7 years; KS2, 8-11; KS3, 12-14; and KS4, 15-16; the curriculum
would comprise nine "foundation" subjects (mathematics, English, science, tech-
nology, history, geography, art, music, a modem foreign language from KS3 (age
11 +), plus Welsh in Wales; and be set out, subject-by-subject in terms of
"attainment targets," defined in the 1988 Education Act as the "knowledge, skills
and understanding which pupils ... are expected to have by the end of each key
stage; attainment would be assessed (including by "nationally prescribed tests")
and publicly reported at the end of each key stage, i.e., at ages 7, 11, 14 and 16.
Thus the bare bones of a criterion-referenced system were laid out, combining
the defining of objectives, subject-by-subject, with assessment to measure their
attainment at individual student, school, and system level. Translating this into
practice was another matter entirely, however, and, in many respects, implemen-
tation and concomitant modification have been continuous processes ever since.
A further key "staging post" in the articulation of the new system, though one
which turned out to create as many, if not more, problems than it solved, was the
report of the Thsk Group on Assessment and Testing (TGAT), set up as
legislation was still being debated in Parliament, and which reported in
December 1987. The group accepted as its task that of designing a workable
national system of assessment and attempted to combine the government's
policy intentions with recent thinking on the role, purpose, and conduct of
assessment, arguing that: -~
For the purpose of national assessment ...

assessment results should give direct information about pupils' achieve-
ment in relation to objectives: they should be criterion-referenced;
the results should provide a basis for decisions about pupils' further learn-
ing needs: they should be formative;
the scales or grades should be capable of comparison ... so the assessment
should be calibrated or moderated;
... the assessments should relate to progression. (TGAT, 1987, para. 5)
Furthermore, the group argued that subject content should be organized progres-
sively, through 10 "levels" of attainment, broadly corresponding to the eleven
years of compulsory schooling from 5-16, with the top grades of 7-10 correspon-
ding to GCSE grades A-F (G being the lowest grade awarded in the new GCSE).
Thus 7-year-olds would be expected to achieve in the level range 1-3, ll-year-
olds in the range 3-5, and 14-year-olds in the range 4-7.
The group summarized the "Purposes and Principles" of the system as being
formative, diagnostic, summative, and evaluative (TGAT, 1987, para. 23). In so
doing, they accepted that one system, within its constituent parts, could contri-
bute both formatively to student learning and summatively to system evaluation
and accountability, and UK teachers and administrators have been struggling to
ope rationalize this combination ever since. However, the political imperative has
focused almost exclusively on issues of evaluation and accountability and this has
had an inevitable impact on resource allocation and teacher priorities and
strategies for improving scores.
IMPLEMENTATION AND MODIFICATION
The essential framework of the new system was that curriculum objectives would
be defined in 9 (or in Wales, 10) separate subjects, organized around broad attain-
ment targets (ATs) each of which, in turn, would comprise individual statements
of attainment (SoAs). These ATs and SoAs would also be organized sequentially
into levels, forming a progressive curriculum "ladder" in each subject, and their
attainment assessed and monitored for each individual pupil, with results being
reported publicly at the end of each key stage, at ages 7, 11, 14, and 16. For example,
"attainment target 3" in English was "writing." At level 2, the statements of attain-
mentwere:
a) produce, independently, pieces of writing using complete sentences, some of

them demarcated with capital letters and full stops or question marks.
b) structure sequences of real or imagined events coherently in chronological
accounts. (Example: an account of a family occasion ... or an adventure story)
c) write stories showing an understanding of the rudiments of story structure by
establishing an opening, characters and one or more events.
d) produce simple, coherent non chronological writing. (Example: lists, captions,
invitations, greetings cards, notices, posters, etc.)
At level 4, the statements of attainment were that pupils should:
a) produce, independently, pieces of writing showing evidence of a developing

ability to structure what is written in ways that make the meaning clear to the
reader; demonstrate in their writing generally accurate use of sentence punc-
tuation. (Example: make use of titles, paragraphs or verses, capital letters,
full stops, question marks and exclamation marks ... )
b) write stories which have an opening, a setting, characters, a series of events
and a resolution and which engage the interest of the reader; produce other
914 Torrance
kinds of chronologically organised writing. (Example: write, in addition to

stories, instructions, accounts or explanations, perhaps a scientific investi-
gation)
c) organise non-chronological writing for different purposes in orderly ways.
(Example: record ... an aspect of learning; present information and express
feelings in forms such as letters, poems, invitations, etc.)
d) begin to use the structures of written Standard English and begin to use some
sentence structures different from those of speech.
e) discuss the organization of their own writing; revise and redraft the writing as
appropriate, independently, in the light of that discussion (DES, 1990,
pp.12-13).
As can be seen, the curriculum documents were very detailed, but still largely
baffling.
Producing such "programmes of study" across the curriculum was an enor-
mously ambitious undertaking, which both TGAT (para. 13) and subsequent
commentators (e.g., Daugherty, 1995) noted had not been attempted anywhere
else in the world. Needless to say it has not been implemented in anything like
its original form, with an early modification being concentration on implement-
ing and testing what became the de facto and later the de jure "core curriculum"
of English, mathematics, and science. However the process of "slimming down,"
as it came to be known in the ensuing debate and implementation process, has
been a bruising and debilitating business, for politicians and teachers alike, since
all complaints about complexity and overload which emerged in the first few
years after 1988 were initially dismissed by government as self-interested whining
from a profession finally being forced to put the consumers, rather than the
producers, first.
Two major problems emerged from 1988 to 1993, when a formal review was
instigated. One focused on curriculum content, the other on complexity of the
assessment arrangements. These problems interacted since a major element of
the complexity of the assessment arrangements was the sheer number of attain-
ment targets and individual statements of attainment that teachers and test
producers had to contend with. However, they have tenoed to be dealt with
separately, though in parallel, as first the scope of the assessment arrangements,
and second the content of the curriculum, was reduced.
These problems were exacerbated by lack of coherent planning at the center
which saw the government replace the Secondary Examinations Council (now
outdated since the legislation applied to all maintained schools, not just
secondary schools) with two new "arms length" statutory authorities to oversee
arrangements for developing the curriculum (the National Curriculum Council
[NCC]) and assessment (School Examinations and Assessment Council [SEAC]).
In tum, subject groups were set up to write the programs of study without
coordination or overlap of membership. A major consequence was that each
subject group, working under intense time pressure, included far too much in
their program of study, as interest groups within the subject communities argued
for inclusion of their particular pet themes and topics. The total curriculum thus
became completely unmanageable - a "mile wide and an inch deep" as one
common cliche soon had it.
At the same time, consortia oftest developers were invited to bid for short-term
contracts to develop national tests in subjects at key stages. Different key stages
were focused on for the development of the testing arrangements at different
times. Key Stage 1 was focused on first, so that a new cohort of pupils could
progress through the whole new system, key stage by key stage; KS3 was focused
on next, and KS2 after that, since it would take new 5-year-old entrants to "Year
1" six years to reach the end of KS2 (at age 11). Thus subject groups were
working to the broad brief outlined by the TGAT report (attainment targets
organized into 10 levels of attainment), but had no knowledge of the detailed
testing arrangements which might be put in place. And test consortia might win
a contract for English but not for maths, and/or for a specific subject at, say KS1,
but not KS3, or vice versa. Also there was no guarantee that a contract would be
renewed (though those prepared to do the government's bidding, which essentially
meant designing simpler and simpler tests, year on year, tended to prosper).
How much of this fragmentation came about because harassed civil servants
simply dealt with whatever pressing decision was on their desk at a particular
time, and how much was a deliberate "divide and rule" strategy by the govern-
ment, is a moot point. But it certainly did not provide for extensive institutional
learning and accumulation of experience across the system. Thus, another lesson
from the UK is that taking the political decision to introduce a national curricu-
lum and assessment system is one thing, planning to do it effectively is quite
another.
Teachers were the main group who paid the price for lack of coordination; and
first in the firing line were teachers of 5- to 7-year-olds (KS1). They had to begin
teaching the new programs of study from September 1988, as they came hot off
the press; and they had to take part in sample pilot KS1 testing in 1990, with the
first full national "run" in 1991. The problems of the sheer speed and scale of the
operation were compounded, however, by the attempt of the TGAT report to
build on and develop existing good practice in assessment - to attempt to "square
the circle" of combining the formative with the summative and evaluative. TGAT
argued that "assessment lies at the heart of ... promoting children's learning"
(para. 3). First and foremost therefore, assessment should be conducted routinely,
diagnostically, and formatively by teachers in classrooms to identify student
strengths, weaknesses, and progress in order to assist learning. The results of this
process, deriving from informal observations and more formal assessment tasks
and tests, could then be summarized and reported as part of the overall national
system. TGAT called this "teacher assessment" (TA). It could also be combined
with formal test results at the end of each key stage to produce an overall sum-
mary grade (levell, level 2, etc.) for each student in each subject. In addition to
having the merit of contributing to student learning, TA would also address the
issue of the validity and reliability of results by extending the sample of assess-
ment that could be included in the overall national system. Comparability of TA
across teachers, schools, and localities would be addressed by extensive moderation
procedures which would also contribute to teacher professional development as
a whole. Furthermore, however, TGAT argued that the formal external test
element of the system should in tum build on best practice and
employ tests for which a wide range of modes of presentation, operation

and response should be used so that each may be valid in relation to the
attainment target assessed. These particular tests should be called
"standard assessment tasks." (TGAT, 1987, para. 50)
All teachers from September 1988 were expected to start keeping some sort of
formal record of their "teacher assessments," though no guidance was issued as
to how this should be done. Furthermore, KS1 teachers and students were to be
experimental subjects for the development of "standard assessment tasks"
(SATh, as they came to be known, rather confusingly for anyone familiar with the
abbreviation more usually standing for American Scholastic Aptitude Tests).
Without guidance and feeling under intense accountability pressure, teachers
started keeping ridiculously detailed TA records, which at their extreme included
keeping every piece of work produced by every pupil, annotated to demonstrate
which Statements of Attainment had been covered by the work and what level
within an overall subject Attainment Target this indicated. Newspapers, prompt-
ed by the teacher unions, soon started to report that at KS1, with only the first
few SoAs in each subject being taught, and only the "core" subjects of maths,
science, and English to be reported on initially, teachers would be dealing with
227 SoAs times c.30 children, totalling 6,810 SoAs per teacher per year (see
Daugherty, 1995, p. 117).
Just as problematic, but even more headline-grabbing, were the pilot and first
full run of KS1 SATs in 1990 and 1991. They were commissioned from test
consortia in maths, English, and science, and were designed to be as "authentic"
as possible for an early years classroom. They involved extended tasks, including
cross-curricular elements which were intended to assess ATs from across core
subjects, usually to be administered to small groups of pupils in order to match
the ordinary routines of early years teaching. They couta be administered by
teachers at any time during the first half of the summer term and, in practice,
took about 30 hours of classroom work to complete over a 3- to 6-week period.
Class teachers also had to mark the work and report the overall grades awarded.
In many respects these SATs represented the zenith of developments which had
been started by the APU - a serious attempt to develop authentic standard tasks
to test as wide a range of curricular outcomes as possible, administered under
ordinary classroom conditions which did not create unnecessary anxiety for
pupils. But they certainly created an immense amount of work and anxiety for
teachers, because of the overt accountability setting in which they were being
implemented. On their own, such tasks might well have become an important
resource for teacher development and for further research on the integration of
assessment, teaching, and learning, especially if they had been accompanied by
the moderation which TGAT originally envisaged. Instead, as essentially evalu-
ative instruments, they were perceived as totally unmanageable and produced
results of highly questionable reliability since moderation was never given the
policy attention or resources required. The idea that similar instruments might
be designed for use at other key stages, and in the other six national curriculum
subjects, was soon regarded as wholly unrealistic, by politicians and the teaching
profession alike (Torrance, 1993, 1995).
Various agendas were coming into direct conflict to produce enormous system
overload. Subject specialists feared that if their particular subject, or approach to
a subject, was not included in the statutory orders which defined each National
Curriculum subject, then it would quickly disappear from the school timetable.
Arts and performing arts teachers were particularly exercised by such a possibility
as the government's attention was largely focussed on increasing the proportion
of maths, science, and technology in the National Curriculum, along with English
(especially spelling and grammar). But the possibility also haunted those who
argued that "personal and social education," "citizenship," "health education,"
and the like, should have a major place in the curriculum. In the absence of such
named subjects, English, history, geography, and biology started to expand
exponentially to accommodate these topics. Meanwhile, although the political
imperative included reorienting the curriculum, especially the primary school
curriculum (KSI and KS2), towards more science and technology, it remained
essentially that of producing straightforward results of comparable national
assessments at specific points in time so that they could be published and com-
pared across schools and over time as they accumulated. However, the agencies
initially charged with developing the national tests saw an opportunity to put the
theory of "authentic assessment" and "measurement-driven instruction" into
practice: include authentic, integrated, demanding tasks in the national assessment
system and you will underpin the development of quality teaching and learning.
The result of this clash of competing agendas was that teachers complained
about curriculum and assessment overload, while many in the Conservative party
saw a conspiracy by the "educational establishment" to undermine the govern-
ment's basic intention of producing simple tests. "Slimming down" the SATs
received the government's initial attention; slimming down curriculum content
came next. Following the KSI piloting, government ministers stated in corres-
pondence with SEAC that they wanted "terminal written examinations" for KS3
(Daugherty, 1995, p. 52) and insisted that similar simplification took place at
KS1, albeit tempered by the need to allow early readers and writers to respond
appropriately to test stimuli. Thus began the process of moving from Standard
Assessment Tasks (SATs), to Standard Tests (STs) to National Tests (NTs) as they
are now known. En route, the government insisted that all GCSE examinations
should include no more than 20 percent coursework assessment and that a propor-
tion of marks should be allocated in all GCSE subjects for spelling, punctuation,
and grammar.
Curriculum content was also recognized as too unwieldy, and a review was set up
under Ron Dearing (a former Chairman of the Post Office), who recommended
918 Torrance
little more than the obvious, but did so after extensive consultations with teacher
groups, local authorities, and examination boards, and so appeared to be taking
proper account of relevant criticisms and complaints. His report recommended
that each National Curriculum subject should be organized in terms of a "core +
options," with the total number of ATs and SoAs being significantly reduced.
Furthermore, at least 20 percent of the overall timetable should be "free ... for
use at the discretion of the school" (Dearing, 1993, p. 7). National testing should
be limited to English, maths, and science, which would thus form a core curricu-
lum. GCSE should be retained as is, rather than, as still might have been the
case, being completely overtaken and absorbed by the 10-level national system.
Effectively therefore, the National Curriculum would now run from ages 5 to 14
years, being assessed and reported through levels 1-8, with the public exami-
nation system at KS4 being end-on and, though linked in terms of curriculum
progression, no longer fully integrated into a single national system. In particular,
subject grading at GCSE would remain reported as grades A-G. Additionally,
however, as acknowledgement of arguments about GCSE not stretching the
"brightest" and possibly lowering overall standards, Dearing recommended the
award of a higher A * (A-star) grade to recognize very high achievement.
While these recommendations were widely welcomed by teachers as helping
to reduce their workload, all the essential features of government control of the
curriculum and publication of assessment results were retained. The political
imperative to produce as simple a set of evaluation and accountability measures
as possible had been accomplished. At each successive stage of struggle over
curriculum content and scope of the assessment arrangements, "slimming down"
has effectively meant simplification of content and procedure. The national
curriculum now comprises nine subjects for 5-14-year-olds,3 organized more as
core syllabi plus recommendations for "breadth of study" rather than as detailed
lists of objectives (DfEE, 1999). Meanwhile, national testing has become a very
narrow exercise indeed: the testing, by externally produced and externally marked
paper and pencil tests, of a narrow range of achievements, measurable by such
an approach, in three core subjects, English, maths, and science. Subsequently
national testing has been further restricted at KS1 to English and maths.
OTHER ASPECTS OF POLICY
The process of slimming down and focussing almost exclusively on the account-
ability purposes of national assessment has been continued by the most recent
Labour governments elected in 1997 and 2001. The emphasis of the new govern-
ment, however, has been more towards managerial use of the accountability
data, rather than simple publication and reliance on market forces to raise
standards. Before reviewing this in more detail and bringing the story up to date,
so to speak, it is important to recognize the way in which the interaction of policy
has reinforced the impact of national testing. Testing and publication of results
have never been considered to be an adequate way of raising standards on their
own. Giving parents more rights to choose schools, rather than leaving allocation
to LEAs, and ensuring that funding followed enrolment, was an integral part of
the package. However, popular oversubscribed schools soon began selecting
parents/pupils, rather than the other way round, so this was not wholly effective
in itself, though it certainly ensured constant pressure on schools to improve
results and to "sell" themselves (Gewirtz, Ball, & Bowe, 1995).
Testing was accompanied by the introduction of a National Curriculum, with
government approved content (written by co-opted subject experts) in a small
range of subjects, which placed particular emphasis on extending the time spent
in schools on maths, science, English and, to a lesser extent, technology. Other
subjects can still be studied at KS4 and beyond (for GCSE and A-level), but
these attract a relatively small number of students4 and, in any case, still have to
meet the general GCSE national criteria for approval by the Secretary of State.
A new school inspection regime was instituted to oversee the implementation of
the national curriculum and testing arrangements. Hitherto, visits by HMIs to
schools, though rigorous, tended to be seen as experienced professionals offering
their informed, comparative judgment and advice to fellow professionals. The
new regime was based on teams of inspectors, including "lay members" (non-
educationists), who were charged with making sure that the national curriculum
was being taught, and used uniform, centrally produced observation instruments
to measure and report on quality of teaching. Test results would be scrutinized to
see how they compared with local and national averages. Reports were (and are)
published, and annual summaries by the Chief Inspector of Schools led to head-
lines such as "15,000 bad teachers should be sacked" (The Times, November 3,
1995). This added hugely to the accountability pressures which teachers felt in
the early-mid 1990s, and still does. 5
FROM MARKETS TO MANAGERIALISM
The "New Labour" government has continued this barrage of additional policy
implementation to reinforce the accountability "basics" of testing and publishing
results. Even more specific content, along with recommended teaching methods,
have been introduced to the core subjects of English and maths in conjunction
with first a National Literacy Strategy (NLS) and then a National Numeracy
Strategy (NNS) at KS1 and KS2. Subsequently, the NLS was extended to KS3.
This has been done in tandem with introducing "national targets" for attainment
in literacy and numeracy at KS2, which in effect are political objectives which the
government has set itself, but then pressurized schools to meet. The targets are,
that by 2002,80 percent of ll-year-olds will reach level 4 or above in English, and
75 percent of ll-year-olds will reach level 4 or above in mathematics (originally
deemed by TGAT to be the level which should be attainable by the "average"
ll-year-old [TGAT, 1987, para. 104 & 108]). In 2000,75 percent achieved level 4
in English and 72 percent level 4 in maths (DfEE, 2001).
Even more importantly however, these targets are scrutinized, and target-
setting is used as a management tool, at local authority and school level, so that,
for example, all schools, but especially those well below the national average (for
whatever combination of social and economic reasons), are exhorted to improve
year-on-year. This target-setting is now an integral element of government micro-
management of the system. Government (through the Office for Standards in
Education [OfSTED] which also manages inspections) negotiates/imposes targets
on LEAs and they, in tum, negotiate/impose them on individual schools. Target-
setting is based on individual Performance and Assessment Reports (PANDAs)
produced by OfSTED for each school, in which individual school performance is
compared with national and local statistics. Meeting targets is also now attended
to in inspections. Furthermore, the government has introduced "performance-
related pay" for teachers by which they can cross a threshold to a higher grade.
A significant element of the case they must make is the achievement of targets
and the raising of "standards" in terms of Key Stage test and GeSE results. Thus
evaluation, understood as producing and using test results to further raise test
scores, is now an integral part of system management. Accountability is inter-
preted almost exclusively in terms of published tests scores, but "driving up
standards" as the government now puts it, is seen as something which can be
accomplished by "performance management" rather than the simple operation
of market forces.
WERE STANDARDS FALLING ANYWAY, AND ARE THEY RISING

NOW?
When GeSE was first announced, the intention of policy was to bring "the level
of attainment of at least 80 to 90 percent of all students up to at least the level
currently associated with the average, as reflected in eSE grade 4" (DES, 1985,
quoted in Kingdon & Stobart, 1988). Now, of course, "eSE grade 4" means very
little in terms of any absolute standard. Similarly, the nature and content of the
examination system has changed significantly over time. But insofar as we can
make any sort of judgments from the official statisties, in terms of broad
equivalences, we can note that successive policy reports have suggested linking
the old GeE O-level "pass" (grade C) to eSE grade 1 and GeSE grade e, thus:
GeE eSE GeSE

A*
A A
B B
e 1 e
D 2 D
E 3 E
F 4 F
G 5 G
Thus we might imagine that large numbers of students were not achieving at least
CSE grade 4 in the early 1980s, and we might look to see how many are achieving
at least GCSE grade F today. In fact, official statistics are usually reported in
terms of students achieving at least 5 A-Cs and 5 A-Gs, so one would have to go
back through individual subject results to calculate the percentage of A-F grades.
However, if we note that in 2000, for example, only 4.8 percent of entries in maths
were graded "G," and 2.2 percent in English were graded "G," we can probably
treat the 5 A-G statistics as accurate to within 2 or 3 percentage points -
sufficient to the purpose here.
Thus we can see from Table 1 that considerable progress towards the govern-
ment's target had already been made between 1975 and 1985, precisely the time
when arguments about falling standards started to be heard, and has continued
to be made over the years since then. Indeed, the original target has been exceeded
for some years now, but this has not stopped continuing complaints about
standards being too low, or the need for the new National Assessment targets at
KS2 to be met.
Similarly, focusing on English and maths results since the introduction of
GCSE we can see pass rates steadily rising (Table 2); likewise the numbers of
pupils reaching the "expected" levels of attainment in KS1 & KS2 national
assessment tests (Table 3).
Of course, some critics, when results go up, simply complain that the examina-
tion(s) have got easier. If results go down, standards are falling; and if results go
up ... standards are falling. And this was indeed a matter of considerable public
debate in the early 1990s, as right-wing critics, wedded to the idea of a normal
Table 1. Percentage of students gaining O-Ievel/CSE/GCSE

1974-1975 to 1999-2000
% 5 0rmoreA*-C % 5 or more A *-G
1974-75 22.6 58.6
1984-85 26.9 74.3
1987-88 29.9 74.7
1994-95 43.5 85.7
1999-2000 49.2 88.9
Source: DtES (2001) (results for England)
N.B. 1987/88 is first results of GCSE; Grades A*-C includes
O-level A-C, CSE grade 1 & GCSE A*-C; Grades A*-G includes
O-level A-E, CSE grades 1-5 & GCSE A*-G
Table 2. Percentage of students gaining GCSE

"passes" (Grades A*-C) in English and maths since
the introduction of GCSE
English: A *-C Maths:A *-C
1987-88 48.9 41.9
1994-95 55.6 44.8
1999-2000 58.6 50.4
Sources: DtEE (1996); DtES (2001) (results for England)
922 Torrance
Table 3. Percentage of students gaining National Curriculum
Assessment level 2 or above at KSI (age 7) and level 4 or above at
KS2 (age 11)
KSl KS2
English Maths English Maths
1992 77 78
1995 76 78 48 44
1996 80 80 58 54
2000 81/84 90 75 72
Sources: The Times, 26/1/96; The Guardian 18/11/96; DfE (1992);
DfEE (2001).
N.B. 1992 is first "full run" of KS1 tests after "slimming down";
1995 is first full run of KS2 tests; By 2000 KS1 English results were
being reported separately in terms of attainment targets (*81 percent
gained level 2 in reading, 84 percent in writing). Such details had
been available previously but results were routinely reported as
''whole subject" levels.
distribution curve of ability, simply could not believe that 40-50 percent (and
rising) of the student population could achieve what traditionally had been the
preserve of the social elite. In turn, governments can always argue that no matter
how high standards are, they need to be higher still, in order to maintain interna-
tional economic competitiveness in a race which demands eternal vigilance and
effort. And certainly, if we assume a reasonable indication of a sound secondary
education is 5 GCSE A *-C grades, including English and maths (considerably
more demanding than the DES policy announcement of the 1980s), then around
50 percent of the school population is still not achieving this. But these argu-
ments are rarely put positively - standards are always said to be too low - and
rarely address the actual evidence of significant progress which successive attempts
at developing accountability measures have produced.
SCALE AND COST
Another key issue for the development of a national system is cost-effectiveness.

The scale of the enterprise in the UK is enormous. Approximately 605,000 pupils
took KS1 tests in England in 2000; 620,000 took KS2 tests; and 580,000 took KS3
tests (DfEE, 2001). Multiply these figures by the number of subjects and the
number of papers in each subject and the figures run into millions - all to be set,
printed, distributed, collected, and marked annually. At GCSE (KS4), there were
554,000 pupils in 2000, entering a total of 4,744,861 GCSE examinations (DfES,
2001). Every year, therefore, within the compulsory school system (i.e., up to the
school leaving age of 16), around 2.3 or 2.4 million students are sitting in excess
of 10 million separate national tests and examinations.
Calculating the cost is by no means easy (despite the government's love affair
with measurement and statistical "transparency") since national testing is not
identified separately in government expenditure figures. At the high end of an
Assessment of the National Cuniculum in England 923
estimate, government statistics indicate that 7 percent of the annual primary
school budget (£678M) and 8 percent of the annual secondary budget (£660M)
is spent on "administration and inspection costs" including "central government
expenditure on qualifications" (DfES, 2001, p. 17). Of this, £105M was spent on
OfSTED inspections in 2000 (p. 12). At the low end of an estimate we can also
note that the Qualifications and Curriculum Authority, now responsible for over-
seeing the National Curriculum and National Assessment arrangements, had a
budget of £68M in 2000, of which £25M was identified as serving "Key Objective
4: Secure a rigorous, consistent and fair system of assessment" (QCA, 2001).
This low figure will certainly include QCA staff salaries and may include some of
the payments to consortia for producing and marking the tests. So a reasonable
guestimate of the direct cost of national assessment might be somewhere between
£25M and the £100+ M it takes to fund OfSTED. Of course, none of these figures
takes into account the indirect costs on local authority and teacher time; nor do
they include the cost of GCSE examining, which would have to be calculated from
the individual fee income received from candidates by each Examination Board. 6
IMPACT ON TEACHING AND LEARNING
Evidence from around the world suggests that too narrow a concentration on
raising test scores can lead to a narrowing of the curriculum, impoverishment of
student learning experience and the possibility of actually lowering educational
"standards" as more broadly conceived (Linn, 2000; Shepard, 1990), as indeed
was predicted in the government consultative document of 1977 quoted above.
Experience confirms this in the UK. Research on the introduction of the National
Curriculum and National Assessment has suggested that there have been some
benefits, especially in terms of an increase in collegial planning and management
of the primary school curriculum (KS1 and KS2), but also many costs. Many
primary school teachers reported feeling threatened by the changing emphasis
from an arts-based early years curriculum to more of a science-oriented curricu-
lum, but also felt they were becoming more confident over time and thought
there were benefits in more collaborative whole-school planning, rather than
simply being left to "get on with things" as they had always done in their own
individual classrooms (Croll, 1996; Pollard, Broadfoot, Croll, Osborn, & Abbott,
1994). Attention to use of evidence rather than "intuition," when it comes to
making classroom assessments and appraising student progress, has also been
noted as an outcome of national assessment (Gipps, Brown, McCallum, &
McAlister, 1995). Accountability pressures have been severely felt however, and
as the national tests have become more and more restricted in scope and allied
to target setting, so has the curriculum become narrowed: "Whole class teaching
and individual pupil work increased at the expense of group work ... [there was]
a noticeable increase in the time spent on the core subjects ... [and] teachers ...
put time aside for revision and mock tests" (McNess, Triggs, Broadfoot, Osborn,
& Pollard, 2001, pp. 12-13).
924 Torrance
Similarly, "in the last two years of primary schooling children encountered a
sharp increase in the use of routine tests and other forms of categoric assess-
ment" and pupils became less keen on teachers looking at and commenting on
their work (McNess et aI., 2001; see also Osborn, McNess, Broadfoot, Pollard, &
Triggs, 2000). This latter point, if sustained, will be very counter-productive since
it is precisely by looking at and commenting on pupil work that teachers can
make a positive difference to pupil learning (Black & Wiliam, 1998) rather than
just coaching them to improve test scores.
Coaching also raises issues of instrument corruption, and whether or not
increasing test scores mean very much even in their own terms. The APU approach
was probably a better way of getting a purchase on "standards," but was far too
complex to roll out across the system as a whole, as early experience with the KSI
SATs demonstrated. Interestingly enough, other work on reliability (Wiliam,
2001) suggests that even estimated at around 80 percent, up to 19 percent of
pupils would be wrongly graded at KS1, 32 percent at KS2, and 43 percent at KS3
- an extraordinary level of error about which the government seems
unconcerned so long as the overall national figures are increasing.
Another puzzling element of government policy in the context of teaching,
learning, and standards, is that if quality of curriculum experience and depth of
learning is severely compromised over time, this will hardly address the key
economic issue of international competitiveness which ostensibly is underpin-
ning government policy on education. As McNess et al. (2001) note: "An
increasingly constrained education system focused on 'the basics' and directed by
the need to prioritise 'performance targets' is in tension with the discourse of the
flexible lifelong learner eager to take risks" (p. 16). One suspicion of course, is
that "discourse" (rhetoric) is all it is. Current government policy certainly attends
to the issue of globalization and the need to continue to develop a knowledge-
based economy. It is also continuing to focus on the consumers of the service,
emphasizing transparency of processes and products, and the role that educa-
tional achievement can play in individual social mobility and social inclusion.
However, as Reich (1991) has noted, "symbolic analysts" are only one of the
employable categories that globalization is calling forth, the others being rather
less educated "production workers" and "service workers;" Were these broad
heuristic categories to be found within an economy, as well as across economies,
then the fact that none of the current legislation or practice in England applies
to independent fee-paying schools, which are free to become exam factories or
not, depending on mission and clientele, suggests that most of the "symbolic
analysts" in England are already assumed to be continuing to come from this
sector. Thirty years on, and the British social class system could be said to be still
very firmly in control.
A further consequence of all the changes that have occurred in England since
the mid-1980s, but which has grown in severity since the mid-1990s, is a crisis in
teacher recruitment and retention. Very large numbers of serving teachers have
taken early retirement to escape the overload of the national curriculum and
national assessment, and training places are hard to fill. A recent survey indicated
that 12% of trainee teachers drop out of training before completion, 30 percent
of newly trained teachers never teach, and a further 18 percent of new recruits
leave the profession within three years (Times Educational Supplement,
November 2, 2001, p. 1). Most explanations focus on overwork, linked to the
pressure to meet targets, along with relatively low pay for an all-graduate profes-
sion. But lack of control over curriculum and teaching methods is also likely to
be important. There is simply no scope in the system anymore for innovative
educators who might wish to experiment at local level. And, coupled with the
immediate crisis of teacher numbers, one wonders where the creative leaders of
the profession will come from in five or ten years time.
POLICY LESSONS TO BE LEARNED
What can be learned from the English experience? The most obvious point to
note is the continuing tension between the formative and evaluative purposes of
a national assessment system and the trade-off between scale and complexity.
The larger the system of national assessment envisaged and implemented, the
simpler the testing regime must be. And this, in tum, will carry consequences for
curriculum focus and quality of educational experience. If policymakers wish to
gamer some of the more positive elements of "measurement driven instruction"
(designing good quality assessments to underpin good quality teaching) and
report results nationally, then the system would probably have to operate at no
more than two ages/stages (7 & 11? 9 & 14?) to allow sufficient time and
resources to be committed to test/task development, teacher development, and
the ongoing moderation of teacher judgments which would have to be at the
heart of any such system (Khattri, Reeve, & Kane, 1998).
There are also many features of the English system which might be configured
in different ways with different emphases. National assessment does not stand
alone as a policy option in England. Its peculiarly coercive and corrosive power
derives from being linked to a national curriculum; framed by a very mechanistic
government view of evaluation and accountability, including the publication of
results and target-setting; and further policed by a very punitive inspection regime.
All of these elements could be combined in different ways and operationalized
with a different emphasis. Thus, one could imagine a national curriculum being
implemented without national assessment at all; or without publishing results; or
with an element of occasional APU-style sample testing; or with a national
assessment system which was more limited in scale but more ambitious in scope.
Alternatively, with or without a national curriculum, a national bank of standard
assessment tasks could be developed for expected, but nevertheless still volun-
tary use and modification by teachers, to underpin their own judgments of
student progress and school self-evaluation. Similarly, results could be published
without being linked to school targets and teacher pay, and set in a more
descriptive and explanatory account of what is happening in a particular school
and why. Such a description and explanation could be written by the school itself,
or be an outcome of a more curious and supportive system of inspection - also
indicating what might be done to assist improvement. Elements of these variations
have in fact been implemented in the other three countries of the UK, including
development of standard assessment task "item banking," and England is now the
only country which publishes "performance tables" in overt comparative form.
Nevertheless, the English public and English schools and teachers have become
used to seeing annual publication of national assessment and GCSE results.
Indeed, unless the press can identify a particular angle (usually involving some
blip in the statistics which can be interpreted as yet more evidence of falling
standards), their publication can pass almost unremarked. Yet stopping publi-
cation in England would be a major and almost unthinkable decision. National
testing and the publication of results is here for the forseeable future. The debate
about testing continues, however, with considerable unease being expressed
about the "over-assessment" of individual students and the system as a whole
(e.g., Times Educational Supplement, November 23, 2001, p. 19; November 30,
2001, p. 3). Although the current government in England seems immune to such
arguments, they ought to be persuasive elsewhere, if the key issue is to develop
creative and flexible citizens of the future rather than dutiful drones for today.
ENDNOTES
1 It should also be noted in this discussion of the fragmented nature of the UK system that the 11 +
itself was locally controlled and administered by local education authorities (LEAs, i.e. school
districts) and test papers and pass rates varied from one authority to another, with variation usually
reflecting the historic provision of grammar school places in each LEA. Indeed one of the factors
that led to the demise of the 11 + was precisely this variation, when it became apparent that pass
rates varied from as low as 9 or 10 percent of a cohort in some LEAs to over 40 percent in others
(see Torrance, 1981).
2 In fact research on this topic seems to indicate that if anything many teachers' judgments can be
harsher than results produced by final papers. Kingdon and Stobbart (1988), reporting on work
carried out at the University of London School Examinations Board, note that "we found no
support for the claim that coursework would be '" a soft option, since in almost half the subjects
the mean coursework mark was actually lower than the mean final mark" (p. 121).
3 Plus physical education (PE) and information and communication technology (ICf) from August
2000. Inclusion of PE responded to concerns about other important subiects being squeezed out of
the curriculum and effectively confirmed the place of some physical activity in the curriculum. ICf
was similarly thought to be in danger of exclusion if not explicitly recognised, though it is recom-
mended that it is taught across the curriculum within each subject, rather than as a separate subject.
4 In 2000, for example, there were 554,551 entries in GCSE maths and 533,227 in GCSE English; by
contrast there were 81,493 entries in computer studies, 84,670 in drama and 14,835 in social studies
(DfES,2001).
5 See House of Commons (1999) and Cullingford (1999) for accounts of the impact of inspection.
6 Personal correspondence with DfES indicates that the direct cost of KS1, 2, and 3 testing for 2001
was £25.6m. (including test development, printing, distribution and marking), but it is not clear
what proportion of this figure, if any, overlaps with QCA budget. Further correspondence indicates
that £78,000 was spent on "disseminating" the results of the tests through the PANDAs produced
for schools. Clearly this figure is a ridiculous underestimate, taking into account little more than
the cost of installing the website. The cost of actually producing the statistics has been ignored.
Thus even after correspondence it is difficult to know, without asking more and more precise
questions, how much is really spent on national testing. None of the figures includes teacher time
or LEA adviser time spent administering the tests and/or acting on the results.
REFERENCES
Becher, T., Eraut, M., & Knight, J. (1981). Policies for educational accountability. London,
Heinemann.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5, 7-74.
Brown, M. (1988). Issues in formulating and organising attainment targets in relation to their
assessment. In H. Torrance (Ed), National assessment and testing: A research response. (pp. 15-25).
London: British Educational Research Association.
Butterfield, S. (1995). Educational objectives and national assessment. Buckingham: Open University
Press.
Croll, P. (Ed.). (1996). Teachers, pupils and primary schooling: Continuity and change. London, Cassell.
Cullingford, C. (Ed.). (1999). An inspector calls. London: Kogan Page.
Daugherty, R. (1995). National curriculum assessment: A review ofpolicy 1987-1994. London: Falmer.
Dearing, R. (1993). The national curriculum and its assessment: Final report. London: School
Curriculum and Assessment Authority.
DES (Department of Education and Science). (1977). Education in schools. A consultative document.
London: Her Majesty's Stationery Office. (cmnd: 6869)
DES (Department of Education and Science). (1987). The national curriculum 5-16. A consultation
document. London: Author.
DES (Department of Education and Science). (1990). English in the national curriculum (No.2).
London: Author.
DfE (Department for Education). (1992). Testing 7-year olds in 1992: Results of the national curricu-
lum assessments in England. London: Author.
DfEE (Department for Education and Employment). (1996). Statistics of education: Public examina-
tions: GCSE and GCE in England 1995. London: Author.
DfEE (Department for Education and Employment). (1999). The national curriculum for England.
London: Author.
DfEE (Department for Education and Employment). (2001). Statistics of education: National curricu-
lum assessments of 7, 11 and 14-year olds in England, 2000. London: Author.
DfES (Department for Education and Skills). (2001). Statistics of education: Public examinations
GCSE/GNVQ and GCE/AGNVQ in England 2000. London: Author.
Elliott, J., Bridges, D., Ebbutt, D., Gibson, R., & Nias, J. (1981). School accountability. London:
Grant McIntyre.
Gewirtz S., Ball J., & Bowe, R. (1995). Markets, choice and equity in education. Buckingham: Open
University Press.
Gipps c., Brown, M., McCallum, 8., & McAlister, S. (1995). Intuition or evidence? Teacher and
national assessment of seven-year olds. Buckingham: Open University Press.
Hargreaves, D. (1984). Improving secondary schools. London: Inner London Education Authority.
Hillgate Group. (1987). The refonn of British education. London: Claridge.
House of Commons. (1999). Education and Employment Select Committee Fourth Report. London:
Her Majesty's Stationery Office. Retrieved from
www.publications.pariiament.uk/pa/cmI99899/cmselect/cmeduemp/62/6202.htm.
Khattri N., Reeve, A., & Kane, M. (1998). Principles and practices ofperfonnance assessment. Mahwah,
NJ: Lawrence Eribaum.
Kingdon M., & Stobbart G. (1988). GCSE examined. London: Falmer Press.
Linn, R.L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.
MacDonald, B. (1979). Hard times - Accountability in England. Educational Analysis, 1(1), 23-44.
Marks, J. (1991). Standards in schools: Assessment, accountability and the purpose of education.
London: Social Market Foundation.
McNess, E., Triggs, P., Broadfoot, P., Osborn, M., & Pollard, A. (2001). The changing nature of
assessment in English primary classrooms. Education 3-13,29(3),9-16.
Murphy, R. (1986). The emperor has no clothes: Grade criteria and the GCSE. In C. Gipps (Ed.),
The GCSE: An uncommon examination. London: Institute of Education Publications.
Murphy, R., & Torrance, H. (1988). The changing face of educational assessment. Buckingham: Open
University Press.
Orr, L., & Nuttall, D. (1983). Detennining standards in the proposed single system of examining at 16+.
London, The Schools Council, Comparability in Examinations: Occasional Paper 2, London.
Osborn, M., McNess, E., Broadfoot, P., Pollard, A., & Triggs, P. (2000). What teachers do: Changing
policy and practice in primary education. London: Continuum.
Pollard, A., Broadfoot, P., Croll, P., Osborn, M., & Abbott, A. (1994). Changing English primary schools.
London: Cassell.
QCA (Qualifications and Curriculum Authority). (2001). Annual report. Retrieved December 2001
from: www.qca.org.uk/annuaIJeport/resourcing.asp.
Reich, R. (1991). The work of nations. London: Simon & Schuster.
Shepard, L. (1990). Inflated test score gains: Is the problem old norms or teaching to the test?
Educational Measurement: Issues and practice, 9(3), 15-22.
TGAT (Thsk Group on Assessment and Testing). (1987). National curriculum: Task Group on
Assessment and Testing: A report. London: DES.
Torrance, H. (1981). The origins and development of mental testing in England and the United
States. British Joumal of Sociology of Education, 2, 45-59.
Torrance, H. (1993). Combining measurement driven instruction with authentic assessment: The case
of national assessment in England and Wales. Educational Evaluation and Policy Analysis, 15, 81-90.
Torrance, H. (Ed.) (1995). Evaluating authentic assessment. Buckingham: Open University Press.
Wiliam, D. (2001). Reliability, validity and all that jazz. Education 3-13, 29(3), 17-21.
41
State and School District Evaluation in the
United States
WILLIAM J. WEBSTER
Dallas Independent School District, Dallas, Texas, USA
TED O. ALMAGUER
TIM ORSAK
The field of educational evaluation developed dramatically over the decades of

the 1960s, 1970s, and 1980s. Following a period of relative inactivity in the 1950s,
influenced by seminal articles by Cronbach (1963), Scriven (1967), Stake (1967),
and Stufflebeam (1966), educational evaluation efforts experienced a period of
revitalization in the mid-1960s. Development was further stimulated by the evalu-
ation requirements of the Great Society programs that were launched in 1965; by
the nationwide accountability movement that began in the early 1970s; and, most
importantly, by the mounting responsibilities and resources that society assigned
to educators.
The intent of this chapter is to describe the practice of evaluation among state
departments and local school districts in the United States. To accomplish this,
the authors draw primarily from studies of local school district evaluation units
by Webster (1977), Webster and Stufflebeam (1978), Stufflebeam and Webster
(1980), a recent survey of local school districts by the authors, several studies of
the characteristics of state accountability systems (Consortium of Policy Research
in Education, 2000, Council of Chief State School Officers, 1995, 2000; Education
Commission of the States, 1998), and a recent analysis of state accountability
systems conducted by the authors. Once the state of educational evaluation is
established, some suggestions are offered for future directions.
WHAT IS EVALUATION?
Before attempting to describe the state of evaluation, it is important to reach a
suitable definition. For about 30 years educators and other professionals
929
Inte17Ultional Handbook of Educational Evaluation, 929-950

930 Webster, Almaguer and Orsak
generally accepted the definition proposed by Smith and Tyler (1942) that
evaluation means determining whether or not objectives have been achieved.
This definition provided a common view that professionals could use to promote
and assess improvements. For example, it was particularly influential in the
development of minimum competency testing programs and in the management-
by-objectives movement. In general, it focused attention on determining whether
intended outcomes were achieved. In the main, however, this definition has been
used uncritically and persistently. Undoubtedly, this ritualistic application has
limited the usefulness of evaluation services by narrowing their perspectives to
concerns evident in specified objectives, by analogously directing attention away
from positive and negative side effects, by suggesting that evaluation is applied
only at the end of a project or other effort, and by encouraging service providers
to define success in terms of objectives without evaluating the objectives.
The objectives-based conception of evaluation still prevails in many
organizations and is particularly prevalent as the major part of many state-level
accountability systems. On the other hand, most school district evaluation units
have moved to a definition based on the view that evaluation should guide
decision making. This definition, for example, has been very influential in the
evaluation systems of a number of school districts (e.g., Dallas, Fort Worth, and
Austin, Texas; Lansing and Saginaw, Michigan; and Cincinnati and Columbus,
Ohio). It also has been the dominant approach employed by the United States
General Accounting Office. The decision orientation is an improvement over the
objectives-oriented definition because it goes well beyond basic accountability in
that its thrust is to ensure that evaluation guides a program throughout its develop-
ment and implementation and because it implies the assessment of a wider range
of variables: needs, plans, operations, and results - including side effects. In
short, evaluation should aid educators in improving programs in addition to
holding them accountable for program outcomes.
The Joint Committee on Standards for Educational Evaluation (1981, 1988,
1994) adopted a third definition for educational evaluation, which is noteworthy
because it followed agreement by a national committee, appointed by major
national educational, evaluation, and research organizations, to develop standards
for educational evaluation. The Joint Committee's definition of evaluation was
the basis for its Standards, which have become influential in the practice of
educational evaluation, and states: "Evaluation is the systematic assessment of
the worth or merit of some object." The stipulation calls attention to the impor-
tance of clarifying the thing being evaluated, and reflects the truism that an
evaluation is an assessment of value. Both points are important, but the latter is
especially critical.
For the purpose of reviewing current practices in evaluation, the following
definition is offered:
Evaluation is the process of delineating, obtaining, and applying descrip-

tive and judgmental information about the worth or merit of some object's
goals, plans, operations, and results in order to guide decision making,
State and School District Evaluation in the United States 931
maintain accountability, and foster understandings. (Stufflebeam et aI.,
1971, p. 40)
This definition emphasizes that evaluations should assist decision making; it

stresses that evaluative inquiries must seek to assess worth and merit; it denotes
a wide range of variables to be assessed; it emphasizes that evaluation is an
interactive process involving both communication and technical activities; and it
points out that, in addition to guiding decisions, evaluations should provide a
basis for accountability and should lead to better understanding of the phenomena
being studied. Existing evaluation systems are reviewed against this definition as
well as five criteria that we believe provide fairness for all stakeholders in the
evaluation system. First, the system must include a legitimate value-added com-
ponent. Second, it must include multiple outcome variables. Third, schools must
only be held accountable for students who have been exposed to their instruc-
tional program (continuously enrolled students). Fourth, schools must derive no
particular advantage by starting with high-scoring or low-scoring students, minority
or white students, high or low socioeconomic level students, or limited English
proficient (LEP) or non-limited English proficient students. Furthermore, such
factors as student mobility, school overcrowding, and staffing patterns over
which the schools have no control must be taken into consideration. Finally, the
system must be based on cohorts of students, not cross-sectional data.
These five criteria are related to fairness in an evaluation system. Systems
must have a value-added component because different students start at different
levels and gain must be considered if one is going to give the student and the school
credit for improvement. The use of multiple outcome variables is preferred
because there are multiple outcomes of education and, to the extent that system
stakeholders can agree upon them, they should be included in the system. The
inclusion of multiple outcome variables also guards against practitioners focusing
on a narrow set of outcomes. Recent criticisms of the Texas State Accountability
System have, among multiple other concerns, questioned whether or not the
achievement gains measured by one test that the entire state has been drilling on
for years are real (Clopton, Bishop, & Klein, 2000).
The issue of holding schools accountable for only the students that they have
had the opportunity to teach, although often amazingly controversial, seems
logical. "Opportunity to teach" can be defined in a number of ways. In Dallas, it is
defined as being continuously enrolled in the same school. This is operationally
defined as being enrolled during the first six weeks of the school year and present
to take the appropriate test.
The issue of schools deriving no particular advantage by starting with high-
scoring or low-scoring students, minority or white students, high or low socioe-
conomic students, and limited English proficient or non-English-proficient
students at either the individual student, classroom, or school level, is one that is
a little more difficult to address. Some of the known factors that affect school
progress but are outside of the control of the school include, but are not limited
to, socioeconomic status, gender, language proficiency, ethnicity, and the
existing ability levels of entering students. The criteria by which all systems for
evaluating schools need to be judged are the degrees of relationship between
these factors and the resulting measures of effectiveness. Using these relation-
ships to adjust the criterion measures for school accountability models is rare
based on the author's extensive search of the literature on school effectiveness
systems. Because context seems to be so universally ignored, a brief discussion
on this issue is presented below.
Systems which employ unadjusted outcomes or testing programs as their basis
for evaluation are too highly correlated with the existing factors just delineated.
As noted by Jaeger (1992) and Webster, Mendro, Bebry, and Orsak (1995), these
types of systems are biased against schools with large proportions of minority
and low socioeconomic status students and in favor of schools that contain large
proportions of white and higher socioeconomic status students. Thus, when unad-
justed outcomes are used, schools are ranked primarily on the types of students
that they receive rather than on the education that they provide, and differences
in populations of students and how they are selected into their schools and
programs are confounded with the difference the schools and programs make.
Schools and programs, which draw on higher scoring students, receive the
benefits of this bias before their students start their first lesson. Schools and
programs that must deal with lower scoring students must overcome this bias
before they can begin to show an effect. In short, the worst possible use of
evaluative data for public reporting is the presentation of simple averages by
districts and schools.
PUBLIC SCHOOL DISTRICT EVALUATION
The information described in this section was collected through surveys in 1978
and 1999 of more than 50 ofthe largest urban school districts in the United States.
The survey conducted in 1978 was mailed to persons having the title of directors
of evaluation/research in each of those districts. The 1999 survey was either
mailed or faxed to persons responsible for evaluation/research at each of the 56
members of the Council of Great City Schools. The 1978 survey had 35 responses
of the 56 surveyed. The 1999 survey had 20 usable responses of 56 surveyed. Nine
districts responded to both surveys. Because of the low response rate in 1999 and
the fact that only nine districts responded to both surveys, we rely to some extent
on previously collected information, some of which is dated.
Pittsburgh and Birmingham reported no formal evaluation or research depart-
ments in 1978, while Des Moines and Birmingham responded that they had no
formal evaluation or research departments in 1999. Des Moines highlighted the
elimination of the district's evaluation unit in 1999 as a result of reorganization.
The representatives of some districts were unable to accurately complete the
simple questionnaire (three in 1978 and three in 1999), thus their responses were
not included in the summarization.
Districts were stratified on the basis of evaluation and research expenditures
and questions asked relative to each of three categories. The three categories
included large (more than $1,000,000 in research, evaluation, and testing expendi-
tures), medium ($300,000 to $1,000,000 in research, evaluation, and testing
expenditures), and small (less than $300,000 in research, evaluation, and testing
expenditures). The research expenditure measure was used in an attempt to
group departments homogeneously for discussion. Grouping by size of district
would have produced extremely heterogeneous groups due to the wide range of
evaluation budgets. Thus generalizations are relative to size of evaluation
departments.
Expenditures
In the real world of applied evaluation and research, emphasis translates to

money. Thus, the relevant question becomes one of the amount of expenditure
on evaluation in the districts. Districts in the 1978 study ranged in evaluation and
research expenditures from $67,000 to $10,300,000. Federal funds contributed
significantly to New York's $10,300,000 research and evaluation department. All
other respondents had operating funds comparably distributed between local
and state funds and federal funds. In comparison to total expenditure per
district, approximately three-tenths of one percent of total expenditures was for
evaluation and research. Although many theoretical models suggest a 5 percent
expenditure for research and evaluation, these results suggest that, in 1979,
school districts generally did not strongly support evaluation and research
functions. Nine responding districts did, however, spend more than $1,000,000
on research and evaluation activities (New York, Dallas, Philadelphia, Chicago,
Detroit, Boston, Los Angeles, Baltimore, and Atlanta).
Districts in the 1999 study ranged in evaluation and research expenditures
from $75,000 to $6,230,922. Of the nine $1,000,000 districts from the 1978 study,
New York, Philadelphia, Chicago, and Boston did not respond to the 1999 survey,
and Baltimore reported spending only $75,000 in 1999. Available evidence
suggests that research and evaluation expenditures in large school districts were
not as high proportionally in 1999 as they were in 1978. Only Fresno, Portland,
and EI Paso incurred increases in purchasing power from 1978 after Consumer
Price Index inflation adjustments were calculated. However, ten responding
districts did report spending at least $1,000,000 on research and evaluation activ-
ities in 1999 (Dallas, Los Angeles, Jefferson County, Fresno, Portland, Atlanta,
Houston, Detroit, EI Paso, and Sacramento) and three spent over $900,000
(Minneapolis, San Francisco, and Mesa).
Observations regarding funding indicate limited support for evaluation and
research functions within public school districts. The 1978 survey reported nine
districts classified in the "large" category whereas ten are included in 1999. Of
the large districts, local/state funds increased from an average of $840,185 to
$1,959,261, while the availability of federal funds decreased from $1,953,518 to
approximately one-fifth at $409,041. The thirteen "medium" participants in 1978
reported an average of $280,871 in local/state funds and $198,053 in federal
funds. The corresponding "medium" group in 1999 reported an average of $693,298
in local/state funds and $91,200 in federal funds. The five "small" participants in
1978 reported an average of $137,620 in local/state funds and $57,576 in federal
funds. The corresponding "small" group in 1999 reported an average of $118,963
in local/state funds and $0 in federal funds. The most obvious change regards
reduction of federal funds incorporated into the evaluation and research
components of school district operations. Perhaps as local funds are cut,
resources are not readily available to write proposals and request federal funds.
Functions
Webster (1977) outlined 16 functions of applied evaluation and research units

(Table 1). Respondents were asked to estimate the proportion of evaluation and
research resources spent for each function. Once again, data were disaggregated
by small, medium, and large evaluation departments. Only 14 functional areas
were included in the 1978 survey, with personnel evaluation and value-added
evaluation being added in 1999.
A number of generalizations were drawn from the 1978 survey results
(Webster & Stufflebeam, 1978). First, the smaller the evaluation unit, the higher
the proportion of its resources that were spent on management. Second, the vast
majority of evaluation offices controlled the testing function. Third, input evalu-
ation was practically nonexistent in medium and small evaluation units. Fourth,
process evaluation was less emphasized as evaluation units became smaller.
Fifth, evaluation departments, regardless of size, put most effort into testing,
product evaluation, and data processing. Sixth, most evaluation departments,
regardless of size, did not control their data processing capability. Seventh, basic
research was practically non-existent in public school evaluation. Finally, small
evaluation departments spent a comparatively large amount of resources on ad
hoc information requirements.
The same basic survey was administered in 1999, yielding similar generaliza-
tions. First, medium and large departments allocate a significant portion of
resources to testing, whereas small departments no longer appear to either
support or be responsible for testing activities. Second, management activities
require the greatest allocation in smaller departments. Third, product evaluation
retains the most resources of all types of evaluation. Fourth, data processing is
not directed/controlled by most evaluation departments. Fifth, personnel evalu-
ation is non-existent in small evaluation departments and receives minimal
allocation in large departments. Finally, value-added components were reported
from six districts, but received emphasis in only two (Dallas and Minneapolis).
Comparisons regarding the resource allocations to public school evaluation
and research departments highlight a disturbing trend towards reduced evalua-
tion services. In the earlier survey, the greatest allocation of funds was
Table 1. Functions of Public School Evaluation and Research Departments
Function Description
1. Management The obtaining, disbursing, and quality control of material and
personnel resources necessary to meet the information needs of the
district.
2. Assessment The operation of the basic system-wide data acquisition system to
ensure reliable, valid, and objective instrumentation and reporting.
This does not include diagnostic testing for special education.
3. Instrument Development The development and validation of necessary instrumentation.
4. Context Evaluation The provision of baseline information that delineates the environ-
ment of interest, describes desired and actual outcomes, identifies
unmet needs and unused opportunities, and diagnoses problems
that prevent needs from being met (often called institutional
research).
5. Input Evaluation The provision of information for determining methods of resource

utilization for accomplishing set goals.
6. Process Evaluation The provision of information for determining defects in procedural
design or implementation during program implementation stages
and for aiding in the interpretation of program evaluation data.
7. Product Evaluation The provision of information for determining the relative success of
educational programs. Product evaluation may be interim or sum-
mative and generally addresses the extent to which programs meet
their objectives, the relative effects of alternative programs, and/or
program cost -effectiveness.
8. Proposal Development The development of proposals, or the evaluation sections of
proposals, for outside funds. In all cases, outside funds should be a
means to an end, not an end in themselves.
9. Applied Research The provision of information pertaining to interactions among
student characteristics, teacher characteristics, and/or instructional
strategies.
10. Basic Research The provision of information pertaining to fundamental relation-
ships affecting student learning.
11. Data Processing The operation of the basic data retrieval and analysis system.
12. Planning Services The provision of technical assistance to district and project manage-
ment in planning and managing projects and programs.
13. Ad Hoc Information The provision of requested information on an ad hoc basis to
decision-makers.
14. Research Consultant The provision of design and analysis help to district personnel in
implementing their own research and evaluation projects.
15. Personnel Evaluation The provision of information for use in personnel evaluation or the
management of the personnel evaluation system. Personnel evalu-
ation for this purpose is limited to teacher and principal evaluation.
16. Value-Added Evaluation The use of measures of adjusted gain in teacher, principal, or
school evaluation.
designated for product evaluation for large departments and for testing for
medium and small departments. Product evaluation was second highest for the
medium and small departments. The 1999 survey indicated that testing was the
primary expense of the large and medium departments with approximately 20 to
32 percent of their resources allocated to this area, respectively. The smaller
departments indicated a great reduction in testing (down to 2.5 percent), whereas
management was their primary expense taking 20 percent of their resources.
Overall, the differences between the two surveys indicate shifts away from
context evaluation, process evaluation, product evaluation, data processing, and
research consultation for the large and medium departments with increases
primarily coming in the area of testing. Ad hoc information production also
showed increases for the large departments. The small departments indicated a
decrease of 16.5 percent in testing and of 7.9 percent in product evaluation with
the greatest increases for management (9.3 percent) and research consultation
(5.7 percent).
The increases in testing for the large and medium departments can probably
be attributed to the additional testing required by state accountability systems.
State accountability systems have shifted additional burdens of required testing
to local school districts. The most obvious of a school district's departments to
assist in implementing the additional testing required by state systems has
traditionally been the research and evaluation department. This absorption of
additional testing forced resource reallocation from traditional research/
evaluation areas to more management and ad hoc services. In the case of small
research departments, the burden of additional testing could not be accommo-
dated by the departments and thus was apparently transferred to another
department.
This shift is significant since it signifies a move away from legitimate evalua-
tion activities to assessment. Much of the assessment costs can be attributed to
test administration and monitoring activities required by, but not funded by,
state agencies. Control of the testing function is essential to operating an efficient
evaluation office. To control testing is to control the quality of the majority of
research and evaluation product data. Without this quality control mechanism,
evaluation is left with the probability of depending on unreliable and invalid
data. However, evaluation departments are assuming more and more of the
administrative burdens of state testing programs. Rather than increase evaluation
budgets to handle these demands, resources are being transferred. Thus, districts
are often left spending their sparse evaluation dollars on massive, unwieldy,
often invalid assessment programs rather than on legitimate evaluation activities.
Management costs for research and evaluation activities are somewhat high
because of the large amount of quality control required of the data and the inten-
sive degree of interaction with decision makers necessary to make evaluation
effective. The smaller the department, the greater the amount of time that the
department head generally must spend selling evaluation activities to decision
makers, including, in many instances, convincing them that they need informa-
tion to make better decisions.
Input evaluation is an extremely cost-effective form of evaluation. By ensuring
that only programs with a high probability of success are adopted, one vastly
reduces the probability of costly failure. This process could save millions of dollars
in program materials and staff resources. Yet, most evaluation units, whatever
their size, did not have input evaluation functions. As a result, it has been the
authors' experience that many variables other than program effectiveness
contribute to educators' decisions to implement or discontinue programs.
Process evaluation is essential in monitoring programs to ensure that evaluators
are not evaluating fictitious events. Furthermore, many changes in instructional
delivery can be made during the course of program implementation through the
use of process evaluation results. Yet, there is strong evidence that reliance on
product evaluation data alone is prevalent among most evaluation units. It is
significant to note that one of the major weaknesses of national evaluation studies
generally is a lack of adequate documentation of instructional delivery that could
be obtained through process evaluation. Over-reliance on product evaluation
data alone often leads to extremely misleading results.
Testing, product evaluation, and data processing are extremely important
evaluation functions. However, testing without some type of evaluation design is
of limited use to decision making. It is, in other words, a necessary but not suffi-
cient tool for evaluation. Product evaluation is the form of evaluation that most
Boards of Education support most readily. It is visible, related directly to
decision making, and provides information of interest to most laypeople. Questions
such as "Did a program meet its objectives?" "Did it meet system objectives
better than the competition?" and/or "Was it cost effective?" are of extreme
interest to Board members and the public. These are also the kinds of questions
asked by federal funding agencies. Yet, even product evaluation activities have
declined in the face of demands for increased large-scale assessment programs.
That small evaluation departments spend a comparatively large amount of
resources on ad hoc information suggests that they, too, are focused on providing
evaluative information for decision making. With a small evaluation budget,
inadequate information exists. Thus decision makers usurp a higher proportion
of resources to answer the pressing questions of the moment.
Personnel evaluation is a primary means to ensure the quality of education
received by students. Indeed, it can be argued that program evaluation in the
absence of systematic value-added teacher evaluation is of extremely limited
utility. Teacher-proof programs cannot be designed. Competent teachers can
make almost anything work, while incompetent ones can ruin even the most
brilliant instructional design. Teacher evaluation systems must be outcomes-
based, value-added, and must be coordinated with ongoing program and school
evaluation. Furthermore, evaluations of principals, support personnel, and admin-
istrators must also be outcomes-based and value-added, in most cases including
the same outcomes as the teacher system. If funds are limited, systematic outcomes-
based, value-added personnel evaluation is much more likely to produce desired
results than is program evaluation. Studies of effective and ineffective teachers
in Dallas found as much as a 50-percentile point difference in students scores on
a norm-referenced test after only three years (Jordan, Bembry, & Mendro, 1998;
Jordan, Mendro, & Weersinghe, 1997). (For a discussion of integrating person-
nel and program evaluation see Webster, 2001b.)
Only 18 percent of the responding districts had any involvement at all in
teacher and/or principal evaluation.
Evaluation Models
Stufflebeam's CIPP (Context, Input, Process, Product) was by far the most pop-
ular evaluation model employed; six districts reported using it or some form of
CIPP. The model was developed based on evaluation experiences in the Columbus,
Ohio, public schools, and has been used as the principal evaluation model in
Dallas, Cincinnati, Saginaw, Lansing, and Philadelphia. The most operational-
ized version of the model is in Dallas, where the system is organized to conduct
the four kinds of evaluation required by the model. The first is context
evaluation, which assesses needs, problems, opportunities, and objectives at
different levels of the school district. The second is input evaluation, which
searches for alternative plans and proposals and assesses whether their adoption
would be likely to promote the meeting of needs at a reasonable cost. The third
is process evaluation, which monitors a project or program, both to help guide
implementation and to provide a record for determining the extent to which the
program or project was implemented as designed. The fourth part of the model
calls for product evaluation, which is an attempt to examine the value-added
outcomes of a program and the extent to which they meet the needs of those
being served. In Dallas, the CIPP model is used both to help implement priority
change programs and to provide an accountability record.
Several districts noted various model combinations as their primary evaluation
model. Fresno reported using a combination of CIPp, the Discrepancy Model,
and General Accounting Office-Yellow Book models. Houston listed Decision
Making, Behavioral Objectives, and Systems Analysis models. Detroit, Norfolk,
and a number of other districts identified no one specific model.
'~
STATE ACCOUNTABILITY SYSTEMS
In response to public pressure to improve student learning, most states have moved
toward enhanced accountability systems, most of which establish standards for
instruction, use either norm-referenced or criterion-referenced assessments to
measure these standards, rate districts and schools on established standards,
provide for some sort of sanctions, and require public reports of the results of the
accountability system.
If state accountability systems are to be complete and valid, both content and
performance standards must be clear. Moreover, the assessments must be valid
measures of the attainment of the standards and especially of student progress.
Multiple indicators must be clearly linked to the standards. Rewards must be
used only to reinforce valid increases in achievement, and punishments must be
used only to negatively reinforce valid decreases in achievement. Moreover,
report cards must report valid assessments of school and student progress.
State Departments of Education have taken a leadership role in attempting to
develop accountability models, particularly at the school and district level. As of
1995,46 of 50 states had accountability systems that featured some type of assess-
ment. Twenty-seven of these systems featured reports at the school, district, and
state level; three featured school level reports only; six featured reports at both
the school and district level; seven featured reports at the district and state level;
two featured reports at the state level only; and one was currently under develop-
ment (Council of Chief State School Officers, 1995). The authors conducted a
January, 2002, Internet survey to update these statistics, which are reported later
in the chapter.
When one reviews state accountability systems, there is little evidence of valid
value-added systems being employed in that apparently only two states, South
Carolina (May, 1990) and Tennessee (Sanders & Horn, 1995), have used value-
added statistical methodology. North Carolina and Colorado are currently
working on value-added components (w. L. Sanders, personal communication,
January 2002). Most of the rest tend to evaluate students, not schools or districts,
and generally cause more harm than good with systematic misinformation about
the contributions of schools and districts to student academic accomplishments.
By comparing schools on the basis of unadjusted student outcomes, state reports
are often systematically biased against schools with population demographics
that differ from the norm. In attempting to eliminate this bias, a number of states
(e.g., Texas, California) have gone to non-statistical grouping techniques, an
approach that has serious limitations when there is consistent one-directional
variance on the grouping characteristics within groups.
In completing the following review, State accountability systems in all 50 states
were considered, with a focus on eight states chosen because they had well-
developed systems and represented the entire spectrum of accountability at the
state level. The eight states were Alabama, California, Georgia, Kentucky,
Massachusetts, New York, Tennessee, and Texas. (For more comprehensive exami-
nations of the status of state accountability systems, see the Council of Chief
State School Officers, 1995, 2000; Education Commission of the States, 1998;
and Consortium for Policy Research in Education, 2000.)
Generally speaking, considerable variability exists in the completeness and
validity of the various state accountability systems. The vast majority have an
objectives orientation as opposed to a decision-making orientation to evaluation.
Furthermore, at the present time, with the exception of Tennessee's accountability
system, state accountability systems do not include legitimate value-added
evaluation components and generally rely simply on cross-sectional analyses to
measure school, district, and teacher impact on student learning.
Standards and Assessments
Many states have developed content and performance standards for each major
discipline of instruction offered in their schools. Courses of study define the
minimum required subject-area content in terms of what students should know
and be able to do at the end of each grade or course. Courses included range
from arts education, career/technical education, computer applications, driver
education, language arts, foreign languages, health education, physical educa-
tion, science, and social studies to the basic three Rs. As of 2002, of the 50 states,
45 have developed criterion-referenced tests to measure the attainment of their
content standards while 27 use both criterion- and norm-referenced tests to
assess school performance. Of the eight states studied in depth, all use criterion-
referenced tests to measure at least a subset of developed standards, though six
rely primarily on norm-referenced tests.
The State of Alabama relies primarily on a norm-referenced test in its account-
ability system. In keeping with legislation requiring the use of a norm-referenced
test for assessing student achievement, Alabama uses the Stanford Achievement
Test (SAT) to assess student achievement in the core subjects, because the test
objectives match well its content and performance standards. The test is
administered to students in grades 3-11 to assess concepts and skills in reading,
mathematics, language, science, and social studies. State-developed tests based
on writing standards in grades 5 and 7 are also administered. Moreover, it uses
the Alabama High School Graduation Exam, which includes reading comprehen-
sion, language, mathematics, and science, to strengthen graduation requirements.
Kentucky uses its Commonwealth Accountability Testing System (CATS) to
assess reading and science in grades 4, 7, 11, and 12; writing in grades 4, 7, and
12; and, social studies, mathematics, arts/humanities; and practicalliving/vocational
studies in grades 5, 8, 11, and 12. It assesses proficiency in each ofthese content
areas through various measures including open-response questions, multiple-
choice questions, and writing portfolios.
Tennessee is implementing the Tennessee Comprehensive Assessment Program
(TCAP), which uses both norm-referenced and criterion-referenced measures.
Currently, the TCAP includes the following tests: the Achievement Test (an
elementary achievement test), the Writing Test, the Competency Test, and the
High School Subject Matter tests. End-of-course tests that will be aligned with
specific curriculum objectives are also being developed.
The California Test Bureau developed the achievement test for grades 3-8,
which measures basic skills in reading, vocabulary, language, language mechanics,
mathematics, mathematics computation, science, social studies, and spelling.
The Thnnessee Comprehensive Assessment Program Competency Test for ninth-
grade students measures skills in mathematics and language arts required for gradu-
ation. High school students take the High School Subject Matter Tests at the end
of their Algebra, Geometry, and Mathematics for Technology courses. The TCAP
Achievement Test is both norm-referenced and criterion-referenced. The TCAP
Competency and High School Subject Matter Tests are criterion-referenced.
Classification Systems
Of the 50 states, 27 have a system for classifying schools on the basis of their
performance levels. Of the eight states studied in depth, seven use a myriad of
systems to classify schools and school systems, while one state has no classifi-
cation system at this time.
Some examples follow. Schools or school systems in Alabama are classified as
Academic Clear, Academic Caution, or Academic Alert. This classification is
based on student scores on the Stanford Achievement Test. California uses three
cut-points to create school, district, county, and state scores: the 25th, 50th, and
75th national percentiles on the Stanford Achievement Test.
With the exception of Title 1, Georgia does not have any set of acceptable levels
of performance for schools, districts, or the state on any measures at this time.
Massachusetts reports criterion-referenced results on the Massachusetts
Comprehensive Assessment System for individual students, schools, and districts
according to four performance levels. The four levels are advanced, proficient,
needs improvement, and failed. At the advanced level, students demonstrate a
comprehensive understanding of rigorous subject matter and can solve complex
problems. At the proficient level, students have full understanding of challenging
subject matter and can solve various problems. At the needs improvement level,
students have some understanding of the subject matter and can solve simple
problems. At the failing level, students demonstrate minimal understanding of
subject matter and cannot solve simple problems. School and district success is
measured by the percentage of students achieving at each level of performance
and by the number of students who move toward a higher level.
In rating school performance, New York identifies only schools that are farthest
from state standards. In grades 4 and 8, the criterion for both English language
arts and mathematics is 90 percent above the statewide performance benchmark.
In grade 11, the criterion is 90 precent of the students in school who have met
the graduation requirements in reading, writing, and mathematics. Moreover, a
criterion of less than a 5 percent dropout rate is set.
Rewards and Sanctions
Of the 50 states, 17 provide rewards for student performance. Of the eight states
studied, five provide rewards for student performance while three do not.
In California, the Governor's Performance Award Program provides monetary
and non-monetary rewards to schools that meet or exceed pre-established growth
rates and that demonstrate comparable improvement in academic achievement by
numerically significant ethnic and socioeconomically disadvantaged subgroups.
Schools that fail to meet annual growth targets must hold a public hearing to
announce their lack of progress. The local board of education may reassign staff,
may negotiate site-specific amendments, and may opt for other appropriate
measures. The Superintendent of Public Instruction (SPI) in California assumes
all governing powers for any school that fails to meet performance goals and
show significant growth in 24 months. The school's principal must be reassigned.
Moreover, the SPI must attend to one of the following. It must allow students to
attend other schools or allow the development of a charter school. Alternatively,
it must assign the management of the school to another educational institution,
reassign certified employees, renegotiate a new collective bargaining agreement,
reorganize the school, or close the school.
Georgia has designed a pay for performance program to promote exemplary
performance and collaboration at the school level. To attain an award, a school
must develop a plan that identifies a comprehensive set of performance objectives
in academic achievement, client involvement, educational programming, and
resource development. For approval, the performance objectives must be exem-
plary and must have an impact on a significant portion of the school population.
Georgia's State Board of Education designates schools and school systems as
standard, exemplary, and non-standard. It arrives at these designations based on
the results of an evaluation of each school and school system conducted at least
once every five years by educators from other school systems, universities, and
citizens residing within the local system. Each school system that is designated as
nonstandard or that operates one or more schools designated as nonstandard
must submit a corrective action plan to the State Board of Education. The State
Board of Education evaluates each school system designated as nonstandard two
years after the Board has approved its plan.
If the Board determines that the system is making unsatisfactory progress, it
may increase the school system's local fair share by an amount necessary to
finance actions needed to correct identified deficiencies. Or it may require the
school system to raise revenue from local sources to finance all actions needed
to conect identified deficiencies. In a rather extreme provision, the Board may
take civil action to determine whether any local school board member or school
superintendent has by action or inaction prevented or delayed implementation
of the corrective plan. If the court finds that any of these officials have prevented
or delayed plan implementation, it may issue an order requiring the officials to
implement the plan. If it finds that its order is not being carried out, the court
may remove the official and appoint a replacement until the position can be
filled by the normal process. The court also has the authority to appoint a trustee
to ensure that its order is carried out.
Texas, in a more conventional approach than Georgia, has implemented the
Texas Successful Schools Award System to provide for monetary awards to
schools rated exemplary or recognized. A portion also is reserved for schools
rated acceptable that have achieved significant gains in student performance, as
measured by comparable improvement in the preceding year. Comparable
improvement is determined by comparing a school's annual growth for matched
students in reading and mathematics to the growth of 40 other schools with
similar demographic characteristics.
The Commissioner of Education may take several actions if a district does not
meet accreditation criteria. For example, he or she may appoint a management
team to direct the operations of the district in areas of unacceptable performance
or require the district to obtain such services under contract. The Commissioner
may also appoint a board of managers to exercise the powers and duties of the
board of trustees if the district has been rated academically unacceptable for at
least one year. He or she may also annex the district to one or more adjoining
districts, or in the case of home rule, request the State Board of Education to
revoke the district's home-rule charter if the district has been rated academically
unacceptable for a period of at least two years.
Indicators
Most (43) states use more than one indicator in their accountability system.
Nonetheless, various assessments of student performance provide the major
accountability indicator in 80 percent of states. Thirty-one of 50 states (62 percent)
also use some form of dropout rate in their accountability systems while 27 (54
percent) use attendance rates, 17 (34 percent) consider graduation rates, 16 (32
percent) look at grade transition rates, and 12 (24 percent) look at American
College Testing and Scholastic Achievement Test scores. In all cases, these
variables are supplementary to various norm-referenced or criterion-referenced
assessments. Thirteen states disaggregate data by various school and student
demographics.
Report Cards
Of the 50 states, 45 require district level reports. All eight states that were studied
in greater detail require districts to report the results of the accountability system
to the public. All report cards are at the district or school level. To date, no state
requires the public reporting of teacher performance data. Since school and
district report cards best represent the products of state accountability systems,
eight reporting mechanisms are briefly reviewed.
The Alabama State Superintendent of Education releases annual report cards
on academic achievement for each school and school system. The purpose of
these is to assess how school systems and individual schools are performing
compared to other schools and systems in Alabama and to the national average.
The report cards rate schools and school districts on indicators that include
achievement, funding, teacher qualifications, and attendance.
California reports student, school, and district scores for subjects tested. It
reports the national percentile ranking of the average student, the mean scaled
score, and the breakdown of student scores by quartiles. Student, school, and dis-
trict reports are distributed to districts. Districts must provide individual student
reports to parents. Schools must notify parents of the purpose of the report cards
and ensure that all parents receive a copy of the report card.
Georgia releases an annual state report card that provides information at the
state, local system, and school levels. Test results are disaggregated by grade,
school, and school district and are disseminated at the school and district levels.
The results of the norm-referenced tests are disaggregated by free or reduced
lunch status. The report cards are available on the Internet.
Kentucky requires local boards of education to publish in local newspapers an
annual performance report on the district's accomplishments regarding its perfor-
mance goals, including student performance and retention rates. The State
Department of Education also publishes detailed information at the school,
district, region, and state levels at the end of each biennium.
Massachusetts posts school and district profile results on its web site. It also
provides these profiles to districts that, in turn, provide them to schools, which,
in turn, provide them to parents. The profile results are disaggregated by regular
students, students with disabilities, and students with limited English proficiency.
The profiles include non-cognitive indicators such as pupiVteacher ratios, dropout
rates, student exclusions, and the racial compositions of student bodies, special
education student populations, limited English proficient student populations,
and students eligible for free and reduced price lunch.
The New York State school report card shows how well students in each school
and district are performing on state assessments, and provides detailed informa-
tion on school attendance, suspension, dropout rates, and post-secondary education
plans. The card disaggregates test scores for general education students and
students with disabilities, and reports performance of LEP students on alterna-
tive measures of reading. It includes attendance rates, suspension rates, dropout
rates, and total public expenditures per pupil.
Tennessee provides a home report that includes information for parents on
their children's performance including percentile ranks, stanines, and mastery of
objectives. The scores are broken down by school, grade, and subject, and
include overall score and rate of improvement. The District report cards include
information on accreditation, personnel data, waivers, class sizes, disaggregated
enrollment, attendance rates, promotion rates, dropout rates, and expulsions/
suspensions. It also includes competency test passing rates, Comprehensive
Assessment results by grade, subject, percentile, and st~ite showing two-year
trends, and Comprehensive Assessment Program Value-added three-year
average by grade span, subject, and state showing six-year trends in state gains
versus national gains.
Texas has designed its Academic Excellence Indicator System (AEIS) to provide
reports at the district and school levels on all performance. The base indicators
used for accountability ratings are the Texas Assessment of Academic Skills
(TAAS) percent passing rates in grades 3-8 and 10 for reading, writing, and
mathematics, and dropout rates. A unique AEIS school report card is provided
for each school. The school must then provide each student's family with
information on TAAS performance, attendance rates, dropout rates, performance
on college admissions examinations, end-of-course examination participation,
student/teacher ratios, and administrative and instructional costs per student.
value-Added Components
While a number of states attempt to include improvement statistics in their

accountability systems, only one, Tennessee, currently includes a valid value-
added component. The Tennessee Value-Added Assessment System (TVAAS) is
a statistical process based on mixed model methodology that assesses the impact
that districts, schools, and teachers have on student learning indicators (Sanders,
2000; Sanders & Horn, 1995). Through TVAAS, Tennessee attempts to provide
unbiased estimates of the impact of district, school, and teacher variables on student
achievement while controlling for differences in student context variables. Scale
scores of students over a period of three to five years on norm-referenced items
on the Tennessee Comprehensive Assessment Program are analyzed. The
important idea is that, although students may not achieve at the same levels,
districts, schools, and teachers should be, at a minimum, adding value to each
student's performance.
Tennessee's TVAAS model is unique in comparison to other state accoun-
tability models in that it attempts to measure district, school, and teacher effects
on student achievement gains while controlling for context, rather than simply
identifying achievement scores at a single point in time, or examining unadjusted
gain scores, or comparing "similar" schools and districts. By using adjusted gain
scores, the TVAAS attempts to provide an equitable system for measuring
district, school, and teacher impact so that teachers, schools, and districts are not
penalized for the low academic entry characteristics of disadvantaged students.
It makes allowances for teachers who work with low achieving students while still
requiring them to help those students achieve.
Other states have attempted to incorporate improvement into their account-
ability systems. California has developed a new Academic Performance Indicator
(API) and has attempted to establish growth targets as the basis of its state
accountability system. The API, a composite index of multiple school indicators,
is purported to provide a measure to rate the performance of all schools on the
academic performance of their students. The indicators include the Stanford
Achievement Test, a primary language test results for ESL students, high school
exit exams, attendance rates, and graduation rates. Approximately 60 percent of
the API is based on test results, while the remaining 40 percent is based on non-
cognitive measures.
Kentucky plots average accountability scores against the years 2000 to 2014.
The baseline score is the average of the school's accountability scores for years
1998-99 and 1999-2000. The line from the baseline point to the ideal index of
100 in 2014 is designated as the school's goal line. If the school is on or above this
goal line every two years, it is eligible for rewards.
Both of these systems, although more sophisticated than other state systems,
assume that comparable improvement can be established without going down to
the individual student level and that reasonable goals can be established without
empirical data.
CONCLUSION
We believe a number of disturbing trends are evident from the various surveys
and literature reviews. First, it appears as if public school evaluation units have
not prospered during the decade of the 1990s. In fact, evidence from the 1999
survey and from the quality and quantity of professional papers presented at
major national research meetings by school district staff suggests that the status
of evaluation in the public schools is far worse than it was in the 1980s. Part of
this decline is probably due to states having taken over many of the assessment
and accountability functions formerly assigned to district evaluation units. This
is the second disturbing trend because most state accountability models appear
to be extremely Tylerian, and only one features legitimate value-added compo-
nents. Thus, we would have to conclude that the state of school system and state
evaluation in the United States is not unlike the state of an endangered species.
Evaluation activities to feed decisions have given way to an almost fanatical
emphasis on accountability with little concern for fairness, validity, or profes-
sional standards.
At the local school district level, evaluation units have apparently fallen on
hard times. Only three districts in the sample have maintained a growth level
between 1979 and 1999 that is equivalent to that of the national economy. None
of the $1,000,000 spenders in 1979 has done so. This decline in local evaluation
efforts has severely hampered efforts to improve education since local evaluation
units were the major sources of the vast majority of input and process evaluation
that existed. Without adequate input and process evaluation, improvement is
unlikely. The only expenditures that consistently increased across evaluation
departments that had enough critical mass to accomplish anything were for testing.
Thus, despite the fact that six of 20 responding school districts in 1999 stated that
their principle evaluation model was CIPp, the evidence suggests that the
resources to implement the input and process evaluation components of CIPP
are not available. Lack of input and process evaluation would make it difficult to
provide a base for understanding and subsequent suggestions for improvement.
This, coupled with the almost complete lack of legitimate value-added method-
ology and the fact that only 18 percent of responding distriets had any involvement
in personnel evaluation, paints a dismal picture.
But what of state evaluation efforts? While local evaluation units have appar-
ently foundered, state accountability systems have prospered. First, when referring
to most existing state accountability systems, evaluation is a misnomer. They are
at best accountability systems, and in many cases just assessment systems. If one
incorporates the requirement that accountability systems must be fair to all
stakeholders, then those systems must include valid value-added components.
Incorporating this requirement reduces all state systems but Tennessee to the
status of assessment systems.
Most state accountability and/or assessment systems have well developed stan-
dards and reporting systems as well as use multiple outcomes. While 80 percent
of 50 state systems use assessment scores as a major part of their accountability
systems, most (45) use additional indicators. Absolute standards are a necessary
part of any accountability system. However, without a legitimate value-added
component, these systems are biased against schools that serve high poverty
ethnic and/or language minority students. Therefore, the lack of valid value-
added components is particularly injurious to schools that serve those student
populations when sanctions are applied as part of the system.
Most state systems rely on the definition of like schools in their comparable
improvement efforts. Research extending over 12 years by the authors into value-
added systems has convinced them that no two schools are exactly alike. In any
matched group there are outliers or schools that are closer to schools in one
group than in any other group but do not quite match. Matching, if it is to be
done, must be done at the student level (Webster et aI., 1997).
It would not be very complicated to fix most state systems so that they would
meet fairness standards. Value-added systems can be constructed with a variety
of similar methodologies, all of which provide preferable alternatives to systems
based on unadjusted outcomes or on so-called comparable schools or class-
rooms. (For discussions of value-added systems and related methodology see
Bryk & Raudenbush, 1992; Darlington, 1990; Dempster, Rubin & Tsutakawa,
1981; Goldstein, 1987; Henderson, 1984; Laird & Ware, 1982; Mendro, Jordan,
Gomez, Anderson, & Bembry, 1998; Mendro & Webster, 1993; Sanders & Horn,
1995; Sanders & Rivers, 1996; Webster, 2000, 2001a, 2001b; Webster, Mendro, &
Almaguer, 1992, 1993, 1994; Webster, Mendro, Bembry, & Orsak, 1995; Webster,
Mendro, Orsark, & Weerasinghe, 1996; Webster & Olson, 1988; Webster et aI.,
1997; Weerasinghe, Orsak, & Mendro,1997). The essence of all of these systems
is to control for factors that are known to affect school outcomes but which are
not under the control of the school as well as to focus on improvement. When
these systems are designed properly, they offer the best chance of adjusting for
effects of variables at the student level that are not explicitly included in the
known factors. In other words, they help control to some extent all factors that
are not under the control of schools.
During the decades of the 1970s and 1980s there was a great deal of discussion
about a code of professional practice for evaluators. We believe that current prac-
tice suggests that a code is badly needed, although we suspect that it is already
too late. Once states became deeply involved in evaluation and accountability,
politics began to playa much heavier role. Concepts such as validity, fairness,
and equity have taken a back seat to political expediency. However, these concepts
need to be raised to a paramount level so that teachers, students, administrators,
and schools are not evaluated unfairly. At this point, this does not appear to be
happening.
REFERENCES
Bryk, A.S., & Raudenbush, S.w. (1992). Hierarchical linear models: Applications and data analysis
methods. Newbury Park: California: Sage.
Clopton, P., Bishop, w., & Klein, D. (2000). Statewide mathematics assessment in Texas. Retrieved
from http://mathematicallvcorrect.com!lonestar.htm.
Consortium for Policy Research in Education. (2000). Assessment and accountability in the fifty states:
1999-2000. Washington, DC: Author.
Council of Chief State School Officers. (1995). State education accountability reports and indicator
reports: Status of reports across the states. Washington, DC: Author.
Council of Chief State School Officers. (2000). State education accountability systems. Washington,
DC: Author.
Cronbach, L.J. (1963). Course improvement through evaluation. Teachers College Record, 64,
672--683.
Darlington, RB. (1990). Regression and linear models. New York: McGraw-Hill.
Dempster, AP., Rubin, D.B., & Tsutakawa, RY. (1981). Estimation in covariance components
models. Journal of the American Statistical Association, 76, 341-353.
Education Commission of the States. (1998). Accountability indicators. Denver, CO: Author.
Goldstein, H. (1987). Multilevel models in educational and social research. New York: Oxford
University Press.
Henderson, C.R (1984). Applications of linear models in animal breeding. Guelph, Canada: University
of Guelph.
Jaeger, RM. (1992). Weak measurement serving presumptive policy. Phi Delta Kappan, 74, 118-128.
Jordan, H.R, Bembry, K.L., & Mendro, RL. (1998, April). Longitudinal teacher effects on student
achievement and their relation to school and project evaluation. Paper presented at the annual
meeting of the American Educational Research Association, San Diego, CA
Jordan, H.R, Mendro, RL., & Weerasinghe, D. (1997, July). Teacher effects on longitudinal student
achievement: A preliminary report on research on teacher effectiveness. Paper presented at the
CREATE National Evaluation Institute, Indianapolis, IN.
Laird, N.M., & Ware, H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963-974.
May, J. (1990). Real world considerations in the development of an effective school incentive
program. (ERIC Document Reproduction Services No. ED320271.)
Mendro, RL., Jordan, H.R, Gomez, E., Anderson, M.e., & Bembry, K.L. (1998, April).
Longitudinal teacher effects on student achievement and their relation to school and project
evaluation. Paper presented at the annual meeting of the American Educational Research
Association, San Diego, CA.
Mendro, RL., & Webster, w.J. (1993, October). Using school effectiveness indices to identify and
reward effective schools. Paper presented at the Rocky Mountain Research Association, Las
Cruces, NM.
Sanders, w.L. (2000, July). Value-added assessment from student achieveiitent data: Opportunities and
hurdles. Paper presented at the CREATE National Evaluation Institute, San Jose, CA.
Sanders, w.L., & Horn, S.P. (1995). The Tennessee Value-Added Assessment System (TVAAS): Mixed
model methodology in educational assessment. In AJ. Shinkfield, & D.L. Stufflebeam (Eds.), Teacher
evaluation: A guide to effective practice (pp. 337-350). Boston: Kluwer Academic Publishers.
Sanders, w.L., & Rivers, J.e. (1996). Cumulative and residual effects of teachers on future student
academic achievement. Knoxville, TN: University of Tennessee.
Scriven, M.S. (1967). The methodology of evaluation. In R Tyler, R Gagne, & M. Scriven (Eds),
Perspectives of curriculum evaluation (pp. 39-83). AERA Monograph Series on Curriculum
Evaluation, No. 1. Chicago, Ill: Rand McNally.
Smith, E.R, & Tyler, RW. (1942). Appraising and recording student progress. New York: Harper.
Stake, RE. (1967). The countenance of educational evaluation. Teachers College Record, 68, 523-540.
Stufflebeam, D.L. (1966). A depth study of the evaluation requirement. Theory 1nto Practice, 5,
121-134.
Stufflebeam, D.L., Foley, w.J., Gephart, w.J., Guba, E.G., Hammond, RL., Merriman, H.O., &
Provus, M.N. (1971). Educational evaluation and decision-making. Itasca, IL: Peacock.
Stufflebeam, D.L., & Webster, w.J. (1980). An analysis of alternative approaches to evaluation.
Webster, w.J. (1977, September). Sensible school district evaluation. Invited address presented at the
evaluation conference of the American Educational Research Association. San Francisco, CA.
Webster, w.J. (2000, July). A value-added application of hierarchical linear modeling to the estimation
of school and teacher effect. Invited paper presented at the First Annual Assessment Conference
of the Council of Great City Schools. Portland, OR
Webster, w.J. (200la, April). An application of two-stage multiple linear regression and hierarchical
linear modeling to the estimation of school and teacher effect. Paper presented at the annual meeting
of the American Educational Research Association, Seattle.
Webster, w.J. (2001b, April).A comprehensive school and personnel evaluation system. Paper presented
at the annual meeting of the American Educational Research Association, Seattle.
Webster, w.J., Mendro, RL., & Almaguer, T. (1992, April). Measuring the effects of schooling:
Expanded school effectiveness indices. Paper presented at annual meeting of the American
Educational Research Association, San Francisco.
Webster, w.J., Mendro, RL., & Almaguer, T.D. (1993). Effectiveness indices: The major component
of an equitable accountability system. ERIC TM 019 193.
Webster, w.J., Mendro, RL., & Almaguer, T. (1994). Effectiveness indices: A "value added"
approach to measuring school effect. Studies in Educational Evaluation, 20,113-145.
Webster, w.J., Mendro, RL., Bembry, KL., & Orsak, T.R. (1995, April). Alternative methodologies
for identifYing effective schools. Paper presented at annual meeting of the American Educational
Research Association, San Francisco.
Webster, w.J., Mendro, RL., Orsak, T.R., & Weerasinghe, D. (1996, April). The applicability of
selected regression and hierarchical linear models to the estimation of school and teacher effects. Paper
presented at annual meeting of the American Educational Research Association, New York.
Webster, w.J., Mendro, RL., Orsak, T.R., & Weerasinghe, D. (1997). A comparison of the results
produced by selected regression and hierarchical linear models in the estimation of school and
teacher effect. Multiple Linear Regression Viewpoints, 1, 28-65.
Webster, w.J., & Olson, G.R. (1988). A quantitative procedure for the identification of effective
schools. Journal of Experimental Education, 56, 213-219.
Webster, w.J., & Stufflebeam, D.L. (1978, March). The state of the art in educational evaluation. Invited
address presented at the annual meeting of the American Educational Research Association,
Toronto.
Weerasinghe, D., Orsak, T., & Mendro, R (1997, January). Value added productivity indicators: A
statistical comparison of the pre-test/post-test model and gain model. Paper presented at 1997 annual
meeting of the Southwest Educational Research Association, Austin, TX.
42
International Studies of Educational Achievement
TJEERD PLOMP
University of Twente, Faculty of Educational Science and Technology, Enschede, The Netherlands
SARAH HOWIE
University of Pretoria, Department of Teaching and Training, Pretoria, South Africa
BARRY McGAW
Centre for Educational Research and Innovation, Paris, France
International studies of educational achievements have been conducted since

the early 1960s. According to Husen and Tuijnman (1994, p. 6), such an empirical
approach in comparative education was made possible by developments in sample
survey methodology, group testing techniques, test development, and data
analysis. A group of American and European scholars conducted a pilot study
from 1959 to 1961 to examine the feasibility of using achievement tests admin-
istered to comparable samples of students in cross-national studies to measure
the "yield" of educational systems. Key questions were whether it would be pos-
sible to develop an appropriate methodology for testing and for processing and
analysing data in such a way that meaningful cross-national comparisons could
be made. Answers from the feasibility study were sufficiently encouraging to lead
to the establishment of the International Association for the Evaluation of
Educational Achievement (lEA) which conducted the first large-scale interna-
tional study of educational (mathematics) achievement (Husen, 1967).
Following that study, lEA has conducted surveys in a variety of school subjects
(see lEA, 1998). Until the 1980s, the studies were driven by the interest of
researchers. However, in the 1980s, as a consequence of the oil crisis and other
structural problems in our societies and due to the media attention that
comparative studies attracted, policymakers began to develop an interest in, and
a concern for, the quality and the cost-effectiveness of education. This can be
illustrated by the frequently cited quotation from the report, A Nation At Risk:
The Imperative for Educational Reform (US National Commission on Excellence
in Education, 1983), on the quality of education in the U.S.A. (in which results
from the lEA Second International Mathematics Study were used to demon-
strate the relatively poor international standing of US school education):
951

952 Plomp, Howie and McGaw
If an unfriendly foreign power had attempted to impose on America the
mediocre educational performance that exists today, we might well have
viewed it as an act of war. As it stands, we have allowed this to happen to
ourselves .... We have, in effect, been committing an act of unthinking,
unilateral educational disarmament. (p. 5)
The interest of policymakers in cost-effectiveness and the quality of education

(instead of just managing the quantitative growth of educational systems) also
became manifest at a meeting of Ministers of Education of Organisation for
Economic and Cultural Development (OECD) countries in 1984 (Husen &
Tuijnman, 1994; Kellaghan, 1996) following which interest in internationally com-
parable indicators developed rapidly. Meanwhile, the IEA continued its studies of
student achievement levels, and the American National Assessment of Educational
Progress (NAEP) extended its activities in two International Assessment of
Educational Progress (IAEP) surveys. The OECD commenced more general
work on indicators of important characteristics of educational systems, relating
to demographic and financial factors as well as educational outcomes, drawing
from the lEA surveys and data on various labor market outcomes, in its publica-
tion, Education at a Glance (OECD, 1992, 1993, 1995, 1996, 1997, 1998, 2000).
The increased interest of policymakers is also illustrated in the initiative of a
number of Ministries of Education in the Southern African sub-region and of
UNESCO in 1991 to establish the Southern African Consortium for Monitoring
Educational Quality (SACMEQ), with the purpose of applying survey research
techniques to study issues relating to the improvement of the quality of education,
as well as by the decision of the OECD, in the second half of the 1990s, to
conduct international comparative achievement studies in its Programme for
International Student Assessment (PISA) (see, e.g., Schleicher, 2000).
This chapter discusses issues relating to the conceptualization, design, and
conduct of international comparative achievement studies. First, a number of
purposes and functions of such studies are summarised. This is followed by a
description of some organizations that have been active in this area and of their
studies. Some conceptual issues related to such studies are then presented,
before discussing some important design and methodological issues. The last
section of the chapter provides some examples of the impact of this type of study,
and also considers some criticisms of the work.
PURPOSES AND FUNCTIONS OF INTERNATIONAL COMPARATIVE

ACHIEVEMENT STUDIES
International studies of educational achievement usually have multiple purposes,

such as to compare levels of national achievement between countries; to identify
the major determinants of national achievement, country by country, and to
examine to what extent they are the same or differ across countries; and to iden-
tify factors that affect differences between countries (Postlethwaite, 1999, p. 12).
International Studies of Educational Achievement 953
Ways in which the results of international comparative studies might be used
and impact on educational policy and practice have been described (Kellaghan,
1996; Plomp, 1998). First, descriptive comparisons with other countries might
serve to identify particular aspects of a national system that could be considered
problematic because they are out of line with what is found in other countries
(e.g., the content of the curriculum, achievement levels of students). This "mirror"
function is considered by many to be interesting in itself and may lead to
decisions to take some kind of action to remedy a particular deficiency. It can be
of particular value when the comparisons are made with other countries of
special interest, such as "cultural" neighbours or economic competitors.
Second, and one step further than benchmarking, is monitoring, which involves
the assessment of educational processes at different levels in the educational
system with the purpose of making informed decisions about change when and
where it is needed. A cycle of regular assessments in the subject areas that are
being monitored to provide trend data is needed to do this. Both the lEA and
the OECD value this function, and are committed to cycles of studies in
mathematics, science, and reading literacy.
Third, findings of international studies can contribute to our understanding of
differences between or within educational systems, which should be helpful in
making decisions about the organization of schooling, the deployment of resources,
and the practice of teaching (Kellaghan, 1996). This was an important consider-
ation in discussions of standard setting in education. In the U.S.A. for example,
the initiative of the National Council for the Teaching of Mathematics (NCTM)
in developing standards for the teaching of mathematics was a reaction to the
poor performance of American students in the lEA Second International
Mathematics Study (SIMS).
Fourth, variation between educational systems revealed in international com-
parative studies can be taken as a starting point for research ("the world as a
laboratory") leading to a better understanding of the factors that contribute to
the effectiveness of education. An example is the analysis by Postlethwaite &
Ross (1992) of the lEA Reading Literacy database (for which data were collected
in 1990-91) in an effort to find indicators that discriminated between more
effective and less effective schools in reading. Keeves (1995) also has identified,
on the basis of all lEA studies conducted until 1994, ten key findings with
suggested implications for educational planning.
Finally, international, comparative achievement studies also serve to promote
general "enlightenment". In this case, there is not a direct link to decisions, but
rather a gradual diffusion of ideas into the sphere of organizational decision
making (see Kellaghan,1996, with reference to Weiss, 1981). In this view, the
findings of international studies may contribute to clarifying policymakers' assump-
tions about what schools try to achieve, what they actually achieve, and what it is
possible to achieve, as well as to enriching public discussion about education
(Husen & Tuijnman, 1994). The OECD Education at a Glance series has
certainly served as a catalyst for such discussion, and Robitaille, Beaton, and
Plomp (2000) have reported on the extensive public discussions about the lEA
Third International Mathematics and Science Study (TIMSS) in participating
countries.
In addition to the general purposes of international comparative studies, there
are some additional benefits that pertain more to developing or less developed
countries. These additional benefits may be divided into four areas (see Howie,
2000). First, international studies (e.g., SACMEQ and lEA) contribute substan-
tially to the development of research capacity in developing countries. TIMSS
significantly developed the capacity in South Africa (and in other countries) to
undertake large-scale surveys involving assessment. In nine African countries,
SACMEQ has made a particular contribution to developing the capacity of
ministerial staff. Researchers have been introduced to the latest research methods
and were provided with substantial training and assistance in their use
throughout the research process.
Second, these studies present an opportunity for developing countries to
collect baseline data in certain subject areas, where previously there was a
vacuum. Third, establishing a national baseline through an international study
heightens awareness of what other countries around the world are doing, and
points to lessons that can be drawn from them. For example, TIMSS was the first
international educational research project in which South Africa participated
after years of political and academic isolation, providing the first opportunity to
review and compare data from South Africa with those of other countries. The
disappointing result of this comparison led the then Minister for Education to
announce, during a parliamentary debate, that his department would review the
data in order to design new curricula to be introduced by 2005. Finally, education
jostles for attention with many other needs in developing countries (e.g., health,
poverty, HIV/AIDS, rural development). In this context, the fact that the results
of international and not merely national studies are available assists researchers,
practitioners, and policymakers to highlight priorities.
ORGANIZATIONS AND STUDIES
Several international organizations publish educational indicators; however, only

a few (lEA, OECD, and SACMEQ) actually organize and help manage interna-
tional comparative achievement studies. All three are described in this section.
International Association for the Evaluation of Educational Achievement (lEA)
The International Association for the Evaluation of Educational Achievement

(lEA) is a nongovernmental organization which has conducted international
comparative studies of educational achievement in the context of the many vari-
ables that determine how teaching and learning take place in schools. Members
of the lEA are research institutes and ministries of education. By the year 2000,
about 55 countries (or educational systems) were members.
lEA's mission is to contribute through its studies to enhancing the quality of
education. Its international comparative studies have two main purposes: to
provide policymakers and educational practitioners with information about the
quality of their education in relation to relevant reference countries and to assist
participating countries in understanding the reasons for observed differences
within and between educational systems. In line with these two purposes, lEA
strives in its studies for two kinds of comparisons. The first consists of straight
international comparisons of the effects of education in terms of scores (or
subscores) on international tests. The second focuses on the extent to which a
country's intended curriculum (what should be taught in a particular grade) is
implemented in schools and is attained by students. The latter kind of compari-
son focuses mainly on national analyses of a country's results in an international
comparative context. As a consequence, most lEA studies are curriculum-driven.
An analysis of the curricula of participating countries and the development of
achievement tests are the starting points of the design of a study.
Over the years, lEA has conducted studies in major school subjects, such as
mathematics, science, written composition, reading comprehension, foreign
languages, and civics education. The studies have also covered non-curricular
areas, such as classroom environments, preprimary education, and computers in
education.
Postlethwaite (1999) has distinguished four phases in lEA's existence up to the
year 2000. The first mathematics study in the early 1960s, which was conducted
to explore the feasibility of international comparative achievement studies, can
be regarded as Phase 1. Phase 2, covering the period until the late 1970s, consists
of the six-subject study, a huge endeavor which comprised six parallel surveys in
science, literature, reading comprehension, French and English as foreign
languages, and civic education. Phase 3, covering the 1980s, consists of a number
of single subject studies, two of which were repeat studies in (mathematics and
science), and five new studies (classroom environment, computers in education,
written composition, reading literacy, and preprimary education). lEA reached
its fourth phase at the end of the 1990s with the Third International Mathematics
and Science Study (TIMSS), while in the same decade the second civics educa-
tion study and the Second International Technology in Education Study (SITES)
commenced. More information about the lEA can be found in the lEA guidebook
(lEA, 1998) and on its Web site: www.iea.nl.
lEA's most prominent study has been TIMSS, which was conducted between
1992 and 1999. The study was designed to assess students' achievements in mathe-
matics and science in the context of the national curricula, instructional practices
and social environment of students. It focused on 9-year-old (population 1),
13-year-old (population 2), and final year of secondary school (population 3)
students. Data were collected in 1994 and 1995. The achievement testing in popu-
lations 1 and 2 and part of population 3 was based on an analysis of the curricula
in mathematics and science of participating countries. The other component of
the testing in population 3 pertained to mathematical and science literacy of
students at the end of secondary education. Data on students, teachers, and
schools, as well as on classroom processes, were collected through questionnaires
completed by students, mathematics and science teachers, and school principals.
Testing was carried out in more than 40 countries and at five grade levels (3rd,
4th, 7th, 8th, and final year of secondary school). More than half a million of
students in more than 15,000 schools responded in more than 30 languages.
There were nearly 1,000 open-ended questions, generating millions of student
responses in the assessment. Provision was made for performance assessments
and for questionnaires containing about 1,500 questions to be completed by
students, teachers, and school principals (see, e.g., Beaton, Martin et aI., 1996;
Beaton, Mullis et aI., 1996; Mullis et aI., 1997; Mullis et aI., 1998; Martin et aI.,
1997; and International Study Center, 2002).
Students in Asian countries were the highest performers in mathematics and
science. In most countries, gender differences in mathematics achievement were
minimal or nonexistent, whereas in science boys outperformed girls. Home
factors were strongly related to achievement in every participating country.
TIMSS was repeated for population 2 in 1998 and 1999 in 38 countries in
Europe, Mrica, North and South America, Asia and Australasia (see Martin
et aI., 2000; Mullis et aI., 2000; International Study Center, 2002) and will be
continued in a regular cycle as "Trends In Mathematics and Science Education
Study," with further data collection in 2002 and 2003.
Organisation for Economic Co-operation and Development (OECD)
The Organisation for Economic Co-operation and Development (OECD) has 30

member countries in Europe, North America, and the Asia-Pacific region, but
works in a variety of ways with many other countries. Membership requires a
country's commitment to a market economy and a pluralist democracy. OECD
provides governments with a setting in which to discuss and develop economic
and social policy. In some cases, their exchanges may lead to agreements to act
in a formal way but, more often, the benefit lies in national policy development,
informed through reflection and exchange with other countries.
In the late 1980s, OECD member countries decided to reorient work on educa-
tion statistics to build comparative indicators of education systems in the belief
that "comparisons of educational indicators can offer information useful for
deciding if, and where, educational reform may be needed. In turn, the desire to
reform and improve education hinges on perceptions about the economic
importance of education and the belief that education is a productive and long-
term investment" (Bottani & Tuijnman, 1994, p. 54). The results of this work
have been published under the title Education at a Glance, the first issue of which
appeared in 1992, and the seventh in 2000; it is now established as an annual
publication. Indicators in the report focus on system context and performance,
and cover the demographic, social, and economic context, financial and human
resources invested in education, education participation and completion rates,
educational processes, aspects of public expectations, and educational and labor
market outcomes.
Indicators of educational achievement included in Education at a Glance to
date have been drawn from studies of lEA and IAEP (International Assessment
of Educational Progress conducted in the 1980s). In the mid-1990s, OEeD
countries began consideration of managing their own systematic collection of
student achievement data and, in 1997, decided to implement the Programme of
International Student Assessment (PISA), which is designed to produce policy-
oriented and internationally comparable indicators of student achievement on a
regular three-year basis. The focus will be on 1S-year-olds, which is in many
countries the end of compulsory schooling, and the indicators are designed to
contribute to an understanding of the extent to which education systems in
participating countries are preparing their students to become lifelong learners
and to play constructive roles as citizens in society (Schleicher, 2000, pp. 63-64).
PISA covers three substantive domains: reading literacy, mathematical literacy,
and scientific literacy. Unlike the lEA surveys, it does not commence with a
survey of the curricula in participating countries to identify common components
as the focus for assessment. Rather an assessment framework for each domain is
structured to reflect important knowledge and skills that students will need in
adult life. The primary question is not whether students have learned what the
curriculum prescribes but whether they can use it. In the successive three-yearly
cycles of PISA, each of these domains will be accorded the major focus, with
about two-thirds of the testing time being devoted to it. In the first cycle, for
which data were collected in 2000, the major domain was reading (OEeD, 2001).
In 2003, the major domain will be mathematics, and, in 2006, science. In 2009,
the primary focus will return to reading.
PISA also assesses cross-curricular competencies. In the first cycle, assess-
ments of leT literacy and capacity to self-regulate and manage learning were
developed as options for countries. Information on both was obtained by self-
reports in questionnaires completed by students. In this cycle, 21 countries took
up the option on leT literacy and 19 the option on self-regulated learning. In the
second cycle, problem solving will be added as a broad cross-curricular compe-
tence, but also as a special focus in the mathematics assessment. For the third
cycle, it is intended to include a more direct assessment of leT literacy.
The fundamental difference between IEA and OEeD studies is in the strength
of the focus on national curricula. Both approaches carry risks of the differential
validity of their assessment instruments when used across countries. In the lEA
approach, there is a risk that the common curriculum components that provide
the focus for the tests will be a more substantial part of the curriculum in some
countries than in others and, conversely, that what is not assessed, because it is
unique to one or more countries, will occupy more of the available curriculum
time in some countries than in others. To check on coverage, the IEA typically
gathers information from teachers on students' opportunity to learn the material
covered in each question and, not surprisingly, the measure is significantly
related to students' achievement levels. For OEeD, the focus on students' ability
by age 15 to use what they have learned in each of the substantive domains runs
the risk of advantaging countries for which the curriculum itself contains this
kind of emphasis. Many of the OECD countries that collaborated in the design
of PISA had substantial prior experience of lEA studies and ongoing involve-
ment in them. They deliberately decided on PIS~s focus on "preparedness for
adult life" because they believed that:
a focus on curriculum content would, in an international setting, restrict

attention to curriculum elements common to all, or most, countries. This
would force many compromises and result in an assessment that was too
narrow to be of value for governments wishing to learn about the strengths
and innovations in the education systems of other countries (Schleicher,
2000, p. 6).
They believed that a focus on students' capacity to use the knowledge and skills
that they have acquired in childhood in situations they could encounter in adult
life would provide the kind of outcome measures "needed by countries which
want to monitor the adequacy of their education systems in a global context"
(Schleicher, 2000, p. 77). Samples of the questions of the kind used in PISA 2000
are provided at http://www.pisa.oecd.org.
Like the lEA studies, PISA is a substantial activity. There were 31 countries
involved, three which are not members of OECD. More than 10 additional non-
OECD countries are planning to gather data in 2002 with the PISA 2000
instruments. In 2000, more than 180,000 students in over 7000 schools partici-
pated, using 26 languages. The assessment included more than 200 items of which
about one-third were open-ended. Questionnaire responses were sought from all
participating students as well as from participating school principals, on matters
relating to family background, familiarity with technology, cross-curricular
competencies, and the learning environment at school (OECD, 2001).
Southern Africa Consortium for Monitoring Educational Quality (SACMEQ)
A number of Ministries of Education in the Southern Africa Sub-region have

worked together since 1991 to address the need for systematic studies of the
conditions of schooling and of student achievement levels. The focus of this work
has been to establish long-term strategies for building the capacity of educational
planners to monitor and evaluate basic education systems (see Ross et aI., 2000).
SACMEQ's mission is to undertake integrated research and training activities
that will expand opportunities for educational planners to gain the technical skills
required to monitor, evaluate, and compare the general conditions of schooling
and the quality of basic education; and to generate information that can be used
by decision makers to plan the quality of education in their countries. To meet
these aims, the first two SACMEQ projects (SACMEQ I and SACMEQ II)
focused on an assessment of the conditions of schooling and the quality of
education. SACMEQ I, which commenced in 1995 and was completed in 1998,
was the first educational policy research project conducted by the consortium.
Seven Ministries of Education participated (Kenya, Malawi, Mauritius, Namibia,
Tanzania/Zanzibar, Zambia, and Zimbabwe). Each country prepared a national
educational policy report, the final chapter of which provides a meta-analysis of
policy suggestions, grouped in five main themes: consultations with staff, com-
munity, and experts; reviews of existing planning procedures; data collections for
planning purposes; educational policy research projects; and investment in
infrastructure and human resources. The first five reports that were published in
1998 have featured in presidential and national commissions on education, educ-
ation sector studies and reviews of education master plans, and had considerable
impact. For example, in Namibia, the SACMEQ project provided the first study
of major regional differences in the quality of education across the country, while
in Kenya an important impact has been the acknowledgement by the Ministry of
Education of the need to employ SACMEQ-style research approaches for data-
bases used for planning purposes (see Ross et aI., 2000).
SACMEQ II, for which SACMEQ I will provide baseline information,
commenced in mid-1998 and collected data in late 2000.
CONCEPTUALIZING INTERNATIONAL COMPARATIVE

ACHIEVEMENT STUDIES
All organizations involved in international comparative achievement studies aim

to provide policymakers with data that will assist them in evaluating and moni-
toring the quality and effectiveness of their educational systems. Typically, all
studies collect data about what can be monitored in education: inputs into schools,
processes within schools, and outcomes of schooling. Postlethwaite (1994) provides
examples of indicators for all three categories that might be of interest to
ministries of education.
However, there are also differences between studies. The lEA pursues explicit
research goals in addition to its interest in policy issues. In part, this reflects its
origins as the creation of a group of researchers seeking to bring a quantitative
dimension to comparative educational research. The policy interest in lEA studies
reflects both the interest of researchers in policy and the growing engagement of
governments in the funding and governance of lEA. The lEA uses an overar-
ching framework for most of its studies, primarily to guide its research goals,
including relational analyses.
In contrast, the OECD work on indicators operates with a broader agenda
within which the focus of PISA is essentially similar to that of the lEA studies,
except for the difference in definition of outcomes already discussed. The general
indicator framework provides a logical and sequential arrangement of components
without an explicit specification of relational analyses that might be investigated
with the indicators, most of which provide information at the level of national
systems. With PISA, however, data on both outcomes and context are gathered
at the student level to permit extensive exploration of relationships between

achievement and contextual conditions which could be altered or have their
effects ameliorated by policy changes.
Organizing Framework Proposed by the OEeD
For its broad indicator work, the OECD has used an organizing framework for
the set of indicators produced for publication in Education at a Glance (OECD,
1992 1993, 1995, 1996, 1997, 1998, 2000). The indicators have been organised in
three categories, as displayed in Figure 1 (summarized from Bottani & Tuijnman,
1994; Fig. 3.1, p. 53). These categories represent three principal divisions: context
information ("contexts of education"), input and process indicators ("costs,
resources and school processes"), and outcome indicators ("results of education").
In 1999, in documents adopted by the OECD Education Committee, the
framework was expressed in terms of five broad areas: basic contextual variables
that can be considered important moderators of educational outcomes; the
financial and human resources invested in education; access to education and
learning, participation, progression and completion; the learning environment
and organization of schools; and the individual, social, and labor market
outcomes of education and transition from school to work.
Neither of the frameworks provides a conceptual model that shows or hypoth-
esizes how the components of the educational system (inputs, processes, outputs)
are linked or that aims to provide a model for relational analyses between
variables (or indicators). What is offered is an array of data agreed by national
representatives and readily available for analysis by third parties. In each edition,
Costs, resources and school processes
Expenditure on education (8 indicators)

Human resources (2 indicators)
PartiCipation in education (7 Indicators)
Decision-making characteristics (4 indicators)
Contexts of education Results of education
Demographic context (3 indicators) Student outcomes (4 indicators)

Social and economic context System outcomes (4 indicators)
(4 indicators) Labour market outcomes (2 indicators)
Figure 1: Organizing Framework for education indicators proposed by the OECD (Bottani &
Thijnman, 1994, p. 53)
about one-third of the indicators are stable over time in content and presenta-
tion, about one-third are stable in content but variable in data source, and about
a third are new. About half focus on outcomes, including education and labor
market outcomes for both young people and adults. About half, particularly
among those dealing with outcomes, provide a perspective on in-country variations.
PISA opens a quite different window on educational systems, in much the way
that the lEA studies do, with detailed data on student achievement at the
individual level, and on background characteristics of home and school with
which complex relationships can be explored. The nature of these investigations
was planned at the time at which the survey was designed to ensure that relevant
information was obtained through the various school and student question-
naires. These reports will explore issues such as the relationship between social
background and achievement, with particular attention to ways in which the
relationship might vary across countries, gender differences in achievement,
attitudes and motivation, school factors related to quality, and equity and the
organisation of learning.
Conceptual Framework of the lEA
Over the past 40 years, lEA has applied large-scale research methods through
(with a few exceptions) a curriculum-based, explanatory design, that has varied
in form and content across studies. The generic lEA conceptual framework is
summarized in Figure 2. A difference in research goals becomes apparent when
this framework is compared with the indicator-type framework of Figure 1. In
lEA, curricula are examined at the system level (intended curricula), school!
classroom level (implemented curricula) and at the individual student level
(attained curricula). Curricular antecedents (such as background characteristics
and school and home resources) can be investigated in relation to curricular
contexts to predict curriculum content outcomes. Such a design also enables the
development and refinement of predictive models, which utilize data collected at
teacher and student level into one conceptual model that includes background
influences on schooling, school processes, and student outcomes. Although each
individual lEA study develops its own conceptual framework, these typically do
not include specifications of indicators, but remain on a more general level
directed by the lEA's curriculum-driven underpinning.
To conduct relational analyses, it is necessary to specify explicitly what factors
are thought to influence educational achievement. An example of a model that
shows how the elements of an educational system are linked, and that provides
lists of variables as the basis for operationalizing such factors, is provided by
Shavelson, McDonnell & Oakes (1989, p. 14) (see Figure 3). Although the
model was not the basis for an international study per se, it was used to inform
the conceptualization of relational analyses in studies such as those of lEA.
A further concern regarding the conceptualization and design of indicator
frameworks is reflected by Bottani and Tuijnman (1994), when they state that
CURRICULAR CURRICULAR CURRICULAR LEVEL

ANTECENTS CONTEXT CONTENT
System Institutional Intended SYSTEM

features and I - - - settings I---
conditions
I I I
Community. School & Implemented
school and
I---
classroom
I--- SCHOOL OR
teacher conditions & CLASSROOM
characteristics processes
J 1 L
Student Student Attained
background STUDENT
r--- r----
characteristics
Figure 2: Model of an lEA Research Study (lEA, 1998, p. 53; Travers & Westbury, 1989)
INPUTS PROCESSES OUTPUTS

Fiscal and
other
Figure 3: Linking elements of the educational system (Shavelson et aI., 1989, p. 14)
"insufficient attention has so far been paid to the analytical perspective in the
design of indicator systems" and that "questions such as how indicators are inter-
related and what predictive power they have are rarely addressed systematically"
(p. 48). That is largely true of the general OEeD indicator work published in
Education at a Glance, but this is by design. There has been deliberate resistance
to the adoption of an overall theoretical framework on the grounds that no
adequate framework has been defined, and that to adopt a less than adequate one
would unnecessarily and unhelpfully constrain the coverage of the indicators.
For large-scale international surveys of student achievement, it is clearly less
true, though it can be noted that the general model that has guided lEA studies
for more than 30 years has not been much altered by the substantial bodies of
evidence generated by successive studies. It would appear that, while great
progress has been made in technical and methodological developments, few
developments have appeared in the conceptual frameworks that underpin the
research being carried out.
Could this stagnation be a result of an increasing pressure for policy-driven
research where relatively quick answers are sought to practical questions in
preference to reflection on what has already been done and what needs to be in
the future to clarify more conceptual and theoretical questions? Howson (1999)
claims that most large-scale studies have used an inductive framework, with
hypotheses often generated post-hoc for exploration with data already collected,
rather then being specified in advance following a "hypothetico-deductive"
model. That is not altogether fair with respect to either the lEA studies or PISA,
in both of which the design of background questionnaires was guided largely by
the research questions to be addressed with the data, in the fashion commended
by Shorrocks-Taylor and Jenkins (2000). In the case of PISA, detailed elabora-
tions of planned analytical reports were prepared in advance of the finalization
of the background questionnaire to ensure that the required data would be
obtained.
DESIGN AND METHODOLOGY
Given their aims and scope, international studies of educational achievement are
complex. Kellaghan (1996) specifies a number of conditions for such studies.
Next to a number of conditions to increase the relevance of such studies for policy-
makers (see below under research questions), he also mentions a few referring
to the design of studies such as accurately representing what students achieve,
permitting valid comparisons, having clear purposes, and justifying the resources
needed to conduct them. To meet these quality demands, careful design,
planning and monitoring are needed. This section will first discuss the phases
that can be recognized in the design of this type of study, after which a few other
conditions for a good study will be addressed.
Design of International Comparative Achievement Studies
lEA (Martin et aI., 1999; lEA, 2002) and others (e.g. Westat, 1991) have devel-
oped standards for conducting international studies of achievement. These are
applied during the study and regular checks are required to ensure that the study
proceeds according to the standards. Nowadays, many studies do have well
documented and detailed procedures for quality monitoring. For PISA, OECD
has produced detailed technical standards, revised for PISA 2003 in the light of
experience with PISA 2000 and covering sampling, translation, test adminis-
tration, quality monitoring, and data coding and entry (see PISA's Web site:
http//www.pisa.oecd.org). Other reference works, such as the International
Encyclopedia of Education (Husen & Postlethwaite, 1994), the lEA technical
handbook (Keeves, 1992), and the International Encyclopedia of Educational
Evaluation (Keeves, 1996) provide many excellent sources for the design and
conduct of such studies.
International comparative studies of educational achievement usually comprise
a number of phases (see, e.g., Beaton, Postlethwaite, Ross, Spearitt, & Wolf,
2000; Kellaghan & Grisay, 1995; Loxley, 1992; Postlethwaite, 1999; Shorrocks-
Taylor & Jenkins, 2000) that constitute a framework for planning and design. In
this section, each of the main phases is briefly discussed.
Research Questions and Conceptual Framework
The design of any study will be determined by the questions that it is required to
answer. Research questions in international comparative studies often stem from
the feeling within some countries that all is not well in education, that quality is
a relative notion and that, although education systems may differ in many
aspects, much can be learned from other countries ("the world as a laboratory").
The research questions, therefore, should address important policy and theory-
oriented issues. For example, in SACMEQ I the research questions were aimed
at the interest of policy makers and educational planners, whilst OECD/PISA
aims to produce policy-oriented and internationally comparable indicators of
student achievement on a regular and timely basis. Within the realm of these
policy-oriented aims, such studies need to inform policy makers and educational
planners about the "human capital" of a nation as an important determinant of
a nation's economic performance (Kellaghan, 1996).
In addition to such policy-related aims, the lEA studies and PISA also include
research questions aimed at understanding reasons for observed differences.
The general research questions underlying a study are often descriptive in nature.
For example, the four major research questions in the lEA TIMSS study are:
What are students expected to learn? Who provides the instruction? How is
instruction organized? What have students learned? In PISA, one finds descrip-
tive questions about differences in patterns of achievement within countries and
differences among schools in the strength of relationships between students'
achievement levels and the economic, social, and cultural capital of their
families. There are also questions about the relationship between achievement
levels and the context of instruction and a variety of other school characteristics.
The first step from the research questions to the design of a study is the
development of a conceptual framework. We have already noted that the charac-
ter of the research questions largely determines this framework, but there are
two distinct dimensions to be considered. One is the nature of the outcome
measures to be used (e.g., reading literacy, mathematical literacy, and scientific
literacy, as defined in PISA). The other is the factors to be studied in relation to the
chosen outcome measures (e.g., as specified in the lEA conceptual framework).
Research Design
The research design focuses on the issues that arise from the research questions,
and needs to provide a balance between the complexity required to respond to
specific questions and the need for simplicity, timeliness, and cost-effectiveness
(Beaton et aI., 2000). The choice between selecting a design that is cross-
sectional (where students are assessed at one point in time) or longitudinal (where
the same students are evaluated at two or more points in time) depends on the
goals of the study. Most studies have a cross-sectional design in which data are
collected from students at certain age/grade levels. However, TIMSS provides an
example of a compromise situation. Ultimately, TIMSS's multi-grade design was
a compromise between a cross-sectional and longitudinal survey design, as it
included adjacent grades for the populations of primary and junior secondary
education. This allowed more information about achievement to be obtained
(across five grades) and an analysis of variation in cumulative schooling effects.
Both PISA and TIMSS gather data in successive waves with cross-sectional
samples at the same age/grade levels several years apart to examine trends, but
these are not longitudinal studies. TIMSS offered a further variant by organizing
a repeat (TIMSS-R) in which data were gathered for the lower secondary popula-
tion in the year in which the primary population in the original TIMSS had
reached that year of secondary schooling. Again the study was not genuinely
longitudinal in the sense of obtaining subsequent measures on the same students.
Because of their complexity, international, comparative longitudinal studies are
rare. It is not only difficult to retain contact with students over time, but expected
high attrition rates have to be reflected in the design, and this adds to the cost
of the studies. One example, however, is the sub-study of the lEA Second
International Mathematics Study (SIMS), in which eight countries collected
achievement data at the beginning and the end of the school year in addition to
the main study, as well as data on classroom processes (Burstein, 1992).
A different kind of genuinely longitudinal study is being planned for PISA.
Some countries are considering making their PISA 2003 sample the beginning
sample for a study in which data on subsequent educational, labor market, and
other experiences will be obtained to permit analyses of their relationship with
earlier educational achievement levels.
Given the type of research questions that were posed, most lEA studies included
curriculum analysis as part of the research design. In this, they differ from
SACMEQ and OECD studies.
The research design needs to allow for good estimates of the variables in ques-
tion, both for international comparisons and for national statistics. An important
decision in each study relates to the level at which analysis will be focused. If the
focus of interest is the school, then it is appropriate to select only one class per
school at the appropriate grade level, or a random sample of students. However,
if the research questions also relate to class and teacher effects, then more than
one class will have to be sampled within each school.
Target Population
An important issue is whether to use a grade or age-based target population. The

choice should reflect the underlying question being addressed. If the question
concerns what school systems achieve with students at a particular grade level,
regardless of how long they have taken to reach the grade, then a grade-based
population is appropriate. If, however, the question concerns what school systems
have achieved with students of a particular age, or after a given number of years
in school, an age-based population is appropriate. In the SACMEQ I study, the
question of interest was the former, so a grade-based population (grade 6) was
chosen. On the other hand, for OECDIPISA, the question of interest was the
latter so an age-based population definition was chosen (IS-year-olds, since
education is compulsory to this age in most OECD countries).
A grade-based target population is easier to work with. The difficulty with an
age-based population is that, due to grade repetition and the timing of the cut-
off date for school commencement, students at the target age can be in more
than one grade, even in as many as four or more in some countries. The lEA uses
a "grade-age" definition as a compromise to reduce the complexity of sampling,
while still throwing some light on what systems have achieved with their students
by the time they have reached a particular age rather than a particular grade. For
example, in TIMSS, where the interest at the junior secondary level was in the
achievements of 13-year-olds, the target population was defined as the two
adjacent grades where most 13-year-olds could be found at the chosen point in
the school year. This approach avoids the additional sampling complexity of a
pure, age-based sample such as that used by PISA but, to the extent that there
are 13-year-olds in grade levels not sampled, provides onlia partial estimate of
the achievements of students at the age of interest.
Countries also have to determine whether separate subpopulations need to be
identified since an increase in sample size will usually be required to obtain
estimates for them. In some countries, there is interest in regional differences,
particularly those in which responsibility for education is decentralized to regional
(state or provincial) level. In some countries, there is also interest in differences
between distinct language or cultural communities.
In any definition of a target population, the possibility of exclusions has to be
addressed. For example, it may be extremely expensive to collect data from
students in special education schools or in very isolated areas. Normally, the
research design indicates that the excluded population in a country should not be
more than, for example, S percent.
Sampling
In the past decade, considerable attention has been paid to the issue of sampling,
which critiques of earlier international studies had highlighted as a problem.
International comparative studies need good quality, representative samples.
These cannot be obtained without careful planning and execution, following
established procedures such as those developed by IEA (IEA, 2002; Martin
et aI., 1999) or in the PISA technical standards (OEeD, 2002). These procedures
consist of rules concerning the selection of the samples and the evaluation of
their technical precision. For example, IEA standards require (among other things)
that the target population(s) have comparable definitions across countries, that
they are defined operationally, and that any discrepancy between nationally
desired and defined populations is clearly described (Martin et aI., 1999). In each
participating country, basic requirements have to be met in a number of areas,
such as specification of domains and strata, sampling error requirements, size of
sample, preparation of sampling frame, mechanical selection procedure, sampling
weights and sampling errors, response or participation rates (see also
Postlethwaite, 1999).
All samples in large-scale studies involve two or more stages. In a two-stage
sample, the first consists of drawing a sample of schools proportional to their
size. At the second stage, students within the school are drawn (either by selec-
ting one or more classes at random or by selecting students at random across all
classes). In a three-stage sampling, design, the first stage would be the region,
followed by the selection of schools, and then students within schools.
The size of sample required is a function of a number of factors, such as the size
of sampling error that is acceptable, the degree of homogeneity or heterogeneity
of students attending schools, and the levels of analysis to be carried out.
Depending on the research questions, researchers may choose one or two classes
per school or a number of students drawn at random from all classes at a partic-
ular grade level. lEA usually recommends 150 classrooms. The simple equivalent
sample size (i.e., if students had been sampled randomly from the total popu-
lation rather than from schools) should not be less than 400 students (Beaton
et aI., 2000; Postlethwaite, 1999). PISA requires a minimum national sample of
150 schools and a random sample of 35 15-year-olds from within each school, or
all where there are fewer than 35. Replacement schools are also drawn in a
sample to enable schools to be substituted in the event that a school in the
primary sample declines to participate. In TIMSS, two replacement schools were
drawn for each school in the primary sample.
The response rate of the sample is an important quality consideration since
non-response can introduce bias. Postlethwaite (1999) says that the minimum
acceptable response rate should be 90 percent of students in 90 percent of
schools in the sample (including replacement schools where they have been
chosen). lEA uses 85 percent after use of replacement schools. PISA requires 85
percent if there is no use of replacement schools, but a progressively higher figure
as the proportion of schools drawn from the reserve sample rises. If the response
rate from the primary sample drops to 65 percent, the absolute minimum
acceptable in PISA, then the required minimum response rate after replacement
is 95 percent.
It has become common practice in international comparative studies to "flag"
countries that do not meet sampling criteria or even to drop them from the study.
In the lEA studies, the results for countries that failed to meet minimum criteria
are reported but the countries are identified. In PISA, countries that fail to satisfy
sampling criteria are excluded from the report unless they fall only marginally
short and can provide unequivocal quantitative evidence from other sources that
non-responding schools do not differ systematically from responding schools,
thus allaying fears of response bias in their results.
Instrument Development
Another crucial element in international studies is the development of achieve-

ment tests, background questionnaires, and attitude scales. In curriculum-driven
studies, such as the lEA ones, the first step in test development is to analyze the
curricula (analyzing curriculum guides, textbooks, and examinations but also, for
example, interviewing subject experts or observing teachers) of participating
countries. This forms the basis of a framework or test grid that represents the
curricula of the countries involved in the study as much as possible. It usually has
two dimensions: content and performance expectations, meaning the kind of
performances expected of students (such as understanding, theorizing and
analyzing, problem solving). Sometimes more dimensions are added, such as
"perspectives" which were included in the TIMSS study to refer to the nature of
the presentation of the content in the curriculum materials (e.g., attitudes,
careers, participation, increasing interest in the subject). This analysis has to lead
to an agreed-upon blueprint for the tests that must cover most of the curricula of
participating countries.
In studies that are not curriculum-driven (OECD/PISA, SACMEQ), it is
important that the tests reflect the research questions and the practical needs of
policy makers (Kellaghan, 1996). In PISA, the assessment frameworks were devel-
oped by international expert panels, negotiated with national project managers,
and finally approved by a Board of Participating Countries consisting of senior
officials from all the countries involved.
Item formats (such as multiple choice, free response, and/or performance items)
have to be decided upon, and test items written to cover the cells in the blueprint.
Large-scale surveys conducted by lEA and other bodies have traditionally used
tests with multiple-choice questions. This type of test is very popular since the
conditions of the assessment can be standardized, the cost is low, and they can
be machine-scored. However, there has recently been a growing awareness
among educators that some important achievement outcomes are either difficult
or impossible to measure using the multiple-choice format. Hence, during the
development phase of TIMSS, it was decided that open-format questions would
be included where students would have to write their answers. In PISA, one-third
of the items were open-ended. These items have to be marked manually. TIMSS
developed a two-digit coding scheme to diagnose students' answers to the open-
ended questions. For example, the first digit registered degree of correctness,
while the second was used to code the type of correct or incorrect answer given.
The ultimate aim of this scheme was to provide a rich database for research on
student's cognitive processes, problem-solving strategies, and common miscon-
ceptions (Robitaille & Garden, 1996).
A third type of test item (not often utilized in large-scale studies) involves
performance assessment, which consists of a set of practical tasks. Its proponents
argue that the practical nature of the tasks permits a richer and deeper under-
standing of some of the aspects of student knowledge and understanding than is
possible with written tests alone. In TIMSS, performance assessment was an
option (Harmon et aI., 1997).
When student responses are scaled, agreement must be reached on the sub-
stantive meaning of the scale in terms of student performance on specified points
of the scale (Beaton et aI., 2000). There must also be international agreement on
the appropriateness of the items, and the reliability of the tests, which requires
extensive try-outs, must be established (Beaton et aI., 2000). A rotated test
design may be applied in which individual students respond to only a proportion
of the items in the assessment. This reduces testing time per student, while
allowing data to be collected on a large number of items covering major aspects
of curricula.
In most studies, questionnaires are used to collect background data about
students, teachers, and schools, which are used as contextual information in the
interpretion of achievement results. Such data allow researchers to develop con-
structs relating to the inputs and processes of education, and to address research
questions concerning what factors contribute to good quality education (see, e.g.,
Postlethwaite & Ross, 1992, who concluded that a large number of background
variables influenced reading achievement). Another reason to collect such data
is that they allow countries to search for determinants of national results in an
international context.
A problem with this type of data, however, is that some variables are culturally
dependent. Questions that researchers may want to ask about the home environ-
ment vary from country to country. For example, when asking about family living
in the student's family, interpretation will have to take account of differences in
family structure. Questions regarding socioeconomic status raise another set of
problems. For example, in countries in Western Europe, televisions, CD players,
computers, and motor vehicles are commonplace and so would not be helpful as
discriminators of socioeconomic status. They may, however, distinguish status in
newly developed countries, but would probably not be very useful in many less
developed ones where indicators of wealth (especially in rural areas) could more
appropriately be other possessions, such as animals. These issues point to the
need in designing questions and in interpreting responses for a thorough
understanding of the context in which data are gathered. That is challenging in
any international comparative study, but particularly so in ones that cover a
diverse range of countries.
In international comparative studies, the translation of instruments is an
important and culturally sensitive issue. For example in lEA, participants use a
system for translating the test and questionnaires that incorporates an independ-
ent check on the translations. All deviations (from the international version of
the instruments) in vocabulary, meaning, or item layout are recorded and
forwarded to the International Study Center. All translations and adaptations of
tests and questionnaires are verified by professional translators, and deviations
identified by the translator that may affect results are addressed by the national
study centre. Further verification is provided by the International Study Center
which also conducts a thorough review of item statistics to detect any errors that
might not have been corrected (Martin et aI., 1999). In PISA, source tests and
questionnaires are provided in English and French. Countries not using these
languages are required to produce two independent translations, and strongly
urged to produce one from each of the English and French sources. This helps
make clear the features of instruments that are essential and those that are
unique to a particular language of presentation. The independent translations
are checked by a third national translator and all are finally checked by an
independent international team. All instruments have to be tried out extensively
in each language of testing, which makes instrument development a difficult,
time consuming, and complex endeavor.
When attitude scales are used, it is necessary to describe the dimensions. Beaton
et aI. (2000) indicate that often three times as many items are needed for trial
testing as will be required in the final instruments.
Data Collection and Preparation
Data collection requirements need to be specified carefully to ensure the com-

parability of the procedures that are followed in the data collection. Manuals are
generally written for the national centers (or national research coordinators), for
the person coordinating the administration of the instruments in each school,
and for the persons administering the instruments in each classroom or to a
group of students. These are piloted simultaneously with the survey instruments
to ensure that stated procedures can be followed or to identify changes that need
to be made. In many countries, it is necessary to train the people who will
administer the tests. In some developing countries, it is not possible to follow the
specification that test materials be sent to schools by post and then administered
by the school staff. For a country such as South Africa, that approach has three
problems. The first is the fact that many schools do not have functioning postal
addresses, or the names of the schools change, or the schools are difficult to
locate. Secondly, the postal system is unreliable, which means that it would be
almost impossible to ensure that materials would reach their destination safely
or in time for the scheduled administration. Finally, there is an overriding need
to ensure credibility of the results by employing test administrators who are
independent of the participating school. In TIMSS in South Africa, an external
field agency was employed to collect data. Out of similar considerations of
credibility, PISA required that test administration not be undertaken by any
teacher of students in the sample in the subjects being tested, and recommended
that the administration not be undertaken by any teacher from the school or any
other school in the PISA sample.
Student responses or, in the case of open-ended items, their coded results, are
usually transferred on computers at the national center utilizing special data entry
software. The data are then sent to the International Study Center for cleaning
and file-building. During this time-consuming process there is usually frequent
communication between the national and international study centers to deal with
problems. Following this, country sampling weights are calculated to account for
disproportionate selection across strata, inaccurate sampling frames, missing data,
and the like. Advanced psychometric techniques are available to compensate for
the effects of a rotated test design and to reduce the effects of missing data.
The results of this phase are data files that are the starting point for the
international and national data analyses and reports. Data archives are often
made available in the public domain (such as those already available from the
IEA TIMSS and SITES studies and are planned for PISA) as the basis for
secondary analyses (national and international).
Data Analysis and Reporting
Analyses of data can follow simple or complex procedures. The first interna-
tional and national reports of a study are usually descriptive, and provide
policymakers and educational practitioners with the most relevant outcomes of
the study. Such reports have to be clearly written and address the research ques-
tions stated at the beginning of the study. It is necessary to reach decisions at an
early stage about issues such as units of analysis and the type of analyses to be
employed.
After the first report, technical reports covering topics such as the psycho-
metric properties of instruments and the characteristics of the achieved samples
are usually prepared. Additional analytic reports explore relationships in the
data sets in more detail than can be achieved in the time available for the
preparation of the first report. Analyses often use more advanced techniques,
such as multilevel analysis (to examine school-level, classroom-level, and student-
level variations) and path analysis to test models explaining some of the complex
relationships between achievement and its determinants (Lietz, 1996; Vari,
1997). Additional reports can also include more detailed analyses of the errors
that students made in incorrect responses to provide teachers and other educa-
tional practitioners with insights into the nature of students' misconceptions.
Secondary analyses aimed at understanding differences between and within
educational systems are also often carried out.
Managing and Implementing Studies
The high stakes and the complexity and costs of international comparative achieve-
ment studies make management a fundamental issue. lEA even developed a
number of standards for managing and implementing studies (Martin et aI.,
1999), which are also used in the SACMEQ studies. As well as the design aspects
discussed in the previous section, other important components are the selection
of an international coordinating center, detailed planning, the development of a
quality assurance programme, and technical reports and documentation. Similar
issues have to be addressed at national level.
Leadership
Strong leadership is key at both international and national levels in managing

and coordinating the design and implementation of international comparative
studies. At international level, the need for a center to coordinate and lead the
study is obvious. Centers (such as those used in lEA and OECD/PISA studies)
require knowledgeable, experienced people in areas such as assessment, psycho-
metrics, statistics, and content knowledge of the subjects being studied, as well as
others skilled in communication and management.
At the national level, the center conducting the study requires strong project
leadership skills which are both technical and substantive, but which also include
management and communication skills. It is essential that national coordinators
are knowledgeable about their contexts. They must be able to review the design
of the study critically in order to make the necessary adaptations for their country,
and to identify the experts that will be required to support the study technically.
Detailed Planning
International comparative studies usually take a number of years to design,

implement, analyse, and report. Therefore, detailed short~~ medium- and long-
term planning is required (see, e.g., Loxley, 1992). This will involve consideration
of objectives, methods, time schedule, costs, and outcomes of the study. Data
gathering in the northern and southern hemispheres must be undertaken at
different times in the calendar year to obtain data at equivalent times in the
school year.
Quality Assurance
A quality assurance program is required to ensure that studies are conducted to

a high standard. Such a program is particularly important for data collection
activities which are often conducted by school personnel or especially appointed
data collectors, and so are outside the control of the study staff. In the lEA
TIMSS studies, quality control observers in each country made unannounced
visits to a sample of schools to check whether data collection followed agreed
procedures. In PISA, considerable resources were invested in a similar program
of observation by unannounced visits by independent monitors. Guidelines for
verification of the translation of instruments are part of such a program.
Technical Reports and Documentation
So that studies can be evaluated and replicated, the study design, instrument
development, data collection, and analysis and reporting procedures should be
described in technical reports and documentation (see Martin & Kelly, 1996;
Martin et aI., 1999).
CONCLUSION
Utilization and Impact
It is important that designers, researchers, funders, and policymakers who initiate
and conduct international comparative studies should reflect on and assess the
extent to which data that are generated and analyses that are undertaken are
utilized. They should also monitor the impact of studies on educational systems.
Clear examples of significant impact are available. The results of studies, as
well as the more general OECD indicators, have been used in many publications
and discussions about the functioning and possible improvement of educational
systems. In some cases (e.g., in Australia, Hungary, Ireland, Japan, New Zealand,
U.S.A.), specific curriculum changes have been attributed to lEA findings
(Keeves, 1995; Kellaghan, 1996). The findings of SACMEQ have also been used
in educational policy making and reform (Ross et aI., 2000). Jenkins (2000,
p. 140) cites (amongst others) Hussein (1992, p. 468) who in addressing the
question "what does Kuwait want to learn from TIMSS?" responded that he
wanted to know whether students in Kuwait are "taught the same curricula as the
students in other countries," and whether they "learn mathematics to a level that
could be considered of a reasonably high international standard." Similarly,
Howie (2000) concludes that the TIMSS results for South Africa not only focused
the country's attention on the poor overall performance of South Africa's
students, but also identified several other key issues, among them that the Grade
7 and Grade 8 science and mathematics curricula were very different from those
of other countries.
Since policy decisions are not normally documented or published, the direct
impact of international comparative achievement studies may not be clearly
visible (Kellaghan, 1996). This will especially be so if findings serve an "enlight-
enment" function in discussions about the quality of education. If this is so,
international studies will not "necessarily mean supplying direct answers to
questions, but rather in enabling planning and decision making to be better
informed" (Howson, 1999, p. 166).
International comparative achievement studies can be utilized to place a
country's educational practices in an international comparative perspective. One
important possibility, often overlooked up till now, is to link national and inter-
national assessments. Linking would not only increase the benefits a country can
obtain from investments in assessment studies, but also the cost-effectiveness of
the operation. The state of Minnesota in the United States provides an interest-
ing example of the use of international data to make comparisons between the
performance of students in the state and that of students in the country as a
whole, and that of students in other countries (based on their performance on
TIMSS) (Sci Math MN, 1998).
Limitations and Criticism
Just as those intimately involved in international studies of achievement reflect

on, and evaluate them, so too do external analysts and critics. It is inevitable that
external researchers and those in the political arena will critique the studies as
part of the academic and political processes related to education. Much of the
critique of early studies focused on technical issues relating to sampling and
survey design. As the study designers refined their methodologies, assisted by the
development of information technology, and improved their quality assurance
procedures, these criticisms have largely been dealt with. Flagging the results in
international reports of countries with deficient samples has helped, though there
remains the risk that the flags and associated footnotes become progressively
ignored. The alternative of non-inclusion removes the risk of misinterpretation,
but is a harsh penalty after a country has invested considerable resources in
gathering its data.
Criticism of the translation of instruments in early lEA studies resulted in the
use of more carefully controlled procedures with independent verification.
Concerns about the adherence to prescribed data collection methods have also
been addressed by more extensive prescription of procedures, increased training
of personnel, and stronger quality assurance procedures. Criticism of delays in
publishing results has been addressed through ensuring before the study is
launched that funds will be available for the whole process, and obtaining a clear
commitment to publication of results within two years of data collection.
As is the case with all research, international studies have their limitations.
Dissatisfaction with results can be fuelled by unrealistic expectations or even
promises. Thus the aspiration of early lEA studies to assess the relative impact
of a variety of school resources and instructional processes on achievement
proved to be very difficult to realize, as most studies were cross-sectional and
provided a limited basis for exploring and testing causal relationships. Another
problem arises from the effort that has to be expended in the development of
achievement tests. This has often been accompanied by inadequate attention to
the development of good quality constructs for the background variables to be
used in relational and multi-level analyses.
A pervasive criticism, and it is one that may be applied to any study using
educational tests developed and analysed with contemporary psychometric theory,
is that the psychometric model imposes unidimensionality on data by eliminating
as misfitting items that could contain important information about other
dimensions of achievement (Goldstein, 2000). Beaton (2000), on the other hand,
emphasizes that items are deleted only when there are clear deficiencies and that
the issue of dimensionality is properly addressed in psychometric analyses in the
pilot and trial stages of test development.
As technical criticisms have been increasingly addressed, recent critiques have
become increasingly political (e.g., Bracey, 1997). Concern has been expressed
about the direction, financing, and ownership of the studies and their data. For
example, it has been claimed that because the primary responsibility for conduct-
ing lEA studies has been with empirical researchers, who are predominantly
psychometricians or data analysts, the content area studied (e.g. mathematics
and science in TIMSS) has been seen as secondary (Keitel & Kilpatrick, 1999).
This criticism may be addressed at two levels. One is through consideration of
the specific personnel involved in each study and the strategies used to engage
subject matter experts. Both lEA and OECD/PISA have actively engaged
subject-matter experts in the specification of their assessment frameworks and in
the design of assessment instruments. In the case of lEA studies, as indicated
earlier, the focus is on curriculum content and curriculum coverage and a careful
curriculum analysis is one of the pillars of test development. In the case of PISA,
the focus is on what students, by the time they reach the end of compulsory
schooling, should be able to do with what they have learned. The other way in
which this criticism can be evaluated is through an analysis of the ways in which
the results of studies have been taken up in discussions of curriculum and in
curriculum research. It is true that mathematics and science educational research
communities, for example, have often paid little attention to the results of
international studies. The reason has been largely due to differences in the scale
of the issues addressed. The focus of pedagogical research in mathematics and
science education has often been on issues which cannot be addressed in large-
scale surveys (e.g., the everyday theories of how the world works that young
people develop and that science education seeks to modify or displace).
Criticisms of international comparative studies make an important contribu-
tion to improving and strengthening the design, development, and implementation
of the studies. Among other things, they have resulted in a broadening of the
range of people engaged in the work.
Final Word
International, comparative education studies use the world as an educationallabo-

ratory to broaden national perspectives and to raise expectations about what might
actually be possible. Furthermore, their collaborative processes build national
research capacity. This occurs in all countries, but has been particularly impor-
tant in developing countries, where the enhanced capacity lessens dependence
on developed countries in designing and developing interventions in education.
Education systems are embedded in national cultures and histories but, increas-
ingly, they are being subjected to the same global forces that are transforming
many other aspects of national life. While exposing national achievements in
comparison with others that have done better might bring discomfort, the
knowledge can be used constructively for national reforms. Furthermore, the
collaborative work that is part and parcel of international assessments can serve
to increase the kind of productive exchange of ideas and technologies from
which all can benefit.
APPENDIX:
ADDRESSES OF AGENCIES THAT CONDUCT INTERNATIONAL

COMPARATIVE ACHIEVEMENT STUDIES
lEA (International Association for the Evaluation of Educational Achievement)

Herengracht 487
1017 BT Amsterdam, Netherlands
tel. +31206253625; fax: +31204207136
email: department@iea.nl
Web site: http://www.iea.nl
OECD/PISA (Programme for International Student Assessment)

OECD/CERI
2, rue Andre-Pascal
75775 Paris 16, France
tel. +33145249250; fax: +33145241968
Web site: http://www.pisa.oecd.org.
'~
SACMEQ, Southern Africa Consortium for Monitoring Educational Quality

UNESCO Harare Sub-regional Office
P.O. Box HG435
Highlands
Harare, Zimbabwe
tel. + 263 4 334 425 -32; fax: + 263 4 776 055 / 332 344
email: pfukani@unesco.co.zw
REFERENCES
Beaton, A.E. (2000). The importance of item response theory (IRT) for large-scale assessments. In
S. Carey (Ed.), Measuring adult literacy: The International Adult Literacy Survey (IALS) in the
European context (pp. 26-33). London: Office for National Statistics.
Beaton, A.E., Martin, M.O., Mullis, l.Y.S., Gonzalez, E.J., Smith, T.A., & Kelly, D.L. (1996). Science
achievement in the middle school years. Chestnut Hill, MA: TIMSS International Study Center,
Lynch School of Education, Boston College.
Beaton, A.E., Mullis, l.Y.S., Martin, M.O., Gonzalez, E.J., Kelly, D.L., & Smith, T.A. (1996).
Mathematics achievement in the middle school years. Chestnut Hill, MA: TIMSS International
Study Center, Lynch School of Education, Boston College.
Beaton, A.E., Postlethwaite, T.N., Ross, K.N., Spearritt, D., & Wolf, R.M. (2000). The benefits and
limitations of international educational achievement studies. Paris: International Institute for
Educational Planning/International Academy of Education.
Bottani, N., & Tuijnman, A.c. (1994). The design of indicator systems. In A.c. Tuijnman, & T.N.
Postlethwaite (Eds.), Monitoring the standards of education (pp. 47-78). Oxford: Pergamon.
Bracey, G.W (1997). On comparing the incomparable: A response to Baker and Stedman.
Educational Researcher, 26(3), 19-25.
Burstein L. (Ed). (1992). The lEA study of mathematics III: Student growth and classroom processes.
Oxford: Pergamon.
Goldstein, H. (2000). IALS - A commentary on the scaling and data analysis. In S. Carey (Ed.),
Measuring adult literacy: The International Adult Literacy Survey (IALS) in the European context
(pp. 34--42). London: Office for National Statistics.
Harmon, M., Smith, T.A., Martin, M.O., Kelly D.L., Beaton, A.E., Mullis, I.Y.S., Gonzalez, E.J., &
Orpwood, G. (1997). Performance assessment in lEA's Third International Mathematics and Science
Study. Chestnut Hill, MA: International Study Center, Boston College.
Howie, S.J. (2000). TIMSS in South Africa: The value of international comparative studies for a
developing country. In D. Shorrocks-Taylor, & E.W Jenkins (Eds.), Learning from others
(pp. 279-301). Boston: Kluwer Academic.
Howson, G. (1999). The value of comparative studies. In G. Kaiser, E. Luna, & I. Huntley (Eds),
International comparisons in mathematics education (pp. 165-181). London: Falmer.
Husen, T. (Ed.) (1967). International study of achievement in mathematics: A comparison of twelve
countries (2 vols). Stockholm: Almqvist & Wiksell/New York: Wiley.
Husen, T., & Postlethwaite, T.N. (Eds) (1994). The international encyclopedia of education. Oxford:
Pergamon.
Husen, T., & Tuijnman, A.c. (1994). Monitoring standards in education: Why and how it came about.
In A.c. Tuijnman, & T.N. Postlethwaite (Eds.), Monitoring the standards of education (pp. 1-22).
Oxford: Pergamon.
Hussein, M.G. (1992). What does Kuwait want to learn from the Third International Mathematics
and Science Study (TIMSS)? Prospects, 22, 463-468.
lEA (International Association for the Evaluation of Educational Achievement). (1998). lEA
guidebook 1998: Activities, institutions and people. Amsterdam: Author.
International Study Center. (2002). Retrieved from http://timss.bc.edu/.ChestnutHill,MA:
International Study Center, Lynch School of Education, Boston College.
Jenkins, E.W. (2000). Making use of international comparisons of student achievement in science and
mathematics. In D. Shorrocks-Taylor, & E.W Jenkins (Eds.), Learning from others (pp. 137-158).
Keeves, J.P. (1992). The lEA technical handbook. Amsterdam: lEA Secretariat.
Keeves, J.P. (1995). The world of school learning: Selected key findings from 35 years of lEA research.
Amsterdam: lEA Secretariat.
Keeves, J.P. (Ed) (1996). The international encyclopedia of evaluation. Oxford: Pergamon.
Keitel, c., & Kilpatrick, J. (1999). The rationality and irrationality of international comparative
studies. In G. Kaiser, E. Luna, & I. Huntley (Eds), International comparisons in mathematics
education (pp. 241-256). London: Falmer.
Kellaghan, T. (1996). lEA studies and educational policy. Assessment in Education, 3, 143-160.
Kellaghan, T., & Grisay, A. (1995). International comparisons of student achievement: Problems and
prospects. In Measuring what students learn (pp. 41-61). Paris: Organisation for Economic Co-
operation and Development.
Lietz, P. (1996). Changes in reading comprehension across cultures and over time. New York: Waxmann.
Loxley, W (1992). Managing survey research. Prospects, 22, 289-296.
Martin, M.O., & Kelly, D. (1996). TIMSS technical report. Volume I: Design and development.
Chestnut Hill, MA: International Study Center, Lynch School of Education Boston College.
Martin, M.O., Mullis, LY.S., Beaton, AE., Gonzalez, E.J., Smith, T.A, & Kelly, D.L. (1997). Science
in the primary school years. Chestnut Hill, MA: International Study Center, Lynch School of
Education, Boston College.
Martin, M.O., Mullis, LY.S., Gonzalez, E.J., Gregory, KD., Smith, T.A, Chrostowski, S.J., Garden,
RA, & O'Connor, KM. (2000). TIMSS 1999. International science report. Chestnut Hill, MA:
Martin, M.O., Rust, K, & Adams, R (Eds) (1999). Technical standards for lEA studies. Amsterdam:
lEA Secretariat.
Mullis, 1.y'S., Martin, M.O., Beaton, AE., Gonzalez, E.J., Kelly, D.A., & Smith, T.A (1997).
Mathematics in the primary school years. Chestnut Hill, MA: International Study Center, Lynch
School of Education, Boston College.
Mullis, LY.S., Martin, M.O., Beaton, AE., Gonzalez, E.J., Kelly, D.A., & Smith, T.A (1998).
Mathematics and science achievement in the final year of secondary school. Chestnut Hill, MA:
Mullis, LY.S., Martin, M.O., Gonzalez, E.J., Gregory, KD., Garden, RA, O'Connor, KM.,
Chrostowski, S.J., & Smith, T.A (2000). TIMSS 1999. International mathematics report. Chestnut
Hill, MA: International Study Center, Lynch School of Education, Boston College.
OECD (Organisation for Economic Co-operation and Development). (1992, 1993, 1995, 1996, 1997,
1998, 2000). Education at a glance. Paris: Author.
OECD. (2001). Knowledge and skills for life. First results from PlSA 2000. Paris: Author.
OECD. (2002). The DECD Programme for International Student Assessment. Retrieved from
http://www.pisa.oecd.org/.
Plomp, T. (1998). The potential of international comparative studies to monitor quality of education.
Prospects, 27, 45-60.
Postlethwaite, T.N. (1994). Monitoring and evaluation in different education systems. In AC. Tuijnman,
& T.N. Postlethwaite (Eds), Monitoring the standards of education (pp. 23-46). Oxford: Pergamon.
Postlethwaite, T.N. (1999). International studies of educational achievement: Methodological issues.
Hong Kong: Comparative Education Research Centre, University of Hong Kong.
Postlethwaite, T.N., & Ross, KN. (1992). Effective schools in reading: Implications for planners.
Amsterdam: lEA Secretariat.
Robitaille, D.P., Beaton, AE., & Plomp, T. (2000). The impact of TIMSS on the teaching & learning of
mathematics & science. Vancouver: Pacific Educational Press.
Robitaille, D.P., & Garden, RA (Eds). (1996). Research questions and study design. Vancouver: Pacific
Educational Press.
Ross, KN. et al. (2000). Translating educational assessment findings into educational policy and reform
measures: Lessons from the SACMEQ initiative in Africa. Paper presented at the World Education
Forum, Dakar, Senegal.
Schleicher, A (2000). Monitoring student knowledge and skills: The OECD Programme for
International Student Assessment. In D. Shorrocks-Taylor, & E.W. Jenkins (Eds.), Learning from
others (pp. 63-78). Boston: Kluwer Academic.
SciMathMN. (1998). Minnesota 4th grade TIMMS results. Retrieved from http://www.scimathmn.org/
timss~r4Jpt.htm. '~
Shavelson, R.I., McDonnell, L.M., & Oakes, J. (1989). Indicators for monitoring mathematics and
science education: A sourcebook. Santa Monica CA: RAND Corporation.
Shorrocks-Thylor, D., & Jenkins, E.W. (Eds.) (2000). Learningfrom others. Boston: Kluwer Academic.
Travers, KJ., & Westbury, I. (1989). The lEA study of mathematics I: International analysis ofmathematics
curricula. Oxford: Pergamon.
U.S. National Commission on Excellence in Education (1983).A nation at risk. Washington, DC: U.S.
Vari, P. (Ed.). (1997). Are we similar in math and science? A study ofgrade 8 in nine central and eastern
European countries. Amsterdam: lEA Secretariat.
Weiss, C.H. (1981). Measuring the use of evaluation. In J.A Ciarlo (Ed.), Utilizing evaluation:
Concepts and measurement techniques (pp. 17-33). Newbury Park, CA: Sage.
Westat. (1991). SEDCAR (Standards for Education Data Collection and Reporting). Washington, DC:
US Department of Education.
43
Cross-National Curriculum Evaluation
WILLIAM H. SCHMIDT
Michigan State University. Ml, USA
RICHARD T. HOUANG
Michigan State University, Ml, USA
This paper is concerned with the measurement of curriculum and its role in
cross-national curriculum evaluation studies. The methods developed and the
data used to illustrate the procedures are taken from the Third International
Mathematics and Science Study (TIMSS).
Before turning to the measurement of curriculum, we examine three types of
cross-national curriculum evaluation studies differentiated by the role that
curriculum plays in each: the object of the evaluation, the criterion measure for
the evaluation, and the context in which to interpret an educational evaluation.
These three have their counterparts in within-country educational evaluation
studies in the United States. The first is the most common. We bring the three
together in a conceptual framework for international research and argue the
relevance of curriculum measurement to all three.
CURRICULUM AS THE OBJECT OF AN EVALUATION
The most traditional form of curriculum evaluation is a study in which the cur-
riculum is itself the object of evaluation. In the language of statistics, curriculum
is the design variable. Traditionally, such studies are aimed at determining if a
curriculum has an effect or if one curriculum is "better" than another. To answer
the questions usually requires the specification of criteria against which the
curricula are to be evaluated. Criteria are most often defined in terms of learning
outcomes, but can also include other kinds of outcomes such as attitudes or
behavioral changes. Based on the criteria, instruments such as achievement tests,
questionnaires, or attitude inventories are developed and used as the dependent
variables in the evaluation study.
In international studies, the comparison is often between the curriculum of
one country and that of another. For example, the question might be posed: is
979

© 2003 Dordrecht: U.S. Government.
980 Schmidt and Houang
the eighth grade curriculum of one country as effective for teaching mathematics
as the curriculum in another? Another way the question can be phrased is: is the
eighth grade curriculum in mathematics in a particular country effective for
student learning when set in an international context? In the latter case, the
comparison is not explicitly with the curriculum of another country but rather
with principles gleaned from the study of multiple countries implicitly.
This type of curriculum evaluation study can also include an examination of
regional differences, or of relationships between curriculum variation and cross-
national differences in achievement. The unit of analysis in this case is the
country. This was the focus of the Second International Mathematics Study and
the Second International Science Study carried out some 20 years ago by the
International Association for the Evaluation of Educational Achievement (lEA).
In reports of the studies (Burstein, 1993; Garden, 1987; Husen & Keeves, 1991;
Keeves, 1992; Kotte, 1992; McKnight et aI., 1987; Postlethwaite & Wiley, 1992;
Robitaille & Garden, 1989; Robitaille & Travers, 1992; Travers & Westbury, 1989;
Westbury & Travers, 1990; Wolf, 1992), the focus was on regional differences in
curricula and on the U.S. curriculum as it is related to other countries partici-
pating in the study. This was especially the case for mathematics in the U.S.
report, The Underachieving Curriculum (McKnight et aI., 1987).
In such studies, the focus of the evaluation is not on determining the
effectiveness of a particular curriculum program with all of its specific activities
as contrasted with another program, but on examining structural or systemic
features of the various curricula - characteristics such as disciplinary coherence,
focus, repetitiveness, the sequencing and selection of topics, and intellectual rigor.
CURRICULUM AS OUTCOME
The second role that curriculum can play in a cross-national evaluation study is
as the outcome or criterion (the dependent variable of the study). In this case,
the object of the evaluation is more general than the curriculum, most typically
the educational system as a whole. The central focus is to determine how educa-
tional systems differ with respect to the curriculum that~ they provide. The
curriculum can be thought of as an indication of opportunity to learn (OTL). The
use of the concept of OTL as a determinant of learning can be traced back to the
First International Mathematics Study (FIMS) (Husen, 1967). Peaker (1975) was
the first to incorporate OTL in modeling student achievement in an lEA cross-
national study and since then it has become an integral part of most lEA studies.
Curriculum analysis was one of the major purposes of TIMSS, which was
designed to explore differences between countries in terms of their definition of
school mathematics and science. Two reports provide detailed results pertaining
to how the curricula of some 50 countries differ (Schmidt, McKnight, Valverde,
Houang, & Wiley, 1997; Schmidt, Raizen, Britton, Bianchi, & Wolfe, 1997).
In this type of evaluation study, the assumption is that the curriculum oppor-
tunities (OTL) that are provided are important products of the educational
Cross-National Curriculum Evaluation 981
system, and that such opportunities are related to the amount of learning that
takes place. The relationship has been examined in numerous studies (see
Floden, 2000). Since the choice of how to spend limited resources, such as the
total number of hours over a school year, becomes an important policy option for
the educational system, consequences of the choices made with respect to such
policies are reasonable objects of evaluation.
CURRICULUM AS CONTEXT
The third type of curriculum evaluation in a sense falls between the first two types.
In this case, curriculum is neither the object of the evaluation nor the outcome,
but represents an intermediate variable, which helps to contextualize the evalua-
tion. The objects of the evaluation are, as in the second case, educational systems,
and the criteria against which the systems are to be evaluated are desired outcomes,
such as learning. The curriculum is considered an intermediate or moderating
variable, through which opportunities to learn are provided to students. In the
language of statistics, curriculum becomes a covariate. Its use in this sense would
be especially appropriate in a study in which the focus of the evaluation is on
some aspect of the educational system other than the curriculum. In comparing
an educational system to other systems, this approach recognizes that curriculum
opportunities affect the amount of learning that takes place, and so should be
controlled for, if one is interested in the role that other systemic features play.
From a cross-national point of view, this is an especially important type of
curriculum evaluation. The results of international comparisons permit the
rankings of countries, which may be viewed and reported as an indication of the
effectiveness of countries' educational systems. In order to make such statements
we argue that the OTL provided by each educational system must be taken into
account as a context in which to interpret the cross-national differences. Simply
stated, if the achievement test focuses only on a limited aspect of mathematics,
which is not taught in one country but is taught in another, then to interpret the
higher achieving country as having a better educational system is certainly naIve,
if not dangerous, without taking into account the fact that the achievement test
may simply happen to measure the aspect of mathematics which one country
covers and the other does not. Thus, the presence of curriculum as a contextu-
alizing variable is absolutely critical for interpreting any country differences that
do not themselves center on the curriculum.
The third type of study includes an examination of the claim that the test is not
a fair basis upon which to compare educational systems or countries. This was an
issue in TIMSS where some countries claimed that the test did not reflect their
curriculum, and hence was unfair when used as a basis for cross-country
comparisons. The design principle employed in developing the TIMSS test was
to make the test "equally unfair" to all countries. The kind of data that would be
collected as a part of this type of curriculum evaluation study could be used to
examine this issue.
BACKGROUND ON TIMSS
Given the cross-national focus of this paper, we illustrate the different types of
curriculum evaluation studies using curriculum data gathered as a part of
TIMSS. In this section we briefly describe TIMSS. It is by far the most extensive
and far-reaching cross-national comparative study of education ever attempted,
and involves a comparison of the official content standards, textbooks, teacher
practices, and student achievements of some 40 to 50 countries. (The actual
number of countries varies depending on the number participating in the partic-
ular comparison of interest.) More than 1,500 official content standards and
textbooks were analyzed. Many thousands of teachers, principals, and experts
responded to survey questionnaires, and over half a million students in some 40
countries were tested in mathematics and science.
Tests were administered to 9-year-olds, 13-year-olds, and students in the last
year of secondary school. The data focused on in this chapter are drawn from the
13-year-old population (Population 2 in TIMSS). For most countries, including
the United States, this population represented seventh and eighth grades.
Numerous reports have been released on the international curriculum analysis
(Schmidt, McKnight & Raizen, 1997; Schmidt, Raizen et al., 1997; Schmidt,
McKnight, & Valverde et aI., 1997); on achievement comparisons (Beaton, Martin,
Gonzalez, Kelly, & Smith, 1996; Beaton, Martin, & Mullis et aI., 1996; Martin et
aI., 1997; Mullis et aI., 1997, 1998); and on the relationship of curriculum to
achievement (Schmidt, McKnight, Cogan, Jakwerth, & Houang, 1999; Schmidt
et aI., 2001). The focus in the study was on the relationship between opportunity
to learn (the curriculum) and student learning (see Schmidt & McKnight, 1995).
THE MEASUREMENT OF CURRICULUM
In this section we describe the conceptual issues surrounding the measurement

of curriculum and the associated techniques that would be required to conduct
the three types of curriculum evaluation studies described above. Many of these
methodological issues and innovations were developed as a part of TIMSS.
Appropriate references are given with respect to the TIMSS study but are
presented here in a more general way, since we believe that the issues would be
present in other kinds of curriculum evaluation studies.
What is Curriculum?
If one's concern is in an evaluation study, curriculum must be defined in a way

that allows it to be measured. Many have written about the definition of
curriculum from a philosophical point of view as well as from a more pragmatic
viewpoint (Apple, 1990, 1996; Jackson, 1992; Porter, 1995). We defined curricu-
lum in a way consistent with the lEA tradition as found in international studies.
Cross-National Cuniculum Evaluation 983
The lEA tripartite model recognizes three aspects of curriculum (Travers &
Westbury, 1989): the intended, implemented, and attained curriculum. The
intended cuniculum is defined in terms of a country's official statement as to what
students are expected to learn as a result of their educational experiences. This
is the major policy lever available in terms of influencing students' opportunities
to learn. The idea of the implemented cuniculum acknowledges that what
teachers teach mayor may not be consistent with what is officially intended. The
implemented curriculum then is defined in terms of what teachers actually do in
the classroom with respect to curriculum. The final aspect of curriculum is the
attained cuniculum which, in lEA parlance, refers to students' actual learning,
which is reflected in their achievements.
Operationalization of these three concepts is critical to conducting any of the
three types of evaluation studies. In TIMSS, for example, the intended curricu-
lum is represented by the official content standard documents produced by the
educational system to inform and to guide instruction, while the implemented
curriculum is represented by the teacher's indication of the number of lessons in
which they dealt with specific topics.
The curriculum can also be defined in terms of textbook coverage. Textbooks
can be considered as part of the intended curriculum since they often embody
specific academic goals for specific sets of students. However, from a practical
perspective, textbooks can be thought of as representing the implemented curricu-
lum since they are employed in classrooms to organize, structure, and inform
students' learning experiences. The extent to which textbooks actually represent
the curriculum as implemented in the classroom is very much dependent on both
the nature of the textbook and how the teacher chooses to use it. These
considerations led to a definition of the textbook as the potentially implemented
curriculum, recognizing that it serves as a bridge between the official declara-
tions of content standards and the actual activities undertaken in the classroom
(Schmidt, McKnight, & Raizen, 1997; Schmidt, Raizen et aI., 1997).
In summary, curriculum evaluation studies should define curriculum in some
way, and we recommend one of three ways or, as done in TIMSS, all three ways.
How to measure the content standards or the textbook will be considered in a
later section. To measure teacher implementation, teachers were asked to
indicate the total number of periods they allocated over the academic year to the
coverage of a given topic. Our experience in TIMSS is that the teachers in
general take this seriously and, as a result, their responses provide a fairly
reliable indicator of the amount of time, at least in a relative sense, allocated to
the study of topic areas. The measurement of curriculum from documents, such
as content standards and textbooks, is more difficult to achieve.
The Need for a Cuniculum Framework
In measuring any of the three aspects of curriculum (intended, potentially imple-

mented, and implemented), the methodological issue that must be addressed is
how to quantify data that are essentially qualitative in nature. In TIMSS, a
procedure was developed for doing this based on a curriculum framework. The
framework is a qualitative listing of all of the topic areas within the discipline
being studied. This common category framework, with its associated descriptive
language, allows for the measurement of all aspects of the curriculum. This is
true whether one is characterizing part of a curriculum document or defining the
list of topics to which teachers are expected to attend. Such descriptions have to
be qualitatively distinct and clear, using common terms, categories, and stan-
dardized procedures, so that codes can be unambiguously assigned to any portion
of a document.
In the TIMSS study, as well as in other studies (see for example, Porter, 1989,
1995) this common language is provided by a framework for the subject area
being considered. The TIMSS study had two such documents, one for mathe-
matics and one for science (Robitaille et aI., 1993). Each covered the full range
of years of schooling in a unified category system. In the development of the
frameworks, it was recognized that methodologically it would be best if they were
defined as multifaceted and multilayered.
The frameworks developed as part of TIMSS covered three aspects of subject
matter: content, performance expectation, and perspective. Content referred to
the subject-matter topic; performance expectation to what students are expected
to do with the particular content; and perspective to any over-arching orientation
to the subject matter.
It was not necessary to conceive of the framework as a matrix in which a given
part of a curriculum document was classified as having a particular topic, together
with a particular performance expectation. This has traditionally been the case
for item classification or test blueprints where an item is placed in a cell of a
matrix, in which the rows represent the topics and the columns the performance
expectations. The TIMSS procedures produced a signature in which a segment
of the curriculum document was identified as covering as many topic areas as
appropriate, and as many performance expectations as needed. This approach
allows greater flexibility when identifying curriculum content, while recognizing
that curriculum is often designed to cover multiple topics with multiple
expectations through the same material.
The data used in the following sections to illustrate the three approaches were
taken from the mathematics portion of TIMSS, which contains 44 topic areas
and 21 performance expectations. Each aspect of the framework is organized
hierarchicly using nested subcategories of increasing specificity. Within a given
level, the arrangement of topics does not reflect a particular rational ordering of
the content. Each framework aspect was meant to be encyclopediac in terms of
covering all possibilities at some level of specificity. No claim is made that the
"grain size" (the level of specificity for each category) is the same throughout the
framework. Some sub-categories are decidedly more inclusive and commonly
used than others (Schmidt, McKnight, Valverde et aI., 1997).
For other subject-matter areas, such a framework would need to be developed
in order to carry out the kind of curriculum measurement that we are proposing
and illustrating in the following sections. TIMSS has done it for science as well.
For curriculum evaluation of any of the three types, we recommend that subject
matter experts be convened to develop a framework along these lines.
How to Measure Curriculum?
In this section, we describe how to measure each of the three aspects of cur-
riculum described above, taking into account the common definition of the
subject matter area as reflected in the frameworks. The TIMSS study developed
a document-based methodology to allow for the analysis of the content standards
and textbooks according to the mathematics framework.
This methodology has been described in great detail elsewhere (Schmidt,
McKnight, & Raizen, 1997; Schmidt, Raizen et aI., 1997). The basic approach
worked fairly well and exhibited the desirable characteristics of validity and
reliability (see Appendix E: Document Analysis Methods in Schmidt, McKnight,
Valverde et al., 1997). The amount of training and quality control involved,
however, was extensive, and whether reliable methods which are less labor
intensive can be developed needs further study.
The basic methodological approach involved subdividing the content
standards or textbooks into smaller units, which are termed blocks, and can be
described as relatively homogeneous segments of the curriculum document.
Each segment was then given a signature, which is represented by a series of
codes on each of the two aspects of subject matter: topic (content) and perfor-
mance expectation. (In TIMSS, perspective was also coded, but is not considered
here.)
As already suggested, these signatures, although often represented by a single
topic code and a single performance expectation code, could be more complex,
and reflect several topics and/or several performance expectations. Coding was
completed for all blocks within a curriculum document. The quantification of the
qualitative specifications was accomplished by aggregating similarly coded blocks
over the document to represent in relatively simple terms the proportion of a
book or a content document that focused on a particular topic. The subtleties
involved in this process, including the rules for segmenting or blocking the docu-
ment, and the methodological issues are discussed in the above-cited references
and need not concern us here. The major point to be made is that some method-
ology, such as the one used in TIMSS seems promising as a way of turning
qualitative information about curriculum into quantities which can be used
analytically as a part of a curriculum evaluation study.
In mathematics and science education, subject-matter specialists emphasized
the importance of the performance expectation as it pertains to the coverage of
various topics, considering it as important as coverage when characterizing a cur-
riculum. For example, it is not enough to know whether a topic in mathematics
was covered; one also needs to know if it was covered with the expectation that
students would go beyond merely knowing and being proficient in the use of
algorithms to being able to solve mathematical problems and engage in mathe-
matical reasoning. Because the frameworks allow for such coding of textbooks
and content standards, they permit characterization of the percentage of the
document that was focused on more complex cognitive demands as well the
content that was covered.
The measurement of the teacher implementation variable is much more
straightforward. If the topics listed in the questionnaire to which the teacher
responds in terms of his or her coverage of topics over the school year are those
specified in the framework, or are some aggregation of those topics, teachers
simply indicate the number of periods they spent covering a given topic, which
then is directly quantifiable and can be used as a means of characterizing the
teacher implemented curriculum.
A thorough curriculum evaluation study of any of the three types of curricul-
um evaluation discussed above would demand attention to each of the following
issues. First, it is necessary to determine which aspect(s) of curriculum is critical
to the study, and then to define the means by which it will be measured. Second,
the measures should be based on a common definition or framework. Also the
methodology employed should recognize the qualitative nature of curriculum,
but in a way that allows for quantification for statistical analysis.
The Attained Curriculum
The lEA concept of the attained curriculum is what most would simply refer to
as achievement or the outcome measure. There is a long-standing tradition of
measuring achievement, which need not be considered here. What will be con-
sidered are several issues that are critical to the first and third type of curriculum
evaluation study in which the outcome is achievement.
One issue that arises is whether achievement is to be defined in terms of the
status of the individual at a given point in time or in terms of what has been
learned over a period of time such as during eighth grade. Status is a measure of
what students know at a given point in time and represents an accumulation of
all that has been learned up to that point. Given that the typical focus of a cross-
national curriculum evaluation study is on the effect on learning of a curriculum
associated with a particular grade, status is not the appropriate measure in most
cases. If the curriculum evaluation is designed to determine the effectiveness of
different educational systems up to some grade level, such as grades one through
eight, then a status measure collected at the end of grade eight, might be
appropriate.
However, in most cases the focus is on particular curricular experiences gained
as part of a specified school year. Under these circumstances, status is not
the appropriate measure as it reflects experiences gained outside the specified
year's curriculum. Therefore, achievement gain which takes into account the
achievement level before and after the curricular experience would be more
appropriate.
In TIMSS it was not possible to obtain achievement data on students at both
the beginning and end of the school year. However, the design incorporated the
two adjacent grades containing 13-year-olds (in most countries, seventh and
eighth grades). Making certain assumptions, the seventh grade, at least in terms
of country-level differences, could be used as a surrogate to reflect students'
achievements going into the eighth grade. Thus, it was possible to report results
pertaining to the first type of curriculum evaluation study dealing with
achievement gains and not status (Schmidt et aI., 1999; Schmidt et aI., 2001).
The second issue is whether the achievement measures have been designed to
be "curriculum sensitive," that is, if the nature of the test, as well as the scaling
employed, is sensitive to the curriculum experiences of students. Measures based
on a test that covers a broad spectrum of the topics within a discipline, such as
mathematics and the resulting scaling that combines all those items (either as a
simple total number correct or a scaled score based on some Item Response
Theory model), will reflect what is common across disparate items. When the
items are based on many different topics, what is common will essentially reflect
a general propensity or ability to do mathematics problems. Such a measure will
not be particularly sensitive to curriculum differences, and it is likely that it will
reflect other factors such as general ability, reading ability, social class, and general
motivational level. When this issue was examined using TIMSS data, country
rankings based on scaled scores were found to change very little even when it was
clear from other data that curriculum differences among countries were
pronounced (Schmidt, Jakwerth, & McKnight, 1998). When the measures were
more aligned with curricula and based on items that focused on a specific part of
mathematics rather than on mathematics as a whole, differences in country rank-
ings were quite evident. This led to the development of 20 subtest areas for
mathematics. This indicates that when scales used in curriculum evaluation studies
are designed to be specific to topic areas rather than generic (generalizing over
multiple topic areas within the discipline), they will be more sensitive to curricula.
EXAMPLES OF THE THREE TYPES OF CURRICULUM EVALUATION

STUDIES USING THE TIMSS DATA
Curriculum as the Object of the Evaluation: Curriculum as a Design Variable
When curriculum is the object of an evaluation, the major question is: is curricu-
lum related to achievement gain? The question of why one would need to measure
curriculum by the methods described in the previous section for this type of study
is important to address. Two answers arise, but for very different reasons.
In international studies, unlike many within-country curriculum evaluation
studies, no two countries have exactly the same curriculum. In such a situation,
even slight variations among countries would need to be taken into account in
examining the relationship between curriculum and achievement gain. Hence, in
any curriculum evaluation study of this type, the need to measure the curriculum,
either in terms of emphasis or in terms of degree of implementation, is indicated.
When such measures are used, an analysis of variance model with curriculum
as the design variable may not be the most appropriate. When quantitative
measures of the degree of implementation of the curriculum or of the degree of
emphasis allocated to different topics are available for each country, the resulting
data could be used in a regression model with achievement gain as the depend-
ent variable and curriculum as the quantitative independent variable(s). This, in
fact, was done with TIMSS data, for which a series of regression analyses were
carried out for 20 sub-areas of mathematics in which all three aspects (intended,
potentially implemented, and implemented) of curriculum were used as inde-
pendent variables (Schmidt et aI., 2001).
This approach may be illustrated using three variables. Table 1 presents the
results for four subtest areas: measurement units, common fractions, congruence
and similarity, and equations and formulas. Do the results indicate that curricu-
lum impact varies by topic? In one case, measurement units, the answer appears
to be no: curriculum differences between countries do not seem to be related to
learning differences between countries. For the other three areas, the answer is,
yes, but the aspect of curriculum that is related to learning varies. In the case of
common fractions and equations and formulas, it is not the amount of space in
textbooks, nor the amount of time spent by teachers on the subject, that makes
a difference. Rather, it is the extent to which the percentage of textbook space
covered is allocated to more complex cognitive demands associated with a topic.
In the case of congruence and similarity, measures of the amount of textbook
space and the amount of time teachers allocate to the topic are related to the
amount of learning. These results suggest that the answer to the question, is
curriculum related to learning, is complex; it depends on topic area and aspect
of the curriculum. They also suggest that, even within countries, when various
curriculum approaches are being evaluated, the answers may be much more
complex than has traditionally been thought to be the case. Cross-national results
clearly support the need for achievement measures that reflect small areas of the
curriculum rather than a general overall total score. Analyses using a total score
often result in the conclusion that curriculum really does not matter. However,
it is clear that when topics are defined at a more specific level, curriculum does
have an impact on student learning.
Curriculum as the Criterion for an Evaluation Study: Curriculum as the

Dependent Variable
Do educational systems provide different curriculum opportunities for students?

This question could also be phrased in terms of the extent to which different
curriculum opportunities to learn are provided by different educational systems
or countries. In this case, opportunity to learn or the curriculum measures become
the dependent variables.
Table 1. Using Curriculum Variables to Predict Achievement Gains in Eighth Grade
Mathematics in 31 TIMSS counties
TIMSS 8th Grade Amount of coverage Amount of textbook Amount of time
Mathematics of topics in textbook allocated to complex teachers allocate
Subtest Areas (% of book) cognitive demands to the topics
for the topics (% of time)
(% of book)
Reg. Std. p Reg. Std. p Reg. Std. p
Coe! Errors Coe! Errors Coe! Errors
Common Fractions 0.254 0.181 0.172 0.405 0.170 0.025 0.206 0.183 0.267
Measurement 0.288 0.181 0.119 0.194 0.184 0.296 0.208 0.182 0.265
Units
Congruence & 0.434 0.168 0.015 0.227 0.184 0.218 0.620 0.147 0.000
Similarity
Equations & -0.203 0.177 0.262 0.388 0.168 0.029 -0.082 0.181 0.654
Formulas
31 Countries Included: Australia, Austria, Belgium-Flemish, Belgium-French, Canada, Colombia,
Cyprus, Czech Republic, Slovak Republic, Denmark, France, Germany, Greece, Hong Kong,
Hungary, Iceland, Ireland, Japan, Korea, Netherlands, New Zealand, Norway, Portugal, Romania,
Russian Federation, Singapore, Spain, Sweden, Switzerland, USA, Slovenia
Two volumes characterizing differences in both the characteristics of the

curriculum and the substance of the curriculum are available for TIMSS (Schmidt,
McKnight, Valverde et ai., 1997; Schmidt, Raizen et ai., 1997). Different findings
emerge in a study reported by Schmidt et ai., 2001. For purposes of illustration,
one aspect of the curriculum (textbooks) will be considered here. The profile of
coverage for the 44 topics covered in the mathematics framework is examined
for 32 countries. Data for analysis are the percentage of each country's textbook
that covers each of the mathematics topics. The results summarized in Table 2
indicate that there are substantial differences in topic coverage across countries.
While the top five topics across all countries are equations and formulas;
polygons and circles; functions and relations; 3-D geometry; and perimeter, area,
and volume, there is substantial variation on each.
One way to examine between-country variations is to perform a median polish
analysis (Mosteller & Tukey, 1977, p. 178). The analysis identifies countries which
give more or less attention to topics in their textbooks than is given in other
countries, or indeed to other topics in the country itself. Using the criterion of an
absolute difference of 10 percent, large country-by-topic interactions were found
for approximately two-thirds oftopics in terms of textbook coverage. (see Table 3).
Textbooks varied in their coverage of topics, both across countries and within
countries. Thus, it may be inferred, countries vary in the opportunities to learn
that they provide.
The topic for which the greatest variation in textbook coverage was found was
equations and formulas. In Japan, for example, more textbook space was devoted
to this topic than would be predicted from the typical amount of space devoted
to topics in that country, as well as from the amount devoted to the topic in other
Thble 2. Percent of Textbook Space Devoted to TIMSS Mathematics Framework Topics \0
~
g:
~
~
I::l
;:::
~
o 2 o ~
2 6 10 4 0 I 0 11 I , 9 2 10 I 11 2 0 0 0 §
11 7 3 o 2 4 o o 4 9 12
23 2 11 o
o o I o 0 0 o 0 o 18 16 12 11 30 o ~
8 6 2 7 0 o I 2 3 0247306 5 I 0
I 0 o 0 o 4 01'0828010 21 II o
2 12 o 9 013'08 920 0 11 2 10
4 I I 0 I 2 I o 17 o 0 I o 0 10 IS 9 4 20 12 0 2'7210 16 32 000
4 I 0 3 2 1 2 1 0 17 0 10 IS 0 9 4 20 12 0 2 S 7 2 10 16 32 0 0
6 6 1301110000 72002407000001'9 o
12 I 6 I 2 0 8 0 0 0 0 0 I , 3 0 I 6 17 3 I I 2 2 2 2 , 4 o
0022301000000000014 020 0000230
I 0 0 0 0 0 0 0 0 0 2 6 14 0 3 0 14 0 10 0 3 4 6 0 9 9 0 0
3 0 0 7 0 0 0 0 I 0 0 10 9 0 0 0 0 0 3 16 8 16 , 0
000 I 003000128 18120419200111111
I' 21 19 I 0 0 0 3 8 0 10 8 0 0 7 I' 0 0 0 0 0 0 0 0 11 13 0 0
10 010 00 0429410 1211
7 6 0 4 4 9 11 0 I 4 4 12
001021 1001432 '001'291'0'001
000002000000000001011 23 000 39 0000
o 0 0 3 0, 0 0 0 0 0 0 0 0 0 I 0 10 I 14 0 0 11 I 0 9 2 0 31 11 0
I f04140 0000011501620400301003660 o I
84 20421'20400 00710182209020 20
7 3 , 0 , 0 0 0 I 2 I 8 2 0 12' 0 0 0 I 0 4 17
2 2 I 0 0 0 12 I I 13 I 3 20 , , 6 3 0 I 17 32 o I
I I 0 I 0 , 28 10 I 0 I 9 I 4 7 17 47 I 7 10 0 I I 21 37 o 31
o 0 0 0 0 0 4 4 I 0 0 16 11 18 0 16 16 3 I 2 0 8 0 3 19 0 0 0
3 I I 0 2 0 0 0 0 0 0 I 0 0 I 16 11 2 12 0 8 12 0 0 10 8 0 30 0
142012010100000000101000111'000733010130
11 , 13 I I 0 10 31 0 10 0 0 0 I I I 0 0 0 0 0 0 2 0 0 8 37 0 0
074271000000021017 400000 0041000
4321210020401 8 71040913100392 o
811 0420000243 3319311002031300'57 o
21 04230'032012046142'2 21 23205249 00
00 1 0 0 0 I 0 2 0 I 0 0 0 0 0 3 0 0 6 'lSSS 3 440 I 8 10920 I 0 0 003 I
Table 3. Mased Median Polish Unique Effect of Textbook Space Devoted to TIMSS Mathematics Framework Topics (Magnitude of 10% or greater)
21
12
20
IS IS
I'
16
II 30 18
12 10 16
12 10 16
II
II -11
I' 9c:s
18
1:l
I
12
16
~
....
18 C'
~
12 13 20 17 10
~
.....
12
21 24
Q
22 24 ~.
16 II 12 ;:
20 !i:
~
II 16
27 44 IS 21 30 ~
13 16 13 !i:
~
II II IS :::t_
-15 \:)
~
II 30 -11 21
17 IS -II
\0
27 10 31 -10 II -11 \0
S~=I ......
USA
s....... 53
I'
countries. The results of analyses of other aspects of the curriculum are provided
in Schmidt et al. (2001).
Curriculum as the Context for a Curriculum Evaluation Study: Curriculum as a

Nuisance Variable
The results of international studies are often used to conclude that some educa-
tional systems are better than others. Policy decisions may follow. For example,
on the basis of the findings of the Second International Mathematics Study,
teachers in the United States were exhorted in the 1980s and 1990s to model
their teaching on Japanese practice. Some now suggest that since Singapore
performed so well in TIMSS that it is the country to emulate.
The use of the results of an international assessment in this way brings us to
the third type of curriculum evaluation study. Is it a reasonable conclusion that
the educational system in Singapore at the eighth grade is better than those of
other countries that participated in TIMSS? The answer can only begin to be
addressed (putting aside the folly of even asking it, or of reaching such conclu-
sions on the basis of this kind of a cross-national study) by taking into account
the curriculum of that country compared to the curricula of other countries. To
accomplish this, not only would achievement have to be measured in the way
described previously but curriculum measures would also have to be obtained so
that they could be entered into analyses as control or conditioning variables.
Since we do not believe this to be an appropriate use of international data, we
will not illustrate the kind of analysis that would use opportunity to learn as a
covariate in examining the relative merits of educational systems.
However, it would be reasonable to ask whether the TIMSS test was equally
fair or unfair to countries. To explore this issue, each country could be repre-
sented on each aspect of the curriculum as a point in a 44-dimensional space,
each dimension representing one of the 44 topics of the mathematics framework.
The TIMSS test could similarly be represented. The distance between those
points within the 44-dimensional space would provide an indication of fairness.
This makes the assumption that the space is Euclidean, which was accepted in
the absence of evidence to the contrary. Using the standard definition of distance
within a Euclidean space, the distances between the TIMSS test and each of the
countries on each of the various aspects was tabulated (Table 4). It is clear from
the data that the TIMSS test was not equally unfair to all countries and, in fact,
was much closer to the curriculum of some countries than to the curricula of
others.
CONCLUSION
A common thread through each of the three types of curriculum study evaluations
considered in this paper is that the curriculum needs to be measured in
Table 4. Alignment (Measured by Distance) between Coverage in
Textbook and in TIMSS Eighth Grade Mathematics Test
Country Distance 2 Distance
Australia 591 24
Austria 1907 44
Belgium-Flemish 780 28
Belgium-French 1112 33
Bulgaria 1806 42
Canada 517 23
Colombia 1136 34
Cyprus 2060 45
Czech Republic 1885 43
Slovak Republic 1885 43
Denmark 1014 32
France 966 31
Germany 1280 36
Greece 882 30
Hong Kong SAR 837 29
Hungary 1224 35
Iceland 994 32
Iran, Islamic Rep. of 883 30
Ireland 805 28
Israel 2147 46
Japan 1609 40
Korea 1176 34
Netherlands 1247 35
New Zealand 683 26
Norway 443 21
Portugal 1293 36
Romania 5548 74
Russian Federation 1657 41
Singapore 1178 34
South Africa 1238 35
Spain 2123 46
Sweden 853 29
Switzerland 608 25
Scotland 1903 44
USA 500 22
Slovenia 3895 62
Average 1407 36
Standard Deviation 975 11
Minimum 443 21
25%ile 849 29
Median 1177 34
75%ile 1826 43
Maximum 5548 74
quantifiable ways that allow for appropriate kinds of statistical analysis. This is
so whether curriculum is the object of an evaluation, the criterion for it, or the
context in which an evaluation is interpreted. Furthermore, on the basis of the
experience of TIMSS, it is reasonable to conclude that curriculum can be
measured both validly and reliably. This is not to say that the methods and the
procedures developed for TIMSS are perfect. However, they do serve to illustrate
that curriculum can be measured and portrayed as an interesting outcome in
itself, and in relation to achievement gain. A final point is that given the diversity
of curricula, some way of measuring them is an essential component of an
international study. The methodologies and issues discussed in this chapter
would also appear relevant when curricula are being evaluated in countries
which do not have a national curriculum.
REFERENCES
Apple, M.W. (1990). Ideology and curriculum. New York: Routledge.

Apple, M.W. (1996). Power, meaning, and identity: Critical sociology of education in the United
States. British Journal of Sociology of Education, 17, 125-144.
Beaton, AE., Martin, M.O., Gonzalez, E.J., Kelly, D.L., & Smith, T.A (1996). Mathematics achieve-
ment in the middle school years. lEA's Third International Mathematics and Science Study. Chestnut
Hill, MA: TIMSS International Study Center, Boston College.
Beaton, AE., Martin, M.O., Mullis, Ly'S., Gonzalez, E.J., Smith, T.A, & Kelly, D.A (1996). Science
achievement in the middle school years: lEA's Third International Mathematics and Science Study.
Chestnut Hill, MA: TIMSS International Study Center, Boston College.
Burstein, L. (Ed.). (1993). The lEA study of mathematics III: Student growth and classroom processes.
Oxford: Pergamon.
Floden, R.E. (2000). The measurement of opportunity to learn. Methodological advances in large-scale
cross-national education surveys. Paper presented at a symposium sponsored by Board on
International Comparative Studies in Education.
Garden, R.A (1987). The second lEA mathematics study. Comparative Education Review, 31,
147-68.
Husen, T. (Ed.). (1967). International study of achievement in mathematics: A comparison of twelve
countries. (2 vols). Stockholm: Almqvist & Wiksell.
Husen, T., & Keeves, J.P. (Eds). (1991). Issues in science education. Oxford: Pergamon.
Jackson, P.w. (Ed.). (1992). Handbook of research on curriculum. New York: Macmillan.
Keeves, J.P. (Ed.). (1992). The lEA science study III: Changes in science education and achievement
1970 to 1984. Oxford: Pergamon.
Kotte, D. (1992). Gender differences in science achievement in 10 countries. Frankfurt: Lang.
Martin, M.O., Mullis, 1.y'S., Beaton, AE., Gonzalez, E.J., Smith, T.A, & Kelly, D.A. (1997). Science
achievement in the primary school years. lEA's Third International Mathematics and Science Study.
Chestnut Hill, MA: TIMSS International Study Center, Boston College. ,
McKnight, c.c., Crosswhite, F.J., Dossey, J.A, Kifer, E., Swafford, J.O., Travers, K.J., & Cooney, T.J.
(1987). The underachieving curriculum: Assessing U.S. school mathematics from an international per-
spective. Champaign, IL: Stipes.
Mosteller, F., & Tukey J.w. (1977). Data analysis and regression: A second course in statistics. Reading,
MA: Addison-Wesley.
Mathematics achievement in the primary school years. lEA's Third International Mathematics and
Science Study. Chestnut Hill, MA: TIMSS International Study Center, Boston College.
Mathematics and science achievement in the final year of secondary school. lEA's Third International
Mathematics and Science Study. Chestnut Hill, MA: TIMSS International Study Center, Boston
College.
Peaker, G.F. (1975). An empirical study of education in twenty-one systems: A technical report.
Stockholm: Almqvist and Wiksell.
Porter, A (1989). A curriculum out of balance: The case of elementary school mathematics.
Educational Researcher, 18(5), 9-15.
Porter, AC. (1995). The uses and misuses of opportunity to learn standards. In D. Ravitch (Ed.),
Debating the future of American education: Do we need national standards and assessments?
(pp. 40-65). Washington, DC: Brookings Institution.
Postlethwaite, T.N., & Wiley, D.E. (1992). The lEA study of science II: Science achievement in twenty-
three countries. Oxford: Pergamon.
Robitaille, D.F., & Garden, RA (1989). The lEA study of mathematics II: Contexts and outcomes of
school mathematics. Oxford: Pergamon.
Robitaille, D.F., Schmidt, WH., Raizen, S., McKnight, c., Britton, E., & Nicol, C. (1993). Curriculum
frameworks for mathematics and science. Vancouver: Pacific Educational Press.
Robitaille, D.F., & Travers, KJ. (1992). International studies of achievement in mathematics. In D.A
Grouws (Ed.), Handbook of research on mathematics teaching and learning (pp. 687-709). New
York: Macmillan.
Schmidt, WH., Jakwerth, P.M., & McKnight, C.C. (1998). Curriculum-sensitive assessment: Content
does make a difference. International Journal of Educational Research, 29, 503-527.
Schmidt, WH., & McKnight, C.c. (1995). Surveying educational opportunity in mathematics and
science: An international perspective. Educational Evaluation and Policy Analysis, 17,337-353.
Schmidt, WH., McKnight, C.C., Cogan, L.S., Jakwerth, P.M., & Houang, RT. (1999). Facing the
consequences: Using TIMSS for a closer look at U.S. mathematics and science education. Boston:
Kluwer Academic.
Schmidt, WH., McKnight, c.c., Houang, RT., Wang, H., Wiley, D.E., Cogan, L.S., & Wolfe, R.G.
(2001). Why schools matter: A cross-national comparison of curriculum and learning. San Francisco:
J ossey-Bass.
Schmidt, WH., McKnight, C.C., & Raizen, S.A. (1997). A splintered vision: An investigation of u.s.
science and mathematics education. Boston: Kluwer Academic.
Schmidt, WH., McKnight, C.C., Valverde, G.A., Houang, RT., & Wiley, D.E. (1997). Many visions,
many aims. Volume 1: A cross-national investigation of curricular intentions in school mathematics.
Schmidt, WH., Raizen, S.A., Britton, E., Bianchi, L.J., & Wolfe, RG. (1997). Many visions, many aims.
Volume 2: A cross-national investigation of curricular intentions in school science. Boston: Kluwer
Academic.
Travers, KJ., & Westbury, I. (1989). The lEA study of mathematics I: Analysis of mathematics curricula.
Oxford: Pergamon.
Westbury, I., & 'fravers, KJ. (Eds). (1990). Second International Mathematics Study. Urbana IL:
University of Illinois.
Wolf, RM. (Ed.). (1992). The Second International Science Study. International Journal of
Educational Research, 17, 231-397.
List of Authors
Lisa M. Abrams, Boston College, Center for the Study of Testing, Evaluation
and Educational Policy, 323 Campion Hall, Chestnut Hill, MA 02467, USA.
E: abramsli@bc.edu.
Peter W. Airasian, Boston College, School of Education, Campion 336D,
Chestnut Hill, MA 02167, USA. E: airasian@bc.edu.
Marvin C. AIkin, University of California at Los Angeles, Department of
Education - Moore 230, 405 Hilgard Avenue, Los Angeles, CA 90024-1521,
USA. E: alkin@gseis.ucla.edu.
Ted O. Almaguer, Dallas Independent School District, 6741 Alexander Drive,
Dallas, TX 75214, USA.
H. S. Bhola, Indiana University (retired), 3537 East Nugget Canyon Place,
Tucson, AZ 85718, USA.
William Bickel, University of Pittsburgh, Learning Research and Development
Center, 740 LRDC Building, 3939 O'Hara Street, Pittsburgh, PA 15260, USA.
E: bickel@pitt.edu.
Robert F. Boruch, University of Pennsylvania, 6417 Wissahickon Avenue,
Philadelphia, PA 19119, USA. E: robertb@gse.upenn.edu.
Carl Candoli, University of Texas (retired), 8103 East Court, Austin, TX 78759-
8726, USA. E: candoli@mail.utexas.edu.
Naomi Chudowsky, National Research Council, 2101 Constitution Ave., NW
Washington, DC 20418, USA. E: nchudows@nas.edu.
Marguerite Clarke, Boston College, Lynch School of Education, CSTEEP 332G,
Chestnut Hill, MA 02467, USA. E: clarkemd@bc.edu.
J. Bradley Cousins, University of Ottawa, Faculty of Education, 145 Jean
Jacques Lussier, Ottawa ON KIN 6N5, Canada. E: bcousins@uottawa.ca.
Lois-ellin Datta, Datta Analysis, P.O. Box 383768, Waikoloa, HI 96738, E: datta
@kona.net.
E. Jane Davidson, Western Michigan University, The Evaluation Center,
Kalamazoo, MI 49008, USA. E: jane. davidson@wmich.edu.
Alice Dignard, Ministry of Natural Resources, Quebec, Direction de la
Planification Strategique, 5700, 4e Avenue Ouest, Bureau A-313, Charlesbourg
Quebec GIH 651, Canada. E: alice.dignard@mrn.gouv.qc.ca.
Elliot Eisner, Stanford University, School of Education, 485 Lausen Mall,
Stanford, CA 94305-3096, USA. E: eisner@leland.stanford.edu.
997

998 List ofAuthors
Kadriye Ercikan, University of British Columbia, Faculty of Education, 2125
Main Mall, Vancouver BC V6T 1Z4, Canada. E: kadriye.ercikan@ubc.ca.
Gila Garaway, Moriah Africa, P.O. Box 22, Poriya IIit 15208, Israel. E: ggaraway
@moriahafrica.org.
Caroline Gipps, Kingston University, 53-57 High Street, Kingston upon Thames,
Surrey KT11LQ, United Kingdom. E: C.Gipps@kingston.ac.uk.
Naftaly S. Glasman, University of California at Santa Barbara, Department of
Education, Santa Barbara, CA 93106-9490, USA. E: glasman@education.
ucsb.edu.
Ronald H. Heck, University of Hawaii at Manoa, Department of Educational
Administration, 1776 University Avenue, Honolulu, HI 96822, USA. E: rheck
@hawaii.edu.
Carolyn Huie Hofstetter, University of California at Berkeley, 717 Hillside Ave.,
Albany, CA 94706, USA. E: chofstet@uclink4.berkeley.edu.
Richard T. Houang, Michigan State University, 458 Erickson Hall, East Lansing,
MI 48824, USA. E: houang@msu.edu.
Ernest R. House, University of Colorado, School of Education, P.O. Box 249,
Boulder, CO 80309-0249, USA. E: ernie.house@colorado.edu.
Kenneth R. Howe, University of Colorado, School of Education, 249 UCB,
Boulder, CO 80309-0249, USA. E: ken.howe@colorado.edu.
Sarah Howie, University of Pretoria, Department of Teaching and Training, 0002
Pretoria, South Africa. E: sjhowie@mweb.co.za.
Robert Johnson, University of South Carolina, Department of Educational
Psychology, 261 Wardlaw Hall, Columbia, SC 29208, USA. E: RJOHNSON
@gwm.sc.edu.
Lyle Jones, University of North Carolina at Chapel Hill, Department of
Psychology, CB 3270, Chapel Hill, NC 27599, USA. E: lvjones@email.unc.edu.
Ove Karlsson, Malardalen University, Malardalens h6gskola, ISB, Box 325,
63105 Eskilstuna, Sweden. E: ove.karlsson@mdh.se.
Thomas Kellaghan, St. Patrick's College, Educational Research Centre, Dublin
9, Ireland. E: tkellaghan@erc.ie. -
Jean King, University of Minnesota, Department of Educational Policy and
Administration, 330 Wulling Hall, 86 Pleasant St SE, Minneapolis, MN 55455,
USA. E: kingx004@maroon.tc.umn.edu.
Henry M. Levin, Columbia University, Teachers College, Box 181, 525 W 120th
Street, New York, NY 10027-6696, USA. E: levin@exchange.tc.columbia.edu.
Yvonna S. Lincoln, Texas A&M University, Department of Educational
Administration, 222 M. T. Harrington Education Center, College Station,
TX 77843-4226, USA. E: ysl@tamu.edu.
Linda Mabry, Washington State University at Vancouver, Education and Human
Development 217, 14204 NE Salmon Creek Avenue, Vancouver, WA
98686-9600, Canada. E: mabryl@vancouver.wsu.edu.
List ofAuthors 999
George Madaus, Boston College, Center for the Study of Testing, Evaluation and
Educational Policy, 321 Campion Hall, Chestnut Hill, MA 02167-3807, USA.
E: madaus@bc.edu.
Patrick J. McEwan, Department of Economics, Wellesley College, 106 Central
Street, Wellesley, MA 02481, USA. E: pmcewan@wellesley.edu.
Barry McGaw, Centre for Educational Research and Innovation, Organisation
for Economic Co-operation and Development, 2 rue Andre-Pascal, 75775
Paris 16, France. E: Barry.MCGAW@oecd.org.
Gary Miron, Western Michigan University, The Evaluation Center, Kalamazoo,
MI 49008, USA. E: gary.miron@wmich.edu.
Robert J. Mislevy, University of Maryland, Measurement, Statistics, and
Evaluation, 1230C Benjamin Building, College Park, MD 20742-1115, USA.
E: rmislevy @ets.org.
Michael Morris, University of New Haven, Department of Psychology, West
Haven, CT 06516, USA. E: MMorris@newhaven.edu.
Catherine Awsumb Nelson, Evaluation Consultant, 55898 Cedar Lane, Paw Paw,
MI 49079, USA. E: awsumb@kalnet.net.
Michael Abiola Omolewa, Nigerian Permanent Delegation to UNESCO,
1 Rue Miollis, 75015 Paris, France. E: M.Omolewa@unesco.org.
Tim O. Orsak, Dallas Independent School District, 6741 Alexander Drive,
Dallas, TX 75214, USA.
John M. Owen, The University of Melbourne, Centre for Program Evaluation,
Parkville, Victoria 3052, Australia. E: j.owen@edfac.unimelb.edu.au.
Michael Quinn Patton, Union Institute and University, 3228 46th Avenue,
South, Minneapolis, MN 55406, USA. E: MQPatton@prodigy.net.
Mari Pearlman, Educational Testing Service, 39-L, Rosedale & Carter Roads,
Princeton, NJ 08541, USA. E: mpearlman@ets.org.
Tjeerd Plomp, University of Twente, Faculty of Educational Science and
Technology, P.O Box 217, 7500 AE Enschede, Groningen, The Netherlands.
E: plomp@edte.utwente.nl.
Jennifer Post, University of Pittsburgh, Learning Research and Development
Center, 2511 Chestnut Ridge Drive, Pittsburgh, PA 15205, USA.
E: jenniferepost@netscape.net.
Hallie Preskill, University of New Mexico, 7112 Staghorn Dr., Albuquerque,
NM 87120, USA. E: hpreskil@unm.edu.
Fernando Reimers, Harvard University, Graduate School of Education,
Gutman 461, Cambridge, MA 02138, USA. E: fernando Jeimers@ harvard.edu.
James R. Sanders, Western Michigan University, The Evaluation Center,
Kalamazoo MI 49008, USA. E: james.sanders@wmich.edu.
William H. Schmidt, Michigan State University, 463 Erickson Hall, East Lansing,
MI 48824, USA. E: bschmidt@msu.edu.
1000 List ofAuthors
Michael Scriven, Claremont Graduate University, P.O. Box 69, Point Reyes,
CA 94956, USA. E: scriven@ao1.com.
M. R Smith, University of Maryland, The Evaluators' Institute, DE (Emerita),
116 Front Street, Federal Building, Rm 236, Lewes, DE 19958, USA.
E: evalinst@erols.com.
Robert Stake, University of Illinois at Champaign-Urbana, CIRCE, 190 Children's
Research Center, MC-670, 51 Getty Drive, Champaign, IL 61820, USA.
E: r-stake@uiuc.edu.
Gordon Stobart, University of London, Institute of Education, 20 Bedford Way,
London WC1H OAL, United Kingdom. E: g.stobart@ioe.ac.uk.
James Stronge, College of William and Mary, School of Education, Jones Hall-
P.O. Box 8795, Williamsburg, VA 23187-8795, USA. E: jhstro@facstaff.wm.edu.
Daniel L. Stuffiebeam, Western Michigan University, The Evaluation Center,
Kalamazoo MI 49008, USA. E: danie1.stufflebeam@wmich.edu.
Richard Thnnenbaum, Educational Testing Service, Rosedale & Carter Roads,
Princeton, NJ 08541, USA. E: rtannenbaum@ets.org.GWIA.WMU.
Harry Torrance, University of Sussex, Institute of Education, EDB 333, Falmer,
Brighton BN1 9RH, United Kingdom. E: h.torrance@sussex.ac.uk.
William J. Webster, Dallas Independent School District, 6741 Alexander Drive,
Dallas, TX 75214, USA. E: wwebster2@mindspring.com.
Mark R. Wilson, University of California at Berkeley, Graduate School of
Education, Berkeley, CA 94720, USA. E: mrwilson@socrates.Berkeley.edu.
Lori A. Wingate, Western Michigan University, The Evaluation Center,
Kalamazoo, MI 49008, USA. E: lori.wingate@wmich.edu.
Richard M. Wolf, Columbia University, Teachers College, Box 165, New York,
NY 10027, USA. E: rmwolf@westnet.com.
Blaine R. Worthen, Utah State University; Western Institute for Research and
Evaluation, Department of Psychology, 2810 Old Main Hill, Logan, UT 84322-
2810, USA. E: Blaine@coe.usu.edu.
Index of Authors
Abbott, A. 923 Arends, R.I. 538

Abelson, Robert 884, 886-7 Argyris, C. 812,814
Abma, Tineke 69, 70, 72, 178, 259 Arkes, J. 613
Abrams, Lisa M. 5, 485, 486, 533-48 Armstrong, I. 410
Abt, Clark 351, 356 Arnheim, R 154
Achilles, C.M. 109, 113, 114, 833 Arregui, P. 442, 453
Ackland, J.W 812 Arter, J. 541
Adams, J.E. 650 Arthur, G. 452
Adams, KA. 305, 320 Arubayi, E. 478
Adams, R 525 Asamoah, Y. 411
Afemikhe,O.A. 478 Asayesh, G. 848
Aguilera, D. 172 Atkin, Mike 67
Ahmann, J.S. 541 Au, KH. 631
Ahuja, R 456 Ausubel, D. 536
Ahuja, Sanchez R 457 Avalos, B. 443
Ai, X. 211 Awsumb Nelson, Catherine 773
Airasian, Peter W 5, 485, 486, 533-48, 878
Aitken, R 811 Bachman, J. 631
Akpe, C.S. 478 Bachor, D.G. 569
Akubue, A. 473 Backhouse, J.K 588
Alexander, D. 49 Backus, C.A. 75
Alexander, Lamar 536,889,890 Bailey, Stephen 888
Ali, A. 473 Bajah, S.T. 478
Alkin, Marvin C. 4, 9, 189-95, 197-222, 226, Baker, Eva L. 353, 354
237-9, 242, 247, 263n1, 661, 724 alternative assessment 550,557,567,571
Allan, J. 172 teacher evaluation 616, 631, 632
Allen, M. 615 US program evaluation 723,724,726-7,
Allen, R 567,570 728-9, 730
Almaguer, Ted O. 6, 51, 881, 929-49 Baker, Kenneth 911
Almond, RG. 489, 502, 505, 528 Baker, WK 633
Altschuld, James W 335-6,338 Balistrieri, M. 71
Alvarez, B. 470 Ball, J. 919
Amano, I. 578 Ballator, Nada 902n1
Ames, C. 592 Ballou, D. 657,660
Anderson, J.O. 569 Bamberger, M. 320,411,711
Anderson, John R 228, 526 Barkdoll, Gerald 235
Anderson, L.W 526 Barnes, H.Y. 118
Anderson, M.C. 947 Barnett, B. 657
Anderson, Scarvia B. 291 Barnett, WS. 109, 135, 137, 140, 141, 145, 146
Angrist, J. 142 Barone, Tom 155
Anrig, Gregory 888, 889 Barrington, G.y' 255
Apling, R 207, 725 Bartell, C. 656
Apple, M.W 982 Bartell, E. 135
Archer, J. 592 Barth, P. 626
1001

1002 Index ofAuthors
Barton, P. 612 Bobadilla, J.L. 145
Baxter, G.P' 550,566 Bobbett, G. 832-3, 834
Beaton, AE. 522, 879, 880, 889 Bobbett, N. 833
curriculum evaluation 982 Bock, J. 452
international studies 953-4, 956, 964, 965, Bock, R. Darrell 887
967,969,970,975 Bolger, N. 591
Becher, T. 908 Bollinger, D. 706
Bedi, AS. 135 Bolton, D.L. 685
Beez, W 534 Bond, Lloyd 633, 887
Bell, RM. 109, 115 Boostrom, R. 534
Belli, G. 829 Boothroyd, RA 411
Bembry, K.L. 730, 932, 938, 947 Borg, WR. 674
Benhabib, S. 436 Borich, G. 542
Bennett, AC. 685 Boruch, Robert F. 4, 104, 107-24, 137, 354
Bennett, R 528, 549 Borum, F. 431
Bennett, William 890 Boser, U. 622, 623, 649
Benson, Jeri 889 Bottani, N. 956, 960, 961-2
Benveniste, L. 443, 876, 878 Bourque, Mary Lyn 591, 902nl
Benway, M. 680 Bowe, R. 919
Berger, P.L. 389, 391 Bowen, Charles 888
Berk, RA. 309,618 Bowsher, Charles 352
Berne, R. 126 Boyer, C.M. 135
Bernstein, D.J. 353 Boyle, c.F. 526
Bernstein, R 389, 391 Brabeck, M.M. 611
Berry, B. 643, 651 Bracey, G.W 975
Bertenthal, M.W 895 Bradley, A 624
Betts, J.R 139 Bradlow, E.T. 524
Bhola, H.S. 5,389-96,397-415,472,706, Brain, George 886
718 Brand, G.A. 848
Bianchi, L.J. 980 Brandon, P.R. 255, 257, 263nl
Bickel, William 6,773,843-70 Brandt, R 618
Bickman, Leonard 117,334-5,337-8,374, Brandt, RS. 175, 181
384 Bransford, J.D. 562
Biderman, AD. 200 Braskamp, Larry 192,206,207
Binnendijk, AL. 404 Bray, B. 848-9
Birdseye, AT. 725 Bray, M. 473,581
Bishop, W 931 Bray, S.W 671, 676
Bixby, J. 499, 557, 594 Brennan, RL. 500,513, 519, 566, 586
Black, P. 557,924 Brennan, RT. 631
Black, S.E. 141 Brereton, J.L. 581
Blair, J. 617, 624, 889 Brewer, D.J. 135
Blake, J. 631 Breyer, F.J. 502
Blank, R 829,831 Bricknell, H.M. 51
Blasczyk,J. 812-13 Bridgeford, N. 561
Blase, J. 651, 657 Bridges, D. 908
Bleim, c.L. 549, 560 Bridges, E.M. 687
Block, P. 318,319 Briggs, D. 624
Bloom, B. 536, 541 Brillinger, David 887
Bloom, B.S. 470,584, 723, 775 Britton, E. 980
Bloom, G.A. 226, 247 Broadfoot,P. 557,559,563,567,570,923,924
Bloom, H.S. 110 Brodkey, L. 177
Blum, A. 879 Bronfenbrenner, Urie 349
Blumer, I. 582 Brookhart, S.M. 544
Boardman, AE. 140,141, 143, 144, 148n1O Brookover, W. 649
Index ofAuthors 1003
Brookshire, D.S. 140 Carroll, Lewis 549
Brophy, J.E. 534 Carron, G. 408
Brown, AL. 562 Carson, RT. 140
Brown, G. 651, 657 Carter, K. 174, 175
Brown, M. 555, 910, 923 Castetter, w.B. 684
Brown, P.R 811 Chalmers, I. 110, 120
Brown, R 839 Chambers, J. 148n4, 253
Brown, Robert D. Chambers, R 250
ethical issues 179,304,309,311-12,317, Chapman, D.W. 411
321,322-3 Chelimsky, Eleanor 86, 90, 180, 237-8, 277n2
evaluation utilization 192, 206, 207 auditing 348
Brown, W. 513 evaluation profession 330, 342, 365
Bruininks, RH. 136 Myrdal award 352
Bruner, J.S. 553, 560 national studies 355
Brunner, L. 715 Chen, H. 728
Bryant, C. 741 Chen, H.T. 376,377
Bryk, AS. 191,434,526, 947 Cheng, Y. 812
Bucuvalas, M.J. 200,217 Chengelele, R 408
Bude, U. 473,475 Chinapah, Vinayagum 408, 476
Burbules, N.C. 397 Chircop, S. 403
Burghardt, J. 117 Choppin, B. 585
Burstall, Clare 888 Chou, J. 109
Burstein, L. 965, 980 Chouinard, M.-A 745
Bush, George W. 726, 901, 902 Christians, C. 75
Butterfield, S. 910 Christie, C.A. 218
Christie, T. 590
Cahan, S. 810 Chudowsky, Naomi 5,485,486,487,
Caldas, S. 837 489-531,901
Calfee, R 618 Clarke, Marguerite 5,485-7,582,593,877
Callahan, RE. 644 Clarke, p.B. 473,581
Callahan, S. 569 Clay, M.M. 556
Cambell, Don 11 Clayton-Jones, L. 651-2,658
Campbell, B. 320 Cleary,llnne 889
Campbell, D.E. 109 Clemen, RT. 139, 148nlO
Campbell, Donald T. 104, 108, 137, 168, 202, Clements, B. 829,831
203,239,354,355,512 Climaco, C. 812
Campbell, J.R. 895,896 Clinton, Bill 430,616,889,894
Campbell, RL. 419 Clogg, Clifford 889
Campillo, M. 456 Clopton, P. 931
Campione, J. 562 Clune, W.H. 126
Candoli, Carl 5,393,417-28 Cochran, w.G. 108
Canieso-Doronila, M.L. 407 Cochran-Smith, M. 634n1
Canner, J. 543 Coffman, W.E. 887
Caplan, N. 198-9 Cogan, L.S. 982
Caracelli, V. 168, 169, 216, 226, 242 Cohen, A 549
ethical considerations 309,314,315 Cohen, David K. 155, 201, 202, 212, 616, 752
participatory evaluation 251, 252 Cohen, E.G. 534
qualitative methods 729 Cohen, J. 901
Card, D. 139 Cohen, J.A 514
Carey, RG. 314 Cohen, L.J. 174
Cariola, Patricio 447,451, 461n1 Cohn, R 304,305,306,307,308-9,311-12,
Carlin, J. 527 320
Carr-Hill, RA 408 Coker, H. 679, 687
Carroll, John B. 889 Colditz, G.A. 146
Cole, M. 470,554 US national assessment 884, 886, 887, 891
Cole, N.S. 501 US program evaluation 722, 724
Coleman, J.G. Jr. 673,674 validity 489, 499
Coleman, J.S. 418,874 Crone, L. 831
Coley, R.J. 844,845,846-7 Crone, T.M. 141
Collins, A 550, 564, 566, 631, 632 Crooks, T.J. 549,553,557,592
Collins, AM. 528n6 Crosse, S.B. 350
Comptois, J. 204 Crowson, R 646,647
Connellan, T.K. 684 Crozier, Michael 237
Conner, RF. 320 Cuban,L. 643,845,846,868n3
Conover, c.J. 145, 148n11 Cullen, C. 335
Cook, D.L. 54 Cullingford, C. 926n5
Cook, T.D. 66, 108, 137, 169, 435 Cummings, RG. 140
evaluation use 210, 217, 218 Cunningham, L.L. 419
program evaluation 724-5, 767n1 Curry, S. 631
Cookingham, F. 715 Cytron, R. 111
Cooksy, L.J. 322
Coolahan, J. 874 Dabinett, G. 437
Cooley, William 889 Daillak, R 204,207,210,214,237,661
Cooper, H.M. 534 Dalgaard, B.R 135
Cooperrider, D. 367,368 Danielson, C. 614,615
Corbett, AT. 526 Darling-Hammond, linda 613-14,620,664,
Cordray, D.S. 119, 295, 350 673
Cornett, L.M. 812 Darlington, RB. 947
Corriveau, L. 747 Darman, Richard 352
Cosgrove, J. 585,879 Datta, Lois-ellin 4, 168, 177, 180,218,374,
Cottingham, P.H. 109, 117 377,383
Coupal, F. 255, 259 advocacy 319, 712
Cousins, J. Bradley 4,9,314,315 end users 381
ethical considerations 309,317,318 evaluation profession 345-59
evaluation use 192, 194, 209, 210, 211, evaluator expertise 379, 380
213,215-16,226,231,242 government 269,273-5,345-59
participatory evaluation 245-65 information technology 381
Coustere, P 135 Daugherty, R 910, 914, 916
Covert, R.W 295 David, J.L. 213
Coyle, K. 217-18,238-9 Davidson, E. Jane 6,772,807-26
Cracknell, B. 193, 404 Davies, I.C. 432, 437
Cradler, J. 844 Davis, Howard 351
Crain, R.L. 109, 114 Davis, James 887
Cramer, G. 681 Davis, R 814
Crawford, E. T. 200 Davis, S. 650, 652, 653
Creed, V. 254 Dayton, C. 140
Crespo, M. 717 Dayton, C.M. 526
Cresswell, M. 586, 587, 588, 590, 591 de Mello de Souza, A 717
Cretin, S. 141 Dearing, Ron 588,917-18
Croll, P. 923 DeBakey, L. 43
Cronbach, L.J. 10,67,189,193,708,929 Deck, D. 116
alpha coefficient 513 Deketelaere, A 811
course improvement 190 Delap, M.R 584
evaluation profession 362 Delors, J. 402
evaluation use 226 Delpit, L. 535
formative-only model 23 DeLuca, C. 592
generalizability theory 514,519 DeMets, D.L. 112
quasi-evaluation 20 DeMoulin, D.F. 812
Dempster, AP. 947 Edge, A 579, 581
DeNisi, A 553 Edwards, T. 593
Dennis, M.L. 110 Edwards, y.B. 619, 622, 627
Dennisson, J.D. 733 Edwards, W 139
Denzin, N.K. 169, 172, 173 Einstein, Albert 29
Derrida, J. 177 Eisemon, T.O. 591
Desforges, M. 810 Eiserman, WD. 135
Dewey, John 155 Eisner, Elliott W 4, 66, 105, 153-66, 168,
Dial, M. 218, 309 169, 170, 174
DiBello, L.y. 525 Elashoff, Janet Dixon 887
Dickey, B. 213 Elbaz, J.G. 810
Dietz, M.E. 681 Ellet, F. 189
Dignard, Alice 6, 698, 700, 733-49 Ellett, C. 645, 659
Dillard, C. 647 Elley, WB. 595
Dilthey, W 182n1 Ellickson, P.L. 109, 115
Doan, P.L. 404 Elliot, Emerson 893
Doignon, J.-P. 525 Elliot, WL. 812
Dolittle, F. 117 Elliott, E.C. 587
Donmoyer, R 656 Elliott, J. 908
Donner, S. 113 Elmore, R 754
Donohue, J.J. 226,247 Elmore, RF. 613
Donoughue, C. 809 Elster, J. 80
Dornbush, S. 658, 664 Embretson, S. 499-500, 513, 525
Doscher, L. 725 Emmert, M.A. 202-3
Doyle, W 533, 538 Engelhart, M.D. 723
Drasgow, F. 517 Engle, M. 335
Dreeben, R 534, 543 Engle, P.K. 844
Drummond, A. 743 English, B. 75
Drummond, M.F. 136,139,140,141,144, Epstein, AM. 146
145, 146, 147 Eraut, M. 908
Dryzek, J.S. 80, 93 Ercikan, Kadriye 5, 485, 486, 489-531
Du Bois, P.H. 580, 581 Eresh, J. 631
Duffy, M. 647, 649, 650, 664 Erickson, F. 169, 173
Duke, D. 643,644,646,647,649,653-4,657 Erickson, RN. 136
Dunbar, S.B. 550,567,631,632 Erlebacher, A. 355
Dunkin, Michael J. 626 Ertmer, P.A. 854,861
Dunn, WN. 251 Escobar, C.M. 135, 140, 146
Dunne, M. 473,474 Espiriu-Acuna, J. 407
Duran, P. 430,431 Esquivel, J.M. 452, 453
Durden, P. 652 Etounga-Manguelle, D. 707
Dusek, J.B. 534 Evers, J. 49
Dwan, B. 589 Eyssell, K.M. 320
Dwyer, C.A. 604, 609, 611, 630 Ezpeleta, J. 457
Dynarski, M. 109, 120
Falmagne, J.-c. 525
Earl, L.M. 192,231,249,250,251,252, 257, Fan, X. 342n5
741, 742 Fantuzzo, J.F. 110
Easley, Jack 67 Farley, Joanne 75
Ebbutt, D. 908 Farrington, D.P. 110
Ebel, R.L. 541 Faure, E. 402
Ebmeier, H. 645, 650 Fayol, Henri 678
Eckstein, M.A. 577, 580 Feistritzer, E.C. 614
Eddy, D.M. 145 Feldt, L.S. 566, 586
Eden, D. 534 Fenster, M. 53
1006 Index of Authors
Ferrer,J. 442,453 French, R 833
Ferrera, R. 562 Friedman, L.M. 112, 119
Ferris, L.E. 324-5 Friedman, S.J. 544
Fetler, M. 829, 830, 836 Frisbie, D.A. 544
Fetterman, D.M. 23,168,179,376 Fruchter, N. 126
empowerment evaluation 170, 246, 316, Fuchs, D. 109
435,728,740,814 Fuchs, L.S. 109
evaluator expertise 380 Fullan, M. 775
qualitative methods 168 Fuller, B. 135
Feuer, M.J. 613, 895 Funnell, S. 754
Feuerstein, M.T. 713 Furberg, C.D. 112
Feuerstein, R 553, 560 Furst, E.J. 723
Finkenauer, J.O. 110 Furubo, J.-E. 430
Finn, J.D. 109, 113, 114
Finn, Jeremy 889 Gaebler, T. 438n2
First, J. 534 Gage, N. 104
Fischer, G.H. 525 Gaines, G.F. 812
Fischer, R.L. 119 Gale, L. 645, 659
Fisher, D. 559 Gall, M.D. 674
Fiske, D.W. 512 Gallagher, P. 733
Fitz-Gibbon, c.T. 205, 589 Gallegos, A. 722
Fitzgerald, S. 858 Gao, X. 566
Fitzharris, L.H. 902nl Garaway, Gila 6,697-8,700,701-20
Fitzpatrick, J.L. 169, 182n3, 314, 331 Garberg, R 313
evaluation profession 362 Garcia, G. 567
evaluation as transdiscipline 342n3 Garden, R.A. 969, 980
government influence 347,349 Gardner, H. 499,557, 594
US program evaluation 722, 724 Gardner, J. 364
Flanagan,R 585 Gareis, C.R 681
Flanigan, J. 657 Garrett, R 657
Fletcher, J. 228 Gauthier, w.J. Jr. 813
Fletcher, J.D. 135,136, 147 Gaventa, J. 250, 254, 256, 259
Fletcher ~ T. 657 Gay, J. 470
Floc'hlay, B. 434-5,436 Geertz, C. 71, 171, 173
Floden, RE. 981 Gelman, A. 527
Foden, F. 579, 581 Gerard, K. 146
Foley, E. 113, 354 Gewirtz, S. 919
Folkman, D.Y. 256 Ghandour, G.A. 526
Fontana, J. 659 Ghere, G. 238
Forrest, G.M. 590 Gibson, R 908
Forss, K. 193 Gilbert, John 887
Foucault, M. 177, 878 Gill, B.P. 135
Franke-Wikberg, S. 429 Gill, S.J. 77
Frankiewicz, R. 536 Ginsberg, R 643,646, 649-50, 651, 654, 655,
Franklin, B. 831,833,834 656-7
Fraser, B. 559 Gipps, Caroline 5,485,486,549-75,923
Fraser, Doug 29nl Girod, J. 52
Frechtling, J. 568 Gitomer, D.H. 631,633
Fredericksen, J.R. 550,564,566,631,632 Glascock, C. 831,833
Fredericksen, N. 592, 618 Glaser, B.G. 172
Freeman, H.E. 25, 137, 193, 330, 362, 722-3, Glaser, Robert 487, 550, 551, 560, 629, 889,
753 890,891-2,901
Freeman, J. 833 Glasman, L. 651,665
Frels, K. 687 Glasman, Naftaly S. 5, 604-5, 607, 643-69
Glass, Gene V. 135, 137,812, 887 Greene, J.G. 192,209,210, 212, 213, 233-4
Gleason, P. 120 Greene, Leroy 886
Glen, J. 594 Greeno, J.G. 528n6
Glenn, J. 499,557 Greenwald, R 615
Glennan, T.K. 847 Greenwood, D.J. 75
Gieser, G.e. 514 Greer, J.T. 643
Glewwe, P. 135 Griffin, P. 554
Glock, M.D. 541 Grisay, A. 879,964
Gme1ch, W 658 Grolnick, WS. 592
Goertz, M. 647,649,650,664 Gronlund, N.E. 541
Goetz, J.P. 171 Guba, Egon G. 4, 11, 24, 67, 71, 389
Gold, M.R 139, 148n8 constructivism 425
Goldring, E.B. 650 dialogue 398
Goldstein, H. 551,588,590,591, 879, 880, evaluation use 212-13
947,975 evaluator partiality 318
Gombrich, E.H. 158 fourth generation evaluation 728
Gomez, E. 947 participation 70, 76
Gonnie van Amelsvoort, H.We. 809 qualitative methods 73, 168, 169, 170, 171,
Gonzalez, e. 848, 851 179,180
Gonzalez, E.J. 879,982 social construction of knowledge 193
Good, P.J. 587 Guerin-Pace, P. 879
Good, T. 534 Gueron, J.M. 110,117,355
Goodale, J.G. 684 Gulick, L. 678
Goodman, N. 160 Gullickson, Arlen R 282
Goodrick, D. 331 Gulliksen, H. 511,517
Gordon, A. 117 Gunnarsson, R 429
Gordon, Edmund 349 Guskey, T.R 848, 849, 850
Gorney, Barbara 772,807,827,828,831, Gutmann, A. 80
832,834,836,837 Guttentag, Marcia 351
Gorton, RA. 654, 658, 671, 673, 677 Guyton, J.W 812
Gottfredson, G.D. 813 Guzman, A. 715
Gould, S.J. 552
Gozdz, K. 366, 371 Haag, RA. 411
Graham, D. 476 Haas, N.S. 618
Graham, WP. 168,169,729 Habermas, Jiirgen 436
Gramlich, E.M. 117 Hacsi, T.A. 823
Granger, Re. 111 Haertel, E.H. 503, 526, 528, 561, 618
Grantham, e. 364-5 Hageboeck, M. 404
Gray, S.T. 812 Haladyna, T.M. 618
Greaney, V. 453,473,474,475,581,592, Hallett, M. 460
593,595 Hallinger, P. 644, 645, 649, 650
international assessments 879 Hambleton, RK. 512,514,519,582,591
national assessments 876, 877 Hamilton, David 67
Green, Bert P. 889, 895 Hamilton, L. 549,613
Greenbaum, W 883,887-8,901 Hammond, RL. 192
Greene, Jennifer C. 75,81,90-1,168,169, Hammond, S.A. 368
170,179,254,257 Han Min 580
advocacy 318 Hanesian, H. 536
constructivism 378 Haney, VV. 53,593,610,611
evaluator independence 375 Hansen, H.P. 431
performance management 437 Hansson, P. 431
qualitative methods 729 Hanushek, E.A. 135,147
situated meaning 376, 378 Harbison, RW 135
social justice 377 Harding, S. 392, 398
Hargreaves, A. 593 Hombo, e.M. 895
Hargreaves, D. 910 Honea, G.E. 304, 305
Hargreaves, E. 557 Hoover, H.D. 515, 567
Harien, W 567,594,595 Hoppe, M. 616
Harmon, M. 969 Hopson, R. 381
Harris, M. 392 Horn, C. 582
Hart, A. 646, 649, 653, 656 Horn, Jerry L. 777,823n1
Hartocollis, A. 623 Horn, S.P. 51, 603, 613, 615, 682, 836, 939,
Hartog, P. 587 945,947
Harvey, D. 398 Horton, J.L. 687
Harvey, G. 754, 756 Hoskins, K. 487
Hasselbring, T. 849 Houang, Richard T. 6,881,979-95
Hastings, J.T. 536,541 House, Ernest R 3-4,9-14,67,347,708,
Hattie, J.A. 633 724, 726
Hauser, RM. 613, 900 constructivism 378
Havelock, Ronald 761 deliberative democratic evaluation 25,
Hawley, D.E. 135 79-100,316,425
Hazlett, J.A. 902n1 ethical considerations 313,321
Heck, Ronald H. 5,604-5,607,643-69 evaluation definition 182n3
Hedges, L.v. 615 evaluation profession 330, 342, 349, 350,
Hedrick, T.E. 117,180 361,362
Heebner, A.L. 109 fact-value dichotomy 374, 435
Helm, V.M. 671, 672-4, 677, 678-81, 683, federal funding 362
688-9,691n3,691n4,691n6 government influence 349, 350
Hemphill, F.e. 895 qualitative methods 169, 170, 172, 175,
Henderson, e.R 947 179,375
Henderson, Rochelle 6 responsive evaluation 66
Hendricks, M. 320 truth 376
Henry, G. 25,377,380,381 Work Sample Technique 53
Henry, G.T. 316 Howard, K.e. 671, 676
Henry, J. 534,543 Howe, Kenneth R 4,11,13,168,169,170,
Henry, John 104 173,179
Herman, J. 559, 630 deliberative democratic evaluation 25,
Heron, John 74 79-100,316,425
Hersey, P. 228 ethical considerations 313
Hess, e. 645, 646 fact-value dichotomy 374, 435
Heubert, J.P. 613,900 Howell, J. 829
Hewitt, E. 587 Howell, WG. 109
Hewitt, G. 631 Howie, Sarah 6, 881,.~51-78
Hewson, P.W 848 Howson, G. 963, 973-4
Heyneman, S.P. 475, 591 Hoy, W 658
Hickey, B. 879 Hua, H. 135
Hill, R 569 Huberman, A.M. 169
Hill, WH. 723 Huberman, M. 193, 848
Himmel, E. 877 Huebner, RB. 350
Hocevar, D. 659 Huebner, T.A. 823
Hodder, I. 172 Hughes, E.F.X. 193, 203-4, 210, 215, 216,
Hofstede, G. 706 217,219n5
Hofstetter, Carolyn Huie 4, 192, 193, 194, Hughes, L. 644
197-222 Huidobro, Juan Eduardo Garcia 447
Holbrook, A. 752 Hunter, M. 687
Holland, Paul W 517, 523, 889, 895 Huriey, J.J. 585
Holley, F.M. 725 Hurst, B. 681
Holtzner, B. 251 Husen, T. 408, 878-9, 951, 952, 953, 964, 980
Hussein, M.G. 973 Julnes, G. 25, 119,316
Huynh, Huynh 889 Junker, B.w. 524
Jussim, L. 534
Iatarola, P. 126
Innes, J.E. 201, 219n2 Kadane, J.B. 528n2
Irby, B. 651, 657 Kaftarian, AJ. 316,814
Iwanicki, E.F. 647,649,687 Kallsen, LA 136
Kamlet, M.S. 148n8
Jackson, Philip W. 155,534,538, 982 Kane, M. 925
Jacobs,L. 304-5,314,316,320,322 Kane, M.T. 489,549,560,564,566-7,582
Jacobs, N.F. 254-5, 762 Kanyika, J. 476
Jacobson,Lenore 534 Kaplan, R.M. 139
Jaeger, Richard M. 6,772,807,812,932 Kaplan, R.S. 823
national assessment 892, 893 Kariyawasam, T. 593
school profiles 827,831,832,836,837 Karlsson, Ove 5,87-8,393-4,429-39
Jakwerth, P.M. 982,987 Kato, L.Y. 110
James, M. 559 Katz, R. 310
James, T. 890 Keeler, E.B. 141
Jamison, D.T. 145 Keenan, Patrick 874
Janoff, S. 368 Keeney, R.L. 139
J anowiz, M. 200 Keeves, J.P. 577, 953, 964, 973, 980
Jantzi, D. 811 Keily, G.L. 524
Jarman, B. 364 Keitel, C. 975
Jason, L.A 310 Keith, J.E. 140
Jefferson, Thomas 71-2 Kellaghan, Thomas 1-6,362,394-5,453,
Jencks, Christopher 897 487,534,873-82
Jenkins, E.W. 963, 964, 973 Africa 465-81
Jerald, C.D. 619, 622, 623, 649, 844 examinations 485, 486, 577-600
Jiang, T. 901 international studies 952, 953, 963, 964,
Johannesson, M. 140 968,973
Johnson, D.R. 136 traditional assessment 552
Johnson, E.G. 566 Kelley, T.L. 508-9
Johnson, L. 502, 671, 677 Kelly, A 585, 879
Johnson, Lyndon Baines 275, 362,419, 724 Kelly, D. 585, 589
Johnson, Robert L. Kelly, DA 973,982
participatory evaluation 255, 257, 259 Kelly, D.H. 471
school profiles/report cards 6, 772, 807, Kelly, D.L. 879
810,811,812,823,827-42 Kelly, G.P. 471
Johnson, Sylvia 889 Kelly, M. 476
Johnson, T.D. 91 Kemis, M. 680
Jolly, Elton 888 Kemmis, Stephen 67
Jones, E. 534 Kemuma, Joyce 438n1
Jones, E.G. 308,310 Kennedy, John F 724
Jones, L.R. 894 Kennedy, M.M. 202,207,212,213, 725, 848
Jones, Lyle V 6, 880, 883-904 Kennedy, Robert F. 103, 190, 349, 724
Jones, P. 180 Keppel, Francis 883, 901
Jones, S.C. 331,338 Kerckhoff, A 836
Jones, T.J. 470 Kernan-Schloss, A. 831,838,839
Jones-Lee, M.W. 148n7 Kgobe, M. 478
Jordan, H.R. 938, 947 Khattri, N. 925
Jiireskog, K.G. 526 Kifer, Edward 889
Joseph, Keith 910 Killait, B.R. 658
Joyce,B. 848 Kilpatrick, J. 975
Julian, M. 512,514,515 Kim, I. 335
Kimball, K 812 Lamb, A 671, 677
King, Jean A 6, 135, 148n3 Lambert, F.C. 211, 251, 762
evaluation use 192, 208-9, 210, 216, 218, Land, G. 364
219n3, 231, 238, 239 Land, R 630
participatory evaluation 256, 263n1 Lane, J. 135
program evaluation 697,698,700,721-32 Lane, W. 523
Kingdon,~. 907,926n2 Lapan, S.D. 715
Kirkhart, KE. 193, 219n6 Lapointe, Archie 888, 889, 902n1
Kirst, ~.w. 616,645,650 Lau, G. 257-8, 259
Klar, N. 113 Laurin, P. 748
Klasek, C. 674 Lavy, V. 142
Kleibard, H.~. 723 Lazarsfeld, P.F. 526
Klein, D. 559, 931 Learmonth, J. 809
Klein, S. 353, 355, 549, 613 LeCompte, ~.D. 169, 171
Kleiner, A. 775 Lee,L.E. 253,255,259,260,740-1,742
Klenowski, V. 563 Lee, S. 832, 837
Kline, P. 775 Leeuw, F.L. 757
Kl6ditz, Regina 6 Lehoux,P. 259
Kluger, AV. 553 Leinhardt, Gaea 889
Knack, R 829 Leithwood, KA 209, 210, 215, 247, 644,
Knapp, ~.S. 754 646,647,653-4,811,812
Knight, J. 908 Leitner, D. 135
Knorr, KD. 199 Le~ahieu, P.G. 631
Knott, T.D. 322 Lesgold, A 629
1Cochan,S. 831,832,836,837 Lessinger, L.~. 644
1Coenig, J.A. 895 Letson, John 886
1Coffi, V. 748 Levey, J. 364, 371
1Colen, ~.J. 515 Levey,~. 364, 371
1Corenman, S.G. 309 Levin, Henry ~. 4, 105, 125-52
1Coretz, D.~. 564,567,569, 613, 895 Levin,~. 75
1Cosecoff, J. 205 Levine,~. 517
1Cotte, D. 980 Leviton, L.c. 66, 169, 435
1Cozol, Jonathan 155 evaluation use 193,203-4, 210, 215, 216,
1Craftarian, S. 246 217, 219n5
1Crain,~. 398, 400, 401 program evaluation 767n1
1Crathwohl, D.R 723 Lew, V. 309
1Crop, C. 135 Lewin, K~. 473,474,475
1Crueger, AB. 139, 142 Lewis, D.R 135, 136
1Cruger, L.J. 673, 677 Lewis,~. 593
1Cuder, G.F. 513 Lezotte, L. 649
1Culpoo, D. 476 Lichtenstein, G. 681
1Cuperrnintz, H. 613 Lietz, P. 971
1Cushner, Saville 67, 94 Light, RJ. 109, 119, 137
1Cutz, E. 610 Lightfoot, Sarah Lawrence 155
1Cweka, AN. 408 Linacre, J.~. 524
Lincoln, Yvonna S. 4, 11, 12-13, 24, 67
Labrecque,~. 257 advocacy 318
Lackey, J.F. 71 constructivism 69-78,378,425
Ladson-Billings, G. 535 evaluator partiality 318
Lafleur, C. 812 fourth generation evaluation 728
Laine, RD. 615 qualitative methods 168, 169, 170, 171,
Laird, N.~. 947 179,180,375
Lajoie, S.P. 629 social construction of knowledge 193
La~ahieu, P 257-8, 259 Lindblom, C.E. 201, 202, 212, 752
Lindsay, G. 810 McDonnell, L.M. 961
Lingen, G. 829 McEwan, Patrick J. 4, 105, 125-52
Linn, Robert L. 53, 354, 834, 836, 923 McGarth, E. 832,837
alternative assessment 550, 552, 560, 564, McGaw, Barry 6,881,951-78
565,566,567-8,571 McGinn, N. 450
teacher evaluation 612,613,616,618,631, McGranahan, P. 725
632 McGreal, T.L. 614,615,687
US national assessment 889, 890 Machingaidze, T. 476
Lipscomb, J. 141 McInerney, VV 657
Lipset, M.S. 461n3 McKee, Anne 67
Lipsey, M.w. 113, 119, 377, 383, 722-3 McKillip, J. 313
Littell, J. 111 McKnight, c.c. 980, 982, 983, 984, 985, 987,
Little, A 593 989
Little, J.W 687 McLaughlin, M.W 724
Little, KL. 466 McLean, L.D. 178
Littman, M. 206 McLeod, J. 755
Livingston, J.J. 765 McMahon, WW 137
Livingstone, I.D. 595 McMeeking, S. 559
Lockheed, M.E. 147 Macnamara, J. 583, 584, 587
Loesch-Griffin, D.A 673 McNamara, J. 833
Loewe, M. 487 McNamara, Robert 724
Lohman, David 889 MacNeil, C. 92,259
Lomask, M.S. 631 McNess, E. 923,924
Loomis, S.c. 591 McPeek, WM. 523
Lord, F.M. 522, 523 McPhee, RF. 419
Lord, RM. 517 McTaggart, R 77, 246
Loucks-Horsley, S. 848 McTighe, J. 537
Louden, W 812 Madaus, George F. 5, 53, 362, 400, 424,
Love, AJ. 305, 320, 330, 379, 380, 381 485-7,534,541,612
Love, N. 848 examinations 577-600
Lovell, R.G. 305, 320 international assessment 880
Loxley, W 964, 972 objectives 536
Luckmann, T. 389, 391 performance management 875
Ludlow, L.H. 610 reading tests 876
Luna, C. 610 teacher evaluation 618
Lundgren, U.P. 429 test agencies 571
Lundin, Anna 438n1 traditional assessment 552
Lunt, I. 554 Mader, P. 763
Lyon, C.D. 725, 729 Madhere, Serge 889
Lyotard, J.-F. 177,398 Madison, Anna 75, 87
Maehlck, L. 591
Mabry, Linda 4, 24, 67, 105, 167-85, 321 Maggi, R. 444
McAlister, S. 555, 923 Magone,M. 523
McCaffrey, D. 613 Major, John 570
McCallum, B. 555, 557, 923 Manatt, R.P. 680, 686
McCarthy, M. 652, 658 Mandeville, G. 835
McCaslin, M. 534 Mann,Horace 581,722,807
Macce,B.R. 335 Manning, P.R 43
McCleary, L. 645, 659 Manning, WG. 144
McConney, AA 682 Manzo, KK 615
McCoun, M. 135 Marcoulides, G. 650,659,660,665
McCurry, D. 594 Marion, S.F. 172
MacDonald, Barry 67, 246, 434, 908 Mark, Mel M. 25,276,316-17,320, 385n2
McDonald, S. 110 evaluation profession 373, 374, 375, 376, 379
performance measurement 383 Mill, John 878
Marks, J. 911 Millar, D. 585, 589
Marshall, J.H. 135 Millman, J. 610, 673
Martens, P.A 649, 651, 656, 665 Minnema, J. 238
Martin, J.D. 507 Miron, Gary 6, 771-4
Martin, M. 585 Mishler, E.G. 566
Martin, M.O. 879, 956, 963, 967, 970, 972, Miskel, e. 658
973,982 Mislevy, Robert J. 5,485,486,489-531,552,
Martinez, V. 87 582,588,629
Masia, B.B. 723 Mitchell, KJ. 894, 900
Mason, J. 147 Mitchell, R 626
Mason, RO. 391 Mitchell, R e. 140
Massell, D. 612, 613, 616 Mitroff, I. 391
Massey, AJ. 590 Miyazaki, I. 580
Masters, G.N. 524 Moberg, D.P. 71
Mathes, P.G. 109 Mohr, L.B. 119
Mathews, J.e. 580,591 Mojica, M.I. 312-13
Mathie, A 254 Monk, D.H. 148n3
Mathison, S. 75,305,314,318,320 Monnier, E. 430
Mawhinney, H.B. 644 Montello, P.A 643
Maxwell, J.A. 172 Montgomery, R 581
May, J. 939 Moreau, A 748
Mayer, D. 109 Morell, J.A. 363
Mayer, S.E. 234 Morgan, M. 585, 879
Mazzeo, J. 895 Morgan, N. 540
Means, B. 861 Morrall, J.P. 147
Measham, AR. 145 Morris, B. 172
Medley, D.M. 679, 687 Morris, Michael 4,269,271-2,303-27,713
Meehl, P.E. 489, 499 Morris, N. 580, 581
Mehlinger, H.D. 398,400,401 Morrison, A 198
Mehrens, WA 543,561, 565, 566 Morrissey, J. 254
Meister, G.R 135 Morse, S.W 364
Melmed, A 847 Moses, Lincoln 887
Melnick, S.L. 610 Mosley, WH. 145
Mendro, RL. 51, 682, 730, 829, 830, 835, Moss, P.A 550, 552, 564, 565-6
836,932,938,947 Mossavat, M. 837
Merriman, H. 838 Mosteller, P. 108, 109, 111, 113, 115, 117,
Mertens, Donna M. 24, 25, 75, 318, 374, 381 121nl,887, 989
performance measurement 379 Motiotie, N.P. 468, 469
social justice 170, 179, 315, 376, 377 Moynihan, Daniel 355, 357
Merton, RK. 303 Mueller, Josef 405
Merwin,Jack 884,886 Mukhe~ee,P. 403
Merwin, J.e. 330 Mullis, LV.S. 879, 902nl, 956, 982
Messick, Samuel 522, 564, 565, 566, 584, 585 Munoz Izquierdo, e. 456, 457
psychometrics 489, 498, 499, 502, 522 Murchan, D.P. 879
teacher evaluation 615,618,631-2 Murphy, J. 650
Mestinsek, R. 829,831 Murphy, P. 167,568
Meyer, CA. 561 Murphy, R 910
Meyer, J. 878 Murphy-Berman, V. 253
Meyer, T.R 844, 848, 851, 854 Murray, D.M. 113
Mezzell, M.H. 534 Murray, S. 116
Michel, A 876, 877 Musick, Mark D. 902
Miech, E.J. 108,111 Muthen, Bengt 889
Miles, M. 169, 462n4 Mwiria, K. 408
Myers, D. 109 Nunnelly, R. 95
Myford, C.M. 498, 507 Nuttall, D.L. 559,581, 588, 589, 590, 910
Myrdal, Alva 352 Nystrand, R.O. 419
Myrdal, Gunnar 352
Oakes, 1. 833,961
Nader, Ralph 329 Oakley, A. 74, 109
Nanda, H. 514 O'Brien, B. 136
Napuk, A. 810 O'Day,l. 616
Narayan-Parker, D. 713 O'Dwyer, L.M. 580, 581, 612
Nassif, R. 443 Ogbu, J.U. 470
Nave, B. 108,111 Ohlemacher, R. 671, 673, 677
Nedelsky, L. 590 O'Leary, M. 880
Nee, D. 312-13 Ole be, M. 628
Neill, M. 567 Olkin, Ingram 889,902n1
Neisser, U. 153 Olsen, L.D. 685
Nelson, Catherine Awsumb 6,843-70 Olson, G.H. 947
Neumann, W.E 207,725 Olson, K 861
Neumeyer, C. 657 Olson, L.
Nevo, David 6, 193, 643, 644, 648, 651, 653, school profiles 827,828,829,831-2,
655,658,665,831 834-5,836,837,838-9
dialogue 718 teacher evaluation 614,617,623
Israel study 802-4 O'Meara, P. 398, 400, 401
projects 722 Omolewa, Michael 5, 394-5, 465-81
school districts 725 O'Neil, H. 567, 571
Newbould, c.A. 590 O'Neil, KA. 523
Newcomer, K 379, 380, 382 Oosterhof, A. 549
Newcomer, KE. 228 Orfield, G. 452
Newman, - 657 Orlofsky, G.E 844
Newman, D. 554 O'Rourke, B. 585
Newman, Dianna L. 280,304,309,311-12, Orr, L.L. 137, 139, 145, 585, 590, 910
317,321,322-3,324 Orsak, Tim H. 6, 881, 929-49
ethics training 179 Orwin, R.G. 350
evaluation use 192, 206, 207, 224 Osborn, D. 438n2
Newmann, EM. 550,564,566 Osborn, M. 923, 924
Newton, P.E. 588, 589, 590 Ostler, T. 673
Nias, J. 908 Ottenberg, S. 465, 466
Nickel, P. 116 Overton, T. 536
Niemi, D. 723,724,726-7,728-9,730 Owen, John M. 6,193,211,212,251,698-9,
Nitko, Anthony 53, 889 700, 751-68
Nixon, Richard 887 Owens, T.R. 54
Nkamba, M. 476
Noah, HJ. 577, 580 Paik, I. -W. 140
Noeth, RJ. 895 Paik, SJ. 615
Nolen, S.B. 542, 618 Palmer, P. 364
Noriega, C. 456 Palumbo, D.J. 313,460
Norman, G. 566 Pandey, Tej 889
Norris, N. 723 Pantili, L. 651, 652
Norton, D.P. 823 Parlett, Malcolm 67
Norton, M. 671, 673 Parrish, T. 148n4
Norton, M.S. 643 Parsons, O.N. 467,468, 469
Norton, P. 848,851 Patton, Michael Q. 4, 9, 75, 217, 382, 661
Novak, J. 536 accountability 684
Novick, Melvin R. 517, 889 developmental evaluation 727
Nowakowski, A. 49 ethical considerations 309, 313
evaluation profession 330, 342, 361 Pokorny, S. 310
evaluation use 192,194,205,209,210--11, Polanyi, M. 175
213, 214, 219n4, 223-43 Pollard, A 923, 924
interest groups 435 Pollard, w.E. 210, 217, 218
personal factor 207 Pollit, C. 432, 433
process use 251 Pollitt, C. 757
qualitative methods 168, 170, 180 Polloway, E.A. 544
utilization-focused evaluation 316,425, Popham, w.J. 353,565,615
728,814 Popper, KR 202
Patz, R.J. 524 Porter, A 982, 984
Pauly, E. 110, 117 Posavac, E.J. 314
Payne, D.A 721 Post, Jennifer 6,773,843-70
Peaker, G.F. 980 Postlethwaite, T.N. 879, 952, 953, 955, 959,
Pearlman, Mari 5,603-4,607,609-41,664 964,967,969,980
Pearson, P. 567 Potvin, L. 259
Pecheone, RL. 631 Powell, Arthur 155, 158, 159
Pechman, E.M. 192, 208, 218, 219n3, 725, Powell, J.L. 585
729 Pratiey,Alan 433
Peck, M.S. 366 Preissle, J. 169
Pedulla, J.J. 53, 582, 877 Preskill, Hallie 4, 211, 216, 226, 234, 238,
Pellegrino, J.w. 487,550,561, 562, 563, 571, 242,381
894,900,901 ethical considerations 309,314-15
Pelowski, B. 829 inquiry-based approach 814,823
Pelz, D.C. 199,200--1 learning community 269,275,361-72
Peng, F.e.c. 389,390 organizational learning 730,762,814
Pennycuick, D. 473,595 participatory evaluation 251, 252, 379
Penuel, W 528 Proulx, M. 259
Perlin, R 671, 673 Provus, Malcolm M. 20, 23, 192, 655
Perreiah, AR 580 Pryor, J. 559
Perrenoud,P. 559 Psacharopoulos, G. 137, 145, 180
Perrin, B. 353 Pullin, D. 610
Perry, P.D. 75 Pursley, L.A 246
Peshkin, Alan 155 Purves, Ae. 408
Petersen, N.S. 515 Putnam, S.E. 807,831
Peterson, KD. 673, 681 Putsoa, B. 477
Peterson, P.P. 109
Petrosino, AJ. 110,823 Quinn, B. 142
Pfister, F.e. 673
Pfukani, P. 476 Raban, B. 556 '~
Phillips, D. 157,389 Raczek, A 486,487,552,571,593
Phillips, D.e. 171 Rafferty, E. 829, 834, 835
Phillips, Meredith 897 Rai, A 146
Piele, P.K 135 Rai, K 256
Pillemer, D.B. 119 Raiffa, H. 139
Pine, J. 566 Raizen, S.A. 725, 980, 982, 983, 985, 989
Piontek, M.E. 238, 251 Rajaratnam, N. 514
Pirolli, P. 494, 525 Rakow, E.A. 585
Pitman, J. 567,570 Rallis, S.F. 168, 173, 177,650
Pitner, N. 659 Rama,G. 443
Plank, D. 450 Ramos, M. 582
Plomp, Tjeerd 6,881,951-78 Rangarajan, A 120
Plottu, E. 434-5, 436 Ransom, AW. 475,591
Podgursky, M. 657,660 Rasch, G. 520, 525, 526
Pohland, Paul 67 Rashdall, H. 580
Rastelli, L. 172 Rollow, S. 645
Raudenbush, S.w. 526, 947 Rose, J.S. 725
Ravitch, D. 418 Roseland, M. 365, 366
Ravsten, M.T. 135 Rosenbaum, P.R 108
Raymond, M.R. 582, 591 Rosenthal, Robert 305, 534
Razik, T.A. 643,661 Ross, D. 93
Reagan, Ronald 362, 725, 726 Ross, KN. 476,591,879,953,958,959,964,
Reason, Peter 69, 74 969,973
Rebien, C. 713 Ross, R 775
Reckase, M. 524-5 Rossi, P.R. 137,168,179-80, 182n5, 330, 753
Reder, L. 228 evaluation profession 362
Redfern, G.B. 685 evaluation utilization 193
Reed, M. 49 social problems 25
Reeve, A. 925 US program evaluation 722-3, 725, 726, 728
Reich, R 924 Rothman, R 613
Reichardt, e.S. 168, 173, 177 Rottenberg, e. 543
Reichardt, R. 135 Rounding, e. 110
Reid, J.B. 582, 591 Rouse, e.E. 135
Reimers, Fernando 5,394, 412nl, 441-63 Roussos, L.A. 525
Rein, M. 201 Rowan, B. 878
Reinhard, Diane 46 Rowe, W.E. 254-5, 762
Reisner, E.R. 725 Rowley, G.L. 514
Resnick, D. 552, 566 Rubin, D.B. 527,947
Resnick, Lauren B. 528n6, 552, 566, 888 Rubin, R.J. 172
Rhodes, E.e. 587 Rubin, I.S. 172
Riccio, J.A. 110 Rueda, M. 449
Rice, J.K. 146 Rusimbi, M. 408
Rice, Joseph 722 Russ-Eft, D. 362
Rich, R.E 198-9 Russell, Bertrand 29
Richard, A. 645, 652 Russell, L.B. 148n8
Richards, e.E. 878 Russon, e. 320,411,429
Richardson, M.W. 513 Russon, K 411,429
Richardson, T. 437 Ryan, John W. 405
Riecken, R.W. 109, 119 Ryan, Katherine 75,318
Riggin, L. 335 Ryan, KE. 91
Righa, G. 408 Ryan, R.M. 592
Rippey, RM. 191,212-13 Rzepnicki, T.L. 111
Rist, R.e. 430,535
Rivers, J.e. 613, 615, 947 Sachs, 1. 109,763
Rivers, L. 206 Sack, J. 655
Rivers, W. 53 Sadler, R 550, 557, 558
Roach, J. 581 St. Pierre, R. 116, 119
Roberts, e. 775 Saint, S. 148n6
Roberts, I. 109 Saito, R 174
Robinson, T. 253, 255, 260 Salasin, Susan 351
Robitaille, D.E 953-4, 969, 980, 984 Salganik, L. 835, 836
Rog, D. 117 Salinas, Carlos 454
Rogers, E.M. 767n2 Salvia, J. 549, 571nl
Rogers, H. 582 Samejima, E 524, 526
Rogers, P.J. 193,377, 380, 383, 754, 757, Samset, K 193
761-2, 823 Sandahl, R. 430
Rogosa, D.R. 526 Sandberg, J. 49
Rojas, e. 452, 453 Sanders, James R. 5-6, 169, 182n3, 282, 295,
Rolheiser-Bennett, e. 848 477
evaluation profession 347,349,362 Key Evaluation Checklist 728
evaluation as transdiscipline 342n3 'management consultations' 182n5
government influence 347,349 'managerial coziness' 69
professionalization 331 modus operandi evaluation 66
program evaluation 697-700,722,724, new evaluation fields 378
730, 777 NIE 353
school evaluation 771, 772, 807-26 observation 679
stakeholders 314 performance 687
Sanders, WL. 51, 603, 613, 615, 682, 836, qualitative methods 170, 174, 179
939,945,947 standards 686
Sanderson, SK 389 synthesis of results 822
Saunders, B. 775 training 384-5
Sawin, E.I. 342n6 transdiscipline 342n3, 728
Saxton, A. 836 truth 241
Saxton, J. 540 US national assessment 892, 893
Scaeffer, B. 626 Searle, J.R 85
Scanlan, D. 470 Sebatane, E.M. 473,477
Schaffarzick, J. 724 Sechrest, L. 119, 374, 375-6, 377, 384
Schalock, Del 52 Sedigh, Mehdi 438n1
Schalock, H.D. 614,682 Seiden, K. 234
Schalock, M.D. 52, 682 Selden, Ramsay 902n1
Schapera, I. 467,468 Seligman, R 205
Scheerens, J. 809,810,811 Senge,P. 775,796,812
Scheirer, MA 224, 280 SEPUP 490
Scheurich, J.J. 398 Sergiovani, T. 647
Schirm, A. 109 Seyfarth, J. T. 643
Schleicher, A. 952,957,958 Seymour-Smith, C. 182n2
Schmidt, William H. 6, 881, 979-95 Shadish, William R Jr. 66,137,169,224,
Schneider, H.K. 465,466 280,293,330-1,342,435
Schnoes, C.J. 253 evaluation profession 374
Schofield, J. 846,847,854,861, 868n3 program evaluation 724-5, 767n1
Schon, D. 201 Shaffer, Juliet 889
Schon, DA 812, 814 Shani, A. 534
Schrank, W 534 Sharp, L.M. 723
Schudson, M. 392 Shavelson, RJ. 519,566,567, 961, 962
Schuerman, J.R. 111, 119 Shepard, LA 549, 552, 553, 555, 560, 564,
Schultze, WD. 140 565,613,615,618,923
Schum, DA 497, 528n2 Shiel, G. 585, 879, 880
Schurmann, P. 456 Shields, Dorothy 888
Schwandt, Torn A. 67, 179, 378 Shinkfield, Anthony 1. 66,630,631,673
Schweder, RA. 411 Shore,A. 582,877
Schweinhart, L.J. 118 Shorrocks-Thylor, D. 963,964
Sc1an, E.M. 673 Showers, B. 848
Scott, J. 674 Shulha, Lyn M. 209, 216, 251, 263n1, 309,
Scriven, Michael 4, 10, 12, 15-30, 50, 192, 317,318
385,929 Shumba, S. 476
bias 82,85 Si, Y. 109
checklist approach 814 Sieber, J.E. 112
evaluation definition 182n3 Siedman, E. 348
evaluation profession 362 Siegel, J.E. 148n8
evaluative criteria 843 Simmons, D.C. 109
formative-summative approach 190, 425 Simon, B. 534
goal-free evaluation 51, 823 Simon, H. 228
job domains 685 Simoneau, M. 255, 259
Singer, J.D. 137 Stalnaker, J.M. 587
Singleton, P. 75 Stambaugh, R.J. 198
Sirotnik, KA 652, 812, 813 Stame, N. 430
Slater, S.c. 512,514 Stanfield, II J.H. 410
Sloane, F.A 145, 148nll Stanley, B. 112
Sloane, K 470,490 Stanley, Julian C. 104, 108, 168
Smerdon, B. 844,846,847 Starch, D. 587
Smith, A 430 Stecher, B. 213,568-9,613, 631
Smith, Adam 878 Steele, C.M. 535
Smith, B. 775 Steinberg, L.S. 489, 502, 524, 528
Smith, E.R. 801, 930 Steiner, D.J. 255
Smith, J.K 148n3 Stephens, D. 473,581
Smith, Lou 67 Sterling, Whitney 611
Smith, M. 338, 616 Stern, D. 140
Smith, M.F. 4,269-77,296,373-86,754 Stern, H. 527
Smith, M.1. 172 Stevahn, L. 238
Smith, M.L. 137,171,543 Stevens, C.J. 218,309
Smith, N.L. 148n3, 403, 411 Stevenson, C. 681
Smith, T. 633 Steward, L. 581
Smith, T.A 879, 982 Stiefel, L. 126
Smrekar, c.E. 644 Stierer, S. 559
Smylie, M. 646,647 Stiggins, R. 561, 562, 643, 646, 657
Snow, Richard 889 Stiles, KE. 848
Snowden, R.E. 654, 658 Stobart, Gordon 5,485,486,549-75,907,
Snyder, C.w. 135 926n2
Snyder, J. 645, 650 Stockdill, S.J. 135
Snyder, M. 404 Stoddart, G.L. 136
Soar, R.S. 679,687 Stokes, E. 579, 581
Soares, J.F. 717 Stokke, O. 404
Sofroniou, N. 585, 879 Stout, w.F. 525
Soholt, S. 838, 839 Strauss, AI. 172
Solsken, J. 610 Stronach, I. 172
Somerset, A 591 Stronge, James H. 5,606,607,671-93
Sommerfeld, M. 615 Stufflebeam, Daniel L. 1-6, 29n1, 75, 325n1,
Sonnichsen, R.c. 231,234, 305, 320, 325n3 400,643,814,929
Sorbom, D. 526 accountability 684
Sowell, T. 762 CIPP model 10,31--62,659,728,938
Spearitt, D. 879, 964 client-centred evaluation 66
Speannan,C. 500,509-10,511,513,517 context of evaluation 393,417-28
Spiegelhalter, D.J. 527 core values 13-14
Spielvogel, S. 673 decision making 191
Spillane, J.P. 616 deliberative democratic evaluation 96
Staats, Elmer 348 Evaluation Center 354
Stage, C. 591 evaluation definition 930-1
Stake, Robert E. 4, 10-11, 12, 13,69,72, evaluation profession 362
354,929 evaluator expertise 379, 380
constructivism 378 institutionalizing evaluation 771-2,775-805
decision making 70 misuse 219n7
deliberative democratic evaluation 93 NIE 353
performance indicators 382, 384 performance 687
qualitative methods 168, 169, 170, 174, 179 personnel evaluation 603-8, 651
responsive evaluation 63-8,425,728,814 principal evaluation 648, 660
stakeholders 191-2 professional standards 269,270-1,
subtheme A 24-5 279-302,303,374
program evaluation 723, 725, 728, 729 Traeger, L. 117
qualitative methods 179 Traub, RE. 514
quasi-evaluation 20 Travers, KJ. 962,980,983
school district evaluation 934 Treff, A 829, 834, 835
standards 686 Triggs, P. 923, 924
teacher evaluation 609,611,630,631,673 Trochim, Bill 335
training 384 Trusty, F. 833
US national assessment 892, 893 Tryneski, J. 652
Sudweeks, RR 342n5 Tsang, M.C. 130
Sullivan, S.D. 148n6 Tsutakawa, RY. 947
Sumida, J. 49 Tucker, P.D. 671, 672-3, 676-9, 682, 684,
Supovitz, J.A 631 691n2,691n4
Suppes,P. 20,189,193 Tuckett, D. Mark 777, 823n1
Suuma, H. 757 Tuijnman, AC. 951, 952, 953, 956, 960,
Swaminathan, Hariharan 582, 889 961-2
Swanson, AD. 643, 661 Tukey,John 884, 886, 887, 889, 902n1
Swanson, D. 566 Tukey, J.w' 989
Swartz, J. 116 Tunstall, P. 558
Sykes, G. 615,648 Turnbull, B. 255-6
Turnbull, Colin 162
Thhir, G. 472 Turner, P.M. 673, 674
Thlmage, H. 723-4 Turner, T.C. 135
Tan, J.-P. 135 Turpin-Petrosino, C. 110
Thnnenbaum, Richard 5,603-4,607,609-41, Tyack, D. 421,643
664 1)rler, Ralph 2, 20, 191, 424, 655, 723, 801,
Taroyan, T. 109 883-6,891, 901, 902n1, 930
Tatsuoka, KK 525 1)rmms, P 603
Taylor, C. 542
Taylor, P. 559 Ubben, G. 644
Tedesco, J.C. 443 Udvarhelyi,I.S. 146
Thatcher, Margaret 909,911 Ulrich, D. 364
Thayer, D.T. 523 Umar,A 472
Thissen, D. 524 Uphoff, N. 404
Thomas, M.D. 813 Urwick, L. 678
Thomas, Rebecca 777, 823n1
Thompson, B. 192, 210, 725 Valencia, S.w' 631
Thompson, D. 80 Valentine, J.w' 685,687
Thompson, S. 656 Valverde, G.A. 980, 982, 984, 985, 989
Thompson, T. 646,649-50,654,655,656,657 van Bruggen, J. 81~14
Thompson, T.L. 49 van der Linden, w,J. 519,522
Tiana, A 809, 810, 811 Van Dusen, L.M. 342n3
Tiersman, C. 813 Van Mondfrans, A 142
Tombari, M. 542 Van Zwoll, J.A. 643
ninnies, Ferdinand 160 VanLehn, K 507
Torrance, G.w' 136, 141, 147 Vari, P. 971
Torrance, Harry 6, 559, 876, 880, 905-28 Veeder, Sally 6
Torres, RT. 92,211,234,238,381,397 Veenstra, D.L. 148n6
ethical considerations 314-15 Vincelette, J.P. 673
evaluation profession 366 Vincent, L. 589
inquiry-based approach 814,823 Vinovskis, M.A 902n1
organizational learning 730,762,814 Viscusi, KP. 148n7
participatory evaluation 251, 255, 258, 379 Viscusi, W,K 141
Toulmin, Stephen 496 Voigts, F. 476
Townsend Cole, E.K. 467 von Bertlanffy, KL. 390, 391
von Winterfeldt, D. 139 Weston, T. 172
von Wright, G.H. 175 Wheeler, P.H. 681
Vygotsky, L. 553, 554, 560 White, K 655
White, KR. 342n5
Wainer, H. 517, 523, 524 White, P. 204, 207, 210, 214, 237, 661
Wakai, S. 559 White, S.H. 201
Walberg, H.J. 615 Whithey, Stephen 883
Walker, C.A. 255 Whitmore, E. 245,246,247,250,314,315
Walker, D.F. 724 Whitney, D. 364, 368
Waller, W 534 Whitty, G. 593
Walshe, Hilary 6 Wholey, Joseph S. 228, 352, 379, 381, 382,
Walters, D. 645, 659 383, 754
Waltman, KK 544 Wiarda, H. 461n3
Wandersman, A 246,316,814 Wiggins, G. 537,542, 549
Wang, N. 523 Wiley, D.E. 501,503,980
Wang, w-e. 525 Wiliam, D. 557,565, 566, 586, 590, 924
Wang, X. 524 Wilkinson, J.E. 810
Ware, H. 947 Willcocks, J. 534
Warfield, M.E. 135 Willeke, M.J. 255
Wasley, P.A 148n1 Willett, J.B. 137
Watkins, J.M. 367 William, D. 924
Watkins, M.H. 466 Williams, Harrison 351
Watts, Michael 67 Williams, J. 651,652
Webb, L.D. 643 Williams, Michael 6
Webb, N.W 519 Williams, R. 536, 725
Weber, Max 18 Williamson, G. 807,831
Webster, William J. 6, 51, 603, 643, 660, 881, Willingham, WW 501
929-49 Willis, P. 593
school profiles 829, 830, 835, 836 Willmott, AS. 588
US program evaluation 723, 725, 726 Willms, J. 835, 836
value-added approach 682 Wilmott, AS. 581
Webster's Dictionary 303 Wilson, e. 681
Weerasinghe, D. 938,947 Wilson, Mark R. 5,485,486,489-531
Weikert, D.P. 118 Wingate, Lori A 1-6
Weiner, P.H. 330 Winston, J. 353
Weiner, R. 845 Wirt, F. 645
Weinstein, M.e. 141, 148n8 Wirtz, W 888, 889
Weisberg, A 140 Wise, L.L. 895
Weisbord, M.R. 368 Wiseman, S. 581
Weiss, Carol H. 137,251,355,362, 953 Wittrock, M.e. 557
evaluation profession 362 Wolcott, H.F. 169,171,173
evaluation use 192, 200, 202, 203-5, 210, Wolf, A 564
212, 213, 217 Wolf, D. 499,557,566, 567, 594
stakeholders 314 Wolf, K 681
Weiss, E. 444, 448, 457 Wolf, P.J. 109
Weiss, J. 812 Wolf, Richard M. 4, 103-6, 723, 724, 879,
Welch, S.D. 436 964,980
Welch, WL. 729 Wolf, Robert L. 77n1
Wenger, E. 366 Wolfe, E.W 633
Wenger, N.S. 309 Wolfe, R.G. 980
Wenglinsky, H. 845-6 Womer, Frank 886
Werge, R. 411 Wong,K 645
West, A 579, 581 Wood, R. 120, 551
Westbury, I. 962, 980, 983 Woodhall, M. 180
1020 Index of Authors
Worthen, Blaine R. 4, 169, 182n3, 269, 273, Yang, Huilang 777, 823n1
276,314,347,349,708 Yang Xiuwen 580
evaluation profession 362, 374 Yeatman, A. 763
evaluator expertise 379, 380 Yeaton, W.H. 119
information technology 381 Yen, W.M. 524
maths achievment 142 Young, C.l. 204
power 477 Ysseldyke, l.E. 549, 571n1
qualifications 273, 329-43
qualitative-quantitative debate 375 Zedillo, Ernesto 454
training 384 Zigler, Edward 349
US program evaluation 722, 723, 724, 725, Zimmerman, M. 741
727,729 Zimmerman, N.R. 77
Wright, B.D. 524 Zucker, L.G. 208
Wright, S.P. 613, 615, 682 Zwick, Rebecca 889
Wye, C. 224, 280
Subject Index
4-D process 368 specialists/support personnel 606, 682,

A-levels 570 684,687,689,690
AASA see American Association of School teacher evaluation 606-7,610,612,615,
Administrators 616,617,634
Abitur 578, 581 UK national assessment 906, 916, 918,
abused children 110--11 919,920,923
Academic Excellence Indicator System United States 929,930,936,938-45,946-7
(AEIS) 944 federal support to schools 419-20
Academic Performance Indictor (API) 945 program evaluation 721, 722, 726
accomplishments 33 school evaluation 807, 810, 811, 812
accountability 277,401,656 accountability reports see school report cards
Africa 474 accreditation
alternative assessment 550, 552, 564, 568, colleges 2
571 evaluation profession 273,331,333,
Australian program evaluation 699, 752, 334-6,340,342
754, 757-61 schools 722,730,807,809-10
Canadian program evaluation 698, 733, teacher evaluation 620
742, 743, 744, 745, 747, 748 US program evaluation 423, 425, 698
CIPP model 31,32,34,35,37, 38, 58 accuracy 177-8,217,223,224,272
process evaluation 48 CIPP model 34
product evaluation 52 classification 514, 515
developing countries 711,712-13,717 measurement 510,514-15
ethical considerations 306 principal evaluation 648, 649
Europe 430 Program Evaluation Standards 281,282,
evaluation definition 930,931 285-6,290,296,785-6
evaluator role 235, 236 reliability 517
fairness criteria 931, 932 specialists/support personnel evaluation
government 362 689,690
high stakes 612 stakeholder judgements 837
internal/external 260 teacher evaluation 607
large-scale testing 569, 571 achievement data 551,614-15
Latin America 450, 451, 453 action research 246,449,763,812,814
New Public Management 432, 433 active-reactive-adaptive mode 228-9, 232-3,
performance management 22, 437 240
portfolio assessments 569 ad hoc school evaluation approaches 813-14
principal evaluation 605, 644-5, 646, 649, adaptive testing 523
651,652,653-4,657 added value see value-added
private sector 352-3 ADEA see Association for Development of
professional standards 279, 298 Education in Africa
school evaluation 772, 774, 814, 816 Administraci6n Nacional de Educaci6n
purpose of 782, 783 Publica 443
summative evaluation 820 administration 2, 214
United States 807,810,811,812 external examinations 577-8, 579
school profiles 828, 829, 830, 838, 839 non-standardized 555
1021
International Handbook of Educatiunal Evaluation, 1021-1062

1022 Subject Index
technology implementation 866-7 leadership 352
admissible users 211 American Educational Research Journal 111
adult literacy 405-6, 407-8 American Evaluation Association (AEA)
Experimental World Literacy Programme 193,216,223,426
407-8 accreditation 335, 336
International Adult Literacy Survey 879 certification 337-8
Advanced Placement (AP) Studio Art conferences 370, 374
portfolio assessment 498-9, 507 ethics 304
Advisory Committee on Human Radiation Evaluation Practice 276
Experiments 305,321-2 evaluation profession 275,334, 341, 346,
advocacy 93-4,276,375-6,712 362
advocacy teams technique 46 Guiding Principles for Program Evaluators
advocacy-orientated evaluation 76 240,270,280,281,293-6,297-9,300
ethical considerations 318-19 developing countries 707, 711
evaluator independence 296 ethical considerations 315,317, 319,
qualitative methods 179 321,323
advocate-adversary model 26 job bank 379
AEA see American Evaluation Association membership 352, 363, 369
AEIS see Academic Excellence Indicator qualitative methods 168
System regional associations 405
AERA see American Educational Research school evaluation 774n1
Association survey 242
AES see Australasian Evaluation Society American Federation of Teachers (AFT)
affirmative action 423 425,533,540,848
Afghanistan 406, 407 American Institutes for Research (AIR) 885
AfrEA see African Evaluation Association American Journal of Evaluation (AlE) 276,
Africa 394-5,408,465-81 338,363, 373, 385n1
culture 707 ethical considerations 322
Education for All initiative 409 professional standards 297, 298, 299
international studies 952, 954, 956, 958-9 American National Standards Institute
national assessments 876 (ANSI) 280,283,295,403
professional culture 401 American Psychological Association (APA)
training 405-6 341,425,486,499,564,626
African Americans 87, 422 American School Counselor Association 677
African Evaluation Association (AfrEA) Analysis Advisory Committee (ANAC)
395,405,478 886-7
AFT see American Federation of Teachers Angola 471
aggregated student data 6 ANSI see American National Standards
AI see appreciative inquiry Institute
AIR see American Institutes for Research anthropology 400
AlE see American Journal of Evaluation APA see American Psychological Association
Alberta Department of Education 812 API see Academic Performance Indictor
alcoholism 350 Appalachia Educational Laboratory 813
Algeria 407 appreciation 153, 154, 155
alternative assessment 485, 486, 549-75 appreciative inquiry (AI) 275,367-70
American Association of Colleges for Teacher Appreciative Inquiry Summit (AI Summit)
Education 616 368-70
American Association of School aptitude 551, 582, 589
Administrators (AASA) 885-6 APU see Assessment of Performance Unit
American Educational Research Association Arab States 409
(AERA) 334,337,345,425,486,499,564 Argentina 448, 450, 452
government influence 351 art 155, 160, 164-5, 181
high-stakes testing 612, 613 artifacts 172, 346, 679, 681-2, 687
Joint Technical Standards 626 arts-based evaluation 153-66
Subject Index 1023
ASEESA see Association for the Study of moderation 594
Educational Evaluation in Southern Africa principal evaluation 651-2
Asia professional culture 401, 403
Education for All initiative 409 program evaluation 698-9,751-68
international studies 956 'authentic' testing/assessment 549,551,
professional culture 401 560--1,911,916,917
training 406-7 authenticity 76, 89
assessment African traditional evaluation 467,469
alternative 485,486,549-75 validity 564, 566
complete and fair 285 authoritarianism 442, 446, 448, 450
cross-national curriculum evaluation 979-95 authority 710
definition 485-6 authorship 177-8, 181
international 873, 875, 876, 878-80, 881, autocratic evaluation 434
951-78,979-95 autonomy
local 873, 946 Israel study 802-3
national 873, 875, 876-8 schools 771
Africa 475-6 teachers 855
United Kingdom 554, 905-28
United States 569,876,877,878,880--1, Baccalaureat 578, 581
883-904 Balanced Scorecard 823
school-based 593-5 Bangladesh 406
system assessment 873--6 baseline assessment 116, 663, 810
teachers 610--11,621-2,628,630--3,915 Bayesian analysis 527
see also classroom evaluation; BEAR Assessment System 490--5,518-19
examinations; schools, evaluation; students, behaviorism 552
assessment; tests benchmarking 760, 953
Assessment of Performance Unit (APU) 'best buy' approach 44
876,908-9,911,916,924 best practice 277,298,382,916
Association for Development of Education in bias 13,239
Africa (ADEA) 477-8 advocacy teams technique 46
Association for the Study of Educational cultural 707
Evaluation in Southern Africa (ASEESA) deliberative democratic evaluation 81, 82,
478 84,85-6,87,94
associations 333-4 ethical considerations 308
ATs see attainment targets examinations 586-7,591
attained curriculum 983, 986-7 gender 164,410,589,591,595
attainment targets (ATh) 913, 914, 915, 916, principal evaluation 660
918,919-20 professional standards 286, 292
attendance 834 qualitative methods 167,171,173,175,
attitude scales 968, 970, 979 177,179
Attribution Theory 310 racial 410
audiences 784, 785, 827 stakeholder participants 250, 262, 819
right-to-know 56,58,293,295,312,424, subtheme A 24-5
785 summative evaluation 257
US program evaluation 420--1 teacher assessment 633
auditing 235, 348, 757 US evaluation 932, 939, 947
Australasia 752-3, 956 bogwera 467-8,469
Australasian Evaluation Society (AES) 752-3 bojale 467,468-9
Australia Boston Surveys 807
alternative assessment 550, 570 Botswana 404,406,467,473,477,478
cost-effectiveness 145 bottom-up approaches 429, 811
evaluation utilization 193 deliberative democratic evaluation 88
examinations 588 democratic evaluation 434
IEA findings 973 Brazil 448, 449, 452, 717
1024 Subject Index
budgets Accountability and Teacher Evaluation
evaluation design 57 (CREATE) 728
evaluation use 206-7,233 centralization
New Public Management 432,433 Europe 432
principal responsibility for 646 Latin America 445-6
technology 847,857,866,867 UK national assessment 906
Traveling Observer Technique 50 Centre for Program Evaluation (CPE,
UK national assessment 922-3,926n6 Australia) 752, 756, 764
US school district evaluation 933-4, 937 Centro de Estudios Educativos (Mexico)
see also cost -effectiveness; costs; funding 447,448,449,455,456,457
BUGS computer program 527 Centro de Estudios para la Reflexion y
bureaucratic culture 392 Planificacion de la Educacion (Venezuela)
bureaucratic evaluation 434 447
Centro de Investigacion y Desarrollo de la
California Commission on Teacher Educacion (Chile) 447
Credentialing & The California CEO Forum 847
Department of Education 628 Certificate of Secondary Education (CSE)
Campbell Collaboration 121 907,910,911,920-1
Canada certification 4
Agency for International Development African examinations 474
451 evaluation profession 273,331, 333,
CIPP model 34 334-5,336-41
cost -effectiveness 145 external examinations 578, 595
Educational Statistics Council 735 principals 652
evaluation utilization 193 teacher evaluation 604,620-1,630,631,
participatory evaluation 255-6 632,633,634
portfolio assessment 569 CFASST program 624, 628
professional culture 401 charged use 208
program evaluation 698, 733-49 cheating 53, 593
Program Evaluation Standards 699 chemical warfare example 86-7
school profiles 831 child abuse 110-11
standards 699, 785 childcare 110-11
welfare-to-work 110 Chile
CAPE see Committee for the Assessment of national assessment 876, 877
Progress of Education social context of evaluation 443,447,448,
Caribbean Education for All initiative 409 450,452,453
Carnegie Corporation 883, 884, 885, 887, China 5,487, 580, 722
888,901 choice 83, 771
case studies 65,324-5 Christianity 470
generalization 160 see also Catholic Church
product evaluation 51 Cincinnati Federation of Teachers &
Catholic Church 444,447 Cincinnati Public Schools 624
CATS see Commonwealth Accountability CIPP (context, input, process and product
Testing System evaluation) model 4, 10, 13,31-62,425,
causation 29 728, 772
CCSSO see Council of Chief State School institutionalization of evaluation 775,
Officers 778-82,801
CEEC see Commission d'Evaluation de principal evaluation 659
l'Enseignement Collegial school district evaluation 938, 946
census-based assessment 6, 878 CIRCE see Center for Instructional Research
Center for Community Change 829, 838 and Curriculum Evaluation
Center for Instructional Research and Citizen's Observatory of Education, Mexico
Curriculum Evaluation (CIRCE) 354 (Observatorio Cuidadano de la Educacion)
Center for Research on Educational 449,455,457
Subject Index 1025
Citizens Research Council of Michigan 813 community
civil rights movement 70, 393 community development 58
class see social class comprehensive initiatives 312
class size 828, 831, 832 concept of 364-5
cost-effectiveness 135 evaluation use 214
randomized field trials 109,112-13, 114, local standards 71,72
116 see also learning community
classical test theory (ClT) 490,505,507-11, comparability
513-14,517,518-19 examinations 583, 587-90
classroom dynamics 861-2 school profiles 830-1
classroom evaluation 5, 53 student assessment 486, 487
learning processes 553 generalizability theory 519
SATs 554-5 item response theory 522-3
student assessment 485, 486, 533-48, 560, linear logistic test model 525
562 psychometrics 489-90,494-5,500-1,
client-centred evaluation 66 502,507,515-16,517
climate of schools 822-3,828,831 teacher assessment 915-16
CNS see Continuous National Survey comparative studies 873,874,878-80,881,
co-generative evaluation 75 951-78
Cochrane Collaboration 121 comparison groups 835-6
code of professional practice 947 compensatory education
cognitive processes 154, 203, 523, 537, 592 Latin America 459-60, 461
cognitive psychology 487, 511, 527 Mexico 454,457,458
coherence 739 United States 422
collaboration competence 294,332,338,551,592
collaborative evaluation 23, 75, 379 certification 337, 338
decision making 318 competency tests 341
institutionalization of evaluation 802, 804 principal evaluation 654
international studies 976 teachers 627, 628
learning community values 275 see also skills
participatory evaluation 245,246,247, competencies
249,255,256,257 core 339,341
school evaluation model 816-17 examinations 585
sustainable learning community values principal evaluation 645, 659
361,362 see also skills
utilization-focused evaluation 224,231 competition
collective choice 83 between schools 910
college accreditation 2 performance monitoring 878
Colombia 452 Compilation and Analysis of Patient Records
colonialism 43-4
Africa 466,470,472-3 complete and fair assessment 285
global evaluation context 397, 398, 402, compliance 20,114
404 comprehensive education 907, 908, 909
COMIE see Mexican Council of Educational computers 487,527,844-5,846,858,863,865
Research computer-assisted instruction 135
commercial standardized tests 540, 542-3 information access in developing countries
Commission d'"Evaluation de l'Enseignement 717
Collegial (CEEC) 735-9,745,746-7 internal school evaluations 764-5
commitment 215,231,235,240 see also Internet
Committee for the Assessment of Progress of CONAFE 454, 455, 456, 457
Education (CAPE) 886 conceptual use
Commonwealth Accountability Testing System knowledge 199,200,210,212,217
(CATS) 940 participatory evaluation 251, 258, 260
communitarianism 69, 72, 75 program understanding 209
1026 Subject Index
utilization-focused evaluation 232 curriculum as 981
concurrent validity 512 developing countries 704-6
conditional independence 507, 519, 520, Europe 429-39
528n5 evaluation use 208, 209, 213, 215, 216, 219
confidentiality global 392-3,397-415
ethical considerations 307,308,309 international studies 956-7,959-60,961
specialists/support personnel 679, 680 Latin America 441-63
conflict local 700
ethical 271,272,306-7,310,322-3 participatory evaluation 252, 261
of interests 285,294,296,312-13 profile indicators 828, 830, 832, 835-6
participatory evaluation 250, 256, 261 program evaluation 699
Connecticut School Effectiveness randomized field trials 117
Questionnaire 813 school district evaluation 935, 936, 938
Connecticut State Department of Education schools 642-3, 818, 828, 830, 832, 835-6
628 United States 417-28,935,936,938
connoisseurship evaluation 4, 63, 66, 105, contingent valuation 140
153-66,170,823 continuous improvement 812,814
Conseil superieur de l'education 746 Australia 760
consensual validation 162,163 Canada 734, 748
consent 310 managerialism 875
conservatism professional development 687
Latin America 394, 442, 443-4, 451, 453 Continuous National Survey (CNS) 199
philosophy of the state 73 Continuous Progress Matrix Sampling Testing
Conservative Party 909-10,911,917 Technique 53-5
Consortium for Policy Research in Education control groups 104, 114
929,939 conventional aid evaluation 714
constant comparative method 172 convergent validity 512
construct representation approach 500, 501, cooperative inquiry model 74
502,513,523,525,526 core analysis 118, 119
construct validity 584-5 core competencies 339,341
constructivism 4,11,24,69-78,168,276, core values 13-14
378,425 corporatism 875
alternative assessment 552 corruption 236, 239
context 391 cost-benefit analysis 105, 133-4, 136-7,
epistemic triangle 398, 399 139-41, 143, 146, 148
learndng 550,552,553 evaluation use 230
positivism conflict 389 Italy 430
power 460 see also ethics cost -benefit analysis
qualitative methods 172 cost-effectiveness 4, 105, 125-52,795
values 12-13 CIPP model 36
Consumer Reports 10, 22, 728 cost-effectiveness ratio 142-3, 146, 147
formative-only model 23 ingredients method 128-33
input evaluation 45 input evaluation 46, 937
Consumer Union 12 international studies 965,974
consumer-orientated evaluation 10, 22, 728, product evaluation 52
823 professional standards 284
consumers 437,438,772 program evaluation 777
content analysis 174 UK national assessment 922-3
content validity 513, 583-4 cost-utility 133-4, 135-6, 138-9, 143, 144-5,
context 389-96 146, 148
accuracy 285 Costa Rica 443, 453
Africa 465-81 costs
CIPP model 31,32-3,35-6,37,39-44, CIPP model 37
775, 778-81 cost-estimation 127-33
Subject Index 1027
evaluation use 233 culture
international studies 960 Africa 465-6,472
program evaluation in developing countries as context 392, 400
698,711-12,714 developing countries 706-8, 709
randomized field trials 117 educational evaluation as praxis 409-10
technology 845,847,858 episternic triangle 398-9
US school district evaluation 936 ethical issues 272, 320
see also budgets; cost-effectiveness; funding fieldwork 411
Council of Chief State School Officers global perspective 399,400,401,411,412
(CCSSO) 616, 620, 629, 890-1, 929, 939 institutional 392
Council of Educational Research (COMIE), international studies 969-70
Mexico 449 knowledge 390
counselors 672, 673, 675, 683, 685-6 Latin America 450
course improvement 190 personality-culture dialectic 392
coursework 570 propriety standards 785
CPE see Centre for Program Evaluation United States 270, 301
CREATE see Center for Research on culture of learning 762
Educational Accountability and Teacher curriculum 3
Evaluation African traditional evaluation 467-8
creativity 275, 783 alternative assessment 550
credentialing 331,335,336,337,338 colonial Africa 470-1
credibility computer-enhanced 764-5
data collection 91, 971 connoisseurship evaluation 154
ethical considerations 303, 318 cross-national evaluation 6, 979-95
evaluation profession 375, 376 curriculum-embedded evaluation 783
evaluators 206, 207, 214, 215, 232-3, 236, deliberative democratic evaluation studies
237-8,239-40,261 90
ethical considerations 272,318 examinations 578, 579, 583, 585, 592, 595
external independent 782, 795, 820 innovative programs 726
professional standards 284 international studies 879, 880, 955, 957-8,
gender 206 961,968,973,975
independence 782,795,820,821 measurement of 982-7
misuse 239 participatory evaluation 258, 260
outside participants 782, 795, 819, 820 principal role 646
participatory evaluation 251, 257, 262 program evaluation 417,422,777
professional standards 284, 299, 300 school profiles 830
qualitative methods 167,175,177,178, state standards 537, 538
179,180 student assessment 490
school profiles 836 technology integration 847,849,851,
criminal justice 109, 110 853-4,861-7
criteria for school evaluation 813,817,818, United States 417,422,724,891
819,823,931 see also National Curriculum
criterion-referencing 51,544,551,563, custom 5,710,711-12
590-1,828,830
UK national assessment 910, 912 Dakar Framework for Action 875
US state accountability systems 938, 940, DARE see Drug Abuse Resistance Education
941,943 data collection 2, 6
criterion-related validity 584 Australian program evaluation 755, 759
criticism 155-9,161-5 comparability 500-1, 502
cross-national curriculum evaluation 6, context evaluation 41,42
979-95 curriculum evaluation 981, 982
CSE see Certificate of Secondary Education deliberative democratic evaluation 79, 80,
CTT see classical test theory 81,86,87,90-1,92,94
cultural analysis 174 design of evaluation 55-6
1028 Subject Index
disaggregated 421 decision-support
ethical considerations 324 outcome-based evaluation 22
evaluation use 224, 227 quasi-evaluation 20-1
fairness 516 Declaration on Education for All 875, 876
generalizability theory 519 deconstructionism 24
information technology 381 definitions of evaluation 15-16, 28, 34,
international studies 955--6, 958, 959, 961, 182n3, 400, 485, 929-32
963,965-6,969-73,974 deliberation 275, 361
Manitoba School Improvement Program deliberative democratic evaluation 81,
741 83-4,87-8,91-2,93-5,97,99
new technology 277 participatory evaluation 255
personality-culture dialectic 392 deliberative democratic evaluation 4, 11, 13,
principal evaluation 650, 659, 660 25,79-100,425
process evaluation 48, 49 context 391
professional standards 292 ethical considerations 271-2,316
qualitative methods 168,169,171-3,174 European approaches 394
randomized field trials 116 qualitative methods 170, 179
responsive evaluation 63, 70 democracy 77,80,431
school evaluations 785,817-18 deliberative 170, 179
specialists/support personnel 679-83 preferential model 93
standard-based evaluation 648 stakeholder selection 254
standardization 497--8 democratic evaluation 69, 75
technology study 852, 854 Europe 429,431,434--6,437--8
US national assessment 900 participatory 246
US program evaluation 729 see also deliberative democratic evaluation
see also methodology; questionnaires; democratization 393, 427
sampling; surveys demographics 829, 832, 836
databases 42 demoralization 104, 105
decentralization 645, 646, 653, 665, 771 Denmark 430,431
Canada 734, 747 Department for Education and Employment
New Public Management 432, 433 (DillE, UK) 827,831,918,919,922
UK examination system 906 Department of Education and Science (DES,
decision making 70, 191, 930 UK) 908-9,911-12,914,920,922
CIPP model 34 Department for Education and Skills (DillS,
classroom evaluation 533 UK) 923, 926n4, 926n6
cost-effectiveness 125 Department of Education (US) 120, 346,
deliberative democratic evaluation 80 354-5,582,616,804
democratic evaluation 80, 435, 436 national assessment 883, 890, 894
ethical considerations 318 technology 846,847,'848
evaluation use 205, 209, 212, 215, 227, Department of Health and Human Services
231,238 (US) 309, 320, 321
Interactive Evaluation 762-3 depersonalization 65
Latin America 450, 458 DES see Department of Education and
participatory evaluation 246,247,249, Science
252-3 description 15,26,27,156-7,158
principal evaluation 661 descriptive models 189
program evaluation 716-17,719 desegregation 393,418,422,423
public 430-1 design of evaluation 55-7,58
shared control 247,249,252-3 appreciative inquiry 368-9
social science data 199-200 cost-effectiveness 137,139-40
transparent 437 evaluation use 227, 228-9
US school district evaluation 936, 937 evidence-centred 632-3
Work Sample Technique 53 framework for assessment design 502-5
decision-orientation 189, 192, 193 government influence 347,350
Subject Index 1029
international comparative studies 963-71, accreditation 620
975 accuracy 285
professional standards 292 cost -effectiveness 131
destiny process 369 evaluation of specialists/support personnel
developing countries 410-11,874-5 681,686-7
cost-effectiveness 135, 145 process evaluation 47
examinations 579, 593 qualitative methods 172
international studies 954, 970, 976 domain-referencing 590
national assessments 878 Dominican Republic 452
program evaluation 697-8,701-20 donors 705,709, 710-11, 713-14, 719
developmental evaluation 727 draft manuals 791-3,794,798,799
DfEE see Department for Education and dream process 368
Employment dropouts 128,140
DfES see Department for Education and cost-effectiveness 143
Skills parental education 897
diagnostic monitoring 878, 880 school profiles 828, 830, 832, 834, 839
dialectical thinking 391, 398, 399 small schools 126
dialogue US school accountability systems 941,943
appreciative inquiry 368, 369 Drug J\buse Resistance Education (DAAE)
critical 436, 438 726-7
deliberative democratic evaluation 81, drug education 16-17,755
82-4,86-8,91-3,95,96,97,98 DSE see Germany, Foundation for
developing countries 707-8, 718 International Development
learning community values 275 dynamic assessment 554--5, 560
participatory evaluation 75, 255
power-free 436 early years education 135, 145
social equity 366 Eastern Europe 432, 581
sustainable learning community values ECAPE see Exploratory Committee for the
361,362 J\ssessment of Progress in Education
differential item functioning (DIF) 517, 523 ECBJ\ see ethics cost-benefit analysis
digital divide 843 economics
dignity 285, 294 conditions in Latin J\merica 444, 445
Direccion de Investigaciones Educativas del propriety issues 180
CI~STJ\V 448,455,456,457 ECS see Education Commission of the States
disabled students 682, 900 Ecuador 407
disclosure Editorial Projects in Education 847,854,862
alternative assessment 566 Education J\ct (UK, 1988) 912
Canadian evaluation reports 738 Education for Nl (EFJ\) 408-9,471,475-6,
professional standards 285, 292-3 875
discounting costs 133,141 Education at a Glance (DECD) 956-7,960,
discovery process 368 962
discrepancy evaluation 20 Education Commission of the States (ECS)
discrimination 418,419 426,615,617,886-9,929,939
dissemination of results 227,233, 251, 293, education definition 400
782 The Education nust 626
developing countries 713 Education Week
Latin J\merica 441-2,443,452,453 NAlEP results 893, 894
see also publication of results principal evaluation 645, 652, 655
distribution of costs 133,142 Quality Counts 2000 614, 619, 622
divergent validity 512 technology survey 844
diversity educational change 754-6,765,772
of students 545 educational criticism 155-9,161-5
US school districts 421-3 educational culture 392
documentation Educational Research Service 679
1030 Subject Index
Educational Resources Information employment
Clearinghouse (ERIC) 673, 674 opportunities for 347,351-2
educational specialists 671-93 randomized field trials 109-10, 117
data collection 679-83 UK education system 907
definition 672-3 empowerment
Goals and Roles evaluation model 683-7 Elementary and Secondary Education Act
personnel evaluation standards 688-9, 690 349
technology 856, 857 participatory evaluation 251, 258, 259,
Educational Testing Service (ETS) 114,341, 260, 261
625,626,628-9,632,885,889-90,895 responsive evaluation 77
EFA see Education for All social justice 315
The Effective School Battery 813 stakeholders 170
effectiveness empowerment evaluation 23-4, 231, 434-5,
Canadian program evaluation 739,743,745 728,814
CIPP model 37 ethical considerations 271-2,316,324
criteria 813 Manitoba Program 740
efficiency distinction 655-6 participatory evaluation 245, 246, 250,
Eight Year Study 802 259,262
principal evaluation 605, 649, 654, 655, 659 ENet see Evaluation Network
professional development 849 England
program evaluation 717, 718 examinations 570
Quebec Policy on University Funding 743, league tables 810
745 national assessment 567, 568, 878, 880,
schools 645, 671, 809 905-28
standardized tests 771 Primary Language Record 556
teacher evaluation 610,613,615,617 reading standards 876
Work Sample Technique 53 see also Great Britain; United Kingdom
see also cost -effectiveness English
efficiency English-language learning 900-1
effectiveness distinction 655--6 UK national assessment 908,912,913-14,
export of US education priorities 452 916,917,918,919,921-2
France 431 enlightenment 200,203,251
Latin America 444, 452, 455, 456 enrolments
principal evaluation 605, 644 Africa 471
program evaluation 717, 719 Canada 744
Quebec Policy on University Funding 743 cost -effectiveness 133
technology 845 Latin America 444, 445
Eight Year Study 723,801-2,884 school profile indicators 828
elementary schools environmental influences 818
Canada 734, 735 epistemic triangle 391, 398-9
evaluation standards 810 epistemology 74,390,398-9,401
teachers 623 constructivist 70-1
Type II Model of evaluation 787-96 educational criticism 161
see also primary education qualitative methods 168, 169, 181
Elementary and Secondary Education Act see also knowledge
(ESEA, 1965» 103, 190,274,348-9,724 Equal Educational Opportunity Survey 874
elites equal opportunity 13, 33
Latin America 394,442,445,446,450-1 Mexico 454
postcolonial 397, 398 United States 418,419,427,644
emotion 157 equilibrium 534-5, 542
empiricism equipment 129, 130, 132
empirical facts 21 equity 361,366,564,567-8,947
epistemic triangle 398 examinations 591, 595
research 16-17,21-2 Latin America 444, 454, 459
Subject Index 1031
see also fairness government influence 274,351
ERIC see Educational Resources Information non-Western areas 404-5
Clearinghouse United States 425-6
Eritrea 471 Evaluation Center, Western Michigan
ERS see Evaluation Research Society University 32, 42, 49, 50, 58, 60n2, 353-4,
essay-type questions 544, 579, 591 814
estimation of costs 127-33 professional standards 282, 297, 298
ethical issues 4,269,271-2,303-27 training 374
constructivism 74 evaluation definitions 15-16,28,34, 182n3,
developing countries 708,711,712-13 400,485,929-32
evaluation use 228,229,239-41 evaluation manuals 791-3,794,798,799,970
method relationship 74 Evaluation Network (ENet) 275, 280, 293,
participatory 23, 69, 75, 77, 261 351,362
professional standards 294, 296, 298 Evaluation Practice (EP) 276, 373, 374, 385nl
propriety 282 evaluation profession 2, 4, 269-77
qualitative methods 179 Africa 478,479
randomized field trials 111-12, 115 Australia 752-3
situational 410 definition 346
utilization-focused evaluation 240 ethical considerations 269,271-2,303-27
value 29 evolution of 362-4
ethics cost-benefit analysis (ECBA) 320-1 future of 373-86
Ethiopia 407, 471, 474 government relationship 269,273-5,
ethnic groups 465 345-59
ethnicity influence of government on profession
examination bias 591 347-54,355-6,357,362-3
Latin America 452 influence of profession on government
principal evaluation 660 354-5,356
school profile indicators 828, 832, 835 mutual influence 355-7
school progress 931 professional standards 269,270-1,
US national assessment trends 896-7 279-302,303,330
see also race ethical considerations 315,319-20,321-2
ethnography 168, 171 violations 311-12
ETS see Educational Testing Service qualifications 273, 329-43
EU see European Union school district evaluation 418,424-6
Europe 2, 393-4 support networks 698
Australia comparison 753 sustainable learning community 269,
bottom-up approaches 811 275-6,361-72
colonialism 466,472-3 United States 362-3, 753
context of evaluation 429-39 see also evaluator role
Education for All initiative 409 Evaluation Research Society (ERS) 274,
examinations 580-1, 596 275,352,362
indicator-based school evaluation 811 government influence 351
influence on African education 470 Standards for Program Evaluation 270,
international studies 956, 969 280,291-3,295-6,300
medieval 487 evaluation-as-assisted-sensemaking 272,316
professional culture 401 'evaluation-free evaluation' 19
standards 403 evaluator role 235-7,240-1,261,272,318,
European Commission 433 435
European Union (EU) 394, 429, 430, 432, increased expertise 379-80
437 independence 375,376
EURYDICE 581 personal factor 207
evaluation associations 275, 301nl, 333-4, process evaluation 47-8
336-7,342,362 professional standards 296, 320
Appreciative Inquiry Summits 369-70 see also credibility
1032 Subject Index
Evaluators' Institute (Maryland) 297,353, facilities
374,385 cost-effectiveness 129, 130, 132
evidence model 491,492-5,502,504-5 developing countries 705-6
Evidence and Tradeoffs 491, 492, 493-5 facts
evidentiary arguments 489, 490, 493, deliberative democratic evaluation 84--6,
495-502,504,511,517,528 87,94-5
EWLP see Experimental World Literacy fact-value dichotomy 11-12,84--6,94-5,
Programme 374,435
examinations 5, 577--600 quasi-evaluation 21
Africa 394-5,472-5,479 fairness
characteristics 577-80 cultural 708
comparability 583, 587-90 student assessment 486, 487, 505
criterion-referencing 551,563,590-1 item response theory 523
critique of 552 linear logistic test model 525
definition 485--6 psychometrics 489-90,495,501-2,507,
development of 580-3 516-17,518
disadvantages 586,591-3 teacher evaluation 631,632
external 485, 486-7 TIMSS test 992
moderation 567 US evaluation 931-2,946,947
national assessment 877 voice 707
reliability 583, 586--6 see also equity
school-based assessment 593-5 Family First program 110-11,119
UK national assessment 905, 906, 907-8, feasibility 176,217,223,224
910-11,920-2 CIPP model 34
validity 583--Q principal evaluation 648
see also grades; norm-referencing; tests Program Evaluation Standards 281, 284,
Excellence in Schools 875 286,290, 296, 785--Q
exclusion 89, 98, 254 specialists/support personnel 689
Latin America 442, 450, 453, 461 teacher evaluation 604,607,631,634
minorities 410 federal disbursements 347, 348-9
expectations 382, 534-5, 703 Federal Judicial Center (FJC) 111-12
experimental design 4, 424, 698 federal support 418,419-20
Experimental World Literacy Programme feedback 281,286,410
(EWLP) 407-8 alternative assessment 550, 556, 557--60,
experimentation 104, 715 562
government influence 350 CIPP model 36, 42
learning community values 275 Continuous Progress Matrix Sampling
sustainable learning community values Testing Technique 54, 55
361,362 process evaluation 47,48,49
theory-based evaluation 378 product evaluation 50, 52
US program evaluation 727 Eight Year Study 802
see also randomized field trials negative 241
Exploratory Committee for the Assessment of performance management 875
Progress in Education (ECAPE) 883-4, responsive evaluation 66
885-6 school evaluations 783, 820
external evaluation 726-7,795,800,803-4 specialists/support personnel 687,689
ethical considerations 305, 320 staff workshops 795
government 347,349-50 technology study 852
public examinations 577--600 female education 467,897
self-evaluation checklists 809 see also women
summative reports 820 feminism 75,82,410
value added 814 fidelity 566
fieldwork 410-11
face validity 513 financial control 429, 433
Subject Index 1033
fine-tuning evaluation 663 Fund for the Improvement of Post-Secondary
Finland 430 Education (FIPSE) 727
FIPSE see Fund for the Improvement of funding 20, 404
Post-Secondary Education Africa 395, 477
fiscal responsibility 285 developing countries 703, 704, 705, 709,
FJC see Federal Judicial Center 710--11,714
flexible time 857-8 Elementary and Secondary Education Act
FMI see Frontline Management Initiative 190
focus groups 79, 91, 729, 760, 793 European Structural Funds 430
context evaluation 41 evaluation use 205
developing countries 703, 708 foundations 347
technology training 852 government 274,348-9,351,353-4,356,
follow-up 362,363
Canadian program evaluation 737, 739 Latin America 446-7,451,452,458
evaluation design 56-7 market mechanisms 771
product evaluation 52 product evaluation 52
school evaluation 820 Quebec Policy on University Funding
technology training 859, 860 742-5, 746, 747
Traveling Observer technique 50 technology 850, 858, 867
Ford Foundation 451, 885, 888 United States
formative evaluation 10, 23, 425, 540 federal support 418,419-20
Africa 467,472,479 national assessment 885, 886, 887, 889,
Canada 747 890,894
CIPP model 34-5,37 program evaluation 427, 698, 723-4,
feedback 557-8 725, 726, 729
personnel evaluation 607 Reagan cuts 362
principal evaluation 605,651-2,661, school district evaluation 933-4
663,664 see also budgets; costs
specialists/support personnel 687 further education 578, 588, 595
teacher evaluation 624, 631
responsive evaluation 66 g-theory see generalizability theory
SATs 555 Gambia 478
school-based 803, 816, 820 GAO see General Accounting Office
technology 852 GCE see General Certificate of Education
tools 563 GCSE see General Certificate of Secondary
UK national curriculum 915 Education
US program evaluation 727 gender 392,931
useof 190,206,210,213,236 bias 164,410,589,591,595
formative-only model 23 developing countries 710, 711
foundations 312, 347 evaluator credibility 206
four-D process 368 examination bias 589, 591, 595
fourth generation evaluation 23, 69-70, 75, Latin America 443, 452
76, 77, 728 PISA reports 961
France principal evaluation 660
examinations 578,579,580--1 school profile indicators 832, 835
influence on Africa 470,472 self-assessment 559
national assessment 876, 877 textbook bias 164
New Public Management 433 Third International Mathematics and
public policy evaluation 430--1 Science Study 956
Fraser Institute 748n7 US national assessment trends 897
free-response assessments 586 General Accounting Office (GAO) 86, 275,
freedom of information 423 348,352,353,355,893
Frontline Management Initiative (FMI) Head Start program 356
756 internal evaluation units 362
1034 Subject Index
objective approach 930 Work Sample Technique 52-3
performance measurement 382 see also objectives
relative effect of interventions 108, 119 Goals and Roles evaluation model 683-7
General Certificate of Education Advanced governance
Level (GCE A-level) 570 CIPP model 36
General Certificate of Education (GCE liberal principle of 70
O-level) 907,910,920-1 US educational system 418,419-21
General Certificate of Secondary Education government 4,269,273-5,345-59
(GCSE) 552,570,909-10,911,913, Africa 477
917-18,919,920-2 Australia 751
see also O-levels definition 346
general scholastic ability 589 evaluation profession influence on 354-5,
generalizability 37,113,146-7,161,390 356
alternative assessment 564, 571nl influence on evaluation profession
performance assessment 561, 566 347-54,355-6,357,362-3
portfolio assessment 569 mutual influences 355-7
teacher evaluation 631,632 non-Western cultures 411
generalizability theory (g-theory) 490, 505, UK national assessment 906, 909-10, 911,
514,517,518-19,524 915, 917, 920
generalization 160-1,163,389-90,566 see also politics; state
Germany Government Performance and Results Act
alternative assessment 570, 571 (US,1993) 352,382
examinations 578,579,580-1,588 Government of Victoria 761
Foundation for International Development grades 543,544,802
(DSE) 405 examinations 578, 582, 587-90
national assessment 876 international studies 966
program evaluation 430 reliability 924
gestalt theory 71 UK national curriculum 912-13, 918,
Ghana 407,473,478 920-2,924
girls see also scoring
bojale 467, 468-9 graduation rates 828, 831, 832, 839, 943
female education 897 grammar schools 906
see also women grand narratives 389-90, 392, 395, 409
global perspective 392-3,397-415 Great Britain
globalization 1, 392-3, 398, 924 classroom assessment 560
Africa 476 examinations 587, 588, 590
culture 399,401,411 Excellence in Schools 875
goal-achievement evaluation 20,21-2 influence on Africa 472-3
goal-free evaluation 10,12,51-2,823 payment-by-results --874
goals program evaluation 430, 722
CIPP model 39,40,41,42,58 see also England; United Kingdom
cost -effectiveness 125 grounded theory 172
evaluation profession 361 group interviews 51
examinations 592-3 Guiding Principles for Program Evaluators
lEA 959,961 (AEA,1995) 270,280,281,293-6,297-9,
national achievement tests 830 300
participatory evaluation 245 developing countries 707,711
principal role 644,646,647,654,655-6, ethical considerations 315,317,319,321,
661 323
responsive evaluation 63 Guinea 407
school evaluation 807
specialists/support personnel 677,683-7, hard knowledge 198, 199, 201
689 hard-core concept of evaluation 28-9
technology implementation 865-6 hardware 850-1,863
Subject Index 1035
Head Start program 145,349-50,355,356, implementation 276, 377
727 Canadian program evaluation 734, 735-6,
head teachers see principals 748
Her Majesty's Inspectors of Schools (HMIs) curriculum 986, 988
552,907,919 Latin American evaluation policies 453,
hierarchical models 526 455,456-7,458,459,461
high-stakes testing 53, 612, 899-900 process evaluation 47
higher education program evaluation in developing countries
Canada 733, 734-40 702, 703, 705
CIPP model 59 school evaluation models 794
Scandinavia 431 technology 773, 847, 850, 851-2, 853-4,
selection 586 861,864,867-8
teaching evaluation 91 UK national assessment 905, 906
see also universities implemented curriculum 983
Hillgate Group 911 improvement
HMIs see Her Majesty's Inspectors of Schools Australia 760--1
holism 174 Balanced Scorecard approach 823
homogenization 165 Canada 698,734,739, 740--2, 745-7, 748
honesty 294 CIPP model 32-3, 35, 39, 40, 42, 58
House of Commons 926n5 evaluator role 235, 236
human capital 445, 773 Manitoba Program 740--2,745-7
international studies 964 program evaluation 716,717-18
technology in schools 843-4,846-7, schools 649, 655, 671, 722
848-50,851-3,854,857,859,866 Australia 760--1
see also skills Canada 698, 740--2, 745-7
Hungary 973 institutionalization of evaluation 794-5
international trends 814
IAEP see International Assessment of Manitoba Program 740--2,745-7
Educational Progress purpose of evaluation 782
ICf see information and communications school councils 783-4
technology school profiles 833, 834-5, 836
ideology 390,391,399 specialists/support personnel 687
Latin America 442, 443 teacher evaluation 625-6
UK national assessment 905 United States 722
lEA see International Association for the see also continuous improvement
Evaluation of Educational Achievement incidental observation 680
lEPs see individual education plans inclusion 314,316,361
lEY see 'Issues, Evidence and You' advocacy position 376
assessment deliberative democratic evaluation 81,
Igbo 465-6 83-4,87,92,94,97-8
IIALM see International Institute for Adult learning community values 275
Literacy Methods inclusive evaluation 69, 75, 76, 77
IIEP see International Institute for independence
Educational Planning conditional 507,519, 520, 52805
impact 216-17,225,226,276 evaluators 236,241,296,375,376,782
developing countries 710--11 Latin American institutions 447,448,450
professional standards 284 see also impartiality; objectivity
program evaluation 710--11,717,719 India 404,406,407,408,412
studies 16 indicator-based school evaluation 810--11,
impartiality 13,318 812
deliberative democratic evaluation 85, 87, indigenous peoples 394, 399
97 Africa 472
professional standards 286 Mexico 454
see also independence; objectivity individual education plans (lEPs) 37-8,682
1036 Subject Index
individualism 707, 710 teacher evaluation 614
Indonesia 406, 408 US school district evaluation 934, 935,
induction programmes 604, 628 937,938,946
inequalities inspection of schools
Latin America 453,454,458,459, 461 school evaluation model 809-10,814
United States 418 United Kingdom 907, 919, 920, 923, 925,
influence 193,212,219n6,395 926
informal assessment 551, 567 see also Her Majesty's Inspectors of
information Schools; site visits
accuracy 282,285 Inspector General Act (US, 1978) 348
Australian program evaluation 759, 760 Institute for Program Evaluation (GAO) 348
CIPP model 35, 58 institutional facts 85
context evaluation 41,42,43 Institutional Policy on Program Evaluation
evaluation design 55-6 (IPPE, Canada) 738, 739
input evaluation 45 Institutional Review Boards (IRBs) 112
process evaluation 48 institutionalization
constructivism 75-6 evaluation 6,771-2,773,774,775-805
cost-effectiveness 131,147 technology 6, 843-70
developing countries 713, 714, 717 institutions
evaluation during instruction 538-9 Africa 478
evaluation use 208,210, 214, 215, 224, 237 Canada 734, 735, 736, 738-9, 743-7, 748
freedom of 423 Latin America 447,448-9
new students 534, 535 licensure tests 625-6
principal evaluation 659 see also schools; universities
professional standards 284, 285-6 instructional champions 865
quality of assessment 544 instructional periods 861
school evaluation model 817-19,820-1 instructional plan 37-8
school profiles/report cards 812, 828, 830, instructional support 866
831-2,838-9 instrument development 968-70
sources 131, 147 instrumental use
standards-based evaluation 648 knowledge 199,200,212,217
see also data collection; knowledge participatory evaluation 251, 258, 260
information and communications technology utilization-focused evaluation 232
(lCT) 917, 926n3, 957 INTASC see Interstate New Teacher
see also computers; Internet Assessment and Support Consortium
Information Technology in Education Study integrity 232-3, 239-40, 294, 296
844,846 intelligence 551
information technology (IT) 381, 549, 844, intended curriculum 983,
846 intended users 192, 223-4, 225-9, 230,
informed consent 310 231-2,233-4,236-7
infrastructure, technology 846, 849, 853-4, admissible users distinction 211
855-8 participatory evaluation 258
ingredients method 128-33, 146, 147 Inter-Agency Commission 402
innovation 33, 783 Inter-American Development Bank 451,
innovative programs 726-7 452,454,455
input evaluation 45 inter-judge reliability 163
inputs interaction
CIPP model 31,32-3,35,36,37-8,40, constructivism 389
44-6, 775, 778-81 evaluation use 228
cost-effectiveness 129-30 interactive evaluation/assessment 200,
educational quality 874 554-5, 761-2
indicators 811 interest groups
international studies 959, 960 Africa 472,476
school profiles 832-4 democratic evaluation 434, 435
Subject Index 1037
Europe 433, 434 Interstate New Teacher Assessment and
interests 170, 180-1 Support Consortium (INTASC) 616,617,
community 364 620,621,628,629
conflicts of 285, 294, 296, 312-13 interventions 113-14
deliberative democratic evaluation 79, 81, cost-effectiveness 125, 127, 128, 131
82-3,84,89,97 ingredients approach 129-30
ethical considerations 314 interviews 131,172,729
evaluation use 226,231-2 developing countries 712
public 294-5 dynamic assessment 555
stakeholder groups 240, 261, 262 Manitoba School Improvement Program
intermediate school districts 420, 423, 426 741
internal evaluation 727 technology training study 852
Australia 764-5 intradisciplinary evaluation 378
ethical considerations 305, 320 intuition 174, 323
government 347,348-9,362 investment in technology 844-7,852,864
Israel study 802-3 involvement
International Adult Literacy Study 879 alternative assessment 559
international assessment 873, 875, 876, ethical issues 271-2
878-80,881,951-78,979-95 evaluation use 224, 230, 232
International Assessment of Educational increase in 380
Progress (IAEP) 879,952,957 limiting 240
International Association for the Evaluation participatory evaluation 254,255,257,258
of Educational Achievement (lEA) 441, public 3
976 social equity 366
curriculum evaluation 980, 982-3, 986 stakeholders 313-14,315,316-17,424
international studies 878,879,951-2,953, see also participation
954-6,957,959,961-3,964-75 IPPE see Institutional Policy on Program
Second International Mathematics Study Evaluation
(SIMS) 951,952,965,980,992 Iran 407
Third International Mathematics and IRBs see Institutional Review Boards
Science Study (TIMSS) 441,879,899 Ireland
cross-national curriculum evaluation examinations 579, 584-5
979,980,981,982-5,987-94 lEA findings 973
international comparative studies international assessment 879-80
953-4,955-6,964-9,971,973-4 national assessment 874, 876
International Development Research Council IRT see item response theory
443,451 Islamic education 466,470
International Institute for Adult Literacy Israel 193,802-3
Methods (IIALM) 405 'Issues, Evidence and You' (lEY) assessment
International Institute for Educational 490-4,499
Planning (IIEP) 476 ISTE see International Society for Technology
International Monetary Fund 347 in Education
International Society for Technology in Italy 430, 579
Education (ISTE) 845 item response theory (IRT) 490,505,510,
International Study Center 956, 970, 971 513,514,515-17,519-26,582,987
Internet 827,840n1,844,866
see also World Wide Web Japan
interpretation examinations 578
educational criticism 158 lEA findings 973
professional standards 292, 295 mathematics 992
qualitative methods 171,172,173-5, technology 844
177 textbooks 989
interpretive exercises 541-2 job artifacts 679,681-2,687
interpretivism 73, 77n1 job descriptions 606
1038 Subject Index
principals 646, 648 construction of 193
specialists/support personnel 677, 678, constructivism 11, 12-13,70-2,389,553
685-6 educational criticism 161, 163
Job Training Partnership Act (JPTA) 139, 145 evaluation use 211
John Henry effect 104, 105 Evidence and Tradeoffs variable 491, 492,
Joint Advisory Committee 540 493-5
Joint Committee on Standards for Educational examinations 588, 592, 595
Evaluation 168, 301nl, 425-6, 697 generalizability 390
evaluation definition 34, 930 hard/soft distinction 198, 199,201
evaluation use 193,217,218, 219n7, 223, knowledge-by-skills matrix 515
224 meta-analysis 194nl
evaluator role 318 multivariate conceptions of 525
funding for evaluation 274 norm-referencing 588
global evaluation 381 paradigm proliferation 339-40
metaevaluations 325, 820 participatory evaluation 245, 246, 257-8
Personnel Evaluation Standards 270, 280, positivism 389
290-1,607,656 professional development 849
metaevaluation 820 social 69, 200-1
specialists/support personnel 686, 688-9 tacit 11, 175
principal evaluation 648, 664 teacher evaluation 626--7,629-30
professional culture 402-3, 409 technology 856, 859
Program Evaluation Standards 270, 280, theory of truth 165nl
281-90,295-9,300-1,403 utilization 192-3, 197-204, 209, 212, 219,
African adoption of 478 246,248,251,390
developing countries 711 see also epistemology; information
ethical considerations 309,311,313, KR-20 formula 513,514
315,319-20,321-2,323 Kuwait 973
federal support 274, 353
institutionalization of evaluation 785-6, labor productivity 452, 453
787, 789 Labour Party 907,908,918,919
international arena 699 language 27,710
metaevaluation 820 educational criticism 156, 157-8
model for school evaluation 814 international studies 970
qualitative methods 168, 176 validity studies 516
utilization 193,217,218, 219n7, 240, 317 large-scale programs 383,498,561-2,564,
violations 311, 317 568-71,726-7, 730nl,937
stakeholders 317 latent class models 517,526
Standards for Evaluations of Educational Latin America 394,441-63
Programs, Projects, and Materials 280, Education for All initiatiVe 409
281-3,426 national assessments 876
Journal of Teacher Education 634n1 training material 406
journals 198,269,276,363,379 leadership 352-3, 362, 796
Latin America 449 international studies 972
randomized field trials 111,112,120 participatory evaluation 256
jurisprudential model 26 principal evaluation 644,645,647-8,
justice 316,708 652-4,660
see also social justice league tables 165, 810
cost-effectiveness 144, 147
Kelley's formula 508-9 UK national assessment 908, 911, 926
Kenya 406,408,473,478,959 learning 361,362,366--7,804
Key Stages 912,915,917,919,920,924 African traditional evaluation 466, 467,
knowledge 469,470
alternative assessment 560, 562 alternative assessment 549-50,551-2,552,
assertability 157 557,567,571
Subject Index 1039
by-doing 803 legal environment 340-1
collaborative 259 legislation
constructivist view 550, 552, 553 Canada 738, 739
Continuous Progress Matrix Sampling Education Act (UK, 1988) 912
Testing Technique 54, 55 Elementary and Secondary Education Act
culture of 762 (ESEA, US 1965) 103, 190, 274, 348-9,
curriculum evaluation 988 724
democratic evaluation 436 equal opportunity 418
developing countries 712 federal disbursements 348-9
dynamic assessment 554 government influence on evaluation
feedback 558 profession 345
goals 592-3 Government Performance and Results Act
international studies 960 (US, 1993) 352,382
'learning environment' 562 Inspector General Act (US, 1978) 348
Manitoba School Improvement Program Job Training Partnership Act 139, 145
741 National Defense Education Act (NDEA,
national assessment 901 US) 724
National Professional Development No Child Left Behind Act (US, 2001) 582
Program 763 Teacher Quality Enhancement Act (US,
new learning environments 845 1998) 616,624
opportunity to learn 567,980,988 US national assessment 895
organizational 192, 194, 211, 251, 252, 260 Lesotho 473,477
appreciative inquiry 275, 367, 369 liberalism 73, 375
evaluative inquiry 814 liberational evaluation 391
indicator-based evaluation 812 Liberia 470
Interactive Evaluation 762 librarians 865
program evaluation 698, 730 library media specialists 672, 673, 674, 675,
program theory 823 680-1
utilization-focused evaluation 230,231, licensure 331,336,337,338
234 principals 652, 656
outcomes assessment 873 teacher evaluation 604, 611, 614, 621-3,
peer-to-peer 867 624-7,628,629,634,640-1
performance assessment 560 linear logistic test model (LLTM) 525
planning instruction 535 literacy 405-6, 407-8
portfolio assessment 563-4 Africa 476
process use 230 India 412
professional development 763, 849 international studies 879, 957
program evaluation 716, 717, 719 Namibia 706
school contribution to 657 National Literacy Strategy 919
situational 228 see also adult literacy
sustainable learning community 269, litigation 339, 340-1
275-6,361-72 LLTM see linear logistic test model
teacher evaluation 604 local context
technology 845, 848, 851, 853, 861 Problem Solving model of change 761-2
UK national assessment 923, 924 program evaluation 700,711-12,719
learning community 269,275-6,361-72,560, local control of schools 71, 72
763 local custom 5, 711-12
learning curves 773, 844, 853-67 local education authorities (LEAs) 907,
Learning Improvement Fund (Australia) 926nl
760 local evaluation 206, 213, 873, 946
learning organization 231,775-6,783,789, country studies 409
796-800, 804, 811 Latin America 394
see also organizational learning local government 346, 347
LEAs see local education authorities localism 11
1040 Subject Index
localization 473 cross-national curriculum evaluation
longitudinal studies 965 979,980,981,982-5,987-94
international comparative studies
~adagascar 407 953-4,955-6,964-9,971,973-4
~alawi 406, 473, 959 UK national assessment 908,912,914,
~ali 407,471 916,917,918,919,921-2
malpractice 279, 298, 299, 332-3, 593 US national assessment 890-1,896,897,
see also misuse 898,902
management matrix sampling testing 54
Canadian program evaluation 739,744, ~auritius 471,478, 959
745,748 measurement
Frontline ~anagement Initiative 756 curriculum 982-7
international studies 972 evaluation distinction 2
management-orientation 20, 22 evaluation use 227
randomized field trials 117-18 program evaluation 716, 718
US school district evaluation 935, 936 quality 429
~anagement Information Systems (~IS) reliability 500, 510
407 teacher evaluation 610, 612, 634
managerial coziness 69 see also performance measurement;
managerialism 875,876,880,918 psychometrics
~anitoba School Improvement Program Inc. measurement models
(~SIP) 253,698,740-2,745-7 alternative assessment 549
manuals 791-3,794,798,799,970 psychometrics 490, 493, 495, 502, 505-8,
marginalized groups, Latin America 442, 513,517-18
443,454-7,460,461 ~easurement Research Corporation (~RC)
see also minorities 885
~arket Data Retrieval survey 845 ~ECT see ~assachusetts Educator
market mechanisms 771, 911, 918 Certification Tests
Europe 438 media culture 392, 398
Latin America 453 membership of evaluation associations 333-4
market-type mechanisms (~T~s) 432 mental models 796, 797-800
market-orientation 83,437,438 mentoring 862
market-type mechanisms (~T~s) 432 merit 28, 29, 83, 182n3
~arkov Chain ~onte Carlo estimation 527 CIPP model 34, 36, 57-8
~assachusetts Educator Certification Tests evaluation criteria 655
(~ECT) 610-11 independent judgements 241
matching 107 participatory evaluation 245
materials program evaluation 715, 716, 718
Australian program evaluation 760 responsive evaluation 63, 64, 65
cost-effectiveness 129, 130, 132 school evaluations 784,821-2,823
developing countries 705 specialists/support personnel 687
~exico 456, 457 Stufflebeam 930-1
planning instruction 536 summative evaluation 232
resource library 791 meta-analysis 137-8, 194n1, 378
mathematics meta-cognition 553, 557, 558, 563
international studies 953-4, 955-6, 957, meta-theory 10, 15
975 metaevaluation 173,276,378
National Council for the Teaching of CIPP model 34,37,58,60-1
~athematics 953 evaluation profession 371, 383
Second International ~athematics Study external 795, 820
(SI~S) 951,952,965,980,992 participatory evaluation 262
Text ~ath 142 personnel evaluation 607
Third International ~athematics and professional standards 283-6, 297, 298,
Science Study (TI~SS) 441,879,899 299,300,322,325
Subject Index 1041
responsive approach 67 CIPP (context, input, process and product
school evaluation 775, 776, 791, 794, 795, evaluation) model 4, 10, 13, 31-62, 425,
820,821 728, 772
methodology 1, 2, 4, 27, 73-4, 104 institutionalization of evaluation 775,
appreciative inquiry 275 778-82,801
CIPP model 40, 41 principal evaluation 659
cost-effectiveness 145-6, 147 school district evaluation 938, 946
curriculum evaluation 983-4, 985, 986, 994 lEA conceptual framework 961
deliberative democratic evaluation 94 institutionalization of evaluation 775,
ethical considerations 305,310-11 776-96,801
evaluation use 227,228, 232 knowledge utilization 200
government influence 347,348,350,355, mental 796,797-800
356 probability models 500,505,506-17
international comparative studies 951, school evaluation 787-96,808,814-21,
963-73,974 822-3
jurisprudential model 26 student assessment 490-5, 500, 502-5,
portfolio assessment 631 506-27
principal evaluation 658-61, 664, 665 Type I Model of evaluation 786-7
Program Evaluation and Methodology Type II Model of evaluation 787-96
Division 348 US school district evaluation 938
randomized field trials 104-5, 107-24 moderation 567,570,594,916,917
requests for proposals 274 modus operandi evaluation 12, 66
responsive evaluation 64 monetary awards 941,942
statistical 527 monitoring 20
teacher evaluation 617,620,628,630,631 diagnostic 878, 880
see also data collection; qualitative Education for All initiative 408
methods; quantitative methods international 953
Mexican Council of Educational Research performance 878, 880
(COMIE) 449 Monte Carlo analysis 144
Mexico 442,444,447,448-9,450,452, motivation for examinations 578, 586, 593
453-7,458 Mozambique 471
micro-politics 252 MRC see Measurement Research Corporation
Middle East MSIP see Manitoba School Improvement
professional culture 401 Program Inc.
training material 406 MTMs see market-type mechanisms
migrants 422 multiattribute utility function 138-9
Ministere de l'Education (Quebec) 744 multiple-choice 473,541,559,579
minorities 70,82,410 behaviorism 552
power-free dialogue 436 critique of 604, 615
United States 418,422 gender variations 591
see also marginalized groups international studies 968
MIS see Management Information Systems standardization 498
miscue analysis 556 US state-mandated tests 582
misevaluation 217-18,238-9 weakness of 566
misrepresentation 309 multivariate models 517, 524-5
mission Myrdal award 352
school evaluation 748, 787
specialists/support personnel 672, 684, NAE see National Academy of Education
690 NAEP see National Assessment of
misuse 217-18,238-9,308,309,317 Educational Progress
see also malpractice NAGB see National Assessment Governing
mixed-method designs 168 Board
model-criticism techniques 511-12,521 Namibia 407,408,471,478,706,959
models 19-28, 189, 192 narrative assessments 542
1042 Subject Index
narrative profiles 827-8,837,840 National Foundation for Educational
narratives 175 Research (NFER) 907,908
NAS see National Academy of Sciences National Institute of Education (NIE) 353,
A Nation at Risk 42 725,888-9
National Academy of Education (NAE) 890, National Institute for Educational Planning
891-2,893-4 and Administration (NIEPA) 411-12
National Academy of Sciences (NAS) 624, National Institutes of Mental Health 351
725,894-5,900,901,902 National Literacy Mission of India (NLM)
national assessment 873, 875, 876-8 412
Africa 475-6 National Literacy Strategy (NLS) 919
United Kingdom 554, 905-28 National Numeracy Strategy (NNS) 919
United States 569,876,877,878,880-1, National Opinion Research Center (NORC)
883-904 199
see also public examinations National Partnership for Excellence and
National Assessment of Educational Progress Accountability in Teaching 848
(NAEP) 6,522,569,613,876,877,878, National Professional Development Program
883,884,886-902,952 (NPDp, Australia) 763
National Assessment Governing Board National Research Council (NRC)
(NAGB) 890,891,892-3, 894-5, 900, 902 panels 354
National Association of Elementary School Program Evaluation Committee 725
Principals 426, 656 student assessment 486, 527, 528nl
National Association of Secondary School teacher evaluation 616, 617, 621, 624-6,
Principals 426, 656, 829 628
National Association of State Directors of National School Public Relations Association
Teacher Education and Certification 616 (NSPRA) 813
National Board for Professional Teaching National Science Foundation (NSF) 50, 199,
Standards (NBPTS) 604, 617, 620, 621, 274,297,334,335-6,353,886,890
628,630-3 National Society for the Study of Education
National Center for Education Statistics (NSSE) 810
(NCES) 623,887,888,898 National Staff Development Council 848
National Center for Research on Teacher National Study of School Evaluation 809
Learning 848 National Tests (NTs) 917
National Commission on Excellence in national trends 895-6
Education 42, 874, 889, 951 nationalism 397
National Commission for the Principalship 656 Native Americans 418,422
National Commission on Teaching and naturalism 4,69,168,170,216
America's Future (NCTAF) 617,620-1 methodologies 73
National Committee for Citizens in Education Southeast Asia 407
813 NBPTS see National Boar(ffor Professional
National Council for Accreditation of Teacher Teaching Standards
Education (NCATE) 617,620,621 NCATE see National Council for
National Council on Measurement in Accreditation of Teacher Education
Education (NCME) 426, 486, 499, 533, NCC see National Curriculum Council
540,564,626 NCES see National Center for Educational
National Council for the Teaching of Statistics
Mathematics (NCTM) 953 NCME see National Council on Measurement
National Curriculum 554,880 in Education
United Kingdom 905-28 NCTAF see National Commission on
United States 190, 902 Teaching and America's Future
National Curriculum Council (NCC) 914 NCTM see National Council for the Teaching
National Defense Education Act (NDEA) of Mathematics
724 NEA see National Education Association
National Education Association (NEA) 426, needs
533,540,886 assessment of 460
Subject Index 1043
CIPP model 32, 35-6, 39, 40, 41, 42 NSF see National Science Foundation
evaluation use 215 NSPRA see National School Public Relations
negative feedback 241 Association
negotiation 223, 228-9, 235 NSSE see National Society for the Study of
democratic evaluation 435, 436 Education
participatory evaluation 435 NTs see National Tests
Nepal 406 nurses, school 672, 675, 680, 682-3
Netherlands 579,811
networks of support 698, 700, 729 O-levels 907,910,920--1
New Labour 919 objectives
New Public Management (NPM) 394, classroom evaluation 541,546
432-4,437 Eight Year Study 801, 884
New Zealand objectives-orientation 424, 930, 939
Educational Review Office 809, 813, 827 planning instruction 536-7
examinations 579 principal evaluation 651-3,654
lEA findings 973 Tyler approach 723
professional culture 401, 403 UK national curriculum 912, 913
school profiles 827 US national assessment 884, 885, 887-8,
NFBE see Nonformal Basic Education 890--1,899,901
NFER see National Foundation for see also goals
Educational Research objectivism 4, 179
NGOs see non-governmental organizations objectivity 18, 161
NIE see National Institute of Education classroom evaluation 539, 544
NIEPA see National Institute for Educational external independent evaluators 782, 795
Planning and Administration fact-value dichotomy 85-6
Nigeria 467,469,473,476,478 psychometrics 65
NLM see National Literacy Mission of India qualitative evaluation 105, 169
NLS see National Literacy Strategy subtheme A 24-5
NNS see National Numeracy Strategy values 11-12, 13
No Child Left Behind Act (US, 2001) 582 see also impartiality; independence
nomothetic net 512,516,527 objects of evaluation 653-5, 660
nomothetic span 500, 526 observable variables 504-5
non-governmental organizations (NGOs) observation 116,171-2,715,729
347,357,404,406,472,702-3,709 observational studies 108,140-1
noncurricular program evaluation 777 specialists/support personnel 679-80, 686
Nonformal Basic Education (NFBE) 406-7 teacher evaluation 606, 630, 656
NORC see National Opinion Research Center Observatorio Cuidadano de la Educaci6n
norm-referencing 544, 550, 551, 553, 563, 612 (Citizen's Observatory of Education,
examinations 588, 590 Mexico) 449,455,457
school profiles 828, 830, 834, 835 OECD see Organisation for Economic
UK national assessment 910 Co-operation and Development
US state accountability systems 938, 940, Office for Standards in Education (OfSTED)
943,944,945 809,920,923
normative theories 15 Office of Technology Assessment (OTA)
norms 269, 450, 817 846,847,851,854
North America 409, 956 OfSTED see Office for Standards in
North Central Association 809 Education
Northwest Regional Educational Laboratory Ohio Department of Education 628
813,848 Ohio State University Evaluation Centers
Norway 193 796-801
NPDP see National Professional Development on-line evaluation 91-2
Program one-way sensitivity analysis 143-4
NPM see New Public Management OPAC see Operations Advisory Committee
NRC see National Research Council open forums 90
1044 Subject Index
open-ended testing 551, 559, 561, 968-9 school profile indicators 828, 829, 830,
Operations Advisory Committee (OPAC) 831,832-4,835-6,839,840
886 teacher evaluation 615, 937
opportunity cost 127, 135 technology 850
opportunity to learn (OTL) 567,980--1,988 unadjusted 932, 939, 947
opportunity to teach 931 ownership
oral examinations 579, 594 ethical considerations 308,315
ordinary knowledge 201-3,212 evaluation process 224,230,231,239
Organisation for Economic Co-operation and school evaluation model 816, 819, 821
Development (OECD) 404,976 Oxford English Dictionary 485
developing countries 712,714
International Adult Literacy Study 879 P-PE see practical participatory evaluation
international studies 952, 953, 956-8, 959, Pacific region, Education for All initiative
960--1,962,964-73,975 409
Programme for International Student Pakistan 406
Assessment (PISA) 584-5, 879-80, 952, Pakot 465-6
957-8,959-61,963-73,975,976 Panama 452
technology 844, 845 PANDAs see Performance and Assessment
organizational change 849 Reports
organizational culture paradigms
Canada 737, 748 alternative assessment 550--2
participatory evaluation 250, 252, 255-6, epistemic triangle 398, 399
260 proliferation 339-40
utilization-focused evaluation 231 PARE program (Mexico) 455-6,457,458
organizational learning 192, 194, 211, 251, parent-teacher groups 423
252,260 parents
appreciative inquiry 275,367,369 accountability 757
evaluative inquiry 814 demand for technology skills 865
indicator-based evaluation 812 individual education plans 38
Interactive Evaluation 762 parental education 897
program evaluation 698, 730 school choice 919
program theory 823 school profiles/report cards 828, 830,
utilization-focused evaluation 230, 231, 234 831-2,833,836-7,838-9,943,944
see also learning organization US national assessment trends 897
organizational structure 678-9 participation
OTA see Office of Technology Assessment African traditional evaluation 469
OTL see opportunity to learn Canadian program evaluation 739
outcomes 6,191,276,427,771,803,930 democratic evaluation 80, 89, 96, 98, 436
CIPP model 50--1, 52, 58 depthof 255-7,263,315
cost-effectiveness 133, 136, 139, 140, 141, ethical considerations 313-17,324
142-4, 146 European approaches 394
curriculum evaluation 979,980--1,986 evaluation use 192,209,230--1,232
indicator-based school evaluation 811 local 170
international studies 878, 959, 960, 961, professional standards 301
965 US program evaluation 427
learning 873, 874, 875 see also involvement; participatory evaluation
multiple 931, 946 participatory ethics 69, 75, 77
outcome-based evaluation 16,22,649-50, participatory evaluation (PE) 4, 23-4,
653,655,661,662,663 245-65,379,728
principal evaluation 649-50, 652, 655, 659, Africa 394
661,662,663 constructivism 69-70,75,77
product evaluation 50--1, 52 context 391
responsive evaluation 64 developing countries 702,704,707,708,
school accreditation 722 710,713
Subject Index 1045
ethical considerations 271-2,314-16 teachers 612,618,631-2,634
evaluation use 192,194,209,230-1,237 school evaluation 772-3, 783-4, 808-9
increase in stakeholder involvement 380 school profiles 827, 832, 836
Manitoba School Improvement Program scoring model 504-5
740 statistical models 511-12
negotiation 435 Swedish quality assurance 432
program theory 823 system assessment 873-4
partnership 117,247,249-50 United States
patient records 43-4 state accountability systems 940, 941,
payment-by-results 874,899 942,943,944
see also performance-related pay state curriculum standards 537-8
PE see participatory evaluation state-mandated tests 542-3, 582
peer review 177 validity 512, 546, 566, 584
peer-to-peer learning 867 Performance and Assessment Reports
PEMD see Program Evaluation and (PANDAs) 920, 926n6
Methodology Division performance management 22, 875
perception New Public Management 433, 437
connoisseurship evaluation 153, 154, 162, UK national assessment 920
164 performance reports see school report cards
qualitative methods 169, 170, 173-4 performance-related pay (PRP) 433, 875, 920
responsive evaluation 65 see also payment-by-results
perceptual-differentiation 154, 164 Perry Preschool Project 109,118,137,140,
performance 145
Academic Performance Indictor 945 personal factor 205,207,214,216,225--6,231
African traditional evaluation 467, 468, 469 personal mastery 796, 797-800
assessment 549, 550,560-3, 565, 566, 567, personality-culture dialectic 392
568 personnel
international studies 969 CIPP model 59, 778, 779-80
teachers 631-2 collective problem-solving 812-13
UK examination system 570 cost-effectiveness 129,130-1,132
Assessment of Performance Unit 908--9, developing countries 705
911 evaluation 5, 59, 603-8
Australian program evaluation 757,759, evaluation manual 792
765--6 institutionalization of 777,787,788,
best performance concept 553, 554, 555 789-93, 794-6
Canadian program evaluation 748 learning organization disciplines
classroom assessment 542, 544, 545 797-800
curriculum evaluation 984 principals 604-5,607,643--69
examinations 582,584,589,590-1,595 support staff 606,607,671-93
general scholastic ability 589 teacher evaluation 603-4,60fr.7,
goals 592-3 609-41,673,674-5,730
grading 544 training 664-5
international studies 95fr.7,969 US school district evaluation 935,
low-performance schools 649 937-9,946
managerialism 875, 880 government 357
measurement 277,437 human capital 843, 844
critiques of 384 international studies 972-3, 974, 975
growth of 379, 380, 381, 382-3 interviews with 131
Sweden 432 learning organization disciplines 797-800
monitoring 878, 880 Manitoba School Improvement Program
personnel evaluation 777 740-1
principals 605, 645, 653, 654, 656, 659 New Public Management 433
specialists/support personnel 679, 680, process evaluation 47-8
681-2,683,685--6,687 product evaluation 52
1046 Subject Index
professional standards 280, 290-1 Latin America 443, 446, 449, 458-9, 461
purpose of evaluation 776 program evaluator involvement 700
qualifications 129, 269, 273, 329-43 Quebec Policy on University Funding 742-7
professional standards 290, 294 school profiles 839
teachers 828,832,839 teacher evaluation 634
rewards 795, 800 technology 857
school profiles 838, 840 UK national assessment 905,913,918-19,
skills 252 924,925
technology institutionalization 855-67 US national assessment 877,888
utilization of evaluation 206,207,214, policy elites 394, 442, 445
229,231-2 political model 200
violations of professional standards 311-12 politicization 381-2
see also evaluation profession; evaluator politics
role; principals; professional development; Africa 472,477
support staff; teachers Australian program evaluation 752
Personnel Evaluation Standards (Joint criticism 155-6
Committee) 270, 280, 290-1, 607 deliberative democratic evaluation 80, 86,
metaevaluation 820 90,93
specialists/support personnel 686, 688-9 democratic processes 435-6
Peru 448 educational accountability 616
Phi Delta Kappa International 351 ethical considerations 313
Philippines 406, 408 evaluation profession/government
philosophy 24 relationship 345-59
physical education 926n3 France 431
PISA see Programme for International increased prominence 277,381-2,384
Student Assessment Latin America 394,446,447,449,450,
plagiarism 309 451-3,460
planning liberal political agenda 375-6
annual school evaluation plan 784 political culture 392
CIPP model 32-3,36 principal evaluation 651, 665
individual education plans 37-8 professional standards 284
instructional 535-8, 546, 778 school evaluation model 818
Interactive Evaluation 762 teacher evaluation 617-18
international studies 972, 974 UK national assessment 906, 909, 910,
Latin America 445 911, 915, 917, 918
process evaluation 47,48 US national curriculum project 190-1
professional standards 291-2 US program evaluation 423-4
stakeholders participation 307 utilization of evaluation 214, 215, 232,
sustainable communities 365-6 233-4,237-8,239
technology implementation 863--4, 867 see also government; policy; state
pluralism populations 112-13
democratic evaluation 77, 429, 434-5, portfolio assessment 542, 550, 560, 563-4,
436-7 568-9
responsive evaluation 4, 11 Advanced Placement (AP) Studio Art
policy 180, 197, 215, 730n1 498-9,507
Africa 472 principal evaluation 657-8
Australian program evaluation 754, 755, specialists/support personnel 679,681-2
765 teacher evaluation 631, 632
evaluation profession influence on 354-5 positivism 11, 178, 182n4, 376
France 430,431 constructivism conflict 389
international studies 880, 956, 959, 960, epistemic triangle 391
963, 964, 973 postmodernism 24, 175, 177, 398
knowledge utilization 198, 199, 200, 201, posttests 54, 55
202-3 poverty
Subject Index 1047
developing countries 707 privatization 443, 448
Latin America 460 probability models 500, 505, 506-17
United States 418,419,422 probability-based reasoning 490,503,506-7,
War on Poverty 724 511,513,517,527
power classical test theory 507-11
advocacy position 376 item response theory 519-20,521, 523
compensatory programs 460 linear logistic test model 525
democratic evaluation 435, 436 problem-solving 25, 48
ethical considerations 313 collective 812-13
models 25--6 knowledge utilization 200
resource allocation 477 Problem Solving model of change 761-2
stakeholders 82, 89, 98, 254--5, 261 process evaluation
practical participatory evaluation (P-PE) CIPP model 31,32-3,35,36,38,40,47-50
246,247,249-50,253,256,259-60,262 impact 719
ethical considerations 271-2,314,315-16 US school district evaluation 935, 936,
increase in stakeholder involvement 380 937,938,946
practice 718-19 process use 209,210-11,229-31,233,236,
prmds 389,409-10 251,259-60
Prmds tests 628, 629 processes
pre-post evaluations 350 CIPP model 775,778-81
predictive validity 512,585,626-7 indicator-based school evaluation 811
preferential model of democracy 93 international studies 959, 960
prescriptive models 189 participatory evaluation 259-60
pre service programs 331, 332 school profile indicators 828,829,831,
President's Committee on Science and 832,833-4
Technology 845 technology 850
pretests 54, 55 product evaluation
primary education CIPP model 31-2,33,35,36,38,40,50-5,
Africa 471,474 775, 778-81
examinations 579 US school district evaluation 935, 936,
Mexico 454-7 937,938
national assessment 876 professional culture 391, 392, 393, 398,
UK budget 923 401-5,409-10
see also elementary schools professional development 603,620-1,624
Primary Language Record 556 APU 909
principals Australia 756, 763
evaluation of 5, 604-5, 607, 643-69 middle managers 756
outcome-based 649-50, 653, 655, 661, principal evaluation 651-2
662,663 specialists/support personnel 687
role-based 646-8, 661, 662 technology 847-50,858,860
standard-based 648-9, 661, 662 professional standards 4, 269, 270-1,
school evaluations 782, 783-4, 788, 789, 279-302,303,330,374,402-3
790, 794, 795, 796 ethical considerations 315,319-20,321-2,
see also personnel 323
private schools government influence 347
examinations 578 participatory evaluation 261, 262
Latin America 443, 452 teacher evaluation 604-5, 607, 626, 628,
United Kingdom 924 630-3
United States 898 United States 425,427
voucher schemes 135 violations 311-12
private sector see also standards
Latin America 446, 450 professionalism 300,748
New Public Management 433, 438n2 developing countries 712
United States 352-3, 427 qualitative methods 179
1048 Subject Index
professionalization 4,270,276,330--1, propriety 178-80, 181, 217, 223, 224
373-4,640--1 CIPP model 34
Australia 752 principal evaluation 648
ethical considerations 306 Program Evaluation Standards 281, 282,
professional standards 280,297,299 284-5,286,290,296,785-6
proficiency-based decisions 516-17 specialists/support personnel 688
profiles 772, 773, 810, 811, 812, 827-42, 944 teacher evaluation 607
program evaluation 2,3,5-6,10,695-768 PRP see performance-related pay
Africa 395,476--8,479 Psychological Corporation 885
Australia 751-68 psychologists, school 672, 675, 679, 680
Canada 733-49 psychology 103, 511
CIPP model 778, 780--1, 801 constructivist 71
developing countries 697-8,701-20 zone of proximal development 553
Eight Year Study 801-2 psychometrics 2,5,485,487,489-531,
evaluation manual 793 549-50,554
institutionalization of evaluation 777, 778, innate ability 552
780--1, 783, 786-7, 789-90, 794 international studies 971, 975
Israel study 803 objectivity 65
Latin America 451 reliability 561
learning organization disciplines 797-800 testing 550, 551
political nature of 460 public examinations 577-600
quasi-evaluation 20--1 public interest 394
student model 503 public sector
Type I Model 786-7 Europe 393,430,432-4
United States 417-28,721-32 market mechanisms 438
utilization-focused 224 New Public Management 432--4, 437
Program Evaluation and Methodology public welfare 294-5
Division (PEMD) 348 publication of results 925, 926, 944, 974
Program Evaluation Standards (Joint see also dissemination of results
Committee, 1994) 168,176,270,280,
281-90,295-9,300--1,403 QCA see Qualifications and Curriculum
African adoption of 478 Authority
developing countries 711 qualifications 129, 269, 273, 329-43
ethical considerations 309,311, 313, 315, professional standards 290, 294
319-20,321-2,323 teachers 828, 832, 839
evaluation funding 274 Qualifications and Curriculum Authority
federal support 353 (QCA) 923, 926n6
institutionalization of evaluation 785-6, qualitative methods 2, 105, 167-85, 378
787, 789 CIPP model 34, 58 -~
international arena 699 constructivism 73
metaevaluation 820 curriculum evaluation 984, 985, 986
model for school evaluation 814 deliberative democratic evaluation 90
utilization 193,217,218, 219n7, 240, 317 Europe 430
violations 311,317 objective/subjective dichotomy 161
program logic 754, 755 professional standards 286
program profile technique 42-3 program evaluation 715, 729
program theory 378, 823 qualitative-quantitative debate 374, 375,
Programme for International Student 376
Assessment (PISA) 584-5, 879-80, 952, randomized field trials 115
957-8,959-61,963-73,975,976 requests for proposals 350
Progressive Education Association 723 responsive evaluation 64
progressivism 394, 442-4, 451, 453 school evaluations 785, 817
projects 721-2, 767n1 school performance indicators 772
propensity scores 108 school profiles 827,828
Subject Index 1049
Southeast Asia 407 questions
quality classroom evaluation 541
Canadian program evaluation 734, 739, international studies 964-5, 968-9
740, 747 questioning techniques 539-40
control school evaluation 808
certification 337 see also multiple-choice
evaluation profession 273
qualifications 331,332,333 race
US school district evaluation 936 Latin America 452
indicator-based school evaluation 811,814 racial bias 410
international studies 955, 958-9 US national assessment trends 896-7
measuring 429 see also ethnicity
New Public Management 433 random assignment 114-15, 118, 119
qualitative methods 170, 173, 175, 176, 181 randomized field trials (RFTh) 4,104-5,
responsive evaluation 63, 64, 65 107-24
schools 141 rater models 505, 524
teachers 611,613,619,622,625-6,633 rationalism 11, 70
United States 951-2 rationalistic evaluation 407
see also technical quality readiness test 835
quality assurance 48, 118-19 reading
international studies 972-3 beginning programs 15
managerialism 875 informal inventories 556
school-based assessment 594 international studies 879-80, 957
Sweden 431-2 miscue analysis 556
teacher evaluation 611,627,631 national assessments 874, 876, 896, 897,
Quality Counts 2000 (Education Week) 614, 898,902
619,622 realism 181, 376
Quality-Adjusted Life-Year 136 reasoning
quantitative methods 2, 104, 168, 169, 177 chain of 495-6,497,500,504
CIPP model 34, 58 probability-based 490,503,506-7,511,
constructivism 73, 74 513,517,527
curriculum evaluation 984, 985, 986, 988, classical test theory 507-11
993 item response theory 519-20, 521, 523
Latin America 444, 445, 446 linear logistic test model 525
professional standards 286 recopying 580
program evaluation 715, 729 Records of Achievement 559
qualitative-quantitative debate 374, 375, Red Latinoamericana de Educaci6n
376 (REDUC) 447,451
randomized field trials 115 REDUC see Red Latinoamericana de
requests for proposals 350 Educaci6n
school evaluations 785, 817 reference testing 594
school performance indicators 772 referential adequacy 162
school profiles 827 reflection 389
Southeast Asia 407 regional services 60
utility 180 regulation of teacher evaluation 621-3, 627
see also psychometrics; statistics relativism 4, 24
quasi-evaluation 20-1 relevance
quasi-experiments 108, 350, 424, 727 Canadian program evaluation 739
Quebec 698,734,735-9,742-7 program evaluation 717,718
Quebec Policy on University Funding 742-7 reliability
questionnaires alternative assessment 550, 564, 565, 566,
curriculum evaluation 979 568
international studies 957, 958, 963, 969, 970 classroom evaluation 539, 540
see also surveys coefficient 510,513-14
1050 Subject Index
curriculum evaluation 985, 994 resentful demoralization 104, 105
examinations 583, 586-6 resources
indices 511, 517 allocation of 363,477
instructional planning 538 baseline assessment 810
inter-judge 163 Canadian program evaluation 739, 743,
knowledge 203 744, 747, 748
marker 567, 569 CIPP model 33, 36, 39, 44, 45
professional standards 285, 292 cost -effectiveness 125, 127-8, 130, 132, 148
psychometrics 550, 561 diagnostic monitoring 878
SATh 555 Equal Educational Opportunity Survey 874
standardization 567,568,570 evaluation design 57
student assessment 486, 487 fair distribution of 366
generalizability theory 519 fiscal responsibility 285
item response theory 522 international studies 960
latent class models 526 lack of 774
linear logistic test model 525 Latin American research communities
psychometrics 489-90, 494, 500, 501, 448,458
507,513-15 learning organization disciplines 799
teacher evaluation 604, 631, 632, 633, 915 national assessment 875--6
wrong grades 924 New Public Management 432
religion 400,472 planning instruction 536
report cards 810,811,812,827-42,890,943-4 principal leadership 647
reports 233, 286, 785 program evaluation 700,714,730
Australian program evaluation 758, 759, program/projects distinction 722
761, 765 qualitative methods 176
Canadian program evaluation 736,737, Quebec Policy on University Funding 743,
738,748n7 744
CIPP model 38 school committee 794
evaluation design 56 school profile indicators 828,830,831,
international studies 971, 973 832-4
NAUEP 891-4,898-9 Swedish quality assurance 432
process evaluation 48 technology 843, 866
product evaluation 52 UK national curriculum 913
professional standards 284 US school district evaluation 934-6,937,
Program Profile Technique 42-3 946
qualitative methods 175--6, 181 response rate 967-8
randomized field trials 120 responsibility 306, 309, 646, 654
responsive evaluation 64 responsive evaluation 4,10--11,63-8,425,
SACMEQ project 959 728, 814 '~
summative evaluation 820 constructivism 69-70, 75, 77
Traveling Observer Technique 50 qualitative methods 170
US state accountability systems 939 results
see also school profiles dissemination of 227,233,251,293,782
representation 156, 177-8, 182n4, 583 developing countries 713
see also construct representation approach Latin America 441-2,443,452,453
requests for proposals (RFPs) 274,349,350, public examinations 579
351,354 publication of 925, 926, 944, 974
resampling-based estimation 527 synthesis of 822
Research for Better Schools 813 review panels 41,47
research communities rewards 624, 795, 800
Latin America 394, 447-9, 450 principals 652
Mexico 454,455,457,458 students 941
research questions 964-5 technology training 859, 866
Research Triangle Institute (RTI) 885 RFPs see requests for proposals
Subject Index 1051
RFIl; see randomized field trials competition between 910
right-to-knowaudiences 56,58,293,295, context evaluation 41,42
312,424,785 contribution to learning 657
rights 285 data 659
Rockefeller Foundation 109, 117, 120 developing countries 704-5
role-based evaluation 646--8, 661, 662 effectiveness 671
role-mixing approaches 23-4 Eight Year Study 884
RTI see Research Triangle Institute evaluation 6,593-5, 771-4, 807-26
rural areas 624,701-2,704-5 Africa 473,475
institutionalization of 771-2, 773, 774,
SACMEQ see Southern Africa Consortium 775-805
for Monitoring Educational Quality practices 808-14
safety issues 33, 831, 839 school profiles 827-42
sampling 566, 583 state involvement 881
international studies 966, 967-8, 971, 974 technology 843-70
US national assessment 885, 895 evaluation use 208
San Diego County Effective Schools Survey improvement 649, 655, 671, 722
813 Australia 760-1
San Diego County Office of Education 813 Canada 698, 740-2, 745-7
SAT see Stanford Achievement Test institutionalization of evaluation 794-5
SATs see Standard Assessment Thsks international trends 814
Scandinavia 431-2,433 Manitoba Program 740-2,745-7
Scared Straight programs 110 purpose of evaluation 782
School Assessment Survey 813 school councils 783-4
School Based Research and Reform school profiles 833,834-5,836
(Australia) 763-4 influence on principal leadership 652-3
school charters 758-9, 761, 766 input evaluation 46
school committee 788, 789-90, 791-2, 794, institutionalizing evaluation 771-2,773,
797-800 774, 775-805
school councils 782,783-4,794,795,802 international studies 966, 967-8
school districts 417,418,419-20,421-6,427, Latin America 452, 458, 459, 460
725 league tables 165, 911
evaluation models 811-13 local control 71
independent evaluators 782 low-performance 649
input evaluation 46 Mexico 453-7
institutionalization of evaluation 801 model for evaluation 772, 807-26
principal evaluation 661-2, 664 National Professional Development
state evaluation 929-49 Program 763
School Examinations and Assessment Council new technologies 773
(SEAC) 555,914,917 obstacles to evaluation 774
school profiles 772, 773, 810, 811, 812, parental choice 910, 919
827-42,944 participatory evaluation 260
school psychologists 37, 38 performance 772-3
school report cards 772, 810, 811, 812, principal evaluation 605, 652-3
827-42,943-4 quality 141
schools randomized field trials 109
accreditation 2,722,730,807,809-10 sampling 966, 967-8
Australian program evaluation 754, 755-6, small 125-6
757-61, 763-5, 766 specialists/support personnel 672-3
autonomy 771 student costs 125-6
Canadian program evaluation 698, 740-2, Sweden 429
745-7,748n7 systemic change 754
CIPP model 59 technical rationality 159-60
climate 822-3,828,831 UK target-setting 919-20
1052 Subject Index
United States Secretaria de Educaci6n Publica (SEP) 454,
accountability systems 941-2,943-4, 457
945,947 selection items 541, 542, 543-4
educational system 418,419-20,427 selection models 108
improvement 722 self-assessment 559, 560, 562
national assessment trends 898 Australian program evaluation 759, 761
unity of command 678-9 principals 657-8, 660
validation 162-3 see also self-evaluation; self-monitoring
see also classroom evaluation; school self-determination 231, 316
districts Europe 432
Schools Council (UK) 907 Manitoba School Improvement Program
science 740
BEAR Assessment 490-2 participatory evaluation 246, 251, 262
CIPP model examples 60 self-evaluation 32, 550, 556, 557-9, 563, 908
international studies 953-4, 955-6, 957, 975 Canada 736, 737, 738, 739, 740
Third International Mathematics and CIPP model 37-9
Science Study (TlMSS) 953-4, 955--6 Interactive Evaluation 762
UK national assessment 908,912,914, school evaluation 809-10,811-12,814,
916, 917, 918, 919 823
US national assessment 896,897, 898 self-reports 957
value-free 18, 19 see also self-assessment; self-monitoring
Science Research Associates (SRA) 885 self-fulfilling prophecy 534
Sci Math MN 974 self-monitoring 550,553,557,558
scoring 543-4, 564, 567, 568 see also self-assessment
international studies 955 self-study 809-10, 823
psychometric model 550 Senegal 471
reliability 587 sensitivity analysis 131,142--4, 146
school profiles 834-5 SEP see Secretaria de Educaci6n Publica
scoring model 490,491,492-3,494, 502, service orientation 284, 290, 291
504-5 shared vision 796, 797-800
teacher evaluation 630, 632, 633 shortcoming 63, 64, 65
US accountability systems 943-4, 945, signalling 208
946--7 SIMS see Second International Mathematics
see also grades Study
SEAC see School Examinations and simulation studies 206--7
Assessment Council Singapore 992
Second International Mathematics Study site councils 645, 652, 653, 660
(SIMS) 951,952,965,980,992 site visits
Second International Technology in Sweden 429
Education Study (SITES) 955 Traveling Observer Technique 50
secondary education United States 423, 424--5
Africa 471,474 see also inspection of schools
Canada 734, 735, 748n7 SITES see Second International Technology in
curricula 595 Education Study
examinations 579, 581 situational evaluation 228, 232
Manitoba School Improvement Program skepticism 24, 86, 177
740-2 skills
teacher evaluation 622 African evaluation programs 479
United Kingdom 907,909-10,911,923 certification/credentialing 273, 338
see also General Certificate of Secondary Eight Year Study 802
Education evaluators 238, 261, 272, 294, 324, 380
secondary employment evaluation 663, 664 examinations 583, 588, 592, 593
Secondary Examinations Council 911,914 international studies 957, 958
secrecy 441,442,450,451,453,461 knowiedge-by-skills matrix 515
Subject Index 1053
leadership 972 Southern Africa Consortium for Monitoring
learning community 366 Educational Quality (SACMEQ) 476,
participatory evaluation 252, 260 952,954,958-9,964-6,968,972,973,976
principals 654, 655, 656, 658, 660, 665 Soviet Union 432, 581
professional development 849 Spain 430
teachers 252, 623, 624 Spearman-Brown formula 513-14,519
technology 773, 850, 851, 85~, 858--61, special education
862-5 CIPP model 37-8
UK national curriculum 912 specialists 672
see also competencies; human capital specialists 606,607,671-93
small schools 125-6 data collection 679-83
SMART (Scientific and Mathematical Arenas definition 672-3
for Refined Thinking) 562 Goals and Roles evaluation model 683-7
SoAs see statements of attainment personnel evaluation standards 688-9, 690
social capital 366, 460 technology 856, 857
social change 716, 717 Spencer Foundation 888
social class 392, 443, 710, 906, 924 spreadsheets 133
see also socioeconomic status SRA see Science Research Associates
social constructions 193, 389 Sri Lanka 406
social equilibrium 534-5, 542 staff see personnel; principals; support staff;
social exclusion see exclusion teachers
social good 17, 320, 321 stakeholders 3, 25, 434
social justice 170, 179, 305, 315, 376, 377, accountability 757
461, 708 admissible users 211
social movements 461 advocacy teams technique 46
social problems 25-6, 63 African traditional evaluation 467,469,479
Social Research and Demonstration bottom-up approaches 429
Corporation 110 Canadian program evaluation 740--1,747
social science 18, 19, 398 CIPP model 33, 41
evaluator role 235 constructivism 72, 75-6
knowledge utilization 198, 199-200, 201, context 391
202,203 cost-utility 135, 136, 138-9
methodology 26,163,164 decision making 70
value-free 18, 21 deliberative democratic evaluation 79-81,
social services 60 82-4,87-9,90--2,96-8
social space 390 democratic evaluation 434-5, 436
social workers 672, 675 developing countries 709,711,712
socialization 467 Eight Year Study 884
socioeconomic status 142,931,932 ethical considerations 271-2,305,306-10,
examination bias 591, 595 312,313-17,319-21,324
international studies 969 Europe 434
Latin America 452, 453 evaluation design 55, 56, 57
school profile indicators 832, 833, 835 evaluation use 191-2, 209, 216, 223, 225,
see also social class 228--9,231-2,238,239
sociology 103, 400 increased involvement 380
soft knowledge 198, 199, 201 limiting participation/involvement 240,
software 527,853-4,856,864,866 247,250,253-4
songs 468--9 Manitoba School Improvement Program
South Africa 407,408,471,476,478,954, 740--1
970--1,973 methodology 74
South America participatory evaluation 192,209,245,
international studies 956 247,251,253-7,259-63
professional culture 401 principals 605, 646
Southeast Asia 406-7 professional standards 284, 294-5
1054 Subject Index
qualitative methods 170, 173-4, 175, 176, US state accountability systems 938-9,
179, 180-1 940,946--7
randomized field trials 117-18 see also professional standards; technical
responsive evaluation 11, 63, 66, 67 standards
Scandinavia 431 Standards for Educational and Psychological
school committee 788 Testing 499, 564
school evaluation model 819, 820, 821, Standards for Evaluations of Educational
822,823 Programs, Projects, and Materials (Joint
school profiles 773,828-30,831-2,834-5, Committee) 270,280,281-3,426
836--7,838-9 see also Program Evaluation Standards
selection 234, 247, 249, 250, 253-5, 262, Standards for Program Evaluation
314 (Evaluation Research Society) 270,280,
stakeholder-orientation 170 291-3,295-6,300
US school districts 418,423-4,425,427 Standards of Reporting 1tials Group 120
values 12, 13, 435 Stanford Achievement Test (SAT) 940,941,
Standard Assessment Thsks (SATh) 554-5, 945
683,916--17,924,925,926 state
standard evaluation designs 139-40 Africa 477
Standard Tests (STs) 917 France 431
standardization Latin America 394,442,444-7,448,449,
Europe 432 450,453,458,461
psychometric model 550 school evaluation units 881
reliability 567, 568, 570, 587 see also government
student assessment 489, 497-9 state assessments 542-3
technology platforms 856 statements of attainment (SoAs) 913-14,
standardized tests 5, 486, 540, 542-3, 830 916,918
critique of 552 statistics 29, 108, 113
effectiveness 771 school profiles 834-5
multiple-choice 615 student assessment 490-5, 502-5, 506--27
national assessment 877,880 see also psychometrics; quantitative
teacher evaluation 604, 619 methods
United States 424, 427, 613, 721, 722-3, 729 steering team 816--17
standards stereotyping 535
CIPP model 33, 34 structural corroboration 162
classroom evaluation 537-8 structural equations models 526
constructivism 72 Structural Funds 430
elementary schools 810 structural models 108
institutionalization of evaluation 785-6, STs see Standard Tests
787, 789-90 student model 490-5, 502, 503, 506, 507-8
international studies 878, 963-4, 967 student teachers 52-3
learning organization disciplines 797,799 students
Office for Standards in Education 920 achievement
principal evaluation 648, 656 alternative assessment 551
school evaluation criteria 817 Australia 759
specialists/support personnel 686, 688-9, curriculum evaluation 986--7,988
690 Eight Year Study 723, 802
standard-based evaluation 648-9, 661, Equal Educational Opportunity Survey
662,835 874
teacher evaluation 611, 616--17, 618-21, indicator-based school evaluation 810,
622,629 811
technology implementation 864, 865 international studies 878, 879, 880,
UK national assessment 908, 909, 910, 957-8,961,964,966
911-12,918-19,920-2 Latin America 455-6, 457
US national assessment 892-4, 902 national achievement tests 830
Subject Index 1055
national assessment 874,876--7,891-2, social equilibrium 534-5
893-4,895-8,899-900 teacher-student ratio 862
performance monitoring 878 teacher-student relations 558-9, 822
principal evaluation 650 technology skills 856, 863, 865
school profile indicators 828, 832, 833, subjectivity 85, 161
835 qualitative methods 169,177
specialists/support personnel 679, 682-3 responsive evaluation 65
teacher evaluation 610,612-13,614-15, truth 178
617,618-19,622,633-4 values 12-13
US state accountability systems 722-3, subsidiarity principle 437
939,940,943,944,945 subtheme A 24-5
alternative assessment 551,553 Sudan 407
assessment 2,3,5,485-7,873 summative evaluation 10, 23, 425
Africa 394-5,475-6,479 alternative assessment 549
alternative 551,553 CIPP model 34-7,38
classroom evaluation 485, 486, 533-48, collaborative 257
560,562 consumer-orientation 22
evidentiary arguments 489, 490, 493, external 804
495-502 personnel evaluation 607
feedback 557-60,562 principal evaluation 605,651,652,661,665
framework for design 502-5 responsive evaluation 66
Latin America 441-2,443,446,451-3 school-based 803, 816, 820
national assessment recommendations specialists/support personnel 687
900 teacher decisions 540
psychometrics 485,487,489-531 UK national curriculum 915
teachers' initial perceptions 535 US program evaluation 727
CIPP model 778, 779 use of 190, 210, 213, 232, 236
disadvantaged 424, 425, 442, 443, 454-7, western influence on Africa 472
567-8 supply items 541,542
diversity 545 support networks 698, 700, 729
enrolments support staff 606, 607, 671-93
Africa 471 data collection 679-83
Canada 744 definition 672-3
cost -effectiveness 133 Goals and Roles evaluation model 683-7
Latin America 444, 445 personnel evaluation standards 688-9, 690
school profile indicators 828 surveys
evaluation during instruction 538-40 accreditation programs 335-6
evaluation manual 792 Australian program evaluation 755
examination selection for further education Boston Surveys 807
578,588,595 certification 338
fairness criteria 931-2 CIPP model 41
individual education plans 37-8 deliberative democratic evaluation 88
institutionalization of evaluation 777, 778, Equal Educational Opportunity Survey 874
789-90, 794 ethical considerations 304, 306-9, 310,
Latin America 442,443,445,454-7 311-12
learning organization disciplines 797-800 evaluators 242
outcomes evaluation 427 international 951, 952, 957, 961, 963, 968,
Personnel Evaluation Standards 290-1 974
planning instruction 536, 537 product evaluation 51
portfolio assessment 563-4 program evaluation 729
reading assessments 556 school effectiveness 813
retention 445 specialists/support personnel 674-6, 679,
sampling 967 680-1,687
school committee 788 teacher dropouts 924-5
1056 Subject Index
technology 844, 847, 852 classroom evaluation 486, 533-46
US school district evaluation 932-8 curriculum evaluation 983, 986
utilization of evaluation 198 developing countries 705, 709
see also questionnaires; sampling effectiveness 603
sustainability evaluation 5,603-4,606-7,609-41,673,
CIPP model 37 674-5, 783
technology institutionalization 854, 856, school district evaluation 937-9
860,865,866 use of 206
sustainable learning community 269, 275-6, value-added 730
361-72 feedback 557-60
Swaziland 471,473,477 grading judgements 544
Sweden 87-8,429,430,431-2 indicator-based school evaluation 811'
Switzerland 193 individual education plans 37-8
syllabi 578, 583 influence of 3, 487
symbolic interactionism 174 instructional champions 865
symbolic use 199,200,209,210,212,217,251 large-scale testing 562
Syria 407 Latin America 452, 458
system assessment 873-6 Manitoba School Improvement Program
systematic inquiry 286, 294, 752, 754, 757 740--1
systematic observation 680 mentoring 862
systems theory/approach 390--1,398-9,718, Mexico 454, 455, 456-7
778, 796, 797-800 national assessment 877
participatory evaluation 255-6, 258, 259
T-PE see transformative participatory payment-by-results 874,899
evaluation performance-related pay 433, 875, 920
TA see Teacher Assessment planning instruction 535-8, 546, 778
TAAS see Texas Assessment of Academic portfolio assessment 564
Skills principal evaluation 659-60
tabular profiles 827,837 professional development 603, 620--1,
TAC see Technical Advisory Committee 624
tacit knowledge 11, 175 APU 909
tactical model 200 technology 847-50,858,860
Tanzania 406,407,408,473,478,959 qualifications 828, 832, 839
target population 966 recruitment and retention crisis 924-5
targets school evaluations 776-7,782-3,788
attainment 913,914,915,916,918, 919-20 school-based assessment 594-5
inducements 878 student teachers 52-3
performance management 875 Teacher Assessment (T(\) 554, 594-5
quasi-evaluation 20 teacher-student relations 558-9, 822
Task Group on Assessment and Testing technical support 850, 851, 855, 857-8,
(TGAT) 912-13,914-17,919 862,866
task model 490--5, 502, 503-4, 525 technology 773,846,847-50,851,852,
TCAP see Tennessee Comprehensive 853-4,855-67
Assessment Program turnover 833,860
Teacher Assessment (TA) 554,594-5,915-16 UK national curriculum 914,915-17,918,
Teacher Education Accreditation Council 616 919,923-4
Teacher Quality Enhancement Act (US, 1998) Work Sample Technique 52-3
616,624 workload 595
teacher-student ratio 862 see also personnel; principals
teachers teaching
Africa 395 Canadian program evaluation 739
Canadian program evaluation 736-7,739, connoisseurship evaluation 154, 163
747 videotapes 164
career ladder 833 team learning 796, 797-800
Subject Index 1057
Technical Advisory Committee (TAC) 884, standardized 5, 486, 540, 542-3, 830
885,886,889 critique of 552
technical quality 233, 315 effectiveness 771
control over 262 multiple-choice 615
responsiveness trade-off 257 national assessment 877,880
technical rationality 159-60 teacher evaluation 604, 619
technical standards 330, 963-4, 967 United States 424,427,613, 721, 722-3,
technical support 729
school evaluations 796, 803, 850, 851, 855, standards-based 835
857-8,862,866 Stanford Achievement Test 940,941,945
specialists/support personnel 674, 675-6 teachers 610-11,616-17,622,624-7,629
technology 1,259,773,843-70 UK national assessment 908-9,915-18,
data collection 172, 277 919,923,925,926
institutionalization of 6 United States 552, 568-9, 577, 721, 722-3,
program evaluation 700, 729 726, 729
Second International Technology in national assessment 894-5, 899-900,
Education Study 955 901-2
see also computers; information and program evaluation 698
communications technology; information school district evaluation 936,937,944
technology standardized tests 424, 427, 613, 721,
Tennessee Comprehensive Assessment 722-3, 729
Program (TCAP) 940 validity 564-5
Tennessee study 112-13,114,116,117,119, voluntary national test 894-5
939 see also assessment; examinations
Tennessee Value-Added Assessment System Texas Assessment of Academic Skills (TAAS)
(TVAAS) 603, 945 944
tests 2,3,168,612-13,616-17 text -context relationships 395
adaptive 523 textbooks
alternative assessment 552 cross-national curriculum evaluation 983,
Assessment of Performance Unit 908-9 985,986,988,989-92,993
'authentic tests' 911, 916, 917 gender bias 164
CIPP model 41,60 teacher's editions 537
classical test theory 490,505,507-11, TGAT see Task Group on Assessment and
513-14,517,518-19 Testing
classroom evaluation 540-3, 544-5, 560, Thailand 406
562 thematics 159-60, 174
commercial standardized 542-3 theory 3-4
Continuous Progress Matrix Sampling context 391
Testing Technique 53-5 grounded 172
curriculum evaluation 979, 981, 982, 987, meta-theory 10, 15
992 theorists 9-10
definition 485-6 theory-driven evaluation 24, 378, 728
high-stakes 53, 612, 899-900 Third International Mathematics and Science
international studies 955, 956, 968, 969, Study (TIMSS) 441,879,899
970-1,974-5 cross-national curriculum evaluation 979,
Latin America 443 980,981,982-5,987-94
national achievement 830 international comparative studies 953-4,
national assessment 873, 876, 877, 894-5, 955-6,964-9,971,973-4
899-900,901-2 see also Second International Mathematics
product evaluation 51 Study
psychometric 550-1 Third World program evaluation 697-8,
readiness 835 701-20
responsive evaluation 64 see also developing countries
school profiles 828, 830, 832, 834-5 time issues
1058 Subject Index
planning 536 transparency 277, 383, 384
technology institutionalization 857-8 Latin America 450
Times Educational Supplement 925, 926 New Public Management 432, 437
TIMSS see Third International Mathematics teacher evaluation 631
and Science Study Traveling Observer (TO) Technique 49-50
top-down approaches 429, 434, 811 trend data 836, 837, 839
deliberative democratic evaluation 88 triangulation 162, 173, 177
Scandinavia 431-2 triennial school reviews 758, 759-60, 761
trade unions trust
Canada 747 evaluation use 215
Europe 431 institutionalization of evaluation 789
Latin America 452, 461n2 organizational 256
training 16, 277, 374, 384-5, 795 social equity 366
accreditation 273, 333, 334-6, 340, 342 truth 165n1, 178, 182n4, 241, 376
African traditional evaluation 466-7,469 Turning the Tide in Schools (TIIS) 755-6,
Australia 756, 766-7 757,766
Canada 733 turnover of users 234
certification 331,333,334-5,336-41 TVAAS see Tennessee Value-Added
CIPP model 34 Assessment System
developing countries 702
Eight Year Study 801, 802 Uganda 473, 476
ethical considerations 272, 305, 324 UN see United Nations
evaluation associations 334 uncertainty 143-4,500,501
evaluation manuals 791, 793 UNDP see United Nations Development
evaluation use 224, 231, 233, 237 Program
experimental evaluation 139 unemployment 907
Frontline Management Initiative 756 UNESCO see United Nations Educational,
government influence 347 Scientific and Cultural Organization
international studies 970, 974 unethical behaviour 271-2, 299
Israel study 803 UNICEF see United Nations Children's Fund
Mexico 455, 456-7 United Kingdom
non-Western areas 404,405-7 alternative assessment 550
participatory evaluation 256 Department for Education and
preservice programs 331,332,334 Employment 827,831,918,919,922
principal evaluation 605, 664-5 Department of Education and Science
professional standards 279,297,298,299, 908-9,911-12,914,920,922
300 Department for Education and Skills 923,
program evaluation 716, 717, 718 926n4, 926n6
randomized field trials 109-10, 117 examinations 552,567, .570, 579, 580, 581
school evaluation models 790 large-scale testing 562, 570
school profiles 838 moderation 594
specialists/support personnel 606, 674, national assessment 554, 905-28
675,677 national curriculum 905-28
teacher dropouts 924-5 Records of Achievement 559
technology 846,847,850-2,855-6, see also England; Great Britain
858-60,862 United Nations Children's Fund (UNICEF)
see also vocational training 404,408,476,477
transactional evaluation 23 United Nations Development Program
transformative evaluation 25 (UNDP) 404,408,471,477
transformative participatory evaluation United Nations Educational, Scientific and
(T-PE) 246, 247, 249-50, 251, 253, 254, Cultural Organization (UNESCO) 402,
256,260,262 404,406,952
ethical considerations 271-2,314,315-16 Africa 471,476,477
increase in stakeholder involvement 380 Dakar Framework for Action 875
Subject Index 1059
Education for All initiative 408-9,471 National Council for the Teaching of
Experimental World Literacy Programme Mathematics 953
407,408 performance assessment/measurement
Institute of Education 878 382,560,561
International Institute for Adult Literacy Perry Preschool Project 109
Methods (IIALM) 405 principal evaluation 643-69
Latin America 441,443,445 professional culture 402-3,410
program evaluation in developing countries proficiency levels 591, 593
709, 711 program evaluation 417-28,698,721-32
World Declaration on Education for All Program Evaluation Standards 699
875, 876 quality of education 951-2
United Nations (UN) 394, 445, 451, 713 randomized field trials 120
United States reform programs 42
accreditation 2 restructuring 461n4
Agency for International Development 404, Scared Straight programs 110
451,452 school evaluation 807,809,810-13
alternative assessment 550 school profiles 827, 828, 830, 831, 836, 837
CIPP model 31, 34 specialists/support personnel 671-93
civil rights movement 70 standardized tests 424, 427, 552, 613, 721,
comparative studies 874 722-3, 729
Congress 86 standards 270, 279-302, 537-8, 785
context of evaluation 393,417-28 state and school district evaluation 929-49
cost accounting 132 teacher evaluation 609-41
cost-effectiveness 135, 144-5, 145 technology 844-5,846,847,850
curriculum evaluation 979, 980, 982, 992 utilization of evaluation 190, 193, 198
Department of Education 120, 346, 354-5, the West 400
582,616,804 see also American Evaluation Association
national assessment 883, 890, 894 United States Agency for International
technology 846,847,848 Development (USAID) 404,451,452
Department of Health and Human Services United Way 22, 352, 353
309,320,321 units of randomization 113
Eight Year Study 801-2 unity of command 678-9
Elementary and Secondary Education Act universities
103 Africa 478
evaluation profession 346, 347, 348-54, Canada 734-5, 742-7
355,356,357,362-3,753 examinations 578,581,585,906-7
examinations 5,577,581-3,587,588,596 Latin America 446,447,448,451
export of education priorities 452 non-Western 404
Federal Judicial Center 111-12 Quebec Policy on University Funding 742-7
General Accounting Office 86, 275, 348, requests for proposals 274
352,353,355,893 US program evaluation 423, 801
Head Start program 356 urbanization 421, 444
objective approach 930 Uruguay 443,452,453,876,878
relative effect of interventions 108, 119 USAID see United States Agency for
government influence on profession 346, International Development
347,348-54,356,357 use definition 208-9,210-12
influence on European evaluation 432 use of evaluation see utilization of evaluation
input evaluation 46 utility
institutionalization of evaluation 801 CIPP model 34
international studies 973, 974 cost-utility 133, 134, 135-6, 138-9, 143,
large-scale testing 568-9 144-5, 146, 148
league tables 165 evaluation use 217,223,224,227,230,
national assessment 6,876,877,878, 232,236
880-1,883-904 principal evaluation 648, 649, 657
1060 Subject Index
Program Evaluation Standards 281, 284, program evaluation 730
286-90,296,785-6 school district evaluation 934, 935, 937,
qualitative methods 180-1 946
specialists/support personnel 689 state accountability systems 939,945,947
teacher evaluation 607 value-free doctrine 18, 19,21, 27
utilization of evaluation 189-95, 197, values
204-219,223-43,281 Africa 707
developing countries 710 appreciative inquiry 368, 369
ethical considerations 309,317-18,324 CIPP model 32, 33, 36, 58
export of US education priorities 452 community 364
improvement 717 core 13-14
international studies 973-4 cost -effectiveness 146-7
legislative initiatives 345 critique of 17
non-Western cultures 411-12 cultural 707
participatory evaluation 245-65 deliberative democratic evaluation 79, 81,
school evaluation model 815-16 83,84-6,86,87,94-5,97
school profiles 838-9 democratic evaluation 435,436-7
utilization of knowledge 192-3, 197-204, developing countries 707,714,715-16
209,212,219,246,248,251,390 educational criticism 158-9
utilization-focused evaluation 4, 192, 194, ethical considerations 323
223-43,814 evaluation profession 361-2,365
ethical considerations 272, 309, 316, 320 evaluation use 223, 234
US program evaluation 425, 728 evaluator bias 308
fact-value dichotomy 11-12,84-6,94-5,
validation 161-3, 177 374,435
validity 516, 527, 946, 947 goal-achievement evaluation 21
alternative assessment 550, 564-8 Latin America 450
classroom evaluation 539, 540, 543 learning community 275
curriculum evaluation 985, 994 market-orientation 437
evaluation use 232, 233 multiple stakeholders 819
examinations 583-6, 594 objectivity 11-12, 13
instructional planning 538 participatory evaluation 262
licensure tests 626-7 pluralism 10-11
multiple-choice testing 615 process use 229-30
outcome-based evaluation 22 professional standards 284
participatory evaluation 262 program evaluation 715, 718
professional standards 285, 292 program targets 20
qualitative methods 171, 173, 175, 177 qualitative methods 179, 181
randomized field trials 104, 105 school evaluations 784
student assessment 486, 487, 546 subjectivity 12-13
item response theory 521-2 sustainable learning community 361-2,367
linear logistic test model 525 teacher evaluation 614
psychometrics 489-90,494,499-500, value-free doctrine 18, 19, 21, 27
507,512-13,517 Venezuela 447, 448, 452
teacher evaluation 604,619,620,626-7, Verstehen 171, 182nl
631-2,633,915 verticality 706
validation 161-3,177 videotapes 164, 172
value 28-9 visitation-based evaluation 809
value for money 430, 432, 433 Visits to the World of Practice 433
value-added 589, 613, 931 VNT see voluntary national test
baseline assessment 810 vocabulary 541
self-evaluation 814 Vocational Certificate of Education Advanced
specialists/support personnel 682 Level (VCE A-level) 570
United States vocational training
Subject Index 1061
African traditions 466, 467 World Bank 135, 145, 180,347,404,408,
Canada 733, 734 579,581
federal assistance 419 Africa 474-5,476,477
voice 704,707,711 Latin America 451, 452, 454, 455, 457, 458
voluntarism 906 program evaluation in developing countries
voluntary certification 334, 338 709,711,713
voluntary national test (VNT) 894-5 World Conference on Education for All 875,
876
WaIter and Duncan Gordon Foundation World Vision 715
740, 742 World Wide Web
War on Poverty 418,419,724 NAEP findings 901
wars 706 randomized trials 121
Web-based evaluation tools 772 school profiles 827, 838
welfare issues see also Internet
professional standards 294-5 worth 28, 29, 182n3
reform 111 CIPP model 34, 36, 57-8
Scandinavia 431 evaluation of specialists/support personnel
Welfare-to-Work trials 110 687
Westat 963 independent judgements 241
Western Michigan University (WMU) participatory evaluation 245
Evaluation Center 32,42,49,50, 58, 61n2, program evaluation 715, 716, 718
353-4,814 responsive evaluation 64, 65
professional standards 282, 297, 298 school evaluations 784, 823
training 374 Stufflebeam 930--1
Westernization 392, 397, 400, 403, 411, 470 summative evaluation 232
WMU see Western Michigan University writing 913-14
women 467, 897
developing countries 711 youth services 60
education/employment programs 109-10
gender bias 164 Zambia 406,473,476,959
see also gender Zimbabwe 406, 959
Work Sample Technique 52-3 zone of proximal development 553
working knowledge 202, 212
Volume 1
International Handbook of Educational Leadership and Administration
Edited by Kenneth Leithwood, Judith Chapman, David Corson,
Philip Hallinger and Ann Hart
ISBN 0-7923-3530-9
Volume 2
International Handbook of Science Education
Edited by Barry J. Fraser and Kenneth G. Tobin
ISBN 0-7923-3531-7
Volume 3
International Handbook of Teachers and Teaching
Edited by Bruce J. Biddle, Thomas L. Good and Ivor L. Goodson
ISBN 0-7923-3532-5
Volume 4
International Handbook of Mathematics Education
Edited by Alan J. Bishop, Ken Clements, Christine Keitel, Jeremy Kilpatrick and
Collette Laborde
ISBN 0-7923-3533-3
Volume 5
International Handbook of Educational Change
Edited by Andy Hargreaves, Ann Leiberman, Michael Fullan and David Hopkins
ISBN 0-7923-3534-1
Volume 6
International Handbook of Lifelong Learning
Edited by David Aspin, Judith Chapman, Michael Hatton and Yukiko Sawano
ISBN 0-7923-6815-0
Volume 7
International Handbook of Research in Medical Education
Edited by Geoff R. Norman, Cees P.M. van der Vleuten and David I. Newble
ISBN 1-4020-0466-4
Volume 8
Second International Handbook of Educational Leadership and Administration
Edited by Kenneth Leithwood and Philip Hallinger
ISBN 1-4020-0690-X
Volume 9
International Handbook of Educational Evaluation
Edited by Thomas Kellaghan and Daniel L. Stufflebeam
ISBN 1-4020-0849-X

International Handbook of Educational Evaluation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

International Handbook of Educational Evaluation

Uploaded by

Copyright:

Available Formats

INTERNATIONAL HANDBOOK OF

with the assistance of

KLUWER ACADEMIC PUBLISHERS

Published by Kluwer Academic Publishers

Sold and distributed in North, Central and South America

In all other countries, sold and distributed

Printed on acid-free paper

All Rights Reserved

PART ONE: PERSPECTIVES

SECTION 1: EVALUATION THEORY

1 Evaluation Theory and Metatheory

SECTION 2: EVALUATION METHODOLOGY

SECTION 3: EVALUATION UTILIZATION

SECTION 4: THE EVALUATION PROFESSION

SECTION 5: THE SOCIAL AND CULTURAL CONTEXTS OF

PART TWO: PRACTICE

SECTION 6: NEW AND OLD IN STUDENT EVALUATION

SECTION 7: PERSONNEL EVALUATION

SECTION 8: PROGRAM/PROJECT EVALUATION

SECTION 9: OLD AND NEW CHALLENGES FOR EVALUATION

SECTION 10: LOCAL, NATIONAL, AND INTERNATIONAL

Educational evaluation encompasses a wide array of activities, including student

International Handbook of Educational Evaluation, 1-6

ERNEST R. HOUSE, Section Editor

It is a pleasure to introduce the work of some of the leading theorists in evalu-

International Handbook of Educational Evaluation, 9-14

essentially subjective, perhaps deep-seated emotions not subject to rational

International Handbook of Educational Evaluation, 15-30

which is (we document by means of a survey) widely thought to be a good

MODELS OF EVALUATION: EIGHT SIMPLIFIED ACCOUNTS

Model 2: Goal-Achievement Evaluation

Model 3: Outcome-Based Evaluation

Outcome-based evaluation is also talked of as results-oriented evaluation, in e.g.,

Model 4: Consumer-Oriented Evaluation

Model 6: Participatory or Role-Mixing Approaches

Model 7: Theory-Driven Evaluation

On a somewhat different track, the approach of theory-driven evaluation has

Running alongside these later conceptions of evaluation (6 and 7, and to some

ModelS: Power Models

Apart from the tendency to over-generalize from a limited range of experiences

THE HARD-CORE CONCEPT OF EVALUATION

At the most basic level, evaluation is a survival-developed brain function of

This chapter presents the CIPP Evaluation Model, a comprehensive framework

International Handbook of Educational Evaluation, 31-62

direction for strengthening plans. Improvement activities bring up questions for

THE CIPP MODE~S IMPROVEMENT/FORMATIVE AND

Table 1. The Relevance of Four Evaluation 1YJIes to Improvement and Accountability

A CHECKLIST FOR SUMMATIVE EVALUATIONS

1. Overview of the program or particular service (including its boundaries,

SELF-EVALUATION APPLICATIONS OF THE CIPP MODEL

It is emphasized that the evaluator need not be an independent evaluator. Often

AN ELABORATION OF THE CIPP CATEGORIES

The matrix in Table 2 is presented as a convenient overview of the essential

A context evaluation's primary orientation is to identify a target group's needs

• define a target group of beneficiaries

Process To identify or predict By using such methods as For implementing and

Product To collect descriptions By measuring intended and For deciding to continue,

The Program Profile Technique, as Applied in Context Evaluations

As noted above, many methods are useful in conducting context evaluations.

• a checklist to collect data from a variety of sources about relevant history;

Using this technique evaluators can maintain a dynamic baseline of information

Analysis of Patient Records, as a Procedure for Context Evaluation in

• identify most prevalent patient needs