You are on page 1of 20

Studses in Educational Evaluatton, Vol 20, pp 147-166, 1994

Pergamon © 1994 Elsevier Science Ltd


Pnnted m Great Bntam All nghts reserved
0191-491X~34 $24 (30

0191-491X(94)EO009-5
EVALUATION AS A DISCIPLINE

Michael Scriven
Western Mtchigan University, Kalamazoo, MI, U S A. 1

In~oduction

This essay provides some background, explanation, and justification of a theoretical


framework for all types of evaluation. This framework is useful for connecting and
underpinning the approaches of the other contributors in this special issue. Some of the
discussion provided to support the framework argues for benefits from using the
perspective generated by the framework for improving best practice in existing applied
fields within evaluation--for example, by looking for similarities between personnel
evaluation and institutional evaluation that may point the way towards solving a problem or
avoiding a mistake. But there is more to the framework than a perspective, there is also the
logic of evaluation itself--the 'core discipline' of evaluation--the fine structure of the
theory that ties the assortment of applied evaluation fields into a discipline. The core
subject pays off for practice in the applied fields not only by substantiating the proposed
overall perspective, but also by solving hitherto unsolved problems in the fields, and by
breaking new ground which uncovers fertile new areas for further work.
The core subject is a kind of central management system that knits the fields
together with an overview and a map of connections (some newly discovered and some
long recognized), develops concepts and language to deal with shared problems, addresses
threats to validity, converts solutions from one field for use by others, and moves the
frontiers of foundations research forward. The emergence of that core subject is essential
to the creation of a discipline because without it there is no entity, no entirety of which the
parts are part, no self-concept, no identity. The emergence of candidates for a core
discipline, and hence of the whole of evaluation as a discipline is a very recent affair. The
treatment here begins with a review of the background leading to this event, and then goes
on to review the stages of early development that followed.
There are two ways to approach a historical review like the brief one here. One
might try to describe it in pluralistic, multi-perspectival terms, so that the reader can see
how the issues presented themselves to--and were resolved or abandoned by--various
parties to the process. That makes for better, fairer, history, but it also requires a longer
perspective than we can have at this point--and it calls for an author less committed to one
of the contrasting views of the nature of evaluation. The treatment here is partisan, and its
main intent is to convey the way that the historical and ideological issues now appear from
147
148 M Scnven
the point of view of an advocate--someone convinced that there is a true discipline of
evaluation, and that it has a certain shape, although it is still in its infancy. From that point
of view, the most interesting issue about the history of evaluation is why it took so long to
emerge.
To understand that delay, we must understand that the birth of a discipline of
evaluation was disadvantaged from well before the time of conception. It was an overdue
child of allegedly incompatible parents whose families opposed the marriage. And its dif-
ficult birth began a confused childhood exhibiting multiple personality disorder.
The discipline of evaluation was fathered by evaluation practice and born of
science. The father was of humble birth but ancient lineage. The mother's family was nou-
veau riche but regarded itself as aristocratic: the mother herself had little interest in meeting
the father and denied she could bear such a child. The father had for much longer acted as
if such an union was of little interest to him. But in the end, with considerable help from
marriage brokers and midwives, and notwithstanding the refusal of most of the mother's
family to accept the union, the match was made and the child was born. Its difficult
childhood, which we will shortly review, resulted from the mother's relapse to the values
of her prenuptial caste and the father's disinterest in the welfare of the child, whose
problem then became that of growing up with a coherent self-concept, and developing
enough autonomy to justify self-respect.
What this meant in practice was that most of the early evaluation theories turned out
to be attempts to replace evaluation by a more acceptable substitute, some persona already
made respectable by science. And most of these theories, including some not subject to the
first type of subversion, had severe constraints imposed on them from the beginning,
constraints as to the turf on which they were to be allowed to play. These constraints
served to keep them away as far away from the matrilineal turf as possible.

The Nature of Evaluation

What we are talking about here is the general discipline of evaluation. It has many
specific and autonomously developed applied areas, of which it's convenient to refer to
half a dozen important ones as 'the Big Six'. These are program evaluation, personnel
evaluation, performance evaluation, product evaluation, proposal evaluation, and policy
evaluation. There are two other applied areas of the greatest importance. One is discussed
only implicitly in this paper: it is meta-evaluation, the evaluation of evaluations. 2 The other
is discipline-specific evaluation, the kind of evaluation that goes on inside a discipline,
sometimes with and often without any assistance from trained evaluators, but always
requiring substantial knowledge of the discipline. It obviously includes the evaluation of
hypotheses, instruments, experimental designs, methods, etc. within a discipline, but here
are some more examples, listed in roughly increasing order of the amount of outside help
they should employ--although they frequently employ less: the evaluation of (i) a new
theory in surface physics (topic-specific); (ii) a review of recent progress and promising
directions in chaos theory (discipline-specific meta-analysis); (iii) a new program in
emergency health care or instruction (application-specific, i.e., it's specific to program
evaluation); (iv) proposals for research support in short-term psychotherapy; (v) several
candidates for a job or for promotion within the department of mathematics; and
(vi) literary criticism, a discipline which is by definition a branch of applied evaluation, but
one with severely limited objectivity and, as usual, in serious need of external review.
Evaluation as a Dtsctpltne 149
There are many other fields of evaluation besides these eight, including: curriculum
evaluation; technology assessment; medical ethics; industrial quality control; appelate court
jurisprudence (the legal assessment of legal opinions); and some from our avocational
interests such as wine tasting, art criticism, movie and restaurant reviewing.
Since the philosophy of science deals, amongst other things, with the question of
the nature and logic of scientific propositions, one of the claims that it has to evaluate is the
issue of whether evaluative claims have (or have not) a legitimate place in science. The
most powerful view about this issue, throughout the twentieth century, has been the
doctrine of value-free science, the denial that evaluative claims have any legitimate place in
science. This position of course entailed that there could not be a science of evaluation.
Similar arguments were raised in the humanities, leading to the general conclusion that
there could be no legitimate discipline of evaluation, whether considered as a science or
under some other heading. We will focus here on the issue with respect to science, since
that is the hardest nut to crack.
Many scientists have an interest in the philosophy of science as well as their own
science--as indeed they should--and they often make claims about the value-free doc-
trine. They commonly make the mistake of thinking that their familiarity with scientific
claims means they are in an expert's position with respect to claims concerning the nature
of scientific propositions. In fact, as is suggested by the radical disagreement between
scientists about such claims, they are in possession of at most half of the requisite
expertise, the other half being an understanding of the concepts involved in
epistemological and logical classification schemes. Their relatively amateur status in this
area, 3 combined with their anxiety about the contamination of science by the intrusion of
matters which many of them saw as essentially undecidable--i.e., value judgments--and
hence essentially improper for inclusion in science, led most of them to embrace and
continue to support the doctrine of value-free science. Once that was in place, and widely
supported by the power elite in science, the stage was set for the suppression of any
nascent efforts at developing a general discipline of evaluation. No-one wanted to be
associated with a politically incorrect movement.
Philosophers of science, who should have known better, were for too long influ-
enced by the distinguished group of neo-positivists in their own discipline, descendants of
the group of logical positivists--scientists and philosophers--who first established the
value-free doctrine. Eventually some of them came to abandon the value-free doctrine, but
just as they became willing to consider this possibility, they were hypnotized by the
constructionist/constructivist revolution. So they jumped ship, but into equally bad
company.4
Constructivism is a currently popular derivative from philosophical scepticism,
relativism, or phenomenology (depending on which version of it one considers). It offered
another kind of reason from the ones considered here for thinking that science was not
value-free. Its reasons were centrally flawed--in particular, they were self-refutingS--but
its extensive acceptance has led to the present unusual situation in which there is
widespread agreement that the value-free doctrine is false based on completely invalid
reasons for supposing it to be false. Since the constructionist's reasons lead to the
abandonment of the notion of factuality or objectivity even for descriptive science their
rejection of the value-free doctrine comes at the price of a simultaneous abandonment of
most of what science stands for. It was in a sense incidental, although important for our
topic here, that constructionism renders impossible the construction of any discipline of
150 M Scnven
evaluation worthy of the name. Ironically, then, the most widely-accepted revolt against
the doctrine of value-free science in fact generated another argument which made a
discipline of evaluation impossible.
The stance here is that a discipline of evaluation is entirely possible and strictly
analogous to the disciplines of statistics, measurement, and logic. That is, evaluation is a
tool discipline, one whose main aim is to develop tools for other disciplines to use, rather
than one whose primary task is the investigation of certain areas or aspects or constituents
of the world. Such disciplines are here called 'transdisciplines' for two reasons. The first
is that they serve many other disciplines--and not just academic ones. Much of the work
that falls under the purview of a transdiscipline is discipline-specific (but not topic-
specific), e.g., biostatistics, statistical mechanics. The second reason for calling them
transdisciplines refers to the "discipline" part of the term. Each of them has a core
component--an academic core--which is concerned with the more general issues of their
organizing theories or classifications, their methodology, nature, concepts, boundaries,
relationships, and logic. In conventional terms, this is often referred to as the pure subject
by contrast with the applied subject. Thus there are pure subjects of logic, of
measurement, and of statistics. The field of evaluation, alone amongst the transdisciplines,
has always had the applied areas--because practical problems demanded it--but never a
core area. Without that, a field cannot be a discipline, for it cannot have a self-concept, a
definition, integrating concepts, plausible accounts of its limits and basic assumptions, etc.
So the birth of the discipline of evaluation was delayed by these squabbles amongst the
families of the potential parents. Meanwhile, the applied disciplines suffered severely, both
from unnecessary limitations and from the use of invalid procedures.

The Uses and Abuses of the Concept of Evaluation

In saying that a general discipline of evaluation has only very recently emerged, it
should not be supposed that there have been no publications which a p p e a r to deal with
such a discipline. There are, for example, many books with the unqualified term
"evaluation" in the title. These would, one might suppose, refer to the general discipline.
Pathetically, however, for six decades they were simply books about student assessment.
That is, they referred to only one part of one applied area in evaluation in one academic
field (performance evaluation in education). More recently, the occurrence of the
unqualified term in the title turns out to be simply referring to program evaluation. In other
cases, a title that referred to "educational evaluation" might lead one to think that the
additional term would entail some inclusion--or at least some mention---of the evaluation
of teaching, administrators, teachers, curriculum, equipment, schools, etc. But while it
used to simply mean 'tests and measurement', more recently, it just means 'program
evaluation in education'. What explains this phenomenon of exaggeration of coverage?
It can be seen as a case of academic nature abhorring a vacuum. In the absence of
any truly general discipline of evaluation, each applied field can think of itself as covering
the general subject. And in a m i c r o c o s m i c way, they do; that is, books on program
evaluation often provide a model of 'evaluation', or at least some remarks about proper
evaluation methodology, which is far more than a mere listing of techniques. But it's far
less than a general model for evaluation, both in breadth and depth. Most of the vacuum
was still there, and its existence was officially endorsed by the value-free doctrine. Low-
level generalizations from the applied fields were no great threat to its legitimacy, although
Evaluation as a Disctphne 151
if you added them all up, the situation was somewhat bizarre--six healthy bastards all said
to have no parents. What was forbidden, as a logical or scientific impropriety--arguments
were given for both claims--was a general account. Nevertheless, it is a bizarre situation
when the whole of science and the teaching of science involves--and cannot continue
without---evaluation, yet the high priests of science still maintain its impropriety.
This was a typical example of the way in which a paradigm can paralyze perception.
One of the classic cases comes from particle physics. Given that electrons have a negative
charge, experimenters supposed that a track of a lightweight particle which curves the
wrong way in a cloud chamber photograph "must have been" due to someone getting
careless with reversing the photographic plate. As we now know, many such photographs
were disregarded--instead of checked, which was easy enough--before someone chal-
lenged the fundamental precept by suggesting that the positron was a real possibility. In
the present case, scientists believed--indeed, most of them wanted to believe--the power
elite's quasi-religious dogma of 'value-free (social) science'. It followed that there could
not be a science of evaluation, indeed any discipline of evaluation. Of course, everyone
knew there was a practice of evaluation, since every one of them as a student had received
grades on their school work and virtually every one of them had given grades to
students--presumably well-justified, factually based grades. People working in testing or
program evaluation realized there were plenty of wicks of the trade, enough to justify a text
on the subject--but it never occurred to them to see that subject as part of a general
discipline, or to use less than the general term to describe their own work, although their
common sense was perfectly well aware that there were half a dozen other applied fields of
evaluation.
Despite the prima facie absurdity of thinking that many fields could be engaged in
easily justified practices which obviously shared many common concepts--ranking,
grading, bias, evaluative validity, etc.--if in fact value judgments were completely
unscientific, the paradigm persisted. It prevented scientists from trying to generalize their
evaluative results to other parts of their own domain, let alone considering the possibility
of a common logic, methodology, and theory that transcended domains. In fact the
paradigm prevented them from trying to study the other fields to see if there were some
practices there from which they could learn. As a result the wheel was reinvented many
times, or, worse, not reinvented.
Instead workers in each field made a point of decrying any suggestions of
similarity. People in personnel evaluation often rejected the idea that they could learn from
the quite sophisticated and much older field of product evaluation, often with some great
insight like: "We can hardly learn from product evaluation, since people aren't products."
One might as well say that cognitive psychology can't learn from computer science since
people aren't computers. The difference in subject matter is undeniable but irrelevant to the
existence of useful analogies and some common logic and methodology.
Had the thought of a general discipline occurred to these writers, they would of
course have made some mention of it in the introduction to their books, or used a less
misleading rifle. But such a thought was not acceptable and such mentions never occurred.
That doesn't mean they thought it but didn't say it; it means they didn't think it. Their
perceptions and thinking were controlled by the paradigm.
We've talked about what it takes to constitute a discipline. Now, what is this
subject of evaluation that we are talking of making into a discipline? The term "evaluation"
is not used here in any technical sense: we follow common sense and the dictionary.
152 M Scnven
Evaluation is simply the process of determining the merit or worth of entities, and
evaluations are the product of that process. Evaluation is an essential ingredient in every
practical activity--where it is used to distinguish between the best or better things to make,
get, or do, and less good alternatives--and in every discipline, where it distinguishes
between good practice and bad, good investigatory designs and less good ones, good
interpretations or theories and weaker ones, and so on. It can be done arbitrarily, as by
most wine 'experts' and art critics, or it can be done conscientiously, objectively, and
accurately, as (sometimes) by trained graders of English essays in the state-wide testing
programs. If done arbitrarily in fields where it can be done seriously, then the field
suffers, and the work of all those in the field suffers. For if we cannot distinguish between
good and bad practice, we can never improve practice. We would never have moved out of
the stone age, or even within the stone age from Paleolithic to Neolithic.

The Emergence of a Discipline From Considered Practice

Evaluation is an ancient practice, indeed as ancient as any practice, since it is an in-


tegral part of every practice. The flint chippers and the bone carvers left mute testimony of
their increasing evaluative sophistication by consigning to the middens many points and
fish hooks that their own ancestors would have accepted, and by steadily increasing the
functionality of their designs. Craft workers developed increasing skill and became
increasingly sophisticated in their communicable knowledge about procedures that led to
improvement, and about indicators that predicted poor performance.
Much of this went on before there was anything correctly described as a language. 6
There is no reason why evaluation cannot proceed without language since gestures or
actions can clearly indicate dissatisfaction, which--in certain contexts--represents an eval-
uative judgment. However, language is the great lever of progress. Once we have a
language for talking about what we make or do, we can then more quickly focus on the
aspects which we approve or disapprove and avoid misguided efforts. Now there are some
contexts where the evaluator's disapproval is merely aesthetic, a matter of personal
preference; important to others only as a reflection of power or tastes. In others, coming
from the shaman or priest, it's intended as an indication of divine disapproval. But in yet
others, it is the judgment of an expert whose expertise is demonstrable, and that kind of
evaluation--and acceptance of it--is a survival characteristic. Better fish hooks catch more
fish. Using the master hook maker as an evaluator leads to making better fish hooks. Or so
it appears, with good reason--but with some traps as well.
That situation is no different today. We still call on disciplinary professionals to re-
view programs and proposals in the same field, we still get valuable commentary from
them--and there are still some traps. Identifying the traps and ways to avoid them or get
out of them, is something we have not got very far with, because doing that requires a
discipline of evaluation. Important traps include the realization that using experts in a field
to evaluate beginners can and often does involve several major sources of error.
But we must recognize that doing some discipline-specific evaluations--the topic-
specific ones--is simply part of competence in doing the usual thoughtful disciplinary
work of research or teaching. As we move on through the sequence of intradisciplinary
evaluation examples given earlier, to disciplinary overviews, to proposal evaluation, and to
the evaluation of programs which teach or apply the discipline, we move further away
from the skills of the outstanding researchers within a discipline. The qualities required for
Evaluation as a Disclphne 153
the latter evaluation tasks are, interestingly enough, quite often lacking in the best
researchers. Here are just a few: open-mindedness with respect to several competing
viewpoints; an ability to conceptualize the functions of the program under evaluation
(which is often education or service rather than research); extensive comparative
experience with similar programs; a strong historical perspective; the ability to do or at
least critique needs assessments; good empathy or role-playing skills (to see the program
or proposal from the point of view of its author or manager as well as its customers/-
clients); skill in applying codes of ethical conduct; an understanding of the main traps that
untrained program or proposal evaluators fall into (such as superficial or zero cost-
analysis---especially opportunity cost analysis); broad experience in other disciplines of
activities at the same level of generality.
With this list we are beginning the task of listing evaluation competencies, some-
thing which is only possible if we have a concept of good evaluation by task. While
professionals in general have a good sense of how to do topic-specific intradisciplinary
evaluation, and some of them have or acquire good talents for application-specific
evaluation tasks (e.g., performance evaluation), one needs more than an intuited set of
standards if one is to improve the latter process further. It's all too easy to give many
examples of it being done impressionistically at the moment. No doubt there are places
within the literature of a number of professions where an attempt has been made to list the
qualities that should be sought in identifying evaluators for programs or proposals or
personnel within the profession. But this approach is reinventing the wheel, and not the
way to get the best result after 50 years of effort. A discipline of evaluation is the central
agency where such attempts can be assembled, compared, strained for dross, squeezed for
common elements, and conceptualized. The disciplines are not well served by using
evaluators of their own programs and proposals who are picked for their prestige amongst
the few with that quality who can take time off.

The Need for a Discipline of Evaluation

Related cases of great importance concern the evaluation of research achievements


in the course of personnel selection or promotion at a university, or when refereeing for a
journal, or when making a selection amongst proposals for support by a fund or
foundation. Since these selections are what shape the whole future of the discipline, their
importance is clear enough; and since the criteria used vary between journals and
departments to an extent which lacks any justification in terms of the differences between
the journals and departments, we are clearly dealing with idiosyncrasy, most of it quite
damaging to the validity of the process. Much, although not all, of that idiosyncrasy is
removable by the use of available procedures. But the situation is far worse; the underlying
logic of the process is usually flawed, and guarantees that even with perfectly uniform
standards, the results will be incorrect. We address this problem briefly below.
Overlapping with these examples of performance evaluation is the neighboring area
of personnel evaluation. Much of it involves the integration of a number of performance
evaluations, but it is not reducible to that--e.g., because there is a need in personnel
evaluation, not present in performance evaluation, to predict future performance. In one
task faced by the personnel evaluator, the selection task, careless thinking about evaluation
has led to some major disasters. Few people outside industrial/organizational psychology
realize the extent of the research that has been done on the interview process. Simply
154 M Scnven
reading a good collection of that research--beginning with the research demonstrating the
near-zero validity of the usual approaches----changes the whole way one looks at the
interview, changes the way one does it, the way one manages it, and the value of the
results from it. 7 Here we have an example of an applied field of evaluation moving from
considered practice to major pay-off territory.
But there are deeper flaws in the selection part of personnel evaluation, still hardly
broached. One of these is the common use of indicators that are only empirically correlated
with later job performance, not just simulations of it. This is the foundation of much of the
use of professional constructed testing of applicants. 8 Now we know that if the indicator
was skin color, then, even if it had been empirically established as correlated with later
good (or bad) performance, we couldn't use it. But that ban is generally regarded as a
political or perhaps an ethical override on the most-effective procedures. In fact, the ban is
based on sound scientific principles. The underlying facts are that the indicators are very
weak predictors, that we can normally do better by using or adding past performance on
related tasks (which can always be obtained, absent emergencies), and that once we use
that data, the indicator is invalidated (because it only applies to random samples from the
population with that skin color, and someone with known relevant past performance is not
a random sample). If it is provably impossible to obtain 'related performance' data, or to
use a simulation (including a trial period), the indicators can be justified, e.g., the use of
the Army alpha test when there was no other way to process the number of recruits.
Otherwise, the use of indicators is scientifically improper--as well as ethically improper
since they involve the 'guilt by association' and 'self-fulfilling prophecy' errors. Given the
attention that has been paid to test validity in recent years, it is interesting that this key
fallacy has not received more attention. 9 Once there is a core discipline of evaluation in
place, the kind of foundational analysis we are doing here can be called on to reexamine
many of these general procedures.
These examples are just a few of a dozen that could be given to illustrate the weak-
ness of assuming that one does not need a discipline of evaluation, which includes its ap-
plied fields and makes some of the needed improvements in them. Evaluation training has
a zero or near-zero contribution to make to the physicist evaluating topic-specific
hypotheses, ~0 but it can transform the evaluation of personnel, proposals, and training
programs in physics from laughable to highly valuable. Of course, as we move towards a
hybrid area like the teaching of science, so another expertise--in this case, research on
teaching, including computer-assisted instructionwmay also earn its place beside the
subject matter specialist and the evaluator.
A core discipline of evaluation starts by developing a language in terms of which
we can describe types of evaluation--and parts or aspects of them--and begin to study,
classify, analyze, and generalize about them. That much enables us to focus our evaluative
commentary, and is the first step towards the discipline. The next step consist in
developing a theory about the nature and limits of these different aspects of evaluation.
This includes, for example, a theory about the relation of grading to ranking, scoring and
apportioning. Another topic it must deal with is the differences between, and the extent of
proper use of, criteria vs. indicators (an example mentioned above), standards vs.
dimensions, effectiveness vs. efficiency, objectivity vs. bias, formative vs. summative,
etc. Here we begin to see something that transcends topic- and area-specific evaluation,
and indeed transcends fields within evaluation. At this step we clearly divorce determining
merit from approving: we can, for example, determine that a hand-gun is accurate and
Evaluation as a Discipline 155
well-made without approving its use, manufacture, or possession. That divorce is notable,
since for the first half of this century the most widely accepted theory about evaluative
claims was that they simply expressed approval or disapproval, and had no propositional
content.
Most applied fields in evaluation, and many other applied fields that are not part of
evaluation (e.g., survey research), have made some steps towards clarifying evaluation
predicates, but very few seem to have done it well. Pick up an opinion questionnaire (or an
example of one from a text), or pick up a personnel rating scale, and you are likely to find
that the anchors are a clumsy mix of norm-referenced (ranking) descriptors and criterion-
referenced (grading or rating) descriptors, 11 a botch-up that violates the simplest principles
of the logic of evaluation. The treatment of Likert scales by experts is equally flawed: it is a
definitional requirement on such a scale that there be no right answer to any question, yet
clearly the correct answer by someone from the ghetto today to "Getting the job you want
is mostly a matter of luck" is Disagree Very Much (Spector, 1992). It's fight up there
with Disagree Very Much with "2+2=5". Thus we can see that the fundamental premises
of most thinking about attitudes and valuing are shaky.12
Essentially nobody teaches and few write about these matters, although the
elements of scaling with descriptive (measurement) predicates are probably known to
everyone who graduates with a major in the social sciences. We need better treatment of
the basics of evaluation, and setting them out is one of the tasks that falls to a foundational
or core discipline of evaluation.13
Sorting out these considerations--about discipline-specific evaluation and scaling--
are early steps in the development of a discipline, but the pay-off from getting them in
place is considerable because of their widespread use. To take an example from one of the
applied areas that is most in need of development, some of the most distinguished
scientists in the country think that the procedure used by the National Science Foundation
to allocate research funds is seriously flawed in that it undervalues originality. They say
that it excessively rewards applicants who are working within a paradigm by comparison
with those striving for a new one. Is this concern justified? It's quite easy to find out, and
it would be the scientifically appropriate response to undertake the study. But that would
be to treat evaluation as if it is (i) legitimate, and/or (ii) something over and above topic-
specific evaluation, which would undermine the axiom that scientists from discipline X are
the only ones with the expertise to do discipline-specific evaluation in X. This is the worst
kind of turf-protection at the expense of the taxpayer and of science, and it occurs because
the above simple distinctions between topic-specific and application-specific evaluation has
not passed into the first-year graduate courses and texts.
We can conclude with one last example that comes from a slightly more difficult
level in the development of a general discipline of evaluation. It is not subject-matter
specific, and is widely used in everyday life as well as in virtually all kinds of evaluation
within the disciplines and the professions. It concerns the synthesis step that must be made
in complex evaluations, in order to bring together the sub-evaluations have been done
which yield ratings on each of the dimensions or criteria of merit. The usual way to do
this, other than impressionistically (as is too often the case in selection evaluation in the
personnel field) is to give a numerical weight to the importance of each criterion, (e.g., on
the scale 1-5), convert the performances on each of these into a standardized numerical
score (e.g., on a scale of 1-10 or 1-100), multiply the two together, and add up each
156 M Scriven
candidate's total score in order to find the winner. Many of us have used something like
this for selecting a home, a job, a graduate school, etc., as well as in evaluating a program.
This approach is fundamentally invalid, and will give completely incorrect results in
many cases. Nor can it be adjusted; nor can one say in advance when it will work or not
work, i.e., on what type of case it will work reasonably well. There are some such cases.
For example, the GPA (grade point average) works reasonably well for some purposes;
for others, it fails. In fact, no algorithm will work, because the fundamental flaw is an
assumption about comparability of utility distributions across three logically independent
scales, an assumption which is essentially always false. There is a valid process (a set of
heuristics) for synthesizing sub-scores, referred to as the Qualitative Weight and Sum
approach, by contrast with the invalid algorithm proposed by the Numerical Weight and
Sum approach (Evaluation Thesaurus 4, 1991). Now, it is remarkable that we should not
have discovered this flaw in such a widespread practice until this late date. The explanation
of course is that it was nobody's business to look at it, since it is obviously an evaluation
process and there was no legitimate discipline of evaluation. As an ironic note, the
invalidity applies to all the standard procedures for evaluating proposals, the system by
which essentially all research funds are allocated in this country; and yet none of the
researchers on the peer review panels, which do most of these evaluations, ever raised
their eyes from the task long enough to notice the crude errors in the process. TM So much
for the paradigm of the scientific method when it runs up against the (virtually self-
refuting) paradigm of value-free science.
It can perhaps be seen from this brief summary that the list of errors in evaluation as
it is currently done, even in the autonomously originated practical fields, but especially
within the disciplines, and even more with respect to transdisciplinary practice, involves
serious and costly mistakes. We need something better than the vacuum where there
should be a core theory of evaluation, something corresponding to the core theories in
statistics and measurement. In the search for that elusive entity, we now turn to a brief
review of what came out of the woodwork in the late 1960s and thereafter to provide us
with various theories which were referred to as theories about the nature of evaluation.
Perhaps one of them, if necessary somewhat modified, can provide us with the missing
core theory.

Early Models of Evaluation 15

It was only because these views were filling a perceived vacuum that they were
generally put forward as theories of evaluation. In fact, they were only theories of program
evaluation. Indeed, they had an even narrower purview. For "program evaluation" has
become a label for only part of what is actually required to do program evaluation, just as
"needs assessment" has in some quarters become a name for a formalized approach that
covers only part of what is required in order to determine needs. In the real world, pro-
gram evaluation always involves some personnel evaluation, should nearly always involve
some evaluation of management systems and some ethical evaluation, and should usually
involve some product evaluation. It will also often benefit from some consideration of pro-
posal evaluation and the evaluation of evaluations. But we'll leave out all these refinements
in this brief overview, and focus on what is conventionally called program evaluation.
Evaluatton as a DIsclphne 157
The following simplified classification16 begins by identifying six views or ap-
proaches that are alternatives to and predecessors of the one advocated here, the
transdisciplinary view. They are listed below in the order of their emergence into a position
of power in the field of program evaluation since the mid-sixties when the explosive phase
in that field began. In addition to those discussed here there is a range of exotica--
fascinating and illuminating models ranging from the jurisprudential model to the connois-
seurship model--which we pass over for reasons of space.
A. The 'strong decision support' view was an explication of the use of program
evaluation as part of the process of rational program management. This process, implicit in
management practice for millenia, has two versions. The strong version described in this
paragraph conceived of evaluators as doing investigations aimed to arrive at evaluative
conclusions designed to assist the decision-maker. Supporters of this approach pay
considerable attention to whether programs reach their goals, but go beyond that into
questions about whether the goals match the needs they are supposedly addressing,
thereby differentiating themselves from the much narrower relativistic approach listed here
as approach C. Position A was exemplified in, but not made explicit by the work of Ralph
Tyler, 17 and extensively elaborated in the CIPP model of evaluation (Context, Input,
Process, and Product) (Stufflebeam, et al., 1971). The CIPP model goes beyond the
rhetoric of decision support into spelling out a useful systematic approach covering most
of what is involved in program evaluation, and uses this to infer evaluative conclusions.
Dan Stufflebeam, who co-authored the CIPP model, has continued to play a leading role in
evaluation, still representing--and further developing--this perspective. By contrast,
Egon Guba, one of his co-authors in the early CIPP work, has now gone in a quite
different direction--see F below. This approach, although this particular conclusion was
more implicit than explicit, clearly rejected the ban on evaluation as a systematic and
scientific process. It was not long, however, before recidivism set in, as we see in the next
four accounts.
B. The 'weak decision support' view. The preceding approach has often been
described as the 'decision support' approach but there is another approach which also
claims that title. It holds that decision support provides decision-relevant data but stops
short of drawing evaluative conclusions or critiquing program goals. This point of view is
represented by evaluation theorists such as Marv Alkin who define evaluation as factual
data gathering in the service of a decision-maker who is to draw all evaluative
conclusions. 18 This position is obviously popular amongst those who think that true
science cannot or should not make value judgments, and it is just the first of several that
found a way to do what they called program evaluation although while managing to avoid
actually drawing evaluative conclusions. The next position is somewhat more like
evaluation as we normally think of it, although it still manages to avoid drawing evaluative
conclusions. This is:
C. The 'relativistic' view. This was the view that evaluation should be done
by using the client's values as a framework, without any judgment by the evaluator about
those values or any reference to other values. The most widely used text in evaluation is
written by two social scientists and essentially represents this approach (Rossi &
Freedman, 1989). B and C were the vehicles that allowed social scientists to join the
program evaluation bandwagon. 19 The simplest form of this approach was developed into
the 'discrepancy model' of program evaluation by Malcolm Provus (the discrepancies
being divergences from the projected task sequence and timeline for the project). Program
158 M Scnven
monitoring as it is often done comes very close to the discrepancy model. This is a long
way from true program evaluation for reasons summarized below. It is best thought of as a
kind of simulation of an evaluation: as in a simulation of a political crisis, the person
staging the simulation is not, in that role, drawing any evaluative conclusions. Of course,
it's a little more quaint for someone who is not drawing any evaluative conclusions to refer
to themselves as an evaluator.
D. The 'rich description' approach. This is the view that evaluation can be done
as a kind of ethnographic or journalistic enterprise, in which the evaluators report what
they see without trying to make evaluative statements or infer to evaluative conclusions--
not even in terms of the client's values (as the relativist can). This view has been very
widely supported--by Bob Stake, the North Dakota School, many of the UK theorists,
and others. It's a kind of naturalistic version of B; it usually has a flavor of relativism
about it, reminiscent of C--in that it eschews any evaluative position; and it sometimes
looks like a precursor of the constructivist approach described under F below, in that it
focuses on the observable rather than the inferrable. More recently, it has been referred to
as the 'thick description' approach--perhaps because "rich" sounds evaluative?
E. The 'social process' school. This was crystallized about 12 years ago,
approximately half way to the present moment in the history of the emerging discipline, by
a group of Stanford academics led by Lee Cronbach, referred to here as C&C (for
Cronbach and Colleagues; Cronbach et al. 1980). It is notable for its denial of the impor-
tance of summative evaluation, i.e., evaluation (i) as providing support for external
decisions about programs, or (ii) to ensure accountability. The substitute they proposed for
evaluating programs in anything like the ordinary sense was understanding social pro-
grams, 20 flavored with a dash of helping them to improve. Their position was encapsulated
in a set of 95 theses. This paper may perhaps represent an implementation of the 87th in
their list, which states: "There is need for exchanges [about evaluation] more energetic than
the typical academic discussion and more responsible than debate among partisans"--if
indeed there is any such middle ground.
Ernie House, a highly independent thinker about evaluation as well as an
experienced evaluator, also stressed the importance of the social ambience but was quite
distinctive in his stress on the ethical and argumentation dimensions of evaluation. In fact
his stress on the ethical dimension was partly intended as a counterpoint to the absence of
this concern in C&C (House, 1989).
F. The 'constructivist' or 'fourth generation' approach, representing the most re-
cent riders on the front of the wave, notably Egon Guba and Yvonna Lincoln (1989), but
with many other supporters including a strong following in the USA and amongst UK
evaluators. This point of view rejects evaluation as a search for quality, merit, worth, etc.,
in favor of the idea that itmand all truth, such as it is in their termsmis the result of
construction by individuals and negotiation by groups. This means that scientific
knowledge of all kinds is suspect, entirely challengeable, in no way objective. So, too, is
all analytic work such as philosophical analysis, including their own position. Out goes the
baby with the bathwater. Guba has always been aware of the potential for self-
contradiction in this position; in fact, there is no way around its suicidal bent.
Evaluation as a Discipline 159
Comments

Now, the commonsensical view of program evaluation is probably the view that it
consists in "working out whether the program is any good". It's the counterpart, people
might say, of the sort of thing doctors, road-testers, engineers, and personnel interviewers
do, but with the subject matter being programs instead of patients, cars, structures, or
applicants. The results of this kind of investigation are of course direct evaluative
conclusions--"The patient/program has improved slightly under the new therapeutic/-
managerial regime", etc. Of the views listed above, the slxong decision support view, of
which CIPP is the best known elaboration, comes closest to this.
The CIPP model was originally a little overgeneralized in that it claimed all
(program) evaluation was oriented to decision support. It seems implausible to insist that a
historian's evaluation of the "bread and circuses" programs of Roman emperors, or even
of the WPA, is or should be designed to serve some contemporary decision maker rather
than the professional interest of historians and others concerned with the truth about the
past. One must surely recognize the 'research role' of evaluation, the search for truth about
merit and worth, whose only payoffs are insights. Much of the decision support kind of
evaluation, and all of the research type exemplify what is sometimes called summative
evaluation---evaluation of a whole program of the kind that is often essential for someone
outside the program. One might also argue, contra the original version of CIPP, that
formative evaluation----evaluation aimed at improving a program or performance, reported
back to the program staff--deserves recognition as having a significantly different role
than decision support and its importance slightly weakens the claim that evaluation is for
decision support. (Of course, it supports decisions about how to improve the program, but
that's not the kind of decision that decision support is normally supposed to support.)
Over the years, however, CIPP has developed so that it accepts these points and is a fully-
fledged account of program evaluation; and its senior author has gone on to lead research
in the field of personnel evaluation.
While CIPP remains an approach to program evaluation, it comes to conclusions
about program evaluation that are very like those entailed by the transdisciplinary model.
The differences are like those between two experienced navigators, each of them with their
own distinctive way of doing things, but each finishing up--or else how would they live
to be experienced?--with very similar conclusions. Of matters beyond program
evaluation, and in particular, of the logic and core theory of evaluation, CIPP does not
speak, and those are the matters on which the transdisciplinary view focuses above all
others.
The other entries in the list above--that is, almost all schools of thought in evalua-
tion---can be seen as a series of attempts to avoid direct statements about the merit or worth
of things. Position B avoids all evaluative conclusions; C avoids direct evaluative claims in
favor of relativistic ones; 21 D avoids them in favor of non-evaluative description; E avoids
them in favor of insights about or understanding of social phenomena; and F rejects their
legitimacy along with that of all other claims. ~
This resistance to the commonsense view of program evaluation--even amongst
those working in the field--has its philosophical roots in the value-free conception of the
social sciences, discussed above, but it also gathered support from another argument,
which appears at first sight to be well-based in common sense. This was the argument that
160 M. Scnven
the decision whether a program is desirable or not should be made by policy-makers, not
by evaluators. On this view it would be presumptuous for program evaluators to act as if it
were their job to decide whether the program they were called in to evaluate should exist.
That argument confuses evaluations--which evaluators should produce--with
recommendations, which they are less often in a good position to produce (although they
often do produce them), and which are frequently best left to executives close to the
political realities of the decision ambience. That such a confusion exists is further evidence
of the lack of clarity about fundamental concepts in the general evaluation vocabulary.
Evaluators have all too often overstepped the boundaries of their expertise and walked on
to the turf of the decision maker, who rightly objects. But it is not necessary to react to the
extent of the weak decision support position and others that draw the line too early, cutting
the evaluator off even from drawing evaluative conclusions.
The issue must now be addressed of how the view supported in this paper, referred
to as the 'transdisciplinary' view, compares with the above. The transdisciplinary view
extends the commonsense view but is significantly different from A, and radically different
from all the rest.

The Transdisciplinary Model

On this view, the discipline of evaluation has two components: the set of applied
evaluation fields, and the core discipline, just as statistics and measurement have these two
components. The applied fields are like other applied fields in their goals, namely to solve
practical problems. This means finding out something about what they study, and what
they study is the merit and worth of various entities--personnel, products, etc. The core
discipline is aimed to find out something about the concepts, methodologies, models,
tools, etc. used in the applied fields of evaluation, and in other fields which use evaluation.
This, as we have suggested, includes all other disciplines---craft and physical as well as
academic. Hence the transdiscipline of evaluation is concerned with the analysis and
improvement of a process that extends across the disciplines, giving rise to the term.
Consider statistics more closely. There is a core discipline, studied in the
department of mathematics or in its own academic department. This is connected to the
applied fields of, for example, biostatistics, statistical mechanics, and demographics. The
applied fields' main tasks are the study and description of certain quantitative aspects of the
phenomena in those fields, and the study and development of field-specific quantitative
tools for describing that data and solving problems on which it can be brought to bear. The
more general results coming from the core discipline apply across all the disciplines that
are using----or should be rising--statistics, hence the term "transdiscipline"; but it also
helps develop field-specific techniques, attending in particular to the soundness of their
fundamental assumptions and hence the limits of their proper use.
Both evaluation and statistics are of course widely used outside their recognized
applied fields, i.e., the ones with "evaluation" or "statistics" in their rifle. That wider use is
part of the subject matter of the core discipline in both cases. Statistics must consider the
use of statistics wherever it is used, not just in areas that have that word in their title.
Looking at other Ixansdisciplines, logic has its own applied fields--the logic of the social
sciences, etc.--and is of course widely used outside those named fields. So it is an
extremely general transdiscipline. But evaluation is probably the most general--unlike
Evaluation as a Discipline 161

logic, it precedes language--and both are much more general than measurement or
statistics.
The transdisciplinary view of evaluation has four characteristics that distinguish it
from B-F on the previous list; one epistemological, one political, one concerning
disciplinary scope, and one methodological.
(I) It is an objectivist view of evaluation, like A. It argues for the idea that the evaluator
is determining the merit or worth of, for example, programs, personnel or products; that
these are real although logically complex properties of everyday things embedded in a
complex relevant context; and that an acceptable degree of objectivity and comprehen-
siveness in the quest to determine these properties is possible, frequently attained, and a
goal which can be more frequently attained if we study the transdiscipline. This contrasts
with B-F for obvious reasons. (There is some contrast with the early form of A, in the
shift of the primary role from decision-serving to troth-seeking.)
Since an objecfivist position implies that it is part of the evaluator's job to draw di-
rect evaluative conclusions about whatever is being evaluated (e.g., programs), the
position requires a head-on attack on the two grounds for avoiding such conclusions. So
the transdisciplinary position:
(i) Explicitly states and defends a logic of inferring evaluative conclusions from
factual and definitional premises; and
(ii) Spells out the fallacies in the arguments for the value-free doctrine33

(II) Second, the approach here is a consumer-oriented view rather than a management-
oriented (or mediator-oriented, or therapist-oriented) approach to program evaluation--and
correspondingly to personnel and product evaluation, etc. This does not mean it is a
consumer-advocacy approach in the sense that 'consumerism' sometimes represents--that
is, an approach which only fights for one side in an ancient struggle. It simply regards the
consumer's welfare as the primary justification for having a program, and accords that
welfare the same primacy in the evaluation. That means it rejects 'decision support'-
which is support of management decisions--as the main function of evaluation (by
contrast with B), although it aims to provide (management-)decision support as a
byproduct. Instead, it regards the main function of an applied evaluation field to be the
determination of the merit and worth of programs (etc.) in terms of how effectively and
efficiently they are serving those they impact, particularly those receiving---or who should
be receiving--the services the programs provide, and those who pay for the program--
typically, taxpayers or their representatives. While it is perfectly appropriate for the welfare
of program staff to also receive some weighting, schools---for example----do not exist
primarily as employment centers for teachers, so staff welfare (within the constraints of
justice) cannot be treated as of comparable importance to the educational welfare of the
students.
To the extent that managers take service to the consumer to be their primary goal--
as they normally should if managing programs in the public or philanthropic sectorm
information about the merit or worth of programs will be valuable information for
management decision making (the interest of the two views that stress decision support);
and to the extent that the goals of a program reflect the actual needs of consumers, this
information will approximate feedback about how well the program is meeting its goals
(the relativist's concern). But neither of these conditions is treated as a presupposition of
an evaluation; they must be investigated and are often violated.
162 M Scnven
The consumer orientation of this approach moves us one step beyond establishing the
legitimacy of drawing evaluative conclusions--Point I above--in that it argues for the
necessity of doing so---in most cases. That is, it categorizes any approach as incomplete
(fragmentary, unconsummated) if it stops short of drawing evaluative conclusions. The
practical demonstration of the feasibility and utility of going the extra step lies in every
issue of Consumer Reports: The things being evaluated are ranked and graded in a
systematic way, so one can see which are the best of the bunch (ranking) and whether the
best are safe, a good buy, etc. (grading), the two crucial requirements for decision-
making.
ON) Third, the approach here is a generalized view. It is not just a general view, it
involves generahzing the concepts of evaluation across the whole range of human
knowledge and practice. So, unlike any of the views A-F, it treats program evaluation as
merely one of many applied areas within an overarching discipline of evaluation. (These
applied areas may also be part of the subject matter of a primary discipline: personnel
evaluation, for example, is part of (industrial/organizational) psychology, biostatistics is,
in a sense, part of biology.) This perspective leads to substantial changes in the range of
considerations to which e.g., program evaluation, must pay attention (for instance, it must
look at other applied evaluation areas for parallels, and to a core discipline for theoretical
analyses), but helps with the added labors by greatly enhancing the methodological
repertoire of program evaluation.
Spelling out the directions of generalization in a little more detail, the
transdisciplinary view stresses:
(a) the large range of distinctive applied evaluation fields. The leading entries
are the Big Six plus meta-evaluation (the evaluation of evaluations). There
are at least a dozen more major entries, ranging from technology
assessment to ethical analysis.
(b) the large range of evaluative processes infelds other than applied evaluation
fields, including all the disciplines (the intradisciplinary evaluation process--
the evaluation of methodologies, data, instruments, research, theories, etc.)
and the practical and performing arts (the evaluation of craft skills,
compositions, competitors, regimens, instructions, etc.)
(c) the large range of types of evaluative investigation, from practical levels of
evaluation (e.g., judging the utility of products or the quality of high dives in
the Olympic aquatic competition) through program evaluation in the field to
conceptual analysis (e.g., the evaluation of conceptual and theoretical
solutions to problems in the core discipline of evaluation).
(d) the overlap between the applied fields, something that is rarely recognized.
For example, methods from one field often solve problems in other fields,
yet 'program evaluation' as usually conceived does not include any reference
to personnel evaluation, proposal evaluation, or ethical evaluation, each of
which must be taken into account in a good proportion of program
evaluations.

(IV) The transdisciplinary view is a technical view. This has to be stated rather carefully,
because we need to distinguish between the fact that many evaluations, for example large
program evaluations, require considerable technical knowledge of methodologies from
other disciplines; and the fact that is being stressed here, that evaluation itself, over and
Evaluation as a Dtsctphne 163

above these 'auxiliary' methodologies, has its own technical methodology. That
methodology needs to be understood by anyone doing non-trivial evaluation in any field at
all. It involves matters such as the logic of synthesis and the differences between the
evaluation functions like grading, scoring, ranking, and apportioning. Not all evaluators
need to know anything about social science methodologies such as survey techniques; all
must understand the core logic or risk serious errors. It has been common for those
working in and teaching others program evaluation to stress the need for skills in
instrument design, cost analysis, etc. But they have commonly supposed that such matters
exhausted the range of technical skills required of the evaluator. On the contrary, they are
the less important of two groups of those skills.
Stressing this does not minimize the fact that across the whole range of evaluation
fields, an immense number of 'auxiliary' methodologies are needed, far more than with
any of the other transdisciplines. There are more than a dozen auxiliary methodologies
involved in even the one applied field of program evaluation, more than half of them not
covered in any normal doctoral training program in any single discipline such as sociology,
psychology, law, or accounting.

Conclusion

Program evaluation treated in isolation can be seen in the ways all six positions
advocate. But program evaluation treated as just one more application of the logical core
which leads us to solid evaluative results in product evaluation, performance assessment,
and half a dozen other applied fields of evaluation, can hardly be seen as consistent with
the flight from direct evaluative conclusions that five of those positions embody.
While there are special features of program evaluation which often make it less
straightforward than the simpler kind of product evaluation, the reverse is often the case.
The view that it is different from all product evaluation is only popular amongst those who
know little about product evaluation. For example, the idea that program evaluation is of
its nature much more political than product evaluation is common but wrong; the history of
the interstate highway system and the superconducting supercollider are counter-examples,
and it was after all 'only' a product evaluation----one commissioned by Congress and done
flawlessly--that led to the dismissal of the Director of the National Bureau of Standards. p
One must conclude that the non-evaluative models of program evaluation discussed
here (B-F) are completely implausible as models for all kinds of evaluation. And it is
extremely implausible to suppose that program evaluation is essentially different from
every other kind of evaluation. The transdisciplinary view, on the other hand, applies
equally to all kinds of evaluation, and that consistency must surely be seen as one of its ap-
peals. For the various evaluative purposes addressed by the authors of the other papers in
this issue, it may also be of some value to see what they are doing as part of a single,
larger, enterprise, and hence as parallel to what workers in other applied fields of
evaluation are doing. In that perception there is a prospect of many valuable results which
should serve to revitalize several areas and sub-areas of evaluation. And the second edge
of the transdisciplinary sword cannot be ignored: the demonstration of fundamental errors
in applied evaluation fields such as personnel evaluation and program evaluation due to the
neglect of the core discipline.
164 M Scfiven
Notes

1 The author welcomes comments and critacisms of all kinds; they should be addressed to him at P.O.
Box 69, Point Reyes, CA 94956 (scriven@aol.com on the Intemet), or faxed to (415) 663-1511.
The reflectaons reported here were produced while working part-time on the CREATE staff, although
mostly on my own tame since my work for CREATE ~s primarily concerned with the specifics of
teacher evaluation. However, even when working on the specific topic, there is a need to examine
foundations in order to deal with questions of validity, and some remarks about the connection are
included here.
2. CREATE works mainly on personnel policy and program evaluation, and on institutional evaluation
which combines program and personnel evaluation. It also does considerable meta-evaluation.
3. Closely analogous to the amateur status of applied mathematicians about matters in the foundations of
mathematics, and not unlike the status of a bookmaker with respect to probability theory. A high
degree of skill in an apphed field does not automatically generate any skill in the theory of the field, let
alone meta-fields such as the sociology or history of the subject, or the logical analysis of
propositaons in the field.
4 The discussion here is only intended to prowde a brisk overview of this techmcal area. Further details
and references will be found in the relevant amcles in An Evaluatton Thesaurus (1991).
5. Since ff constructaonism were true, its arguments would prove that the claim that it is true has no
validity for those who do not construct reality in the same way as those who think it true, i.e., those
who disagree with it. That is, it is no more true than its denial, which means it is not true in the only
sense of that term that ever made it to the dictaonaries or into logical or scientafic usage.
6. By contrast wRh a vocabulary of standard signs, lacking grammar and hence recombination
capabtlity.
7. The defimtave reference is The employment intervtew" theory, research, and practice, Eder and Ferris
(1989).
8. The term "indicator" is here used to refer to a factor that ~s not a criterion (i.e., not one of the defining
components of the job). Hence good simulations----e.g., the classical typing test for selecting
typists--are exempt from these remarks, which apply only to 'empirically validated' indicators such
as performance on proprietary tests or demographic variables.
9. It is discussed at greater length in the writer's contribution to Research-Based Teacher Evaluation
(1990). It is still denied by leading specialists in personnel evaluation, many of whose standard
procedures are threatened by it (the use of 'empirically validated' tests).
10. There are some cases where the contribution has been and may be significantly different from zero.
For example, when the theory wolates a paradigm, a specialist in the evaluation of paradigms--
someone with a background in both history and philosophy of science - - m a y be able to contribute a
useful perspectwe or analogy.
11. E.g., they will include Excellent, Above Average, Average, Below Average, Unacceptable. Since the
average performance may be excellent, or unacceptable, this fails to meet the minimum requirement
for a scale (exclusive categories). The converse error is a scale like this: Outstanding, Good,
Acceptable, Weak, Weakest. 'Grading on the curve' is another good example of total category
confusion, for well-known reasons.
12. Many other examples are given m ET4.
13. By early next year, the present author hopes to do this in a monograph for the Sage Social Science
Methodology Series called General Evaluation Methodology.
Eva/uatlon as a Dtsctpline 165
14. The most obvious ~s that the standard procedure, which allocates say 100 points across half a dozen
dimensions of ment, ignores the existence of absolute mimma on some of the scales. This means---to
gave an extreme example that a proposal which happens to be on the wrong topic, but which is
staffed by great staff from a great institution at a modest price, could m principle wm a competitive
grant by picking up enough points on several of the dimensions to overcome its zero on relevance.
15. Some of this section is an improved version of parts of a much longer article, "Hard-Won Lessons m
Program Evaluation" m the June, 1993, issue of New Directions in Program Evaluation (Jossey-
Bass).
16. Ttus is an improved version of a classification that appeared in Scriven (1993).
17. Although he is often wrongly thought of as never questiomng program goals.
18. Alkin recently reviewed his original definition of evaluation after 21 years, and still could not bring
himself to include any reference to merit, worth, or value. He defines it as the collection and presenta-
tion of data summaries for dec~smn-makers, which is of course the definition of MIS (management
reformation systems). See pp. 93-96, in Allon (1991).
19 By contrast, Position A was put forward by educational researchers, who were less committed to the
paradigm of value-free social science, possibly because their disciphne includes history and phi-
losophy of educataon, comparative education, and educational administration, which have quite dif-
ferent paradigms.
20 This attempt to replace evaluataon with explanation is reminiscent of the last stand of psychotherapists
faced with the put up or shut up attitude of those doing outcome studies m the 1950s and 1960s. The
therapists, notably the psychoanalysts, tried to replace reme&ation with explanation, arguing that the
payoff from psychotherapy was improved understanding by the pauents of their cond~uons, rather
than symptom reduction. This was not a popular view amongst patients who were in pare and paying
heavily to reduce it--they thought.
21. A relativtstic evaluatave statement is something like: "If you value so-and-so, then this will be a good
program for you" or "The program was very successful in meetang its goals", or "If technology
education should be accessible to girls as easily as boys, then this program will help bnng that about"
These claims---of course, these are simple examples--express an evaluative conclusion only relative
to the client's or the consumer's values. A direct evaluauve claim, by contrast, wtule it can be
'relatavistic' m another sense--that ~s, comparative or condlt~onal--will contain an evaluative claim by
tts author, about the program under evaluataon. For example: "This program is not cost-effective com-
pared to the use of traditional methods" or "This is the best of the optaons" or "These s~de-effects are
extremely unfortunate".
22. The conno~sseurship model also weakens the evaluative component m evaluation, reducing ~t to the
largely subjectave model of a connoisseur's judgments. The connoisseur is highly knowledgeable but
the knowledge is in a domain where it only changes but does not validate its owner's evaluations.
23 These points are covered m some detarl in the Evaluation Thesaurus entries on the logic of evaluation
and not repeated here since the arguments are of rather specialized interest, although the issue is of
crucial ~mportance.
24. In the re_famous Astin case, where the D~rector was asked to do a study of the effect of the battery
addiuve AD-X2 prior to governmental purchase of it for the vehicle fleet. The additive had no effect,
as was apparent from a simple control-group study of government vehicles, and reporting that result
cost Astin his job (although media pressure eventually got him reinstated). A look at the process of
evaluation for textbooks, and its polittcal ambience, provides what may be an even clearer example of
product evaluation as involwng the same pohtical dimensions as program evaluation.
166 M Scnven
References

Alkin, M. (1991). Evaluation theory development: II. In M. McLaughlin, & D. Phillips (Eds.),
Evaluanon and educanon at quarter century (pp. 9 I-112). Chicago: NSSEAJniversityof Chicago

Cronbach, L.J., Robinson Ambron, S., Dombusch, S.M., Hess, R.D., Homik, R.C., Phillips, D.C.,
Walker, D.F., Weiner, S.S. Towards reform in program evaluation" Aims, methods and
tnstituttonal arrangements San Francisco: Jossey-Bass.

Eder, R.W., & Ferns, G.R. (Eds.). (1989). The employment mterview Theory, research and practice
Newbury Park, Cahforma. Sage.

Evaluation Thesaurus (1991). 4th edition. Newbury Park, California: Sage.

Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation.. Newbury Park, California: Sage.

House, E. (1989). Evaluating with validity Newbury Park, California: Sage.

Ross1, & Freedman (2989). Evaluation" A systematic approach Newbury Park, Califorma: Sage.

Scriven, M. (1990). Can research-based teacher evaluation be saved? In Rachard L. Schwab (EA.),
Research-based teacher evaluation (pp. 12-32). Boston: Kluwer.

Scriven, M, (1993). Hard-won lessons in program evaluation. New Directions in Program Evaluation
(June).

Spector, P. (1992). Summated ranng scale construction. Newbury Park, California: Sage.

Stufflebeam, D.L., Foley, W.J., Gephart, W.J., Guba, E.G., Hammond, R.L., Mernman, H.O., &
Provus, M.M. (1971). Educational evaluanon and decision making. Itasca, IL: Peacock.

The Author

MICHAEL SCRIVEN has degrees in mathematics and philosophy from Melbourne and
Oxford, and has taught and published in those areas and in psychology, education,
computer science, jurisprudence, and technology studies. He was on the faculty of
UC/Berkeley for twelve years. He was the first president of what is now the American
Evaluation Association, founding editor of its journal (now called E v a l u a t i o n Practice),
and recipient of its Lazarsfeld Medal for contributions to evaluation theory.

You might also like