You are on page 1of 81

EXPERIMENTALAND

QUASI-EXPERIMENTAL
DESIGNSFORGENERALIZED
ii:.

CAUSALINFERENCE

William R. Shadish
Trru UNIvERSITYop MEvPrrts

** Thomas D. Cook
.jr-*"-
iLli"
NonrrrwpsrERN UNrvPnslrY
'"+.'-, ,

fr
Donald T. Campbell

HOUGHTONMIFFLINCOMPANY Boston New York


and
Experiments
Causal
Generalized
lnference
Ex.per'i'ment (ik-spEr'e-mant):[Middle English from Old French from Latin
experimentum, from experiri, to try; seeper- in Indo-European Roots.]
n. Abbr. exp., expt, 1. a. A test under controlled conditions that is
made to demonstratea known truth, examine the validity of a hypothe-
sis, or determine the efficacyof something previously untried' b. The
processof conducting such a test; experimentation. 2' An innovative
"Democracy is only an experiment in gouernment"
act or procedure:
(.V{illiam Ralph lnge).

Cause (k6z): [Middle English from Old French from Latin causa' teason,
purpose.] n. 1. a. The producer of an effect, result, or consequence.
b. The one, such as a person, an event' or a condition, that is responsi-
ble for an action or a result. v. 1. To be the causeof or reason for; re-
sult in. 2. To bring about or compel by authority or force.

o MANv historians and philosophers,the increasedemphasison experimenta-


tion in the 15th and L7th centuriesmarked the emergenceof modern science
from its roots in natural philosophy (Hacking, 1983). Drake (1981) cites
'1.6'!.2 'Water,
Galileo's treatrseBodies Tbat Stay Atop or Moue in It as usheringin
modern experimental science,but earlier claims can be made favoring \Tilliam
Gilbert's1,600study Onthe Loadstoneand MagneticBodies,Leonardoda Vinci's
(1,452-1.51.9)many investigations, and perhapseventhe Sth-centuryB.C.philoso-
pher Empedocles,who used various empirical demonstrationsto argue against
'1.969a,
Parmenides(Jones, 1'969b).In the everyday senseof the term, humans
have beenexperimentingwith different ways of doing things from the earliestmo-
ments of their history. Suchexperimentingis as natural a part of our life as trying
a new recipe or a different way of starting campfires.
z | 1. EXeERTMENTs
ANDGENERALTzED
cAUsALINFERENcE
I

However, the scientific revolution of the 1.7thcentury departed in three ways


from the common use of observation in natural philosophy atthat time. First, it in-
creasingly used observation to correct errors in theory. Throughout historg natu-
ral philosophers often used observation in their theories, usually to win philo-
sophical arguments by finding observations that supported their theories.
However, they still subordinated the use of observation to the practice of deriving
theories from "first principles," starting points that humans know to be true by our
nature or by divine revelation (e.g., the assumedproperties of the four basic ele-
ments of fire, water, earth, and air in Aristotelian natural philosophy). According
to some accounts,this subordination of evidenceto theory degeneratedin the 17th
"The
century: Aristotelian principle of appealing to experiencehad degenerated
among philosophers into dependenceon reasoning supported by casual examples
and the refutation of opponents by pointing to apparent exceptions not carefully
'1,98"1.,
examined" (Drake, p. xxi).'Sfhen some 17th-century scholarsthen beganto
use observation to correct apparent errors in theoretical and religious first princi-
ples, they came into conflict with religious or philosophical authorities, as in the
case of the Inquisition's demands that Galileo recant his account of the earth re-
volving around the sun. Given such hazards,the fact that the new experimental sci-
ence tipped the balance toward observation and ^way from dogma is remarkable.
By the time Galileo died, the role of systematicobservation was firmly entrenched
as a central feature of science,and it has remained so ever since (Harr6,1981).
Second,before the 17th century, appeals to experiencewere usually basedon
passive observation of ongoing systemsrather than on observation of what hap-
pens after a system is deliberately changed. After the scientific revolution in the
L7th centurS the word experiment (terms in boldface in this book are defined in
the Glossary) came to connote taking a deliberate action followed by systematic
observationof what occurred afterward. As Hacking (1983) noted of FrancisBa-
con: "He taught that not only must we observenature in the raw, but that we must
'twist
also the lion's tale', that is, manipulate our world in order to learn its se-
crets" (p. U9). Although passiveobservation revealsmuch about the world, ac-
tive manipulation is required to discover some of the world's regularities and pos-
sibilities (Greenwood,, 1989). As a mundane example, stainless steel does not
occur naturally; humans must manipulate it into existence.Experimental science
came to be concerned with observing the effects of such manipulations.
Third, early experimenters realized the desirability of controlling extraneous
influences that might limit or bias observation. So telescopeswere carried to
higher points at which the air was clearer, the glass for microscopeswas ground
ever more accuratelg and scientistsconstructed laboratories in which it was pos-
sible to use walls to keep out potentially biasing ether waves and to use (eventu-
ally sterilized) test tubes to keep out dust or bacteria. At first, thesecontrols were
developed for astronomg chemistrg and physics, the natural sciencesin which in-
terest in sciencefirst bloomed. But when scientists started to use experiments in
areas such as public health or education, in which extraneous influences are
harder to control (e.g., Lind , 1,753lr,they found that the controls used in natural
AND CAUSATTONI I
EXPERTMENTS

sciencein the laboratoryworked poorly in thesenew applications.So they devel-


oped new methodsof dealingwith extraneousinfluence,such as random assign-
ment (Fisher,1,925)or addinga nonrandomizedcontrol group (Coover& Angell,
1.907).As theoreticaland observationalexperienceaccumulatedacrosstheseset-
tings and topics,more sourcesof bias were identifiedand more methodswere de-
velopedto copewith them (Dehue,2000).
TodaSthe key featurecommonto all experimentsis still to deliberatelyvary
somethingso asto discoverwhat happensto somethingelselater-to discoverthe
effectsof presumedcauses.As laypersonswe do this, for example,to assess what
happensto our blood pressureif we exercisemore, to our weight if we diet less,
or ro our behaviorif we read a self-helpbook. However,scientificexperimenta-
tion has developedincreasinglyspecializedsubstance,language,and tools, in-
that is the pri-
cluding the practiceof field experimentationin the socialsciences
mary focus of this book. This chapter begins to explore these matters by
(1) discussingthe natureof causationthat experimentstest,(2) explainingthe spe-
cializedterminology(e.g.,randomizedexperiments,quasi-experiments) that de-
(3)
scribessocial experiments, introducing the problem of how to generalize
causalconnectionsfrom individual experiments,and (4) briefly situatingthe ex-
perimentwithin a largerliteratureon the nature of science.

AND CAUSATION
EXPERIMENTS
A sensiblediscussionof experimentsrequiresboth a vocabularyfor talking about
causationand an understandingof key conceptsthat underliethat vocabulary.

DefiningCause,Effect,and CausalRelationships
Most peopleintuitively recognizecausalrelationshipsin their daily lives.For in-
stance,you may say that another automobile'shitting yours was a causeof the
damageto your car; that the number of hours you spentstudyingwas a causeof
your testgrades;or that the amountof food a friend eatswas a causeof his weight.
You may evenpoint to more complicatedcausalrelationships,noting that a low
test gradewas demoralizing,which reducedsubsequentstudying,which caused
evenlower grades.Here the samevariable(low grade)can be both a causeand an
effect,and there can be a reciprocal relationship betweentwo variables (low
gradesand not studying)that causeeachother.
Despitethis intuitive familiarity with causalrelationsbips,a precisedefinition
of causeand effecthaseludedphilosophersfor centuries.lIndeed,the definitions

1. Our analysisrefldctsthe useof the word causationin ordinary language,not the more detaileddiscussionsof
causeby philosophers.Readersinterestedin suchdetail may consult a host of works that we referencein this
chapter,includingCook and Campbell(1979).
4 | 1. EXPERTMENTS
AND GENERALTZED
CAUSAL
INFERENCE

of terms suchas cause and,effectdependpartly on eachother and on the causal


relationshipin which both are embedded.So the 17th-centuryphilosopherJohn
Locke said: "That which producesany simpleor complexidea,we denoteby the
generalnamecaLtse, and that which is produced, effect" (1,97s, p. 32fl and also:
" A cAtrseis that
which makesany other thing, either simpleidea, substance, or
mode,beginto be; and an effectis that, which had its beginningfrom someother
thing" (p. 325).Sincethen,otherphilosophers and scientistshavegivenus useful
definitionsof the threekey ideas--cause,effect,and causalrelationship-that are
more specificand that betterilluminatehow experimentswork. We would not de-
fend any of theseas the true or correctdefinition,giventhat the latter haseluded
philosophersfor millennia;but we do claignthat theseideashelp to clarify the sci-
entific practiceof probing causes.

Cause
'We
Considerthe causeof a forest fire. know that fires start in differentways-a
match tossedfrom a ca\ a lightning strike, or a smolderingcampfire,for exam-
ple. None of thesecausesis necessarybecausea forest fire can start evenwhen,
say'a match is not present.Also, none of them is sufficientto start the fire. After
all, a match must stay "hot" long enoughto start combustion;it must contact
combustiblematerial suchas dry leaves;theremust be oxygenfor combustionto
occur; and the weather must be dry enoughso that the leavesare dry and the
match is not dousedby rain. So the match is part of a constellationof conditions
without which a fire will not result,althoughsomeof theseconditionscan be usu-
ally takenfor granted,suchasthe availabilityof oxygen.A lightedmatchis, rhere-
fore, what Mackie (1,974)called an inus condition-"an insufficient but non-
redundantpart of an unnecessary but sufficient condition" (p. 62; italicsin orig-
inal). It is insufficientbecausea match cannot start a fire without the other con-
ditions. It is nonredundant only if it adds something fire-promoting that is
uniquelydifferent from what the other factors in the constellation(e.g.,oxygen,
dry leaves)contributeto startinga fire; after all,it would beharderro saywhether
the match causedthe fire if someoneelsesimultaneouslytried startingit with a
cigarettelighter.It is part of a sufficientcondition to start a fire in combination
with the full constellationof factors.But that condition is not necessary because
thereare other setsof conditionsthat can also start fires.
A researchexampleof an inus condition concernsa new potentialtreatment
for cancer.In the late 1990s,a teamof researchers in Bostonheadedby Dr. Judah
Folkman reportedthat a new drug calledEndostatinshrank tumors by limiting
their blood supply (Folkman, 1996).Other respectedresearchers could not repli-
catethe effectevenwhen usingdrugsshippedto them from Folkman'slab. Scien-
tists eventuallyreplicatedthe resultsafter they had traveledto Folkman'slab to
learnhow to properlymanufacture,transport,store,and handlethe drug and how
to inject it in the right location at the right depth and angle.One observerlabeled
thesecontingenciesthe "in-our-hands" phenomenon,meaning "even we don't
AND CAUSATIONI S
EXPERIMENTS

know which details are important, so it might take you some time to work it out"
(Rowe, L999, p.732). Endostatin was an inus condition. It was insufficientcause
by itself, and its effectivenessrequired it to be embedded in a larger set of condi-
tions that were not even fully understood by the original investigators.
Most causesare more accurately called inus conditions. Many factors are usu-
ally required for an effectto occur, but we rarely know all of them and how they
relate to each other. This is one reason that the causal relationships we discussin
this book are not deterministic but only increasethe probability that an effect will
occur (Eells,1,991,;Holland, 1,994).It also explains why a given causalrelation-
ship will occur under some conditions but not universally across time, space,hu-
-"r pop,rlations, or other kinds of treatments and outcomes that are more or less
related io those studied. To different {egrees, all causal relationships are context
dependent,so the generalizationof experimental effects is always at issue.That is
*hy *. return to such generahzationsthroughout this book.

Effect
'We that
can better understand what an effect is through a counterfactual model'l'973'
goes back at least to the 18th-century philosopher David Hume (Lewis,
p. SSel. A counterfactual is something that is contrary to fact. In an experiment,
ie obseruewhat did happez when people received a treatment. The counterfac-
tual is knowledge of what would haue happened to those same people if they si-
multaneously had not receivedtreatment. An effect is the difference betweenwhat
did happen and what would have happened.
'We
cannot actually observe a counterfactual. Consider phenylketonuria
(PKU), a genetically-based metabolic diseasethat causesmental retardation unless
treated during the first few weeks of life. PKU is the absenceof an enzyme that
would otherwise prevent a buildup of phenylalanine, a substance toxic to the
nervous system. Vhen a restricted phenylalanine diet is begun early and main-
tained, reiardation is prevented. In this example, the causecould be thought of as
the underlying genetic defect, as the enzymatic disorder, or as the diet. Each im-
plies a difierenicounterfactual. For example, if we say that a restricted phenyl-
alanine diet causeda decreasein PKU-basedmental retardation in infants who are
phenylketonuric
'h"d at birth, the counterfactual is whatever would have happened
t'h.r. sameinfants not receiveda restricted phenylalanine diet. The samelogic
applies to the genetic or enzymatic version of the cause. But it is impossible for
theseu.ry ,"-i infants simultaneously to both have and not have the diet, the ge-
netic disorder, or the enzyme deficiency.
So a central task for all cause-probing research is to create reasonable ap-
proximations to this physically impossible counterfactual. For instance, if it were
ethical to do so, we might contrast phenylketonuric infants who were given the
diet with other phenylketonuric infants who wer€ not given the diet but who were
similar in many ways to those who were (e.g., similar face) gender,age, socioeco-
nomic status, health status). Or we might (if it were ethical) contrast infants who
I

6 I 1. EXPERIMENTS
ANDGENERALIZED
CAUSAL
INFERENCE

were not on the diet for the first 3 months of their lives with those same infants
after they were put on the diet starting in the 4th month. Neither of these ap-
proximations is a true counterfactual. In the first case,the individual infants in the
treatment condition are different from those in the comparison condition; in the
second case, the identities are the same, but time has passedand many changes
other than the treatment have occurred to the infants (including permanent dam-
age done by phenylalanine during the first 3 months of life). So two central tasks
in experimental design are creating a high-quality but necessarilyimperfect source
of counterfactual inference and understanding how this source differs from the
treatment condition.
This counterfactual reasoning is fundarnentally qualitative becausecausal in-
ference, even in experiments, is fundamentally qualitative (Campbell, 1975;
Shadish, 1995a; Shadish 6c Cook, 1,999). However, some of these points have
been formalized by statisticiansinto a specialcasethat is sometimescalled Rubin's
"1.974,'1.977,1978,79861.
CausalModel (Holland, 1,986;Rubin, This book is not
about statistics, so we do not describethat model in detail ('West,Biesanz,& Pitts
[2000] do so and relate it to the Campbell tradition). A primary emphasisof Ru-
bin's model is the analysis of causein experiments, and its basic premisesare con-
sistent with those of this book.2 Rubin's model has also been widely used to ana-
lyze causal inference in case-control studies in public health and medicine
(Holland 6c Rubin, 1988), in path analysisin sociology (Holland,1986), and in
a paradox that Lord (1967) introduced into psychology (Holland 6c Rubin,
1983); and it has generatedmany statistical innovations that we cover later in this
book. It is new enough that critiques of it are just now beginning to appear (e.g.,
Dawid, 2000; Pearl, 2000). tUfhat is clear, however, is that Rubin's is a very gen-
eral model with obvious and subtle implications. Both it and the critiques of it are
required material for advanced students and scholars of cause-probingmethods.

CausalRelationship
How do we know if cause and effect are related? In a classic analysis formalized
by the 19th-century philosopher John Stuart Mill, a causal relationship exists if
(1) the causeprecededthe effect, (2) the causewas related to the effect,and (3) we
can find no plausible alternative explanation for the effect other than the cause.
These three characteristics mirror what happens in experiments in which (1) we
manipulate the presumed cause and observe an outcome afterward; (2) we see
whether variation in the cause is related to variation in the effect; and (3) we use
various methods during the experiment to reduce the plausibility of other expla-
nations for the effect, along with ancillary methods to explore the plausibility of
those we cannot rule out (most of this book is about methods for doing this).

2. However, Rubin's model is not intended to say much about the matters of causal generalization that we address
in this book.
EXPERTMENTS | 7
AND CAUSATTON
I

Henceexperimentsare well-suitedto studyingcausalrelationships.No other sci-


entificmethodregularlymatchesthe characteristics of causalrelationshipssowell.
Mill's analysisalsopointsto the weaknessof other methods. In many correlational
studies,for example,it is impossibleto know which of two variablescamefirst,
so defendinga causalrelationshipbetweenthem is precarious.Understandingthis
logic of causalrelationshipsand how its key terms,suchas causeand effect,are
definedhelpsresearchers to critique cause-probingstudies.

and Confounds
Correlation,
Causation,
A well-known maxim in research is: Correlation does not proue causation. This is
so becausewe may not know which variable came first nor whether alternative ex-
planations for the presumed effectexist. For example, supposeincome and educa-
tion are correlated.Do you have to have a high income before you can aff.ordto pay
for education,or do you first have to get a good education before you can get a bet-
ter paying job? Each possibility may be true, and so both need investigation.But un-
til those investigationsare completed and evaluatedby the scholarly communiry a
simple correlation doesnot indicate which variable came first. Correlations also do
little to rule out alternative explanations for a relationship between two variables
such as education and income. That relationship may not be causal at all but rather
due to a third variable (often called a confound), such as intelligence or family so-
cioeconomicstatus,that causesboth high education and high income. For example,
if high intelligencecausessuccessin education and on the job, then intelligent peo-
ple would have correlatededucation and incomes,not becauseeducation causesin-
come (or vice versa) but becauseboth would be causedby intelligence.Thus a cen-
tral task in the study of experiments is identifying the different kinds of confounds
that can operate in a particular researcharea and understanding the strengthsand
weaknessesassociatedwith various ways of dealing with them

Causes
and Nonmanipulable
Manipulable
In the intuitive understandingof experimentationthat most peoplehave,it makes
senseto say,"Let's seewhat happensif we requirewelfarerecipientsto work"; but
it makesno senseto say,"Let's seewhat happensif I changethis adult maleinto a
three-year-old girl." And so it is alsoin scientificexperiments.Experimentsexplore
the effectsof things that can be manipulated, such as the dose of a medicine,the
amount of a welfarecheck,the kind or amount of psychotherapyor the number
of childrenin a classroom.Nonmanipulableevents(e.g.,the explosionof a super-
nova) or attributes(e.g.,people'sages,their raw geneticmaterial,or their biologi-
cal sex)cannotbe causesin experimentsbecause we cannotdeliberatelyvary them
to seewhat then happens.Consequently, most scientistsand philosophersagree
that it is much harderto discoverthe effectsof nonmanipulablecauses.
I

8 1. EXeERTMENTS cAUsALTNFERENcE
ANDGENERALTzED
|

To be clear,we are not arguing that all causesmust be manipulable-only that


experimental causesmust be so. Many variables that we correctly think of as causes
are not directly manipulable. Thus it is well establishedthat a geneticdefect causes
PKU even though that defect is not directly manipulable.'We can investigatesuch
causesindirectly in nonexperimental studiesor even in experimentsby manipulat-
ing biological processesthat prevent the gene from exerting its influence, as
through the use of diet to inhibit the gene'sbiological consequences.Both the non-
manipulable gene and the manipulable diet can be viewed as causes-both covary
with PKU-basedretardation, both precedethe retardation, and it is possibleto ex-
plore other explanations for the gene'sand the diet's effectson cognitive function-
ing. However, investigating the manipulablc diet as a causehas two important ad-
vantages over considering the nonmanipulable genetic problem as a cause.First,
only the diet provides a direct action to solve the problem; and second,we will see
that studying manipulable agents allows a higher quality source of counterfactual
inferencethrough such methods as random assignment.\fhen individuals with the
nonmanipulable genetic problem are compared with personswithout it, the latter
are likely to be different from the former in many ways other than the genetic de-
fect. So the counterfactual inference about what would have happened to those
with the PKU genetic defect is much more difficult to make.
Nonetheless,nonmanipulable causesshould be studied using whatever means
are availableand seemuseful. This is true becausesuch causeseventuallyhelp us
to find manipulable agents that can then be used to ameliorate the problem at
hand. The PKU example illustrates this. Medical researchersdid not discover how
to treat PKU effectively by first trying different diets with retarded children. They
first discovered the nonmanipulable biological features of retarded children af-
fected with PKU, finding abnormally high levels of phenylalanine and its associ-
ated metabolic and genetic problems in those children. Those findings pointed in
certain ameliorative directions and away from others, leading scientiststo exper-
iment with treatments they thought might be effective and practical. Thus the new
diet resulted from a sequenceof studies with different immediate purposes, with
different forms, and with varying degreesof uncertainty reduction. Somewere ex-
perimental, but others were not.
Further, analogue experiments can sometimes be done on nonmanipulable
causes,that is, experiments that manipulate an agent that is similar to the cause
of interest. Thus we cannot change a person's race, but we can chemically induce
skin pigmentation changes in volunteer individuals-though such analogues do
not match the reality of being Black every day and everywhere for an entire life.
Similarly past events,which are normally nonmanipulable, sometimesconstitute
a natural experiment that may even have been randomized, as when the 1'970
Vietnam-era draft lottery was used to investigate a variety of outcomes (e.g., An-
grist, Imbens, & Rubin, 1.996a;Notz, Staw, & Cook, l97l).
Although experimenting on manipulable causesmakes the job of discovering
their effectseasier,experiments are far from perfect means of investigating causes.
I
EXPERIMENTS
AND CAUSATIONI 9

Sometimesexperiments modify the conditions in which testing occurs in a way


that reducesthe fit between those conditions and the situation to which the results
are to be generalized.Also, knowledge of the effects of manipulable causestells
nothing about how and why those effectsoccur. Nor do experiments answer many
other questions relevant to the real world-for example, which questions are
worth asking, how strong the need for treatment is, how a cause is distributed
through societg whether the treatment is implemented with theoretical fidelitS
and what value should be attached to the experimental results.
In additioq, in experiments,we first manipulate a treatment and only then ob-
serveits effects;but in some other studieswe first observean effect, such as AIDS,
and then search for its cause, whether manipulable or not. Experiments cannot
help us with that search. Scriven (1976) likens such searchesto detective work in
which a crime has been committed (..d., " robbery), the detectivesobservea par-
ticular pattern of evidencesurrounding the crime (e.g.,the robber wore a baseball
cap and a distinct jacket and used a certain kind of Bun), and then the detectives
searchfor criminals whose known method of operating (their modus operandi or
m.o.) includes this pattern. A criminal whose m.o. fits that pattern of evidence
then becomesa suspect to be investigated further. Epidemiologists use a similar
method, the case-control design (Ahlbom 6c Norell, 1,990),in which they observe
a particular health outcome (e.g., an increasein brain tumors) that is not seen in
another group and then attempt to identify associatedcauses(e.g., increasedcell
phone use). Experiments do not aspire to answer all the kinds of questions, not
even all the types of causal questions, that social scientistsask.

and CausalExplanation
CausalDescription
The uniquestrengthof experimentationis in describingthe consequences attrib-
utableto deliberatelyvaryinga treatment.'Wecall this causaldescription.In con-
trast, experimentsdo lesswell in clarifying the mechanismsthrough which and
the conditionsunder which that causalrelationshipholds-what we call causal
explanation.For example,most childrenvery quickly learnthe descriptivecausal
relationshipbetweenflicking a light switch and obtainingillumination in a room.
However,few children (or evenadults)can fully explain why that light goeson.
To do so, they would haveto decomposethe treatment(the act of flicking a light
switch)into its causallyefficaciousfeatures(e.g.,closingan insulatedcircuit) and
its nonessentialfeatures(e.g.,whetherthe switch is thrown by hand or a motion
detector).They would haveto do the samefor the effect (eitherincandescentor
fluorescentlight can be produced,but light will still be produced whether the
light fixture is recessedor not). For full explanation,they would then have to
show how the causallyefficaciousparts of the treatmentinfluencethe causally
affectedparts of the outcomethrough identified mediating processes(e.g.,the
I

ANDGENERALIZED
1O I T. CXPTRIMENTS INFERENCE
CAUSAL

passageof electricity through the circuit, the excitation of photons).3 ClearlS the
causeof the light going on is a complex cluster of many factors. For those philoso-
phers who equate cause with identifying that constellation of variables that nec-
essarily inevitably and infallibly results in the effect (Beauchamp,1.974),talk of
cause is not warranted until everything of relevanceis known. For them, there is
no causal description without causal explanation. Whatever the philosophic mer-
its of their position, though, it is not practical to expect much current social sci-
ence to achieve such complete explanation.
The practical importance of causal explanation is brought home when the
switch fails to make the light go on and when replacing the light bulb (another
easily learned manipulation) fails to solva the problem. Explanatory knowledge
then offers clues about how to fix the problem-for example, by detecting and re-
pairing a short circuit. Or if we wanted to create illumination in a place without
lights and we had explanatory knowledge, we would know exactly which features
of the cause-and-effectrelationship are essentialto create light and which are ir-
relevant. Our explanation might tell us that there must be a source of electricity
but that that source could take several different molar forms, such as abattery, a
generator, a windmill, or a solar array. There must also be a switch mechanism to
close a circuit, but this could also take many forms, including the touching of two
bare wires or even a motion detector that trips the switch when someone enters
the room. So causal explanation is an important route to the generalization of
causal descriptions becauseit tells us which features of the causal relationship are
essentialto transfer to other situations.
This benefit of causal explanation helps elucidate its priority and prestige in
all sciencesand helps explain why, once a novel and important causal relationship
is discovered, the bulk of basic scientific effort turns toward explaining why and
how it happens. Usuallg this involves decomposing the causeinto its causally ef-
fective parts, decomposing the effects into its causally affected parts, and identi-
fying the processesthrough which the effective causal parts influence the causally
affected outcome parts.
These examplesalso show the close parallel between descriptive and explana-
tory causation and molar and molecular causation.aDescriptive causation usually
concerns simple bivariate relationships between molar treatments and molar out-
comes, molar here referring to a package that consistsof many different parts. For
instance, we may find that psychotherapy decreasesdepression,a simple descrip-
tive causal relationship benveen a molar treatment package and a molar outcome.
However, psychotherapy consists of such parts as verbal interactions, placebo-

3. However, the full explanationa physicistwould offer might be quite different from this electrician's
explanation,perhapsinvoking the behaviorof subparticles.This differenceindicatesiust how complicatedis the
notion of explanationand how it can quickly becomequite complex once one shifts levelsof analysis.
4. By molar, we mean somethingtaken as a whole rather than in parts. An analogyis to physics,in which molar
might refer to the propertiesor motions of masses,as distinguishedfrom those of moleculesor atomsthat make up
thosemasses.
AND CAUSATIONI 11
EXPERIMENTS
I

generating procedures, setting characteristics,time constraints, and payment for


services.Similarly, many depression measuresconsist of items pertaining to the
physiological,cognitive, and affectiveaspectsof depression.Explan atory causation
breaks thesemolar causesand effectsinto their molecular parts so as to learn, say,
that the verbal interactions and the placebo featuresof therapy both causechanges
in the cognitive symptoms of depression,but that payment for servicesdoes not do
so even though it is part of the molar treatment package.
If experiments are less able to provide this highly-prized explanatory causal
knowledge, why.are experimentsso central to science,especiallyto basic social sci-
ence,in which theory and explanation are often the coin of the realm? The answer is
that the dichotomy ber'*reendescriptive and explanatory causation is lessclear in sci-
entific practice than in abstract discussionsabout causation.First, many causal ex-
planatironsconsist of chains of descriptivi causal links in which one event causesthe
next. Experiments help to test the links in each chain. Second,experiments help dis-
tinguish betweenthe validity of competing explanatory theories, for example, by test-
ing competing mediating links proposed by those theories. Third, some experiments
test whether a descriptive causal relationship varies in strength or direction under
Condition A versus Condition B (then the condition is a moderator variable that ex-
plains the conditions under which the effect holds). Fourth, some experimentsadd
quantitative or qualitative observations of the links in the explanatory chain (medi-
ator variables) to generateand study explanations for the descriptive causal effect.
Experiments are also prized in applied areas of social science,in which the
identification of practical solutions to social problems has as great or even greater
priority than explanations of those solutions. After all, explanation is not always
required for identifying practical solutions. Lewontin (1997) makes this point
about the Human Genome Project, a coordinated multibillion-dollar research
program ro map the human genome that it is hoped eventually will clarify the ge-
netic causesof diseases.Lewontin is skeptical about aspectsof this search:
'!ilhat
is involvedhereis the differencebetweenexplanationand intervention.Many
disorderscan be explainedby the failureof the organismto makea normal protein,a
failurethat is the consequence of a genemutation.But interuentionrequiresthat the
normalproteinbe providedat the right placein the right cells,at the right time and in
the right amount,or elsethat an alternativeway be found to providenormal cellular
function.'Whatis worse,it might evenbenecessary to keepthe abnormalproteinaway
from the cellsat criticalmoments.None of theseobjectivesis servedby knowing the
"1,997,
DNA sequence of the defectivegene.(Lewontin, p.29)

Practical applications are not immediately revealedby theoretical advance.In-


stead, to reveal them may take decadesof follow-up work, including tests of sim-
ple descriptive causal relationships. The same point is illustrated by the cancer
drug Endostatin, discussedearlier. Scientistsknew the action of the drug occurred
through cutting off tumor blood supplies; but to successfullyuse the drug to treat
cancersin mice required administering it at the right place, angle, and depth, and
those details were not part of the usual scientific explanation of the drug's effects.
12 I 1. EXPERTMENTS
AND GENERALTZED TNFERENCE
CAUSAL
I

In the end,then,causaldescriptionsand causalexplanationsarein delicatebal-


ancein experiments.'$7hat experimentsdo bestis to improvecausaldescriptions;
they do lesswell at explainingcausalrelationships.But most experimentscan be
designedto providebetterexplanationsthan is typicallythe casetoday.Further,in
focusingon causaldescriptions,experimentsoften investigatemolar eventsthat
may be less strongly related to outcomesthan are more molecularmediating
processes, especiallythoseprocesses
that are closerto the outcomein the explana-
tory chain. However,many causaldescriptionsare still dependableand strong
enoughto be useful,to be worth making the building blocks around which im-
portant policiesand theoriesare created.Just considerthe dependabilityof such
causalstatements asthat schooldesegregation causeswhite flight, or that outgroup
threat causesingroup cohesion,or that psychotherapyimprovesmentalhealth,or
that diet reducesthe retardationdueto PKU. Suchdependable causalrelationships
are usefulto policymakers,practitioners,and scientistsalike.

MODERNDESCRIPTIONS
OF EXPERIMENTS
Some of the terms used in describing modern experimentation (seeTable L.L) are
unique, clearly defined, and consistently used; others are blurred and inconsis-
tently used. The common attribute in all experiments is control of treatment
(though control can take many different forms). So Mosteller (1990, p. 225)
writes, "fn an experiment the investigator controls the application of the treat-
ment"l and Yaremko, Harari, Harrison, and Lynn (1,986,p.72) write, "one or
more independent variables are manipulated to observe their effects on one or
more dependentvariables." However, over time many different experimental sub-
types have developed in responseto the needs and histories of different sciences
'Winston
('Winston, 1990; 6c Blais, 1.996\.

TABLE1.1TheVocabularyof Experiments

Experiment:
A studyin whichan intervention
is deliberately to observe
introduced itseffects.
RandomizedExperiment:
An experiment in whichunitsareassignedto receivethe treatmentor
conditionby a randomprocess
an alternative suchasthe tossof a coinor a tableof
randomnumbers.
An experiment
Quasi-Experiment: to conditions
in whichunitsarenot assigned randomly.
NaturalExperiment:
Not reallyan experiment
because the causeusuallycannotbe
manipulated;
a studythat contrasts
a naturally
occurringeventsuchasan earthquake
with
a comoarisoncondition.
Correlational
Study:Usuallysynonymouswith nonexperimentalor observational
study;a study
thatsimplyobserves
the sizeanddirection amongvariables.
of a relationship
I

I tr
OF EXPERIMENTS
MODERNDESCRIPTIONS

Experiment
Randomized
The most clearlydescribedvariant is the randomizedexperiment,widely credited
to Sir RonaldFisher(1,925,1926).Itwas first usedin agriculturebut laterspread
to other topic areasbecauseit promisedcontrol over extraneoussourcesof vari-
ation without requiringthe physicalisolationof the laboratory.Its distinguishing
featureis clear and important-that the varioustreatmentsbeingcontrasted(in-
cludingno treatmentat all) are assignedto experimentalunits' by chance,for ex-
ample,by cointossor useof a table of random numbers.If implementedcorrectlS
,"rdo- assignmentcreatestwo or more groupsof units that are probabilistically
similarto .".h other on the average.6Hence,any outcomedifferencesthat are ob-
servedbetweenthosegroupsat the end,ofa study arelikely to be dueto treatment'
not to differencesbetweenthe groupsthat alreadyexistedat the start of the study.
Further,when certainassumptionsare met, the randomizedexperimentyieldsan
estimateof the sizeof a treatmenteffectthat has desirablestatisticalproperties'
along with estimatesof the probability that the true effectfalls within a defined
confidenceinterval.Thesefeaturesof experimentsare so highly prized that in a
researchareasuchas medicinethe randomizedexperimentis often referredto as
the gold standardfor treatmentoutcomeresearch.'
Closelyrelatedto the randomizedexperimentis a more ambiguousand in-
consistentlyusedterm, true experiment.Someauthorsuseit synonymouslywith
randomizedexperiment(Rosenthal& Rosnow,1991').Others useit more gener-
ally to refer to any studyin which an independentvariableis deliberately manip-
'We
ulated (Yaremkoet al., 1,9861anda dependentvariableis assessed. shall not
usethe term at all givenits ambiguity and given that the modifier true seemsto
imply restrictedclaimsto a singlecorrectexperimentalmethod.

Quasi-Experiment
Much of this book focuseson a class of designsthat Campbell and Stanley
(1,963)popularizedasquasi-experiments.s sharewith all other
Quasi-experiments

5. Units can be people,animals,time periods,institutions,or almost anything else.Typically in field


experimentationthey are peopleor someaggregateof people,such as classroomsor work sites.In addition, a little
thought showsthat random assignmentof units to treatmentsis the sameas assignmentof treatmentsto units, so
thesephrasesare frequendyusedinterchangeably'
6. The word probabilisticallyis crucial, as is explainedin more detail in Chapter 8.
7. Although the rerm randomized experiment is used this way consistently acrossmany fields and in this book,
statisticianssometimesuse the closely related term random experiment in a different way to indicate experiments
for which the outcomecannor be predictedwith certainry(e.g.,Hogg & Tanis, 1988).
8. Campbell (1957) first calledthesecompromisedesignsbut changedterminologyvery quickly; Rosenbaum
(1995a\ and Cochran (1965\ referto theseas observationalstudies,a term we avoid becausemany peopleuseit to
refer to correlationalor nonexperimentalstudies,as well. Greenbergand Shroder(1997) usequdsi-etcperiment to
refer to studiesthat randomly assigngroups (e.g.,communities)to conditions,but we would considerthesegroup-
randomizedexperiments(Murray' 1998).
I

14 I 1. EXPERIMENTS
AND GENERALIZED
CAUSAL
INFERENCE
I

experiments a similar purpose-to test descriptivecausal hypothesesabout manip-


ulable causes-as well as many structural details, such as the frequent presenceof
control groups and pretest measures,to support a counterfactual inference about
what would have happened in the absenceof treatment. But, by definition, quasi-
experiments lack random assignment. Assignment to conditions is by means of self-
selection,by which units choosetreatment for themselves,or by meansof adminis-
trator selection,by which teachers,bureaucrats,legislators,therapists,physicians, t

or others decide which persons should get which treatment. Howeveq researchers
who use quasi-experimentsmay still have considerablecontrol over selectingand rl
schedulingmeasures,over how nonrandom assignmentis executed,over the kinds
of comparison groups with which treatment,groups are compared, and over some
aspectsof how treatment is scheduled.As Campbell and Stanleynote:

There are many natural socialsettingsin which the researchpersoncan introduce


somethinglike experimentaldesigninto his schedulingof data collectionprocedures
(e.g.,the uhen and to whom of measurement), eventhough he lacksthe full control
over the schedulingof experimentalstimuli (the when and to wltom of exposureand
the ability to randomizeexposures)which makesa true experimentpossible.Collec-
tively,such situationscan be regardedas quasi-experimental designs.(Campbell&
StanleS1,963, p. 34)

In quasi-experiments,the causeis manipulable and occurs before the effect is


measured. However, quasi-experimental design features usually create less com-
pelling support for counterfactual inferences. For example, quasi-experimental
control groups may differ from the treatment condition in many systematic(non-
random) ways other than the presenceof the treatment Many of theseways could
be alternative explanations for the observed effect, and so researchershave to
worry about ruling them out in order to get a more valid estimate of the treatment
effect. By contrast, with random assignmentthe researcherdoes not have to think
as much about all these alternative explanations. If correctly done, random as-
signment makes most of the alternatives less likely as causes of the observed
treatment effect at the start of the study.
In quasi-experiments,the researcherhas to enumeratealternative explanations
one by one, decide which are plausible, and then use logic, design, and measure-
ment to assesswhether each one is operating in a way that might explain any ob-
servedeffect. The difficulties are that thesealternative explanations are never com-
pletely enumerable in advance, that some of them are particular to the context
being studied, and that the methods neededto eliminate them from contention will
vary from alternative to alternative and from study to study. For example, suppose
two nonrandomly formed groups of children are studied, a volunteer treatment
group that gets a new reading program and a control group of nonvolunteerswho
do not get it. If the treatment group does better, is it becauseof treatment or be-
causethe cognitive development of the volunteerswas increasingmore rapidly even
before treatment began? (In a randomized experiment, maturation rates would
OF EXPERIMENTS 1s
MODERNDESCRIPTIONS |

havebeenprobabilisticallyequalin both groups.)To assess this alternative,the re-


searchermight add multiple preteststo revealmaturationaltrend beforethe treat-
ment, and then comparethat trend with the trend after treatment.
Another alternativeexplanationmight bethat the nonrandomcontrol group in-
cludedmoredisadvantaged childrenwho had lessaccess to booksin their homesor
who had parentswho read to them lessoften. (In a randomizedexperiment'both
groupswould havehad similar proportionsof suchchildren.)To assess this alter-
nativi, the experimentermay measurethe number of books at home,parentaltime
spentreadingtochildren,and perhapstrips to libraries.Then the researcher would
seeif thesevariablesdiffered acrosstreatment and control groups in the hypothe-
sizeddirection that could explain the observedtreatment effect. Obviously,as the
number of plausiblealternativeexplapationsincreases,the designof the quasi-
. experimentbecomesmore intellectually demandingand complex---especiallybe-
causewe are nevercertainwe haveidentifiedall the alternativeexplanations.The
efforts of the quasi-experimenterstart to look like affemptsto bandagea wound
that would havebeenlesssevereif random assignmenthad beenusedinitially.
The ruling out of alternativehypothesesis closelyrelatedto a falsificationist
logic popularizedby Popper(1959).Poppernoted how hard it is to be sure that a
g*.r"t conclusion(e.g.,,ll r*"ttr are white) is correct basedon a limited set of
observations(e.g.,all the swansI've seenwere white). After all, future observa-
tions may change(e.g.,somedayI may seea black swan).So confirmation is log-
ically difficult. By contrast,observinga disconfirminginstance(e.g.,a black swan)
is sufficient,in Popper'sview, to falsify the generalconclusionthat all swansare
white. Accordingly,nopper urged scientiststo try deliberatelyto falsify the con-
clusionsthey wiih to draw rather than only to seekinformation corroborating
them. Conciusionsthat withstand falsificationare retainedin scientificbooks or
journals and treated as plausible until better evidencecomes along. Quasi-
experimentationis falsificationistin that it requiresexperimentersto identify a
causalclaim and then to generateand examineplausiblealternativeexplanations
that might falsify the claim.
However,suchfalsificationcan neverbe as definitiveas Popperhoped.Kuhn
(7962) pointed out that falsificationdependson two assumptionsthat can never
be fully tested.The first is that the causalclaim is perfectlyspecified.But that is
neverih. ."r.. So many featuresof both the claim and the test of the claim are
debatable-for example,which outcome is of interest,how it is measured,the
conditionsof treatment,who needstreatment,and all the many other decisions
that researchers must make in testingcausalrelationships.As a result, disconfir-
mation often leadstheoriststo respecifypart of their causaltheories.For exam-
ple, they might now specifynovel conditionsthat must hold for their theory to be
irue and that were derivedfrom the apparentlydisconfirmingobservations.Sec-
ond, falsificationrequiresmeasuresthat are perfectlyvalid reflectionsof the the-
ory being tested.However,most philosophersmaintain that all observationis
theorv-laden.It is laden both with intellectualnuancesspecificto the partially
16 I 1. EXPERIMENTS
AND GENERALIZED INFERENCE
CAUSAL
I

uniquescientificunderstandings of the theory held by the individual or group de-


vising the test and also with the experimenters'extrascientificwishes,hopes,
aspirations,and broadly shared cultural assumptionsand understandings.If
measuresare not independentof theories,how can they provideindependentthe-
ory tests,includingtestsof causaltheories?If the possibilityof theory-neutralob-
servationsis denied,with them disappearsthe possibilityof definitiveknowledge
both of what seemsto confirm a causalclaim and of what seemsto disconfirmit.
Nonetheless, a fallibilist versionof falsificationis possible.It arguesthat stud-
iesof causalhypothesescan still usefullyimproveunderstandingof generaltrends
despiteignoranceof all the contingencies that might pertainto thosetrends.It ar-
guesthat causalstudiesare usefulevenif w0 haveto respecifythe initial hypoth-
esisrepeatedlyto accommodatenew contingenciesand new understandings. Af-
ter all, those respecificationsare usually minor in scope;they rarely involve
wholesaleoverthrowingof generaltrendsin favor of completelyoppositetrends.
Fallibilist falsificationalso assumesthat theory-neutralobservationis impossible
but that observationscan approacha more factlikestatuswhenthey havebeenre-
peatedlymadeacrossdifferenttheoreticalconceptionsof a construct,acrossmul-
tiple kinds of measurements, and at multiple times.It alsoassumes that observa-
tions are imbued with multiple theories, not iust one, and that different
operationalproceduresdo not sharethe samemultiple theories.As a result,ob-
servationsthat repeatedlyoccur despitedifferent theoriesbeing built into them
havea specialfactlike statusevenif they can neverbe fully justifiedascompletely
theory-neutralfacts.In summary,then, fallible falsificationis more than just see-
ing whether observationsdisconfirm a prediction. It involvesdiscoveringand
judging the worth of ancillary assumptionsabout the restrictedspecificityof the
causalhypothesisunder test and also about the heterogeneityof theories,view-
points, settings,and times built into the measuresof the causeand effectand of
any contingencies modifying their relationship.
It is neitherfeasiblenor desirableto rule out all possiblealternativeinterpre-
tarionsof a causalrelationship.Instead,only plausiblealternativesconstitutethe
major focus.This servespartly to keep matterstractablebecausethe number of
possiblealternativesis endless.It also recognizesthat many alternativeshaveno
seriousempiricalor experientialsupport and so do not warrant specialattention.
However,the lack of supportcan sometimesbe deceiving.For example,the cause
of stomachulcerswas long thought to be a combinationof lifestyle(e.g.,stress)
and excessacid production. Few scientistsseriouslythought that ulcers were
causedby a pathogen(e.g.,virus,germ,bacteria)because it was assumed that an
acid-filled stomachwould destroy all living organisms. However,in L982 Aus-
'Warren
tralian researchersBarry Marshall and Robin discoveredspiral-shaped
bacteria,later named Helicobacterpylori (H. pylori), in ulcerpatients'stomachs.
rilfith this discovery,the previouslypossiblebut implausiblebecameplausible.By
"1994,
a U.S. National Institutesof Health ConsensusDevelopmentConference
concluded that H. pylori was the major causeof most peptic ulcers. So labeling ri-
I

II tt
OFEXPERIMENTS
MODERNDESCRTPTONS

val hypothesesas plausible dependsnot just on what is logically possible but on


social consensus,shared experienceand, empirical data.
Becausesuch factors are often context specific, different substantive areasde-
velop their own lore about which alternatives are important enough to need to be
controlled, even developing their own methods for doing so. In early psychologg
for example, a control group with pretest observations was invented to control for
the plausible alternative explanation that, by giving practice in answering test con-
tent, pretestswould produce gains in performance even in the absenceof a treat-
ment effect (Coover 6c Angell, 1907). Thus the focus on plausibility is a two-edged
sword: it reducesthe range of alternatives to be considered in quasi-experimental
work, yet it also leavesthe resulting causal inference vulnerable to the discovery
that an implausible-seemingalternative may later emerge as a likely causal agent.

NaturalExperiment
The term natural experiment describesa naturally-occurring contrast between a
treatment and a comparisoncondition (Fagan, 1990; Meyer, 1995;Zeisel,1,973l.
Often the treatments are not even potentially manipulable, as when researchers
retrospectivelyexamined whether earthquakesin California causeddrops in prop-
erty values (Brunette, 1.995; Murdoch, Singh, 6c Thayer, 1993). Yet plausible
causal inferences about the effects of earthquakes are easy to construct and de-
fend. After all, the earthquakesoccurred before the observations on property val-
ues,and it is easyto seewhether earthquakesare related to properfy values. A use-
ful source of counterfactual inference can be constructed by examining property
values in the same locale before the earthquake or by studying similar localesthat
did not experience an earthquake during the bame time. If property values
dropped right after the earthquake in the earthquake condition but not in the com-
parison condition, it is difficult to find an alternative explanation for that drop.
Natural experiments have recently gained a high profile in economics. Before
the 1990s economists had great faith in their ability to produce valid causal in-
ferencesthrough statistical adjustments for initial nonequivalence between treat-
ment and control groups. But two studies on the effects of job training programs
showed that those adjustments produced estimates that were not close to those
generated from a randomized experiment and were unstable across tests of the
model's sensitivity (Fraker 6c Maynard, 1,987; Lalonde, 1986). Hence, in their
searchfor alternative methods, many economistscame to do natural experiments,
such as the economic study of the effects that occurred in the Miami job market
when many prisoners were releasedfrom Cuban jails and allowed to come to the
United States(Card, 1990). They assumethat the releaseof prisoners (or the tim-
ing of an earthquake) is independent of the ongoing processesthat usually affect
unemployment rates (or housing values). Later we explore the validity of this
assumption-of its desirability there can be little question.
18 I 1. EXPERIMENTS
AND GENERALIZED INFERENCE
CAUSAL

Nonexperimental
Designs
The termscorrelationaldesign,passiveobservationaldesign,and nonexperimental
designrefer to situationsin which a presumedcauseand effect are identified and
measuredbut in which other structural featuresof experimentsare missing.Ran-
dom assignmentis not part of the design,nor are suchdesignelementsas pretests
and control groupsfrom which researchers might constructa usefulcounterfactual
inference.Instead,relianceis placedon measuringalternativeexplanationsindi-
vidually and then statisticallycontrolling for them. In cross-sectionalstudiesin
which all the data aregatheredon the respondentsat one time, the researchermay
not even know if the causeprecedesthe dffect. When thesestudiesare used for
causalpurposes,the missingdesignfeaturescan be problematicunlessmuch is al-
ready known about which alternativeinterpretationsare plausible,unlessthose
that are plausiblecan be validly measured,and unlessthe substantivemodel used
for statisticaladjustmentis well-specified.
Theseare difficult conditionsto meetin
the real world of researchpractice,and thereforemany commentatorsdoubt the
potentialof suchdesignsto supportstrongcausalinferencesin most cases.

EXPERIMENTS
ANDTHEGENERALIZATION
OF
CAUSALCONNECTIONS
The strength of experimentation is its ability to illuminate causal inference. The
weaknessof experimentation is doubt about the extent to which that causal rela-
'We
tionship generalizes. hope that an innovative feature of this book is its focus
on generalization. Here we introduce the general issuesthat are expanded in later
chapters.

Most Experiments
Are HighlyLocalBut Have
GeneralAspirations

Most experimentsare highly localizedand particularistic.They are almostalways


conductedin a restrictedrange of settings,often just one, with a particular ver-
sion of one type of treatmentrather than, say,a sampleof all possibleversions.
Usually they have severalmeasures-eachwith theoreticalassumptionsthat are
differentfrom thosepresentin other measures-but far from a completesetof all
possiblemeasures.Each experimentnearly always usesa convenientsampleof
people rather than one that reflectsa well-describedpopulation; and it will in-
evitably be conductedat a particular point in time that rapidly becomeshistory.
Yet readersof experimentalresultsare rarelyconcernedwith what happened
in that particular,past,local study.Rather,they usuallyaim to learn eitherabout
theoreticalconstructsof interestor about alarger policy.Theoristsoften want to
OFCAUSAL
AND THEGENERALIZATION
EXeERTMENTS I t'
CONNECTIONS

connect experimental results to theories with broad conceptual applicability,


which ,.q,rir., generalization at the linguistic level of constructs rather than at the
level of the operations used to represent these constructs in a given experiment.
They nearly always want to generallzeto more people and settings than are rep-
resentedin a single experiment. Indeed, the value assignedto a substantive theory
usually dependson how broad a rangeof phenomena the theory covers. SimilarlS
policymakers may be interested in whether a causal relationship would hold
implemented as a
iprobabilistically) across the many sites at which it would be
policS an inferencethat requires generalization beyond the original experimental
stody contexr. Indeed, all human beings probably value the perceptual and cogni-
tive stability that is fostered by generalizations. Otherwise, the world might ap-
pear as a btulzzingcacophony of isolqted instances requiring constant cognitive
processingthat would overwhelm our limited capacities.
In defining generalizationas a problem, we do not assumethat more broadly ap-
plicable resulti are always more desirable(Greenwood, 1989). For example, physi-
cists -ho use particle accelerators to discover new elements may not expect that it
would be desiiable to introduce such elementsinto the world. Similarly, social scien-
tists sometimes aim to demonstrate that an effect is possible and to understand its
mechanismswithout expecting that the effect can be produced more generally. For
"sleeper effect" occurs in an attitude change study involving per-
instance, when a
suasivecommunications, the implication is that change is manifest after a time delay
but not immediately so. The circumstancesunder which this effect occurs turn out to
be quite limited and unlikely to be of any general interest other than to show that the
theory predicting it (and many other ancillary theories) may not be wrong (Cook,
Gruder, Hennigan & Flay l979\.Experiments that demonstrate limited generaliza-
tion may be just as valuable as those that demonstratebroad generalization.
Nonetheless,a conflict seemsto exist berweenthe localized nature of the causal
knowledge that individual experiments provide and the more generalizedcausal
goals that researchaspiresto attain. Cronbach and his colleagues(Cronbach et al.,
f gSO;Cronbach, 19821havemade this argument most forcefully and their works
have contributed much to our thinking about causal generalization. Cronbach
noted that each experiment consistsof units that receivethe experiencesbeing con-
trasted, of the treaiments themselves, of obseruations made on the units, and of the
settings in which the study is conducted. Taking the first letter from each of these
"instances on which data
four iords, he defined the acronym utos to refer to the
"1.982,p.
are collected" (Cronb ach, 78)-to the actual people,treatments' measures'
and settingsthat were sampledin the experiment. He then defined two problems of
"domain about which
generalizition: (1) generaliiing to the [the] question is asked"
"units, treatments,variables,
(p.7g),which he called UTOS; and (2) generalizingto
oUTOS.e
"nd r.r,ings not directly observed" (p. 831,*hi.h he called

S,
9. We oversimplify Cronbach'spresentationhere for pedagogicalreasons.For example,Cronbach only usedcapital
not small s, so that his system,eferred only to ,tos, not utos. He offered diverseand not always consistentdefinitions
do here.
of UTOS and *UTOS, in particular. And he doesnot usethe word generalizationin the samebroad way we
I
20 I 1. EXPERIMENTS
AND GENERALIZED INFERENCE
CAUSAL

Our theoryof causalgeneralization, outlinedbelowand presentedin more de-


tail in ChaptersLL through 13, melds Cronbach'sthinking with our own ideas
about generalizationfrom previousworks (Cook, 1990, t99t; Cook 6c Camp-
bell,1979), creatinga theory that is differentin modestways from both of these
predecessors. Our theory is influencedby Cronbach'swork in two ways.First, we
follow him by describingexperimentsconsistentlythroughout this book as con-
sistingof the elementsof units, treatments,observations,and settingsrlothough
we frequentlysubstitutepersonsfor units giventhat most field experimentationis
conductedwith humansas participants.:Wealsooften substituteoutcomef.orob-
seruationsgiven the centrality of observationsabout outcomewhen examining
causalrelationships.Second,we acknowledgethat researchers areofteninterested
in two kinds of.generalizationabout eachof thesefive elements,and that these
two typesareinspiredbg but not identicalto, the two kinds of generalizationthat
'We
Cronbach defined. call these construct validity generalizations(inferences
about the constructsthat researchoperationsrepresent)and externalvalidity gen-
eralizations(inferencesabout whetherthe causalrelationshipholdsovervariation
in persons,settings,treatment,and measurement variables).

ConstructValidity:CausalGeneralization
as Representation
The first causal generalization problem concerns how to go from the particular
units, treatments, observations, and settings on which data are collected to the
higher order constructs these instancesrepresent.These constructs are almost al-
ways couched in terms that are more abstract than the particular instancessam-
pled in an experiment. The labels may pertain to the individual elementsof the ex-
periment (e.g., is the outcome measured by a given test best described as
intelligence or as achievement?).Or the labels may pertain to the nature of rela-
tionships among elements, including causal relationships, as when cancer treat-
ments are classified as cytotoxic or cytostatic depending on whether they kill tu-
mor cells directly or delay tumor growth by modulating their environment.
Consider a randomized experiment by Fortin and Kirouac (1.9761.The treatment
was a brief educational course administered by severalnurses,who gave a tour of
their hospital and covered some basic facts about surgery with individuals who
were to have elective abdominal or thoracic surgery 1-5to 20 days later in a sin-
gle Montreal hospital. Ten specific outcome measureswere used after the surgery,
such as an activities of daily living scaleand a count of the analgesicsused to con-
trol pain. Now compare this study with its likely t^rget constructs-whether

10. \Weoccasionallyrefer to time as a separatefeatureof experiments,following Campbell (79571and Cook and


Campbell (19791,becausetime can cut acrossthe other factorsindependently.Cronbachdid not includetime in
his notational system,insteadincorporating time into treatment(e.g.,the schedulingof treatment),observations
(e.g.,when measuresare administered),or setting (e.g.,the historicalcontext of the experiment).
oF cAUsALcoNNEcrtoNS| ,,
ANDTHEGENERALIZATIoN
EXnERTMENTs I

patient education (the target cause)promotes physical recovery (the targ€t effect)
"*ong surgical patients (the target population of units) in hospitals (the target
univeise ofiettings). Another example occurs in basic research,in which the ques-
tion frequently aiises as to whether the actual manipulations and measuresused
in an experiment really tap into the specific cause and effect constructs specified
by the theory. One way to dismiss an empirical challenge to a theory is simply to
make the casethat the data do not really represent the concepts as they are spec-
ified in the theory.
Empirical resnlts often force researchersto change their initial understanding
of whaithe domain under study is. Sometimesthe reconceptuahzation leads to a
more restricted inference about what has been studied. Thus the planned causal
agent in the Fortin and Kirouac (I976),study-patie,nt education-might need to
b! respecified as informational patient education if the information component of
the treatment proved to be causally related to recovery from surgery but the tour
of the hospital did not. Conversely data can sometimes lead researchersto think
in terms o?,"rg., constructs and categoriesthat are more general than those with
which they began a researchprogram. Thus the creative analyst of patient educa-
tion studies mlght surmise that the treatment is a subclass of interventions that
"perceived control" or that recovery from surgery can be
function by increasing
;'p.tronal coping." Subsequentreaders of the study can
treated as a subclas of
even add their own interpietations, perhaps claiming that perceived control is re-
ally just a special caseof the even more general self-efficacy construct. There is a
sobtie interplay over time among the original categories the researcherintended
to represeni, the study as it was actually conducted, the study results, and subse-
qrr..ri interpretations. This interplay can change the researcher'sthinking about
what the siudy particulars actually achieved at a more conceptual level, as can
feedback fromreaders. But whatever reconceptualizationsoccur' the first problem
of causal generaltzationis always the same: How can we generalizefrom a sam-
ple of instancesand the data patterns associatedwith them to the particular tar-
get constructs they represent?

as Extrapolation
ExternalValidity:CausalGeneralization
The secondproblem of generalizationis to infer whether a causalrelationship
holdsovervariationsin p.rrorrt, settings,treatments,and outcomes.For example,
someonereadingthe resultsof an experimenton the effectsof a kindergarten
Head Startprogiam on the subsequent grammarschoolreadingtestscoresof poor
African Americanchildrenin Memphis during the 1980smay want to know if a
programwith partially overlappingcognitiveand socialdevelopmentgoals_would
be aseffectivein improvingthi mathematicstest scoresof poor Hispanicchildren
in Dallas if this programwere to be implementedtomorrow.
This exampl. again reminds us that generahzationis not a synonym for
broader applicatiorr.H.r., generahzationis from one city to another city and
1. EXPERIMENTS
AND GENERALIZED INFERENCE
CAUSAL

from one kind of clienteleto anotherkind, but thereis no presumptionthat Dal-


las is somehow broader than Memphis or that Hispanic children constitute a
broader population than African American children. Of course,some general-
izations are from narrow to broad. For example,a researcherwho randomly
samplesexperimentalparticipants from a national population may generalize
(probabilistically)from the sampleto all the other unstudiedmembersof that
samepopulation. Indeed,that is the rationale for choosingrandom selectionin
the first place.Similarly when policymakersconsiderwhetherHead Start should
be continuedon a national basis,they are not so interestedin what happenedin
Memphis.They are more interestedin what would happenon the averageacross
the United States,as its many local programsstill differ from eachother despite
efforts in the 1990sto standardizemuch of what happensto Head Startchildren
and parents.But generalizationcan also go from the broad to the narrow. Cron-
bach(1982)givesthe exampleof an experimentthat studieddifferences between
the performancesof groups of studentsattendingprivate and public schools.In
this case,the concernof individual parentsis to know which type of schoolis bet-
ter for their particular child, not for the whole group. \Thether from narrow to
broad, broad to narroq or acrossunits at about the samelevelof aggregation,
all theseexamplesof externalvalidity questionssharethe sameneed-to infer the
extent to which the effect holds over variationsin persons,settings,treatments,
or outcomes.

Approaches
to MakingCausalGeneralizations
\Thichever way the causal generalization issue is framed, experiments do not
seem at first glance to be very useful. Almost invariablS a given experiment uses
a limited set of operations to represent units, treatments, outcomes, and settings.
This high degree of localization is not unique to the experiment; it also charac-
terizes case studies, performance monitoring systems, and opportunistically-
administered marketing questionnaires given to, say, a haphazard sample of re-
spondents at local shopping centers (Shadish, 1995b). Even when questionnaires
are administered to nationally representative samples, they are ideal for repre-
senting that particular population of persons but have little relevanceto citizens
outside of that nation. Moreover, responsesmay also vary by the setting in which
the interview took place (a doorstep, a living room, or a work site), by the time
of day at which it was administered, by how each question was framed, or by the
particular race, age,and gender combination of interviewers. But the fact that the
experiment is not alone in its vulnerability to generalization issuesdoes not make
it any less a problem. So what is it that justifies any belief that an experiment can
achieve a better fit between the sampling particulars of a study and more general
inferences to constructs or over variations in persons, settings, treatments, and
outcomes?
oF cAUsALcoNNEcrtoNs I tt
ANDTHEGENERALtzATtoN
EXeERTMENTs

Samplingand CausalGeneralization
The methodmost often recommendedfor achievingthis closefit is the useof for-
mal probabiliry samplingof instancesof units, treatments,observations,or set-
tings (Rossi,Vlright, & Anderson,L983). This presupposes that we have clearly
deiineatedpopulationsof eachand that we can samplewith known probability
from within eachof thesepopulations.In effect,this entailsthe random selection
of instances,to be carefullydistinguishedfrom random assignmentdiscussed ear-
lier in this chapter.Randomselectioninvolvesselectingcases by chance to repre-
sentthat popuiation,whereasrandom assignmentinvolvesassigningcasesto mul-
tiple conditions.
In cause-probingresearchthat is not experimental,random samplesof indi-
viduals"r. oft.n nr.d. Large-scale longitudinalsurveyssuchasthe PanelStudyof
IncomeDynamicsor the National Longitudinal Surveyare usedto representthe
populationof the United States-or certainagebracketswithin it-and measures
Lf pot.ntial causesand effectsare then relatedto each other using time lags in
,nr^"r.rr.-ent and statisticalcontrolsfor group nonequivalence. All this is donein
hopesof approximatingwhat a randomizedexperimentachieves.However,cases
of random ielection from a broad population followed by random assignment
from within this population are much rarer (seeChapter 12 for examples).Also
rare arestudiesoi t".rdotn selectionfollowed by a quality quasi-experiment. Such
experimentsrequirea high levelof resourcesand a degreeof logisticalcontrol that
is iarely feasible,so many researchers prefer to rely on an implicit set of nonsta-
tistical heuristicsfor generalizationthat we hope to make more explicit and sys-
tematicin this book.
Random selectionoccurseven more rarely with treatments'outcomes,and
settingsthan with people.Considerthe outcomesobservedin an experiment.How
ofterrlre they raniomly sampled?'Wegrant that the domain samplingmodel of
classicaltestiheory (Nunnally 6c Bernstein,1994)assumesthat the itemsusedto
measurea constructhavebeenrandomly sampledfrom a domain of all possible
items. However,in actual experimentalpracticefew researchersever randomly
sampleitemswhen constructingmeasures.Nor do they do so when choosingma-
nipulationsor settings.For instance,many settingswill not agreeto be sampled,
"rid ,o1n. of the settingsthat agreeto be randomly sampledwill almostcertainly
not agreeto be randomlyassignedto conditions.For treatments,no definitivelist
of poisible treatmentsusuallyexists,as is most obvious in areasin which treat-
-*,, are being discoveredand developedrapidly, such as in AIDS research.In
general,then, random samplingis alwaysdesirable,but it is only rarely and con-
tingently feasible.
"However,
formal samplingmethodsare not the only option. Two informal, pur-
posive samplingmethodrare sometimesuseful-purposive sampling of heteroge-
neousinstancesand purposivesamplingof typical instances.In the former case'the
aim is to includeinrLni.r chosendeliberatelyto reflect diversity on presumptively
important dimensions,eventhough the sampleis not formally random. In the latter
24 I .l. TxpEnIMENTS INFERENCE
CAUSAL
ANDGENERALIZED

case,the aim is to explicate the kinds of units, treatments, observations, and settings
to which one most wants to generalize andthen to selectat least one instance of each
class that is impressionistically similar to the class mode. Although these purposive
sampling methods are more practical than formal probability sampling, they are not
backed by a statistical logic that justifies formal generalizations.Nonetheless, they
are probabty the most commonly used of all sampling methods for facilitating gen-
eralizations. A task we set ourselvesin this book is to explicate such methods and to
describe how they can be used more often than is the casetoday.
However, sampling methods of any kind are insufficient to solve either prob-
lem of generalization. Formal probability sampling requires specifying a target
population from which sampling then takes place, but defining such populations
is difficult for some targets of generalization such as treatments. Purposive sam-
pling of heterogeneousinstancesis differentially feasible for different elementsin
a study; it is often more feasible to make measuresdiverse than it is to obtain di-
verse settings, for example. Purposive sampling of typical instancesis often feasi-
ble when target modes, medians, or means are known, but it leaves questions
about generalizationsto a wider range than is typical. Besides,as Cronbach points
out, most challenges to the causal generalization of an experiment typically
emerge after a study is done. In such cases,sampling is relevant only if the in-
stancesin the original study were sampled diversely enough to promote responsi-
ble reanalysesof the data to seeif a treatment effect holds acrossmost or all of the
targets about which generahzation has been challenged. But packing so many
sourcesof variation into a single experimental study is rarely practical and will al-
most certainly conflict with other goals of the experiment. Formal sampling meth-
ods usually offer only a limited solution to causal generalizationproblems. A the-
ory of generalizedcausal inference needsadditional tools.

A GroundedTheoryof CausalGeneralization
Practicingscientistsroutinely make causal generalizations in their research,and
they almostneveruseformal probability samplingwhen they do. In this book, we
presenta theory of causalgeneralization that is groundedin the actualpracticeof
science(Matt, Cook, 6c Shadish,2000). Although this theory was originally de-
velopedfrom ideasthat were groundedin the constructand externalvalidiry lit-
eratures(Cook, 1990,1991.),we havesincefound that theseideasarecommonin
a diverseliteratureabout scientificgeneralizations (e.g.,Abelson,1995;Campbell
& Fiske, 1.959;Cronbach& Meehl, 1955; Davis, 1994; Locke, 1'986;Medin,
1989;Messick,1ggg,1'995;Rubins,1.994;'Willner, Hayward,Tu-
1,991';'$7ilson,
\7e providemore details
nis, Bass,& Guyatt, 1,995];t. about this grounded theory
in Chapters1L through L3, but in brief it suggests that scientistsmakecausalgen-
eralizationsin their work by usingfive closelyrelatedprinciples:
"L. the apparentsimilaritiesbetweenstudy opera-
SurfaceSimilarity.They assess
tions and the prototypicalcharacteristicsof the target of generalization.
I

AND THEGENERALIZATION
EXPERIMENTS II ZS
OFCAUSALCONNECTIONS

2. Ruling Out lrreleuancies.They identify those things that are irrelevant because
they do not change a generalization.
3. Making Discriminations. They clarify k.y discriminations that limit
generalization.
4. Interpolation and Extrapolation. They make interpolations to unsampled val-
ues within the range of the sampled instances and, much more difficult, they
explore extrapolations beyond the sampled range.
5 . Causal Explanation. They develop and test explanatory theories about the pat-
tern of effects,causes,and mediational processesthat are essentialto the trans-
fer of a causalrelationship.
In this book, we want to show how scientistscan and do use thesefive princi-
ples to draw generalizedconclusions dbout a causal connection. Sometimes the
conclusion is about the higher order constructs to use in describing an obtained
connection at the samplelevel. In this sense,thesefive principles have analoguesor
parallels both in the construct validity literature (e.g.,with construct content, with
loru.rg.nt and discriminant validity, and with the need for theoretical rationales
for consrructs) and in the cognitive scienceand philosophy literatures that study
how people decidewhether instancesfall into a category(e.g.,concerning the roles
that protorypical characteristicsand surface versus deep similarity play in deter-
mining category membership). But at other times, the conclusion about general-
ization refers to whether a connection holds broadly or narrowly over variations
in persons, settings,treatments, or outcomes. Here, too, the principles have ana-
logues or parallels that we can recognizefrom scientific theory and practice, as in
the study of dose-responserelationships (a form of interpolation-extrapolation) or
the appeal to explanatory mechanismsin generalizing from animals to humans (a
form of causal explanation).
Scientistsuse rhese five principles almost constantly during all phases of re-
search.For example, when they read a published study and wonder if some varia-
tion on the study's particulars would work in their lab, they think about similari-
'$7hen
ties of the published study to what they propose to do. they conceptualize
the new study, they anticipate how the instances they plan to study will match the
prototypical featuresof the constructs about which they are curious. They may de-
iign their study on the assumptionthat certain variations will be irrelevant to it but
that others will point to key discriminations over which the causal relationship
does not hold or the very character of the constructs changes.They may include
measuresof key theoretical mechanisms to clarify how the intervention works.
During data analysis, they test all these hypotheses and adjust their construct de-
scriptions to match better what the data suggest happened in the study. The intro-
duction section of their articles tries to convince the reader that the study bears on
specific constructs, and the discussion sometimes speculatesabout how results
-igttt extrapolate to different units, treatments, outcomes, and settings.
Further, practicing scientistsdo all this not just with single studies that they
read or conduct but also with multiple studies. They nearly always think about
26 1. EXPERTMENTS
ANDGENERALIZED INFERENCE
CAUSAL
|

how their own studiesfit into a larger literature about both the constructsbeing
measuredand the variablesthat may or may not bound or explain a causalconnec-
tion, often documentingthis fit in the introduction to their study.And they apply all
five principleswhen they conduct reviewsof the literature,in which they make in-
ferencesabout the kinds of generalizations that a body of researchcan suppoft.
Throughoutthis book, and especiallyin Chapters11 to L3, we providemore
detailsabout this groundedtheory of causal generalizationand about the scientific
Adopting this groundedtheoryof generalization
practicesthat it suggests. doesnot
imply a rejectionof formal probabilitysampling.Indeed, we recommendsuchsam-
pling unambiguouslywhenit is feasible,alongwith purposive samplingschemes to
aid generalization when formal randomselectionmethodscannotbe implemented.
But we alsoshow that samplingis just one methodthat practicingscientistsuseto
make causalgeneralizations, along with practicallogic, applicationof diversesta-
tistical methods,and useof featuresof designother than sampling.

AND METASCIENCE
EXPERIMENTS
Extensivephilosophicaldebatesometimessurroundsexperimentation.Here we
briefly summarizesomekey featuresof thesedebates,and then we discusssome
implications of thesedebatesfor experimentation.However,there is a sensein
which all this philosophicaldebateis incidentalto the practiceof experimentation.
Experimentationis as old as humanity itself, so it precededhumanity'sphilo-
sophicaleffortsto understandcausationand genenlizationby thousandsof years.
Even over just the past 400 yearsof scientificexperimentation,we can seesome
constancyof experimentalconcept and method, whereasdiversephilosophical
"Ex-
conceptions of the experimenthavecomeand gone.As Hacking(1983)said,
perimentationhas a life of its own" (p. 150). It has beenone of science's most
powerful methodsfor discoveringdescriptivecausalrelationships,and it hasdone
so well in so many ways that its placein scienceis probably assuredforever.To
justify its practicetodag a scientistneednot resortto sophisticatedphilosophical
reasoningabout experimentation.
Nonetheless,it doeshelp scientiststo understandthesephilosophicaldebates.
For example,previousdistinctionsin this chapterbetweenmolar and molecular
causation,descriptiveand explanatorycause,or probabilisticand deterministic
causalinferencesall help both philosophersand scientiststo understandbetter
both the purposeand the resultsof experiments(e.g.,Bunge,1959; Eells, 1991';
Hart & Honore, 1985;Humphreys,"t989;Mackie, 1'974;Salmon,7984,1989;
Sobel,1993;P.A. \X/hite,1990).Here we focus on a differentand broadersetof
critiquesof scienceitself,not only from philosophybut alsofrom the history,so-
ciologS and psychologyof science(seeusefulgeneralreviewsby Bechtel,1988;
H. I. Brown, 1977; Oldroyd, 19861.Someof theseworks have beenexplicitly
about the nature of experimentation,seekingto createa justified role for it (e.g.,
I

AND METASCIENCE
EXPERIMENTS I 27

'1.990;
Bhaskar,L975;Campbell,1982,,1988;Danziger, S. Drake, l98l; Gergen,
1,973;Gholson,Shadish, Neimeyer,6d Houts, L989; Gooding,Pinch,6cSchaffer,
'Woolgar,
1,989b;Greenwood, L989; Hacking, L983; Latour, 1'987;Latour 6c
1.979;Morawski, 1988;Orne,1.962;R.RosenthaL,1.966;Shadish & Fuller,L994;
Shapin,1,9941. Thesecritiqueshelp scientiststo seesomelimits of experimenta-
tion in both scienceand society.

TheKuhnianCritique
Kuhn (1962\ describedscientificrevolutionsas differentand partly incommensu-
rableparadigmsthat abruptly succeedgd eachother in time and in which the grad-
ual accumulation of scientificknowledgewas a chimera.Hanson(1958),Polanyi
(1958),Popper('J.959), Toulmin (1'961),Feyerabend(L975),and Quine (1'95t'
1,969)contributedto the critical momentum,in part by exposingthe grossmis-
takesin logicalpositivism'sattemptto build a philosophyof sciencebasedon re-
constructinga successfulsciencesuch as physics.All thesecritiquesdeniedany
firm foundationsfor scientificknowledge(so, by extension,experimentsdo not
provide firm causalknowledge).The logicalpositivistshopedto achievefounda-
tions on which to build knowledgeby tying all theory tightly to theory-freeob-
servationthrough predicatelogic. But this left out important scientificconcepts
that could not be tied tightly to observation;and it failed to recognizethat all ob-
servationsare impregnatedwith substantiveand methodologicaltheory,making
it impossibleto conducttheory-freetests.lt
The impossibility of theory-neutral observation (often referred to as the
Quine-Duhemthesis)impliesthat the resultsof any singletest (and so any single
experiment)are inevitably ambiguous.They could be disputed,for example,on
groundsthat the theoreticalassumptionsbuilt into the outcome measurewere
wrong or that the study made a fatity assumptionabout how high a treatment
dosewas requiredto be effective.Someof theseassumptionsare small,easilyde-
tected,and correctable,suchaswhen a voltmetergivesthe wrong readingbecause
the impedanceof the voltagesourcewas much higherthan that of the meter ('$fil-
son, L952).But other assumptionsare more paradigmlike,impregnatinga theory
so completelythat other parts of the theory makeno sensewithout them (e.g.,the
assumptionthat the earthis the centerof the universein pre-Galileanastronomy).
Becausethe number of assumptionsinvolved in any scientifictest is very large,
researcherscan easily find some assumptionsto fault or can even posit new

"Even the father


11. However, Holton (1986) reminds us nor to overstatethe relianceof positivistson empirical data:
of positivism,AugusteComte, had written . . . that without a theory of somesort by which to link phenomena to some
'it
principles would not only be impossibleto combine the isolated observations and draw any useful conclusions, we
would not evenbe able to remember them, and, for the most part, the fact would not be noticed by our eyes"' (p. 32).
Similarly, Uebel (1992) providesa more detailedhistorical analysisof the protocol sentencedebatein logical
positivism, showing somesurprisinglynonstereorypicalpositions held by key playerssuch as Carnap.
28 ANDGENERALIZED
r. rxeenlMENTs INFERENCE
CAUSAL
|

assumptions(Mitroff & Fitzgerald,1.977).In this way, substantivetheoriesare


lesstestablethan their authors originally conceived.How cana theory be tested
if it is madeof clayrather than granite?
For reasonswe clarify later,this critique is more true of singlestudiesand less
true of programsof research.But evenin the latter case,undetectedconstantbiases
."tt t.r,tlt in flawed inferencesabout causeand its genenlization.As a result,no ex- I
perimentis everfully certain,and extrascientificbeliefsand preferences alwayshave
ioo- to influencethe many discretionaryjudgmentsinvolved in all scientificbelief.

Critiques
ModernSocialPsychological
Sociologists working within traditionsvariouslycalledsocialconstructivism,epis-
temologicalrelativism,and the strongprogram(e.g.,Barnes,1974;Bloor,1976;
Collins,l98l;Knorr-Cetina,L981-;Latour 6c'Woolgar,1.979;Mulkay, 1'979)have
shown thoseextrascientificprocesses at work in science.Their empiricalstudies
show that scientistsoften fail to adhereto norms commonlyproposedas part of
good science(e.g.,objectivity neutrality,sharingof information).They havealso
rho*n how that which comesto be reportedas scientificknowledgeis partly de-
terminedby socialand psychologicalforcesand partly by issuesof economicand
political power both within scienceand in the largersociety-issuesthat arerarely
mention;d in publishedresearchreports.The most extremeamongthesesociolo-
gistsattributesall scientificknowledgeto suchextrascientificprocesses, claiming
ihat "the natural world has a small or nonexistentrole in the constructionof sci-
"l'98I,
entificknowledge"(Collins, p. 3).
Collins doesnot denyontologicalrea.lism,that real entitiesexistin the world.
Rather,he deniesepistemological(scientific)realism, that whateverexternal real-
ity may existcanconstrainour scientifictheories.For example,if atomsreally ex-
ist, do they affectour scientifictheoriesat all? If our theory postulatesan atom, is
it describing a realentitythat existsroughly aswe describeit? Epistetnologi,cal rel-
atiuistssuch as Collins respondnegativelyto both questions,believingthat the
most important influencesin scienceare social,psychological,economic,and po-
litical, "ttd th"t thesemight evenbe the only influenceson scientifictheories-This
view is not widely endorsedoutsidea small group of sociologists,but it is a use-
ful counterweightto naiveassumptionsthat scientificstudiessomehowdirectlyre-
veal natur. to r.r,(an assumptiorwe callnaiuerealism).The resultsof all studies,
including experiments,are profoundly subjectto theseextrascientificinfluences,
from their conceptionto reportsof their results.

and Trust
Science
A standard image of the scientist is as a skeptic, a person who only trusts results that
have been personally verified. Indeed, the scientific revolution of the'l'7th century
I
I 29
AND METASCIENCE
EXPERIMENTS
I

claimed that trust, particularly trust in authority and dogma, was antithetical to
good science.Every authoritative assertion,every dogma, was to be open to ques-
tion, and the job of sciencewas to do that questioning.
That image is partly wrong. Any single scientific study is an exercisein trust
(Pinch, 1986; Shapin, 1,994).Studies trust the vast majority of already developed
methods, findings, and concepts that they use when they test a new hypothesis.
For example, statistical theories and methods are usually taken on faith rather
than personally verified, as are measurement instruments. The ratio of trust to
skepticism in any given study is more llke 99% trust to 1% skepticism than the
opposite. Even in lifelong programs of research, the single scientist trusts much
-or. than he or she ever doubts. Indeed, thoroughgoing skepticism is probably
impossible for the individual scientist, po iudge from what we know of the psy-
chology of science(Gholson et al., L989; Shadish 6c Fuller, 1'9941.Finall5 skepti-
cism is not even an accuratecharacterrzation of past scientific revolutions; Shapin
"gentlemanly trust" in L7th-century England was
(1,994) shows that the role of
central to the establishment of experimental science.Trust pervades science,de-
spite its rhetoric of skepticism.

for Experiments
lmplications
The net result of thesecriticismsis a greaterappreciationfor the equivocalityof
all scientificknowledge.The experimentis not a clearwindow that revealsnature
directly to us.To the contrary,experimentsyield hypotheticaland fallible knowl-
edgethat is often dependenton context and imbuedwith many unstatedtheoret-
ical assumprions.Consequentlyexperimentalresultsare partly relativeto those
assumptionsand contextsand might well changewith new assumptionsor con-
constructivistsand relativists.
texts.In this sense,all scientistsare epistemological
The differenceis whether they are strong or weak relativists.Strong relativists
share Collins'sposition that only extrascientificfactors influenceour theories.
'Weak
relativistsbelievethat both the ontologicalworld and the worlds of ideol-
og5 interests,values,hopes,and wishesplay a role in the constructionof scien-
tiiic knowledge.Most practicingscientists,including ourselves,would probably
describethemselves ", Lrrtologicalrealistsbut weak epistemologicalrelativists.l2
To the extent that experimentsrevealnature to us, it is through a very clouded
windowpane(Campbell,1988).
Suchcounterweights to naiveviewsof experimentswere badly needed.As re-
centlyas 30 yearsago,the centralrole of the experimentin sciencewas probably

1.2. If spacepermitred,we could exrendthis discussionto a host of other philosophicalissuesthat have beenraised
about the experiment, such as its role in discovery versusconfirmation, incorrect assertionsthat the experiment is
tied to somespecificphilosophysuch as logical positivismor pragmatism,and the various mistakesthat are
frequentlymadei., suchdiscussions(e.g.,Campbell, 1982,1988; Cook, 1991; Cook 6< Campbell, 1985; Shadish,
1.995a\.
I
30 | 1. EXPERTMENTS
AND GENERALTZED INFERENCE
CAUSAL

taken more for granted than is the case today. For example, Campbell and Stan-
ley (1.9631
describedthemselvesas:
committed to the experiment: as the only means for settling disputes regarding educa-
tional practice, as the only way of verifying educational improvements, and as the only
way of establishing a cumulative tradition in which improvements can be introduced
without the danger of a faddish discard of old wisdom in favor of inferior novelties. (p. 2)
"'experimental method' usedto be
Indeed,Hacking (1983) points out that iust an-
other name for scientific method" (p.149); and experimentation was then a more
fertile ground for examples illustrating basic philosophical issuesthan it was a
source of contention itself. ,
'We
Not so today. now understand better that the experiment is a profoundly
human endeavor,affected by all the same human foibles as any other human en-
deavor, though with well-developed procedures for partial control of some of the
limitations that have been identified to date. Some of these limitations are com-
mon to all science,of course. For example, scientiststend to notice evidencethat
confirms their preferred hypothesesand to overlook contradictory evidence.They
make routine cognitive errors of judgment and have limited capacity to process
large amounts of information. They react to peer pressuresto agreewith accepted
dogma and to social role pressuresin their relationships to students,participants,
and other scientists.They are partly motivated by sociological and economic re-
wards for their work (sadl5 sometimesto the point of fraud), and they display all-
too-human psychological needs and irrationalities about their work. Other limi-
tations have unique relevance to experimentation. For example, if causal results
are ambiguous, as in many weaker quasi-experiments,experimentersmay attrib-
ute causation or causal generalization based on study features that have little to
do with orthodox logic or method. They may fail to pursue all the alternative
causal explanations becauseof a lack of energS a need to achieveclosure, or a bias
toward accepting evidence that confirms their preferred hypothesis.Each experi-
ment is also a social situation, full of social roles (e.g., participant, experimenter,
assistant) and social expectations (e.g., that people should provide true informa-
tion) but with a uniqueness (e.g., that the experimenter does not always tell the
truth) that can lead to problems when social cues are misread or deliberately
thwarted by either party. Fortunately these limits are not insurmountable, as for-
mal training can help overcome some of them (Lehman, Lempert, & Nisbett,
1988). Still, the relationship between scientific results and the world that science
studies is neither simple nor fully trustworthy.
These social and psychological analyseshave taken some of the luster from
the experiment as a centerpieceof science.The experiment may have a life of its
own, but it is no longer life on a pedestal. Among scientists,belief in the experi-
ment as the only meansto settle disputes about causation is gone, though it is still
the preferred method in many circumstances. Gone, too, is the belief that the
power experimental methods often displayed in the laboratory would transfer eas-
ily to applications in field settings. As a result of highly publicized science-related
I

OR CAUSES?I gT
A WORLDWITHOUTEXPERIMENTS
I

eventssuchasthe tragicresultsof the Chernobylnucleardisaster,the disputesover


certaintylevelsof DNA testingin the O.J. Simpsontrials, and the failure to find
a cure for most cancersafter decadesof highly publicizedand funded effort, the
generalpublic now betterunderstandsthe limits of science.
Yet we should not take these critiques too far. Those who argue against
theory-freetestsoften seemto suggestthat everyexperimentwill comeout just as
the experimenterwishes.This expectationis totally contrary to the experienceof
researchers, who find insteadthat experimentationis often frustratingand disap-
pointing for the theoriesthey loved so much. Laboratory resultsmay not speak
for themselves, but they certainlydo not speakonly for one'shopesand wishes.
"stubborn facts" with
We find much to valuein the laboratoryscientist'sbeliefin
a life spanthat is greaterthan the fluctqatingtheorieswith which one tries to ex-
plain them.Thus many basicresultsabout gravityare the same,whetherthey are
containedwithin a framework developedby Newton or by Einstein;and no suc-
cessortheory to Einstein'swould be plausibleunlessit could accountfor most of
the stubbornfactlike findingsabout falling bodies.There may not be pure facts,
but someobservationsare clearlyworth treating as if they were facts.
Some theorists of science-Hanson, Polanyi, Kuhn, and Feyerabend
included-have so exaggeratedthe role of theory in scienceas to make experi-
mental evidenceseemalmost irrelevant.But exploratory experimentsthat were
unguidedby formal theory and unexpectedexperimentaldiscoveries tangentialto
the initial researchmotivationshaverepeatedlybeenthe sourceof greatscientific
advances.Experimentshaveprovidedmany stubborn,dependable,replicablere-
sultsthat then becomethe subjectof theory.Experimentalphysicistsfeelthat their
laboratorydata help keeptheir more speculativetheoreticalcounterpartshonest,
giving experimentsan indispensablerole in science.Of course,thesestubborn
facts often involve both commonsensepresumptionsand trust in many well-
establishedtheoriesthat make up the sharedcore of belief of the sciencein ques-
tion. And of course,thesestubbornfactssometimesproveto beundependable, are
reinterpretedas experimentalartifacts,or are so ladenwith a dominantfocal the-
ory that they disappearoncethat theory is replaced.But this is not the casewith
the greatbulk of the factualbase,which remainsreasonablydependableover rel-
ativelylong periodsof time.

ORCAUSES?
A WORLDWITHOUTEXPERIMENTS
To borrow a thought experimentfrom Maclntyre (1981),imaginethat the slates
of scienceand philosophywerewiped cleanand that we had to constructour un-
derstandingof the world anew.As part of that reconstruction,would we reinvent
the notion of a manipulablecause?\7e think so, largely becauseof the practical
utility that dependablemanipulandahave for our ability to surviveand prosper.
IUTouldwe reinvent the experimentas a method for investigatingsuch causes?
I
AND GENERALTZED
32 | 1. EXPERTMENTS CAUSAL
TNFERENCE

Again yes,becausehumanswill always be trying to betterknow how well these


manipulablecauseswork. Over time, they will refinehow they conductthoseex-
perimentsand so will againbe drawn to problemsof counterfactualinference,of
causeprecedingeffect,of alternativeexplanations,and of all of the other features
of causationthat we havediscussedin this chapter.In the end, we would proba-
bly end up with the experimentor somethingvery much like it. This book is one
more stepin that ongoingprocessof refining experiments.It is about improving
the yield from experimentsthat take placein complexfield settings,both the qual-
ity of causalinferencesthey yield and our ability to generalizetheseinferencesto
constructsand over variationsin persons,settings,treatments,and outcomes.
A CriticalAssessment
of
Our Assumptions
As.sump.tion(e-simp'shen):[Middle Englishassumpcion,from Latin as-
sumpti, assumptin-adoption, from assumptus,past participle of ass-
mere,te adopt; seeassume.]n. 1. The act of taking to or upon oneself:
assumptionof an obligation. 2.The act of taking overiassumptionof
command. 3. The act of taking for granted:assumptionof a false the-
ory. 4. Somethingtaken for granted or acceptedas true without proof;
a supposition:a ualid assumption. 5. Presumption;arrogance.5.
Logic.A minor premise.

fltHIS BooK covers five central topics across its 13 chapters. The first topic
| (Chapter 1) deals with our general understanding of descriptive causation and
I experimentation. The second (Chapters 2 and 3) deals with the types of valid-
ity and the specific validity threats associatedwith this understanding. The third
(Chapters 4 through 7) deals with quasi-experimentsand illustrates how combin-
ing design features can facilitate better causal inference. The fourth (Chapters 8
through L0) concerns randomized experiments and stressesthe factors that im-
pede and promote their implementation. The fifth (Chapters 11 through L3) deals
with causal generalization, both theoretically and as concerns the conduct of in-
dividual studies and programs of research.The purpose of this last chapter is to
critically assesssome of the assumptions that have gone into these five topics, es-
pecially the assumptions that critics have found obiectionable or that we antici-
'We
pate they will find objectionable. organize the discussionaround each of the
five topics and then briefly justify why we did not deal more extensivelywith non-
experimental methods for assessingcausation.
I7e do not delude ourselvesthat we can be the best explicators of our own as-
sumptions. Our critics can do that task better. But we want to be as comprehen-
sive
srve and
an(l as explicit
explclt as we can.
can. This
I nrs is rn part
ls in part because
becausewe
we are
are convinced
convrnced of
ot the
the ad-
acl-
vantages of falsification as a major component of any epistemology for the social
sciences,and forcing out one's assumptions and confronting them is one part of
falsification. But it is also becausewe would like to stimulate critical debateabout
theseassumptionsso that we can learn from those who would challengeour think-

456
AND EXPERIMENTATION
CAUSATION rct
|

ing. If therewereto be a future book that carriedevenfurther forward the tradi-


tion emanatingfrom Campbelland Stanleyvia Cook and Campbellto this book,
then that futuie book *o,rld probably be all the better for building upon all the
justifiedcriticismscomingfrom thosewho do not agreewith us, eitheron partic-
cau-
,rlu6 o, on the whole approachwe havetaken to the analysisof descriptive
sationand its generayzition.'We would like this chapternot only to model the at-
but
i.-p, to be cr"iti.alabout the assumprionsall scholarsmust inevitablymake
and how they might be
alsoto encourageothersto think about theseassumptions
addressed in fuiure empiricalor theoreticalwork'

ENTATION
AND EXPERIM
CAUSATION

CausalArrows and Pretzels


descriptive
Experiments test the influence of one or at most a small subset of
very few
causes.If statistical interactions are involved, they tend to be among
of moderator variables'
treatments or between a single treatment and a limited set
typical
Many researchersbelieve that the causal knowledge that results from this
.*p.ii-..rtal structure fails to map the many causal forces that simultaneously af-
(e.g., Cronbach et al',
fe.t "ny given outcome in compiex and nonlinear ways
prioritize on ar-
19g0; Magnusson,2000). These critics assertthat experiments
an explanatory
,o*, .onrr-.cting A to B when they should instead seekto describe
most causal
pretzel or set of intersectingpretzels,as it were. They also believethat
whether
ielationships vary across ,rttitt, settings, and times, and so they doubt
(e.g., Cronbach 6c Snow,
there ".. "ny constant bivariate causal relationships
reflect sta-
1977).Those that do appearto be dependablein the data may simply
to reveal the
tistically underpow.r.i irr,, of modeiators or mediators that failed
sizesmight
true underlying complex causal relationships. True-variation in effect
or the
also be obrc.rr"d b.c"rrs. the relevant substantive theory is underspecified,
is attenuated, or
outcome measuresare partially invalid, or the treatment contrast
(McClelland
causally implicated variables afe truncated in how they are sampled
6c Judd, 1993).
As valid as theseobiectionsare, they do not invalidatethe casefor experi-
ments.The purposeof experimentsis not to completelyexplain-some.phenome-
makes
non; it is to ldentify whethera particularvariableor small setof variables
affect-
a margirraldifferencein someoutcomeover and above all the other forces
not
ing that outcome.Moreover,ontologicaldoubts such as the precedinghave
as though many
stJppedbelieversin more complex iausal theoriesfrom acting
or as
.r,rol relationshipscan be usefullycharacterizedas dependablemain effects
In this
very simpl. nonlin."rities that are also dependableenoughto be_u_seful.
where
connection,considersomeexamplesfrom educationin the United States,
I
4s8 14.A CRTTTCAL
ASSESSMENT
OFOURASSUMPTTONS
|

objections to experimentation are probably the most prevalent and virulent. Few
educational researchersseemto object to the following substantiveconclusions of
the form that A dependably causesB: small schools are better than large ones;
time-on-task raises achievement; summer school raises test scores;school deseg-
regation hardly affects achievement but does increaseWhite flight; and assigning
and grading homework raises achievement.The critics also do not seemto object
to other conclusions involving very simple causal contingencies: reducing class
"sizable"
size increasesachievement,but only if the amount of change is and to a
level under 20; or Catholic schools are superior to public ones, but only in the in-
ner city and not in the suburbs and then most noticeably in graduation rates rather
than in achievementtest scores. ,
The primary iustification for such oversimplifications-and for the use of the
experiments that test them-is that some moderators of effects are of minor rele-
vance to policy and theory even if they marginally improve explanation. The most
important contingencies are usually those that modify the sign of a causal rela-
tionship rather than its magnitude. Sign changesimply that a treatment is benefi-
cial in some circumstancesbut might be harmful in others. This is quite different
from identifying circumstancesthat influence just how positive an effect might be.
Policy-makers are often willing to advocate an overall change,even if they suspect
it has different-sizedpositive effects for different groups, as long as the effects are
rarely negative. But if some groups will be positively affected and others nega-
tively political actors are loath to prescribe different treatments for different
groups becauserivalries and jealousies often ensue. Theoreticians also probably
pay more attention to causal relationships that differ in causal sign becausethis
result implies that one can identify the boundary conditions that impel such a dis-
parate data pattern.
Of course, we do not advocate ignoring all causal contingencies.For exam-
ple, physicians routinely prescribe one of severalpossibleinterventions for a given
diagnosis.The exact choice may depend on the diagnosis,test results,patient pref-
erences, insurance resources, and the availability of treatments in the patient's
area. However, the costs of such a contingent system are high. In part to limit the
number of relevant contingencies,physicians specialize,andwithin their own spe-
cialty they undergo extensivetraining to enable them to make thesecontingent de-
cisions. Even then, substantial judgment is still required to cover the many situa-
tions in which causal contingencies are ambiguous or in dispute. In many other
policy domains it would also be costly to implement the financial, management,
and cultural changesthat a truly contingent system would require even if the req-
uisite knowledge were available. Taking such a contingent approach to its logical
extremes would entail in education, for example, that individual tutoring become
dav.Studentsand instructorswould haveto be carefullymatched
the order of the day. matched
for overlap in teachingand learning skills and in the curriculum supportsthey
would need.
tilTithinlimits, some moderators can be studied experimentallSeither by
measuringthe moderator so it can be testedduring analysisor by deliberately
I Ot'
AND EXPERIMENTATION
CAU5ATION

experi-
varying it in the next study in a program of research'In conductingsuch
tak-
ments,onemovesawayfrom thethik-bo" experimentsof yesteryeartoward
more seriouslyand toward routinely study!1g them by,
ing causalcontingencies
foi .""-ple, disaggregating the treatmentto examineits causallyeffectivecom-
ponents,iir"ggt.glting the effect,toexamineits causallyimpactedcomponents,
variables,and
.ondrr.ting ,n"ty*r ofi.-ographic and psychologicalmoderator
affects
exploringlhe causalpathwa-ysihtooghwhjch (parts.of) the treatment
in a singleexperimentis not possi-
lparts of) the outcomJ.To do all of this well
tl.. brrtto do someof it well is possibleand desirable.

of E4periments
Criticisms
Epistemological
we have
In highlightingstatisticalconclusionvalidity and in-selectingexamples,
testing'
often linked causaldescriptionto quantitativemethodsand hypothesis
positivism'
Many criticswill (wrongly)r.. this asimplying a discreditedtheory of
positivismre-
As a philosophyof scieniefirst outlined in the early L9th century'
about unobservables,and equated
1.ct.d' metaphysicalspeculations,especially school of
lrro*t.ag. *lih descriptionsof e*periencedphenomena-A narrower
realism
logical pisitivism .*.rg.d in the eatly 20th century that also rejected
logic form
*til. "lro .-phasizing Ih. ,rr. of data-theoryconnectionsin predicate
Both thesere-
""J " fr.f.r.r.. for p"redictingphenomenaover explainingthem'
lated epistemologies *.r. lonf ago discredited,especiallyas explanationsof how
this basis'How-
scienceop.r"trr.*so few criticsseriouslycritici'e experimentson
to attack
ever,many critics use the term positiuismwith lesshistorical fidelity
1985)'
quantitativesocialsciencemethodsin genera-l(e'g', Lincoln & Guba,
use of quantification
liuilding on the rejectionof logicalpositivism,they reiectthe
and forLal logic in observatiron, measurement,and hypothesistesting.Because
of posi-
theselast featuresare part of experiments,to reiectthis looseconception
are nu-
tivism entailsrejectingexperiments.However,the errorsin suchcriticisms
(like the idea that
merous.For example,to ,eject a specificfeatureof positivism
only permissiblelinks betweendata and
f,r"rrtifi.rtion and p redicatelogicare the
tiheory;doesnot nJcessarily imlly reiectingall relatedand more generalproposi-
testing
tions jsuch asthe notion that somekinds of quantificationand hypothesis
outlined more such er-
may be usefulfor knowledgegrowth).Ife and othershave
rors elsewhere (Phillips,1990;Shadish,I995al'
other epistemological criticismsof experimentationcitethe work of historians
and'wool-
of sciencesuchasKuh"n(1,g62),ofsociologistsof sciencesuchasLatour
tend
gar ltiZll "rrd of fhiloroph.ir of scienceiuchas Harr6'(1931).Thesecritics
of theories,the notion that
to focuson threethings.orre.i, the incommensurability
theoriesare neverper"fectly specifiedand so can alwaysbe reinterpreted.As a re-
be reiected'its
sult, when disconfirmingdata seemto imply that a theory should
poriolut., can insteadbI reworkedin order to make the theory and observations
to the
consistentwith eachother.This is usuallydoneby addingnew contingencies
460 | 14.A CRIT|CAL
ASSESSMENT
OF OURASSUMPTTONS
I

theory that limit the conditions under which it is thought to hold. A second cri-
tique is of the assumption that experimental observations can be used as truth
'We
tests. would like observations to be objective assessmentsthat can adjudicate
between different theoretical explanations of a phenomenon. But in practice, ob-
servationsare not theory neutral; they are open to multiple interpretations that in-
clude such irrelevanciesas the researcher'shopes, dreams, and predilections. The
consequenceis that observations rarely result in definitive hypothesistests.The fi-
nal criticism follows from the many behavioral and cognitive inconsistenciesbe-
tween what scientists do in practice and what scientific norms prescribe they
should do. Descriptions of scientists' behavior in laboratories reveal them as
choosing to do particular experiments becausethey have an intuition about a re-
lationship, or they are simply curious to seewhat happens, or they want to play
with a new piece of equipment they happen to find lying around. Their impetus,
therefore, is not a hypothesis carefully deduced from a theory that they then test
by means of careful observation.
Although these critiques have some credibilitg they are overgeneralized.Few
experimentersbelievethat their work yields definitive results even after it has been
subjected to professional review. Further, though these philosophical, historical,
and social critiques complicate what a "fact" means for any scientific method,
nonethelessmany relationships have stubbornly recurred despite changesassoci-
ated with the substantive theories, methods, and researcherbiasesthat first gen-
erated them. Observations may never achieve the status of "facts," but many of
them are so stubbornly replicable that they may be consideredas though they were
facts. For experimenters, the trick is to make sure that observations are not im-
pregnated with just one theory, and this is done by building multiple theories into
observationsand by valuing independent replications, especiallythose of sub-
stantive critics-what we have elsewherecalled critical multiplism (Cook, 1985;
Shadish,'1.989, 1994).
Although causal claims can never be definitively tested and proven, individ-
ual experiments still manage to probe such claims. For example, if a study pro-
duces negative results, it is often the casethat program developersand other ad-
vocates then bring up methodological and substantive contingenciesthat might
have changedthe result. For instance, they might contend that a different outcome
measure or population would have led to a different conclusion. Subsequentstud-
ies then probe these alternatives and, if they again prove negative, lead to yet an-
other round of probes of whatever new explanatory possibilities have emerged.
After a time, this process runs out of steam, so particularistic are the contingen-
cies that remain to be examined. It is as though a consensusemerges:"The causal
relationship was not obtained under many conditions. The conditions that remain
to be examined are so circumscribed that the intervention will not be worth much
'W'e
even if it is effectiveunder these conditions. " agreethat this processis as much
or more social than logical. But the reality of elastic theory does not mean that de-
cisions about causal hypotheses are only social and devoid of all empirical and
I
logical content.
I
I
t
I

J
| +er
AND EXPERIMENTATION
CAUSATION

The criticismsnoted are especiallyusefulin highlightingthe limited value of


individual studiesrelativeto reviewsof researchprograms.Suchreviewsare bet-
ter becausethe greaterdiversityof study featuresmakesit lesslikely that the same
theoreticalbiasesthat inevitablyimpregnateany one studywill reappearacrossall
the studiesunderreview.Still, a dialecticprocessof point, response,and counter-
point is neededevenwith reviews,againimplying that no singlereview is defini-
iirr.. Fo, example,in responseto Smith and Glass's(1'977)meta-analyticclaim
that psychotheiapy *", .ff..tive, Eysen ck (L977)and Presby(1'977)pointedout
methojological and substantivecontingenciesthat challengedthe original re-
viewers'reJults.They suggested that a differentanswerwould havebeenachieved
if Smith and Glassitrd ""t combinedrandomizedand nonrandomizedexperi-
mentsor if they had usednarrower calegoriesin which to classifytypesof ther-
apy. Subsequentstudiesprobed thesechallengesto Smith and Glassor brought
foith nouef or,., 1e.g.,\il'eiszet al., 1,992).This processof challengingcausal
claimswith specificalternativeshas now slowedin reviewsof psychotherapyas
many major contingencies have beenexplored.The
that might limit effectiveness
currenrconsensus fiom reviewsof many experimentsin many kinds of settingsis
that psychotherapyis effective;it is not iust the product of a regressionprocess
in needseekprofes-
lrporrt"nrors remission)wherebythosewho are temporarily
,ii""t help and get better,as they would haveevenwithout the therapy'

NeglectedAncillarYQuestions
Our focus on causalquestionswithin an experimentalframework neglectsmany
other questionsthat arerelevantto causation.Theseincludequestionsabout how
to decideon the importanceor leverageof any singlecausalquestion.This could
entail exploringwhethera causalquestionis evenwarranted,as it often is not at
the early sa"g.-ofdevelopmentof an issue.Or it could entail exploringwhat type
of c".rsalquestionis moie important-one that fills an identifiedhole in somelit-
erature,o, orr. that setsout to identify specificboundary conditionslimiting a
causalconnection,or one that probesthe validity of a centralassumptionheld by
all the theoristsand researchers within a field, or one that reducesuncertainty
about an important decisionwhen formerly uncertaintywas high. Our approach
alsoneglectsthe realitythat how oneformulatesa descriptivecausalquestionusu-
aily enLils meetingsomestakeholders'interestsin the socialresearchmore than
those of others.TLus to ask about the effectsof a national program meetsthe
needsof Congressional staffs,the media,and policy wonks to learnaboutwhether
the program"*orks. But it can fail to meet the needsof local practitionerswho
,rro"lly"*"nt to know about the effectiveness of microelementswithin the pro-
gram ,o thut they can usethis knowledgeto improve their daily practice.-Inmore
Ih.or.ti."l work, to ask how some interventionaffectspersonalself-efficacyis
likely to promote individuals'autonomyneeds,whereasto ask about the effects
of a'persoasivecommunicationdesignedto changeattitudescould well cater to
t
462 14.A CR|T|CAL
ASSESSMENT
OFOURASSUMPT|ONS
|

the needs of those who would limit or manipulate such autonomy. Our narrow
technical approach to causation also neglectedissuesrelated to how such causal
knowledge might be used and misused. It gave short shrift to a systematic analy-
sis of the kinds of causal questions that can and cannot be answered through ex-
periments. \7hat about the effects of abortion, divorce, stable cohabitation, birth
out of wedlock, and other possibly harmful events that we cannot ethically ma-
nipulate? What about the effects of class,race, and gender that are not amenable
'What
to experimentation? about the effects of historical occurrencesthat can be
studied only by using time-seriesmethods on whatever variables might or might
not be in the archives?Of what use, one might ask, is a method that cannot get at
some of the most important phenomena that shape our social world, often over
generations,as in the caseof race, class,and gender?
Many statisticians now consider questions about things that cannot be ma-
nipulated as being beyond causal analysis,so closely do they link manipulation to
causation. To them, the cause must be at least potentially manipulable, even if it
is not actually manipulated in a given observational study. Thus they would not
consider race ^ cause, though they would speak of the causal analysis of race in
studies in which Black and White couples are, say, randomly assignedto visiting
rental units in order to seeif the refusal rates vary, or that entail chemically chang-
ing skin color to seehow individuals are responded to differently as a function of
pigmentation, or that systematicallyvaried the racial mix of studentsin schools or
classrooms in order to study teacher responsesand student performance. Many
critics do not like so tight a coupling of manipulation and causation. For exam-
ple, those who do status attainment researchconsider it obvious that race causally
influences how teachers treat individual minority students and thus affects how
well these children do in school and therefore what jobs they get and what
prospects their own children will subsequentlyhave. So this coupling of causeto
manipulation is a real limit of an experimental approach to causation. Although
we like the coupling of causation and manipulation for purposes of defining ex-
periments, we do not seeit as necessaryto all useful forms of cause.

VALIDITY
Objectionsto InternalValidity
There are severalcriticismsof Campbell's(1957) validity typology and its exten-
sions(Gadenne,1976;Kruglanski& Kroy, 1.976;Hultsch 6cHickey,1978;Cron-
'We
bach, 1982; Cronbachet al., 1980). start first with two criticismsof internal
validity raisedby Cronbach(1982)and to a lbsserextentby Kruglanskiand Kroy
(1'976):(1) an atheoreticallydefinedinternal validity (A causesB) is trivial with-
out referenceto constructs;and (2) causationin singleinstancesis impossible,in-
cludingin singleexperiments.
vALtDtrY nol
I

lnternal Validity ls Trivial


Cronbach(L982)writes:
I consider it pointless to speak of causeswhen all that can be validly meant by refer-
ma-
enceto a causein a particular instanceis that, on one trial of a partially specified
other conditions not named, phe-
nipulation under.orrditior6 A, B, and c, along with
nomenon p was observed.To introduce the word cause seemspointless. Campbell's
writings make internal validity a property of trivial, past-tense'and local statements'
(p .t3 7 )

Hence,.,causallanguageis superfluous"(p. 140).Cronbachdoesnot retaina spe-


cific role fo, .",rr"Iinferenceln his validity typology at all. Kruglanski and Kroy
(1976)criticizeinternalvalidity similanlSsaying:
are
The concrete events which constitute the treatment within a specific research
category' ' ' ' Thus, it is simply
meaningful only as members of a general conceptual
are
impossibleto draw strictly specificconclusionsfrom an experiment: our concepts
g.rr.r"l and each pr.r,rppor"s an implicit general theory about resemblanceberween
different concretecases.(p. 1'57)

All theseauthors suggestcollapsinginternal with constructvalidity in different


ways.
Of course,we agreethat researchers conceptualize and discusstreatmentsand
outcomesin concepfualterms.As we saidin Chapter3, constructsare so basicto
l"rrgo"g. and thought that it is impossibleto conceptualizescientificwork with-
out"thJm. Indeed,ir, *"ny important respects,the constructswe use constrain
what we experience,a point agreedto by theoristsranging from Quine (L951'
L96g)to th; postmodernists(Conner,1989;Testeq1993). So when we say that
internalvalidity concernsan atheoreticallocal molar causalinference,we do not
mean that the researchershould conceptualizeexperimentsor report a causal
claim as "somethingmadea differencer"to useCronbach's(1982,p' 130) exag-
geratedcharacterization.
Still, it is both sensibleand usefulto differentiateinternal from constructva-
lidity. The task of sortingout constructsis demandingenoughto warrant separate
attention from the task of sorting out causes.After all, operationsare concept
laden,and it is very rare for researchers to know fully what thoseconceptsare.In
fu.t, th, ,erearchrialmostcertainlycannotknow them fully becauseparadigmatic
.orr..p,, areso implicitly and universallyimbuedthat thoseconceptsand their as-
sumptions "r. ,oi,'.,imes entirely unrecognized_ by researchcommunitiesfor
y."ri. Indeed,the history of scienceis repletewith examplesof famousseriesof
."p.rim.nts in which a causalrelationshipwas demonstratedearlS but it took
y."r, for the cause(or effect)to be consensuallyand stablynamed.For instance,
in psychologyand linguisticsmany causalrelationshipsoriginally emanatedfrom
a behavioriit paradigl but were later relabeledin cognitive terms; in the early
Hawthorne st;dy, illumination effectswere later relabeledas effectsof obtrusive
observers;and some cognitive dissonanceeffects have been reinterpretedas
464 I 14.A CRITICAL
ASSESSMENT
OF OURASSUMPTIONS

attribution effects.In the history of a discipline,relationshipsthat are correctly


identified as causalcan be important evenwhen the causeand effectconstructs
are incorrectlylabeled.Suchexamplesexist becausethe reasoningusedto draw
causalinferences(e.g.,requiring evidencethat treatmentprecededoutcome)dif-
fers from the reasoningusedto generalize(e.g.,matchingoperationsto prototyp-
ical characteristicsof constructs).\Tithout understandingwhat is meant by de-
scriptive causation, we have no means of telling whether a claim to have
establishedsuchcausationis justified.
Cronbach's(1982) prosemakesclear that he understandsthe importanceof
causallogic; but in the end, his sporadicallyexpressedcraft knowledgedoesnot
add up to a coherenttheory of judgingthe validity of descriptivecausalinferences.
His equation of internal validity as part of reproducibility (under replication)
missesthe point that one can replicateincorrectcausalconclusions.His solution
to suchquestionsis simplythat "the forceof eachquestioncan bereducedby suit-
able controls" (1982,p. 233).This is inadequate, for a completeanalysisof the
problem of descriptivecausalinferencerequiresconceptswe can useto recognize
suitablecontrols.If a suitablecontrol is one that reducesthe plausibilityof, say
historyor maturation,asCronbach(1982,p.233)suggests, thisis little morethan
internalvalidity aswe haveformulatedit. If one needsthe conceptsenoughto use
them, then they should be part of a validity typology for cause-probing methods.
For completeness, we might add that a similar boundaryquestionarisesbe-
tween constructvalidity and externalvalidity and betweenconstructvalidity and
statisticalconclusionvalidity. In the former case,no scientistever framesan ex-
ternal validity questionwithout couchingthe questionin the languageof con-
structs.In the latter case,researchers neverconceptualizeor discusstheir results
solelyin terms of statistics.Constructsare ubiquitousin the processof doing re-
searchbecausethey are essentialfor conceptualizingand reporting operations.
But again,the answerto this objectionis the same.The strategiesfor making in-
ferencesabout a constructare not the sameas strategiesfor making inferences
about whether a causal relationship holds over variation in persons,settings,
treatments,and outcomesin externalvalidity or for drawing valid statisticalcon-
clusionsin the caseof statisticalconclusionvalidity.Constructvalidity requiresa
theoreticalargumentand an assessment of the correspondence betweensamples
and constructs.Externalvalidity requiresanalyzingwhethercausalrelationships
hold over variations in persons,settings,treatments,and outcomes.Statistical
conclusionvalidity requirescloseexaminationof the statisticalproceduresand as-
sumptionsused.And again,one can be wrong about constructlabelswhile being
right about externalor statisticalconclusionvalidity.

Objections
to Causation
in SingleExperiments
A second criticism of internal validity deniesthe possibility of inferring causation
in a single experiment. Cronbach (1982) says that the important feature of cau-
sation is the "progressivelocalizationof a cause" (Mackie, 1974, p.73) over mul-

J
vALrDrry otu
|

tiple experimentsin a program of researchin which the uncertainties about the es-
sential i."t.rr.r of the cause are reduced to the point at which one can character-
ize exacflywhat the causeis and is not. Indeed, much philosophy of causation as-
serts that we only recognize causes through observing multiple instances of a
putative causal relationship, although philosophers differ as to whether the mech-
anism for recognition involves logical laws or empirical regularities (Beauchamp,
1974;P. White, 1990).
However, some philosophers do defend the position that causescan be in-
ferred in singleinstances(e.g.,Davidson, 1,967;Ducasse'1,95L1' Madden & Hum-
ber, L97'1,).A good example is causation in the law (e.g.,Hart & Honore, 1985)'
by which we judge whether or not one person, say, caused the death of another
despitethe fact that the defendant may 4ever before have been on trial for a crime.
The verdict requires a plausible casethat (among other things) the defendantb ac-
tions precededlhe death of the victim, that those actions were related to the death,
that other potential causesof the death are implausible, and that the death would
not have occurred had the defendant not taken those actions-the very logic of
causal relationships and counterfactualsthat we outlined in Chapter 1. In fact, the
defendant'scriminal history will often be specifically excluded from consideration
in iudging guilt during the trial. The lessonis clear. Although we may learn more
"bo,rt ."nsation from multiple than from single experiments, we can rnf.ercause
in single experiments.Indeed, experimenterswill do so whether we tell them to or
not. Providing them with conceptual help in doing so is a virtue, not a vice; fail-
ing to do so is a major flaw in a theory of cause-probing methods.
Of course, individual experiments virtually always use prior concepts from
other experiments.However, such prior conceptualizations are entirely consistent
with the claim that internal validity is about causal claims in single experiments.
If it were not (at least partly) about single experiments, there would be no point
to doing the experiment, for the prior conceptualization would successfullypre-
dict what will be observed.The possibility that the data will not support the prior
conceptualization makes internal validity essential.Further, prior conceptualiza-
tions are not logically necessary;we can experiment to discover effects that we
"The physicist George Darwin used
have no prior conceptual structure to expect:
to say tliat once in a while one should do a completely crazy experiment, like
blowing the trumper to the tulips every morning for a month. Probably nothing
wiil hafpen, but if something did happen, that would be a stupendousdiscovery"
(Hacking, L983, p. 15a). But we would still need internal validity to guide us in
judging if the trumpets had an effect.

Objections to Descriptive Causation


A few authorsobjectto the very notion of descriptivecausation.Typicall5 how-
ever,suchobjectionsaremadeabout a caricatureof descriptivecausationthat has
not teen usedin philosophyor in sciencefor many years-for example,a billiard
ball modelthat requiresa commitmentto deterministiccausationor that excludes
466 ra.n cRrrcALAssEssMENT
oFouRAssuMproNs
|

reciprocalcausation.In contrast,mostwho write aboutexperimentationtoday es-


pousetheoriesof probabilisticcausationin which the many difficultiesassociated
with identifyingdependablecausalrelationshipsare humbly acknowledged. Even
more important, thesecriticsinevitablyusecausal-sounding languagethemselves,
for example,replacing "cause" with "mutual simultaneousshaping" (Lincoln 6c
p.
Guba, 1985, 151).Thesereplacements seem to us to avoidthe word but keep
the concept,and for good reason.As we saidat the end of ChapterL, if we wiped
the slatecleanand constructedour knowledgeof the world aneq we believewe
would end up reinventingthe notion of descriptivecausationall over again, so
greatlydoesknowledgeof causeshelp us to survivein the world.

ObjectionsConcerning Between
the Discrimination
ConstructValidityand ExternalValidity
Although we traced the history of the present validity system briefly in Chapter 2,
readers may want additional historical perspectiveon why we made the changes
we made in the present book regarding construct and external validity. Both
Campbell (1957) and Campbell and Stanley(1963) only usedthe phraseexternal
validitS which they defined as inferring to what populations, settings,treatment
variables, and measurement variables an effect can be generalized.They did not
rcfer at all to construct validity. However, from his subsequentwritings (Camp-
bell, 1986), it is clear Campbell thought of construct validity as being part of ex-
ternal validity. In Campbell and Stanley therefore, external validity subsumed
generalizing from researchoperations about persons, settings,causes,and effects
for the purposes of labeling theseparticulars in more abstract terms, and also gen-
eralizing by identifying sourcesof variation in causal relationships that are attrib-
utable to person, setting, cause, and effect factors. All subsequentconceptualiza-
tions also share the same generic strategy based on sampling instancesof persons,
settings, causes,and effects and then evaluating them for their presumed corre-
spondenceto targets of inference.
In Campbell and Stanley'sformulation, person, setting, cause,and effect cat-
egories share two basic similarities despite their surface differences-to wit, all of
them have both ostensive qualities and construct representations.Populations of
persons or settings are composed of units that are obviously individually osten-
sive. This capacity to point to individual persons and settings, especially when
they are known to belong in a referent category permits them to be readily enu-
merated and selectedfor study in the formal ways that sampling statisticianspre-
fer. By contrast, although individual measures (e.g., the Beck Depression Inven-
tory) and treatments (e.g., a syringe full of a vaccine) are also ostensive,efforts to
enumerate all existing ways of measuring or manipulating such measuresand
treatments are much more rare (e.g.,Bloom, L956; Ciarlo et al., 1986; Steiner&
Gingrich, 2000). The reason is that researchersprefer to use substantivetheory to
determine which attributes a treatment or outcome measureshould contain in any

.J
vALrDtrYI oe,

given studS recognizing that scholars often disagreeabout the relevant attributes
of th. higher order entity and of the supposed best operations to representthem.
None of ihis negatesthe reality that populations of persons or settingsare also de-
fined in part by the theoretical constructs used to refer to them, just like treatments
and outiomes; they also have multiple attributes that can be legitimately con-
'!(hat,
tested. for instance, is the American population? \7hile a legal definition
surely exists,it is not inviolate. The German conception of nationality allows that
the gieat grandchildren of a German are Germans even if their parents and grand-
p"r*t, have not claimed German nationality. This is not possible for Americans.
And why privilege alegaldefinition? A cultural conception might admit as Amer-
ican all thor. illegal immigrants who have been in the United Statesfor decades
and it might e*cl.rde those American adults with passports who have never lived
in the United States. Given that person's,settings, treatments, and outcomes all
have both construct and ostensive qualities, it is no surprise that Campbell and
Stanley did not distinguish between construct and external validity.
Cook and Camptell, however, did distinguish between the two. Their un-
stated rationale for the distinction was mostly pragmatic-to facilitate memory
for the very long list of threats that, with the additions they made' would have
had to fit under bampbell and Stanley'sumbrella conception of external validity.
In their theoretical diicussion, Cook and Campbell associatedconstruct validity
with generalizingto causesand effects, and external validity with generalizing to
and across persons, settings, and times. Their choice of terms explicitly refer-
encedCronbach and Meehl (1955) who used construct and construct validity in
"about higher-order constructs from re-
measurementtheory to justify inferences
search operations'; lcook & Campbel| 1,979, p. 3S). Likewise, Cook and
Campbeli associatedthe terms population and external ualidity with sampling
theory and the formal and purposive ways in which researchersselect instances
of persons and settings. But to complicate matters, Cook and Campbell also
"all aspectsof the researchrequire naming samples in
brlefly acknowledged that
gener-alizable termi, including samplesof peoples and settings as well as samples
of -r"r,rres or manipulations" (p. 59). And in listing their external validity
threats as statistical inieractions between a treatment and population, they linked
external validity more to generalizing across populations than to generalizing to
them. Also, their construct validity threats were listed in ways that emphasized
generalizing to cause and effect constructs. Generalizing across different causes
ind effect, *", listed as external validity becausethis task does not involve at-
tributing meaning to a particular measure or manipulation. To read the threats
in Cook and Campbell, external validity is about generalizing acrosspopulations
of persons and settings and across different cause and effect constructs, while
construct validity is about generalizing to causesand effects.Where, then, is gen-
era\zing from samples of persons or settings to their referent populations? The
text disiussesthis as a matter of external validitg but this classification is not ap-
parent in the list of validity threats. A system is neededthat can improve on Cook
and Campbell's partial confounding between objects of generalization (causes
468 ASSESSMENT
14.A CRITICAL OF OURASSUMPTIONS

and effects versus persons and settings) and functions of generalization (general-
izing to higher-order constructs from researchoperations versus inferring the de-
greeof replicationacrossdifferent constructsand populations).
This book usessucha functional approachto differentiateconstructvalidity
from externalvalidity. It equatesconstructvalidity with labelingresearchopera-
tions, and externalvalidity with sourcesof variation in causalrelationships.This
new formulation subsumesall of the old. Thus, Cook and Campbellt under- ,i-{11
standingof constructvalidity asgeneralizingfrom manipulationsand measures to f i..

causeand effectconstructsis retained.So is externalvalidity understoodas gen-


eralizingacrosssamplesof persons,settings,and times.And generalizingacross
different causeor effectconstructsis now,evenmore clearlyclassifiedas part of
exrernalvalidity.Also highlightedis the needto label samplesof personsand set-
tings in abstractterms, iust as measuresand manipulationsneedto be labeled.
Suchlabelingwould seemto be a matterof constructvalidity giventhat construct
validity is functionallydefinedin termsof labeling.However,labelinghumansam-
ples might have been read as being a matter of external validity in Cook and
Campbell,given that their referentswere human populationsand their validity
typeswereorganizedmore around referentsthan functions.So,althoughthe new
formulation in this book is definitelymore systematicthan its predecessors,
we are
unsurewhetherthat systematizationwillultimately result in greaterterminologi-
cal clarity or confusion.To keepthe latter to a minimum, the following discussion
reflectsissuespertinentto the demarcationof constructand externalvalidity that
have emergedeither in deliberationsbetweenthe first two authorsor in classes
that we havetaughtusingpre-publication versionsof this book.

Is Construct Vatidity a Prerequisite for External Vatidity?


In this book, we equateexternalvalidity with variation in causalrelationshipsand
constructvalidity with labeling.research operations.Somereadersmight seethis
assuggesting of a causalrelationshiprequiresthe ac-
that successfulgeneralization
curate labelingof eachpopulation of personsand eachtype of settingto which
generalization is sought,eventhough we can neverbe certainthat anythingis la-
beledwith perfectaccuracy.The relevanttask is to achievethe most accurateas-
sessmentavailableunder the circumstances. Technically,we can.test genenliza-
tion acrossentitiesthat are akeadyknown to be confoundedand thus not labeled
well-e.g., when causaldata arebrokenout by genderbut the femalesin the sam-
ple are, on average,more intelligentthan the malesand thereforescorehigheron
everythingelsecorrelatedwith intelligence.This exampleillustrateshow danger-
ous it is to rely on measuredsurfacesimilarity alone (i.e.,genderdifferences)for
determininghow a sampleshouldbe labeledin populationterms.\7e might more
accuratelylabel genderdifferencesif we had a random sampleof each gender
taken from the samepopulation.But this is not often found in experimentalwork,
and eventhis is not perfectbecausegenderis known to be confoundedwith other
attributes(e.g.,income,work status)evenin the population,and thoseother at-

t
.,.J
I oo,
vALrDrrY

tributes may be pertinent labels for some of the inferencesbeing made. Hence, we
usually have to rely on the assumption that, becausegender samplescome from
the same physical setting, they are comparable on all background characteristics
that might be correlated with the outcome. Becausethis assumption cannot be
fully testedand is ^nyw^y often false-as in the hypothetical example above-this
means rhat we could and should measure all the potential confounds within the
limits of our theoretical knowledge to suggestthem, and that we should also use
these measuresin the analysis to reduce confounding.
Even with acknowledged confounding, sample-specific differences in effect
sizesmay still allow us to conclude that a causal relationship varies by something
associatedwith gender.This is a useful conclusion for preventing premature over-
generalization.Iilith more breakdownq, confounded or not, one can even get a
senseof the percentageof contrastsacrosswhich a causal relationship does and
does not hold. But without further work, the populations across which the rela-
tionship varies are incompletely identified. The value of identifying them better is
particularly salient when some effect sizescannot be distinguished from zero. Al-
though this clearly identifies a nonuniversal causal relationship, it does not ad-
vance theory or practice by specifying the labeled boundary conditions over which
a causal relationship fails to hold. Knowledge gains are also modest from gener-
alization strategiesthat do not explicitly contrast effect sizes.Thus, when differ-
ent populations are lumped together in a single hypothesis test, researcherscan
learn how large a causal relationship is despite the many unexamined sources of
variation built into the analysis. But they cannot accurately identify which con-
structs do and do not co-determine the relationship's size. Construct validity adds
useful specificity to external validity concerns, but it is not a necessarycondition
for external validity.'We can generalize across entities known to be confounded'
albeit lessusefully than acrossaccurately labeled entities.
This last point is similar to the one raised earlier to counter the assertion of
Gadenne (L9761and Kruglanski and Kroy (1976) that internal validity requires
the high consrruct validity of both causeand effect. They assertthat all scienceis
"something causedsome-
about constructs, and so it has no value to conclude that
thing sfss"-1hs result that would follow if we did a technically exemplary ran-
domized experiment with correspondingly high internal validity but the causeand
effect were not labeled. Nonetheless, a causal relationship is demonstrably en-
"something reliably causedsomething else" might lead
tailed, and the finding that
to further researchto refine whatever clues are available about the cause and ef-
fect constructs. A similar argument holds for the relationship of construct to ex-
ternal validity. Labels with high construct validity are not necessaryfor internal
or for external validity, but they are useful for both.
Researchersnecessarilyuse the language of constructs (including human and
setting population ones) to frame their research questions and selecttheir repre-
sentationsof constructsin the samplesand measureschosen.If they have designed
their work well and have had some luck, the constructs they begin and end with
will be the same,though critics can challengeany claims they make. However, the
470 14.A CRITICAL OFOURASSUMPTIONS
ASSESSMENT

samplesand constructs might not match we[], and then the task is to examine the
samples and ascertain what they might alternatively stand for. As critics like
Gadenne,Kruglanski,and Kroy havepointedout, suchrelianceon the operational
levelseemsto legitimizeoperationsashavinga life independentof constructs.This
is not the case,though,for operationsare intimatelydependenton interpretations
at all stagesof research.Still, every operation fits some interpretations, however
tentative that referent may be due to poor researchplanning or to nature turning
out to be more complex than the researcher'sinitial theory.

How Does Variation AcrossDifferent Operational Representations


of the SameIntendedCauseor EffectRelateto Constructand
ExternalValidity?
In Chapter 3 we emphasizedhow the valid labeling of a cause or effect benefits
from multiple operational instances,and also that thesevarious instancescan be
fruitfully analyzedto examine how a causal relationship varies with the definition
used. If each operational instance is indeed of the sameunderlying construct, then
the samecausalrelationshipshouldresult regardlessof how the causeor effectis
operationally defined. Yet data analysis sometimes revealsthat a causal relation-
ship varies by operational instance.This means that the operations are not in fact
equivalent,so that theypresumablytap both into differentconstructsand into dif-
ferent causalrelationships.Either the samecausalconstructis differentlyrelated
to what now must be seenas two distinct outcomes,or the sameeffectconstruct
is differently related to two or more unique causal agents.So the intention to pro-
mote the construct validity of causesand effects by using multiple operations has
now facilitated conclusions about the external validiry of causesor effects;that is,
when the external validity of the causeand effect are in play, the data analysishas
revealed that more than one causal relationship needsto be invoked.
FortunatelS when we find that a causal relationship varies over different causes
or different effects, the research and its context often provide clues as to how the
causalelementsin eachrelationshipmight be (re)labeled.For example,the researcher
will generally examine closely how the operations differ in their particulars, and will
also study which unique meaningshave been attached to variants like thesein the ex-
isting literature.While the meaningsthat are achievedmight be lesssuccessfulbe-
cause they have been devised post hoc to fit novel findings, they may in some cr-
cumstances still attain an acceptable level of accuracy and will certainly prompt
continued discussion to account for the findings. Thus, we come full circle. I7e be-
gan with multiple operational representations of the same causeor effect when test-
ing a single causal relationship; then the data forced us to invoke more than one re-
lationship; and finally the pattern of the outcomes and their relationship to the
existing literature can help improve the labeling of the new relationships achieved.A
construct validity exercise begets an externat validity conclusion that prompts the
need for relabeling constructs. Demonstrating effect size variation acrossoperations
presumed to represent the same cause or effect can enhance external validity by
vALlDlrY I ort

are involved than was origi-


showingthat more constructsand causalrelationships validity by pre-
and in that case,it can eventuallyincreaseconstruct
nally envisaged;
in the original choiceof meas-
ventingany mislabelingof the causeor effectinherent
causalrelationshipsabout how the
ures and by providffilues from detailsof the
seehereanalytictasksthat flow
elementsin each..f"io"ritp shouldbe labeled.'We
concerns'involving each'
smoothlybetween.onr,r.r.i and externalvalidity

of Personsor settings
should Generalizingfrom a single sample
Be Classifiedas External or Construct Validity?
If a study hasa singlesampleof pers.ons or settings,this samplemust representa
is an issue'Given that construct
population.How ,"nlrrr-pre should be labeled
an issueof constructvalidity?Af-
validity is about rJ.iirrg, i, Itbeling the lample
sincewith a singlesampleit is not
ter all, externalvalidity hardly seemsrelevant
in causalrelationshipswould
immediatelyobvious*n", comparisonof variation
of personsor settingsis treatedas a
be involved.So if g.".t"iit-g fio* a sample
from treatment and out-
matter of constructvalidity analogousto generalizing
highlightsa potential conflict in
come operations,i*o probl.-, "r-ir.. Firstl this
someparts of which saythat gen-
usagein the generalsocialsciencecommunity' va-
eralizationsfrom;;;i; of peopleto its pofulation are a matter of external
lidity, evenwhen ;rh.;;;", ,"y ih", labefingpeopleis a matter of constructva-
in Cook and Campbellthat
lidity. Second,trrir-J".r not fit'with the discussion
personsand settingsas an external
treatsgeneralizingrr.t" irrdiuidrr"lsamplesof
threatsdoesnot explicitly deal
validity matter,thoughtheir list of .*t.*"1 validity
with this and only mentionsinteracti,ons betweenthe treatmentand attributesof
the settingand Person.
selectedfrom.the pop-
The issueis most acutewhen the samplewas randomly
so keento promoterandom sam-
ulation. considerwhy samplingstatisticiansare
Suchsamplingensuresthat the
pling for represe";i"; " *.il-dJrignated universe.
on all measuredand unmeasured
sampleand populatiJndistributionsare identical
Notice that this includesthe popula-
variableswithin the limits of samplingerror. also
randomsamplingguarantees
tion label(whethermoreor less"ccorit.;, which a well
appliesto the ,";;[. K.y tg tle or.i rl*r, of random samplingis having
in samplingtheory and
boundedpop.rl"tiJ., from which to sample,a-requirement
many well boundedpopulations
somethingoften obviousin practice.Given that
guarantees that a valid populationla-
are alsowell tabeied,r""a.- samplingthen
For instance'the population of
bel can equallyvalidly be applied,o itt. saripl..
known and is obviouslycorrectly
telephoneprefixesor.d i' tlie city of Chicagols
digit dialing frol that list of
labeled.Hence,i *""fa be difficuli. ,rrJt"ndom telephone
sampleas representing
Chicagopr.fi*., "nJ itt." mislabelthe resulting a clearly
sJction-of Chicago- Given
ownersin Detroii o, orty in the Edgewater
the samplelabel is the populationla-
boundedpopulationand random saripling,
that no methodis superiorto ran-
bel, which is why samplingstatisticiansbelieve
populationlabelis known'
dom selectio'f- iun.ii"g"tumpleswhen the
OFOURASSUMPTIONS
ASSESSMENT
472 I T+.N CRITICAL

With purposive sample selection,this elegant rationale cannot be used,


whetheror not the population label is known. Thus, if respondents were selected
haphazardlyfrom shoppingmalls all over Chicago,many of the peoplestudied
would belongin the likely populationof interest-residentsof Chicago.But many
would not becausesomeChicagoresidentsdo not go to malls at the hours inter-
viewing takes place, and becausemany personsin these malls are not from
Chicago.Lacking random sampling,we could not evenconfidentlycall this sam-
ple "peoplewalking in Chicagomalls," for other constructssuchas volunteering
to be interviewedmay be systematicallyconfounded with samplemembership.So,
meremembershipin the sampleis not sufficientfor accuratelyrepresenting a pop-
ulation, and by the rationalein the previousparagraph,it is alsonot sufficientfor
accuratelylabelingthe sample.All this leadsto two conclusionsworth elaborat-
ing: (1) that random sampling can sometimespromote constructvalidity, and
(2) thatexternalvalidity is in play when inferring that a singlecausalrelationship
from a samplewould hold in a population,whetherfrom a randomsampleor not.
On the first point, the conditions under which random samplingcan some-
timespromote the constructvalidity of singlesamplesare straightforward.Given
a well boundeduniverse,samplingstatisticianshavejustifiedrandom samplingas
away of clearlyrepresentingin the sampleall populationattributes.This must in-
cludethe populationlabel,and so random samplingresultsin labelingthe sample
in the sameterms that apply to the population. Random samplingdoesnot, of
course,tell us whetherthe population label is itself reasonablyaccurate;random
samplingwill also replicatein the sampleany mistakesthat are madein labeling
the population. However,given that many populationsare alreadyreasonably
well-labeledbasedon past researchand theory and that suchsituationsare often
intuitively obviousfor researchers experiencedin an area,random samplingcan,
underthesecircumstances, be countedon to promoteconstructvalidity.However,
when random selectionhas not occurredor when the populationlabel is itself in
doubt, this book hasexplicatedother principlesand methodsthat can be usedfor
labelingstudy operations,including labelingthe samplesof personsand settings
in a study.
On the secondpoint, when the questionconcernsthe validity of generalizing
from a causalrelationshipin a singlesampleto its population,the readermay also
wonder how externalvalidity can be in play at all. After all, we haveframedex-
ternal validity as beingabout whetherthe causalrelationshipholds overuariation
in persons,settings,treatmentvariables,and measurement variables.If thereis only
one random samplefrom a population,where is the variation over which to ex-
aminethat causalrelationship?The answeris simple:the variationis betweensam-
pled and unsampledpersonsin that population.As we saidin Chapter2 (and as
was true in our predecessorbooks), external validity questionscan be about
whether a causalrelationshipholds (a) over variationsin persons,settings,treat-
ments,and outcomesthat were in the experiment,and (b) for persons,settings,
treatments,and outcomesthat werenot in the experiment.Thosepersonsin a pop-
vALlDlw | 473

ulation who were not randomly sampledfall into the latter category.Nothing
about externalvalidity,eitherin the presentbook or in its predecessors, requires
that all possibleuariuiion, of externalvalidity interestactuallybe observedin the
study-indeed, it would beimpossibleto do so,and we providedseveralarguments
in Cirapter2 aboutwhy it would not be wise to limit external validity questions
only to variationsactuallyobservedin a study.Of course,in most casesexternal
ualidiry generalizations to things that were not studied are difficult, having to rely
on the .L.r..pt, and methodswe outlined in our grounded theory of generalized
causalinferencein Chapters11 through 13. But it is the great beautyof random
samplingthat it guaran;es that this generalizationwill hold over both sampledand
,rnr"-pl".d p.rr6nr. So it is indeedan externalvalidity questionwhah-e1acausal
relationshipthat hasbeenobservedin a singlerandomsamplewould hold for those
units that were in the populationbut not'in the random sample.
Inthe end,this book treatsthe labelingof a singlesampleof personsor set-
tings asa matterof constructvalidiry whetheror not random samplingis used.It
alsi treatsthe generalizationof causalrelationshipsfrom a singlesampleto un-
observedinstancesasa matterof externalvalidity-againrwhether or not random
samplingwas used.The fact that random sampling(which is associatedwith ex-
,.rrr"l uiiairy in this book) sometimeshappensto facilitatethe constructlabeling
of a sampleis incidentalto the fact that the population label is alreadyknown.
Though many populationlabelsare indeedwell-known, many more are still mat-
,.r, of debate,as reflectedin the exampleswe gavein Chapter3 of whetherper-
sonsshouldbe labeledschizophrenicor settingslabeledas hostilework environ-
ments.In theselatter cases,random samplingmakesno contribution to resolving
debatesabout the applicabilityof thoselabels.Instead,the principlesand meth-
ods we outlinedin Ci"pt.rs 11 through 13 will haveto be brought to bear.And
when random samplinghasnot beenused,thoseprinciplesand methodswill also
haveto be broughito b.". on the externalvalidity problemof generalizingcausal
relationshipsfrom singlesamplesto unobservedinstances.

of the Typology
ObjectionsAbout the Completeness
The first objectionof this kind is that our lists of particularthreatsto validity are
incomplete.Bracht and Glass(1,968),for example,ad-dednew externalvalidity
threatsthat they thought were overlookedby Campbelland Stanley(1,96311' and
more recentlyAiken ind West (1991) pointed to new reactivity threats._ These
challenges "r. i*portant becausethe key to the most confidentcausalconclusions
in our ,f,.ory of validity is the ability to construct a persuasiveargumentthat every
plausibleand identifiedthreat to validity has beenidentifiedand ruled out. How-
iver, thereis no guaranteethat all relevantthreatsto validity havebeenidentified.
Our lists are not divinely ordained,as can be observedfrom the changesin the
threats from Campbel IUST) to Campbell and Stanley (1'963)to Cook and
14.A CRITICAL OF OURASSUMPTIONS
ASSESSMENT

Campbell(1979) to this book. Threatsare better identifiedfrom insiderknowl-


edgethan from abstractand nonlocal lists of threats.
A secondobjectionis that we may haveleft out particularvalidity fypesor or-
ganizedthem suboptimally.Perhapsthe bestillustration that this is true is Sack-
ett's(1979) treatmentof bias in case-controlstudies.Case-controlstudiesdo not
commonly fall under the rubric of experimentalor quasi-experimental designs;
but they are cause-probingdesigns,and in that sensea generalinterestin general-
ized causalinferenceis at leastpartly shared.Yet Sackettcreateda different ty-
pology.He organizedhis list around sevenstagesof researchat which biascan oc-
cur: (1) in readingaboutthe field, (2) in samplespecification (3) in
and selection,
defining the experimentalexposure,(4) 'in measuringexposureand outcome,
(5) in dataanalysis,(5) in interpretationof analyses, and (71inpublishingresults.
Each of thesecould generatea validiry type, someof which would overlapcon-
siderablywith our validity types.For example,his conceptof biases"in executing
the experimentalmanoeuvre" (p. 62) is quite similar to our internal validiry
whereashis withdrawal biasmirrors our attrition. However,his list alsosuggests
new validity types,such as biasesin readingthe literature,and biaseshe lists at
each stageare partly orthogonal to our lists. For example,biasesin readingin-
clude biasesof rhetoric in which "any of severaltechniquesare usedto convince
the readerwithout appealingto reason"(p. 60).
In the end,then, our claim is only that the presenttypologyis reasonablywell
informed by knowledgeof the nature of generalizedcausalinferenceand of some
of the problemsthat are frequentlysalientabout thoseinferencesin field experi-
mentation.It can and hopefullywill continueto be improvedboth by addition of
threatsto existing validity types and by thoughtful exploration of new validity
typesthat might pertainto the problem of generalizedcausalinferencethat is our
main concern.t

1. We are acutelyaware of, and modestlydismayedat, the many differentusagesof thesevalidity labelsthat have
developedover the years and of the risk that posesfor terminological confusion---eventhough we are responsible
for rnany of thesevariations ourselves.After all, the understandingsof validiry in this book differ from those in
Campbelland Stanley(1963),whoseonly distinctionwas betweeninternal and externalvalidity. They alsodiffer
from Cook and Campbell (7979), in which externalvalidity was concernedwith generalizingto and across
populations of personsand settings,whereasall issuesof generalizingfrom the causeand effect operations
constitutedthe domain of constructvalidity. Further,Campbell(1985) himselfrelabeledinternalvalidiry and
external validiry as local molar causalvalidity and the principle of proximal similarity, respectively.Steppingoutside
Campbell'stradition, Cronbach(1982) usedtheselabelswith yet other meanings.He said internalvalidity is the
problem of generalizingfrom samplesto the domain about which the questionis asked,which soundsmuch like our
construct validity except that he specifically denied any distinction betweenconstruct validiry and external validiry,
using the latter term to refer to generalizingresults to unstudied populations, an issueof extrapolation beyond the
data at hand. Our understandingof external validity includessuch extrapolations as one case,but it is not limited to
that becauseit also has to do with empirically identifying sourcesof variation in an effect sizewhen existing data
allow doing so. Finally, many other authors have casually used all theselabels in completelydifferent ways (Goetz
& LeCompte,1984; Kleinbaum,Kupper, & Morgenstern,1982;Menard, 1991).So in view of all thesevariations,
we urge that theselabels be used only with descriptionsthat make their intended understandingsclear.

::j
!t
t
VALIDTTY 47s
|

the Natureof Validity


ObjectionsConcerning
'We it dif-
defined validity as the approximate truth of an inference. Others define
ferently. Here are some alternatives and our reasonsfor not using them'

Validity in the New TestTheory Tradition


Testtheorists discussed well be-
validity(e.g.,cronbach,1946;Guilford,1,946)
fore Campbell(L957) inventedhis typology.Sfecan only begin to touch on the
many iss.re,pertinentto validity that aboundin that tradition. Here we outline a
f.* i.y poinis that help differentiateour approachfrom that of test theory.The
early emphasisin test theory was mostly on inferencesabout what a test meas-
or.j, with a pinnaclebeingieachedin the notion of constructvalidity. Cronbach
"proper breadth to the notion of
ltltll creditsCook and -a-pbell for giving
consffucts',(p. 152) in constructvalidity through their claim that constructva-
lidity is not j"tt li-it.d to inferencesabout outcomesbut also about causesand
about orherfeaturesof experiments.In addition, early test theory tied validity to
"The literatureon validationhasconcentratedon the
the truth of suchinferences:
truthfulnessof testinterpretation"(Cronbach,1988,p' 5)'
However,the yearshave bro.tght changeto this early understanding'In one
particularlyinfluentialdefinitionof validity in test theory Messick(1989)said'
;V"lidiry ii an integratedevaluativejudgmentof the degreeto which empiricalev-
idenceand theoreti"cal rationalessupportthe adequacyand appropriateness of in-
and actionsbasedon testscoresor other modesof assessment" (p. L3);
ferences
"Validiry is broadly definedas nothing lessthan an evalua-
and later he saysthat
tive summary'of both the ruid.tr.. for and the actual-as well as potential-
consequen..,of scoreinterpretationand use" (1995, p.74L)._Whereas. our un-
d.rrtu.rdirrgof validity is that inferencesare the subjectof validation,this defini-
tion suggeJt,th"t actionsare also subjectto validation and that validation is ac-
tually evaluation.Theseextentionsare far from our view.
A little historywill help here.Testsare designedfor practicaluse.Commer-
cial test developers hope to profit from salesto thosewho usetests;employers
hope to ,rr. t.rt, to seiectbetterpersonnel;and test takershope that testswill
tell them somethingusefulabout themsqlves. Thesepracticalapplicationsgen-
eratedconcerni., tf,e AmericanPsychologicalAssociation(APA) to identify the
characteristicsof better and worse tests.APA appointeda committeechairedby
Cronbachto addressthe problem.The committeeproducedthe first in a contin-
uing seriesof teststandaris(APA,1,954);andthis wolk alsoled to Cronbachand
Melhl', (1955)classicarticle on constructvalidity. The test standardshave been
freq.rerrtiyrevised,most recentlycosponsoredby other professionalassociations
(AmericanEducaiionalResearchAssociation,American PsychologicalAssocia-
Re-
tion, and National Council on Measurementin Education,1985, 1999)'
qoirl-.nts to adhereto rhe standardsbecamepart of professionalethical codes.
Th" ,tandardswere also influential in legaland regulatoryproceedingsand have
14.A CRITICAL
ASSESSMENT
OF OURASSUMPTIONS

beencited,for example,in U.S.SupremeCourt casesaboutallegedmisusesof test-


ing practices (e.g., Albermarle Paper Co. v. MoodS 1975; Washington v. Davis,
L976) and have influencedthe "Uniform Guidelines"for personnelselectionby
the Equal EmploymentOpportunity Commission(EEOC)et al. (1978).Various
validity standardswere particularly salientin theseuses.
Becauseof this legal,professional,and regulatoryconcernwith the useof test-
ing, the researchcommunity concerned with measurementvalidity began to use the i'
,;;::
word ualiditymoreexpansivelyforexample, "asonewaytojustifytheuseofatest"
(Cronbach,1989,p. M9).It is only a short distancefrom validatinguseto validat-
ing action, becausemost of the relevantuseswere actionssuchas hiring or firing
someoneor labelingsomeoneretarded.Actions,in turn, haveconsequences-some
positive,suchas efficiencyin hiring and accuratediagnosisthat allows bettertailor-
ing of treatment,and somenegative,suchas lossof incomeand stigmatization.So
Messick(1989, 1995l proposedthat validationalsoevaluatethoseconsequences, es-
peciallythe socialjusticeof consequences. Thus evaluatingthe consequences of test
usebecamea key featureof validity in test theory.The net resultwas a blurring of
the line betweenvalidity-as-truth and validity-as-evaluation,to the point where
Cronbach(1988)said"Validationof a testor testuseis evaluation"(p.4).
'We
strongly endorse the legitimacy of questions about the use of both tests and
experiments. Although scientistshave frequently avoided value questions in the mis-
taken belief that they cannot be studied scientifically or that scienceis value free, we
cannot avoid values even if we try. The conduct of experiments involves values at
every step, from question selection through the interpretation and reporting of re-
sults. Concerns about the usesto which experiments and their results are put and the
value of the consequencesof those usesare all important (e.g.,Shadishet al., 1991),
as we illustrated in Chapter 9 in discussingethical concerns with experiments.
However, if validity is to retain its primary association with the truth of
knowledge claims, then it is fundamentally impossible to validate an action be-
causeactions are not knowledge claims. Actions are more properly evaluated, not
validated. Supposean employer administers a test, intending to use it in hiring de-
cisions. Suppose the action is that a person is hired. The action is not itself a
knowledge claim and therefore cannot be either true or false. Supposethat person
then physically assaultsa subordinate. That consequenceis also not a knowledge
claim and so also cannot be true or false. The action and the consequencesmerely
exist; they are ontological entities, not epistemological ones. Perhaps Messick
(1989) really meant to ask whether inferencesabout actions and consequencesare
true or false. If so, the inclusion of action in his (1,989)definition of validity is en-
tirely superfluous, for validity-as-truth is already about evidencein support of in-
ferences,including those about action or consequ.rr..s.'

2. Perhapspartly in recognitionof this, the most recentversionof the test standards(AmericanEducational


ResearchAssociation,American PsychologicalAssociation,and National Council on Measurementin Education,
1999) helpsresolvesomeof the problemsoudined hereinby removingreferenceto validatingaction from the
definition of validity: "Validity refersto the degreeto which evidenceand theory support the interpretationsof test
scoresentailedby proposedusesof tests" (p. 9).

i
,l

,I
t
VALIDITY I 477

Alternatively perhaps Messick ('1.989,L995) meant his definition to instruct


"Valid-
test validators to eualuatethe action or its consequences,as intimated in:
ity is broadly defined as nothing less than an evaluative summary of both the ev-
idence for and the actual-as well as potential--consequences of score interpre-
tation and use" (1,995, p. 742). Validity-as-truth certainly plays a role in
evaluating testsand experiments.But we must be clear about what that role is and
is not. Philosophers(e.g., Scriven, 1980; Rescher,1969) tell us that a judgment
about the value of something requires that we (1) selectcriteria of merit on which
the thing being evaluated would have to perform well, (2) set standards of per-
formanci for how well the thing must do on each criterion to be judged positivel5
(3) gather pertinent data about the thing's performance on the criteria, and then
Validity-as-truth
i+j i"6gr4te the results into one or more evaluative conclusions.
is one (but only one) criterion of merit in dvaluation; that is, it is good if inferences
about a test are true, just as it is good for the causal inference made from an ex-
periment to be true. However, validation is not isomorphic with evaluation. First,
criteria of merit for tests (or experiments) are not limited to validity-as-truth- For
example, a good test meetsother criteria, such as having a test manual that reports
,ror*^r, being affordable for the contexts of application, and protecting confiden-
tialiry ", "ppropriate. Second,the theory of validity Jvlessickproposed gives no
help in accomplishing some of the other steps in the four-step evaluation process
outlined previously. To evaluate a test, we need to know something about how
much ualidity the inference should have to be judged good; and we need to know
how to integrate results from all the other criteria of merit along with validity into
an overall waluation. It is not a flaw in validity theory that these other steps are
not addressed,for they are the domain of evaluation theory. The latter tells us
something about how to executethesesteps (e.g.,Scriven, 1980, 1'991)and also
about other matters to be taken into account in the evaluation. Validation is not
evaluation; truth is not value.
Of course, the definition of terms is partly arbitrary. So one might respond
that one should be able to conflate validity-as-truth and validity-as-evaluation if
one so chooses.However:

The very fact that termsmusrbesuppliedwith arbitrarymeaningsrequiresthat words


be usedwith a greatsenseof responsibility.This responsibilityis twofold: first, to es-
tablished,6"9"; second,to the limitationsthat the definitionsselectedimposeon the
"l'982,
user.(Goldschmidt, P. 642)
'We
need the distinction between truth and value becausetrue inferencescan be
about bad things (the fact that smoking causescancer does not make smoking or
cancer good); "nd f"lr. inferencescan lead to good things (the astrologer'sadvice
'lavoid alienating your coworkers today" may have nothing to do with
to Piscei to
heavenly bodies, but may still be good advice). Conflating truth and value can be
actively harmful. Messick (1995) makes clear that the social consequencesof test-
"bias, fairness, and distributive justice" (P. 745).
ing are to be judged in terms of
'Wi
agreewith this statement,but this is test evaluation, not test validity. Messick
I
478 ra. n cRrTrcAL
ASSESSMENT
OFOURASSUMPTTONS
|

notes that his intention is not to open the door to the social policing of truth (i.e.,
a test is valid if its social consequencesare good), but ambiguity on this issuehas
nonethelessopened this very door. For example, Kirkhart (1,995)cites Messick as
justification for judging the validity of evaluations by their social consequences:
"Consequential
validity refers here to the soundnessof changeexerted on systems
by evaluationand the extent to which thosechangesare just" (p.a).This notion is
risky becausethe most powerful arbiter of the soundnessand iustice of social con-
sequencesis the sociopolitical systemin which we live. Depending on the forces in
power in that system at any given time, we may find that what counts as valid is
effectively determined by the political preferencesof those with power.

Validity in the Qualitative Traditions


One of the most important developmentsin recent social researchis the expanded
use of qualitative methods such as ethnography ethnology, participant observa-
tion, unstructured interviewing, and case study methodology (e.g., Denzin 6c
Lincoln, 2000). These methods have unrivaled strengths for the elucidation of
meanings, the in-depth description of cases,the discovery of new hypotheses,and
the description of how treatment interventions are implemented or of possible
causal explanations. Even for those purposes for which other methods are usually
preferable,such as for making the kinds of descriptivecausalinferencesthat are the
topic of this book, qualitative methods can often contribute helpful knowledge and
'S7henever
on rare occasionscan be sufficient (Campbell, 1975; Scriven, 1976ll. re-
sources allow, field experiments will benefit from including qualitative methods
both for the primary benefits they are capable of generatingand also for the assis-
tance they provide to the descriptive causal task itself. For example, they can un-
cover important site-specificthreats to validiry and also contribute to explaining
experimental results in general and perplexing outcome patterns in particular.
However, the flowering of qualitative methods has often been accompanied
by theoretical and philosophical controversy, often referred to as the qualitative-
quantitative debates. These debates concern not just methods but roles and re-
wards within science,ethics and morality and epistemologiesand ontologies. As
part of the latter, the concept of validity has receivedconsiderableattention (e.g.,
Eisenhart & Howe, 1992; Goetz & LeCompte,1984; Kirk & Miller, 1'986;Kvale,
1.989;J. Maxwell, 1.992;J. Maxwell 6c Lincoln, 1.990;Mishler, 1,990;Phillips,
'Wolcott,
1,987; 1990). Notions of validity that are different from ours have occa-
sionally resulted from qualitative work, and sometimesvalidity is rejectedentirely.
However, before we review those differences we prefer to emphasize the com-
monalities that we think dominate on all sides of the debates.

Comtnonalities. As we read it, the predominant view among qualitative theorists


is that validity is a concept that is and should be applicable to their work..We start
with examples of discussionsof validity by qualitative theorists that illustrate these
similarities becausethey are surprisingly more common than someportrayals in the

:
I

:l{
VALIDITYI O''

qualitative-quantitative debatessuggestand because they demonstratean underly-


ing unity of interestin producingvalid knowledgethat we believeis widely shared
(1990) says, "qualitative re-
by"*ori social scientiits.For example,Maxwell
searchers are just as concernedas quantitativeonesabout'getting it wrong,' and
validity broadlydefinedsimplyrefersto the possibleways one'saccountmight be
'validity threats' can be addressed"(p. 505). Even those
*rorrg, and how these "go
quafi[tive theoristswho saythey rejectthe word ualidity will admit that they
to considerable painsnot to getit all wrong" (Wolcott,1990,p. L27).Kvale(1989)
"conceptsof validity are rootedin more com-
tiesvalidity directlyto truth, saying
prehensive epistemological assumptionsof the nature of true knowledge"(p. 1-1);
l'refersto the truth and correctness of a statement"(p.731.
and later that validity 'valid' is as a properly
"the technicaluseof the term
Kirk and Miller (1986) say
'true' " (p. L9). Maxwell (L9921says"Validiry in a
hedgedweak synonymfor
broad sense,pertainsto this relationshipbetweenan accountand somethingout-
sidethat account" (p. 283). All theseseemquite compatiblewith our understand-
ing of validity.
Maxvreli's(7992\ accountpoints to other similarities.He claimsthat validity
"the kinds of understandingsthat accountscan embody"
is always relative to
(p. 28il and that different communitiesof inquirers are interestedin different
kindsof understandings. He notesthat qualitativeresearchers areinterestedin five
kinds of understandingsabout: (1) the descriptionsof what was seenand heard,
(2) the meaningof what was seenand heard, (3) theoreticalconstructionsthat
characteriz.*h"t was seenand heardat higher levelsof abstraction,(4) general-
izationof accountsto other persons,times,or settingsthan originallystudied,and
(5) evaluationsof the objectsof study (Maxwell, 1'992;he saysthat the last two
understandings are of interestrelativelyrarely in qualitativework). He then pro-
posesa five-p-artvalidity typology for qualitativeresearchers, one for eachof the
'We
?ineorrd..standings. agreethat validity is relativeto understanding,thoughwe
usuallyrefer to in-ference iather than understanding.And we agreethat different
communitiesof inquirerstend to be interestedin different kinds of understand-
ings,though common interestsare illustratedby the apparentlysharedconcerns
thlt both ixperimentersand qualitativeresearchers have in how bestto charac-
terizewhatwas seenand heardin a study (Maxwell'stheoreticalvalidity and our
constructvalidity). Our extendeddiscussionof internal validity reflectsthe inter-
est of the community of experimentersin understandingdescriptivecauses'pro-
portionatelymore so than is relevantto qualitativeresearchers, evenwhen their
reportsare necessarily repletewith the languageof causation.This observationis
,rot " criticismof qualitativeresearchers, nor is it a criticism of experimentersas
being lessinterestedthan qualitativeresearchers in thick descriptionof an indi-
vidualcase.
On the other hand, we should not let differencesin prototypical tendencies
acrossresearchcommunitiesblind us to the fact that when a particular under-
standingls of interest,the pertinentvalidity concernsarethe sameno matterwhat
the metlodology usedto developthe knowledgeclaim. It would be wrong for a
14.A CRITICAL
ASSESSMENT
OF OURASSUMPTIONS

qualitative researcherto claim that internal validity is irrelevantto qualitative


methods.Validity is not a properry of methodsbut of inferencesand knowledge
claims. On those infrequent occasionsin which a qualitative researcherhas a
stronginterestin a local molar causalinference,the concernswe haveoutlinedun-
der internal validity pertain.This argumentcuts both ways,of course.An exper-
imenterwho wonderswhat the experimentmeansto participantscould learna lot
from the concernsthat Maxwell outlinesunder interpretivevalidity.
Maxwell (1992) also points out that his validity typology suggeststhreats
to validity about which qualitativeresearchers seek"evidencethat would allow
them to be ruled-out. . . usinga logic similar to that of quasi-experimental re-
searcherssuch as Cook and Campbell" (p. 296). He does not outline such
threatshimself,but his descriptionallows one to guesswhat somemight look
like. To judge from Maxwell's prose,threats to descriptivevalidity include er-
rors of commission(describingsomethingthat did not occur),errorsof omis-
sion (failingto describesomethingthat did occur),errorsof frequency(misstat-
ing how often something occurred), and interrater disagreementabout
description.Threatsto the validity of knowledgeclaimshavealsobeeninvoked
by qualitative theorists other than Maxwell-for example,by Becker(1979),
Denzin(1989'),and Goetzand LeCompte(1984).Our only significantdisagree-
ment with Maxwell's discussionof threats is his claim that qualitative re-
searchers are lessable to use "designfeatures"(p. 296) to deal with threatsto
validity. For instance,his preferreduseof multiple observersis a qualitativede-
signfeaturethat helpsto reduceerrorsof omission,commission,and frequency.
The repertoireof designfeaturesthat qualitativeresearchers usewill usuallybe
quite different from those used by researchersin other traditions, but they are
designfeatures(methods)all the same.

Dffirences. Theseagreementsnotwithstanding,many qualitativetheoristsap-


proach validity in ways that differ from our treatment.A few of thesedifferences
are basedon argumentsthat are simplyerroneous(Heap,7995;Shadish,1995a).
But many are thoughtful and deservemore attention than our spaceconstraints
allow. Following is a sample.
Somequalitativetheoristseither mix togetherevaluativeand socialtheories
of truth (Eisner,\979,1983) or proposeto substitutethe socialfor theevaluative.
SoJensen(1989)saysthat validiry refersto whethera knowledgeclaim is "mean-
ingful and relevant" (p. 107) to a particular languagecommunity; andGuba and
Lincoln (1,982)saythat truth can be reducedto whetheran accountis credibleto
thosewho read it. Although we agreethat socialand evaluativetheoriescomple-
ment eachother and are both helpful, replacingthe evaluativewith the socialis
misguided. These social alternatives allow for devastatingcounterexamples
(Phillips, 1987): the swindler'sstory is coherentbut fraudulent;cults convince
membersof beliefsthat havelittle or no apparentbasisotherwise;and an account
of an interactionbetweenteacherand studentmight be true evenif neitherfound
it to be credible.Bunge(1992) showshow one cannotdefinethe basicideaof er-
I
:J

I
:iil
14.A CRITICAL
ASSESSMENT
OF OURASSUMPTIONS

qualitative researcher to claim that internal validity is irrelevant to qualitative


methods. Validity is not a properfy of methods but of inferencesand knowledge
claims. On those infrequent occasions in which a qualitative researcher has a
strong interest in a local molar causal inference,the concernswe have outlined un-
der internal validity pertain. This argument cuts both ways, of course. An exper-
imenter who wonders what the experiment meansto participants could learn a lot
from the concerns that Maxwell outlines under interpretive validity.
Maxwell (1,992) also points out that his validity typology suggeststhreats
to validity about which qualitative researchersseek "evidencethat would allow
them to be ruled-out . . . using a logic similar to that of quasi-experimentalre-
searcherssuch as Cook and Campbell" (p. 296). He does not outline such
threats himself, but his description allows one to guess what some might look
like. To judge from Maxwell's prose, threats to descriptive validity include er-
rors of commission (describing something that did not occur), errors of omis-
sion (failing to describesomething that did occur), errors of frequency (misstat-
itg how often something occurred), and interrater disagreement about
description. Threats to the validity of knowledge claims have also been invoked
by qualitative theorists other than Maxwell-for example, by Becker (1,979),
Denzin (1989), and Goetz and LeCompte (1984). Our only significant disagree-
ment with Maxwell's discussion of threats is his claim that qualitative re-
searchersare less able to use "design features" (p. 2961to deal with threats to
validity. For instance, his preferred use of multiple observers ls a qualitative de-
sign feature that helps to reduce errors of omission, commission, and frequency.
The repertoire of design featuresthat qualitative researchersuse will usually be
quite different from those used by researchersin other traditions, but they are
design features (methods) all the same.

Differences. These agreementsnotwithstanding, many qualitative theorists ap-


proach validity in ways that differ from our treatment. A few of thesedifferences
are basedon argumentsthat are simply erroneous(Heap, 1.995;Shadish,1995a).
But many are thoughtful and deservemore attention than our spaceconstraints
allow. Following is a sample.
Some qualitative theorists either mix together evaluative and social theories
"1.979,1983)
of truth (Eisner, or propose to substitutethe socialfor the evaluative.
So Jensen(1989) saysthat validiry refers to whether a knowledge claim is "mean-
ingful and relevant" (p. L07l to a particular language community; and Guba and
Lincoln (t9821say that truth can be reduced to whether an account is credible to
those who read it. Although we agree that social and evaluative theories comple-
ment each other and are both helpful, replacing the evaluative with the social is
misguided. These social alternatives allow for devastating counterexamples
(Phillips, L987): the swindler's story is coherent but fraudulent; cults convince
members of beliefs that have little or no apparent basis otherwise; and an account
of an interaction between teacher and student might be true even if neither found
it to be credible. Bunge (1992) shows how one cannot define the basic idea of er- .il

j
t
I

I
'iil
VALIDITYI +ET

ror usingsocialtheoriesof truth. Kirk and Miller (1986) capturethe needfor an


evaluativetheory of truth in qualitativemethods:

In responseto the propensity of so many nonqualitative researchtraditions to use such


hidden positivist assumptions, some social scientists have tended to overreact by
stressinj the possibility ;f alternative interpretations of everything to th€ exclusion of
of ob-
urry .ffor, to chooseamong them. This extreme relativism ignores the other side
at all. It ignores the distinction between
leciivity-that there is an external world
lrro*l"dg. and opinion, and results in everyonehaving a separateinsight that cannot
be reconciledwith anyone else's.(p. 15)

A seconddifferencerefersto equatingthe validity of knowledgeclaimswith their


evaluation,aswe discussed earlierwith tqsttheory (e.g.,Eisenhart6CHowe, L992)'
This is mostexplicitin Salner(L989),whosuggested that much of validityin quali-
"that are useful for evaluatingcompeting
tative methodoiogyconcernsthe criteria
claims',(p. 51);"id rh. urgesresearchers to exposethe moral andvalueimplications
of ,.r."rch, *.rch asMessick'We (1.989)saidin reference to testtheory.Our response is
the sameas for test theory. endorsethe need to evaluateknowledge claims
broadly includingtheir moial implications;but this is not the sameassayingthat the
claim is-t.ue.Truih is just onecriterionof merit for a good knowledgeclaim.
A third differencemakes validity a result of the processby which truth
emerges.For instance,emphasizingthe dialecticprocessthat givesrise to truth'
,,ValidLnowledgeclaimsemerge. . . from the conflict and dif-
Salnei(l9g9l says:
ferencesbetweenthe contextsthemselvesas thesedifferencesare communicated
and negotiated amongpeoplewho sharedecisions and actions"(p. 61).Miles and
Huberman(1984)rpr"t of th. problemof validity in qualitativemethodsbeing
'Lnalysis proceduresfor qualitative data" (p. 230). Guba and
an insufficiencyof
Lincoln (1989) argue that tiustworthinessemergesfrom communicationwith
other colleaguesarid stakeholders. The problemwith all thesepositionsis the er-
ror of thinklng that validity is a property of methods.Any procedurefor generat-
ing knowledg! can g.n.r"i. invalid-knowledge,so in the end it is the knowledge
(1992) says, "The validity of an ac-
claim itself that muJt be judged.As Maxwell
count is inherent,not in the proceduresusedto produceand validateit, but in its
relationshipto thosethings it is intendedto be an accountof" (p' 281)'
A fourth differencesuggeststhat traditional approaches to validity must be
"historically arosein the
reformulatedfor qualitativemethodsbecausevalidiry
context of experimentalresearch"(Eisenhart6CHowe, 1992, p' 64\' Othersre-
ject validity for similar reasonsexceptthat they saythat validity arosein test the-
o.y 1..g.,*lol.orr, 19gO).Both are incorrect,for validiry concernsprobably first
"ror. Jrt.*"ti.ally in philosophyprecedingtesttheory and experimentalscience
by hundredsor thour"ndr of years.Validity is pertinentto any discussionof the
warrant for believingknowledgeand is not specificto particular.methods.
A fifth differenie .on..rrri the claim that there is no ontological reality at
all, so thereis no truth to correspondto it. The problemswith this perspective
"r. .rror1nous(Schmitt,1,995). First, evenif it were true' it would apply only to
-T

8z I r+.n cRtlcAL AssEssMENT


oF ouR AssuMploNs

correspondence theories of truth; coherence and pragmatist theories would be


unaffected. Second, the claim contradicts our experience. As Kirk and Miller
( 1 9 8 6 1 p u ti t :

Thereis a world of empiricalreality out there.The way we perceiveand understand


that world is largelyup to us, but the world doesnot tolerateall understandingsof it
equally(sothat the individualwho believes he or shecanhalt a speedingtrain with his
or her bare handsmay be punishedby the world for actingon that understanding).
( p .1 1 )

Third, the claim ignores evidenceabout the problems with people'sconstructions.


Maxwell notes that "one of the fundamental insights of the social sciencesis that
people's constructions are often systematic distortions of their actual situation"
(p. 506). FinallS the claim is self-contradictory becauseit implies that the claim
itself cannot be rrue.
A sixth difference is the claim that it makes no senseto speak of truth because
there are many different realities, with multiple truths to match each (Filstead,
1.979;Guba 6c Lincoln, L982; Lincoln 6c Guba, 1985). Lincoln (L990), for ex-
ample, says that "a realist philosophical stance requires, indeed demands, a sin-
gular reality and thereforea singulartruth" (p. 502), which shejuxtaposesagainst
her own assumption of multiple realities with multiple truths. Whatever the mer-
its of the underlying ontological arguments, this is not an argument against valid-
ity. Ontological realism (a commitment that "something" does exist) does not re-
quire a singular reality but merely a commitment that there be at least one reality.
To take just one example, physicists have speculated that there may be circum-
stancesunder which multiple physical realities could exist in parallel, as in the case
of Schrodinger'scat (Davies,1984; Davies & Brown, 1986). Such circumstances
would in no way constitute an objection to pursuing valid characterizationsof
those multiple realities. Nor for that matter would the existenceof multiple real-
ities require multiple truths; physicists use the same principles to account for the
multiple realities that might be experiencedby Schrodinger'scat. Epistemological
realism (a commitment that our knowledge reflects ontological reality) does not
require only one true account of that world(s), but only that there not be two con-
tradictory accounts that are both true of the same ontological referent.3 How
many realities there might be, and how many truths it takes to account for them,
should not be decided by fiat.
A seventh difference objects to the belief in a monolithic or absolute Truth
(with capital T). rUfolcott (1990) says, "'What I seek is something else, a quality
that points more to identifying critical elements and wringing plausible interpre-
tations from them, something one can pursue without becoming obsessedwith

3. The fact that different people might have different beliefs about the same referent is sometimes cited as violating
this maxim, but it need not do so. For example, if the knowledge claim being validated is "John views the program
as effective but Mary views it as ineffective," the claim can be true even though the views of John and Mary are
contradictory.

ii

j
VALIDITY I 483

finding the right or ultimate answer'the correctversion,the Truth" (p' 146)' He


describes"the critical point of departurebetween quantities-orientedand quali-
'know'with
ties-orientedresearch[as beingthat] we cannot the former'ssatisfy-
ing levelsof certainty" (p. 1,47).Mishler (t990) objectsthat traditional ap-
"as universal,abstractguarantorsof truth"
prl".h., to validationare portrayed
"the realistpositiondemandsabsolutetruth"
ip. +ZOl.Lincoln(1990)thinksthat
or absolutetruth
tp. SOZI.However,it is misguidedto attributebeliefsin certainty
tf appioachesto validity srrchas that in this book.'We hope we havemadeclear
by now that thereare no guarantorsof valid inferences.Indeed,the more experi-
encethat mostexperimenters gain,the morethey appreciatethe ambiguityof their
"An experimentis somethingeverybodybelieves
results.Albert Einsteinoncesaid,
exceptthe personwho madeit" (Holton, 1986, p. 13).Like \(olcott, most ex-
periri-renter, ,..k only to wring plausibleinterpretationsfrom their work, believ-
irrg thut "prudencesat poisedbetweenskepticismand credulity" (Shapin,1994,
p."xxix). rilfletteednor, shouldnot, and frequentlycannot decidethat one account
i, ,broirrt.ly true ani the other completelyfalse.To the contrary' tolerancefor
multiple knowledgeconstructionsis a virtual necessity(Lakatos, 1'978)because
evidenceis frequeirtlyinadequateto distinguishbetweentwo well-supportedac-
counrs(islight " p"tti.l. or wave?),and sometimes accountsthat appearto be un-
supported
"An by euiJence for manyyearsturn out to betrue (do germscauseulcers?)-
eighih differenceclaims that traditional understandingsof validity have
"forcesis-
moral shoitcomings.The argumentsherearemany,for example,that it
sues of politics, ial,res (social and scientific), and ethics to be submerged"
"social science'experts' .
(Lincoln, 1,990,p. 503) and implicitly empowers
whoseclasspreoccupations (primarily'$7hite, male,and middle-class) ensuresta-
tus for somevoiceswhile marginalizittg . . . thoseof women, personsof color, or
"1.990,p. may
minoritygroupmembers"(Lincoln, 502).Althoughthesearguments
b. ou"..tlted, they contain important cautions.Recallthe examplein Chapter3
that ,,Eventhe rats werewhite males"
'White
in healthresearch.No doubt this biaswas
partly due to the dominanceof
'..r.ur.h. malesin the designand executionof health
None of the methodsdiscussedin this book are intendedto redressthis
problem or are capableof it. The purposeof experimentaldesignis to elucidate
ca.rsalinferences-or. than morallnferences.'Whatis lessclearis that this prob-
lem requiresabandoningnotions of validity or truth. The claim that traditional
,pprou.h.s to truth forcibly submergepolitical and ethicalissuesis simplywrong.
Tb-the extent that morality is reflectedin the questionsasked,the assumptions
made,and the outcomesexamined,experimentefscan go a long way by ensuring
a broad representation of stakeholdervoicesin study design.Further,moral social
sciencereiuires commitment to truth. Moral righteousnesswithout truthful
analysisis ihe stuff of totalitarianism.Moral diversityhelpspreventtotalitarian-
ism, but without the discipline provided by truth-seeking,diversity offers no
-."16 to identify thoseoptionsthat are good for the human condition,which is,
after all,the essenceof morality.In order to havea moral socialscience,we must
haveboih the capacityto elucidatepersonalconstructionsand the capacityto see
484 14.A CR|T|CAL
ASSESSMENT
OF OURASSUMPTTONS
|

how thoseconstructionsreflectand distort reality (Maxwell, 19921.'Weembrace


the moral aspirationsof scholarssuchas Lincoln, but giving voiceto thoseaspi-
rations simply doesnot requireus to abandonsuchnotions as validity and truth.

Q UASI.EXPERIM ENTATION
Criteriafor RulingOut Threats:
The Centralityof FuzzyPlausibility
In a randomized experiment in which all groups are treated in the sameway excepr
for treatment assignment,very few assumptionsneed to be made about ro,rr.", of
bias. And those that are made are clear and can be easily tested,particularly as con-
cerns the fidelity of the original assignment process and its subsequentmainte-
nance. Not surprisinglS statisticiansprefer methods in which the assumptionsare
few, transparent, and testable. Quasi-experiments, however, rely heavily on re-
searcheriudgments about assumptions, especiallyon the fuzzy but indispensable
concept of plausibility. Judgments about plausibility are neededfor deciding which
of the many threats to validity are relevant in a given study for deciding whether
a particular designelement is capable of ruling out a given threat, for estimating by
how much the bias might have been reduced, and for assessingwhether multiple
threats that might have been only partially adjusted for might add up to a total bias
greater than the effect size the researcher is inclined to claim. Vith quasi-
experiments, the relevant assumptions are numerous, their plausibility is less evi-
dent, and their single and joint effectsare lesseasily modeled. We acknowledgethe
fuzzy way in which particular internal validity threats are often ruled out, and it is
becauseof this that we too prefer randomized experiments (and regressiondiscon-
tinuity designs)over most of their quasi-experimentalalternatives.
But quasi-experiments vary among themselveswith respect to the number,
transparencg and testability of assumptions. Indeed, we deliberately ordered the
chapters on quasi-experiments to reflect the increase in inferential power that
comes from moving from designs without a pretest or without a comparison
group to those with both, to those based on an interrupted time series,and from
there to regression discontinuity and random assignment.Within most of these
chapters we also illustrated how inferencescan be improved by adding design el-
ements-more pretest observation points, better stable matching, replication and
systematic removal of the treatment, multiple control groups, and nonequivalent
dependentvariables. In a sense,the plan of the four chapters on quasi-experiments
reflects two purposes. One is to show how the number, transparency and testa-
bility of assumptions varies by type of quasi-experimental design so that, in the
best of quasi-experiments,internal validity is not much worse than with the ran-
domized experiment. The other is to get students of quasi-experimentsto be more
sparing with the use of this overly general label, for it threatens to tar all quasi-

t
QUASI-EXPERIMENTATION+SS
|

to the
experimentswith the samenegativebrush. As scholarswho have contributed
institution alization of the t i^ quoti-experiment, we feel a lot of ambivalence
the random-
about our role. Scholarsneed to itrint critically about alternatives to
la-
ized experiment, and from this need arisesthe need for the quasi-experimental
under the
bel. But all instancesof quasi-experimentaldesignshould not be brought
the best studies do
sameunduly broad quasi-experimentalumbrella if attributes of
not closely match the weaker attributes of the field writ large.
use of
Statisticians seek to make their assumptions transparent through the
have resisted this strat-
formal models laid out as formulae. For the most part, we
very con-
egy becauseit backfires with so many readers,alienating them from the
words in-
.!pt.r"t issuesthe formul ae aredesignedto make evident.'We have used
cognoscenti'
stead.There is a cost to this, and not jupt in the distaste of statistical
The
particularly those whose own research has emphasized statistical models-
to formally
main cost is that our narrative approach makes it more difficult
the alternative
demonstrate how much fewer and more evident and more testable
quasi-
interpretations became as we moved from the weaker to the stronger
acrossthe
.*p.ri-.rrts, both within the relevant quasi-experimental chapters and
'We
set of them. regret this, but do not apologize for the accessibility we tried to
Fortu-
create by minimirirrg the use of Greek symbols and Roman subscripts.
to develop
nately, this deficit is not absolute, as both we and others have worked
in partic-
meth;ds that can be used to measurethe size of particular threats' both
1998; Shadish, 2000) and
ular studies(e.g.,Gastwirth et al., L994;Shadishet al.,
Posavac,6c
in sets of studiis (e.g.,Kazdin 6c Bass, 1989; Miller, Turner, Tindale,
& Putnam,t982\.
& Rubin,1,978;Willson
Dugoni,1,991;Ror."nitt.t our
Further,
statistical
narrative approach has a significant advantage over a more narrowly
threats
emphasisii allows us to addressa broad er array of qualitatively different
that there-
to validitS threats for which no statistical measure is yet available and
quantification.
fore mighi otherwise be overlooked with too strict an emphasison
at all
Better to h"u. imprecise attention to plausibility than to have no attention
paid to many imptrtant threats just becausethey cannot be well measured'

PatternMatchingas a ProblematicCriterion
This book is more explicitthan its predecessors about the desirabilityof imbuing
a causalhypothesiswith multiple tistable implicationsin the data, providedthat
we
they servett reducethe viability of alternativecausalexplanations.In a sense'
havesoughtto substitutea pattern-matchingme{rod-ology'We for the u-sualassessment
of wheth-era few means,oft.n only fwo, reliably differ. do this not because
num-
.o-pl.*ity itself is a desideratumin science.To the contrary,simpliciry in the
be, of questionsaskedand methodsusedis highly prizedin science. The simplicity
well.
of ,arrjomized experimentsfor descriptivecausalinferenceillustratesthis
However,the samesimple circumstancedoes not hold with quasi-experiments.
With them. we haveassirtedthat causalinferenceis improvedthe more specific,
488 | ro.o cRtlcALAssEssMENT
oF ouRAssuMploNs

generatingtheselists.The main concernwas to havea consensus of educationre-


searchersendorsingeachpractice;and he guessedthat the number of thesebest
practicesthat dependedon randomizedexperimentswould be zero. Severalna-
tionally known educationalresearcherswere present,agreedthat such assign-
ment probably playedno role in generatingthe list, and felt no distressat this. So
long as the belief is widespreadthat quasi-experiments constitutethe summit of
what is neededto support causalconclusions,the support for experimentation
that is currently found in health, agriculture,or health in schoolsis unlikely to
occur.Yet randomizationis possiblein.manyeducationalcontextswithin schools
if the will existsto carry it out (Cook et al., 1999;Cook et al., in press).An un-
fortunate and inadvertentside effect of seriousdiscussionof quasi-experiments
may sometimesbe the practicalneglectof randomizedexperiments.That is a pity.

RANDOMIZED
EXPERIMENTS
This sectionlistsobjectionsthat havebeenraisedto doingrandomizedexperiments,
and our analysisof the more and lesslegitimateissuesthat theseobiectionsraise.

Experiments
CannotBe Successfully
lmplemented
Even a little exposure to large-scalesocial experimentation shows that treatments
are often improperly or incompletely implemented and that differential attrition
often occurs. Organizational obstaclesto experiments are many. They include the
reality that different actors vary in the priority they attribute to random assign-
ment, that some interventions seem disruptive at all levels of the organization,
and that those at the point of service delivery often find the treatment require-
ments a nuisance addition to their aheady overburdened daily routine. Then
there are sometimes treatment crossovers,as units in the control condition adopt
or adapt components from the treatment or as those in a treatment group are ex-
posed to some but not all of these same components. These criticisms suggestthat
the correct comparison is not between the randomized experiment and better
quasi-experiments when each is implemented perfectly but rather between the
randomized experiment as it is often imperfectly implemented and better quasi-
experiments. Indeed, implementation can sometimes be better in the quasi-
experiment if the decision not to randomize is based on fears of treatment degra-
dation. This argument cannot be addressedwell becauseit dependson specifying
the nature and degree of degradation and the kind of quasi-experimental alter-
native. But taken to its extreme it suggeststhat randomized experiments have no
special warrant in field settings becausethere is no evidencethat they are stronger
than other designs in practice (only in theory).
But the situation is probably not so bleak. Methods for preventing and cop-
ing with treatment degradation are improving rapidly (seeChapter 10, this vol-
I AAS
EXPERIMENTS
RANDOMIZED

random assign-
umel Boru ch,1997;Gueron,1,999;Orr, L999).More important,
with the
-.n, may still createa superiorcounterfactualto its alternativeseven
(1'9961foundthat,
flaws mentionedherein.FLr e*ample,Shadishand Ragsdale
without attrition, randomized experi-
.o-p"..d with randomized."p..i-.tts
nonrandom-
mentswith attrition still yieldedbetter effectsizeestimatesthan did
ran-
ized experiments.Sometimes,of course,an alternativeto severelydegraded
a control'
domizaiion will be best,such as a strong interruptedtime serieswith
poor rule to fol-
But routine rejectionof degradedrandomizedexperimentsis a
to
l,o*; it takescarefulstudy and judgmentto decide.Further,many alternatives
flaws that
experimentationare themselu.i ,ob;..t to treatmentimplementation
thieatenthe validity'weof inferencesfrom them. Attrition and treatmentcrossovers
also occur in them. also suspectthat implementationflaws are salientin ex-
hav6beenaround so long and experimenters
f.ri-errt"tion becauseexperiments the quality
"r. .o critical of eachothlr's work. By contrast,criteria for assessing
(e'g',Datta,
of implementationand resultsfrom othermethodsarefar more recent
lesssubjected
D97j,and they may thereforebe lesswell developedconceptuallS
to peercriticism,and lessimprovedby the lessonsof experience.

ExperimentationNeedsStrongTheoryand Standardized
TreatmentlmPlementation
rs
Many critics claim that experimentationis more fruitful when an intervention
detailsis
basedon strongsubstantivetheory when implementationof treatment
when im-
faithful to that theor5 when the rlsearchsettingis well managed,and
these
plementationdoes,roi uury much betweenunits' In many field experiments'
organiza'
conditions are not met. For example,schools arclarge, complex, social
iio"r *ith multiple programs,disputatiouspolitics, and conflicting stakeholder
well as
goals.Many progr"*, a"reimplementedvariablyacrossschooldistricts,as
of standard
f.ror, ..hoth, .Lrrroo-r, arri ,t.rdents.Therecan be no presumPli9n
1'977)'
implementationor fidelity to programtheory (Berman& Mclaughlin,
well-
But thesecriticismsur., i' fa-ct,misplaced.Experimentsdo not require
implementa-
specifiedprogram theories,good program management,standard
a contri-
tion, or treatmentsthat are tJtally ?aithful to theory' Experimentsmake.
makesa
bution when they simplyprobewhetheran intervention-as-implemented
preceding
marginal improvem.tttt.yord other backgroundvariability. Still, the
suggests
fa.tJ* can ieducestatisticalpower and so cloud causalinference.This
experiments should:
that in settingsin which *or. of these conditions hold,
(L) uselargesamplesto detecteffects;(2) take painsto reducethe influenceof ex-
ma-
traneousvariation either by designor through measurementand statistical
worth study-
nipulation; and (3) studyimplementationquality both as a variable
implement
i"g * its own right in oid.r to ascertainwhich settingsand providers
treat-
thl interventionbetterand asa mediatorto seehow implementationcarries
ment effectsto outcome.
490 | r+.a cRtTtcAL
ASSESSMENT
OFOURA5SUMPTIONS

Indeed,for many purposesthe lack of standardizationmayaid in understanding


how effectivean interventionwill be undernormal conditionsof implementation.In
the social world, few treatmentsare introduced in a standardand theory-faithful
way. Local adaptationsand partial implementationare the norm. If this is the case,
then someexperimentsshould reflect this variation and ask whetherthe treatment
cancontinueto be effectivedespiteall the variation within groupsthat we would ex-
pectto find if the treatmentwerepolicy.Programdeveloperiand socialtheoristsmay
want standardizationat high levelsof implementation,but policy analysrsshouldnot
welcomethis if it makesthe researchconditionsdifferenifro- the practicecondi-
tions to which they would like to generalize.Of course,it is most desiiableto be able
to answerboth setsof questions-about policy-relevanteffectsof treatmentsthat are
variably implementedand alsoabout the more theory-relevanteffectsof optimal ex-
posureto the intervention.In this regard,one might recall recenteffortsio analyze
the effectsof the original intent to treat through traditional meansbut alsoof the ef-
fectsof the actual treatmentthrough using random assignmentas an instrumental
variable(Angristet al., 1996a\.

ExperimentsEntailTradeoffsNot Worth Making


The choiceto experimentinvolvesa number of tradeoffsthat someresearchers be-
lieveare not worth making (Cronbach,7982).Experimenrationprioritizeson un-
biasedanswersto descriptivecausalquestions.But, givenfinite r.rour..r, somere-
searchers preferto investwhat they havenot into marginalimprovementsin internal
validity but into promoting higher constructand externalvalidity. They might be
content with a greaterdegreeof uncertainryabout the quality of a causalconnec-
tion in orderto purposivelysamplea greaterrangeof populationsof peopleor set-
tings or, when a particular population is central to the research,in ordeito gener-
ate a formally representativesample.They might evenusethe resourcesto improve
treatmentfidelity or to includemultiplemeasures of averyimportantoutcomecon-
struct. If a consequence of this preferencefor constructand ixternal validity is to
conducta quasi-experimentor evena nonexperimentrather than a randomizedex-
periment, then so be it. Similar preferencesmake other critics look askancewhen
advocatesof experimentationcounselrestrictinga study to volunteersin order to
increasethe chancesof beingable to implementand maintainrandomassignment
or when thesesameadvocatesadviseclosemonitoring of the treatmentto ensureits
fideliry therebycreatinga situation of greaterobtruiivenessrhan would pertain if
the sametreatmentwerepart of someongoingsocialpolicy (e.g.,Heckman,1992).
In the languageof Campbelland Stanley(1,963;., theclaim was that ."p.ri*.rrt"-
tion traded off externalvalidity in favor of internal validiry. In the parlanceof this
book and of Cook and Campbell(1979),it is that experimentatiortrades off both
externaland constructvalidity for internal validiry to its detriment.
Critics also claim that experimentsoveremphasize conservativestandardsof
scientificrigor. Theseinclude (1) usinga conservativecriterion to protect against
| *tt
EXPERIMENTS
RANDOMIZED

to de-
wrongly concludinga treatmentis effective (p <.05) at the risk of failing
that include
tect true treatment;ffects;(2) recommendingintent-to-treatanalyses
treatment; (3) deni-
as part of the treatmentthoseunits that have neverreceived
gr"ting inferencesthat result from exploring unplanned treatment interactions
with characteristics of units, observations,settings,or times;and (4) rigidly pur-
emerge
suing a priori experimentalquestionswhen other interestingquestions
about
duriig " ,t,rdy. Mort laypersonsuse a more liberal risk calculusto decide
poten-
.u,rrul inferencesin their own lives,as when they considertaking up some
ii"ity lifesavingtherapy.Should not sciencedo the same' be lessconservative?
Snoula it notlt least-sometimes make different tradeoffs betweenprotection
againstincorrectinferencesand the failure to detecttrue effects?
critics further obiectthat experimeptsprioritize descriptiveover explanatory
whether
causation.The criticsin qrrestionwould toleratemore uncertaintyabout
processes
the interventionworks in order to learn more about any explanatory
that havethe potentialto generalize acrossunits, settings'observations,and times'
qualita-
Further,,o-. critics pr.f!, to pursuethis explanatory knowledgeusing
than
tive meihodssimilar io thor. of th. historian,journalist, and ethnographer
more opaque
by meansof, sa5 structuralequation modeling that seemsmuch
than the narrativereportsof theseother fields'
critics alsodislikethe priority that experimentsgiveto providing policymak-
real-time
ers with ofren belated"rrri.r, about what works insteadof providing
are rarely interested in
help to serviceprovidersin local settings.Theseproviders
" torrg-a.tayed r,rrnmaryofwhat, ptogt"- has.achieved. They often preferre-
elements
ceiving.o.riin,ro.rsfeedbackabouttheir work and especiallyabout those
A recent letter to
oiprJ.ri.. that they can changewithout undue complication'
theNew York Timescapturedthis preference:
to approach issues
Alan Krueger . . claims to eschew value iudgments and wants
his insistenceon postponing changesin ed-
(about educationalreform) empirically. Yet
is itself a value judgment
ucation policy until studiesby iesearchersapproach certainry
in parts of public edu-
in favor of the status quo. In view of the tragic state of affairs
(Petersen, 1999)
cation, his judgment is a most questionableone.
ques-
we agreewith many of thesecriticisms.Among all possible_research
questionsconstituteonly a subset.And of all possiblecausal meth-
tions,cau-sal
ods,experimentation is not relevantio all typesof questionsand all typesof cir-
in
cumstance.One needonly read the list of options and contingenciesoutlined
Ch"p,.r, 9 and L0 to appreciatehow foolhardy it is to advocate experimenta-
"gold standard"that will invariablyresultin
tion on a routine basisas a causal
trade-
clearly interpretableeffect sizes.However,many of the criticisms about
even over-
offs are basedon artificial dichotomies,correctableproblems,-and
im-
simplifications.Experimentscan and should examinereasonsfor variable
They
pl.-.nt"tion, and they should searchto uncover mediating processes'
for the '05
neednot use stringentalpha rates;only statisticaltradition argues
that
level.Nor needonJ restrict dataanalysesonly to the intent-to-treat'though
'aloJ
lueururo.rdeJoru qf,ntu
e sdeld drrprlerrlpuJetur ql1qlv\ ur surerSord rpuar
;o lurluerelm puorq dpursrrd
-Jns eql ol pue qf,ntu oor drlPllu^
Ieurelur aztseqduraapreqr qf,Jeesar;o sururSord
ur pa8raureaABr{stsaSSnsdrolsrq lpql sassau>lea^\
IertuaraJuragr ol uouuane Bur
-llEf, orp arrrtaqrey .(rq8rlrodsaql uI erurl slr a^eq lsntu ad& drlPler d-rela)
/lpll
-EA Jo pnJlsuof, JaAo drrprlerr
lEuJalxa IeuJalur 1o drerurrd eurlnoJ due -ro; 8ur1er
lou eJEeM'T lardu{J uI Jealr oPELua.&\se 'esJnoJIO 'parseSSnsarreqsrrlr.rr tsed
req.a'\sPeaJxadlrear8 sluaut-radxaeldrllnur Ja o senssrdrrpllel leuJalxa pug lrnrls
-uol qloq sserppeol dlneder aql.slsdleue-Eleruur dlrrap
lsoru aeso^4,se 1ng ,sans
-sI asJl{t qroq Sulssarpp" ur r.lf,EarperFrll e^Er{ slueurradxa .paluerg
'larrr dlfsapou sanssr,{rrprlerr lpnprlrpur
Ipuralxo puB lJnrlsuoJ r{foq sserppE ot r{f,rBrsar
letuaurr.ladxeIo srueJSord;o dlpeder agr qrr^\ pessarduneJEaA\ ,1ser1uocdg
'8ur1uru r{uo^\ tanau aJu
lpql stJoapqJfarrnbar sluaurrradxeter{l tsa88nsol luet
-.rodur ool sr s{Jo.&\rpqra 1no Surpurg .sanqod
IErJospaseq-sseua^rpa}Ja alouord
ol lue1Y\ol{1v\sJJelsrlaql Pue srolelsr8al asoql ro; d1-rrlnrrued 'cnerualqord eJoru
uala dlqeqo-rdaru sJel\supJeell tnoqtr^ saurl aturl Buol-opelep qrns .re8uep
IEar
P sI uollEluaurtradxa arntreruardq8noqlly 'sploq uorlenlrs atues aql lsorule pue
'uerSor4
ruaurdolartaq IooqJS aqr uuSaq rauoJ sauef arurs sread 0t sl lI .sra/\,s
-uB ou 'sloogrs peleJelalf,eue8aq
PUPsluolutradxa ou aleq all\ Pue urle-J drua11
eruts sread SI sl fI 'sfteJJe JIeI{l rnoqe sJa^.r{sup Jeolf,ou e^Erl llrls o.&\puu .pasod
-ord arain sJarlf,no^ .splp .uoryo oor
Iooqtrs erurs srcad 0t A ou sr lI IIE ,.raddeq
SFII 'elgBpuedapun sI lEql uollf,auuoJ lusnpr E tnogp suorsnlf,uoo
leraua8 puorg
Sutmerp >lsIJol sI uolluelJetul uE Jo srlaJJear{l uo sarpnts
leluauuadxa Suorls o4
e^Pq ol 'spuno;8 IEIIuePI^ero lerrSoyuo elqrsneldurrdl-realf,arg drrprlerr
leuralur
ol slBarql ssaFn 'saf,uareJur da4;o dtr.rSalureql Sursrulordtuot lnoqlrd\ passoJl eq
louupf, spunoq aruos 'lurod srql ot rrlaqledtuds d11erauafi are am qfinoqrly .ftget
'qrequo.r3 :og5t ''1" ''3'a)
Ir r{luquorJ spoqlau lutuaurradxo ra8uorls aqr Bur
-zrsuqdruasrue.rSo-rd tuory ueql serpnls leluaunradxeuou pue Iuluaur-radxa_isenb
;o dlerrlua uela ro dlrsoru lslsuof, rer{l qf,reasarJo sure-r8ordtuory peuJuel aq IIri\,\
uoupluroJul aroru
InJasn r'rrrpourr
Et'*:u"';rrx;;H:;r:::ilil?
sluerue^ordur leur8rulu JeuIJ-JaAa 'lsa88ns stxel eruos su
ilHt", ",
to 1uo3aqr pur plSrr se
eq tou paau sluaur.radxg '(salqerrul Surlelpau Jo sarnseau Burppe,.8.a)tuaqr;o
^{et salulleluoslnq 'saJJnosalartnber sarnparo-rdasaql
ilV'{ooq srql ur peurllno
spoqrau eqt Sursn pelpreua8aq plnoqs uortuzrlereua8
lesneolnoqp alqrssodse
uolletuJotul qlntu sB puv 'sasseoordSurlelpau pue sauof,lno pepuelurun Burre
-^oJsrp tE parurp uorllellor etvp a^nelrlenb aq plnor{s pue upJ aleql .saruof,lno
pue 'stuerulearl 's8urpas (suos.rad;o sluerussasse dfrpryel lf,nJlsuof, ar{r puu
;o
salduresJo sseualrleluasardar er.lrJo sasdyeue
lulueurr.radxeuou eg osle plnoqs
'paqsqqnd aq uE3 sluaurrradxs
Pue uBr erarll tuoJJ sllnsoJ urrelul .dlsnorl
-nBf, suolsnlf,uol rraqf 3urqf,nof,
PuE seleJ JoJJa ale8rgo.ld lsure8e Surpren8
hlo11u remod lptrrtsrlels pue droagl elrtuetsqns leql luatxe eqr ol suorlrrJal
-uI 'srsdleur auo oq dlorruryap
IEf,Itsllels aroldxa osle uet sJatuJurrradxg pFor{s

sNoll_dwnssv tv)tl|u) v .tt I zov


uno lo l_Nty\sslssv
I
| 493
EXPERIMENTS
RANDOMIZED
I

Assumean InvalidModel
Experiments
Utilization
of Research
model of decision
To somecritics, experimentsrecreatea naive rational choice
among (the treat-
making. That is, one first lays out the alternativesto choose
one collectsin-
*.rr,rt] then one decideson criteria of merit (the outcomes);then
and finally
formation on eachcriterion for eachtreatment(the data collection),
empirical
one makes a decisionabout the superior alternative.UnfortunatelS
so simpleas the ra-
work on the useof socialsciencedaia showsthat useis not
tional choicemodelsuggests (c. \ufeiss6c Bucuvalas,1980; c''weiss, 1988)'
contexts'ex-
First, evenwhen."-rir. and effectquestionsare askedin decision
ex-
p.ri-.nt"l resultsare still usedalong with other forms of information-from
consensusof a
isting theories,personaltestimony,extrapolationsfrom surveys'
haverecentlybe-
fieldlchims from expertswith intereststo defend,and ideasthat
politics' person-
.o*. trendy.Decisionsare shapedpartly by ideology,interests,
as much made by a
ality, windows of-opportunity, and ualues;and they are
individualor com-
policy-shapirrg.o-*nrrity (cronbachet al., 1980) as by an
overtime asear-
*i,,... Fuither,manydecisionsarenot so much madeasaccreted
maker with few op-
lier decision,.orrrir"in later ones,leavingthe final decision
are available,new
tions ('Weiss,1980). Indeed,by the time ixperimental results
decisionmakersand issuesmay havereplacedold ones.
verdicts
Second,.*p.rirn.nts often yield contestedrather than unanimous
Disputes arise about
that therefore have uncertain implications for decisions.
resultsare valid'
whether the causalquestionswere correctly framed, whether
whetherrelevantoutcomeswere assessed, and whetherthe resultsentail a specific
voucher
decision.For example,reexaminationsof the Milwaukee educational (H'
occurred
;"rdy offereddifferentconclusionsabout whetherand whereeffects
6cDu, 1.999;'Sritte, 1'998,"1'999,2000)' SimilarlS
Fuller,2000;Greene,Peterson,
differenteffect,ir., *.r. generatedfrom the Tennessee classsizeexperiment(Finn
EcAchilles,1.990;Hanusi'ek,1999;Mosteller, Light, 6c Sachs,1996)'Sometimes,
scholarlydisagreements are at issue,but at other timesthe disputesreflectdeeply
conflictedstakeholderinterests.
likely when
Third, short-terminstrumentaluseof experimentaldata is more
it is easierto
the interventionis a minor variant on existingpractice.For example,
criteriafor
changetextbooksin a classroomor pills givenlo patientsor eligibility
or to open
;;;;"* entry than it is to relocatehospitalsto.underservedlocations
state' Becausethe
day-carecentersfor welfare recipientsthroughout an entire
to dramatically
more feasible.tt""g.t are so ,ood.r, in scope,they are lesslikely
on shor-t-termin-
affecttheproble- ih.y address.So critics note that prioritizing
is unlikelyto solve
strumentalchangetendsto preservemost of the statusquo and
that truly twist
tr.rr.hunt social"probl.-s. bf course'thereare someexperiments
from denselypoor
the lion,stail andinvolvebold initiatives.Thus moving families
deviations
inner-citylocationsto the suburbsinvolveda changeof three standard
494 14.A CRIT|CAL
ASSESSMENT
OFOURASSUMPTTONS
|

in the poverty level of the sending and receiving communities, much greater than
what happens when poor families spontaneously move.'S7hethersuch a dramatic
change could ever be used as a model for cleaning out the inner cities of those who
want to move is a moot issue. Many would judge such a policy to be unlikely.
Truly bold experiments have many important rationales; but creating new policies
that look like the treatment soon after the experiment is not one of them.
Fourth, the most frequent use of research may be conceptual rather than in-
strumental, changing how users think about basic assumptions,how they under-
stand contexts, and how they organize'or label ideas. Some conceptual uses are
intentional, as when a person deliberately reads a book on a current problem; for
example, Murray's (1984) book on social policy had such a conceptual impact in
the 1980s, creating a new social policy agenda. But other conceptual usesoccur
in passing, as when a person reads a newspaper story referring to social research.
Such usescan have great long-run impact as new ways of thinking move through
the system, but they rarely change particular short-term decisions.
These arguments against a naive rational decision-making model of experi-
mental usefulnessare compelling. That model is rightly rejected. However, mosr
of the objections are true not just of experiments but of all social sciencemethods.
Consider controversies over the accuracy of the U.S. Census,the entirely descrip-
tive results of which enter into a decision-making process about the apportion-
ment of resourcesthat is complex and highly politically charged. No method of-
fers a direct road to short-term instrumental use. Moreover, the obiections are
exaggerated.In settings such as the U.S. Congress,decision making is sometimes
influenced instrumentally by social scienceinformation (Chelimsky, 1998), and
experiments frequently contribute to that use as part of a researchreview on ef-
fectivenessquestions. Similarlg policy initiatives get recycled, as happened with
school vouchers, so that social science data that were not used in past years are
used later when they become instrumentally relevant to a current issue (Polsby,
1'984; Quirk, 1986).In addition, data about effectivenessinfluence many stake-
holders' thinking even when they do not use the information quickly or instru-
mentally. Indeed, researchsuggeststhat high-quality experiments can confer exrra
'Weiss
credibility among policymakers and decision makers (C. & Bucuvalas,
1980)' as happened with the Tennesseeclasssize study. We should also not forget
that the conceptual use of experiments occurs when the texts used to train pro-
fessionalsin a given field contain results of past studies about successfulpractice
(Leviton 6c Cook, 1983). And using social sciencedata to produce incremental
change is not always trivial. Small changescan yield benefits of hundreds of mil-
lions of dollars (Fienberg,Singer,& Tanur, 1985). SociologistCarol'Weiss,an ad-
vocate of doing research for enlightenment's sake, says that 3 decadesof experi-
ence and her studies of the use of social sciencedata leave her "impressed with the
utility of evaluation findings in stimulating incremental increasesin knowledge
and in program effectiveness.Over time, cumulative incrementsare not such small
potatoes after all" ('Weiss,1998, p. 31,9).Finallg the usefulnessof experimentscan
be increased by the actions outlined earlier in this chapter that involve comple-
ott
EXPERIMENTS
RANDOMIZED I

mentingbasicexperimentaldesignwith adjunctssuchas measuresof implemen-


pro-
tation a-ndmediationo, qualitativemethods-anything that will help clarify
gram processand implementationproblems.In summarSinvalid modelsof the
ir.foln.rs of experimintalresultsseemto us to be no more nor lesscommonthan
'we
have learned
invalid modelslf th. use of any other social sciencemethods.
their
much in the last severaldecadesabout use, and experimenterswho want
of thoselessons(Shadishet al., 1'99I).
work to be usefulcan take advantages

TheConditionsof ExperimentationDifferfrom the


Conditionsof Policylmplementation
if were
Experimentsare often doneon a smalleiscalethan would pertain services
rele-
i-il.-r.rted state-or nationwide,and so they cannot mimic all the details
interven-
u"rr, ,o full policy implementation.Hencepolicy implementationof an
For ex-
,i"" -ry yi.ta aiff.rint o,rt.omesthan the experiment(Elmore, 1996)'
class size,
ample, t"r.d partly on researchabout the benefits of reducing
Tennessee and Caliiornia implementedstatewidepoliciesto have more classes
class-
with fewer studentsin each.This required many new teachersand new
of those new teach-
rooms.However,becauseof a nationalteachershortage,some
of
ers may havebeenlessqualifiedthan thosein the experiment;and a shortage
that may have
classroomsled to more .rs. of trailers and dilapidatedbuildings
harmedeffectiveness further.
enthu-
Sometimesan experimentaltreatmentis an innovation that generates
experi-
siasticeffortsto implementit well. This is particularly frequentwhen the
that
ment is done by a charismaticinnovator whosetacit knowledgemay exceed
pr^ctrce
of thosewho would be expectedto implementthe program in ordinary
These factors may
and whosecharismamay inducehigh-qualityimplementation.
is im-
generatemore srr...srfoi outcomesthan will be seenwhen the intervention
plementedasroutine PolicY.
Policy implementationmay also yield different-resultswhen experimental
prac-
treatmentsare implementedin a fashionthat differs from or conflictswith
ticesin real-*orld application.For example,experimentsstudying psychotherapy
and
outcomeoften standardizetreatmentwiih a manual and sometimesobserve
but
correct the therapistfor deviatingfrom the manual (shadishet al., 2000);
effec-
thesepracticesare rare in clinicallractice. If manualizedtreatmentis more
might
tive (bhambless& Hollon, 1998; Kendall, 1998), experimentalresults
transferpoorly to practicesettings.
policy
Raniom assigrrm.ntmay also changethe program from the intended
implementation(i{eckman,l992l. For ixample, thosewilling to be randomized
may
-"y diff.r from those for whom the treatment is intended; randomizatLon
with
changepeople'spsychologicalor social responseto treatment compared
those"wlroself-selecttreatment;and randomizationmay disrupt administration
clients'
and implemenrationby forcingthe programto copewith a differentmix of
I

496 14.A CR|T|CAL


ASSESSMENT
OF OURASSUMPTIONS
|

Heckman claims this kind of problem with the Job taining PartnershipAct
"calls into question
0TPA) evaluation the validity of the experimentalestimates
as a statementabout theJTPAsystemas a whole" (Heckman,1.992, p. ZZ1,).
In many respects,we agreewith thesecriticisms,thoughit is worth noting sev-
eral responsesto them. First, theyassumealack of generalizabllityfrom experi-
ment to policy but that is an empirical question.Somedata suggesrthar general-
ization may be high despite differencesbetweenlab and field (C. Anderson,
LindsaS & Bushman, 1999) or betweenresearchand practice (Shadishet al.,
2000). Second,it can help to implement.treatment underconditionsthat aremore
characteristicof practiceif it doesnot unduly compromiseother researchpriori-
ties. A little forethoughtcan improve the surfacesimilarity of units, trearments,
observations,settings,or timesto their intendedtargets.Third, someof thesecrit-
icismsare true of any researchmethodologyconductedin a limited context,such
as locally conductedcasestudiesor quasi-experiments, becauselocal implemen-
tation issuesalwaysdiffer from large-scaleissues.Fourth, the potentiallydisrup-
tive natureof experimentallymanipulatedinterventionsis sharedby many locally
'rrr"or"h
invented novel programs, euen uhen they are not studied by any
methodologyat all.Innovation inherentlydisrupts,and substantiveliteraturesare
rife with examplesof innovationsthat encounteredpolicy implementationim-
pediments(Shadish,1984).
However,the essentialproblem remainsthat large-scalepolicy implementa-
tion is a singularevent,the effectsof which cannot be fully known exceptby do-
ing the full implementation.A singleexperiment,or evena smallseriesof ri-ilrt
ones,cannotprovidecompleteanswersabout what will happenif the intervention
is adoptedas policy. However,Heckman'scriticism needsreframing.He fails to
distinguishamongvalidity types(statisticalconclusion,internal,.onrtro.., exter-
nal). Doing so makesit clearthat his claim that suchcriticism"calls into question
the validity of the experimentalestimatesasa sratementabout the JTPA,yrt.rr, ",
a whole" (Heckman,1.992, p.221,)is reallyabout externalvalidityand construcr
validity,not statisticalconclusionor internalvalidity.Exceptin thenarrow econo-
metricstraditionthat he understandably cites(Haavelmo,7944;Marschak ,7953;
Tinbergen,1956),few socialexperimentersever claimedthat experimentscould
describethe "systemas a whole"-even Fisher(1935)acknowledged this trade-
off. Further,the econometricsolutionsthat Heckman suggestscannot avoid the
sametradeoffsbetweeninternal and externalvalidity. For example,surveysand
certain quasi-experiments can avoid someproblemsby observingexistinginter-
ventionsthat have aheadybeenwidely implemented,but the validity of tleir es-
timatesof program effectsare suspectand may themselves changeif the program
were imposedevenmore widely as policy.
Addressingthesecriticismsrequiresmultiple lines of evidence-randomized
experimentsof efficacyand effectiveness, nonrandomizedexperimentsthat ob-
serveexistinginterventions,nonexperimentalsurveysto yield estimatesof repre-
sentativeness, statisticalanalysesthat bracketeffectsunder diverseassumpd;ns,

J
II Ot
EXPERIMENTS
RANDOMIZED

qualitative observation to discover potential incompatibilities between the inter-


ventiol and its context of likely implementation, historical study of the fates of
similar interventions when they were implemented as policg policy analysesby
those with expertisein the type of intervention at issue,and the methods for causal
generalizationin this book. The conditions of policy implementation will be dif-
i.r.rr, from the conditions characteristic of any rese^rchstudy of it, so predicting
generalizationto policy will always be one of the toughest problems.

lmposingTreatments Flawed
ls Fundamentally
Compared with Encouragingthe Growthof Local
Solutionsto Problems
Experimentsimposetreatmentson recipients.Yet som,elate 20th-centurythought
,.rjg.rt, that imposedsolutionsmay be inferior to solutionsthat are locally gen-
.rJr".a by thoseiho h"n. the problem. Partly,this view is premisedon research
findings of few effectsfor the Great Societysocialprogramsof the 1960sin the
UniteJ States(Murrag 1.984;Rossi, L987),with the presumptionthat a portion
of the failurewas due to the federallyimposednatureof the programs.Partly,the
view reflectsthe successof late 2Oth-centuryfree market economicsand conser-
vative political ideologiescompared with centrally controlled economiesand
more fi|eral political beliefs.Experimentallyimposedtreatmentsare seenin some
quartersas beinginconsistentwith suchthinking'
IronicallS the first objectionis basedon resultsof experiments-if it is true
that impos.i progr"*s do not work, experimentsprovided the evidence.More-
over,thesetro-.ff..t findingsmay havebeenpartly due to methodologicalfailures
of experimentsas they were implementedat that time. Much progressin solving
practicalexperimentalproblemsoccurredafter,and partly in responseto, those
experiments. If so,it is prematureto assumetheseexperimentsdefinitivelydemon-
stiated no effect,especlalygiven our increasedability to detectsmall effectsto-
day ' (D. Greenberg 6c shroder,1,997;LipseSL992;Lipsey6c'Wilson,!993).
iistinguish betweenpolitical-economiccurrencyand the effects
We must also'We
of interventions. know of no comparisonsof, say,the effectsof locally gener-
atedversusimposedsolutions.Indeed,the methodologicalproblemsin doing such
comparisonsare daunting, especiallyaccuratelycategotizinginterventionsinto
the two categoriesand unlonfounding the categorieswith correlatedmethoddif-
ferences.Bariing an unexpectedsolutionto the seeminglyintractableproblemsof
causalinferencein nonrandomizeddesigns,answeringquestionsabout the effects
of locally generatedsolutionsmay requireexactlythe kind of high-qualityexper-
imentatioi being criticized.Though it is likely that locally generatedsolutions
may indeedhavesignificantadvantages,it also is likely that someof thosesolu-
tions will haveto be experimentallyevaluated.
I

498 | 14.A CRIT|CAL


ASSESSMENT
OF OURASSUMPTTONS

CAUSALGENERALIZATION:
AN OVERLY
COMPLICATED
THEORY?
Internal validity is best promoted via random assignment,an omnibus mechanism
that ensuresthat we do not have many assumptions to worry about when causal in-
ferenceis our goal. By contrast, quasi-experimentsrequire us to make explicit many
assumptions-the threats to internal validity-that we then have to rule out by fiat,
by design,or by measurement.The latter is a more complex and assumption-riddled
processthat is clearly inferior to random assignment.Something similar holds for
causal generalization,in which random selectionis the most parsimonious and the-
oretically justified method, requiring the fewest assumptionswhen causalgeneral-
ization is our goal. But becauserandom selectionis so rarely feasible,one instead
has to construct an acceptabletheory of generaliz tion out of purposive sampling,
'We
a much more difficult process. have tried to do this with our five principles of
generalizedcausal inference.These, we contend, are the keys to generalizedinfer-
ence that lie behind random sampling and that have to be identified, explicated,
ano assessed
and assessedifrt we are to make
make better general inferences, if they are not per-
rnterences,even rt
fect ones. But these principles are much more complex to implement than is ran-
dom sampling.
Let us briefly illustrate this with the category called American adult women.
We could represent this category by random selection from a critically appraised
register of all women who live in the United Statesand who arc at least 21 years
of age.I7ithin the limits of sampling error, we could formally generalizeany char-
acteristics we measured on this sample to the population on that register. Of
course, we cannot selectthis way becauseno such register exists.Instead,one does
onet experiment with an opportunistic sample of women. On inspection they all
'1,9
turn out to be between and 30 years of age, to be higher than average in
achievementand abilit5 and to be attending school-that is, we have useda group
of college women. Surface similarity suggeststhat each is an instance of the cate-
gory woman. But it is obvious that the modal American woman is clearly not a
college student. Such students constitute an overly homogeneoussample with re-
spect to educational abilities and achievement,socioeconomicstatus, occupation,
and all observable and unobservable correlates thereof, including health status,
current employment, and educational and occupational aspirations and expecta-
tions. To remedy this bias, we could use a more complex purposive sampling de-
sign that selectswomen heterogeneouslyon all these characteristics.But purpo-
sive sampling for heterogeneousinstances can never do this as well as random
selection can, and it is certainly more complex to conceive and execute.I7e could
go on and illustrate how the other principles faclhtate generalization. The point is
that any theory of generalization from purposive samples is bound to be more
complicated than the simplicity of random selection.
But becauserandom selection is rarely possible when testing causal relation-
ships within an experimental framework, we need these purposive alternatives.
I 499
NONEXPERIMENTALALTERNATIVES

Yet most experimental work probably still relies on the weakest of these alterna-
tives, surfaci similarity.'We seek to improve on such uncritical practice. Unfortu-
nately though, there is often restricted freedom for the more careful selection of
instancesof units, treatments, outcomes, and settings, even when the selection is
done purposively.It requires resourcesto sample irrelevanciesso that they are het-
erogeneouson many attributes, to measure several related constructs that can be
discriminated from each other conceptually and to measure a variety of possible
explanatory processes.This is partly why we expect more progress on causal gen-
eralization from a review context rather than from single studies. Thus, if one re-
searcher can work with college women, another can work with female school-
teachers, and another with female retirees, this creates an opportunity to see if
thesesourcesof irrelevant homogeneity make a difference to a causal relationship
or whether it holds over all these differ6nt types of women.
UltimatelS causal generalizationwill always be more complicated than assess-
ing the likelihood that a relationship is causal.The theory is more diffuse, more re-
cent, and lesswell testedin the crucible of researchexperience.And in some quar-
ters there is disdain for the issue,given the belief and practice that relationshipsthat
replicate once should be consideredas generaluntil proven otherwise' not to speak
oithe belief that little progressand prestigecan be achieved by designingthe next
experiment to be some minor variant on past studies. There is no point in pre-
t.nding that causal generalization is as institutionalized procedurally as other
methods in the social sciences.'Wehave tried to set the theoretical agendain a sys-
tematic way. But we do not expect to have the last word. There is still no explica-
tion of causal generalizationequivalent to the empirically produced list of threats
to internal validiry and the quasi-experimental designsthat have evolved over 40
years to rule out thesethreats. The agendais set but not complete.

RIM ENTALALTERNATIVES
NONEXPE
Though this book is about experimentalmethodsfor answeringquestionsabout
.".rr"l hypotheses,it is a mistaketo believethat only experimentalapproachesare
used for thir p,r.pose.In the following; we briefly consider severalother ap-
proaches,indiiating the major reasonswhy we havenot dwelt on them in detail.
basicallSthe reasonis that we believethat, whatevertheir merits for somere-
searchpurposes,they generatelessclearcausalconclusionsthan randomizedex-
perimentsor eventhe bestquasi-experiments or
suchas regression-discontinuity
interruptedtime series
The nonexperimentalalternativeswe examineare the major onesto emerge
in variousacademicdisciplines.In educationand parts of anthropologyand soci-
ologg one alternativeis intensivequalitativecasestudies.In thesesamefields,and
also-in developmentalpsychologythere is an emerginginterestin theory-based
500 14.A CR|T|CAL
ASSESSMENT
OFOURASSUMPTTONS
|

causal studies basedon causal modeling practices.Across the social sciencesother


than economics and statistics, the word quasi-experiment is routinely used to justify
causal inferences,even though designsso referred to are so primitive in structure that
'We
causal conclusions are often problematic. have to challenge such advoc acy of
low-grade quasi-experiments as a valid alternative to the quality of studies we have
been calling for in this book. And finally in parts of statistics and epidemiology, and
overwhelmingly in econometrics and those parts of sociology and political science
that draw from econometrics,the emphasisis more on control through statistical ma-
nipulation than on experimental design.I7hen descriptive causal inferencesare the
primary concern, all of these alternatives will usually be inferior to experiments.

IntensiveQualitativeCaseStudies
The call to generate causal conclusions from intensive case studies comes from
several sources. One is from quantitative researchersin education who became
disenchanted with the tools of their trade and subsequently came to prefer the
qualitative methods of the historian and journalist and especiallyof the ethnog-
rapher (e.g.,Guba,198l, 1,990;and more tentatively Cronbach, 1986).Another
is from those researchersoriginally trained in primary disciplines such as qualita-
tive anthropology (e.g.,Fetterman, 19841or sociology (Patton, 1980).
The enthusiasm for case study methods arises for several different reasons.
One is that qualitative methods often reduce enough uncertainty about causation
to meet stakeholderneeds.Most advocatespoint out that journalists,historians,
ethnographers, and lay persons regularly make valid causal inferences using a
qualitative processthat combines reasoning, observation, and falsificationist pro-
cedures in order to rule out threats to internal validity-even if that kind of lan-
guage is not explicitly used (e.g.,Becker,1958; Cronbach,1982). A small minor-
ity of qualitative theorists go even further to claim that casestudiescan routinely
replace experiments for nearly any causal-sounding question they can conceive
(e.g.,Lincoln & Guba, 1985). A secondreasonis the belief that suchmethodscan
also engagea broad view of causation that permits getting at the many forces in
the world and human minds that together influence behavior in much more com-
plex ways than any experiment will uncover.And the third reasonis the belief that
case studies are broader than experiments in the types of information they yield.
For example, they can inform readers about such useful and diverse matters as
how pertinent problems were formulated by stakeholders, what the substantive
theories of the intervention are, how well implemented the intervention compo-
nents were, what distal, as well as proximal, effects have come about in respon-
dents' lives, what unanticipated side effects there have been, and what processes
explain the pattern of obtained results.The claim is that intensivecasestudy meth-
ods allow probes of an A to B connection, of a broad range of factors condition-
ing this relationship, and of a range of intervention-relevant questions that is
broader than the experiment allows.

.J
| 501
NONEXPERIMENTALALTERNATIVES
I

Although we agree that qualitative evidence can reduce some uncertainfy


about cause-sometimes substantially the conditions under which this occurs
are usually rare (Campbell, 1975).In particular, qualitative methods usually pro-
duce unclear knowledge about the counterfactual of greatest importance, how
those who receivedtreatment would have changedwithout treatment. Adding de-
sign featuresto casestudies,such as comparison groups and pretreatmentobser-
vations, clearly improves causal inference. But it does so by melding case-study
data collection methods with experimental design.Although we consider this as a
valuable addition ro ways of thinking about casestudies, many advocatesof the
method would no longer recognize it as still being a case study. To our way of
thinking, casestudies are very relevant when causation is at most a minor issue;
but in most other caseswhen substantial uncertainry reduction about causation is
required, we value qualitative methods within experiments rather than as alter-
natives to them, in ways similar to those we outlined in Chapter 12.

Evaluations
Theory-Based
This approach has beenformulated relatively recently and is describedin various
books or specialjournal issues(Chen & Rossi, 1,992;Connell, Kubisch, Schorr,&
'Weiss,
1.995;Rogers,Hacsi, Petrosino,& Huebner, 2000). Its origins are in path
analysis and causal modeling traditions that are much older. Although advocates
have some differenceswith each other, basically they all contend that it is useful:
(1) to explicate the theory of a treatment by detailing the expected relationships
among inputs, mediating pfocesses,and short- and long-term outcomes; (2) to
measure all the constructs specified in the theory; and (3) to analyzethe data to
assessthe extent to which the postulated relationships actually occurred. For
shorter time periods, the available data may addressonly the first part of a pos-
tulated causal chain; but over longer periods the complete model could be in-
volved. Thus, the priority is on highly specific substantive theorS high-quality
measurement,and valid analysisof multivariate explanatory processesas they un-
fold in time (Chen & Rossi, 1'987,1,992).
Such theoretical exploration is important. It can clarify general issueswith treat-
ments of a particular type, suggestspecific researchquestions,describehow the inter-
vention functions, spell out mediating processes,locate opportunities to remedy im-
plementation failures, and provide lively anecdotesfor reporting results ('Weiss,1'998).
All th.r. serveto increasethe knowledge yield, evenwhen such theoretical analysisis
done within an experimental framework. There is nothing about the approach that
makes it an alternative to experiments. It can clearly be a very important adjunct to
such studies,and in this role we heartily endorsethe approach (Cook,2000).
However, some authors (e.g., Chen 6c Rossi, 1,987, 1992; Connell et al.,
1,995l have advocated theory-based evaluation as an attractive alternative to ex-
periments when it comes to testing causal hypotheses.It is attractive for several
i.urorrr. First, it requires only a treatment group' not a comparison group whose
502 | 14.A CRTT|CAL
ASSESSMENT
OFOURASSUMPTTONS

agreement to be in the study might be problematic and whose participation in-


creasesresearchcosts. Second, demonstrating a match between theory and data
suggeststhe validity of the causal theory without having to go through a labori-
ous processof explicitly considering alternative explanations. Third, it is often im-
practical to measure distant end points in a presumed causal chain. So confirma-
tion of attaining proximal end points through theory-specified processescan be
used in the interim to inform program staff about effectivenessto date, to argue
for more program resourcesif the program seemsto be on theoretical track, to
justify claims that the program might be effective in the future on the as-yet-not-
assesseddistant criteria, and to defend against premature summative evaluations
that claim that an intervention is ineffective before it has been demonstrated that
the processesnecessaryfor the effect have actually occurred.
However, maior problems exist with this approach for high-quality descrip-
tive causalinference(Cook, 2000). First, our experiencein writing about the the-
ory of a program with its developer (Anson et al., 1,991)has shown that the the-
ory is not always clear and could be clarified in diverse ways. Second, many
theories are linear in their flow, omitting reciprocal feedback or external contin-
genciesthat might moderate the entire flow. Third, few theories specify how long
it takes for a given processto affect an indicator, making it unclear if null results
disconfirm a link or suggestthat the next step did not yet occur. Fourth, failure to
corroborate a model could stem from partially invalid measuresas opposedto in-
validity of the theory. Fifth, many different models can fit a data set (Glymour et
a1.,1987;Stelzl, 1986), so our confidencein any given model may be small. Such
problems are often fatal to an approach that relies on theory to make strong causal
claims. Though some of theseproblems are present in experiments (e.g.,failure to
incorporate reciprocal causation, poor measures),they are of far less import be-
cause experiments do not require a well-specified theory in constructing causal
knowledge. Experimental causal knowledge is less ambitious than theory-based
knowledge, but the more limited ambition is attainable.

Weaker Quasi-Experi
ments
For some researchers,random assignment is undesirable for practical or ethical
reasons, so they prefer quasi-experiments. Clearly, we support thoughtful use of
quasi-experimentation to study descriptive causal questions. Both interrupted
time series and regression discontinuity often yield excellent effect estimates.
Slightly weaker quasi-experiments can also yield defensible estimates,especially
when they involve control groups with careful matching on stable pretest attrib-
utes combined with other design features that have been thoughtfully chosen to
addresscontextually plausible threats to validity. However, when a researchercan
choose, randomized designsare usually superior to nonrandomized designs.
This is especially true of nonrandomized designs in which little thought is
given to such matters as the quality of the match when creating control groups,

j
I tOl
NONEXPERIMENTALALTERNATIVES

includingmultiple hypothesistestsrather than a singleone' generatingdata from


severalpr.tr."t*.nt time points rather than one, or having severalcomparison
groupsto createcontrolsthat bracketperformancein the treatmentgroups.In-
I..d, when resultsfrom typical quasi-experiments are comparedwith thosefrom
randomizedexperimentson the same topic, several findings emerge.Quasi-
experimentsfrequentlymisestimateeffects(Heinsman& Shadish,1'996;Shadish
& Ragsdale,t9961.Tiresebiasesare often large and plausiblydue to selectionbi-
asessrrchas the self-selection of more distressedclientsinto psychotherapytreat-
ment conditions(Shadishet al., 2000) or of patientswith a poorer prognosisinto
controlsin medicalexperiments(Kunz & Oxman,1'9981.Thesebiasesare espe-
cially prevalentin quasi-experiments that usepoor quality control groupsand have
higheiattrition(Heinsmar$cShadish,'1,996;Shadish 6cRagsdale,l996l.So,if the
an"swers obtainedfrom randomizedexperimentsare more crediblethan thosefrom
quasi-experiments on theoreticalgroundsand are more accurateempirically,then
,'h. ".g,.r-entsfor randomizedexperimentsare evenstrongerwhenevera high de-
gr.. oI uncertaintyreductionis requiredabout a descriptivecausalclaim.
Becauseall quasi-experiments are not equal in their ability to reduceuncer-
tainty about."ur., *. -"ttt to draw attention againto a common but unfortu-
natepracticein manysocialsciences-tosaythat a quasi-experiment is beingdone
in order to provide justificationthat the resultinginferencewill be valid. Then a
quasi-experimental designis describedthat is so deficientin the desirablestruc-
tural featuresnoted previously,which promote better inference,that it is proba-
bly not worth doing. Indeed,over the yearswe have__repeatedly noted the term
quasi-experiment biing usedto justify designsthat fell into the classthat Camp-
bell and'stanley(196i) labeledas uninterpretable and that Cook and Campbell
(1,9791labeled'as generallyuninterpretable. Theseare the simplestforms of the
designsdiscussedin Chapters4 and 5. Quasi-experiments cannot be an alterna-
tive to randomizedexperimentswhen the latter are feasible,and poor quasi-ex-
perimentscan neverbi a substitutefor strongerquasi-experiments when_thelat-
i., "r. also feasible.Just as Gueron (L999) has remindedus about randomized
experiments,good quasi-experiments haveto be fought for, too. They are rarely
handedout as though on a silverplate.

StatisticalControls
In this book,we haveadvocatedthat statisticaladjustmentsfor groupnonequivalence
are best urrd oBt designcontrolshavealreadybeenusedto the maximum in order
to reducenonequivalence to a minimum. So we are not opponentsof statisticalad-
justmenttechniquessuchasthoseadvocatedby the statisticiansand econometricians
describedin the appendixto Chapter5. Ratheqwe want to usethem as the last re-
sort.The positionwe do not like is the assumptionthat statisticalcontrolsare sowell
developeithat they can be usedto obtain confidentresultsin nonexperimentaland
weak iuasi-e*perimentalcontexts.As we saw in Chapter 5, researchin the past 2
504 | ta. a cRtTtcAL
AsSEssMENT
OFOURASSUMPT|ONS
I

decadeshas not much supported the notion that a control group can be constructed
through matchingfrom somenational or state registrywhen the treatmentgroup
comesfrom a morecircumscribedand localsetting.Nor hasresearchmuchsupported
the useof statisticaladjustmentsin longitudinalnationalsurveysin which individuals
with differentexperiences are explicitly contrastedin order to estimatethe effectsof
this experiencedifference.Undermatchingis a chronic problem here,as are conse-
quencesof unreliabilityin the selectionvariables,not to speakof specificationerrors
dueto incompleteknowledgeof the selectionprocess.In particular,endogeneity prob-
'We
lemsarea realconcern. areheartenedthat more recentwork on statisticaladjust-
mentsseemsto be moving toward the position we represent,with greateremphasis
beingplacedon internal controls,on stablematchingwithin suchinternalcontrols,
on the desirabilityof seekingcohort controlsthroughthe useof siblings,on the useof
pretests
PrstssLs collected
sorrccf,e(Jon
on the
rne same
same measures the posttest,
measures aS tne posttest, on
On the utiliw Ot
tne Uulrty of SUCh
suchpretest
measures collected at several different times, and on the desirability of studying inter-
'We
ventionsthat areclearlyexogenousshocksto someongoingsystem. arealsoheart-
enedby the progressbeingmadein the statisticaldomainbecause it includesprogress
on designconsiderations, aswell ason analysisper se(e.g.,Rosenbaum,1999a).Ve
areagnosticat this time asto the virtuesof the propensityscoreandinstrumentalvari-
able approachesthat predominatein discussionsof statisticaladiustmenr.Time will
tell how well
tell well they
they pan out relative to the results from randomizedexperiments.'We
have surely not heard the last word on this topic.

CONCLUSION
'We
cannot point to one new development that has revolutionized field experimen-
tation in the past few decades,yet we have seena very large number of incremen-
tal improvements. As a whole, these improvements allow us to create far better
field experiments than we could do 40 years ago when Campbell and Stanley
(1963) first wrote. In this sense,we are very optimistic about the future. Ve believe
that we will continue to see steadg incremental growth in our knowledge about
how to do better field experiments. The cost of this growth, howeveq is that field
experimentation has become a more specializedtopic, both in terms of knowledge
developmentand of the opportunity to put that knowledge into practice in the con-
duct of field experiments. As a result, nonspecialistswho wish to do a field exper-
iment may greatly benefit by consulting with those with the expertise,especiallyfor
large experiments, for experiments in which implementation problems may be
high, or for casesin which methodological vulnerabilities will greatly reducecred-
ibility. The same is true, of course, for many other methods. Case-studymethods,
for example, have become highly enough developed that most researcherswould
do an amateurishjob of using them without specializedtraining or supervisedprac-
tice. Such Balkanization of. methodolog)r is, perhaps, inevitable, though none the
lessregrettable.\U7ecan easethe regret somewhat by recognizingthatwith special-
ization may come faster progress in solving the problems of field experimentation.

You might also like