You are on page 1of 38

Annual Reviews

Inc.All rightsreserved


Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Thomas D. Cook
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

Departmentsof Sociology and Psychology,NorthwesternUniversity, Evanston, Illinois


William R. Shadish
Department of Psychology, MemphisState University, Memphis,Tennessee 38152

causation, randomization,quasi-experiments,causal generalization, selection

..................................................................................................................... 546
TheoriesofCausation ........................................................................................ 547
Theoriesof Categorization ................................................................................ 548
RelabelingInternal Validity ............................................................................... 550
RelabelingExternal Validity .............................................................................. 552
Debates aboutPriorityamong ValidityTypes................................................... 553
Raisingthe Salienceof Treatment Implementation Issues................................. 554
TheEpistemiology of Ruling OutThreats .......................................................... 555
RANDOMIZEDEXPERIMENTS INFIELD SETTINGS ...................................................... 556
Mounting a Randomized Experiment ................................................................. 557
Maintaining a Randomized Experiment ............................................................. 560
................................................................................................ 561
DesignsEmphasizing Point-SpecificCausalHypotheses .................................. 562
DesignsEmphasizing Multivariate-Complex CausalPredictions..................... 565
Statistical Analysis of Datafrom the Basic NonequivalentControl Group
Design ........................................................................................................ 566
CAUSAL RELATIONSHIPS ..................................................................... 571
Sampling .............................................................................................. 571
ProximalSimilarity ............................................................................................ 572
Probesfor RobustReplicationoverMultipleExperiments ................................ 572
Explanation ............................................................................................ 573
........................................................................................................................ 575

0066-4308/94/0201-0545505.00 545
Annual Reviews



The Annual Review of Psychology has never published a chapter on causal

inference or experimentation. This is surprising, given psychologys tradi-
tional concern with establishing causal relationships, preferably through ex-
periments. Our ownexpertise is in social experiments,as practiced not only in
psychologybut also in education, economics,health, sociology, political sci-
ence, law and social welfare. Philosophers and statisticians are also involved
with such experiments, but moreto analyze them than to do them. This review
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

concentrates on important developmentsfrom this multidisciplinary spectrum

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

during the 1980s and early 1990s.

Experimentscan be characterized both structurally and functionally (Cook
1991a). The moreprototypical structural attributes of experimentsinclude
sudden intervention; knowledgeof when the intervention occurred; one or
more post-intervention outcomemeasures; and someform of a causal counter-
factual--that is, a baseline against which to comparetreated subjects. Func-
tionally, experimentstest propositions about whetherthe intervention or inter-
ventions under test are causally responsible for a particular outcomechangein
the restricted sense that the change would not have occurred without the
Social experimentstake place outside of laboratories and therefore tend to
have less physical isolation of materials, less procedural standardization, and
longer-lasting treatments when comparedto experiments in laboratory set-
tings. Social experimentsare usually designed to test an intervention or treat-
ment that is usually better characterized as a global packageof manycompo-
nents than as a presumptivelyunidimensionaltheory-derived causal construct.
This is because the usual aim of field researchers is to learn howto modify
behaviorthat has provento be recalcitrant in the past (e.g. poor school perfor-
mance, drug abuse, unemployment,or unhealthful lifc~tyle~) as opposed to
testing a theoretical proposition about unidimensional causes and effects.
There are two types of social experiments, each of which has the sudden
intervention, knowledgeof intervention onset, posttest, and causal counterfac-
tual componentthat characterize all experiments. Randomizedexperiments
have units that are assigned to treatments or conditions using proceduresthat
mimica lottery, whereas quasi-experiments involve treatments that are as-
signed nonrandomly,mostly because the units under study--usually individu-
als, workgroups, schools or neighborhoods--self-select themselvesinto treat-
mentsor are so assigned by administrators based on analysis of whomerits or
needs the opportunity being tested.
Wecover four experimentation topics that are currently of interest: 1.
modifications to the dominanttheory of social experimentation; 2. shifts in
thinking about the desirability and feasibility of conductingrandomizedfield
Annual Reviews


experiments; 3. modifications to thinking about quasi-experiments; and 4.

recent concernsabout justifying generalized causal inferences.



Theories of Causation
Experimentationis predicated on the manipulability or activity theory of cau-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

sation (Collingwood 1940, Mackie 1974, Gasking 1955, Whitbeck 1977),

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

whichtries to identify agents that are underhumancontrol and can be manipu-

lated to bring about desired changes. This utilitarian conceptionof causation
corresponds closely with common-senseand evolutionary biological under-
standings of causation, and it assumesthat cause-effect connections of this
kind are dependableenoughto be useful. Experimentshave this same purpose.
Theyprobe whethera force that is suddenlyintroduced into an ongoingsystem
influences particular outcomesin a way that wouldnot have occurred without
the intervention. The aim of experiments is to describe these causal conse-
quences, not to explain howor whythey occurred.
Experimentscan be mademore explanatory, though. This is achieved pri-
marily by selecting independent and dependent variables that explicitly ex-
plore a particular theoretical issue or by collecting measuresof mediating
processes that occurred after a cause was manipulated and without which the
effect wouldpresumablynot have comeabout. But few social experiments of
the last decades were designed to have primarily an explanatory yield, and
nothing about the logic of experimentation requires a substantive theory or
well-articulated links between the intervention and outcome, though social
experimentsare superior if either of these conditions is met (Lipsey 1993).
Understandingexperimentsas tests of the causal consequencesof manipu-
lated events highlights their similarity with the manipulabilitytheory of causa-
tion. But it also makesthemseemless relevant to the essentialist theories of
causation (Cook &Campbell1979) to which most scholars subscribe. These
theories seek either a full explanation of whya descriptive causal connection
comesabout or the total prediction of an effect. Thus, they prioritize on causal
explanation rather than causal description, on isolating whya causal connec-
tion comesabout rather than inferring that a cause and effect are related. Such
explanatorytheories are likely to be reductionistic or involve specifying multi-
ple variables that co-condition whena cause and effect are related. Eachof
these is a far cry from the "if X, then Y" of the manipulability theory of
causation. Mackie(1974)--perhaps the foremost theorist of causation--sees
the manipulability theory as too simplistic because it assumesa real world
characterized by manymain effects that experiments try to identify. For
Annual Reviews


Mackieand the essentialists, the real world is morecomplexthan this. How-

ever, Mackiedoubts whether essentialist theories are epistemologically ade-
quate, howeverontologically appropriate they maybe. He postulates that all
causal knowledgeis inevitably "gappy," dependent on manyfactors so com-
plexly ordered that full determination of the causal system influencing an
outcomeis impossiblein theory and practice. Thus, he contendsthat essential-
ists seek a completedeterministic knowledgethey are fated never to attain.
Conceivingof all causal relationships as embeddedwithin a complexcon-
text that co-determines when an effect is manifest implies the need for a
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

methodologysensitive to that complexity and embeddedness.Yet individual

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

experiments are designed to test the effects of one or a few manipulable

independentvariables over a restricted range of treatment variants, outcome
realizations, populationsof persons, types of settings, and historical time peri-
ods. Tobe sure, in their individual studies, researchers can measureattributes
of treatments, outcomes,respondents, and settings and then use these measures
to probe whether a particular outcomedependson the treatments statistical
interaction with such attributes. This strategy has been widely advocated,
especially in education (Cronbach&Snow1976). But the variability available
for analysis in individual studies is inherently limited, and tests of hi~her-order
interactions can easily lead to data analyses with low statistical power(Aiken
&West 1991). As a result, interest has shifted awayfrom exploring interac-
tions within individual experimentsand toward reviewing the results of many
related experimentsthat are heterogeneousin the times, populations, settings,
and treatment and outcomevariants examined. Such reviews promise to iden-
tify moreof the causal contingenciesimplied by Mackies fallibilist, probabi-
listic theory of causation; and they speak to the more general concern with
causal generalization that emerged in the 1980s and 1990s (e,g. Cronbach
1982, Cook 1993, Dunn1982).
With its emphasison identifying main effects of manipulated treatments,
the activity theory of causation is too simple to reflect current philosophical
thinking about the highly conditional ways in which causes and effects are
structurally related. The experimentalists pragmatic assumption has to be
either (a) that somemain effects emergeoften enoughto be dependable, even
if they are embeddedwithin a larger explanatory system that is not fully
known;or (b) that the results of individual experimentswill eventually con-
tribute to a literature review identifying someof the causal contingencies
within this system.

Theories of Categorization
The descriptive causal questions that experimenters seek to answer usually
specify particular cause and effect constructs or categories to whichgeneral-
ization is sought fromthe manipulationsor measuresactually used in a study.
Annual Reviews


Philosophical work on categorization is currently dormant, while in psy-

chologythere is considerable productivechaos. Classic theories of categoriza-
tion postulate that instances belongin a class if they haveall the fixed attri-
butes of that class. This approach has been widely discredited (Rosch 1978,
Lakoff 1987, Medin1989), and there is nowrecognition that all categories
have fuzzy rather than clearly demarcatedboundaries (Zimmerman et al 1984),
that some elements are more central or prototypical than others (Smith
Medin1981), and that knowledgeof commontheoretical origins is sometimes
moreuseful for classifying instances than the degree of initially observed
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

similarity (Medin1989). In biology, plants or animals used to be assigned

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

the sameclass if they shared certain attributes that facilitate procreation. But
classification nowdependsmoreon similarity in DNA structure, so that plants
or animals with seeminglyincompatiblereproductive features but with similar
DNA,are placed in the same category. This historical change suggests that
seeking attributes that definitively determinecategory membershipis a chi-
mera. The attributes used for classification evolve with advanceselsewherein
a scholarly field, as the movefrom a Linnean to a more microbiological
classification systemillustrates in biology.
Categorizationis not just a logical process; it is also a complexpsychologi-
cal process (or set of processes) that wedo not yet fully understand. Ideally,
social experimentsrequire researchers to explicate a cause or an effect con-
struct, to identify its moreprototypical attributes, and then to construct an
intervention or outcomethat seemsto be an exemplarof that construct. Ex-
plicit here is a pattern-matching methodology(Campbell1966) based on prox-
imal similarity (Campbell1986). That is, the operation looks like what it
meant to represent. But the meaningsattached to similar events are often
context-dependent.Thus, if a studys independentvariable is a family income
guarantee of $20,000 per year, this would~probably meansomething quite
different if the guaranteewerefor three years versus twenty. In the first case
individuals would have to think much more about leaving a disliked job
because they do not have the samefinancial cushion as someonepromisedthe
guarantee for 20 years. Also, classifying two things together because they
share manysimilar attributes does not imply they are theoretically similar.
Dolphins and sharks have muchin common.They live exclusively in water,
swim,have fins, are streamlined, etc. But dolphinsbreathe air out of water and
give direct birth to their young--both features considered prototypical of
mammals.Hence, dolphins are currently classified as mammalsrather than
fish. But whyshould breathing out of water and giving birth to ones youngbe
considered prototypical attributes of mammals? Whyshould it not be the fit
betweenform and function that unites dolphins and fish? Anotherexampleof
this complexitycomesfrom social psychology. Receiving very little moneyto
do somethingone doesnt want to do seemsdifferent on the surface from being
Annual Reviews


asked to choosebetweentwo objects of similar value. Yet each is supposedto

elicit the sametheoretical process of cognitive dissonance. Shouldthe attri-
butes determining category membership depend on physical observables
linked to particular classes by a theory of proximal exemplification or pro-
totypicality? Or should the attributes instead reflect more latent and hence
distal theoretical processes, such as those presumedto distinguish mammals
from fish or those that make financial underpaymentand selecting among
objects of similar value instances of the same class called cognitive disso-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Experimentershave to generalize about causes, effects, times, settings, and

people. If the best theories from philosophyand psychologyare not definitely
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

helpful, researchers can turn instead to statistics whereformal samplingtheory

provides a defensible rationale for generalization. However,in experimental
practice it is almost impossible to sample treatment variants, outcomemea-
sures, or historical times in the randomfashion that samplingtheory requires;
and it is extremelydifficult to samplepersons and settings in this way. So, in
individual experiments, generalization dependson the purposive rather than
randomselection of instances and samplesto represent particular categories of
treatment, outcome, people, or settings. Unfortunately, in current sampling
theory, purposive selection cannot justify genera/ization to a category---only
randomselection can.

Relabeling Internal Validity

Amongsocial scientists, Campbell(1957) was the first to systematically work
through the special issues that arise whenprobing causal hypotheses in com-
plex field settings. As he recounts it (Campbell1986), his analysis originated
from concerns with the low quality of causal inferences promotedby the field
research of the time and with the questionable generalizability of the cause-
testing laboratory research of the time. So he invented a languagefor promot-
ing moreconfident causal inferences in contexts that better resembledthose to
whichgeneralization is typically sought. He coined the term internal validity
to refer to inferences about whether the relationship betweentwo operationa-
lized variables was causal in the particular contexts wherethe relationship had
been tested to date; and he invented external validity to express the extent to
which a causal relationship can be generalized across different types of set-
tings, persons, times, and waysof operationalizing a cause and an effect. These
twoterms have nowentered into the lexicon of most social scientists.
But manyscholars did not understand internal validity in the wayCampbell
intended. It waswidelytaken to refer, not just to the quality of evidenceabout
whetheran obtained relationship betweentwo variables was causal, but also to
whether the relationship was from a particular theoretically specified causal
agent to a particular theoretically specified effect (Kruglanski& Kroy1975,
Annual Reviews


Gadenne 1976). Thus, labeling the cause and effect were smuggled in as
componentsof internal validity. Cronbach(1982) went even further, adding
that internal validity concernswhethera causal relationship occurs in particu-
lar identifiable classes of settings and with particular identifiable populations
of people--each an element of Campbellsexternal validity. To discuss valid-
ity matters exactly, Cronbachinvented a notational system. He used utos, to
refer to the instances or samplesof units u, treatments t, observations o, and
settings s achieved in a research project probinga causal hypothesis. He used
UTOS,1 to refer to the categories, populations,or universesthat these instances
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

or samples represent and about which research conclusions are eventually

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

drawn.Translating Campbellsinternal validity into these terms wesee that it

refers only to the link betweent and o at the operational level. Whatt stands
for (i.e. howthe cause or T should be labeled) is irrelevant, just as it
irrelevant what o stands for (i.e. howthe effect or O should be labeled). For
Campbell,these are matters of external validity, as are concerns about the
types of units and settings in whichthe t-o relationship can be found (Camp-
bell 1957, Campbell&Stanley 1963).
After critics eventually understood howrestricted Campbellsdescription
of internal validity was, they decried it as positivistic (Gadenne1976),irrele-
vant to substantive theory (Kruglanski &Kroy 1975), and insensitive to the
contingency-laden character of all causal conclusions (Cronbach 1982).
Cronbachpersisted in describing his internal validity in terms of generaliza-
tion from all parts of utos to all parts of UTOS,thus subsumingunder his
version of internal validity all the elementsCampbellhad included as elements
of his external validity! To reduce this ambiguity, Campbell(1986) sought
invent more accurate labels for internal validity. He invented local molar
causalvalidity to replace internal validity, hopingto highlight with local that
internal validity is confinedto the particular contexts sampledin a given study.
Bymolarhe hopedto emphasizethat his internal validity treated interventions
and outcomesas complexoperational packages composedof multiple microel-
ements or components. (The contrast here is with more reductionist ap-
proaches to causation that place most weight on identifying the microelements
responsible for any relationship between a more molar treatment and a more
molar outcome.) Finally, with the causal part of his new label, Campbell
sought to highlight the centrality of assertions about whether the o in
Cronbachs utos formulation wouldchange without variation in t, thereby
downplaying the aspiration to learn about other facets of this link, especially
its generalizability. Campbellsnew namefor internal validity has not been
widelyadopted, despite its greater precision. Internal validity is still widely

We ignore here Cronbachs discussion of the difficulty of drawing inferences about populations
or settings.
Annual Reviews


used and misused,thoughclarification is nowat handfor those willing to seek


Relabeling External Validity

Campbellsconcept of exterual validity also cameto be modified. His early
workdescribed external validity in terms of identifying the contexts in whicha
given causal relationship holds, including waysthat the cause or effect were
operationalized. Cook& Campbell(1979) distinguished between generalizing
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

to theoretical cause and effect constructs and generalizing to populations of

persons and settings, using construct validity to refer to the formerand exter-
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

nal validity to refer to the latter. Moreimportantly,Cronbach(1982) differenti-

ated between two types of generalization that were obscured in the work of
Campbelland his colleagues, The first involves generalizing from utos to
UTOS--using attributes of the samplingparticulars to infer the constructs and
populations they represent. This is what Cronbachcalls internal validity. The
second involves generalizing from UTOSto *UTOS(star-UTOS), with
*UTOS representing populations and constructs that have manifestly different
attributes from those observedin the samplingdetails of a study. At issue with
*UTOSis extrapolation beyond the sampled particulars to draw conclusions
about novel constructs and populations. Cronbachcalls this external validity.
To clarify his owndifferent understanding of external validity, Campbell
relabeled it as proximal similarity (Campbell1986). He chose this label
emphasizea theory of generalization based on using observable attributes to
infer whethera particular sampleof units, treatments, observations, or settings
belongedin a particular class or category as that class or category is usually
understood within a local research communityor in ordinary English-language
usage. Similarity betweenthe specific instances sampledand the moregeneral
classes they are supposedto represent is the order of the day, and Campbellis
adamantin classifying instances on the basis of observable characteristics
linked to a theory specifying whichcharacteristics are prototypical. Proximal
is part of his newlabel for external validity to emphasizethat other characteris-
tics are more distal for him, particularly those wherethe similarity is not
immediatelyevident but is instead derived from substantive theory--e.g, clas-
sifying dolphins as mammals rather than fish. Like local molar causal validity,
proximalsimilarity has failed to catch on amongsocial scientists. Campbells
old external validity lives on, seemingly as vigorous as ever. However,the
needhe felt in 1986to create newlabels to clarify the constructs first intro-
ducedin 1957suggests growingdissatisfaction about the clarity or content of
his validity types. Howeverworthythe underlying ideas, the terms themselves
cannot nowbe considered as unproblematic and as sacrosanct as they once
Annual Reviews


Debates about Priority among Validity Types

Onedebate about priorities amongvalidity types concerns the warrant for
Campbellspreference for internal over exten~al validity. Campbell&Stanley
(1963) argued that it is only useful to explore howgeneralized a causal
connection is if very little uncertainty remains about that connection. How
wouldtheory or practice be advanced,they reasoned, if we identified limits to
the generalization of a relationship that might later turn out not to be causal?
Cronbach(1982) asserted that this priority reflects the conservative standards
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

scientists havetraditionally adoptedfor inferring causation, and he questioned

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

whetherthese standardsare as relevant for other groupsas they are for individ-
uals with a stake in the results of a social experiment.His ownexperience has
convinced him that manymembersof the policy-shaping communityare more
preparedthan scholars to tolerate uncertainty about whethera relationship is
causal, and they are willing to put greater emphasis than scholars do on
identifying the range of application of relationships that are manifestly still
provisional. Cronbachbelieves that Campbellspreference for internal over
external validity is parochial. Chelimsky(1987) has offered an empirical refu-
tation of Cronbachscharacterization of knowledgepreferences at the federal
level. But research on preferencesat the state or local levels is not yet avail-
able, and Cronbachs characterization maybe more appropriate there. Time
will tell.
Dunn(1982) has criticized the validity typology of Campbelland his col-
leagues becauseneither internal nor external validity refers to the importance
of the research questions being addressed in an experiment. He claims that
substantive importance can be explained just as systematically as the more
technical topics Campbellhas addressed, and he has proposed somespecific
threats that limit the importanceof research questions. Dunnis correct in
assumingthat there is no limit to the numberof possible validity types, as
Cook&Campbellpartially illustrated whenthey divided Campbellsearlier
internal validity into statistical conclusionvalidity (are the cause and effect
related?) and internal validity (is the relationship causal?), and whenthey
divided his external validity into construct validity (generalizing from the
research operations to cause and effect constructs) and external validity (gen-
eralizing to populations of persons and settings). Cook & Campbellalso
proposed new threats to validity under each of these headings, and their
validity typology is widely used (e.g. Wortman 1993). Thus, Campbelland his
colleagues agree with Dunn,only criticizing him on someparticulars about
threats to the importance of research questions (see especially Campbell
Thoughhis higher-order validity types cameunder attack, Campbellsspe-
cific list of threats to internal and external validity did not. Internal validity
Annual Reviews


threats like selection, history, and maturation continue to be gainfully and

regularly used to justify causal inferences, under whichevervalidity label they
mightbe fit. This is probablybecauseresearchers, in their daily work,have to
rule out specific threats to particular inferences. Debatesabout validity types
are more abstract and remote. Campbellsidentification of specific validity
threats constitutes one of his most enduring contributions to social science

Raising the Salience of Treatment Implementation Issues

Annu. Rev. Psychol. 1994.45:545-580. Downloaded from
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

Cook& Campbell(1979) distinguished between construct and external valid-

ity because (a) the meansfor promotingconstruct validity are rarely, if ever,
based on the randomsampling procedures so often advocated for generalizing
to populations of persons and settings; and (b) experience has indicated how
problematic treatment implementationcan be in nonlaboratory contexts where
researchers are moreoften guests than hosts, whereresearchers are not always
the providers of services, whereprogramtheory is often poorly explicated, and
whereunexpecteddifficulties often arise to compromise howwell an interven-
tion is mounted(Boruch &Gomez1977; Sechrest et al 1979; Bickman1987,
1990; Chen & Rossi 1983; Gottfredson 1978). Hence, Cook & Campbell
(1979) outlined a newlist of threats describing someof the waysthe complex
intervention packages implementedin practice differ from the often more
theoretical intervention procedures outlined in a research protocol (Lipsey
1993). Their concern was to identify whichparts of the research protocol ~vere
actually implemented; which were not implemented; which were implemented
but without the frequency or intensity detailed in the guiding theory; and
whichunplannedirrelevancies impingedon the research and might have influ-
enced research outcomes.
Experimentstest treatment contrasts rather than single treatments (Holland
1986). Cook&Campbell(1979) drewattention to four novel threats that affect
the treatment contrast without necessarily influencing the major treatment
purportedly undertest. All these threats touch on diffusion of Onetreatment to
other treatments. Resen(ful demoralization occurs whenmembersof control
groupsor groupsreceiving less desirable treatments learn that other groupsare
receiving more. If they becomeresentful because of this focused compmison
their performancemight decrease and lead to a treatment-control difference.
But this is because the controls are getting worserather than the experimentals
getting better. Theyalso postulated that compensatory rivalry can arise if the
focused inequity that so manyexperimental contrasts require leads members
of the groups receiving less to respond by trying harder to showthat they do
not deserve their implicitly lower status. Such a motivation obscures true
effects occurring in treatment groups because performance in the control
groups should improve. Compensatoryequalization occurs when administra-
Annual Reviews


tors are not willing to tolerate focusedinequities and they use whateverdiscre-
tionary resources they control to equalize what each group receives. This also
threatens to obscure true effects of a treatment. Finally, treatment diffusion
occurs when treatment providers or recipients lcam what other treatment
groups are doing and, impressed by the new practices, copy them. Onceagain,
this obscures planned treatment contrasts. These lbur threats have all been
observed in social experiments, and each threatens to bias treatment effect
estimates whencomparedto situations where the experimental units cannot
communicateabout the different treatments. Statisticians nowsubsumethem
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

under the general rubric of SUTVA--the stable-unit-treatment-value assump-

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

tion (Holland & Rubin1988)--in order to highlight howmuchthe interpreta-

tion of experimental results dependson the unique componentsof one treat-
mentgroup not diffusing to other groups and hence contributing to the mis-
identification of the causal agent operative within a treatmentcontrast. Is it the
planned treatment that influences the outcome,or is it serving in a control
This is only one of manyargumentsin favor of documentingthe quality of
treatment implementationin all treatment conditions, including no-treatment
controls. "Black box" experimentswithout such documentationare potentially
counterproductive(Lipsey 1993). If they result in no-differencefindings, it
not clear whetherthe intervention theory was wrong,the particular treatment
implementationwasweak, or the statistical powerto detect a given effect was
inadequate. In the absence of documentationof treatment integrity, even posi-
tive results do not makeit clear what role the intended processes played in
bringing about outcomechanges. Realizing this, experimenters have devoted
moreresources over the last decade to describing implementationquality and
relating this to variability in outcomes.

The Epistemology of Ruling Out Threats

All lists of validity threats are the product of historical analyses aimedat
identifying potential sources of bias. Theyare designed to help researchers
construct the argumentthat none of these threats could have accounted for a
particular finding they wish to claim as causal. The epistemological premise
here is heavily influenced by Poppers falsificationism (1959). But Kuhns
(1962) critique of Popper on the grounds that theories are incommensurable
and observations are theory-laden makesit difficult to believe that ruling out
such threats is easy. So, too, does Marks(1986) observation that ruling out
validity threats often dependson accepting the null hypothesis--a logically
tenuous exercise. Campbellhas always emphasizedthat only plausible threats
need ruling out and that plausibility is a slippery concept, adding to the
impressionthat falsification is logically underjustified.
Annual Reviews


However,in the real world of research there have to be somemechanisms

for certifying whichthreats are plausible in local contexts and for establishing
which procedures will probably rule out a particular threat. The scientific
communityis not likely to prefer methodsfor ruling out threats that depend
only on professional consensus.Logically justifiable proceduresare the prefer-
ence. This is whycausal inferences from quasi-experiments are less convinc-
ing than from randomizedexperiments. They dependon a larger set of accom-
panying arguments, including manythat are weaker because they postulate
that particular threats are implausible in the context under analysis or have
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

been ruled out on the basis of statistical adjustments of the data that them-
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

selves are assumption-laden. Is it not better to control threats via random

assignment,with its clear rationale in logic? Realization of the epistemological
difficulties inherent in ruling out alternative interpretations has strengthened
calls both to samplepersons, settings, treatments, and outcomesat randomand
also to assign units to treatments at random(Kish 1987). There is no debate
about the desirability of this in principle; however,there are legitimate debates
about the feasibility of randomassignmentand randomselection whentesting
causal hypotheses, and especially about the degree of confidence merited by
different methodsthat do not use someform of randomization for inferring
cause and estimating its generalizability.


The strength of the randomizedexperimentseemsto be that the expected mean

difference betweenrandomlycreated groups is zero. This is merelya special
case of the moregeneral principle that unbiased causal inference results when-
ever all the differences betweengroups receiving treatments are fully known
(Rubin 1977, Trochim1984). Statisticians like to remind us that unbiased
causal inference results if individuals are assigned to conditions by the first
letter of their surname,by whetherthey are left or right handed,or by the order
in whichthey applied to be in a particular social program.The key is that the
assignment process is fully knownand perfectly implemented;randomassign-
mentis only one instance of this moregeneral principle.
Thoughrandomassignment involves considerable ethical, political, and
legal problems(detailed in Coyleet al 1991), assertions that such assignment
can rarely be used outside the laboratory havebeen disproved by the long lists
of experiments Boruch et al (1978) have compiled. Cronbach (1982)
reclassified someof the studies on these lists as quasi-experimentsor failed
randomizedexperiments; nonetheless, the lists leave the impression that ran-
domassignment is already commonin social science and could be even more
widely used if researchers were better informed about how to circumvent
objections to it. Researchers wanting to use other meansof assignment now
Annual Reviews


find it moredifficult to justify their plans than evena decadeago. There are
still cases where a randomlycreated control group is impossible (e.g. when
studyingthe effects of a natural disaster), but the last decadehas witnessed
powerfulshift in scientific opinion towardrandomizedfield experimentsand
away from quasi-experiments or non-experiments. Indeed, Campbell &
Boruch(1975) regret the influence Campbellsearlier workhad in justifying
quasi-experiments where randomizedexperiments might have been possible.
In labor economics, manycritics nowadvocate randomizedexperiments over
quasi-experiments (Ashenfelter & Card 1985, Betsey et al 1985, LaLonde
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

1986). In community-basedhealth promotion research, randomized experi-

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

ments have nowreplaced the quasi-experiments that were formerly used to

reduce the high cost of mountinginterventions with the larger samplesize of
wholetownsor cities that randomassignmentrequires. Indeed, in biostatistics
muchemotionalstrength is attached to randomassignmentand alternatives are
regularly denigrated (e.g. Freedman1987).
Muchof the recent discussion about randomized experiments deals with
implementingthem better and more often. Little of this discussion has ap-
peared in scholarly outlets. So the following exposition of the factors making
randomized experiments easier to implement depends heavily on informal
knowledgeamongpracticing social experimenters as well as on such pub-
lished sources as Boruch&Wothke(1985).

Mounting a Randomized Experiment

The best situation for randomassignment is whenthe demandfor the treat-
ment under evaluation exceeds the supply. The treatment then has to be ra-
tioned and can be allocated at random. If demanddoes not exceed supply
enaturally, experimenters can often induce the extra demandby publicizing th
treatment. In a pilot study of multiracial residential housingin a university,
CMSteele (unpublishedresearch proposal) found in the pilot year that too few
whites volunteered for the programto makerandom,assignmentpossible. So,
the next year he publicized the residential units through the campushousing
office. Suchpublicity creates a tradeoff. It makesrandomassignmenteasier;
but becauseof it, those whoeventually enter the control or placebo conditions
have learned of the multiracial housingalternative for whichthey applied but
were not selected. This knowledgemight lead to treatment diffusion, compen-
satory rivalry, compensatoryequalization, or resentful demoralization. Al-
though such threats can operate in any type of study, the question for field
experimenters is whetherthey operate morestrongly because of the publicity
required to stimulate extra demand(Manski & Garfinkel 1992) or because
randomassignmentprovides a less acceptable rationale for experimentalineq-
uities whencomparedto allocation by need, by merit, or on a "first come,first
served" basis (Cook &Campbell1979).
Annual Reviews


Whenusing random assignment in field settings, a key concern is when

respondents are assigned to treatments. Riecken &Boruch (1974) identified
three options: before respondents learn of the measurementburdens and treat-
ment alternatives; after they learn of the measurementdemands,but before
they learn of the content of the treatments being contrasted; and after they
knowof the measurementdemandsand treatments and agree to be in whatever
treatment condition the coin toss (or equivalent thereof) determinesfor them.
Opinionhas long favored the last alternative because it reduces the likelihood
that refusals to serve in the experimentwill be related to the treatments, thus
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

creating the nonequivalence that random assignment is designed to aw~id.

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

However,the cost to such delayed randomization is that some persons may

drop out of the study whenthey realize they are not guaranteed the treatment
they most desire. Thus, the potential gain in internal validity (in Campbells
original sense) entails a potential loss in external validity.
Somescholars now question the wisdomof placing such a high premium
on internal validity that treatment assignmentis delayed. For policy contexts,
Heckman&Hotz (1988) question the usefulness of strong causal conclusions
if they only generalize to those whoare indifferent about the treatments they
receive, because persona/belief in a treatment might be an important contribu-
tor to its effectiveness. This critique highlights the advantagesthat follow fi:om
linking a randomizedstudy of persons willing to undergoany treatment with a
quasi-experimental study of persons whoself-select themselves into treat-
ments, controlling for self-selection in the quasi-experiment as well as the
statistical state of the art allows (Boruch1975). Otherwise, advocates of de-
layed randomassignmenthave to argue that it is better to have limited infor-
mationabout the generalization of a confident causal connectionthan it is to
have considerable informationabout the generalizability of a less certain con-
Usingrandomassignmentin complexfield settings is all the moredifficult
because researchers often do not do the physical assignment process them-
selves. They may design the process and write the protocols determining
assignment, whether as manuals or computer programs implementing a com-
plex stratification design. But the final step of this assignmentprocess is often
cardedout by a social worker,nurse, physician,or school district official. It is
up to these profesionals to explain to potential volunteers the rationale for
randomassignment, to detail the alternatives available, and then to makethe
actual treatment assignment. Sometimes,implementersat the point of service
delivery misunderstand what they are supposed to do---which is mostly a
matter of improvedtraining. At other times, however,professional judgment
predominates over the planned selection criteria and assignment occurs by
presumptionsabout a client s needor merit instead of by lottery.
Annual Reviews


Professionals are trained to judge whoneeds what, when.It is not always

easy for them to let a manualor computerprotocol decide. Indeed, we have
heard anecdotes about workers in early childhood developmentcenters who
camein surreptitiously at night to modify a computerso that it madethe
assignmentsthey believed were professionally correct rather than the random
assignments that were supposedto be made!To prevent professional judgment
from overriding scientific assignmentrequires giving better explanations of
the necessity for randomassignment, having local research staff carry out the
assignmentrather than service professionals, instituting earlier and morerigor-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

ous monitoringof the selection process, and providing professionals with the
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

chanceto discuss special cases they think should be excludedfrom the assign-
ment process.
Randomassignment can be complicated, especially when social programs
are being tested. In such cases, the organization providingbenefits could lose
incomeif all the places in the programare not filled. This meansa waiting list
mustbe kept; however,it is unrealistic to expect that all those on the waiting
list will suspendself-help activities in the hopethey will eventually enter the
program. Somewill join other programswith similar objectives or will have
their needs met more informally. Thus, even a randomlyformed waiting list
changes over time, often in different ways from the changes occurring in a
treatment group as it experiences attrition. Thoughno bias occurs if persons
from the waiting list are randomlyassigned to treatment and control status
whenprogramvacancies occur, it can be difficult to describe the population
that results whenpeople from the ever-changingwaiting list are addedto the
original treatment group. Bias only occurs with morecomplexstratification
designs where programneeds call tbr makingthe next treatment available to a
certain categoryof respondent(e.g. black malesunder25). If there are fewer
these in the pool than there are treatment conditions in the research design,
then programneeds dictate that this person enters the programeven if no
similar person can be assigned to the control status. If researchers belatedly
recruit someonewith these characteristics into a comparisongroup, this viti-
ates randomassignmentand introduces a possible selection bias. But if they
exclude these persons from the data analysis because there are no comparable
controls, this reducesexternal validity.
Outside the laboratory, randomassignmentoften involves aggregates larger
than individuals (e.g. schools, neighborhoods,cities, or cohorts of trainees).
This used to present problemsof data analysis since analysis at the individual
level fails to accountfor the effects of grouping, while analysis at the higher
level entails smaller samplesizes and reducedstatistical power.The advent of
hierarchical linear modeling (Bryk & Raudenbush1992) should help reduce
this problem, particularly when more user-friendly computer programsare
available. But such modelingcannot deal with the nonequivalencethat results
Annual Reviews


when, despite random assignment, the number of aggregated units is low.

After all, it can be costly to assign wholecities to treatments and obtain stable
estimates for each city at each measurement wave. To counter this, researchers
nowalmost routinely matchlarger units on pretest meansor powerful corre-
lates of the outcomebefore assigning themto the various treatments. Since the
highest single research cost is likely to be mountingthe intervention, research-
ers sometimesselect morecontrol than intervention sites (Kish 1987). This
reduces the statistical powerproblemassociated with small samplesizes, but it
does not guaranteeinitial groupequivalenceif the treatmentis still assigned to
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

a small sample of randomlyselected and aggregated units. So the preferred

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

solution is to moveto a study with moreschools or cities. The National Heart,

Lung and Blood Institute now seems to have adopted this point of view,
preferring multimillion dollar, multisite studies of smokingcessation and
womenshealth over smaller studies. Yet these studies will inevitably be
homogeneous in somecomponentsthat reduce generalization, and critics will
inevitably suggest somestudy factors that need to be varied in subsequent
research. Moreover,the expenseof such multisite experimentsentails opportu-
nity costs--meritorious studies that deserve to be fundedcannot be.

Maintaining a Randomized Experiment

Unlike laboratory experimenters, field experimenters are usually guests in
somebodyelses organization. The persons they study are usually free to come
and go and pay attention or not as they please, lowering the degree of treat-
ment standardization. Someindividuals will leave the study long before they
were expected to, having had little exposure to the treatment. This increases
the within-groupvariability in treatment exposureand also decreases the expo-
sure meanfor the group overall. All these processes reduce statistical power.
Standard practice nowcalls for describing the variability in treatment expo-
sure, determiningits causes, and then relating the variability to changesin the
dependent variable (Cook et al 1993). Describing the quality of treatment
implementation has now becomecentral; even explaining this variability
seems to be growingin importance.
Treatment-correlated respondent attrition is particularly problematic be-
cause it leads to selection bias--a problemexacerbated in social experiments
becausethe treatments being contrasted often differ in intrinsic desirability and
attrition is generally higher the less desirable the intervention. Field experi-
menters have nowaccumulated a set of practices that reduce the severity of
treatment-correlatedattrition, thoughnoneis perfect. Takentogether, they are
often effective. They include makinggenerous paymentsto controls for pro-
viding outcomedata; monitoringattrition rates early to describe them, to elicit
their possible causes, and to take remedial action; following up drop-outs in
whateverarchival databases relevant information is located; and designing the
Annual Reviews


experiment to contrast different treatments that might achieve the desired

outcomeas opposedto contrasting a single treatment group with a no-treat-
ment control group. The presumption is that rival contenders as effective
interventions are likely to be moresimilar in intrinsic desirability than are a
treatment and a no-treatmentcontrol group,
Social experimenterscannot take for granted that a perfect randomassign-
ment plan will be perfectly implemented. Social experiments have to be
closely monitored, even if this meansadding to the already complexset of
elements that might explain whya treatment impacts on an outcome. Note the
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

two purposes of monitoring: to check on the quality of implementationof the

research design, including randomassignment; and to check on the quality of
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

implementationof the intervention programitself. Variability in programim-

plementationquality is serious, but probablynot as serious as the confounding
that arises whencomponentsof one treatment are shared by another. Occur-
ring in complexfield contexts, social experimentsrarely permit experimental
isolation and so treatments can diffuse in response to manyquite different
types of forces. For example,cultural moreschangewith the result that in one
disease prevention study, membersof the control group began to exercise
more, eat better, and avoid drugs in responseto national trends (RVLuepkeret
al, submittedfor publication). Alternatively, no-treatmentcontrols can and do
seek alternative sources of help; educational professionals get together and
share experiences so that principals from a control school maylearn what is
happeningin an intervention school; and participants in a job training program
mighttell programpersonnelabout details of the job training their friends are
experiencing in other programs.Whateverthe impetus, treatment diffusion is
not rare, so experimentershave to look for opportunities to study groups that
cannot communicatewith each other. But even this does not avoid the problem
that social experimentsare probably most likely to be funded whena social
issue is already on national policy agendasso that other solutions are simulta-
neously being generated in the public or private sectors. Monitoring what
happens in both treatment and control groups is a sine qua non of modern
social experiments.

In quasi-experimentation,the slogan is that it is better to rule out validity
threats by design than by statistical adjustment. The basic design features for
doing this--pretests, pretest time series, nonequivalentcontrol groups, match-
ing, etc--were outlined by Campbell& Stanley (1963) and added to by Cook
& Campbell (1979). Thinking about quasi-experiments has since evolved
along three lines: 1. toward better understandingof designs that makepoint-
specific predictions; 2. towardpredictions about the multiple implications of a
Annual Reviews


given causal hypothesis; and 3. toward improvedanalysis of data from quasi-

experiments. These more recent developmentsare general, whereasthe alter-
native interpretations that must be ruled out in any cause-probing study are
particular to the given study (Cook& Campbell1986). In quasi-experimental
work, the local setting must be carefully examinedto determinewhichvalidity
threats are plausible and to identify any such threats that might be operating.
With this caution in mind, we nowdiscuss the three morerecent developments
mentioned above.
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Designs Emphasizing Point-Specific Causal Hypotheses

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

In interrupted time series, the same outcomevariable is examinedover many

time points. If the cause-effect link is quick acting or has a knowncausal
delay, then an effective treatment should lead to changein the level, slope, or
variance of the time series at the point where treatment occurred. The test,
then, is whether the obtained data showthe change in the series at the pre-
specified point. Statistical conclusionvalidity is a potential problembecause
errors in a time series are likely to be autocorrelated, biasing ordinary least-
squares estimates of the standard error and hencestatistical tests. But internal
validity is the major problem, especially because of history (e.g. someother
outcome-causingevent occurring at the sametime as the treatment) and instru-
mentation (e.g. a change in record keeping occurring with the treatment).
Fortunately, the point specificity of prediction limits the viability of most
history and instrumentation threats to those occurring only whenthe treatment
began. Suchthreats are often easily checked, are rarer than threats occurring
elsewhere during the series, and can sometimes be ruled out on a priori
Plausible threats are best ruled out by using additional time series. Espe-
cially important are (a) control groupseries not expectedto showthe hypothe-
sized discontinuity in level, slope, or variability of an outcome;and (b) addi-
tional treatment series to whichthe sametreatment is applied at different times
so we expect the obtained data to recreate the knowndifferences in whenthe
treatment was madeavailable. During his career, Campbellhas provided many
goodexamplesof the use of control time series and multiple time series with
different treatment onset times (e.g. Campbell1976, 1984, and other papers
reproduced in Overman1988). The design features also help when a time
series is abbreviated, lacking the roughly50 pretest and 50 posttest data points
that characterize formal time series analysis. Extensions using from 5 to 20
pretest and posttest assessmentsare common in single-subject designs (13arlow
&Hersen 1984, Kratochwill & Levin 1992).
Despite their strengths for causal hypothesis-testing, interrupted time series
designs are infrequently used. Most use data already in archives because it is
often costly to gather original data over long time periods. But relevant depen-
Annual Reviews


dent variables are not always available in archives. Moreover,the relevant

statistical analyses are not familiar to manysocial scientists, with ARIMA
modeling (Box & Jenkins 1970) and spectral analysis (Granger & Newbold
1977) being particularly foreign to psychologys ANOVA tradition. But where
the necessary archives exist, researchers have quickly learned these methods;
and analytic strategies continue to develop (Harrop &Velicer 1985, 1990).
time series research, causal inference dependson effects occurring shortly
after treatment implementationor with a knowncausal lag. Whena possible
effect is delayed, causal inference is less clear--as whenjob training affects
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

incomeyears later. Causal inference is also less clear whenan intervention

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

slowly diffuses through a population (e.g. whenan AIDSprevention program

is first providedto a small proportion of local needle-sharingdrug users and
then slowly extends to others as they are found and programresources grow).
Withunpredicted causal lags, time series have an ambiguousrelationship to
the point of treatment implementation--a problem that is compounded when-
ever treatment implementationis poor or highly variable. Nonetheless, inter-
rupted time series rightly enjoy a special status amongquasi-experimental
designs whereverthey are feasible.
Like interrupted time series studies, regression discontinuity studies also
rely on the hypothesisthat observations will depart from an established pattern
at a specified point on a continuum.In this case the continuumis not time, but
rather the regression line characterizing the relationship betweenoutcomeand
an assignmentvariable. The regression discontinuity design dependson units
on one side of an eligibility cutoff point being assigned one treatment, while
units on the other side are not so assigned (e.g. incomedefines eligibility for
Medicaid and grades determine whomakes the Deans List). More complex
assignmentstrategies are possible using multiple cutoffs or multiple assign-
mentvariables (Trochim1984). But in all cases, the assignmentvariables must
be observed(not latent), and adherenceto the cutoff mustbe strict.
The regression discontinuity design is so namedbecausea regression line is
plotted to relate the assignmentand outcomevariables. If the treatment is
effective, a discontinuityin the regressionline shouldoccurat the cutoff point.
Individuals whoseincomeis just aboveand just belowthe eligibility point for
Medicaidshould differ in health status if the programis effective. The point
specificity of such a prediction makesregression discontinuity resembletime
series, but it is also like a randomizedexperimentin that the assigmnentof
subjects to conditions is completely known.As a result, Mosteller (1990, p.
225) defines regression discontinuity as a true experiment and Goldberger
(1972a,b) and Rubin(1977) have providedformal statistical proofs that regres-
sion discontinuity provides an unbiasedestimate of treatmenteffects, just like
the randomizedexperiment.
Annual Reviews


The widespreadendorsementof the regression discontinuity design and its

frequent reinvention in different disciplines suggests its usefulness whenever
merit or need determine treatment assignment. But regression discontinuity
studies are unfortunatelyevenrarer than interrupted time series studies. This is
partly because assignmentis not always done according to strict public cri-
teria. Professional judgmentand cronyismplay somerole, as does the use of
multiple criteria, some of which are judgmental. The low use may also be
because the analysis requires accurate modelingof the functional form of the
assignment-outcomerelationship. Researchers typically plot linear relation-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

ships betweenthe two variables, but if the underlyingrelationship is curvilin-

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

ear, such modelingwill yield inaccurate estimates of the discontinuity at the

cutoff. Since the randomized experiment is not subject to this problem,
Trochim & Cappelleri (1992) argue for combining regression discontinuity
with randomizedassignment, using the latter to verify the functional form at
the points most open to doubt. If such a combinationis not feasible, then ad
hoc analyses may be needed to model functional form (Trochim 1984)
statistical tests can be used that do not makestrong assumptionsabout such
form (Robbins&Zhang1988, 1989). However,these last tests are still in their
infancy and should be treated with caution. Finally, regression discontinuity is
less used because it requires 2.5 times as manysubjects to achieve as much
poweras randomizedexperimentsthat use the pretest as a covariate, although
in the absenceof the pretest covariate, the powerof the twodesigns is similar
(Goldberger1972a). In practice, poweris further eroded in regression-discon-
tinuity designs as (a) as the cutoff point departs fromthe assignmentvariable
mean;(b) cases are assigned in violation of the cutoff, as with the fuzzy cutoff
design discussed by Trochim(1984); or (c) the units assigned to a treatment
fail to receive it. Whilethese problemsare also foundwith randomizedexperi-
ments, they are likely to have moreserious practical consequencesfor regres-
sion-discontinuity designs because researchers using such designs usually
have less control over the treatment assignment implementation processes.
Experimentersare therefore correct to prefer randomassignment over regres-
The promise of the regression-discontinuity design is that it can be used
wheneverpolicy dictates that special need or merit should be a prerequisite for
access to the particular services whoseeffectiveness is to be evaluated. In this
circumstance, the treatment and control group means will almost certainly
differ at the pretest, seemingto create a fatal selection confound.But this is not
the case with regression-discontinuity designs because the effect estimate is
not the difference betweenraw posttest means.It is the size of the projected
discontinuity at the cutoff, whichis unaffected by the correlation betweenthe
assignment and outcomevariables. The design is also flexible in that many
variables other than quantified needor merit can be used for treatment assign-
Annual Reviews


ment, including the order with which individuals apply to be in a social

programor the year they were born (Cain 1975). Despite the designs flexibil-
ity and impeccablelogic, manypracticing researchers are still skeptical. It
seemsimplausible to them that a cutoff-based assignmentprocess that neces-
sarily creates a selection difference canrule out selection ! Yetthis is the case.
Realizing that the major impedimentto greater use of this design are issues of
its implementability and persuasiveness in ruling out selection, Trochim&
Cappelleri (1992) recently havedescribed several variations of the design that
are particularly useful for practitioners and that create the groundworkfor
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

increased use of this design in the medical context from whichtheir examples
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.


Designs Emphasizing Multivariate-Complex Causal Predictions

Successfulprediction of a complexpattern of multivariate results often leaves
few plausible alternative explanations. The design elements to be combined
for such prediction include (a) nonequivalent dependent variables, only
subset of which is theoretically responsive to a treatment, though all the
dependentvariables are responsive to the other plausible alternative explana-
tions of an outcomechange; (b) designs where a treatment is introduced,
removed,and reintroduced to the same group; (c) nonequivalent group designs
that have two or more pretest measurementwavesproviding a pre-intervention
estimate of differences in rates of change betweennonequivalent groups; (d)
nonequivalent group designs with multiple comparisongroups, someof which
initially outperform the treatment group and someof which underperformit
(Holland1986); (e) cohort designs that use naturally occurring cycles in fami-
lies or educationalinstitutions to construct control groupsof siblings or next
years freshmenthat are initially less different than controls constructed in
almost any other way; (j) other designs that match units on demonstrably
stable attributes to reduce initial group nonequivalencewithout causingstatis-
tical regression (Holland1986); and (g) designs that partition respondents
settings---even after the fact--to create subgroupsthat differ in treatment
exposurelevels and so in expectedeffect size.
Thesedesign features are often discussedseparately. But the better strategy
is to combinemanyof themin a single research project so as to increase the
numberand specificity of the testable implications of a causal hypothesis.
Cook & Campbell(1979) give two examples. One example is an interrupted
time series (a) to whicha no-treatmentcontrol series wasadded; (b) where
intervention was later given to controls; and (c) wherethe original treatment
series was partitioned into two nonequivalent series, only one of which the
treatment should theoretically have influenced. A second exampleis a regres-
sion-discontinuity study of Medicaidseffects on physician visits. Medicaid
eligibility dependson householdincomeand family size. The regression-dis-
Annual Reviews


continuity design related income(the assignment variable) to the number

physician visits after Medicaidwas passed, and it supplementedthis with data
on income and physician visits from the prior year. A discontinuity in the
numberof physician visits occurredat the cutoff point in the year after Medic-
aid was introducedbut not in the year before.
Meehl(1978) has lamented the social sciences dependenceon null hypoth-
esis testing, arguing that with large enoughsamplesany null hypothesiscan be
rejected under someconditions. He counsels testing exact numerical predic-
tions. Amongsocial scientists, his advice has largely fallen on deaf ears,
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

presumablybecause, outside of someareas within economics,so few theories

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

are specific enoughto makepoint predictions. The quasi-experimental empha-

sis on point-specific and multivariate-complex causal predictions approaches
the spirit of Meehlscounsel, but substitutes specificity of time point predic-
tions and pattern matchingfor numericpoint predictions. Both point-specific
and multivariate-complexpredictions preserve the link to falsification because
the pattern of obtained relationships facilitates causal inference only to the
extent no other theory makes the same prediction about the pattern of the
outcomedata. Point-specific and multivariate-complex predictions lower the
chancethat plausible alternatives will makeexactly the sameprediction as the
causal hypothesisundertest; but they cannot guaranteethis.

Statistical Analysis of Data from the Basic Nonequivalent

Control Group Design
Despite the growingadvocacyof designs makingpoint-specific or multivari-
ate complex predictions (Cook1991a), the most frequently employedquasi-
experiment still involves only two (nonequivalent) groups and two measure-
ment waves, one a pretest and the other a posttest measured on the same
instrument. This design is superficially like the simplest randomizedexperi-
ment and is easy to implement,perhaps explaining its great popularity. How-
ever, causal inferences are not easy to justify fromthis design.
Interpretation dependson identifying and ruling out all the validity threats
that are plausible in the specific local context where a treatment has been
implemented.But manyresearchers fail to consider context-specific threats,
relying instead on Campbell&Stanleys (1963) checklist of validity threats
whenthis design is used, thoughsuch threats are general and mayfail to reflect
unique local threats. Unfortunately, causal inferences from the design have
proven far more vulnerable to model misspecification than was originally
hoped. In principle, a well-specified modelrequires complete knowledgeand
perfect measurementeither (a) of the selection process by whichrespondents
end up in different treatment groups; or (b) of all causes of the outcomethat
are correlated with participation in treatment. The hope was that such knowl-
edge might becomeavailable or that approximations wouldallow us to. get
Annual Reviews


quite close to the correct answer. But few social researchers whostudy the
matter closely nowbelieve they can identify the completely specified model,
and the 1980sand early 1990s were characterized by growingdisillusionment
about developing even adequate approximations. Nearly any observed effect
might be either nullified or even reversed in the population, depending, for
example, on the variables omitted from the model (whose importance one
could never knowfor certain) or on the pattern of unreliability of measurement
across the variablesin the analysis. Asa result, it proveddifficult to anticipate
with certainty the direction of bias resulting from the simpler types of non-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

equivalent control group studies. But there has been muchwork on data
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

analysis for this design (see Moffitt 1991, Campbell1993).


DITION Mostof the scholars writing about the analysis of quasi-experimental
data fromwithin the Campbelltradition are reluctant apologists for their work,
preferring randomizedexperimentsbut realizing that they are not alwayspos-
sible. Moreover,evenif such experimentsare implemented initially, treatment-
correlated attrition often occurs, leaving the data to be analyzedas thoughfrom
a quasi-experiment. Initially, the obvious solution seemedto be to measure
pretest differences betweenthe groups under contrast and then to use simple
analysis of covariance to adjust awaythese differences (Reichardt 1979). But
two problemsarise here. First, unreliability in the covariates leads to biased
estimates of treatment effect (Lord 1960)so that a preferenceemergedfor using
latent variable modelsin which multiple observed measuresof each construct
are analyzed, thereby reducing measurementerror and permitting an analysis
only of the shared variance (Magidson1977, Magidson&Sorbom1982; but for
a dissenting voice, see Cohenet al 1990). However,this does not deal with the
secondand moreserious problem--thatof validly specifying either the selection
process or the outcomemodel. For this, the recommendationhas been that
investigators must rely on their best common sense and the available research
literature in the hopeof includingin the analysis all likely predictors of group
membership(the selection model) or outcome(the outcomemodel), or
(althougheither by itself is sufficient). Sincethe latent variable tradition offers
no guarantee that all group differences have been statistically adjusted for,
membersof the theory group around Campbellturned instead to strengthening
causal inference by design rather than statistical analysis (e.g. Cook1991a)
by conducting multiple data analyses under different plausible assumptions
about the direction and nature of selection bias instead of relying on a single
analysis (Reichardt &Gollub 1986, Rindskopf1986, Shadish et al 1986). The
one perfect analysis remainedan impossible dream, given only two nonequiv-
alent treatment groups and two measurementwaves.
Annual Reviews


SUGGESTIONS FROM LABOR ECONOMISTS At the same time, some labor econ-
omists developedan alternative tradition that soughtto create statistical models
that do not require full modelspecification and that wouldapply to all nonex-
perimental research whererespondentsare divided into levels on someindepen-
dent variable. The best developedof this class of analyses is the instrumental
variable approach associated with Heckman(Heckman1980, Heckmanet al
1987, Heckman& Hotz 1989). The probability of group membershipis first
modeledin a selection equation, with the predictors being a subset of variables
that are presumablyrelated to selection into the different treatment conditions.
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Whenthis equation correctly predicts group membership,it is assumedthat a

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

correct selection modelhas been identified. The results of this modelare then
used as an instrumental variable in the main analysis to estimate treatment
effects by adjusting for the effects of selection. Conceptuallyand analytically,
this approachclosely resemblesthe regression discontinuity design because it
is based on full knowledgeand perfect measurementof the selection model.
However,in approacheslike Heckman s, the selection modelis not fully known.
It is estimated on the basis of fallible observedmeasuresthat serve to locate
moregeneral constructs. This limitation mayexplain whyempirical probes of
Heckrnanstheory have not produced confirmatory results. The most compel-
ling critiques comefrom studies wheretreatment-effect estimates derived from
Heckmansselection modelingprocedures are directly comparedto those from
randomizedexperiments--presumedto be the gold standard against which the
success of statistical adjustmentproceduresshould be assessed. In a reanalysis
of annual earnings data from a randomizedexperiment on job training, both
LaLonde(1986, LaLonde& Maynard1987) and Fraker & Maynard(1987)
shownthat (a) econometric adjustments from Heckmanswork provide many
different estimates of the training programs effects; and (b) none of these
estimates closely coincides with the estimate from randomizedcontrols. These
reanalyses have led some labor economists (e.g. Ashenfelter & Card 1985,
Betsey et al 1985)to suggest that unbiasedestimates can only be justified from
randomized experiments, thereby undermining Heckmansdecade-long work
on adjustmentsfor selection bias. Otherlabor economistsdo not go quite so far
and prefer what they call natural experiments--what Campbellcalls quasi-
experiments--over the non-experimentaldata labor economistspreviously used
so readily (e.g. Card 1990, Katz &Krueger1992, Meyer& Katz 1990; and see
especially Meyer1990for an examplethat reinvents regression-discontinuity).
In Heckman &Hotzs (1989, Heckman et al 1987) rejoinder to their critics,
they used the same job training data as LaLondeto argue that a particular
selection model--based on two separate pretreatment measures of annual in-
come--metcertain specification tests and generated an average causal effect
no different from the estimate provided by the randomizedexperiment. Unfor-
tunately, their demonstrationis not very convincing(Cook199 la, Coyleet al
Annual Reviews


1991). First, there is a problemof modelgenerality since the two-waveselec-

tion process that fit for youths whowerejust entering the labor marketdid not
fit for mothers enrolled in the Aid to Families with DependentChildren
Program (AFDC)who were returning to the job market. Heckmaninvoked
simpler cross-sectional model to describe the selection process for AFDC
mothers, but this modelcould not be subjected to restriction tests and was
assumedto be correct by fiat. Second, close inspection of the data Heckman
presented reveals that the econometric analyses yielded such large standard
errors that finding no difference betweenthe two methodsreflects a lack of
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

statistical powerin the econometric approach. Finally, the procedure that

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

Heckman &Hotz found to be better for their one population had been widely
advocated for at least a decade as a superior quasi-experimental design be-
cause the two pretest wavesallow estimation of the preintervention selection-
maturation differences between groups (Cook &Campbell 1979). Moffitts
(1991) rediscovery of the double-pretest design should be seen in the same
historical light that exemplifies the quasi-experimentalslogan: better to con-
trol through design than measurement and statistical adjustment. To conclude,
empiricalresults indicate that selection modelsusually producedifferent effect
estimates than do randomizedexperiments, and if researchers using such mod-
els were close to the mark in a particular project, they wouldnot knowthis
unless there were also a randomizedexperiment on the same topic or some
better quasi-experiment. But then the need for less interpretable selection
modelingwouldnot exist!


STATISTICIANS Historically, statisti-
cians have preferred not to deal with the problemsarising from quasi-experi-
ments (which they call observational studies). They almost monolithically
advocate randomizeddesigns unless prior information is so reliable and com-
prehensive that Bayestheoremis obviously relevant. This situation changeda
little in the 1980s,and a flurry of publicationsappearedguidedby the realization
that quasi-experiments were rampantand might be improveduponeven if never
perfected (Holland 1986; Rosenbaum1984; Rosenbaum& Rubin 1983, 1984).
The statisticians provided four major suggestions for improvingthe analy-
sis of quasi-experimental data. The first emphasizedthe need for selecting
study units matchedon stable attributes, not as a substitute for randomassign-
ment, but as one of a series of palliatives that reduce bias (Rubin 1986,
Rosenbaum1984). The second was a call to computea propensity score, an
empirical estimate of the selection process based, to the extent possible, on
actual observation of which units enter the various treatment groups being
contrasted (Rosenbaum&Rubin 1983, 1984). The intent is to generate the
mostaccurate and completedescription of the selection process possible rather
than to find a proxyfor the entire selection process. The third improvement the
Annual Reviews


statisticians counseledwas the use of multiple comparisongroups to increase

statistical power; to rule out construct validity threats through the use of
placebo controls, for example; and to vary the direction of pretest group
differences and hence the assumptions made about growth rate differences
between groups. The statisticians want to avoid having a single comparison
group that outperforms or underperformsa treatment group before the treat-
mentand might be changing over time at a different rate from the treatment
group. The final suggestion was that multiple data analyses should be con-
ducted under different explicit and plausible assumptionsabout factors that
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

might influence posttest performancein one group morethan another. Clearly,

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

the statisticians see quasi-experimental analysis as involving a morecompli-

cated, thoughtful, theory-dependentprocess of argumentconstruction than is
the case whenanalyzing data from randomizedexperiments.
It is interesting to note someconvergencesacross all the different traditions
of dealing with group nonequivalence.Oneis the advocacyof directly observ-
ing the selection process so as to havea morefully specified and moreaccurate
model. Confidence in instrumental variables or single pretest measures as
adequate proxies is waning. A second convergenceis the advocacyof multiple
control groups, especially to makeheterogeneousthe direction of pretest bias.
A third is the advisability of using several pretreatment measurement wavesto
observe howthe different groups might be changing spontaneously over time.
A fourth is the value of using matching to minimize initial group non-
comparabilityprovidedthat the matchingwill not lead to statistical regression.
The final point of convergenceis that multiple data analyses should take place
undervery different (and explicit) assumptionsabout the direction of bias. The
aim is to test the robustness of results across different plausible assumption
sets. Wesee here the call for an honest critical multiplism (Cook 1985,
Shadish 1989); a lack of confidence in the basic two-group, two-wavequasi-
experimental design that is so often used in social experiments; and the need
for more specific causal hypotheses, supplementingthe basic design with more
measurementwaves and more control groups.
Duringthe years we are examining,the use of empirical research on experi-
ments has been increasing. Fromthe beginning, Campbell(1957) argued that
his lists of validity threats could be shortened or lengthened as experience
accumulatedabout the viability of particular validity threats or the necessity to
postulate new ones. Today we see even more focused research on design
issues, as with the labor economistswhocontrast the results from randomized
experimentsand quasi-experiments. The use of empirical research to improve
design is also evident in meta-analysis, particularly as it explores howmethod-
ological decisions influence effect size estimates. For example,do published
studies yield different estimates from unpublished studies (Shadish et al
1989)? Does the presence of pretest measurementchange effect size estimates
Annual Reviews


(Willson & Pumam1982)? Do randomized experiments yield different esti-

mates than quasi-experimentsin general or than certain types of quasi-experi-
mentsin particular (Heinsman1993)?Weexpect explicit meta-analytic studies
of design questions to becomeeven moreprevalent in the future.


Cronbachs(1982) UTOSformulation suggests that experimenters often hope

that the causal inferences they claim (a) will generalize to specific populations
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

of units and settings and to target cause and effect constructs; and (b) can
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

extrapolatedto times, persons, settings, causes, and effects that are manifestly
different from those sampledin the existing research. Campbells(1957) work
on external validity describes a similar but less articulated aspiration. But such
generalization is not easy. For compellinglogistical reasons, nearly all experi-
ments are conducted with local samples of convenience in places of conve-
nience and involve only one or two operationalizations of a treatment, though
there maybe more operationalizations of an outcome. Howare Cronbachs
two types of generalization to be promotedif most social experimentsare so
contextually specific?

Random Sampling

The classic answerto this question requires randomlysamplingunits from the

populationof persons, settings, times, treatments, or outcomesto whichgener-
alization is desired. The resulting samplesprovide unbiased estimates of the
population parameters within knownsampling limits, thereby solving
Cronbachsfirst causal generalization problem.Unfortunately, this solution is
rarely applicable. In single experimentsit is sometimespossible to drawa formal
probability sampleof units and evensettings. However,it is almost impossible
to drawprobability samples of treatments or outcomesbecause no enumeration
of the populationcan be achieved given that the treatment and outcomeclasses
can seldombe fully defined. Occasionally,an enumerationof settings or units
is possible, as with studies intendingto generalize to elementaryschoolchildren
in a particular city, or to all community mentalhealth centers in the nation. But
for dispersedpopulations, there are enormouspractical difficulties with drawing
the appropriate samplein such a waythat the treatment can be implemented with
high quality. Evenif these difficulties could be surmounted,the experiment
wouldusually still be restricted to those individuals whoagreed to be randomly
assigned--surely only a subpopulation of all those to whomgeneralization is
sought. Moreover,attrition is likely after an experimenthas begun, makingthe
achieved population even less similar to the target population. For all these
reasons, formal probability sampling methods seem unlikely to provide a
Annual Reviews


general solution to Cronbachs first framingof the causal generalization prob-

lem, thoughthey are desirable wheneverfeasible.

Proximal Similarity

Campbells solution to Cronbach s first generalization problemis implicit in his

tentative relabeling of external validity as proximalsimilarity, by which he
meansthat "as scientists we generalize with most confidence to applications
most similar to the setting of the original research" (Campbell1986, p. 75).
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

believes that the moresimilar an experiments observablesare to the setting to

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

whichgeneralizationis desired, the less likely it is that background variables in

the setting of desired application will modifythe cause-effect relationship
obtained in the original experiment. Cronbachappears to agree with this. But
applying this rationale requires knowledgeof whichattributes must be similar,
and this presumablydependson goodtheory, validated clinical experience, or
prior relevant experimentalresults. Alas, goodsubstantive theory is not always
available, intuitive judgmentscan be fickle, and prior findings are often lacking.
Whatcan be done in such cases?
St. Pierre &Cook(1984) suggest two related purposive sampling options
that are useful in planning generalizable cause-probing experiments. In modal
instance sampling, the researcher uses backgroundinformation or pilot work
to select experimentalparticulars that proximallyresemblethe kinds of unit,
setting, treatment, and outcomevariants that are most commonly found in the
settings to whichgeneralization is sought. The strategy here is to capture the
most representative instances (Brunswick 1955). In sampling to maximize
heterogeneity, the aim is to sampleas widely as possible along dimensionsthat
theory, practice, or other forms of speculation suggest might codeterminethe
efficacy of a treatment. If a treatment is robustly effective at widelydiffering
points on such dimensions,then generalization over themis facilitated. If the
treatmentis not general in its effects, then the researcher can conductsubgroup
analyses to identify someof the specific boundaryconditions under whichthe
treatment is effective. Since poweris potentially a problem whenexamining
statistical interactions, St. Pierre &Cook(1984) also suggest probing whether
a cause-effect relationship can be demonstratedacross whateverheterogeneity
is deliberately allowed into the sampling plan. Purposive sampling either to
increase heterogeneity or to capture modalinstances does not justify general-
ization as well as randomsamplingdoes, but it should increase insights into
howbroadly a causal relationship can be generalized.

Probes for Robust Replication over Multiple Experiments

A collection of experimentson the sameor related topics can allow even better
probes of the causal contingenciesinfluencing treatment effectiveness. This is
Annual Reviews


because each experimentis likely to involve a different population of persons

and settings, a different time, and uniquewaysof conceptualizingand measuring
the cause and effect. A search can then take place to identify which causal
relationships are robust across the obtainedvariation in all these dimensionsand
also to identify those contexts that limit generalization becauseno cause-effect
relationship can be found in them. Meta-analysisis the best-knownexemplarof
this approachto causal generalization (Cook199 l b, Cooket al 1992).
Considermeta-analytic workon the effects of psychotherapy.This suggests
that effect sizes do not differ across such seeminglycrucial dimensionsas
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

therapist training and experience,treatment origntation, and duration of treat-

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

ment (Smith et al 1980, Berman& Norton 1985). However,effect sizes

differ by outcomecharacteristics, with larger effects emergingfor outcomes
thht receive the most attention in treatment (Shadish et al 1993), suggesting
one restriction to the generality of psychotherapyeffects. Empiricalefforts to
probethe robustness of a cause-effectrelationship require the pool of available
studies to vary on factors likely to moderatethe relationship. But sometimes
there is little or no variability. For example,in psychotherapy literature there
are very fewexperimentalstudies of psychoanalyticorientations, so we still do
not knowif the general finding about therapist orientation makinglittle differ-
ence to client outcomesalso applies to psychoanalysisas a particular form of
Still, with the greater range and heterogeneity of samplingthat literature
reviews promoteit is often possible to probewhethera causal relationship is
reproducedin contexts that are proximallysimilar to particular, defined con-
texts of intendedapplication. This speaks to Cronbachsfirst framingof causal
generalization. Importantfor his secondframingis that whena causal relation-
ship demonstrablyholds over a wide array of different contexts (e.g. Devine
1992), it seemsreasonable to presumethat it will continue to hold in future
contexts that are different frompast ones. Thewarrant for such extrapolation is
logically flawed; nonetheless, it is pragmatically superior to the warrant for
extrapolation whena relationship has only been tested within a narrowrange
of persons, settings, times, causal manipulations,and effect measures.

Causal Explanation

Wecannot expect all relevant causal contingenciesto be representedin the body

of studies available for s~nthesis. Hence,weneed other methodsfor extrapolat-
ing a causal connectionto unstudiedpopulations and classes. Cronbachbelieves
that causal explanation is the key to such generalization, and that causal
explanation requires identifying the processes that follow because a treatment
has varied and without whichthe effect wouldnot have been produced. Identi-
fying such micro-mediatingprocesses usually involves decomposingthe molar
Annual Reviews


cause and effect constructs into the subsets of componentsthought to be most

critically involved i~ the molar causal link and then examiningall micro-medi-
ating processes that might link the causally efficacious treatment components
to the causally impacted outcomecomponents.
The presumptionis that once such knowledgehas been gained, the crucial
causal processes can be transferred to novel contexts of application. Cook&
Campbell(1979) use the exampleof a light switch to illustrate this. Knowl-
edge that flicking the switch results in light is the type of descriptive knowl-
edge about manipulanda that experiments promote; more explanatory knowl-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

edge requires knowingabout switch mechanisms,wiring and circuitry, the

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

nature of electricity, and howall these elements can be combinedto produce

light. Knowingso muchincreases the chances of creating light in circum-
stances where there are no light switches, providing one can reproduce the
causal explanatory processes that makelight in whatever wayslocal resources
allow. Oneneed not be restricted to what can be boughtin a hardwarestore or
to what was available whenone first learned about electricity. Anotherexam-
ple concerns the murder of Kitty Genovese, whowas stabbed repeatedly for
over half an hour, with nearly forty neighbors watching and not offering to
help. Latane &Darley (1970) hypothesized that this apathy resulted from
process they called the diffusion of responsibility--everyone thought someone
else wouldsurely help. Latane & Darley experimentally demonstratedthat this
process was common to a host of other situations as diverse as whena woman
falls and sprains her ankle, smokecomesinto a roomfrom a fire, someonehas
an epileptic seizure, or a cash register is robbed(Brown1986). Oncemicro-
mediatingprocesses are identified, they are often broadly transferable, cer-
tainly more so than if the research had shownthat helping is reduced only
following a stabbing and then only in NewJersey and only with apartment
Identifying causal explanatory processes is not easy. Randomizedexperi-
ments were designed to provide information about descriptive causal connec-
tions and not about processes accounting for these connections. Hence, the
methodsfor exploring causal mediation must be added to experimental frame-
works. Withinquantitative research traditions, potential mediators can be as-
sessed and used in someform of causal modeling. Alternatively, respondents
mightbe assigned to treatments that vary in the availability of a hypothesized
explanatory mechanism. However, experience with complex randomized ex-
periments has not been salutary in field research. The Negative IncomeTax
experiments are amongthe most complex ever designed with respect to the
numberof factors examined,the numberlevels on each factor, and the unbal-
anced distribution of respondents across factors and levels (Kershaw&Fair
1976). Preservingall these distinctions in field settings has provenquite diffi-
cult, though somefairly complexplanned variation studies have been carded
Annual Reviews


Out successfully (Connell et al 1985). Cronbach(1982) has proposed that

qualitative methodsof the ethnographer,historian, or journalist should also be
used to generate and defend hypotheses about micro-mediating processes.
Althoughwe personally have a lot of sympathyfor this qualitative approach,it
is likely to fall on deaf ears in the social science community
at large.


Field experiments are nowmuch more commonplacethan they were twenty

Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

years ago. The pendulumhas swungto favor randomized experiments over

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

quasi-experiments--and strongly so in all areas except perhaps intervention

research in education. Considerable informal information is nowavailable
about the manyfactors that facilitate the implementationof randomassign-
ment and about factors that promotebetter randomizedstudies. As a result,
fields like labor economicsand community health promotionthat fifteen years
ago werecharacterized by the use of statistical adjustmentsto facilitate causal
inference nowuse more (and larger) experiments and fewer quasi-experi-
Thoughquasi-experiments have lost someof their warrant, someprogress
has been madetoward improving them. More attention has been directed to
two designs that promotepoint-specific causal inferences--the regression-dis-
continuity and interrupted time series designs (Marcantonio & Cook1994).
Also more often advocated are somewhatelaborate quasi-experimental de-
signs that predict a complexpattern of results emanatingfromthe causal agent
under analysis. At issue here are more pretest measurementwaves, more
control groups, better matchedcontrols (including cohorts), and removingand
reinstating the treatment at knowntime points (Cooket al 1990).
Muchworkhas also gone into the statistical analysis of data from quasi-ex-
perimentsin order to modelselection biases. Thereare several different tradi-
tions that, unfortunately, rarely communicatewith each other (see Campbell
1993). This is sad because they are growing ever closer together. None of
these traditions is sanguineabout finding the elusive demonstrablyvalid coun-
terfactual baseline unless the selection modelis completely known,as with
randomizedexperimentsor the regression-discontinuity design. All traditions
agree, though, that causal inference is better warrantedthe moreclosely the
comparisongroups are matched, the more information there is to describe the
selection process, the morecomparisongroups there are (particularly if they
bracket the treatment group meanat the pretest), and the morerobust are the
results generatedfrommultiple statistical analyses madeunderdifferent plau-
sible assumptionsabout the nature of selection bias.
But selection is not the only inferential problemresearchers needto worry
about. Campbells(1957) internal validity threats of testing, instrumentation,
Annual Reviews


and history are often relevant, as are Cook & Campbells (1979) threats
treatment diffusion, resentful demoralization, compensatory rivalry, and com-
pensatory equalization. Statistical research on selection modeling has been
dominant in the past, and rightly so. But it should not crowd out research into
these other relevant threats to valid causal inference.
Work on the generalization of causal relationships has historically been less
prevalent than work on establishing causal relationships. There is ongoing
work on the generalization of causal inferences in two modes, One is meta-an-
alytic andemphasizes
the empiricalrobustnessof a particular causal connec-
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

tion acrossa widerangeof persons,settings, times, andcause andeffect

by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

constructs.Thesecondis morecausal-explanatory,
andit emphasizes identify-
ing the micro-mediatingprocessesthat causally connecta treatmentto an
outcome,usuallythrougha processof theoreticalspecification,measurement,
anddataanalysisratherthanthrough the sampling
strategythat characterizes
Causalgeneralizationis nowmoreof an issue in social experi-
mentsthanit wasfifteen yearsago (Cook1991a,1993;Friedlander &Guerin

Literature Cited
Aiken LS, West SG. 1991. Multiple Regres- and theory in impact evaluations. Prof.
sion: Testing and Interpreting Interac- Psychol. Res. Pract. 8:411-34
tions. NewburyPark, CA: Sage Boruch RF, McSweeneyAJ, Soderstrom EJ.
Ashenfelter O, Card D. 1985. Using the longi- 1978. Randomizedfield experiments for
tudinal structure of earnings to estimate the program planning developmentand evalua-
effect of training programs. Rev. Econ. tion. Eval. Q. 2:655-95
Stat. 67:648-60 Boruch RF, WothkeW, eds. 1985. Randomiza-
Barlow DH, Hersen M, eds. 1984. Single Case tion and Field Experimentation: NewDi-
Experimental Designs: Strategies for rections Jbr ProgramEvaluation, Vol. 28.
Studying Behavior Change. New York: San Francisco: Jossey-Bass
Pergamon. 2nd ed. Box GEP, Jenkins GM. 1970. Time-series
Berman JS, Norton NC. 1985. Does profes- Analysis: Forecasting and Control. San
sional training makea therapist moreeffec- Francisco: Holden-Day
tive? Psychol. Bull 98:401-7 BrownR. 1986. Social Psychology: The Sec-
Betsey CL, Hollister RE, Papageorgiou MR, ond Edition. NewYork: Free Press
eds. 1985. Youth Employed and Training BrunswikE. 1955. Perception and the Repre-
Programs: The YEDPAYears. Washing- sentative Design of Psychological Experi-
ton, DC:Natl. Acad. Press ments. Berkeley: Univ. Calif. Press. 2nd
BickmanL. 1987. Functions of program the- ed.
try. In Using ProgramTheory in Evalua- Bryk AS, RaudenbushSW. 1992. Hierarchical
tion: NewDirections for ProgramEvalua- Linear Models: Applications and Data
tion, ed. L Bickman, 33:5-18. San Fran- Analysis Methods. Newbury Park, CA:
cisco: Jossey-Bass Sage
Bickman L, ed. 1990. Advances in Program Cain GG.1975. Regression and selection mod-
Theory: New Directions for Program els to improve nonexpetimental co~npari-
Evaluation, Vol. 47. San Francisco: sons. In Evaluation and Experiment: Some
Jossey-Bass Critical Issues in Assessing Social Pro-
Boruch RF. 1975. On commoncontentions grams, ed. CA Bennett, AA Lumsdaine,
about randomizedexperiments. In Experi- pp. 297-317. NewYork: Academic
mental Tests of Public Policy, ed. RF Campbell DT. 1957. Factors relevant to the
Boruch, HWRiecken, pp. 107~12. Boul- validity of experimentsin social settings.
der, CO: Westview Psychol. BulL 54:297-312
Boruch RF, GomezH. 1977. Sensitivity, bias Campbell DT. 1966. Pattern matching as an
Annual Reviews


essential in distal knowing.In ThePsychol- generalized causal inferences in quasi-ex-

ogy of Egon Brunswik, ed. KR Hammond, perimentation. In Evaluation and Educa-
pp. 81-106. NewYork: Holt, Rinehart, tion at Quarter Century, ed. MW
Winston McLaughlin,D Phillips, pp. 115-44. Chi-
CampbellDT. 1976. Focal local indicators for cago: NSSE
social programevaluation. Soc. Indic. Res. CookTD. 1991b. Meta-analysis: its potential
3:237-56 for causal description and causal explana-
Campbell DT. 1982. Experiments as argu- tion within programevaluation. In Social
ments. Knowl.: Creation Diffus. Util. Prevention and the Social Sciences: Theo-
3:237-56 retical Controversies. Research Problems
CampbellDT. 1984. Hospital and landsting as and Evaluation Strategies, ed. G Albrecht,
continuously monitoring social programs: H-UOtto, S Karstedt-Henke,KBollert, pp.
advocacy and warning. In Evaluation of 245-85. Berlin-NewYork: de Gruyter
Mental Health Service Programs, ed. B Cook TD. 1993. A quasi-sampling theory of
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Cronhom, Lvon Korring, pp. 13-39. the generalization of causal relationships.

Stockholm: Forskningsraadet Medicinska In Understanding Causes and Generaliz-
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

Campbell DT. 1986. Relabeling internal and ing About Them: NewDirections for Pro-
external validity for applied social scien- gram Evaluation, ed. L Sechrest, AG
tists. See Trochim1986, pp. 67-77 Scott, 57:39-82. San Francisco: Jossey-
Campbell DT. 1993. Quasi-experimental re- Bass
search designs in compensatoryeducation. Cook TD, Anson A, Walchli S. 1993. From
In Evaluating Intervention Strategies for causal description to causal explanation:
Children and Youth at Risk, ed. EMScott. improving three already good evaluations
Washington, DC: US Govt. Print. Office. of adolescent health programs. In Promot-
In press ing the Health of Adolescents: NewDirec-
Campbell DT, Boruch RF. 1975. Making the tions for the Twenty-First Century, ed.
case for randomizedassignment to treat- SGMillstein, ACPetersen, EO Nightin-
ments by considering the alternatives: six gale, pp. 339-74. NewYork: Oxford Univ.
ways in which quasi-experimental evalua- Press
tions tend to underestimateeffects. In Eval- Cook TD, Campbell DT. 1979. Quasi-Experi-
uation and Experience: SomeCritical Is- mentation: Design and Analysis Issues for
sues in Assessing Social Programs,ed. CA Field Settings. Chicago: Rand-McNally
Bennett, AA Lumsdaine, pp. 195-296. CookTD, Campbell DT. 1986. The causal as-
NewYork: Academic sumptions of quasi-experimental practice.
CampbellDT, Stanley JC. 1963. Experimental Synthese 68:141-80
and Quasi-Experimental Designs for Re- Cook TD, Campbell DT, Perrachio L. 1990.
search. Chicago: Rand-McNally Quasiexperimentation. In Handbookof In-
Card D. 1990. The impactof the Mariel boatlift dustrial and Organizational Psychology,
on the Miami labor market. Ind. Labor ed. MDDunnette, LMHough, pp. 491-
Relat. 43:245-57 576. Palo Alto, CA: Consult. Psychol.
Chelimsky E. 1987. The politics of program Press. 2nded.
evaluation. Soc. Sci. Mod.Soc. 25(1):24- Cook TD, Cooper H, Cordray D, Hartmann H,
32 HedgesL, Light R, Louis T, Mosteller F,
Chen HT, Rossi PH. 1983. Evaluating with eds. 1992. Meta-Analysis for Explanation:
sense: the theory-driven approach. Eval. A Casebook. New York: Russell Sage
Rev. 7:283-302 Found.
CohenP, CohenJ, Teresi J, Marchi M, Velez Coyle SL, Boruch RF, Turner CF, eds. 1991.
NC. 1990. Problems in the measurementof Evaluating AIDS Prevention Programs:
latent variables in structural equations ExpandedEdition. Washington, DC: Natl.
causal models. Appl. Psychol. Meas. Acad. Press
14(2): 183-96 Cronbach LJ. 1982. Designing Evaluations of
Collingwood RG. 1940. An Essay on Meta- Educational and Social Programs. San
physics. Oxford: Clarendon Francisco: Jossey-Ba~s
Connell DB, Turner RT, Mason EF. 1985. Cronbach LJ, SnowRE. 1976. Aptitudes and
Summary of findings of the school health Instructional Methods. NewYork: Irving-
education evaluation: health promotionef- ton
fectiveness, implementation,and costs. J. Devine E. 1992. Effects of psychoeducational
School Health 85:316-17 care with adult surgical patients: a theory-
CookTD. 1985. Post-positivist critical multi- .probing meta-analysisof intervention stud-
pllsm. In Social Science and Social Policy, ~es. See Cooket al 1992, pp. 35-84.
ed. RL Shotland, MMMark, pp. 21~52. Dunn WN. 1982. Reforms as arguments.
Beverly Hills, CA:Sage Knowl.Creation Diffus. Util. 3(3):293-326
CookTD. 1991a. Clarifying the warrant for Fraker T, MaynardR. 1987. Evaluating the ad-
Annual Reviews


equacy of comparison group designs for HollandPW.1986. Statistics and causal infer-
evaluation of employment-related pro- ence (with discussion). J. Am. Stat. Assoc.
grams. J. Hum.Res. 22:194-227 81:945-70
Freedman DA. 1987. A rejoinder on models, Holland PW, Rubin DB. 1988. Causal infer-
metaphors, and fables. J. Educ. Star. ence in retrospective studies. Eval. Rev.
12:206-23 12:203-31
Friedlander D, Gueron JM. 1992. Are high- Katz LF, KruegerAB. 1992. The effect of min-
cost services moreeffective than low-cost imumwageon the fast-food industry. Ind.
services? In Evaluating Welfare and Labor Relat. 46:6-21
Training Programs,ed. CFManski, I Gar- KershawD, Fair J. 1976. The NewJersey In-
finkel, pp. 143-98. Cambridge: Harvard come-Maintenance Experiment. Vol. 1 :
Univ. Press Operations, Surveys and Administration.
Gadenne V. 1976. Die Gultigkeit psy- New York: Academic
chologischer Untersuchungen. Stuttgart, Kish L. 1987. Statistical Design for Research.
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

Germany: Kohlhammer New York: Wiley

Gasking D. 1955. Causation and recipes. Mind Kratochwill TR, Levin JR, eds. 1992. Stogie-
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

64:479-87 Case Research Design and Analysis.

GoldbergerAS. 1972a. Selection bias in evalu- Hillsdale, NJ: Erlbaum
ating treatment effects: some formal Kruglanski AW,Kroy M. 1975. Outcome va-
illustrations. In Discussion Papers No. lidity in experimental research: a re-con-
123-172. Madison: Inst. Res. Poverty, ceptualization. J. Represent.Res. Soc. Psy-
Univ. Wisc. chol. 7:168-78
Goldberger AS. 1972b. Selection bias in KuhnTS. 1962. The Structure of Scientific
evaluating treatment effects: the case of Revolutions. Chicago: Univ. Chicago Press
interaction. In Discussion Papers No. 123- Lakoff G. 1987. Women,Fire, and Dangerous
172 Madison: Inst. Res. Poverty, Univ. Things: WhatCategories Reveal About the
Wisc. Mind. Chicago: Univ. Chicago Press
Gotffredson SD. 1978. Evaluating psychologi- LaLondeRJ. 1986. Evaluating the econometric
cal research reports: dimensions,reliability evaluations of training programs with
and correlates of quality judgments. Am. experimental data. Am. Econ. Rev. 76:604-
Psychol. 33:920-34 20
Granger CW, Newbold P. Forecasting Eco- LaLondeILl, Maynard R. 1987. Howprecise
nomic Time Series. NewYork: Academic are evaluations of employmentand training
Harrop JW, Velicer WF. 1985. A comparison experiments: evidence from a field experi-
of alternative approachesto the analysis of ment. Eval. Rev. 11:428-51
interrupted time series. Multivariate Latane B, Darley JM. 1970. The Unresponsive
Behav. Res. 20:27-44 Bystander: Why Doesnt He Help? New
Harrop JW, Velicer WF. 1990. Computerpro- York: Appleton-Century-Crofts
gramsfor interrupted time series analysis: Lipsey M. 1993. Theory as method: small the-
1. A qualitative evaluation. Multivariate ories of treatments. In Understanding
Behav. Res. 25:219-31 Causes and Generalizing About Them:
HeckmanJJ. 1980. Sample selection bias NewDirections for ProgramEvaluation,
as a specification error. In Evaluation ed. LB Sechrest, AGScott, 57:5-38. San
Studies Review Annual, ed. EWStroms- Francisco: Jossey-Bass
dorfer, G Farkas, 5:13-31. NewburyPark, Lord FM.1960. Large-scale covariance analy-
CA: Sage sis whenthe control variable is fallible. J.
HeckmanJJ, Hotz VJ. 1988. Are classical ex- Am. Stat. Assoc. 55:307-21
periments necessary for evaluating the im- MackicJL. 1974. The Cementof the Universe.
pact of manpowertraining programs? A Oxford: Oxford Univ. Press
critical assessment. Ind. Relat. Res. Assoc. MagidsonJ. 1977. Towarda causal model ap-
40th Annu. Proc., pp. 291-302 proachfor adjusting for pre-existing differ-
HeckmanJJ, Hotz VJ. 1989. Choosing among ences in the non-equivalent control group
alternative nonexperimental methods for situation. Eval. Q. 1(3):399-420
estimating the impact of social programs: Magidson J, Sorbom D. 1982. Adjusting for
the case of manpower training. J. Am.Star. confoundingfactors in quasi-experiments.
Assoc. 84:862-74 Educ. Eval. Policy Anal. 4:321-29
HeckmanJJ, Hotz VJ, Dabos M. 1987. Dowe ManskiCF, Garfinkel I, eds. 1992. Evaluating
need experimental data to evaluate the im- Welfare and Training Programs. Cam-
pact of manpowertraining on earnings? bridge: HarvardUniv. Press
Eval. Rev. 11:395-427 Marcantonio RJ, Cook TD. 1994. Convincing
HeinsmanD. 1993. Effect sizes in ~neta-analy- quasi-experiments:the interrupted time se-
sis: does randomassignment makea dif- ries and regression-discontinuity designs.
ference ? PhDthesis. MemphisState Univ., In Handbookof Practical ProgramEvalu-
Tenn. ation, ed. JS Wholey,FIP Hata2, KENew-
Annual Reviews


comer. San Francisco: Jossey-Bass. In bias in observation studies for causal ef-
press fects. J. Educ. Psychol. 66:688-701
Mark MM.1986. Validity typologies and the Rubin DB. 1977. Assignment to treatment
logic and practice of quasi-experimenta- group on the basis of a covariate. J. Educ.
tion. See Trochim1986, pp. 47-66 Stat. 2:1 26
Medin DL. 1989. Concepts and conceptual Rubin DB. 1986. Which ifs have causal an-
structure. Am. Psychol. 44:1469-81 swers? J. Am. Stat. Assoc. 81:961-62
MeehlPE. 1978. Theoretical risks and tabular Sechrest L, West SG, Phillips MA,Redner R,
asterisks: Sir Karl, Sir Ronald,and the slow Yeaton W. 1979. Someneglected problems
progress of soft psychology. J. Consult. in evaluation research: strength and integ-
Clin. Psychol. 46:806-34 rity of treatments. In Evaluation Studies
Meyer BD. 1990. Unemployment insurance Review Annual, ed. L Sechrest, SGWest,
and unemploymentspells. Econometrica MAPhillips, R Redner, WYeaton, 4:15-
58:757-82 35. Beverly Hills, CA:Sage
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from

MeyerBD, Katz LF. 1990. The impact of the Shadish WR.1989. Critical multiplism: a re-
potential duration of unemploymentbene- search strategy and its attendent tactics. In
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.

fits on the duration of unemployment.J. Health Services Research Methodology: A

Public Econ. 41:45-72 Focus on AIDS. DHHSPub. No. PHS89-
Moffitt R. 1991. The use of selection modeling 3439, ed. L Sechrest, H Freeman,A Mully.
to evaluate AIDSinterventions with obser- Rockville, MD: NCHS, USDHHS
vational data. Eval. Rev. 15(3):291-314 Smith EE, Medin DL. 1981. Categories and
Mosteller F. 1990. Improvingresearch meth- Concepts, Cambridge:Harvard Univ. Press
odology: an overview. In Research Meth Shadish WR, Cook TD, Houts AC. 1986.
odology: Strengthening Causal Interpreta- Quasi-experimentation in a critical
tions of Nonexperimental Data, ed. L multiplist mode. See Trochim 1986, pp.
Sechrest, E Pen-in, J Bunker, pp. 221-30. 29-46
Rockville, MD:AHCPR,PHS Shadish WR, Doherty M, Montgomery LM.
OvermanES. 1988. Methodology and Episte- 1989. Howmanystudies are in the file
mology for the Social Sciences: Selected drawer? Anestimate from the family/mari-
Papers of DonaMT. Campbell. Chicago: tal psychotherapyliterature. Clin. PsychoL
Univ. Chicago Press Rev. 9:589-603
Popper KR.1959. The Logic of Scientific Dis- Shadish WR, Montgomery LM, Wilson P,
covery. NewYork: Basic Books Wilson MR, Bright I, OkwumabuaTM.
Reichardt CS.1979. The statistical analysis of 1993. The effects of family and marital
data from nonequivalent group designs. psychotherapies: a meta-analysis. Z Con-
See Cook & Campbell 1979, pp. 147-205 sult. Clin. Psychol.In press
Reichardt CS, Gollob HF. 1986. Satisfying the Smith EE, Medin DL. 1981. Categories and
constraints of causal modelling. See Concepts. Cambrigde:Harvard Univ. Press
Trochim 1986, pp. 91-107 Smith ML, Glass GV, Miller TI. 1980. The
Riecken HV~,Boruch RF, eds. 1974. Social Benefits of Psychotherapy. Baltimore:
Experimentation. NewYork: Academic Johns HopkinsUniv. Press
Rindskopf D. 1986. Newdevelopments in se- St. Pierre RG,CookTD. 1984. Sampling strat-
lection modeling for quasi-experimenta- egy in the design of programevaluations.
tion, See Trochim1986 In Evaluation Studies Review Annual, ed.
Robbins H, Zhang C-H. 1988. Estimating a RFConner, DGAltman, C Jackson, 9:459-
treatment effect under biased sampling. 84. Beverly Hills, CA:Sage
Proc. Natl. Acad. Sci. USA85:3670-72 Trochim WMK.1984. Research design for
Robbins H, Zhang C-H. 1989. Estimating the programevaluation: the regression-discon-
superiority of a drug to a placebowhenall tinuity approach. NewburyPark, CA: Sage
and only those patients at risk are treated Trochim WMK, ed. 1986. Advances in Quasi-
with the drug. Proc. Natl. Acad. Sci. USA experimental Design Analysis: NewDirec-
86:3003-5 tions for ProgramEvaluation, Vol. 31. San
RoschE. 1978. Principles of categorization. In Francisco: Jossey-Bass
Cognition and Categorization, ed. E Trochim WMK, Cappelleri JC. 1992. Cutoff
Rosch, BBLloyd. Hillsdale, NJ: Erlbaum assignment strategies for enhancing ran-
RosenbaumPR. 1984. From association to domizedclinical trials. Control. Clin. Tri-
causation in observational studies: the role als 13:190-212
of tests of strongly ignorable treatment as- WhitbeckC. 1977. Causation in medicine: the
signment. J. Am. Star. Assoc. 79:4148 disease entity model. Philos. Sci. 44:619-
RosenbaumPR, Rubin DB. 1983. The central 37
role of the propensity score in observa- Willson VL, Putnam RR. 1982. A meta-analy-
tional studies for causal effects. Biometrika sis of pretest sensitization effects in experi-
70:41 55 mental design. Am. Educ. Res. J. 19:249-
Rosenbaum PR, Rubin DB. 1984. Reducing 58
Annual Reviews


WortmanPM. 1993. Judging research quality. ZimmermannHJ, Zadeh LA, Gaines BR, eds.
In Handbookof Research Synthesis, ed. 1984. Fuzzy Sets and Decision Analysis.
HNICooper, LV Hedges. NewYork: Rus- NewYork: Elsevier
sell Sage Foundation.In press
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.
Annu. Rev. Psychol. 1994.45:545-580. Downloaded from
by UNIVERSITY OF IDAHO LIBRARY on 10/04/05. For personal use only.