You are on page 1of 81

ii:.

AND EXPERIMENTAL QUASI-EXPERIMENTAL FORGENERALIZED DESIGNS CAUSALINFERENCE

William R. Shadish
Trru UNIvERSITYop MEvPrrts

.jr-*",, '"+.'iLli"

**

Thomas D. Cook
NonrrrwpsrERN UNrvPnslrY

fr

Donald T. Campbell

COMPANY MIFFLIN HOUGHTON

Boston New York

and Experiments Causal Generalized lnference


Ex.per'i'ment (ik-spEr'e-mant):[Middle English from Old French from Latin experimentum, from experiri, to try; seeper- in Indo-European Roots.] n. Abbr. exp., expt, 1. a. A test under controlled conditions that is made to demonstratea known truth, examine the validity of a hypothesis, or determine the efficacyof something previously untried' b. The processof conducting such a test; experimentation. 2' An innovative "Democracy is only an experiment in gouernment" act or procedure: (.V{illiam Ralph lnge). Cause (k6z): [Middle English from Old French from Latin causa' teason, purpose.] n. 1. a. The producer of an effect, result, or consequence. b. The one, such as a person, an event' or a condition, that is responsible for an action or a result. v. 1. To be the causeof or reason for; result in. 2. To bring about or compel by authority or force.

o MANv historians and philosophers,the increasedemphasison experimentaof modern science tion in the 15th and L7th centuriesmarked the emergence 1983). Drake (1981) cites from its roots in natural philosophy (Hacking, 'Water, '1.6'!.2 or Moue in It as usheringin Bodies Tbat Stay Atop treatrse Galileo's modern experimental science,but earlier claims can be made favoring \Tilliam Leonardoda Vinci's study Onthe Loadstoneand MagneticBodies, Gilbert's1,600 B.C.philosoand perhapseventhe Sth-century (1,452-1.51.9) many investigations, pher Empedocles,who used various empirical demonstrationsto argue against '1.969a, 1'969b).In the everyday senseof the term, humans Parmenides(Jones, have beenexperimentingwith different ways of doing things from the earliestmoments of their history. Suchexperimentingis as natural a part of our life as trying a new recipe or a different way of starting campfires.

z | 1. EXeERTMENTs ANDGENERALTzED INFERENcE cAUsAL I

However, the scientific revolution of the 1.7thcentury departed in three ways from the common use of observation in natural philosophy atthat time. First, it increasingly used observation to correct errors in theory. Throughout historg natural philosophers often used observation in their theories, usually to win philosophical arguments by finding observations that supported their theories. However, they still subordinated the use of observation to the practice of deriving theories from "first principles," starting points that humans know to be true by our nature or by divine revelation (e.g., the assumedproperties of the four basic elements of fire, water, earth, and air in Aristotelian natural philosophy). According to some accounts,this subordination of evidenceto theory degenerated in the 17th "The century: Aristotelian principle of appealing to experiencehad degenerated among philosophers into dependenceon reasoning supported by casual examples and the refutation of opponents by pointing to apparent exceptions not carefully '1,98"1., examined" (Drake, p. xxi).'Sfhen some 17th-century scholarsthen beganto use observation to correct apparent errors in theoretical and religious first principles, they came into conflict with religious or philosophical authorities, as in the case of the Inquisition's demands that Galileo recant his account of the earth revolving around the sun. Given such hazards,the fact that the new experimental science tipped the balance toward observation and ^way from dogma is remarkable. By the time Galileo died, the role of systematicobservation was firmly entrenched as a central feature of science,and it has remained so ever since (Harr6,1981). Second,before the 17th century, appeals to experiencewere usually basedon passive observation of ongoing systemsrather than on observation of what happens after a system is deliberately changed. After the scientific revolution in the L7th centurS the word experiment (terms in boldface in this book are defined in the Glossary) came to connote taking a deliberate action followed by systematic observationof what occurred afterward. As Hacking (1983) noted of FrancisBacon: "He taught that not only must we observenature in the raw, but that we must 'twist also the lion's tale', that is, manipulate our world in order to learn its secrets" (p. U9). Although passiveobservation revealsmuch about the world, active manipulation is required to discover some of the world's regularities and possibilities (Greenwood,, 1989). As a mundane example, stainless steel does not occur naturally; humans must manipulate it into existence.Experimental science came to be concerned with observing the effects of such manipulations. Third, early experimenters realized the desirability of controlling extraneous influences that might limit or bias observation. So telescopeswere carried to higher points at which the air was clearer, the glass for microscopeswas ground ever more accuratelg and scientistsconstructed laboratories in which it was possible to use walls to keep out potentially biasing ether waves and to use (eventually sterilized) test tubes to keep out dust or bacteria. At first, thesecontrols were developed for astronomg chemistrg and physics, the natural sciences in which interest in sciencefirst bloomed. But when scientists started to use experiments in areas such as public health or education, in which extraneous influences are harder to control (e.g., Lind , 1,753lr,they found that the controls used in natural

AND CAUSATTON EXPERTMENTS I I

So they develin the laboratoryworked poorly in thesenew applications. science influence,such as random assignoped new methodsof dealingwith extraneous control group (Coover& Angell, 1,925) or addinga nonrandomized ment (Fisher, across thesesetaccumulated experience As theoreticaland observational 1.907). of bias were identifiedand more methodswere detings and topics,more sources 2000). to copewith them (Dehue, veloped vary is still to deliberately TodaSthe key featurecommonto all experiments elselater-to discover the to something what happens so asto discover something for example, to assess what do this, we As laypersons causes. effectsof presumed more, to our weight if we diet less, if we exercise to our blood pressure happens or ro our behaviorif we read a self-helpbook. However,scientificexperimentalanguage,and tools, insubstance, increasinglyspecialized tion has developed that is the priin the socialsciences cluding the practiceof field experimentation mary focus of this book. This chapter begins to explore these matters by test,(2) explainingthe spethat experiments (1) discussing the natureof causation quasi-experiments) that decializedterminology(e.g.,randomizedexperiments, generalize problem how to (3) of the scribessocial experiments, introducing and (4) briefly situatingthe exfrom individual experiments, causalconnections perimentwithin a largerliteratureon the nature of science.

AND CAUSATION EXPERIMENTS


requiresboth a vocabularyfor talking about of experiments discussion A sensible that underliethat vocabulary. key concepts of and an understanding causation

Relationships Effect, and Causal DefiningCause,


in their daily lives.For incausalrelationships Most peopleintuitively recognize hitting yours was a causeof the stance,you may say that another automobile's damage to your car; that the number of hours you spentstudyingwas a causeof of his weight. your testgrades; or that the amountof food a friend eatswas a cause noting that a low causalrelationships, You may evenpoint to more complicated studying,which caused which reducedsubsequent test gradewas demoralizing, (low grade) and an be both a cause can variable Here the same evenlower grades. effect,and there can be a reciprocal relationship betweentwo variables (low eachother. gradesand not studying)that cause definition a precise familiarity with causalrelationsbips, intuitive Despitethis for centuries.lIndeed,the definitions and effecthaseludedphilosophers of cause
of not the more detaileddiscussions 1. Our analysisrefldctsthe useof the word causationin ordinary language, in this interested in suchdetail may consult a host of works that we reference Readers causeby philosophers. chapter,includingCook and Campbell(1979).

AND GENERALTZED CAUSAL INFERENCE | 1. EXPERTMENTS

of terms suchas cause and, effectdependpartly on eachother and on the causal relationshipin which both are embedded. So the 17th-century philosopherJohn Locke said: "That which producesany simpleor complexidea,we denoteby the generalnamecaLtse, and that which is produce d, effect" (1,97 s, p. 32fl and also: " A cAtrse is that which makesany other thing, either simpleidea, substance, or mode,beginto be; and an effectis that, which had its beginning from someother thing" (p. 325).Since then,otherphilosophers and scientists havegivenus useful definitionsof the threekey ideas--cause, effect,and causal relationship-that are more specific and that betterilluminatehow experiments work. We would not defend any of theseas the true or correctdefinition,giventhat the latter haseluded philosophers for millennia;but we do claign that theseideashelp to clarify the scientific practiceof probing causes. Cause 'We Considerthe causeof a forest fire. know that fires start in differentways-a match tossedfrom a ca\ a lightning strike, or a smolderingcampfire,for example. None of thesecauses is necessary because a forest fire can start evenwhen, say'a match is not present.Also, none of them is sufficientto start the fire. After all, a match must stay "hot" long enoughto start combustion;it must contact combustible material suchas dry leaves; theremust be oxygenfor combustionto occur; and the weather must be dry enoughso that the leavesare dry and the match is not dousedby rain. So the match is part of a constellation of conditions without which a fire will not result,althoughsomeof these conditionscan be usually takenfor granted,suchasthe availabilityof oxygen.A lightedmatchis, rherefore, what Mackie (1,974)called an inus condition-"an insufficient but nonredundantpart of an unnecessary but sufficient condition" (p. 62; italicsin original). It is insufficientbecause a match cannot start a fire without the other conditions. It is nonredundant only if it adds something fire-promoting that is uniquelydifferent from what the other factors in the constellation (e.g.,oxygen, dry leaves) contributeto startinga fire; after all,it would beharderro saywhether the match causedthe fire if someone elsesimultaneously tried startingit with a cigarettelighter.It is part of a sufficientcondition to start a fire in combination with the full constellationof factors.But that condition is not necessary because thereare other setsof conditionsthat can also start fires. A research exampleof an inus condition concerns a new potentialtreatment for cancer. In the late 1990s,a teamof researchers in Bostonheaded by Dr. Judah Folkman reportedthat a new drug calledEndostatinshrank tumors by limiting their blood supply (Folkman, 1996).Other respected researchers could not replicatethe effectevenwhen usingdrugsshippedto them from Folkman's lab. Scientists eventuallyreplicatedthe resultsafter they had traveledto Folkman'slab to learnhow to properlymanufacture, transport,store,and handlethe drug and how to inject it in the right location at the right depth and angle.One observer labeled "in-our-hands" "even thesecontingencies the phenomenon,meaning we don't

AND CAUSATIONI S EXPERIMENTS

know which details are important, so it might take you some time to work it out" (Rowe, L999, p.732). Endostatin was an inus condition. It was insufficientcause required it to be embedded in a larger set of condiby itself, and its effectiveness tions that were not even fully understood by the original investigators. Most causesare more accurately called inus conditions. Many factors are usually required for an effectto occur, but we rarely know all of them and how they relate to each other. This is one reason that the causal relationships we discussin this book are not deterministic but only increasethe probability that an effect will Holland, 1,994).It also explains why a given causalrelationoccur (Eells,1,991,; ship will occur under some conditions but not universally across time, space,hu-"r pop,rlations, or other kinds of treatments and outcomes that are more or less related io those studied. To different {egrees, all causal relationships are context dependent,so the generalizationof experimental effects is always at issue.That is throughout this book. *hy *. return to such generahzations

Effect
'We that can better understand what an effect is through a counterfactual model 'l'973' goes back at least to the 18th-century philosopher David Hume (Lewis, p. SSel. A counterfactual is something that is contrary to fact. In an experiment, ie obseruewhat did happez when people received a treatment. The counterfactual is knowledge of what would haue happened to those same people if they simultaneously had not receivedtreatment. An effect is the difference betweenwhat did happen and what would have happened. 'We cannot actually observe a counterfactual. Consider phenylketonuria mental retardation unless that causes metabolic disease (PKU), a genetically-based treated during the first few weeks of life. PKU is the absenceof an enzyme that would otherwise prevent a buildup of phenylalanine, a substance toxic to the nervous system. Vhen a restricted phenylalanine diet is begun early and maintained, reiardation is prevented. In this example, the causecould be thought of as the underlying genetic defect, as the enzymatic disorder, or as the diet. Each implies a difierenicounterfactual. For example, if we say that a restricted phenylin PKU-basedmental retardation in infants who are alanine diet causeda decrease at birth, the counterfactual is whatever would have happened phenylketonuric 'h"d t'h.r. sameinfants not receiveda restricted phenylalanine diet. The samelogic applies to the genetic or enzymatic version of the cause. But it is impossible for theseu.ry ,"-i infants simultaneously to both have and not have the diet, the genetic disorder, or the enzyme deficiency. So a central task for all cause-probing research is to create reasonable approximations to this physically impossible counterfactual. For instance, if it were ethical to do so, we might contrast phenylketonuric infants who were given the diet with other phenylketonuric infants who wer not given the diet but who were similar in many ways to those who were (e.g., similar face) gender,age, socioeconomic status, health status). Or we might (if it were ethical) contrast infants who

6 I 1. EXPERIMENTS ANDGENERALIZED CAUSAL INFERENCE

were not on the diet for the first 3 months of their lives with those same infants after they were put on the diet starting in the 4th month. Neither of these approximations is a true counterfactual. In the first case,the individual infants in the treatment condition are different from those in the comparison condition; in the second case, the identities are the same, but time has passedand many changes other than the treatment have occurred to the infants (including permanent damage done by phenylalanine during the first 3 months of life). So two central tasks in experimental design are creating a high-quality but necessarily imperfect source of counterfactual inference and understanding how this source differs from the treatment condition. This counterfactual reasoning is fundarnentally qualitative because causal inference, even in experiments, is fundamentally qualitative (Campbell, 1975; Shadish, 1995a; Shadish 6c Cook, 1,999). However, some of these points have been formalized by statisticiansinto a specialcasethat is sometimescalled Rubin's "1.974,'1.977,1978,79861. CausalModel (Holland, 1,986;Rubin, This book is not about statistics, so we do not describethat model in detail ('West,Biesanz,& Pitts [2000] do so and relate it to the Campbell tradition). A primary emphasisof Rubin's model is the analysis of causein experiments, and its basic premisesare consistent with those of this book.2 Rubin's model has also been widely used to analyze causal inference in case-control studies in public health and medicine (Holland 6c Rubin, 1988), in path analysisin sociology (Holland,1986), and in a paradox that Lord (1967) introduced into psychology (Holland 6c Rubin, 1983); and it has generatedmany statistical innovations that we cover later in this book. It is new enough that critiques of it are just now beginning to appear (e.g., Dawid, 2000; Pearl, 2000). tUfhat is clear, however, is that Rubin's is a very general model with obvious and subtle implications. Both it and the critiques of it are required material for advanced students and scholars of cause-probingmethods.

Causal Relationship
How do we know if cause and effect are related? In a classic analysis formalized by the 19th-century philosopher John Stuart Mill, a causal relationship exists if (1) the causeprecededthe effect, (2) the causewas related to the effect,and (3) we can find no plausible alternative explanation for the effect other than the cause. These three characteristics mirror what happens in experiments in which (1) we manipulate the presumed cause and observe an outcome afterward; (2) we see whether variation in the cause is related to variation in the effect; and (3) we use various methods during the experiment to reduce the plausibility of other explanations for the effect, along with ancillary methods to explore the plausibility of those we cannot rule out (most of this book is about methods for doing this).

2. However, Rubin's model is not intended to say much about the matters of causal generalization that we address in this book.

EXPERTMENTS AND CAUSATTON | 7


I

No other sciare well-suitedto studyingcausalrelationships. Henceexperiments sowell. of causalrelationships the characteristics entificmethodregularlymatches methods. In many correlational of other alsopointsto the weakness Mill's analysis which of two variablescamefirst, know to is impossible it for example, studies, this Understanding them is precarious. relationshipbetween a causal so defending and effect, are cause terms, such as and how its key logic of causalrelationships studies. to critique cause-probing definedhelpsresearchers

and Confounds Correlation, Causation,


A well-known maxim in research is: Correlation does not proue causation. This is we may not know which variable came first nor whether alternative exso because planations for the presumed effectexist. For example, supposeincome and education are correlated.Do you have to have a high income before you can aff.ordto pay for education,or do you first have to get a good education before you can get a better paying job? Each possibility may be true, and so both need investigation.But until those investigationsare completed and evaluatedby the scholarly communiry a simple correlation doesnot indicate which variable came first. Correlations also do little to rule out alternative explanations for a relationship between two variables such as education and income. That relationship may not be causal at all but rather due to a third variable (often called a confound), such as intelligence or family soboth high education and high income. For example, cioeconomicstatus,that causes in education and on the job, then intelligent peosuccess if high intelligencecauses ineducation causes ple would have correlatededucation and incomes,not because both would be causedby intelligence.Thus a cencome (or vice versa) but because tral task in the study of experiments is identifying the different kinds of confounds that can operate in a particular researcharea and understanding the strengthsand with various ways of dealing with them associated weaknesses

Causes and Nonmanipulable Manipulable


that most peoplehave,it makes of experimentation In the intuitive understanding "Let's to work"; but if we requirewelfarerecipients see what happens to say, sense "Let's if I change this adult maleinto a seewhat happens to say, it makesno sense explore Experiments experiments. girl." And so it is alsoin scientific three-year-old medicine, the of a the dose such as the effectsof things that can be manipulated, or the number amount of a welfarecheck,the kind or amount of psychotherapy (e.g.,the explosionof a superevents Nonmanipulable of childrenin a classroom. genetic material,or their biologitheir raw ages, nova) or attributes(e.g.,people's vary them we cannotdeliberately because in experiments cal sex)cannotbe causes agree and philosophers most scientists Consequently, to seewhat then happens. causes. of nonmanipulable the effects that it is much harderto discover

TNFERENcE 1. EXeERTMENTS ANDGENERALTzED cAUsAL

must be manipulable-only that To be clear,we are not arguing that all causes must be so. Many variables that we correctly think of as causes experimental causes are not directly manipulable. Thus it is well establishedthat a geneticdefect causes PKU even though that defect is not directly manipulable.'We can investigatesuch causesindirectly in nonexperimental studiesor even in experimentsby manipulating biological processesthat prevent the gene from exerting its influence, as Both the nonthrough the use of diet to inhibit the gene'sbiological consequences. manipulable gene and the manipulable diet can be viewed as causes-both covary with PKU-basedretardation, both precedethe retardation, and it is possibleto explore other explanations for the gene'sand the diet's effectson cognitive functioning. However, investigating the manipulablc diet as a causehas two important advantages over considering the nonmanipulable genetic problem as a cause.First, only the diet provides a direct action to solve the problem; and second,we will see that studying manipulable agents allows a higher quality source of counterfactual inferencethrough such methods as random assignment.\fhen individuals with the nonmanipulable genetic problem are compared with personswithout it, the latter are likely to be different from the former in many ways other than the genetic defect. So the counterfactual inference about what would have happened to those with the PKU genetic defect is much more difficult to make. Nonetheless,nonmanipulable causesshould be studied using whatever means eventuallyhelp us such causes are availableand seemuseful. This is true because to find manipulable agents that can then be used to ameliorate the problem at did not discover how hand. The PKU example illustrates this. Medical researchers to treat PKU effectively by first trying different diets with retarded children. They first discovered the nonmanipulable biological features of retarded children affected with PKU, finding abnormally high levels of phenylalanine and its associated metabolic and genetic problems in those children. Those findings pointed in certain ameliorative directions and away from others, leading scientiststo experiment with treatments they thought might be effective and practical. Thus the new diet resulted from a sequenceof studies with different immediate purposes, with different forms, and with varying degreesof uncertainty reduction. Somewere experimental, but others were not. Further, analogue experiments can sometimes be done on nonmanipulable causes,that is, experiments that manipulate an agent that is similar to the cause of interest. Thus we cannot change a person's race, but we can chemically induce skin pigmentation changes in volunteer individuals-though such analogues do not match the reality of being Black every day and everywhere for an entire life. Similarly past events,which are normally nonmanipulable, sometimesconstitute a natural experiment that may even have been randomized, as when the 1'970 Vietnam-era draft lottery was used to investigate a variety of outcomes (e.g., Angrist, Imbens, & Rubin, 1.996a;Notz, Staw, & Cook, l97l). Although experimenting on manipulable causesmakes the job of discovering their effectseasier,experiments are far from perfect means of investigating causes.

EXPERIMENTS AND CAUSATIONI 9

Sometimesexperiments modify the conditions in which testing occurs in a way that reducesthe fit between those conditions and the situation to which the results are to be generalized.Also, knowledge of the effects of manipulable causestells nothing about how and why those effectsoccur. Nor do experiments answer many example, which questions are other questions relevant to the real world-for worth asking, how strong the need for treatment is, how a cause is distributed through societg whether the treatment is implemented with theoretical fidelitS and what value should be attached to the experimental results. In additioq, in experiments,we first manipulate a treatment and only then observeits effects;but in some other studieswe first observean effect, such as AIDS, and then search for its cause, whether manipulable or not. Experiments cannot help us with that search. Scriven (1976) likens such searchesto detective work in observea parwhich a crime has been committed (..d., " robbery), the detectives ticular pattern of evidencesurrounding the crime (e.g.,the robber wore a baseball cap and a distinct jacket and used a certain kind of Bun), and then the detectives searchfor criminals whose known method of operating (their modus operandi or m.o.) includes this pattern. A criminal whose m.o. fits that pattern of evidence then becomesa suspect to be investigated further. Epidemiologists use a similar method, the case-control design (Ahlbom 6c Norell, 1,990),in which they observe a particular health outcome (e.g., an increasein brain tumors) that is not seen in another group and then attempt to identify associatedcauses(e.g., increasedcell phone use). Experiments do not aspire to answer all the kinds of questions, not even all the types of causal questions, that social scientistsask.

Explanation and Causal Causal Description


attribis in describingthe consequences The uniquestrengthof experimentation In concall this causaldescription. varyinga treatment.'We utableto deliberately through which and do lesswell in clarifying the mechanisms trast, experiments holds-what relationship we call causal causal the conditionsunder which that causal explanation.For example,most childrenvery quickly learnthe descriptive flicking a light switch and obtainingillumination in a room. relationshipbetween However,few children (or evenadults)can fully explain why that light goeson. the treatment(the act of flicking a light To do so, they would haveto decompose features(e.g.,closingan insulatedcircuit) and switch)into its causallyefficacious its nonessential features(e.g.,whetherthe switch is thrown by hand or a motion or detector).They would haveto do the samefor the effect (eitherincandescent produced will still be whether the produced, light but fluorescentlight can be light fixture is recessed or not). For full explanation,they would then have to parts of the treatmentinfluencethe causally show how the causallyefficacious (e.g.,the affectedparts of the outcomethrough identified mediating processes

INFERENCE ANDGENERALIZED CAUSAL 1O I T. CXPTRIMENTS

passageof electricity through the circuit, the excitation of photons).3 ClearlS the causeof the light going on is a complex cluster of many factors. For those philosophers who equate cause with identifying that constellation of variables that necessarily inevitably and infallibly results in the effect (Beauchamp,1.974),talk of cause is not warranted until everything of relevanceis known. For them, there is no causal description without causal explanation. Whatever the philosophic merits of their position, though, it is not practical to expect much current social science to achieve such complete explanation. The practical importance of causal explanation is brought home when the switch fails to make the light go on and when replacing the light bulb (another easily learned manipulation) fails to solva the problem. Explanatory knowledge then offers clues about how to fix the problem-for example, by detecting and repairing a short circuit. Or if we wanted to create illumination in a place without lights and we had explanatory knowledge, we would know exactly which features relationship are essentialto create light and which are irof the cause-and-effect relevant. Our explanation might tell us that there must be a source of electricity but that that source could take several different molar forms, such as abattery, a generator, a windmill, or a solar array. There must also be a switch mechanism to close a circuit, but this could also take many forms, including the touching of two bare wires or even a motion detector that trips the switch when someone enters the room. So causal explanation is an important route to the generalization of causal descriptions becauseit tells us which features of the causal relationship are essentialto transfer to other situations. This benefit of causal explanation helps elucidate its priority and prestige in and helps explain why, once a novel and important causal relationship all sciences is discovered, the bulk of basic scientific effort turns toward explaining why and how it happens. Usuallg this involves decomposing the causeinto its causally effective parts, decomposing the effects into its causally affected parts, and identifying the processes through which the effective causal parts influence the causally affected outcome parts. These examplesalso show the close parallel between descriptive and explanatory causation and molar and molecular causation.aDescriptive causation usually concerns simple bivariate relationships between molar treatments and molar outcomes, molar here referring to a package that consistsof many different parts. For depression,a simple descripinstance, we may find that psychotherapy decreases package and a molar outcome. tive causal relationship benveen a molar treatment However, psychotherapy consists of such parts as verbal interactions, placebo3. However, the full explanationa physicistwould offer might be quite different from this electrician's is the indicatesiust how complicated This difference explanation,perhapsinvoking the behaviorof subparticles. notion of explanationand how it can quickly becomequite complex once one shifts levelsof analysis. 4. By molar, we mean somethingtaken as a whole rather than in parts. An analogyis to physics,in which molar or atomsthat make up from those of molecules as distinguished might refer to the propertiesor motions of masses, thosemasses.

EXPERIMENTS AND CAUSATION I 11


I

generating procedures, setting characteristics,time constraints, and payment for services.Similarly, many depression measuresconsist of items pertaining to the physiological,cognitive, and affectiveaspectsof depression.Explan atory causation breaks thesemolar causesand effectsinto their molecular parts so as to learn, say, that the verbal interactions and the placebo featuresof therapy both causechanges in the cognitive symptoms of depression,but that payment for servicesdoes not do so even though it is part of the molar treatment package. If experiments are less able to provide this highly-prized explanatory causal knowledge, why.are experimentsso central to science,especiallyto basic social science,in which theory and explanation are often the coin of the realm? The answer is descriptive and explanatory causation is lessclear in scithat the dichotomy ber'*reen about causation.First, many causal exentific practice than in abstract discussions planatironsconsist of chains of descriptivi causal links in which one event causesthe next. Experiments help to test the links in each chain. Second,experiments help distinguish betweenthe validity of competing explanatory theories, for example, by testing competing mediating links proposed by those theories. Third, some experiments test whether a descriptive causal relationship varies in strength or direction under Condition A versus Condition B (then the condition is a moderator variable that explains the conditions under which the effect holds). Fourth, some experimentsadd quantitative or qualitative observations of the links in the explanatory chain (mediator variables) to generateand study explanations for the descriptive causal effect. Experiments are also prized in applied areas of social science,in which the identification of practical solutions to social problems has as great or even greater priority than explanations of those solutions. After all, explanation is not always required for identifying practical solutions. Lewontin (1997) makes this point about the Human Genome Project, a coordinated multibillion-dollar research program ro map the human genome that it is hoped eventually will clarify the geLewontin is skeptical about aspectsof this search: netic causesof diseases. '!ilhat Many and intervention. explanation between is involvedhereis the difference to makea normal protein,a by the failureof the organism can be explained disorders requires that the But interuention gene mutation. of a failurethat is the consequence right time and in normalproteinbe providedat the right placein the right cells,at the way be found to providenormal cellular the right amount,or elsethat an alternative to keepthe abnormalproteinaway necessary be might even it is worse, function.'What by knowing the is served None of theseobjectives from the cellsat criticalmoments. "1,997, p.29) gene.(Lewontin, of the defective DNA sequence Practical applications are not immediately revealedby theoretical advance.Instead, to reveal them may take decadesof follow-up work, including tests of simple descriptive causal relationships. The same point is illustrated by the cancer drug Endostatin, discussedearlier. Scientistsknew the action of the drug occurred through cutting off tumor blood supplies; but to successfullyuse the drug to treat cancersin mice required administering it at the right place, angle, and depth, and those details were not part of the usual scientific explanation of the drug's effects.

12 I 1. EXPERTMENTS AND GENERALTZED TNFERENCE CAUSAL


I

In the end,then,causal arein delicate descriptions and causal explanations balancein experiments.'$7hat experiments do bestis to improvecausaldescriptions; they do lesswell at explainingcausalrelationships. But most experiments can be designed to providebetterexplanations is today. Further, in than typicallythe case focusingon causaldescriptions, molar eventsthat experiments often investigate may be less strongly related to outcomesthan are more molecularmediating processes, especially thoseprocesses that are closerto the outcomein the explanatory chain. However,many causaldescriptions are still dependable and strong enoughto be useful,to be worth making the building blocks around which important policiesand theoriesare created. of such Just considerthe dependability causal statements asthat schooldesegregation white flight, or that outgroup causes threat causes ingroup cohesion, improves mentalhealth,or or that psychotherapy that diet reduces relationships the retardationdueto PKU. Suchdependable causal are usefulto policymakers, practitioners, alike. and scientists

MODERN DESCRIPTIONS OF EXPERIMENTS


Some of the terms used in describing modern experimentation (seeTable L.L) are unique, clearly defined, and consistently used; others are blurred and inconsistently used. The common attribute in all experiments is control of treatment (though control can take many different forms). So Mosteller (1990, p. 225) writes, "fn an experiment the investigator controls the application of the treatment"l and Yaremko, Harari, Harrison, and Lynn (1,986,p.72) write, "one or more independent variables are manipulated to observe their effects on one or more dependentvariables." However, over time many different experimental subtypes have developed in responseto the needs and histories of different sciences 'Winston ('Winston, 1990; 6c Blais, 1.996\. TABLE 1.1TheVocabulary of Experiments Experiment: A studyin whichan intervention to observe itseffects. is deliberately introduced Randomized Experiment: to receive the treatment or An experiment in whichunitsareassigned process an alternative of condition by a random such asthe toss of a coinor a table random numbers. randomly. An experiment in whichunitsarenot assigned to conditions Quasi-Experiment: Natural Experiment: Not really usually cannot be an experiment because the cause manipulated; asan earthquake with a study event such that contrasts occurring a naturally a comoarison condition. Correlational or observational study; a study Study:Usually with nonexperimental synonymous variables. thatsimply among observes the size anddirection of a relationship

OF EXPERIMENTS DESCRIPTIONS MODERN I tr

Experiment Randomized
widely credited variant is the randomizedexperiment, The most clearlydescribed laterspread but in agriculture used (1,925,1926).It first was to Sir RonaldFisher of varisources it promisedcontrol over extraneous to other topic areasbecause ation without requiringthe physicalisolationof the laboratory.Its distinguishing featureis clear and important-that the varioustreatmentsbeingcontrasted(infor exunits' by chance, to experimental cludingno treatmentat all) are assigned implemented correctlS If ample,by cointossor useof a table of random numbers. two or more groupsof units that are probabilistically creates ,"rdo- assignment that are obHence,any outcomedifferences similarto .".h other on the average.6 treatment' to be due to likely are thosegroupsat the end,ofa study between served the groupsthat alreadyexistedat the start of the study. between not to differences yieldsan are met, the randomizedexperiment Further,when certainassumptions estimateof the sizeof a treatmenteffectthat has desirablestatisticalproperties' of the probability that the true effectfalls within a defined along with estimates are so highly prized that in a interval.Thesefeaturesof experiments confidence areasuchas medicinethe randomizedexperimentis often referredto as research the gold standardfor treatmentoutcomeresearch.' Closelyrelatedto the randomizedexperimentis a more ambiguousand inwith usedterm, true experiment.Someauthorsuseit synonymously consistently generit more use Others & Rosnow,1991'). experiment(Rosenthal randomized manipis deliberately variable independent an which ally to refer to any studyin 'We variableis assessed. shall not a dependent ulated (Yaremkoet al., 1,9861and to usethe term at all givenits ambiguity and given that the modifier true seems method. claimsto a singlecorrectexperimental imply restricted

Quasi-Experiment
Much of this book focuseson a class of designsthat Campbell and Stanley with all other share (1,963) asquasi-experiments.s popularized Quasi-experiments
5. Units can be people,animals,time periods,institutions,or almost anything else.Typically in field or work sites.In addition, a little of people,such as classrooms they are peopleor someaggregate experimentation of treatmentsto units, so same as assignment is the to treatments of units assignment random thought showsthat are frequendyusedinterchangeably' thesephrases 6. The word probabilisticallyis crucial, as is explainedin more detail in Chapter 8. 7. Although the rerm randomized experiment is used this way consistently acrossmany fields and in this book, statisticianssometimesuse the closely related term random experiment in a different way to indicate experiments for which the outcomecannor be predictedwith certainry(e.g.,Hogg & Tanis, 1988). but changedterminologyvery quickly; Rosenbaum 8. Campbell (1957) first calledthesecompromisedesigns many peopleuseit to studies,a term we avoid because (1995a\ and Cochran (1965\ referto theseas observational to and Shroder(1997) usequdsi-etcperiment studies,as well. Greenberg refer to correlationalor nonexperimental refer to studiesthat randomly assigngroups (e.g.,communities)to conditions,but we would considerthesegroup(Murray' 1998). randomizedexperiments

14 I 1. EXPERIMENTS AND GENERALIZED CAUSAL INFERENCE


I

experiments a similar purpose-to test descriptivecausal hypothesesabout manipulable causes-as well as many structural details, such as the frequent presenceof control groups and pretest measures,to support a counterfactual inference about what would have happened in the absenceof treatment. But, by definition, quasiexperiments lack random assignment. Assignment to conditions is by means of selfselection,by which units choosetreatment for themselves, or by meansof administrator selection,by which teachers,bureaucrats,legislators,therapists,physicians, or others decide which persons should get which treatment. Howeveq researchers who use quasi-experimentsmay still have considerablecontrol over selectingand schedulingmeasures,over how nonrandom assignmentis executed,over the kinds of comparison groups with which treatment,groups are compared, and over some aspectsof how treatment is scheduled.As Campbell and Stanleynote: personcan introduce There are many natural socialsettingsin which the research something like experimental of data collectionprocedures designinto his scheduling (e.g.,the uhen and to whom of measurement), eventhough he lacksthe full control over the scheduling and of experimental stimuli (the when and to wltom of exposure possible. Collecthe ability to randomize exposures) which makesa true experiment (Campbell& designs. tively,such situationscan be regarded as quasi-experimental p. 34) 1,963, StanleS In quasi-experiments,the causeis manipulable and occurs before the effect is measured. However, quasi-experimental design features usually create less compelling support for counterfactual inferences. For example, quasi-experimental control groups may differ from the treatment condition in many systematic(nonrandom) ways other than the presenceof the treatment Many of theseways could be alternative explanations for the observed effect, and so researchershave to worry about ruling them out in order to get a more valid estimate of the treatment effect. By contrast, with random assignmentthe researcherdoes not have to think as much about all these alternative explanations. If correctly done, random assignment makes most of the alternatives less likely as causes of the observed treatment effect at the start of the study. In quasi-experiments,the researcherhas to enumeratealternative explanations one by one, decide which are plausible, and then use logic, design, and measurement to assess whether each one is operating in a way that might explain any observedeffect. The difficulties are that thesealternative explanations are never completely enumerable in advance, that some of them are particular to the context being studied, and that the methods neededto eliminate them from contention will vary from alternative to alternative and from study to study. For example, suppose two nonrandomly formed groups of children are studied, a volunteer treatment group that gets a new reading program and a control group of nonvolunteerswho do not get it. If the treatment group does better, is it becauseof treatment or becausethe cognitive development of the volunteerswas increasingmore rapidly even before treatment began? (In a randomized experiment, maturation rates would

rl

OF EXPERIMENTS 1s MODERN DESCRIPTIONS |

the rethis alternative, equalin both groups.)To assess havebeenprobabilistically to revealmaturationaltrend beforethe treatmight add multiple pretests searcher ment, and then comparethat trend with the trend after treatment. explanationmight bethat the nonrandomcontrol group inAnother alternative to booksin their homesor childrenwho had lessaccess cludedmoredisadvantaged who had parentswho read to them lessoften. (In a randomizedexperiment'both this altergroupswould havehad similar proportionsof suchchildren.)To assess the number of books at home,parentaltime may measure nativi, the experimenter would Then the researcher trips to libraries. spentreadingtochildren,and perhaps seeif thesevariablesdiffered acrosstreatment and control groups in the hypothesizeddirection that could explain the observedtreatment effect. Obviously,as the the designof the quasinumber of plausiblealternativeexplapationsincreases, be. experimentbecomes more intellectually demandingand complex---especially The explanations. we are nevercertainwe haveidentifiedall the alternative cause a wound to bandage like affempts look start to efforts of the quasi-experimenter had beenusedinitially. if random assignment that would havebeenlesssevere is closelyrelatedto a falsificationist hypotheses The ruling out of alternative by Popper(1959).Poppernoted how hard it is to be sure that a logic popularized g*.r"t conclusion(e.g.,,ll r*"ttr are white) is correct basedon a limited set of (e.g.,all the swansI've seenwere white). After all, future observaobservations dayI may seea black swan).So confirmation is logtions may change(e.g.,some (e.g.,a black swan) a disconfirminginstance ically difficult. By contrast,observing view, to falsify the generalconclusionthat all swansare is sufficient,in Popper's to falsify the conto try deliberately nopper urged scientists white. Accordingly, clusionsthey wiih to draw rather than only to seekinformation corroborating that withstand falsificationare retainedin scientificbooks or them. Conciusions journals and treated as plausible until better evidencecomes along. Quasito identify a is falsificationistin that it requiresexperimenters experimentation and examineplausiblealternativeexplanations causalclaim and then to generate that might falsify the claim. However,suchfalsificationcan neverbe as definitiveas Popperhoped.Kuhn that can never on two assumptions (7962) pointed out that falsificationdepends But that is perfectly specified. be fully tested.The first is that the causalclaim is neverih. ."r.. So many featuresof both the claim and the test of the claim are the debatable-for example,which outcome is of interest,how it is measured, decisions other many all the and conditionsof treatment,who needstreatment, As a result, disconfirmust make in testingcausalrelationships. that researchers part of their causaltheories.For exammation often leadstheoriststo respecify ple, they might now specifynovel conditionsthat must hold for their theory to be Secirue and that were derivedfrom the apparentlydisconfirmingobservations. the thethat are perfectlyvalid reflectionsof ond, falsificationrequiresmeasures ory being tested.However,most philosophersmaintain that all observationis It is laden both with intellectualnuancesspecificto the partially theorv-laden.

INFERENCE AND GENERALIZED CAUSAL 16 I 1. EXPERIMENTS I

of the theory held by the individual or group deuniquescientificunderstandings wishes,hopes, extrascientific vising the test and also with the experimenters' If understandings. and assumptions shared cultural aspirations,and broadly thehow can they provideindependent of theories, measures are not independent obIf the possibilityof theory-neutral ory tests,includingtestsof causaltheories? knowledge possibility of definitive the them disappears is denied,with servations to disconfirmit. to confirm a causalclaim and of what seems both of what seems that studpossible. It argues a fallibilist versionof falsificationis Nonetheless, trends of general iesof causalhypotheses can still usefullyimproveunderstanding that might pertainto thosetrends.It arignorance of all the contingencies despite the initial hypothguesthat causalstudiesare usefulevenif w0 haveto respecify Afand new understandings. new contingencies to accommodate esisrepeatedly are usually minor in scope;they rarely involve ter all, those respecifications oppositetrends. general of trendsin favor of completely wholesale overthrowing is impossible that theory-neutralobservation Fallibilist falsificationalso assumes have beenrecan approacha more factlikestatuswhenthey but that observations mulof a construct,across peatedlymadeacross differenttheoreticalconceptions that observaand at multiple times.It alsoassumes tiple kinds of measurements, that different and one, not tions are imbued with multiple theories, iust As a result,obdo not sharethe samemultiple theories. operationalprocedures that repeatedlyoccur despitedifferent theoriesbeing built into them servations havea special factlike statusevenif they can neverbe fully justifiedascompletely then, fallible falsificationis more than just seefacts.In summary, theory-neutral ing whether observationsdisconfirm a prediction. It involvesdiscoveringand judging the worth of ancillary assumptions about the restrictedspecificityof the of theories,viewunder test and also about the heterogeneity causalhypothesis and effectand of the cause of points, settings, and times built into the measures modifying their relationship. any contingencies interprenor desirable to rule out all possiblealternative It is neitherfeasible the constitute plausible alternatives tarionsof a causalrelationship.Instead,only the number of partly to keep matterstractablebecause major focus.This serves haveno alternatives that many possiblealternatives is endless. It also recognizes attention. support and so do not warrant special seriousempiricalor experiential the cause For example, be deceiving. However,the lack of supportcan sometimes (e.g., stress) lifestyle of of stomachulcerswas long thought to be a combination and excessacid production. Few scientistsseriouslythought that ulcers were that an it was assumed (e.g., because virus,germ,bacteria) caused by a pathogen However,in L982 Ausacid-filled stomachwould destroy all living organisms. 'Warren spiral-shaped discovered Barry Marshall and Robin tralian researchers pylori (H. pylori), in ulcerpatients'stomachs. bacteria,later name d Helicobacter rilfith this discovery, plausible. By the previouslypossiblebut implausiblebecame "1994, Conference Development a U.S. National Institutesof Health Consensus
concluded that H. pylori was the major causeof most peptic ulcers. So labeling ri-

tt OFEXPERIMENTS DESCRTPTONS MODERN I I

val hypothesesas plausible dependsnot just on what is logically possible but on shared experienceand, empirical data. social consensus, such factors are often context specific, different substantive areasdeBecause velop their own lore about which alternatives are important enough to need to be controlled, even developing their own methods for doing so. In early psychologg for example, a control group with pretest observations was invented to control for the plausible alternative explanation that, by giving practice in answering test content, pretestswould produce gains in performance even in the absenceof a treatment effect (Coover 6c Angell, 1907). Thus the focus on plausibility is a two-edged sword: it reducesthe range of alternatives to be considered in quasi-experimental work, yet it also leavesthe resulting causal inference vulnerable to the discovery that an implausible-seemingalternative may later emerge as a likely causal agent.

NaturalExperiment
The term natural experiment describesa naturally-occurring contrast between a treatment and a comparisoncondition (Fagan, 1990; Meyer, 1995;Zeisel,1,973l. Often the treatments are not even potentially manipulable, as when researchers retrospectivelyexamined whether earthquakesin California causeddrops in property values (Brunette, 1.995; Murdoch, Singh, 6c Thayer, 1993). Yet plausible causal inferences about the effects of earthquakes are easy to construct and defend. After all, the earthquakesoccurred before the observations on property values,and it is easyto seewhether earthquakesare related to properfy values. A useful source of counterfactual inference can be constructed by examining property values in the same locale before the earthquake or by studying similar localesthat did not experience an earthquake during the bame time. If property values dropped right after the earthquake in the earthquake condition but not in the comparison condition, it is difficult to find an alternative explanation for that drop. Natural experiments have recently gained a high profile in economics. Before the 1990s economists had great faith in their ability to produce valid causal inferencesthrough statistical adjustments for initial nonequivalence between treatment and control groups. But two studies on the effects of job training programs showed that those adjustments produced estimates that were not close to those generated from a randomized experiment and were unstable across tests of the model's sensitivity (Fraker 6c Maynard, 1,987; Lalonde, 1986). Hence, in their searchfor alternative methods, many economistscame to do natural experiments, such as the economic study of the effects that occurred in the Miami job market when many prisoners were releasedfrom Cuban jails and allowed to come to the United States(Card, 1990). They assumethat the releaseof prisoners (or the timing of an earthquake) is independent of the ongoing processesthat usually affect unemployment rates (or housing values). Later we explore the validity of this assumption-of its desirability there can be little question.

18 I 1. EXPERIMENTS AND GENERALIZED INFERENCE CAUSAL

Nonexperimental Designs
The termscorrelationaldesign,passive observational design,and nonexperimental designrefer to situationsin which a presumedcauseand effect are identified and measuredbut in which other structural featuresof experiments are missing.Random assignment is not part of the design,nor are suchdesignelements as pretests and control groupsfrom which researchers might constructa usefulcounterfactual inference. Instead,relianceis placedon measuring indialternativeexplanations vidually and then statisticallycontrolling for them. In cross-sectional studiesin which all the data aregatheredon the respondents may at one time, the researcher not even know if the causeprecedes the dffect. When thesestudiesare used for causalpurposes, unless much is althe missingdesignfeatures can be problematic ready known about which alternativeinterpretations are plausible,unlessthose model used that are plausiblecan be validly measured, the substantive and unless for statistical adjustment is well-specified. Theseare difficult conditionsto meetin the real world of research practice,and thereforemany commentators doubt the potentialof suchdesigns in most cases. to supportstrongcausalinferences

EXPERIMENTS ANDTHEGENERALIZATION OF CAUSAL CONNECTIONS


The strength of experimentation is its ability to illuminate causal inference. The weaknessof experimentation is doubt about the extent to which that causal rela'We tionship generalizes. hope that an innovative feature of this book is its focus on generalization. Here we introduce the general issuesthat are expanded in later chapters.

Most Experiments Are HighlyLocalBut Have


GeneralAspirations
Most experiments They are almostalways are highly localizedand particularistic. conductedin a restrictedrange of settings, often just one, with a particular version of one type of treatmentrather than, say,a sampleof all possibleversions. Usually they have several that are measures-eachwith theoreticalassumptions differentfrom thosepresentin other measures-but far from a complete setof all possiblemeasures. Each experimentnearly always usesa convenientsampleof people rather than one that reflectsa well-described population; and it will inevitably be conducted history. at a particular point in time that rapidly becomes Yet readers what happened of experimental with resultsare rarelyconcerned in that particular,past,local study.Rather,they usuallyaim to learn eitherabout theoreticalconstructs of interestor about alarger policy.Theoristsoften want to

CONNECTIONS OFCAUSAL AND THEGENERALIZATION EXeERTMENTS I t'

connect experimental results to theories with broad conceptual applicability, which ,.q,rir., generalization at the linguistic level of constructs rather than at the level of the operations used to represent these constructs in a given experiment. They nearly always want to generallzeto more people and settings than are representedin a single experiment. Indeed, the value assignedto a substantive theory usually dependson how broad a rangeof phenomena the theory covers. SimilarlS policymakers may be interested in whether a causal relationship would hold implemented as a iprobabilistically) across the many sites at which it would be experimental original the beyond policS an inferencethat requires generalization stody contexr. Indeed, all human beings probably value the perceptual and cognitive stability that is fostered by generalizations. Otherwise, the world might apcacophony of isolqted instances requiring constant cognitive pear as a btulzzing processingthat would overwhelm our limited capacities. In defining generalizationas a problem, we do not assumethat more broadly applicable resulti are always more desirable(Greenwood, 1989). For example, physicists -ho use particle accelerators to discover new elements may not expect that it would be desiiable to introduce such elementsinto the world. Similarly, social scientists sometimes aim to demonstrate that an effect is possible and to understand its mechanismswithout expecting that the effect can be produced more generally. For "sleeper effect" occurs in an attitude change study involving perinstance, when a suasivecommunications, the implication is that change is manifest after a time delay but not immediately so. The circumstancesunder which this effect occurs turn out to be quite limited and unlikely to be of any general interest other than to show that the theory predicting it (and many other ancillary theories) may not be wrong (Cook, Gruder, Hennigan & Flay l979\.Experiments that demonstrate limited generalization may be just as valuable as those that demonstratebroad generalization. to exist berweenthe localized nature of the causal a conflict seems Nonetheless, knowledge that individual experiments provide and the more generalizedcausal goals that researchaspiresto attain. Cronbach and his colleagues(Cronbach et al., f gSO;Cronbach, 19821havemade this argument most forcefully and their works have contributed much to our thinking about causal generalization. Cronbach being connoted that each experiment consistsof units that receivethe experiences and of the units, on the made trasted, of the treaiments themselves , of obseruations settings in which the study is conducted. Taking the first letter from each of these "instances on which data four iords, he defined the acronym utos to refer to the "1.982,p. 78)-to the actual people,treatments' measures' are collected" (Cronb ach, and settingsthat were sampledin the experiment. He then defined two problems of "domain about which [the] question is asked" generalizition: (1) generaliiing to the "units, treatments,variables, (p.7g),which he called UTOS; and (2) generalizingto oUTOS.e "nd r.r,ings not directly observed" (p. 831,*hi.h he called
reasons.For example,Cronbach only usedcapital S, 9. We oversimplify Cronbach'spresentationhere for pedagogical not small s, so that his system,eferred only to ,tos, not utos. He offered diverseand not always consistentdefinitions do here. of UTOS and *UTOS, in particular. And he doesnot usethe word generalizationin the samebroad way we

INFERENCE 20 I 1. EXPERIMENTS AND GENERALIZED CAUSAL

in more deoutlinedbelowand presented Our theoryof causalgeneralization, thinking with our own ideas tail in ChaptersLL through 13, melds Cronbach's from previousworks (Cook, 1990, t99t; Cook 6c Campabout generalization bell,1979), creatinga theory that is differentin modestways from both of these predecessors. work in two ways.First, we Our theory is influencedby Cronbach's throughout this book as confollow him by describingexperiments consistently though and settingsrlo observations, sistingof the elements of units, treatments, is persons experimentation given field units that most for we frequentlysubstitute :We conducted with humansas participants. alsooften substituteoutcomef.orobgiven the centrality of observations about outcomewhen examining seruations areofteninterested researchers that relationships. Second, we acknowledge causal and that these in two kinds of.generalization about eachof thesefive elements, generalization that two typesareinspiredbg but not identicalto, the two kinds of 'We (inferences Cronbach defined. call these construct validity generalizations and externalvalidity genrepresent) that research operations about the constructs (inferences holdsovervariation relationship about whetherthe causal eralizations variables). in persons, settings, treatment,and measurement

Construct Validity: Causal Generalization as Representation


The first causal generalization problem concerns how to go from the particular units, treatments, observations, and settings on which data are collected to the higher order constructs these instancesrepresent.These constructs are almost always couched in terms that are more abstract than the particular instancessampled in an experiment. The labels may pertain to the individual elementsof the experiment (e.g., is the outcome measured by a given test best described as intelligence or as achievement?).Or the labels may pertain to the nature of relationships among elements, including causal relationships, as when cancer treatments are classified as cytotoxic or cytostatic depending on whether they kill tumor cells directly or delay tumor growth by modulating their environment. The treatment Consider a randomized experiment by Fortin and Kirouac (1.9761. was a brief educational course administered by severalnurses,who gave a tour of their hospital and covered some basic facts about surgery with individuals who were to have elective abdominal or thoracic surgery 1-5to 20 days later in a single Montreal hospital. Ten specific outcome measureswere used after the surgery, used to consuch as an activities of daily living scaleand a count of the analgesics constructs-whether pain. likely t^rget trol Now compare this study with its
following Campbell (79571and Cook and 10. \Weoccasionally refer to time as a separate featureof experiments, Cronbachdid not includetime in Campbell (19791,because time can cut acrossthe other factorsindependently. of treatment),observations his notational system, insteadincorporating time into treatment(e.g.,the scheduling (e.g.,when measures are administered), or setting (e.g.,the historicalcontext of the experiment).

coNNEcrtoNS| ,, oF cAUsAL GENERALIZATIoN ANDTHE EXnERTMENTs


I

patient education (the target cause)promotes physical recovery (the targt effect) "*ong surgical patients (the target population of units) in hospitals (the target univeise ofiettings). Another example occurs in basic research,in which the question frequently aiises as to whether the actual manipulations and measuresused in an experiment really tap into the specific cause and effect constructs specified by the theory. One way to dismiss an empirical challenge to a theory is simply to make the casethat the data do not really represent the concepts as they are specified in the theory. to change their initial understanding Empirical resnlts often force researchers of whaithe domain under study is. Sometimesthe reconceptuahzation leads to a more restricted inference about what has been studied. Thus the planned causal agent in the Fortin and Kirouac (I976),study-patie,nt education-might need to b! respecified as informational patient education if the information component of the treatment proved to be causally related to recovery from surgery but the tour of the hospital did not. Conversely data can sometimes lead researchersto think in terms o?,"rg., constructs and categoriesthat are more general than those with which they began a researchprogram. Thus the creative analyst of patient education studies mlght surmise that the treatment is a subclass of interventions that "perceived control" or that recovery from surgery can be function by increasing ;'p.tronal coping." Subsequentreaders of the study can treated as a subclas of even add their own interpietations, perhaps claiming that perceived control is really just a special caseof the even more general self-efficacy construct. There is a sobtie interplay over time among the original categories the researcherintended to represeni, the study as it was actually conducted, the study results, and subseqrr..ri interpretations. This interplay can change the researcher'sthinking about what the siudy particulars actually achieved at a more conceptual level, as can feedback fromreaders. But whatever reconceptualizationsoccur' the first problem of causal generaltzationis always the same: How can we generalizefrom a sample of instancesand the data patterns associatedwith them to the particular target constructs they represent?

as Extrapolation Generalization Causal Validity: External


is to infer whether a causalrelationship The secondproblem of generalization For example, and outcomes. treatments, holdsovervariationsin p.rrorrt, settings, someonereadingthe resultsof an experimenton the effectsof a kindergarten of poor grammarschoolreadingtestscores Head Startprogiam on the subsequent African Americanchildrenin Memphis during the 1980smay want to know if a goals_would cognitiveand socialdevelopment programwith partially overlapping children poor Hispanic of test scores in improvingthi mathematics be aseffective tomorrow. in Dallas if this programwere to be implemented is not a synonym for This exampl. again reminds us that generahzation is from one city to another city and broader applicatiorr.H.r., generahzation

1. EXPERIMENTS AND GENERALIZED INFERENCE CAUSAL

from one kind of clienteleto anotherkind, but thereis no presumptionthat Dallas is somehow broader than Memphis or that Hispanic children constitute a

broader population than African American children. Of course,some generalwho randomly izations are from narrow to broad. For example,a researcher samplesexperimentalparticipants from a national population may generalize (probabilistically) members of that from the sampleto all the other unstudied in population. for random selection Indeed, is rationale choosing same that the considerwhetherHead Start should the first place.Similarly when policymakers in in what happened be continuedon a national basis,they are not so interested across in what would happenon the average Memphis.They are more interested the United States, as its many local programsstill differ from eachother despite to Head Startchildren efforts in the 1990sto standardize much of what happens and parents.But generalization can also go from the broad to the narrow. Cronbetween that studied differences bach(1982)givesthe example of an experiment the performances of groups of studentsattendingprivate and public schools.In this case, the concernof individual parentsis to know which type of schoolis better for their particular child, not for the whole group. \Thether from narrow to broad, broad to narroq or across units at about the samelevelof aggregation, sharethe sameneed-to infer the all theseexamples of externalvalidity questions settings, treatments, which extent to the effect holds over variationsin persons, or outcomes.

Approaches Generalizations to MakingCausal


\Thichever way the causal generalization issue is framed, experiments do not seem at first glance to be very useful. Almost invariablS a given experiment uses a limited set of operations to represent units, treatments, outcomes, and settings. This high degree of localization is not unique to the experiment; it also characterizes case studies, performance monitoring systems, and opportunisticallyadministered marketing questionnaires given to, say, a haphazard sample of respondents at local shopping centers (Shadish, 1995b). Even when questionnaires are administered to nationally representative samples, they are ideal for representing that particular population of persons but have little relevanceto citizens outside of that nation. Moreover, responsesmay also vary by the setting in which the interview took place (a doorstep, a living room, or a work site), by the time of day at which it was administered, by how each question was framed, or by the particular race, age,and gender combination of interviewers. But the fact that the experiment is not alone in its vulnerability to generalization issuesdoes not make it any less a problem. So what is it that justifies any belief that an experiment can achieve a better fit between the sampling particulars of a study and more general inferences to constructs or over variations in persons, settings, treatments, and outcomes?

coNNEcrtoNs I tt oF cAUsAL GENERALtzATtoN ANDTHE EXeERTMENTs

Generalization Samplingand Causal


this closefit is the useof forfor achieving The methodmost often recommended or setobservations, of units, treatments, mal probabiliry samplingof instances that we have clearly tings (Rossi,Vlright, & Anderson,L983). This presupposes populationsof eachand that we can samplewith known probability deiineated from within eachof thesepopulations.In effect,this entailsthe random selection eardiscussed from random assignment to be carefullydistinguished of instances, repreto chance by cases involvesselecting Randomselection lier in this chapter. to mulcases assigning involves assignment random sentthat popuiation,whereas tiple conditions. of indirandom samples that is not experimental, research In cause-probing suchasthe PanelStudyof longitudinalsurveys viduals"r. oft.n nr.d. Large-scale the IncomeDynamicsor the National Longitudinal Surveyare usedto represent within it-and measures populationof the United States-or certainagebrackets and effectsare then relatedto each other using time lags in Lf pot.ntial causes All this is donein controlsfor group nonequivalence. ,nr^"r.rr.-ent and statistical However,cases achieves. hopesof approximatingwhat a randomizedexperiment of random ielection from a broad population followed by random assignment Also from within this population are much rarer (seeChapter 12 for examples). Such followed by a quality quasi-experiment. oi t".rdotn selection rare arestudies that control logistical of and a degree requirea high levelof resources experiments prefer to rely on an implicit set of nonstaso many researchers is iarely feasible, that we hope to make more explicit and sysfor generalization tistical heuristics tematicin this book. occurseven more rarely with treatments'outcomes,and Random selection How in an experiment. observed the outcomes than with people.Consider settings grant that the domain samplingmodel of ofterrlre they raniomly sampled?'We that the itemsusedto testiheory (Nunnally 6c Bernstein,1994)assumes classical a constructhavebeenrandomly sampledfrom a domain of all possible measure ever randomly items. However,in actual experimentalpracticefew researchers maNor do they do so when choosing measures. sampleitemswhen constructing sampled, to be not agree will many settings For instance, nipulationsor settings. will almostcertainly sampled randomly be to agree that "rid ,o1n. of the settings no definitivelist to conditions.For treatments, to be randomlyassigned not agree usuallyexists,as is most obvious in areasin which treatof poisible treatments In rapidly, such as in AIDS research. and developed are being discovered -*,, conand rarely but it is only general, then, random samplingis alwaysdesirable, feasible. tingently "However, methodsare not the only option. Two informal, purformal sampling useful-purposive sampling of heterogeposive samplingmethodrare sometimes In the former case'the and purposivesamplingof typical instances. neousinstances to reflect diversity on presumptively aim is to includeinrLni.r chosendeliberately eventhough the sampleis not formally random. In the latter important dimensions,

INFERENCE CAUSAL ANDGENERALIZED 24 I .l. TxpEnIMENTS

case,the aim is to explicate the kinds of units, treatments, observations, and settings to which one most wants to generalize andthen to selectat least one instance of each class that is impressionistically similar to the class mode. Although these purposive sampling methods are more practical than formal probability sampling, they are not backed by a statistical logic that justifies formal generalizations.Nonetheless, they are probabty the most commonly used of all sampling methods for facilitating generalizations. A task we set ourselvesin this book is to explicate such methods and to describe how they can be used more often than is the casetoday. However, sampling methods of any kind are insufficient to solve either problem of generalization. Formal probability sampling requires specifying a target population from which sampling then takes place, but defining such populations is difficult for some targets of generalization such as treatments. Purposive sampling of heterogeneousinstancesis differentially feasible for different elementsin a study; it is often more feasible to make measuresdiverse than it is to obtain diverse settings, for example. Purposive sampling of typical instancesis often feasible when target modes, medians, or means are known, but it leaves questions as Cronbach points to a wider range than is typical. Besides, about generalizations typically an experiment of generalization out, most challenges to the causal emerge after a study is done. In such cases,sampling is relevant only if the instancesin the original study were sampled diversely enough to promote responsible reanalysesof the data to seeif a treatment effect holds acrossmost or all of the targets about which generahzation has been challenged. But packing so many sourcesof variation into a single experimental study is rarely practical and will almost certainly conflict with other goals of the experiment. Formal sampling methods usually offer only a limited solution to causal generalizationproblems. A theory of generalizedcausal inference needsadditional tools.

Generalization A GroundedTheoryof Causal


and in their research, routinely make causal generalizations Practicingscientists we this book, In do. they they almostneveruseformal probability samplingwhen that is groundedin the actualpracticeof presenta theory of causalgeneralization (Matt, Cook, 6c Shadish, 2000). Although this theory was originally descience velopedfrom ideasthat were groundedin the constructand externalvalidiry litarecommonin ideas (Cook, 1990,1991.),we found that these havesince eratures Campbell (e.g., 1995; Abelson, literatureabout scientificgeneralizations a diverse Medin, Cronbach& Meehl, 1955; Davis, 1994; Locke, 1'986; & Fiske, 1.959; Hayward,Tu1,991';'$7ilson, Rubins,1.994;'Willner, 1989;Messick, 1ggg,1'995; grounded theory this about \7e providemore details nis, Bass, & Guyatt, 1,995];t. genmakecausal that scientists in Chapters1L through L3, but in brief it suggests in their work by usingfive closelyrelatedprinciples: eralizations "L. the apparentsimilaritiesbetweenstudy operaSurfaceSimilarity.They assess of the target of generalization. tions and the prototypicalcharacteristics

ZS CONNECTIONS OFCAUSAL AND THEGENERALIZATION EXPERIMENTS I I

2. Ruling Out lrreleuancies.They identify those things that are irrelevant because they do not change a generalization. Discriminations. They clarify k.y discriminations that limit Making 3. generalization. 4. Interpolation and Extrapolation. They make interpolations to unsampled values within the range of the sampled instances and, much more difficult, they explore extrapolations beyond the sampled range. 5 . Causal Explanation. They develop and test explanatory theories about the patthat are essentialto the transtern of effects,causes,and mediational processes fer of a causalrelationship. In this book, we want to show how scientistscan and do use thesefive principles to draw generalizedconclusions dbout a causal connection. Sometimes the conclusion is about the higher order constructs to use in describing an obtained thesefive principles have analoguesor connection at the samplelevel. In this sense, (e.g.,with construct content, with literature parallels both in the construct validity loru.rg.nt and discriminant validity, and with the need for theoretical rationales for consrructs) and in the cognitive scienceand philosophy literatures that study how people decidewhether instancesfall into a category(e.g.,concerning the roles that protorypical characteristicsand surface versus deep similarity play in determining category membership). But at other times, the conclusion about generalization refers to whether a connection holds broadly or narrowly over variations in persons, settings,treatments, or outcomes. Here, too, the principles have analogues or parallels that we can recognizefrom scientific theory and practice, as in relationships (a form of interpolation-extrapolation) or the study of dose-response the appeal to explanatory mechanismsin generalizing from animals to humans (a form of causal explanation). Scientistsuse rhese five principles almost constantly during all phases of research.For example, when they read a published study and wonder if some variathink about similarition on the study's particulars would work in their lab, they '$7hen they conceptualize ties of the published study to what they propose to do. plan will match the to study they instances the new study, they anticipate how the prototypical featuresof the constructs about which they are curious. They may deiign their study on the assumptionthat certain variations will be irrelevant to it but that others will point to key discriminations over which the causal relationship does not hold or the very character of the constructs changes.They may include measuresof key theoretical mechanisms to clarify how the intervention works. During data analysis, they test all these hypotheses and adjust their construct descriptions to match better what the data suggest happened in the study. The introduction section of their articles tries to convince the reader that the study bears on specific constructs, and the discussion sometimes speculatesabout how results -igttt extrapolate to different units, treatments, outcomes, and settings. Further, practicing scientistsdo all this not just with single studies that they read or conduct but also with multiple studies. They nearly always think about

26

INFERENCE CAUSAL 1. EXPERTMENTS AND GENERALIZED

how their own studiesfit into a larger literature about both the constructsbeing that may or may not bound or explain a causalconnecmeasured and the variables this fit in the introduction to their study.And they apply all tion, often documenting five principleswhen they conduct reviewsof the literature,in which they make incan suppoft. that a body of research ferences about the kinds of generalizations in Chapters11 to L3, we providemore Throughoutthis book, and especially and about the scientific detailsabout this groundedtheory of causal generalization doesnot generalization of grounded theory Adopting this practices that it suggests. suchsamwe recommend imply a rejectionof formal probabilitysampling.Indeed, to schemes purposive sampling alongwith pling unambiguously whenit is feasible, cannotbe implemented. methods when formal randomselection aid generalization useto But we alsoshow that samplingis just one methodthat practicingscientists staof diverse logic, application along with practical make causalgeneralizations, of designother than sampling. tistical methods,and useof features

AND METASCIENCE EXPERIMENTS


Here we surroundsexperimentation. Extensivephilosophicaldebatesometimes some we discuss then and debates, somekey featuresof these briefly summarize in However,there is a sense implications of thesedebatesfor experimentation. debateis incidentalto the practiceof experimentation. which all this philosophical humanity'sphiloExperimentationis as old as humanity itself, so it preceded of years. by thousands and genenlization causation sophicaleffortsto understand see some we can Even over just the past 400 yearsof scientificexperimentation, constancyof experimentalconcept and method, whereasdiversephilosophical "ExAs Hacking(1983)said, havecomeand gone. of the experiment conceptions most has a life of its own" (p. 150). It has beenone of science's perimentation and it hasdone causalrelationships, descriptive powerful methodsfor discovering forever.To is probably assured so well in so many ways that its placein science philosophical justify its practicetodag a scientist neednot resortto sophisticated reasoningabout experimentation. philosophical debates. these to understand it doeshelp scientists Nonetheless, For example,previousdistinctionsin this chapterbetweenmolar and molecular causation,descriptiveand explanatorycause,or probabilisticand deterministic to understandbetter all help both philosophersand scientists causalinferences (e.g.,Bunge,1959; Eells, 1991'; both the purposeand the resultsof experiments Salmon,7984,1989; Mackie, 1'974; Hart & Honore, 1985;Humphreys,"t989; Sobel,1993;P.A. \X/hite,1990).Here we focus on a differentand broadersetof itself,not only from philosophybut alsofrom the history,socritiquesof science 1988; reviewsby Bechtel, (see usefulgeneral of science ciologS and psychology explicitly have been works of these Some H. I. Brown, 1977; Oldroyd, 19861. to createa justified role for it (e.g., seeking about the nature of experimentation,

AND METASCIENCE EXPERIMENTS I 27

'1.990; S. Drake, l98l; Gergen, Danziger, L975;Campbell,1982,,1988; Bhaskar, Pinch,6cSchaffer, L989; Gooding, Houts, Neimeyer,6d Shadish, 1,973; Gholson, 'Woolgar, 1,989b;Greenwood, L989; Hacking, L983; Latour, 1'987;Latour 6c L994; & Fuller, RosenthaL,1.966;Shadish 1988;Orne,1.962;R. 1.979;Morawski, to seesomelimits of experimentaThesecritiqueshelp scientists Shapin,1,9941. and society. tion in both science

Critique TheKuhnian
scientificrevolutionsas differentand partly incommensuKuhn (1962\ described eachother in time and in which the gradthat abruptly succeedgd rableparadigms Hanson(1958),Polanyi was a chimera. knowledge of scientific ual accumulation (L975),and Quine (1'95t' Feyerabend Toulmin (1'961), (1958),Popper('J.959), part exposingthe grossmisin by contributedto the critical momentum, 1,969) basedon reattemptto build a philosophyof science takesin logicalpositivism's such as physics.All thesecritiquesdeniedany science constructinga successful (so, by extension,experiments do not knowledge firm foundationsfor scientific foundaThe logicalpositivistshopedto achieve provide firm causalknowledge). to theory-free obtions on which to build knowledgeby tying all theory tightly servationthrough predicatelogic. But this left out important scientificconcepts that all oband it failed to recognize that could not be tied tightly to observation; making theory, methodological and with substantive are impregnated servations tests.lt to conducttheory-free it impossible The impossibility of theory-neutral observation (often referred to as the thesis)impliesthat the resultsof any singletest (and so any single Quine-Duhem They could be disputed,for example,on are inevitably ambiguous. experiment) built into the outcome measurewere groundsthat the theoreticalassumptions wrong or that the study made a fatity assumptionabout how high a treatment are small,easilydeSomeof theseassumptions dosewas requiredto be effective. readingbecause gives wrong the suchaswhen a voltmeter tected,and correctable, was much higherthan that of the meter ('$filof the voltagesource the impedance impregnating a theory are more paradigmlike, son, L952).But other assumptions (e.g., the them no sense without make parts the theory of that other so completely astronomy). in pre-Galilean that the earthis the centerof the universe assumption is very large, in scientific test involved any Because the number of assumptions to fault or can even posit new can easily find some assumptions researchers

"Even the father 11. However, Holton (1986) reminds us nor to overstatethe relianceof positivistson empirical data: phenomena to some which to link of positivism,AugusteComte, had written . . . that without a theory of somesort by 'it we conclusions, draw any useful and isolated observations the principles would not only be impossibleto combine (p. 32). noticed by our eyes"' part, would not be the fact most for the them, and, remember would not evenbe able to debatein logical Similarly, Uebel (1992) providesa more detailedhistorical analysisof the protocol sentence positions held by key playerssuch as Carnap. positivism, showing somesurprisinglynonstereorypical

28

INFERENCE CAUSAL GENERALIZED AND r. rxeenlMENTs

theoriesare (Mitroff & Fitzgerald,1.977).In this way, substantive assumptions How cana theory be tested lesstestablethan their authors originally conceived. if it is madeof clayrather than granite? we clarify later,this critique is more true of singlestudiesand less For reasons constantbiases undetected But evenin the latter case, true of programsof research. As a result,no exgenenlization. and its about cause ."tt t.r,tlt in flawed inferences alwayshave beliefsand preferences perimentis everfully certain,and extrascientific belief. scientific ioo- to influencethe many discretionaryjudgmentsinvolved in all

Critiques Psychological ModernSocial


episworking within traditionsvariouslycalledsocialconstructivism, Sociologists 1976; Bloor, (e.g., Barnes,1974; and the strongprogram relativism, temological 1'979)have Latour 6c'Woolgar,1.979;Mulkay, L981-; l98l;Knorr-Cetina, Collins, Their empiricalstudies processes at work in science. shown thoseextrascientific often fail to adhereto norms commonlyproposedas part of show that scientists (e.g.,objectivity neutrality,sharingof information).They havealso good science rho*n how that which comesto be reportedas scientificknowledgeis partly deof economicand forcesand partly by issues terminedby socialand psychological that arerarely and in the largersociety-issues political power both within science reports.The most extremeamongthesesociolomention;d in publishedresearch processes, claiming gistsattributesall scientificknowledgeto suchextrascientific "the of scirole in the construction natural world has a small or nonexistent ihat "l'98I, p. 3). entificknowledge"(Collins, that real entitiesexistin the world. Collins doesnot denyontologicalrea.lism, Rather,he deniesepistemological(scientific)realism, that whateverexternal realif atomsreally exFor example, ity may existcanconstrainour scientifictheories. an atom, is ist, do they affectour scientifictheoriesat all? If our theory postulates relit? Epistetnologi,cal g a realentitythat existsroughly aswe describe it describin believingthat the atiuistssuch as Collins respondnegativelyto both questions, and poeconomic, are social,psychological, in science most important influences This theorieson scientific litical, "ttd th"t thesemight evenbe the only influences but it is a useoutsidea small group of sociologists, view is not widely endorsed directlyresomehow that scientificstudies to naiveassumptions ful counterweight veal natur. to r.r,(an assumptiorwe callnaiuerealism).The resultsof all studies, influences, are profoundly subjectto theseextrascientific including experiments, to reportsof their results. from their conception

and Trust Science


A standard image of the scientist is as a skeptic, a person who only trusts results that have been personally verified. Indeed, the scientific revolution of the'l'7th century

AND METASCIENCE EXPERIMENTS I 29


I

claimed that trust, particularly trust in authority and dogma, was antithetical to good science.Every authoritative assertion,every dogma, was to be open to queswas to do that questioning. tion, and the job of science That image is partly wrong. Any single scientific study is an exercisein trust (Pinch, 1986; Shapin, 1,994).Studies trust the vast majority of already developed methods, findings, and concepts that they use when they test a new hypothesis. For example, statistical theories and methods are usually taken on faith rather than personally verified, as are measurement instruments. The ratio of trust to skepticism in any given study is more llke 99% trust to 1% skepticism than the opposite. Even in lifelong programs of research, the single scientist trusts much -or. than he or she ever doubts. Indeed, thoroughgoing skepticism is probably impossible for the individual scientist, po iudge from what we know of the psyFinall5 skeptichology of science(Gholson et al., L989; Shadish 6c Fuller, 1'9941. cism is not even an accuratecharacterrzation of past scientific revolutions; Shapin "gentlemanly trust" in L7th-century England was (1,994) shows that the role of central to the establishment of experimental science.Trust pervades science,despite its rhetoric of skepticism.

for Experiments lmplications


The net result of thesecriticismsis a greaterappreciationfor the equivocalityof nature is not a clearwindow that reveals The experiment all scientificknowledge. yield hypotheticaland fallible knowldirectly to us.To the contrary,experiments on context and imbuedwith many unstatedtheoretedgethat is often dependent resultsare partly relativeto those Consequentlyexperimental ical assumprions. or conand contextsand might well changewith new assumptions assumptions relativists. and constructivists are epistemological all scientists texts.In this sense, relativists.Strong relativists weak or strong The differenceis whether they are share Collins'sposition that only extrascientificfactors influenceour theories. 'Weak that both the ontologicalworld and the worlds of ideolrelativistsbelieve values,hopes,and wishesplay a role in the constructionof scienog5 interests, would probably including ourselves, tiiic knowledge.Most practicingscientists, relativists.l2 epistemological weak realistsbut ", Lrrtological themselves describe revealnature to us, it is through a very clouded To the extent that experiments (Campbell, 1988). windowpane As rewere badly needed. to naiveviewsof experiments Suchcounterweights was probably centlyas 30 yearsago,the centralrole of the experimentin science

that have beenraised to a host of other philosophicalissues 1.2. If spacepermitred,we could exrendthis discussion about the experiment, such as its role in discovery versusconfirmation, incorrect assertionsthat the experiment is tied to somespecificphilosophysuch as logical positivismor pragmatism,and the various mistakesthat are (e.g.,Campbell, 1982,1988; Cook, 1991; Cook 6< Campbell, 1985; Shadish, frequentlymadei., suchdiscussions 1.995a\.

I INFERENCE AND GENERALTZED CAUSAL 30 | 1. EXPERTMENTS

taken more for granted than is the case today. For example, Campbell and Stan-

as: ley (1.9631 described themselves


committed to the experiment: as the only means for settling disputes regarding educational practice, as the only way of verifying educational improvements, and as the only way of establishing a cumulative tradition in which improvements can be introduced without the danger of a faddish discard of old wisdom in favor of inferior novelties. (p. 2)

"'experimental method' usedto be Indeed,Hacking (1983) points out that iust an(p.149); was then a more experimentation and other name for scientific method" fertile ground for examples illustrating basic philosophical issuesthan it was a , source of contention itself. 'We now understand better that the experiment is a profoundly Not so today. human endeavor,affected by all the same human foibles as any other human endeavor, though with well-developed procedures for partial control of some of the limitations that have been identified to date. Some of these limitations are common to all science,of course. For example, scientiststend to notice evidencethat confirms their preferred hypothesesand to overlook contradictory evidence.They make routine cognitive errors of judgment and have limited capacity to process to agreewith accepted large amounts of information. They react to peer pressures dogma and to social role pressuresin their relationships to students,participants, and other scientists.They are partly motivated by sociological and economic rewards for their work (sadl5 sometimesto the point of fraud), and they display alltoo-human psychological needs and irrationalities about their work. Other limitations have unique relevance to experimentation. For example, if causal results are ambiguous, as in many weaker quasi-experiments,experimentersmay attribute causation or causal generalization based on study features that have little to do with orthodox logic or method. They may fail to pursue all the alternative causal explanations becauseof a lack of energS a need to achieveclosure, or a bias toward accepting evidence that confirms their preferred hypothesis.Each experiment is also a social situation, full of social roles (e.g., participant, experimenter, assistant) and social expectations (e.g., that people should provide true information) but with a uniqueness (e.g., that the experimenter does not always tell the truth) that can lead to problems when social cues are misread or deliberately thwarted by either party. Fortunately these limits are not insurmountable, as formal training can help overcome some of them (Lehman, Lempert, & Nisbett, 1988). Still, the relationship between scientific results and the world that science studies is neither simple nor fully trustworthy. These social and psychological analyseshave taken some of the luster from the experiment as a centerpieceof science.The experiment may have a life of its own, but it is no longer life on a pedestal. Among scientists,belief in the experiment as the only meansto settle disputes about causation is gone, though it is still the preferred method in many circumstances. Gone, too, is the belief that the power experimental methods often displayed in the laboratory would transfer easily to applications in field settings. As a result of highly publicized science-related

WITHOUT EXPERIMENTS OR CAUSES? A WORLD I gT


I

the disputes over of the Chernobylnucleardisaster, suchasthe tragicresults events failure find and the to certaintylevelsof DNA testingin the O.J. Simpsontrials, of highly publicizedand funded effort, the after decades a cure for most cancers the limits of science. public now betterunderstands general Yet we should not take these critiques too far. Those who argue against will comeout just as that everyexperiment to suggest testsoften seem theory-free of is totally contrary to the experience This expectation wishes. the experimenter frustrating and disapis often experimentation instead that who find researchers, pointing for the theoriesthey loved so much. Laboratory resultsmay not speak but they certainlydo not speakonly for one'shopesand wishes. for themselves, beliefin "stubborn facts" with We find much to valuein the laboratoryscientist's a life spanthat is greaterthan the fluctqatingtheorieswith which one tries to exwhetherthey are plain them.Thus many basicresultsabout gravityare the same, by Newton or by Einstein;and no succontainedwithin a framework developed would be plausibleunlessit could accountfor most of theory to Einstein's cessor the stubbornfactlike findingsabout falling bodies.There may not be pure facts, are clearlyworth treating as if they were facts. but someobservations Some theorists of science-Hanson, Polanyi, Kuhn, and Feyerabend as to make experithe role of theory in science included-have so exaggerated that were seemalmost irrelevant.But exploratory experiments mental evidence to tangential discoveries experimental unguidedby formal theory and unexpected beenthe sourceof greatscientific motivationshaverepeatedly the initial research replicablerehaveprovidedmany stubborn,dependable, Experiments advances. physicists feel that their Experimental of theory. the subject sultsthat then become honest, theoreticalcounterparts laboratorydata help keeptheir more speculative Of course,thesestubborn role in science. giving experiments an indispensable presumptionsand trust in many wellfacts often involve both commonsense in questheoriesthat make up the sharedcore of belief of the science established prove are be undependable, to these stubbornfactssometimes tion. And of course, artifacts,or are so ladenwith a dominantfocal theas experimental reinterpreted But this is not the casewith oncethat theory is replaced. ory that they disappear over reldependable reasonably remains which base, the greatbulk of the factual ativelylong periodsof time.

ORCAUSES? EXPERIMENTS WITHOUT A WORLD


from Maclntyre (1981),imaginethat the slates To borrow a thought experiment of science and philosophywerewiped cleanand that we had to constructour unwould we reinvent of the world anew.As part of that reconstruction, derstanding of the practical \7e think so, largely because the notion of a manipulablecause? manipulandahave for our ability to surviveand prosper. utility that dependable IUTould we reinvent the experimentas a method for investigatingsuch causes?

I AND GENERALTZED 32 | 1. EXPERTMENTS CAUSAL TNFERENCE

Again yes,because humanswill always be trying to betterknow how well these manipulablecauses work. Over time, they will refinehow they conductthoseexinference, perimentsand so will againbe drawn to problemsof counterfactual of precedingeffect,of alternative and of all of the other features cause explanations, of causation that we havediscussed in this chapter.In the end, we would probavery much like it. This book is one bly end up with the experimentor something It is about improving more stepin that ongoingprocess of refining experiments. both the qualthe yield from experiments that take placein complexfield settings, theseinferences to ity of causalinferences they yield and our ability to generalize and outcomes. treatments, constructs settings, and over variationsin persons,

A Critical Assessment of Our Assumptions


(e-simp'shen): As.sump.tion from Latin as[Middle Englishassumpcion, sumpti, assumptin-adoption, from assumptus,past participle of assmere,te adopt; seeassume.] n. 1. The act of taking to or upon oneself: assumption of an obligation. 2.The act of taking overiassumption of command. 3. The act of taking for granted:assumptionof a false theory. 4. Somethingtaken for granted or accepted as true without proof; a supposition: a ualid assumption. 5. Presumption; arrogance. 5. Logic.A minor premise.

fltHIS BooK covers five central topics across its 13 chapters. The first topic | (Chapter 1) deals with our general understanding of descriptive causation and I experimentation. The second (Chapters 2 and 3) deals with the types of validity and the specific validity threats associatedwith this understanding. The third (Chapters 4 through 7) deals with quasi-experimentsand illustrates how combining design features can facilitate better causal inference. The fourth (Chapters 8 through L0) concerns randomized experiments and stresses the factors that impede and promote their implementation. The fifth (Chapters 11 through L3) deals with causal generalization, both theoretically and as concerns the conduct of individual studies and programs of research.The purpose of this last chapter is to critically assess some of the assumptions that have gone into these five topics, especially the assumptions that critics have found obiectionable or that we antici'We pate they will find objectionable. organize the discussionaround each of the five topics and then briefly justify why we did not deal more extensivelywith nonexperimental methods for assessing causation. I7e do not delude ourselvesthat we can be the best explicators of our own assumptions. Our critics can do that task better. But we want to be as comprehensrve and sive an(l as explicit explclt as we can. can. This I nrs is part because ls in rn part becausewe we are are convinced convrnced of ot the the adaclvantages of falsification as a major component of any epistemology for the social sciences,and forcing out one's assumptions and confronting them is one part of falsification. But it is also becausewe would like to stimulate critical debateabout theseassumptionsso that we can learn from those who would challengeour think456

rct AND EXPERIMENTATION CAUSATION |

ing. If therewereto be a future book that carriedevenfurther forward the tradifrom Campbelland Stanleyvia Cook and Campbellto this book, tion emanating then that futuie book *o,rld probably be all the better for building upon all the comingfrom thosewho do not agreewith us, eitheron particjustifiedcriticisms cauof descriptive ,rlu6 o, on the whole approachwe havetaken to the analysis would like this chapternot only to model the atsationand its generayzition.'We all scholarsmust inevitablymake but i.-p, to be cr"iti.alabout the assumprions and how they might be othersto think about theseassumptions alsoto encourage in fuiure empiricalor theoreticalwork' addressed

ENTATION AND EXPERIM CAUSATION Arrows and Pretzels Causal


descriptive Experiments test the influence of one or at most a small subset of very few causes.If statistical interactions are involved, they tend to be among variables' moderator of treatments or between a single treatment and a limited set believe that the causal knowledge that results from this typical Many researchers structure fails to map the many causal forces that simultaneously af.*p.ii-..rtal (e.g., Cronbach et al', fe.t "ny given outcome in compiex and nonlinear ways prioritize on ar19g0; Magnusson,2000). These critics assertthat experiments an explanatory ,o*, .onrr-.cting A to B when they should instead seekto describe most causal pretzel or set of intersectingpretzels,as it were. They also believethat whether ielationships vary across ,rttitt, settings, and times, and so they doubt Snow, 6c (e.g., Cronbach there ".. "ny constant bivariate causal relationships reflect sta1977).Those that do appearto be dependablein the data may simply reveal the to tistically underpow.r.i irr,, of modeiators or mediators that failed sizesmight true underlying complex causal relationships. True-variation in effect or the also be obrc.rr"d b.c"rrs. the relevant substantive theory is underspecified, or attenuated, is outcome measuresare partially invalid, or the treatment contrast (McClelland causally implicated variables afe truncated in how they are sampled

6c Judd, 1993). As valid as theseobiectionsare, they do not invalidatethe casefor experisome.phenomeis not to completelyexplainments.The purposeof experiments makes non; it is to ldentify whethera particularvariableor small setof variables affecta margirraldifferencein someoutcomeover and above all the other forces have not ing that outcome.Moreover,ontologicaldoubts such as the preceding many though as in more complex iausal theoriesfrom acting stJppedbelievers or as main effects as dependable can be usefullycharacterized .r,rol relationships In this enoughto be_u_seful. very simpl. nonlin."rities that are also dependable where from educationin the United States, considersomeexamples connection,

4s8

14.A CRTTTCAL ASSESSMENT OFOUR ASSUMPTTONS

objections to experimentation are probably the most prevalent and virulent. Few educational researchers seemto object to the following substantiveconclusions of the form that A dependably causesB: small schools are better than large ones; time-on-task raises achievement; summer school raises test scores;school desegregation hardly affects achievement but does increaseWhite flight; and assigning and grading homework raises achievement.The critics also do not seemto object to other conclusions involving very simple causal contingencies: reducing class "sizable" size increasesachievement,but only if the amount of change is and to a level under 20; or Catholic schools are superior to public ones, but only in the inner city and not in the suburbs and then most noticeably in graduation rates rather , than in achievementtest scores. The primary iustification for such oversimplifications-and for the use of the experiments that test them-is that some moderators of effects are of minor relevance to policy and theory even if they marginally improve explanation. The most important contingencies are usually those that modify the sign of a causal relationship rather than its magnitude. Sign changesimply that a treatment is beneficial in some circumstancesbut might be harmful in others. This is quite different from identifying circumstancesthat influence just how positive an effect might be. Policy-makers are often willing to advocate an overall change,even if they suspect it has different-sizedpositive effects for different groups, as long as the effects are rarely negative. But if some groups will be positively affected and others negatively political actors are loath to prescribe different treatments for different groups becauserivalries and jealousies often ensue. Theoreticians also probably pay more attention to causal relationships that differ in causal sign becausethis result implies that one can identify the boundary conditions that impel such a disparate data pattern. Of course, we do not advocate ignoring all causal contingencies.For example, physicians routinely prescribe one of severalpossibleinterventions for a given diagnosis.The exact choice may depend on the diagnosis,test results,patient preferences, insurance resources, and the availability of treatments in the patient's area. However, the costs of such a contingent system are high. In part to limit the number of relevant contingencies,physicians specialize,andwithin their own specialty they undergo extensivetraining to enable them to make thesecontingent decisions. Even then, substantial judgment is still required to cover the many situations in which causal contingencies are ambiguous or in dispute. In many other policy domains it would also be costly to implement the financial, management, and cultural changesthat a truly contingent system would require even if the requisite knowledge were available. Taking such a contingent approach to its logical extremes would entail in education, for example, that individual tutoring become matched the order of the day. dav.Students and instructorswould haveto be carefullymatched

for overlap in teachingand learning skills and in the curriculum supportsthey would need. tilTithinlimits, some moderators can be studied experimentallSeither by measuringthe moderator so it can be testedduring analysisor by deliberately

AND EXPERIMENTATION CAU5ATION I Ot'

In conductingsuchexperivarying it in the next study in a program of research' towardtakof yesteryear from thethik-bo" experiments ments,onemovesaway them by, study!1g more seriouslyand toward routinely ing causalcontingencies the treatmentto examineits causallyeffectivecomfoi .""-ple, disaggregating ponents,iir"ggt.glting the effect,toexamineits causallyimpactedcomponents, and moderatorvariables, .ondrr.ting ,n"ty*r ofi.-ographic and psychological affects exploringlhe causalpathwa-ysihtooghwhjch (parts.of) the treatment possiis not in a singleexperiment lparts of) the outcomJ.To do all of this well and desirable. tl.. brrtto do someof it well is possible

of E4periments Criticisms Epistemological


we have examples, conclusionvalidity and in-selecting In highlightingstatistical testing' often linked causaldescriptionto quantitativemethodsand hypothesis positivism' theory of Many criticswill (wrongly)r.. this asimplying a discredited positivismrecentury' L9th As a philosophyof scieniefirst outlined in the early and equated especiallyabout unobservables, speculations, 1.ct.d' metaphysical A narrower school of phenomenaof e*perienced lrro*t.ag. *lih descriptions realism logical pisitivism .*.rg.d in the eatly 20th century that also rejected form logic in predicate connections *til. "lro .-phasizing Ih. ,rr. of data-theory over explainingthem' Both theserephenomena ""J " fr.f.r.r.. for p"redicting of how as explanations especially *.r. lonf ago discredited, lated epistemologies on this basis'Howcritici'e experiments op.r"trr.*so few criticsseriously science to attack ever,many critics use the term positiuismwith lesshistorical fidelity (e'g', Lincoln & Guba, 1985)' methodsin genera-l quantitativesocialscience quantification liuilding on the rejectionof logicalpositivism,they reiectthe useof and hypothesistesting.Because measurement, and forLal logic in observatiron, to reiectthis looseconceptionof posiare part of experiments, theselast features However,the errorsin suchcriticismsare nuexperiments. tivism entailsrejecting (like the idea that merous.For example,to ,eject a specificfeatureof positivism data and links between are the only permissible f,r"rrtifi.rtion and p redicatelogic imlly reiectingall relatedand more generalpropositiheory;doesnot nJcessarily testing tions jsuch asthe notion that somekinds of quantificationand hypothesis ersuch more outlined growth).Ife and othershave may be usefulfor knowledge I995al' 1990;Shadish, (Phillips, rors elsewhere citethe work of historians experimentation of criticisms other epistemological suchasLatour and'woolof science sociologists (1,g62),of suchasKuh"n of science as Harr6'(1931).Thesecriticstend gar ltiZll "rrd of fhiloroph.ir of scienceiuch the notion that of theories, to focuson threethings.orre.i, the incommensurability As a reand so can alwaysbe reinterpreted. specified theoriesare neverper"fectly data seemto imply that a theory should be reiected'its sult, when disconfirming poriolut., can insteadbI reworkedin order to make the theory and observations to the with eachother.This is usuallydoneby addingnew contingencies consistent

460 | 14.A CRIT|CAL ASSESSMENT OF OURASSUMPTTONS


I

theory that limit the conditions under which it is thought to hold. A second critique is of the assumption that experimental observations can be used as truth 'We tests. would like observations to be objective assessments that can adjudicate between different theoretical explanations of a phenomenon. But in practice, observationsare not theory neutral; they are open to multiple interpretations that include such irrelevanciesas the researcher's hopes, dreams, and predilections. The consequence is that observations rarely result in definitive hypothesistests.The final criticism follows from the many behavioral and cognitive inconsistenciesbetween what scientists do in practice and what scientific norms prescribe they should do. Descriptions of scientists' behavior in laboratories reveal them as choosing to do particular experiments becausethey have an intuition about a relationship, or they are simply curious to seewhat happens, or they want to play with a new piece of equipment they happen to find lying around. Their impetus, therefore, is not a hypothesis carefully deduced from a theory that they then test by means of careful observation. Although these critiques have some credibilitg they are overgeneralized.Few experimentersbelievethat their work yields definitive results even after it has been subjected to professional review. Further, though these philosophical, historical, and social critiques complicate what a "fact" means for any scientific method, nonethelessmany relationships have stubbornly recurred despite changesassociated with the substantive theories, methods, and researcherbiasesthat first generated them. Observations may never achieve the status of "facts," but many of them are so stubbornly replicable that they may be consideredas though they were facts. For experimenters, the trick is to make sure that observations are not impregnated with just one theory, and this is done by building multiple theories into observationsand by valuing independent replications, especiallythose of substantive critics-what we have elsewherecalled critical multiplism (Cook, 1985; Shadish,'1.989, 1994). Although causal claims can never be definitively tested and proven, individual experiments still manage to probe such claims. For example, if a study produces negative results, it is often the casethat program developersand other advocates then bring up methodological and substantive contingenciesthat might have changedthe result. For instance, they might contend that a different outcome measure or population would have led to a different conclusion. Subsequent studies then probe these alternatives and, if they again prove negative, lead to yet another round of probes of whatever new explanatory possibilities have emerged. After a time, this process runs out of steam, so particularistic are the contingencies that remain to be examined. It is as though a consensusemerges:"The causal relationship was not obtained under many conditions. The conditions that remain to be examined are so circumscribed that the intervention will not be worth much 'W'e even if it is effectiveunder these conditions. " agreethat this processis as much or more social than logical. But the reality of elastic theory does not mean that decisions about causal hypotheses are only social and devoid of all empirical and logical content.

I
I t

I
I

AND EXPERIMENTATION CAUSATION | +er

usefulin highlightingthe limited value of The criticismsnoted are especially programs.Suchreviewsare betrelativeto reviewsof research individual studies makesit lesslikely that the same the greaterdiversityof study features ter because all across any one studywill reappear that inevitablyimpregnate biases theoretical and counterof point, response, underreview.Still, a dialecticprocess the studies evenwith reviews,againimplying that no singlereview is definipoint is needed claim (1'977) meta-analytic to Smith and Glass's in response iirr.. Fo, example, (1'977) pointedout ck (L977)and Presby *", .ff..tive, Eysen that psychotheiapy that challengedthe original remethojological and substantivecontingencies that a differentanswerwould havebeenachieved They suggested viewers'reJults. if Smith and Glassitrd ""t combinedrandomizedand nonrandomizedexperiin which to classifytypesof thermentsor if they had usednarrower calegories to Smith and Glassor brought studiesprobed thesechallenges apy. Subsequent \il'eiszet al., 1,992).This processof challengingcausal foith nouef or,., 1e.g., as has now slowedin reviewsof psychotherapy claimswith specificalternatives have beenexplored.The that might limit effectiveness many major contingencies is in many kinds of settings fiom reviewsof many experiments currenrconsensus process it is not iust the product of a regression is effective; that psychotherapy wherebythosewho are temporarily in needseekprofeslrporrt"nrors remission) ,ii""t help and get better,as they would haveevenwithout the therapy'

AncillarY Questions Neglected


many framework neglects within an experimental Our focus on causalquestions how about questions include These that arerelevantto causation. other questions of any singlecausalquestion.This could to decideon the importanceor leverage entail exploringwhethera causalquestionis evenwarranted,as it often is not at of an issue.Or it could entail exploringwhat type the early sa"g.-ofdevelopment of c".rsalquestionis moie important-one that fills an identifiedhole in someliterature,o, orr. that setsout to identify specificboundary conditionslimiting a held by or one that probesthe validity of a centralassumption causalconnection, within a field, or one that reducesuncertainty all the theoristsand researchers when formerly uncertaintywas high. Our approach about an important decision causalquestionusuthe realitythat how oneformulatesa descriptive alsoneglects more than in the socialresearch interests aily enLils meetingsomestakeholders' those of others.TLus to ask about the effectsof a national program meetsthe staffs,the media,and policy wonks to learnaboutwhether of Congressional needs the program"*orks. But it can fail to meet the needsof local practitionerswho within the proof microelements ,rro"lly"*"nt to know about the effectiveness more gram ,o thut they can usethis knowledgeto improve their daily practice.-In is Ih.or.ti."l work, to ask how some interventionaffectspersonalself-efficacy whereasto ask about the effects likely to promote individuals'autonomyneeds, to changeattitudescould well cater to communicationdesigned of a'persoasive

462

14.A CR|T|CAL ASSESSMENT OFOUR ASSUMPT|ONS

the needs of those who would limit or manipulate such autonomy. Our narrow technical approach to causation also neglectedissuesrelated to how such causal knowledge might be used and misused. It gave short shrift to a systematic analysis of the kinds of causal questions that can and cannot be answered through experiments. \7hat about the effects of abortion, divorce, stable cohabitation, birth out of wedlock, and other possibly harmful events that we cannot ethically manipulate? What about the effects of class,race, and gender that are not amenable 'What to experimentation? about the effects of historical occurrencesthat can be studied only by using time-seriesmethods on whatever variables might or might not be in the archives?Of what use, one might ask, is a method that cannot get at some of the most important phenomena that shape our social world, often over generations, as in the caseof race, class,and gender? Many statisticians now consider questions about things that cannot be manipulated as being beyond causal analysis,so closely do they link manipulation to causation. To them, the cause must be at least potentially manipulable, even if it is not actually manipulated in a given observational study. Thus they would not consider race ^ cause, though they would speak of the causal analysis of race in studies in which Black and White couples are, say, randomly assignedto visiting rental units in order to seeif the refusal rates vary, or that entail chemically changing skin color to seehow individuals are responded to differently as a function of pigmentation, or that systematicallyvaried the racial mix of studentsin schools or classrooms in order to study teacher responsesand student performance. Many critics do not like so tight a coupling of manipulation and causation. For example, those who do status attainment researchconsider it obvious that race causally influences how teachers treat individual minority students and thus affects how well these children do in school and therefore what jobs they get and what prospects their own children will subsequentlyhave. So this coupling of causeto manipulation is a real limit of an experimental approach to causation. Although we like the coupling of causation and manipulation for purposes of defining experiments, we do not seeit as necessaryto all useful forms of cause.

VALIDITY
Objections to InternalValidity
There are several criticismsof Campbell's(1957) validity typology and its extensions(Gadenne, 1976;Kruglanski 6cHickey,1978;Cron& Kroy, 1.976;Hultsch 'We bach, 1982; Cronbachet al., 1980). start first with two criticismsof internal validity raisedby Cronbach(1982)and to a lbsser extentby Kruglanskiand Kroy (1'976):(1) an atheoretically definedinternal validity (A causes B) is trivial without reference (2) to constructs;and causationin singleinstances is impossible, includingin singleexperiments.

vALtDtrY nol I
lnternal Validity ls Trivial (L982)writes: Cronbach
I consider it pointless to speak of causeswhen all that can be validly meant by refermaenceto a causein a particular instanceis that, on one trial of a partially specified phenamed, not conditions other nipulation under.orrditior6 A, B, and c, along with nomenon p was observed.To introduce the word cause seemspointless. Campbell's writings make internal validity a property of trivial, past-tense'and local statements'

(p .t3 7 ) doesnot retaina spe(p. 140).Cronbach is superfluous" language Hence,.,causal cific role fo, .",rr"Iinferenceln his validity typology at all. Kruglanski and Kroy (1976)criticizeinternalvalidity similanlSsaying:
are The concrete events which constitute the treatment within a specific research is simply it Thus, ' ' ' category' meaningful only as members of a general conceptual are impossibleto draw strictly specificconclusionsfrom an experiment: our concepts g.rr.r"l and each pr.r,rppor"s an implicit general theory about resemblanceberween different concretecases.(p. 1'57)

collapsinginternal with constructvalidity in different All theseauthors suggest ways. and treatments and discuss conceptualize that researchers we agree Of course, to basic so are terms.As we saidin Chapter3, constructs in concepfual outcomes scientificwork withto conceptualize l"rrgo"g. and thought that it is impossible the constructswe use constrain out"thJm. Indeed,ir, *"ny important respects, a point agreedto by theoristsranging from Quine (L951' what we experience, (Conner,1989;Testeq1993). So when we say that L96g)to th; postmodernists we do not local molar causalinference, an atheoretical internalvalidity concerns experimentsor report a causal should conceptualize mean that the researcher (1982,p' 130) exagto useCronbach's claim as "somethingmadea differencer" geratedcharacterization. and usefulto differentiateinternal from constructvaStill, it is both sensible enoughto warrant separate is demanding lidity. The task of sortingout constructs After all, operationsare concept attention from the task of sorting out causes. are.In to know fully what thoseconcepts laden,and it is very rare for researchers paradigmatic fu.t, th, ,erearchrialmostcertainlycannotknow them fully because and their asconcepts those imbuedthat .orr..p,, areso implicitly and universally by researchcommunitiesfor sumptions "r. ,oi,'.,imes entirely unrecognized_ of is repletewith examplesof famousseries y."ri. Indeed,the history of science earlS but it took ."p.rim.nts in which a causalrelationshipwas demonstrated and stablynamed.For instance, y."r, for the cause(or effect)to be consensually from originally emanated many causalrelationships and linguistics in psychology a behavioriit paradigl but were later relabeledin cognitive terms; in the early as effectsof obtrusive Hawthorne st;dy, illumination effectswere later relabeled observers;and some cognitive dissonanceeffects have been reinterpretedas

464 I 14.A CRITICAL ASSESSMENT OF OURASSUMPTIONS

attribution effects.In the history of a discipline,relationships that are correctly identified as causalcan be important evenwhen the causeand effectconstructs are incorrectlylabeled.Suchexamples usedto draw exist because the reasoning (e.g.,requiring evidence causalinferences that treatmentpreceded outcome)dif(e.g., fers from the reasoning generalize usedto matchingoperations to prototypical characteristics of constructs). \Tithout understanding what is meant by descriptive causation, we have no means of telling whether a claim to have established suchcausationis justified. Cronbach's(1982) prosemakesclear that he understands the importanceof causallogic; but in the end, his sporadically expressed craft knowledgedoesnot add up to a coherent causal inferences. theory of judgingthe validity of descriptive His equation of internal validity as part of reproducibility (under replication) misses the point that one can replicateincorrectcausalconclusions. His solution "the to suchquestions is simplythat forceof eachquestion can bereduced by suitable controls" (1982,p. 233).This is inadequate, for a complete analysis of the problem of descriptive we can useto recognize causalinference requiresconcepts suitablecontrols.If a suitablecontrol is one that reduces the plausibilityof, say (1982,p.233) historyor maturation, suggests, thisis little morethan asCronbach internalvalidity aswe haveformulatedit. If one needs enoughto use the concepts them, then they should be part of a validity typology for cause-probing methods. For completeness, we might add that a similar boundaryquestionarisesbetween constructvalidity and externalvalidity and between constructvalidity and statisticalconclusionvalidity. In the former case,no scientistever framesan external validity questionwithout couchingthe questionin the languageof constructs.In the latter case,researchers or discuss their results neverconceptualize process solelyin terms of statistics. of doing reConstructsare ubiquitousin the searchbecause they are essential for conceptualizing and reporting operations. But again,the answerto this objectionis the same.The strategies for making inferences about a constructare not the sameas strategies for making inferences about whether a causal relationship holds over variation in persons,settings, treatments, and outcomes in externalvalidity or for drawing valid statistical conclusionsin the caseof statisticalconclusionvalidity.Constructvalidity requiresa theoreticalargumentand an assessment betweensamples of the correspondence constructs. and Externalvalidity requiresanalyzing whethercausalrelationships hold over variations in persons,settings,treatments,and outcomes.Statistical conclusion procedures validity requires and ascloseexaminationof the statistical sumptionsused.And again,one can be wrong about constructlabelswhile being right about externalor statisticalconclusionvalidity.

Objections to Causation in SingleExperiments


A second criticism of internal validity deniesthe possibility of inferring causation in a single experiment. Cronbach (1982) says that the important feature of causation is the "progressivelocalizationof a cause" (Mackie, 1974, p.73) over mul-

vALrDrry otu |
tiple experimentsin a program of researchin which the uncertainties about the essential i."t.rr.r of the cause are reduced to the point at which one can characterize exacflywhat the causeis and is not. Indeed, much philosophy of causation asserts that we only recognize causes through observing multiple instances of a putative causal relationship, although philosophers differ as to whether the mechanism for recognition involves logical laws or empirical regularities (Beauchamp, 1974;P. White, 1990). However, some philosophers do defend the position that causescan be inMadden & Humferred in singleinstances(e.g.,Davidson, 1,967;Ducasse'1,95L1' (e.g., Honore, 1985)' Hart & law in the A good example is causation ber, L97'1,). by which we judge whether or not one person, say, caused the death of another despitethe fact that the defendant may 4ever before have been on trial for a crime. The verdict requires a plausible casethat (among other things) the defendantb actions precededlhe death of the victim, that those actions were related to the death, that other potential causesof the death are implausible, and that the death would not have occurred had the defendant not taken those actions-the very logic of causal relationships and counterfactualsthat we outlined in Chapter 1. In fact, the defendant'scriminal history will often be specifically excluded from consideration in iudging guilt during the trial. The lessonis clear. Although we may learn more "bo,rt ."nsation from multiple than from single experiments, we can rnf.ercause in single experiments.Indeed, experimenterswill do so whether we tell them to or not. Providing them with conceptual help in doing so is a virtue, not a vice; failing to do so is a major flaw in a theory of cause-probing methods. Of course, individual experiments virtually always use prior concepts from other experiments.However, such prior conceptualizations are entirely consistent with the claim that internal validity is about causal claims in single experiments. If it were not (at least partly) about single experiments, there would be no point to doing the experiment, for the prior conceptualization would successfullypredict what will be observed.The possibility that the data will not support the prior conceptualization makes internal validity essential.Further, prior conceptualizations are not logically necessary;we can experiment to discover effects that we "The physicist George Darwin used have no prior conceptual structure to expect: to say tliat once in a while one should do a completely crazy experiment, like blowing the trumper to the tulips every morning for a month. Probably nothing wiil hafpen, but if something did happen, that would be a stupendousdiscovery" (Hacking, L983, p. 15a). But we would still need internal validity to guide us in judging if the trumpets had an effect.

Objections to Descriptive Causation Typicall5 howcausation. A few authorsobjectto the very notion of descriptive that has causation of descriptive aremadeabout a caricature ever,suchobjections for many years-for example,a billiard not teen usedin philosophyor in science or that excludes causation ball modelthat requiresa commitmentto deterministic

466

AssEssMENT ra.n cRrrcAL oFouRAssuMproNs

today esIn contrast,mostwho write aboutexperimentation reciprocalcausation. pousetheoriesof probabilisticcausation in which the many difficultiesassociated Even are humbly acknowledged. identifying dependable causal relationships with language themselves, more important, thesecriticsinevitablyusecausal-sounding "mutual "cause" shaping" (Lincoln 6c simultaneous for example,replacing with the word but keep p. us to avoid seem to Guba, 1985, 151).Thesereplacements the concept,and for good reason.As we saidat the end of ChapterL, if we wiped we the slatecleanand constructed our knowledgeof the world aneq we believe causationall over again, so would end up reinventingthe notion of descriptive greatlydoesknowledgeof causes help us to survivein the world.

Between Objections Concerning the Discrimination Construct Validityand ExternalValidity


Although we traced the history of the present validity system briefly in Chapter 2, readers may want additional historical perspectiveon why we made the changes we made in the present book regarding construct and external validity. Both Campbell (1957) and Campbell and Stanley(1963) only usedthe phraseexternal validitS which they defined as inferring to what populations, settings,treatment variables, and measurement variables an effect can be generalized.They did not rcfer at all to construct validity. However, from his subsequentwritings (Campbell, 1986), it is clear Campbell thought of construct validity as being part of external validity. In Campbell and Stanley therefore, external validity subsumed generalizing from researchoperations about persons, settings,causes,and effects for the purposes of labeling theseparticulars in more abstract terms, and also generalizing by identifying sourcesof variation in causal relationships that are attributable to person, setting, cause, and effect factors. All subsequentconceptualizations also share the same generic strategy based on sampling instancesof persons, settings, causes,and effects and then evaluating them for their presumed correspondenceto targets of inference. In Campbell and Stanley'sformulation, person, setting, cause,and effect categories share two basic similarities despite their surface differences-to wit, all of them have both ostensive qualities and construct representations.Populations of persons or settings are composed of units that are obviously individually ostensive. This capacity to point to individual persons and settings, especially when they are known to belong in a referent category permits them to be readily enumerated and selectedfor study in the formal ways that sampling statisticiansprefer. By contrast, although individual measures (e.g., the Beck Depression Inventory) and treatments (e.g., a syringe full of a vaccine) are also ostensive,efforts to enumerate all existing ways of measuring or manipulating such measuresand treatments are much more rare (e.g.,Bloom, L956; Ciarlo et al., 1986; Steiner& prefer to use substantivetheory to Gingrich, 2000). The reason is that researchers determine which attributes a treatment or outcome measureshould contain in any

.J

vALrDtrY I oe,
given studS recognizing that scholars often disagreeabout the relevant attributes of th. higher order entity and of the supposed best operations to representthem. None of ihis negatesthe reality that populations of persons or settingsare also defined in part by the theoretical constructs used to refer to them, just like treatments and outiomes; they also have multiple attributes that can be legitimately con'!(hat, for instance, is the American population? \7hile a legal definition tested. surely exists,it is not inviolate. The German conception of nationality allows that the gieat grandchildren of a German are Germans even if their parents and grandp"r*t, have not claimed German nationality. This is not possible for Americans. And why privilege alegaldefinition? A cultural conception might admit as American all thor. illegal immigrants who have been in the United Statesfor decades and it might e*cl.rde those American adults with passports who have never lived in the United States. Given that person's,settings, treatments, and outcomes all have both construct and ostensive qualities, it is no surprise that Campbell and Stanley did not distinguish between construct and external validity. Cook and Camptell, however, did distinguish between the two. Their unstated rationale for the distinction was mostly pragmatic-to facilitate memory for the very long list of threats that, with the additions they made' would have umbrella conception of external validity. had to fit under bampbell and Stanley's construct validity In their theoretical diicussion, Cook and Campbell associated generalizing to with validity external with generalizingto causesand effects, and and across persons, settings, and times. Their choice of terms explicitly referencedCronbach and Meehl (1955) who used construct and construct validity in "about higher-order constructs from remeasurementtheory to justify inferences search operations'; lcook & Campbel| 1,979, p. 3S). Likewise, Cook and Campbeli associatedthe terms population and external ualidity with sampling theory and the formal and purposive ways in which researchersselect instances of persons and settings. But to complicate matters, Cook and Campbell also "all aspectsof the researchrequire naming samples in brlefly acknowledged that termi, including samplesof peoples and settings as well as samples gener-alizable of -r"r,rres or manipulations" (p. 59). And in listing their external validity threats as statistical inieractions between a treatment and population, they linked external validity more to generalizing across populations than to generalizing to them. Also, their construct validity threats were listed in ways that emphasized generalizing to cause and effect constructs. Generalizing across different causes ind effect, *", listed as external validity becausethis task does not involve attributing meaning to a particular measure or manipulation. To read the threats in Cook and Campbell, external validity is about generalizing acrosspopulations of persons and settings and across different cause and effect constructs, while construct validity is about generalizing to causesand effects.Where, then, is genera\zing from samples of persons or settings to their referent populations? The text disiussesthis as a matter of external validitg but this classification is not apparent in the list of validity threats. A system is neededthat can improve on Cook and Campbell's partial confounding between objects of generalization (causes

468

ASSESSMENT OF OURASSUMPTIONS 14.A CRITICAL

and effects versus persons and settings) and functions of generalization (generalizing to higher-order constructs from researchoperations versus inferring the de-

greeof replicationacrossdifferent constructsand populations). constructvalidity This book usessucha functional approachto differentiate operafrom externalvalidity. It equates constructvalidity with labelingresearch relationships. This in causal of variation tions, and externalvalidity with sources all of the old. Thus, Cook and Campbellt undernew formulation subsumes to and measures from manipulations standingof constructvalidity asgeneralizing as gencauseand effectconstructsis retained.So is externalvalidity understood across and times.And generalizing settings, eralizingacrosssamples of persons, part of as different causeor effectconstructsis now,evenmore clearlyclassified and setof persons exrernalvalidity.Also highlightedis the needto label samples and manipulationsneedto be labeled. tings in abstractterms, iust as measures to be a matterof constructvalidity giventhat construct Suchlabelingwould seem validity is functionallydefinedin termsof labeling.However,labelinghumansamples might have been read as being a matter of external validity in Cook and were human populationsand their validity Campbell,given that their referents than functions.So,althoughthe new more around referents typeswereorganized we are than its predecessors, formulation in this book is definitelymore systematic unsurewhetherthat systematizationwillultimately result in greaterterminological clarity or confusion.To keepthe latter to a minimum, the following discussion pertinentto the demarcation of constructand externalvalidity that reflectsissues betweenthe first two authorsor in classes have emerged either in deliberations of this book. versions that we havetaughtusingpre-publication Is Construct Vatidity a Prerequisite for External Vatidity? relationships and In this book, we equateexternalvalidity with variation in causal might seethis operations.Somereaders constructvalidity with labeling.research requires the acrelationship generalization of a causal assuggesting that successful curate labelingof eachpopulation of personsand eachtype of settingto which generalization is sought,eventhough we can neverbe certainthat anythingis laasthe most accurate The relevanttask is to achieve beledwith perfectaccuracy. we can.test genenlizaTechnically, sessment availableunder the circumstances. and thus not labeled tion across entitiesthat are akeadyknown to be confounded in the samwell-e.g., when causaldata arebrokenout by genderbut the females ple are, on average, more intelligentthan the malesand thereforescorehigheron how dangerThis exampleillustrates everythingelsecorrelatedwith intelligence. for ous it is to rely on measured surfacesimilarity alone (i.e.,genderdifferences) how a sampleshouldbe labeledin populationterms.\7e might more determining if we had a random sampleof each gender accuratelylabel genderdifferences work, population.But this is not often found in experimental taken from the same with other genderis known to be confounded and eventhis is not perfectbecause those other atpopulation, and (e.g., income,work status)evenin the attributes

,i-{11 f i..

t .,.J

vALrDrrY I oo,
tributes may be pertinent labels for some of the inferencesbeing made. Hence, we usually have to rely on the assumption that, becausegender samplescome from the same physical setting, they are comparable on all background characteristics that might be correlated with the outcome. Becausethis assumption cannot be fully testedand is ^nyw^y often false-as in the hypothetical example above-this means rhat we could and should measure all the potential confounds within the limits of our theoretical knowledge to suggestthem, and that we should also use these measuresin the analysis to reduce confounding. Even with acknowledged confounding, sample-specific differences in effect sizesmay still allow us to conclude that a causal relationship varies by something associatedwith gender.This is a useful conclusion for preventing premature overgeneralization.Iilith more breakdownq, confounded or not, one can even get a of contrastsacrosswhich a causal relationship does and senseof the percentage does not hold. But without further work, the populations across which the relationship varies are incompletely identified. The value of identifying them better is particularly salient when some effect sizescannot be distinguished from zero. Although this clearly identifies a nonuniversal causal relationship, it does not advance theory or practice by specifying the labeled boundary conditions over which a causal relationship fails to hold. Knowledge gains are also modest from generalization strategiesthat do not explicitly contrast effect sizes.Thus, when different populations are lumped together in a single hypothesis test, researcherscan learn how large a causal relationship is despite the many unexamined sources of variation built into the analysis. But they cannot accurately identify which constructs do and do not co-determine the relationship's size. Construct validity adds useful specificity to external validity concerns, but it is not a necessarycondition for external validity.'We can generalize across entities known to be confounded' albeit lessusefully than acrossaccurately labeled entities. This last point is similar to the one raised earlier to counter the assertion of Gadenne (L9761and Kruglanski and Kroy (1976) that internal validity requires the high consrruct validity of both causeand effect. They assertthat all scienceis "something causedsomeabout constructs, and so it has no value to conclude that thing sfss"-1hs result that would follow if we did a technically exemplary randomized experiment with correspondingly high internal validity but the causeand effect were not labeled. Nonetheless, a causal relationship is demonstrably en"something reliably causedsomething else" might lead tailed, and the finding that to further researchto refine whatever clues are available about the cause and effect constructs. A similar argument holds for the relationship of construct to external validity. Labels with high construct validity are not necessaryfor internal or for external validity, but they are useful for both. necessarilyuse the language of constructs (including human and Researchers setting population ones) to frame their research questions and selecttheir representationsof constructsin the samplesand measureschosen.If they have designed their work well and have had some luck, the constructs they begin and end with will be the same,though critics can challengeany claims they make. However, the

470

OFOURASSUMPTIONS ASSESSMENT 14.A CRITICAL

samplesand constructs might not match we[], and then the task is to examine the samples and ascertain what they might alternatively stand for. As critics like

on the operational Kruglanski,and Kroy havepointedout, suchreliance Gadenne, This of ashavinga life independent constructs. operations levelseems to legitimize on interpretations are intimatelydependent though,for operations is not the case, Still, every operation fits some interpretations, however of research. at all stages
tentative that referent may be due to poor researchplanning or to nature turning out to be more complex than the researcher'sinitial theory.

How Does Variation AcrossDifferent Operational Representations and to Construct or EffectRelate of the SameIntendedCause ExternalValidity?
In Chapter 3 we emphasizedhow the valid labeling of a cause or effect benefits from multiple operational instances,and also that thesevarious instancescan be fruitfully analyzedto examine how a causal relationship varies with the definition used. If each operational instance is indeed of the sameunderlying construct, then

or effectis of how the cause the samecausalrelationshipshouldresult regardless


operationally defined. Yet data analysis sometimes revealsthat a causal relationship varies by operational instance.This means that the operations are not in fact

and into diftap both into differentconstructs so that theypresumably equivalent, Either the samecausalconstructis differentlyrelated ferent causalrelationships. or the sameeffectconstruct to what now must be seenas two distinct outcomes,
is differently related to two or more unique causal agents.So the intention to promote the construct validity of causesand effects by using multiple operations has now facilitated conclusions about the external validiry of causesor effects;that is, when the external validity of the causeand effect are in play, the data analysishas revealed that more than one causal relationship needsto be invoked. FortunatelS when we find that a causal relationship varies over different causes or different effects, the research and its context often provide clues as to how the For example,the researcher causalelementsin eachrelationshipmight be (re)labeled. will generally examine closely how the operations differ in their particulars, and will also study which unique meaningshave been attached to variants like thesein the ex-

bemight be lesssuccessful isting literature.While the meaningsthat are achieved


cause they have been devised post hoc to fit novel findings, they may in some crcumstances still attain an acceptable level of accuracy and will certainly prompt continued discussion to account for the findings. Thus, we come full circle. I7e began with multiple operational representations of the same causeor effect when testing a single causal relationship; then the data forced us to invoke more than one relationship; and finally the pattern of the outcomes and their relationship to the existing literature can help improve the labeling of the new relationships achieved.A construct validity exercise begets an externat validity conclusion that prompts the need for relabeling constructs. Demonstrating effect size variation acrossoperations presumed to represent the same cause or effect can enhance external validity by

vALlDlrY I ort

are involved than was origiand causalrelationships showingthat more constructs constructvalidity by preit can eventuallyincrease and in that case, nally envisaged; or effectinherentin the original choiceof measof the cause ventingany mislabeling causalrelationshipsabout how the ures and by providffilues from detailsof the hereanalytictasksthat flow see in each..f"io"ritp shouldbe labeled.'We elements concerns'involving each' smoothlybetween.onr,r.r.i and externalvalidity of Personsor settings should Generalizingfrom a single sample Be Classifiedas External or Construct Validity? a this samplemust represent or settings, If a study hasa singlesampleof pers.ons is an issue'Given that construct population.How ,"nlrrr-pre should be labeled an issueof constructvalidity?Afvalidity is about rJ.iirrg, i, Itbeling the lample relevantsincewith a singlesampleit is not ter all, externalvalidity hardly seems would relationships obvious*n", comparisonof variation in causal immediately is treatedas a of personsor settings be involved.So if g.".t"iit-g fio* a sample from treatment and outmatter of constructvalidity analogousto generalizing potential conflict in i*o probl.-, "r-ir.. Firstl this highlightsa come operations, community'someparts of which saythat genin the generalsocialscience usage vaof peopleto its pofulation are a matter of external from;;;i; eralizations ,"y ih", labefingpeopleis a matter of constructvalidity, evenwhen ;rh.;;;", in Cook and Campbellthat trrir-J".r not fit'with the discussion lidity. Second, as an external of personsand settings rr.t" irrdiuidrr"lsamples treatsgeneralizing doesnot explicitly deal thoughtheir list of .*t.*"1 validity threats validity matter, betweenthe treatmentand attributesof with this and only mentionsinteracti,ons the settingand Person. from.the popwas randomlyselected is most acutewhen the sample The issue are so keento promoterandom samulation. considerwhy samplingstatisticians that the Suchsamplingensures pling for represe";i"; " *.il-dJrignated universe. unmeasured and are identicalon all measured and populatiJndistributions sample the populaincludes this within the limits of samplingerror.Notice that variables also guarantees moreor less"ccorit.;, which randomsampling tion label(whether a well K.y tg tle or.i rl*r, of random samplingis having appliesto the ,";;[. in samplingtheory and a-requirement boundedpop.rl"tiJ., from which to sample, boundedpopulations often obviousin practice.Given that many well something that a valid populationlathen guarantees r""a.- sampling are alsowell tabeied, For instance'the population of bel can equallyvalidly be applied,o itt. saripl.. is obviouslycorrectly prefixesor.d i' tlie city of Chicagolsknown and telephone dialing frol that list of Hence,i *""fa be difficuli. ,rrJt"ndom digit labeled. telephone sampleas representing Chicagopr.fi*., "nJ itt." mislabelthe resulting a clearly Given sJction-of Chicagoownersin Detroii o, orty in the Edgewater the samplelabel is the populationlaboundedpopulationand random saripling, that no methodis superiorto ranbelieve bel, which is why samplingstatisticians when the populationlabelis known' dom selectio'f- iun.ii"g"tumples

ASSESSMENT OFOURASSUMPTIONS 472 I T+.N CRITICAL

With purposive sample selection,this elegant rationale cannot be used, were selected whetheror not the population label is known. Thus, if respondents from shoppingmalls all over Chicago,many of the peoplestudied haphazardly of Chicago.But many would belongin the likely populationof interest-residents go malls at the hours interto do not someChicagoresidents would not because viewing takes place, and becausemany personsin these malls are not from Chicago.Lacking random sampling,we could not evenconfidentlycall this samsuchas volunteering ple "peoplewalking in Chicagomalls," for other constructs membership. So, with sample may be systematicallyconfounded to be interviewed poprepresenting a in the sampleis not sufficientfor accurately meremembership for is not sufficient paragraph, it also previous in the rationale ulation, and by the worth elaboratlabelingthe sample.All this leadsto two conclusions accurately validity, and promote construct ing: (1) that random sampling can sometimes (2) thatexternalvalidity is in play when inferring that a singlecausalrelationship or not. would hold in a population,whetherfrom a randomsample from a sample can somesampling random which point, under the conditions On the first Given are straightforward. timespromote the constructvalidity of singlesamples justified random sampling as have samplingstatisticians a well boundeduniverse, This must inin the sampleall populationattributes. away of clearlyrepresenting cludethe populationlabel,and so random samplingresultsin labelingthe sample in the sameterms that apply to the population. Random samplingdoesnot, of random accurate; course,tell us whetherthe population label is itself reasonably that are madein labeling samplingwill also replicatein the sampleany mistakes are alreadyreasonably populations the population. However,given that many and theory and that suchsituationsare often well-labeled basedon past research in an area,random samplingcan, experienced intuitively obviousfor researchers be countedon to promoteconstructvalidity.However, underthese circumstances, has not occurredor when the populationlabel is itself in when random selection doubt, this book hasexplicatedother principlesand methodsthat can be usedfor of personsand settings labelingstudy operations,including labelingthe samples in a study. the validity of generalizing point, when the questionconcerns On the second to its population,the readermay also in a singlesample from a causalrelationship wonder how externalvalidity can be in play at all. After all, we haveframedexholds overuariation ternal validity as beingabout whetherthe causalrelationship If thereis only variables. variables, and measurement in persons, settings, treatment which to exover variation is the one random samplefrom a population,where samThe answeris simple:the variationis between relationship? aminethat causal (and as personsin that population.As we saidin Chapter2 pled and unsampled can be about questions books), external validity was true in our predecessor treatsettings, (a) over variationsin persons, holds a causal relationship whether settings, ments,and outcomes that were in the experiment,and (b) for persons, persons in a popThose and outcomes that werenot in the experiment. treatments,

vALlDlw | 473

Nothing ulation who were not randomly sampledfall into the latter category. requires about externalvalidity,eitherin the presentbook or in its predecessors, in the uariuiion, of externalvalidity interestactuallybe observed that all possible arguments to do so,and we providedseveral study-indeed, it would beimpossible in Cirapter2 aboutwhy it would not be wise to limit external validity questions external in a study.Of course,in most cases only to variationsactuallyobserved to things that were not studied are difficult, having to rely ualidiry generalizations on the .L.r..pt, and methodswe outlined in our grounded theory of generalized in Chapters11 through 13. But it is the great beautyof random causalinference will hold over both sampledand samplingthat it guaran;es that this generalization ,rnr"-pl".d p.rr6nr. So it is indeedan externalvalidity questionwhah-e1acausal would hold for those in a singlerandomsample that hasbeenobserved relationship units that were in the populationbut not'in the random sample. or setof persons Inthe end,this book treatsthe labelingof a singlesample tings asa matterof constructvalidiry whetheror not random samplingis used.It from a singlesampleto unof causalrelationships alsi treatsthe generalization asa matterof externalvalidity-againrwhether or not random instances observed with exsamplingwas used.The fact that random sampling(which is associated to facilitatethe constructlabeling happens ,.rrr"l uiiairy in this book) sometimes of a sampleis incidentalto the fact that the population label is alreadyknown. Though many populationlabelsare indeedwell-known, many more are still matwe gavein Chapter3 of whetherperin the examples ,.r, of debate,as reflected or settingslabeledas hostilework environsonsshouldbe labeledschizophrenic random samplingmakesno contribution to resolving ments.In theselatter cases, about the applicabilityof thoselabels.Instead,the principlesand methdebates ods we outlinedin Ci"pt.rs 11 through 13 will haveto be brought to bear.And when random samplinghasnot beenused,thoseprinciplesand methodswill also causal haveto be broughito b.". on the externalvalidity problemof generalizing instances. to unobserved from singlesamples relationships

of the Typology About the Completeness Objections


The first objectionof this kind is that our lists of particularthreatsto validity are new externalvalidity for example,ad-ded Bracht and Glass(1,968), incomplete. and Stanley(1,96311' and threatsthat they thought were overlookedby Campbell These more recentlyAiken ind West (1991) pointed to new reactivity threats._ the key to the most confidentcausalconclusions "r. i*portant because challenges argumentthat every in our ,f,.ory of validity is the ability to construct a persuasive plausibleand identifiedthreat to validity has beenidentifiedand ruled out. Howthat all relevantthreatsto validity havebeenidentified. iver, thereis no guarantee in the from the changes Our lists are not divinely ordained,as can be observed threats from Campbel IUST) to Campbell and Stanley (1'963)to Cook and

14.A CRITICAL ASSESSMENT OF OURASSUMPTIONS

Campbell(1979) to this book. Threatsare better identifiedfrom insiderknowledgethan from abstractand nonlocal lists of threats. A second objectionis that we may haveleft out particularvalidity fypesor organized Perhaps them suboptimally. the bestillustration that this is true is SackCase-control studies do not ett's(1979) treatmentof bias in case-control studies. designs; commonly fall under the rubric of experimentalor quasi-experimental interestin generala general but they are cause-probing designs, and in that sense ized causalinferenceis at leastpartly shared.Yet Sackettcreateda different typology.He organized at which biascan ochis list around seven of research stages (3) in and selection, specification cur: (1) in readingaboutthe field, (2) in sample 'in measuringexposureand outcome, defining the experimentalexposure,(4) (5) in dataanalysis, (5) in interpretation results. and (71inpublishing of analyses, Each of thesecould generate a validiry type, someof which would overlapcon"in executing siderably with our validity types.For example,his conceptof biases the experimentalmanoeuvre" (p. 62) is quite similar to our internal validiry his withdrawal biasmirrors our attrition. However,his list alsosuggests whereas he lists at new validity types,such as biasesin readingthe literature,and biases in readingineach stageare partly orthogonal to our lists. For example,biases are usedto convince techniques clude biases of rhetoric in which "any of several the readerwithout appealing to reason"(p. 60). well In the end,then, our claim is only that the present typologyis reasonably and of some causalinference informed by knowledgeof the nature of generalized in field experiof the problemsthat are frequentlysalientabout thoseinferences mentation.It can and hopefullywill continueto be improvedboth by addition of threatsto existing validity types and by thoughtful exploration of new validity that is our causalinference typesthat might pertainto the problem of generalized main concern.t

of thesevalidity labelsthat have 1. We are acutelyaware of, and modestlydismayedat, the many differentusages though we are responsible developedover the years and of the risk that posesfor terminological confusion---even for rnany of thesevariations ourselves.After all, the understandingsof validiry in this book differ from those in Campbelland Stanley(1963),whoseonly distinctionwas betweeninternal and externalvalidity. They alsodiffer with generalizing to and across from Cook and Campbell (7979), in which externalvalidity was concerned populations of personsand settings,whereasall issuesof generalizingfrom the causeand effect operations internalvalidiry and constitutedthe domain of constructvalidity. Further,Campbell(1985) himselfrelabeled Steppingoutside external validiry as local molar causalvalidity and the principle of proximal similarity, respectively. He said internalvalidity is the Campbell'stradition, Cronbach(1982) usedtheselabelswith yet other meanings. problem of generalizing from samples to the domain about which the questionis asked,which soundsmuch like our construct validity except that he specifically denied any distinction betweenconstruct validiry and external validiry, using the latter term to refer to generalizingresults to unstudied populations, an issueof extrapolation beyond the data at hand. Our understandingof external validity includessuch extrapolations as one case,but it is not limited to that because it also has to do with empirically identifying sourcesof variation in an effect sizewhen existing data allow doing so. Finally, many other authors have casually used all theselabels in completelydifferent ways (Goetz & LeCompte,1984; Kleinbaum,Kupper, & Morgenstern,1982;Menard, 1991).So in view of all thesevariations, clear. we urge that theselabels be used only with descriptionsthat make their intended understandings

::j

!t

VALIDTTY

47s

the Natureof Validity Concerning Objections


'We it difdefined validity as the approximate truth of an inference. Others define ferently. Here are some alternatives and our reasonsfor not using them'

Validity in the New TestTheory Tradition well be1946;Guilford,1,946) cronbach, validity(e.g., discussed Testtheorists

fore Campbell(L957) inventedhis typology.Sfecan only begin to touch on the pertinentto validity that aboundin that tradition. Here we outline a many iss.re, f.* i.y poinis that help differentiateour approachfrom that of test theory.The about what a test measin test theory was mostly on inferences early emphasis of constructvalidity. Cronbach or.j, with a pinnaclebeingieachedin the notion "proper breadth to the notion of ltltll creditsCook and -a-pbell for giving consffucts',(p. 152) in constructvalidity through their claim that constructvaand about outcomesbut also about causes lidity is not j"tt li-it.d to inferences In addition, early test theory tied validity to of experiments. about orherfeatures "The literatureon validationhasconcentrated on the the truth of suchinferences: 1988,p' 5)' (Cronbach, of testinterpretation" truthfulness In one to this early understanding' However,the yearshave bro.tght change particularlyinfluentialdefinitionof validity in test theory Messick(1989)said' ;V"lidiry ii an integrated to which empiricalevjudgmentof the degree evaluative of inand appropriateness supportthe adequacy rationales idenceand theoreti"cal (p. L3); or other modesof assessment" on testscores based and actions ferences "Validiry is broadly definedas nothing lessthan an evaluaand later he saysthat tive summary'of both the ruid.tr.. for and the actual-as well as potentialour un5, p.74L)._Whereas. and use" (199 of scoreinterpretation consequen.., definithis are the subjectof validation, d.rrtu.rdirrgof validity is that inferences th"t actionsare also subjectto validation and that validation is action suggeJt, Theseextentionsare far from our view. tually evaluation. for practicaluse.Commerare designed A little historywill help here.Tests to thosewho usetests;employers hope to profit from sales cial test developers and test takershope that testswill hope to ,rr. t.rt, to seiectbetterpersonnel; genThesepracticalapplications usefulabout themsqlves. tell them something Association(APA) to identify the eratedconcerni., tf,e AmericanPsychological of better and worse tests.APA appointeda committeechairedby characteristics the problem.The committeeproducedthe first in a continCronbachto address and this wolk alsoled to Cronbach of teststandaris(APA,1,954);and uing series been have article on constructvalidity. The test standards Melhl', (1955)classic associations by other professional most recentlycosponsored revised, freq.rerrtiy AssociaAssociation,American Psychological (AmericanEducaiionalResearch Re1999)' in Education,1985, tion, and National Council on Measurement ethical codes. part of professional became qoirl-.nts to adhereto rhe standards and have Th" ,tandardswere also influential in legaland regulatoryproceedings

14.A CRITICAL ASSESSMENT OF OURASSUMPTIONS

beencited,for example,in U.S.Supreme misuses Court cases aboutalleged of testing practices (e.g., Albermarle Paper Co. v. MoodS 1975; Washington v. Davis,

L976) and have influencedthe "Uniform Guidelines"for personnel selectionby the Equal EmploymentOpportunity Commission(EEOC)et al. (1978).Various validity standards were particularly salientin theseuses. Because of this legal,professional, and regulatoryconcernwith the useof test"asonewaytojustifytheuseof word ualidity moreexpansivelyforexample, atest" (Cronbach, 1989,p. M9).It is only a short distance from validating useto validating action, because most of the relevantuseswere actionssuchas hiring or firing someone or labelingsomeone retarded.Actions,in turn, haveconsequences-some positive,suchas efficiencyin hiring and accurate diagnosis that allows bettertailoring of treatment, and somenegative, So suchas lossof incomeand stigmatization. (1989 Messick those consequences, es5l proposed that validationalsoevaluate , 199 peciallythe socialjusticeof consequences. the consequences of test Thus evaluating usebecame a key featureof validity in test theory.The net resultwas a blurring of the line betweenvalidity-as-truth and validity-as-evaluation, to the point where "Validation (1988)said (p.4). Cronbach of a testor testuseis evaluation"
strongly endorse the legitimacy of questions about the use of both tests and experiments. Although scientistshave frequently avoided value questions in the mistaken belief that they cannot be studied scientifically or that scienceis value free, we cannot avoid values even if we try. The conduct of experiments involves values at every step, from question selection through the interpretation and reporting of results. Concerns about the usesto which experiments and their results are put and the value of the consequences of those usesare all important (e.g.,Shadishet al., 1991), as we illustrated in Chapter 9 in discussingethical concerns with experiments. However, if validity is to retain its primary association with the truth of knowledge claims, then it is fundamentally impossible to validate an action becauseactions are not knowledge claims. Actions are more properly evaluated, not validated. Supposean employer administers a test, intending to use it in hiring decisions. Suppose the action is that a person is hired. The action is not itself a knowledge claim and therefore cannot be either true or false. Supposethat person then physically assaultsa subordinate. That consequence is also not a knowledge claim and so also cannot be true or false. The action and the consequences merely exist; they are ontological entities, not epistemological ones. Perhaps Messick (1989) really meant to ask whether inferencesabout actions and consequences are true or false. If so, the inclusion of action in his (1,989)definition of validity is entirely superfluous, for validity-as-truth is already about evidencein support of inferences,including those about action or consequ.rr..s.'
partly in recognitionof this, the most recentversionof the test standards(AmericanEducational 2. Perhaps Research Association, American Psychological in Education, Association, and National Council on Measurement 1999) helpsresolvesomeof the problemsoudined hereinby removingreference to validatingaction from the definition of validity: "Validity refersto the degree of test to which evidence and theory support the interpretations scores entailedby proposedusesof tests" (p. 9).

ing, the researchcommunity concerned with measurementvalidity began to use the

i' ,;;::

'We

i ,l I ,
t

VALIDITY I 477

Alternatively perhaps Messick ('1.989,L995) meant his definition to instruct "Validas intimated in: test validators to eualuatethe action or its consequences, ity is broadly defined as nothing less than an evaluative summary of both the evidence for and the actual-as well as potential--consequences of score interpretation and use" (1,995, p. 742). Validity-as-truth certainly plays a role in evaluating testsand experiments.But we must be clear about what that role is and is not. Philosophers(e.g., Scriven, 1980; Rescher,1969) tell us that a judgment about the value of something requires that we (1) selectcriteria of merit on which the thing being evaluated would have to perform well, (2) set standards of performanci for how well the thing must do on each criterion to be judged positivel5 (3) gather pertinent data about the thing's performance on the criteria, and then Validity-as-truth i+j i"6gr4te the results into one or more evaluative conclusions. is one (but only one) criterion of merit in dvaluation; that is, it is good if inferences about a test are true, just as it is good for the causal inference made from an experiment to be true. However, validation is not isomorphic with evaluation. First, criteria of merit for tests (or experiments) are not limited to validity-as-truth- For example, a good test meetsother criteria, such as having a test manual that reports ,ror*^r, being affordable for the contexts of application, and protecting confidentialiry ", "ppropriate. Second,the theory of validity Jvlessickproposed gives no help in accomplishing some of the other steps in the four-step evaluation process outlined previously. To evaluate a test, we need to know something about how much ualidity the inference should have to be judged good; and we need to know how to integrate results from all the other criteria of merit along with validity into an overall waluation. It is not a flaw in validity theory that these other steps are not addressed,for they are the domain of evaluation theory. The latter tells us something about how to executethesesteps (e.g.,Scriven, 1980, 1'991)and also about other matters to be taken into account in the evaluation. Validation is not evaluation; truth is not value. Of course, the definition of terms is partly arbitrary. So one might respond that one should be able to conflate validity-as-truth and validity-as-evaluation if one so chooses.However: requires that words with arbitrarymeanings The very fact that termsmusrbesupplied first, to esis twofold: responsibility This of responsibility. be usedwith a greatsense on the impose to the limitationsthat the definitionsselected ,6"9"; second, tablished "l'982, user.(Goldschmidt, P. 642) 'We need the distinction between truth and value becausetrue inferencescan be about bad things (the fact that smoking causescancer does not make smoking or cancer good); "nd f"lr. inferencescan lead to good things (the astrologer'sadvice 'lavoid alienating your coworkers today" may have nothing to do with to Piscei to heavenly bodies, but may still be good advice). Conflating truth and value can be of testactively harmful. Messick (1995) makes clear that the social consequences "bias, justice" (P. 745). fairness, and distributive ing are to be judged in terms of 'Wi Messick test validity. not agreewith this statement,but this is test evaluation,

478

ra. n cRrTrcAL ASSESSMENT OFOUR ASSUMPTTONS

notes that his intention is not to open the door to the social policing of truth (i.e., a test is valid if its social consequences are good), but ambiguity on this issuehas nonethelessopened this very door. For example, Kirkhart (1,995)cites Messick as justification for judging the validity of evaluations by their social consequences: "Consequential validity refers here to the soundnessof changeexerted on systems are just" (p.a).This notion is by evaluationand the extent to which thosechanges risky because the most powerful arbiter of the soundnessand iustice of social consequences is the sociopolitical systemin which we live. Depending on the forces in power in that system at any given time, we may find that what counts as valid is effectively determined by the political preferencesof those with power. Validity in the Qualitative Traditions

One of the most important developmentsin recent social researchis the expanded use of qualitative methods such as ethnography ethnology, participant observation, unstructured interviewing, and case study methodology (e.g., Denzin 6c Lincoln, 2000). These methods have unrivaled strengths for the elucidation of meanings, the in-depth description of cases,the discovery of new hypotheses,and the description of how treatment interventions are implemented or of possible causal explanations. Even for those purposes for which other methods are usually that are the preferable,such as for making the kinds of descriptivecausalinferences topic of this book, qualitative methods can often contribute helpful knowledge and 'S7henever reon rare occasionscan be sufficient (Campbell, 1975; Scriven, 1976ll. sources allow, field experiments will benefit from including qualitative methods both for the primary benefits they are capable of generatingand also for the assistance they provide to the descriptive causal task itself. For example, they can uncover important site-specificthreats to validiry and also contribute to explaining experimental results in general and perplexing outcome patterns in particular. However, the flowering of qualitative methods has often been accompanied by theoretical and philosophical controversy, often referred to as the qualitativequantitative debates. These debates concern not just methods but roles and rewards within science,ethics and morality and epistemologiesand ontologies. As part of the latter, the concept of validity has receivedconsiderableattention (e.g., Eisenhart & Howe, 1992; Goetz & LeCompte,1984; Kirk & Miller, 1'986;Kvale, 1.989;J. Maxwell, 1.992;J. Maxwell 6c Lincoln, 1.990;Mishler, 1,990;Phillips, 'Wolcott, 1,987; 1990). Notions of validity that are different from ours have occasionally resulted from qualitative work, and sometimesvalidity is rejectedentirely. However, before we review those differences we prefer to emphasize the commonalities that we think dominate on all sides of the debates. Comtnonalities. As we read it, the predominant view among qualitative theorists is that validity is a concept that is and should be applicable to their work..We start with examples of discussionsof validity by qualitative theorists that illustrate these similarities because they are surprisingly more common than someportrayals in the
:
I :l{

VALIDITY I O''

an underlythey demonstrate and because suggest debates qualitative-quantitative is widely shared that we believe ing unity of interestin producingvalid knowledge "qualitative re(1990) says, by"*ori social scientiits.For example,Maxwell as quantitativeonesabout'getting it wrong,' and are just as concerned searchers ways one'saccountmight be validity broadlydefinedsimplyrefersto the possible 'validity threats' can be addressed" (p. 505). Even those *rorrg, and how these "go quafi[tive theoristswho saythey rejectthe word ualidity will admit that they painsnot to getit all wrong" (Wolcott,1990,p. L27).Kvale(1989) to considerable "conceptsof validity are rootedin more comtiesvalidity directlyto truth, saying of the nature of true knowledge"(p. 1-1); assumptions epistemological prehensive l'refersto the truth and correctness of a statement"(p.731. and later that validity 'valid' is as a properly "the technicaluseof the term Kirk and Miller (1986) say 'true' " (p. L9). Maxwell (L9921says"Validiry in a hedgedweak synonymfor outpertainsto this relationshipbetweenan accountand something broad sense, sidethat account" (p. 283). All theseseemquite compatiblewith our understanding of validity. He claimsthat validity Maxvreli's(7992\ accountpoints to other similarities. "the that accountscan embody" kinds of understandings is always relative to in different (p. 28il and that different communitiesof inquirers are interested in five areinterested He notesthat qualitativeresearchers kindsof understandings. heard, and of what was seen about: (1) the descriptions kinds of understandings that constructions (3) theoretical heard, (2) the meaningof what was seenand (4) general*h"t was seenand heardat higher levelsof abstraction, characteriz. and studied, than originally times,or settings to other persons, izationof accounts of the objectsof study (Maxwell, 1'992;he saysthat the last two (5) evaluations are of interestrelativelyrarely in qualitativework). He then prounderstandings one for eachof the validity typology for qualitativeresearchers, a five-p-art poses 'We thoughwe that validity is relativeto understanding, orrd..standings. agree ?ine that different And we agree iather than understanding. usuallyrefer to in-ference in different kinds of understandof inquirerstend to be interested communities are illustratedby the apparentlysharedconcerns ings,though common interests have in how bestto characthlt both ixperimentersand qualitativeresearchers and heardin a study (Maxwell'stheoreticalvalidity and our terizewhatwas seen of internal validity reflectsthe interdiscussion constructvalidity). Our extended procauses' descriptive in understanding est of the community of experimenters evenwhen their portionatelymore so than is relevantto qualitativeresearchers, is This observation of causation. repletewith the language reportsare necessarily as nor is it a criticism of experimenters ,rot " criticismof qualitativeresearchers, in thick descriptionof an indithan qualitativeresearchers being lessinterested vidualcase. in prototypical tendencies On the other hand, we should not let differences communitiesblind us to the fact that when a particular underacrossresearch arethe sameno matterwhat the pertinentvalidity concerns standingls of interest, It would be wrong for a claim. the metlodology usedto developthe knowledge

14.A CRITICAL ASSESSMENT OF OURASSUMPTIONS

qualitative researcher to claim that internal validity is irrelevantto qualitative methods.Validity is not a properry of methodsbut of inferences and knowledge claims. On those infrequent occasions in which a qualitative researcher has a stronginterestin a local molar causal inference, the concerns we haveoutlinedunder internal validity pertain.This argumentcuts both ways,of course. An experimenterwho wonderswhat the experiment means to participants could learna lot from the concerns that Maxwell outlinesunder interpretivevalidity. Maxwell (1992) also points out that his validity typology suggests threats "evidence to validity about which qualitativeresearchers that would allow seek quasi-experimental them to be ruled-out. . . usinga logic similar to that of researchers such as Cook and Campbell" (p. 296). He does not outline such threatshimself,but his descriptionallows one to guess what somemight look like. To judge from Maxwell's prose,threats to descriptivevalidity include errors of commission(describing something that did not occur),errorsof omis(failing (misstatsion to describe something that did occur),errorsof frequency ing how often something occurred), and interrater disagreement about description. Threatsto the validity of knowledge claimshavealsobeeninvoked qualitative by theorists other than Maxwell-for example,by Becker(1979), Denzin(1989'), and Goetzand LeCompte(1984).Our only significant disagreement with Maxwell's discussionof threats is his claim that qualitative researchers are lessable to use "designfeatures"(p. 296) to deal with threatsto validity. For instance,his preferreduseof multiple observers is a qualitativedesignfeaturethat helpsto reduce errorsof omission, commission, and frequency. The repertoireof designfeatures usewill usuallybe that qualitativeresearchers quite different from those used by researchers in other traditions, but they are (methods) designfeatures all the same. Dffirences. Theseagreements notwithstanding,many qualitativetheoristsapproach validity in ways that differ from our treatment.A few of thesedifferences (Heap,7995;Shadish, are based on arguments 1995a). that are simplyerroneous But many are thoughtful and deserve more attention than our space constraints allow. Following is a sample. Somequalitativetheoristseither mix togetherevaluativeand socialtheories of truth (Eisner, \979,1983) or propose to substitute the socialfor theevaluative. (1989)saysthat validiry refersto whethera knowledgeclaim is "meanSoJensen ingful and relevant" (p. 107) to a particular language community; andGuba and Lincoln (1,982) saythat truth can be reduced to whetheran accountis credibleto thosewho read it. Although we agreethat socialand evaluative theories complement eachother and are both helpful, replacingthe evaluative with the socialis misguided. These social alternatives allow for devastatingcounterexamples (Phillips, 1987): the swindler'sstory is coherentbut fraudulent;cults convince members of beliefsthat havelittle or no apparentbasisotherwise; and an account of an interactionbetweenteacherand studentmight be true evenif neitherfound it to be credible.Bunge(1992) showshow one cannotdefinethe basicideaof er-

I I
:J

:iil

14.A CRITICAL ASSESSMENT OF OURASSUMPTIONS

qualitative researcher to claim that internal validity is irrelevant to qualitative methods. Validity is not a properfy of methods but of inferencesand knowledge claims. On those infrequent occasions in which a qualitative researcher has a strong interest in a local molar causal inference,the concernswe have outlined under internal validity pertain. This argument cuts both ways, of course. An experimenter who wonders what the experiment meansto participants could learn a lot from the concerns that Maxwell outlines under interpretive validity. Maxwell (1,992) also points out that his validity typology suggeststhreats to validity about which qualitative researchers seek "evidencethat would allow them to be ruled-out . . . using a logic similar to that of quasi-experimentalresearcherssuch as Cook and Campbell" (p. 296). He does not outline such threats himself, but his description allows one to guess what some might look like. To judge from Maxwell's prose, threats to descriptive validity include errors of commission (describing something that did not occur), errors of omission (failing to describesomething that did occur), errors of frequency (misstatitg how often something occurred), and interrater disagreement about description. Threats to the validity of knowledge claims have also been invoked by qualitative theorists other than Maxwell-for example, by Becker (1,979), Denzin (1989), and Goetz and LeCompte (1984). Our only significant disagreement with Maxwell's discussion of threats is his claim that qualitative researchersare less able to use "design features" (p. 2961to deal with threats to validity. For instance, his preferred use of multiple observers ls a qualitative design feature that helps to reduce errors of omission, commission, and frequency. The repertoire of design featuresthat qualitative researchers use will usually be quite different from those used by researchersin other traditions, but they are design features (methods) all the same. Differences. These agreementsnotwithstanding, many qualitative theorists approach validity in ways that differ from our treatment. A few of thesedifferences are basedon argumentsthat are simply erroneous(Heap, 1.995;Shadish,1995a). But many are thoughtful and deservemore attention than our spaceconstraints allow. Following is a sample. Some qualitative theorists either mix together evaluative and social theories "1.979,1983) of truth (Eisner, or propose to substitutethe socialfor the evaluative. So Jensen(1989) saysthat validiry refers to whether a knowledge claim is "meaningful and relevant" (p. L07l to a particular language community; and Guba and Lincoln (t9821say that truth can be reduced to whether an account is credible to those who read it. Although we agree that social and evaluative theories complement each other and are both helpful, replacing the evaluative with the social is misguided. These social alternatives allow for devastating counterexamples (Phillips, L987): the swindler's story is coherent but fraudulent; cults convince members of beliefs that have little or no apparent basis otherwise; and an account of an interaction between teacher and student might be true even if neither found it to be credible. Bunge (1992) shows how one cannot define the basic idea of er-

.il

j
I

'iil

VALIDITY I +ET

ror usingsocialtheoriesof truth. Kirk and Miller (1986) capturethe needfor an theory of truth in qualitativemethods: evaluative
to the propensity of so many nonqualitative researchtraditions to use such In response hidden positivist assumptions, some social scientists have tended to overreact by stressinj the possibility ;f alternative interpretations of everything to th exclusion of of oburry .ffor, to chooseamong them. This extreme relativism ignores the other side at all. It ignores the distinction between leciivity-that there is an external world lrro*l"dg. and opinion, and results in everyonehaving a separateinsight that cannot be reconciledwith anyone else's.(p. 15)

the validity of knowledgeclaimswith their refersto equating difference A second 6CHowe, L992)' Eisenhart earlierwith tqsttheory (e.g., aswe discussed evaluation, that much of validityin quali(L989),whosuggested This is mostexplicitin Salner "that are useful for evaluatingcompeting the criteria tative methodoiogyconcerns the moral andvalueimplications to expose researchers claims',(p. 51);"id rh. urges is to testtheory.Our response (1.989) saidin reference of ,.r."rch, *.rch asMessick'We claims endorsethe need to evaluateknowledge the sameas for test theory. that the assaying but this is not the same their moial implications; broadly including claim. Truih is just onecriterionof merit for a good knowledge claim is-t.ue. which truth process by A third differencemakes validity a result of the the dialecticprocessthat givesrise to truth' emphasizing For instance, emerges. ,,ValidLnowledge . . . from the conflict and difclaimsemerge Salnei(l9g9l says: are communicated as thesedifferences the contextsthemselves between ferences and actions"(p. 61).Miles and decisions who share amongpeople and negotiated methodsbeing Huberman(1984)rpr"t of th. problemof validity in qualitative 'Lnalysis (p. 230). Guba and data" qualitative for procedures of an insufficiency from communicationwith Lincoln (1989) argue that tiustworthinessemerges The problemwith all thesepositionsis the erarid stakeholders. other colleagues for generatror of thinklng that validity is a property of methods.Any procedure so in the end it is the knowledge ing knowledg! can g.n.r"i. invalid-knowledge, (1992) says,"The validity of an acclaim itself that muJt be judged.As Maxwell usedto produceand validateit, but in its count is inherent,not in the procedures relationshipto thosethings it is intendedto be an accountof" (p' 281)' to validity must be that traditional approaches A fourth differencesuggests "historically arosein the validiry reformulatedfor qualitativemethodsbecause p' Othersre64\' 1992, 6CHowe, research"(Eisenhart context of experimental exceptthat they saythat validity arosein test theject validity for similar reasons probably first Both are incorrect,for validiry concerns o.y 1..g.,*lol.orr, 19gO). science testtheory and experimental "ror. Jrt.*"ti.ally in philosophypreceding of the by hundredsor thour"ndr of years.Validity is pertinentto any discussion methods. knowledgeand is not specificto particular. warrant for believing A fifth differenie .on..rrri the claim that there is no ontological reality at to it. The problemswith this perspective all, so thereis no truth to correspond First, evenif it were true' it would apply only to (Schmitt,1,995). "r. .rror1nous

-T
8z oF ouR AssuMploNs I r+.n cRtlcAL AssEssMENT

correspondence theories of truth; coherence and pragmatist theories would be unaffected. Second, the claim contradicts our experience. As Kirk and Miller ( 1 9 8 6 1 p u ti t : Thereis a world of empiricalreality out there.The way we perceive and understand that world is largelyup to us, but the world doesnot tolerateall understandings of it equally(sothat the individualwho believes he or shecanhalt a speeding train with his or her bare handsmay be punishedby the world for actingon that understanding). ( p .1 1 ) Third, the claim ignores evidenceabout the problems with people'sconstructions. Maxwell notes that "one of the fundamental insights of the social sciences is that people's constructions are often systematic distortions of their actual situation" (p. 506). FinallS the claim is self-contradictory becauseit implies that the claim itself cannot be rrue. A sixth difference is the claim that it makes no senseto speak of truth because there are many different realities, with multiple truths to match each (Filstead, 1.979;Guba 6c Lincoln, L982; Lincoln 6c Guba, 1985). Lincoln (L990), for example, says that "a realist philosophical stance requires, indeed demands, a singular reality and thereforea singulartruth" (p. 502), which shejuxtaposesagainst her own assumption of multiple realities with multiple truths. Whatever the merits of the underlying ontological arguments, this is not an argument against validity. Ontological realism (a commitment that "something" does exist) does not require a singular reality but merely a commitment that there be at least one reality. To take just one example, physicists have speculated that there may be circumstancesunder which multiple physical realities could exist in parallel, as in the case of Schrodinger's cat (Davies,1984; Davies & Brown, 1986). Such circumstances would in no way constitute an objection to pursuing valid characterizationsof those multiple realities. Nor for that matter would the existenceof multiple realities require multiple truths; physicists use the same principles to account for the multiple realities that might be experiencedby Schrodinger'scat. Epistemological realism (a commitment that our knowledge reflects ontological reality) does not require only one true account of that world(s), but only that there not be two contradictory accounts that are both true of the same ontological referent.3 How many realities there might be, and how many truths it takes to account for them, should not be decided by fiat. A seventh difference objects to the belief in a monolithic or absolute Truth (with capital T). rUfolcott (1990) says, "'What I seek is something else, a quality that points more to identifying critical elements and wringing plausible interpretations from them, something one can pursue without becoming obsessed with

3. The fact that different people might have different beliefs about the same referent is sometimes cited as violating this maxim, but it need not do so. For example, if the knowledge claim being validated is "John views the program as effective but Mary views it as ineffective," the claim can be true even though the views of John and Mary are contradictory.

ii

VALIDITY I 483

finding the right or ultimate answer'the correctversion,the Truth" (p' 146)' He "the critical point of departurebetweenquantities-oriented and qualidescribes 'know'with the former'ssatisfyresearch ties-oriented [as beingthat] we cannot ing levelsof certainty" (p. 1,47).Mishler (t990) objectsthat traditional ap"as universal,abstractguarantorsof truth" prl".h., to validationare portrayed "the truth" absolute positiondemands realist (1990)thinksthat ip. +ZOl.Lincoln truth in certaintyor absolute to attributebeliefs it is misguided However, tp. SOZI. clear made have we hope to validity srrchas that in this book.'We tf appioaches Indeed,the more experiby now that thereare no guarantorsof valid inferences. the ambiguityof their gain,the morethey appreciate encethat mostexperimenters "An believes everybody is something experiment oncesaid, Albert Einstein results. exceptthe personwho madeit" (Holton, 1986, p. 13).Like \(olcott, most exfrom their work, believ,..k only to wring plausibleinterpretations periri-renter, and credulity" (Shapin,1994, skepticism sat poisedbetween irrg thut "prudence tteednor, shouldnot, and frequentlycannot decidethat one account p."xxix). rilfle i, ,broirrt.ly true ani the other completelyfalse.To the contrary' tolerancefor (Lakatos, 1'978)because is a virtual necessity multiple knowledgeconstructions acto distinguishbetweentwo well-supported is frequeirtlyinadequate evidence to be unthat appear accounts and sometimes counrs(islight " p"tti.l. or wave?), ulcers?)cause turn out to betrue (do germs for manyyears by euiJence supported "An have validity of claims that traditional understandings eighih difference "forces isthat it herearemany,for example, The arguments moral shoitcomings. ethics to be submerged" sues of politics, ial,res (social and scientific), and 'experts' . "social science p. 503) and implicitly empowers (Lincoln, 1,990, staensure male,and middle-class) (primarily'$7hite, preoccupations whoseclass of color, or . . . thoseof women, persons tus for somevoiceswhile marginalizittg "1.990,p. may arguments 502).Althoughthese (Lincoln, minoritygroupmembers" b. ou"..tlted, they contain important cautions.Recallthe examplein Chapter3 No doubt this biaswas in healthresearch. that ,,Eventhe rats werewhite males" 'White malesin the designand executionof health of partly due to the dominance '..r.ur.h. this in this book are intendedto redress None of the methodsdiscussed elucidate is to design problem or are capableof it. The purposeof experimental is lessclearis that this prob-or. than morallnferences.'What inferences ca.rsal notions of validity or truth. The claim that traditional lem requiresabandoning is simplywrong. political and ethicalissues ,pprou.h.s to truth forcibly submerge Tb-the extent that morality is reflectedin the questionsasked,the assumptions can go a long way by ensuring experimentefs examined, made,and the outcomes Further,moral social voicesin study design. of stakeholder a broad representation without truthful sciencereiuires commitment to truth. Moral righteousness prevent totalitarianhelps is ihe stuff of totalitarianism.Moral diversity analysis ism, but without the discipline provided by truth-seeking,diversity offers no -."16 to identify thoseoptionsthat are good for the human condition,which is, we must of morality.In order to havea moral socialscience, after all,the essence to see and the capacity personalconstructions haveboih the capacityto elucidate

484

14.A CR|T|CAL ASSESSMENT OF OURASSUMPTTONS

'We how thoseconstructions reflectand distort reality (Maxwell, 19921. embrace the moral aspirations of scholarssuchas Lincoln, but giving voiceto thoseaspirations simply doesnot requireus to abandonsuchnotions as validity and truth.

RIM ENTATION Q UASI.EXPE Criteria for RulingOut Threats: The Centralityof FuzzyPlausibility
In a randomized experiment in which all groups are treated in the sameway excepr for treatment assignment,very few assumptionsneed to be made about ro,rr.", of bias. And those that are made are clear and can be easily tested,particularly as concerns the fidelity of the original assignment process and its subsequentmaintenance. Not surprisinglS statisticiansprefer methods in which the assumptionsare few, transparent, and testable. Quasi-experiments, however, rely heavily on researcheriudgments about assumptions, especiallyon the fuzzy but indispensable concept of plausibility. Judgments about plausibility are neededfor deciding which of the many threats to validity are relevant in a given study for deciding whether a particular designelement is capable of ruling out a given threat, for estimating by how much the bias might have been reduced, and for assessing whether multiple threats that might have been only partially adjusted for might add up to a total bias greater than the effect size the researcher is inclined to claim. Vith quasiexperiments, the relevant assumptions are numerous, their plausibility is less evident, and their single and joint effectsare lesseasily modeled. We acknowledgethe fuzzy way in which particular internal validity threats are often ruled out, and it is becauseof this that we too prefer randomized experiments (and regressiondiscontinuity designs)over most of their quasi-experimentalalternatives. But quasi-experiments vary among themselveswith respect to the number, transparencg and testability of assumptions. Indeed, we deliberately ordered the chapters on quasi-experiments to reflect the increase in inferential power that comes from moving from designs without a pretest or without a comparison group to those with both, to those based on an interrupted time series,and from there to regression discontinuity and random assignment.Within most of these chapters we also illustrated how inferencescan be improved by adding design elements-more pretest observation points, better stable matching, replication and systematic removal of the treatment, multiple control groups, and nonequivalent dependentvariables. In a sense,the plan of the four chapters on quasi-experiments reflects two purposes. One is to show how the number, transparency and testability of assumptions varies by type of quasi-experimental design so that, in the best of quasi-experiments,internal validity is not much worse than with the randomized experiment. The other is to get students of quasi-experimentsto be more sparing with the use of this overly general label, for it threatens to tar all quasi-

+SS QUASI-EXPERIMENTATION |

to the experimentswith the samenegativebrush. As scholarswho have contributed institution alization of the t i^ quoti-experiment, we feel a lot of ambivalence the randomabout our role. Scholarsneed to itrint critically about alternatives to laized experiment, and from this need arisesthe need for the quasi-experimental under the bel. But all instancesof quasi-experimentaldesignshould not be brought do studies best the sameunduly broad quasi-experimentalumbrella if attributes of not closely match the weaker attributes of the field writ large. use of Statisticians seek to make their assumptions transparent through the stratthis resisted have formal models laid out as formulae. For the most part, we very conegy becauseit backfires with so many readers,alienating them from the inwords .!pt.r"t issuesthe formul ae aredesignedto make evident.'We have used cognoscenti' stead.There is a cost to this, and not jupt in the distaste of statistical The particularly those whose own research has emphasized statistical modelsformally to main cost is that our narrative approach makes it more difficult the alternative demonstrate how much fewer and more evident and more testable quasiinterpretations became as we moved from the weaker to the stronger acrossthe .*p.ri-.rrts, both within the relevant quasi-experimental chapters and 'We regret this, but do not apologize for the accessibility we tried to set of them. Fortucreate by minimirirrg the use of Greek symbols and Roman subscripts. to develop nately, this deficit is not absolute, as both we and others have worked in particmeth;ds that can be used to measurethe size of particular threats' both and 2000) Shadish, 1998; ular studies(e.g.,Gastwirth et al., L994;Shadishet al., Posavac,6c in sets of studiis (e.g.,Kazdin 6c Bass, 1989; Miller, Turner, Tindale,

our Further, & Putnam,t982\. & Rubin,1,978;Willson 1,991;Ror."nitt.t Dugoni,


statistical narrative approach has a significant advantage over a more narrowly threats emphasisii allows us to addressa broad er array of qualitatively different therethat to validitS threats for which no statistical measure is yet available and quantification. fore mighi otherwise be overlooked with too strict an emphasison at all Better to h"u. imprecise attention to plausibility than to have no attention measured' paid to many imptrtant threats just becausethey cannot be well

Criterion PatternMatchingas a Problematic


about the desirabilityof imbuing This book is more explicitthan its predecessors with multiple tistable implicationsin the data, providedthat a causalhypothesis we In a sense' causalexplanations. the viability of alternative tt reduce they serve assessment u-sual the for me{rod-ology a pattern-matching havesoughtto substitute 'We do this not because differ. reliably of wheth-era few means,oft.n only fwo, To the contrary,simpliciry in the numin science. .o-pl.*ity itself is a desideratum The simplicity and methodsusedis highly prizedin science. asked be, of questions causalinferenceillustratesthis well. for descriptive of ,arrjomized experiments does not hold with quasi-experiments. However,the samesimple circumstance is improvedthe more specific, With them. we haveassirtedthat causalinference

488 | ro.o cRtlcAL AssEssMENT oF ouRAssuMploNs

generating theselists.The main concernwas to havea consensus of educationresearchers endorsingeachpractice;and he guessed that the number of thesebest practicesthat depended on randomizedexperiments would be zero. Several nationally known educationalresearchers were present,agreedthat such assignment probably playedno role in generating the list, and felt no distress at this. So long as the belief is widespreadthat quasi-experiments constitutethe summit of what is neededto support causalconclusions, the support for experimentation that is currently found in health, agriculture,or health in schoolsis unlikely to occur.Yet randomizationis possible in.manyeducational contextswithin schools if the will existsto carry it out (Cook et al., 1999;Cook et al., in press). An unfortunate and inadvertentside effect of seriousdiscussion of quasi-experiments may sometimes be the practicalneglect of randomized experiments. That is a pity.

RANDOMIZED EXPERIMENTS
This section listsobjections that havebeenraised to doingrandomized experiments, and our analysis of the more and lesslegitimate issues that these obiections raise.

Experiments CannotBe Successfully lmplemented


Even a little exposure to large-scalesocial experimentation shows that treatments are often improperly or incompletely implemented and that differential attrition often occurs. Organizational obstaclesto experiments are many. They include the reality that different actors vary in the priority they attribute to random assignment, that some interventions seem disruptive at all levels of the organization, and that those at the point of service delivery often find the treatment requirements a nuisance addition to their aheady overburdened daily routine. Then there are sometimes treatment crossovers,as units in the control condition adopt or adapt components from the treatment or as those in a treatment group are exposed to some but not all of these same components. These criticisms suggestthat the correct comparison is not between the randomized experiment and better quasi-experiments when each is implemented perfectly but rather between the randomized experiment as it is often imperfectly implemented and better quasiexperiments. Indeed, implementation can sometimes be better in the quasiexperiment if the decision not to randomize is based on fears of treatment degradation. This argument cannot be addressedwell becauseit dependson specifying the nature and degree of degradation and the kind of quasi-experimental alternative. But taken to its extreme it suggeststhat randomized experiments have no special warrant in field settings becausethere is no evidencethat they are stronger than other designs in practice (only in theory). But the situation is probably not so bleak. Methods for preventing and coping with treatment degradation are improving rapidly (seeChapter 10, this vol-

EXPERIMENTS RANDOMIZED I AAS

random assignumel Boru ch,1997;Gueron,1,999;Orr, L999).More important, evenwith the to its alternatives a superiorcounterfactual -.n, may still create (1'9961foundthat, flaws mentionedherein.FLr e*ample,Shadishand Ragsdale experirandomized without attrition, .o-p"..d with randomized."p..i-.tts than did nonrandomsizeestimates mentswith attrition still yieldedbetter effect randegraded to severely of course,an alternative Sometimes, ized experiments. with a control' domizaiion will be best,such as a strong interruptedtime series is a poor rule to folrandomizedexperiments But routine rejectionof degraded to l,o*; it takescarefulstudy and judgmentto decide.Further,many alternatives are themselu.i ,ob;..t to treatmentimplementationflaws that experimentation from them. Attrition and treatmentcrossovers of inferences thieatenthe validity'we that implementationflaws are salientin exalso suspect also occur in them. hav6beenaround so long and experimenters experiments f.ri-errt"tion because the quality "r. .o critical of eachothlr's work. By contrast,criteria for assessing from othermethodsarefar more recent(e'g',Datta, and results of implementation lesssubjected conceptuallS D97j,and they may thereforebe lesswell developed of experience. to peercriticism,and lessimprovedby the lessons

StrongTheoryand Standardized Needs Experimentation lmPlementation Treatment


rs Many critics claim that experimentationis more fruitful when an intervention is details theory when implementationof treatment basedon strongsubstantive and when imfaithful to that theor5 when the rlsearchsettingis well managed, these units' In many field experiments' does,roi uury much between plementation organiza' conditions are not met. For example,schools arclarge, complex, social iio"r *ith multiple programs,disputatiouspolitics, and conflicting stakeholder schooldistricts,aswell as variablyacross implemented goals.Many progr"*, a"re of standard Therecan be no presumPli9n f.ror, ..hoth, .Lrrroo-r, arri ,t.rdents. or fidelity to programtheory (Berman& Mclaughlin, 1'977)' implementation Experimentsdo not requirewellmisplaced. But thesecriticismsur., i' fa-ct, standardimplementaspecifiedprogram theories,good program management, a contrimake. that are tJtally ?aithful to theory' Experiments tion, or treatments makesa bution when they simplyprobewhetheran intervention-as-implemented preceding marginal improvem.tttt.yord other backgroundvariability. Still, the This suggests fa.tJ* can ieducestatisticalpower and so cloud causalinference. should: experiments that in settingsin which *or. of these conditions hold, of ex(2) take painsto reducethe influence to detecteffects; (L) uselargesamples and statisticalmatraneousvariation either by designor through measurement quality both as a variableworth studynipulation; and (3) studyimplementation which settingsand providersimplement i"g * its own right in oid.r to ascertain carriestreatthl interventionbetterand asa mediatorto seehow implementation to outcome. ment effects

490 | r+.a cRtTtcAL ASSESSMENT OFOUR A5SUMPTIONS

Indeed,for many purposes the lack of standardizationmay aid in understanding how effective an interventionwill be undernormal conditionsof implementation. In the social world, few treatmentsare introduced in a standardand theory-faithful way. Local adaptationsand partial implementationare the norm. If this is the case, then someexperiments should reflect this variation and ask whetherthe treatment cancontinueto be effective despiteall the variation within groupsthat we would expectto find if the treatmentwerepolicy.Programdeveloperiand socialtheorists may want standardization at high levelsof implementation, policy but analysrs shouldnot welcomethis if it makesthe research conditionsdifferenifro- the practiceconditions to which they would like to generalize. Of course, it is most desiiable to be able to answerboth setsof questions-about policy-relevant effects of treatments that are variably implemented and alsoabout the more theory-relevant effects of optimal exposureto the intervention.In this regard,one might recall recenteffortsio analyze the effects of the original intent to treat through traditional meansbut alsoof the effectsof the actual treatmentthrough using random assignment as an instrumental variable(Angristet al., 1996a\.

Experiments EntailTradeoffs Not Worth Making


The choiceto experimentinvolvesa number of tradeoffsthat someresearchers believeare not worth making (Cronbach,7982). Experimenration prioritizeson unbiased answers to descriptive causalquestions. But, givenfinite r.rour..r, someresearchers preferto investwhat they havenot into marginalimprovements in internal validity but into promoting higher constructand externalvalidity. They might be content with a greaterdegreeof uncertainryabout the quality of a causalconnection in orderto purposively samplea greater rangeof populations of peopleor settings or, when a particular population is central to the research, in ordeito generate a formally representative sample.They might evenusethe resources to improve treatment fidelity or to includemultiplemeasures of averyimportantoutcome construct. If a consequence of this preference for constructand ixternal validity is to conducta quasi-experiment or evena nonexperiment rather than a randomizedexperiment, then so be it. Similar preferences make other critics look askance when advocates of experimentationcounselrestrictinga study to volunteersin order to increase the chances of beingable to implementand maintainrandomassignment or when thesesameadvocates adviseclosemonitoring of the treatmentto ensure its fideliry therebycreatinga situation of greaterobtruiiveness rhan would pertain if the same treatment werepart of someongoingsocialpolicy (e.g., Heckman,1992). In the language of Campbelland Stanley (1,963;., theclaim was that ."p.ri*.rrt"tion traded off externalvalidity in favor of internal validiry. In the parlanceof this book and of Cook and Campbell(1979),it is that experimentatiortrades off both externaland constructvalidity for internal validiry to its detriment. Critics also claim that experiments overemphasize conservative standards of scientificrigor. Theseinclude (1) usinga conservative criterion to protect against

EXPERIMENTS RANDOMIZED | *tt

e (p <.05) at the risk of failing to dewrongly concludinga treatmentis effectiv that include intent-to-treatanalyses tect true treatment;ffects;(2) recommending (3) denitreatment; as part of the treatmentthoseunits that have neverreceived that result from exploring unplanned treatment interactions gr"ting inferences or times;and (4) rigidly pursettings, of units, observations, with characteristics emerge suing a priori experimentalquestionswhen other interestingquestions use a more liberal risk calculusto decideabout duriig " ,t,rdy. Mort laypersons in their own lives,as when they considertaking up somepoten.u,rrul inferences do the same' be lessconservative? ii"ity lifesavingtherapy.Should not science make different tradeoffs betweenprotection Snoula it notlt least-sometimes and the failure to detecttrue effects? againstincorrectinferences over explanatory prioritize descriptive critics further obiectthat experimepts would toleratemore uncertaintyabout whether The criticsin qrrestion causation. processes the interventionworks in order to learn more about any explanatory and times' observations, units, settings' across that havethe potentialto generalize qualitaFurther,,o-. critics pr.f!, to pursuethis explanatory knowledgeusing than tive meihodssimilar io thor. of th. historian,journalist, and ethnographer much more opaque by meansof, sa5 structuralequation modeling that seems than the narrativereportsof theseother fields' giveto providing policymakcritics alsodislikethe priority that experiments real-time ers with ofren belated"rrri.r, about what works insteadof providing in interested rarely are Theseproviders providersin local settings. help to service They often preferrer,rrnmaryofwhat, ptogt"- has.achieved. " torrg-a.tayed about thoseelements abouttheir work and especially feedback ceiving.o.riin,ro.rs letter to oiprJ.ri.. that they can changewithout undue complication' A recent theNew York Timescapturedthis preference:
to approach issues Alan Krueger . . claims to eschew value iudgments and wants changesin edpostponing on insistence his (about educationalreform) empirically. Yet judgment a value itself is approach certainry ucation policy until studiesby iesearchers in parts of public eduin favor of the status quo. In view of the tragic state of affairs 1999) (Petersen, cation, his judgment is a most questionableone.

quesAmong all possible we agreewith many of thesecriticisms. _research methcausal And of all possible only a subset. constitute questions tions,cau-sal typesof cirall and questions of io all types is not relevant ods,experimentation outlined in One needonly read the list of options and contingencies cumstance. experimentahow foolhardy it is to advocate Ch"p,.r, 9 and L0 to appreciate "gold standard"that will invariablyresultin tion on a routine basisas a causal tradeclearly interpretableeffect sizes.However,many of the criticisms about even overoffs are basedon artificial dichotomies,correctableproblems,-and for variableimcan and should examinereasons Experiments simplifications. They pl.-.nt"tion, and they should searchto uncover mediating processes' '05 the for neednot use stringentalpha rates;only statisticaltradition argues only to the intent-to-treat'though that t dataanalyses level.Nor needonJ restric

lueururo.rdeJoru qf,ntu e sdeld drrprlerrlpuJetur ql1qlv\ ur surerSord rpuar ;o lurluerelm puorq dpursrrd -Jns eql ol pue qf,ntu oor drlPllu^ reqr qf,Jeesar;o sururSord Ieurelur aztseqduraap ur pa8raureaABr{ stsaSSns drolsrq lpql sassau>lea^\ IertuaraJuragr ol uouuane Bur -llEf, orp arrrtaqrey .(rq8rlrodsaql uI erurl slr a^eq lsntu ad& drlPler d-rela) /lpll -EA pnJlsuof, Jo JaAo drrprlerr lEuJalxa IeuJalur 1o drerurrd eurlnoJ due -ro; 8ur1er lou eJEeM'T lardu{J uI Jealr oPELu a.&\ se 'esJnoJIO 'parseSSns arreqsrrlr.rr tsed sPeaJxa dlrear8 sluaut-radxaeldrllnur Ja o senssr req.a'\ drrpllel leuJalxa pug lrnrls -uol qloq sserppeol dlneder aql.slsdleue-Eleruur dlrrap se 1ng ,sans lsoru aeso^4, -sI asJl{t qroq Sulssarpp" ur r.lf,Ear perFrll e^Er{ slueurradxa lpnprlrpur .paluerg 'larrr dlfsapou sanssr,{rrprlerr Ipuralxo puB lJnrlsuoJ r{foq sserppE ot r{f,rBrsar srueJSord;o dlpeder agr qrr^\ pessarduneJEaA\ ,1ser1uoc dg letuaurr.ladxe Io '8ur1uru r{uo^\ tanau aJu lpql stJoapqJfarrnbar sluaurrradxeter{l tsa88nsol luet -.rodur ool sr s{Jo.&\ rpqra 1no Surpurg .sanqod IErJospaseq-sseua^rpa}Ja alouord ol lue1Y\ ol{1v\ sJJels rlaql Pue srolelsr8al asoql ro; d1-rrlnrrued 'cnerualqord eJoru uala dlqeqo-rdaru sJel\supJeell tnoqtr^ saurl aturl Buol-opelep qrns .re8uep IEar P sI uollEluaurtradxa arntreruard q8noqlly 'sploq uorlenlrs atues aql lsorule pue 'uerSor4 ruaurdolartaq IooqJS aqr uuSaq rauoJ sauef arurs sread 0t sl lI .sra/\,s 'sloogrs peleJelalf,eue8aq -uB ou urle-J drua11 PUPsluolutradxa ou aleq all\ Pue 'sfteJJe eruts sread SI sl fI JIeI{l rnoqe sJa^.r{sup Jeolf,ou e^Erl llrls o.&\ puu .pasod -ord arain sJarlf,no^ .splp .uoryo oor IIE ,.raddeq Iooqtrs erurs srcad 0t A ou sr lI SFII 'elgBpuedapunsI lEql uollf,auuoJ lusnpr E tnogp suorsnlf,uoo leraua8 puorg Sutmerp >lsIJ ol sI uolluelJetul uE Jo srlaJJear{l uo sarpntsleluauuadxa Suorls o4 e^Pq ol 'spuno;8 IEIIuePI^ero lerrSoyuo elqrsneldurrdl-realf, arg drrprlerrleuralur ol slBarql ssaFn 'saf,uareJur da4;o dtr.rSalur eql Sursrulordtuot lnoqlrd\ passoJl eq louupf, spunoq aruos 'lurod srql ot rrlaqledtuds d11erauafi are am qfinoqrly .ftget 'qrequo.r3 :og5t ''1" ''3'a) Ir r{luquorJ spoqlau lutuaurradxo ra8uorls aqr Bur -zrsuqdruasrue.rSo-rd tuory ueql serpnls leluaunradxeuou pue Iuluaur-radxa_isenb tuory peuJuel aq IIri\,\ ;o dlerrlua uela ro dlrsoru lslsuof, rer{l qf,reasarJo sure-r8ord

'aloJ

uo aroru upluroJul pourr Et'*:u"';rrx;;H:;r:::ilil? r'rrr InJasn

'lsa88ns stxel eruos su sluerue^ordur leur8rulu JeuIJ-JaAa plSrr se to 1uo3aqr pur '(salqerrul eq tou paau sluaur.radxg Surlelpau Jo sarnseau Burppe,.8.a)tuaqr;o ^{et salulleluoslnq 'saJJnosal artnber sarnparo-rd asaql ilV'{ooq srql ur peurllno spoqrau eqt Sursn pelpreua8aq plnoqs uortuzrlereua8 lesneolnoqp alqrssodse uolletuJotul qlntu sB puv 'sasseoordSurlelpau pue sauof,lno pepuelurun Burre -^oJsrp tE parurp uorllellor etvp a^nelrlenb aq plnor{s pue upJ aleql .saruof,lno pue 'stuerulearl 's8urpas (suos.rad ;o sluerussasse ;o dfrpryel lf,nJlsuof, ar{r puu salduresJo sseualrleluasardar er.lrJo sasdyeue lulueurr.radxeuou eg osle plnoqs 'paqsqqnd aq uE3 sluaurrradxs uBr erarll tuoJJ sllnsoJ urrelul .dlsnorl Pue -nBf, suolsnlf,uol rraqf 3urqf,nof, PuE seleJ JoJJa ale8rgo.ld lsure8e Surpren8 hlo11u remod lptrrtsrlels pue droagl elrtuetsqns leql luatxe eqr ol suorlrrJal -uI 'srsdleur auo oq dlorruryap IEf,Itsllels aroldxa osle uet sJatuJurrradxg pFor{s

ilHt", ",

sNoll_dwnssv uno lo l_Nty\sslssv tv)tl|u) v .tt I zov


I

EXPERIMENTS RANDOMIZED | 493


I

an InvalidModel Assume Experiments Utilization of Research


recreatea naive rational choicemodel of decision To somecritics, experiments to chooseamong (the treatmaking. That is, one first lays out the alternatives then one collectsinon criteria of merit (the outcomes); *.rr,rt] then one decides and finally formation on eachcriterion for eachtreatment(the data collection), empirical one makes a decisionabout the superior alternative.UnfortunatelS daia showsthat useis not so simpleas the rawork on the useof socialscience 1980; c''weiss, 1988)' 6c Bucuvalas, (c. \ufeiss modelsuggests tional choice contexts'exare askedin decision First, evenwhen."-rir. and effectquestions exp.ri-.nt"l resultsare still usedalong with other forms of information-from of a consensus isting theories,personaltestimony,extrapolationsfrom surveys' to defend,and ideasthat haverecentlybewith interests fieldlchims from experts politics' personare shapedpartly by ideology,interests, .o*. trendy.Decisions as much made by a ality, windows of-opportunity, and ualues;and they are individualor compolicy-shapirrg.o-*nrrity (cronbachet al., 1980) as by an overtime aseararenot so much madeasaccreted *i,,... Fuither,manydecisions maker with few oplier decision,.orrrir"in later ones,leavingthe final decision 1980). Indeed,by the time ixperimental resultsare available,new tions ('Weiss, old ones. may havereplaced decisionmakersand issues verdicts Second,.*p.rirn.nts often yield contestedrather than unanimous Disputes arise about that therefore have uncertain implications for decisions. resultsare valid' whether the causalquestionswere correctly framed, whether and whetherthe resultsentail a specific were assessed, whetherrelevantoutcomes of the Milwaukee educationalvoucher decision.For example,reexaminations about whetherand whereeffectsoccurred(H' ;"rdy offereddifferentconclusions SimilarlS 1'998,"1'999,2000)' 6cDu, 1.999;'Sritte, Peterson, 2000;Greene, Fuller, (Finn classsizeexperiment from the Tennessee differenteffect,ir., *.r. generated 1996)'Sometimes, Light, 6c Sachs, EcAchilles ,1.990;Hanusi'ek,1999;Mosteller, reflectdeeply are at issue,but at other timesthe disputes scholarlydisagreements interests. conflictedstakeholder data is more likely when useof experimental instrumental short-term Third, to it is easier the interventionis a minor variant on existingpractice.For example, or pills givenlo patientsor eligibility criteriafor textbooksin a classroom change locationsor to open entry than it is to relocatehospitalsto.underserved ;;;;"* the centersfor welfare recipientsthroughout an entire state' Because day-care they are lesslikely to dramatically .tt""g.t are so ,ood.r, in scope, more feasible inSo critics note that prioritizing on shor-t-term affecttheproble- ih.y address. to solve most of the statusquo and is unlikely tendsto preserve change strumental that truly twist tr.rr.hunt social"probl.-s. bf course'thereare someexperiments poor from densely the lion,stail andinvolvebold initiatives.Thus moving families deviations inner-citylocationsto the suburbsinvolveda changeof three standard

494

14.A CRIT|CAL ASSESSMENT OFOUR ASSUMPTTONS

in the poverty level of the sending and receiving communities, much greater than what happens when poor families spontaneously move.'S7hethersuch a dramatic change could ever be used as a model for cleaning out the inner cities of those who want to move is a moot issue. Many would judge such a policy to be unlikely. Truly bold experiments have many important rationales; but creating new policies that look like the treatment soon after the experiment is not one of them. Fourth, the most frequent use of research may be conceptual rather than instrumental, changing how users think about basic assumptions,how they understand contexts, and how they organize'or label ideas. Some conceptual uses are intentional, as when a person deliberately reads a book on a current problem; for example, Murray's (1984) book on social policy had such a conceptual impact in the 1980s, creating a new social policy agenda. But other conceptual usesoccur in passing, as when a person reads a newspaper story referring to social research. Such usescan have great long-run impact as new ways of thinking move through the system, but they rarely change particular short-term decisions. These arguments against a naive rational decision-making model of experimental usefulnessare compelling. That model is rightly rejected. However, mosr of the objections are true not just of experiments but of all social sciencemethods. Consider controversies over the accuracy of the U.S. Census,the entirely descriptive results of which enter into a decision-making process about the apportionment of resourcesthat is complex and highly politically charged. No method offers a direct road to short-term instrumental use. Moreover, the obiections are exaggerated.In settings such as the U.S. Congress,decision making is sometimes influenced instrumentally by social scienceinformation (Chelimsky, 1998), and experiments frequently contribute to that use as part of a researchreview on effectivenessquestions. Similarlg policy initiatives get recycled, as happened with school vouchers, so that social science data that were not used in past years are used later when they become instrumentally relevant to a current issue (Polsby, 1'984; Quirk, 1986).In addition, data about effectiveness influence many stakeholders' thinking even when they do not use the information quickly or instrumentally. Indeed, researchsuggests that high-quality experiments can confer exrra 'Weiss credibility among policymakers and decision makers (C. & Bucuvalas, 1980)' as happened with the Tennessee classsize study. We should also not forget that the conceptual use of experiments occurs when the texts used to train professionalsin a given field contain results of past studies about successfulpractice (Leviton 6c Cook, 1983). And using social sciencedata to produce incremental change is not always trivial. Small changescan yield benefits of hundreds of millions of dollars (Fienberg,Singer,& Tanur, 1985). SociologistCarol'Weiss, an advocate of doing research for enlightenment's sake, says that 3 decadesof experience and her studies of the use of social sciencedata leave her "impressed with the utility of evaluation findings in stimulating incremental increasesin knowledge and in program effectiveness. Over time, cumulative incrementsare not such small potatoes after all" ('Weiss, 1998, p. 31,9). Finallg the usefulness of experimentscan be increased by the actions outlined earlier in this chapter that involve comple-

ott EXPERIMENTS RANDOMIZED I of implemendesignwith adjunctssuchas measures mentingbasicexperimental clarify prohelp mediationo, qualitativemethods-anything that will tation a-nd problems.In summarSinvalid modelsof the and implementation gram process commonthan to us to be no more nor less ir.foln.rs of experimintalresultsseem 'we have learned methods. invalid modelslf th. use of any other social science who want their about use, and experimenters much in the last severaldecades (Shadish et al., 1'99I). of thoselessons work to be usefulcan take advantages

Differfrom the of Experimentation TheConditions lmplementation of Policy Conditions


were are often doneon a smalleiscalethan would pertain if services Experiments relei-il.-r.rted state-or nationwide,and so they cannot mimic all the details Hencepolicy implementationof an intervenu"rr, ,o full policy implementation. For ex,i"" -ry yi.ta aiff.rint o,rt.omesthan the experiment(Elmore, 1996)' size, class ample, t"r.d partly on researchabout the benefits of reducing and Caliiornia implementedstatewidepoliciesto have more classes Tennessee classwith fewer studentsin each.This required many new teachersand new teachnew those of some shortage, of a nationalteacher rooms.However,because and a shortageof experiment; the in ers may havebeenlessqualifiedthan those led to more .rs. of trailers and dilapidatedbuildings that may have classrooms further. harmedeffectiveness enthutreatmentis an innovation that generates an experimental Sometimes experisiasticeffortsto implementit well. This is particularly frequentwhen the that innovator whosetacit knowledgemay exceed ment is done by a charismatic to implementthe program in ordinary pr^ctrce of thosewho would be expected Thesefactorsmay and whosecharismamay inducehigh-qualityimplementation. than will be seenwhen the interventionis immore srr...srfoi outcomes generate asroutine PolicY. plemented Policy implementationmay also yield different-resultswhen experimental in a fashionthat differs from or conflictswith pracare implemented treatments studyingpsychotherapy experiments ticesin real-*orld application.For example, and observe treatmentwiih a manual and sometimes outcomeoften standardize but correct the therapistfor deviatingfrom the manual (shadishet al., 2000); treatmentis more effecare rare in clinicallractice. If manualized thesepractices might tive (bhambless& Hollon, 1998; Kendall, 1998), experimentalresults transferpoorly to practicesettings. may also changethe program from the intendedpolicy Raniom assigrrm.nt (i{eckman,l992l. For ixample, thosewilling to be randomized implementation may -"y diff.r from those for whom the treatment is intended; randomizatLon with changepeople'spsychologicalor social responseto treatment compared treatment;and randomizationmay disrupt administration those"wlroself-select by forcingthe programto copewith a differentmix of clients' and implemenration

496

14.A CR|T|CAL ASSESSMENT OF OURASSUMPTIONS

Heckman claims this kind of problem with the Job taining PartnershipAct "calls into question 0TPA) evaluation the validity of the experimental estimates as a statement about theJTPAsystem as a whole" (Heckman,1.992, p. ZZ1,). In many respects, we agree with these criticisms, thoughit is worth noting several responses to them. First, theyassumealack of generalizabllity from experiment to policy but that is an empirical question. Somedata suggesr thar generalization may be high despite differencesbetweenlab and field (C. Anderson, LindsaS & Bushman, 1999) or betweenresearchand practice (Shadish et al., 2000). Second, it can help to implement.treatment underconditionsthat aremore characteristic of practiceif it doesnot unduly compromise other research priorities. A little forethoughtcan improve the surfacesimilarity of units, trearments, observations, settings, or timesto their intendedtargets. Third, someof these criticismsare true of any research methodologyconducted in a limited context,such as locally conductedcasestudiesor quasi-experiments, because local implementation issues alwaysdiffer from large-scale issues. Fourth, the potentiallydisruptive natureof experimentally manipulatedinterventions is sharedby many locally 'rrr"or"h invented novel programs, euen uhen they are not studied by any methodology at all.Innovation inherentlydisrupts,and substantive literatures are rife with examplesof innovationsthat encountered policy implementationimpediments(Shadish, 1984). However,the essential problem remainsthat large-scale policy implementation is a singularevent,the effectsof which cannot be fully known exceptby doing the full implementation. A singleexperiment, or evena smallseries of ri-ilrt ones,cannotprovidecompleteanswers about what will happenif the intervention is adoptedas policy. However,Heckman'scriticism needsreframing.He fails to distinguishamongvalidity types(statistical conclusion, internal,.onrtro.., external). Doing so makesit clearthat his claim that suchcriticism"calls into question the validity of the experimental estimates asa sratement about the JTPA,yrt.rr, ", a whole" (Heckman,1.992, p.221,)is reallyabout external validityand construcr validity,not statistical conclusion or internalvalidity.Exceptin thenarrow econometrics traditionthat he understandably cites(Haavelmo, 7944;Marschak ,7953; Tinbergen,1956),few socialexperimenters ever claimedthat experiments could describe the "systemas a whole"-even Fisher(1935)acknowledged this tradeoff. Further,the econometric solutionsthat Heckman suggests cannot avoid the sametradeoffsbetweeninternal and externalvalidity. For example,surveys and certain quasi-experiments can avoid someproblemsby observingexistinginterventionsthat have aheadybeenwidely implemented, but the validity of tleir estimatesof program effectsare suspect and may themselves change if the program were imposedevenmore widely as policy. Addressing thesecriticismsrequiresmultiple lines of evidence-randomized experimentsof efficacyand effectiveness, nonrandomizedexperiments that observeexistinginterventions, nonexperimental surveys to yield estimates of representativeness, statisticalanalyses that bracketeffectsunder diverseassumpd;ns,

Ot EXPERIMENTS RANDOMIZED I I

qualitative observation to discover potential incompatibilities between the interventiol and its context of likely implementation, historical study of the fates of similar interventions when they were implemented as policg policy analysesby those with expertisein the type of intervention at issue,and the methods for causal generalizationin this book. The conditions of policy implementation will be difi.r.rr, from the conditions characteristic of any rese^rchstudy of it, so predicting generalizationto policy will always be one of the toughest problems.

Flawed ls Fundamentally Treatments lmposing the Growthof Local with Encouraging Compared to Problems Solutions
late 20th-centurythought Yet som,e on recipients. imposetreatments Experiments ,.rjg.rt, that imposedsolutionsmay be inferior to solutionsthat are locally gen.rJr".a by thoseiho h"n. the problem. Partly,this view is premisedon research findings of few effectsfor the Great Societysocialprogramsof the 1960sin the Rossi, L987),with the presumptionthat a portion (Murrag 1.984; UniteJ States of the failurewas due to the federallyimposednatureof the programs.Partly,the and conserfree market economics of late 2Oth-century view reflectsthe success and economies controlled centrally vative political ideologiescompared with in some are seen imposedtreatments Experimentally more fi|eral political beliefs. with suchthinking' quartersas beinginconsistent IronicallS the first objectionis basedon resultsof experiments-if it is true Moreprovided the evidence. that impos.i progr"*s do not work, experiments failures methodological tro-.ff..t findingsmay havebeenpartly due to over,these in solving at that time. Much progress as they were implemented of experiments to, those problemsoccurredafter,and partly in response practicalexperimental demondefinitively experiments these to assume If so,it is premature experiments. ability to detectsmall effectstostiated no effect,especlalygiven our increased LipseSL992;Lipsey6c'Wilson, 1,997; 6c shroder, day !993). ' (D. Greenberg currencyand the effects political-economic iistinguish between We must also 'We of, say,the effectsof locally generof interventions. know of no comparisons problemsin doing such the methodological solutions.Indeed, imposed atedversus interventionsinto are daunting, especiallyaccuratelycategotizing comparisons with correlatedmethoddifand unlonfounding the categories the two categories intractableproblemsof solutionto the seemingly Bariing an unexpected ferences. questions about the effects answering designs, in nonrandomized causalinference solutionsmay requireexactlythe kind of high-qualityexperof locally generated solutions imentatioi being criticized.Though it is likely that locally generated it also is likely that someof thosesolumay indeedhavesignificantadvantages, evaluated. tions will haveto be experimentally

498 | 14.A CRIT|CAL ASSESSMENT OF OURASSUMPTTONS

CAUSALGENERALIZATION: AN OVERLY COMPLICATED THEORY?


Internal validity is best promoted via random assignment,an omnibus mechanism that ensuresthat we do not have many assumptions to worry about when causal inferenceis our goal. By contrast, quasi-experiments require us to make explicit many assumptions-the threats to internal validity-that we then have to rule out by fiat, by design,or by measurement. The latter is a more complex and assumption-riddled processthat is clearly inferior to random assignment.Something similar holds for causal generalization,in which random selectionis the most parsimonious and theoretically justified method, requiring the fewest assumptionswhen causalgeneralization is our goal. But becauserandom selectionis so rarely feasible,one instead has to construct an acceptabletheory of generaliz tion out of purposive sampling, 'We a much more difficult process. have tried to do this with our five principles of generalizedcausal inference.These, we contend, are the keys to generalizedinference that lie behind random sampling and that have to be identified, explicated, and ano assessed assessed rt we are to make if make better general inferences, rnterences, if they are not pereven rt fect ones. But these principles are much more complex to implement than is random sampling. Let us briefly illustrate this with the category called American adult women. We could represent this category by random selection from a critically appraised register of all women who live in the United Statesand who arc at least 21 years of age.I7ithin the limits of sampling error, we could formally generalizeany characteristics we measured on this sample to the population on that register. Of course, we cannot selectthis way becauseno such register exists.Instead,one does onet experiment with an opportunistic sample of women. On inspection they all '1,9 turn out to be between and 30 years of age, to be higher than average in achievementand abilit5 and to be attending school-that is, we have useda group of college women. Surface similarity suggests that each is an instance of the category woman. But it is obvious that the modal American woman is clearly not a college student. Such students constitute an overly homogeneoussample with respect to educational abilities and achievement,socioeconomicstatus, occupation, and all observable and unobservable correlates thereof, including health status, current employment, and educational and occupational aspirations and expectations. To remedy this bias, we could use a more complex purposive sampling design that selectswomen heterogeneouslyon all these characteristics.But purposive sampling for heterogeneousinstances can never do this as well as random selection can, and it is certainly more complex to conceive and execute.I7e could go on and illustrate how the other principles faclhtate generalization. The point is that any theory of generalization from purposive samples is bound to be more complicated than the simplicity of random selection. But becauserandom selection is rarely possible when testing causal relationships within an experimental framework, we need these purposive alternatives.

NONEXPERIMENTALALTERNATIVES I 499

Yet most experimental work probably still relies on the weakest of these alternatives, surfaci similarity.'We seek to improve on such uncritical practice. Unfortunately though, there is often restricted freedom for the more careful selection of instancesof units, treatments, outcomes, and settings, even when the selection is done purposively.It requires resourcesto sample irrelevanciesso that they are heterogeneouson many attributes, to measure several related constructs that can be discriminated from each other conceptually and to measure a variety of possible This is partly why we expect more progress on causal genexplanatory processes. eralization from a review context rather than from single studies. Thus, if one researcher can work with college women, another can work with female schoolteachers, and another with female retirees, this creates an opportunity to see if thesesourcesof irrelevant homogeneity make a difference to a causal relationship or whether it holds over all these differ6nt types of women. UltimatelS causal generalizationwill always be more complicated than assessing the likelihood that a relationship is causal.The theory is more diffuse, more recent, and lesswell testedin the crucible of researchexperience.And in some quarters there is disdain for the issue,given the belief and practice that relationshipsthat replicate once should be consideredas generaluntil proven otherwise' not to speak oithe belief that little progressand prestigecan be achieved by designingthe next experiment to be some minor variant on past studies. There is no point in pret.nding that causal generalization is as institutionalized procedurally as other have tried to set the theoretical agendain a sysmethods in the social sciences.'We tematic way. But we do not expect to have the last word. There is still no explication of causal generalizationequivalent to the empirically produced list of threats to internal validiry and the quasi-experimental designsthat have evolved over 40 years to rule out thesethreats. The agendais set but not complete.

RNATIVES ALTE M ENTAL RI NONEXPE


about methodsfor answeringquestions Though this book is about experimental are approaches that only experimental it is a mistaketo believe .".rr"l hypotheses, apother several consider used for thir p,r.pose.In the following; we briefly why we havenot dwelt on them in detail. indiiating the major reasons proaches, basicallSthe reasonis that we believethat, whatevertheir merits for somereexthan randomized lessclearcausalconclusions purposes, they generate search or suchas regression-discontinuity perimentsor eventhe bestquasi-experiments interruptedtime series we examineare the major onesto emerge alternatives The nonexperimental In educationand parts of anthropologyand socidisciplines. in variousacademic samefields,and In these studies. qualitativecase is intensive ologg one alternative in theory-based interest psychologythere is an emerging also-in developmental

500

14.A CR|T|CAL ASSESSMENT OFOUR ASSUMPTTONS

causal studies basedon causal modeling practices.Across the social sciences other than economics and statistics, the word quasi-experiment is routinely used to justify causal inferences,even though designsso referred to are so primitive in structure that 'We causal conclusions are often problematic. have to challenge such advoc acy of low-grade quasi-experiments as a valid alternative to the quality of studies we have been calling for in this book. And finally in parts of statistics and epidemiology, and overwhelmingly in econometrics and those parts of sociology and political science that draw from econometrics,the emphasisis more on control through statistical manipulation than on experimental design.I7hen descriptive causal inferencesare the primary concern, all of these alternatives will usually be inferior to experiments.

Intensive Case Studies Qualitative


The call to generate causal conclusions from intensive case studies comes from several sources. One is from quantitative researchersin education who became disenchanted with the tools of their trade and subsequently came to prefer the qualitative methods of the historian and journalist and especiallyof the ethnographer (e.g.,Guba,198l, 1,990;and more tentatively Cronbach, 1986).Another is from those researchersoriginally trained in primary disciplines such as qualitative anthropology (e.g.,Fetterman, 19841or sociology (Patton, 1980). The enthusiasm for case study methods arises for several different reasons. One is that qualitative methods often reduce enough uncertainty about causation to meet stakeholderneeds.Most advocatespoint out that journalists,historians, ethnographers, and lay persons regularly make valid causal inferences using a qualitative processthat combines reasoning, observation, and falsificationist procedures in order to rule out threats to internal validity-even if that kind of language is not explicitly used (e.g.,Becker,1958; Cronbach,1982). A small minority of qualitative theorists go even further to claim that casestudiescan routinely replace experiments for nearly any causal-sounding question they can conceive (e.g.,Lincoln & Guba, 1985). A secondreasonis the belief that suchmethodscan also engagea broad view of causation that permits getting at the many forces in the world and human minds that together influence behavior in much more complex ways than any experiment will uncover.And the third reasonis the belief that case studies are broader than experiments in the types of information they yield. For example, they can inform readers about such useful and diverse matters as how pertinent problems were formulated by stakeholders, what the substantive theories of the intervention are, how well implemented the intervention components were, what distal, as well as proximal, effects have come about in respondents' lives, what unanticipated side effects there have been, and what processes explain the pattern of obtained results.The claim is that intensivecasestudy methods allow probes of an A to B connection, of a broad range of factors conditioning this relationship, and of a range of intervention-relevant questions that is broader than the experiment allows.

.J

NONEXPERIMENTALALTERNATIVES | 501
I

Although we agree that qualitative evidence can reduce some uncertainfy about cause-sometimes substantially the conditions under which this occurs are usually rare (Campbell, 1975).In particular, qualitative methods usually produce unclear knowledge about the counterfactual of greatest importance, how those who receivedtreatment would have changedwithout treatment. Adding design featuresto casestudies,such as comparison groups and pretreatmentobservations, clearly improves causal inference. But it does so by melding case-study data collection methods with experimental design.Although we consider this as a valuable addition ro ways of thinking about casestudies, many advocatesof the method would no longer recognize it as still being a case study. To our way of thinking, casestudies are very relevant when causation is at most a minor issue; when substantial uncertainry reduction about causation is but in most other cases required, we value qualitative methods within experiments rather than as alternatives to them, in ways similar to those we outlined in Chapter 12.

luations Eva Theory-Based


This approach has beenformulated relatively recently and is describedin various books or specialjournal issues(Chen & Rossi, 1,992;Connell, Kubisch, Schorr,& 'Weiss, 1.995;Rogers,Hacsi, Petrosino,& Huebner, 2000). Its origins are in path analysis and causal modeling traditions that are much older. Although advocates have some differenceswith each other, basically they all contend that it is useful: (1) to explicate the theory of a treatment by detailing the expected relationships among inputs, mediating pfocesses,and short- and long-term outcomes; (2) to measure all the constructs specified in the theory; and (3) to analyzethe data to assessthe extent to which the postulated relationships actually occurred. For shorter time periods, the available data may addressonly the first part of a postulated causal chain; but over longer periods the complete model could be involved. Thus, the priority is on highly specific substantive theorS high-quality as they unmeasurement,and valid analysisof multivariate explanatory processes fold in time (Chen & Rossi, 1'987,1,992). Such theoretical exploration is important. It can clarify general issueswith treatments of a particular type, suggestspecific researchquestions,describehow the interlocate opportunities to remedy imvention functions, spell out mediating processes, 1'998). for reporting results ('Weiss, plementation failures, and provide lively anecdotes is analysis All th.r. serveto increasethe knowledge yield, evenwhen such theoretical done within an experimental framework. There is nothing about the approach that makes it an alternative to experiments. It can clearly be a very important adjunct to such studies,and in this role we heartily endorsethe approach (Cook,2000). However, some authors (e.g., Chen 6c Rossi, 1,987, 1992; Connell et al., 1,99 5l have advocated theory-based evaluation as an attractive alternative to experiments when it comes to testing causal hypotheses.It is attractive for several i.urorrr. First, it requires only a treatment group' not a comparison group whose

502 | 14.A CRTT|CAL ASSESSMENT OFOUR ASSUMPTTONS

agreement to be in the study might be problematic and whose participation increasesresearchcosts. Second, demonstrating a match between theory and data suggeststhe validity of the causal theory without having to go through a laborious processof explicitly considering alternative explanations. Third, it is often impractical to measure distant end points in a presumed causal chain. So confirmation of attaining proximal end points through theory-specified processes can be used in the interim to inform program staff about effectiveness to date, to argue for more program resourcesif the program seemsto be on theoretical track, to justify claims that the program might be effective in the future on the as-yet-notassessed distant criteria, and to defend against premature summative evaluations that claim that an intervention is ineffective before it has been demonstrated that the processes necessaryfor the effect have actually occurred. However, maior problems exist with this approach for high-quality descriptive causalinference(Cook, 2000). First, our experience in writing about the theory of a program with its developer (Anson et al., 1,991)has shown that the theory is not always clear and could be clarified in diverse ways. Second, many theories are linear in their flow, omitting reciprocal feedback or external contingenciesthat might moderate the entire flow. Third, few theories specify how long it takes for a given processto affect an indicator, making it unclear if null results disconfirm a link or suggestthat the next step did not yet occur. Fourth, failure to corroborate a model could stem from partially invalid measuresas opposedto invalidity of the theory. Fifth, many different models can fit a data set (Glymour et a1.,1987;Stelzl, 1986), so our confidencein any given model may be small. Such problems are often fatal to an approach that relies on theory to make strong causal claims. Though some of theseproblems are present in experiments (e.g.,failure to incorporate reciprocal causation, poor measures),they are of far less import because experiments do not require a well-specified theory in constructing causal knowledge. Experimental causal knowledge is less ambitious than theory-based knowledge, but the more limited ambition is attainable.

Weaker Quasi-Experi ments


For some researchers,random assignment is undesirable for practical or ethical reasons, so they prefer quasi-experiments. Clearly, we support thoughtful use of quasi-experimentation to study descriptive causal questions. Both interrupted time series and regression discontinuity often yield excellent effect estimates. Slightly weaker quasi-experiments can also yield defensible estimates,especially when they involve control groups with careful matching on stable pretest attributes combined with other design features that have been thoughtfully chosen to addresscontextually plausible threats to validity. However, when a researchercan choose, randomized designsare usually superior to nonrandomized designs. This is especially true of nonrandomized designs in which little thought is given to such matters as the quality of the match when creating control groups,

NONEXPERIMENTALALTERNATIVES I tOl data from testsrather than a singleone' generating includingmultiple hypothesis comparison having several pr.tr."t*.nt time points rather than one, or several groupsto createcontrolsthat bracketperformancein the treatmentgroups.Inwith thosefrom are compared I..d, when resultsfrom typical quasi-experiments randomizedexperimentson the same topic, several findings emerge.Quasieffects(Heinsman& Shadish,1'996;Shadish frequentlymisestimate experiments biare often large and plausiblydue to selection biases t9961.Tirese & Ragsdale, treatclientsinto psychotherapy of more distressed srrchas the self-selection ases into et al., 2000) or of patientswith a poorer prognosis ment conditions(Shadish are espeThese biases (Kunz & Oxman,1'9981. controlsin medicalexperiments that usepoor quality control groupsand have in quasi-experiments cially prevalent So,if the 6cRagsdale,l996l. Shadish,'1,996;Shadish higheiattrition(Heinsmar$c from than those more credible are experiments obtainedfrom randomized an"swers then empirically, groundsand are more accurate on theoretical quasi-experiments a high deare evenstrongerwhenever experiments ,'h. ".g,.r-entsfor randomized claim. causal a descriptive reductionis requiredabout gr.. oI uncertainty are not equal in their ability to reduceuncerquasi-experiments all Because tainty about."ur., *. -"ttt to draw attention againto a common but unfortuis beingdone in manysocialsciences-tosaythat a quasi-experiment natepractice will be valid. Then a in order to provide justificationthat the resultinginference strucin the desirable deficient is so that designis described quasi-experimental that it is probawhich promote better inference, noted previously, tural features noted the term bly not worth doing. Indeed,over the yearswe have__repeatedly that fell into the classthat Campbiing usedto justify designs quasi-experiment and that Cook and Campbell (196i) labeled as uninterpretable bell and'stanley forms of the Theseare the simplest uninterpretable. generally (1,9791labeled'as cannot be an alterna4 and 5. Quasi-experiments in Chapters discussed designs and poor quasi-exwhen the latter are feasible, tive to randomizedexperiments latwhen_the for strongerquasi-experiments can neverbi a substitute periments i., "r. also feasible. Just as Gueron (L999) has remindedus about randomized haveto be fought for, too. They are rarely good quasi-experiments experiments, handedout as though on a silverplate.

Controls Statistical
for groupnonequivalence adjustments that statistical In this book,we haveadvocated controlshavealreadybeenusedto the maximum in order are best urrd oBt design of statisticaladto a minimum. So we are not opponents to reducenonequivalence and econometricians by the statisticians justmenttechniques suchasthoseadvocated in the appendixto Chapter5. Ratheqwe want to usethem as the last redescribed controlsare sowell that statistical sort.The positionwe do not like is the assumption and developeithat they can be usedto obtain confidentresultsin nonexperimental in the past 2 contexts.As we saw in Chapter 5, research weak iuasi-e*perimental

504 | ta. a cRtTtcAL AsSEssMENT OFOUR ASSUMPT|ONS


I

decadeshas not much supported the notion that a control group can be constructed

through matchingfrom somenational or state registrywhen the treatmentgroup comes from a morecircumscribed and localsetting. Nor hasresearch muchsupported the useof statistical adjustments in longitudinalnationalsurveys in which individuals with differentexperiences are explicitly contrasted in order to estimate the effects of this experience difference. Undermatching is a chronic problem here,as are consequences of unreliabilityin the selection variables, not to speakof specification errors dueto incomplete knowledge of the selection process. particular, In endogeneity prob'We lemsarea realconcern. areheartened that more recentwork on statistical adjustmentsseems to be moving toward the position we represent, with greateremphasis beingplacedon internal controls,on stablematchingwithin suchinternalcontrols, on the desirability of seeking cohort controlsthroughthe useof siblings, on the useof pretests sorrccf,e(J collected on on rne the same same measures measures posttest, posttest, aS tne the On on tne the Uulrty utiliw Ot of SUCh suchpretest PrstssLs 'We ventions that areclearlyexogenous shocks to someongoingsystem. arealsoheartenedby the progress beingmadein the statistical domainbecause it includes progress on design considerations, aswell ason analysis per se(e.g., Rosenbaum, 1999a).Ve areagnostic at this time asto the virtuesof the propensity score andinstrumental variable approaches that predominatein discussions of statisticaladiustmenr. Time will tell how well tell well they they pan out relative to the results from randomizedexperiments.'We
have surely not heard the last word on this topic.
measures collected at several different times, and on the desirability of studying inter-

CONCLUSION
cannot point to one new development that has revolutionized field experimentation in the past few decades,yet we have seena very large number of incremental improvements. As a whole, these improvements allow us to create far better field experiments than we could do 40 years ago when Campbell and Stanley (1963) first wrote. In this sense, we are very optimistic about the future. Ve believe that we will continue to see steadg incremental growth in our knowledge about how to do better field experiments. The cost of this growth, howeveq is that field experimentation has become a more specializedtopic, both in terms of knowledge developmentand of the opportunity to put that knowledge into practice in the conduct of field experiments. As a result, nonspecialistswho wish to do a field experiment may greatly benefit by consulting with those with the expertise,especially for large experiments, for experiments in which implementation problems may be high, or for casesin which methodological vulnerabilities will greatly reducecredibility. The same is true, of course, for many other methods. Case-studymethods, for example, have become highly enough developed that most researchers would do an amateurishjob of using them without specializedtraining or supervised practice. Such Balkanization of. methodolog)r is, perhaps, inevitable, though none the lessregrettable.\U7e can easethe regret somewhat by recognizingthatwith specialization may come faster progress in solving the problems of field experimentation. 'We

You might also like