You are on page 1of 20

1

Causal Inference in
Randomized and
Non-Randomized Studies:
The Definition, Identification,
and Estimation of Causal
Parameters
M i c h a e l E. S o b e l

INTRODUCTION between the illusory and non-illusory


correlations, Yule invented partial correlation
The distinction between causation and associ- to ‘control’ for the influence of a common
ation has figured prominently in science and factor, arguing in context that because the
philosophy for several hundred years at least, relationship between pauperism and out
and, more recently, in statistical science as relief did not vanish when ‘controlling’ for
well, indeed, since Galton, Pearson and, Yule poverty, this relationship could be deemed
developed the theory of correlation. causal. A half century later, philosophers,
Statisticians have pioneered two psychologists and social scientists (e.g.,
approaches to causal inference that have Reichenbach, 1956; Simon 1954; Suppes
proven influential in the natural and 1970) rediscovered Yule’s approach to
behavioral sciences. The oldest dates back distinguishing between causal and non-causal
to Yule (1896), who wrote extensively about relationships, and econometricians (e.g.,
‘illusory’ correlations, by which he meant Granger 1969) extended this idea to the
correlations that should not be endowed time-series setting. Graphical models, path
with a causal interpretation. To distinguish analysis and, more generally, structural
4 DESIGN AND INFERENCE

equation models, when these methods are section, ‘Estimation of causal parameters in
used to make causal inferences, also rely randomized studies’, discusses estimation.
on this type of reasoning. The theory of The section ‘Mediation analyses’ takes up the
experimental design, which emerges in topic of mediation, which is of special interest
the 1920s and thereafter, and is associated to psychologists and prevention scientists.
especially with Neyman (1923) and Fisher I show that the usual approach to mediation,
(1925), forms the basis for a second approach which uses structural equation modeling, does
to inferring causal relationships. Here, the not yield estimates of causal parameters,
use of good design, especially randomization, even in randomized studies. Several other
is emphasized, apparently obviating the need approaches to mediation, including principal
to worry about spurious relationships. stratification and instrumental variables, are
Despite these important contributions, dur- also considered.
ing the majority of the twentieth century,
most statisticians espoused the view that
statistics had little to do with causation. CAUSATION AND PROBABILISTIC
But the situation has reversed dramatically CAUSATION
in the last 30 years, since Rubin (1974,
1977, 1978, 1980) rediscovered Neyman’s Regularity theories of causation are concerned
potential outcomes notation and extended with the full (or philosophical) cause of an
the theory of experimental design to obser- effect, by which is meant a set of conditions
vational studies. Currently, there is a large that is sufficient (or necessary or necessary
and growing inference in statistics on the and sufficient) for the effect to occur. This
topic of causal inference and this second type of theory descends from Hume, who
approach to inferring causal relationships claimed that causation (as it exists in the
is coming to dominate the first approach, real world) consists only of the following:
even in disciplines such as economics, (1) temporal priority, i.e., the cause must
which rely on observational studies and precede the effect in time; (2) spatiotemporal
where the first approach has traditionally contiguity, i.e., the cause and effect are
dominated. ‘near’ in time and space; and (3) constant
This chapter provides an introduction, conjunction, i.e., if the same circumstances
tailored to the concerns of behavioral sci- are repeated, the same outcome will occur.
entists, to this second approach to causal Many subsequent writers argued that Hume’s
inference. Because causal inference is the analysis cannot distinguish between regular-
act of making inferences about the causal ities that are not causal, such as a relation
relation and notions of the causal relation between two events brought about by a
differ, it is important to understand what common factor, and genuine causation. At the
notion of causation is under consideration minimum, this suggests that Hume’s account
when such an inference is made. Thus, in is incomplete. A number of philosophers
the next section, I briefly review several [e.g., Bunge (1979) and Harré and Madden
notions of causation and also briefly examine (1975)] argue that the causal relation is
the approach to causal inference that derives generative. While this idea is appealing,
from Yule. In the section ‘Unit and average especially to modern scientists who speak of
causal effects’ the second approach, which mechanisms, attempts to elaborate this idea
is built on the idea that a causal relation have not been entirely successful. Another
should sustain a counterfactual conditional approach (examined later) is to require
statement, is introduced, and a number of causal relationships to sustain counterfactual
estimands of interest are defined. The section conditional statements.
‘Identification of causal parameters under Hume’s analysis is also deterministic,
ignorable treatment assignment’ discusses the and the literature on probabilistic causation
identification of causal effects and the next that descends from Yule can be viewed as
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 5

an attempt to both relax this feature and whether causation is viewed as probabilistic
distinguish between causal and non-causal in some inherent sense or if probability arises
regularities. The basic idea is as follows. First, in some other way. The deficiencies of this
there is a putative cause Z prior in some approach are evident in the psychological
sense to an outcome Y . Further, Z and Y literature on casual modeling, where a variety
are associated (correlated). However, if the of extra-mathematical considerations (such as
Z − Y association vanishes when a (set of) model specification) are used to suggest that
variable(s) X prior to Z is conditioned on model coefficients can be endowed with a
(or in some accounts, if such a set exists), causal interpretation (see, for example, Sobel
this is taken to mean that Z ‘does not cause’ 1995 on this point).
Y , that is, the relationship is ‘spurious’. To By way of contrast to regularity theo-
complete the picture, various examples where ries, manipulability theories view causes
this criterion would seem to work well have as variables that can be manipulated, with
been constructed. the outcome depending on the state of the
Granger causation and structural equa- manipulated variable. Here, as opposed to
tion models use this type of reasoning to specifying all the variables and the functional
distinguish between empirical relationships relationship between these and the outcome
that are regarded as causal and not causal. (which would constitute a successful causal
For example, consider a structural equation account of a phenomenon under a regularity
model, with X a vector of variables, Z an theory), the goal is more modest, to examine
outcome occurring after X, and Y an outcome the ‘effect’ of a particular variable. See Sobel
after Z. If the ‘direct effect’ of Y on Z is 0 (1995) for a reconciliation of these two
(not 0), Z is not viewed (viewed) as a cause approaches.
of Y . A sufficient condition for this direct Manipulability theories require the causal
effect to be 0 is that Y and Z are conditionally relation to sustain a counterfactual conditional
independent, given X. Of course, this same statement (e.g., eating the poison caused John
kind of reasoning can be extended to the to die means that John ate the poison and
consideration of other types of direct and died, but had John not eaten the poison,
indirect effects. For example, consider the he would not have died). This is closer to
case of a variable X associated with Z, with the way an experimentalist thinks of cau-
X and Y conditionally independent, given Z, sation. However, many philosophers regard
implying the ‘direct effect of X on Y is 0, and manipulability theories as anthropomorphic.
Z and Y not conditionally independent given Further, many questions that scientists ask
X, implying the ‘direct effect’ of Z on Y is are not amenable to experimentation, e.g,
non-zero. Here there is an effect of X on Y , the effect of education on longevity or
but the effect is indirect, through Z. the effect of marriage on happiness. This
There are a number of problems with this would appear to seriously limit the value of
approach. First and foremost, it confounds this approach for addressing real scientific
causation with the act of inferring causation, questions.
as evidenced by the fact that the criteria However, even without manipulating a per-
above for inferring causation are typically son’s level of education, one might imagine
put forth independently of any explicit notion that had this person’s level of education
of the causal relation. As notions of the taken on a different value than that actually
causal relation vary, this method of inferring realized, this person might also have a
causation may be appropriate for some different outcome. This suggests adopting the
notions of causation, for example, the case broader view that it is not the manipulation
where causation is regarded as a predictive per se, but the idea that the causal variable
relationship among variables, but not for could take on a different value than it
others. Nor (because the nature of the casual actually takes on, which is key. This is the
relation is not explicitly considered), is it clear idea underlying counterfactual theories of
6 DESIGN AND INFERENCE

causation (e.g., Lewis 1973), where the closest show that these methods rest on a number of
possible world to the one we live in serves as implausible assumptions.
the basis for the counterfactual.
Counterfactual theories also have their
difficulties, both theoretical and practical. UNIT AND AVERAGE CAUSAL EFFECTS
One criticism goes under the rubric of
‘preemption’. Person A shoots person C in the The notion of causation congruous to recent
head, and C dies. It seems natural to claim statistical work on causal inference has two
that person A caused person C to die. Yet, important properties. First, the causal relation
suppose that if A had not shot C, B would is singular, i.e., it is meaningful to speak of
have done so and C would also have died. In effects at the individual level and these effects
that case, C dies whether or not A shoots him, may vary over individuals (heterogeneity).
so one cannot (under a simple counterfactual Second, causal statements sustain counterfac-
theory) say A caused C to die. This seems tual conditionals. Thus, we might state that
wrong. attending health class caused Bill to drink less,
In practice, the outcome may also depend by which we mean that Bill went to health
on the way in which the cause is brought class and later drank amount y, whereas had he
about. When an experiment is performed, not attended health class, later he would have
this issue is not, per se, problematic, and drunk y∗ > y. For Mary, perhaps the outcome
the effect corresponding to that manipulation is the same whether or not she attends, in
is well defined. Otherwise, as there may be which case we would say that attending health
different outcomes, ‘the effect’ is ill-defined, class did not cause Mary to drink less (or
unless the closest world is specified; in some more). Note that only attending health class
instances, this will be a very difficult task. is considered as the cause. Other possible
This suggests that some questions, e.g., the causes, e.g., sex, are regarded as part of the
effect of marriage on happiness, may be better causal background (pretreatment covariates in
left unasked (or at the minimum, one must statistical language).
specify the hypothetical intervention by which The single most important contribution
persons are exposed/not exposed to marriage in this literature is the potential outcomes
and to marriage partners). notation developed by Neyman (1923) and
I now turn to the recent statistical literature Rubin (1974) to express the ideas above.
on causal inference, which is also based Using this notation allows causal effects to
on the idea that causal relations sustain be defined independently of the association
counterfactual conditionals. This approach parameters that are actually estimated in
to casual inference (as suggested by the studies; one can then ask whether and under
preceding material) is not concerned with what conditions these associations equal the
elucidating the various causes of the outcome causal effects. Consider the case of an
(effect) and the way in which these causes experiment where unit i in a population P is
produce the effect, but with the more limited assigned (or not) to receive a treatment. The
goal of inferring the effect (in a sense to data for this unit is typically written as:
be described) of a particular causal variable. (Zi , Yi , X i ), where Zi = 1 if i is assigned
Scientists who are interested in a fuller to receive the treatment, 0 otherwise, Yi is
accounting of the causes of an effect and the value of the outcome and X i is a vector
the pathways through which the effect is of covariates. Although this representation
produced may find this approach less than is adequate for descriptive modeling [e.g.,
entirely satisfying. However, as discussed the regression function E(Y | Z, X)], it does
subsequently, this approach can also be used to not adequately express the idea that Z might
evaluate methods (such as structural equation take on different values and that i’s outcome
models) that researchers sometimes use to might vary with this. One way to formalize
provide a fuller account, and it is not hard to this idea is to consider two outcomes for i,
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 7

Yzi (0), the outcome i would have if he is As a simple example, consider the case
not assigned to receive treatment and Yzi (1), above, with h(Yzi (0), Yzi (1)) = Yzi (1) −
the outcome i would have if he is assigned Yzi (0). The ‘intent to treat’ estimand (here-
to receive treatment. With this notation, it is after ITT), which is commonly featured in
then straightforward to define singular causal connection with randomized clinical trials, is
effects (unit effects) h(Yzi (0), Yzi (1)) [where defined as:
there is no effect if Yzi (0) = Yzi (1)]; the unit
effects then serve as the building blocks for E(Yz (1) − Yz (0)), (1)
various types of average effects.
Were it possible to take a sample from
the average of the unobserved unit effects.
(Yz (0), Yz (1)), it would be a simple matter
Because the expected value is a linear
to obtain the unit effects h(Yzi (0), Yzi (1)) for
operator, the ITT can also be expressed as
the sampled units and to estimate various
E(Yz (1)) − E(Yz (0)). Thus, if it is possible
parameters that are functions of these. The
to take a random sample of size n from P
literature on causal inference arises from the
and then take random sub-samples from Yz (1)
fact that it is only possible to observe one of
and Yz (0), the difference between the sample
the potential outcomes.
averages:
A limitation of the notation above is that
only a scalar outcome and a binary treatment Pn Pn
are considered. Because the generalization to i=1 Zi Yi (1 − Zi )Yi
Pn − Pi=1
n (2)
a random object and to arbitrary types of Z
i=1 i i=1 (1 − Zi )
treatments is trivial and does not generate
substantially new issues, I continue to treat is an unbiased and consistent estimator of
the case of a binary treatment and scalar the ITT.
outcome. A more serious limitation (though The ITT is one of a number of possible
it is again not difficult to generalize this parameters of interest and may not always be
notation) that does generate new issues is of greatest scientific or policy relevance. It
that the notation above does not allow for measures the effect of treatment assignment,
interference (Cox, 1958), that is, i’s potential and as subjects may not always take up the
outcomes are not allowed to depend on the treatments to which they are assigned, the ITT
treatment received by other units. Rubin does not measure the effect of treatment itself.
(1980) calls this the stable unit treatment A policy maker might nevertheless argue that
value assumption (SUTVA). Although this the ITT is of primary interest because it
assumption is often reasonable and almost measures the effect that would be actually be
universally made, there are many instances in observed in the real world. As an example,
the social and behavioral sciences where it is consider the effect of a universally free school-
untenable. For example, in schools, children breakfast program vs. the current Federal
in the same (or even different) classrooms program (Crepinsek et al., 2006) on total food
may interfere with one another. This case has and nutrient intake. Some students will take
been studied by Gitelman (2005); for the more up the free breakfast, others will not. From
general case, see Halloran and Struchiner the policy maker’s perspective, if the program
(1995) and Sobel (2006a, 2006b). Hereafter, I is highly effective amongst those who take it
shall assume SUTVA holds. up, but the takers are a small percentage of
Although the unit effects above cannot be those who might benefit, the program may be
determined (since only one of the potential judged a failure.
outcomes is ever observed), it turns out In observational studies where treatments
remarkably that under suitable conditions are not assigned, one observes only whether
(discussed later) various types of averages of or not a subject takes up a treatment (D = 1) or
these effects are nevertheless identifiable and not (D = 0); defining the potential outcomes
can be consistently estimated. as Ydi (0) and Ydi (1), interest often centers on
8 DESIGN AND INFERENCE

the average treatment effect (hereafter ATE): X to be a vector of pretreatment covari-


ates. This leads to consideration of the
E(Yd (1) − Yd (0)) (3) parameters ITT(X), ATE(X), and ATT(X),
where, for example, ATE(X) is defined as
or the effect of treatment on the treated (ATT): E(Yd (1) − Yd (0) | X), and the other parame-
ters are defined analogously.
E(Yd (1) − Yd (0) | D = 1). (4) Although attention herein focuses on the
parameters above, a number of other inter-
Neyman (1923) first considered the ATE. esting and/or useful parameters have been
The ATT was first considered by Belsen defined and considered. Bjőrklund and Moffitt
(1956) and discussed in detail by Rubin (1987) defined (and discussed the economic
(1978). The ATE measures the average effect relevance of) the marginal treatment effect
if all persons in the population are given for subjects indifferent between participating
the treatment, whereas the ATT measures or not in a program of interest. Quantile
the average effect of the treatment in the treatment effects, the difference between the
subpopulation that takes up the treatment. marginal quantiles of Y (1) and Y (0), were
The ATE is a natural parameter of interest defined by Doksum (1974) and Lehmann
if the treatment corresponds to a policy (1974); these effects have received some
that is under consideration for universal and attention recently (e.g., Abadie, Angrist and
mandatory adoption. In cases where adoption Imbens, 2002). A parameter (discussed sub-
is voluntary, some economists have argued sequently) that has received much attention
that only the ATT is relevant, because it lately is the local average treatment effect
reflects what would actually occur if the (LATE) considered by Angrist, Imbens and
policy were to be implemented. However, one Rubin (1996).
might also want to know if those who do Many other parameters might also be
not adopt the policy would benefit, because considered. A decision maker might wish to
an affirmative answer might suggest to a consider the utilities of the potential outcome
policy maker that efforts focus on increasing values and ask whether a treatment increases
the take up rate. Additionally, if the ATT is average utility or some other measure of social
positive, and persons who have not take up the welfare; this is a matter of considering U(Y )
policy have access to this type of information, as opposed to Y .
they might be more motivated to do so. This The average effects above take as building
suggests that in general, for policy purposes, blocks the unit differences Ydi (1) − Ydi (0)
one might wish to know, in addition to the (or Yzi (1) − Yzi (0)). As will be evident
ATT, the average effect of treatment on the later, because averages of these depend
untreated (ATU) and the ATE, which is a only upon the marginal distributions of
weighted average of the ATT and ATU. The Yd (0) and Yd (1), the effects in question are
ATE will also be a more natural parameter of identified if the marginal distributions are
interest than the ATT in many contexts where identified. Parameters that depend on the joint
the focus is on the basic science, where the distribution of the potential outcomes may
causal variable may not be one that can be also be defined (e.g., the proportion who
manipulated for policy purposes. would benefit from treatment), but the data,
Various other parameters may also be even from a randomized experiment, typically
of interest. First, for both scientific and contain little or no information about this
policy reasons, one often wants to know joint distribution, so these parameters will not
whether the effects above vary in different be identifiable without introducing additional
sub-populations defined by characteristics assumptions. While this may appear to be a
of the units. Let X denote a vector of serious limitation, several comments are in
variables that are not affected by the treatment order. First, sometimes a transformation may
(or assignment variable), for example, take produce an estimand of the desired form.
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 9

For example, for positive variables, with To see how (5) is used for identification,
h(Yd (0), Yd (1)) = Yd (1)/Yd (0), redefining note that whether or not randomization is
the potential outcomes as log Yd (0) and assumed to assign subjects to treatments, what
log Yd (1) gives transformed effects in the can actually be observed is a sample from
desired form. Second, in decision making, the joint distribution of (Y , Z, X). From this
the additional information contained in the distribution, the conditional distributions Y |
joint distribution may be irrelevant (Imbens Z = z for z = 0, 1 are identified, and as
and Rubin, 1997) to the policy maker. Third, Y = ZYz (1)+(1−Z)Yz (0), the distribution Y |
at least occasionally, plausible substantive Z = z is the distribution Yz | Z = z. Thus, the
assumptions could lead to identification of the population means E(Y | Z = 1) = E(Yz (1) |
joint distribution of the joint distribution. For Z = 1) and E(Y | Z = 0) = E(Yz (0) | Z = 0)
example, let the outcome be death (1 if alive, are identifiable, so the difference:
0 if dead) and suppose one wants to know
the proportion benefiting under treatment; it E(Y | Z = 1) = E(Yz (1) | Z = 1)−
is easy to see that if treatment is at least E(Y | Z = 0) = E(Yz (0) | Z = 0) (6)
not harmful, the joint distribution is identified
from the marginal distributions of the poten-
is also identified. In general, (6) does not equal
tial outcomes. However, in general, this is not
(1) because the identified conditional distri-
the case, and as there is likely to be very little
butions Yz | Z = z, z = 0, 1, are not equal to
scientific knowledge about quantities like the
the corresponding marginal distributions Yz ,
joint distribution of potential outcomes in a
z = 0, 1. But under the random assignment
study in which the marginal distribution is not
assumption, (5) holds, implying equality of
even assumed to be known, the identification
the two sets of distributions; hence (6) = (1).
of parameters involving the joint distribution
Often an investigator will also want to know
will typically require making assumptions
if the value of the ITT depends on covariates
that are substantively heroic (although per-
of interest. The parameter of interest is then
haps mathematically convenient) and possibly
ITT (X) = E(Yz (1) – Yz (0) | X). As (for the
quite sensitive to violations.
case above):
I now consider the identification of causal
parameters.
0 < πz (X) ≡ Pr(Z = 1 | X) < 1, (7)

and since assumption (5) implies:


IDENTIFICATION OF CAUSAL
PARAMETERS UNDER IGNORABLE Z%Yz (0), Yz (1) | X, (8)
TREATMENT ASSIGNMENT
that is, assignment is random within levels of
Random assignment is an assumption about X, ITT(X) is identifiable and equal to:
the way units are assigned to treatments. At
the heart of randomized experiments, this
E(Y | Z = 1, X) − E(Yz (0) | Z = 0, X).
assumption enables identification of causal
(9)
parameters. In the simplest case where each
subject is assigned with probability 0 < π <
Just as the assumption of random assignment
1 to the control condition and probability 1−π
is the key to identifying causal parameters
to the treatment condition, random assignment
from the randomized experiment, the assump-
implies treatment assignment is “ignorable”,
tion of random assignment within blocks
i.e., Z is independent of background covari-
(sub-populations) is the key to identifying
ates and potential outcomes:
causal parameters from the randomized block
experiment. It is also the key to making causal
Z%X, Yz (0), Yz (1) (5) inferences from observational studies.
10 DESIGN AND INFERENCE

When assumptions (7) and (8) hold, treat- I do not consider this matter further here, save
ment assignment is said to be strongly ignor- to note that estimating these distributions will
able, given X (Rosenbaum and Rubin, 1983). (when the outcome Y is metrical) typically
In observational studies in the social and be more difficult than estimating the average
behavioral sciences, where subjects choose causal effects above.
the treatment received (D), the assumption: Second, the average effects above can be
identified under weaker ignorability assump-
D%X, Yd (0), Yd (1) (10) tions than those given here. For example,
ATE(X) and ATT(X) are identified under the
[akin to (5)] is likely to be unreasonable, marginal ignorability assumption:
as typically evidenced by differences in the
distribution of covariates in the treatment and D%Yd (d) | X (15)
control groups. However, if the investigator
knows (as in a randomized block experiment) for d = 0, 1, and occasionally using this
the covariates that account for the differential weaker assumption is advantageous. It also
assignment of subjects into the treatment and obvious that for estimating means, ignorabil-
control groups: ity assumptions can be replaced by the weaker
condition of so-called ‘mean independence’,
D%Yd (0), Yd (1) | X, (11) e.g., E(Yd | X, D = d) = E(Yd | X). How-
ever, it is difficult to think of situations where
and if: mean independence holds and ignorability
does not. Additionally, mean independence
0 < πd (X) ≡ Pr(D = 1 | X) < 1, (12) does not hold for functions of Y , such as U(Y ),
the utility of Y . Thus, I do not consider this
for all X, i.e., treatment received D is strongly further.
ignorable given X, the parameter ATE(X): In observational studies, it will often be the
case that an investigator is not sure if he/she
E(Yd (1) − Yd (0) | X) (13) has measured all the covariates X predictive
of both the treatment and the outcome.
is identified and equals E(Y | X, D = 1) − Not surprisingly (as it is not possible to
E(Y | X, D = 0). It also follows that ATE(X) observe both potential outcomes), ignorability
= ATT(X): assumptions are not, per se, testable; attempts
to assess such assumptions invariably rely
E(Yd (1) − Yd (0) | X, D = 1). (14) on various types of auxiliary assumptions
(Rosenbaum, 1987; Rosenbaum, 2002).
Typically, the investigator will be interested When an investigator believes there are
not only in ATE(X) and/or ATT(X), but also in variables he/she has not measured that predict
ATE = EP (ATE(X)) and ATT = EP ∗ (ATT(X)), both the treatment and the potential outcomes,
where P ∗ is the sub-population of units that it is nevertheless sometimes possible (using
receive treatment. Note that ATT (= ATE other types of assumptions) to estimate
because these parameters are obtained by the parameters above (or parameters similar
averaging over different units: the ATE is a to these). The section ‘Mediation analyses’
weighted average of the ATT and the ATU. examines the use of this approach in the
More generally, as at the beginning of context of mediation. Another approach that
this section, under the types of ignorability has been used involves the use of fixed effects
assumptions above, it is possible to identify models (and differences in differences) to
the marginal and conditional (given X) dis- remove the effects of unmeasured variables.
tributions of the potential outcomes in P and When the investigator knows the treatment
to therefore consider any causal estimand that assignment rule, but the assignment proba-
can be defined in terms of these distributions. bilities are 0 and 1, as is the case in risk
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 11

based allocation (Thistlethwaite and Camp- ordinary least squares regression of Y on Z:


bell, 1960), causal inferences necessarily
rely on extrapolation. Nevertheless, in some Yi = α + τ Zi + i , (16)
cases, reasonable inferences can be made
(Finkelstein, Levin and Robbins, 1996). i = 1, . . ., n, where the parameters are identi-
Other approaches in the absence of fied by the assumption E() = 0.
ignorability include bounding causal effects The estimator (2) also arises by predicting
(Manski, 1995; Robins, 1989). Bounds that the missing outcomes Yzi (0) (if Zi = 1) or
make few assumptions are often quite wide Yzi (1) (if Zi = 0) using the estimated mini-
and not especially useful. Nevertheless, when mum mean square error predictor b E(Y | Z).
assumptions leading to tighter bounds are Let bYzi (0) = Yzi (0) if Zi = 0, and α̂ = b E(Y |
credible, this approach may be quite helpful. Z = 0) otherwise, b Yzi (1) = Yzi (1) if Zi = 1,
Additionally, sensitivity analyses can also be α̂ + τ̂ = bE(Y | Z = 1) otherwise; thus, (2) can
very useful; if ignorability is violated due to also be written as:
an unmeasured covariate, but the results are n
X
robust to this violation, credible inferences n−1 (b
Yzi (1) − b
Yzi (0)). (17)
can nevertheless be made (see Rosenbaum, i=1
2002, for further material on this topic).
In the case of a randomized block exper-
iment, where the probability of assignment
to the treatment group depends on known
ESTIMATION OF CAUSAL covariates, assumption (5) will be violated,
PARAMETERS IN RANDOMIZED but if the covariates are unrelated to the
STUDIES potential outcomes:

I consider ITT(X) and ITT in this section, Z% Yz (0), Yz (1), (18)


using these cases to introduce the primary
in which case (2) is still unbiased and
ideas underlying the estimation of causal
consistent for ITT.
effects in the simplest setting. The discussion
When the covariates are related to both
is organized around two broad approaches:
treatment assignment and the potential out-
(1) using potential outcomes imputed by
comes, (8) provides the basis for extending
regression or some other method (e.g.,
the approach above. Let the covariates X
matching) and using the observed and imputed
take on L distinct values, corresponding to
outcomes to estimate ITT; and (2) reweighting
blocks b = 1, . . ., L and let g(X) ≡ B
the data in the treatment and control groups
be the one to one onto function mapping X
to reflect the composition of the population P
onto the blocking variable B. Within each
(or an appropriate subpopulation thereof). The
block b, n1b = nb Pr(Z = 1 | B = b) of the
estimators considered under the first approach
nb units are assigned to the treatment group,
have been used in the experimental design
where 0 < Pr(Z = 1 | B = b) < 1 for all b.
literature for many years and will be familiar
The matched pairs design is the special case
to most readers.
where the sample size is 2n, L = n, n1b = 1,
n0b = nb − n1b = 1.
The regression corresponding to (16) is:
Estimation of ITT(X) and ITT in
randomized studies L
X
Yi = 1{Bi } (b)(αb + τb Zi + i ), (19)
The simplest case, previously considered, b=1
estimates the ITT using (2) under the
identification condition (5). It is also useful where 1{Bi } (b) = 1 if {Bi } = b, 0 other-
to note that (2) is also the coefficient τ̂ in the wise, and E( | X, Z) = 0. Thus, τb is the
12 DESIGN AND INFERENCE

value of ITT(X) in block b, with estimator applied to the weighted data, yielding the IPW
τ̂b = Ȳ{Z=1,B=b} − Ȳ{Z=0,B=b} , the difference estimator:
between the treatment group and control Pn −1
group means in this block. The ITT can then i=1 πz (X i )Zi Yi
Pn −1

be estimated using the estimated marginal
i=1 πz (X i )Zi
distribution (or the marginal distribution if it Pn
(1 − πz (X i ))−1 (1 − Zi )Yi
is known) of the blocking variable: Pi=1
n −1
. (22)
i=1 (1 − πz (X i )) (1 − Zi )
L
X
d b = b). Using YZ = Yz (1)Z, elementary properties
ITT= (Ȳ{Z=1,B=b} − Ȳ{Z=0,B=b} )Pr(B
of conditional expectation, and assumption
b=1
(20) (8), leads to a more formal justification:
E(πz−1 (X)ZY ) = E(E(πz−1 (X)ZY | X)) =
E(πz−1 (X)E(ZYz (1) | X)) = E(πz−1 (X)E(Z |
Under random sampling from P, X)E(Yz (1) | X)) = E(Yz (1)). Finally, note
b = b) = nb /n, and (20) = (17); For
Pr(B that in the randomized block experiment
the matched pair design, in addition, πz (x) = πz (g(x)) = πz (b) = n1b /nb , and the
(17) = (2). estimate (22) is identical to (20) under random
As above, it is also easy to see that: sampling from P.

Ȳ{Z=1,B=b} − Ȳ{Z=0,B=b} =
n Estimation of treatment effects in
X
−1
(nb ) 1{Bi } (b)(b
Yzi (1) − b
Yzi (0)), (21) observational studies
i=1 In observational studies in the social and
behavioral sciences, the assumption that
where the missing outcomes are imputed treatment D is unrelated to the potential
using the estimated ‘best’ predictor; thus outcomes Yd (0) and Yd (1) is unlikely to
the estimator (20) can also be obtained by hold. Estimation of the treatment effects
imputing missing potential outcomes. ATE(X), ATT(X), ATE and ATT is therefore
Another approach to estimating the ITT considered under the assumption (given
under (8) is to reweight the treatment by (11) and (7)) that treatment asignment
group observations in such a way that the is strongly ignorable, given the covariates
reweighted data from the treatment group X (Rosenbaum and Rubin, 1983). For a
(control group) would be a random sample more extensive treatment of estimation under
from the distribution of (Yz (1), X) (Yz (0), X)) strongly ignorable treatment assignment, see
and then apply the simple estimator (2) to the reviews by Imbens (2004) and Schafer and
the weighted data. This is the essence of Kang (2007).
‘inverse probability weighting’ (Horvitz and In principle, this case has already been
Thompson, 1952). considered. Nevertheless, new issues arise in
To see how this works, suppose that πz (x) attempting to use the estimators previously
percent of the observations at level x of X considered. There are serval reasons for this.
are in the treatment group. Under random First, in a randomized block experiment,
sampling from the distributions (Yz (1), X) and the treatment and control group probabilities
(Yz (0), X), the treatment and control groups depend on the covariates in a known way,
should have the same distribution on X. that is πz (X) is known. Thus, for example,
If the treated observations at level x are if inverse probability weighting is used to
weighting by πz−1 (x) and the control group estimate the ITT, the weights are known. This
observations at x by (1 − πz (x))−1 the treated is not the case in an observational study.
and controls will have the same distribution Second, in a randomized block experiment (8)
on X in the weighted data set and (2) can be is a byproduct of the study design, whereas
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 13

in observational studies the analogue (10) is the sample mean vector for the covariates
is an assumption; this issue was briefly in the treatment (control) group. It is easy to
discussed earlier. In practice, making a see that (25) = (23).
compelling argument that a particular set of There are essentially two problems with the
covariates renders (8) true is the most difficult estimator τ̂ . First, the investigator typically
challenge facing empirical workers who want does not know the form of the response
to use the methods below to make inferences function and the linear form is chosen out
about various types of treatment effects. of convenience. This form has very strong
Third, in a randomized block experiment, implications: τ = ATE(X) = ATE = ATT,
the covariates used in blocking take on (in that is, not only are the effects the same at all
principle) relatively few values, and thus levels of X, but the ATE is also the ATT. When
the ITT can be estimated non-parametrically the regression functions are misspecified,
using the first appproach (as in (20)). In an using τ̂ can yield misleading inferences.
observational study,where X is most likely Second, when there are ‘regions’ with little
high dimensional, this is no longer the case. overlap between covariate values in the
And finally, it is necessary (though not treatment and control groups, imputed values
difficult) to modify estimators of the ATE(X) are then based on extrapolations outside the
and ATE to apply when it is desired to estimate range of the data. For example, if the treatment
ATT(X) and ATT. group members have ‘large’ values of a
Regression estimators use estimates (b E) of covariate X1 and the control group members
the regressions E(Yd (1) | X) and E(Yd (0) | X) have ‘small’ values, the imputations b Yd (0)
to impute missing potential outcomes: if (b
Yd (1)) for treatment (control) group members
Di = 1, b Ydi (1) = Ydi (1), b
Ydi (0) = b
E(Yd (0) | will involve extrapolating the control group
X = x i ), and if Di = 0, b Ydi (0) = Ydi (0), (treatment group) regression to large (small)
b
Ydi (1) = b E(Yd (1) | X = x i ). These are then X1 values. This may produce very misleading
used to impute the unit effects and the ATE is results.
then estimated by averaging over these: To deal with the first of these difficulties,
n a natural alternative is to use nonlinear
X
n−1 (b
Ydi (1) − b
Ydi (0)). (23) regresssion, or in the typical case where the
i=1
form of the regression functions is not known,
to estimate these non-parametrically (as in
As ATE(X) = ATT(X), the ATT can be (19)). Imbens (2004) reviews this approach.
obtained as above, averaging only over the n1 Non-parametric regression can work well if
treated observations. X is not high dimensional, but when there are
As a starting point, consider the simplest many covariates to control for, as is typical
(and still most widely used) regression in observational studies, the precision of the
estimator, where the covariates enter the estimator regression may be quite low. This
response function linearly: problem then spills over to the imputations.
[But see also Hill and McCulloch (2007), who
Yi = α + τ Di + β - X i + i , (24)
propose using Bayesian Additive Regression
where the parameters are identified by the Trees to fit the regression functions, finding
condition E( | X, D) = 0. This leads to the that estimates based on this approach are
well-known regression adjusted estimator of superior to those obtained using many other
ATE(X): typically employed methods.]
Sub-classification (also called blocking)
-
τ̂ = (Ȳ{D=1} − Ȳ{D=0} )− β̂ (X̄ {D=1} − X̄ {D=0} ), is an older method used to estimate causal
(25) effects that is also non-parametric in spirit.
Here, units with ‘similar’ values of X are
where Ȳ{D=1} is the treatment group mean, grouped into blocks and the ATE is estimated
Ȳ{D=0} is the control group mean, and X̄ 1 (X̄ 0 ) as in the case of a randomized block
14 DESIGN AND INFERENCE

experiment considered above. To estimate the the control group. When this is done, however,
ATT, the distribution of the blocking variable the quantity estimated is no longer the ATE or
B in the group receiving treatment (rather than ATT because the average is only taken over
the overall population) is used; equivalently, the region of common support.
the imputed unit effects are averaged over the In an important paper, Rosenbaum and
treatment group only. In a widely cited paper, Rubin (1983) addressed the issue of overlap,
Cochran (1968) shows in a concrete example proving that when (11) and (12) hold:
with one covariate that subclassification with
five blocks removes 90% of the bias. Y (0), Y (1)%D | πd (X), (26)
Matching is another long-standing method
that has been used which avoids paramet- 0 < Pr(D = 1 | πd (X)) < 1, (27)
rically modeling the regression functions.
Although matching can be used to estimate implying that any of the methods just dis-
the ATE, it has most commonly been used cussed may be applied using the ‘propensity
to estimate the ATT in the situation where score’ πd (X) (which is a many to one function
the control group is substantially larger of X), rather than X; this cannot exacerbate
than the treatment group. In this case, each the potential overlap problem and may help
unit i = 1, . . . , n1 in the treatment group is to lessen this problem. [For generalizations of
matched to one or more units ‘closest’ in the the propensity score, applicable to the case
control group and the outcome values from where the treatment is categorical, ordinal or
the matched control(s) are used to impute continuous, see Imai and van Dyk (2004),
b
Ydi (0). In the case of ‘one to one’ matching, Imbens (2000) and Joffe and Rosenbaum
unit i is matched to one control with value (1999).]
Y∗ ≡ b Ydi (0); if i is matched to more than Rosenbaum and Rubin (1983) also discuss
one control, the average of the control group (their corollary 4.3) using the propensity
outcomes can be used. score to estimate the regression functions
There are many possible matching E(Yd (1) | πd (X)) and E(Yd (0) | πd (X))
schemes. A unit can be matched with one when these are linear. Typically, the true
or more others using various metrics to form of the regression functions relating
measure the distance between covariates X, potential outcomes to the propensity score
and various criteria for when two units have will be unknown. But these functions can be
covariate values close enough to constitute estimated non-parametrically more precisely
a ‘match’ can be used. In some schemes, using πd (X) than X. However, the advantage
matches are not reused, but in others are used of this approach is somewhat illusory, as in
again. In some schemes, not all units are observational studies, πd (X) will be unknown
necessarily matched. See Gu and Rosenbaum and must be estimated. Logistic regression is
(1993) for a nice discussion of the issues often used, but this form is typically chosen
involved in matching. Despite the intuitive for convenience. To the best of my knowledge,
appeal of matching, estimators that match the impact of using a misspecified propensity
on X typically have poor large sample score in this case has not been studied.
properties (see Abadie and Imbens, 2006, for If, however, a non-parametric estimator is
details). used (for example, a sieve estimator as
The procedures above do not contend described in Imbens, 2004), the so-called
with the frequently encountered problem of ‘curse of dimensionality’is simply transferred
insufficient overlap in the treatment and from estimation of the regression function to
control groups. One alternative is to only estimation of the propensity score.
consider regions where there is sufficient A straightforward way to use subclas-
overlap, for example, to match only those sification on the propensity score is to
treatment units with covariate values that are divide the unit interval into L equal length
‘sufficiently’ close to the values observed in intervals and group the observations by their
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 15

estimated propensity scores. Within interval more control group units. The control group
I` , ` = 1, . . ., L, the ATE is estimated as: outcomes are then used to impute b Ydi (0)
P P and the ATT is estimated as in the case of
i∈I` Di Yi i∈I` (1 − Di )Yi matching on X.
P − P . (28) Because the propensity score is a balancing
i∈I` Di i∈I` (1 − Di )
score, after matching, the distribution of the
The ATE is then estimated by P
averaging covariates should be similar in the treatment
n and control groups. In practice, a researcher
1 (`)
the estimates (28), using weights i=1n Ai ,
should check this balance and, if there is a
where 1Ai (`) = 1 if i ∈ I` , 0 oth-
problem, the model for the propensity score
erwise. To estimate the ATT, the weights
can be refitted (perhaps including interactions
should be modified to reflect the distribution
among covariates and/or other higher order
of
P the observations receiving treatment:
n
1 (`)Di terms) and the balance rechecked. In this
Pn Ai
i=1
. Lunceford and Davidian (2004) sense, proper specification of the propensity
i=1 Di
study subclassification on the propensity score model is not really at issue here: the
score and compare this with IPW estimators question is whether the matched sample is
(discussed below). Using simulations, Drake balanced.
(1993) compares the bias in the case where Finally, in practice, it is often found that
the propensity score is known to the case when the estimated propensity scores are near
where it is estimated, finding no additional 0 or 1, the problem of insufficient overlap
bias is introduced in the latter case. She in the treatment and control groups may be
also finds that when the model for the lessened, but it is still present. In this case,
propensity score is misspecified, the bias the same kinds of issues previously discussed
incurred is smaller than that incurred by reappear.
misspecifying the regression function. This As seen above, the propensity score also
may be suggestive, but without knowing features prominently when methods that use
how to put the misspecification in the two inverse probability weighting are used to esti-
different models onto a common ground, it mate treatment effects. There the covariates
is difficult to attribute too much meaning to took on L distinct values, each with positive
this finding. probability, and the IPW estimator is identical
Matching on propensity scores is widely to the non-parametric regression estimator.
used in empirical work and has also been This will no longer be the case. Parallelling the
shown to perform well in some situations material above, the ATE may be estimated as:
(Dehejia and Wahba, 1999). Corollary 4.1 in Pn −1
Rosenbaum and Rubin (1983) shows the ATE i=1 π̂d (X i )Di Yi
Pn −
can be estimated by drawing a random sample −1
i=1 π̂d (X i )Di
πd (X1 ), . . ., πd (Xn ) from the distribution of Pn
πd (X), then randomly choosing a unit from (1 − π̂d (X i ))−1 (1 − Di )Yi
Pi=1
n −1
. (29)
the treatment group and the control group with i=1 (1 − π̂d (X i )) (1 − Di )
this value πd (x), and taking the difference
Y (1) − Y (0), then averaging the n differences. To estimate the ATT, it is necessary to weight
In practice, of course, the propensity scores the expression above by πd (X), giving
is usually unknown and must be estimated.
Ȳ{D=1} −
[Rosenbaum (1987) explains the seemingly Pn
paradoxical finding that using the estimated π̂d (X i )(1 − π̂d (X i ))−1 (1 − Di )Yi
Pi=1
n −1
.
propensity score tends to produce better i=1 π̂d (X i )(1 − π̂d (Xi )) (1 − Di )
balance than using the true propensity score.] (30)
Typically, the ATT is estimated by matching
(using an estimate of the propensity score) A problem with using these estimators is
each treated unit i = 1, . . . n1 to one or that probabilities near 0 and 1 assign large
16 DESIGN AND INFERENCE

weights to relatively few cases (Rosenbaum, observations, and if the population model
1987). IPW estimators do not require estimat- is correct
P and ĝ is consistent, the estimator
ing the regression functions, but the weights n−1 ni=1 ĝ(X i ) is consistent for E(Yd (1)).
must be estimated consistently in order that If the regression function is misspecified,
the estimator be consistent. When the model the errors may not have 0 mean over
for the propensity score is misspecifed, the P. But if a good estimate of the δi can
weights will be estimated incorrectly and be obtained,
P E(Yd (1)) Pcan be estimated as
the IPW estimator will not be consistent; n−1 ni=1 ĝ(X i ) + n−1 ni=1 δ̂i . To estimate
if the estimated probabilities near 0 and 1 the δi in P, the propensity score can be used. If
are not close to the true probabilities, the the model for the propensity score is correct,
bias can be substantial. To contend with this, E(Di πd (X i )δi ) = E(δi ), and thus the estimator
Hirano, Imbens and Ridder (2003) propose
the use of a sieve estimator for the propensity
score, while Shafer and Kang (2007) propose n
X n
X
using a ‘robit’ model (a more robust model n−1 ĝ(X i ) + Di π̂d (X i )δ̂i (32)
based on the cumulative distribution function i=1 i=1
of the t distribution, as opposed to the
normal distribution (probit model) or logistic
distribution (logistic regression). will be consistent for E(Yd (1)).
Strategies for estimating treatment effects On the other hand, if the model for the
that combine one or more of the methods regression function is correct, then whether
have also been proposed. For example, sub- or not the model for the propensity score is
classification may still leave an imbalance correct, E(Di πd (X i )δi ) = EE(Di πd (X i )δi |
between the covariates in the treatment X) = E(πd (X i )E(Di | X)E(δi | X) = 0 as
and control groups. To reduce bias, linear E(δi | X) = 0.
regression of the outcome on D and X in A consistent estimate for E(Yd (0)) can be
each block may be used to adjust for the constructed in a similar manner. The weighted
imbalance, e.g., as in (25) (Rosenbaum and least-squares estimator, with appropriately
Rubin, 1983). Matching estimators may be chosen weights, is another example; more
similarly modified. generally, the regression function can be
Recently, a number of estimators that estimated semiparametrically (Robins and
combine inverse probability weighting with Rotnitsky, 1995). Kang and Schafer (2007)
regression have been proposed. These estima- discuss a number of other estimators that are
tors have the property that so long as either the consistent so long as either the regression
model for the propensity score is correct or the function or the propensity score is specified
model for the regression function is correct, correctly. Such estimators are often called
the estimator is consistent. Kang and Schafer ‘doubly robust’; however, the reader should
(2007) do a nice job of explaining this idea, note that this terminology is a bit misleading.
which originates in the sampling literature Statistical methods that operate well when
(Cassel, Sarndal and Wretman, 1976, 1977), the assumptions underlying their usage are
and of summarizing the literature on this topic. violated are typically called robust. Here, the
To give some intuition, consider estimation estimator is robust with misspecification to
of the ATE. Suppose the population regression either the propensity score or the regression
function is assumed to have the form: function, but not both. In that vein, Kang
and Schafer’s (2007) simulations suggest
Ydi (1) = g(X i ) + δi , (31) that when neither the propensity score nor
the regression function is correctly specified,
with E(δi | X i ) = 0, giving E(Yd (1) | doubly robust estimators are often more
X) = g(X). As before, because (11) holds, biased than estimators without this attractive
the model can be estimated using the treated theoretical property.
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 17

MEDIATION ANALYSES The local average treatment effect


As before, let Zi = 1 if unit i is assigned
Mediation is a difficult topic and a thorough
to the treatment group, 0 otherwise. Let
treatment would require an essay length treat-
Dzi (0) (Dzi (1)) denote the treatment i takes
ment. The topic arises in several ways. First,
up when assigned to the control (treatment)
even in randomized experiments, subjects
group. Similarly, for z = 0, 1 and d = 0, 1,
do not always ‘comply’ with their treatment
let Yzdi (0, 0) denote the response when i is
assignments. Thus, the treatment received D is
assigned to treatment z = 0 and receives treat-
an intermediate outcome intervening between
ment d = 0; Yzdi (0, 1), Yzdi (1, 0), Yzdi (0, 0)
Z and Y , and an investigator might want to
are defined analogously. Let Yzdi (Zi , Dzi (Zi ))
know, in addition to the ITT (which measures
denote i’s observed response.
the effect of Z on Y ) the effect of D on Y .
In a randomized experiment, the potential
This might be of interest scientifically, and
outcomes are assumed to be independent of
may also point, if the effect is substantial, but
treatment assignment:
subjects don’t take up the treatment, to the
need to improve the delivery of the treatment
Dz (0), Dz (1), Yz (0, Dz (0)), Yz (1, Dz (1))%Z.
package. Traditional methods of analysis that
compare subjects by the treatment actually (33)
received or which compare only those subjects
in the treatment and control groups that The two ITTs (hereafter ITTD and ITTY )
follow the experimental protocol are flawed are identified as before by virtue of
because treatment received D is not ignorable assumption (5); while these parameters are
with respect to Y . To handle this, Bloom clearly of interest (and some would say
(1984) first proposed using Z as an instrument these are the only parameters that should be
for D. Subsequently, Angrist, Imbens and of interest), neither parameter measures the
Rubin (1996) clarified the meaning of the effect of D on Y . That is because D is (in
IV estimand. Second, and more generally, econometric parlance) ‘endogenous’. To deal
researchers often have theories about the path- with such problems, economists have long
ways (intervening variables) through which a used instrumental variables (including two
particular cause (or set of causes) affects the stage least squares), in which ‘exogenous’
response variable and the effects of both the variables that are believed to affect Y only
particular cause(causes) and the intervening through D are used as an instrument for D. The
variables is of interest. To quantify these IV estimand (in the simple case herein) is:
effects, psychologists and others often use
structural equation models, following Baron cov(Z, Y ) E(Y | Z = 1) − E(Y | Z = 0)
=
and Kenny (1986), for example. However, cov(Z, D) E(D | Z = 1) − E(D | Z = 0)
the ‘direct effects’ of D on Y and Z on = ITTY /ITTD . (34)
Y in structural equation models should not
generally be interpreted as effects; conditions Recently, Imbens and Angrist (1994) and
(which are unlikely to be met) when these Angrist, Imbens and Rubin (1996) clarified the
parameters can be given a casual interpreta- meaning of the IV estimand (34) and the sense
tion are also given below, as are conditions in which this estimand is a causal parameter.
under which the IV estimated admits a causal ITTY is a weighted average over four
interpretation. compliance types: (1) compliers, with
Throughout, only the case of a ran- Dzi (0) = 0, Dzi (1) = 1; (2) never takers, with
domized experiment with no covariates is Dzi (0) = 0, Dzi (1) = 0; (3) always takers,
considered; the results extend immediately with Dzi (0) = 1, Dzi (1) = 1; and (4) defiers,
to the case of an observational study where with Dzi (0) = 1, Dzi (1) = 0, who take up
treatment assignment is ignorable only after treatment if not assigned to treatment and
conditioning on the covariates. who do not take up treatment if assigned
18 DESIGN AND INFERENCE

to treatment. Often it will be substantively in the not uncommon case where the only
reasonable to assume there are no defiers; way to obtain the treatment is by being in
this is the ‘weak monotonicity assumption’ the treatment group. In addition, although the
Dzi (1) ≥ Dzi (0) for all i. Because the never ATT is defined under the assumption that
takers and always takers receive the same the exclusion restriction holds, the average
treatment irrespective of their assignment, effect of Z on Y for compliers equals the
any effect of treatment assignment on Y for average effect of Z on Y for the treated
these types cannot be due to treatment D. in this case. Finally, in the (unlikely) case
If it is reasonable to assume the effect of where the treatment effects are constant,
treatment assignment operates only via the LATE = ATT = ATE.
treatment, i.e., there is no ‘direct effect’ of Empirical workers should also remember
Z on Y , then the unit effect of Z on Y for the exclusion restriction is very strong (even
never takers and always takers is 0; this is if applied only to the never takers and
called the exclusion restriction. Under weak always takers), and in a ‘natural’ experiment
monotonicity and exclusion, ITTY therefore or a randomized experiment that is not
reduces to: double blinded, this restriction may not hold.
Researchers who are in the position of being
E(Yz (1, Dz (1)) − Yz (0, Dz (0))) = able to design a double-blinded, randomized
E(Y (1, 1) − Y (0, 0)) × Pr(Dz (0) = 0, experiment should do so, and researchers who
Dz (1) = 1). (35) are relying on a natural experiment should
think very seriously about whether or not
As Pr(Dz (0) = 0, Dz (1) = 1) = E(D(1) − this restriction is plausible. Finally, it is also
D(0)) in the absence of defiers, provided this important to remember that the compliers
is greater than 0 (weak monotonicity and consitute an unobserved sub-population of P,
this assumption is sometimes called ‘strong so that even if a policy maker were able to
monotonicity’), (34) is the average effect of offer the treatment only to subjects in this
Z on Y for the compliers. If the direct effect subpopulation, he/she cannot identify these
of Z on Y for the compliers is also 0, (34) is subjects with certainty.
also the effect of D on Y in this subpopulation; The approach above also serves as the
this is sometimes called the complier average basis for the idea of principal stratification
causal effect (CACE) or the local average (Frangakis and Rubin, 2002). The essential
treatment effect (LATE). [For some further idea is that for any intermediate outcome D
statistical work on compliance, see Imbens (not necessarily binary), causal effects of D
and Rubin, (1997), Little and Yau (1998), are defined within principal strata (subpop-
Jo (2002), Hirano, Imbens, Rubin and Zhou ulations with identical values of Dzi (0) and
(2000)]. Dzi (1)).
Because compliance is such an important
issue, empirical researchers have been quick
Mediation and structural equation
to apply the results above. But researchers
modeling
who want to know the ATT or ATE might find
the average effect of Z on Y for compliers To facilitate comparison with the psycholog-
or LATE to be of limited interest when the ical literature, in which structural equation
proportion of compliers is small (e.g., about models are typically used to study mediation
15% in the example presented by Angrist (Baron and Kenny, 1986; MacKinnon and
et al. (1996)]. Researchers who estimate Dwyer, 1993), I discuss the special case where
the IV estimand (or who use instrumental D and Y are continuous, Z and D have additive
variables or two stage least squares) should effects on Y and the average effect of D on Y is
be careful not to forget that compliers may linear (as described below); for a more general
differ systematically from the never takers discussion, see Sobel (2008).As above, I make
and always takers. However, LATE = ATT assumption (5) and examine the IV estimand;
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 19

the extension to the case where ignorability E(Yzd (z, Dz (z))); a sufficient condition for this
holds, conditional on covariates, is immediate. to hold is:
Using potential outcomes, a linear causal
model analogous to a linear structural equa- Yz (z, Dz (z))%Dz (z). (40)
tion model may be constructed:
Similarly, β2c = β2s under this condition.
Dzi (z) = α1c + γ1c z c
+ ε1zi (z) (36) Results along these lines are reported in
Yzdi (z, d) = α2c + γ2c z c c
+ β2 d + ε2zdi (z, d), Eggleston, Scharfstein, Munoz and West
(37) (2006), Sobel (2008) and Ten Have, Joffe,
Lynch, Brown and Maisto (2005). Unfortu-
c (z)) = E(ε c (z, d)) = 0; thus, nately, this condition is unlikely to be met
where E(ε1z 2zd in applications, as it requires the intermediate
γ1 = ITTD , γ2c = E(Yzd (1, d) − Yzd (0, d))
c
outcome D to be ignorable with respect to Y ,
for any d is the average unmediated effect of
as if D had been randomized.
Z on Y , and β2c = E(Yzd (z, d + 1) − Yzd (z, d))
Holland (1988) showed that if (33)
for z = 0, 1 is the average effect of a one unit
holds,the exclusion restriction Yzdi (1, d) −
increase in D on Y .
Yzdi (0, d) = 0 for all i holds, ITTD ( = 0,
A linear structural equation model for the
and the other unit effects Dzi (1) − Dzi (0)
relationship between Z, D and Y is given by:
and Yzdi (z, d) − Yzdi (z, d - ) are constant for
all i, β2c is equal to the IV estimand (34).
s
Di = α1s + γ1s Zi + ε1i (38)
Unfortunately, the assumption that the effects
s
Yi = α2s + γ2s Zi + β2s Di + ε2i , (39) are constant is even more implausible in
the kinds of studies typically carried out in
where the parameters are identified by the the behavioral and medical sciences than the
assumptions E(ε1s | Z) = 0 and E(ε2s | assumptions needed to justify using structural
Z, D) = 0. Thus, the ‘direct effects’ of Z on equation models.
D and Y , respectively, are: γ1s = E(D | Z = Sobel (2008) relaxes the assumption of
1) − E(D | Z = 0), γ2s = E(Y | Z = 1, D = constant effects, assuming instead:
d) − E(Y | Z = 0, D = d). The ‘direct effect’
of D on Y is given by β2s = E(Y | Z = z, c
E(ε2zd c
(1, Dz (1)) − ε2zd (0, Dz (0)) = 0. (41)
D = d + 1) − E(Y | Z = 0, D = d). The
‘total effect’ τ s ≡ γ2s + γ1s β2s . Under (41), (33), the exclusion restriction
By virtue of (33) γ1s = ITTD and γ2c = 0, and the assumption γ1c ( = 0, β2c =
s
τ = ITTY (Holland, 1988). However, the IV estimand (34). Further, assumption
neither γ2s nor β2s should generally be (41) is also weaker than the assumption (40)
given a causal interpretation. To illustrate, needed to justify using structural equation
consider E(Y | Z = z, D = d) = models.
E(Yzd (z, Dz (z)) | Z = z, Dz (z) = d) = The results above can be extended to the
E(Yzd (z, Dz (z)) | Dz (z) = d), where case where there are multiple instruments and
the last equality follows from (33). This multiple mediators. The results can also be
gives γ2s = E(Yzd (1, Dz (1)) | Dz (1) = extended to the case where compliance is an
d) − E(Yzd (0, Dz (0)) | Dz (0) = d). Because intermediate outcome prior to the mediating
subjects with Dz (0) = d are not the same variable (Sobel, 2008) to obtain a complier
subjects as those with Dz (1) = d, unless the average effect of the continuous mediator D
unit effects of Z on D are 0, γ2s is a descriptive on Y ; this is the effect of the continuous
parameter comparing subjects across mediator D on Y within the principal stratum
different subpopulations. Similar remarks (of the binary outcome denoting whether
apply to β2s . or not treatment is taken) composed of the
It is also easy to see from the above that compliers. Principal stratification itself can
γ2s = γ2c if E(Yzd (z, Dz (z)) | Dz (z) = d) = also be used to approach the problem of
20 DESIGN AND INFERENCE

estimating the effect of D on Y (Jo, 2008); the manner in which a treatment package
here the idea would be to consider the effects may work through multiple mediators and the
of D on Y within strata defined by the pair of causal relationships among these mediators.
values of (Dz (0), Dz (1)). These and many other issues are in need of
much further work.

DISCUSSION

In the last three decades, statisticians have REFERENCES


generated a literature on causal inference that
formally expresses the idea that causal rela- Abadie, A. and Imbens. G. (2006) ‘Large
tionships sustain counterfactual conditional sample properties of matching estimators for
statements. The potential outcomes notation average treatment effects’, Econometrica, 74:
allows causal estimands to be defined inde- 235–267.
pendently of the expected values of estima- Abadie, A., Angrist, J. and Imbens, G. (2002) ‘Instrumen-
tors. Thus, one can assess and give conditions tal variables estimation of quantile treatment effects’,
Econometrica, 70: 91–117.
(e.g., ignorability) under which estimators
Angrist, J.D., Imbens, G.W. and Rubin, D.B. (1996)
commonly employed actually estimate causal ‘Identification of causal effects using instrumental
parameters. Prior to this, researchers esti- variables’, (with discussion) Journal of the American
mated descriptive parameters and verbally Statistical Association, 91: 444–472.
argued these were causal based on other Baron A., Rubin M., and Kenny, D.A. (1986) ‘The
considerations, such as model specification, a moderator–mediator variable distinction in social
practice that led workers in many disciplines, psychological research: Conceptual, strategic and
e.g., sociology and psychology, to interpret statistical considerations’, Journal of Personality and
just about any parameter from a regression or Social Psychology, 51: 1173–1182.
structural equation model as a causal effect. Belsen, W.A. (1956) A technique for studying the
While this literature has led most effects of a television broadcast’, Applied Statistics,
researchers to a better understanding that a 5: 195–202.
Björklund, A. and Moffit, R. (1987) ‘The estimation
good study design (especially a randomized
of wage gains and welfare gains in self-selection
study) leads to more credible estimation of models’, The Review of Economics and Statistics,
causal parameters than approaches using 69: 42–49.
observational studies in conjunction with Bloom, H.S. (1984) ‘Accounting for no-shows in
many unverifiable substantive assumptions, experimental evaluation designs’, Evaluation Review,
it is also important to remember the old 8: 225–246.
lesson (Campbell and Stanley, 1963) that Bunge, M.A. (1979) Causality and Modern Science (3rd
randomized studies do not always estimate edn.). New York: Dover.
parameters that are generalizable to the Campbell, D.T. and Stanley, J.C. (1963) Experimen-
desired population. This is especially true for tal and Quasi-experimental Designs for Research.
natural experiments, where the investigator Chicago: Rand McNally.
has no control over the experiment, although Cassel, C.M., Särndal, C.E. and Wretman, J.H. (1976)
‘Some results on generalized difference estimation
the randomization assumption is plausible.
and generalized regression estimation for finite
This literature has also led to clarification
populations’, Biometrika, 63: 615–620.
of existing procedures. In the process, new Cassel, C.M., Särndal, C.E. and Wretman, J.H. (1977)
challenges have been generated. For example, Foundations of Inference in Survey Sampling.
while this literature reveals that the framework New York: Wiley.
psychologists have been using for 25 years to Cochran, W.G. (1968) ‘The effectiveness of adjustment
study mediation is seriously flawed, as of yet, by subclassication in removing bias in observational
this literature cannot give adequate expression studies’, Biometrics, 24: 205–213.
to and/or indicate how to assess the sub- Cox, D.R. (1958) The Planning of Experiments.
stantive theories that investigators have about New York: John Wiley.
CAUSAL INFERENCE IN RANDOMIZED AND NON-RANDOMIZED STUDIES 21

Crepinsek, M.K., Singh, A., Bernstein, L.S., and effects using the estimated propensity score’,
McLaughlin, J.E. (2006) ‘Dietary effects of universal- Econometrica, 71: 1161–1189.
free school breakfast: findings from the evaluation of Holland, P.W. (1988) ‘Causal inference, path analysis,
the school breakfast program pilot project’, Journal of and recursive structural equation models’, (with
the American Dietetic Association, 106: 1796–1803. discussion) in Clogg, C.C. (ed.), Sociological Method-
Dehejia, R.H. and Wahba, S. (1999) ‘Causal effects in ology. Washington, D.C.: American Sociological
nonexperimental studies: reevaluating the evaluation Association. pp. 449–493.
of training programs’, Journal of the American Horvitz, D.G., and D.J. Thompson (1952) ‘A gener-
Statistical Association, 94: 1053–1062. alization of sampling without replacement from a
Doksum, K. (1974) ‘Empirical probability plots and finite universe’ Journal of the American Statistical
statistical inference for nonlinear models in the Association, 47: 663–685.
two-sample case’, Annals of Statistics, 2: 267–277. Imai, K. and van Dyk, D.A. (2004) ‘Causal inference
Drake, C. (1993) ‘Effects of misspecication of the with general treatment regimes: generalizing the
propensity score on estimators of treatment eect’, propensity score’, Journal of the American Statistical
Biometrics, 49: 1231–1236. Association, 99: 854–866.
Eggleston, B., Scharfstein, D., Munoz, B. and West, S. Imbens, G.W. (2000) ‘The role of the propensity score
(2006) ‘Investigation mediation when counterfactu- in estimating dose-response functions’, Biometrika,
als are well-defined: does sunlight exposure mediate 87: 706–710.
the effect of eye-glasses on cataracts?’. Unpublished Imbens, G.W. (2004) ‘Nonparametric estimation of
manuscript, Johns Hopkins University. average treatment effects under exogeneity: a
Finkelstein, M.O., Levin, B. and Robbins, H. (1996) review’, Review of Economics and Statistics, 86:
‘Clinical and prophylactic trials with assured new 4–29.
treatment for those at greater risk: II. examples’, Imbens, G.W., and J.D. Angrist (1994) ‘Identification
American Journal of Public Health, 86: 696–702. and estimation of local average treatment effects’,
Fisher, R.A. (1925) Statistical Methods for Research Econometrica, 62: 467–475.
Workers. London: Oliver and Boyd. Imbens, G.W. and Rubin, D.B. (1997) ‘Estimating
Frangakis, C.E. and Rubin, D.B. (2002) ‘Principal outcome distributions for compliers in instrumental
stratication in causal inference’, Biometrics, 58: variables models’, Review of Economic Studies,
21–29. 64: 555–574.
Gitelman, A.I. (2005) ‘Estimating causal effects from Jo, B. (2002) ‘Estimation of intervention effects
multilevel group-allocation data’, Journal of Educa- with noncompliance: Alternative model specifications
tional and Behavioral Statistics, 30: 397–412. (with discussion),’ Journal of Educational and
Granger, C.W. (1969) ‘Investigating causal relationships Behavioral Statistics, 27: 385–415.
by econometric models and cross-spectral methods’, Jo, B. (2008) ‘Causal inference in randomized exper-
Econometrica, 37: 424–438. iments with mediational processes,’ Psychological
Gu, X.S, and Rosenbaum, P.R. (1993) ‘Comparison of Methods, 13: 314–336.
multivariate matching methods: structures, distances Joffe, M.M. and Rosenbaum P.R. (1999) ‘Propen-
and algorithms’, Journal of Computational and sity scores’, American Journal of Epidemiology,
Graphical Statistics, 2: 405–420. 150: 327–333.
Halloran, M. E. and Struchiner, C.J. (1995) ‘Causal Kang, J.D.Y. and Schafer, J.L. (2007) ‘Demystifying
inference in infectious diseases’, Epidemiology, double robustness: a comparison of alternative
6: 142–151. strategies for estimating population means from
Harre, R. and Madden, E.H. (1975) Causal Powers: incomplete data’, Statistical Science, 22: 523–580.
A Theory of Natural Necessity. Oxford: Basil Lehmann, E.L. (1974) ‘Nonparametris: Statistical
Blackwell. Methods Based on Ranks,’ Holden-Day, Inc.:
Hill, J.L. and McCulloch, R.E. (2007) ‘Bayesian nonpara- San Francisco, CA.
metric modeling for causal inference.’ Unpublished Lewis, D. (1973) ‘Causation’, Journal of Philosophy, 70:
manuscript, Columbia University. 556–567.
Hirano, K., Imbens, G.W., Rubin, D.B., and X. Little, R.J, and Yau, L.H.Y. (1998) ‘Statistical techniques
Zhou (2000) ‘Assessing the effect of an influenza for analyzing data from prevention trials: treatment of
vaccine in an encouragement design with covariates,’ no-shows using Rubin’s causal model’, Psychological
Biostatistics, 1: 69–88. Methods, 3: 147–159.
Hirano, Keisuke, Imbens, Guido W., and Ridder, G. Lunceford, J.K., and M. Davidian. (2004) ‘Stratication
(2003) ‘Efficient estimation of average treatment and weighting via the propensity score in estimation
22 DESIGN AND INFERENCE

of causal treatment effects: a comparative study’, test” by D. Basu’, Journal of the American Statistical
Statistics in Medicine, 23: 2937–2960. Association, 75: 591–593.
MacKinnon, D.P. and Dwyer, J.H. (1993) ‘Estimating Schafer, J.L. and Kang, J.D.Y. (2007) ‘Average causal
mediating effects in prevention studies’, Evaluation effects from observational studies: a practical guide
Review, 17: 144–158. and simulated example’. Unpublished manuscript,
Manski, C.F. (1995) Identication Problems in the Pennsylvania State University.
Social Sciences. Cambridge, MA: Harvard University Simon, H.A. (1954) ‘Spurious correlation: a causal
Press. interpretation’, Journal of the American Statistical
Neyman, J. (1923) 1990 ‘On the application of Association, 49: 467–492.
probability theory to agricultural experiments. essays Sobel, M.E. (1995) ‘Causal inference in the social and
on principles. Section 9’, (with discussion) Statistical behavioral sciences’, in Arminger, G., Clogg, C.C. and
Science, 4: 465–480. Sobel, M.E. (eds.), Handbook of Statistical Modeling
Reichenbach, H. (1956) The Direction of Time. Berkeley: for the Social and Behavioral Sciences. New York:
University of California Press. Plenum. pp. 1–38.
Robins, J.M. (1989) ‘The analysis of randomized and Sobel, M.E. (2006a) ‘Spatial concentration and social
nonrandomized aids treatment trials using a new stratication: does the clustering of disadvantage
approach to causal inference in longitudinal studies’, “beget” bad outcomes?’, in Bowles, S., Durlauf, S.N.
in Sechrest, L., Freedman, H. and Mulley, A. (eds.), and Hoff, K. (eds.), Poverty Traps, New York: Russell
Health Services Research Methodology: A Focus Sage Foundation. pp. 204–229.
on AIDS. Rockville, MD: US Department of Health Sobel, M.E. (2006b) ‘What do randomized studies of
and Human Services. pp. 113–159. housing mobility demonstrate? Causal inference in
Robins, J.M. and Rotnitsky, A. (1995) ‘Semiparametric the face of interference’, Journal of the American
efficiency in multivariate regression models with Statistical Association, 101: 1398–1407.
missing data’, Journal of the American Statistical Sobel, M.E. (2008) ‘Identification of causal parameters
Association, 90: 122–129. in randomized studies with mediating variables,’
Rosenbaum, P.R. (1987) ‘The role of a second control Journal of Educational and Behavioral Statistics,
group in an observational study,’ Statistical Science, 33: 230–251.
2: 292–316. Suppes, P. (1970) A Probabilistic Theory of Causality.
Rosenbaum, P.R. (2002) Observational Studies. Amsterdam: North Holland.
New York: Springer-Verlag. Tenhave, T.R., Joffe, M.M., Lynch, K.G., Brown, G.K.,
Rosenbaum, P.R. and Rubin, D.B. (1983) ‘The central Maisto, S. A., and A.T. Beck (2007) ‘Causal mediation
role of the propensity score in observational studies analyses with rank preserving models,’ Biometrics,
for causal effects’, Biometrika, 70: 41–55. 63: 926–934.
Rubin, D.B. (1974) ‘Estimating causal effects of treat- TenHave, T.R., Marshall, J., Kevin, L., Brown, G. and
ments in randomized and nonrandomized studies’, Maisto, S. (2005) Causal Mediation Analysis with
Journal of Educational Psychology, 66: 688–701. Structural Mean Models. University of Pennsylvania
Rubin, D.B. (1977) ‘Assignment to treatment groups Biostatistics, Working Paper.
on the basis of a covariate’, Journal of Educational Thistlethwaite, D.L. and Campbell, D.T. (1960)
Statistics, 2: 1–26. ‘Regression-discontinuity analysis: an alternative to
Rubin, D.B. (1978) ‘Bayesian inference for causal effects: the ex post facto experiment’, Journal of Educational
the role of randomization’, The Annals of Statistics, Psychology, 51: 309–317.
6: 34–58. Yule, G.U. (1896) ‘On the correlation of total pauperism
Rubin, D.B. (1980) ‘Comment on “randomization anal- with proportion of out-relief ii: males over 65’,
ysis of experimental data: the Fisher randomization Economic Journal, 6: 613–623.

You might also like