You are on page 1of 63

LECTURE NOTE ON

Treatment Eects Analysis


YU-WEI HSIEH
New York University
First Draft: Sep 1, 2009
c Yu-Wei Hsieh (all rights reserved).
E-mail: yuwei.hsieh@nyu.edu
Contents
1 Introduction to Average Treatment Eects 1
1.1 Rubins Statistical Causality Model . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Identication and Estimation under Exogeneity . . . . . . . . . . . . . . . 4
1.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 OLS v.s. Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Propensity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.1 Specication Testing of the Propensity Score . . . . . . . . . . . . 14
1.4.2 Regression as a Dimension Reduction Method . . . . . . . . . . . . 15
1.4.3 Propensity Score Weighting Estimator . . . . . . . . . . . . . . . . 16
2 Quantile Treatment Eects 17
2.1 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Weighting Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Instrumental Variables I: Local Treatment Eects 19
3.1 Instrumental Variable : A Review . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Restrictions on the Selection Process : Local Treatment Eects . . . . . . 21
3.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Vietnam Draft Lottery . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Randomized Eligibility Design . . . . . . . . . . . . . . . . . . . . 26
3.4 Other Identied Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Non-binary Treatments and Instruments . . . . . . . . . . . . . . . . . . . 29
3.6 Nonparametric Estimation of LTE with Covariate . . . . . . . . . . . . . . 29
3.7 Parametric and Semiparametric Estimation of LTE with Covariate . . . . 31
4 Dierence-in-Dierence Designs 34
4.1 Linear Dierence-in-Dierence model . . . . . . . . . . . . . . . . . . . . . 36
4.2 Nonparametric Dierence-in-Dierenc Models . . . . . . . . . . . . . . . . 37
4.3 Nonlinear Dierence-in-Dierence . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 The Change-in-Change Model . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Quantile Dierence-in-Dierence . . . . . . . . . . . . . . . . . . . . . . . 45
i
5 Nonparametric Bounding Approaches 48
5.1 No-assumption Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
50
5.3 Restrictions on the Moment Conditions: Monotone Instrumental Variables 51
5.4 Monotone Treatment Selection . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Shape Restrictions on the Response Functions: Monotone Treatment Re-
sponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Restrictions on the Selection Mechanism . . . . . . . . . . . . . . . . . . . 54
5.7 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
References 56
ii
1 Introduction to Average Treatment Eects
Suppose the company of national health insurance launched a new reimbursment scheme
quality paymentfor physicians, linking their salary with patients health outcome. If
patients health status are getting better, the insurance company then gives physicians
extra bonus. By providing proper nancial incentives to physicians, it can encourage
them to treat patients more carefully, hence leading to higher cure rate. In this case,
the new payment program is called a treatment, while the cure rate is called response.
The group of subjects receiving treatment is called treatment group, while the group re-
ceiving control treatment (typically, no treatment) is called control or comparison group.
We want to know whether a treatment has an impact on the response variable. If yes,
does the treatment cause a positive or negative eect? In this section we introduce the
statistical causality model, proposed by Rubin (1974), to quantify the eects of a certain
treatment. More discussions about this model can be found in Holland (1986).
1.1 Rubins Statistical Causality Model
For subject i, the Bernoulli random variable Y
1i
represents whether the patient is cured or
not, if he go to the hospital that participates in the quality payment program (treatment
group); and Y
0i
if he go to the hospital that does not participate in that program. We
call Y
1i
and Y
0i
potential response. Intuitively, the individual treatment eect on subject
i can be dene as Y
1i
Y
0i
. If Y
1i
Y
0i
> 0 then the treatment has a positive eect on
subjects health status: it makes subjects fully recover from an illness. Let D
i
= 1 if
subject i is in the treatment group, and D
i
= 0 if subject i is in the control group. The
treatment indicator D
i
is called observed treatment, indicating whether unit i receives
treatment or not. Dene observed response Y
i
= D
i
Y
1i
+ (1 D
i
)Y
0i
. We observe Y
1i
if
unit i is in the treatment group (D
i
= 1), and we observe Y
0i
if unit i is in the control
group (D
i
= 0). However, unit i cannot be assigned to treatment and control group at
the same time. Unit i can only be treated or untreated at a specic time. Because we
can only observe either Y
1i
or Y
0i
, we face a missing data problem that up to 50% data
is missed. Therefore, it is impossible to identify the individual eect due to the missing
data problem. The unobservable missing outcome is termed counterfactual outcome. For
example, if we observe Y
1i
, then Y
0i
is the counterfactual.
Sometimes, it is cumbersome for policy maker to learn the individual eect for each
subject. Instead, we are interested in other summary statistics, such as the average
1
treatment eect:
ATE = E[Y
1
Y
0
]
ATE not only describes the average treatment eect on subjects, but also transfers the
impossible-to-identify individual eect into a possible-to-estimate statistical problem. It
is because we can utilize information contained in the sampling processes to learn the
average eect, without exactly knowing the individual eect for each unit. However, the
missing data problem raises the identication problem of ATE because we want to learn
the features of (Y
1
, Y
0
, D) but only (Y, D) is observed. In this lecture note several identi-
cation strategies will be discussed under dierent assumptions of selection mechanisms,
exclusion restrictions, source of exogeneous variations, functional form restrictions, and
heterogeneity. In order to identify ATE, the mechanism that how units be selected into
treatment group or not lies in the central heart of treatment eect analysis. Now we
introduce an identication condition for ATE. Suppose the treatment assignment D sat-
ises:
Assumption 1.1 (Y
1
, Y
0
)D, denotes independence.
Assumption 1 is an exogenous condition. For example, a randomized experiment au-
tomatically satises this condition. Another example satisfying this assumption is that
the government launched a new program, and some people are required to participate
in that program. Since these people cannot choose to join the program or not, D is
independent of (Y
1
, Y
0
). This situation is termed quasi-experiment or natural experiment
in the literature. Under this assumption, ATE can be identifed by group mean dierence:
E[Y |D = 1] E[Y |D = 0]. E[Y |D = 1] is the mean of treatment group, and E[Y |D = 0]
is the mean of control group. Since Y = DY
1
+ (1 D)Y
0
, we have:
E[Y |D = 1] E[Y |D = 0] = E[Y
1
|D = 1] E[Y
0
|D = 0],
and by Assumption 1,
E[Y
1
|D = 1] E[Y
0
|D = 0] = E[Y
1
] E[Y
0
] = E[Y
1
Y
0
] = ATE. (1)
Here is an important implication of assumption 1. Since (Y
1
, Y
0
)D, we have E[Y
0
] =
E[Y
0
|D = 0] = E[Y
0
|D = 1]. E[Y
0
|D = 1] means the average response of units in
the treatment group, had them not been treated. However, one can only observe Y
1
in
the treatment group; therefore E[Y
0
|D = 1] is counterfactual. Under assumption 1, we
can use the observable E[Y
0
|D = 0] to impute the counterfactual E[Y
0
|D = 1]. This
2
is because assumption 1 guarantee the treatment group and the control group is very
similar, so that we can compare each other. We can use the information of control group
to impute the counterfactual Y
0
of treatment group, and we can also use the information
of treatment group to impute the counterfactual Y
1
of control group. In fact, a weaker
condition is sucient to identify ATE:
Assumption 1.2 (mean independent)
E[Y
0
|D] = E[Y
0
], and E[Y
1
|D] = E[Y
1
].
When the parameter of interest is quantile treatment eects, or we want to estimate the
asymptotic variance of ATE, weaker conditions like assumption 2 is not enough. There-
fore, throughout this lecture we will impose stronger condtions, even though weaker
conditions are sucient for identication. One should know E[Y
0
|D = 0] does not nec-
essarily equal to E[Y
0
|D = 1], that is, the control group need not be a good proxy for
the treatment group. The identication condition thus governs the mechanism how units
be selected into treatment group or not. Also, it determines whether the experiment
is statistically sound or not. The principle of treatment eect analysis is under what
conditions the treatment group and the control group is quite similar, namely, we can
compare the comparable. To be specic, the meaning of comparing the comparable is
an imputation procedure for the counterfactual that can also remove the selection bias.
Almost all estimators discussed in this note intrinsically implement this principle.
1.2 Selection Bias
However, in most of social science studies hardly can we have data generated by ran-
domized experiments. Even in natural science studies, sometimes random sample is also
unavailable. For example, we want to study if a certain virus is lethal. It is against
ethic and law to conduct a randomized experiment that makes some people infected
and then studies the mortality rate. Instead of collecting the data from a randomized
experiment controlled by an experimenter, in many cases we conduct a nonrandomized
observational study. The challenge of observational study is the treatment assignment
D
i
may depend on other factors that also aect the response variable. Moreover, D
i
may be endogenous as well. Both situations invalidate the independent assumption. In
our example, hospitals choose to participate in the program or not. Therefore, D
i
is not
a random assignment, and assumption 1 may be violated. To see this, D
i
may depend
on the scale of hospital, X: Big hospitals, such as teaching hosptials, are more likely to
participate in the program. One reason is that the insurance company asked these hos-
3
pitals to participate in it. By providing better health care services, hospitals may incur
more cost so that it may not be protable to participate in it. Big hospitals may provide
health care services in a more cost-eective manner. Therefore, big hospitals are more
likely to join the program. In other words, D
i
is a function of X. However, the scale of
hospital may also aect patients health outcome; Y
i
is also a function of X. Since X
is the common factor of D
i
and Y
i
, obviously the condition (Y
1
, Y
0
) D is violated. X
is termed confounder or covariate or pretreatment variable or exogeneous variable. If the
confounder eect is not controlled for, we will nd positive ATE by using group mean
dierence along to estimate it. The ideal is as follow. Big hospitals have higher cure
rate, and higher probability to participate in the program. Intuitively, one scenior is that
the program has virtually no eect on patients health outcome. It is simply because
there are more big hospitals in the treatment group, so we nd the patients treated in
the treatment group hospitals have better health outcome. The treatment and control
group is not comparable, if the confounders are not controlled for.
The bias created by not controlling for observable confounders is classifed as overt
bias. Overt bias is analogous to the omitted variable problem in regression analysis. Overt
bias focuses on covariates that are observable to econometricians, while hidden bias puts
emphasis on unobservable covariate. Hidden bias is also known as endogeneity problem.
For example, the genetic characteristics of patients or the talent of workers, can be view
as unobservable covariates. Another source of hidden bias is due to self-selection into
treatment, known as self-selection bias. For example, suppose the government launches a
job training program for workers. The decision whether to join the program or not may
relate to the benet Y
1
Y
0
. This case also renders assumption 1 implausible. When
these biases are present, the group mean dierence no longer identify ATE. However,
as long as we can come up with some methods to control for selection biases, ATE can
still be identied. Typically, it evolves understandings of the data generating process
of D
i
. We discuss the identication and estimation under exogeneity in the following
section. The issue of hidden bias and self-selection is defered to the sections of repeated
observations, instrumental variable and control function.
1.3 Identication and Estimation under Exogeneity
In this section we introduce the identication and estimation problems of ATE when there
is only overt bias. The framework of this section mainly follows Imbens (2004). Moreover,
both Wooldridge (2001) and M.-J. Lee (2005) are excellent references on this topics.
4
We will show how to remove overt bias and then identify ATE under mild exogeneous
conditions. First we impose two key assumptions on the treatment assignment:
Assumption 1.3 (unconfoundedness)
(Y
1
, Y
0
)D|X,
and
Assumption 1.4 (overlap or support condition)
0 < prob(D = 1|X) < 1.
Assumption 3 thus modies assumption 1, the independence assumption, to conditional
independence assumption. This assumption means all overt biases can be removed by
conditioning on the vector of all observable confounders. Intuitively, after controlling
for X, D is somewhat like random assignment. Therefore, each subgroup of the treat-
ment group and control group, dened by the same value of covariates, is compara-
ble. Unconfoundedness is interchangeable with ignorable treatment (Rosenbaum and Ru-
bin, 1983), conditional independence (Lechner, 1999), selection on observables (Barnow,
Cain, and Goldberger, 1980). Assumption 4 guarantees at least we can compare the
treatment and control group. For example, suppose X = 1 stands for male subjects.
prob(D = 1|X = 1) = 1 signies all male subjects receive treatment; there are no male
subjects in the control group. If the gender dierence has inuence on the potential
responses, one technique to control for the gender eect is comparing male subjects who
receive treatment, with male subjects who do not receive treatment; and comparing fe-
male subjects who receive treatment, with female subjects who do not receive treatment.
But when all male subjects are in the treatment group, we cannot not control for the
gender eect, then the two groups is not comparable. Next, we dene some notations:
Denition 1.1 (conditional mean and conditional variance)
(x, d) = E[Y |X = x, D = d],

d
(x) = E[Y
d
|X = x],

2
(x, d) = V (Y |X = x, D = d),

2
d
(x) = V (Y
d
|X = x).
Under assumption 3, (x, d) =
d
(x),
2
(x, d) =
2
d
(x). Now, we introduce OLS approach
to estimate ATE, and then move to recent advances in nonparametric and semiparametric
methods.
5
1.3.1 Regression
First we postulate the constant eect model: Y
1i
Y
0i
= i, and assume Y
0i
= +
X

i
+
i
. Then we have:
Y
i
= D
i
Y
1i
+ (1 D
i
)Y
0i
= Y
0i
+D
i
(Y
1i
Y
0i
) = +D
i
+

X
i
+
i
,
which is nothing but a dummy variable regression. It is easy to see that OLS estimator
is an estimator for ATE. Since X
i
is added into the regression function, we have
controlled for the confounder eect. This setting also highlights the relationship between
the unconfoundedness assumption and the exogeneous assumption in regression analysis.
Assumption 3 is equivalent to D |X, characterizing exogeneity of D
i
. Next, we use
a more general setting to study regression-based ATE estimators. Under Assumption 3,
we have:
E[Y |D = 1, X] E[Y |D = 0, X] = E[Y
1
|D = 1, X] E[Y
0
|D = 0, X]
= E[Y
1
|X] E[Y
0
|X] = E[Y
1
Y
0
|X] (X). (2)
By using the conditional group mean dierence E[Y |D = 1, X] E[Y |D = 0, X], we
identify the conditional ATE, (X). Taking expectation with respect to distribution of
X, we can identify ATE:
E[Y
1
Y
0
] = E
_
E[Y
1
Y
0
|X]

= E[(X)]. (3)
The corresponding sample counterpart is given by:
=
1
N
N

i=1
_

1
(X
i
)
0
(X
i
)
_
. (4)

1
(X
i
) and
0
(X
i
) is the estimator for E[Y |D = 1, X] and E[Y |D = 0, X], respec-
tively. In (4), the diernce of the two estimated conditional mean functions estimates
(X). While E[(X)] is estimated by averaging (X
i
) over the empirical distribution
of X. From this expression, the estimation problem of ATE can be view as the estima-
tion problem of the conditional mean function, E[Y |D, X]. Suppose
d
(x) is linear in
treatment assignment and covariates:

d
(x) = +d +

x,
then the corresponding dummy variable regression is given by:
Y
i
= +

X
i
+D
i
+
i
.
6
The OLS estimator can estimates ATE:
1
N
N

i=1
_

1
(X
i
)
0
(X
i
)
_
=
1
N
N

i=1
_
( + +

x
i
) ( +

x
i
)
_
= .
Instead of specifying a dummy variable regression, we can also estimate two separate
regression functions:

1
(x) =
1
+

1
x, if D
i
= 1

0
(x) =
0
+

0
x, if D
i
= 0.
One property of OLS estimator is the average of predicted response equals the average
of observed response:

i
D
i

1
(X
i
) =

i
D
i
Y
i
, and

i
(1 D
i
)
0
(X
i
) =

i
(1 D
i
)Y
i
.
Plugging this algebraic property into (4), can be decomposed into:
=
1
N
N

i=1
D
i
[Y
i

0
(X
i
)] + (1 D
i
)[
1
(X
i
) Y
i
]. (5)
Many ATE estimators have the above representation. The above expression has a nice
interpretation. For instance, if unit i receives treatment (D
i
= 1), in fact we calculate
Y
i

0
(X
i
) and Y
i
= Y
1i
. Since Y
1i
is observed, the remaining task is to impute the
counterfactual Y
0i
, and we impute it by
0
(X
i
).
0
(X
i
) describes the average response
of the units who have covariate value X
i
, had them not been treated.
Besides OLS method, there is a vast literature on estimating conditional mean func-
tions. Since OLS may raise model misspecication problems, current researches pay much
more attention on nonparametric and semiparametric models for ATE. Hahn (1998) de-
rives the eciency bound for ATE, and proposes an ecient estimator for ATE using
series estimator.
1
Heckman, Ichimura, Todd (1997) focus on kernel regression approach.
We outline the estimator proposed by Heckman et al. Suppose X is one dimensional (it
can be generalized to multi-dimensional), the kernel estimator for
d
(x) is given by:

d
(x) =

i:D
i
=d
Y
i
K((X
i
x)/h)
_

i:D
i
=d
K((X
i
x)/h),
1
see Pagan and Ullah (2005) for an introduction.
7
where K(.) is the kernel function, h is the bandwidth, jointly determine the weighting
scheme. To ensure consistency, h should grow as sample size increases but at a slower
rate. Let T denotes the treatment group and C denotes the control group, it can also be
decomposed into the form of (5):
=
1
N

iT
_
Y
i

jC
K((X
j
X
i
)/h)Y
j

jC
K((X
j
X
i
)/h)
_
+
1
N

jC
_

iT
K((X
j
X
i
)/h)Y
i

iT
K((X
j
X
i
)/h)
Y
j
_
. (6)
1.3.2 Matching
Matching estimates ATE by comparing subjects with similar covariates. Unlike regression
that directly estimates the two conditional mean functions, matching directly implements
the principle of comparing the comparable, though it also estimates E[Y |D, X] implicitly.
Before matching subjects with similar covariate, rst we should dene the criterion to
measure similarity. First we introduce nearest neighbor matching estimator.
Nearest Neighbor Matching
For unit i in the treatment group, we pick up M units in the control group whose
covariates are closest to X
i
, then average the responses of these M units to impute
the counterfactual of unit i. The same procedure applies to the units in the control
group. Following Abadie and Imbens (2006), we can use Euclidean norm ||x|| = (x

x)
1/2
to measure closeness. We can also use ||x||
V
= (x

V x)
1/2
A V is a positive denite
symmetric matrix. Let j
m
(i) be an index satisfying
D
j
= 1 D
i
,
and

l:D
l
=1D
i
1
_
X
l
X
i
X
j
X
i

_
= m.
It indicates the unit in the opposite treatment group that is m-th closest to unit i with
respect to the Euclidean norm. Then dene J
M
(i) as the set of indices for the rst M
matches for unit i.
J
M
(i) =
_
j
1
(i), ..., j
M
(i)
_
.
The imputation procedure is given by:
8

Y
0i
=
_
Y
i
, if D
i
= 0,
1
M

jJ
M
(i)
Y
j
, if D
i
= 1,
and

Y
1i
=
_
1
M

jJ
M
(i)
Y
j
, if D
i
= 0,
Y
i
, if D
i
= 1.
The nearest neighbor matching estimator for ATE is:

M
=
1
N
N

i=1
_

Y
1i


Y
0i
_
=
1
N
N

i=1
D
i
_
Y
i

1
M

jJ
M
(i)
Y
j
_
+ (1 D
i
)
_
1
M

jJ
M
(i)
Y
j
Y
i
_
. (7)
Kernel Matching
It is interesting that Heckman, Ichimura, and Todd (1997) is not only a nonparamet-
ric regression-type estimator, but also a matching estimator. It use kernel function to
measure closeness. To see this, let K(.) in (6) be the Bartlett kernel:
2
K(x) =
_
1 |x|, |x| 1,
0, otherwise.
It imputes the counterfactual of treated units by:

Y
0i
=

jC
K((X
j
X
i
)/h)Y
j

jC
K((X
j
X
i
)/h)
,
(X
j
X
i
) measures the dierence between the covariate of treated unit i, and the covariate
of untreated unit j. If |(X
j
X
i
)| is large, namely, j is a distant observation relative to
i in terms of kernel metric, it will receive smaller weight. When |X
j
X
i
| h, it will
receive zero weight; X
j
is not included in the imputation of Y
0i
.
2
We use Bartlett kernel only for exposition purpose. K should satises

z
z
r
K(z)dz = 0, r dim(X)
in Heckman et al. Obviously Bartlett kernel violates this condition.
9
1.3.3 OLS v.s. Matching
In this section we discuss the fundamental dierences between OLS and matching esti-
mator for ATE. First, OLS use a linear model to estimate ATE, as well as the eects of
covariates on the response variables. It may suer from model misspecication. Matching
avoids this problem, but raises the question of user-choosen parameter M. In matching,
the role of the covariates is to determine which unit is a good match; hence we can only
identify ATE without knowing the eects of the covariates on the response.
Second, they use dierent methods to remove selection biases. Matching compares
units with similar covariate, directly using the unconfoundedness condition. Let be the
OLS estimator for the linear model Y
i
= + D
i
+

X
i
+
i
. By Frisch-Waugh-Lovell
theorem, is the estimate that already removed the linear inuence of X
i
on Y
i
and on
D
i
.
Finally, which sample should be included in the imputation is dierent. Let X
1
and X
0
denote the covariates of the treatment and the control group, respectively. Let
s(X
1
) and s(X
0
) stand for the corresponding support. The denition of support is
s(X) = {x : f(x) > 0}A f(x) is the pdf. Matching only use the sample around the overlap
of the support, s(X
1
)

s(X
0
). If there is no sucient overlap between s(X
1
) and s(X
0
),
we can only use limited sample to impute the counterfactual. This situation is termed
support problem. The extreme case is that there is no overlap, i.e., s(X
1
)

s(X
0
) = . For
example, if the government lauched a subsidy program that all household with annual
income less then 10, 000 are required to participate in this program. In this case, we
cannot match the income variable because all treated units are low income family. We
will dicuss this issue in the section of regression discontinuity design. By contrast, OLS
will use all sample to estimate ATE regardless of whether there is a sucient common
support. Not because OLS does not suer from the support probelm, it assumes a linear
model to solve it.
Because OLS virtually has no mechanism to deal with the support problem, and it
also estimates the eects of covariates on the response, it would be vary sensitive to the
entire distribution of X
1
and X
0
. By contrast, only the common support will aect the
precision of matching estimator.
1.4 Propensity Score
Without conditioning on some covariates which potentially aect D and (Y
0
, Y
1
), com-
parisons between two groups will be biased. This is because the selection processes induce
10
imbalance covariates distributions between the treatment and the control group. Two
groups are comparable in statistical sense if their covariates distributions are the same.
In the previous section we demonstrated that under conditional unconfoundness, overt
biases can be removed by various conditioning strategies such as regression or matching.
Another identication strategies, the balancing score and the propensity score, solve the
selection bias problem by creating balance covariate distributions between two groups.
These concepts are rst introduced by Rosenbaum and Rubin (1983).
Denition 1.2 (balancing score)
A balancing score, b(X), is a function of the observed covariate X such that the condi-
tional distribution of X given b(X) is the same for treated and control units; that is, in
Dawids (1979) notation, X D|b(X).
Conditional on the balancing score, the covariate distributions are balanced between the
treatment and the control group; hence, they become comparable. Obviously, X is a
balancing score. It will be useful if there exists some lower dimensional balancing scores.
Denition 1.3 (propensity score)
propensity score e(x) is the conditional probability of receiving the treatment:
e(x) = P(D = 1|X = x).
Rosenbaum and Rubin (1983) show that the propensity score is a balancing score.
Theorem 1.1 (Balancing Property)
If D is binary, then X D|e(X).
proof:
P(X x, D = 1|e(X)) = E[D I
{Xx}
|e(X)]
= E
_
E[D I
{Xx}
|X]

e(X)
_
(e(X) is measurable w.r.t X)
= E
_
I
{Xx}
E[D|X]

e(X)
_
= e(X) P(X x|e(X)).
Moreover,
P(D = 1|e(X)) = E[D|e(X)] = E
_
E[D|X]

e(X)
_
= E[e(X)|e(X)] = e(X).
Therefore,
P(X x, D = 1|e(X)) = e(X) P(X x|e(X)) = P(D = 1|e(X)) P(X x|e(X)).
11
Alternatively, we can prove this theorem by the following argument. Because P(D =
1|X, e(X)) = P(D = 1, X|e(X))/P(X|e(X)), if we can show P(D = 1|X, e(X)) =
P(D = 1|e(X)) the we are done. Obviously, P(D = 1|X, e(X)) = E[D|X, e(X)] =
E[D|X] = e(X) = P(D = 1|e(X)).
Note that this theorem is implied by the denition of the propensity score. No distri-
butional assumptions and unconfoundness conditions are needed to prove this property.
Rosenbaum and Rubin (1983) further show that the propensity score is the most con-
densed balancing score. Namely, the -algebra induced by the propensity score is the
coarsest within the class of balancing scores.
Theorem 1.2 (Most Condensed Information)
b(X) is a balancing score; i.e., X D|b(X) if and only if b(X) is ner than e(X) in the
sense that e(X) = f(b(X)) for some function f.
3
proof:
: Suppose b(X) is ner than e(X) (has ner -algebra), then
P(D = 1|b(X)) = E[D|b(X)] = E[E[D|X]|b(X)] = E[e(X)|b(X)] = e(X).
Also, P(D = 1|X, b(X)) = E[D|X, b(X)] = E[D|X] = e(X). Therefore b(X) is a balanc-
ing score following the same argument in the proof of theorem 1.1.
: Suppose b(X) is a balancing score but b(X) is not ner than e(X). Therefore, there
exists b(x
1
) = b(x
2
) but e(x
1
) = e(x
2
). However, this implies P(D = 1|X = x
1
) =
P(D = 1|X = x
2
) which means that D and X are not conditional independent given
b = b(x
1
) = b(x
2
), a contradiction.
Conditioning on the balancing score can equalize the covariates distribution between
the treated and control units. Intuitively, the selection problem are resolved because
the treatment group and control group are now comparable, after conditional on b(X).
Indeed, we have a formal statement for this intuition:
Theorem 1.3 (Conditional Unconfoundness)
Suppose assumption 1.3 and 1.4 hold, then (Y
0
, Y
1
) D|b(X). Namely, instead of con-
ditioning on the entire covariate X, conditioning solely on b(X) suce for removing the
selection biases.
3
f(X) will only reduce the information of X. This is because the same value of x will have the same
value of f(x). However, dierent value of x may have same f(x). Therefore f(x) can only induces coarser
-algebra. For example, if f() is a constant function, then (f(X)) is the trivial -algebra.
12
proof: By Bayesian rule we know that P(D = 1, Y
0
, Y
1
|b(X)) = P(D = 1|Y
0
, Y
1
, b(X))
P(Y
0
, Y
1
|b(X)). If we can show P(D = 1|Y
0
, Y
1
, b(X)) = P(D = 1|b(X)) then we are
done.
P(D = 1|Y
0
, Y
1
, b(X)) = E[D|Y
0
, Y
1
, b(X)]
= E
_
E[D|Y
0
, Y
1
, X, b(X)]

Y
0
, Y
1
, b(X)
_
(law of iterative expectation)
= E
_
E[D|Y
0
, Y
1
, X]

Y
0
, Y
1
, b(X)
_
(b(X) is X-measurable)
= E
_
E[D|X]

Y
0
, Y
1
, b(X)
_
(by conditional unconfoundness)
= e(X) (b(X) is ner than e(X))
Recall in the proof of theorem 1.2 we show that P(D = 1|b(X)) = e(X). Therefore, we
already show that P(D = 1|Y
0
, Y
1
, b(X)) = P(D = 1|b(X)).
Why e(X) can remove selection biases is because it adjusts the imbalance of co-
variates of the treatment and control group, making them comparable. In the random
experiment, randomization automatically balance the covariate distributions of the two
groups. Matching has an implicit mechanism to balance the covariates because it only
matches units with similar covariates. However, the extent of imbalance of covariates
does determine how many samples can be used.
Propensity score was rst introduced in Rosenbaum and Rubin (1983) as a dimension
reduction technique. Nonparametric estimation of ATE, though appearing because no
constant eect assumption is imposed, is dicult to implement it due to the problem of
curse of dimensionality. Typically, when the number of continuous covariate increases,
the nonparametric estimator will converge with a slow rate. Abadie and Imbens (2006)
proved that the dimension of continuous covariates will aect the convergence rate of
the nearest neighbor matching estimator. The higher the dimension is, the slower the
convergence rate. Intuitively, if the dimension of X is high, it would be more dicult to
nd a good match. It is becasue the number of covariates is somewhat like the number of
restrictions to be satised. If there are a lot of restrictions, we can only have few qualied
samples to impute the counterfactual. Since the propensity score is only one dimensional
and by theorem 1.3 it can also remove overt biases like conditioning on X, it thus becomes
the most popular approach in the empirical research. To implement it, just regress or
match on the propensity score instead of the original covariates. For example, Heckman,
13
Ichimura, and Todd (1997) use the propensity score to implement their estimator to
avoid high dimensional nonparametric regression. Consult Imbens (2004)A Dehejia and
Wahba (1999, 2002)A Dehejia (2005)A Smith and Todd (2001) for more details.
4
Although propensity score matching becomes a quite popular tool since the inuential
paper of Dehejia and Wahba (1999), there are some potential problems of this dimen-
sion reduction approach. Typically e(X) is unkown and should be estimated. When we
match or regress on the estimated e(X), it becomes a two-stage estimator. To authors
knowledge, it is still an ongoing research about how to calculate the asymptotic vari-
ance for such estimators. Most papers derive their asymptotic variance assuming the
propensity score is know. Secondly, there is a methodological paradox of using propen-
sity score. Propensity score is originally introduced as a dimension reduction technique
because estimating E[Y |D, X] suers from curse of dimensionality.
5
Still, estimating
e(X) = E[D|X] suers from curse of dimensionality too. It is not so clear (at least to
me) that the propensity score indeed achieves the goal of dimension reduction.
1.4.1 Specication Testing of the Propensity Score
Researchers tend to specify parametric model for the propensity score; e.g., logit regres-
sion. Shaikh, Simonsen, Vytlacil, and Yildiz (2005) provide specication testing for the
propensity score. Let f(e), f
1
(e), f
0
(e) be the pdf of e(X), e(X)|D = 1, e(X)|D = 0,
respectively. We have the following testable restriction:
Lemma 1.1 Assume 0 < P(D = 0) < 1. Then for all 0 < e < 1 and e support(e(X)),
we have
f
1
(e)
f
0
(e)
=
P(D=0)
P(D=1)
e
1e
.
proof: First note that
P(D = 1, e(X) = e) = P(D = 1|e(X) = e) P(e(x) = e) = ef(e).
(recall that P(D = 1|e(X)) = e(X))
Also note that
P(D = 1, e(X) = e) = P(e(X) = e|D = 1) P(D = 1) = f
1
(e)P(D = 1).
Combining these two expressions, we have f
1
(e)P(D = 1) = ef(e). Analogously, we also
have f
0
(e)P(D = 0) = (1 e)f(e).
Shaikh et al. develope testing procedure by exploring this restriction.
4
Moreover, Hahn (1998) found propensity score is also closely related to the eciency bound of ATE;
see also Chen, Hong, and Tarozzi (2004).
5
Also recall that the cure of dimensionality mainly refers to there are many continuous covariates. If
X are all discrete, perhaps there is no need to do dimension reduction.
14
1.4.2 Regression as a Dimension Reduction Method
Here I would like to make a digression to discuss the relationship between regression and
dimension reduction, because it provides another interpretation of the balancing property
of the propensity score. For example, we want to study the relationship between the
response variable Y and the predetermined variables X. The dimension of X is high so
we should employ some dimension reduction technique before running the regression. A
naive way is to do principal component analysis on X. Principal component analysis
will nd vectors
1
,
2
, ...,
k
such that the new variables X

1
, X

2
, ..., X

k
can best
summarize the information contained in the original data X. Then we build the regression
model linking Y and X

1
, X

2
, ..., X

k
, instead of the original X. In particular, if a
single direction is already informative in describing X, and the relationship between
Y and X

is linear, we will have the familiar regression model Y


i
= X

+
i
. Why
parametric regression does not suer from the dimensionality problem? In light of the
above argument, this is because parametric regression itself can be view as a dimension
reduction technique.
Assume that
i
is i.i.d. with mean zero, we have Y X|X

. Conditional on
the systematic part X

, the stochastic property of Y is driven by the noise term .


Hence Y is independent of X given X

. In the dimension reduction literature, the


subspace spanned by is called the dimension reduction space; see Cook (1998) for more
details. Now return to the propensity score case. Think about the propensity score
E[D|X] = G(X

). If G(.) is the CDF of normal distribution this would be the familiar


probit regression. Analogously, after conditional on the systematic component G(X

),
the stochastic property of D is driven by the noise term and hence is independent of
X. Namely, D X|e(X). The balancing score is in fact, the sucient reduction in the
statistics literature of dimension reduction.
A nal remark is doing principal component analysis on X and then run the regression
may not be a good practice. Because PCA on X only nds some linear combinations of X
that best describe the variation of X, it does not guarantee such linear combinations can
best describe the relationship between Y and X. The sliced inverse regression developed
by Li (1991) is a method that incorporating the information of Y when doing the PCA
on X. Similarly, propensity score does not incorporate the information of Y so it may
not be an optimal dimension reduction method to study the treatment eects.
15
1.4.3 Propensity Score Weighting Estimator
Besides conditional on e(X) and use various conditioning strategies, the propensity score
can be used to construct the propensity score weighting estimator (or inverse probability
weighting) estimator for ATE. It is based on the following identication result:
Lemma 1.2 (Propensity Score Weighting)
Under Assumption 1.3 and 1.4,
E
_
DY
e(X)

= E[Y
1
], and
E
_
(1 D)Y
1 e(X)

= E[Y
0
].
proof: Assumption 1.4 is required to guarantee the above objects are well-dened because
e(X) is in the denominator.
E
_
DY
e(X)

= E
_
E
_
DY
e(X)
|X

_
= E
_
1
e(X)
E
_
DY

_
= E
_
1
e(X)
E
_
1 Y |X, D = 1

P(D = 1|X)
_
= E
_
E
_
Y
1
|X, D = 1

_
= E
_
E
_
Y
1
|X

_
= E[Y
1
].
Hirano, Imbens, and Ridder (2003) estimate e(X) by series estimator. We can have
an ecient estimator for ATE by modifying the above lemma; see Hirano et al. for
details. More theoretical properties on the propensity score weighting estimators can
be found in Chen, Hong, and Tarozzi (2004), and Khan and Tamer (2008). To be
better understand this identication result, lets consider a simplied version of lemma
1.2. Suppose the covariates have no impact on the potential outcomes and treatment
assignment (assumption 1.1 holds), then e(X) = P(D = 1), X. Also note that E[DY ] =
E[1 Y |D = 1]P(D = 1) = E[Y
1
] P(D = 1). Clearly E[Y
1
] = E[
DY
P(D=1)
] and the sample
counterpart is
1
N

N
i
D
i
Y
i
1
N

N
i
D
i
=

i:D
i
=1
Y
i
N
1
.
This is nothing but the sample average of Y
i
for the treatment group. Under as-
sumption 1.1 this estimator is consistent for E[Y
1
]. Because D is the indicator random
variable, D times Y means that we want to calculate E[Y
1
]. However, due to the missing
16
data problem, E[D Y ] equals E[Y
1
] times a P(D = 1) term. This is because the ex-
pectation operator here is taken over the whole population (

N
i
D
i
Y
i
is divided by N).
However, only the treated units make contribution when estimating E[Y
1
]. Therefore,
it should be divided by the sample size of the treatment group N
1
, not N. Propensity
score weighting is just a way to recover the correct sample size.
M.-J. Lee (2005) provides a more formal interpretation of weighting estimators. Sup-
pose we want to estimate E[Y ] =

yf(y)dy. Due to the selection problem, the data


(Y
i
)
N
i=1
are now sampled from density g(y) instead of f(y). Calculate the mean directly
will not yield consistent estimate. However, even though the data are sampled from the
wrong density g(y), it is still possible to calculate E[Y ] by importance sampling:

yf(y)dy =

y
f(y)
g(y)
g(y)dy =

y
r(y)
g(y)dy,
where r(y) =
g(y)
f(y)
. N
1

i
y
i
/r(y
i
) is consistent for E[Y ]. r(y
i
) here is similar to the
role of propensity score weighting. The intuition is if we know the selection process, we
may be able to recover the original density f(y) and the propensity score is just a way
to describe the selection process in statistical sense.
2 Quantile Treatment Eects
Denition 2.1 Quantile Function
Q
Y
() = F
1
() = inf{y : F(y) }
Theorem 2.1 Under assumption ?, the marginal distribution of Y
1
and Y
0
is identied.
proof:
WLOG, supposeE[|g(Y
1
)|] < .
E[E[g(Y )|D = 1, X]] = E[E[g(Y
1
)|D = 1, X]] (by def. of Y )
= E[E[g(Y
1
)|X]] (by def. of conditional independence)
= E[g(Y
1
)] (X is observable so we can integrate it out).
Let g(Y ) = I
{Y y}
, then E[g(Y
1
)] = F
1
(y). Choosing y (, ), we can trace out the
entire distribution function F
1
for Y
1
. Therefore, the quantile function is also identied.
Denition 2.2 Quantile Treatment Eect
The quantile treatment eect at -th quantile is dened as: Q
Y
1
() Q
Y
0
().
17
The dierence of two quantile functions equals to the horizontal distance between two
distribution functions. Let x being the horizontal distance between F
0
and F
1
at -th
quantile:
= F
0
(x) = F
1
(x +x),
x +x = F
1
1
_
F
0
(x)
_
= F
1
1
(),
x = F
1
1
() x = F
1
1
() F
1
0
() = Q
Y
1
() Q
Y
0
().
As in the previous section, there are two classes of estimation strategies for QTE un-
der conditional unconfoundness: quantile regression which is base on conditioning, and
propensity score weighting.
2.1 Quantile Regression
Denition 2.3 Quantile Regression
If Q
Y
(|X) = X

(), then
() = argmin

E[

(Y X

)], where

(Y X

) = ( I
{Y X

<0}
)(Y X

).

() is also known as check function.


2.2 Weighting Estimator
In this section we introduce the weighting estimator for QTE developed by Firpo (2007),
which is able to directly estimate unconditional QTE. Recall that E
_
DY
e(X)

= E[Y
1
]. Not
only for Y , this is also true for any measureable function g(Y ).
Lemma 2.1 Under Assumption 1.3 and 1.4,
E
_
Dg(Y )
e(X)

= E[g(Y
1
)], and
E
_
(1 D)g(Y )
1 e(X)

= E[g(Y
0
)].
proof:
E
_
Dg(Y )
e(X)

= E
_
E
_
Dg(Y )
e(X)
|X

_
= E
_
1
e(X)
E
_
Dg(Y )

_
= E
_
1
e(X)
E
_
1 g(Y )|X, D = 1

P(D = 1|X)
_
= E
_
E
_
g(Y
1
)|X, D = 1

_
= E
_
E
_
g(Y
1
)|X

_
= E[g(Y
1
)]
18
By properly choosing the function g(), we can obtain the moment conditions for the
quantile functions:
Corollary 2.1 Let g(Y ) = I
{Y Q
Y
1
()}
, then E
_
Dg(Y )
e(X)

= E[I
{Y
1
Q
Y
1
()}
] = .
Let g(Y ) = I
{Y Q
Y
0
()}
, then E
_
(1 D)g(Y )
1 e(X)

= E[g(Y
0
)] = E[I
{Y
0
Q
Y
0
()}
] = .
The quantile function can be estimated by solving a weighted quantile regression problem:

Q
Y
j
() = argmin
q
1
N

N
i=1

j,i

(Y
i
q), where

1,i
=
D
i
e(X
i
)
;
0,i
=
1 D
i
1 e(X
i
)
.
For example, the FOC for

Q
Y
1
is:
1
N

N
i=1
D
i
e(X
i
)
(I
{Y
i

Q
Y
1
()}
) = 0,
which is the sample analog of the moment conditon for Q
Y
1
(). Firpo (2007) suggests the
following procedure: First estimate the propensity score e(X) nonparametrically. Plug-in
the estimated e(X) and solve the weighted quantile regression problem.

Q
Y
1
()

Q
Y
0
()
is the estimated unconditional -th QTE.
3 Instrumental Variables I: Local Treatment Eects
Suppose we want to study the return to schooling. A naive way to conduct such study
is by regression individuals wage Y on their education level D. However, education
level may be confounded by unobservable individuals ability. High ability guy may have
higher earning potential as well. Therefore, the eect of return to schooling may be
inated. Technically, D is correlated with the error term , and OLS does not yield
consistent estimate in such situation. This is a well known problem called hidden bias or
endogeneity. Since the variable representing ability is in general unavailable or subject
to measure error problem, we cannot employ conditioning strategies discussed in the
previous sections. There are several ways to tackle the endogeneity problem, one is the
instrumental variable approach.
3.1 Instrumental Variable : A Review
A variable Z is an instrumental variable if it aects treatment assignmnet D (inclustion
restriction), but not response Y directly (exclustion restriction). Z aects Y only through
D. The causal diagram of IV setup is:
19
Z DY
Under this causal relationship, IV eectively induces exogeneous variations in D.
In our return to schooling case, now suppose Z stands for exogeneously determined
loan policy. In Taiwan, student loan policy is determined by the government. Z will
aect education level D because students have to consider the cost of education to make
decision. Moreover, Z will not directly aect the earning potential. Therefore, variations
in Z can cause exogeneous variations in D, and we can utilize such exogeneous variation
to identify ATE of D on Y . Since Z aects Y only through D, we can decompose the
eect of Z on Y into eect of Z on D times eect of D on Y . According to our
causal diagram, we are able to estimate the eect of Z on D as we did in section 2.
Therefore, divide eect of Z on Y by eect of Z on D, we obtain the eect of D
on Y , the parameter of interest. For exposition purpose, from now on assuming both D
and Z are binary. Consider the following estimator of ATE that implements the above
intuition:
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
That is, the ATE of Z on Y divided by the ATE of Z on D. Several identication
assumptions can make this estimator (henceforth IVE) have causal interpretation. First
consider the constant eect model, which is standard in the IV literature:
Y
i
=
0
+
1
D
i
+
i
,
D
i
=
0
+
1
Z
i
+v
i
,
(
i
, v
i
)Z
i
,
1
= 0, and E[v
i
] = E[
i
] = 0.

1
= 0 and v
i
Z
i
captures the ideal that Z aects D and is exogeneous. Z is not in the
Y equation and (
i
, v
i
)Z
i
ensures that Z aects Y only through D. We can plug in the
D equation into the Y equation,
Y
i
=
0
+
1
(
0
+
1
Z
i
+v
i
) +
i
= (
0
+
1

0
) +
1

1
Z
i
+ (
1
v
i
+
i
).
As we emphasized in section 2, the coecient of the dummy regressor is the ATE un-
der constant eect assumption. Therefore, E[Y |Z = 1] E[Y |Z = 0] =
1

1
and
E[D|Z = 1] E[D|Z = 0] =
1
so the IVE identies
1
, the ATE of D on Y . Under
our assumptions,
1

1
can be consistently estimated by regressing Y on Z, and
1
can
be consistently estimated by regressing D on Z.

IV E =

1

1
=

(z
i
z)(y
i
y)

(z
i
z)
2

(z
i
z)
2

(z
i
z)(d
i


d)
,
20
which is nothing but the familiar IV formula in textbooks. Usually it is derived by
exploring the following moment condition:
E[Z

] = 0,
E[Z

(Y X)] = E[Z

Y ] E[Z

X] = 0,
= E[Z

X]
1
E[Z

Y ], where
Z = [1, z
i
], X = [1, d
i
], Y = [y
i
], the data matrices by stacking data from i = 1 to N.
We can implement

IV E by st regressing D on Z and obtaining the tted value

D. Then
regress Y on

D. This procedure is the celebrated two-stage least square (2SLS). Note
that

IV E is consistent for
Cov(Z,Y )
Cov(Z,D)
. Moreover, if Z and D are both binary, we have the
following lemma.
Lemma 3.1 If Z and D are both binary, then
Cov(Z, Y )
Cov(Z, D)
=
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
.
proof:
Cov(Z, D) = E[ZD] E[Z]E[D].
E[ZD] = E[D 1|Z = 1]P(Z = 1), and
E[Z]E[D] = E[D(Z + (1 Z))]P(Z = 1)
=
_
E[DZ] +E[D(1 Z)]
_
P(Z = 1)
=
_
E[D|Z = 1]P(Z = 1) +E[D|Z = 0]P(Z = 0)
_
P(Z = 1).
Therefore E[ZD] E[Z]E[D]
= E[D|Z = 1]P(Z = 1) E[D|Z = 1]P(Z = 1)
2
E[D|Z = 0]P(Z = 0)P(Z = 1)
= E[D|Z = 1]P(Z = 1)(1 P(Z = 1)) E[D|Z = 0]P(Z = 0)P(Z = 1)
=
_
E[D|Z = 1] E[D|Z = 0]
_
P(Z = 0)P(Z = 1).
Similarily, Cov(Z, Y ) =
_
E[Y |Z = 1] E[Y |Z = 0]
_
P(Z = 0)P(Z = 1).
3.2 Restrictions on the Selection Process : Local Treatment Eects
The constant eect model we just discussed is quite restricted. If we want to allow for
arbitrary form of heterogeneous eects across individuals, does IV E identies any ATE
parameter? This question is addressed by the concept of local average treatment eect
(LATE), developed by Imbens and Angrist (1994), Angrist and Imbens (1995), Angrist,
Imbens, Rubin (1996), Angrist (2001), and Angrist (2004).
21
Again suppose both Z and D are binary. Recall we dene the potential outcome
(Y
0
, Y
1
) and observed outcom Y = DY
1
+ (1 D)Y
0
. The potential outcome framework
enables us to say something about the eect of D on Y . In IV estimation, we have
to know the eect of Z on D. To make this concept manageable, dene the potential
treatment (D
0
, D
1
), and the observed treatment D = ZD
1
+ (1 Z)D
0
. When Z = 1,
D
1
is observed and when Z = 0, D
0
is observed. D
1i
D
0i
is the individual eect of
instrument on the treatment assignment. Clearly it is a conterfactual setting as before
because we can only observe D
1
or D
0
but not both. We observe (Z
i
, D
i
, Y
i
)
i=1,...,N
and
want to identify features of (Z
i
, D
0i
, D
1i
, Y
0i
, Y
1i
)
i=1,...,N
.
Consider the encouragement design of Rosenbaum (1996) to further understand the mean-
ing of potential treatment. We want to study the eect of exercise D on Y , the forced
expiratory volume (FEV). Simple comparisons do not identify the eect of D on Y be-
cause it is confounded by the subjects unobservable health status. Healthy people tend
to do exercise and have higher FEV as well. Suppose the subjects are randomly selected
and encouraged to do exercise. Let Z
i
= 1 when subject i is selected. Encouragement
may induce some people start to do exercise, hence Z shifts D. Moreover, randomized
encouragement wont aect FEV direcly. We can classify the subjects into four groups
according to the value of potential treatments.
D
0
= 0, D
1
= 0 : never-taker
D
0
= 1, D
1
= 1 : always-taker
D
0
= 0, D
1
= 1 : complier
D
0
= 1, D
1
= 0 : deer
Never-takers never exercise whether there is encouragement or not. By contrast, always-
taker always exercise regardless of encouragement. Compliers exercise only if they were
encouraged. They are named complier because their action follows the instrument; i.e.,
D = Z. By contrast, deers disobey the instrument. Note that the complier and the
deer group are the source of exogeneous variation induced by the instrument because
they change their behavior accordingly.
Suppose
Assumption 3.1
22
LATE-1: (Y
0i
, Y
1i
, D
0i
, D
1i
)Z
i
, (exclustion restriction)
LATE-2: E[D|Z = z] is nontrivial function of z, (inclustion restriction) and
LATE-3: D
1i
D
0i
i. (Monotonicity)
LATE-1 captures the ideal that Z is exogeneous and it is similar to assumption 1.
Under this assumption, group mean dierences can identify the ATE of Z on D and Z
on Y . In principal, we should dene the potential outcomes with two subscripts; i.e.,
Y
zd
, z {0, 1} and d {0, 1}. In our notation the potential outcomes are only indexed
by d, meaning that Z only aects Y through D. LATE-2 features that Z shifts D as
standard IV setup. LATE-3, the monotonicity is an extra condition compared with the
traditional IV setting. We will see this assumption enables us to identify several features
of (Z
i
, D
0i
, D
1i
, Y
0i
, Y
1i
)
i=1,...,N
. LATE-3 eectively imposes restrictions on selection pro-
cess D
i
and rules out the deer group. Imbens and Angrist (1994) pointed out that latent
index models satisfy assumption 5. For example, let
Y
0i
=
0
+
i
,
Y
1i
=
0
+
1
+
i
, and
D
zi
= I
{
0
+
1
z+v
i
>0}
.
If (
i
, v
i
)Z
i
, then (D
0i
, D
1i
, Y
0i
, Y
1i
) = (I
{
0
+v
i
>0}
, I
{
0
+
1
+v
i
>0}
,
0
+
i
,
0
+
1
+
i
)Z
i
.

1
> 0 guarantees LATE-2 and 3. Empirical examples of IVs satisfying LATE-3 will be
discuess in the following subsection. Under assumption 4.1, we have
Theorem 3.1 (ATE on the Compliers)
Given assumption 4.1,
IV E =
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
= E[Y
1
Y
0
|D
0
= 0, D
1
= 1] = E[Y
1
Y
0
|complier] LATE.
proof:
First consider the numerator,
E[Y |Z = 1] E[Y |Z = 0]
= E[D
1
Y
1
+ (1 D
1
)Y
0
|Z = 1] E[D
0
Y
1
+ (1 D
0
)Y
0
|Z = 0], (by def. of Y and D)
= E[D
1
Y
1
+ (1 D
1
)Y
0
] E[D
0
Y
1
+ (1 D
0
)Y
0
], (by LATE-1)
= E[D
1
(Y
1
Y
0
) +Y
0
D
0
(Y
1
Y
0
) Y
0
]
= E[(D
1
D
0
)(Y
1
Y
0
)].
23
where (D
1i
D
0i
)(Y
1i
Y
0i
) is the individual eect of Z on Y .
E[(D
1
D
0
)(Y
1
Y
0
)]
= E[0 (Y
1
Y
0
)|D
0
= 0, D
1
= 0]P(D
0
= 0, D
1
= 0)
+E[0 (Y
1
Y
0
)|D
0
= 1, D
1
= 1]P(D
0
= 1, D
1
= 1)
+E[1 (Y
1
Y
0
)|D
0
= 0, D
1
= 1]P(D
0
= 0, D
1
= 1)
+E[1 (Y
1
Y
0
)|D
0
= 1, D
1
= 0]P(D
0
= 1, D
1
= 0)
= E[(Y
1
Y
0
)|D
0
= 0, D
1
= 1]P(D
0
= 0, D
1
= 1).(by LATE-3)
It is clear that the complier group is the only source of variation induced by the instru-
mental variable satisfying assumption 4.1. Secondly,
E[D|Z = 1] E[D|Z = 0] = P(D = 1|Z = 1) P(D = 1|Z = 0) (D is binary)
= P(D
1
= 1|Z = 1) P(D
0
= 1|Z = 0) (def. of D)
= P(D
1
= 1) P(D
0
= 1) (IVE-1)
=
_
P(D
0
= 0, D
1
= 1) +P(D
0
= 1, D
1
= 1)

_
P(D
0
= 1, D
1
= 0) +P(D
0
= 1, D
1
= 1)

= P(D
0
= 0, D
1
= 1).(IVE-3)
Therefore,
IV E =
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
= E[Y
1
Y
0
|D
0
= 0, D
1
= 1].
There are several noticeable points for this result.
LATE is ATE on compliers:
Recall that when Z and D are both binary, IVE equals Cov(Z, Y )/Cov(Z, D), the
probability limit of traditional IV or 2SLS estimator. Thus under treatment eect
heterogeneity of unknown form, 2SLS fails to identify ATE but ATE on compliers
or LATE. This is because compliers, a subgroup of the whole population, are the
only source of variation used to identify ATE. Under heterogeneous eects it cannot
be extrapolated to the whole population. Why traditional IV estimator can identify
ATE is due to constant eect assumption.
Dierent IV identies dierent LATE:
Dierent IV will dene dierent groups of complier, never-taker and always-taker
so LATE is relative to what instrument being used, which is in sharp contrast
with traditional IV estimation in which the identied parameter does not depend
on instruments. Typically, if there are several IVs available, researchers tend to
put them together and use 2SLS to get rid of overidentication probelm and uti-
lize all information contained in IVs. Overidentication stems from the constant
24
eect assumption so that there is only one parameter to be identied. However,
in heterogeneous treatment eect model, everyone can have dierent eect. There
is no issue of overidentication because the object of interest is nonparametric in
natural; i.e., innite dimensional object.
Who is complier?
Although the proportion of the complier group is identied according to theorem
4.1, who is complier is unknown since only one of (D
0
, D
1
) is observed. To exactly
know who is complier one should observe both (D
0
, D
1
) by denition. LATE is
often criticized that it is conditional on an unobservable subgroup; see e.g., Heck-
man (1996).
The fact that IVE can only identify ATE on compliers articulates that researchers should
be really careful about the source of variation used to identify the parameter when
the treatment eect is heterogeneous. As Imbens and Angrist (1994) mention, this is
analogous to panel data model with individual xed eect. Consider the following ex-
ample:
6
We want to study the eect of gender dierence D on wage Y . Panel data
Y
it
= D
i
+
i
+
it
is available so we can control for individual xed eect. However, is
identifed only if D
i
is time-varing. Therefore, the source of identicaion comes from those
who changed their gender status. Do you think measure the eect of gender dierence,
or the eect of changing gender status? Another problem of IV estimation is if Z is only
weakly correlated with D then the complier group would be just a tiny fraction of the
whole population. The IV estimation is therefore not representative. It also suers from
the problem of weak instrument in statistical inference; see e.g., Hall (200?), and Angrist
and Krueger (2001).
3.3 Case Studies
3.3.1 Vietnam Draft Lottery
In IV estimation, the identication power comes from the exclusion restriction. Re-
searchers usually have dierent opinion about whether a variable satises the exclusion
restriction or not. For instance, Heckman (1996) argued that when people receive a high
draft lottery number, they may change their schooling plan so that the earning potential
is aected as well. Vietnam draft lottery number may not be a valid instrument.
6
Donghoon Lee provided this interesting example.
25
3.3.2 Randomized Eligibility Design
3.4 Other Identied Features
Besides LATE, under assumption 4.1 several features of (Y
0
, Y
1
, D
0
, D
1
, Z) are identi-
ed from observed (Y, D, Z). Again the identication power mainly comes from the
monotonicity assumption. Here I summarize several identication results in Imbens and
Rubin (1997). Although compliers are not identied from the observed data, some mem-
bers of always-takers and never-takers are identied under monotonicity assumption. For
example, if (Z
i
= 0; D
i
= 1) it would imply (Z
i
= 0; D
0i
= 1, D
1i
= 1), then i is always-
takers. Similarily, if (Z
i
= 1; D
i
= 0) then i is never-taker. The (Z
i
= 0; D
i
= 0) group
contains compliers and never-takers, while (Z
i
= 1; D
i
= 1) group contains compliers
and always-takers. Thus the monotonicity assumption induces the following structure.
Almost all identication strategies in local treatment eect stem from this table.
D = 0 D = 1
Z = 0 compliers + never-takers always-takers
Z = 1 never-takers compliers + always takers
Lemma 3.2 Denote
c
,
a
, and
n
the proportion of compliers, always-takers, and
never-takers, respectively. These population proportions are identied. Moreover,
c
=
E[D|Z = 1] E[D|Z = 0],
a
= E[D|Z = 0], and
n
= 1 E[D|Z = 1].
proof: We already showed that
c
= E[D|Z = 1] E[D|Z = 0]. By monotonicity,
E[D|Z = 0] = P(D = 1|Z = 0) = P(D
0
= 1|Z = 0) = P(D
0
= 1, D
1
= 1|Z = 0).
By independence, P(D
0
= 1, D
1
= 1|Z = 0) = P(D
0
= 1, D
1
= 1)
a
. Thus

n
= 1
c

a
= 1 E[D|Z = 1].
In fact the above table provides some intuition to prove lemma 4.2. We can calculte
how many people have D = 1 in the group Z = 0, which is what E[D|Z = 0] does. It
will tell us the proportion of always-takers in the group Z = 0. By independence as-
sumption, Z is independent of (D
0
, D
1
) hence it is also independent of individuals type.
Therefore, the proportion of always-takers in the Z = 0 group and the Z = 1 group
should be the same due to randomization. Knowing that (Z
i
= 0; D
i
= 1) is always-
taker allows us using P(D = 1|Z = 0) to identify
a
. It can be estimated by

D(1Z)

(1Z)
,
26
which has probability limit E[
D(1Z)
P(Z=0)
] = P(D = 1|Z = 0). We will see this again later on.
In section 2 we demonstrated the conditional unconfoundness condition permits iden-
tication of the two marginal distributions of potential outcomes. Since the IV model
considered here is a generalized version of unconfoundness and overlap conditions in sec-
tion 1, similar identication results also follows. Denote f
zd
(y), z, d {0, 1} the density
functions of observed Y in the subsample dened by Z = z and D = d. For instance,
f
01
(y) can be estimated using subsamples with (Z = 0, D = 1). Let (g
c
0
(y), g
c
1
(y)) denote
the two marginal densities of potential outcomes for compliers. Also, (g
a
0
(y), g
a
1
(y)), and
(g
n
0
(y), g
n
1
(y)) denote the two marginal densities of potential outcomes for always-takers
and never-takers, respectively.
Lemma 3.3 g
a
1
(y) = f
01
(y) and g
n
0
(y) = f
10
(y), while g
a
0
(y) and g
n
1
(y) is unidentied.
(g
c
0
(y), g
c
1
(y)) are both identied.
proof: Because we never observe always-takers without the treatment, and never-takers
with the treatment, there is no way we can learn g
a
0
(y) and g
n
1
(y). Because Z is in-
dependent of the individuals type, and the subsamples with (Z = 0, D = 1) are all
always-takers, we have g
a
1
(y) = f
01
(y). Analogously, g
n
0
(y) = f
10
(y). Again because Z is
independent of the individuals type, f
00
(y) is the mixture of the density Y
0
for compliers
and never-takers (see the upper-left block of the table). Similar, f
11
(y) is the mixture of
the density Y
1
for compliers and always-takers. Therefore, we have
f
00
(y) =

c

c
+
n
g
c
0
(y) +

n

c
+
n
g
n
0
(y), and
f
11
(y) =

c

c
+
a
g
c
1
(y) +

a

c
+
a
g
a
1
(y).
By inverting these equations, (g
c
0
(y), g
c
1
(y)) can be expressed in terms of directly estimable
distributions.
g
c
0
(y) =

c
+
n

c
f
00
(y)

n

c
f
10
(y), and
g
c
1
(y) =

c
+
a

c
f
11
(y)

a

c
f
01
(y).
Alternatively, Abaide (2002) proposed another identication strategy to identify the cu-
mulative distribution functions of potential outcomes for compliers.
27
Lemma 3.4 Suppose E|g(Y )| < . Under assumption 4.1, we have
E[g(Y
1
)|D
0
= 0, D
1
= 1] =
E[g(Y )D|Z = 1] E[g(Y )D|Z = 0]
E[D|Z = 1] E[D|Z = 0]
, and
E[g(Y
0
)|D
0
= 0, D
1
= 1] =
E[g(Y )(1 D)|Z = 1] E[g(Y )(1 D)|Z = 0]
E[D|Z = 1] E[D|Z = 0]
.
proof: First note that
E[g(Y )D|Z = 1] = E[g(D
1
Y
1
+ (1 D
1
)Y
0
)D
1
|Z = 1]
= E[g(D
1
Y
1
+ (1 D
1
)Y
0
)D
1
] = E[g(Y
1
)|D
1
= 1]P(D
1
= 1)
=
_
E[g(Y
1
)|D
1
= 1, D
0
= 1]P(D
0
= 1|D
1
= 1)
+E[g(Y
1
)|D
1
= 1, D
0
= 0]P(D
0
= 0|D
1
= 1)
_
P(D
1
= 1)
= E[g(Y
1
)|D
1
= 1, D
0
= 1]P(D
1
= 1, D
0
= 1) +E[g(Y
1
)|D
1
= 1, D
0
= 0]P(D
1
= 1, D
0
= 0).
Similarily,
E[g(Y )D|Z = 0] = E[g(D
0
Y
1
+ (1 D
0
)Y
0
)D
0
|Z = 0]
= E[g(D
0
Y
1
+ (1 D
0
)Y
0
)D
0
] = E[g(Y
1
)|D
0
= 1]P(D
0
= 1)
=
_
E[g(Y
1
)|D
1
= 1, D
0
= 1]P(D
1
= 1|D
0
= 1)
+E[g(Y
1
)|D
1
= 0, D
0
= 1]P(D
1
= 0|D
0
= 1)
_
P(D
0
= 1)
= E[g(Y
1
)|D
1
= 1, D
0
= 1]P(D
1
= 1, D
0
= 1).
Alternatively, we can prove it by exploring lemma 4.3:
proof:
According to g
c
1
(y) =

c
+
a

c
f
11
(y)

c
f
01
(y), we know E[g(Y
1
)|complier] =

c
+
a

c
E[g(Y )|Z =
1, D = 1]

a

c
E[g(Y )|Z = 0, D = 1].
(
c
+
a
)E[g(Y )|Z = 1, D = 1]
a
E[g(Y )|Z = 0, D = 1]
= (1
n
)E[g(Y )|Z = 1, D = 1]
a
E[g(Y )|Z = 0, D = 1]
= P(D = 1|Z = 1)E[g(Y )|Z = 1, D = 1] P(D = 1|Z = 0)E[g(Y )|Z = 0, D = 1]
= E[g(Y )D|Z = 1] E[g(Y )D|Z = 0].
Divided by
c
= E[D|Z = 1] E[D|Z = 0], we get E[g(Y
1
)|complier].
28
If g(Y ) = I
{Y y}
, the above lemma can be used to estimate the CDFs of potential
outcomes for compliers, as well as the two quantile functions. The identication strategy
for the CDF employed here is parallel to that of section 2. Analogously, the Firpo (2007)-
type QTE estimator can be constructed for LQTE, the QTE on compliers.
3.5 Non-binary Treatments and Instruments
See the reference in 4.2. Basically, IVE identies some complex linear combination of
LATEs.
3.6 Nonparametric Estimation of LTE with Covariate
Sometimes it is dicult to obtain IVs which are generated by randomized or natural ex-
periment. Even randomization per se does not necessarily guarantee assumption 4.1. For
example, one may change the schooling plan due to high draft lottery number. Higher
education level will increase the earning potential, hence violating the exclustion restric-
tion. If Z is not randomly assigned, it may be confounded with D and Y . Therefore,
covariates should be included in the analysis. For example, Card (1995) uses living close
to a college as an IV to estimate the return to schooling. Residential decision may depend
on the parental income, which may aect childrens earning potential as well. That is to
say, Z is a valid instrument only after conditioning on some covariates. We can extend
assumption 4.1 to allow for covariates using similar modelling technique as assumption
1.3 and 1.4. Following Abadie, Angrist, and Imbens (2002, hereafter AAI), the following
conditions are assumed in the subsequent analysis:
Assumption 3.2
1. independence: (Y
0i
, Y
1i
, D
0i
, D
1i
)Z
i
|X
2. nontrivial assignment: 0 < P(Z = 1|X) < 1,
3. rst-stage: E[D
1
|X] = E[D
0
|X], and
4. monotonicity: P(D
1i
D
0i
|X) = 1.
We have the following theorem:
Theorem 3.2 (Conditional ATE on the Compliers)
Given assumption 4.2
CIV E =
E[Y |Z = 1, X] E[Y |Z = 0, X]
E[D|Z = 1, X] E[D|Z = 0, X]
= E[Y
1
Y
0
|D
0
= 0, D
1
= 1, X] = E[Y
1
Y
0
|complier, X] CLATE.
29
Assumption 4.2-1 is analogous to the conditional unconfoundedness assumption 1.3, and
assumption 4.2-2 is analogous to the overlap assumption 1.4. After conditioning on X,
the overt biases are removed and we can measure the impact of Z on D and Z on Y .
Note that E[D|Z = i, X] = E[D
i
|Z = i, X] = E[D
i
|X], i = 0, 1. Thus assumption 4.2-3
guarantees the denominator of CIVE is nonzero; it states that Z shits D, conditional
on X. Monotonicity rules out deers and makes CIVE has an easy-to-explain causal
interpretation. CLATE can be estimated by using subsamples with covariate X = x.
Since X is observed, we can integrate CLATE over X to obtain LATE. However, if X is
continuous, doing this will be cumbersome. To circumvent this problem, we can employ
nonparametric regression technique(Frolich, 2006) or specify parametric models (Abadie,
2003). Frolich (2006) shows that:
Lemma 3.5

E[Y
1
Y
0
|x, complier] f(x|complier)dx = E[Y
1
Y
0
|complier]
=
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
=
E[E[Y |Z = 1, X]] E[E[Y |Z = 0, X]]
E[E[D|Z = 1, X]] E[E[D|Z = 0, X]]
=

(E[Y |Z = 1, x] E[Y |Z = 0, x])f(x)dx

(E[D|Z = 1, x] E[D|Z = 0, x])f(x)dx


.
proof:

E[Y
1
Y
0
|x, complier] f(x|complier)dx
=

E[Y |Z = 1, X = x] E[Y |Z = 0, X = x]
E[D|Z = 1, X = x] E[D|Z = 0, X = x]
f(x|complier)dx
=

E[Y |Z = 1, x] E[Y |Z = 0, x]
P(complier|x)
f(x|complier)dx
=

E[Y |Z = 1, x] E[Y |Z = 0, x]
P(complier|x)

f(x, complier)
P(complier)
dx
=

E[Y |Z = 1, x] E[Y |Z = 0, x]
P(complier|x)

P(complier|x)
P(complier)
f(x)dx
=
1
P(complier)

(E[Y |Z = 1, x] E[Y |Z = 0, x])f(x)dx


=
1
E[D|Z = 1] E[D|Z = 0]

(E[Y |Z = 1, x] E[Y |Z = 0, x])f(x)dx


=
E[E[Y |Z = 1, X]] E[E[Y |Z = 0, X]]
E[E[D|Z = 1, X]] E[E[D|Z = 0, X]]
=
E[Y |Z = 1] E[Y |Z = 0]
E[D|Z = 1] E[D|Z = 0]
= LATE.
30
LATE is essentially dened as the ratio of two ATEs, if we think Z is also a treatment.
Therefore, if there are overt biases due to covariate X, we can use propensity score
weighting, matching, parametric or nonparametric regression to estimate the ATE of Z
on D and Z on Y . According to lemma 4.2, LATE can be estimated by taking the ratio
of the two estimated ATEs. Asymptotic properties of such procedures are analyzed in
Frolich (2006).
3.7 Parametric and Semiparametric Estimation of LTE with Covariate
In the section we introduce the parametric and semiparametric estimation of LATE with
covariate, developed by AAI and Abadie (2003). First consider the identication result
for the conditional mean function on compliers E[Y |X, D, complier]:
Lemma 3.6
E[Y |X, D = 0, complier] = E[Y
0
|X, complier],
E[Y |X, D = 1, complier] = E[Y
1
|X, complier], and clearly
CLATE = E[Y
1
Y
0
|X, complier] = E[Y |X, D = 1, complier] E[Y |X, D = 0, complier].
proof:
E[Y |X, D = 0, complier] = E[Y
0
|X, D = 0, complier]
= E[Y
0
|X, Z = 0, complier] (D=Z for compliers)
= E[Y
0
|X, complier]. (by assumption 4.2-1)
Therefore, we can estimate CLATE and LATE by estimating E[Y |X, D, complier]. The
issue here is the complier group is unobservable and hence the conditional moment func-
tion is not a directly estimable object by using subsample of compliers. Analogous to
Imbens and Rubin (1997), it is possible to transform the conditional on unobservable
compliers object to an object without conditional on compliers:
Lemma 3.7 (AAI weighting)
Dene = 1
D(1 Z)
P(Z = 0|X)

(1 D)Z
P(Z = 1|X)
. Suppose E|g(Y, D, X)| < , then
E[g(Y, D, X)|X, D
1
= 1, D
0
= 0] =
1
P(D
1
= 1, D
0
= 0|X)
E[ g(Y, D, X)|X], and
E[g(Y, D, X)|D
1
= 1, D
0
= 0] =
1
P(D
1
= 1, D
0
= 0)
E[ g(Y, D, X)].
31
proof:
E[g(Y, D, X)|X]
= E[g(Y, D, X)|X, D
1
= 1, D
0
= 0] P(D
1
= 1, D
0
= 0|X)
+E[g(Y, D, X)|X, D
1
= 1, D
0
= 1] P(D
1
= 1, D
0
= 1|X)
+E[g(Y, D, X)|X, D
1
= 0, D
0
= 0] P(D
1
= 0, D
0
= 0|X).
Rearrange, we have
E[g(Y, D, X)|X, D
1
= 1, D
0
= 0] =
1
P(D
1
= 1, D
0
= 0|X)
_
E[g(Y, D, X)|X]
E[g(Y, D, X)|X, D
1
= 1, D
0
= 1] P(D
1
= 1, D
0
= 1|X)
. .
I
E[g(Y, D, X)|X, D
1
= 0, D
0
= 0] P(D
1
= 0, D
0
= 0|X)
_
. .
II
.
Note that
I = E[g(Y, D, X)|X, D
1
= 1, D
0
= 1, Z = 0] P(D
1
= 1, D
0
= 1|X, Z = 0) (by assumption 4.2-1)
= E[g(Y, D, X)|X, D = 1, Z = 0] P(D = 1|X, Z = 0). (by assumption 4.2-4)
It follows that II = E[g(Y, D, X)|X, D = 0, Z = 1] P(D = 0|X, Z = 1). Also note that
E[D(1 Z)g(Y, D, X)|X] = E[1 g(Y, D, X)|X, D = 1, Z = 0] P(D = 1, Z = 0|X)
= E[g(Y, D, X)|X, D = 1, Z = 0] P(D = 1|X, Z = 0)P(Z = 0|X).
Therefore,
I = E[
D(1 Z)
P(Z = 0|X)
g(Y, D, X)|X], and
II = E[
(1 D)Z
P(Z = 1|X)
g(Y, D, X)|X].
Substitute I and II into the original equation, we have
E[g(Y, D, X)|X, D
1
= 1, D
0
= 0] =
1
P(D
1
= 1, D
0
= 0|X)
E[g(Y, D, X) (1
D(1 Z)
P(Z = 0|X)

(1 D)Z
P(Z = 1|X)
)|X].
Recall that P(D
1
= 1, D
0
= 0|X) = E[D|X, Z = 1] E[D|X, Z = 0] (see the proof
of theorem 4.1). Finally, the non-estimable conditional on compliers left-hand side is
32
expressed as the estimable conditional on full sample right-hand side. Since D(1Z) = 1
indicates always-takers and (1 D)Z = 1 indicates never-takers, the intuition of the
weighting function is after deleting the contribution of always-takers and never-
takers from E[Y ], we get the contribution of compliers. Now ignore the covariate X for
a second to simplify the explanation. We can calculate the mean of the units in the
upper-right block of the table in section 4.4., then the mean response of always-takers is
obtained. According to that table, the mean response of always-taker can be estimated
by

i:Z
i
=0,D
i
=1
Y
i

i
D
i
(1 Z
i
)
=

i
D
i
(1 Z
i
)Y
i

i
D
i
(1 Z
i
)
.
In section 4.4 we show that the proportion of always-taker is identied and can be esti-
mated by

i
D
i
(1Z
i
)

i
(1Z
i
)
. Hence the always-takers contribution to E[Y ] can be estimated
by

i
D
i
(1 Z
i
)Y
i

i
D
i
(1 Z
i
)

i
D
i
(1 Z
i
)

i
(1 Z
i
)
,
which is consistent for E[
D
(
1Z)Y
P(Z=0)
]. The identication strategy of AAI weighting is a nice
combination of Imbens and Rubin (1997) and propensity score weighting.
This lemma enables us to get rid of the conditional on compliers problem as long as
the statistics can be expressed in terms of moments of observable (Y, D, X). By choosing
suitable function g(), we are able to estimate LATE and LQTE.
Estimation of LATE:
Abaide (2003) postulates a parametric model for the conditional mean function on compli-
ers as E[Y |X, D, D
1
= 1, D
0
= 0] = h(D, X;
o
). Finding the conditional mean function
corresponds to the minimization of the quadratic loss function:

o
= argmin

E[{Y h(D, X;
o
)}
2
|D
1
= 1, D
0
= 0]
By the lemma of AAI weighting, the above minimization problem is equivalent to

o
= argmin

E[ {Y h(D, X;
o
)}
2
].
If we let h(D, X;
o
) = D +X

, then
( ,

) = argmin
1
N

N
i=1

i
(Y D X

)
2
.

i
= 1
D
i
(1 Z
i
)
1

P(Z = 1|X)

(1 D
i
)Z
i

P(Z = 1|X)
.
33
Clearly, the estimated LATE can be obtained by running a weighted least square.
Estimation of LQTE:
LQTE can be estimated using the same manner. AAI species the conditional quantile
function on compliers as Q
Y
(|X, D, D
1
= 1, D
0
= 0) = ()D + X

(). Finding the


conditional quantile function corresponds to the minimization of the check function:
_
(), ()
_
= argmin
,
E[

(Y D X

)|D
1
= 1, D
0
= 0]
= argmin
,
E[

(Y D X

)] (by AAI weighting).


() is the -th LQTE and it can be estimated by a weighted quantile regression.
7
Note
that the parametric models are specied for the ease of implementation; the identication
result is nonparametric. Summing up, lets give some intuition why the estimation of
LATE or LQTE boils down to a weighted regression problem. AAI show that
Lemma 3.8 By assumption 4.2-1, we have (Y
1
, Y
0
)|X, D
1
= 1, D
0
= 0.
proof:
By assumption 4.2-1 we know (Y
1
, Y
0
, D
1
, D
0
)Z|X, implying (Y
1
, Y
0
)Z|X, D
1
= 1, D
0
=
0. When we conditional on compliers, Z = D and then (Y
1
, Y
0
)D|X, D
1
= 1, D
0
= 0.
After conditional on compliers, those who change their behavior by Z, we have (Y
1
, Y
0
)D|X.
Under this condition, various methods can be used to estimate treatment eects param-
eters. In particular, LATE can be obtained by OLS and LQTE, quantile regression. AAI
weighting function helps transform the conditional on compliers problem to an uncon-
ditional problem. It turns out the treatment eects parameters can be estimated by
weighted regression methods.
4 Dierence-in-Dierence Designs
In the previous chapters, several methods to control for observed confounders are in-
troduced. The premise is all the relevant covariates are available to the researchers. It
is a strong assumption, however. For example, in the case of return to schooling, the
unobserved ability may aect the earning and the schooling choice as well . In this
chapter, we introduce methods that allow for researchers to control for such unobserved
confounders when panel or repeated cross-sectional data is avaiable. Panel data refers
7
AAI discuss the computational issue when applying AAI weighting to estimate QTE. Also see Frolich
and Melly (2008).
34
to patient is medical outcomes and his characteristics are observed at both t = 0 and
t = 1. However, it is less likely that patient i is infected the same disease at both t = 0
and t = 1. Instead, data are available at a more aggregate level. Patient-level data are
both avaiable at t = 0 and t = 1 but the same individuals may not be measured twiced.
The later case is referred to repeated cross-sectional data.
Perhaps the best control unit for John is John himself. When both the pre and
post-treatment data are avaiable, such comparison is feasible. If the study period is
short, this before-after comparison is less problematic. If the study period is long, the
outcoume variable at t = 1 is likely to be contaminiated by the time trend, or by other
factors occurred between t = 0 and t = 1, which is unrelated to the policy intervention.
All these eects caused by factors that are unrelated to the treatment is summarized
in the time eect. Consequently, the pre-treatment and post-treatment comparison of
the treatment group yields the estimated treatment eect which is the eect of policy
interventions plus the undesired time eects. Furthermore, the before-after comparison
of the unteated control group can be used to idntify the time eect since the dierence
between pre-treatment and post-treatment outcomes are attribut to the time eect in
the absence of policy intervention. The dierence-in-dierence (DID) approach is the
constructed based on this intuition.
For example, the government launches the job training program in region 1 at time 1.
The umployment rate is reduced by 2% compared with the umployment rate in region 1
at time 0. However, it is possible that the economic condition improves in region 1 that
leads to a lower umployment rate.
Suppose we can nd another region 0 is similar to region 1 in terms of economic
conditions. The umployment rate is reduced 1% in region 0 duriong the same period. To
be specic, the DID estimator is given by :

DID
= E[Y |G = 1, T = 1] E[Y |G = 1, T = 0]
(E[Y |G = 0, T = 1] E[Y |G = 0, T = 0])
where G is the group indicator and T is time indicator. The DID estimator can be
interpreted as matching estimator as well. The within group dierence is a matching
that control for the unobserved group(individual) specic xed-eect in ? repeated cross-
sectional (panel) data. The between group dierence is a matching that control for
the time eect. The DID approaches have received a considerable attention in applied
researches. Meyer (1995), Lee(2005), and Angrist and Pischke(2009) ??? DID methods.
Applications includes job training program (Ashenfelter and Card, 1985), minimum wage
35
(Card and Krueger, 1994), saving behavior(Poterba, Venti, and Wise, 1995), disability
act(Acemoglu and Angrist, 2001), and consequences of potential loss on Adolescents
(Corak, 2001) among others. See also Angrist and Krueger(1999) and Rosenzweig and
Wolpin(2001) for more examples.
4.1 Linear Dierence-in-Dierence model
Suppose there are two groups, G = 0 and 1, Group 0(1) will be referred as the control
(treatment group). Only group 1 is exposed to the policy intervention at time 1. Let
Y
N
(Y
I
) stands for the untreated(treated) potential outcome. The observed outcome
for individual i is given by Y
i
= Y
N
i
(1 I
i
) + Y
I
i
I
i
, where I
i
= G
i
T
i
is the treatment
indicator. Therefore, Y = Y
N
when (G, T) = (0, 0), (0, 1), and (1, 0) while Y = Y
I
when
(G, T) = (1, 1). Assume that we have panel data in which there is no moving in and out
for each group.
y
N
it
= +t +G
i
+
i
+
it
, and
y
I
it
y
N
it
= , i, t (constant eect).
These two assumptions together imply the observed outcomes y
it
can be written as
y
it
= +t +G
i
+
i
+G
i
t +
it
In this model, t summaries the common time eects across individuals.
i
is the un-
observed xed-eect that is potentially correlated with G
i
(selection-on-unobservables).
Namely, G
i
is allowed to be endogeneous. is the ATE of interest.
it
are assume to
be i.i.d across time and individuals. Within group dierencing removes the unobserved
confounders
i
:
y
i
= +G
i
+
it
The ATE. , is identied through the moment condition:

DID
= E[y
i1
y
i0
|G
i
= 1] E[y
i1
y
i0
|G
i
= 0]
= E[y
i
|G
i
= 1] E[y
i
|G
i
= 0]
= +
=
Equivalently, can be estimated by OLS with time, group, and group time dummies.
The previous constant eect linear DID model can be modied to t into the framework
36
of repeated cross-sectional data :
Y
i
= +T
i
+G
i
+
i
+G
i
T
i
+
i
Note that we only observe agent i at one time.
i
is correlated with G
i
, and the distribu-
tion of is time-invariant. It is thus captures the unobserved group specic xed-eect.
The ATE can be identied through

DID
= E[Y |G = 1, T = 1] E[Y |G = 1, T = 0]
(E[Y |G = 0, T = 1] E[Y |G = 0, T = 0])
= ( +E[
i
|G = 1] E[
i
|G = 1] +)
( +E[
i
|G = 0] E[|G = 0])
=
4.2 Nonparametric Dierence-in-Dierenc Models
In this section we consider the cases in which dierence in observed covarites create
dierent time eect between the treatment and control groups.
Suppose there are two types of workers, blue collar and white collar. The percentage
of white collar workers in the treatment group is higher than that of the control group.
During the study period, the information technology may improve, leading to non-parallel
growth of wage between blue collar and white collar workers since white collar workers are
more likely to enjoy the benet from technology. Therefore, controling for the dierence
of covariate between the treatment and control group is also important in the DID models.
In the DID model, this can be done by adding covariates into the regression. In this
section we introduce more general nonparametrics methods to control for covariates. The
key identication condition for ATE is the conditional same time eect assumption in
the absence of treatment:
Assumption 4.1
E[Y
N
11
Y
N
10
|X] = E[Y
N
01
Y 00
N
|X]
where Y
gt
is the shorthand for Y |G = g, T = t.
Theorem 4.1 Under the same time eect assumption, the DID estimator
DID
identi-
es the ATE on the treated at time 1, E[Y
I
11
Y
N
11
]
37
Pf 4.1

DID
= E[Y
11
Y
10
|X] E[Y
01
Y
00
|X] (in terms of observed outcomes)
= E[Y
I
11
Y
N
10
|X] E[Y
N
01
Y
N
00
|X] (in terms of potential outcomes)
= E[Y
I
11
Y
N
10
|X] E[Y 11
N
Y
N
10
|X]
+E[Y
N
11
Y
N
10
|X] E[Y
N
01
Y
N
00
|X]
= E[Y
I
11
Y
N
10
Y
N
11
+Y
N
10
|X]
(by the same time eect in the absence of treatment)
= E[Y
I
11
Y
N
11
|X]
Averaging
DID
according to the distribution of X yields
E[y
N
11
Y
N
11
] E[Y
I
t=1
Y
N
t=1
|G = 1]
The
DID
can be written as
E[Y
t=1
Y
t=0
|G = 1, X] E[Y
t=1
Y
t=0
|G = 0, X]
If panel data is available, rst we take rst dierence in(?) the outcome for each individual
to generate new dependent available Y
i
= Y
it=1
Y
it=0
. Then
DID
(X) can be rewritten
as E[Y |G = 1, X] E[Y |G = 0, X], which has exactly the same form as (X) in
section 1.
DID
(X) can be estimated by dierencing the two estimated conditional mean
functions. In particular, matching estimators can also be employed. The intuition is that
the within group dierence removes the unobservable confounders, and then matching is
employed to control for the non-parallel outcome dynamics caused by dierent covariates.
If we only have repeated cross-sectional data, generating Y
i
is infeasible. Instead,
we can estimate four conditional mean functions to estimate
DID
(X):

DID
(X) = E[Y |G = 1, T = 1, X] E[Y |G = 1, T = 0, X]
E[Y |G = 0, T = 1, X] +E[Y |G = 0, T = 0, X]
Note that the same time eect assumption ? not exclude the possibility of selection-
on-unobserable. There may exist systematic dierence between the treated and control
units such that E[Y
N
10
] = E[Y
N
00
]. Such endogeneity problem is resolved using the time
dimension, and the role of control group is to remove the time eect.
Besides matching, Abadie(2005) shows that the propensity score weighting approach
can also be extended to the DID setting. If we view the chanse of untreated response
38
Y
N
= Y
N
t=1
Y
N
t=0
as Y
0
, The same time eect condition, E[Y
N
t=1
Y
N
t=0
|G = 1, X] =
E[Y
N
t=1
Y
N
t=0
|G = 0, X] is analogous to the mean independent condition for Y
0
, E[Y
0
|D =
1, X] = E[Y
0
|D = 0, X], which is used to identied ATE on the treated in the cross
sectional data in chapter. Also recall that
DID
identies the ATE on the treated,
E[Y
I
t=1
Y
N
t=1
|G = 1], Dene Y = Y
t=1
Y
t=0
, Y
1
= Y
I
t=1
Y
N
t=0
, and Y
0
=
Y
N
t=1
Y
N
t=0
. Then E[Y
I
t=1
Y
N
t=1
|G = 1] can be express as E[Y
1
Y
0
|G = 1].
Combing the above three facts the parameter of in ?? ?? the associated identication
condition are exactly the same. Therefore, the same weighting estimator can be applied
here. As showed in chapter 1, under E[Y
0
|D = 1, X] = E[Y
0
|D = 0, X], the propensity
score weighting estimator is given by
E[Y
1
Y
0
|D = 1] =
1
p(D = 1)
E[
D e(x)
1 e(x)
Y ]
Pf 4.2
E[Y
1
Y
0
|D = 1] = E[Y
1
|D = 1] E[Y
0
|D = 1] (1)
E[Y
1
|D = 1] = E[Y |D = 1] (2)
Recall that E[Y
0
] = E[
(1 D)Y
1 e(x)
] (3)
E[Y
0
] = E[Y
0
|D = 1]P(D = 1) +E[Y
0
|D = 0]P(D = 0) (4)
(3),(4) gives an expression of the counterfactual E[Y
0
|D = 1] in terms of observable:
E[Y
0
|D = 1] =
1
p(D = 1)
[E[Y
0
] E[Y
0
|D = 0]P(D = 0)]
=
1
P(D = 1)
_
E[
(1 D)Y
1 e(x)
E[(1 D)Y ]
_
(5)
(1),(2),(5) simplies that
E[Y
1
Y
0
|D = 1] = E[Y |D = 1]
1
P(D = 1)
_
E[
(1 D)Y
1 e(x)
] E[(1 D)Y ]
_
= E[DY ]
1
P(D = 1)
??
=
1
P(D = 1)
E
_
DY
(1 D)Y
1 e(x)
+ (1 D)Y
_
=
1
P(D = 1)
E
_
D e(x)
1 e(x)
Y
_
Analogously, dene e(X) = P(G = 1|X), we have E[Y
1
Y
0
|G = 1] =
1
P(G=1)
E
_
Ge(X)
1e(X)
_
The above estimator implicitly assume that panel data is available because the estimation
is based on Y
i
= Y
it=1
Y
it=0
. Abadie(2005) demonstrate that it can be modied to
t into repeated cross-sectional data theorem.
39
Theorem 4.2
E
_
T P(T = 1)
P(T = 1)(1 P(T = 1))
Ge(X)
e(X)(1 e(X))
Y

X
_
= E[Y |G = 1, T = 1, X] E[Y |G = 1, T = 0, X]
E[Y |G = 0, T = 1, X] +E[Y |G = 0, T = 0, X]
Pf 4.3 The proposed estimator equals
E
_
1 P(T = 1)
P(T = 1)(1 P(T = 1))
Ge(X)
e(X)(1 e(X))
Y

X, T = 1
_
P(T = 1)
+E
_
0 P(T = 1)
P(T = 1)(1 P(T = 1))
Ge(X)
e(X)(1 e(X))
Y

X, T = 0
_
P(T = 0) =
E
_
Ge(X)
e(X)(1 e(X))
Y
t=1

X, T = 1
_
E
_
Ge(X)
e(X)(1 e(X))
Y
t=0

X, T = 0
_
(1)
E
_
Ge(X)
e(X)(1 e(X))
Y
t=1

X, T = 1
_
= E
_
1 e(X)
e(X)(1 e(X))
Y
t=1

X, G = 1
_
P(G = 1, X)
+E
_
0 e(X)
e(X)(1 e(X))
Y
t=1

X, G = 0
_
P(G = 0, X)
= E[Y +t = 1|X, G = 1] E[Y
t=1
|X, G = 0] (2)
Similarily, E
_
Ge(X)
e(X)(1 e(X))
Y
t=1

X
_
= E[Y
t=0
|X, G = 1] E[Y
t=0
|X, G = 0] (3)
By (1)(2)(3), we have shown the proposed estimator equals
E[Y |X, G = 1, T = 1] E[Y |X, G = 1, T = 0] E[Y |X, G = 02, T = 1]E[Y |X, G = 0, T = 0]
= E[Y
I
t=1
Y
N
t=1
|G = 1, X]
The ATE on the treated is given by

E[Y
I
t=1
Y
N
t=1
|G = 1, X] d(P(X|G = 1)) =

E[Y
I
t=1
Y
N
t=1
|G = 1, X]
R(G = 1|X)
P(G = 1)
dP(X)
= E
_
T P(T = 1)
P(T = 1)(1 P(T = 1))
Ge(X)
1 e(X)
1
P(G = 1)
Y
_
4.3 Nonlinear Dierence-in-Dierence
The validity of the DID estimator hinges on the additively separable property. When the
outcome variable is discrete or rate data, the linear specication is problematic and re-
searchers usually resort to logit, probit or semiparametric singlex index models. However,
40
the dierencing procedure fails to identify ATE under nonlinear model. Consider
Y = 1
[++G+TG+0]
, N(0, 1)

DID
= E[Y |T = 1, G = 1] E[Y |T = 0, G = 1]
E[Y |T = 1, G = 0] +E[Y |T = 1, G = 1]
= ( + + +) ( +) ( +) + ()
neq
However, standard linear DID model implicit impose the constant time eect assumption.
Everyone will experience the same time eect in absence of the treatment. Athey and
Imbens demonstrate when such assumption is violated, DID also fails to identify ATE.
Suppose y is wage and is workers ability. Consider the model allowing for time trend
in the level of wages and the returns to ability:
Y = +T +G+TG+ (1 +rT)

DID
= [( + + + + (1 +r)E[|G = 1]) ( + +E[|G = 1])]
[( + + (1 +r)E[|G = 0]) ( +E[|G = 0])]
= +r(E[|G = 1] E[|G = 0])
In this case,
DID
= only when E[|G = 1] = E[|G = 0]. However, this condition
will exclude the case of selection-on-unobserable. It is unappealing because part of the
motivation to consider panel data is it allows for selection-on-unobservable.
Finally, linear DID models only admit location shift. When the primary concern is
the issue of inequality, a model that can allow for more general forms of distributional
eect is more suitable to answer interesting policy questions. For example, a tax reform
is expected to have positive eect on the lower quantiles of earning distribution and
negative eect on the upper quantiles of earning distribution.
Estimators of QTE in the context of panel or repeated cross-sectional data thus
deliever more informative estimates than linear DID estimators.
Athey and Imbens(2006) propose two models, change-in-change and quantitle dierence-
in-diernce, to impute the entire counterfactual distribution; therefore ATE and QTE
can be derived from these methods.
Both change-in-chanve (CIC) and quantile diference-in-dierence (QDID) relax the
functional form assumptions in standard DID. Furthermore, agents with dierent unob-
served characteristics are allowed to have dierent time eect under both models.
41
4.4 The Change-in-Change Model
Dene Y
I
and Y
N
the potential outcome with and without the treatment respectively.
The observed outcome Y = Y
N
(1 I +IY
I
), where I = GT is the treatment indicator.
We use the shorthand for conditioning :
Y
N
gt
Y
N
|G = g, T = t , Y
I
gt
Y
I
|G = g, T = t
Y
gt
Y |G = g, T = t , U
g
= U|G = g
U represents the unobserved indivieual characteristics. The corresponding CDF are de-
noted by F
Y
N
,gt
, F
Y
I
,gt
, F
Y,gt
and F
U,g
. Since the treatment is eective for (G, T) = (1, 1),
F
Y,gt
= F
Y
N
,gt
, for (G, T) = (1, 0), (0, 0), (0, 1) and F
Y,11
= F
Y
I
,11
. We want to identify
the counterfactual distribution F
Y,11
from the observed F
Y
N
,00
, F
Y
N
,01
and F
Y
N
,10
. Athey
and Imbens consider the CIC model :
Assumption 4.2 CIC-1 (Structural Model)
Y
N
= h(U, T)
Assumption 4.3 CIC-2 (Strict Monotonicity)
h(u, t) is strictly increasing in u for t = 0 amd 1
Assumption 4.4 CIC-3 (Time invariance)
U T | G
Assumption 4.5 CIC-4 (Support)
suppU
1
suppU
0
Assumption CIC-1 and 2 relax the constant time eect assumption in standard DID
models. For individual with realization U = u, she will experience the time eect h(u, 1)
h(u, 0) in the absence of the treatment. Since we do not impose any restriction on the
functional form of h, each agent can have dierent time eect whenever their individual
characteristics diers. It thus nests the constant eect model a special case.
Assumption CIC-3 asserts that within each group, the distribution of U is stable
across time. In particular, the individual xed eect model satises this assumption.
However, within the same time period, the distribution of U can vary across group
42
(U
1
= U
0
) and the CIC model thus allows for selection-on-unobservable. For example,
the treatment group may contains more high ability workers than the control group, but
the distribution of ability is time-invariant in each group.
Notice that CIC-3 isessential for identifying the time eect in this model. If U is
also time-varing, it would be dicult to isolate the time eect from the eect of the
chanse of the chanse of distribution of U by looking at the outcome variable. Simply
put, the control group at t = 0 is comparable with the control group at t = 1 when U
0
is time-invariant. The treatment eect of time is identied only if they are comparable.
Many economic model directly map into the CIC model. For example, Y is wage and U
is workers ability and wage is an increasing function of ability. Suppose Y is working
hours and U is the preference of working hour. The chosen working hour is high if the
preference of leisure is low, conditional on the wage and nonlabor income. Suppose Y is
saving and U is the preference of risk aversion. The level of the precautionary saving is
higher when the degree of risk aversion is high.
Following Matzkin(2003), it is impossible to separately identify the structural shock
U and structural function h. In particular, under CIC-2 we can treat U as uniform [0, 1],
and h as the quantile function of Y
N
since
Y
N
= h(U, T) = h(F
1
F(U), T) = h(U, T)
where F is the distibution function of U.
In fact, the idencation results heavily rely on the use of the quantile function. To
ease the exposition of the main idea(?), we should only consider the case when Y is
continuous. The case of discrete Y are referred to Athet and Imbens.
Theorem (Identication of the CIC model)
Under CIC1-4, F
Y
N
,11
(y) = F
Y,10
(F
1
Y,00
(y))
Pf 4.4
F
y
N
,gt
= P(h(U, T) y|G = g, T = t)
= P(h(U, T) y|G = g) (by CIC-3)
= P(U h
1
(y ; t)|G = g) (by CIC-2)
= F
U,g
(h
1
(y ; t)) (1)
Using this identity and substitute in y = h(u, 0), we have
F
Y,00
(h(u, 0)) = F
U,0
(h
1
(h(u, 0) ; 0))
= F
U,0
43
Applying F
1
Y,00
to each quantile, we have
h(u, 0) = F
1
Y,00
(F
u,00
(u)) u supp U
0
(2)
Applying (1) to (g, t) = (0, 1), we have
F
Y,01
(y) = F
U,0
(h
1
(y ; 1)) h
1
(y, 1) supp U
0
h
1
(y ; 1) = F
1
U,0
(F
Y,01
(y)) y supp Y
0
(3)
Combining (2)(3) gives h(h
1
(y ; 1), 0) = F
1
Y,00
(F
Y,01
(y)) (4)
Apply (1) with (g, t) = (1, 0), we have
F
U,1
(u) = F
Y,10
(h(u, 0)) (5)
Then F
Y
N
,11
(y) = F
U,1
(h
1
(y ; 1))
= F
Y,10
(h(h
1
(y ; 1), 0)) (by (5))
= F
Y,10
F
1
Y,00
F
Y,01
(y) (by (4))
By CIC-4, supp U
1
supp U
0
, it implies that supp Y
N
11
supp Y
01
and it enables us
to impute the entire counterfactual distribution F
Y
N
,11
from F
Y,10
F
Y,00
and F
Y,01
for all
y Y
N
11
.
Corollary 4.1 The quantile function of Y
N
11
is given by
F
1
Y
N
,11
(q) = F
1
Y,01
F
Y,00
F
1
Y,11
(q)
and the QTE is given by
CIC
q
= F
1
Y
I
,11
(q) F
1
Y
N
The CIC identication can be interpreted as dening a transformation K
CIC
(y) = F
1
Y,01

F
Y,00
(y). Such transformation suggest the following ATE estimator:

CIC
= E[Y
I
11
] E[Y
N
11
] = E[Y
I
11
] E[K
CIC
(Y
10
)] = E[Y
N
11
] E[F
1
Y,01
F
Y,00
(Y
10
)]
Under the CIC model, within the same time period, the same realization of Y corre-
sponds to a specic realization of u regardless of the group. Once we know u, the time
eect for u, h(u, 1) h(u, 0), can be back out from comparing the quantile functions of
Y
01
and Y
00
.
Given Y
10
there exists u such that h(u, 0) = Y
10
. The quantile of u in U
1
is denoted
as q
1
. Since U
1
= U
0
, the quantile of u in U
0
, q
0
is dierent to q
1
in general.
44
The rst transformation, F
Y,00
(Y
10
), thus gives us q
0
. The time eect, h(u, 1)h(u, 0),
is equal to the quantile treatment eect at q
0
since h(u, 1)h(u, 0) = h

(q
0
, 1)h

(q
0
, 0),
knowing (???...) because there is a one-to-one correspondence between q
0
and u. Given
u, the counterfactual Y
N
11
equals h(u, 1) = h

(q
0
, 1) F
1
Y,01
(q
0
) = F
1
Y,01
(F
Y,00
(Y
10
)) =
F
1
Y,01
F
Y,00
F
1
Y,10
(q
1
).
?? = Y
10
+h

(q
0
, 1) h

(q
0
, 0)
= Y
10
+F
1
Y,01
(q
0
) F
1
Y,00
(q
0
)
= Y
10
+F
1
Y,01
(F
Y,00
(Y
10
)) F
1
Y,00
(F
Y,00
(Y
10
))
= F
1
Y,01
F
Y,00
(Y
10
)
4.5 Quantile Dierence-in-Dierence
QDID is a generalization of DID that applys the DID method to each quantile instead of
mean, which can be dated back to Meyer, Viscusi, and Dubin(1995) and Poterba, Venti,
and Wise(1995). It thus permits richer forms of QTE than the DID method. In DID, the
key identication assumption is the average eect of time is the same for the treatment
and the control group in the absence of policy interventions. Analogously, applying DID
to each quantile yeilds:
F
1
Y
N
,11
(q) F
Y,10
(q) = F
Y,10
(q) F
Y,00
(q)
That is, the QTE of time is the same for all q [0, 1]. The identication condition of
QDID implies that the counterfactual F
1
Y
N
,11
(q) can be imputed from
F
1
Y,10
+ [F
1
Y,01
(q) F
1
Y,00
(q)]
Athey and Imbens(2002) supply a set of primitive assumptions to justify the QDID
identication condition.
Assumption 4.6 QDID-1 (Structural Model)
Y
N
= h(U, G, T) where h(u, g, t) is additively separable in g and t; i.e, h(U, G, T) =
h
G
(U, G) +h
T
(U, T)
Assumption 4.7 QDID-2 (Monotonicity)
Given g and t, h(.) is strictly increasing in u
Assumption 4.8 QDID-3
U(G, T)
45
Athey and Imbens(2002) show that
Theorem 4.3 (Identication of the QDID Model)
Suppose Y is continuous random variable and QDID1-3 hold,.... The counterfactual
distribution of Y
N
11
is identied and
F
1
Y
N
,11
= F
1
Y,10
(q) +F
1
Y,01
F
1
Y,00
(q) , q (0, 1)
Pf 4.5 WLOG, U is assumed to be uniform [0, 1] and then by QDID-2, h(, g, t) is in-
terpreted as quantile function conditional on (g, t) (Skorohod representation)
F
Y,gt
(y) = P(h(U, G, T) y|G = g, T = t)
= P(h(U, g, t) y) (by QDID-3)
= P(U h
1
(y ; g, t)) (by QDID-2)
= h
1
(y ; g, t)
h(u, g, t) = F
1
Y,gt
(u) (1)
By QDID-1,
h(u, 1, 1) h(u, 1, 0) = h
G
(u, 1) +h
T
(u, 1) h
G
(u, 1) h
T
(u, 0)
= h
T
(u, 1) h
T
(u, 0)
= h(u, 0, 1) h(u, 0, 0)
h(u, 1, 1) = h(u, 1, 0) +h(u, 0, 1) h(u, 0, 0) (2)
(1)(2) yield the desired result.
The modeling philosophy ...(?) is quite dierent. In QDID, the outcome in the absence
of treatment is generated according to
Y
N
00
= h
0
(U) ; Y
N
10
= h
1
(U)
The unobserved characteristic U is equally distributed in the treatment and control group
( U (G, T)) but dierent group uses dierent production technology. Therefore, the
observed dierence between F
Y,00
and F
Y,10
is attributed to the dierence between h
0
and
h
1
. Individuals with the same realization of u will be mapped into dierent outcomes
when their groups dier, but the monotonicity of h
0
and h
1
still preserve their rank.
Consequently, comparing individuals with the same quantile of the outcome is equivalent
to comparing individuals with the same u under the QDID model.
46
By contrast, under the cUC model the outcome in the absence of treatment is gener-
ated according to
Y
N
00
= h(U
0
) ; Y
N
10
= h(U
1
)
The treatment and control group both use the same production technology, but their dis-
tribution of characteristics can be dierent in an arbitrary way. The observed dierence
between F
Y,00
and F
Y,10
is attributed to the dierence between U
0
and U
1
. Individuals
with the same realization of u will be mapped into the same outcome. Therefore, com-
paring individuals with the same outcome is equivalent to comparing individuals with
the same u under the CIC model, though the rank of u in U
0
and U
1
is dierent. The
separability condition QDID-1 is crucial for identifying the time eect. If there is an
interaction eect between G and T, the treated and control units are allowed to have
dierent time path, violating the same time eect structure given u. The QDID model
suggests the estimator for the QTE to be

QDID
q
= F
1
Y
I
,11
(q) F
1
Y
N
,11
(q)
= F
1
Y
I
,11
(q) F
1
Y,10
(q) [F
1
Y,01
(q) F
1
Y,00
(q)]
Following Koenker(2006), it can be estimated from the quantile regression with group,
time, and treatment dummies
F
1
Y IG,T
(q) = (q) +(q)T +(g)G+
QDID
q
GT
Integrating QTE, tau
QDID
q
, yields the ATE
QDID


QDID
q
dq =

(F
1
Y
I
,11
(q) F
1
Y,10
(q) F
1
Y,01
+F
1
Y,00
) dq
= E[Y
I
11
] E[Y
10
] E[Y
01
] +E[Y
00
]
which is the same as
DID
... though the CIC model allows U
1
and U
0
can dier in arbitrary way, the support
condition supp U
1
supp U
0
may rule out many interesting cases in labor economics.
The support condition implies that supp Y
10
supp Y
00
and supp Y
N
11
supp Y
01
.
However, in pratice max Y
10
tend to be greater than max Y
00
in the case of job traning
programme (Crump, Hotz, Imbens and Mitnik). Therefore, the abitily to impute the
entire counterfactual distribution will be cramped by support problem.
Only supp Y
10
supp Y
00
can be imputed under the CIC model. By contrast, the
QDID model doesnt suer from the support problem, because it is always feasible to
47
compute the q-th QTE of Y
01
versus Y
00
and add it back to Y
10
that corresponds to
the q-th quantile in F
Y,10
. Comparing individuals with the same quantile can be dated
backed to fractile regression (Mahalanobis, 1960; Sen and Chaudhuri, 2006). Even in the
extreme case, supp Y
10
supp Y
00
= , they are still comparable because the integral
transformation transform supp Y into [0, 1], regardless of the original support of Y .
5 Nonparametric Bounding Approaches
In the preceding sections we review several identication methods for the treatment eect
parameters under various assumptions. Although those assumptions are quite dierent
at rst glance, the ultimate implication for identication is essentially the same. Those
assumptions are strong enough in the sense that the counterfactuals can be imputed from
the observable data, and the treatment eect parameters are point identied. The main
dierence is the way we impute the counterfactuals might be dierent under dierent
assumptions. For example, under conditional unconfoundness, we impute counterfactual
E[Y
1
|D = 0, X] by E[Y
1
|D = 1, X]. In the following sample selection model:
Y
1
= X

1
+u
1
,
Y
0
= X

0
+u
0
,
D = I
{X

D
+u
D
>0}
.
According to this system, the counterfactual E[Y
1
|D = 0, X] equals E[X

1
+u
1
|X

D
+
u
D
< 0, X] = X

1
+ E[u
1
|X

D
+ u
D
< 0, X]. If we make further distributional
assumption, for example (u
1
, u
0
, u
D
) is trivariate normally distributed, we will obtain a
closed form expression for E[u
1
|X

D
+ u
D
< 0, X]. This is the celebrated Heckman
two-step estimator. The above system enables us to impute the counterfactual in a
specic manner. However, the validity of such imputation comes from the functional
form restriction, the additively separable error term, and the distributional assumption
(Manski, 1989). A specic assumption leads to a specic imputation method. Researchers
ususally have diverse prior beliefs about which assumption is more plausible. Such diverse
belief comes from two facts: First, typically those assumptions in section 1 to 6 cannot
be tested statistically; there is no systematic way to address which assumption is more
plausible. Therefore, researchers use economy theory to guide the choice of assumptions.
However, the theory usually remains silent about the distributional assumptions and
functional form restrictions. Unfortunately, our ability to impute the counterfactuals
hinges on those assumptions.
48
A more challenging question is can we still say something about the value of the
treamtment eect parameters if no assumption being made? How far can we go if we only
make weaker (and hence more credible) assumptions? Charles Manski provides a fresh
insight into this problem by derived nonparametric bounds for the counterfactuals and
the treatment eect parameters under dierent set of assumptions. These assumptions
we are going to introduce are weaker than that of section 1 to 6., and we dont assume
more than what economy theory predicts. The result is the information contained in the
data is quite limited to impute the counterfactuals. Instead of the point identication
of the TE parameters, we could only obtain an identication region for the parameters.
This new identication concept is called partial identication or set identication in the
literature.
Recall that we only observe (Y, D, X) but want to identify characteristics of (Y
1
, Y
0
, D, X).
For example, we want to learn about E[Y
1
] and E[Y
0
], and they can be expressed as:
E[Y
1
] = E[Y
1
|D = 1] P(D = 1) +E[Y
1
|D = 0] P(D = 0),
E[Y
0
] = E[Y
0
|D = 1] P(D = 1) +E[Y
0
|D = 0] P(D = 0).
From now on I suppress the notation conditional on X to make the exposition more
transparent. The sampling process identies P(D = 0), P(D = 1), E[Y
1
|D = 1] =
E[Y |D = 1], E[Y
0
|D = 0] = E[Y |D = 0]. Without making further assumptions, the
counterfactuals E[Y
1
|D = 0] and E[Y
0
|D = 1] are not identied. Suppose Y
0
and Y
1
are bounded random variables with common support [K
0
, K
1
]. Boundness is essential
for deriving the nonparametric bound for ATE. In particular, lets assume Y
0
and Y
1
are
binary random variables. Such assumption admits a more simplifed version of nonpara-
metric bounds for ATE. Binary random variables also imply K
1
= 1 and K
0
= 0. Under
this assumption, we know E[Y
1
Y
0
] [1, 1]. This bound is of course, trivial. However,
without any data, we can only conclude that ATE belongs to [1, 1]. The length of this
bound equals 2.
5.1 No-assumption Bound
Since Y
1
is binary, E[Y
1
|D = 0] = P(Y
1
= 1|D = 0) [0, 1]. Therefore we obtain the
upper bound and lower bound for E[Y
1
]:
U
1
= E[Y
1
|D = 1] P(D = 1) +P(D = 0),
L
1
= E[Y
1
|D = 1] P(D = 1).
49
Similarily, the bound for E[Y
0
] is
U
0
= E[Y
0
|D = 0] P(D = 0) +P(D = 1),
L
0
= E[Y
0
|D = 0] P(D = 0).
Notice that the bounds here are functions of identied object and can be nonparamet-
rically estimated. The trick of the bound analysis is plug-in the upper and the lower
bounds for the non-identied counterfactuals, leaving the point-identied objects un-
changed. These bounds implies the bound for ATE:
U = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) +P(D = 0),
L = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) P(D = 1).
This bound has length equals P(D = 0) + P(D = 1) = 1. Comparing with the trivial
bound for ATE, the uncertainty is reduced 50% by using the information contained in
the data. However, the no assumption bound necessarily covers 0 and hence the sign of
ATE is not identied. Because it is too wide to be useful in practice, we need to impose
other assumptions to improve on the accuracy of this bound.
5.2 Level-set Restrictions: Instrumental Variables
8
Manski (1990) and Manski and Pepper (2000) assume:
Denition 5.1 (IV)
Z is an instrumental variable in the sense of mean-independence if, for j = 0 and 1, and
each (z, z

), we have E[Y
j
|Z = z] = E[Y
j
|Z = z

].
This is the standard exclusion restriction that Z does not shift the response variables.
The no-assumption bound can be applied to E[Y
1
|Z = z], z support(Z) :
U
1
(z) = E[Y
1
|D = 1, Z = z] P(D = 1|Z = z) +P(D = 0|Z = z),
L
1
(z) = E[Y
1
|D = 1, Z = z] P(D = 1|Z = z).
Since E[Y
1
|Z = z] = E[Y
1
] by the IV assumption, L
1
(z) E[Y
1
] U
1
(z) for all z.
Therefore, we obtain the intersection bound for E[Y
1
] :
sup
z
L
1
(z) E[Y
1
] inf
z
U
1
(z)
Chernozhukov, Lee, and Rosen (2008) analyze the asymptotic properties of the intersec-
tion bounds.
8
Another example of level-set restrictions is the constant eect assumption. See Manski (1990).
50
We can also discuss the identication power of an instrument Z. Consider the extreme
case that Z is independent of D. This implies P(D = 1|Z) = P(D). Since E[Y
j
|Z = z]
is a constant function of z, constant P(D = 1|Z) implies E[Y
j
|D = 1, Z = z] and
E[Y
j
|D = 0, Z = z] are necessarily constant functions of z as well. We conclude that
E[Y
j
|D = 1, Z = z] = E[Y
j
|D = 1] and E[Y
j
|D = 0, Z = z] = E[Y
j
|D = 0]. Therefore
U
1
(z) = U
1
and L
1
(z) = L
1
, and such IV cannot improve on the no-assumption bound.
5.3 Restrictions on the Moment Conditions: Monotone Instrumental
Variables
Manski and Pepper (2000) propose a new concept of instrumental variables, termed
monotone instrumental variables:
Denition 5.2 (MIV) Covariate Z is an monotone instrumental variable if
E[Y
1
|Z = z] E[Y
1
|Z = z

], and
E[Y
0
|Z = z] E[Y
0
|Z = z

], z z

.
Compared with the IV assumption, the equality in IV is replaced by inequality, yielding
a set of moment inequalities. Such assumption is quite nonstandard because IV assump-
tions usually generate a set of moment equalities. However, moment inequalities can
still be used to bound the parameters of interest. According to the MIV assumption,
E[Y
1
|Z = z] E[Y
1
|Z = z

], z z

. Therefore, E[Y
1
|Z = z] inf
zz
E[Y
1
|Z = z

].
Analogously, the no-assumption bound can be applied to each E[Y
1
|Z = z

]. Therefore,
the bound for E[Y
1
|Z = z] is given by
sup
z

z
L
1
(z

) E[Y
1
|Z = z] inf
zz
U
1
(z

),
which is also an intersection bound. Integrating the upper and the lower bound with
respect to the distribution of Z yields the bound for E[Y
1
].
An example of MIV is the IQ test score. Let Y
j
, j = 0, 1 being the wage functions.
The MIV asserts that the persons with higher IQ test score will have weakly higher mean
wage functions, regardless of participating the job training program or not.
5.4 Monotone Treatment Selection
Manski and Pepper (2000) show that the observed treatment D itself can be viewed as
a MIV if the following conditions are satised:
51
Denition 5.3 (MTS)
E[Y
1
|D = 1] E[Y
1
|D = 0], and
E[Y
0
|D = 1] E[Y
0
|D = 0].
Clearly, D is a monotone IV according to this denition. This special case of MIV is called
monotone treatment selection: These two moment inequalities are consistent with the
selection-on-unobservables. For example, theory suggests those who with higher ability
choose higher schooling level and have higher mean wage than those who with lower
ability. In our quality payment program example, patients with higher self-consciousness
choose the hospitals participating in the program and have higher treatment completion
rate. Therefore, MTS captures the main implication of some sample selection problems,
or some sample selection problems directly lead to MTS.
MTS yields an upper bound for the counterfactual E[Y
1
|D = 0], which is E[Y
1
|D = 1],
and a lower bound for E[Y
0
|D = 1], which is E[Y
0
|D = 0]. The bound associated with
E[Y
1
] becomes:
U
1
= E[Y
1
|D = 1],
L
1
= E[Y
1
|D = 1] P(D = 1).
The lower bound is the same as the no-assumption bound because MTS does not provide
information for the lower bound of E[Y
1
|D = 0]. Analogously, the bound associated with
E[Y
0
] is:
U
0
= E[Y
0
|D = 0] P(D = 0) +P(D = 1),
L
0
= E[Y
0
|D = 0].
The bound for ATE is:
U = E[Y
1
|D = 1] E[Y
0
|D = 0],
L = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) P(D = 1).
Notice that L here is the same as the no-assumption bound, and U here is just the group
mean dierence, which identies ATE when conditional unconfoundness holds.
5.5 Shape Restrictions on the Response Functions: Monotone Treat-
ment Response
Manski (1997) considers the restrictions on the response variables. For example, economy
theory asserts that the supply function is a monotone increasing function of price. In-
stead of imposing parametric functional forms on the response functions, Manski (1997)
52
considers weaker assumptions, such as monotonicity or concavity of the response func-
tions. Typically, such assumptions are no stronger than the implication of many economy
theories. Suppose the job training program is indeed benecial for everyone, this would
imply the assumption of monotone treatment response:
Denition 5.4 (MTR) Y
1i
Y
0i
i.
According to MTR, we obtain bounds for the counterfactuals:
E[Y
1
|D = 0] E[Y
0
|D = 0], and
E[Y
1
|D = 1] E[Y
0
|D = 1]
The bound for E[Y
1
] under MTR is
U
1
= E[Y
1
|D = 1] P(D = 1) +P(D = 0),
L
1
= E[Y
1
|D = 1] P(D = 1) +E[Y
0
|D = 0] P(D = 0).
Similarily, the bound for E[Y
0
] is
U
0
= E[Y
0
|D = 0] P(D = 0) +E[Y
1
|D = 1]P(D = 1),
L
0
= E[Y
0
|D = 0] P(D = 0).
These bounds implies the bound for ATE:
U = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) +P(D = 0),
L = 0.
Clearly MTR necessarily implies that ATE is weakly greater than 0. The upper bound
here is the same as the no-assumption bound. Althogh MTS and MTR looks very similar,
they have distinct meanings. I borrow the interpretation from Manski and Pepper (2000):
Consider the variation of wages with schooling. It is common to hear the
verbal assertion that wages increase with schooling. The MTS and MTR
assumptions interpret this statement in dierent ways. The MTS interpreta-
tion is that persons who select higher levels of schooling have weakly higher
mean wage functions than do those who select lower levels of schooling. The
MTR interpretation is that each persons wage function is weakly increasing
in conjectured years of schooling.
The MTS assumption is consistent with economic models of schooling
choice and wage determination that predict that persons with higher ability
have higher mean wage functions and choose higher levels of schooling than do
persons with lower ability. The MTR assumption is consistent with economic
models of the production of human capital through schooling.
53
In general, several assumptions can be imposed together to yield a sharper identi-
cation region, as long as they do not contradict to each other. For example, if we impose
MTR as well as MTS, the bound for ATE shrinks to :
U = E[Y
1
|D = 1] E[Y
0
|D = 0],
L = 0.
5.6 Restrictions on the Selection Mechanism
Manski (1990, 1994) shows that the selection rules can be used to bound the counter-
factuals. Consider the selection rule of Roy model D = I
{Y
1
Y
0
}
. This selection rule
is also consistent with the agents optimization behaviors. If we specify the functional
forms for Y
j
and the distributional assumptions, the Roy model can be analyzed using
methods develpoed in section 6. However, without those assumptions, such selection rule
is already informative to bound the counterfactuals. Notice that:
E[Y
1
|D = 0] = E[Y
1
|Y
1
< Y
0
] E[Y
1
|Y
1
Y
0
] = E[Y
1
|D = 1], and
E[Y
1
|D = 0] = E[Y
1
|Y
1
< Y
0
] E[Y
0
|Y
1
< Y
0
] = E[Y
0
|D = 0].
Holding Y
0
xed, Y
1
in the region of {Y
1
< Y
0
} is weakly samller than Y
1
in the region of
{Y
1
Y
0
}. This holds true for any value of Y
0
. By the monotonicity of the expectation,
we obtain the rst inequality. The second inequality is simple. Since we conditional on
Y
1
< Y
0
, of course Y
0
is greater than Y
1
in this region. Analogously, there are two upper
bounds for the counterfactual E[Y
0
|D = 1]:
E[Y
0
|D = 1] = E[Y
0
|Y
1
Y
0
] E[Y
0
|Y
1
< Y
0
] = E[Y
0
|D = 0], and
E[Y
0
|D = 1] = E[Y
0
|Y
1
Y
0
] E[Y
1
|Y
1
Y
0
] = E[Y
1
|D = 1].
Hence the upper bound for E[Y
0
|D = 1] and E[Y
1
|D = 0] is = min(E[Y
0
|D =
0], E[Y
1
|D = 1]). The above analysis gives the bound for E[Y
1
] is:
U
1
= E[Y
1
|D = 1] P(D = 1) + P(D = 0),
L
1
= E[Y
1
|D = 1] P(D = 1).
Similarily, the bound for E[Y
0
] is
U
0
= E[Y
0
|D = 0] P(D = 0) + P(D = 1),
L
0
= E[Y
0
|D = 0] P(D = 0).
These bounds implies the bound for ATE:
U = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) + P(D = 0),
L = E[Y
1
|D = 1] P(D = 1) E[Y
0
|D = 0] P(D = 0) P(D = 1).
54
It has lenght of and hence this bound for ATE is contained in the no-assumption
bound.
5.7 Some Remarks
The above arguments also hold true after conditional on covariate X. We could derive
bounds for conditional ATE under dierent assumptions:
L(X) E[Y
1
Y
0
|X] < U(X)
Integrating U(x) and L(x) with respect to the distribution of X gives the bound for ATE:

L(x)dF(x) E[Y
1
Y
0
] =

E[Y
1
Y
0
|X = x]dF(x)

U(x)dF(x).
To bound the ATE, boundness of the response variables is essential. Even Y
1
has un-
bounded support, we can still bound E[g(Y
1
)], provided that g() is a bounded function.
A useful case is g(Y
1
) = I
{Y
1
y}
because E[g(Y
1
)] = P(Y
1
y) = F
1
(y), the CDF of Y
1
.
Thus it is always possible to bound the distribution function based on the approaches
in this section. Since various statistical quantities are nothing but the functionals of
the distribution functions, in principle one can derive bounds for mean, variance (Stoye,
2007), and quantile (Manski, 1994). Manski (2003, 2007) review the literature in partial
identication. Horowitz and Manski (2000), Imbens and Manski (2004), Chernozhukov,
Hong, and Tamer (2008), Beresteanu and Molinari (2008), Romano and Shaikh (2006),
and Stoye (2008) construct the condence intervals for bounds. Empirical researches
using this methodology include Manski, Sandefur, McLanahan and Powers (1992), Blun-
dell, Gosling, Ichimura, and Meghir (2004), Manski and Nagin (1998), Pepper (2000,
2003), and Lechner (1999). Manski and Tamer (2002), Hong and Tamer (2003), Haile and
Tamer (2003), Tamer (2003), Honore and Lleras-Muney (2006), Honore and Tamer (2006)
apply this method to interval regression, censored regression, English auction, incomplete
models
9
, competing risk models, and panel dynamic discrete choice models respectively.
9
For example, models with multiple equilibriums
55
References
Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens (2001). Implementing Matching
Estimators for Average Treatment Eects in Stata, The Stata Journal, 1, 118.
Abadie, A. (2002). Bootstrap Tests for Distributional Treatment Eects in Instrumental
Variable Models, Journal of the American Statistical Association, 97, 284292.
Abadie, A., J. Angrist, and G. W. Imbens (2002). Instrumental Variables Estimates of the
Eect of Subsidized Training on the Quantiles of Trainee Earnings, Econometrica,
70, 91117.
Abadie, A. (2003). Semiparametric Instrumental Variable Estimation of Treatment Re-
sponse Models, Journal of Econometrics, 113, 231263.
Abadie, A. and G. W. Imbens (2006). Large Sample Properties of Matching Estimators
For Average Treatment Eects, Econometrica, 74, 235267.
Angrist, J. and G. W. Imbens (1995). Two-Stage Least Squares Estimation of Average
Causal Eects in Models with Variable Treatment Intensity, Journal of the American
Statistical Association, 90, 431442.
Angrist, J., G. W. Imbens, and D. B. Rubin (1996). Identication of Causal Eects
Using Instrumental Variables (with discussion), Journal of the American Statistical
Association, 91, 444455.
Angrist, J. (2001). Estimation of Limited Dependent Variable Models with Dummy
Endogenous Regressors: Simple Strategies for Empirical Practice (with discussion),
Journal of Business and Economic Statistics, 19, 216.
Angrist, J. and A. B. Krueger (2001). Instrumental Variables and the Search for Identi-
cation: From Supply and Demand to Natural Experiments, The Journal of Economic
Perspectives, 15, 6985.
Angrist, J. (2004). Treatment Eect Heterogeneity in Theory and Practice, The Eco-
nomic Journal, 114, C52C83.
Angrist, J. (2006). Instrumental Variables methods in Experimental Criminological Re-
search: What, Why and How, Journal of Experimental Criminology, 2, 2344.
Card, D. (1995). Using geographic variation in college proximity to estimate the return
to schooling. In: Christodes, L., Grant, E., Swidinsky, R. (Eds.), Aspects of Labor
56
Market Behaviour: Essays in Honour of John Vanderkamp. University of Toronto
Press, Toronto, 201V222.
Chaudhuri, P. (1991). Nonparametric estimation of regression quantiles and their local
Badahur representation, Annals of Statistics 19, 760777.
Chen, X., H. Hong, and A. Tarozzi (2004). Semiparametric Eciency in GMM Models
of Nonclassical Measurement Error, Missing Data and Treatment Eect, Annals of
Statistics, forthcoming.
Chernozhukov, V., G. W. Imbens, and W. K. Newey (2004). Instrumental Variable Iden-
tication and Estimation of Nonseparable Models via Quantile Conditions, Working
Paper
Chernozhukov, V. and C. Hansen (2005). An IV Model of Quantile Treatment Eects,
Econometrica, 73, 245261.
Chernozhukov, V. and C. Hansen (2006). Instrumental Quantile Regression Inference for
Structural and Treatment Eect Models, Journal of Econometrics, 132, 491525.
Cook, R. D. (1998). Regression Graphics: Ideal for Studying Regressions Through
Graphics, New York: Wiley.
Dehejia, R. (2005). Practical Propensity Score Matching: a reply to Smith and Todd,
Journal of Econometrics , 125, 355364.
Dehejia, R. H. and S. Wahba (1999). Causal Eects in Nonexperimental Studies: Reeval-
uating the Evaluation of Training Programs, Journal of the American Statistical
Association, 94, 10531062.
Dehejia, R. H. and S. Wahba (2002). Propensity Score Matching Methods for Nonexper-
imental Causal Studies, Review of Economics and Statistics, 84, 151161.
Firpo, S. (2007). Ecient Semiparametric Estimation of Quantile Treatment Eects,
Econometrica, 75, 259276.
Frolich, M. (2006). Nonparametric IV Estimation of Local Average Treatment Eects
with Covariate, Journal of Econometrics, forthcoming.
Frolich, M. and B. Melly (2008). Estimation of Quantile Treatment Eects with STATA,
working paper.
Hahn, J. (1998). On the Role of the Propensity Score in Ecient Semiparametric Esti-
57
mation of Average Treatment Eect, Econometrica, 66, 315331.
Heckman, James J. (1976). Common structure of statistical medels of truncation, sample
selection, and limited dependent variables and a simple estimator for such models,
Annals of Economic and Social Measurement, 5, 475492.
Heckman, J. J., H. Ichimura, P. E. Todd (1997). Matching as an Econometric Evaluation
Estimator: Evidence from Evaluating a Job Training Programme, Econometrica, 64,
605654.
Heckman, J. J., H. Ichimura, and P. E. Todd (1998). Matching as an Econometric
Evaluation Estimator, Review of Economic Studies, 65, 261294.
Heckman, James J. (1996). Comment, Journal of the American Statistical Association,
91, 459462.
Heckman, James J. (1996). Randomization as an Instrumental Variable, The Review of
Economics and Statistics, 78, 336341.
Heckman, J., J. L. Tobias, E. Vytlacil (2003). Simple Estimators for Treatment Pa-
rameters in a Latent-Variable Framework, Review of Economics and Statistics, 85,
748755.
Hirano, K., G. W. Imbens, and G. Ridder (2003). Ecient Estimation of Average Treat-
ment Eects Using the Estimated Propensity Score, Econometrica, 71, 11611189.
Holland, P. D. (1986). Statistics and Causal Inference, Journal of the American Statistical
Association, 81, 945960.
Imbens, G. W. and J. Angrist (1994) Identication and Estimation of Local Average
Treatment Eects, Econometrica, 62, 467475.
Imbens, G. W. and D. B. Rubin (1997) Estimating Outcome Distribution for Compliers
in Instrumental Variables Models, The Review of Economic Studies, 64, 555574.
Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Eects under
Exogeneity: A Review, The Review of Economics and Statistics, 86, 429.
Imbens, G. W. (2006). Nonadditive Models with Endogenous Regressors, Working Paper.
Imbens, G. W. and W. K. Newey (2008). Identication and Estimation of Triangular
Simultaneous Equations Models Without Additivity, Working Paper.
Khan, S. and E. Tamer (2008). Irregular Identication, Support Conditions and Inverse
58
Weight Estimation, Working Paper.
Lee, M.-J. (2005). Micro-Econometrics For Policy, Program, And Treatment Eects.
Lechner, M. (1999). Nonparametric Bounds on Employment and Income Eects of Con-
tinuous Vocational Training in East Germany, Econometrics Journal, 2, 128.
Li, K.-C. (1991). Sliced Inverse Regression for Dimension Reduction (with discussion),
Journal of the American Statistical Association, 86, 316327.
Manski, C. F. (1989). Anatomy of the Selection Problem, Journal of Human Resources,
24, 343360.
Manski, C. F. (1990). Nonparametric Bounds on Treatment Eects, AER Papers and
Proceedings, 80, 319323.
Manski, C. F., G. Sandefur, S. McLanahan, and D. Powers (1992). Alternative Estimates
of the Eect of Family Structure During Adolescence on High School Graduation,
Journal of the American Statistical Association, 87, 2537.
Manski, C. F. (1994). The Selection Problem, in Advances in Econometrics, Sixth World
Congress, Cambridge University Press.
Manski, C. F. (1997). Monotone Treatment Response, Econometrica, 65, 13111334.
Manski, C. F. and D. Nagin (1998). Bounding Disagreements about Treatment Eects:
A Case Study of Sentencing and Recidivism, Sociological Methodology, 28, 99137.
Manski, C. F. and J. V. Pepper (2000). Monotone Instrumental Variables: With an
Application to the Returns to Schooling, Econometrica, 68, 9971010.
Manski, C. F. and E. Tamer (2002). Inference on Regressions with Interval Data on A
Regressor or Outcome, Econometrica, 70, 519546.
Manski, C. F. (2003). Partial Identication of Probability Distributions, Springer-Verlag.
Manski, C. F. (2007). Partial Identication in Econometrics, forthcoming in the New
Palgrave Dictionary of Economics, second edition.
Newey, W. K. and J. Powell (2003). Instrumental Variable Estimation of Nonparametric
Models, Econometrica, 71, 15651578.
Pagan, A. and A. Ullah (2005). Nonparametric Econometrics.
Pearl, J. (2001). Causality.
59
Pepper, J. V. (2000). The Intergenerational Transmission of Welfare Receipt: A Non-
parametric Bounds Analysis, Review of Economics and Statistics, 82, 472488.
Pepper, J. V. (2003). Using Experiments to Evaluate Performance Standards: What Do
Welfare-to-Work Demonstrations Reveal to Welfare Reforms? Journal of Human
Resources, 38, 860880.
Rosenbaum, P. R. (1996). Comment, Journal of the American Statistical Association,
91, 465468.
Rosenbaum, P. R. (2002). Observational Studies, New York: Springer-Verlag.
Rosenbaum, P. R. ,and D. B. Rubin (1983). The Central Role of the Propensity Score
in Observational Studies for Causal Eects, Biometrika, 70, 4155.
Rubin, D. B. (1974). Estimating Causal Eects of Treatments in Randomized and Non-
randomized Studies, Journal of Educational Psychology, 66, 688701
Shaikh, A. M., M. Simonsen, E. J. Vytlacil, and N. Yildiz (2005). On the Identication
of Misspecied Propensity Score, Working Paper.
Shaikh, A. M. and E. J. Vytlacil (2005). Threshold Crossing Models and Bounds on
Treatment Eects: A Nonparametric Analysis, Working Paper.
Smith, J. A., and P. E. Todd (2001). Reconciling Conicting Evidence on the Perfor-
mance of Propensity Score Matching Methods, The American Economic Review, 91,
112118.
Vytlacil, E. (2003) Dummy Endogenous Variables in Nonseparable Models, Working
Paper
Vytlacil, E. and N. Yildiz (2007). Dummy Endogenous Variables in Weakly Separable
Models, Econometrica, 75, 757779.
60