18 views

Uploaded by Ahmed Eid

Treatment effects

- Chapter7-Econometrics-Multicollinearity
- Vipul Chalotra 180 190
- Facebook, Twitter, And Youth Engagement
- Static Regression Solutions Problem Set 2 2014 11-17-16!06!22
- 40
- International Trade and Development
- 1. a Multi Level Approach (2)
- LinearRegression
- Bank Ownership and Performance in the Middle East and North Africa Region
- BallEtAl(2011)
- Inequality and Criminality in Brazil
- Investigation of the Relationship between Diesel Fuel Properties and Emissions from Engines with Fuzzy Linear Regression
- Regressao
- This Article Was Downloaded by: [Friedrich Althoff Konsortium] on: 19
- lec7
- Psychological Research
- Sing Layer Perc
- 05-ML (Linear Regression)
- sales forecasting-econometric.pdf
- Log In

You are on page 1of 63

YU-WEI HSIEH

New York University

First Draft: Sep 1, 2009

c Yu-Wei Hsieh (all rights reserved).

E-mail: yuwei.hsieh@nyu.edu

Contents

1 Introduction to Average Treatment Eects 1

1.1 Rubins Statistical Causality Model . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Identication and Estimation under Exogeneity . . . . . . . . . . . . . . . 4

1.3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.3 OLS v.s. Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Propensity Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.1 Specication Testing of the Propensity Score . . . . . . . . . . . . 14

1.4.2 Regression as a Dimension Reduction Method . . . . . . . . . . . . 15

1.4.3 Propensity Score Weighting Estimator . . . . . . . . . . . . . . . . 16

2 Quantile Treatment Eects 17

2.1 Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Weighting Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Instrumental Variables I: Local Treatment Eects 19

3.1 Instrumental Variable : A Review . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Restrictions on the Selection Process : Local Treatment Eects . . . . . . 21

3.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Vietnam Draft Lottery . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2 Randomized Eligibility Design . . . . . . . . . . . . . . . . . . . . 26

3.4 Other Identied Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Non-binary Treatments and Instruments . . . . . . . . . . . . . . . . . . . 29

3.6 Nonparametric Estimation of LTE with Covariate . . . . . . . . . . . . . . 29

3.7 Parametric and Semiparametric Estimation of LTE with Covariate . . . . 31

4 Dierence-in-Dierence Designs 34

4.1 Linear Dierence-in-Dierence model . . . . . . . . . . . . . . . . . . . . . 36

4.2 Nonparametric Dierence-in-Dierenc Models . . . . . . . . . . . . . . . . 37

4.3 Nonlinear Dierence-in-Dierence . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 The Change-in-Change Model . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Quantile Dierence-in-Dierence . . . . . . . . . . . . . . . . . . . . . . . 45

i

5 Nonparametric Bounding Approaches 48

5.1 No-assumption Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

50

5.3 Restrictions on the Moment Conditions: Monotone Instrumental Variables 51

5.4 Monotone Treatment Selection . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Shape Restrictions on the Response Functions: Monotone Treatment Re-

sponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 Restrictions on the Selection Mechanism . . . . . . . . . . . . . . . . . . . 54

5.7 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

References 56

ii

1 Introduction to Average Treatment Eects

Suppose the company of national health insurance launched a new reimbursment scheme

quality paymentfor physicians, linking their salary with patients health outcome. If

patients health status are getting better, the insurance company then gives physicians

extra bonus. By providing proper nancial incentives to physicians, it can encourage

them to treat patients more carefully, hence leading to higher cure rate. In this case,

the new payment program is called a treatment, while the cure rate is called response.

The group of subjects receiving treatment is called treatment group, while the group re-

ceiving control treatment (typically, no treatment) is called control or comparison group.

We want to know whether a treatment has an impact on the response variable. If yes,

does the treatment cause a positive or negative eect? In this section we introduce the

statistical causality model, proposed by Rubin (1974), to quantify the eects of a certain

treatment. More discussions about this model can be found in Holland (1986).

1.1 Rubins Statistical Causality Model

For subject i, the Bernoulli random variable Y

1i

represents whether the patient is cured or

not, if he go to the hospital that participates in the quality payment program (treatment

group); and Y

0i

if he go to the hospital that does not participate in that program. We

call Y

1i

and Y

0i

potential response. Intuitively, the individual treatment eect on subject

i can be dene as Y

1i

Y

0i

. If Y

1i

Y

0i

> 0 then the treatment has a positive eect on

subjects health status: it makes subjects fully recover from an illness. Let D

i

= 1 if

subject i is in the treatment group, and D

i

= 0 if subject i is in the control group. The

treatment indicator D

i

is called observed treatment, indicating whether unit i receives

treatment or not. Dene observed response Y

i

= D

i

Y

1i

+ (1 D

i

)Y

0i

. We observe Y

1i

if

unit i is in the treatment group (D

i

= 1), and we observe Y

0i

if unit i is in the control

group (D

i

= 0). However, unit i cannot be assigned to treatment and control group at

the same time. Unit i can only be treated or untreated at a specic time. Because we

can only observe either Y

1i

or Y

0i

, we face a missing data problem that up to 50% data

is missed. Therefore, it is impossible to identify the individual eect due to the missing

data problem. The unobservable missing outcome is termed counterfactual outcome. For

example, if we observe Y

1i

, then Y

0i

is the counterfactual.

Sometimes, it is cumbersome for policy maker to learn the individual eect for each

subject. Instead, we are interested in other summary statistics, such as the average

1

treatment eect:

ATE = E[Y

1

Y

0

]

ATE not only describes the average treatment eect on subjects, but also transfers the

impossible-to-identify individual eect into a possible-to-estimate statistical problem. It

is because we can utilize information contained in the sampling processes to learn the

average eect, without exactly knowing the individual eect for each unit. However, the

missing data problem raises the identication problem of ATE because we want to learn

the features of (Y

1

, Y

0

, D) but only (Y, D) is observed. In this lecture note several identi-

cation strategies will be discussed under dierent assumptions of selection mechanisms,

exclusion restrictions, source of exogeneous variations, functional form restrictions, and

heterogeneity. In order to identify ATE, the mechanism that how units be selected into

treatment group or not lies in the central heart of treatment eect analysis. Now we

introduce an identication condition for ATE. Suppose the treatment assignment D sat-

ises:

Assumption 1.1 (Y

1

, Y

0

)D, denotes independence.

Assumption 1 is an exogenous condition. For example, a randomized experiment au-

tomatically satises this condition. Another example satisfying this assumption is that

the government launched a new program, and some people are required to participate

in that program. Since these people cannot choose to join the program or not, D is

independent of (Y

1

, Y

0

). This situation is termed quasi-experiment or natural experiment

in the literature. Under this assumption, ATE can be identifed by group mean dierence:

E[Y |D = 1] E[Y |D = 0]. E[Y |D = 1] is the mean of treatment group, and E[Y |D = 0]

is the mean of control group. Since Y = DY

1

+ (1 D)Y

0

, we have:

E[Y |D = 1] E[Y |D = 0] = E[Y

1

|D = 1] E[Y

0

|D = 0],

and by Assumption 1,

E[Y

1

|D = 1] E[Y

0

|D = 0] = E[Y

1

] E[Y

0

] = E[Y

1

Y

0

] = ATE. (1)

Here is an important implication of assumption 1. Since (Y

1

, Y

0

)D, we have E[Y

0

] =

E[Y

0

|D = 0] = E[Y

0

|D = 1]. E[Y

0

|D = 1] means the average response of units in

the treatment group, had them not been treated. However, one can only observe Y

1

in

the treatment group; therefore E[Y

0

|D = 1] is counterfactual. Under assumption 1, we

can use the observable E[Y

0

|D = 0] to impute the counterfactual E[Y

0

|D = 1]. This

2

is because assumption 1 guarantee the treatment group and the control group is very

similar, so that we can compare each other. We can use the information of control group

to impute the counterfactual Y

0

of treatment group, and we can also use the information

of treatment group to impute the counterfactual Y

1

of control group. In fact, a weaker

condition is sucient to identify ATE:

Assumption 1.2 (mean independent)

E[Y

0

|D] = E[Y

0

], and E[Y

1

|D] = E[Y

1

].

When the parameter of interest is quantile treatment eects, or we want to estimate the

asymptotic variance of ATE, weaker conditions like assumption 2 is not enough. There-

fore, throughout this lecture we will impose stronger condtions, even though weaker

conditions are sucient for identication. One should know E[Y

0

|D = 0] does not nec-

essarily equal to E[Y

0

|D = 1], that is, the control group need not be a good proxy for

the treatment group. The identication condition thus governs the mechanism how units

be selected into treatment group or not. Also, it determines whether the experiment

is statistically sound or not. The principle of treatment eect analysis is under what

conditions the treatment group and the control group is quite similar, namely, we can

compare the comparable. To be specic, the meaning of comparing the comparable is

an imputation procedure for the counterfactual that can also remove the selection bias.

Almost all estimators discussed in this note intrinsically implement this principle.

1.2 Selection Bias

However, in most of social science studies hardly can we have data generated by ran-

domized experiments. Even in natural science studies, sometimes random sample is also

unavailable. For example, we want to study if a certain virus is lethal. It is against

ethic and law to conduct a randomized experiment that makes some people infected

and then studies the mortality rate. Instead of collecting the data from a randomized

experiment controlled by an experimenter, in many cases we conduct a nonrandomized

observational study. The challenge of observational study is the treatment assignment

D

i

may depend on other factors that also aect the response variable. Moreover, D

i

may be endogenous as well. Both situations invalidate the independent assumption. In

our example, hospitals choose to participate in the program or not. Therefore, D

i

is not

a random assignment, and assumption 1 may be violated. To see this, D

i

may depend

on the scale of hospital, X: Big hospitals, such as teaching hosptials, are more likely to

participate in the program. One reason is that the insurance company asked these hos-

3

pitals to participate in it. By providing better health care services, hospitals may incur

more cost so that it may not be protable to participate in it. Big hospitals may provide

health care services in a more cost-eective manner. Therefore, big hospitals are more

likely to join the program. In other words, D

i

is a function of X. However, the scale of

hospital may also aect patients health outcome; Y

i

is also a function of X. Since X

is the common factor of D

i

and Y

i

, obviously the condition (Y

1

, Y

0

) D is violated. X

is termed confounder or covariate or pretreatment variable or exogeneous variable. If the

confounder eect is not controlled for, we will nd positive ATE by using group mean

dierence along to estimate it. The ideal is as follow. Big hospitals have higher cure

rate, and higher probability to participate in the program. Intuitively, one scenior is that

the program has virtually no eect on patients health outcome. It is simply because

there are more big hospitals in the treatment group, so we nd the patients treated in

the treatment group hospitals have better health outcome. The treatment and control

group is not comparable, if the confounders are not controlled for.

The bias created by not controlling for observable confounders is classifed as overt

bias. Overt bias is analogous to the omitted variable problem in regression analysis. Overt

bias focuses on covariates that are observable to econometricians, while hidden bias puts

emphasis on unobservable covariate. Hidden bias is also known as endogeneity problem.

For example, the genetic characteristics of patients or the talent of workers, can be view

as unobservable covariates. Another source of hidden bias is due to self-selection into

treatment, known as self-selection bias. For example, suppose the government launches a

job training program for workers. The decision whether to join the program or not may

relate to the benet Y

1

Y

0

. This case also renders assumption 1 implausible. When

these biases are present, the group mean dierence no longer identify ATE. However,

as long as we can come up with some methods to control for selection biases, ATE can

still be identied. Typically, it evolves understandings of the data generating process

of D

i

. We discuss the identication and estimation under exogeneity in the following

section. The issue of hidden bias and self-selection is defered to the sections of repeated

observations, instrumental variable and control function.

1.3 Identication and Estimation under Exogeneity

In this section we introduce the identication and estimation problems of ATE when there

is only overt bias. The framework of this section mainly follows Imbens (2004). Moreover,

both Wooldridge (2001) and M.-J. Lee (2005) are excellent references on this topics.

4

We will show how to remove overt bias and then identify ATE under mild exogeneous

conditions. First we impose two key assumptions on the treatment assignment:

Assumption 1.3 (unconfoundedness)

(Y

1

, Y

0

)D|X,

and

Assumption 1.4 (overlap or support condition)

0 < prob(D = 1|X) < 1.

Assumption 3 thus modies assumption 1, the independence assumption, to conditional

independence assumption. This assumption means all overt biases can be removed by

conditioning on the vector of all observable confounders. Intuitively, after controlling

for X, D is somewhat like random assignment. Therefore, each subgroup of the treat-

ment group and control group, dened by the same value of covariates, is compara-

ble. Unconfoundedness is interchangeable with ignorable treatment (Rosenbaum and Ru-

bin, 1983), conditional independence (Lechner, 1999), selection on observables (Barnow,

Cain, and Goldberger, 1980). Assumption 4 guarantees at least we can compare the

treatment and control group. For example, suppose X = 1 stands for male subjects.

prob(D = 1|X = 1) = 1 signies all male subjects receive treatment; there are no male

subjects in the control group. If the gender dierence has inuence on the potential

responses, one technique to control for the gender eect is comparing male subjects who

receive treatment, with male subjects who do not receive treatment; and comparing fe-

male subjects who receive treatment, with female subjects who do not receive treatment.

But when all male subjects are in the treatment group, we cannot not control for the

gender eect, then the two groups is not comparable. Next, we dene some notations:

Denition 1.1 (conditional mean and conditional variance)

(x, d) = E[Y |X = x, D = d],

d

(x) = E[Y

d

|X = x],

2

(x, d) = V (Y |X = x, D = d),

2

d

(x) = V (Y

d

|X = x).

Under assumption 3, (x, d) =

d

(x),

2

(x, d) =

2

d

(x). Now, we introduce OLS approach

to estimate ATE, and then move to recent advances in nonparametric and semiparametric

methods.

5

1.3.1 Regression

First we postulate the constant eect model: Y

1i

Y

0i

= i, and assume Y

0i

= +

X

i

+

i

. Then we have:

Y

i

= D

i

Y

1i

+ (1 D

i

)Y

0i

= Y

0i

+D

i

(Y

1i

Y

0i

) = +D

i

+

X

i

+

i

,

which is nothing but a dummy variable regression. It is easy to see that OLS estimator

is an estimator for ATE. Since X

i

is added into the regression function, we have

controlled for the confounder eect. This setting also highlights the relationship between

the unconfoundedness assumption and the exogeneous assumption in regression analysis.

Assumption 3 is equivalent to D |X, characterizing exogeneity of D

i

. Next, we use

a more general setting to study regression-based ATE estimators. Under Assumption 3,

we have:

E[Y |D = 1, X] E[Y |D = 0, X] = E[Y

1

|D = 1, X] E[Y

0

|D = 0, X]

= E[Y

1

|X] E[Y

0

|X] = E[Y

1

Y

0

|X] (X). (2)

By using the conditional group mean dierence E[Y |D = 1, X] E[Y |D = 0, X], we

identify the conditional ATE, (X). Taking expectation with respect to distribution of

X, we can identify ATE:

E[Y

1

Y

0

] = E

_

E[Y

1

Y

0

|X]

= E[(X)]. (3)

The corresponding sample counterpart is given by:

=

1

N

N

i=1

_

1

(X

i

)

0

(X

i

)

_

. (4)

1

(X

i

) and

0

(X

i

) is the estimator for E[Y |D = 1, X] and E[Y |D = 0, X], respec-

tively. In (4), the diernce of the two estimated conditional mean functions estimates

(X). While E[(X)] is estimated by averaging (X

i

) over the empirical distribution

of X. From this expression, the estimation problem of ATE can be view as the estima-

tion problem of the conditional mean function, E[Y |D, X]. Suppose

d

(x) is linear in

treatment assignment and covariates:

d

(x) = +d +

x,

then the corresponding dummy variable regression is given by:

Y

i

= +

X

i

+D

i

+

i

.

6

The OLS estimator can estimates ATE:

1

N

N

i=1

_

1

(X

i

)

0

(X

i

)

_

=

1

N

N

i=1

_

( + +

x

i

) ( +

x

i

)

_

= .

Instead of specifying a dummy variable regression, we can also estimate two separate

regression functions:

1

(x) =

1

+

1

x, if D

i

= 1

0

(x) =

0

+

0

x, if D

i

= 0.

One property of OLS estimator is the average of predicted response equals the average

of observed response:

i

D

i

1

(X

i

) =

i

D

i

Y

i

, and

i

(1 D

i

)

0

(X

i

) =

i

(1 D

i

)Y

i

.

Plugging this algebraic property into (4), can be decomposed into:

=

1

N

N

i=1

D

i

[Y

i

0

(X

i

)] + (1 D

i

)[

1

(X

i

) Y

i

]. (5)

Many ATE estimators have the above representation. The above expression has a nice

interpretation. For instance, if unit i receives treatment (D

i

= 1), in fact we calculate

Y

i

0

(X

i

) and Y

i

= Y

1i

. Since Y

1i

is observed, the remaining task is to impute the

counterfactual Y

0i

, and we impute it by

0

(X

i

).

0

(X

i

) describes the average response

of the units who have covariate value X

i

, had them not been treated.

Besides OLS method, there is a vast literature on estimating conditional mean func-

tions. Since OLS may raise model misspecication problems, current researches pay much

more attention on nonparametric and semiparametric models for ATE. Hahn (1998) de-

rives the eciency bound for ATE, and proposes an ecient estimator for ATE using

series estimator.

1

Heckman, Ichimura, Todd (1997) focus on kernel regression approach.

We outline the estimator proposed by Heckman et al. Suppose X is one dimensional (it

can be generalized to multi-dimensional), the kernel estimator for

d

(x) is given by:

d

(x) =

i:D

i

=d

Y

i

K((X

i

x)/h)

_

i:D

i

=d

K((X

i

x)/h),

1

see Pagan and Ullah (2005) for an introduction.

7

where K(.) is the kernel function, h is the bandwidth, jointly determine the weighting

scheme. To ensure consistency, h should grow as sample size increases but at a slower

rate. Let T denotes the treatment group and C denotes the control group, it can also be

decomposed into the form of (5):

=

1

N

iT

_

Y

i

jC

K((X

j

X

i

)/h)Y

j

jC

K((X

j

X

i

)/h)

_

+

1

N

jC

_

iT

K((X

j

X

i

)/h)Y

i

iT

K((X

j

X

i

)/h)

Y

j

_

. (6)

1.3.2 Matching

Matching estimates ATE by comparing subjects with similar covariates. Unlike regression

that directly estimates the two conditional mean functions, matching directly implements

the principle of comparing the comparable, though it also estimates E[Y |D, X] implicitly.

Before matching subjects with similar covariate, rst we should dene the criterion to

measure similarity. First we introduce nearest neighbor matching estimator.

Nearest Neighbor Matching

For unit i in the treatment group, we pick up M units in the control group whose

covariates are closest to X

i

, then average the responses of these M units to impute

the counterfactual of unit i. The same procedure applies to the units in the control

group. Following Abadie and Imbens (2006), we can use Euclidean norm ||x|| = (x

x)

1/2

to measure closeness. We can also use ||x||

V

= (x

V x)

1/2

A V is a positive denite

symmetric matrix. Let j

m

(i) be an index satisfying

D

j

= 1 D

i

,

and

l:D

l

=1D

i

1

_

X

l

X

i

X

j

X

i

_

= m.

It indicates the unit in the opposite treatment group that is m-th closest to unit i with

respect to the Euclidean norm. Then dene J

M

(i) as the set of indices for the rst M

matches for unit i.

J

M

(i) =

_

j

1

(i), ..., j

M

(i)

_

.

The imputation procedure is given by:

8

Y

0i

=

_

Y

i

, if D

i

= 0,

1

M

jJ

M

(i)

Y

j

, if D

i

= 1,

and

Y

1i

=

_

1

M

jJ

M

(i)

Y

j

, if D

i

= 0,

Y

i

, if D

i

= 1.

The nearest neighbor matching estimator for ATE is:

M

=

1

N

N

i=1

_

Y

1i

Y

0i

_

=

1

N

N

i=1

D

i

_

Y

i

1

M

jJ

M

(i)

Y

j

_

+ (1 D

i

)

_

1

M

jJ

M

(i)

Y

j

Y

i

_

. (7)

Kernel Matching

It is interesting that Heckman, Ichimura, and Todd (1997) is not only a nonparamet-

ric regression-type estimator, but also a matching estimator. It use kernel function to

measure closeness. To see this, let K(.) in (6) be the Bartlett kernel:

2

K(x) =

_

1 |x|, |x| 1,

0, otherwise.

It imputes the counterfactual of treated units by:

Y

0i

=

jC

K((X

j

X

i

)/h)Y

j

jC

K((X

j

X

i

)/h)

,

(X

j

X

i

) measures the dierence between the covariate of treated unit i, and the covariate

of untreated unit j. If |(X

j

X

i

)| is large, namely, j is a distant observation relative to

i in terms of kernel metric, it will receive smaller weight. When |X

j

X

i

| h, it will

receive zero weight; X

j

is not included in the imputation of Y

0i

.

2

We use Bartlett kernel only for exposition purpose. K should satises

z

z

r

K(z)dz = 0, r dim(X)

in Heckman et al. Obviously Bartlett kernel violates this condition.

9

1.3.3 OLS v.s. Matching

In this section we discuss the fundamental dierences between OLS and matching esti-

mator for ATE. First, OLS use a linear model to estimate ATE, as well as the eects of

covariates on the response variables. It may suer from model misspecication. Matching

avoids this problem, but raises the question of user-choosen parameter M. In matching,

the role of the covariates is to determine which unit is a good match; hence we can only

identify ATE without knowing the eects of the covariates on the response.

Second, they use dierent methods to remove selection biases. Matching compares

units with similar covariate, directly using the unconfoundedness condition. Let be the

OLS estimator for the linear model Y

i

= + D

i

+

X

i

+

i

. By Frisch-Waugh-Lovell

theorem, is the estimate that already removed the linear inuence of X

i

on Y

i

and on

D

i

.

Finally, which sample should be included in the imputation is dierent. Let X

1

and X

0

denote the covariates of the treatment and the control group, respectively. Let

s(X

1

) and s(X

0

) stand for the corresponding support. The denition of support is

s(X) = {x : f(x) > 0}A f(x) is the pdf. Matching only use the sample around the overlap

of the support, s(X

1

)

s(X

0

). If there is no sucient overlap between s(X

1

) and s(X

0

),

we can only use limited sample to impute the counterfactual. This situation is termed

support problem. The extreme case is that there is no overlap, i.e., s(X

1

)

s(X

0

) = . For

example, if the government lauched a subsidy program that all household with annual

income less then 10, 000 are required to participate in this program. In this case, we

cannot match the income variable because all treated units are low income family. We

will dicuss this issue in the section of regression discontinuity design. By contrast, OLS

will use all sample to estimate ATE regardless of whether there is a sucient common

support. Not because OLS does not suer from the support probelm, it assumes a linear

model to solve it.

Because OLS virtually has no mechanism to deal with the support problem, and it

also estimates the eects of covariates on the response, it would be vary sensitive to the

entire distribution of X

1

and X

0

. By contrast, only the common support will aect the

precision of matching estimator.

1.4 Propensity Score

Without conditioning on some covariates which potentially aect D and (Y

0

, Y

1

), com-

parisons between two groups will be biased. This is because the selection processes induce

10

imbalance covariates distributions between the treatment and the control group. Two

groups are comparable in statistical sense if their covariates distributions are the same.

In the previous section we demonstrated that under conditional unconfoundness, overt

biases can be removed by various conditioning strategies such as regression or matching.

Another identication strategies, the balancing score and the propensity score, solve the

selection bias problem by creating balance covariate distributions between two groups.

These concepts are rst introduced by Rosenbaum and Rubin (1983).

Denition 1.2 (balancing score)

A balancing score, b(X), is a function of the observed covariate X such that the condi-

tional distribution of X given b(X) is the same for treated and control units; that is, in

Dawids (1979) notation, X D|b(X).

Conditional on the balancing score, the covariate distributions are balanced between the

treatment and the control group; hence, they become comparable. Obviously, X is a

balancing score. It will be useful if there exists some lower dimensional balancing scores.

Denition 1.3 (propensity score)

propensity score e(x) is the conditional probability of receiving the treatment:

e(x) = P(D = 1|X = x).

Rosenbaum and Rubin (1983) show that the propensity score is a balancing score.

Theorem 1.1 (Balancing Property)

If D is binary, then X D|e(X).

proof:

P(X x, D = 1|e(X)) = E[D I

{Xx}

|e(X)]

= E

_

E[D I

{Xx}

|X]

e(X)

_

(e(X) is measurable w.r.t X)

= E

_

I

{Xx}

E[D|X]

e(X)

_

= e(X) P(X x|e(X)).

Moreover,

P(D = 1|e(X)) = E[D|e(X)] = E

_

E[D|X]

e(X)

_

= E[e(X)|e(X)] = e(X).

Therefore,

P(X x, D = 1|e(X)) = e(X) P(X x|e(X)) = P(D = 1|e(X)) P(X x|e(X)).

11

Alternatively, we can prove this theorem by the following argument. Because P(D =

1|X, e(X)) = P(D = 1, X|e(X))/P(X|e(X)), if we can show P(D = 1|X, e(X)) =

P(D = 1|e(X)) the we are done. Obviously, P(D = 1|X, e(X)) = E[D|X, e(X)] =

E[D|X] = e(X) = P(D = 1|e(X)).

Note that this theorem is implied by the denition of the propensity score. No distri-

butional assumptions and unconfoundness conditions are needed to prove this property.

Rosenbaum and Rubin (1983) further show that the propensity score is the most con-

densed balancing score. Namely, the -algebra induced by the propensity score is the

coarsest within the class of balancing scores.

Theorem 1.2 (Most Condensed Information)

b(X) is a balancing score; i.e., X D|b(X) if and only if b(X) is ner than e(X) in the

sense that e(X) = f(b(X)) for some function f.

3

proof:

: Suppose b(X) is ner than e(X) (has ner -algebra), then

P(D = 1|b(X)) = E[D|b(X)] = E[E[D|X]|b(X)] = E[e(X)|b(X)] = e(X).

Also, P(D = 1|X, b(X)) = E[D|X, b(X)] = E[D|X] = e(X). Therefore b(X) is a balanc-

ing score following the same argument in the proof of theorem 1.1.

: Suppose b(X) is a balancing score but b(X) is not ner than e(X). Therefore, there

exists b(x

1

) = b(x

2

) but e(x

1

) = e(x

2

). However, this implies P(D = 1|X = x

1

) =

P(D = 1|X = x

2

) which means that D and X are not conditional independent given

b = b(x

1

) = b(x

2

), a contradiction.

Conditioning on the balancing score can equalize the covariates distribution between

the treated and control units. Intuitively, the selection problem are resolved because

the treatment group and control group are now comparable, after conditional on b(X).

Indeed, we have a formal statement for this intuition:

Theorem 1.3 (Conditional Unconfoundness)

Suppose assumption 1.3 and 1.4 hold, then (Y

0

, Y

1

) D|b(X). Namely, instead of con-

ditioning on the entire covariate X, conditioning solely on b(X) suce for removing the

selection biases.

3

f(X) will only reduce the information of X. This is because the same value of x will have the same

value of f(x). However, dierent value of x may have same f(x). Therefore f(x) can only induces coarser

-algebra. For example, if f() is a constant function, then (f(X)) is the trivial -algebra.

12

proof: By Bayesian rule we know that P(D = 1, Y

0

, Y

1

|b(X)) = P(D = 1|Y

0

, Y

1

, b(X))

P(Y

0

, Y

1

|b(X)). If we can show P(D = 1|Y

0

, Y

1

, b(X)) = P(D = 1|b(X)) then we are

done.

P(D = 1|Y

0

, Y

1

, b(X)) = E[D|Y

0

, Y

1

, b(X)]

= E

_

E[D|Y

0

, Y

1

, X, b(X)]

Y

0

, Y

1

, b(X)

_

(law of iterative expectation)

= E

_

E[D|Y

0

, Y

1

, X]

Y

0

, Y

1

, b(X)

_

(b(X) is X-measurable)

= E

_

E[D|X]

Y

0

, Y

1

, b(X)

_

(by conditional unconfoundness)

= e(X) (b(X) is ner than e(X))

Recall in the proof of theorem 1.2 we show that P(D = 1|b(X)) = e(X). Therefore, we

already show that P(D = 1|Y

0

, Y

1

, b(X)) = P(D = 1|b(X)).

Why e(X) can remove selection biases is because it adjusts the imbalance of co-

variates of the treatment and control group, making them comparable. In the random

experiment, randomization automatically balance the covariate distributions of the two

groups. Matching has an implicit mechanism to balance the covariates because it only

matches units with similar covariates. However, the extent of imbalance of covariates

does determine how many samples can be used.

Propensity score was rst introduced in Rosenbaum and Rubin (1983) as a dimension

reduction technique. Nonparametric estimation of ATE, though appearing because no

constant eect assumption is imposed, is dicult to implement it due to the problem of

curse of dimensionality. Typically, when the number of continuous covariate increases,

the nonparametric estimator will converge with a slow rate. Abadie and Imbens (2006)

proved that the dimension of continuous covariates will aect the convergence rate of

the nearest neighbor matching estimator. The higher the dimension is, the slower the

convergence rate. Intuitively, if the dimension of X is high, it would be more dicult to

nd a good match. It is becasue the number of covariates is somewhat like the number of

restrictions to be satised. If there are a lot of restrictions, we can only have few qualied

samples to impute the counterfactual. Since the propensity score is only one dimensional

and by theorem 1.3 it can also remove overt biases like conditioning on X, it thus becomes

the most popular approach in the empirical research. To implement it, just regress or

match on the propensity score instead of the original covariates. For example, Heckman,

13

Ichimura, and Todd (1997) use the propensity score to implement their estimator to

avoid high dimensional nonparametric regression. Consult Imbens (2004)A Dehejia and

Wahba (1999, 2002)A Dehejia (2005)A Smith and Todd (2001) for more details.

4

Although propensity score matching becomes a quite popular tool since the inuential

paper of Dehejia and Wahba (1999), there are some potential problems of this dimen-

sion reduction approach. Typically e(X) is unkown and should be estimated. When we

match or regress on the estimated e(X), it becomes a two-stage estimator. To authors

knowledge, it is still an ongoing research about how to calculate the asymptotic vari-

ance for such estimators. Most papers derive their asymptotic variance assuming the

propensity score is know. Secondly, there is a methodological paradox of using propen-

sity score. Propensity score is originally introduced as a dimension reduction technique

because estimating E[Y |D, X] suers from curse of dimensionality.

5

Still, estimating

e(X) = E[D|X] suers from curse of dimensionality too. It is not so clear (at least to

me) that the propensity score indeed achieves the goal of dimension reduction.

1.4.1 Specication Testing of the Propensity Score

Researchers tend to specify parametric model for the propensity score; e.g., logit regres-

sion. Shaikh, Simonsen, Vytlacil, and Yildiz (2005) provide specication testing for the

propensity score. Let f(e), f

1

(e), f

0

(e) be the pdf of e(X), e(X)|D = 1, e(X)|D = 0,

respectively. We have the following testable restriction:

Lemma 1.1 Assume 0 < P(D = 0) < 1. Then for all 0 < e < 1 and e support(e(X)),

we have

f

1

(e)

f

0

(e)

=

P(D=0)

P(D=1)

e

1e

.

proof: First note that

P(D = 1, e(X) = e) = P(D = 1|e(X) = e) P(e(x) = e) = ef(e).

(recall that P(D = 1|e(X)) = e(X))

Also note that

P(D = 1, e(X) = e) = P(e(X) = e|D = 1) P(D = 1) = f

1

(e)P(D = 1).

Combining these two expressions, we have f

1

(e)P(D = 1) = ef(e). Analogously, we also

have f

0

(e)P(D = 0) = (1 e)f(e).

Shaikh et al. develope testing procedure by exploring this restriction.

4

Moreover, Hahn (1998) found propensity score is also closely related to the eciency bound of ATE;

see also Chen, Hong, and Tarozzi (2004).

5

Also recall that the cure of dimensionality mainly refers to there are many continuous covariates. If

X are all discrete, perhaps there is no need to do dimension reduction.

14

1.4.2 Regression as a Dimension Reduction Method

Here I would like to make a digression to discuss the relationship between regression and

dimension reduction, because it provides another interpretation of the balancing property

of the propensity score. For example, we want to study the relationship between the

response variable Y and the predetermined variables X. The dimension of X is high so

we should employ some dimension reduction technique before running the regression. A

naive way is to do principal component analysis on X. Principal component analysis

will nd vectors

1

,

2

, ...,

k

such that the new variables X

1

, X

2

, ..., X

k

can best

summarize the information contained in the original data X. Then we build the regression

model linking Y and X

1

, X

2

, ..., X

k

, instead of the original X. In particular, if a

single direction is already informative in describing X, and the relationship between

Y and X

i

= X

+

i

. Why

parametric regression does not suer from the dimensionality problem? In light of the

above argument, this is because parametric regression itself can be view as a dimension

reduction technique.

Assume that

i

is i.i.d. with mean zero, we have Y X|X

. Conditional on

the systematic part X

Hence Y is independent of X given X

subspace spanned by is called the dimension reduction space; see Cook (1998) for more

details. Now return to the propensity score case. Think about the propensity score

E[D|X] = G(X

probit regression. Analogously, after conditional on the systematic component G(X

),

the stochastic property of D is driven by the noise term and hence is independent of

X. Namely, D X|e(X). The balancing score is in fact, the sucient reduction in the

statistics literature of dimension reduction.

A nal remark is doing principal component analysis on X and then run the regression

may not be a good practice. Because PCA on X only nds some linear combinations of X

that best describe the variation of X, it does not guarantee such linear combinations can

best describe the relationship between Y and X. The sliced inverse regression developed

by Li (1991) is a method that incorporating the information of Y when doing the PCA

on X. Similarly, propensity score does not incorporate the information of Y so it may

not be an optimal dimension reduction method to study the treatment eects.

15

1.4.3 Propensity Score Weighting Estimator

Besides conditional on e(X) and use various conditioning strategies, the propensity score

can be used to construct the propensity score weighting estimator (or inverse probability

weighting) estimator for ATE. It is based on the following identication result:

Lemma 1.2 (Propensity Score Weighting)

Under Assumption 1.3 and 1.4,

E

_

DY

e(X)

= E[Y

1

], and

E

_

(1 D)Y

1 e(X)

= E[Y

0

].

proof: Assumption 1.4 is required to guarantee the above objects are well-dened because

e(X) is in the denominator.

E

_

DY

e(X)

= E

_

E

_

DY

e(X)

|X

_

= E

_

1

e(X)

E

_

DY

_

= E

_

1

e(X)

E

_

1 Y |X, D = 1

P(D = 1|X)

_

= E

_

E

_

Y

1

|X, D = 1

_

= E

_

E

_

Y

1

|X

_

= E[Y

1

].

Hirano, Imbens, and Ridder (2003) estimate e(X) by series estimator. We can have

an ecient estimator for ATE by modifying the above lemma; see Hirano et al. for

details. More theoretical properties on the propensity score weighting estimators can

be found in Chen, Hong, and Tarozzi (2004), and Khan and Tamer (2008). To be

better understand this identication result, lets consider a simplied version of lemma

1.2. Suppose the covariates have no impact on the potential outcomes and treatment

assignment (assumption 1.1 holds), then e(X) = P(D = 1), X. Also note that E[DY ] =

E[1 Y |D = 1]P(D = 1) = E[Y

1

] P(D = 1). Clearly E[Y

1

] = E[

DY

P(D=1)

] and the sample

counterpart is

1

N

N

i

D

i

Y

i

1

N

N

i

D

i

=

i:D

i

=1

Y

i

N

1

.

This is nothing but the sample average of Y

i

for the treatment group. Under as-

sumption 1.1 this estimator is consistent for E[Y

1

]. Because D is the indicator random

variable, D times Y means that we want to calculate E[Y

1

]. However, due to the missing

16

data problem, E[D Y ] equals E[Y

1

] times a P(D = 1) term. This is because the ex-

pectation operator here is taken over the whole population (

N

i

D

i

Y

i

is divided by N).

However, only the treated units make contribution when estimating E[Y

1

]. Therefore,

it should be divided by the sample size of the treatment group N

1

, not N. Propensity

score weighting is just a way to recover the correct sample size.

M.-J. Lee (2005) provides a more formal interpretation of weighting estimators. Sup-

pose we want to estimate E[Y ] =

(Y

i

)

N

i=1

are now sampled from density g(y) instead of f(y). Calculate the mean directly

will not yield consistent estimate. However, even though the data are sampled from the

wrong density g(y), it is still possible to calculate E[Y ] by importance sampling:

yf(y)dy =

y

f(y)

g(y)

g(y)dy =

y

r(y)

g(y)dy,

where r(y) =

g(y)

f(y)

. N

1

i

y

i

/r(y

i

) is consistent for E[Y ]. r(y

i

) here is similar to the

role of propensity score weighting. The intuition is if we know the selection process, we

may be able to recover the original density f(y) and the propensity score is just a way

to describe the selection process in statistical sense.

2 Quantile Treatment Eects

Denition 2.1 Quantile Function

Q

Y

() = F

1

() = inf{y : F(y) }

Theorem 2.1 Under assumption ?, the marginal distribution of Y

1

and Y

0

is identied.

proof:

WLOG, supposeE[|g(Y

1

)|] < .

E[E[g(Y )|D = 1, X]] = E[E[g(Y

1

)|D = 1, X]] (by def. of Y )

= E[E[g(Y

1

)|X]] (by def. of conditional independence)

= E[g(Y

1

)] (X is observable so we can integrate it out).

Let g(Y ) = I

{Y y}

, then E[g(Y

1

)] = F

1

(y). Choosing y (, ), we can trace out the

entire distribution function F

1

for Y

1

. Therefore, the quantile function is also identied.

Denition 2.2 Quantile Treatment Eect

The quantile treatment eect at -th quantile is dened as: Q

Y

1

() Q

Y

0

().

17

The dierence of two quantile functions equals to the horizontal distance between two

distribution functions. Let x being the horizontal distance between F

0

and F

1

at -th

quantile:

= F

0

(x) = F

1

(x +x),

x +x = F

1

1

_

F

0

(x)

_

= F

1

1

(),

x = F

1

1

() x = F

1

1

() F

1

0

() = Q

Y

1

() Q

Y

0

().

As in the previous section, there are two classes of estimation strategies for QTE un-

der conditional unconfoundness: quantile regression which is base on conditioning, and

propensity score weighting.

2.1 Quantile Regression

Denition 2.3 Quantile Regression

If Q

Y

(|X) = X

(), then

() = argmin

E[

(Y X

)], where

(Y X

) = ( I

{Y X

<0}

)(Y X

).

2.2 Weighting Estimator

In this section we introduce the weighting estimator for QTE developed by Firpo (2007),

which is able to directly estimate unconditional QTE. Recall that E

_

DY

e(X)

= E[Y

1

]. Not

only for Y , this is also true for any measureable function g(Y ).

Lemma 2.1 Under Assumption 1.3 and 1.4,

E

_

Dg(Y )

e(X)

= E[g(Y

1

)], and

E

_

(1 D)g(Y )

1 e(X)

= E[g(Y

0

)].

proof:

E

_

Dg(Y )

e(X)

= E

_

E

_

Dg(Y )

e(X)

|X

_

= E

_

1

e(X)

E

_

Dg(Y )

_

= E

_

1

e(X)

E

_

1 g(Y )|X, D = 1

P(D = 1|X)

_

= E

_

E

_

g(Y

1

)|X, D = 1

_

= E

_

E

_

g(Y

1

)|X

_

= E[g(Y

1

)]

18

By properly choosing the function g(), we can obtain the moment conditions for the

quantile functions:

Corollary 2.1 Let g(Y ) = I

{Y Q

Y

1

()}

, then E

_

Dg(Y )

e(X)

= E[I

{Y

1

Q

Y

1

()}

] = .

Let g(Y ) = I

{Y Q

Y

0

()}

, then E

_

(1 D)g(Y )

1 e(X)

= E[g(Y

0

)] = E[I

{Y

0

Q

Y

0

()}

] = .

The quantile function can be estimated by solving a weighted quantile regression problem:

Q

Y

j

() = argmin

q

1

N

N

i=1

j,i

(Y

i

q), where

1,i

=

D

i

e(X

i

)

;

0,i

=

1 D

i

1 e(X

i

)

.

For example, the FOC for

Q

Y

1

is:

1

N

N

i=1

D

i

e(X

i

)

(I

{Y

i

Q

Y

1

()}

) = 0,

which is the sample analog of the moment conditon for Q

Y

1

(). Firpo (2007) suggests the

following procedure: First estimate the propensity score e(X) nonparametrically. Plug-in

the estimated e(X) and solve the weighted quantile regression problem.

Q

Y

1

()

Q

Y

0

()

is the estimated unconditional -th QTE.

3 Instrumental Variables I: Local Treatment Eects

Suppose we want to study the return to schooling. A naive way to conduct such study

is by regression individuals wage Y on their education level D. However, education

level may be confounded by unobservable individuals ability. High ability guy may have

higher earning potential as well. Therefore, the eect of return to schooling may be

inated. Technically, D is correlated with the error term , and OLS does not yield

consistent estimate in such situation. This is a well known problem called hidden bias or

endogeneity. Since the variable representing ability is in general unavailable or subject

to measure error problem, we cannot employ conditioning strategies discussed in the

previous sections. There are several ways to tackle the endogeneity problem, one is the

instrumental variable approach.

3.1 Instrumental Variable : A Review

A variable Z is an instrumental variable if it aects treatment assignmnet D (inclustion

restriction), but not response Y directly (exclustion restriction). Z aects Y only through

D. The causal diagram of IV setup is:

19

Z DY

Under this causal relationship, IV eectively induces exogeneous variations in D.

In our return to schooling case, now suppose Z stands for exogeneously determined

loan policy. In Taiwan, student loan policy is determined by the government. Z will

aect education level D because students have to consider the cost of education to make

decision. Moreover, Z will not directly aect the earning potential. Therefore, variations

in Z can cause exogeneous variations in D, and we can utilize such exogeneous variation

to identify ATE of D on Y . Since Z aects Y only through D, we can decompose the

eect of Z on Y into eect of Z on D times eect of D on Y . According to our

causal diagram, we are able to estimate the eect of Z on D as we did in section 2.

Therefore, divide eect of Z on Y by eect of Z on D, we obtain the eect of D

on Y , the parameter of interest. For exposition purpose, from now on assuming both D

and Z are binary. Consider the following estimator of ATE that implements the above

intuition:

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

That is, the ATE of Z on Y divided by the ATE of Z on D. Several identication

assumptions can make this estimator (henceforth IVE) have causal interpretation. First

consider the constant eect model, which is standard in the IV literature:

Y

i

=

0

+

1

D

i

+

i

,

D

i

=

0

+

1

Z

i

+v

i

,

(

i

, v

i

)Z

i

,

1

= 0, and E[v

i

] = E[

i

] = 0.

1

= 0 and v

i

Z

i

captures the ideal that Z aects D and is exogeneous. Z is not in the

Y equation and (

i

, v

i

)Z

i

ensures that Z aects Y only through D. We can plug in the

D equation into the Y equation,

Y

i

=

0

+

1

(

0

+

1

Z

i

+v

i

) +

i

= (

0

+

1

0

) +

1

1

Z

i

+ (

1

v

i

+

i

).

As we emphasized in section 2, the coecient of the dummy regressor is the ATE un-

der constant eect assumption. Therefore, E[Y |Z = 1] E[Y |Z = 0] =

1

1

and

E[D|Z = 1] E[D|Z = 0] =

1

so the IVE identies

1

, the ATE of D on Y . Under

our assumptions,

1

1

can be consistently estimated by regressing Y on Z, and

1

can

be consistently estimated by regressing D on Z.

IV E =

1

1

=

(z

i

z)(y

i

y)

(z

i

z)

2

(z

i

z)

2

(z

i

z)(d

i

d)

,

20

which is nothing but the familiar IV formula in textbooks. Usually it is derived by

exploring the following moment condition:

E[Z

] = 0,

E[Z

(Y X)] = E[Z

Y ] E[Z

X] = 0,

= E[Z

X]

1

E[Z

Y ], where

Z = [1, z

i

], X = [1, d

i

], Y = [y

i

], the data matrices by stacking data from i = 1 to N.

We can implement

IV E by st regressing D on Z and obtaining the tted value

D. Then

regress Y on

D. This procedure is the celebrated two-stage least square (2SLS). Note

that

IV E is consistent for

Cov(Z,Y )

Cov(Z,D)

. Moreover, if Z and D are both binary, we have the

following lemma.

Lemma 3.1 If Z and D are both binary, then

Cov(Z, Y )

Cov(Z, D)

=

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

.

proof:

Cov(Z, D) = E[ZD] E[Z]E[D].

E[ZD] = E[D 1|Z = 1]P(Z = 1), and

E[Z]E[D] = E[D(Z + (1 Z))]P(Z = 1)

=

_

E[DZ] +E[D(1 Z)]

_

P(Z = 1)

=

_

E[D|Z = 1]P(Z = 1) +E[D|Z = 0]P(Z = 0)

_

P(Z = 1).

Therefore E[ZD] E[Z]E[D]

= E[D|Z = 1]P(Z = 1) E[D|Z = 1]P(Z = 1)

2

E[D|Z = 0]P(Z = 0)P(Z = 1)

= E[D|Z = 1]P(Z = 1)(1 P(Z = 1)) E[D|Z = 0]P(Z = 0)P(Z = 1)

=

_

E[D|Z = 1] E[D|Z = 0]

_

P(Z = 0)P(Z = 1).

Similarily, Cov(Z, Y ) =

_

E[Y |Z = 1] E[Y |Z = 0]

_

P(Z = 0)P(Z = 1).

3.2 Restrictions on the Selection Process : Local Treatment Eects

The constant eect model we just discussed is quite restricted. If we want to allow for

arbitrary form of heterogeneous eects across individuals, does IV E identies any ATE

parameter? This question is addressed by the concept of local average treatment eect

(LATE), developed by Imbens and Angrist (1994), Angrist and Imbens (1995), Angrist,

Imbens, Rubin (1996), Angrist (2001), and Angrist (2004).

21

Again suppose both Z and D are binary. Recall we dene the potential outcome

(Y

0

, Y

1

) and observed outcom Y = DY

1

+ (1 D)Y

0

. The potential outcome framework

enables us to say something about the eect of D on Y . In IV estimation, we have

to know the eect of Z on D. To make this concept manageable, dene the potential

treatment (D

0

, D

1

), and the observed treatment D = ZD

1

+ (1 Z)D

0

. When Z = 1,

D

1

is observed and when Z = 0, D

0

is observed. D

1i

D

0i

is the individual eect of

instrument on the treatment assignment. Clearly it is a conterfactual setting as before

because we can only observe D

1

or D

0

but not both. We observe (Z

i

, D

i

, Y

i

)

i=1,...,N

and

want to identify features of (Z

i

, D

0i

, D

1i

, Y

0i

, Y

1i

)

i=1,...,N

.

Consider the encouragement design of Rosenbaum (1996) to further understand the mean-

ing of potential treatment. We want to study the eect of exercise D on Y , the forced

expiratory volume (FEV). Simple comparisons do not identify the eect of D on Y be-

cause it is confounded by the subjects unobservable health status. Healthy people tend

to do exercise and have higher FEV as well. Suppose the subjects are randomly selected

and encouraged to do exercise. Let Z

i

= 1 when subject i is selected. Encouragement

may induce some people start to do exercise, hence Z shifts D. Moreover, randomized

encouragement wont aect FEV direcly. We can classify the subjects into four groups

according to the value of potential treatments.

D

0

= 0, D

1

= 0 : never-taker

D

0

= 1, D

1

= 1 : always-taker

D

0

= 0, D

1

= 1 : complier

D

0

= 1, D

1

= 0 : deer

Never-takers never exercise whether there is encouragement or not. By contrast, always-

taker always exercise regardless of encouragement. Compliers exercise only if they were

encouraged. They are named complier because their action follows the instrument; i.e.,

D = Z. By contrast, deers disobey the instrument. Note that the complier and the

deer group are the source of exogeneous variation induced by the instrument because

they change their behavior accordingly.

Suppose

Assumption 3.1

22

LATE-1: (Y

0i

, Y

1i

, D

0i

, D

1i

)Z

i

, (exclustion restriction)

LATE-2: E[D|Z = z] is nontrivial function of z, (inclustion restriction) and

LATE-3: D

1i

D

0i

i. (Monotonicity)

LATE-1 captures the ideal that Z is exogeneous and it is similar to assumption 1.

Under this assumption, group mean dierences can identify the ATE of Z on D and Z

on Y . In principal, we should dene the potential outcomes with two subscripts; i.e.,

Y

zd

, z {0, 1} and d {0, 1}. In our notation the potential outcomes are only indexed

by d, meaning that Z only aects Y through D. LATE-2 features that Z shifts D as

standard IV setup. LATE-3, the monotonicity is an extra condition compared with the

traditional IV setting. We will see this assumption enables us to identify several features

of (Z

i

, D

0i

, D

1i

, Y

0i

, Y

1i

)

i=1,...,N

. LATE-3 eectively imposes restrictions on selection pro-

cess D

i

and rules out the deer group. Imbens and Angrist (1994) pointed out that latent

index models satisfy assumption 5. For example, let

Y

0i

=

0

+

i

,

Y

1i

=

0

+

1

+

i

, and

D

zi

= I

{

0

+

1

z+v

i

>0}

.

If (

i

, v

i

)Z

i

, then (D

0i

, D

1i

, Y

0i

, Y

1i

) = (I

{

0

+v

i

>0}

, I

{

0

+

1

+v

i

>0}

,

0

+

i

,

0

+

1

+

i

)Z

i

.

1

> 0 guarantees LATE-2 and 3. Empirical examples of IVs satisfying LATE-3 will be

discuess in the following subsection. Under assumption 4.1, we have

Theorem 3.1 (ATE on the Compliers)

Given assumption 4.1,

IV E =

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

= E[Y

1

Y

0

|D

0

= 0, D

1

= 1] = E[Y

1

Y

0

|complier] LATE.

proof:

First consider the numerator,

E[Y |Z = 1] E[Y |Z = 0]

= E[D

1

Y

1

+ (1 D

1

)Y

0

|Z = 1] E[D

0

Y

1

+ (1 D

0

)Y

0

|Z = 0], (by def. of Y and D)

= E[D

1

Y

1

+ (1 D

1

)Y

0

] E[D

0

Y

1

+ (1 D

0

)Y

0

], (by LATE-1)

= E[D

1

(Y

1

Y

0

) +Y

0

D

0

(Y

1

Y

0

) Y

0

]

= E[(D

1

D

0

)(Y

1

Y

0

)].

23

where (D

1i

D

0i

)(Y

1i

Y

0i

) is the individual eect of Z on Y .

E[(D

1

D

0

)(Y

1

Y

0

)]

= E[0 (Y

1

Y

0

)|D

0

= 0, D

1

= 0]P(D

0

= 0, D

1

= 0)

+E[0 (Y

1

Y

0

)|D

0

= 1, D

1

= 1]P(D

0

= 1, D

1

= 1)

+E[1 (Y

1

Y

0

)|D

0

= 0, D

1

= 1]P(D

0

= 0, D

1

= 1)

+E[1 (Y

1

Y

0

)|D

0

= 1, D

1

= 0]P(D

0

= 1, D

1

= 0)

= E[(Y

1

Y

0

)|D

0

= 0, D

1

= 1]P(D

0

= 0, D

1

= 1).(by LATE-3)

It is clear that the complier group is the only source of variation induced by the instru-

mental variable satisfying assumption 4.1. Secondly,

E[D|Z = 1] E[D|Z = 0] = P(D = 1|Z = 1) P(D = 1|Z = 0) (D is binary)

= P(D

1

= 1|Z = 1) P(D

0

= 1|Z = 0) (def. of D)

= P(D

1

= 1) P(D

0

= 1) (IVE-1)

=

_

P(D

0

= 0, D

1

= 1) +P(D

0

= 1, D

1

= 1)

_

P(D

0

= 1, D

1

= 0) +P(D

0

= 1, D

1

= 1)

= P(D

0

= 0, D

1

= 1).(IVE-3)

Therefore,

IV E =

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

= E[Y

1

Y

0

|D

0

= 0, D

1

= 1].

There are several noticeable points for this result.

LATE is ATE on compliers:

Recall that when Z and D are both binary, IVE equals Cov(Z, Y )/Cov(Z, D), the

probability limit of traditional IV or 2SLS estimator. Thus under treatment eect

heterogeneity of unknown form, 2SLS fails to identify ATE but ATE on compliers

or LATE. This is because compliers, a subgroup of the whole population, are the

only source of variation used to identify ATE. Under heterogeneous eects it cannot

be extrapolated to the whole population. Why traditional IV estimator can identify

ATE is due to constant eect assumption.

Dierent IV identies dierent LATE:

Dierent IV will dene dierent groups of complier, never-taker and always-taker

so LATE is relative to what instrument being used, which is in sharp contrast

with traditional IV estimation in which the identied parameter does not depend

on instruments. Typically, if there are several IVs available, researchers tend to

put them together and use 2SLS to get rid of overidentication probelm and uti-

lize all information contained in IVs. Overidentication stems from the constant

24

eect assumption so that there is only one parameter to be identied. However,

in heterogeneous treatment eect model, everyone can have dierent eect. There

is no issue of overidentication because the object of interest is nonparametric in

natural; i.e., innite dimensional object.

Who is complier?

Although the proportion of the complier group is identied according to theorem

4.1, who is complier is unknown since only one of (D

0

, D

1

) is observed. To exactly

know who is complier one should observe both (D

0

, D

1

) by denition. LATE is

often criticized that it is conditional on an unobservable subgroup; see e.g., Heck-

man (1996).

The fact that IVE can only identify ATE on compliers articulates that researchers should

be really careful about the source of variation used to identify the parameter when

the treatment eect is heterogeneous. As Imbens and Angrist (1994) mention, this is

analogous to panel data model with individual xed eect. Consider the following ex-

ample:

6

We want to study the eect of gender dierence D on wage Y . Panel data

Y

it

= D

i

+

i

+

it

is available so we can control for individual xed eect. However, is

identifed only if D

i

is time-varing. Therefore, the source of identicaion comes from those

who changed their gender status. Do you think measure the eect of gender dierence,

or the eect of changing gender status? Another problem of IV estimation is if Z is only

weakly correlated with D then the complier group would be just a tiny fraction of the

whole population. The IV estimation is therefore not representative. It also suers from

the problem of weak instrument in statistical inference; see e.g., Hall (200?), and Angrist

and Krueger (2001).

3.3 Case Studies

3.3.1 Vietnam Draft Lottery

In IV estimation, the identication power comes from the exclusion restriction. Re-

searchers usually have dierent opinion about whether a variable satises the exclusion

restriction or not. For instance, Heckman (1996) argued that when people receive a high

draft lottery number, they may change their schooling plan so that the earning potential

is aected as well. Vietnam draft lottery number may not be a valid instrument.

6

Donghoon Lee provided this interesting example.

25

3.3.2 Randomized Eligibility Design

3.4 Other Identied Features

Besides LATE, under assumption 4.1 several features of (Y

0

, Y

1

, D

0

, D

1

, Z) are identi-

ed from observed (Y, D, Z). Again the identication power mainly comes from the

monotonicity assumption. Here I summarize several identication results in Imbens and

Rubin (1997). Although compliers are not identied from the observed data, some mem-

bers of always-takers and never-takers are identied under monotonicity assumption. For

example, if (Z

i

= 0; D

i

= 1) it would imply (Z

i

= 0; D

0i

= 1, D

1i

= 1), then i is always-

takers. Similarily, if (Z

i

= 1; D

i

= 0) then i is never-taker. The (Z

i

= 0; D

i

= 0) group

contains compliers and never-takers, while (Z

i

= 1; D

i

= 1) group contains compliers

and always-takers. Thus the monotonicity assumption induces the following structure.

Almost all identication strategies in local treatment eect stem from this table.

D = 0 D = 1

Z = 0 compliers + never-takers always-takers

Z = 1 never-takers compliers + always takers

Lemma 3.2 Denote

c

,

a

, and

n

the proportion of compliers, always-takers, and

never-takers, respectively. These population proportions are identied. Moreover,

c

=

E[D|Z = 1] E[D|Z = 0],

a

= E[D|Z = 0], and

n

= 1 E[D|Z = 1].

proof: We already showed that

c

= E[D|Z = 1] E[D|Z = 0]. By monotonicity,

E[D|Z = 0] = P(D = 1|Z = 0) = P(D

0

= 1|Z = 0) = P(D

0

= 1, D

1

= 1|Z = 0).

By independence, P(D

0

= 1, D

1

= 1|Z = 0) = P(D

0

= 1, D

1

= 1)

a

. Thus

n

= 1

c

a

= 1 E[D|Z = 1].

In fact the above table provides some intuition to prove lemma 4.2. We can calculte

how many people have D = 1 in the group Z = 0, which is what E[D|Z = 0] does. It

will tell us the proportion of always-takers in the group Z = 0. By independence as-

sumption, Z is independent of (D

0

, D

1

) hence it is also independent of individuals type.

Therefore, the proportion of always-takers in the Z = 0 group and the Z = 1 group

should be the same due to randomization. Knowing that (Z

i

= 0; D

i

= 1) is always-

taker allows us using P(D = 1|Z = 0) to identify

a

. It can be estimated by

D(1Z)

(1Z)

,

26

which has probability limit E[

D(1Z)

P(Z=0)

] = P(D = 1|Z = 0). We will see this again later on.

In section 2 we demonstrated the conditional unconfoundness condition permits iden-

tication of the two marginal distributions of potential outcomes. Since the IV model

considered here is a generalized version of unconfoundness and overlap conditions in sec-

tion 1, similar identication results also follows. Denote f

zd

(y), z, d {0, 1} the density

functions of observed Y in the subsample dened by Z = z and D = d. For instance,

f

01

(y) can be estimated using subsamples with (Z = 0, D = 1). Let (g

c

0

(y), g

c

1

(y)) denote

the two marginal densities of potential outcomes for compliers. Also, (g

a

0

(y), g

a

1

(y)), and

(g

n

0

(y), g

n

1

(y)) denote the two marginal densities of potential outcomes for always-takers

and never-takers, respectively.

Lemma 3.3 g

a

1

(y) = f

01

(y) and g

n

0

(y) = f

10

(y), while g

a

0

(y) and g

n

1

(y) is unidentied.

(g

c

0

(y), g

c

1

(y)) are both identied.

proof: Because we never observe always-takers without the treatment, and never-takers

with the treatment, there is no way we can learn g

a

0

(y) and g

n

1

(y). Because Z is in-

dependent of the individuals type, and the subsamples with (Z = 0, D = 1) are all

always-takers, we have g

a

1

(y) = f

01

(y). Analogously, g

n

0

(y) = f

10

(y). Again because Z is

independent of the individuals type, f

00

(y) is the mixture of the density Y

0

for compliers

and never-takers (see the upper-left block of the table). Similar, f

11

(y) is the mixture of

the density Y

1

for compliers and always-takers. Therefore, we have

f

00

(y) =

c

c

+

n

g

c

0

(y) +

n

c

+

n

g

n

0

(y), and

f

11

(y) =

c

c

+

a

g

c

1

(y) +

a

c

+

a

g

a

1

(y).

By inverting these equations, (g

c

0

(y), g

c

1

(y)) can be expressed in terms of directly estimable

distributions.

g

c

0

(y) =

c

+

n

c

f

00

(y)

n

c

f

10

(y), and

g

c

1

(y) =

c

+

a

c

f

11

(y)

a

c

f

01

(y).

Alternatively, Abaide (2002) proposed another identication strategy to identify the cu-

mulative distribution functions of potential outcomes for compliers.

27

Lemma 3.4 Suppose E|g(Y )| < . Under assumption 4.1, we have

E[g(Y

1

)|D

0

= 0, D

1

= 1] =

E[g(Y )D|Z = 1] E[g(Y )D|Z = 0]

E[D|Z = 1] E[D|Z = 0]

, and

E[g(Y

0

)|D

0

= 0, D

1

= 1] =

E[g(Y )(1 D)|Z = 1] E[g(Y )(1 D)|Z = 0]

E[D|Z = 1] E[D|Z = 0]

.

proof: First note that

E[g(Y )D|Z = 1] = E[g(D

1

Y

1

+ (1 D

1

)Y

0

)D

1

|Z = 1]

= E[g(D

1

Y

1

+ (1 D

1

)Y

0

)D

1

] = E[g(Y

1

)|D

1

= 1]P(D

1

= 1)

=

_

E[g(Y

1

)|D

1

= 1, D

0

= 1]P(D

0

= 1|D

1

= 1)

+E[g(Y

1

)|D

1

= 1, D

0

= 0]P(D

0

= 0|D

1

= 1)

_

P(D

1

= 1)

= E[g(Y

1

)|D

1

= 1, D

0

= 1]P(D

1

= 1, D

0

= 1) +E[g(Y

1

)|D

1

= 1, D

0

= 0]P(D

1

= 1, D

0

= 0).

Similarily,

E[g(Y )D|Z = 0] = E[g(D

0

Y

1

+ (1 D

0

)Y

0

)D

0

|Z = 0]

= E[g(D

0

Y

1

+ (1 D

0

)Y

0

)D

0

] = E[g(Y

1

)|D

0

= 1]P(D

0

= 1)

=

_

E[g(Y

1

)|D

1

= 1, D

0

= 1]P(D

1

= 1|D

0

= 1)

+E[g(Y

1

)|D

1

= 0, D

0

= 1]P(D

1

= 0|D

0

= 1)

_

P(D

0

= 1)

= E[g(Y

1

)|D

1

= 1, D

0

= 1]P(D

1

= 1, D

0

= 1).

Alternatively, we can prove it by exploring lemma 4.3:

proof:

According to g

c

1

(y) =

c

+

a

c

f

11

(y)

c

f

01

(y), we know E[g(Y

1

)|complier] =

c

+

a

c

E[g(Y )|Z =

1, D = 1]

a

c

E[g(Y )|Z = 0, D = 1].

(

c

+

a

)E[g(Y )|Z = 1, D = 1]

a

E[g(Y )|Z = 0, D = 1]

= (1

n

)E[g(Y )|Z = 1, D = 1]

a

E[g(Y )|Z = 0, D = 1]

= P(D = 1|Z = 1)E[g(Y )|Z = 1, D = 1] P(D = 1|Z = 0)E[g(Y )|Z = 0, D = 1]

= E[g(Y )D|Z = 1] E[g(Y )D|Z = 0].

Divided by

c

= E[D|Z = 1] E[D|Z = 0], we get E[g(Y

1

)|complier].

28

If g(Y ) = I

{Y y}

, the above lemma can be used to estimate the CDFs of potential

outcomes for compliers, as well as the two quantile functions. The identication strategy

for the CDF employed here is parallel to that of section 2. Analogously, the Firpo (2007)-

type QTE estimator can be constructed for LQTE, the QTE on compliers.

3.5 Non-binary Treatments and Instruments

See the reference in 4.2. Basically, IVE identies some complex linear combination of

LATEs.

3.6 Nonparametric Estimation of LTE with Covariate

Sometimes it is dicult to obtain IVs which are generated by randomized or natural ex-

periment. Even randomization per se does not necessarily guarantee assumption 4.1. For

example, one may change the schooling plan due to high draft lottery number. Higher

education level will increase the earning potential, hence violating the exclustion restric-

tion. If Z is not randomly assigned, it may be confounded with D and Y . Therefore,

covariates should be included in the analysis. For example, Card (1995) uses living close

to a college as an IV to estimate the return to schooling. Residential decision may depend

on the parental income, which may aect childrens earning potential as well. That is to

say, Z is a valid instrument only after conditioning on some covariates. We can extend

assumption 4.1 to allow for covariates using similar modelling technique as assumption

1.3 and 1.4. Following Abadie, Angrist, and Imbens (2002, hereafter AAI), the following

conditions are assumed in the subsequent analysis:

Assumption 3.2

1. independence: (Y

0i

, Y

1i

, D

0i

, D

1i

)Z

i

|X

2. nontrivial assignment: 0 < P(Z = 1|X) < 1,

3. rst-stage: E[D

1

|X] = E[D

0

|X], and

4. monotonicity: P(D

1i

D

0i

|X) = 1.

We have the following theorem:

Theorem 3.2 (Conditional ATE on the Compliers)

Given assumption 4.2

CIV E =

E[Y |Z = 1, X] E[Y |Z = 0, X]

E[D|Z = 1, X] E[D|Z = 0, X]

= E[Y

1

Y

0

|D

0

= 0, D

1

= 1, X] = E[Y

1

Y

0

|complier, X] CLATE.

29

Assumption 4.2-1 is analogous to the conditional unconfoundedness assumption 1.3, and

assumption 4.2-2 is analogous to the overlap assumption 1.4. After conditioning on X,

the overt biases are removed and we can measure the impact of Z on D and Z on Y .

Note that E[D|Z = i, X] = E[D

i

|Z = i, X] = E[D

i

|X], i = 0, 1. Thus assumption 4.2-3

guarantees the denominator of CIVE is nonzero; it states that Z shits D, conditional

on X. Monotonicity rules out deers and makes CIVE has an easy-to-explain causal

interpretation. CLATE can be estimated by using subsamples with covariate X = x.

Since X is observed, we can integrate CLATE over X to obtain LATE. However, if X is

continuous, doing this will be cumbersome. To circumvent this problem, we can employ

nonparametric regression technique(Frolich, 2006) or specify parametric models (Abadie,

2003). Frolich (2006) shows that:

Lemma 3.5

E[Y

1

Y

0

|x, complier] f(x|complier)dx = E[Y

1

Y

0

|complier]

=

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

=

E[E[Y |Z = 1, X]] E[E[Y |Z = 0, X]]

E[E[D|Z = 1, X]] E[E[D|Z = 0, X]]

=

.

proof:

E[Y

1

Y

0

|x, complier] f(x|complier)dx

=

E[Y |Z = 1, X = x] E[Y |Z = 0, X = x]

E[D|Z = 1, X = x] E[D|Z = 0, X = x]

f(x|complier)dx

=

E[Y |Z = 1, x] E[Y |Z = 0, x]

P(complier|x)

f(x|complier)dx

=

E[Y |Z = 1, x] E[Y |Z = 0, x]

P(complier|x)

f(x, complier)

P(complier)

dx

=

E[Y |Z = 1, x] E[Y |Z = 0, x]

P(complier|x)

P(complier|x)

P(complier)

f(x)dx

=

1

P(complier)

=

1

E[D|Z = 1] E[D|Z = 0]

=

E[E[Y |Z = 1, X]] E[E[Y |Z = 0, X]]

E[E[D|Z = 1, X]] E[E[D|Z = 0, X]]

=

E[Y |Z = 1] E[Y |Z = 0]

E[D|Z = 1] E[D|Z = 0]

= LATE.

30

LATE is essentially dened as the ratio of two ATEs, if we think Z is also a treatment.

Therefore, if there are overt biases due to covariate X, we can use propensity score

weighting, matching, parametric or nonparametric regression to estimate the ATE of Z

on D and Z on Y . According to lemma 4.2, LATE can be estimated by taking the ratio

of the two estimated ATEs. Asymptotic properties of such procedures are analyzed in

Frolich (2006).

3.7 Parametric and Semiparametric Estimation of LTE with Covariate

In the section we introduce the parametric and semiparametric estimation of LATE with

covariate, developed by AAI and Abadie (2003). First consider the identication result

for the conditional mean function on compliers E[Y |X, D, complier]:

Lemma 3.6

E[Y |X, D = 0, complier] = E[Y

0

|X, complier],

E[Y |X, D = 1, complier] = E[Y

1

|X, complier], and clearly

CLATE = E[Y

1

Y

0

|X, complier] = E[Y |X, D = 1, complier] E[Y |X, D = 0, complier].

proof:

E[Y |X, D = 0, complier] = E[Y

0

|X, D = 0, complier]

= E[Y

0

|X, Z = 0, complier] (D=Z for compliers)

= E[Y

0

|X, complier]. (by assumption 4.2-1)

Therefore, we can estimate CLATE and LATE by estimating E[Y |X, D, complier]. The

issue here is the complier group is unobservable and hence the conditional moment func-

tion is not a directly estimable object by using subsample of compliers. Analogous to

Imbens and Rubin (1997), it is possible to transform the conditional on unobservable

compliers object to an object without conditional on compliers:

Lemma 3.7 (AAI weighting)

Dene = 1

D(1 Z)

P(Z = 0|X)

(1 D)Z

P(Z = 1|X)

. Suppose E|g(Y, D, X)| < , then

E[g(Y, D, X)|X, D

1

= 1, D

0

= 0] =

1

P(D

1

= 1, D

0

= 0|X)

E[ g(Y, D, X)|X], and

E[g(Y, D, X)|D

1

= 1, D

0

= 0] =

1

P(D

1

= 1, D

0

= 0)

E[ g(Y, D, X)].

31

proof:

E[g(Y, D, X)|X]

= E[g(Y, D, X)|X, D

1

= 1, D

0

= 0] P(D

1

= 1, D

0

= 0|X)

+E[g(Y, D, X)|X, D

1

= 1, D

0

= 1] P(D

1

= 1, D

0

= 1|X)

+E[g(Y, D, X)|X, D

1

= 0, D

0

= 0] P(D

1

= 0, D

0

= 0|X).

Rearrange, we have

E[g(Y, D, X)|X, D

1

= 1, D

0

= 0] =

1

P(D

1

= 1, D

0

= 0|X)

_

E[g(Y, D, X)|X]

E[g(Y, D, X)|X, D

1

= 1, D

0

= 1] P(D

1

= 1, D

0

= 1|X)

. .

I

E[g(Y, D, X)|X, D

1

= 0, D

0

= 0] P(D

1

= 0, D

0

= 0|X)

_

. .

II

.

Note that

I = E[g(Y, D, X)|X, D

1

= 1, D

0

= 1, Z = 0] P(D

1

= 1, D

0

= 1|X, Z = 0) (by assumption 4.2-1)

= E[g(Y, D, X)|X, D = 1, Z = 0] P(D = 1|X, Z = 0). (by assumption 4.2-4)

It follows that II = E[g(Y, D, X)|X, D = 0, Z = 1] P(D = 0|X, Z = 1). Also note that

E[D(1 Z)g(Y, D, X)|X] = E[1 g(Y, D, X)|X, D = 1, Z = 0] P(D = 1, Z = 0|X)

= E[g(Y, D, X)|X, D = 1, Z = 0] P(D = 1|X, Z = 0)P(Z = 0|X).

Therefore,

I = E[

D(1 Z)

P(Z = 0|X)

g(Y, D, X)|X], and

II = E[

(1 D)Z

P(Z = 1|X)

g(Y, D, X)|X].

Substitute I and II into the original equation, we have

E[g(Y, D, X)|X, D

1

= 1, D

0

= 0] =

1

P(D

1

= 1, D

0

= 0|X)

E[g(Y, D, X) (1

D(1 Z)

P(Z = 0|X)

(1 D)Z

P(Z = 1|X)

)|X].

Recall that P(D

1

= 1, D

0

= 0|X) = E[D|X, Z = 1] E[D|X, Z = 0] (see the proof

of theorem 4.1). Finally, the non-estimable conditional on compliers left-hand side is

32

expressed as the estimable conditional on full sample right-hand side. Since D(1Z) = 1

indicates always-takers and (1 D)Z = 1 indicates never-takers, the intuition of the

weighting function is after deleting the contribution of always-takers and never-

takers from E[Y ], we get the contribution of compliers. Now ignore the covariate X for

a second to simplify the explanation. We can calculate the mean of the units in the

upper-right block of the table in section 4.4., then the mean response of always-takers is

obtained. According to that table, the mean response of always-taker can be estimated

by

i:Z

i

=0,D

i

=1

Y

i

i

D

i

(1 Z

i

)

=

i

D

i

(1 Z

i

)Y

i

i

D

i

(1 Z

i

)

.

In section 4.4 we show that the proportion of always-taker is identied and can be esti-

mated by

i

D

i

(1Z

i

)

i

(1Z

i

)

. Hence the always-takers contribution to E[Y ] can be estimated

by

i

D

i

(1 Z

i

)Y

i

i

D

i

(1 Z

i

)

i

D

i

(1 Z

i

)

i

(1 Z

i

)

,

which is consistent for E[

D

(

1Z)Y

P(Z=0)

]. The identication strategy of AAI weighting is a nice

combination of Imbens and Rubin (1997) and propensity score weighting.

This lemma enables us to get rid of the conditional on compliers problem as long as

the statistics can be expressed in terms of moments of observable (Y, D, X). By choosing

suitable function g(), we are able to estimate LATE and LQTE.

Estimation of LATE:

Abaide (2003) postulates a parametric model for the conditional mean function on compli-

ers as E[Y |X, D, D

1

= 1, D

0

= 0] = h(D, X;

o

). Finding the conditional mean function

corresponds to the minimization of the quadratic loss function:

o

= argmin

E[{Y h(D, X;

o

)}

2

|D

1

= 1, D

0

= 0]

By the lemma of AAI weighting, the above minimization problem is equivalent to

o

= argmin

E[ {Y h(D, X;

o

)}

2

].

If we let h(D, X;

o

) = D +X

, then

( ,

) = argmin

1

N

N

i=1

i

(Y D X

)

2

.

i

= 1

D

i

(1 Z

i

)

1

P(Z = 1|X)

(1 D

i

)Z

i

P(Z = 1|X)

.

33

Clearly, the estimated LATE can be obtained by running a weighted least square.

Estimation of LQTE:

LQTE can be estimated using the same manner. AAI species the conditional quantile

function on compliers as Q

Y

(|X, D, D

1

= 1, D

0

= 0) = ()D + X

conditional quantile function corresponds to the minimization of the check function:

_

(), ()

_

= argmin

,

E[

(Y D X

)|D

1

= 1, D

0

= 0]

= argmin

,

E[

(Y D X

() is the -th LQTE and it can be estimated by a weighted quantile regression.

7

Note

that the parametric models are specied for the ease of implementation; the identication

result is nonparametric. Summing up, lets give some intuition why the estimation of

LATE or LQTE boils down to a weighted regression problem. AAI show that

Lemma 3.8 By assumption 4.2-1, we have (Y

1

, Y

0

)|X, D

1

= 1, D

0

= 0.

proof:

By assumption 4.2-1 we know (Y

1

, Y

0

, D

1

, D

0

)Z|X, implying (Y

1

, Y

0

)Z|X, D

1

= 1, D

0

=

0. When we conditional on compliers, Z = D and then (Y

1

, Y

0

)D|X, D

1

= 1, D

0

= 0.

After conditional on compliers, those who change their behavior by Z, we have (Y

1

, Y

0

)D|X.

Under this condition, various methods can be used to estimate treatment eects param-

eters. In particular, LATE can be obtained by OLS and LQTE, quantile regression. AAI

weighting function helps transform the conditional on compliers problem to an uncon-

ditional problem. It turns out the treatment eects parameters can be estimated by

weighted regression methods.

4 Dierence-in-Dierence Designs

In the previous chapters, several methods to control for observed confounders are in-

troduced. The premise is all the relevant covariates are available to the researchers. It

is a strong assumption, however. For example, in the case of return to schooling, the

unobserved ability may aect the earning and the schooling choice as well . In this

chapter, we introduce methods that allow for researchers to control for such unobserved

confounders when panel or repeated cross-sectional data is avaiable. Panel data refers

7

AAI discuss the computational issue when applying AAI weighting to estimate QTE. Also see Frolich

and Melly (2008).

34

to patient is medical outcomes and his characteristics are observed at both t = 0 and

t = 1. However, it is less likely that patient i is infected the same disease at both t = 0

and t = 1. Instead, data are available at a more aggregate level. Patient-level data are

both avaiable at t = 0 and t = 1 but the same individuals may not be measured twiced.

The later case is referred to repeated cross-sectional data.

Perhaps the best control unit for John is John himself. When both the pre and

post-treatment data are avaiable, such comparison is feasible. If the study period is

short, this before-after comparison is less problematic. If the study period is long, the

outcoume variable at t = 1 is likely to be contaminiated by the time trend, or by other

factors occurred between t = 0 and t = 1, which is unrelated to the policy intervention.

All these eects caused by factors that are unrelated to the treatment is summarized

in the time eect. Consequently, the pre-treatment and post-treatment comparison of

the treatment group yields the estimated treatment eect which is the eect of policy

interventions plus the undesired time eects. Furthermore, the before-after comparison

of the unteated control group can be used to idntify the time eect since the dierence

between pre-treatment and post-treatment outcomes are attribut to the time eect in

the absence of policy intervention. The dierence-in-dierence (DID) approach is the

constructed based on this intuition.

For example, the government launches the job training program in region 1 at time 1.

The umployment rate is reduced by 2% compared with the umployment rate in region 1

at time 0. However, it is possible that the economic condition improves in region 1 that

leads to a lower umployment rate.

Suppose we can nd another region 0 is similar to region 1 in terms of economic

conditions. The umployment rate is reduced 1% in region 0 duriong the same period. To

be specic, the DID estimator is given by :

DID

= E[Y |G = 1, T = 1] E[Y |G = 1, T = 0]

(E[Y |G = 0, T = 1] E[Y |G = 0, T = 0])

where G is the group indicator and T is time indicator. The DID estimator can be

interpreted as matching estimator as well. The within group dierence is a matching

that control for the unobserved group(individual) specic xed-eect in ? repeated cross-

sectional (panel) data. The between group dierence is a matching that control for

the time eect. The DID approaches have received a considerable attention in applied

researches. Meyer (1995), Lee(2005), and Angrist and Pischke(2009) ??? DID methods.

Applications includes job training program (Ashenfelter and Card, 1985), minimum wage

35

(Card and Krueger, 1994), saving behavior(Poterba, Venti, and Wise, 1995), disability

act(Acemoglu and Angrist, 2001), and consequences of potential loss on Adolescents

(Corak, 2001) among others. See also Angrist and Krueger(1999) and Rosenzweig and

Wolpin(2001) for more examples.

4.1 Linear Dierence-in-Dierence model

Suppose there are two groups, G = 0 and 1, Group 0(1) will be referred as the control

(treatment group). Only group 1 is exposed to the policy intervention at time 1. Let

Y

N

(Y

I

) stands for the untreated(treated) potential outcome. The observed outcome

for individual i is given by Y

i

= Y

N

i

(1 I

i

) + Y

I

i

I

i

, where I

i

= G

i

T

i

is the treatment

indicator. Therefore, Y = Y

N

when (G, T) = (0, 0), (0, 1), and (1, 0) while Y = Y

I

when

(G, T) = (1, 1). Assume that we have panel data in which there is no moving in and out

for each group.

y

N

it

= +t +G

i

+

i

+

it

, and

y

I

it

y

N

it

= , i, t (constant eect).

These two assumptions together imply the observed outcomes y

it

can be written as

y

it

= +t +G

i

+

i

+G

i

t +

it

In this model, t summaries the common time eects across individuals.

i

is the un-

observed xed-eect that is potentially correlated with G

i

(selection-on-unobservables).

Namely, G

i

is allowed to be endogeneous. is the ATE of interest.

it

are assume to

be i.i.d across time and individuals. Within group dierencing removes the unobserved

confounders

i

:

y

i

= +G

i

+

it

The ATE. , is identied through the moment condition:

DID

= E[y

i1

y

i0

|G

i

= 1] E[y

i1

y

i0

|G

i

= 0]

= E[y

i

|G

i

= 1] E[y

i

|G

i

= 0]

= +

=

Equivalently, can be estimated by OLS with time, group, and group time dummies.

The previous constant eect linear DID model can be modied to t into the framework

36

of repeated cross-sectional data :

Y

i

= +T

i

+G

i

+

i

+G

i

T

i

+

i

Note that we only observe agent i at one time.

i

is correlated with G

i

, and the distribu-

tion of is time-invariant. It is thus captures the unobserved group specic xed-eect.

The ATE can be identied through

DID

= E[Y |G = 1, T = 1] E[Y |G = 1, T = 0]

(E[Y |G = 0, T = 1] E[Y |G = 0, T = 0])

= ( +E[

i

|G = 1] E[

i

|G = 1] +)

( +E[

i

|G = 0] E[|G = 0])

=

4.2 Nonparametric Dierence-in-Dierenc Models

In this section we consider the cases in which dierence in observed covarites create

dierent time eect between the treatment and control groups.

Suppose there are two types of workers, blue collar and white collar. The percentage

of white collar workers in the treatment group is higher than that of the control group.

During the study period, the information technology may improve, leading to non-parallel

growth of wage between blue collar and white collar workers since white collar workers are

more likely to enjoy the benet from technology. Therefore, controling for the dierence

of covariate between the treatment and control group is also important in the DID models.

In the DID model, this can be done by adding covariates into the regression. In this

section we introduce more general nonparametrics methods to control for covariates. The

key identication condition for ATE is the conditional same time eect assumption in

the absence of treatment:

Assumption 4.1

E[Y

N

11

Y

N

10

|X] = E[Y

N

01

Y 00

N

|X]

where Y

gt

is the shorthand for Y |G = g, T = t.

Theorem 4.1 Under the same time eect assumption, the DID estimator

DID

identi-

es the ATE on the treated at time 1, E[Y

I

11

Y

N

11

]

37

Pf 4.1

DID

= E[Y

11

Y

10

|X] E[Y

01

Y

00

|X] (in terms of observed outcomes)

= E[Y

I

11

Y

N

10

|X] E[Y

N

01

Y

N

00

|X] (in terms of potential outcomes)

= E[Y

I

11

Y

N

10

|X] E[Y 11

N

Y

N

10

|X]

+E[Y

N

11

Y

N

10

|X] E[Y

N

01

Y

N

00

|X]

= E[Y

I

11

Y

N

10

Y

N

11

+Y

N

10

|X]

(by the same time eect in the absence of treatment)

= E[Y

I

11

Y

N

11

|X]

Averaging

DID

according to the distribution of X yields

E[y

N

11

Y

N

11

] E[Y

I

t=1

Y

N

t=1

|G = 1]

The

DID

can be written as

E[Y

t=1

Y

t=0

|G = 1, X] E[Y

t=1

Y

t=0

|G = 0, X]

If panel data is available, rst we take rst dierence in(?) the outcome for each individual

to generate new dependent available Y

i

= Y

it=1

Y

it=0

. Then

DID

(X) can be rewritten

as E[Y |G = 1, X] E[Y |G = 0, X], which has exactly the same form as (X) in

section 1.

DID

(X) can be estimated by dierencing the two estimated conditional mean

functions. In particular, matching estimators can also be employed. The intuition is that

the within group dierence removes the unobservable confounders, and then matching is

employed to control for the non-parallel outcome dynamics caused by dierent covariates.

If we only have repeated cross-sectional data, generating Y

i

is infeasible. Instead,

we can estimate four conditional mean functions to estimate

DID

(X):

DID

(X) = E[Y |G = 1, T = 1, X] E[Y |G = 1, T = 0, X]

E[Y |G = 0, T = 1, X] +E[Y |G = 0, T = 0, X]

Note that the same time eect assumption ? not exclude the possibility of selection-

on-unobserable. There may exist systematic dierence between the treated and control

units such that E[Y

N

10

] = E[Y

N

00

]. Such endogeneity problem is resolved using the time

dimension, and the role of control group is to remove the time eect.

Besides matching, Abadie(2005) shows that the propensity score weighting approach

can also be extended to the DID setting. If we view the chanse of untreated response

38

Y

N

= Y

N

t=1

Y

N

t=0

as Y

0

, The same time eect condition, E[Y

N

t=1

Y

N

t=0

|G = 1, X] =

E[Y

N

t=1

Y

N

t=0

|G = 0, X] is analogous to the mean independent condition for Y

0

, E[Y

0

|D =

1, X] = E[Y

0

|D = 0, X], which is used to identied ATE on the treated in the cross

sectional data in chapter. Also recall that

DID

identies the ATE on the treated,

E[Y

I

t=1

Y

N

t=1

|G = 1], Dene Y = Y

t=1

Y

t=0

, Y

1

= Y

I

t=1

Y

N

t=0

, and Y

0

=

Y

N

t=1

Y

N

t=0

. Then E[Y

I

t=1

Y

N

t=1

|G = 1] can be express as E[Y

1

Y

0

|G = 1].

Combing the above three facts the parameter of in ?? ?? the associated identication

condition are exactly the same. Therefore, the same weighting estimator can be applied

here. As showed in chapter 1, under E[Y

0

|D = 1, X] = E[Y

0

|D = 0, X], the propensity

score weighting estimator is given by

E[Y

1

Y

0

|D = 1] =

1

p(D = 1)

E[

D e(x)

1 e(x)

Y ]

Pf 4.2

E[Y

1

Y

0

|D = 1] = E[Y

1

|D = 1] E[Y

0

|D = 1] (1)

E[Y

1

|D = 1] = E[Y |D = 1] (2)

Recall that E[Y

0

] = E[

(1 D)Y

1 e(x)

] (3)

E[Y

0

] = E[Y

0

|D = 1]P(D = 1) +E[Y

0

|D = 0]P(D = 0) (4)

(3),(4) gives an expression of the counterfactual E[Y

0

|D = 1] in terms of observable:

E[Y

0

|D = 1] =

1

p(D = 1)

[E[Y

0

] E[Y

0

|D = 0]P(D = 0)]

=

1

P(D = 1)

_

E[

(1 D)Y

1 e(x)

E[(1 D)Y ]

_

(5)

(1),(2),(5) simplies that

E[Y

1

Y

0

|D = 1] = E[Y |D = 1]

1

P(D = 1)

_

E[

(1 D)Y

1 e(x)

] E[(1 D)Y ]

_

= E[DY ]

1

P(D = 1)

??

=

1

P(D = 1)

E

_

DY

(1 D)Y

1 e(x)

+ (1 D)Y

_

=

1

P(D = 1)

E

_

D e(x)

1 e(x)

Y

_

Analogously, dene e(X) = P(G = 1|X), we have E[Y

1

Y

0

|G = 1] =

1

P(G=1)

E

_

Ge(X)

1e(X)

_

The above estimator implicitly assume that panel data is available because the estimation

is based on Y

i

= Y

it=1

Y

it=0

. Abadie(2005) demonstrate that it can be modied to

t into repeated cross-sectional data theorem.

39

Theorem 4.2

E

_

T P(T = 1)

P(T = 1)(1 P(T = 1))

Ge(X)

e(X)(1 e(X))

Y

X

_

= E[Y |G = 1, T = 1, X] E[Y |G = 1, T = 0, X]

E[Y |G = 0, T = 1, X] +E[Y |G = 0, T = 0, X]

Pf 4.3 The proposed estimator equals

E

_

1 P(T = 1)

P(T = 1)(1 P(T = 1))

Ge(X)

e(X)(1 e(X))

Y

X, T = 1

_

P(T = 1)

+E

_

0 P(T = 1)

P(T = 1)(1 P(T = 1))

Ge(X)

e(X)(1 e(X))

Y

X, T = 0

_

P(T = 0) =

E

_

Ge(X)

e(X)(1 e(X))

Y

t=1

X, T = 1

_

E

_

Ge(X)

e(X)(1 e(X))

Y

t=0

X, T = 0

_

(1)

E

_

Ge(X)

e(X)(1 e(X))

Y

t=1

X, T = 1

_

= E

_

1 e(X)

e(X)(1 e(X))

Y

t=1

X, G = 1

_

P(G = 1, X)

+E

_

0 e(X)

e(X)(1 e(X))

Y

t=1

X, G = 0

_

P(G = 0, X)

= E[Y +t = 1|X, G = 1] E[Y

t=1

|X, G = 0] (2)

Similarily, E

_

Ge(X)

e(X)(1 e(X))

Y

t=1

X

_

= E[Y

t=0

|X, G = 1] E[Y

t=0

|X, G = 0] (3)

By (1)(2)(3), we have shown the proposed estimator equals

E[Y |X, G = 1, T = 1] E[Y |X, G = 1, T = 0] E[Y |X, G = 02, T = 1]E[Y |X, G = 0, T = 0]

= E[Y

I

t=1

Y

N

t=1

|G = 1, X]

The ATE on the treated is given by

E[Y

I

t=1

Y

N

t=1

|G = 1, X] d(P(X|G = 1)) =

E[Y

I

t=1

Y

N

t=1

|G = 1, X]

R(G = 1|X)

P(G = 1)

dP(X)

= E

_

T P(T = 1)

P(T = 1)(1 P(T = 1))

Ge(X)

1 e(X)

1

P(G = 1)

Y

_

4.3 Nonlinear Dierence-in-Dierence

The validity of the DID estimator hinges on the additively separable property. When the

outcome variable is discrete or rate data, the linear specication is problematic and re-

searchers usually resort to logit, probit or semiparametric singlex index models. However,

40

the dierencing procedure fails to identify ATE under nonlinear model. Consider

Y = 1

[++G+TG+0]

, N(0, 1)

DID

= E[Y |T = 1, G = 1] E[Y |T = 0, G = 1]

E[Y |T = 1, G = 0] +E[Y |T = 1, G = 1]

= ( + + +) ( +) ( +) + ()

neq

However, standard linear DID model implicit impose the constant time eect assumption.

Everyone will experience the same time eect in absence of the treatment. Athey and

Imbens demonstrate when such assumption is violated, DID also fails to identify ATE.

Suppose y is wage and is workers ability. Consider the model allowing for time trend

in the level of wages and the returns to ability:

Y = +T +G+TG+ (1 +rT)

DID

= [( + + + + (1 +r)E[|G = 1]) ( + +E[|G = 1])]

[( + + (1 +r)E[|G = 0]) ( +E[|G = 0])]

= +r(E[|G = 1] E[|G = 0])

In this case,

DID

= only when E[|G = 1] = E[|G = 0]. However, this condition

will exclude the case of selection-on-unobserable. It is unappealing because part of the

motivation to consider panel data is it allows for selection-on-unobservable.

Finally, linear DID models only admit location shift. When the primary concern is

the issue of inequality, a model that can allow for more general forms of distributional

eect is more suitable to answer interesting policy questions. For example, a tax reform

is expected to have positive eect on the lower quantiles of earning distribution and

negative eect on the upper quantiles of earning distribution.

Estimators of QTE in the context of panel or repeated cross-sectional data thus

deliever more informative estimates than linear DID estimators.

Athey and Imbens(2006) propose two models, change-in-change and quantitle dierence-

in-diernce, to impute the entire counterfactual distribution; therefore ATE and QTE

can be derived from these methods.

Both change-in-chanve (CIC) and quantile diference-in-dierence (QDID) relax the

functional form assumptions in standard DID. Furthermore, agents with dierent unob-

served characteristics are allowed to have dierent time eect under both models.

41

4.4 The Change-in-Change Model

Dene Y

I

and Y

N

the potential outcome with and without the treatment respectively.

The observed outcome Y = Y

N

(1 I +IY

I

), where I = GT is the treatment indicator.

We use the shorthand for conditioning :

Y

N

gt

Y

N

|G = g, T = t , Y

I

gt

Y

I

|G = g, T = t

Y

gt

Y |G = g, T = t , U

g

= U|G = g

U represents the unobserved indivieual characteristics. The corresponding CDF are de-

noted by F

Y

N

,gt

, F

Y

I

,gt

, F

Y,gt

and F

U,g

. Since the treatment is eective for (G, T) = (1, 1),

F

Y,gt

= F

Y

N

,gt

, for (G, T) = (1, 0), (0, 0), (0, 1) and F

Y,11

= F

Y

I

,11

. We want to identify

the counterfactual distribution F

Y,11

from the observed F

Y

N

,00

, F

Y

N

,01

and F

Y

N

,10

. Athey

and Imbens consider the CIC model :

Assumption 4.2 CIC-1 (Structural Model)

Y

N

= h(U, T)

Assumption 4.3 CIC-2 (Strict Monotonicity)

h(u, t) is strictly increasing in u for t = 0 amd 1

Assumption 4.4 CIC-3 (Time invariance)

U T | G

Assumption 4.5 CIC-4 (Support)

suppU

1

suppU

0

Assumption CIC-1 and 2 relax the constant time eect assumption in standard DID

models. For individual with realization U = u, she will experience the time eect h(u, 1)

h(u, 0) in the absence of the treatment. Since we do not impose any restriction on the

functional form of h, each agent can have dierent time eect whenever their individual

characteristics diers. It thus nests the constant eect model a special case.

Assumption CIC-3 asserts that within each group, the distribution of U is stable

across time. In particular, the individual xed eect model satises this assumption.

However, within the same time period, the distribution of U can vary across group

42

(U

1

= U

0

) and the CIC model thus allows for selection-on-unobservable. For example,

the treatment group may contains more high ability workers than the control group, but

the distribution of ability is time-invariant in each group.

Notice that CIC-3 isessential for identifying the time eect in this model. If U is

also time-varing, it would be dicult to isolate the time eect from the eect of the

chanse of the chanse of distribution of U by looking at the outcome variable. Simply

put, the control group at t = 0 is comparable with the control group at t = 1 when U

0

is time-invariant. The treatment eect of time is identied only if they are comparable.

Many economic model directly map into the CIC model. For example, Y is wage and U

is workers ability and wage is an increasing function of ability. Suppose Y is working

hours and U is the preference of working hour. The chosen working hour is high if the

preference of leisure is low, conditional on the wage and nonlabor income. Suppose Y is

saving and U is the preference of risk aversion. The level of the precautionary saving is

higher when the degree of risk aversion is high.

Following Matzkin(2003), it is impossible to separately identify the structural shock

U and structural function h. In particular, under CIC-2 we can treat U as uniform [0, 1],

and h as the quantile function of Y

N

since

Y

N

= h(U, T) = h(F

1

F(U), T) = h(U, T)

where F is the distibution function of U.

In fact, the idencation results heavily rely on the use of the quantile function. To

ease the exposition of the main idea(?), we should only consider the case when Y is

continuous. The case of discrete Y are referred to Athet and Imbens.

Theorem (Identication of the CIC model)

Under CIC1-4, F

Y

N

,11

(y) = F

Y,10

(F

1

Y,00

(y))

Pf 4.4

F

y

N

,gt

= P(h(U, T) y|G = g, T = t)

= P(h(U, T) y|G = g) (by CIC-3)

= P(U h

1

(y ; t)|G = g) (by CIC-2)

= F

U,g

(h

1

(y ; t)) (1)

Using this identity and substitute in y = h(u, 0), we have

F

Y,00

(h(u, 0)) = F

U,0

(h

1

(h(u, 0) ; 0))

= F

U,0

43

Applying F

1

Y,00

to each quantile, we have

h(u, 0) = F

1

Y,00

(F

u,00

(u)) u supp U

0

(2)

Applying (1) to (g, t) = (0, 1), we have

F

Y,01

(y) = F

U,0

(h

1

(y ; 1)) h

1

(y, 1) supp U

0

h

1

(y ; 1) = F

1

U,0

(F

Y,01

(y)) y supp Y

0

(3)

Combining (2)(3) gives h(h

1

(y ; 1), 0) = F

1

Y,00

(F

Y,01

(y)) (4)

Apply (1) with (g, t) = (1, 0), we have

F

U,1

(u) = F

Y,10

(h(u, 0)) (5)

Then F

Y

N

,11

(y) = F

U,1

(h

1

(y ; 1))

= F

Y,10

(h(h

1

(y ; 1), 0)) (by (5))

= F

Y,10

F

1

Y,00

F

Y,01

(y) (by (4))

By CIC-4, supp U

1

supp U

0

, it implies that supp Y

N

11

supp Y

01

and it enables us

to impute the entire counterfactual distribution F

Y

N

,11

from F

Y,10

F

Y,00

and F

Y,01

for all

y Y

N

11

.

Corollary 4.1 The quantile function of Y

N

11

is given by

F

1

Y

N

,11

(q) = F

1

Y,01

F

Y,00

F

1

Y,11

(q)

and the QTE is given by

CIC

q

= F

1

Y

I

,11

(q) F

1

Y

N

The CIC identication can be interpreted as dening a transformation K

CIC

(y) = F

1

Y,01

F

Y,00

(y). Such transformation suggest the following ATE estimator:

CIC

= E[Y

I

11

] E[Y

N

11

] = E[Y

I

11

] E[K

CIC

(Y

10

)] = E[Y

N

11

] E[F

1

Y,01

F

Y,00

(Y

10

)]

Under the CIC model, within the same time period, the same realization of Y corre-

sponds to a specic realization of u regardless of the group. Once we know u, the time

eect for u, h(u, 1) h(u, 0), can be back out from comparing the quantile functions of

Y

01

and Y

00

.

Given Y

10

there exists u such that h(u, 0) = Y

10

. The quantile of u in U

1

is denoted

as q

1

. Since U

1

= U

0

, the quantile of u in U

0

, q

0

is dierent to q

1

in general.

44

The rst transformation, F

Y,00

(Y

10

), thus gives us q

0

. The time eect, h(u, 1)h(u, 0),

is equal to the quantile treatment eect at q

0

since h(u, 1)h(u, 0) = h

(q

0

, 1)h

(q

0

, 0),

knowing (???...) because there is a one-to-one correspondence between q

0

and u. Given

u, the counterfactual Y

N

11

equals h(u, 1) = h

(q

0

, 1) F

1

Y,01

(q

0

) = F

1

Y,01

(F

Y,00

(Y

10

)) =

F

1

Y,01

F

Y,00

F

1

Y,10

(q

1

).

?? = Y

10

+h

(q

0

, 1) h

(q

0

, 0)

= Y

10

+F

1

Y,01

(q

0

) F

1

Y,00

(q

0

)

= Y

10

+F

1

Y,01

(F

Y,00

(Y

10

)) F

1

Y,00

(F

Y,00

(Y

10

))

= F

1

Y,01

F

Y,00

(Y

10

)

4.5 Quantile Dierence-in-Dierence

QDID is a generalization of DID that applys the DID method to each quantile instead of

mean, which can be dated back to Meyer, Viscusi, and Dubin(1995) and Poterba, Venti,

and Wise(1995). It thus permits richer forms of QTE than the DID method. In DID, the

key identication assumption is the average eect of time is the same for the treatment

and the control group in the absence of policy interventions. Analogously, applying DID

to each quantile yeilds:

F

1

Y

N

,11

(q) F

Y,10

(q) = F

Y,10

(q) F

Y,00

(q)

That is, the QTE of time is the same for all q [0, 1]. The identication condition of

QDID implies that the counterfactual F

1

Y

N

,11

(q) can be imputed from

F

1

Y,10

+ [F

1

Y,01

(q) F

1

Y,00

(q)]

Athey and Imbens(2002) supply a set of primitive assumptions to justify the QDID

identication condition.

Assumption 4.6 QDID-1 (Structural Model)

Y

N

= h(U, G, T) where h(u, g, t) is additively separable in g and t; i.e, h(U, G, T) =

h

G

(U, G) +h

T

(U, T)

Assumption 4.7 QDID-2 (Monotonicity)

Given g and t, h(.) is strictly increasing in u

Assumption 4.8 QDID-3

U(G, T)

45

Athey and Imbens(2002) show that

Theorem 4.3 (Identication of the QDID Model)

Suppose Y is continuous random variable and QDID1-3 hold,.... The counterfactual

distribution of Y

N

11

is identied and

F

1

Y

N

,11

= F

1

Y,10

(q) +F

1

Y,01

F

1

Y,00

(q) , q (0, 1)

Pf 4.5 WLOG, U is assumed to be uniform [0, 1] and then by QDID-2, h(, g, t) is in-

terpreted as quantile function conditional on (g, t) (Skorohod representation)

F

Y,gt

(y) = P(h(U, G, T) y|G = g, T = t)

= P(h(U, g, t) y) (by QDID-3)

= P(U h

1

(y ; g, t)) (by QDID-2)

= h

1

(y ; g, t)

h(u, g, t) = F

1

Y,gt

(u) (1)

By QDID-1,

h(u, 1, 1) h(u, 1, 0) = h

G

(u, 1) +h

T

(u, 1) h

G

(u, 1) h

T

(u, 0)

= h

T

(u, 1) h

T

(u, 0)

= h(u, 0, 1) h(u, 0, 0)

h(u, 1, 1) = h(u, 1, 0) +h(u, 0, 1) h(u, 0, 0) (2)

(1)(2) yield the desired result.

The modeling philosophy ...(?) is quite dierent. In QDID, the outcome in the absence

of treatment is generated according to

Y

N

00

= h

0

(U) ; Y

N

10

= h

1

(U)

The unobserved characteristic U is equally distributed in the treatment and control group

( U (G, T)) but dierent group uses dierent production technology. Therefore, the

observed dierence between F

Y,00

and F

Y,10

is attributed to the dierence between h

0

and

h

1

. Individuals with the same realization of u will be mapped into dierent outcomes

when their groups dier, but the monotonicity of h

0

and h

1

still preserve their rank.

Consequently, comparing individuals with the same quantile of the outcome is equivalent

to comparing individuals with the same u under the QDID model.

46

By contrast, under the cUC model the outcome in the absence of treatment is gener-

ated according to

Y

N

00

= h(U

0

) ; Y

N

10

= h(U

1

)

The treatment and control group both use the same production technology, but their dis-

tribution of characteristics can be dierent in an arbitrary way. The observed dierence

between F

Y,00

and F

Y,10

is attributed to the dierence between U

0

and U

1

. Individuals

with the same realization of u will be mapped into the same outcome. Therefore, com-

paring individuals with the same outcome is equivalent to comparing individuals with

the same u under the CIC model, though the rank of u in U

0

and U

1

is dierent. The

separability condition QDID-1 is crucial for identifying the time eect. If there is an

interaction eect between G and T, the treated and control units are allowed to have

dierent time path, violating the same time eect structure given u. The QDID model

suggests the estimator for the QTE to be

QDID

q

= F

1

Y

I

,11

(q) F

1

Y

N

,11

(q)

= F

1

Y

I

,11

(q) F

1

Y,10

(q) [F

1

Y,01

(q) F

1

Y,00

(q)]

Following Koenker(2006), it can be estimated from the quantile regression with group,

time, and treatment dummies

F

1

Y IG,T

(q) = (q) +(q)T +(g)G+

QDID

q

GT

Integrating QTE, tau

QDID

q

, yields the ATE

QDID

QDID

q

dq =

(F

1

Y

I

,11

(q) F

1

Y,10

(q) F

1

Y,01

+F

1

Y,00

) dq

= E[Y

I

11

] E[Y

10

] E[Y

01

] +E[Y

00

]

which is the same as

DID

... though the CIC model allows U

1

and U

0

can dier in arbitrary way, the support

condition supp U

1

supp U

0

may rule out many interesting cases in labor economics.

The support condition implies that supp Y

10

supp Y

00

and supp Y

N

11

supp Y

01

.

However, in pratice max Y

10

tend to be greater than max Y

00

in the case of job traning

programme (Crump, Hotz, Imbens and Mitnik). Therefore, the abitily to impute the

entire counterfactual distribution will be cramped by support problem.

Only supp Y

10

supp Y

00

can be imputed under the CIC model. By contrast, the

QDID model doesnt suer from the support problem, because it is always feasible to

47

compute the q-th QTE of Y

01

versus Y

00

and add it back to Y

10

that corresponds to

the q-th quantile in F

Y,10

. Comparing individuals with the same quantile can be dated

backed to fractile regression (Mahalanobis, 1960; Sen and Chaudhuri, 2006). Even in the

extreme case, supp Y

10

supp Y

00

= , they are still comparable because the integral

transformation transform supp Y into [0, 1], regardless of the original support of Y .

5 Nonparametric Bounding Approaches

In the preceding sections we review several identication methods for the treatment eect

parameters under various assumptions. Although those assumptions are quite dierent

at rst glance, the ultimate implication for identication is essentially the same. Those

assumptions are strong enough in the sense that the counterfactuals can be imputed from

the observable data, and the treatment eect parameters are point identied. The main

dierence is the way we impute the counterfactuals might be dierent under dierent

assumptions. For example, under conditional unconfoundness, we impute counterfactual

E[Y

1

|D = 0, X] by E[Y

1

|D = 1, X]. In the following sample selection model:

Y

1

= X

1

+u

1

,

Y

0

= X

0

+u

0

,

D = I

{X

D

+u

D

>0}

.

According to this system, the counterfactual E[Y

1

|D = 0, X] equals E[X

1

+u

1

|X

D

+

u

D

< 0, X] = X

1

+ E[u

1

|X

D

+ u

D

< 0, X]. If we make further distributional

assumption, for example (u

1

, u

0

, u

D

) is trivariate normally distributed, we will obtain a

closed form expression for E[u

1

|X

D

+ u

D

< 0, X]. This is the celebrated Heckman

two-step estimator. The above system enables us to impute the counterfactual in a

specic manner. However, the validity of such imputation comes from the functional

form restriction, the additively separable error term, and the distributional assumption

(Manski, 1989). A specic assumption leads to a specic imputation method. Researchers

ususally have diverse prior beliefs about which assumption is more plausible. Such diverse

belief comes from two facts: First, typically those assumptions in section 1 to 6 cannot

be tested statistically; there is no systematic way to address which assumption is more

plausible. Therefore, researchers use economy theory to guide the choice of assumptions.

However, the theory usually remains silent about the distributional assumptions and

functional form restrictions. Unfortunately, our ability to impute the counterfactuals

hinges on those assumptions.

48

A more challenging question is can we still say something about the value of the

treamtment eect parameters if no assumption being made? How far can we go if we only

make weaker (and hence more credible) assumptions? Charles Manski provides a fresh

insight into this problem by derived nonparametric bounds for the counterfactuals and

the treatment eect parameters under dierent set of assumptions. These assumptions

we are going to introduce are weaker than that of section 1 to 6., and we dont assume

more than what economy theory predicts. The result is the information contained in the

data is quite limited to impute the counterfactuals. Instead of the point identication

of the TE parameters, we could only obtain an identication region for the parameters.

This new identication concept is called partial identication or set identication in the

literature.

Recall that we only observe (Y, D, X) but want to identify characteristics of (Y

1

, Y

0

, D, X).

For example, we want to learn about E[Y

1

] and E[Y

0

], and they can be expressed as:

E[Y

1

] = E[Y

1

|D = 1] P(D = 1) +E[Y

1

|D = 0] P(D = 0),

E[Y

0

] = E[Y

0

|D = 1] P(D = 1) +E[Y

0

|D = 0] P(D = 0).

From now on I suppress the notation conditional on X to make the exposition more

transparent. The sampling process identies P(D = 0), P(D = 1), E[Y

1

|D = 1] =

E[Y |D = 1], E[Y

0

|D = 0] = E[Y |D = 0]. Without making further assumptions, the

counterfactuals E[Y

1

|D = 0] and E[Y

0

|D = 1] are not identied. Suppose Y

0

and Y

1

are bounded random variables with common support [K

0

, K

1

]. Boundness is essential

for deriving the nonparametric bound for ATE. In particular, lets assume Y

0

and Y

1

are

binary random variables. Such assumption admits a more simplifed version of nonpara-

metric bounds for ATE. Binary random variables also imply K

1

= 1 and K

0

= 0. Under

this assumption, we know E[Y

1

Y

0

] [1, 1]. This bound is of course, trivial. However,

without any data, we can only conclude that ATE belongs to [1, 1]. The length of this

bound equals 2.

5.1 No-assumption Bound

Since Y

1

is binary, E[Y

1

|D = 0] = P(Y

1

= 1|D = 0) [0, 1]. Therefore we obtain the

upper bound and lower bound for E[Y

1

]:

U

1

= E[Y

1

|D = 1] P(D = 1) +P(D = 0),

L

1

= E[Y

1

|D = 1] P(D = 1).

49

Similarily, the bound for E[Y

0

] is

U

0

= E[Y

0

|D = 0] P(D = 0) +P(D = 1),

L

0

= E[Y

0

|D = 0] P(D = 0).

Notice that the bounds here are functions of identied object and can be nonparamet-

rically estimated. The trick of the bound analysis is plug-in the upper and the lower

bounds for the non-identied counterfactuals, leaving the point-identied objects un-

changed. These bounds implies the bound for ATE:

U = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) +P(D = 0),

L = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) P(D = 1).

This bound has length equals P(D = 0) + P(D = 1) = 1. Comparing with the trivial

bound for ATE, the uncertainty is reduced 50% by using the information contained in

the data. However, the no assumption bound necessarily covers 0 and hence the sign of

ATE is not identied. Because it is too wide to be useful in practice, we need to impose

other assumptions to improve on the accuracy of this bound.

5.2 Level-set Restrictions: Instrumental Variables

8

Manski (1990) and Manski and Pepper (2000) assume:

Denition 5.1 (IV)

Z is an instrumental variable in the sense of mean-independence if, for j = 0 and 1, and

each (z, z

), we have E[Y

j

|Z = z] = E[Y

j

|Z = z

].

This is the standard exclusion restriction that Z does not shift the response variables.

The no-assumption bound can be applied to E[Y

1

|Z = z], z support(Z) :

U

1

(z) = E[Y

1

|D = 1, Z = z] P(D = 1|Z = z) +P(D = 0|Z = z),

L

1

(z) = E[Y

1

|D = 1, Z = z] P(D = 1|Z = z).

Since E[Y

1

|Z = z] = E[Y

1

] by the IV assumption, L

1

(z) E[Y

1

] U

1

(z) for all z.

Therefore, we obtain the intersection bound for E[Y

1

] :

sup

z

L

1

(z) E[Y

1

] inf

z

U

1

(z)

Chernozhukov, Lee, and Rosen (2008) analyze the asymptotic properties of the intersec-

tion bounds.

8

Another example of level-set restrictions is the constant eect assumption. See Manski (1990).

50

We can also discuss the identication power of an instrument Z. Consider the extreme

case that Z is independent of D. This implies P(D = 1|Z) = P(D). Since E[Y

j

|Z = z]

is a constant function of z, constant P(D = 1|Z) implies E[Y

j

|D = 1, Z = z] and

E[Y

j

|D = 0, Z = z] are necessarily constant functions of z as well. We conclude that

E[Y

j

|D = 1, Z = z] = E[Y

j

|D = 1] and E[Y

j

|D = 0, Z = z] = E[Y

j

|D = 0]. Therefore

U

1

(z) = U

1

and L

1

(z) = L

1

, and such IV cannot improve on the no-assumption bound.

5.3 Restrictions on the Moment Conditions: Monotone Instrumental

Variables

Manski and Pepper (2000) propose a new concept of instrumental variables, termed

monotone instrumental variables:

Denition 5.2 (MIV) Covariate Z is an monotone instrumental variable if

E[Y

1

|Z = z] E[Y

1

|Z = z

], and

E[Y

0

|Z = z] E[Y

0

|Z = z

], z z

.

Compared with the IV assumption, the equality in IV is replaced by inequality, yielding

a set of moment inequalities. Such assumption is quite nonstandard because IV assump-

tions usually generate a set of moment equalities. However, moment inequalities can

still be used to bound the parameters of interest. According to the MIV assumption,

E[Y

1

|Z = z] E[Y

1

|Z = z

], z z

. Therefore, E[Y

1

|Z = z] inf

zz

E[Y

1

|Z = z

].

Analogously, the no-assumption bound can be applied to each E[Y

1

|Z = z

]. Therefore,

the bound for E[Y

1

|Z = z] is given by

sup

z

z

L

1

(z

) E[Y

1

|Z = z] inf

zz

U

1

(z

),

which is also an intersection bound. Integrating the upper and the lower bound with

respect to the distribution of Z yields the bound for E[Y

1

].

An example of MIV is the IQ test score. Let Y

j

, j = 0, 1 being the wage functions.

The MIV asserts that the persons with higher IQ test score will have weakly higher mean

wage functions, regardless of participating the job training program or not.

5.4 Monotone Treatment Selection

Manski and Pepper (2000) show that the observed treatment D itself can be viewed as

a MIV if the following conditions are satised:

51

Denition 5.3 (MTS)

E[Y

1

|D = 1] E[Y

1

|D = 0], and

E[Y

0

|D = 1] E[Y

0

|D = 0].

Clearly, D is a monotone IV according to this denition. This special case of MIV is called

monotone treatment selection: These two moment inequalities are consistent with the

selection-on-unobservables. For example, theory suggests those who with higher ability

choose higher schooling level and have higher mean wage than those who with lower

ability. In our quality payment program example, patients with higher self-consciousness

choose the hospitals participating in the program and have higher treatment completion

rate. Therefore, MTS captures the main implication of some sample selection problems,

or some sample selection problems directly lead to MTS.

MTS yields an upper bound for the counterfactual E[Y

1

|D = 0], which is E[Y

1

|D = 1],

and a lower bound for E[Y

0

|D = 1], which is E[Y

0

|D = 0]. The bound associated with

E[Y

1

] becomes:

U

1

= E[Y

1

|D = 1],

L

1

= E[Y

1

|D = 1] P(D = 1).

The lower bound is the same as the no-assumption bound because MTS does not provide

information for the lower bound of E[Y

1

|D = 0]. Analogously, the bound associated with

E[Y

0

] is:

U

0

= E[Y

0

|D = 0] P(D = 0) +P(D = 1),

L

0

= E[Y

0

|D = 0].

The bound for ATE is:

U = E[Y

1

|D = 1] E[Y

0

|D = 0],

L = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) P(D = 1).

Notice that L here is the same as the no-assumption bound, and U here is just the group

mean dierence, which identies ATE when conditional unconfoundness holds.

5.5 Shape Restrictions on the Response Functions: Monotone Treat-

ment Response

Manski (1997) considers the restrictions on the response variables. For example, economy

theory asserts that the supply function is a monotone increasing function of price. In-

stead of imposing parametric functional forms on the response functions, Manski (1997)

52

considers weaker assumptions, such as monotonicity or concavity of the response func-

tions. Typically, such assumptions are no stronger than the implication of many economy

theories. Suppose the job training program is indeed benecial for everyone, this would

imply the assumption of monotone treatment response:

Denition 5.4 (MTR) Y

1i

Y

0i

i.

According to MTR, we obtain bounds for the counterfactuals:

E[Y

1

|D = 0] E[Y

0

|D = 0], and

E[Y

1

|D = 1] E[Y

0

|D = 1]

The bound for E[Y

1

] under MTR is

U

1

= E[Y

1

|D = 1] P(D = 1) +P(D = 0),

L

1

= E[Y

1

|D = 1] P(D = 1) +E[Y

0

|D = 0] P(D = 0).

Similarily, the bound for E[Y

0

] is

U

0

= E[Y

0

|D = 0] P(D = 0) +E[Y

1

|D = 1]P(D = 1),

L

0

= E[Y

0

|D = 0] P(D = 0).

These bounds implies the bound for ATE:

U = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) +P(D = 0),

L = 0.

Clearly MTR necessarily implies that ATE is weakly greater than 0. The upper bound

here is the same as the no-assumption bound. Althogh MTS and MTR looks very similar,

they have distinct meanings. I borrow the interpretation from Manski and Pepper (2000):

Consider the variation of wages with schooling. It is common to hear the

verbal assertion that wages increase with schooling. The MTS and MTR

assumptions interpret this statement in dierent ways. The MTS interpreta-

tion is that persons who select higher levels of schooling have weakly higher

mean wage functions than do those who select lower levels of schooling. The

MTR interpretation is that each persons wage function is weakly increasing

in conjectured years of schooling.

The MTS assumption is consistent with economic models of schooling

choice and wage determination that predict that persons with higher ability

have higher mean wage functions and choose higher levels of schooling than do

persons with lower ability. The MTR assumption is consistent with economic

models of the production of human capital through schooling.

53

In general, several assumptions can be imposed together to yield a sharper identi-

cation region, as long as they do not contradict to each other. For example, if we impose

MTR as well as MTS, the bound for ATE shrinks to :

U = E[Y

1

|D = 1] E[Y

0

|D = 0],

L = 0.

5.6 Restrictions on the Selection Mechanism

Manski (1990, 1994) shows that the selection rules can be used to bound the counter-

factuals. Consider the selection rule of Roy model D = I

{Y

1

Y

0

}

. This selection rule

is also consistent with the agents optimization behaviors. If we specify the functional

forms for Y

j

and the distributional assumptions, the Roy model can be analyzed using

methods develpoed in section 6. However, without those assumptions, such selection rule

is already informative to bound the counterfactuals. Notice that:

E[Y

1

|D = 0] = E[Y

1

|Y

1

< Y

0

] E[Y

1

|Y

1

Y

0

] = E[Y

1

|D = 1], and

E[Y

1

|D = 0] = E[Y

1

|Y

1

< Y

0

] E[Y

0

|Y

1

< Y

0

] = E[Y

0

|D = 0].

Holding Y

0

xed, Y

1

in the region of {Y

1

< Y

0

} is weakly samller than Y

1

in the region of

{Y

1

Y

0

}. This holds true for any value of Y

0

. By the monotonicity of the expectation,

we obtain the rst inequality. The second inequality is simple. Since we conditional on

Y

1

< Y

0

, of course Y

0

is greater than Y

1

in this region. Analogously, there are two upper

bounds for the counterfactual E[Y

0

|D = 1]:

E[Y

0

|D = 1] = E[Y

0

|Y

1

Y

0

] E[Y

0

|Y

1

< Y

0

] = E[Y

0

|D = 0], and

E[Y

0

|D = 1] = E[Y

0

|Y

1

Y

0

] E[Y

1

|Y

1

Y

0

] = E[Y

1

|D = 1].

Hence the upper bound for E[Y

0

|D = 1] and E[Y

1

|D = 0] is = min(E[Y

0

|D =

0], E[Y

1

|D = 1]). The above analysis gives the bound for E[Y

1

] is:

U

1

= E[Y

1

|D = 1] P(D = 1) + P(D = 0),

L

1

= E[Y

1

|D = 1] P(D = 1).

Similarily, the bound for E[Y

0

] is

U

0

= E[Y

0

|D = 0] P(D = 0) + P(D = 1),

L

0

= E[Y

0

|D = 0] P(D = 0).

These bounds implies the bound for ATE:

U = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) + P(D = 0),

L = E[Y

1

|D = 1] P(D = 1) E[Y

0

|D = 0] P(D = 0) P(D = 1).

54

It has lenght of and hence this bound for ATE is contained in the no-assumption

bound.

5.7 Some Remarks

The above arguments also hold true after conditional on covariate X. We could derive

bounds for conditional ATE under dierent assumptions:

L(X) E[Y

1

Y

0

|X] < U(X)

Integrating U(x) and L(x) with respect to the distribution of X gives the bound for ATE:

L(x)dF(x) E[Y

1

Y

0

] =

E[Y

1

Y

0

|X = x]dF(x)

U(x)dF(x).

To bound the ATE, boundness of the response variables is essential. Even Y

1

has un-

bounded support, we can still bound E[g(Y

1

)], provided that g() is a bounded function.

A useful case is g(Y

1

) = I

{Y

1

y}

because E[g(Y

1

)] = P(Y

1

y) = F

1

(y), the CDF of Y

1

.

Thus it is always possible to bound the distribution function based on the approaches

in this section. Since various statistical quantities are nothing but the functionals of

the distribution functions, in principle one can derive bounds for mean, variance (Stoye,

2007), and quantile (Manski, 1994). Manski (2003, 2007) review the literature in partial

identication. Horowitz and Manski (2000), Imbens and Manski (2004), Chernozhukov,

Hong, and Tamer (2008), Beresteanu and Molinari (2008), Romano and Shaikh (2006),

and Stoye (2008) construct the condence intervals for bounds. Empirical researches

using this methodology include Manski, Sandefur, McLanahan and Powers (1992), Blun-

dell, Gosling, Ichimura, and Meghir (2004), Manski and Nagin (1998), Pepper (2000,

2003), and Lechner (1999). Manski and Tamer (2002), Hong and Tamer (2003), Haile and

Tamer (2003), Tamer (2003), Honore and Lleras-Muney (2006), Honore and Tamer (2006)

apply this method to interval regression, censored regression, English auction, incomplete

models

9

, competing risk models, and panel dynamic discrete choice models respectively.

9

For example, models with multiple equilibriums

55

References

Abadie, A., D. Drukker, J. L. Herr, and G. W. Imbens (2001). Implementing Matching

Estimators for Average Treatment Eects in Stata, The Stata Journal, 1, 118.

Abadie, A. (2002). Bootstrap Tests for Distributional Treatment Eects in Instrumental

Variable Models, Journal of the American Statistical Association, 97, 284292.

Abadie, A., J. Angrist, and G. W. Imbens (2002). Instrumental Variables Estimates of the

Eect of Subsidized Training on the Quantiles of Trainee Earnings, Econometrica,

70, 91117.

Abadie, A. (2003). Semiparametric Instrumental Variable Estimation of Treatment Re-

sponse Models, Journal of Econometrics, 113, 231263.

Abadie, A. and G. W. Imbens (2006). Large Sample Properties of Matching Estimators

For Average Treatment Eects, Econometrica, 74, 235267.

Angrist, J. and G. W. Imbens (1995). Two-Stage Least Squares Estimation of Average

Causal Eects in Models with Variable Treatment Intensity, Journal of the American

Statistical Association, 90, 431442.

Angrist, J., G. W. Imbens, and D. B. Rubin (1996). Identication of Causal Eects

Using Instrumental Variables (with discussion), Journal of the American Statistical

Association, 91, 444455.

Angrist, J. (2001). Estimation of Limited Dependent Variable Models with Dummy

Endogenous Regressors: Simple Strategies for Empirical Practice (with discussion),

Journal of Business and Economic Statistics, 19, 216.

Angrist, J. and A. B. Krueger (2001). Instrumental Variables and the Search for Identi-

cation: From Supply and Demand to Natural Experiments, The Journal of Economic

Perspectives, 15, 6985.

Angrist, J. (2004). Treatment Eect Heterogeneity in Theory and Practice, The Eco-

nomic Journal, 114, C52C83.

Angrist, J. (2006). Instrumental Variables methods in Experimental Criminological Re-

search: What, Why and How, Journal of Experimental Criminology, 2, 2344.

Card, D. (1995). Using geographic variation in college proximity to estimate the return

to schooling. In: Christodes, L., Grant, E., Swidinsky, R. (Eds.), Aspects of Labor

56

Market Behaviour: Essays in Honour of John Vanderkamp. University of Toronto

Press, Toronto, 201V222.

Chaudhuri, P. (1991). Nonparametric estimation of regression quantiles and their local

Badahur representation, Annals of Statistics 19, 760777.

Chen, X., H. Hong, and A. Tarozzi (2004). Semiparametric Eciency in GMM Models

of Nonclassical Measurement Error, Missing Data and Treatment Eect, Annals of

Statistics, forthcoming.

Chernozhukov, V., G. W. Imbens, and W. K. Newey (2004). Instrumental Variable Iden-

tication and Estimation of Nonseparable Models via Quantile Conditions, Working

Paper

Chernozhukov, V. and C. Hansen (2005). An IV Model of Quantile Treatment Eects,

Econometrica, 73, 245261.

Chernozhukov, V. and C. Hansen (2006). Instrumental Quantile Regression Inference for

Structural and Treatment Eect Models, Journal of Econometrics, 132, 491525.

Cook, R. D. (1998). Regression Graphics: Ideal for Studying Regressions Through

Graphics, New York: Wiley.

Dehejia, R. (2005). Practical Propensity Score Matching: a reply to Smith and Todd,

Journal of Econometrics , 125, 355364.

Dehejia, R. H. and S. Wahba (1999). Causal Eects in Nonexperimental Studies: Reeval-

uating the Evaluation of Training Programs, Journal of the American Statistical

Association, 94, 10531062.

Dehejia, R. H. and S. Wahba (2002). Propensity Score Matching Methods for Nonexper-

imental Causal Studies, Review of Economics and Statistics, 84, 151161.

Firpo, S. (2007). Ecient Semiparametric Estimation of Quantile Treatment Eects,

Econometrica, 75, 259276.

Frolich, M. (2006). Nonparametric IV Estimation of Local Average Treatment Eects

with Covariate, Journal of Econometrics, forthcoming.

Frolich, M. and B. Melly (2008). Estimation of Quantile Treatment Eects with STATA,

working paper.

Hahn, J. (1998). On the Role of the Propensity Score in Ecient Semiparametric Esti-

57

mation of Average Treatment Eect, Econometrica, 66, 315331.

Heckman, James J. (1976). Common structure of statistical medels of truncation, sample

selection, and limited dependent variables and a simple estimator for such models,

Annals of Economic and Social Measurement, 5, 475492.

Heckman, J. J., H. Ichimura, P. E. Todd (1997). Matching as an Econometric Evaluation

Estimator: Evidence from Evaluating a Job Training Programme, Econometrica, 64,

605654.

Heckman, J. J., H. Ichimura, and P. E. Todd (1998). Matching as an Econometric

Evaluation Estimator, Review of Economic Studies, 65, 261294.

Heckman, James J. (1996). Comment, Journal of the American Statistical Association,

91, 459462.

Heckman, James J. (1996). Randomization as an Instrumental Variable, The Review of

Economics and Statistics, 78, 336341.

Heckman, J., J. L. Tobias, E. Vytlacil (2003). Simple Estimators for Treatment Pa-

rameters in a Latent-Variable Framework, Review of Economics and Statistics, 85,

748755.

Hirano, K., G. W. Imbens, and G. Ridder (2003). Ecient Estimation of Average Treat-

ment Eects Using the Estimated Propensity Score, Econometrica, 71, 11611189.

Holland, P. D. (1986). Statistics and Causal Inference, Journal of the American Statistical

Association, 81, 945960.

Imbens, G. W. and J. Angrist (1994) Identication and Estimation of Local Average

Treatment Eects, Econometrica, 62, 467475.

Imbens, G. W. and D. B. Rubin (1997) Estimating Outcome Distribution for Compliers

in Instrumental Variables Models, The Review of Economic Studies, 64, 555574.

Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Eects under

Exogeneity: A Review, The Review of Economics and Statistics, 86, 429.

Imbens, G. W. (2006). Nonadditive Models with Endogenous Regressors, Working Paper.

Imbens, G. W. and W. K. Newey (2008). Identication and Estimation of Triangular

Simultaneous Equations Models Without Additivity, Working Paper.

Khan, S. and E. Tamer (2008). Irregular Identication, Support Conditions and Inverse

58

Weight Estimation, Working Paper.

Lee, M.-J. (2005). Micro-Econometrics For Policy, Program, And Treatment Eects.

Lechner, M. (1999). Nonparametric Bounds on Employment and Income Eects of Con-

tinuous Vocational Training in East Germany, Econometrics Journal, 2, 128.

Li, K.-C. (1991). Sliced Inverse Regression for Dimension Reduction (with discussion),

Journal of the American Statistical Association, 86, 316327.

Manski, C. F. (1989). Anatomy of the Selection Problem, Journal of Human Resources,

24, 343360.

Manski, C. F. (1990). Nonparametric Bounds on Treatment Eects, AER Papers and

Proceedings, 80, 319323.

Manski, C. F., G. Sandefur, S. McLanahan, and D. Powers (1992). Alternative Estimates

of the Eect of Family Structure During Adolescence on High School Graduation,

Journal of the American Statistical Association, 87, 2537.

Manski, C. F. (1994). The Selection Problem, in Advances in Econometrics, Sixth World

Congress, Cambridge University Press.

Manski, C. F. (1997). Monotone Treatment Response, Econometrica, 65, 13111334.

Manski, C. F. and D. Nagin (1998). Bounding Disagreements about Treatment Eects:

A Case Study of Sentencing and Recidivism, Sociological Methodology, 28, 99137.

Manski, C. F. and J. V. Pepper (2000). Monotone Instrumental Variables: With an

Application to the Returns to Schooling, Econometrica, 68, 9971010.

Manski, C. F. and E. Tamer (2002). Inference on Regressions with Interval Data on A

Regressor or Outcome, Econometrica, 70, 519546.

Manski, C. F. (2003). Partial Identication of Probability Distributions, Springer-Verlag.

Manski, C. F. (2007). Partial Identication in Econometrics, forthcoming in the New

Palgrave Dictionary of Economics, second edition.

Newey, W. K. and J. Powell (2003). Instrumental Variable Estimation of Nonparametric

Models, Econometrica, 71, 15651578.

Pagan, A. and A. Ullah (2005). Nonparametric Econometrics.

Pearl, J. (2001). Causality.

59

Pepper, J. V. (2000). The Intergenerational Transmission of Welfare Receipt: A Non-

parametric Bounds Analysis, Review of Economics and Statistics, 82, 472488.

Pepper, J. V. (2003). Using Experiments to Evaluate Performance Standards: What Do

Welfare-to-Work Demonstrations Reveal to Welfare Reforms? Journal of Human

Resources, 38, 860880.

Rosenbaum, P. R. (1996). Comment, Journal of the American Statistical Association,

91, 465468.

Rosenbaum, P. R. (2002). Observational Studies, New York: Springer-Verlag.

Rosenbaum, P. R. ,and D. B. Rubin (1983). The Central Role of the Propensity Score

in Observational Studies for Causal Eects, Biometrika, 70, 4155.

Rubin, D. B. (1974). Estimating Causal Eects of Treatments in Randomized and Non-

randomized Studies, Journal of Educational Psychology, 66, 688701

Shaikh, A. M., M. Simonsen, E. J. Vytlacil, and N. Yildiz (2005). On the Identication

of Misspecied Propensity Score, Working Paper.

Shaikh, A. M. and E. J. Vytlacil (2005). Threshold Crossing Models and Bounds on

Treatment Eects: A Nonparametric Analysis, Working Paper.

Smith, J. A., and P. E. Todd (2001). Reconciling Conicting Evidence on the Perfor-

mance of Propensity Score Matching Methods, The American Economic Review, 91,

112118.

Vytlacil, E. (2003) Dummy Endogenous Variables in Nonseparable Models, Working

Paper

Vytlacil, E. and N. Yildiz (2007). Dummy Endogenous Variables in Weakly Separable

Models, Econometrica, 75, 757779.

60

- Chapter7-Econometrics-MulticollinearityUploaded byAbdullah Khatib
- Vipul Chalotra 180 190Uploaded byVinay Kumar
- Facebook, Twitter, And Youth EngagementUploaded byNa'imatur Rofiqoh
- Static Regression Solutions Problem Set 2 2014 11-17-16!06!22Uploaded byeduardo
- 40Uploaded bysuraj_sowkar218
- International Trade and DevelopmentUploaded byTauhidulIslamTuhin
- 1. a Multi Level Approach (2)Uploaded bySahil
- LinearRegressionUploaded bySoumyajit Das
- Bank Ownership and Performance in the Middle East and North Africa RegionUploaded bykuxjnh
- BallEtAl(2011)Uploaded byBen James
- Inequality and Criminality in BrazilUploaded byJ Camilo de Sousa
- Investigation of the Relationship between Diesel Fuel Properties and Emissions from Engines with Fuzzy Linear RegressionUploaded bySEP-Publisher
- RegressaoUploaded byIsmael Neu
- This Article Was Downloaded by: [Friedrich Althoff Konsortium] on: 19Uploaded byapi-26691457
- lec7Uploaded byvsuarezf2732
- Psychological ResearchUploaded byMei Joy
- Sing Layer PercUploaded byEduardo Quintanilla
- 05-ML (Linear Regression)Uploaded byfarsun
- sales forecasting-econometric.pdfUploaded byManoj
- Log InUploaded byJasmin Kotak
- Traffic Accidents in PakistanUploaded bymuki10
- A Tutorial in Logistic RegressionUploaded byOmar Msawel
- Multivariate Kalman Filter ReviewUploaded bymaxmanfren
- multivariate regression analysis - instructionsUploaded byapi-242076022
- Hansen_kindle.pdfUploaded byEskay Hong
- Edgar WernerUploaded bymaxxflyy
- Accounts SummaryUploaded byKiranjyothi Mullangi
- Compra de Alimentos Chatarra en Los Hogares MexicanosUploaded byFernando Ramírez Hernández
- examen 2 courseraUploaded byADQ LOBO
- RMpptUploaded byRohit Padalkar

- Gonzalo Rojas - (1996) Río TurbioUploaded bymethatiax
- carta_v2Uploaded byAhmed Eid
- Migrate 09S 1Uploaded byAhmed Eid
- Hambre - Knut HamsunUploaded byAlejandro Bonaldi
- CPAEB-2005Uploaded byAhmed Eid
- Bolaño, Roberto - Distant Star (New Directions, 2004)Uploaded byAsier Aurrekoetxea
- Regression DiscontinuityUploaded bykoorossadighi1
- The New Drawing on the Right Side of the Brain.pdfUploaded byAhmed Eid
- DibaxuUploaded byLuís Flores
- Linea BaseahmeUploaded byAhmed Eid
- Emily Dickens - Sección poéticaUploaded bysigma147
- Draw Now- 30 Easy Exercises for BeginnersUploaded byyusufibanu
- Lace a Lames 2012Uploaded byAhmed Eid
- ypctot_1999Uploaded byAhmed Eid
- 301149Uploaded byAhmed Eid
- 1798-5664-1-PBUploaded byAhmed Eid
- time series analysisUploaded byAhmed Eid
- 01Uploaded byDaniel Lee Eisenberg Jacobs
- p0bysex NalUploaded byAhmed Eid
- carta shirleyUploaded byAhmed Eid
- boletaUploaded byAhmed Eid
- cond_expUploaded byWaseem Abbas
- Emily Dickens - Sección poéticaUploaded bysigma147
- 241B SyllabusUploaded byAhmed Eid
- Stammbaum_EckmannUploaded byAhmed Eid
- Lecture 27Numerical AnalysisUploaded byBehnam Aryafar
- Book1pedidosUploaded byAhmed Eid
- reunionOber2Uploaded byAhmed Eid

- PSYC-354 Week 1 to 8 Complete Homework SolutionUploaded bysalamauway
- Week14a (1)Uploaded byHawJingZhi
- Cunningham_McCrum_Gardner2007.pdfUploaded byReza Saputree
- Neyman Pearson DetectorsUploaded byGuilherme Dattoli
- Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm - MathWorks IndiaUploaded byaaditya01
- 2 Hypo TestingUploaded byPallav Anand
- Hypothesis Testing - Why Levene Test of Equality of Variances Rather Than F Ratio_ - Cross ValidatedUploaded bybharat005
- EMATH10 Intro to Probability and StatisticsUploaded bySpearMint
- England and Verrall - Predictive Distributions of Outstanding Liabilities in General InsuranceUploaded byapi-3851231
- 380 SPSS EXERCISE ANSWER SHEET.DOCUploaded byWagner Leite
- Binary Logistic Regression Schueppert 2009Uploaded byTürk Psikolojik Travma Akademisi
- Forecasting Gdp Growth Rates of India an Empirical Study 2162 6359-1-082Uploaded bybhagatvarun
- HW1Uploaded byKhoa Cao
- lec3-16Uploaded byZachry Wang
- TSP50UGUploaded byapi-3828195
- SarimaUploaded byBicăjanu Bianca
- Econometrics TutorialUploaded byReneẻ Màlizía Marandu
- Monetary Aggregates and Monetary PolicyUploaded byOladipupo Mayowa Paul
- regr06linregressionquiz2ans 1Uploaded byapi-250286128
- Grade 11 PS Hypothesis TestingUploaded byMaxine Taeyeon
- UNSW -Econ2206 Solutions Semester 1 2011 - Introductory Eco No MetricsUploaded byleobe89
- 14633comparingmanymeans-160909024307Uploaded byJasMisionMXPachuca
- REGRESSION ANALYSISUploaded bysdhanju2007
- Factorial Anova Examples PDFUploaded byEugene
- Syllabus_G651Uploaded byifcrindia
- Minitab CommandsUploaded byHenry Tom
- chin 2003Uploaded byBet9112003
- MTE3105 Pengujian Hipotesis Khi Kuasa Dua 2Uploaded bySitherrai Paramananthan
- 02_tests.pdfUploaded bydaria_ioana
- Assignment ProblemsUploaded byAhamed Wasil