Skript Frölich Impact Evaluation

.
Impact evaluation, treatment e¤ects

and causal analysis
Markus Frölich
Universität Mannheim
Lecture notes, Version: 17 April 2008
Do not circulate or quote without permission
Professor Dr. Markus Frölich
Lehrstuhl für Ökonometrie / Chair of Econometrics
Universität Mannheim
L7, 3-5, 68131 Mannheim, Deutschland
froelich@uni-mannheim.de
Markus Frölich Advanced Microeconometrics i
Contents
1 Causality, nonparametric identi…cation, experimental data 1
1.0.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1 Experiments, randomized trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Nonexperimental data and parameters of interest . . . . . . . . . . . . . . . . . . 17
2 Selection on observables, causal graphs, matching estimators 22
2.1 Causal graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Front door identi…cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.2 Direct and indirect e¤ects, partial e¤ects . . . . . . . . . . . . . . . . . . 36
2.2 Matching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.1 Brief introduction to Nonparametric regression . . . . . . . . . . . . . . . 41
2.2.2 Matching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.3 Propensity score matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2.4 Propensity score weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.2.5 Combination of weighting and regression . . . . . . . . . . . . . . . . . . . 65
2.2.6 Asymptotic properties of treatment e¤ect estimators . . . . . . . . . . . . 66
2.2.7 Testing the validity of the conditional independence assumption . . . . . 66
2.2.8 Matching with non-binary D . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.9 OLS and linear regression models . . . . . . . . . . . . . . . . . . . . . . . 70
2.3 Multiple treatment evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3 Dynamic treatment evaluation 73
3.1 Shortcomings of the static potential outcomes model . . . . . . . . . . . . . . . . 73
3.2 Dynamic potential outcomes model . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2.1 Equivalence to static model . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.2 Sequential conditional independence assumptions . . . . . . . . . . . . . . 86
3.2.3 Sequential matching or weighting estimation . . . . . . . . . . . . . . . . 95
3.2.4 Relationship to causality in time series econometrics . . . . . . . . . . . . 99
3.3 Timing of events and duration models . . . . . . . . . . . . . . . . . . . . . . . . 99

ii Advanced Microeconometrics Markus Frölich
4 Selection on unobserved - nonparametric IV - triangular model 99

4.1 Examples of instrumental variables . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2 Local average treatment e¤ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.1 Identifying the potential distributions for compliers . . . . . . . . . . . . . 115
4.3 LATE with covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4 Combination of IV with matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.5 Regression discontinuity design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5.1 Regression discontinuity design with covariates . . . . . . . . . . . . . . . 162
4.6 Marginal treatment e¤ect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.7 Non-binary models with monotonicity in choice equation . . . . . . . . . . . . . . 174
4.7.1 Continuous D with triangularity . . . . . . . . . . . . . . . . . . . . . . . 175
4.7.2 Ordered discrete D with triangularity . . . . . . . . . . . . . . . . . . . . 180
4.7.3 Unordered discrete D with triangularity . . . . . . . . . . . . . . . . . . . 183
5 Identi…cation without triangularity 183

5.1 Monotonicity in the outcome equation . . . . . . . . . . . . . . . . . . . . . . . . 187
5.2 Non-continuous Y and the coherency condition . . . . . . . . . . . . . . . . . . . 194
6 Linear single equation IV models 195
7 Repetition of GMM estimation 204

7.1 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.2 GMM with orthogonality restrictions . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.3 GMM with conditional moment restrictions . . . . . . . . . . . . . . . . . . . . . 213
8 Linear system estimation by IV 215
9 Nonparametric di¤-in-di¤ and Panel Data 222

9.1 Linear di¤erence-in-di¤erence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.2 Changes-in-changes model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.2.1 Changes-in-changes with continuous outcome Y . . . . . . . . . . . . . . . 231
9.2.2 Changes-in-changes with discrete outcome Y and interval identi…cation . 237
9.2.3 Changes-in-changes with discrete outcome Y and point identi…cation . . . 240
Markus Frölich Advanced Microeconometrics iii
10 Sources of identifying variation 241
11 Linear Panel Data models 242

11.1 Strict exogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.2 Random E¤ects Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.3 Fixed E¤ects Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
11.4 Pre-determindness and dynamic panel data . . . . . . . . . . . . . . . . . . . . . 253
11.4.1 Arellano Bond di¤erence-GMM estimator: . . . . . . . . . . . . . . . . . . 254
11.4.2 Blundell Bond system-GMM estimator: . . . . . . . . . . . . . . . . . . . 258
11.5 External instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
12 Non-Linear Panel Data models 259

12.1 Binary choice models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.1.1 Fixed e¤ects model under strict exogeneity . . . . . . . . . . . . . . . . . 260
12.1.2 Predetermined regressors (and dynamic models) with …xed e¤ects . . . . 267
12.1.3 Random e¤ects model under strict exogeneity . . . . . . . . . . . . . . . . 269
12.1.4 Correlated e¤ects model under strict exogeneity . . . . . . . . . . . . . . . 274
12.1.5 Predetermined regressors (and dynamic models) with random e¤ects . . . 276
12.2 Panel data models with nonseparability . . . . . . . . . . . . . . . . . . . . . . . 277
13 Weak instruments 277

13.1 Weak IV in single linear equation models . . . . . . . . . . . . . . . . . . . . . . 278
13.1.1 Finite-sample properties of 2SLS with normal errors . . . . . . . . . . . . 279
13.1.2 Test for weak IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
13.1.3 Alternative estimators partially robust to weak IV . . . . . . . . . . . . . 286
13.1.4 Robust Inference with weak IV for 2SLS . . . . . . . . . . . . . . . . . . . 290
13.1.5 Robust Inference with weak IV for alternative estimators . . . . . . . . . 297
13.2 Weak IV in system of linear equations models . . . . . . . . . . . . . . . . . . . . 297
13.3 Weak IV in nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
13.4 Weak IV in nonparametric models . . . . . . . . . . . . . . . . . . . . . . . 298
14 Nonparametric estimation 298

14.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
iv Advanced Microeconometrics Markus Frölich
14.2 Kernel based methods for nonparametric regression . . . . . . . . . . . . . . . . . 302

14.2.1 Nonparametric moment estimation in one dimension . . . . . . . . . . . . 302
14.2.2 Multivariate local polynomial regression . . . . . . . . . . . . . . . . . . . 317
14.2.3 Local parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . 326
14.2.4 Combination of discrete and continuous regressors . . . . . . . . . . . . . 333
14.2.5 Nonparametric estimation of monotone functions . . . . . . . . . . . . . . 337
14.3 Nonparametric density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 338
14.4 Global nonparametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
14.4.1 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.4.2 Series estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
14.5 Comparison of splines, series, kernels . . . . . . . . . . . . . . . . . . . . . . . . . 342
15 Semiparametric estimation 342

15.1 Semiparametric structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
15.1.1 Partial linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
15.1.2 Index models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
15.1.3 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
15.2 Semiparametric estimators and partial means . . . . . . . . . . . . . . . . . . . . 350
15.2.1 Examples for calculation of the adjustment factor . . . . . . . . . . . . . . 352
15.2.2 Semiparametric e¢ ciency bounds . . . . . . . . . . . . . . . . . . . . . . . 356
15.2.3 Choice of smoothing parameters . . . . . . . . . . . . . . . . . . . . . . . 361
15.2.4 Average derivative estimation (skip this chapter) . . . . . . . . . . . . . . 363
15.2.5 Alternative dimension reduction approaches: Partial means . . . . . . . . 364
16 Quantile regression 365

16.0.6 Properties of quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
16.1 Parametric quantile regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16.1.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
16.1.2 Optimization instead of ordering . . . . . . . . . . . . . . . . . . . . . . . 369
16.1.3 Linear programming algorithms . . . . . . . . . . . . . . . . . . . . . . . . 373
16.1.4 Inference and asymptotic theory . . . . . . . . . . . . . . . . . . . . . . . 374
16.2 Nonparametric quantile regression . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Markus Frölich Advanced Microeconometrics v
16.3 Quantile treatment e¤ects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

16.4 QTE under selection on observables . . . . . . . . . . . . . . . . . . . . . . . . . 383
16.5 Unconditional QTE under endogeneity . . . . . . . . . . . . . . . . . . . . . . . . 385
16.6 Conditional QTE with endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 395
16.6.1 Abadie, Angrist, Imbens estimator . . . . . . . . . . . . . . . . . . . . . . 396
17 Simulation and numerical methods 398

17.1 Numerical optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
17.2 Simulation based methods (MSL, SGMM) . . . . . . . . . . . . . . . . . . . . . . 407
17.2.1 Maximum simulated likelihood estimation . . . . . . . . . . . . . . . . . . 408
17.2.2 Method of simulated moments . . . . . . . . . . . . . . . . . . . . . . . . 410
17.2.3 Indirect inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
17.2.4 Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
17.3 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
17.3.1 Uses of the bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
17.3.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
17.3.3 Asymptotic re…nement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
17.3.4 Bootstrap sampling methods and number of replications . . . . . . . . . . 423
18 Duration models 424
19 Further topics 425
A Appendix: Asymptotic Theory 440

Markus Frölich 1. Causality, nonparametric identi…cation, experimental data 1
Introduction
This course in advanced microeconometrics is intended to familiarize students with recent
developments in econometric theory and their applications. It is strongly oriented towards recent
developments and emphasizes the reading of original articles of leading scholars. These lecture
notes do not substitute for the reading of the original articles. They rather seek to summarize
a few of the central aspects of each of the papers, to harmonize notation and to provide a
coherent roadmap. It is largely oriented at topics of current interest, in particular impact
evaluation, treatment e¤ects and causal analysis. These includes topics such as nonparametric
identi…cation with endogenous variables and recent research developments on weak instruments.
The textbooks of Cameron and Trivedi (2005) and of Wooldridge (2002) are used to …ll in more
traditional material.
1 Causality, nonparametric identi…cation, experimental data
In econometrics, we often want to learn the causal or structural e¤ect of variable D on some
variable Y .1 One might be interested in the total e¤ect of D on Y . Or in the e¤ect of D on Y in
a particular environment, i.e. were other variables are held …x (partial e¤ect). D and Y could
be price and demand for a good in a market and we are interested in the impact of price on
demand or vice versa. In another example, D could be the amount of education an individual
has received, e.g. years of schooling, and Y could be an outcome later in life, e.g. employment
status, earnings or wealth. This setup also comprehends the recent literature on treatment
evaluation where D 2 f0; 1g is binary and indicates whether an individual received a particular
treatment or did not. A treatment could represent, for example, receiving a vaccine or a medical
treatment, participating in an adult (computer) literacy training programme, participating in a
public-works scheme, attending private versus public secondary school, attending a university
etc. A treatment could also be a voucher (or receiving the entitlement to a voucher) to attend
private school or a conditional cash transfer.2 D could also be nominal (i.e. discrete without
natural ordering), e.g. representing di¤erent subjects of degree in university.
In the …rst chapters we will be mainly interested in how we could identify such relationships
1
Thus we aim to estimate the structural e¤ect of D on Y . Our principal aim is not to …nd the best …tting
model for predicting Y or to analyze the covariance of Y (ANCOVA).
2
E.g. the large conditional cash transfer programmes in Mexico and Columbia and Latin America in general.
2 Advanced Microeconometrics Markus Frölich
from an in…nitely large sample of independently sampled observations. Let ( ; F; Pr) denote
a probability space. All random variables will be de…ned on this common probability space.
To avoid all measure-theoretic notation, which often does not help much to understanding the
economic content of econometrics, we let i denote an element of and might often think of
it as an individual, a …rm, a household or a classroom. We may think of as containing
in…nitely many individuals from which individuals are sampled randomly. For the purpose
of identi…cation, we assume that we have an in…nitely large sample such that we know the
joint distributions of all observed variables, e.g. FY DXZ , where X and Z are other vectors of
covariates to be introduced later. We want to examine what could be estimated under minimal
identifying assumptions.
Let D and Y be scalar random variables, extension to vector valued will be considered later.
We thus want to analyze the relationship3
Yi = '(Di ; Ui ),
where ' is an unknown measurable function and Ui is a vector of observed and unobserved
characteristics.4 The dimension of Ui is not yet restricted: it might be scalar or of higher
dimension. We may often think of U as abilities or skills in di¤erent dimensions.5 We may
often think of U as abilities or skills in di¤erent dimensions. We are interested in learning the
function ' or some features of it to be able to forecast what would happen if D were to be
changed exogenously. In other words, we want to forecast the outcomes we would observe if D
were changed for person i but Ui remained unchanged. Intuitively, it is helpful to de…ne the
potential outcomes:
Yid = '(d; Ui ),
which is the outcome that individual i would realize if Ui were held …xed and Di were forced
externally to take the value d. The function '(:; :) is not under the control of the individual
or chosen or manipulated by it. It is considered to be a structural feature or relationship, e.g.
an (educational) production function. In particular, it is assumed that '(:; :) describes the
3
We will later often add a second equation Di = (Vi ) and augment it with additional explanatory variables.
The dependence between U and V thus generates the endogeneity of D.
4
Upper case letters represent random variables or random vectors, whereas lower case letters represent numbers
or vectors.
5
For the moment this does not matter since Ui could simply be the passport number of each person. When
we later impose conditions on ' or the density of U , this will become relevant.
relationship between D and Y not only for the observed values Di and Ui , but that the same
model describes what the outcome Y would have been had the value of D externally been set
to any other value.
Notice also that this relationship is assumed on the individual level given an unchanged
environment. I.e. only variation in D for individual i is considered but not variation in D for
other individuals which may impact on Yi or might even generate feedback cycles. Hence, our
perspective is entirely microeconometric and we are examining what would have happened if
D were changed for one individual. In contrast, a policy that changes D for every individual
(or for a large number of individuals), e.g. a large campaign to increase education or computer
literacy, might change the entire function ', which might perhaps depend on the number of
literate and illiterate persons in the economy. Such kind of macro e¤ects, displacement e¤ects or
general equilibrium e¤ects are not considered here and receive more attention in the treatment
evaluation literature.6
The di¤erence
00 0
Yid Yid
is sometimes called the individual treatment e¤ect. To consider a simple example, let D 2
f0; 1g7 indicate whether a person graduated from university or did not, and let Y denote wealth
at the age of 50. Then, Yi1 Yi0 is the e¤ect of university graduation on wealth for person i,
i.e. it is the wealth obtained if this person had attended university minus the wealth this same
individual would have obtained without attending university.
Sometimes we want to explicitly consider situations where we are interested in the e¤ects
of two (or more) treatment variables. We simply could consider D to be vector valued. Yet, we
will …nd it sometimes useful to use two di¤erent symbols for the two treatment variables, e.g.
D and X, out of two reasons. The …rst reason is that we may sometimes have some treatment
variables D that are endogenous (i.e. related to U ), whereas other treatment variables X are
considered as exogenous (i.e. not related to U ). Therefore, we distinguish D from X since
6
E.g. a large policy providing employment or wage subsidies for unemployed workers may lower the labour
market chances for individuals not eligible to such subsidies (substitution or displacement e¤ects) and may change
the entire labour market: The cost of labour decreases for the …rms, the disutility from unemployment decreases
for the workers and this impacts on e¢ ciency wages, search behaviour and the bargaining power of unions.
7
For binary D, individuals with Di = 1 will often be called participants or treated while individuals with
Di = 0 are referred to as nonparticipants or controls.
dealing with the endogeneity of D will require more attention.

A second, and unrelated, reason is that we sometimes like to make it explicit that we are
mainly interested in the impacts of changes in D while keeping X …xed by external intervention.
For example, let Di indicate whether individual i attended private or public secondary school,
whereas Xi indicates whether the individual afterwards continued to university or did not. We
might be interested here in that part of the e¤ect of private versus public school on wealth that
is not channeled via university attendance. Clearly, attending private or public school (D) is
likely to have an e¤ect on the likelihood to continue to university (X), which is going to a¤ect
wealth. But there might also be a direct e¤ect of D on wealth, even if university attendance is
externally …xed at zero (or one), that we are interested in. We de…ne the relationship
Yi = '(Di ; Xi ; Ui )
d
Yi;x = '(d; x; Ui ),
which is now a function of the variables D, X and U . (This function ' is essentially the same as
before, with the only di¤erence that D and X are written as separate arguments whereas they
were both subsumed in the vector D in the previous notation.) The partial e¤ ect (also called
direct e¤ect, i.e. the e¤ect not channelled via university attendance) of public versus private
school is
d 00 d0
Yi;x Yi;x .
1
In particular, Yi;0 0 is the partial e¤ect of private/public when university attendance is …xed
Yi;0
1
to zero, whereas Yi;1 0 is the e¤ect when university attendance is …xed to one, by external
Yi;1
intervention. In contrast, the total e¤ect is
00 0
Yid Yid .
Hence, the reason for using two di¤erent symbols for D and X is to emphasize that we are
interested in the e¤ects of changes in D while keeping X …xed.8 Sometimes such partial e¤ects
can be obtained simply via conditioning on X but sometimes more sophisticated approaches
are necessary, as will be discussed.
8
Of course, the previous notation could still be used by considering D as a vector, i.e. D containing also the
X covariates. But this notation with separate X and D is standard in the literature. It also emphasizes the
asymmetry between D and X. We are often interested in changing D while keeping X …xed.
Another example for the distinction between partial and total e¤ects are the Mincer earnings
functions in labour economics, which are often used to estimate the returns to education. To
determine the returns to education, in many empirical studies, log adult wages are regressed
on labour market experience and years of schooling and a measure of ability, e.g. measured in
early childhood, if avaiable. The reasoning is that on-the-job-experience, formal education and
general ability are important determinants of wages.9 We are not so much interested in the
e¤ects of ability on wages and merely include ability in the regression to deal with the selection
problem further discussed below. The ceteris paribus analysis now examines (hypothetically)
how wages would change if years of schooling (D) was changed while experience (X) remains
…xed. Since labour market experience usually accumulates after the completion of education,
schooling (D) may have two e¤ects. First, it is likely to a¤ect experience (X), which then
a¤ects wages (Y ). One plausible channel is that schooling a¤ects the probability and duration
of unemployment or repeated unemployment, which reduces the accumulation of job experience.
Schooling outcomes may also a¤ect the time out-of-labour force, which also does not lead to
job expereince. (In some countries, e.g. the USA, it may increase the time spent in prison.)
Hence, D a¤ects Y indirectly via X. On the other hand, years of schooling is also likely to
have a direct (positive) e¤ect on wages. Thus, by including X in the regression, we "block"
the indirect e¤ect and measure only the direct e¤ect of schooling here. (As we discuss later,
including X in the regression is not always a good strategy. Sometimes, it may introduce bias.
In other words, sometimes we can identify only the total e¤ect, but not the direct e¤ect.)
In the next chapters we are mainly interested in nonparametric identi…cation of ' or some
features of it. Most econometric textbooks start by assuming a linear model, e.g. of the type,
Yi = + Di + Xi + Ui
and discuss identi…cation and estimation of such models under certain restrictions, e.g.
E[Ui jDi ; Xi ] = 0. (1)
However, since the assumption of linearity is almost always an assumption made for convenience
and not based on sound economic theory, it is more insightful to discuss what can be identi…ed
9
In many datasets, unfortunately, true experience is not available and potential experience (de…ned as age
minus years of schooling minus seven) is used instead.
under which restrictions without imposing such a linearity assumption. We do not want to
restrict the shape of the function '. It can be linear, it can be quadratic, it can be of any form,
it need not even be continuous, di¤erentiable or monotonic. Hence, whereas only the parameters
; and , i.e. a …nite number of coe¢ cients, are needed in the linear model to describe all
potential outcomes, in the more general approach the function ' could be of such complex form
that it cannot be described with a …nite set of parameters (hence the term nonparametric).
In other words, an in…nitely large number of coe¢ cients would be needed to describe '. We
might be interested in estimating the entire function ', or only at some locations, or we might
want to learn about more "aggregated" properties of '. For identi…cation, we will then often
have to impose certain restrictions, which are usually weaker than (1). Such restrictions often
come in the form of di¤erentiability and continuity restrictions on ' (often called smoothness
restrictions), monotonicity assumptions and restrictions on the statistical relationships between
the observed and unobserved variables. There is now a large and growing literature which
attempts to …nd the weakest assumptions under which certain objects can be identi…ed. The
function ' is nonparametrically identi…ed if we could determine it exactly from an in…nitely
large sample. Suppose that we have in…nitely many observations, we know the joint distribution
of Y , D and some other covariates X. The function ' or some feature of it is nonparametrically
identi…ed if no other function could have generated the same distribution. In other words, it is
not identi…ed if two di¤erent functions ' and '
~ could generate the same joint cdf FY DXZ of the
observed variables. Often the function ' is identi…ed only in some regions but not in others,
e.g. outside of the support of the data.
This can be easiest explained with respect to the average treatment e¤ ect (ATE) in the
university graduation example mentioned above. We want to estimate the wealth e¤ect of
attending university for a person randomly drawn from the population
E[Y 1 Y 0]
where the expectation operator averages over all individuals i. This is the average causal e¤ect
of attending university. This e¤ect can be interpreted in two ways: It is the e¤ect on an
individual randomly drawn from the population. Alternatively, it is the expected change in the
average outcome if D were changed from 0 to 1 for every individual, provided that no general
equilibrium e¤ects occur.
However, if we were to compare the average wealth of those who attended university (Di = 1)
with those who did not (Di = 0) we would estimate
E[Y 1 jD = 1] E[Y 0 jD = 0].
This is usually di¤erent from the average treatment e¤ect in that the persons attending uni-
versity and those who did not di¤er in observed and unobserved characteristics. This is most
apparent when examining the e¤ect on those who actually did attend university. This average
treatment e¤ect on the treated (ATET) is
E[Y 1 Y 0 jD = 1],
which is often of particular interest in a policy evaluation context, where it may be more
informative to know how the programme a¤ected those who participated in it, than how it
might have a¤ected those who could have participated but decided not to. (Of course, if rolling
out the policy to the entire population is aimed at, the ATEN would be more interesting.)
The focus on a binary treatment D 2 f0; 1g helps a lot to focus on the main issues of
identi…cation in the following and permits a straigtforward analysis of quantile treatment e¤ects.
As another example for this framework consider formal and informal labour markets. In many
developing countries (but also in developed countries) individuals work in the so-called informal
sector, which consists of activities at …rms without formal registration or without employment
contract. Rougly, one can distinguish four di¤erent activities, self-employed in the formal
sector (i.e. owner of a registered …rm), self-employed in the informal sector (owner of a business
without formal registration including many family …rms, street vendor), worker in the formal
sector and worker in the non-formal sector (worker for family …rm etc). Firms in the formal
sector pay taxes, have access to courts and other public services but also have to adhere to
certain legislations, e.g. adhering to worker protection laws, providing medical and retirement
bene…ts. Informal …rms do not have access to such public services such as police and courts
and have to purchase private protection or rely on networks. Similarly, employees in the formal
sector have a legal work contract and are (at least in principle) covered by worker protection
laws and usually bene…t from medical bene…ts (e.g. accident insurance), retirement bene…ts,
job dismissal rules etc.10
10
Many (household) datasets do not contain very detailed questions on this such that measuring informality
The early literature on this duality sometimes associated the formal sector with the modern
industrialistic sector and the informal sector with the technologically backwards (rural) areas.
The formal sector was considered to be superior but its size limited. Those individuals migrating
from the rural to the urban areas in search for formal sector jobs, who did not …nd formal
employment, accept to work in the urban informal sector, until they …nd formal employment.
Jobs in the formal sector are thus rationed, and employment in the informal sector is a second-
best choice. An alternative explanation refers to e¢ ciency wage theory. If worker’s e¤ort in
the formal sector cannot be monitored perfectly or only at considerable costs, some incentives
are required to promote workers’e¤ort. An implication of e¢ ciency wage theory is that …rms
pay higher wages to promote e¤ort, which leads to unemployment. The risk of becoming
unemployed in case of shirking provides the incentives for the workers to provide e¤ort. Because
most developing countries due not provide generous unemployment insurance schemes and since
the value of money is larger than the utility from leisure, these unemployed enter in lower
productivity informal activities (where they are either self-employed or monitoring is less costly,
e.g. family workers). Formal and informal sector thus coexist, with higher wages and better
working conditions in the formal sector. Everyone would thus prefer working in the formal
sector.
An alternative view has emphasized that …rms and workers may voluntarily prefer informal-
ity, particularly when taxes and social security contributions are high, licences for registration
are expensive or di¢ cult to obtain (bribes), public services are of poor quality and returns to
…rm size (i.e. economies of scale) are low. (It would be di¢ cult to run a large …rm ino¢ cially.)
Similarly, the medical and retirement bene…ts to formal employees (and worker protection) may
often be of limited value, and in some countries access to these bene…ts already exists if a fam-
ily member, e.g. spouse, is in formal employment. In addition, o¢ cial labour market restric-
tions (e.g. on (maximum) working hours, paid holidays, notice period, severance pay, maternity
leave) may not provide the ‡exibility that …rms and worker wish. Under certain conditions,
workers and …rms may then voluntarily choose informal employment. Firms may also prefer
may not be trivial. Henley, Arabsheibani, and Carneiro (2007) discuss di¤erent de…nitions and examine Brazilian
data that permits a de…nition by (a) work contract status, (b) social security status and (c) formal activity. In
many other datasets, however, only size of the establishment is available and all workers who work in …rms with
less than, say, 5 workers, excluding professionals, are classi…ed as informal. This is, of course, a very crude
classi…cation and may entail a lot of measurement error.
informality as this may guard them against the development of strong unions, or worker repre-
sentation e.g. regarding re-organizations, dismissals, social plans for unemployed or precarious
workers. Hence, costs (taxes, social security) and state regulations provide incentives for re-
maining informal.
Consider an individual i who seeks employment either in the formal or informal sector. Let
Yi1 be his wage in the formal sector and Yi0 his wage in the informal sector. (This wage outcome
may also include non-wage bene…ts.) If individuals self-selected their sector in that they chose
Di = 1 Yi1 > Yi0 ,
often referred to as the Roy (1951) model, it should be the case that
Yi1 Yi0 > 0 if Di = 1
Yi1 Yi0 < 0 if Di = 0.
Under the hypothesis of informality being only an involuntary choice because of the limited size
of the formal sector, it should be that Yi1 Yi0 > 0 for everyone. In this case, some individuals
would like to join the formal sector but were not sucessful. Taking the size of the formal sector
as given, an e¢ cient allocation would require that
E Y1 Y 0 jD = 1 > E Y 1 Y 0 jD = 0 ,
i.e. those who obtained a formal sector job should have a larger gain vis-a-vis non-formal
employment than those who did not obtain a formal sector job (D = 0).
The naive estimator of the ATET is
E[Y 1 jD = 1] E[Y 0 jD = 0],
which can be written as
= E[Y 1 jD = 1] E[Y 0 jD = 1] + E[Y 0 jD = 1] E[Y 0 jD = 0] .

| {z } | {z }
AT ET selection bias
The …rst term is the ATET, whereas the second term is the selection bias, i.e. di¤erences in
the non-graduation wealth between individuals who actually attended university and those who
did not.
Y
D
V U
X
Of central interest is not the association between earnings and schooling, but rather the
change in earnings if schooling were changed exogenously. The fact that university graduates
earn, on average, higher wages than non-graduates, could simply re‡ect di¤erences in ability.
Hence, graduates and non-graduates have di¤erent observed and unobserved characteristics
even before they enter or do not enter university. To identify the individual return to schooling
one would like to compare individuals with the same observed and unobserved characteristics
but with di¤erent levels of schooling.11
1.0.1 Examples
Consider a few more examples.

Beegle, Dehejia, and Gatti (2006) analyze the e¤ects of transitory income shocks on the
extent of child labour, using household panel data in rural western Tanzania collected in 1991
to 1994. Their hypothesis is that transitory income shocks due to crops lost to pests and other
calamities may induce families to use (at least temporarily) more child labour. This e¤ect is
11
Even if one identi…es the individual return to schooling, the economic interpretation still depends on the
causal channels one has in mind. This can be illustrated here by contrasting the human capital theory versus the
signalling theory of schooling. The human capital theory posits that schooling increases human capital which
increases wages. The signalling theory presumes that attainment of higher education (e.g. a degree) simply
signals high unobserved ability to potential employers, even if the content of education is completely useless.
In this later case, from an individual perspective schooling may well have a high return. On the other hand, if
years of schooling were to be increased for everyone, the overall return would be zero since the ranking between
individuals would not change. This is an example of a violation of SUTVA, where the own potential outcomes
depend on the treatment choices of other individuals. (This is also often referred to as peer e¤ects or externalities.)
Here, individual level regressions would identify only the private marginal return, but not the social return.
expected to be mitigated by family wealth.
Other examples are the e¤ects of the tax system on labour supply, the public-private sector
wage di¤erential, the returns to schooling or the e¤ects of class-size on students’ outcomes.
Distinguishing the true causal e¤ect from di¤erences in unobservables is the main obstacle to
nonparametric identi…cation of the function ' or of features of it such as treatment e¤ects. In
the next chapters various assumptions that permit identi…cation are discussed in detail.12 For
several research questions, random experiments o¤er the most convincing solution.
1.1 Experiments, randomized trials
References:
- Kremer Econometrica (2005)
- Heckman Smith (J Economic Perspectives)
To control for di¤erences in observed and unobserved characteristics, controlled

experiments can be very helpful. Randomized assignment of D ensures that D is not correlated
with observed and unobserved characteristics. Experiments are still rather rare but are now
becoming widely used particularly in developing countries. Recent examples of deliberate
experiments are PROGRESA in Mexica and Familias en Accion in Columbia and similar
conditional-cash transfer experiments in other Latin American countries. Other examples
are the STAR class-size experiment in Tennessee (USA), the Job Training Partnership Act
(JTPA) in the USA, the de-worming (Miguel, Kremer) and the random provision of school
inputs in Kenya (Glewwe, Kremer et al).
The "Student teacher achievement ratio" (STAR) experiment was designed to obtain credi-
ble evidence on the hotly debated issue of whether smaller classes support student learning and
lead to better student outcomes (and whether any gains would justify the costs of reducing class
sizes). The results from observational studies were hotly disputed. Class size is endogenous in
that classes may be smaller in richer areas or where parents are very interested in securing a
good education for their children. On the other hand, more disruptive children and children
with learning di¢ culties are often placed in smaller classes. In the STAR experiment (from
12
While this has always been of concern in econometrics, in the recent years much more emphasis has been
laid on trying to verify these assumptions and …nding weaker assumptions for identi…cation. Earlier approaches
often relied on incredible assumptions, e.g. the bivariate normality of the Heckman sample selection model.
1985 to 1989), each participating school assigned children to one of three types of classrooms:
Small classes had a targeted enrollment of 13-17. Regular classes had a targeted enrollment of
22-25. A third class targeted regular enrollment of size 22-25, but added a full-time teacher’s
aide in the room.
Experiments ensure that treated and control have the same distribution of observed and
unobserved characteristics such that:
E[Y 1 jD = 1] = E[Y 1 jD = 0] = E[Y 1 ].
Random programme assignment ensures that any di¤erences between the treatment groups are
by pure chance and not systematic. This ensures …rst that the unobservables are uncorrelated
with D, i.e. identically distributed in both groups, which thus eliminates "selection on unob-
servables". Second, it also guarantees that the distribution of the observed characteristics is
identical in both groups and in particular that they have the same support. This means that for
any values of concomitant characteristics X that are observed in one group, we could also …nd
observations with the same characteristics in the other group, given an in…nitely large number
of observations. This is the common support condition, which will be discussed in the next
chapter. This implies that ATE=ATET and that the naive estimator
E[Y jD = 1] E[Y jD = 0]
is consistent for these. Random experiments, if properly conducted, provide the most convincing
identi…cation strategy, as all the other identi…cation strategies discussed later rest on untestable
assumptions that are hardly ever unambiguously accepted.
Consider a few other recent examples. The JTPA experiment in the USA provided intensive
training to unemployed or poor youth and adults, consisting of on-the-job training, job search
assistance and classroom training. Voluntary applicants to this programme have been randomly
accepted (=randomized in) or rejected (=randomized out).13
Experiments have recently been conducted in several developing countries for evaluating the
impacts of health and education programmes. PROGRESA in Mexico is a large public pro-
gramme designed to increase school participation. Cash grants are distributed to women, con-
13
This programme has been intensively evaluated, e.g. Heckman, Ichimura, and Todd (1997) and Heckman,
Ichimura, Smith, and Todd (1998).
ditional on children’s school attendance and preventative health measures (nutrition supple-
mentation, health care visits, and participation in health education programs). When the pro-
gram was launched in 1998, due to budgetary limits the programme was introduced only in sev-
eral pilot regions, which were randomly selected (randomized phasing-in). The unit of random-
ization was the community level and data were collected not only for these randomly selected
communities but also in several randomly selected non-participating communities.14 Participa-
tion in the programme was designed as a two step procedure. In the …rst step, a number of
localities with high degree of marginality were selected of which about half were randomized
into the programme. In the second step, only poor households living in pilot localities were
considered as eligible to the programme, on the basis of a region-speci…c poverty index at the
household level. Data were collected at baseline (i.e. before the introduction of the programme)
and in subsequent waves afterwards. The positive evaluation results convinced policy makers
to roll out the programme nationwide.15
Several di¤erent experimental interventions intended to increase school participation were

conducted in a speci…c district in Kenya, thereby allowing to compare the cost-e¤ectiveness
of di¤erent interventions in the same environment. (If some types of programmes are
implemented in Kenya and others in Nigeria, their relative performance would be di¢ cult to
compare due to the di¤erences in the environments.) The impact of providing free school
meals in Kenyan pre-schools on attendance and test scores were examined in Vermeersch
(2002) in twenty-…ve participating and twenty-…ve comparison schools. Kremer and coauthors
conducted several randomized interventions in Kenya, where school uniforms, textbooks,
and classroom construction where subsidized in several poorly performing schools. Miguel
and Kremer (2005) evaluate a program of twice-yearly school-based mass treatment with
inexpensive de-worming drugs in Kenya, where the prevalence of intestinal worms among
children is very high. Seventy-…ve schools were phased into the program in random order.
Health and school participation improved not only at program schools, but also at nearby
schools due to reduced disease transmission. Since these interventions were conducted in a
similar environment in Western Kenya, the costs and e¤ects of these di¤erent interventions
14
In fact, half of the communities participated in the programme and the other half did not.
15
Part of the credibility of the evaluation results also likely rested on that the evaluation was done by an inde-
pendent foreign research institute and that data was made accessible to numerous researchers thereby reducing
the risk of consciously biased evaluation reports.
can be compared. Another problem with many of these experiments is their small sample
size. E.g. in the Kenyan experiment only a small number of schools were in the treatment
and control group. Apart from statistical power of tests, this also raises the concern about
external validity if the schools are not nationally representative.16
Organizing and conducting an experimental trial can be expensive and may receive a lot of
resistance. A variety of problems are discussed for example in Heckman and Smith (1995) with
respect to random assignment to the JTPA training programme in the USA. Since participation
in this programme is voluntary, randomization can only be implemented with respect to the
individuals who applied for the programme, which are then randomized in or randomized out.
However, these might be di¤erent from the population of interest. Particularly, if randomiza-
tion covers only parts of the population, the experimental results may not be generalizable to
the broader population. In other words, although internal validity is often plausbile, external
validity may be limited if the selected units are not representative of the population at large.
Even if a policy is mandatory and all individuals can be randomly assigned to the treat-
ments, full compliance is often di¢ cult to achieve if participants must exercise some e¤ort dur-
ing the participation and may refuse their cooperation. Heckman and Smith (1995) discuss dif-
ferent sources that may invalidate the experimental evaluation results. Randomization bias oc-
curs if the prospect of randomized allocation alters the pool of potential participants because
individuals may be reluctant to apply at all or reduce any preparatory activities such as com-
plementary training due to the fear of being randomized-out (threat of service denial). Substi-
tution bias occurs if members of the control group (the randomized-out non-participants) ob-
tain some treatment or participate in similar programmes, e.g. identical or similar training ob-
tained from private providers.17 Drop-out bias occurs if individuals assigned to a particular
programme do not (or only partly) participate in it. These two biases are the results of two-
sided noncompliance, as discussed later. Heckman and Smith (1995) also mention that ran-
domized experiments can be expensive, often face political obstacles and may distort the oper-
ation of an on-going policy. The pilot-study character of an experiment may also change the
behaviour of the participants, who may put in additional e¤ort to show that the pilot study
16
Deworming was found to be extraordinarily cost-e¤ective at only $3.50 per additional year of schooling
(Miguel and Kremer, 2005).
17
In this case, the experimental evaluation measures only the incremental value of the policy relative to the
programmes available otherwise.
works or does not work (so called Hawthorne e¤ ects). If randomization proceeds not on the
individual but a higher level, endogenous sample selection problems may occur. For example,
if programme schools receive additional resources this might attract more parents to send their
children to these schools18 such that the actual allocation is not random anymore. Neverthe-
less, randomization could still be used to estimate "intention to treat (ITT)" e¤ects or could
be used as an instrumental variable for actual assignment.
Although full randomization is often not possible, in particular when attempting to force
individuals into a programme, there might sometimes be random elements in the organization
of publicly provided or mandated programmes, which may overcome partly the preceding prob-
lems. Randomized phasing-in, as discussed above, will only temporarily deny participation in
the programme. In some situations it might also be possible to let all units participate but treat
only di¤erent subsamples within each unit. Consider for example the provision of additional
school books. In some schools, additional books could be provided to the third grade only and
in some other schools to the …fth grade only. Hence, all schools participate to the same degree
in the programme (which thus avoids feelings of being deprived of resources relative to others),
but the …fth graders from the …rst half of schools can be used as a control group for the second
half of schools and vice versa for the third graders.
Marginal randomization is sometimes used when the number of available places in a pro-
gramme or a school is limited, such that those admitted are randomly drawn from the appli-
cants. Consider application to a particular public school or university, which might (be forced
to) choose randomly from the applicants if oversubscribed. In such a situation, those random-
ized out and randomized in do not di¤er from each other in their distributions of observable
and unobservable characteristics. This marginal group, however, may represent only a very tiny
fraction of the entire population of interest and the estimated e¤ects may not generalize to the
population at large.
There might also be randomization with respect to entitlement or non-entitlement to a par-
ticular programme, which can often deliver a credible instrument for an instrumental variables
strategy discussed below. For example, during the Vietnam war young American men were
drafted to the army on the basis of their month and day of birth, where a certain number of
birth dates had been randomly determined to be draft eligible, see Angrist (1998). Hence, the
18
E.g. parents may withdraw their children from the control schools and send them to the programme schools.
indicator whether being born on a draft-eligible day or not satis…es the above requirements and
would deliver the ITT e¤ect. But the main research interest is in the e¤ect of participating in
the army on later outcomes. Participation in the army is not randomized, but as we will see
later the lottery of birth dates can function as an instrument.
Hence, randomized assignment can be very helpful for credible evaluation. But not all
questions can be answered by experiments (e.g. the e¤ects of constitutions or institutions) and
experimental data are often not available. Experimental data alone will also usually not allow
to determine the entire function '(d; x; u), for which additional assumptions will be required.19
In addition, in practice randomized experiments hardly ever turn out to be perfect. E.g. in
the STAR experiment, children who skipped a grade or who repeated a class left the experiment.
Some pupils entered the school during the trial. Some kind of re-assignment happened during
the trial etc. This implies that one needs to know all these details when evaluating the trial
and estimating treatment e¤ects. One should not only know the experimental protocol but also
about the many (smaller and larger) problems that happened during the experimental phase.
Other problems may appear when collecting follow-up data. E.g. an educational intervention
may have take place in kindergarten and we would like to estimate its e¤ects several years later.
Attrition and non-response in follow-up surveys may lead to selected samples. E.g. it may
be harder to trace and survey individuals who have moved. (In many health interventions,
mortality may be the main reason behind attrition.) Non-experimental methods are needed to
deal with this.
As a related literature, experimental economics examines the impact of often hypothetical

interventions to study the behaviour of individuals under certain well-de…ned situations. This
typically proceeds by inviting a number of students to play public good games with compen-
sation depending on how much they earn during these games. Most of these experiments take
place in a computer laboratory under highly stylized conditions. Field experiments similarly
examine behaviour in real settings, i.e. outside the laboratory. A recent example is Falk (2007),
who examines gift-exchange with respect to voluntary donation. People are o¤ering and repay-
19
Even if a proper experiment is conducted, it might still occur by chance that the treatment groups di¤er
substantially in their characteristics particularly if the sample sizes are small. Although the di¤erences in sample
means provide unbiased estimates of average treatment e¤ects, adjusting for the di¤erences in the covariates, as
discussed below, can reduce the variance of the estimates (Rubin 1974).
ing gifts without external obligations. Falk (2007) examined gift exchange in a natural setting.
A charitable organization sent about 10000 solicitation letters to potential donors. Some of
these letters contained a small gift, some letters contained a larger gift, and the remaining let-
ters contained no gift at all. The assignment to these three treatments was random. It was
found that the relative frequency of donations increased when a small gift was included and
increased even further when a large gift was included.
1.2 Nonexperimental data and parameters of interest
As we have seen in the previous subsection, experiments can be very helpful for credible
identi…cation. However, very often we have only non-experimental data. In addition, we might
be interested in identifying also other objects than the average treatment e¤ect and not all
these objects can be obtained in practice from experimental data.20 Some of these di¤erent
objects we might be interested in are discussed below, and strategies to estimate them from
non-experimental data are examined in the following chapters.
Let us reconsider the non-separable model
Yi = '(Di ; Ui ) with Yid = '(d; Ui )
or the model with the partial e¤ect:
d
Yi = '(Di ; Xi ; Ui ) with Yi;x = '(d; x; Ui ).
This model is non-separable in the error term, which means that the e¤ect of D on Y can
vary among individuals. The return to one additional year of schooling for an individual with
characteristics x; ui and di = 8 is
'(9; x; u) '(8; x; u),
which may vary with x and ui . If D is continuous (i.e. its cdf is continuous), the marginal e¤ect
is
rd '(d; x; u)
where rd refers to the partial derivative with respect to the …rst argument.
20
Nevertheless, experimental data can also help a in identifying more complex structural models since the
orthogonality of the random assignment and all unobservables and observables is very useful.
Notice the di¤erence to a model with additively-separable errors
Yi = '(Di ; Xi ) + Ui ,
which implies that the return to one additional year of schooling in this model is:
'(9; x) '(8; x),
and thus does not vary with u, which is often an unrealistic assumption. Non-separable models
thus permit heterogeneity in the responses among observably identical persons and the responses
to changes in D will therefore have probability distributions. This non-separability will make
these models more realistic and delineate more clearly the nature of the identifying assumptions
to be used. On the other hand, it also makes identi…cation more di¢ cult.21 Additively separable
models will not be discussed further due to time constraints.22
Heterogeneity in the responses might itself be of policy interest and it might therefore often
be interesting to try to identify the entire function '(d; x; u). In other applications, one would
like to obtain a parameter that does not depend on Ui , since Ui is unobserved and usually not
identi…ed. The average treatment e¤ect is such a parameter where the unobserved variables
have been averaged out: The ATE is
Z
E[Y 1 Y 0] = ('(1; U ) '(0; U )) dFU ,
and the ATET is

Z
1 0
E[Y Y jD = 1] = ('(1; U ) '(0; U )) dFU jD=1 .
In the university graduation example, the di¤erence between ATET and ATE is often re-
ferred to as the sorting gain. The decision whether to attend university or not is likely to de-
pend on some kind of individual expectation about their wage gains from attending university.
This leads to a sorting of the population. Those who gain most from university are more likely
to attend it, whereas those who have little to gain from it will most likely abstain. This could
lead to an ATET being much higher than ATE. Hence, the average wage gain for students is
21
The nonparametric identi…cation in non-separable models is a rather recent …eld of research and is still very
active today.
22
In the familiar linear model Yi = + Di + Xi + Ui a common treatment e¤ect is assumed. It prohibits
not only e¤ect heterogeneity conditional on X but also e¤ect heterogeneity in general.
higher in the sorted subpopulation than in a world without sorting. This di¤erence between
ATET and ATE could be due to di¤erences in observed as well as unobserved characteristics.
Hence, the observed di¤erence in outcomes among students and non-students can be decom-
posed as
E [Y jD = 1] E [Y jD = 0] = average return to schooling + sorting gain + selection

| {z bias} .
| {z } | {z }
AT E AT ET AT E E[Y 0 jD=1] E[Y 0 jD=0]
The expected potential outcome is

Z Z
d
E[Y ] = '(d; U ) dFU = '(d; X; U ) dFXU ,
which integrates out the marginal distribution of U , i.e. the distribution of U in the population.
The expected potential outcome can be interpreted in two ways: It represents the outcome that
would be observed, if all individuals had d years of schooling. (Again, assuming no general
equilibrium e¤ects.) Or, it represents the outcome for a randomly chosen individual, for whom
years of schooling is set to d. The latter interpretation is usually still valid even with general
equilibrium e¤ects. The expected potential outcome di¤ers from the expected value E[Y jD =
R
d] = '(d; x; u) fXU jD (x; ujd)dudx, because U may be related to D.
This expected potential outcome is also referred to as the average structural function (ASF,
Blundell and Powell (2003)), in particular if we are interested in partial e¤ects. If we are
interested in partial e¤ects, we often want to …x some other treatment variable X at some value
x. The expected potential outcome then is
Z
ASF (d; x) = E[Yxd ] = '(d; x; U ) dFU .
Notice again the fundamental di¤erence between …xing D and X externally and conditioning
(i.e. observing) D and X. In contrast to E[Yxd ], the expected outcome conditional on X is
Z Z
d
E[Y jX = x] = '(d; U ) dFU jX=x = '(d; x; U ) dFU jX=x .
If U and X are uncorrelated, which is often assumed, E[Yxd ] and E[Y d jX = x] are identical,
but otherwise not.
Having de…ned the ASF, to obtain total policy outcomes for a policy which assigns d and x
according to a weighting function f (d; x), one could integrate out:
Z
ASF (d; x) f (d; x)dddx.
In addition to the average hypothetical outcome, we might also be interested in the distri-
bution of these hypothetical outcome. The following equations are de…ned with respect to the
two treatment variables D and X (i.e. partial e¤ects), but we could consider X to be the empty
set to obtain total e¤ects. The distributional structural function is the distribution function of
'(d; x; U ):
Z
DSF (d; x; a) Pr ['(d; x; U ) a] = 1 ['(d; x; u) a] fU (u)du.
The quantile structural function (QSF) is the inverse of the DSF. It is the -th quantile of the
outcome for …xed d and x:
QSF (d; x; ) = Q ('(d; x; U ))
were the quantile is over the marginal distribution of U . The symbol Q (A) represents the -th
quantile of A.23 Again it can be interpreted in two ways, although the second interpretation is
less straightforward here. It may be most intuitive to think of it as the -th quantile of Y if D
and X are …xed externally for every individual.
Notice again, that this is di¤erent from the observable quantile
Q [Y jD = d; X = x] = Q ['(D; X; U )jD = d; X = x]
because the latter is the quantile with respect to the conditional distribution fU jDX instead of
the marginal fU . To gain some intuition, suppose that U is a scalar and ' is strictly increasing
in its third argument u. Then
Q ('(d; x; U )) = '(d; x; QU )
where QU = Q U represents the quantile in the ’fortune’distribution in the population. Hence,

QSF (d; x; 0:9) is the outcome for di¤erent values of d and x for an individual at the 90%
percentile in the fortune distribution. On the other hand, the observed quantile
Q [Y jD = d; X = x] = '(d; x; QU jD=d;X=x )
where QU jD=d;X=x = Q [U jD = d; X = x] is the quantile in the ’fortune’distribution among

those who chose d years of schooling and have characteristics x.
23
QA Q (A) inffq : FA (q) g
Since the QSF describes the whole distribution, the ASF can be recovered from the QSF as
Z1
ASF (d; x) = E[Yxd ] = QSF (d; x; )d .
0
Hence, if the QSF is identi…ed at all quantiles , so is the ASF, but not vice versa.
The DSF and the QSF contain the same information and if the DSF is continuous,
QSF (d; x; ) = DSF 1 (d; x; ). Analytically, it is often more convenient to use the DSF,
whereas the QSF is more suited to economic interpretation.
The di¤erence between the ASF at two di¤erent values of D represents the impact on
earnings for a random person if years of schooling were set to, e.g. 10 versus 8:
Z
ASF (10; x) ASF (8; x) = f'(10; x; u) '(8; x; u)g fU (u)du,
which is a (partial) average treatment e¤ect.
These di¤erent objects compare outcomes when …xing D at a speci…c value d. Alternatively,
one could examine the impact of increasing D by one from the observed values. This takes thus
account of di¤erences in the observed choices Di . The impact of increasing years of schooling
by one for everyone, given the observed levels, is
Z
f'(d + 1; x; u) '(d; x; u)g fDXU (d; x; u)dddxdu,
or for continuous D the average derivative:

Z
@'(d; x; u) @'(D; X; U )
fDXU (d; x; u)dddxdu = E .
@d @d
The average derivative has two potential advantages: It may often be closer to answering
a policy question, e.g. by how much the outcome would change if D is increased for each
individual relative to the currently observed level. Second, identi…cation may be easier since
the objects to be identi…ed are closer to their sample distribution. It may be hard to infer the
outcome for a low ability-in-schooling person if he would be sent to university; but it can be
much easier to infer the e¤ect of increasing the years of schooling by a little. This problem will
later show up in the discussion about common support.
So far it has been discussed, which types of objects we would like to estimate. The next
step is to examine under which conditions they can be identi…ed. This means, suppose we know
the distribution function FY;D;X;Z (e.g. through an in…nite amount of data), is this su¢ cient to
identify the above parameters?
Without further assumptions it is not since the unobserved variables can generate any sta-
tistical association between Y and D even if the true impact of D on Y is zero everywhere.
Hence, data alone are not su¢ cient to identify average treatment e¤ects. Conceptual causal
models are required, which entail identifying assumptions about the process through which the
individuals were assigned to the treatments, or about stability of the outcomes over time, see
Pearl (2000). The corresponding minimal identifying assumptions cannot be tested with obser-
vational data and their plausibility must be assessed through prior knowledge of institutional
details, the allocation process and behavioural theory.
2 Selection on observables, causal graphs, matching estimators
In this section identi…cation with non-experimental data and "selection on observables" is dis-
cussed. Essentially, this is a nonparametric extension of the familiar OLS identi…cation and
estimation approach. We might be interested in estimating average potential outcomes:
E[Y d ]
or di¤erences in outcomes
E[Y 1 Y 0 ],
where the outcome could be wages or wealth, for example. Often, the endogeneity of D, e.g.
due to self-selection implies that
E[Y d jD = d] 6= E[Y d ]
such that simple estimation of E[Y d jD = d] will not identify the mean potential outcome.
However, if we were to observe all covariates that a¤ected D and the potential outcome, then
conditional on these covariates X, the variables D and Y d are independent (??):
Y d ??DjX 8d 2 Supp(D) (2)
which implies mean independence
E[Y d jD; X] = E[Y d jX] (3)

Markus Frölich 2. Selection on observables, causal graphs, matching estimators 23
as a slightly weaker assumption. This assumption is easiest understood in the treatment

evaluation context. Let D 2 f0; 1g indicate whether an individual continues to university after
high-school graduation or does not. Suppose that the decision to enroll in university depends
on only two factors: the examination results when …nishing high-school and the weather
on that day. Without controlling for the high-school examination results, the conditional
independence assumption (2) is unlikely to be satis…ed: Individuals with better grades are
more likely to enroll in university and probably also have higher outcomes Y 0 and Y 1 . But
conditional on the grades, the CIA (2) is satis…ed if the weather itself has no e¤ect on the
wages later in life. Hence, for individuals with the same grades, the decision to enroll does
no longer depend on factors that also are related to the potential outcomes. Conditional on
grades, there is no selection bias and we can simply compare the outcomes of those deciding
to enroll in university and of those who do not. This is the, often called, "selection on
observables" conditions for identi…cation.24 Thus we could compare university graduates and
non-graduates with the same values of X and then take the average with respect to X to
obtain:
Z
1 0
E[Y Y ] = E[Y 1 Y 0 jX]dFX (4)
Z Z
= E[Y 1 jX; D = 1]dFX E[Y 0 jX; D = 0]dFX .
(The basic concept of adjusting for the di¤erent distributions of covariates is well known since
long, see e.g. Fechner (1860) or even earlier references. The advances in computing power over
the last decades, however, have made many new nonparametric estimators feasible to estimate
(4) without having to resort to parametric estimators such as OLS.)
We already see that a second condition will be required for identi…cation. There must exist
both university attendees and non-attendees for every value of X. If for some particular values
of X all individuals enroll in university or do not, we simply cannot compare attendees and non-
attendees as there are none in one of the two groups. This is the "common support condition"
further discussed below. Hence, eventually we will require two conditions: 1) That all factors
causing D and Y are observed. 2) That conditional on these factors there is still su¢ cient
randomness in the choice of D.25 The …rst condition is essentially non-testable. Although
24
Which essentially means that there is no selection on unobservables that are also a¤ecting the outcome.
25
In other words, this second assumption requires the existence of an instrumental variable, but we do not
need to observe it.
it can obviously be tested whether some variables do or do not a¤ect D or Y , it can never
be ascertained by statistical means whether there is no omitted (unobserved) variable, which
consciously or unconsciously a¤ected the process determining or the choice of D.26
The second condition of a common support, however, can be tested and the object to be
estimated could also be re-de…ned in the case of failure of common support, as described below.
(For non-binary D these two conditions are needed analogously.)
In the familiar linear model
Yi = + Di + Xi + Ui
the conditional independence assumption (2) translates to assuming
E[Ui jDi ; Xi ] = E[Ui jXi ].
Notice the di¤erence to the assumption usually invoked for OLS estimation:
E[Ui jDi ; Xi ] = 0.
For estimating the linear model by OLS we need that U is uncorrelated with D and with X
(or at least with those X that D is correlated with). For nonparametric identi…cation and
estimation, this assumption is not needed: U can be allowed to be correlated with X, i.e. X
could be endogenous.
2.1 Causal graphs
References: Pearl (2000)

26
Generally speaking, identi…cation by the conditional independence assumption (2) is easier to achieve the
more bureaucratic, rule-based and deterministic the programme selection process is, provided the common sup-
port condition below still applies. For example, in his analysis of the e¤ects of voluntary participation in the
military on civilian earnings, Angrist (1998) takes advantage of the fact that the military is known to screen ap-
plicants to the armed forces on the basis of particular characteristics, primarily on the basis of age, schooling and
test scores. Hence these characteristics are the principal factors guiding the acceptance decision, and it appears
reasonable to assume that among applicants with the same observed characteristics, those who …nally enter the
military and those who do not are not systematically di¤erent with respect to some outcome variable Y measured
later in life. A similar reasoning applies to the e¤ects of schooling, if it is known that applicants to a school
or university are screened on the basis of certain characteristics, but that conditional on these characteristics
selection is on a …rst-come/…rst-serve basis.
Assumptions of the type (2) involving independence statements with respect to potential
outcomes may be somewhat unfamiliar. As similar statements will appear later again with
respect to instrumental variables, it may be worthwhile to gain a better intuition for this. This
is particularly relevant since these identifying statements usually represent the link between
economic theory and the empirical analysis, and thus distinguishes econometrics from statistics.
Economic theory often delivers only statements about which variables may a¤ect each other
or not. The other ingredients of the empirical analysis, e.g. the choice of the parametric
speci…cation of the model, the choice of the estimator etc. are driven by convenience and the
peculiarities of the data set at hand, e.g. the sample size and the variation in the observed
variables.
The assumption (2) states that all variables that a¤ected D and the potential outcomes are
observed. Whether this assumption holds or does not in a given application depends largely
on information about the assignment process and the data that are or could be observed. If
no control variables are observed at all, this condition is almost certainly invalid unless D has
been randomly assigned. On the other hand, if the entire information set on which the process
or assignment mechanism D is based would be observed, then the assumption (2) would hold.
For gaining a better intuition about conditional independence assumptions, graphical mod-
els encoding the causal assumptions can be helpful. Graphical models have only recently be-
come popular in statistics, but they have the advantage that the causal structure can easily
be displayed and that the distinction between causation and correlation becomes more evident,
as it is sometimes blurred, e.g. in the conventional assumption of zero correlation between the
error term and the covariates in linear models. Essentially, the structural-equations, potential-
outcomes and the causal graphs approaches are equivalent representations of the same underly-
ing conceptual model.27 Presenting and using all three approaches may thus enhance intuition
about the basic assumptions needed for identi…cation and may also help for further reading of
original articles, which often favour di¤erent approaches. The following discussion contains a
few excerpts from the book of Pearl (2000, Chapter 3).28
27
Or can be made equivalent as discussed in Pearl (2000).
28
The book of Pearl (2000) attempts to provide a general approach to the identi…cation of causal e¤ects. It
provides some interesting discussion on aspects of causality but is largely limited to acyclic graphs, which is a
serious limitation as we may often want to permit e.g. D and X to a¤ect each other in both directions. It also
contains very little about IV methods.
Consider the relationship between some variables Y , D and X, and suppose, for convenience,
that all variables are discrete with a …nite number of mass points (as in Pearl (2000)). The
relationship can be described by a probability distribution PrY DX (y; d; x). To abstract here
from any common support problems we assume that PrY DX (y; d; x) > 0 for all combinations of
y 2 Supp(Y ), d 2 Supp(D) and x 2 Supp(X). Hence, we suppose that all combinations of y; d
and x can be observed with positive probability. The relationship between these variables can
be presented in a graph:
V1 Y U
D
V2
where U , V1 and V2 are unobserved variables, which are determined outside of the model. The
graph consists of a set of variables (vertices) and a set of (directed or bi-directed) arcs. The set
of variables may include observed as well as unobserved variables. The directed arcs represent
causal relationships. The dashed bidirected arcs represent relationships due to unobserved com-
mon causes. In this graph, a priori restrictions can be easily encoded, and simple rules can then
be applied to determine whether the e¤ect of one variable on another can be determined. For
example, let X be high-school examination results, D be an indicator of enrolling in university
and Y be wealth at age 50. Then the following graph
V1 Y U
D
V2
contains the restrictions that Y does not a¤ect D, Y does not a¤ect X and D does not a¤ect
X. It contains further the restrictions that U , V1 and V2 are jointly independent. I.e. it is the
missing arcs which encode our a priori information. These causal assumptions are to be dis-
tinguished from statistical association. Whereas causal statements are asymmetric, statistical
association is always symmetric: If X is statistically dependent on Y , then Y is also statisti-
cally dependent on X.29 A causal structure is therefore richer, because X can causally a¤ect
Y , without X being causally a¤ected by Y . The graph above also incorporates a triangular
structure or a causal chain. Such triangular structure is not always appropriate, e.g. in a model
of a market:
V U
Q P
where Q is demand and P is the price and both equations depend on each other.
We are interested in estimating the causal e¤ect of D on Y , i.e. which outcomes would
be observed if we were to set D externally. In the observed world, D is determined by its
antecedents in the causal model, e.g. by V1 and X and thus indirectly by the exogenous
variables V1 and V2 . We consider now an external intervention that sets D to a speci…c value d
and want to identify the distribution of Y d .30 This essentially implies that the graph is stripped
o¤ all arrows pointing to D.
A graph where all edges are directed (i.e. does not contain bidirected dashed arcs) and
which contains no cycles is called a directed acyclic graph (DAG). Although the requirement of
acyclicity rules out many interesting cases, several results for DAG are useful to form intuition.
Bidirected dashed arcs can be eliminated by introducing additional unobserved variables in the
graph. In other words, the left graph in the …gure below can be expressed equivalently by the
right graph.
29
It is worthwhile noting that the statistical literature has not yet developed a precise, uniformly accepted
language for causality, in contrast to concepts such as correlation. Even Heckman, Ichimura, Smith, and Todd
(1998, p. 1021 bottom) made this mistake. They wanted to assume that the X are exogenous in the sense that
D does not cause X and therefore assumed that F (XjY 0 ; Y 1 ; D) = F (XjY 0 ; Y 1 ). However, X?
?DjY 0 ; Y 1 is the
?XjY 0 ; Y 1 and therefore does not entail any structural assumption on whether X causes D or D
same as D?
?XjY 0 ; Y 1 since they consider X as potential confounders of
causes X. Clearly, they did not want to assume D?
D.
30
In the notation of Pearl this is stated as Y given D is set to d.
D D U1
U2
X X
U3
The graph can now be used to better understand conditional independence assumptions.
For this a few de…nitions are helpful. A path between two variables is a sequence of consecutive
edges (of any directionality).
De…nition 1.2.3 of Pearl (2000): A path is d-separated (or blocked) by a set of nodes Z i¤
i) the path contains a chain i ! m ! j or a fork i m ! j such that the middle node is in Z
ii) the path contains an inverted fork i ! m j such that the middle node is not in Z, and
such that no descendant of m is in Z.
De…nition d-separation: A set Z is said to d-separate X from Y if Z blocks every path from
a node in X to a node in Y .
Theorem 1.2.4 of Pearl (2000): If sets X and Y are d-separated by Z in a DAG, then X
is independent of Y conditional on Z (in every distribution compatible with the graph).
The intuition behind the …rst condition of De…nition 1.2.3 is simple: i and j are marginally
dependent but once we compare i and j only when m is observed to take a particular value,
i and j will be independent. The second condition is less obvious at …rst sight. Here, the
variables i and j are marginally independent and become only dependent after conditioning on
m. Here conditioning "unblocks" the path. Consider throwing two coins and let the variable m
denote whether both coins show the same side. The outcome of each coin is independent of the
other, but once we condition on m they will become dependent. As another example, consider
admission to a certain graduate school (or university) to be based on either good grades or high
talent in sport. Then we will …nd a negative correlation between these two characteristics in
the school even if these two characteristics are independent in the population. Conditioning on
m could also happen inadvertently through the data collection process. In the last example,
if we obtain our dataset from the school register, then we have implicitly conditioned on the
event that all observations in the dataset have been admitted to the school.
Consider the following examples:
Z1
X Z1 Z2 Z3 Y X Z2 Y
(a) (b)
In example (a) the two paths between X and Y are blocked when conditioning on the empty
set. Conditioning on Z1 would unblock both paths. In fact, it unblocks the "colliders" at both
Z1 and Z3 . It unblocks the collider at Z1 since this is the collision node. It unblocks the collider
at Z3 because Z1 is a descendant of Z3 , the collision node. (Conditioning on Z2 or Z3 would
not unblock the paths between X and Y .)
In example (b), which is not a DAG as it contains a cycle, X and Y cannot be d-separated by
any set of nodes, including the empty set. Without conditioning, the path X Z1 Z2 Y
is not blocked. This path can be blocked by conditioning on Z1 or Z2 , but in both cases the
path X ! Z2 Y becomes unblocked. Conditioning on Z2 unblocks this path immediately.
Conditioning on Z1 unblocks this path, because it is a descendant of Z2 .
These examples help us to understand the meaning of
Y ??DjX.
Note that this is very di¤erent from Y d ??DjX. To understand the meaning of the assumption
Y d ??DjX,
notice that this assumption is satis…ed by Theorem 3.4.1 of Pearl (2000) in a DAG if
(Y ??DjX)GD ; (5)
¯
where GD is the subgraph obtained by deleting all arrows emerging from D.3132 Hence, after
¯
deleting all arrows emanating from D, the variables Y and D should be independent conditional
on X.
31
Suppose D is binary, the above assumption implies Pr(Y d = yjD = d; X) = Pr(Y d = yjX). This assumption
actually su¢ ces for most results in the literature even if D is non-binary. Since Pr(Y d = yjD = d; X) is equivalent
to Pr(Y = yjD = d; X) by consistency rule. Hence, the assumption is Pr(Y = yjD = d; X) = Pr(Y = yjD is
set to d; X). The second rule of Theorem 3.4.1 for DAG of Pearl (2000) says that Pr(Y = yjD is set to
d; X) = Pr(Y = yjD = d; X) if (Y ? ?DjX)GD .
¯
32
Analogously, GD is the graph obtained by deleting all arrows pointing to D.
Hence, we can express independence relationships regarding potential outcomes by using

subgraphs. From this one should also note that, for a speci…c data generating process, neither
of the two statements Y ??DjX and Y d ??DjX strictly implies the other. If the latter is true, the
former could be wrong e.g. due to selection bias. If the former is true, the latter is most likely
to be true. But for certain data generating processes it could happen that despite Y d not being
independent of D (given X), we still observe Y ??DjX. This would be the case if a non-zero
treatment e¤ect and non-zero selection bias exactly cancel each other. Hence, generally33
Y ??DjX < Y d ??DjX.
With this basic intuition developed, we can already guess which variables need to be adjusted
for to identify a causal e¤ect of D on Y . The easiest way to think about this is to suppose
that the true e¤ect is zero and ascertain whether the impacts of the unobserved variables could
generate a dependence between D and Y . Consider in the following graphs whether conditioning
on X is necessary or not.
(a) (b) (c) (d)
D Y D Y D Y D Y
X X X X
D Y D Y D Y D Y
X X X X
(e) (f) (g) (h)
In all these 8 graphs one unobservable pointing to Y , D and X each is omitted. In example
(a) conditioning on X is necessary since a spurious relationship between D and Y could be
generated otherwise. In example (b) one should not condition on X as this would unblock the
path via X. In example (c) one should not condition on X as this would "block" a part of the
total e¤ect of D on Y . The graph in example (d) is cyclic and the tools developed so far do not
apply. Generally, we cannot tell the e¤ect of Y on D and the e¤ect of D on Y apart without
further assumptions, such as instrumental variables.
In the next four graphs, there is no confounding and conditioning on X is not necessary,
and can even hurt, in particular in the case of lack of common support, as to be discussed
33
One should also remember that joint independence (A; B)?
?C implies marginal independence A; B?
?C but
not vice versa.
below. In graph (e) X is like an instrumental variable in that it a¤ects Y only indirectly via
D. In principle, controlling for X does not invalidate identi…cation, but it can be harmful in
two respects. First, it can reduce the region of common support in that for some values of x
the probability Pr(D = 1jX = x) might be zero or one. Second, it can worsen …nite sample
properties (for nonparametric and also for parametric estimators) in that it takes some of the
exogenous variation away. For identi…cation we need to …nd individuals identical in all their
characteristics which only di¤er in that some have D = 1 and others have D = 0. This requires
some variation introduced by factors that do not have a direct impact on Y . If we reduce this
variation by conditioning on X, the …nite sample properties will usually worsen. In a more
complex scenario, it may also reduce the amount of ’good’variation and leads to an increase in
bias. Consider a simple example where we introduce some unobservable U in graph (e) which
a¤ects D and Y simultaneously. Since we cannot condition on this unobservable, the estimates
of the causal e¤ects will be biased. The variation in D that is introduced through U is ’bad’
variation in that it biases the estimates. The amount of bias depends on the variance of U
and the correlation between its impact on D and on Y . The variation that is introduced via
X is good in the sense that it does not lead to a bias of the estimated causal e¤ect. If the
’good’ variation is much larger than the ’bad’ variation, the bias will be small. On the other
hand, if we condition on X, i.e. compare only individuals with the same value of X, the good
variation is taken away and only the ’bad’variation remains, which will lead to a larger bias.
The di¤erence between these two approaches will be particularly pronounced if X is a strong
instrument in that it introduces a lot of variation in D.
In graph (f) the variable X is already causally a¤ected by D and one would therefore
not want to control for it. In graph (g) the variable X has no e¤ect on the selection process
determining D, hence there is no need to control for it. (Researchers often use a probit regression
of D on X variables to purge variables that have no e¤ect on D.) On the other hand, including
X may improve e¢ ciency in that it reduces the semiparametric e¢ ciency bound. (Whether
this e¤ect is large in practice seems not have been much explored so far.)
In graph (h) one should not condition on X, again representing the general rule that one
should not condition on variables already in‡uenced by the variable D. In this example (h),
if Y and X are correlated, the estimated e¤ect would usually tend to be biased towards zero.
Consider a simple example: D is Bernoulli with p = 0:5. The outcome is Y = D + U and
further X = Y + V . Suppose further that (U; V ) are jointly normal and independent of D,
which implies that the support of Y and X is the entire real line. (This is di¤erent from the
DAG in Pearl where he considers only discrete variables with …nite number of mass points,
whereas Y and X are continuous here.) We thus obtain that E[Y jD = 1] E[Y jD = 0] = 1.
However, if we condition e.g. on X = 1 we obtain after some calculations that E[Y jX = 1; D =
1] E[Y jX = 1; D = 0] = 0. This result also holds for other values of X, showing that the
estimates (in absolute value) are downward biased.
Hence, conditioning on third variables is not always appropriate even if they are highly
correlated with D. This becomes also particularly evident in the following graph where the
variable Z is neither causally a¤ected by nor a¤ecting D or Y , yet, it can still be highly
correlated with both variables. The e¤ect of D on Y is well identi…ed if not conditioning on
Z. Yet, conditioning on Z would unblock the path via V and U and would thus confound the
e¤ect of D on Y .
V Z U
D Y
According to this discussion and the relationship (5), the e¤ect of D on Y is identi…ed by
adjusting for a set of variables X, such that X does not contain any descendant of D and that
X blocks every path between D and Y that contains an arrow into D. This corresponds to
Theorem 3.3.2 of Pearl (2000), which he denotes as "back-door" adjustment. In the following
example, the set fX3 ; X4 g meets this back-door criterion as does the set fX4 ; X5 g. The set
fX4 g, however, does not because it unblocks the path fD; X3 ; X1 ; X4 ; X2 ; X5 ; Y g. Neither does
fX1 ; X2 g.
X1 X2
X3 X4 X5
D X6 Y
The formula for computing the causal e¤ect is equivalent to (4) given above: First the (expected)
outcome conditional on D = d and X is taken and this is then averaged over the distribution
of X (i.e. not the distribution of XjD = d).
In terms of the structural function notation of Chapter 1, we can also show identi…cation in
an analogous way. Consider the following graph and note that it implies that U ??DjX.
V1 Y U
D
V2
We thus obtain
h i Z Z Z
E Yd = '(d; U ) dFU = '(d; U ) dFU jX dFX
Z Z Z h i
d
= '(d; U ) dFU jX;D=d dFX = E Y jX dFX .
2.1.1 Front door identi…cation
If the variables D and Y are connected by a dashed arc, as in the next graph, the e¤ect of D on
Y can usually not be identi…ed. If there is a mediating variable, as in example (b), the e¤ect is
identi…able, essentially by …rst identifying the e¤ect of D on X and subsequently the e¤ect of X
on Y . Here, adjusting for an endogenous variable, i.e. a¤ected by D, is permitted, but a more
complex calculation is required. This is the "front-door" adjustment of Pearl (2000, Theorem
3.3.4). This also works if there is an e¤ect of the unobserved on the mediating variable, which
can be blocked, e.g. by conditioning on Q in example (c). Hence, the usual rule that one should
not control for a variable that is on the causal pathway, has some exceptions. One should note,
however, that a di¤erent formulae for identi…cation has to be used, see below.
D Y D X Y D X Y
(a) (b) (c)
Pearl (2000, Section 3.3.3) gives an example when estimating the e¤ect of smoking on lung-
cancer. The observed positive correlation between smoking and lung-cancer has been attributed
to genetic di¤erences by advocates of the tobacco industry. According to this theory, some
individuals are more likely to enjoy smoking or become addicted to nicotine and these same
individuals might also be more susceptible to develop cancer, but not because of smoking. If
we were to …nd a mediating variable X not caused by these genetic di¤erences, the previously
described strategy could be used. The amount of tar deposited in a person’s lungs would be
such a variable, if we could assume that 1) smoking has no e¤ect on the production of lung
cancer except as mediated through tar deposits (i.e. the e¤ect of smoking on cancer is channeled
entirely via the mediating variable), and 2) that the unobserved genotype has no direct e¤ect
on the accumulation of tar, and 3) that there is no other factor that a¤ects the accumulation
of tar deposits and which has an in‡uence on smoking or cancer. This identi…cation approach
shows that it sometimes can be appropriate to adjust for a variable that is causally a¤ected by
D.
How can we calculate the treatment e¤ect or the potential outcomes with such a mediating
variable? Consider the following graph without further covariates.
D Z Y
The graph implies that

Z d ??U and Z d ??D.
The …rst statement can also be written as FZ d ;U = FZ d FU . The second statement implies that
FZ d = FZ d jD=d = FZjD=d . We make use of these implications when expressing the potential
outcomes in terms. The potential outcome depends on Z and U in that
Yid = '(Zid ; Ui ),
where Zid is the potential outcome of Z. We now obtain

h i Z Z Z Z
d d
E Y = '(Z ; U ) dFZ d ;U = '(Z d ; U ) dFZjD=d dFU
Z Z
= '(Z; U ) dFZjD=d dFU
Z Z
= '(z; u) fU (u)du fZjD=d (z)dz
and it follows that

h i Z
d
E Y = fZjD=d (z) E [E [Y jD; Z = z]] dz (6)
where we made use of

Z Z Z
E [E [Y jD; Z = z]] = E [Y jD; Z = z] dFD = '(Z; U ) dFU jD;Z=z dFD
Z Z Z Z Z
= '(z; U ) dFU jD;Z=z dFD = '(z; U ) dFU jD dFD = '(z; U ) dFU
because U ??ZjD.
Hence, the formula (6) showed that we can express the expected potential outcome in terms
of observable random variables. If D and Z are discrete, (6) can be written as
!
h i X X
d 0 0
E Y = Pr (Z = zjD = d) E Y jD = d ; Z = z Pr D = d . (7)
z d0
To obtain an intution behind this formula, note that we can separately identify the e¤ect of
D on Z and the e¤ect of Z on Y . First, consider the e¤ect of Z on Y and note that the above
graph implies Y z ??ZjD. Therefore
Z
E[Y z ] = E [Y jD; Z = z] dFD = E [E [Y jD; Z = z]] . (8)
To obtain the e¤ect of D on Z we note that there is no confounding, i.e. Z d ??D. In other
words, the treatment e¤ect of D on the distribution of Z is directly re‡ected in the conditional
distribution function FZjD . Combining FZjD with (8) gives the formula (6).
These two identi…cation approaches (controlling for confounding variables, adjusting by a

mediating variable) can be combined in several ways to analyze identi…cation in more complex
graphs, which we will not examine here further. (See e.g. Pearl, 2000 Section 3.4).
In later chapters, we will also often consider instrumental variable indenti…cation, where
causal graphs of the following kind apply.
Y
D
Z
V U
2.1.2 Direct and indirect e¤ects, partial e¤ects
The analysis becomes more di¢ cult in the presence of cycles or feedback. Consider e.g. example
(a) below, where D does a¤ect X and X does a¤ect D. This could be because of a direct
feedback or simultaneous determination of both variables. Or it could be that for some (random)
subpopulation D a¤ects X and for the other X a¤ects D. Finally, it could also be that the
causal in‡uence is in fact unidirectional, but that we simply do not know, which direction it
is and thus do not want to restrict this relationship. In this situation, not conditioning on X
would lead to confounding. On the other hand, conditioning on X would block the back-door
path but would also block the e¤ect of D on Y which mediates through X. By conditioning
on X we might thus at most be able to estimate the direct e¤ect of D on Y , i.e. the total
e¤ect without the part which is channeled through X. Thus, conditioning on X could permit
us to estimate partial e¤ects or direct e¤ects, which might sometimes be of interest in their
own right, as discussed below. However, conditioning on X does not always guarantee that we
can identify the direct e¤ect, as example (b) demonstrates: In this situation conditioning on X
unblocks the path between D and Y via the dashed arc. Hence, even if the true direct e¤ect
of D on Y is zero, we could …nd a nonzero association between D and Y after conditioning on
X. This discussion thus demonstrates that attempting to identify direct e¤ ects via conditioning
can fail.
D Y D Y D Y
X X X
(a) (b) (c)
It might sometimes not be possible to identify a direct e¤ect, while the total e¤ect might
well be identi…ed, as e.g. in example (c). The total e¤ect of D on Y is identi…ed but the direct
e¤ect of D on Y is not, because conditioning on X would unblock the path via the dashed
arc.3435 An example from labour economics: Frequently one is interested in the e¤ects of some
school inputs (D), e.g. computer training in school, on productivity (=wages) in adult life (Y ).
In the typical Mincer-type regressions, one regresses wages (Y ) on a constant, experience (X)
and school inputs (D). Here, experience is included to obtain only the direct e¤ect of D on Y ,
by blocking the indirect e¤ect D may have on experience (X). This is another example, where
including an additional variable in the regression may cause problems. Suppose the computer
training programme was introduced in some randomly selected pilot schools. Clearly, due to
the randomization, the total e¤ect of D on Y is identi…ed. However, introducing experience
(X) in the regression is likely to lead to a situation as in graph (c) above. The amount of labour
market experience depends on the time in unemployment or out-of-the labour force, which is
almost certainly be correlated with some unobserved productivity characteristics that also a¤ect
Y . Hence, the direct e¤ect is not identi…ed. Here, introducing X destroys the advantages that
could be reaped from the experiment. (This example also illustrates the di¤erence between the
34
Another heuristic way to see this is that we could identify the e¤ect of D on X but not the e¤ect of X on
Y , hence we could never know how much of the total e¤ect is channeled via X.
35
Consider a classic example where a birth-control pill is suspected to increase the risk of thrombosis, but at
the same time reduces the rate of pregnancies, which are known to encourage thrombosis. We are not interested
in the total e¤ect of the pill on thrombosis but rather on its direct impact. Suppose the pill is introduced in
a random drug-placebo trial and suppose further that there is an unobserved variable a¤ecting the likelihood
of pregnancy as well as thrombosis. This corresponds to the graph in example (c). The total e¤ect of the pill
is immediately identi…ed since it is a random trial. On the other hand, measuring the e¤ect separately among
pregnant women and non-pregnant women could lead to spurious associations due to the unobserved confounding
factor. Therefore to measure the direct e¤ect, alternative approaches are required, e.g. to start the randomized
trial only after women became pregnant or among women who prevented pregnancy by means other than the
drug. (Pearl gives another illustrative example on gender discrimination in university admission.)
production-function philosophy and the treatment-e¤ect philosophy in current econometrics.

The production function approach attempts to include all relevant determinants of the output,
such that after having included all these factors the error terms U should be purely random
noise. The treatment e¤ect philosophy is interested in the e¤ects of only one (or perhaps two)
variables D and chooses the other regressors mostly according to knowledge about the process
that determined D.)
Identifying direct or partial e¤ects requires the identi…cation of the distribution of Yxd where
d and x are both set externally, which can be di¤erent from Y d where d is set externally and
X is observed to be x, as seen in the examples above.
For calculations in a DAG the rules of Theorem 3.4.1 of Pearl are helpful. (Note that there
rules apply only to acyclic graphs, i.e. without any cycles or feedback, unlike examples (a) and
(b) above.) I restate the rules of Pearl using his notation. Let X, Z, W be three sets of nodes
^ = x) denote the distribution of Y if X is
where each of these sets may be empty. Let Pr(Y jX
^ = x; Z^ = z) represents the distribution of Y if
externally set to the value x. Similarly, Pr(Y jX
^ = x; Z = z) is the distribution if X is
X and Z are both externally set. In contrast, Pr(Y jX
externally set and subsequently Z = z is observed. (In our notation, this refers to Z x , i.e. the
potential outcome of Z if X is …xed.)
Theorem 3.4.1
Rule 1 (Insertion/deletion of observations):
^ Z; W ) = Pr(Y jX;
Pr(Y jX; ^ W) if (Y ??ZjX; W )GX
Rule 2 (Action/observation exchange):
^ Z;
Pr(Y jX; ^ W ) = Pr(Y jX;
^ Z; W ) if (Y ??ZjX; W )GXZ
¯
Rule 3 (Insertion/deletion of actions):
^ Z;
Pr(Y jX; ^ W ) = Pr(Y jX;
^ W) if (Y ??ZjX; W )GX;Z(W )
where Z(W ) is the set of Z-nodes that are not ancestors of any W -node in GX . (Chapter 3.4
of Pearl discusses various examples how to apply these rules.)
Returning to our discussion on the identi…cation of partial e¤ects and using our notation of
potential outcomes, we can apply Rule 2 twice. First we obtain
Pr(Yxd ) = Pr(Y d jX d ) if (Y ??XjD)GDX ,

¯
where the condition would be satis…ed e.g. in example (c) if there was no dashed arc. Hence,
in this situation the partial e¤ect which is not channelled via the variables X can be obtained
separately for each value of X provided that Y d jX d is identi…ed.
The latter can be identi…ed under various conditions, e.g. by applying Rule 2 as
Pr(Y d jX d ) = Pr(Y jD; X) if (Y ??DjX)GD .

¯
Hence, in this situation, conditioning identi…es potential outcomes. Hence, in conventional

regression jargon X can be added as an additional regressor in a (linear) regression to obtain
that part of the e¤ect of D which is not channelled via X. Note again that Theorem 3.4.1
above only applies to DAG. Sometimes the rules may hold also in cyclical graphs, but may not
always.
This was a brief introduction in nonparametric identi…cation via controlling for covariates
and gave some general rules on which variables one wants to control for and for which not.
It also exempli…ed that the set of control variables used e.g. in linear regressions may often
contain variables for two purposes: to control for confounding variables to eliminate selection
bias, and to control for certain variables to obtain partial e¤ects instead of the total e¤ects. It
is not always appropriate to include all variables that D is correlated with as control variables.
Nor can partial e¤ects always be obtained by simply conditioning on the respective variables.
2.2 Matching estimators
References:
- Heckman, Ichimura, Todd (1997, 1998, RES)
- Imbens (2004, RESTAT)
- Special issue on Matching estimators (RESTAT, 2004)
After having discussed some issues about variable selection, in this section estimation of
average potential outcomes and average treatment e¤ects by adjusting for confounders is dis-
cussed. We will assume throughout the conditional mean independence assumption
E[Y d jD; X] = E[Y d jX],
i.e. assume that all confounding variables are observed. Traditionally, OLS has been the most
favoured estimator in such "selection on observables" situations. In recent years, an alternative
estimation strategy has become very popular: matching estimators. Although one might still
often prefer OLS in applications due to convenience and long standing experience, it is at least
worthwhile to understand the theory behind these matching estimators. One advantage often
brought forward is that matching estimators are entirely nonparametric and thus do not rely on
the assumption of linearity. In particular this permits treatment e¤ ect heterogeneity of any form.
This advantage, however, is of real relevance only when the sample size is su¢ ciently large. In
small samples and with often a large number of control variates, the speci…cation of the local
approximation plane is still of high importance. Furthermore, higher dimensional nonparametric
regression can be computationally demanding (but computing speed is increasing every year).
Another advantage of the theory of matching estimators is that it highlights the importance
of the support (of the distribution) of the observed variables (and thus the importance of
extrapolation). Matching estimators do not solve a support problem, but they clearly highlight
it. In adittion, matching estimators permit endogenous control variables, i.e. they do not
require that E[U jD; X] = E[U jX] but permit this to be di¤erent from zero. OLS regression
requires further E[U jX] = 0, i.e. all control variables must also be exogenous.
By the mean independence assumption, we can identify average potential outcomes as

Z
d
E[Y ] = E[Y jX; D = d]dFX
and the average treatment e¤ect on the treated as

Z
1 0
E[Y Y jD = 1] = E[Y jD = 1] E[Y jX; D = 0]dFXjD=1 .
For the ATET we only need the assumption
E[Y 0 jX; D = 0] = E[Y 0 jX; D = 1],
whereas it is not required that
E[Y 1 jX; D = 0] = E[Y 1 jX; D = 1].
Hence, it su¢ ces to assume that the no-treatment outcome is unconfounded given X, while we
can permit that the choice of D depends on the individual gain Yi1 Yi0 . This is not permitted
for ATE. This di¤erence can be a relevant relaxation in some applications.
A matching estimator of E[Y d ] is

n
1X
m
^ d (Xi ),
n
i=1
where m
^ d (x) is a nonparametric regression estimator of md (x) = E[Y jX = x; D = d]. Before
proceeding further with discussing this estimator, we need to brie‡y introduce the concept of
nonparametric regression leaving a more detailed discussion for later.
2.2.1 Brief introduction to Nonparametric regression
References:
- Pagan Ullah (1999) Nonparametric Econometrics
- Ichimura Todd (2008) Chapter in Handbook of Econometrics
Nonparametric estimation of a conditional expectation function E[Y jX] di¤ers from para-
metric regression in that the regression function is not speci…ed as linear or quadratic but is
permitted to have any form, with some restrictions on integrability, continuity and di¤erentia-
bility. Here only some intuitive ideas are given, leaving a more detailed discussion for later.
There are a large number of di¤erent approaches to nonparametric regression, but the approach
closest to the underlying spirit of nonparametric regression is probably kernel regression. In
the following discussion we usually consider the case where X is continuously distributed. If X
were discrete, the asymptotic theory is trivial. In this case, the conditional mean E[Y jX = x] is
simply estimated by taking the average of Yi among all observations with Xi exactly equal to x.
The number of observations with Xi = x grows (in expectation) proportionally with the sample
p
size and it can easily be shown that the average of these observations is n consistent for the
conditional mean. From the perspective of econometric theory the situation is only interesting
when X is continuously distributed because in this case the number of observation with Xi
exactly equal to x is zero. Estimation of E[Y jX = x] is then only possible when using also
observations with Xi 6= x. Hence, we will discuss only the case with continuously distributed
X in the following. One should note here, however, that in …nite samples some smoothing is
usually useful even if all X variables are discrete. Although this introduces some bias, it usu-
ally reduces the variance even more. Smoothing over discrete and continuous X is discussed in
Chapter 13.
The idea of kernel estimation of the conditional mean E[Y jX = x] at location x is to take
a weighted average of the observed Y in a small neighbourhood about x. Suppose that X is
scalar and that an iid sample f(Yj ; Xj )gnj=1 is available. Let h de…ne a so-called bandwidth
value which de…nes the size of the neighbourhood. Now taking the average of the observed Y
in a neighbourhood of range 2h gives:
P
Yj 1 (jXj xj h)
m(x;
^ h) = P .
1 (jXj xj h)
If h were in…nitely large, this would simply be the sample mean of Y . For h becoming small, this
is the average of Y close to the point where E[Y jX = x] is to be estimated. For a bandwidth
value h converging to zero for growing sample size n, this estimate converges to the true value
E[Y jX = x].
A more precise estimate can often be obtained by using a weighted average of Y :
P Xj x
Yj K h
m(x;
^ h) = P ,
Xj x
K h
where K (u) is a weighting function (usually called kernel function), which gives higher weights
to observations that are closer to x. This is the Nadaraya-Watson kernel regression estimator.
(Usually K is positive, has a maximum at 0 and integrates to one.) Often used kernels are the
Epanechnikov kernel K(u) = 34 (1 u 2 ) 1[ 1;1] (u) or the Gaussian kernel K(u) = (u). Whereas
the former is compactly supported, the latter has unbounded support. An alternative kernel
15
is, e.g., the quartic (or biweight) kernel K(u) = 16 (1 u2 )2 1(juj < 1), which is compact and
di¤erentiable. The choice of the kernel is generally perceived to be of lesser importance. The
choice of the bandwidth value is however very important for the properties of the estimator.
Various approaches to choosing the bandwidth value in …nite samples have been proposed, which
however are not optimal for the matching estimators discussed below. For these estimators, the
issue of optimal bandwidth choice is not yet fully resolved. Nevertheless, the conventional
bandwidth estimators may not perform too badly, see e.g. Frölich (2004) or Frölich (2005) for
propensity score matching (PSM).
Instead of taking a simple weighted average, one could also …t a local model in the neigh-
bourhood about x and use this model to estimate E[Y jX = x]. Local linear regression …ts lo-
cally a line about x:
X Xj x
(^ ; ^ ) = arg min (Yj (Xj x) )2 K ,
; h
j
and estimates E[Y jX = x] by ^ . Fitting locally a constant (local constant regression) gives
the NW kernel regression estimator given above. Local linear regression has several theoreti-
cal advantages over local constant regression36 and delivers also an estimate of the derivative
@E[Y jX = x]=@x at the location x, which is given by ^ . It may lead to rather variable esti-
mates in small samples, though. Local polynomial regression …ts locally a polynomial.37 Ac-
cording to the polynomial order the local polynomial estimator is also called Nadaraya-Watson
kernel (p = 0), local linear (p = 1), local quadratic (p = 2) or local cubic (p = 3) regression.
Polynomials of order higher than three are rarely used in practice, except for estimating local
bias in data-driven bandwidth selectors. Nadaraya-Watson kernel and local linear regression
are most common in econometrics. Local polynomial regression of order two or three is more
suited than kernel or local linear regression for modelling peaks and oscillating regression curves
in larger samples, but it often proves unstable in small samples since more data points in each
smoothing interval are required (Loader 1999). Alternative local parametric models could be
used if the support of Y is not the real line, e.g. local logit if Y is binary. When interest is in
estimating derivatives, higher order polynomials are often advocated. If interest is in the v-th
derivative, choosing p such that p v is odd ensures that the bias in the boundary is of the
same order as in the interior. If p v is even, this bias will be of higher order in the boundary
and will also depend on the density of X, see e.g. Fan and Gijbels (1996).
For X being of higher dimension (dim(X) = q), the local model corresponds to a local
hyperplane and the kernel function K has to de…ne a multidimensional smoothing window with
q q dimensional bandwidth matrix. Often a multiplicative kernel, called product kernel, is
Q u
used KH (u) = ql=1 K( hqq ).
Nonparametric regression thus avoids relying on functional form assumptions such as linear-
36
Fan (1992, 1993) showed that the bias of the local linear estimator is of the same order at boundary points as
in the interior, whereas the bias of the Nadaraya-Watson estimator is of order O(h) when estimated at values x
at the boundaries of its support. Fan (1993) demonstrated that the local linear estimator attains full asymptotic
e¢ ciency in a minimax sense among all linear smoothers and has high e¢ ciency among all smoothers, see also
Fan, Gasser, Gijbels, Brockmann, and Engel (1997).
37
The solution m(x)
^ for local polynomial regression of degree p can be written explicitly as
0 1 1 0 1
Q0 (x) Q1 (x) Qp (x) T0 (x)
B C B C
B Q (x) Q2 (x) Qp+1 (x) C B T (x) C
B 1 C B 1 C
m(x)
^ = e01 B C B C,
B .. .. .. .. C B .. C
B . . . . C B . C
@ A @ A
Qp (x) Qp+1 (x) Q2p (x) Tp (x)
P Xj x P Xj x
where e01 = (1; 0; 0; ::), Ql (x) = K h
(Xj x)l and Tl (x) = K h
(Xj x)l Yj (Fan, Gasser,
Gijbels, Brockmann, and Engel 1997).
ity, and we would expect it therefore to be less precise than a (correctly speci…ed) parametric
estimator. In particular, the uncertainty in nonparametric regression consists of a bias and a
variance term. (In parametric regression we usually have to deal only with a variance term
since the bias is often zero, e.g. in a correctly speci…ed linear model estimated by OLS.) Keep-
ing h …xed and letting n increase to in…nity, the local bias and the local variance of the local
constant (Nadaraya-Watson) and the local linear estimator are of the same order if x is esti-
mated at interior points. The …rst-order bias and variance approximations for x being scalar,
i.e. dim(X) = 1, are
local constant (NW) local linear

m0 (x)fs0 (x) m00 (x) m00 (x)
Bias h2 2
f (x) + 2 h2 2
2
0 0
2 (x) 2 (x)
1 1
Variance nh
0
2 f (x) nh
0
2 f (x) ,
0 0
R R
where l = ul K (u) du and l = ul K 2 (u) du.
In the multivariate case with dim(X) > 1, the bias remains of order h2 buth the variance is
1
of order nhq . Hence, the bias increases with a larger bandwidth value and the variance decreases.
Consistency of the estimator will require that h converges to zero with growing sample size but
not too fast such that also the local variance converges to zero. For one-dimensional X, the
optimal bandwidth with respect to mean squared error of the estimator E[m(x)
^ m(x)] satis…es
1 1
h=O p
5
= O(n 5 ).
n
This gives only the rate or the relationship between optimal h and sample size. It will usually
not help to choose h for a given data set, which will be discussed below.
The previous results hold only for interior points. The following table gives bias for interior
as well as for boundary points, again only for one-dimensional X. The local bias is of the same
order in the interior and at the boundary for odd-order polynomials whereas it is of lower order
in the interior for even-order polynomials:
p=0 p=1 p=2 p=3

Bias in interior O h2 O h2 O h4 O h4
Bias at boundary O h1 O h2 O h3 O h4
1 1 1 1
Variance O nh O nh O nh O nh .
To achieve the fastest rate of convergence with respect to mean squared error, the bandwidth
h should be chosen to balance squared bias and variance, which leads to the convergence rates:
Convergence rate p=0 p=1 p=2 p=3

2 2 4 4
in the interior n 5 n 5 n 9 n 9
1 2 3 4
at the boundary n 3 n 5 n 7 n 9
Nonparametric regression avoids relying on functional form assumptions such as linearity

and therefore we would expect it to be less precise than a (correctly speci…ed) parametric
estimator. In fact, this precision decreases for higher dimensional X. Let q be the number of
continuous variables in X. Stone (1980, 1982) showed that the optimal rate of convergence for
nonparametric estimation of a p times continuously di¤erentiable function m(x), x 2 Rq is in
L2 -norm
p
n 2p+q
and in sup-Norm (i.e. uniform convergence)

p
n 2p+q
.
ln n
Hence, the rate of convergence is always slower than the parametric rate and decreases with
the dimension of X, more precisely the number of continuous variables in X. This is known as
the "curse of dimensionality" and implies that nonparametric regression becomes more di¢ cult
for higher-dimensional X. On the other hand, the optimal rate of convergence increases with
the number of continuous derivatives of m(x), and for p very large it is close to the parametric
rate. Stone (1980, 1982) showed further that a local polynomial regression estimator of order
p 1 attains the optimal rate of convergence, provided the bandwidth is chosen optimally.
This decrease in the optimal rate of convergence with the dimension of X led to a general
suspicion against nonparametric estimation methods when X contains more than 2 or 3 contin-
uous regressors. However, such pessimism is not necessarily justi…ed when estimating objects
such as average potential outcomes E[Y d ] or average treatment e¤ects, which are low dimen-
sional objects (i.e. a scalar). Even though the nonparametric regression plane md (x) may be
di¢ cult to estimate, the subsequent averaging over these estimates can lead to a faster con-
p
vergence rate, and as mentioned below, even n-convergence can be achieved under su¢ cient
regularity assumptions.
The behaviour of all nonparametric regression estimators depends on the proper choice
of a bandwidth value. Asymptotic rates of convergence are of little guidance for choosing
the bandwidth for a particular dataset. A versatile approach to bandwidth selection is
cross-validation (Stone 1974). Cross-validation is based on the principle of maximizing the
out-of-sample predictive performance. If a quadratic loss function is used to assess the
estimation of m(x) at a particular point x, a bandwidth value h should be selected to minimize
E[(m(x;
^ h) m(x))2 ]. If a single bandwidth value is used to estimate the function m(x)
at all points x, the (global) bandwidth should be chosen to minimize the mean integrated
R
squared error MISE(h) = E (m(x; ^ h) m(x))2 dx. Since m(x) is unknown, a computable
approximation to minimizing mean integrated squared error is minimizing average squared
error
1X
arg min (Yj ^ (Xj ; h))2 .
m (9)
h n
j
However, minimizing average squared error leads to the selection of too small bandwidth
values. For example, if a kernel with compact support is used and h is very small, the local
neighbourhood of Xj would contain only the observation (Yj ; Xj ). As the estimate m(X
^ j)
is a weighted average of the Y observations in the neighbourhood, the estimate of m(Xj )
would be Yj . Hence, (9) would be minimized by a bandwidth value h close to zero. To
avoid underestimating the optimal bandwidth, the observation (Yj ; Xj ) should be excluded
from the sample when estimating m(Xj ).38 The corresponding estimate m
^ j (Xj ) is called
the leave-one-out estimate and represents the out-of-sample prediction from the sample
f(Yl ; Xl )gl6=j at Xj . The resulting cross validation function is de…ned as
X
CV (h; n) = (Yj m
^ j (Xj ; h))2 , (10)
j
and h is chosen to minimize (10).
Usually the bandwidth h is considered …x for a given sample, i.e. for estimating m(x)
at di¤erent locations x the same bandwidth value is used. However, permitting a variable
bandwidth h(x) could permit a more precise estimation if the smoothing window adapts to
the density of the available data. One such approach is nearest neighbours regression. With
-NN (and a kernel with bounded support), the bandwidth h(x) is chosen such that exactly
38
An alternative is generalized cross-validation which penalizes the cross-validation criterion when the contri-
bution of the data point (Yj ; Xj ) to the estimation of m(X
^ j ) is large.
observations fall in the smoothing window. I.e. only the nearest neighbours to x are used
for estimating m(x). (The kernel weights could be uniform or decreasing with distance.)39 For
determining nearness a distance metric is required when X is multidimensional. One common
choice is Mahalanobis distance, which is a quadratic form in (X x) weighted by the inverse
of the covariance matrix of X.
Generally, when dim(X) > 1 we have to smooth in various dimensions. This would require
the choice of a q q dimensional bandwidth matrix, which also de…nes the spatial properties
of the kernel, e.g. elipsoidal support of the kernel. The simplest solution to deal with this
situation is to scale the X data beforehand such that each regressor has mean zero, variance
one and covariance zero. (This can easily be done in practice by multiplying the data matrix
with the Cholesky decomposition of the covariance matrix of the regressors.) Then, using a
single bandwidth value and a product kernel is convenient.
Before concluding this very brief introduction to nonparametric regression, series regression
should also be mentioned. In contrast to kernel regression, which is based on smoothing in a
local neighbourhood, series regression is a global smoothing method like conventional parametric
regression but with an increasing number of regressors. Consider X scalar. A series regression
based on a power series, estimates the regression plane by conventional least squares with the
regressors 1; x; x2 ; x3 ; x4 ; ::::. The number of regressors included grows to in…nity with sample
size and the number of terms included for a given data set can be obtained by cross-validation.
In practice, the power series is not the best choice due to collinearity problems and alternative
series are more appropriate. A large number of more attractive basis functions exist, which
one should choose according to some characteristics of m(x). E.g. if m(x) is periodic a ‡exible
Fourier series would be adequate. If X has support [0,1], Chebyshev or Legendre polynomials
are appealing. (@@@@@ prüfe letzten Satz nochmal nach.)
39
The basic di¤erence between -NN and kernel based techniques is that the latter estimates m(x) by smooth-
ing the data in a …xed (i.e. deterministic) neighbourhood of x whereas the former smooths the data in a
neighbourhood of stochastic size. For example, 10-NN regression estimates m(x) by the average outcome of Y of
all observations in a neighbourhood of x where the size of the neighbourhood is determined such that it contains
exactly 10 observations of the sample
2.2.2 Matching estimators
The literature on matching estimators largely evolved around the identi…cation and estimation
of average treatment e¤ects with a binary variable D, see Heckman, Ichimura, Todd (1997,1998).
We follow this discussion as it elucidates many of the main issues that distinguish matching
estimators from OLS. Consider estimation of the average treatment e¤ect on the treated
E[Y 1 Y 0 jD = 1].
A naive estimator is
E[Y jD = 1] E[Y jD = 0],
which simply compares the observed outcomes among the treated (D = 1) and the non-treated
(D = 0). We have seen in the previous chapter that this means comparison is appropriate
and consistent if we have experimental data, i.e. when D has been randomly assigned. With
non-experimental data (i.e. when D is not randomly assigned), this estimator is usually biased
due to di¤erences in observables and unobservables among those who chose D = 1 and those
who chose D = 0. As discussed above, the bias of this simple experimental estimator when
used with observational (=non-experimental) data is:
E[Y 0 jD = 1] E[Y 0 jD = 0].
Assuming for convenience that X is continuous, this bias can also be written as
Z Z
0
= E[Y jX = x; D = 1]fXjD=1 (x)dx E[Y 0 jX = x; D = 0]fXjD=0 (x)dx
S1 S0
where Sd is Supp(XjD = d). Let S10 = S1 \ S0 be the common support, i.e. the region where
S1 and S0 overlap. The bias can now be decomposed into three parts:
Z Z
0
E Y jX = x; D = 1 fXjD=1 (x)dx E Y 0 jX = x; D = 0 fXjD=0 (x)dx
S1 nS10 S0 nS10
Z
+ E Y 0 jX = x; D = 0 fXjD=1 (x) fXjD=0 (x) dx
S10
Z
+ E[Y 0 jX = x; D = 1] E[Y 0 jX = x; D = 0] fXjD=1 (x)dx.
S10
The third part is the bias due to di¤erences in the expected outcomes between the participants
(D = 1) and the nonparticipants (D = 0) conditional on X. This component would be nonzero
if there would be systematic unobserved di¤erences even after controlling for the covariates X.
If X includes already all confounding variables, then
E[Y 0 jX; D = 0] = E[Y 0 jX; D = 1]
and the third part of the bias decomposition is zero. This component is what is traditionally
understood by selection bias.
The …rst and the second part of this bias decomposition show that there are also other is-
sues to be considered. The second part of the bias is due to di¤erences in the distributions
of the X characteristics among participants and non-participants, which are neglected in the
simple means comparison. These di¤erences need to be adjusted for in the matching estimator.
Finally, the …rst component is due to di¤erences in the support of X in the participant and
nonparticipant subpopulation. In the experimental estimator E[Y jD = 1] E[Y jD = 0] we
partly compare individuals to each other for whom no counterfactual could ever be nonpara-
metrically identi…ed since the density fXjD=d (x) is zero in the other subpopulation. If S1 nS10
is nonzero, there are participants with characteristics x for whom no counterpart in the non-
participant (D = 0) subpopulation could ever be observed since the density fXjD=0 (x) is zero.
Analogously if S0 nS10 is nonzero, there will be nonparticipants with characteristics for whom
no participant with identical characteristics could be found because fXjD=1 (x) is zero. If it
happened that individuals with characteristics S1 nS10 have on average large outcomes Y 0 and
those with characteristics S0 nS10 have on average small outcomes Y 0 , then the …rst bias compo-
nent of the experimental estimator would be positive. The reason is that the term E[Y jD = 1]
contains these high outcome (e.g. earnings) individuals (i.e. S1 nS10 ), which are missing in the
D = 0 population and are thus not entering in E[Y jD = 0]. Analogously, the term E[Y jD = 0]
contains those low earning individuals (i.e. S0 nS10 ) whose characteristics have zero density in
the D = 1 population and are thus missing in the term E[Y jD = 1]. Therefore, in this example,
the term E[Y jD = 1] would be too large as it contains the high earnings individuals and the
term E[Y jD = 0] would be too small as it contains those low earnings individuals.
In the case of randomized experiments S0 = S1 and, hence, common support is guaranteed.
With observational studies, however, this is often not the case. For example, if applying for
university requires a minimum grade (in the secondary school examination results), then we
could …nd individuals with very low grades in the population not attending university, but we
could never …nd university graduates for particular values of X. Hence, we will not be able to
…nd a proper comparison group for such levels of X.40 If we know the rules regulating entry to
university, we would know exactly which x values cannot be observed in the D = 1 population.
In addition to such formal rules, there are often many other factors unknown to us that make
the choice of D = 1 or D = 0 extremely unlikely. For example, parental income may matter a
lot for attending university and for very low incomes it might be impossible to attend university.
However, we do not know the threshold a priori and would thus not know the common support
but rather had to estimate it.
To eliminate selection bias in the estimation of ATET, we must address all three possible
sources. The third component is eliminated if we can assume that all confounding variables
are included in X. Otherwise, we have to …nd other approaches to deal with this selection on
unobservables, as e.g. instrumental variables that are discussed in the next sections. To deal
with the second component, we have to weight the nonparametric estimates of E Y 0 jX; D = 0
with the appropriate density of XjD = d. The …rst part is di¢ cult to deal with as we would
have to estimate E Y 0 jX; D = 0 in regions where there are no data, i.e. where fXjD=0 is
zero. We either have to give up the fully nonparametric approach to identi…cation and use
extrapolation methods to estimate conditional expectation functions in regions where there are
no data. (Therefore it would be useful to have a good local model. The local constant model
would often not be the most appropriate.) Or we have to concede that there are regions without
data and that we have to restrict ourselves to identifying the parameter of interest only in those
regions where data can exist. This is usually done by deleting observations which appear to be
outside the common support.
Consider now estimating the average potential outcome among the treated
Z
0
E Y jD = 1 = E[Y jX; D = 0]dFXjD=1
Z
= m0 (x)fXjD=1 (x)dx.
Here the second bias component discussed above is eliminated by weighting the conditional
expectation m0 (x) = E[Y jX = x; D = 0] among the nonparticipants by the density of X
among the participants fXjD=1 .
40
For example, in active labour market programmes being unemployed is usually a central condition for el-
igibility. Thus employed persons cannot be participants as they are not eligible and, hence, no counterfactual
outcome is identi…ed for them.
To deal with the …rst bias component, we rede…ne the parameter of interest by restricting
it to the region S0 , because m0 (x) is nonparametrically identi…ed only where fXjD=0 (x) > 0.41
Restricting the average potential outcome to the region S0 is in fact equivalent to restricting
it to the set of common support S10 since fXjD=1 (x) = 0 in the region S0 nS10 . This is clearly
visible from the next equation where it would not make any di¤erence if we exchanged S10
by S0 because the di¤erence between these two sets has probability mass zero in the D = 1
population.42 This gives the average potential outcome in the common support region de…ned
as R
m0 (x)fXjD=1 (x)dx
0 S10
E Y jD = 1 = R .
S10 fXjD=1 (x)dx
S10
For the ATET this gives analogously
E Y1 Y 0 jD = 1 .
S10
Suppose for the moment that S1 S0 , i.e. that we have full common support. A match-
ing estimator of E Y 0 jD = 1 is obtained by replacing FXjD=1 by the empirical distribution
function F^XjD=1 and m0 by a nonparametric estimator m
^ 0:
1 X
E [Y\
0 jD = 1] = m
^ 0 (Xi ),
n1
i:Di =1
and similarly for ATET
1 X
E [Y 1 \
Y 0 jD = 1] = Yi1 m
^ 0 (Xi ) , (11)
n1
i:Di =1
where m
^ 0 (x) is a nonparametric regression estimator obtained from the nonparticipants sample.
This estimator estimates m0 (x) at all locations x of the participant sample and takes then the
average of these estimates in the participant sample. Taking the average in the participant
sample is equivalent to weighting with dF^XjD=1 , i.e. it is self-weighting.
The estimator m
^ 0 could be, for example, a kernel regression estimator or local polynomial
regression. A very popular alternative estimates m0 (x) by …rst-nearest-neighbour regression,
41
For values of x where fXjD=0 (x) = 0 there could never be any data at x in the D = 0 population and since
consistency of nonparametric regression estimators requires that the bandwidth value h converges to zero, m0 (x)
is nonparametrically identi…ed only where fXjD=0 (x) > 0.
42
R
I.e. S0 nS10 fXjD=1 (x)dx = 0.
which corresponds to a kernel regression estimator with bounded kernel and a variable band-
width such that at each location x where m0 (x) is to be estimated the bandwidth is chosen such
that only one observation falls in the bandwidth window. In other words, for estimating m0 (x)
the observation with characteristics Xj that are closest to the characteristics x is selected (i.e.
^ 0 (x) = Yj .43 The use of
the closest observation to x) and its value Yj is taken as the estimate: m
the nearest-neighbour regression estimator provides the origin of the name matching estimator.
In the formula given above, m0 (Xi ) is estimated for each participant Xi (with Di = 1). The
estimate m0 (x) is obtained as the outcome of that non-participant observation (with Di = 0)
that is closest to x. Hence, for each participant Xi the closest non-participant is found, i.e.
"pairs" or "matches" of similar participants and non-participants are formed. In the estimation
of the ATET (11) the outcome di¤erence within each such pair is taken and averaged over all
pairs.
Obviously, using a single-nearest neighbour for estimating m0 (x) leads to the lowest bias
but rather high variance and a wider bandwidth window might therefore be appropriate. A
nonparametric regression estimator that permits a wider bandwidth is nearest neighbour
regression, which takes a (weighted) average of the nearest neighbours. Again this could be
understood as a kernel regression estimator with a compact kernel and variable bandwidths.
Hence, matching with -nearest-neighbour regressions and matching with kernel regression and
bandwidth h are likely to perform similarly if and h are chosen optimally and converge to
in…nity and zero, respectively, with growing sample size.44 However, matching based on local
polynomial regression or with so-called higher order kernels, discussed later, can reduce the
bias of the matching estimator, which is not possible with conventional nearest neighbours
regression.
One may be wondering why the simple one-to-one matching estimators have been so popular,
particularly in biometrics. The main reason seems to be that it can help to reduce the cost of
data collection. Suppose we have a dataset from medical records on 50 individuals who were
exposed to a certain drup or treatment and 5000 individuals who were not exposed. For the
5000 controls some basic X variables are available but not the Y variable of interest. This
outcome would usually require collection of follow-up data, e.g. whether the person is still alive
43
I.e. the smoothing window contains only one observation.
44
In practice, nearest neighbour matching may perform somewhat better since the smoothing region aut-
matically adapts to the density and thus ensures that never less than regressors are in the smoothing region.
a few years later. Collecting these Y data is often costly, and may e.g. require a blood test
with prior consent of the physician and the individual. Thus, instead of following-up all 5000
individuals, it makes sense to use the available X data to choose 50 control observations most
similar to the 50 treated observations in terms of X and to collect additional data on these
individuals afterwards. This gives a reason for why one-to-one matching is helpful before data
collection is done, but afterwards it does not preclude the use of estimators that use a larger
smoothing area.
Heckman, Ichimura, and Todd (1998) and Abadie and Imbens (2006), among others,
analyzed the asymptotic properties with local polynomial regression and with -NN regression,
respectively. Heckman, Ichimura, and Todd (1998) showed that under certain conditions on
p
the bandwidth and the kernel function, the matching estimator can achieve n-consistency.
p
Hence, in spite of the curse of dimensionality, n convergence could be achieved. In other
words, the curse of dimensionality does not really apply to estimators of "aggregated regression
hyperplanes" such as ATE or ATET. Nevertheless, nonparametric regression becomes more
di¢ cult with higher dimensional X and the regularity conditions on the function m0 and the
densities f0 are getting stronger for larger dim(X).45
However, this is not possible with -NN matching, unless dim(X) is very small. Abadie and
Imbens are particularly interested in the properties when does not increase to in…nity with
p
the sample size, e.g. if =1 and …nd that n convergence can only be achieved if dim(X) 4.46
45
A cautionary note: We are not going to examine regularity conditions in much detail here. Often they
take the form of assuming that X is twice or thrice continuously di¤erentiable, which we often would regard as
acceptable and thus not very strong conditions since we believe that most economic relationships are (at least
after taking expectations) relatively smooth. There might sometimes be discontinuities due to regulations and
laws. E.g. if a driving license can be obtained only at the age of, say, 18 years (and if this is strictly enforced)
then earnings might jump at the age of 18 if having a driving license is important for jobs or commuting to
jobs. Sometimes, regularity conditions should be examined in more detail and should not always be taken for
granted. In a well known paper of Hirano, Imbens, and Ridder (2003) on weighting estimators, they assume
that the propensity score p(x) is at least 7 q times continuously di¤erentiable. Hence, if X contains 5 continuos
regressors we would require at least 35 continuous derivatives, which is certainly a rather strong assumption.
46
Abadie and Imbens (2006) analyze the asymptotic e¢ ciency of -nearest-neighbours matching estimators in
estimating average treatment e¤ects when is …xed, i.e. when the number of neighbours is …xed and does not
grow with increasing sample size. This includes the standard pair-matching estimator ( = 1). They consider
matching with respect to the X variables and show that 1) these estimators do not attain the variance bound
2=c
and, hence, are ine¢ cient. 2) The bias term of the estimator is of order O(n ) where c is the number of
p
Hence, if the dimension is large, we will not achieve n convergence. But with propensity score
matching, discussed below, the dimension would be reduced to one, in principle.
Now, if S1 * S0 there will be values of x observed among the participants (D = 1) for whom
m0 (x) is not identi…ed since the density fXjD=0 is zero. Eliminating such observations gives
1 X
E [Y\
0 jD = 1] = m
^ 0 (Xi ) 1(Xi 2 S10 ),
S10 nS1 10 i:Di =1
where nS1 10 is the number of participants in the common support. Usually, the common support
is unknown and needs to be estimated. Often we know parts of the regions outside the common
support. We might know that for certain values of X, choosing D = 0 might be impossible,
e.g. due to legal rules. For example, if every unemployed person either has to attend training
(D = 0) or an employment programme (D = 1), it may be that training is only permitted for
individuals younger than 35. In this case, we would delete all observations above 35 because it
would be impossible to observe older individuals with D = 0. (One might still observe older
individuals with D = 0 in the available data, but one would then be concerned whether those
individuals are very special cases, e.g. in terms of their unobservables, or whether there might
be measurement error in the age variable.) But even after imposing such known restrictions,
there might still be individuals with Di = 1 for whom fXjD=0 (Xi ) = 0. To estimate the common
support, one could nonparametrically estimate the density fXjD=0 and eliminate observations
where fXjD=0 is zero or very small. An alternative is based on de…ning the common support
region with respect to the propensity score
p(x) = Pr(D = 1jX = x).
For any x where p(x) = 1 all individuals with such characteristics would be observed to have
D = 1, such that fXjD=0 (x) must be zero.47 Hence, we could also de…ne the matching estimator
continuous covariates. Consequently, if the number of continuous covariates is 4, the estimator is asymptotically
p
biased. If the number of continuous covariates is even larger, the estimator does no longer converge at rate n.
3) The bias term can be removed through re-centering. However, since re-centering leaves the variance term
unchanged, the modi…ed estimator is still ine¢ cient.
@@@verify whether this is still true in the published paper
47
Obviously, the reverse is not true since for some x it could be that fXjD=0 (x) and fXjD=1 (x) are both zero,
such that p(x) is unde…ned. However, such values of x would then also not be observed among the participants
and would thus not appear in the matching estimator. I.e. they have measure zero in both the treated and the
non-treated populations.
with respect to all observations where p(Xi ) < 1
1 X
E [Y\
0 jD = 1] = m
^ 0 (Xi ) 1(p(Xi ) < 1)
S10 nS1 10 i:Di =1
2.2.3 Propensity score matching
Instead of using a matching estimator with respect to the covariates X, matching with respect
to the propensity score is very popular. This propensity score matching estimator is motivated
on the observation that the conditional independence assumption (2) also implies that
Y d ??Djp(X),
where p(x) is the propensity score.48 Hence, to eliminate selection bias due to unobservables
it is not necessary to compare individuals that are identical in all characteristics X, it su¢ ces
if they are identical in the propensity score. Since the propensity score is a (one-dimensional)
number between 0 and 1, this reduces the dimensionality problem considerably. It also allows
us to rank individuals since we can project them on the one-dimensional interval between 0 and
1.
Hence, instead of matching on the multidimensional vector x, it su¢ ces to match on the
one-dimensional propensity score p(x), because
E Y 0 jD = 1 = E [ E [Y 0 jp(X); D = 1] jD = 1]
S10 S10 S10
= E [ E [Y 0 jp(X); D = 0] jD = 1]
S10 S10
= E [ E [Y jp(X); D = 0] jD = 1]
S10 S10
48
?Y d jp(X), i.e. that the distribution of D does not
The proof for binary D is very simple: To show that D?
depend on Y d given p(X), it needs to be shown that P (D = 1jY d ; p(X)) = P (D = 1jp(X)) and analogously
for D = 0. Because P (D = 1j ) and P (D = 0j ) have to add to one for binary D, it su¢ ces to show this
relationship for the former. Now, P (D = 1jY d ; p(X)) = E[DjY d ; p(X)] = E[E[DjX; Y d ; p(X)] jY d ; p(X)] by
iterated expectation. This equals E[E[DjX; Y d ] jY d ; p(X)] = E[E[DjX] jY d ; p(X)] by the (2). Now, this
equals E[p(X) jY d ; p(X)] = p(X). Analogously for the right hand side: P (D = 1jp(X)) = E[Djp(X)] =
E[E[DjX; p(X)]jp(X)] = p(X).
In fact, the justi…cation of propensity score matching does not depend on any properties of the potential
outcomes. Propensity score matching and matching on covariates X will always converge to the same limit since
it is a mechanical property of iterated integration, as shown in Frölich (2007b).
and the propensity score matching estimator is

1 X
E [Y\
0 jD = 1] = ^ 0 (pi ) 1(pi < 1),
m
S10 nS1 10 i:Di =1
^ 0 (p) = E[Y jD = 0; P = ].
where pi = p(Xi ) and m
In practice the propensity score is almost always unknown and has to be estimated …rst.
Estimating the propensity score nonparametrically is usually as di¢ cult as estimating the con-
ditional expectation function m0 (x) since they have the same dimensionality. However, it is
accepted practice in the applied econometrics profession to estimate the propensity score by a
probit or logit model. This parametric estimator turns the matching estimator essentially into
a semiparametric estimator (due to the parametric estimates of the propensity score). So you
might want to ask: Why should propensity score matching be better than OLS, if I use a para-
metric estimator in the …rst stage? There are a number of econometricians who believe that a
logit estimate of the propensity score works well and does not lead to very di¤erent results than
when a nonparametric propensity score is used. You might want to believe in this or not. The
strongest points for using propensity score matching probably are: (1) We permit heterogeneity
in treatment e¤ects of relatively arbitrary form. (2) It highlights the issue of common support,
i.e. di¤erences in the observable characteristics. It helps to examine these di¤erences in a sim-
ple visual way: It is simple to plot the distribution of pi in the D = 0 and D = 1 population
since it is only one-dimensional. On the other hand, examining the densities fXjD visually is
almost always impossible due to the high dimension. Hence, via the propensity score it can
easily be seen for which observations nonparametric estimation is possible and for which obser-
vations, the results would have to be based on extrapolation. (3) Endogenous control variables
X, i.e. which are correlated with U , are permitted. (4) One can specify the parametric model
of the selection process without involving the outcome variable. Hence, one can re-specify a
probit model several times e.g. via omitted-variables tests, balancing tests or the inclusion of
several interaction terms until a good …t is obtained, without this procedure being driven by Y
or the treatment e¤ect themselves. If one were to estimate a regression of Y on D and X, all
diagnostics would be in‡uenced by the true treatment e¤ect and a re-speci…cation of the model,
e.g. leaving out insigni…cant variables, would thus already depend on the treatment e¤ect, thus
invalidating the standard inference based on the t-statistic in the second regression. (5) Once
a good …t of the propensity score has been obtained, it can be used to estimate the e¤ect of
several di¤erent outcome variables Y , e.g. employment states at di¤erent times in the future,
various measures of earnings, health indicators etc. (6) Lastly, it can always be a useful robust-
ness check of conventional OLS estimates. If OLS leads to very di¤erent results than propensity
score matching, then one should explore the reasons for this, whether it is due to di¤erences in
the support or mainly the result of the linearity assumption inherent in OLS. Often the reason
lies in a lack of common support, and when the common support de…nition is used to select the
sample before estimating OLS, both procedures may lead to similar results.
Hence, with an estimated propensity score, the propensity score matching estimator is
1 X
E [Y\
0 jD = 1] = ^ 0 (^
m pi ) 1(^
pi < 1).
S10 nS1 10 i:Di =1
Heckman, Ichimura, and Todd (1998) analyze propensity score matching with an estimated
propensity score and show consistency under certain regularity assumptions, which are usually
satis…ed if a parametric estimator is used. This …rst step estimation usually has an impact on
the asymptotic variance matrix of the estimator.
The CIA (2) implies not only independence conditional on p but also that
Y d ??Djp(X); subvector(X), (12)
where subvector(X) contains only (a few) components of X, e.g. gender, age etc. Hence, if we
were interested in estimating the average potential outcome separately for men and for women,
we could use (assuming for the moment full common support)
E Y 0 jD = 1; men = E[ E Y 0 jp(X); D = 1; men jD = 1; men]
= E[ E Y 0 jp(X); D = 0; men jD = 1; men]
= E[ E [Y jp(X); D = 0; men] jD = 1; men],
and the same for women.

Thus, we can use the same (estimated) propensity score for estimating the average potential
outcome in the entire population, the male population as well as the female population. Hence,
estimating the propensity score only once would su¢ ce.49 Obviously, the analysis of common
support has to be done separately in each subpopulation.
@@@ hier auch noch die Argumente von Lechner für balancing score
49
I.e. if we want to estimate the propensity score nonparametrically, we would have to do this only once, and
could use this estimate to estimate ATE in di¤erent subpopulations.
Practical issues in propensity score estimation Hence, the propensity plays two central
roles. First, it is a very helpful tool to highlight and impose the common support requirement.
Second, it can additionally be used to simplify the estimation process. Even if one does not
pursue this second aspect and use regression on X or parametric regression, the propensity
score nevertheless remains very helpful to visualize common support.
In practice one often estimates …rst the propensity score p(x), either parametrically by
probit/logit or nonparametrically, and plots the histogramm or density of the estimated Pî
separately for the D = 0 and the D = 1 population. An example is given in the following
graph, which shows how the distribution of the propensity scores could look like. This graph
helps us to visualize the di¤erences in observed characteristics between the treated and controls
in a simple way. First, we see rougly how much the distributions di¤er. If D had been randomly
allocated with probability 0.5, the distributions should be very similar for treated and controls
ideally with a single spike at P = 0:5. Otherwise, the distributions would be more dissimilar,
with most of the mass to the left for the controls and most of the mass to the right for the treated.
We often …nd that there is very little density mass above 0.7 or 0.8 in the D = 0 population,
whereas naturally much of the density mass lies above 0.7 in the D = 1 population. If we are
interested in the ATET, we would then drop all treated observations above the largest value of
P in the control sample. Hence, if the largest value of P among the controls is 0.7, all treated
above this threshold are deleted since we could not …nd close comparison observations. Usually,
we would often be stricter in the implementation of the common support in that we drop all
treated above, say, the tenth-largest value of P in the control population. If we are interested in
the ATE, we would implement common supports on both sides in that we delete treated with
high values of P and controls with low values of P . Obviously, the deletion of these observations
changes the population for which we estimate an e¤ect and we therefore should always examine
the descriptive statistics of the observations deleted with those remaining. If we lose only about
2 to 10% of the observations due to imposition of common support and if the sample means
are similar among those deleted and those remaining, we would be relatively con…dent that
the estimates obtained can be interpreted more broadly. However, if a large fraction of the
sample is dropped and their characteristcs are rather di¤erent from those remaining, we should
be cautious in interpreting our estimates only with respect to the remaining subpopulation and
should acknowledge that external validity might be limited. (This comparison between the in-
support and out-of-support population is more di¢ cult with the weigthing estimator discussed
later.)
Density fP jD=0 Density fP jD=1
Apart from restricting the matching estimator to the common support for reasons of
identi…cation, further trimming of observations in regions where the density is very small
can help to improve the small sample properties of the estimator. Heckman, Ichimura, and
Todd (1997, 1998) suggest to estimate the densities of the propensity score fP jD=0 ( ) and
fP jD=1 ( ) nonparametrically and eliminate those regions where the densities are small. Since
P is one-dimensional this is not di¢ cult to implement.
A simpler but popular alternative is to delete all treated with a propensity score larger than
the largest (or second or 10-th largest) propensity score among the D = 0 observations. This is
when interest is in ATET. When estimating ATE, one needs to trim observations on both sides
analogously, i.e. to delete also all D = 0 observations with a propensity score smaller than the
smallest (or second or 10-th smallest) propensity score among the D = 1 observations.
As an additional analysis, one might even consider smaller subsets as the common support
regions. E.g. using only the subset 0:1 < P < 0:9 is a frequent choice to obtain more precise
estimates. Similarly, Black and Smith (2004) de…ne the "thick support" region as 0:33 <
P < 0:67 and examine an additional analysis for this region. A …rst motivation is that one
would expect more precise estimates for this region where most of the data lies. Black and
Smith (2004) mention two additional reasons. First, if measurement error in the D variable
is possible, then observing a very high value of Pi and at the same time Di = 0 could be an
indication of measurement error in that the true value of D may be one. There may be less
reason for inspecting measurement error when Pi takes intermediate values. A second reason
is that, under certain assumptions, the bias due to any remaining selection-on-unobservables
is largest in the tails of the distribution of P . Black and Smith (2004, p. 111-113) give an
example.
In principle, we could estimate the propensity score nonparametrically. Many researchers,

though, prefer at least in the beginning of the analysis a simple parametric probit or logit model
since a number of tests for omitted-variables, insigni…cant variables and balancing tests are
widely available. (A number of nonparametric tests for insigni…cant variables are discussed in
Li and Racine (2007).) Although one should ideally know from economic theory which variables
are to be included in X, in practice one would prefer not to have too many variables in X and
eliminate insigni…cant variables. Including variables that are not signi…cant predictors of D
should not hurt in principle, and could even increase e¢ ciency, it would nevertheless introduce
a lot of noise in the estimated propensity scores. Because all the coe¢ cients, signi…cant or not,
are used to calculate the propensity score, they can blurr the imposition of common support and
also the true signal contained in X. E.g. if in the true model there is only one strong predictor
of D, estimating the propensity score with only this variable would ensure that we compare
only observations with the same characteristic. If, on the other hand, we include very many
additional insign…cant variables in X, the estimated propensity scores would then contain a lot
of noise and it would be more or less random which control observation is matched to which
treated observation. To partially avoid this problem, it is often helpful to make use of (12)
and to match not only on the propensity score but also on other characteristics that we deem
to be particularly important with respect to the outcome variable. One could then either use
multivariate nonparametric regression or use Mahalanobis distance to project the distance in
these various characteristics, including the propensity score, onto the one-dimensional space.
Frölich (2007b) shows that propensity score matching can also be used for estimating coun-
terfactual distribution functions, and that its applicability is not con…ned to treatment evalu-
ation. It can be used more generally to adjust for di¤erences in the distribution of covariates
between two populations.50 As an example, discrimination of women in the labour market is
considered. It is frequently observed that women earn on average lower wages than men. This
gender wage gap may partly be due to discrimination. The fact that women are paid substan-
tially lower wages than men may be the result of wage discrimination in the labour market.
On the other hand, part of this wage gap may be due to di¤erences in education, experience
50
Propensity score weighting, discussed below, can be used as well.
and other skills, whose distribution di¤ers between men and women. Much of the literature on
discrimination has attempted to estimate how much of the gender wage gap would remain if
men and women had the same distributions of observable characteristics.51 Not unexpectedly,
the conclusions drawn from these studies depend on which and how many characteristics are
observed. For those individuals studying at university of polytechnic, rather di¤erent subjects
or …elds are available to them, ranging from mathematics, engineering, economics, philosophy
etc. One observes that men and women choose rather di¤erent subjects, with mathematical
and technical subject more often chosen by men. At the same time ’subject of degree’(= …eld
of major) is often not available in most data sets. In Frölich (2007b) the additional explana-
tory power of ’subject of degree’on the gender wage is examined. Propensity score matching is
applied to analyze the gender wage gap of college graduates in the UK. It is examined to which
extent the gender wage gap can be explained by observed characteristics, and by simulating the
entire wage distribution the gender wage gap can be examined at di¤erent quantiles. It turned
out that subject of degree contributes substantially to reducing the unexplained wage gap, par-
ticularly in the upper tail of the wage distribution. The huge wage di¤erential between high-
earning men and high-earning women is thus to a large extent the result of men and women
choosing di¤erent subjects in university.
Propensity score matching with choice based samples Frölich (2007b) examines
propensity score matching with non-random sampling including choice based sampling. The
available data may often not be representative of the true population proportions, with
certain groups such as treatment participants, foreigners, low-income individuals or residents
of particular regions being oversampled.52 In this case the sampling distribution FY;X;D is
di¤erent from the population distribution, i.e. the sample is not drawn at random from
the population. In the context of treatment evaluation, it is helpful to distinguish between
non-random sampling with respect to Pr(D) and with respect to FY;XjD . Non-random
sampling with respect to D is particularly frequent in the treatment evaluation context, where
treatment participants are often oversampled. This is referred to as choice-based sampling in
Heckman, Ichimura, and Todd (1997, Footnote 18), Heckman, Ichimura, Smith, and Todd
51
If one attempts to phrase this in the treatment evaluation jargon, one would like to measure the direct e¤ect
of gender on wage, while holding skills and experience …xed.
52
See e.g. Imbens and Lancaster (1996) or Wooldridge (1999) for a discussion of various sampling schemes.
(1998) and Smith and Todd (2005).53 Choice-based sampling leads to inconsistent estimates
of the propensity score (Manski and Lerman 1977). Nevertheless, the estimation of the ATET
using this inconsistent estimates of the propensity score is still consistent!
On the other hand, non-random sampling with respect to FY;XjD requires a modi…cation of
the matching estimator. Non-random sampling with respect to X or Y given D can occur e.g.
due to oversampling of foreigners or low-income individuals. Such non-random sampling could
be present only in one of the two populations, for example, if di¤erent sampling designs were
used for the two populations. A survey might e.g. oversample pregnant women or women with
small children among those who take the new drug (D = 1), while using random sampling for
the control group. In this case two modi…cations of the propensity score matching estimator are
^ 0 (pi ) among the treated observations, sampling
required. First, when taking the average of m
^ 0 from the control observations,
weights have to be included. Second, in the estimation of m
sampling weights have to be included as well. However, in the case that the sampling was the
same in both subpopulations, weighting can be ignored completely. This condition is usually
satis…ed when the same sampling design was used for the treated and the controls, e.g. both
are part of the same survey.54 (See Frölich (2007b) for details.)
2.2.4 Propensity score weighting
An alternative estimation strategy to adjust for the di¤erences in the covariate composition
among source and target population relies on weighting the observed outcomes by the inverse
of the propensity score. Since values of x where p(x) is large are relatively over-represented
among the treated and values of x where p(x) is small are over-represented among the non-
treated, we could rectify this by weighting by the propensity score. Again examining the average
potential outcome (and abstracting from the common support problem), we can write:
Z
E Y 0 jD = 1 = m0 (x)fXjD=1 (x)dx
Z
p(x) P (D = 0)
= m0 (x) f (x)dx.
1 p(x) P (D = 1) XjD=0
53
It may also often occur, when the source and target sample stem from separate surveys or data sources.
54
In the example mentioned above, the relevant condition is Pr (in samplejX; D = 1) /
Pr (in samplejX; D = 0). Notice that this proportionality condition refers to the marginal sampling prob-
ability with respect to X only. Even if the sampling weights with respect to Y and X are the same in the source
and target population, they may di¤er with respect to X only, for example if Y is wages, X is education, women
are paid less than men and low-wage earners are oversampled.
because
fXjD=1 (x)P (D = 1)
p(x) =
fX (x)
by Bayes’law and thus
p(x) fXjD=1 (x)P (D = 1)
= .
1 p(x) fXjD=0 (x)P (D = 0)
This gives
Z
0 P (D = 0) p(x)
E Y jD = 1 = m0 (x) f (x)dx
P (D = 1) 1 p(x) XjD=0
P (D = 0) p(X)
= E m0 (X) jD = 0
P (D = 1) 1 p(X)
P (D = 0) p(X)
= E Y jD = 0 ,
P (D = 1) 1 p(X)
and a natural estimator would thus be
P (D = 0) 1 X pj
E [Y\
0 jD = 1] = Yj
P (D = 1) n0 1 pj
j:Dj =0
or
1 X pj
E [Y\
0 jD = 1] = Yj
n1 1 pj
j:Dj =0
n0 P (D=0)
where we used n1 as an estimate of P (D=1) . It is usually more e¢ cient to normalize by the sum
of the weights as P pj
Yj (1 Di ) 1 pj
j
E [Y\
0 jD = 1] = P
pj . (13)
(1 Di ) 1 pj
j
This estimator uses only the observations Yj in the nonparticipant population (Dj = 0) and
pj
weights these outcomes by 1 pj . Notice that the sum (13) is divided by n1 and not n0 because
of the estimate of the ratio PP (D=1)

(D=0)
. What are the advantages/disadvantages of this estimator
compared to the matching estimators? It has the advantage that it only requires a …rst step
estimation of p(x) and does not require m0 (x) or m1 (x). Hence, we would thus avoid the time-
consuming part of nonparametric estimation of m0 (x) and m1 (x). It small samples, however,
there seems to be some Monte Carlo evidence e.g. in Frölich (2004) that the estimates can
have a rather high variance if some propensity scores pj are close to one. In this case the term
pj
1 pj can get arbitrarily large and lead to variable estimates. In practice, it appears necessary
pj
to impose a cap on 1 pj . One could either trim (i.e. delete) those observations or censor them
pj pj
by replacing 1 pj with min( 1 pj , upper bound). So far no reliable procedure for implementing
this in small samples seems to be available. The typical solution to the problem is to remove
(or rescale) observations with very large weights and check the sensitivity of the …nal results
with respect to the trimming rules used.
The reason for the high variance when pj is close to one is of course the common support
problem. The remedies here are somewhat di¤erent from the matching estimator, though. In
the matching estimation setup discussed before, if interest is in the ATET we would delete the
D = 1 observations with high values of the propensity scores. Then we would usually compare
the descriptive statisctics of the deleted D = 1 observations with the remaining observations
to understand the implications of this deletion and to assess external validity of our …ndings.
If e.g. the deleted observations are low-income individuals compared to the remaining D = 1
observations, we know that our results would mainly hold for the high-income individuals.
Imposing some kind of trimming or capping with the weighting estimator also changes the
population for which the e¤ect is estimated, but the implications are less obvious, though. The
reason is that only the D = 0 observations are used in the (13) whereas the ATET is estimated
pj
for the D = 1 population. Hence, if any of the D = 0 observations with large values of 1 pj
are trimmed or censored we do not observe how this a¤ects the D = 1 population. A simple
pj
solution could be to trim (i.e. delete) the D = 0 observations with large value of 1 pj in the
(13) and to use the same trimming rule also in the D = 1 sample. We can then compare those
D = 1 observations that have been deleted with those D = 1 observations that have not been
deleted to get an understanding of the implications of this trimming rule. (For ATET, we would
then compute E[Y 1 jD = 1] also only for the trimmed sample.)
Analogously, the ATE is identi…ed as
YD Y (1 D)
E Y1 Y0 =E
p(X) 1 p(X)
and can be estimated as

P
n
Yi Di P
n
Yi (1 Di )
p^(Xi ) 1 p^(Xi )
i=1 i=1
.
Pn
Di Pn
1 Di
p^(Xi ) 1 p^(Xi )
i=1 i=1
2.2.5 Combination of weighting and regression
We can also write the weighting estimtor of ATET as a weighted average:

1 Di
1 X Di 1 p(Xi ) p(Xi )
1 0
E Y Y jD = 1 = plim Yi
N Pr(D = 1)
which gives
1 1 Di
= EE Yi Di p(Xi ) jX
Pr(D = 1) 1 p(Xi )
1
= E p(X) E Y 1 Y 0 jX = E [ jD = 1] .
Pr(D = 1)
Note that the weights are negative for the D = 0 observations. In addition, the weights have
h i
mean zero: E Di 1 1 p(XDi
i)
p(X i ) = 0.
Alternatively, one can show that the weighting estimator can be written as a simple linear
regression. I.e. one can regress
regress Y on constant, D
using WLS with weights s

p(X)
!i = D + (1 D)
1 p(X)
to obtain the ATET. The weights for ATE are:55
s
D 1 D
!i = + .
p(X) 1 p(X)
One can extend this idea to include further covariates in the linear regression model. E.g.
to assume a model
0
Yi = + Xi + Di + Ui (14)
and estimate by weighted least squares (WLS) using the weights ! i . This approach has the
so-called "double robustness property" in the terminology of Robins-Ritov. That means it
is consistent if either the parametric speci…cation of the propensity score is correct or if the
55
As an exercise show that
2 P 3 2 P 3 Pn P
n
P 1 Y i Di Yi (1 Di )
0 4 ! 2i ! 2i Di 2
! i Yi p(X
^ i) 1 p(X
^ i)
e2 P P 5 4P 5= i=1 i=1
Pn Pn
! 2i Di ! 2i Di2 ! 2i Di Yi Di 1 Di
p(X
^ i) 1 p(X
^ i)
i=1 i=1
speci…cation of the outcome equation (14) is correct. Hence, if one of these two is correctly
speci…ed is does not matter whether the other speci…cation is correct. First, if (14) is correctly
speci…ed, will be estimated consistently by WLS whatever weights p(x) are used. Second,
@@@ hier eintragen
2.2.6 Asymptotic properties of treatment e¤ect estimators
The asymptotic properties of the generalized matching estimator and the re-weighting estimator
have been studied in the binary treatment framework by Hahn (1998), Heckman, Ichimura, and
Todd (1998), Hirano, Imbens, and Ridder (2003) and Abadie and Imbens (2006), among others,
under the conditional independence assumption (2). (@@@: hier auch rotnitzky einbauen)
p
Hahn (1998) derived the n-semiparametric variance bounds for nonparametric estimation of
the average treatment e¤ect and the average treatment e¤ect on the treated.
For the estimation of average potential outcomes or average treatment e¤ects it can be
shown that propensity score matching is ine¢ cient compared to matching on X.
Heckman, Ichimura, and Todd (1998) and Imbens (2004) contain results on estimating
asymptotic variances for matching estimators. Heckman, Ichimura, and Todd (1998) analyze
propensity score matching with an estimated propensity score and show consistency under
certain regularity assumptions, which are usually satis…ed if a parametric estimator is used.
This …rst step estimation usually has an impact on the asymptotic variance matrix of the
estimator. Imbens (2004) also analyzes estimation of the variance.
It is now often common to estimate variances via bootstrapping due to convenience and
sometimes improved small sample results. It seems that at the moment it is not yet known
whether the bootstrap is consistent, though. There are some recent results by Abadie and
Imbens (2005) showing that bootstrapping does not work for the simple pair-matching estimator
(i.e. where …rst nearest neighbour regression is used). It is suspected that bootstrapping works
when using a kernel regression estimator (with bandwidth converging to zero), but I have not
seen a formal proof.
2.2.7 Testing the validity of the conditional independence assumption
As mentioned above several times, the conditional mean independence assumption is a minimal
identifying assumption and cannot be validated from the data. Its assertion must be based
on economic theory, institutional knowledge and beliefs. It can only be tested if one imposes
additional assumptions.56 An indirect test of the CIA is to examine whether we would estimate
a zero treatment e¤ect when comparing subpopulations for which we know that they are not
a¤ected by the treatment. For example, access to many social programme depends on certain
eligibility criteria. This leads to three groups: Ineligibles, eligible nonparticipants and (eligible)
particpants. The …rst two groups are nonparticipants and their Y 0 outcome is thus observed.
Usually both groups have di¤erent distributions of X characteristics. If one strengthens the
conditional independence assumption to
Y 0 ??T jX
where T 2 fIneligibles, eligible nonparticipants, particpantsg, a testable implication is that
Y ??T jX; T 2 fIneligibles; eligible nonparticipantsg.
The (average) outcome of Y among the ineligibles and among the eligible nonparticipants should
thus be identical, after adjusting for di¤erences in the distribution of X.
A more interesting application of this pseudo-treatment approach is often used when in-
formation on the outcome variable before the treatment happened is available, usually in the
form of panel data. One can then examine di¤erences between participants and nonparticipants
before the treatment actually happened (and before the participants knew about their partic-
ipation status as this might have generated anticipation e¤ects). Since the treatment has not
yet happened, there should be no (statistically signi…cant) di¤erences in the outcomes between
the subpopulation that is later taking treatment and the subpopulation that is later not taking
treatment, after controlling for some variables X. This is known as the pre-programme test or
"pseudo-treatment" test. In other words, we test whether the e¤ect of a pseudo-treatment is
indeed estimated to be zero by our estimator.
Suppose that longitudinal data on participants and non-participants i available, e.g. in an
adult literacy programme, which starts at a time 0 and where we measure the outcome at time
1. Before time 0, all individuals are in the non-treatment state. We assume that there are no
0 =Y
anticipation e¤ ects such that Yt=0 t=0 . Assume that conditional independence holds at time
56
With such over-identifying assumptions it can be tested whether given one set of assumptions, the other
set of assumptions is valid. If under these conditions the latter assumptions are not rejected, the identi…cation
strategy will be considered more credible.
t = 1:
0 0 0 0
Yt=1 ??Dt=0 jX; Yt=0 ; Yt= 1 ; Yt= 2 ; :::, (15)
where X here contains time invariant characteristics and were we also condition on lagged out-
come values as these are often important determinants of the programme participation decision.
Now it might often be reasonable to assume that the potential outcomes are correlated over
time such that any unobserved di¤erences might also be captured by the control variables in
earlier time periods. We could therefore assume conditional independence to hold also in pre-
vious periods, e.g.57
0 0 0 0
Yt=0 ??Dt=0 jX; Yt= 1 ; Yt= 2 ; Yt= 3 ; :::. (16)
In contrast to the expression (15), this assumption is testable, because at time -1 we observe the
non-treatment outcome Y 0 as well for those with Dt=0 = 0 as for those with Dt=0 = 1, i.e. those
who will later participate in treatment. This is because treatment has not yet started (at time
0 for those with D
-1). Assumption (15) is untestable because at time 1 the outcome Yt=1 t=0 = 1
is counterfactual, i.e. could never be observed because these individuals received treatment
1
and therefore only Yt=1 can be observed. Hence, if we are willing to assume that the CIA
also holds in previous periods, we could estimate the treatment e¤ect in those previous periods
and test whether it is zero. If it is statistically signi…cantly di¤erent from zero this might be
some indication that the two groups, participants and non-participants, were already di¤ erent
in their unobservable characteristics even before the treatment and might therefore still be after
the treatment. To be able to use this test, we need to have (at least one) additional lagged
time period that is not needed as control variable in (15). I.e. if we assume that conditional
independence holds if we condition on X and two lags of Y , then we need to be able to observe
Yt= 2 to test (15).
To implement this test it is useful to think of it as if some "pseudo-treatment" had happened
at time zero or earlier. Hence, we retain the observed indicator Dt=0 as de…ning the participants
and non-participants and pretend that the treatment had started already at time -1. Since we
know that actually no treatment had happened at that time, we would expect the treatment
e¤ect to be zero and any (statistically signi…cant) non-zero estimate would be an indication of
relevant di¤erences in unobservables.
If we were to …nd signi…cant pseudo-programme impacts, we might consider this as an
57
Imbens (2004) refers to this as assuming stationarity and exchangeability.
estimate of the bias due to unobservables and might perhaps also be willing to assume that this
bias due to unobservables is constant over time such that we could subtract it from the …nal
estimate. This is the basic idea of di¤erence-in-di¤erence estimators to be discussed later and
is also essentially the intuition behind …xed e¤ects estimators for panel data models.
The basic assumption therefore is that the average bias is identical over time, or in other
words, that both groups were to follow the "same trends" if treatment had not happened. If
multiple previous time periods are available, a more complex modelling of this trend is possible.
Nevertheless, it is worthwhile noting again that this pre-programme test only works if we
have a su¢ cient number of lags of outcomes available to test (15). We need to have some more
lags to use (16) than we need for (15). When we reject the test, it could still be that (15) would
have been valid had we included more lags there.
2.2.8 Matching with non-binary D
In principle, we could extend the above ideas to the situation where D is non-binary. If D is
discrete with only a few mass points, we could estimate md (x) separately in each sub-population,
i.e. for each value of d in Supp(D). If D takes on many di¤erent values, e.g. being continuous
or multivariate, the estimator of md (x) also has to smooth with respect to D. Hence, we could
still use
n
\ 1X
E [Y 0 ] = m
^ 0 (Xi ),
n
i=1
where m
^ d (x) is now a nonparametric regression estimator of E[Y jX = x; D = d] which smooths
over X and D. When D was binary, we estimated m0 (x) by using only the nonparticipant
(D = 0) observations. However, when D is continuously distributed, the probability that any
observation with D = d is observed is zero, and we therefore have to rely also on observations
with D 6= d to estimate md (x). Therefore, we might be using all observations but assign a
larger weight to those observations were Dj is close to d.
For D continuous, however, a propensity score matching approach would be more di¢ cult
to implement (since Pr(D = djX = x) = 0) and we usually have to rely on higher-dimensional
nonparametric regression. So far it seems that there are only very few applications of these
methods and it is not yet well-known how they fare in practice and how well their approaches
to inference are.
2.2.9 OLS and linear regression models
How are the above results on matching estimators related to conventional OLS estimation of
linear models. One could think of obtaining the OLS estimator by taking the local linear
regression estimator and letting its bandwidth increase to in…nity. With in…nite bandwidth
values, the kernel weights are constant and the local linear hyperplane becomes global. Hence,
^ 0 (x) = ^ 0 + x ^ 0 where ^ 0 ; ^ 0 are the coe¢ cients obtained from the nonparticipant sample.
m
The average potential outcome is thus
\
E [Y 0 ] = ^ 0 + X ^ 0 ,
where X are the average characteristics in the population. The ATE is correspondingly
E [Y\
1 Y 0] = ^ 1 ^ 0 + X( ^ 1 ^ ).
0
If, as discussed in the previous subsection, smoothing is performed with respect to X and
D, the local linear regression estimator with in…nite bandwidth values would deliver m
^ d (x) =
^ + x ^ + d^ where ^ ; ^ ; ^ are now obtained from the entire sample. The estimate of the average
potential outcome would then be
\
E [Y d ] = ^ + X ^ + d^ ,
which corresponds to the results when using OLS in a linear model.

Thus, whereas pair-matching is the most localized estimator, relying on only one observa-
tion in the neighbourhood, least squares matching corresponds to local linear matching with
a bandwidth value of in…nity. With a bandwidth not converging to zero, the OLS estimator
would generally be inconsistent unless the conditional expectation functions are truly linear.
Wooldridge (2002, Ch. 18) discusses a number of examples where OLS can still be a consistent
estimator of average treatment e¤ects.
To conclude, the conventional approach starts with assuming a linear model
Yi = + Xi + Di + Ui
and assumes that

E[Ui jDi ; Xi ] = 0.
The OLS estimates thus correspond to the results with local linear matching and in…nite band-
width values. It relies on the correctness of the functional form assumption. In addition, the
assumption E[Ui jDi ; Xi ] = 0 is stronger than in the matching approach where the conditional
independence assumption (2) can be interpreted as assuming
E[Ui jDi ; Xi ] = E[Ui jXi ],
which is permitted to be di¤erent from zero. For estimating the linear model by OLS we need
that U is uncorrelated with D and with X (or at least with those X that D is correlated
with). For nonparametric identi…cation and estimation, this assumption is not needed: U can
be allowed to be correlated with X, i.e. X could be endogenous, see e.g. Frölich (2006b).
2.3 Multiple treatment evaluation
Lechner (2001) and Imbens (2000) extended the previous approach of propensity score matching
to the case of multiple treatments, where D can take values in f0; 1; :::; M g. These M +1 di¤erent
treatments are to be de…ned as mututally exclusive, i.e. each individual will receive exactly one
of these treatments. The average treatment e¤ect for two di¤erent treatments m and l would
thus be
h i
E Ym Yl
and the ATET correspondingly

h i
E Ym Y l jD = m .
The conditional independence assumption is now
Y d ??DjX 8d 2 f0; 1; :::; M g,
and a common support assumption that
Pr(D = djX) > 0 a:s: 8d 2 f0; 1; :::; M g.
A quite useful result is that a dimension reducing balancing property is available also for this
case. De…ne the probabilities
pl (x) = Pr(D = ljX = x).
and
pl (x)
pljml (x) = Pr(D = ljX = x; D 2 fm; lg) = .
pl (x) + pm (x)
Then it is easy to show that

Z
m
E [Y ] = E [Y jD = m; pm ] dFpm
and that
Z h i
m
E [Y jD = l] = E Y jD = m; pmjml dFpmjml jD=l .
The latter result is obtained by showing that the conditional independence also implies
Y d ??Djpmjml ; D 2 fm; lg.
Instead of conditioning on pmjml it is also possible to jointly condition on pm and pl because

pmjml is a function of these. Hence, we also have
Y d ??Djpm ; pl ; D 2 fm; lg.
These various results now suggest di¤erent estimation strategies via propensity score matching.
If one is interested in all pair-wise treatment e¤ects, one could estimate a parametric discrete
choice model such as multinomial probit (MNP) (or a multinomial logit if the di¤erent treatment
categories are very distinct),58 which delivers consistent estimates of the marginal probabilities
pl (x) for all treatment categories. This thus requires the speci…cation of only one parametric
model for the discrete choice.
On the other hand, if computation time for the MNP is too demanding, an alternative
is to estimate all the M (M 1)=2 propensity scores pmjml by using binary probits for all
pair-wise comparisons separately. From a modelling perspective, the MNP model might be
preferred because if the model is correct, all marginal and conditional probabilities would be
consistently estimated. The estimation of pair-wise probits, on the other hand, does not seem
to be consistent with any well known discrete choice model.59 On the other hand, speci…cation
58
The multinomial logit model is based on much stronger assumptions than the multinomial probit model.
A well known implication is the Independence of Irrelevant Alternative (IIA) asumption, which is usually not
plausible if some of the choice options are more similar than others. A nested logit approach might be an
alternative, e.g. if the …rst decision is whether to attend training or not and the exact type of training is
determined only as a second decision. This, however, requires a previous grouping of the categories. MNP is
therefore a more ‡exible approach if computational power permits its use.
59
I.e. the usual discrete choice model would assume that all choices made and the corresponding characteristics
X have to be taken into account for estimation. A pair-wise probit of m versus l, and one for l versus k etc.
would not be consistent with this model.
Markus Frölich 3. Dynamic treatment evaluation 73
tests and veri…cation of balancing are often easier to perform with respect to binary probits,
to obtain a well …tting speci…cation. Using separate binary probits also has the advantage
that misspeci…cation of one of the binary probit models does not imply that all propensity
scores are misspeci…ed, as would be the case with a MNP model. Ger…n and Lechner (2002),
Lechner (2002a) and Ger…n, Lechner, and Steiger (2005) compare these various methods and
…nd relatively little di¤erences in their relative performance. Overall, estimating separate binary
probit models seems to be a ‡exible and convenient approach.
Whichever way one estimates the propensity scores, one should de…ne the common support
with respect to all the propensity scores. Although it would su¢ ce for the estimation of E[Y m
Y l jD = m] to examine only pmjml for the support region. But the interpretation of various
e¤ects such as E[Y m Y l jD = m] and E[Y m Y k jD = m] would be more di¢ cult if they are
de…ned for di¤erent subpopulations due to the common support restriction. A comparison of the
estimates could thus not disentangle di¤erences coming from di¤erent supports and di¤erences
coming from di¤erent e¤ects.
One (relatively strict) way to implement a joint common support is to delete all observations
with at least one of their estimated probabilities larger than the smallest maximum and smaller
than the largest minimum of all subsamples de…ned by D. For an individual, who satis…es
this restriction, we can thus be sure that we …nd at least one comparison observation in each
subgroup de…ned by D.
For some applications and discussion of practical matters see e.g. Ger…n and Lechner (2002),
Lechner (2002a) or Ger…n, Lechner, and Steiger (2005).
3 Dynamic treatment evaluation
3.1 Shortcomings of the static potential outcomes model
In the preceeding section, time was largely ignored. We examined the impact of a treatment
D on an outcome variable Y . The treatment starts at some point in time 0 and the outcome
variable can be measured some time 0 + t later. We usually would control for variables Xi; 0
measured at or until time 0, e.g. employment and earnings histories.

In the evaluation of labour market programmes, we could think of Y d0 +t as the employment
status at some point in time. Alternatively, we could pursue a duration or hazard model per-
spective and examine the out‡ow out of unemployment, the hazard rate e.g. into employment.
Here the expected value of Y d0 +t would then be the survivor probability. In the latter case, we
would measure only the impact on out‡ows, but would not consider the impact on repeated
unemployment.
While in several settings the framework of the previous section may well apply, in others
a more careful treatment of time and dynamic treatment allocation needs to be taken into
account. To mention just a few issues that come up with the evaluation of ALMP:
1) The time 0 when a programme starts might itself be related to unobserved characteristics
of the unemployed person. Therefore, 0 might often itself be an important control variable.
Here 0 might re‡ect calendar time (e.g. seasonal e¤ects) as well as process time (e.g. current
unemployment duration).
2) The time 1 when participation in a programme ends might often be already an outcome
of the programme itself. A person who …nds employment while being in a training programme
would naturally have a shorter programme duration 1 0 than a person who did not …nd a
job during that period. Therefore, it is often advised to measure the impact from the beginning
of the programme 0 and not from the end of the programme 1. Nevertheless, one might often
be interested in the e¤ects after the end of the programme or particularly in the e¤ects of the
length 1 0 of the programme. A possible shortcut is to use the length of intended programme
duration as a measure that may be more likely to be exogenous, conditional on X 0 . Otherwise,
a more explicit structural modelling may often be necessary, as discussed further below.
3) As discussed in (1) above it may often be important to include the time 0 as a variable to
control for. Often there is a treatment non-participation for which there is no natural starting
date. As an imperfect solution, one may simulate potential start dates for these non-participants
as in Lechner (1999), Lechner (2002a), Lechner (2002b). An alternative is to shorten the length
of the treatment de…nition window to make participants and non-participants more alike.
The treatment de…nition window is the time period that is used to de…ne the treatment
status in a static model. At the beginning of this window, eligibility to the treatment is de…ned
and thus the risk set of observations who could start treatment probability between zero and
one is determined. At the end of the window it is de…ned who started treatment during this
period.
E.g. consider unemployed persons registered at a certain day. A treatment de…nition window
of length 1 day would de…ne an individual as treated if a programme started on the …rst day
of unemployment. Everyone else would be de…ned as non-treated. Similarly, a treatment
de…nition window of length 1 day applied to day 29 would de…ne as treated everyone who starts
treatment on day 29 and as untreated who does not, only using the individuals who are still
registered unemployed at day 29. Only those would usually be eligible and thus in the risk set
for potentially being assigned to a programme. For such a short treatment de…nition window,
there would be only very few treated observations and estimation might be very imprecise.
In addition to that the treatment e¤ects are likely to be very small and may not be of main
interest: They would measure the e¤ect of starting a programme today versus "not today but
perhaps tomorrow". Many of the non-treated might actually receive treatment a few days later.
(This is similar to substitution bias in an experimental setting.)
A treatment de…nition window of one year would de…ne as treated who started a programme
in the …rst year and as untreated who did not enter a programme during the entire …rst year.
This might be the treatment e¤ect of most interest. The problem for identi…cation, however,
is that the de…nition is ’conditioning on the future’, using the language of Fredriksson and
Johansson (2003). From the onset of the treatment de…nition window, one could imagine two
competing processes: being sent to a programme and …nding a job. Even for two persons
exactly identical in all their characteristics, it may happen by chance that the …rst person …nds
a job after eight weeks whereas the other person would have found a job after ten weeks but
was send to a programme already after nine weeks. In this case, the …rst person would be
de…ned a nontreated, whereas the second would be de…ned as treated. In the extreme case of
a very long treatment de…nition window, all the nontreated would be those who found a job
before the programme started, whereas all the treated are those who would have found a job
at some time but the programme started before. Clearly, such a situation is likely to lead to
biased estimates, in favour of the so-de…ned nontreated. This bias is likely to be most severe
if every unemployed person eventually has to participate in some programme, if not …nding a
job before. If not, there will also be other (unobserved) reasons for why individuals did not get
treated (even if not …nding a job) and the direction of the bias will be less clear.60
To overcome this problem, one may simply shorten the length of the treatment de…nition
60
@@@Markus: aktuelle Version von Per anfragen und lesen.
window. But this is likely to introduce again the problem that many of those de…ned as
nontreated may have actually been treated shortly thereafter (as discussed above in the extreme
case of a one day treatment de…nition window). One solution is to analyse the sensitivity of
the …nal estimates to alternative de…nitions of this window. If the length of the window is
shorted, bias due to conditioning on the future should decrease (and variance increase). At the
same time, however, cross-over in the sense that many nontreated may shortly thereafter have
become treated will increase. If the data available permits, one can measure how many people
have been a¤ected. For the interpretation of the estimated e¤ects, one should therefore always
examine which share of the nontreated actually received treatment in the period thereafter.61
An alternative is to return to treatment de…nition windows of a very short length (e.g. one
day, week or month) and attempt to aggregate the e¤ects over time, as e.g. in Sianesi (2004).
This approach seems particularly natural if the outcome variable Y is the survivor probability,
e.g. in single-spell unemployment data, as in Fredriksson and Johansson (2003). The estimated
e¤ects however only refer to "treatment not today but perhaps tomorrow", which may not be
of most policy interest.62
The discrete time causal model outlined in the next section does permit a more detailed
de…nition of e¤ects of interest. As it relies on discrete time periods, it will always contain some
’conditioning on the future’ element, which nevertheless can be ameliorated if data permits.
Compared to continuous time models no restrictions on treatment e¤ect heterogeneity need to
be imposed.
4) Sequences of programmes: In many applications we might be interested in the e¤ects of

sequences of programmes, e.g. …rst programme 1 and then 2 versus vice versa. Since the fact
that a second treatment was applied may often already be an outcome of the (unsuccesful) …rst
programme, disentangling such e¤ects is very di¢ cult in a static model, and the general rule
was to examine only the e¤ects of the …rst programme (measured from the beginning of the …rst
programme) and subsume the second programme as an outcome of the …rst. To examine the
e¤ects of sequences, we need a more elaborate model setup. From this discussion we already
notice that intermediate outcome variables, i.e. Yt for some values of t, might be important
variables that a¤ect the sequence of treatment choice. As discussed above, a general rule from
61
See e.g. Lechner, Miquel, and Wunsch (2004).
62
In certain situations, this however may indeed be the e¤ect of most interest, as in Frölich (2007d).
the static model is that one should usually never control for variables already a¤ected by the
treatment. We will see below that some type of controlling for these variables is important here,
though.
3.2 Dynamic potential outcomes model
Introducing a time dimension into the evaluation framework can be done in two ways: By
considering treatments as sequences of treatments over a number of discrete time periods, i.e.
periods of …nite lengths (e.g. 2 months), or by considering time as continuous. We will …rst
examine a modeling framework for discrete time periods, which permits a wide range of possible
treatment sequences, di¤erent starting dates, di¤erent treatment durations etc. This model
may often be directly applicable if treatments can start only at certain …xed points in time, e.g.
quarterly, or when data is observed only for discrete time periods. When treatments can start,
more or less, in continuous time, this model may nevertheless have several advantages over an
explicit incoporation of continuous time in that it does not impase any restrictions on treatment
e¤ect heterogeneity. Time is partitioned into discrete periods where di¤erent sequences of
treatments can be chosen. Because the treatment e¤ects are not restricted across treatment
sequences, the model cannot be directly extended to continuous time as there would be an in…te
amount of treatment sequences. For continous time, more restrictions will be required, as will
be discussed later. (In applications, time will always be discrete, but the important points are
how many observations are observed entering in treatment in a particular time period and how
many di¤erent treatment sequences can be examined. In the unemployment example, one day
would certainly be considered as continuous time, whereas one month or one quarter would
often …t a discrete time model.)
A ‡exible multiperiod extension of the potential outcomes model has been developed by
several authors including Robins and co-authors in several articles. Here we focus on the
exposition and extensions of Lechner and Miquel (2001), its revised version in Lechner and
Miquel (2005), and Lechner (2008a), which seem to be closest in spirit to empirical applications
in economics.63 Identi…cation is based on sequential conditional independence assumptions or
sequential ’selection on observables’. As we will see, for identi…cation it is often important to
be able to observe intermediate outcomes variables. Such information may be available e.g. in
63
See e.g. Murphy (2003) on targeting of treatment sequences.
administrative data of unemployment registers, but in many other applications they might not
be and it would often be important to collect such data as part of the evaluation strategy.
To introduce the basic ideas suppose that there are time periods and that in each time
period either a treatment 0 or 1 can be chosen. From this setup, the extensions to many
time periods and multiple treatments will be obvious. The outcome variable is measured at
some time t later. In addition, there is an initial period, where information on covariates is
available before any treatments starts. I.e. an time period zero where none of the treatments
of interest have already started and where we can measure potentially confounding variables
before treatment. E.g. in the application to labour market programmes, at the beginning of the
spell every person is unemployed, and we have some information measured at that time about
the person and previous employment histories.
Let D 2 f0; 1g be the treatment chosen in period , and let D be the sequence of treat-
¯
ments until time with d being a particular realization of the random variable. The possible
¯
realizations of D1 are f0g and f1g. The possible realizations of D2 are f0; 0g; f1; 0g; f0; 1g and
¯ ¯
f1; 1g. The possible realizations of D3 are thus 000, 001, 010, 011, 100, 101, 110, 111. We thus
¯
de…ne potential outcomes as
d
YT¯
which is the outcome that would be observed at some time T if the particular sequence d had
¯
been chosen. (We use the letters t and in the following to refer to treatment sequences and T
to the time when the outcome is measured.) Hence, with two treatment periods we distinguish
between
d d
YT¯ 1 and YT¯ 2 .
The observed outcome YT is the one that corresponds to the sequence actually chosen. To be
speci…c about the time when we measure these variables, we will assume that treatment starts
at the beginning of a period whereas the outcome Y (and also other covariates X introduced
later) are measured at the end of a period. We thus obtain the observation rule:
Y1 = D1 Y11 + (1 D1 )Y10
Y2 = D1 Y21 + (1 D1 )Y20
Y2 = D1 D2 Y211 + (1 D1 )D2 Y201 + D1 (1 D2 )Y210 + (1 D1 )(1 D2 )Y200 .

To be clear about the di¤erence between Y211 and Y21 . The potential outcome Y211 is the
outcome that a particular individual i would have realized at the end of the second period if
by external intervention this person was sent to the sequence f1; 1g. The potential outcome
Y21 is the outcome that this individual i would have realized at the end of the second period
if by external intervention this person was sent …rst to the programme 1 and thereafter chose
as second treatment whatever this person was about to choose. I.e. the …rst period is set by
external intervention whereas the treatment in the second period is according to the selection
process of the individual or the caseworker etc. This means
Y 1 = D2 Y 11 + (1 D2 )Y 10 (17)
The observed outcome Y2 corresponds to the outcome if the person herself selected the …rst and
the second programme.
As another example, we might be interested in the e¤ects of school inputs on cognitive

development. One could consider education as a sequence of school years. However, there is
usually rather limited variation from one year to the next, such that many of the sequences
may be hard to identify. A more interesting approach would be to consider education as
a sequence of kindergarten, primary education, lower secondary education, upper secondary
education, tertiary education. Along this sequence quite a number of di¤erent input sequence
could be considered. E.g. private versus public school, small versus large classes, traditional
schooling versus strong emphasis on education in foreign languages, low versus high teacher
salary, etc. Several interesting research questions arise then. Are investments into schooling
complementary or substitutive? Do early investments into pre-school increase the returns to
further schooling or reduce the returns (i.e. diminishing marginal returns)?
If a …xed budget is available, at which stage should it be invested most? One could compare
a sequence with many expenditures at the beginning (e.g. small classes) and lower expenditures
lateron versus the opposite sequencing.
The observed schooling sequence clearly evolves endogeneously and the decisions about the
next step almost certainly depend on the success in the previous steps. The strategy outlined
below is based on sequential conditional independence assumptions, which requires data on the
test scores or grades in the previous periods. which we denote as covariates X. For example, the
type of secondary school a child attends clearly depend on the schooling outcomes (grades, test
scores) at the end of primary school, and without observing these grades or scores, identi…cation
would be very di¢ cult.
Before turning to such issues, we …rst return to the example of labour market programme
sequences and examine some exemplary sequences:
1. Starting times of treatment: For some individuals, a programme starts later for
others earlier in a unemployment spell. For those who never participate in a training
programme, their "starting date" would be unde…ned. Suppose we de…ne time in
quarters. We could then compare sequences such as 001 to 0001, i.e. to compare a
programme start in the third quarter of unemployment versus the fourth quarter of
unemployment. Or the sequence 1 versus 001 to examine the e¤ect of starting the
programme in the …rst days of unemployment or only after half a year. We could also
compare the sequences 1 to 000, where the latter group refers to not receiving any
treatment during the …rst nine months. Another option is comparing a sequence 00 to
01, which is the e¤ect of receiving treatment in the second quarter versus not receiving
it in the second quarter but perhaps immediately after.64 When examining the e¤ects of
waiting, we might also require some minimum programme duration, e.g. to compare 11
to 0011 or 111 to 00111. Note that for several of these comparisons the basic message of
Fredriksson and Johansson (2003) in that ’conditioning on the future’ may introduce a
bias will still apply in principle. To mitigate the extent of such biases, (whose direction
can often be conjectured in empirical applications) the length of the time periods should
be kept short. If the amount of data available permits, one may want to use months
rather than quarters when de…nining the sequences. If the amount of data is limited, one
could examine various alternative de…nitions of the length of the time window (months,
quarters) and compare the estimation results. For the shorter time periods, the results
should be less biased, but may be more variable. In particular, the longer the treatment
sequences spe…cied, the less observations will be in the data that have exactly followed
this sequence.
2. Treatment durations: To examine the e¤ects of di¤erent durations of treatment, we could

64
This last example is used in Sianesi (2004), Fredriksson and Johansson (2003) and applied to targeting in
Frölich (2007d).
compare the sequences 010 to 011 for example. We already referred to the potential
endogeneity of the treatment duration in evaluation studies if subjects can drop out during
the treatment. If the treatment is unpleasant or sends a certain signal, individuals will
seek to leave the programme while it is under way. This attrition is, however, already
an e¤ect of the programme (for some programmes even the main intended e¤ect). One
way to circumvent this problem in the static model is to consider the e¤ects of planned
durations only, e.g. in Lechner, Miquel, and Wunsch (2004). In some situations this may
be the e¤ect of most interest. In other situations the e¤ect of realized durations may
be more interesting, though. We might thus be interested in comparing 1 versus 11, or
altervatively also in 10 versus 110. Whereas the former comparison refers to treatments
with a minimum duration of at least one or two periods, the latter comparison refers to
treatments with a duration of exactly one or two periods.
3. Sequences of treatments: We might be interested in various sequences of treatments, e.g.

010001 versus 0101. Particularly when we extend the previous setup to allow for several
treatment options e.g. f0; 1; 2; 3g in each period, for example, no assistance, job search
assistance, training and employment programmes, it may be interesting to compare a
sequence 123 to 132 or 101 or 1001. Should one start with training or with an employment
programme? If one programme has been completed, should one start with the next one
or leave some time in between for individuals to focus on their own job search activities?
The application of the static model, as covered in the previous section, breaks down when
selection into the second and any subsequent programs is in‡uenced by the outcome of the
previous programmes. Then these intermediate outcomes have to be included to control
for selection.
Hence, a large number of sequences could be interesting.65 However, when specifying such
sequences, one should keep in mind that the longer the treatment sequences speci…ed, the less
observations will be in the data that have exactly followed this sequence. Hence, one could very
fast run into small sample problems, even with a dataset of several thousand observations. An
additional complication will arise when comparing two rather di¤erent sequences, e.g. 1110 to
00001110. It is quite likely that those individuals who followed a very speci…c sequence such as
65
For more examples see also Lechner (2008a).
00001110 may be relatively homogenous in their X characteristics. If also the participants in

1110 are relatively homogenous, then the common support between these two participant groups
might be relatively small and after deleting observations out of common support, the treatment
e¤ect between 00001110 and 1110 might thus depend on only a very speci…c subpopulation,
which reduces external validity. Another concern with very long sequences is that we have to
include the vector of covariates X0 , X1 ...., up to X 1 for identifying the e¤ect of a sequence
d , which may contain quite a high number of variables when is large. (Even when we
¯
use parametric propensity score matching as discussed later.) When the number of covariates
becomes too large, one may perhaps only include, say, four lags Xt 1, Xt 2, Xt 3, Xt 4 as they
may be picking up most of the information contained in the past X.
As a further component to the model, we will often also include additional covariates Xt ,
which we permit to be time varying, and we denote by Xt the collection of Xt variables up
¯
to time period t. The Xt may also include the outcome variables up to Yt . Hence, we permit
¯
that the variables Xt are already causally in‡uenced by the treatments and we could also de…ne
d
potential values Xt¯ for these. Remember, that we observe Xt at the end of a period. Hence,
at the beginning of a period , the values of Xt up to 1 are observed.
With this setup, we can de…ne a large number of di¤erent average treatment e¤ects. Let
d0 0 , d0000 and d000000 be three sequences and de…ne the treatment e¤ect
¯ ¯ ¯
d 0 0 ;d 0000 000 d0 0 d 0000
¯ ¯ (d 000 )
t = E[YT¯ YT¯ jD = d000000 ] for 000 0
; 00
,
¯ ¯ ¯
which is the treatment e¤ect between sequence d0 0 and d0000 for the subpopulation that is observed
¯ ¯
to have taken sequence d 000 . Note that the the three sequences d0 0 , d0000 and d000000 can di¤er in
000
¯ ¯ ¯ ¯
the length and in the exact choice of the treatments. Hence, we could be comparing two
sequences of the same length, e.g. f0; 1g versus f1; 0g, as well as sequences of di¤erent lengths,
e.g. f0; 1g versus f1g. The latter example corresponds to the e¤ect of a delayed treatment
start, i.e. the treatment starting in period 2 versus period 1. The sequence d000000 de…nes the
¯
subgroup for which the e¤ect is de…ned. If 000 = 0, this gives the dynamic average treatment
e¤ect (DATE)
d 0 0 ;d 0000 d0 0 d 0000
¯ ¯
t = E[YT¯ YT¯ ],
wheres the dynamic average treatment e¤ect on the treated (DATET) would be obtained when
d000000 =d0 0
¯ ¯
d 0 0 ;d 0000 d0 0 d 0000
¯ ¯ (d
t ) = E[YT¯ YT¯ jD = d0 0 ],
¯ ¯ ¯
and dynamic average treatment e¤ect on the nontreated (DATEN) would be obtained when
d000000 =d0000
¯ ¯
d 0 0 ;d 0000 0 d0 0 d 0000
¯ ¯ (d 0 )
t = E[YT¯ YT¯ jD = d0000 ].
¯ ¯ ¯
Without any restrictions on e¤ect heterogeneity, these e¤ects could be very di¤erent.
We also suppose that 000 0 ; 00 since there seems to be little interest in the e¤ect for a
sub-population which is more …nely de…ned than the two sequences for which the causal e¤ect
is to be determined. (The identi…cation conditions would also be stronger. Therefore this case
is not considered here.)
Further, we only consider the case where T max( 0 ; 00 ), which means that we only
consider as …nal outcome variables the periods after the completion of the sequence. It would
not make sense to consider explicity the case for T < max( 0 ; 00 ) because we assume that
treatments can have an e¤ect only on future periods but not on earlier periods. If we were
expecting anticipation e¤ects, we would have to re-de…ne the treatment start to the point were
the anticipation started. I.e. if we observe in the data that an unemployed person started
a training programme in June, but we also know that this person was already informed by
early May about this programme we would consider May as the date where treatment already
started.66
Under certain conditions discussed below these treatment e¤ects can be identi…ed by se-
quential controlling for confounders. Note that these treatment e¤ects can then also be identi-
…ed within strata de…ned by any strictly exogenous covariates X, i.e. by covariates that do not
depend on the treatment, such as age and gender. Strata de…ned by covariates that are causally
a¤ected by the treatment would require a more careful consideration and usually stronger iden-
ti…cation conditions.
From the above de…nitions we obtain two useful results that help to relate various treatment
e¤ects to each other. First, we examine the relation between the expected outcomes for di¤erent
lengths of the conditioning sets D =d000000 . De…ne (d000000 ; v1 ; v2 ; :::; v ) as a sequence of length
¯ ¯ ¯
66
If date of referral and date of programme start are very close together, and the date of referral is not observed,
the possible anticipation e¤ects can hopefully be ignored.
000 +which starts with the subsequence d000000 followed by the treatments v1 ; v2 ; :::; v . By
¯
iterated expectations we obtain, with a slight abuse of notation, that
1
X 1
X
d0 0 d0 0
E[YT¯ jD = d000000 ] = E[YT¯ jD 000 + = (d000000 ; v1 ; v2 ; :::; v )]
¯ ¯ ¯ ¯
v1 =0 v =0
Pr D 000 +1 = v1 ; :::; D 000 + = v )jD = d000000 . (18)

¯ ¯
This implies that if a treatment e¤ect is identi…ed for a …ner population, i.e. de…ned by a longer
sequence, then it will also be identi…ed for the coarser population by a simple weighted average.
Hence, identi…cation for …ner subpopulations will in general be more di¢ cult.
Another result connects the expected outcomes for di¤erent lengths of the treatment
sequence, by using the de…nition of potential outcomes for di¤erent sequences (17). Suppose
000 0 we obtain
1
X 1
X
d0 0 000 (d 0 0 ;v1 ;:::;v )
E[YT¯ jD = d 000 ]= E[YT ¯ jD 000 + = (d000000 ; ; ; v1 ; v2 ; :::; v )]
¯ ¯ ¯ ¯
v1 =0 v =0
Pr D 0 +1 = v1 ; :::; D 0+ = v )jD = d000000 .

¯ ¯
The bullet points at the end of the …rst term indicate that there is some unspeci…ed part in the
conditioning set when 000 < 0. If 000 = 0 the sequence is fully determined. Hence, identi…cation
of longer sequences implies identi…cation for shorter sequences. Thus, identi…cation of longer
sequences will in general be more di¢ cult.
3.2.1 Equivalence to static model
To gain some intuition we consider …rst very strong assumptions that would permit us employing
the tools for the static model. In the next subsection we will then relax these assumptions. Let
be the set of all possible treatment sequences up to period . When only a binary treatment
is available in each period, the cardinality of is thus 2 .
Now assume that all potential outcomes for all possible treatment sequences d are inde-
¯
pendent of the actual sequence D taken, conditional on X0
¯
d
YT¯ ??D jX0 8d 2 (19)
¯ ¯
and also assume common support67
0 < Pr (D = d jX0 ) < 1 a:s: 8d 2 .

¯ ¯ ¯
Which this assumption we would assume that the information X0 that we observe at time zero
is su¢ cient to explain the entire treatment sequence taken later. In other words, we assume that
the researcher has enough information in the beginning of the initial period so that assignment
to the treatment in every period can be treated as random conditional on X0 . In other words,
this represents a scheme where the assignment of all treatments is made in the initial period and
is not changed subsequently when new information arrives. Hence, the choices do not depend
on time varying X and also not on the outcomes of the treatments in the previous periods,
because the complete treatment sequence is chosen initially based on the information contained
in X0 . New information revealed in later periods does not play any role for the selection into
the treatment.
This assumption is rather strong and will be relaxed in the next subsection. But it is helpful
here to consider the implications of this assumption. As shown in Lechner and Miquel (2001),
with the above assumptions all treatment e¤ects up to perido are identi…ed, including DATET
and DATEN as well as coarser subpopulations. It also includes identi…cation of e¤ects of the
type
(111) (000)
E[YT YT jD = (101) ]
¯
where the population for which the e¤ect is identi…ed has no common sequence with the potential
outcomes that are compared. As we will see later, such e¤ects are harder to identify when the
conditional independence assumption is relaxed.
The above setup esssentially boils down to the multiple programme approach of the static
model. There are d 2 di¤erent types of treatments (in this case sequences) and controlling
¯
for X0 eliminates all endogeneity problems. Hence, the conventional matching or re-weighting
approach of the multiple treatment evaluation approach can be applied here.
For comparison with the later subsection, we also note that the assumption (19) can be
written equivalently sequentially as
d
YT¯ ??Dt jX0 ; Dt 1 8t and d 2 . (20)
¯ ¯
67
The abbreviation a:s: in the following expression means almost surely and means that the statement is true
for all values of x0 except for a set that has measure zero, i.e. for a set for which the probability that X0 takes
any value of this set is zero. More formally this expression means that Pr (Pr (D = d jX0 ) 2 (0; 1)) = 1.
¯ ¯
Hence, conditional on the treatment sequence until t 1, the choice of treatment Dt only
depends on X0 (and some random noise that is not related to the potential outcomes). Again,
this assumption implies an essentially static treatment regime because any new information
related to the outcomes that might be revealed after period 0 does not play any role in the
selection process.
3.2.2 Sequential conditional independence assumptions
The previous discussion examined various conditions which are likely to be restrictive in most
applications as they did not permit sequential treatment selection to depend on intermediate
outcome variables. In the following we therefore want to relax these assumptions. We …rst ex-
amine a sequential conditional independence assumption which permits to control for endoge-
nous variables, including intermediate outcome variables.
De…nition 1 Weak dynamic conditional independence assumption (W-CIA):
d
YT¯ ??Dt jX̄t 1 ; Dt 1 8t and d 2 , (21)
¯ ¯
0 < Pr Dt = dt jX̄t 1 ; Dt 1 <1 a:s: 8t and dt ,

¯
where Xt may include Yt .
This assumption is weaker than the previous ones as it does permit selection to depend on
endogenous variables, i.e. observable variables that are functions of the outcomes of previous
treatments. To see whether such an assumption is plausible in a certain application, we have
to know which variables in‡uence changes in treatment status as well as outcomes and whether
they are observable.
For the two period example discussed in Lechner and Miquel (2005) this means
d
a) YT¯ 2 ??D1 jX0 8d2 2 2
¯
d2
b) YT ??D2 jX1 ; D1
¯ 8d2 2 2
¯ ¯
c) 0 < Pr (D1 = 1jX0 ) < 1 and 0 < Pr (D2 = 1jX1 ; D1 ) < 1 a:s:
¯
where X1 may include Y1 .
¯
Under this assumption, the …rst selection is not related to the potential outcomes, condi-
tional on X0 . Similarly, it is assumed that the second treatment selection is not related to the
potential outcomes, conditional on all the variables observed up to that point. These control
variables now include X0 and X1 and the treatment choices made so far, i.e. D1 , and usually
also the intermediate outcome variable Y1 . In a certain sense this is similar to the static model,
with the additional aspect that we often want to include intermediate outcome variables in cer-
tain steps of the estimation. (This of course requires that information on these intermediate
outcome variables is available in the data, which may very often not be the case!) As we will
see later, the nonparametric estimators are more complex, though.
The common support assumption W-CIA (c) requires that each treatment path is observed
in the data. E.g. the second part states that for all values of (X1 = x; D1 = d) with nonzero
¯
density, both choices D2 = 0 and D2 = 1 should have positive probability. Note that this
common support condition needs to hold only sequentially and is thus weaker than
Pr (D1 = d1 ; D2 = d2 jX1 ) > 0 a:s: for all d1 ; d2 2 f0; 1g. (22)

¯
As an example, certain values of X1 may have zero density when D1 = 1 but may have positive
¯
density when D1 = 0. For these values of X1 together with D1 = 1, the second part of W-CIA
¯
(c) would still be satis…ed even if Pr (D2 = 1jX1 ; D1 = 1) = 0 since the occurrence of these
¯
values has density zero. On the other hand, (22) would not be satis…ed.
(E.g. think of a training programmeme that prohibits repeated participation. Then eligi-
bility status as one of the variables in X1 will never be one if D1 = 1, whereas it has positive
probability to be one if D1 = 0. Hence, Pr (D2 = 1jX1 = eligible; D1 = 1) is zero, but the event
¯
(X1 = eligible; D1 = 1) has probability zero. On the other hand, (22) would not be satis…ed
¯
because Pr(D1 = D2 = 1jX1 = eligible) = 0 and X1 = eligible happens with positive probabil-
¯ ¯
ity.)
In contrast to the previous stronger assumptions, not all treatment e¤ects are identi…ed
anymore. Observing the information set that in‡uences the allocation to the next treatment in
a sequence as well as the outcome of interest is su¢ cient to identify average treatment e¤ects
(DATE) even if this information is based on endogenous variables. However, this assumption
is not su¢ cient to identify the treatment e¤ect on the treated (DATET) The reason is that
the subpopulation of interest (the participants who complete the sequence) has evolved (been
selected) based on the realised intermediate outcomes of the sequence.
As shown in Lechner and Miquel (2001, Theorem 3a), under the above assumption the
population average potential outcomes
d
E[YT¯ ]
are identi…ed for any sequence d of length T . Also all outcomes for any sequence d in
¯ ¯
the subpopulation of individuals who participated in treatment 0 or 1 in the …rst period
d
E[YT¯ jD1 = d1 ]
are identi…ed.
The situation becomes more di¢ cult, however, if we are interested in the average e¤ect for
a subpopulation that is de…ned by a longer sequence, e.g. with D1 = D2 = 0.
The relevant distinction between the populations de…ned by treatment states in the …rst
and subsequent periods is that in the …rst period, treatment choice is random conditional on
exogenous variables, which is the result of the initial condition stating that D0 = 0 holds for
everybody. However, in the second and later periods, randomization into these treatments
is conditional on endogenous variables, i.e. variables already in‡uenced by the …rst part of
the treatment. W-CIA has an appeal for applied work as a natural extension of the static
framework. However, W-CIA does not identify the classical treatment e¤ects on the treated if
the sequences of interest di¤er in the …rst period.
Nevertheless some e¤ects can also be identi…ed for …ner subpopulations. The …rst result
refers to comparisons of sequences that di¤er only with respect to the treatment in the last
period. I.e. they have the same subsequence until 1 and di¤er only in period . This
is basically the same result as before, but with time period 1 playing the role of time
period 0 (i.e. the period for which the treatment sequence still coincides) before. In this case
the endogeneity problem is not really harmful, because the endogenous variables X 1 , Y 1 ,
¯ ¯
which are the crucial ones to condition on for identi…cation, have been in‡uenced by the same
past treatment sequence at time 1 when comparing the two sequences that di¤er only with
respect to the treatment in period . Lechner and Miquel (2001, Theorem 3b) shows that the
potential outcome is identi…ed if the sequences (d 0 00 )
1; d ) and (d 1; d are identical up except
¯ ¯
for the last period:
(d 0
1 ;d ) 00
E[YT ¯ jD = (d 1; d )]. (23)
¯ ¯
By the result (18) for coarser subpopulations, this also implies that
(d 0
1 ;d )
E[YT ¯ jD 1 =d 1]
¯ ¯
is also identi…ed. To give some examples, E[Y211 ], E[Y211 jD1 = 0], E[Y211 jD1 = 1] and
E[Y211 jD2 = f1; 0g] and E[Y211 jD2 = f1; 1g] are identi…ed but neither E[Y211 jD2 = f0; 0g] nor
¯ ¯ ¯
E[Y211 jD2 = f0; 1g]. Hence, the average treatment e¤ect on the treated between the sequences
¯
10 and 01 is thus not identi…ed.
The previous result referred to two sequences that are identical except for the last period of
the sequence. Part c of Lechner and Miquel (2001, Theorem 3) gives a result for two sequences
that are identical in the …rst few periods but then may di¤er for several periods, i.e. more than
one as in the previous case. Consider a sequence d w where 1 w < and a longer sequence
¯
that starts with the same subsequence (d w ; d w+1 ; :::; d ), then
¯
(d w ;d w+1 ;:::;d )
E[YT ¯ jD w =d w]
¯ ¯
is identi…ed. This states that all sequences that have a common part in the beginning are
identi…ed for subpopulations de…ned as participating in that common part of the sequence.
Of course, the relevant subpopulations for which identi…cation is obtained could be coarser,
but not …ner. Compared to (23) the conditioning set for the expected value is ’one period
shorter’. The reason is that the identi…cation of sequences that di¤er for more than one period
is more di¢ cult: The endogenous conditioning variables X 1 , Y 1 needed to make participants
¯ ¯
comparable to non-participants in the speci…c sequence is in‡uenced by all events during the
sequence. However, since the sequences di¤er, these events di¤er, leading to some additional
loss of identi…cation.
To give one example of what is identi…ed by W-CIA (21) consider E[Y211 jD1 = 0] in the
simple two-period model. Using iterated expectations and W-CIA with respect to the …rst
period, we can write
E Y211 jD1 = 0 = E E Y211 jX0 ; D1 = 0 jD1 = 0
= E E Y211 jX0 ; D1 = 1 jD1 = 0

= E E E Y211 jX0 ; X1 ; D1 = 1 jX0 ; D1 = 1 jD1 = 0
= E E E Y211 jX0 ; X1 ; D2 = 11 jX0 ; D1 = 1 jD1 = 0

¯
= E [E [E [Y2 jX0 ; X1 ; D2 = 11] jX0 ; D1 = 1] jD1 = 0]
¯
Z
= E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1 dFX0 jD1 =0 .
¯ ¯
This result shows on the one hand that this potential outcome is identi…ed and also suggests a
way for estimating it. We …rst need to estimate E [Y2 jX1 ; D2 = 11] nonparametrically and then
¯ ¯
adjust it sequentially for the distributions dFX1 jX0 ;D1 =1 and dFX0 jD1 =0 . As discussed later this
adjustment can be done via matching or via weighting. The estimator is more complex than
in the static model as we have to adjust for di¤erences in X distribution twice. If we were to
consider treatment sequences of length we would have to adjust times. (This is discussed
in some more detail later.)
To understand why the treatment e¤ect on the treated is not identi…ed, it is useful to
exemplary examine identi…cation of E[Y211 jD2 = f0; 0g]. By iterated expectations we obtain
¯
E Y211 jD2 = 00 = E[E[Y211 jX1 ; X0 ; D2 = 00] jD2 = 00]
¯ ¯ ¯
11
= E[E[Y2 jX1 ; X0 ; D1 = 0] jD2 = 00]
¯
where the last equality follows by (21).The next steps would be to re-arrange this expression
until it can be written in terms of E[Y211 j ;D2 = 11] which would correspond to the observable
¯
outcome E[Y2 j ;D2 = 11]. At this point the assumption (21) however is not su¢ ciently strong
¯
to obtain identi…cation. The reason is that although Y211 ??D1 jX0 , it is not the case that Y211
were independent of D1 conditional on X0 and X1 . (See the following graph as an explanation.)
Neither is Y211 independent of X1 , conditional on D1 and X0 .
For the identi…cation of E[Y211 jD1 ] this was no problem, but for E[Y211 jD2 ] this becomes
important because X1 determines the population of interest in the second period. Hence, on
the one hand, we would have to condition on this variable to control for the selection in the
second period. On the other hand, we are not permitted to condition on this variable as this
could invalidate independence for the selection in the …rst period. (See the following graph.)
We would have to restrict the endogeneity of X1 to be able to identify DATET. Lechner

and Miquel (2001, Theorem 1) consider a stronger sequential independence assumption, under
which still all e¤ects up to period are identi…ed.
De…nition 2 Strong dynamic conditional independence assumption (S-CIA):
d
a) (YT¯ ; Xt )??Dt jX̄t 1 ; Dt 1 8t 1 andd 2 ,
¯ ¯
d
b) YT¯ ??Dt jX̄t 1 ; Dt 1 t= and d 2 ,
¯ ¯
c) 0 < Pr Dt = dt jX̄t 1 <1 a:s: 8t and d 2 .

¯ ¯ ¯
d
Note that assumption (a) also implies that YT¯ 2 ??D1 jX0 ; X1 as can be shown by simple
d
calculations. Together with the assumption (b) we thus have that YT¯ 2 ??D1 ; D2 jX1 .
¯
With this assumption we can derive for E[Y211 jD2 = f0; 0g] that
¯
E Y211 jD2 = 00 = E[E[Y211 jX1 ; X0 ; D2 = 00] jD2 = 00]

¯ ¯ ¯
11
= E[E[Y2 jX1 ; X0 ; D1 = 0] jD2 = 00]
¯
11
= E[E[Y2 jX1 ; X0 ; D1 = 1] jD2 = 00]
¯
11
= E[E[Y2 jX1 ; X0 ; D2 = 11] jD2 = 00]
¯ ¯
= E[E[Y2 jX1 ; X0 ; D2 = 11] jD2 = 00].
¯ ¯
This result has two implications: First, the DATET is identi…ed. Second, we simply have to
adjust for the distribution of X1 and X0 simultaneously, and can thus use the simpler methods
for the static model with multiple treatments. In other words, we do not have to use the more
complex sequential matching or weighting methods that are applicable to the (21) and are
discussed in more detail later.
Note that part a) of the S-CIA also implies that X1 ??D1 jX0 . I.e. the variable X1 , which is
observed at the end of the …rst period, is not in‡uenced by D1 , which starts at the beginning
of the …rst period. Hence, the Xt still have be exogenous in that Dt has no e¤ect on Xt
(which thus also prohibits to include intermediate outcomes in Xt ).68 This is a statement in
terms of observed variables and its implication can be related to causality concepts in time
series econometrics. It says that the X1 is not Granger-caused by previous treatments. This
68
In other words, treatment assignment is decided each period based on initial information, treatment history
and new information that is revealed up to that period. But it is not permitted that the information revealed
has been caused by past treatment. The participation decision may be based on the values of time varying
confounders observable at the beginning of the period, but they are not in‡uenced by the treatments of this
period. Hence, Xt is still exogenous, which thus not permits Yt to be included in Xt .
condition is a testable implication of S-CIA, which on the one hand is an advantage, but on
the other hand suggests that S-CIA may be stronger than strictly necessary. We will discuss
below alternative representations in terms of potential values of X1d1 , i.e. of the values of X1
that would be observed if a particular treatment had been applied. Intuitiviley part a) of S-CIA
d0
says that X1d1 = X1 1 , i.e. the potential values of X1 are the same whatever treatment is applied.
However, as we discussed in the previous chapter, these two statements are not exactly equal.
d0
In principle, it could be that X1d1 6= X1 1 but that due to selection e¤ects it is still the case that
X1 ??D1 jX0 .
To examine alternative representations of the CIA assumptions in terms of potential values,

we …rst turn back to the W-CIA. When using W-CIA, no explicit exogeneity condition is
required for the control variables. This may appear to be surprising, because it is a well-
known fact that if we include, for example, the outcome in the list of control variables, we will
always estimate a zero e¤ect (see Rosenbaum (1984), Rubin (2004) and Rubin (2005) on this
so-called endogeneity bias). As observed by Lechner (2008b), a CIA based on observable control
variables which are potentially in‡uenced by the treatment is not the best representation of the
identifying conditions, because it confounds selection e¤ects with endogeneity issues. E.g. the
W-CIA implies for the second period
d d
E[YT¯ jX1 ; D1 = 1; ] = E[YT¯ jX1 ; D2 = 11]
¯ ¯ ¯
or equivalently in terms of potential confounders
d d d =11
E[YT¯ jXd11 =1 ; D1 = 1; ] = E[YT¯ jX1¯ 2 ; D2 = 11].
¯ ¯ ¯
This shows that W-CIA is in fact a set of joint assumptions about selection and endogeneity
bias. Lechner and Miquel (2005) give an alternative de…nition of the W-CIA in terms of poten-
tial confounders, which are by de…nition exogenous, i.e. not causally a¤ected by the treatment.
Then, it becomes apparent what type of exogeneity assumptions are required to obtain identi-
…cation without endogeneity bias.
We consider in the following only the version for the simple two periods model to focus on
the key issues. First, a set of assumptions similar to the W-CIA is presented (called W-CIA-P),
thereafter assumptions similar to the S-CIA are discussed (called S-CIA-P). Lechner and Miquel
(2005, Corrolaries 1 and 2) show that these assumptions are strongly related. Nevertheless,
neither does W-CIA directly imply W-CIA-P nor vice versa. The same applies to S-CIA and
S-CIA-P. Hence, the following assumptions W-CIA-P and S-CIA-P are not exactly equivalent
to our previous discussion, but almost. Hence, they provide an intuition how we might interpret
the previous W-CIA and S-CIA, which were expressed in terms of observed confounders.69
Weak dynamic conditional independence based on potential confounders (W-CIA-P):
d d
YT¯ 2 ??Dt jXt¯ 2 1 ; Dt 1 8t 2 and d2 2 2
¯ ¯ ¯
d
F (X0¯ 2 jD1 = d1 ) = F (X0d1 jD1 = d1 ) 8d2 2 2 (24)
¯
d1 ;d02 d1 ;d02 d1 d1
F (X1 jX0 ; D1 = d1 ) = F (X1 jX0 ; D1 = d1 ) 8d2 2 2
¯
where Xt may include Yt . The common support requirement remains the same as before.
One can show that the same treatment e¤ects are identi…ed under W-CIA and W-CIA-P.
The conditional independence condition is the same as before, now formulated in terms of
potential confounders. What is new are the exogeneity conditions given afterwards. Intuitively,
(24) states that D1 and D2 should have no e¤ect on confounders in period 0, and D2 should
have no e¤ect on confounders in period 1. A somewhat stronger assumption which implies this
d d0
is if the treatment has no e¤ect on the confounders before it starts, i.e. X0¯ 2 = X0¯ 2 for any
d ;d0
d2 and d02 and also X1d1 ;d2 = X1 1 2 for any d2 and d02 . This rules out anticipation e¤ects on
¯ ¯
the confounders. In the jargon of panel data econometrics, the values of Xt are pre-determined.
They may depend on past values of the treatment sequence, but not the current value or future
values of Dt . Hence, we not only rule out anticipation e¤ects on the outcome variable, as
this would not permit identi…cation anyhow, but also anticipation e¤ects on the confounders.
Although W-CIA and W-CIA-P are not equivalent, they are strongly related, such that we
can intuitively think of W-CIA to require that all confounders are not a¤ected by anticipation
e¤ects.
The requirements for the strong dynamic CIA are stronger and a nearly equivalent repre-
sentation in terms of confounders is given by
69
Note that S-DCIA contained a testable restriction, whereas the S-DCIA-P does not.
Strong dynamic conditional independence based on potential confounders (S-CIA-

P):
d d d
(YT¯ 2 ; X1¯ 2 )??D1 jX0¯ 2 8d2 2 2
¯ ¯
d2 d2
YT¯ ??D2 jX1¯ ; D1 8d2 2 2
¯ ¯
d0 d
F (X0¯ 2 jD2 = d2 ) = F (X0¯ 2 jD2 = d2 ) 8d2 ; d02 2 2 (25)
¯ ¯ ¯ ¯ ¯ ¯
d 02 d 02 d2 d2
F (X1¯ jX0¯ ; D2 = d2 ) = F (X1¯ jX0¯ ; D2 = d2 ) 8d2 ; d02 2 2.
¯ ¯ ¯ ¯ ¯ ¯
d0 ;d02
In constrast do (24), the above exogeneity conditions require that X1d1 ;d2 = X1 1 for any
values of d1 ; d01 ; d2 ; d02 . This means not only that the causal e¤ect of D2 on X1 is zero as before
(no anticipation) but also that the causal e¤ect of D1 on X1 is zero. Hence, Xt is assumed
to not be a¤ected by past nor future values of Dt . This assumption goes much beyond the
no-anticipation condition required for W-CIA-P by ruling out the use of intermediate outcomes
as conditioning variables. (Hence, as already remarked when discussing S-CIA before, the
identi…cation essentially boils down to the static model with multiple treatments, which, if
deemed reasonable, makes estimation much simpler.) In many applications S-CIA is likely to
be too strong. However, in cases where the new information Xt does in‡uence outcomes as well
as the choice of treatment in the next period, and this new information comes as a surprise (or
at least is not in‡uenced by the evolvement of the treatment history so far), then S-CIA may
be plausible.7071
70
@@@Kann wohl gelöscht werden. Dies gehört noch zu S-CIA
The part b of the assumption is formulated in terms of observed outcomes. An equivalent representation of
part b can be expressed in terms of potential outcomes for any two values d0t ; d00t as
0 00
(d 1 ;dt ) (d 1 ;dt )
b’) F (Xt ¯ t jD̄t 1 = dt 1 ; Dt = d0t ; Xt 1) = F (Xt ¯ t jD̄t 1 = dt 1 ; Dt = d0t ; Xt 1)
¯ ¯ ¯ ¯
8t and d 2 .
¯
71
@@@kann wohl gelöscht werden
Lechner and Miquel (2001, Theorem 1) showed that under the following conditions, all treatment e¤ects up to
period are identi…ed. Suppose
d d
Y1::T
¯ ; X1::T
¯ ?
?Dt jX̄t 1 ; Yt 1 ; Dt 1 8t and d 2 ,
¯ ¯ ¯
0 < Pr Dt = dt jX̄t 1 ; Yt 1 ; Dt 1 <1 a:s: 8t and dt ,

¯ ¯
3.2.3 Sequential matching or weighting estimation
The previously identi…ed can all be considered as weigthed averages of the observed outcomes
in the subgroup experiencing the treatment sequence of interest. As one example we showed
above that
Z Z
E Y211 jD1 =0 = E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1 dFX0 jD1 =0
¯ ¯
or
Z Z
E Y211 = E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1 dFX0 . (26)
¯ ¯
In principle we can apply nonparametric matching estimators as in the static case. There will
be at least three complications, though, which can pose problems in …nite samples. First, if we
consider very long sequences, e.g. 10 versus 0000010, the number of observations who actually
experienced these sequences can be very small. Second, the observations in very long sequences
are likely to be more homogenous such that the common support for the comparison of two
sequences may be rather small. Another issue is that we will also have to control for continous
variables in our sequential matching estimation. While we can estimate dFX0 in (26) simply
by the empirical distribution function of X0 , this would not be possible for dFX1 jX0 ;D1 =1 if X0
contains a continuous variable. If one were to impose parametric forms for dFX1 jX0 ;D1 =1 and
dFX0 , this would become much simpler.
This last problem is actually not present, if one were to assume S-CIA as in this case one
could identify
Z
E Y211 jD2 = 00 = E [Y2 jX1 ; D2 = 11] dFX1 ;X0 jD 2 =00
¯ ¯ ¯ ¯
Z
E Y211 jD1 =0 = E [Y2 jX1 ; D2 = 11] dFX1 ;X0 jD1 =0
¯ ¯
Z
E Y211 = E [Y2 jX1 ; D2 = 11] dFX1 ;X0 ,
¯ ¯
d d d
where Y1::T
¯ = fY1¯ ; :::; YT¯ g, and analogously for the X variables. This condition thus allows selection into
treatment to be based on intermediate outcomes Yt 1 (predetermined endogenous variables). On the other hand,
it requires all the control variables X also satisfy this indepence condition. Hence, for any variable that we
include in X to control for possible confounding of D and Y , we also have to incorporate all common factors
that in‡uenced Dt and Xt simultaneously. In other words, once a variable is discovered that in‡uences both the
participation decision Dt as well as the potential outcomes, one can only use that variable in the conditioning
set, if the other X variables include all systematic factors that a¤ect Xt and Dt .
where D1 = 0 and D2 = 00 have positive probability mass. Hence, with S-CIA we obtain
¯
a simpler estimation approach. Of course, since the S-CIA implies the W-CIA, the methods
below for W-CIA are also possible to use here. (This could in fact be used as a speci…cation
check for those parameters that are identi…ed under S-CIA and also under W-CIA.)
For nonparametric identi…cation under W-CIA two promising approaches are propensity
score matching and weighting. Both classes of estimators have the advantage that the relation
of the outcome variable to the confounders X does not need to be speci…ed, they only require
the speci…cation of the relation between X and the selection process. In many labor market
evaluation studies, this is often considered an important advantage, because applied researchers
may have better knowledge about the selection process than about the outcome process. It is so
far common practice in the static evaluation literature to estimate such selection probabilities
with parametric binary or multiple response models.
If one uses inverse probability weighting (IPW), this permits not only a convenient
implementation of the estimator but also allows us to use the results for sequential GMM
estimation, e.g. Newey (1984), to obtain consistent standard errors. By this we take account
of the fact that the propensity scores are estimated, which may often lead to a reduction
in standard errors, compared to a situation where this is ignored. Although the derivation
of the variance by the sequential GMM estimation approach is very simple analytically,
it may nevertheless take you some time to programme it. (One could also extend the
calculations of Hirano, Imbens, and Ridder (2003) to obtain the asymptotic distribution with
nonparametrically estimated choice probabilities.) When using matching estimation, the
sequential GMM approach to calculating the standard errors is not available and one has to
rely either on the more complex asymptotic theory as could be developed from Heckman,
Ichimura, and Todd (1998). (Bootstrap is another option, if the estimator satis…es certain
smoothness properties.)
The IPW estimator is straightforward to derive. De…ne pd1 (x0 ) = Pr(D1 = d1 jX0 = x0 )
and pd2 jd1 (x1 ) = Pr(D2 = d2 jX1 =x1 ; D1 = d1 ) then
¯ ¯ ¯
Y2
E jD = 11 Pr(D2 = 11)
p (X1 )p1 (X0 ) ¯ 2
1j1 ¯
¯
Z
Pr(D2 = 11)
= 1j1
¯ E
[Y2 jX1 ; D2 = 11] dFX1 ;X0 jD 2 =11
p (X1 )p1 (X0 ) ¯ ¯ ¯
Z ¯
Pr(D2 = 11) Pr(D2 = 1jX1 ; X0 ; D1 = 1)
= ¯ E [Y2 jX1 ; D2 = 11] dFX1 ;X0 jD1 =1
p1j1 (X1 )p1 (X0 ) ¯ ¯ Pr(D2 = 1jD1 = 1)
Z ¯
Pr(D1 = 1)
= E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1 dFX0 jD1 =1
p1 (X0 ) ¯ ¯
Z
Pr(D1 = 1) Pr(D1 = 1jX0 )dFX0
= E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1
p1 (X0 ) ¯ ¯ Pr (D1 = 1)
Z
= E [Y2 jX1 ; D2 = 11] dFX1 jX0 ;D1 =1 dFX0 = E Y211
¯ ¯
which is identical to (26). Hence, a natural estimator is

P
w
î Y2
i:D 2;i =11
¯ P
w
î
i:D 2;i =11
¯
1
where w
î = p^1j1 (X 1 )^
p1 (X0;i )
. The conditional probabilities can be estimated nonparametrically,
¯
but when the sequences become very long parametric estimation might be more reliable because
the number of observations decrease and the list of control variables X gets longer.
¯
Similarly,
Y2 Pr(D2 = 11)
E Y211 jD1 = 0 = E 1j1
p0 (X0 ) jD2 = 11 ¯ .
p (X1 )p1 (X0 ) ¯ Pr(D1 = 0)
¯
Lechner and Miquel (2005) have shown that these propensity scores also satisfy a balancing
property which can make sequential matching estimation much simpler. For the two-period
example under the W-CIA it follows thatFor the two period example discussed in Lechner and
Miquel (2005) this means
d
a) YT¯ 2 ??D1 jp1 (X0 ) 8d2 2 2
¯
d2 1jD1
b) YT¯ ??D2 jp (X1 ) 8d2 2 2.
¯ ¯
In fact, instead of pd1 (x0 ) and pd2 jd1 (x1 ) one can also use any balancing scores b1 (x0 ) and
¯
b2 (x1 ; d1 ) with the properties that
¯
h i h i
E pd1 (X0 )jb1 (X0 ) = pd1 (X0 ) and E pd2 jd1 (X1 )jb2 (X1 ; d1 ) = pd2 jd1 (X1 ).
¯ ¯ ¯
Hence, we can always augment the propensity score with additional control variables that we
deem to be particularly important for the outcome variable, with the aim to improve small
sample properties. In addition to that, it means that we can use the same propensity score
when estimating the e¤ects separately by gender or age groups, for example.
We thus obtain
Z Z h i
E Y211 = E Y2 jp1j1 ; p1 ; D2 = 11 dFp1j1 jp1 ;D1 =1 dFp1 .
A potential estimator would thus be

0 1
P 1j1 p1j p1i
m11 (pj ; p1j ) K
1XB C
h
B j:d1;j =1 C
B C
n @ P p1j p1i A
i K h
j:d1;j =1
where m11 (p1j1 ; p1 ) = E Y2 jp1j1 ; p1 ; D2 = 11 . If more than 2 time periods are examined, more
propensity scores are needed. The minimum number of propensity scores needed corresponds to
the length of the treatment sequence. Various matching estimators based on nearest-neighbour
regression are examined in Lechner (2008a).
Note that the dimension of the matching estimator is increasing with the length of the
treatment sequence even if we use a parametrically estimated propensity score. When we are
interested in Y 11 we have to control for p1j1 and p1 in the matching estimator. When Y 111 is
of interest, we will need p1j11 and p1j1 and p1 . Hence, the number of propensity scores needed
for a treatment sequence d equals .
¯
Hence, estimation is possible by sequential matching or by sequential inverse probability

weighting (IPW). Lechner (2008a) examines the …nite sample properties of sequential matching,
whereas IPW weigthing estimators are considered in Lechner (2008c). Those two papers also
discuss in detail the …nite sample issues that arise in the implementation of the common support
restriction. Which of these two approaches tends to be more reliable in …nite samples is still an
unresolved question, but the discussion paper Lechner (2004) was more in favour of matching.
Note that a crucial assumption in the above model was that 0 < Pr (D2 = 1jX1 ; D1 ) < 1
¯
almost surely. In other words, for every value of X0 and X1 there should be a positive probability
that either D2 = 1 or D2 = 0 is chosen. In some applications, the set of possible treatments
might, however, depend on the value of X1 . For example, if we examine a particular training
programme for unemployed the treatment option D2 = 1 might not exist anymore for someone
Markus Frölich 4. Selection on unobserved - nonparametric IV - triangular model 99
for whom X1 is not unemployed anymore. Here, the set of available treatment options in a
given time period t varies with Xt 1, and the model discussed so far would have to be adjusted
to this setting. Here there seems to be scope for some further research.
3.2.4 Relationship to causality in time series econometrics
Lechner (2006) relates the above concept of causality based on potential outcomes to concepts
of causality frequently found in time series econometrics. In the concept of causality advocated
by Granger and Sims, a variable Dt is causing Yt+1 if the information Dt helps to obtain better
¯
predictions of Yt+1 given all other information avaiable.
De…nition 3 Granger-Sims non-causality
D does not GS-cause Yt+1 i¤ Yt+1 ??D̄t jȲt ; D0 ; Y0

¯t
De…nition 4 Potential outcome non-causality
D does not GS-cause Yt+1 i¤ F d0 (u) = F d0 (u) 8d0t ; d00t 2 t

¯t ¯t
Yt+1 ¯t
Yt+1 ¯ ¯
Lechner (2006) shows that neither of these two de…nitions of non-causality implies the
other. However, if the W-CIA holds (including the common support assumption) than each
these two de…nitions of non-causality implies the other. Hence, if we can assume W-CIA, both
de…nitions can be used to test for non-causality and can be interpreted in whichever of these
two perspectives seems to be more intuitive.
3.3 Timing of events and duration models
to be written
4 Selection on unobserved - nonparametric IV - triangular

model
In many applications, we may not be able to observe all confounding variables as data collection
might have been too expensive or since some variables might be very di¢ cult to measure.
(This may be less of a concern with detailed administrative data, but more often when only
a limited set of covariates is available, perhaps measured with substantial error if elicited by
e.g. telephone surveys.) In this situation the endogeneity of D cannot be controlled for by
conditioning on observed covariates. Instrumental variables is the technique most frequently
used in econometrics to deal with this situation. An instrumental variable Z is a variable that
a¤ects only the endogenous variable D but not the outcome variable Y .72 It a¤ects the observed
outcome only indirectly through D. Hence, any observed impact of Z on Y must have been
mediated via D. A variation in Z then permits to observe changes in D without any change in
the unobservables. This allows to estimate the e¤ect of D. We begin our analysis with Y , D
and Z being scalar. See the graph for a simple illustration without control variables.73
Y
D
Z
V U
The classical instrumental variables strategy uses a linear model
Y = +D +U
and assumes that

cov(U; Z) = 0
and that
cov(D; Z) 6= 0.
I.e. it assumes that Z is not correlated with the unobservables but is correlated with D. This
leads to the linear IV estimator, which will be discussed in more detail in the next chapter.
The assumption of linearity is for convenience and usually does not emanate from an economic
model.
72
In the selection-on-observables approach as described in the previous chapters we also require the existence
of instrumental variables, but we do not need to observe them. There we require some random noise that a¤ects
D but not Y such that the common support requirement can be ful…lled.
73
Causal inference through instrumental variables has been analyzed by Imbens and Angrist (1994), Angrist,
Imbens, and Rubin (1996), Heckman and Vytlacil (1999) and Imbens (2001), among many others.
Therefore it is most helpful to better understand the merits and limits of instrumental
variables by analyzing …rst nonparametric identi…cation. We begin with the simplest situation
where D and Z are both binary and no other covariates are included. This is the analysis
of the local average treatment e¤ect, which clearly shows the limits of instrumental variables
estimation.
For nonparametric identi…cation we are going to see that we often need two assumptions:
First, that the instrument Z has no direct e¤ect on Y and, second, that the instrument itself
is not confounded. The meaning of these assumptions can be seen by comparing the following
with the previous graph. The assumption that Z has no direct e¤ect on Y requires that the
direct arc from Z to Y does not exist. The assumption of no confounding requires that there
are no confounding (dashed) arcs between Z and V and between Z and U .
Later on, we will relax these assumptions by permitting to condition on additional covariates
X and will assume that conditional on X these assumptions hold. Hence, X will then allow us
to "block" the direct e¤ect and the confounders.
Y
D
Z
V U
Later the analysis will be extended by adding explicitly a second equation to the outcome
equation
Yi = '(Di ; Xi ; Ui )
Di = (Zi ; Xi ; Vi ),
where the endogeneity of D arises from statistical dependence between U and V , where U
and V are vectors of unobserved characteristics.74 U could be unobserved cognitive and non-
cognitive skills, talents, ability, fortune in the labour market. V could be proneness to schooling,
dedication to academic study and factors a¤ecting the costs of schooling. One might think of
U as fortune in the labour market and V as ability in schooling as e.g. in Chesher (2003).75
4.1 Examples of instrumental variables
A discussion of recent "famous" instrumental variables is given in class. These include the
following:
Distance to college/university in Card (1995)
Rainfall variation to estimate the impact of economic growth on the likelihood of civil war
in Africa in Miguel, Satyanath, and Sergenti (2004). The determinants of civil wars are
an important research topics. A number of contributions, e.g. Collier and Hö- er (2002),
have stressed that civil wars particularly in Africa are more often driven by business
opportunities than political grievances. The costs of recruting …ghters may therefore
be an important factor for triggering civil wars. In this respect, poor young men are
more willing to recruit as …ghters when their income opportunities in agriculture or in
the formal labour market are worse (relative to their incomes as …ghters). The causal
chain assumption in Miguel, Satyanath, and Sergenti (2004) is that, in countries with
a large agricultural sector that is mainly rainfed, negative weather shocks reduce GDP,
which as a proxy for the economic situation, increases the risk of civil war. The price
for recruiting …ghters may be one of the channels. Another could be the reduced state
military strength and road coverage if public resources are reduced, e.g. due to negative
weather shock. Whereas weather shocks are arguably exogenous, the absence of a direct
e¤ect of weather on civil war incidence is an important assumption. (As an interesting
"test" of the exogeneity assumption, they also consider future rainfall growth as a pseudo
74
In the binary treatment evaluation literature, U is often considered to be a vector: U 0 ; U 1 of potential
unobserved variates.
75
The control function approaches later discussed rest on …nding some method to keep the value of V …xed
while permitting variation in D. For a …xed value of V , the channel through which endogeneity arises is …xed,
but variation in Z still introduces variation in D.
instrument. If weather ‡uctuations are random (exogenous), future rainfall should not be
corelated with current GDP growth, such that it should be insigni…cant in the …rst stage
regression).
Vietnam draft lottery in Hearst, Newman, and Hulley (1986) and Angrist (1990). They use
the Vietnam era conscription lottery as an instrument to identify the e¤ects of mandatory
military conscription on subsequent civilian mortality and earnings. A suited instrumental
variable was provided through the U.S. conscription policy during the years of the Vietnam
war, which conscripted individuals on the basis of randomly drawn birth dates.
Imbens, Rubin, and Sacerdote (2001) use ‘winning a prize in the lottery’as an instrument
to identify the e¤ects of unearned income on subsequent labour supply, earnings and
consumption behaviour. In both examples the instrument is randomly assigned (by a
lottery).
Quarter of birth instrument for estimating returns to schooling Angrist and Krueger
(1991). They estimated the returns to schooling using the quarter of birth as an
instrumental variable for educational attainment. According to U.S. compulsory school
attendance laws, compulsory education ends when the pupil reaches a certain age, and
thus, the month in which termination of the compulsory education is reached depends
on the birth date. Since the school year starts for all pupils in summer/autumn, the
minimum education varies with the birth date, which can be exploited to estimate the
impact of an additional year of schooling on earnings.
Twin births and same-sex births of …rst two children to estimate the impact of children
on female labour supply in Angrist and Evans (1998)
Minimum quota (Frölich and Lechner (2006)): Frölich and Lechner (2006) analyze the
impact of participation in an active labour market training programme on subsequent
unemployment chances. They use the so-called minimum quota as an instrument for
being assigned to a labour market programme. When active labour market programmes
were introduced on a large scale in Switzerland, the central government wanted to ensure
that all regions (called cantons) introduce these new programmes at the same time. The
fear was that (at least some of) the cantons, which enjoy a very high degree of autonomy
in the implementation of the policy, might have been reluctant to introduce these new
programmes and might have preferred a wait-and-see strategy. To avoid such behaviour,
the central government imposed a minimum quota on all cantons of a minimum number
of programme slots that had to be …lled. Since the calculation of this quote was partly
based on population share and partly on unemployment share, it introduced a di¤erential
in the likelihood of being assigned to treatment being neighbouring cantons. They argue
that people living close to a cantonal border but on di¤erent sides of it, live essentially in
the same labour market environment but their chances of being assigned to treatment in
case of becoming unemployed depend on the side of the border where they live.
Swan-Ganz catheterization: An example from the medical …eld. As discussed e.g. in

Bhattacharya et. al. (2005), individuals with heart attack and other acute diseases (acute
respiratory failure etc.) often receive a Swan-Ganz catheter which measures pressure in
the right side of the heart and the pulmonary artery. The catheter is often left in place
for several days and provides information to the doctors on the current health state,
which is used to make decisions about further treatment. The catheter might therefore
lead to better treatment choices, but at the same time, the introduction of the catheter
may increase mortality. Since the decision to introduce a catheter itself depends on
the acuteness of the health problem, a simple comparison of mortality outcomes with
and without catheter can be misleading. A candidate instrument could be whether the
patient was admitted to hospital on a weekday or during the weekend. It seems that
the introduction of the catheter depends on sta¤ load and therefore is di¤erent between
weekdays and weekends. At the same time, whether a patient with acute heart attack
was admitted during weekdays or the weekend might itself not be related (very much)
to expected mortality. (This, of course, might be very di¤erent for less acute health
conditions, where individuals might wait until Monday if the health problem is not so
acute.)
Becker and Wöß

mann (2007) examined potential reasons for why Protestant regions
attained a higher economic properity than Catholic regions. Max Weber suggested the
Protestant work ethic as the underlying cause. Becker and Wöß
mann (2007) suggested
that "Protestant economies prospered because instruction in reading the Bible generated
the human capital crucial to economic prosperity", using county-level data from late
19th-century Prussia. Distance to Wittenberg as an instrument for Protestantism.
Acemoglu, Johnson, and Robinson (2001) examined the impact of institutions (e.g. rule
of law, protection against expropriation) on enconomic development across countries.
Certainly, institutions and economic development a¤ect each other, such that an
instrumental variables strategy might be helpful. They argue that institutions are
very persistent over the centuries, but the European colonization implied a shock to
institutional development. Depending on the "settler mortality", the European powers
introduced very di¤erent institutions. Where settler mortality was low, they intended to
settle and enacted instituions similar to their home country, favouring private propertiy
and protection against expropriation by the state. Where settler mortality was high,
they instead set up "extractive states" by imposing institutions intended to transfer
(natural) resources to their home country (i.e. subduing private property rights and little
protection against government expropriation). They hypothecize that settler mortality
a¤ected the development of settlements, which then shaped the early institutions. It
is further hypothetized that these early institutions are highly correlated with the
institutions after independence and that the latter a¤ect current economic performance.
While the link from settler mortality to early institutions (…rst stage) can be examined
empirically, the "exclusion restriction" that settler mortality itself has no e¤ect on
current economic performance is not testable and has to be justi…ed on a priori grounds.
To argue their claim they cite various historical sources that …nd that the main causes to
European deaths in the colonies were malaria and yellow fever, to which the indigenous
adults had developed some kind of immunity such that these diseases are unlikely to be
the reason behind the large di¤erences in GDP. (Most at risk are young children.) The
mortality rates of local troops of the British army in India were of a similar magnitude as
the mortality rates of the British troops in Britain. In addition, several places in Africa
and India also had relatively high population densities before the Europeans arrived,
suggesting that the disease burden for the indigenous population was not extremely
high. (This article shows that justi…cation of the exclusion restriction for instrumental
variables lies at the heart of many empirical papers and that additional sources are
needed to provide indications for or against the exclusion restriction. Certainly, one will
never be able to show validity of it, at most plausibility and absence of evidence against
it. Acemoglu, Johnson, and Robinson (2001) also use some kind of overidenti…cation
test to test their claim. Such tests, however, are strongly dependent on the linearity
assumption.)
Consider a few more examples from these and other papers.76

(1) An individual may choose to attend college or not, and the outcome is earnings later in
the life cycle. The individual’s decision depends on the expected payo¤, i.e. better employment
chances or higher wages and also on the costs of attending college, which includes travel costs,
tuition, commuting time but also foregone earnings. Distance to college could be an instrument.
1 is larger than Y 0 . He may however not be able
Suppose the individual chooses college if Yi;Zi i;z
to forecast the potential outcomes perfectly but receives only a noisy signal of ability V and
information about a cost shifter Z. The participation decision thus is
1 0
Di = 1 E Yi;Zi
jVi ; Zi c(1; Zi ) > E Yi;Zi
jVi ; Zi c(0; Zi ) .
Here is a di¤erence between the objective function of the individual (outcomes minus costs) and
1
the "production function" that we are interested in (i.e. Yi;Z 0 ). The instrument should
Yi;Z
i i
shift the objective function of the individual without shifting the production function.
Consider the situation where D is non-binary. For simplicity X is dropped: Let Y be
individual lifetime earnings and '(d; u) the educational production function, with d representing
inputs chosen by the individual and u other inputs over which the individual has no control
(environment, IQ). The individual may not know u exactly, but receives only a noisy signal V
that is correlated with U . (The correlation could be one.) The individual chooses d to maximize
the di¤erence between the outcome and the costs c(d; Z), where Z is a cost shifter (which does
not enter in the educational production function):
D = arg maxE ['(d; U ) c(d; Z) jV; Z] (Z; V ).

d
Whereas the econometrician is interested in learning about the educational production function,
the individual chooses d with respect to a di¤ erent objective function: The di¤erence between
outcome and costs. If varies with Z, there will be individuals with the same values of u but
76
See also Imbens (2001).
di¤erent choices of D. (If '(d; u) were additively separable, the choice of D would not depend
on V , because D = arg max '(d) c(d; Z) + E [U jV = v; Z = z]. Hence, D = (Z) and would
not be confounded with U . Then we would not need an IV approach, but could use simple
regression analysis.)
(2) A …rm can choose between adopting a new production technology or not. Our interest is
in the e¤ect of technology on output. The …rm, on the other hand, bases it decision to maximize
pro…ts, i.e. it chooses D to maximize
d
Di = arg max pYi;Zi
Ci (d),
d2f0;1g
where p is the price of a unit of output, which is common to all …rms and not in‡uenced by
the …rms decision (i.e. the …rm is a price-taker without market power) and Ci (d) is the cost of
adopting the new technology. An instrument, e.g. a subsidy, might change the cost and thus
pro…ts without a¤ecting the output directly.
(3) Individuals in a clinical trial are randomly assigned to a new treatment against malaria
or to a control treatment. Individuals assigned to the treatment group may refuse the new
treatment. But individuals assigned to the control group cannot receive the new treatment.
Hence, individuals in the treatment group may not comply, but individuals in the control group
cannot get access to the treatment. This is called one-sided non-compliance. The decision of
individuals to decline the new treatment may be related to their health status at that time.
Individuals in particularly bad health at the time when being administered the new drug may
refuse to take the new drug with potentially severe side-e¤ects. Hence, the decision to take the
drug may be related with health status at that time which is likely to be related to the health
status later. Hence, D is endogenous, but random assignment could be used as an instrumental
variable.
4.2 Local average treatment e¤ect
References:
- Imbens and Angrist (1994, Econometrica)
- Imbens (2001, in Book "Econometric Evaluation of Labour Market Policies")
- Angrist Imbens Rubin (1996, JASA): easy reading with stronger assumptions
In their seminal paper, Imbens and Angrist (1994) analyzed what the linear IV estimator
estimates in the situation with D and Z binary. The assumptions and derivations are more
clearly laid out in Imbens (2001). The following discussion follows Frölich (2007a) who relaxes
the assumptions somewhat for the binary situation (and extends to covariates as discussed in
the next subsection). Consider a triangular model with non-separable errors, where Y and D
are scalar:
d = ' (d; z; U ) Di
Yi = '(Di ; Zi ; Ui ) with Yi;z i and Yi;Zi
= Yi
Di = (Zi ; Vi ) with Di;z = (z; Vi ) and Di;Zi = Di .
For example, D could be attending/not attending college and Z could be an indicator of

living close to or far from a college. College proximity has been used by Card (1995) and many
others as an instrument to identify the returns to schooling, noting that living close to a college
during childhood may induce some children to go to college but is unlikely to directly a¤ect the
wages earned in their adulthood.
According to the reaction of D on an external intervention on Z, the units i can be dis-
tinguished into di¤erent types: For some units, D would remain unchanged if Z were changed
from 0 to 1, whereas for others D would change. With D and Z binary, four di¤erent latent
types T 2 fn; c; d; ag are possible:
Ti = n if Di;0 = 0 and Di;1 = 0 Never-taker

Ti = c if Di;0 = 0 and Di;1 = 1 Complier/compliant
Ti = d if Di;0 = 1 and Di;1 = 0 De…er
Ti = a if Di;0 = 1 and Di;1 = 1 Always-taker.
Since the units of type always-taker and of type never-taker cannot be induced to change
D through a variation in the instrumental variable, the impact of D on Y can at most be
ascertained for the subpopulations of compliers and de…ers. Assume the following:
[Assumption 1: Monotonicity] The subpopulation of de…ers has probability measure zero:
Pr (Di;0 > Di;1 ) = 0.
[Assumption 2: Existence of compliers] The subpopulation of compliers has positive probability:
Pr (Di;0 < Di;1 ) > 0.

Assumptions 1 and 2 rule out the existence of subpopulations that are a¤ected by the
instrument in an opposite direction. Since changes in the instrument Z would trigger changes
in D as well for the compliers as for the de…ers, but with opposite sign, any causal e¤ect on the
compliers could be o¤set by opposite ‡ows of de…ers. Monotonicity ensures that the e¤ect of
Z on D has the same direction for all units. The monotonicity and the existence assumption
together ensure that Di;1 Di;0 for all i and that the instrument has an e¤ect on D, such
that Di;1 > Di;0 for at least some units. When college proximity is used as an instrument to
identify the returns to attending college, monotonicity requires that any child which would not
have attended college if living close to a college, would also not have done so if living far from
a college. The existence assumption requires that the college attendance decision depends at
least for some children, on the proximity to the nearest college.
[Assumption 3: Unconfounded instrument] The relative size of the subpopulations always-
takers, never-takers and compliers is independent of the instrument:
Pr (Ti = tjZi = 0) = Pr (Ti = tjZi = 1) for t 2 fa; n; cg.
Assumption 3 allows to identify the e¤ect of Z on D and to estimate the fraction of compliers.
It requires that the fraction of always-takers, never-takers and compliers is independent of the
instrument.
[Assumption 4: Mean exclusion restriction] The potential outcomes are mean independent of
the instrumental variable Z in each subpopulation:
0 0
E Yi;Zi
jZi = 0; Ti = t = E Yi;Zi
jZi = 1; Ti = t for t 2 fn; cg
1 1
E Yi;Zi
jZi = 0; Ti = t = E Yi;Zi
jZi = 1; Ti = t for t 2 fa; cg.
Assumption 4 rules out a direct e¤ect of Z on Y . Any e¤ect of Z should be channeled

through D such that the potential outcomes are not correlated with the instrument.
To gain a better intuition, one could think of assumption 4 as containing two assumptions:
an unconfounded instrument and an exclusion restriction. Take the …rst condition
0 0
E Yi;0 jZi = 0; Ti = t = E Yi;1 jZi = 1; Ti = t for t 2 fn; cg
and consider splitting it up into two parts (Assumption 4a and 4b):77

0 0 0
E Yi;0 jZi = 0; Ti = t = E Yi;1 jZi = 0; Ti = t = E Yi;1 jZi = 1; Ti = t for t 2 fn; cg.
77
Obviously, the following assumption is stronger than the previous such that it is not strictly necessary. It
helps, though, to gain intuition into what these assumptions mean and how they can be justi…ed in applications.
The …rst part is like an exclusion restriction on the individual level and would be satis…ed
0 = Y 0 . It is assumed that the potential outcome Y 1
e.g. if Yi;0 i;1 i;Zi for unit i is una¤ected by
1 = Y 1 , college proximity itself has no direct
an exogenous change in Zi . For example, if Yi;0 i;1
e¤ect on the child’s wages in its later career. This thus rules out any systematic impact of Z
on the potential outcomes on a unit level. The second part represents an unconfoundedness
0 is identically
assumption on the population level. It assumes that the potential outcome Yi;1
distributed in the subpopulation of units for whom the instrument Zi is observed to have the
value zero and in the subpopulation of units where Zi is observed to be one. This assumption
rules out selection e¤ects that are related to the potential outcomes: The families who decided
to reside close to a college should be identical in all characteristics (that a¤ect their children’s
subsequent wages) to the families who decided to live far from a college. Thus, whereas the
second part refers to the composition of units for whom Z = 1 or Z = 0 is observed, the …rst
part of the assumption refers to how the instrument a¤ects the outcome Y of a particular unit.
The second part of the assumption is trivially satis…ed if the instrument Z is randomly
assigned. Nevertheless randomization of Z does not guarantee that the exclusion assumption
holds on the unit level (Assumption 4a). On the other hand, if Z is chosen by the unit itself,
selection e¤ects may often invalidate Assumption 4b: The families who decide to reside nearer
to or farther from a college might be rather di¤erent, for example, if the districts where colleges
are located also o¤er other job opportunities. In this case it is necessary to include in X
0
all variables that a¤ect the choice of residence Z as well as the potential outcomes Yi;Z and
i
1 .787980
Yi;Zi
One implication of the mean exclusion restriction is that it implies unconfoundedness of D

78
Consider the three examples: (1) Here distance to the training site may act as an instrument. One could
be worried about di¤erences in population mix by distance to the training center that could violate the uncon-
foundedness assumptions (3 and 4). The location might also entail access to other services which could violate
the exclusion restriction. The monotonicity assumption appears plausible. LATE would be the e¤ect on those
who take part in training only because they live nearby. These may not be those individuals who bene…t most
from training. They may demonstrate less of a commitment to take up training than those who would take part
in training irrespective of the distance. On the other hand, this may often be the subpopulation that can most
easily be induced or encouraged to take up training, i.e. are most reactive to external incentives or subsidies.
79
(2) The cost of adopting the new technology might depend on a subsidy or a regulatory feature of the
institutional environment the …rm operates in. Let the cost of adopting the technology be a function C(d; z)
such that the …rm’s decision problem is
d
Di = arg max pYi;Zi
C(d; z).
d2f0;1g
Notice that the cost enters in the choice problem of the …rm but that the object we are interested in, i.e. the
production function, is not a¤ected by this. This is important for identi…cation. We may be able to use the
subsidy as an instrument to identify the e¤ect of technology on output. However, we cannot use it to identify
the e¤ect of technology on pro…ts or stock prices since the subsidy itself changes the pro…ts.
For identi…cation we need a variation in the level of Z. The unconfoundedness assumption requires that the
mechanism that generated this variation in Z should not be related to the production function of the …rms
nor on their decision rule. A violation of these assumptions could arise, e.g., if particular …rms are granted a
more generous subsidy after lobbying for favourable environments. If …rms that are more likely to adopt the
new technology only if subsidized are able to lobby for a higher subsidy, then the fraction of compliers would be
higher among …rms that obtained a higher subsidy than among those that did not. The monotonicity assumption
is satis…ed if C(d; z) is decreasing in z.
The LATE is the e¤ect of technology on those …rms which only adopt the new technology because of the
subsidy. It could be plausible that the e¤ect for the always-takers is larger than this LATE, and that the e¤ect
on never-takers would be smaller.
80
(3) In this clinical trial, the unconfoundedness of the instrument is guaranteed by formal randomization. If
all individuals would comply with their assignment, the treatment e¤ect could be estimated by simple means
comparisons. With non-compliance, still the ITT e¤ect of Z on Y can be estimated, but this does not correspond
to the treatment e¤ect of D on Y anymore. With one-sided non-compliance (Di;0 = 0 for all individuals),
monotonicity is automatically satis…ed. The exclusion restriction further requires that the assignment status
itself has no direct e¤ect on health, which could well arise e.g. through psychological e¤ects on the side of the
patient or the physician because of the awareness of assignment status. (This is the reason for double-blind
placebo trials.) Notice that with one-sided noncompliance, the group of always-takers does not exist and that
therefore LATE and ATET are identical, see below.
in the complier subpopulation, because Di = Zi for a complier:
0 0
E Yi;Zi
jDi = 0; Ti = c = E Yi;Zi
jDi = 1; Ti = c
1 1
E Yi;Zi
jDi = 0; Ti = c = E Yi;Zi
jDi = 1; Ti = c .
Hence, in the complier subpopulation D is not confounded with the potential outcomes anymore.
If one were able to observe the type, one could retain only the complier subpopulation and
could use a simple means comparison (as with experimental data as discussed in Chapter 1) to
estimate the treatment e¤ect. In other words, the type is the missing covariate that one would
need to control for to use a selection-on-observables approach, i.e. that needs to be adjusted
for in order to remove biases in comparisons of treated and control units. (This type we are
trying to identify in Section 3.5.)
We do not observe the type, but the average treatment e¤ect on the compliers can still be
obtained which is based on the observation that the intention to treat e¤ect can be obtained as
well as the size of the complier subpopulation be identi…ed. Consider the following expression:
h i
Di
E [Yi jZi = z] = E Yi;Z i
jZ i = z; Ti = n Pr (Ti = njZi = z)
h i
Di
+E Yi;Z i
jZi = z; Ti = c Pr (Ti = cjZi = z)
h i
Di
+E Yi;Z i
jZ i = z; Ti = d Pr (Ti = djZi = z)
h i
Di
+E Yi;Z i
jZi = z; Ti = a Pr (Ti = ajZi = z)
0
= E Yi;Zi
jZi = z; Ti = n Pr (Ti = n)
h i
Di
+E Yi;Z i
jZi = z; Ti = c Pr (Ti = c)
h i
Di
+E Yi;Z i
jZi = z; Ti = d Pr (Ti = d)
1
+E Yi;Zi
jZi = z; Ti = a Pr (Ti = a)
by assumption 3 and the de…nition of the types. By the mean exclusion restriction (Assumption
4) the potential outcomes are independent of Z in the always- and in the never-taker subpopu-
lation. Hence, when taking the di¤erence E[Y jX; Z = 1] E[Y jX; Z = 0] the respective terms
for the always- and for the never-takers cancel, such that
h i h i
Di Di
E [Yi jZi = 1] E [Yi jZi = 0] = E Yi;Z i
jZ i = 1; Ti = c E Yi;Zi
jZ i = 0; Ti = c Pr (Ti = c)
h i h i
Di Di
+ E Yi;Z i
jZi = 1; Ti = d E Yi;Z i
jZi = 0; Ti = d Pr (Ti = d)
1 0
= E Yi;Zi
jZi = 1; Ti = c E Yi;Zi
jZi = 0; Ti = c Pr (Ti = c)
0 1
+ E Yi;Zi
jZi = 1; Ti = d E Yi;Zi
jZi = 0; Ti = d Pr (Ti = d) .
Exploiting the mean exclusion restriction for the compliers (and de…ers) gives
1 0 1 0
= E Yi;Zi
Yi;Zi
jTi = c Pr (Ti = c) E Yi;Zi
Yi;Zi
jTi = d Pr (Ti = d) . (27)
This is a weighted average of the treatment e¤ect on the compliers and the de…ers.
The di¤erence E [Y jZ = 1] E [Y jZ = 0] thus represents the di¤erence between the average
treatment e¤ect on the compliers (who switch into treatment as a reaction on a change in
the instrument from 0 to 1) and the average treatment e¤ect on the de…ers (who switch out
of treatment). An estimate of (27) is not very informative since, for example, an estimate
of zero could be the result of a treatment without e¤ect. But it could also be the result
of a treatment with a large impact but o¤setting ‡ows of compliers and de…ers. Hence, the
exclusion restriction is not su¢ cient to isolate a meaningful treatment e¤ect. However, as (27)
indicates, a treatment e¤ect could be identi…ed if either no compliers P (Ti = c) = 0 or no de…ers
P (Ti = d) = 0 existed. If an instrumental variable is found that "moves" all individuals in the
’same direction’, e.g. that either induces individuals to switch into participation or leaves their
participation status unchanged, but does not induce any individual to switch out of treatment,
the average treatment e¤ect on the responsive subpopulation is identi…ed.
Now supposing that no de…ers exist (Assumption 1) gives
E [Y jZ = 1] E [Y jZ = 0]
E Y1 Y 0 jT = c = .
Pr (T = c)
Noticing that E [DjZ = 0] = Pr(D = 1jZ = 0) = Pr (T = a) + Pr (T = d) and E[DjZ = 1] =
Pr(D = 1jZ = 1) = Pr (T = a) + Pr (T = c) and using that Pr (T = d) = 0 by Assumption 1,
the relative size of the subpopulation of compliers is identi…ed as
Pr (T = c) = E [DjZ = 1] E [DjZ = 0] ,
and it follows that the average treatment e¤ect in the subpopulation of compliers is
E [Y jZ = 1] E [Y jZ = 0]
LAT E = E Y 1 Y 0 jT = c = .
E [DjZ = 1] E [DjZ = 0]
This is the e¤ect of treatment in the subpopulation of compliers, which is the subpopulation
reacting to the instrument. This subpopulation is unknown, only its size is identi…ed. Hence,
only the e¤ect of D in the complier subpopulation is identi…ed. Any approaches that identify
more need stronger assumptions. The LATE can be estimated by
E [Y\
jZ = 1] E [Y\
jZ = 0]
,
\= 1]
E [DjZ \= 0]
E [DjZ
which is also called the Wald estimator since Wald (1940) suggested this particular estimator,
although in a rather di¤erent context. One can easily show that this estimator is identical to
the 2SLS estimator for this setup with D and Z binary.81
Three comments: First, if Zi = 0 rules out obtaining the treatment, in the sense that
Di;0 = 0 for every individual, then the group of always-takers does not exist. Since the only
remaining subpopulations are the never-takers and the compliers, those individuals who are
observed with Di = 1 must be compliers and therefore the set of individuals with Di = 1 is
identical to the set of individuals with Ti = c, which implies that
E Y1 Y 0 jD = 1 = E Y 1 Y 0 jT = c .
Hence, in this situation the average treatment e¤ect on the treated (ATET) is identi…ed. This
is e.g. the case in the example above with one-sided noncompliance: Individuals assigned to the
control group cannot gain access to the new drug. In this situation where always-participation
can be ruled out by the assignment mechanism, the ATET is identi…ed.
Second, if the assumption 1 is invalid and the existence of de…ers cannot be ruled out, the
Wald estimator does usually not estimate LATE consistently because by inserting the above we
obtain:
E [Y jZ = 1] E [Y jZ = 0] E Y1 Y 0 jT = c Pr (T = c) E Y 1 Y 0 jT = d Pr (T = d)
=
E [DjZ = 1] E [DjZ = 0] Pr (T = c) Pr (T = d)
Hence, it estimates a weighted average of the complier and de…er treatment e¤ect, but with
one of the weights being negative. This is because a change in Z moves some individuals into
treatment but other individuals out of treatment. The e¤ect on those moving out of treatment
and those moving into treatment may cancel each other. In fact, if the complier- and de…er-
treatment e¤ects are identical and both subpopulations are of same size, the observable net
81
Perhaps, you may want to show this.
e¤ect is exactly zero even though each treatment e¤ect could be large. Hence, the existence
of de…ers can bias the above Wald estimator depending on the size of the de…er population
(This is discussed in more detail in Angrist, Imbens, and Rubin (1996)). Nevertheless, if the
true complier- and de…er-treatment e¤ects are identical and if the two subpopulations are not
of equal size, then the Wald estimator would still be consistent, as can be easily seen from the
above equation.
Third, the problem of weak instruments is already visible from the formulae of the Wald
estimator, because we are dividing the ITT e¤ect E [Y jZ = 1] E [Y jZ = 0] by E [DjZ = 1]
E [DjZ = 0]. If the instrument has only a weak correlation with D, then E [DjZ = 1] and
E [DjZ = 0] will not di¤er much and the denominator is close to zero. Dividing by something
close to zero can lead to very imprecise estimates.82
4.2.1 Identifying the potential distributions for compliers
This instrumental variable setup actually permits us not only to estimate the average treatment
e¤ect for the compliers but also the distributions of the potential outcomes for the compliers
FY 1 jT =c and FY 0 jT =c .
For identifying distributions, we need to strengthen the above assumptions somewhat in

that we replace assumptions 3 and 4 by
Y 0 ; Y 1 ; D0 ; D1 ??Z (28)
where Yi1 = Yi;Z

1 . This requires that Z is not confounded with D , D neither with the potential
i 0 1
outcomes and also that Z has no direct e¤ect on the potential outcomes.
Using similar derivations as before, one can show that the potential outcome distributions
for the compliers are identi…ed as:
R
(E [1 (Y u) DjX; Z = 1] E [1 (Y u) DjX; Z = 0]) dFX
FY 1 jc (u) = R
(E [DjX; Z = 1] E [DjX; Z = 0]) dFX
R
(E [1 (Y u) (D 1)jX; Z = 1] E [1 (Y u) (D 1)jX; Z = 0]) dFX
FY 0 jc (u) = R .
(E [DjX; Z = 1] E [DjX; Z = 0]) dFX
82
This last point refers to the statistical properties of estimators, to be discussed later, and not to issues of
identi…cation.
Instead of estimating the cdf of the potential outcomes in this way, Frölich and Melly (2007)
use weigthed quantile regression, which will be discussed in a later chapter.83
For interpreting the quantile treatment e¤ ects, it should be kept in mind that we have not
imposed any rank invariance assumption on the function '. Hence, we are not imposing that
any individual who is at, say, the 90-th percentile in the Y 1 distribution would also be at the
90-percentile in the Y 0 distribution (in the complier subpopulation). Even if the distribution
of Y 1 …rst-order stochastically dominates Y 0 (e.g. a rightward shift in wages), it is not certain
that the wage of every individual would increase. Someone at the 90-th percentile in the Y 0
distribution might be at the 20-th percentile in the Y 1 distribution. Hence, di¤erences in the
distributions do not provide a distribution of the individual treatment e¤ects.84 For being
able to interpret quantile treatment e¤ects as individual treatment e¤ects, we would need some
monotonicity assumption on '( ; ; Ui ), which we discuss in Chapter 5.
4.3 LATE with covariates
- Frölich (2007, Journal of Econometrics)
As alluded to several times above, often the instrumental variables assumptions are not
valid in the population, but may become valid after conditioning on some covariates X. In the
distance-to-college example, it appears unreasonable that those living close to a college and those
living distant to a college are identical in terms of their characteristics. Deliberate residential
choice by their parents is likely to lead to confounding between Z and other characteristics
of the individuals. Their choice might be related to characteristics that a¤ect their children’s
subsequent wages directly. In addition, cities with a university may also have other facilities
that might improve their earnings capacity (e.g. city size might matter). However, if we are
able to condition on parental characteristics, we might be able to intercept (or "block") all
confounding paths between Z and U and/or V and might also be able to intercept all directed
83
Abadie, Angrist, and Imbens (2002) examine conditional quantile treatment e¤ects.
84
They could help to bound the distribution of the individual treatment e¤ects (Fréchet 1951), but these
bounds are often wide and uninformative.
paths of Z to Y .8586
The following graphs shall help to illustrate the crucial conditions needed for IV identi…cation
and relate these to our discussion on selection on observables. A more thorough discussion
follows later. The …rst graph shows the situation where neither matching estimation nor IV
identi…cation is possible. IV identi…cation is not possible since Z has a direct impact on Y and
also because Z is related to the unobservables U and/or V .
Y
D
Z
V U
X
A crucial assumption for identi…cation will be
(Y d ; T )??ZjX a:s: for d = 0; 1, (29)
although some kind of mean independence would often su¢ ce as well. Consider …rst a situa-
tion where Z has no direct impact on Y . In both graphs (a) and (b) below the idependence
assumption (29) is satis…ed conditional on X. The di¤erence between these two graphs is that
in (a) X is exogenous whereas in (b) X may be correlated with V and/or U . We will see later,
that nonparametric identi…cation can be obtained in both situations. On the other hand, esti-
mation based on a parametric model, e.g. 2SLS, usually requires exogenous X. In situation (b)
with endogenous X, 2SLS will be inconsistent, whereas nonparametric approaches will remain
consistent.
85
Parental education is another example of an instrumental variable that is often used to identify the returns
to schooling. It may appear reasonable to assume that parental schooling itself has no direct impact on their
children’s wages. Nevertheless, it is likely to be correlated with parents’ profession, family income and wealth,
which may directly a¤ect the wage prospects of their o¤spring.
86
This section is based on Frölich (2007a).
(a) (b)
X X
Y Y
D D
Z Z
V U V U
Now we add the possibility that Z might have a direct e¤ect on Y via the variables X2 . In
the left graph (a) we can achieve (29) if we condition on X1 and X2 . Hence, we can control
for variables that confound the instrument and also for those which lie on a mediating causal
path other than via D. There is one further distinction between X1 and X2 , though. Whereas
X1 is permitted to endogeneous, X2 is not permitted to be so. This can be seen in graph (b)
below. If we there condition on X2 , we would unblock the path Z ! X2 U and the path
Z ! X2 W2 . Hereby we introduce another confounding link between Z and the outcome
variable. On the other hand, if we not condition on X2 , the instrument Z has an e¤ect on Y
via D and would thus not satisfy (29). Hence, whereas the X1 are permitted to be endogenous,
the X2 must be exogenous.
(a) (b)
V U
V U X1
X1 W
W
Y
Y
D D
Z Z
X2 X2
W2 W2
Hence, introducing covariates may serve four di¤erent purposes here:
1. To control for potential confounders of the instrument Z.
2. To intercept or block all mediating causal paths between Z and Y , i.e. all paths other
than via D.
3. To increase e¢ ciency, as will be discussed later.
4. To separate total e¤ects from partial e¤ects. (This will not be discussed further here.)
Now we introduce some additional notation. Consider again the case where the endogenous
regressor D 2 f0; 1g and the instrument Z 2 f0; 1g are both binary. (Extensions to non-binary
D and non-binary Z are discussed later.) The notation is extended to incorporate a vector of
covariates X as:
d = ' (d; z; X ; U ) Di
Yi = '(Di ; Zi ; Xi ; Ui ) with Yi;z i i and Yi;Zi
= Yi
Di = (Zi ; Xi ; Vi ) with Di;z = (z; Xi ; Vi ) and Di;Zi = Di
The previous instrumental variable conditions are assumed to hold conditional on X.87 The
extension to incorporate covariates did not a¤ect the decision of the compliance types, which is
as before.
[Assumption 1: Monotonicity] The subpopulation of de…ers has probability measure zero:
Pr (Di;0 > Di;1 ) = 0.
[Assumption 2: Existence of compliers] The subpopulation of compliers has positive probability:
Pr (Di;0 < Di;1 ) > 0.
Assumptions 1 and 2 rule out the existence of subpopulations that are a¤ected by the
instrument in an opposite direction. Monotonicity ensures that the e¤ect of Z on D has the
same direction for all units.88
87
If D is also a cause to X, in the sense that changing D would imply also a change in X, only the direct e¤ect
of D on Y would be recovered with the following identi…cation strategy, but not the total e¤ect (i.e. including
the e¤ect of D on Y that is channeled through X). This also requires that conditioning on X does not introduce
any dependencies and new confounding paths.
88
In fact, monotonicity is only required conditional on X and the direction of the monotonicity could in
principle for di¤erent values of X, i.e. for some values only de…ers exist, for others only compliers.
[Assumption 3: Unconfounded instrument] The relative size of the subpopulations always-

takers, never-takers and compliers is independent of the instrument: For all x 2 Supp (X)
Pr (Ti = tjXi = x; Zi = 0) = Pr (Ti = tjXi = x; Zi = 1) for t 2 fa; n; cg.
Validity of Assumption 3 requires that the vector X contains all variables that a¤ect the
choice of residence Z as well as the type T . Without conditioning on covariates X this assump-
tion may often be invalidated because of selection e¤ects, unless the instrument Z is randomly
assigned. For example, parents who would like their children to attend college but could not
avoid that their children might not want to go if living too far away, might decide to reside
closer to a college. In this case, the subpopulation living close to a college would contain a
higher fraction of compliers than those living far away.
[Assumption 4: Mean exclusion restriction] The potential outcomes are mean independent of
the instrumental variable Z in each subpopulation: For all x 2 Supp (X)
0 0
E Yi;Zi
jXi = x; Zi = 0; Ti = t = E Yi;Zi
jXi = x; Zi = 1; Ti = t for t 2 fn; cg
1 1
E Yi;Zi
jXi = x; Zi = 0; Ti = t = E Yi;Zi
jXi = x; Zi = 1; Ti = t for t 2 fa; cg.
Assumption 4 rules out a direct e¤ect of Z on Y . Conditional on X, any e¤ect of Z

should be channeled through D such that the potential outcomes are not correlated with the
instrument. Without conditioning on X, this assumption may often be invalid, for example, if
college proximity itself has a direct e¤ect on the child’s wages in its later career or if the families
who decided to reside close to a college are di¤erent from those who decided to live far from a
college. One may therefore wish to include characteristics of location that may be correlated
with the presence of the training programs, e.g. information on other educational facilities or
their quality.
Since below we are going to be interested in estimating some kind of average complier e¤ect,
we will impose an additional assumption:
[Assumption 5: Common support] The support of X is identical in both subpopulations:
Supp (XjZ = 1) = Supp (XjZ = 0) .
Assumption 5 requires that for any value of X (in its support) both values of the instrument
Z can be observed. An equivalent representation of the common support condition is that
0 < (x) < 1 8x with fx (x) > 0, where (x) = Pr(Z = 1jX = x).
With these assumptions, the local average treatment e¤ect (x) is identi…ed for all x with
Pr (T = cjX = x) > 0, where the proof is analogously to the proof before without covariates X:
E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]
(x) = E[Y 1 Y 0 jX = x; T = c] = .
E [DjX = x; Z = 1] E [DjX = x; Z = 0]
Although we have identi…ed LATE for every value of X, in policy applications we might
sometimes be interested in obtaining an average e¤ect for the whole population. Particularly
if X contains many variables, there would be very many di¤erent treatment e¤ects (x) to
be estimated and interpreted. In addition, if X contains continuous regressors, the estimates
p
might be rather imprecise and we would also not be able to attain n convergence. Therefore
we might be interested in some kind of average e¤ect similar to the previous chapter.
One possibility would thus be to weight (x) by the population distribution of x, which
would give as an average treatment e¤ect
Z Z
E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]
(x)dFX = fX (x)dx. (30)
E [DjX = x; Z = 1] E [DjX = x; Z = 0]
E[Y jX;Z=1] E[Y jX;Z=0]

This approach may be problematic in two respects. First, the estimates E[DjX;Z=1] E[DjX;Z=0]
could sometimes be very imprecise particularly if X contains several continuous variables. The
^ [DjX; Z = 1] E
nonparametrically estimated denominator E ^ [DjX; Z = 0] might often be close
to zero thus leading to very large estimates of (x). In addition, the above weighting scheme
represents a mixture between the e¤ects on compliers and always-/never-takers that might be
hard to interpret: (x) is the e¤ect for compliers with x whereas dFx is the distribution of x in
the entire population (consisting of compliers, always and never-takers).
As an alternative it might be interesting to examine the e¤ect in the subpopulation of all
compliers, which is in fact the largest subpopulation for which a treatment e¤ect is identi…ed
without further assumptions. This treatment e¤ect on all compliers thus is
Z
1 0
E Y Y jT = c = E[Y 1 Y 0 jX = x; T = c] dFXjT =c
Z
= (x)dFXjT =c ,
where FXjT =c denotes the distribution function of X in the subpopulation of all compliers.
The distribution is not directly identi…ed, since the subpopulation of compliers is not iden-
Pr(T =cjX)
ti…ed. However, by Bayes’theorem dFXjT =c = Pr(T =c) dFX , which gives
Z
1 0 Pr (T = cjX = x)
E Y Y jT = c = (x) fX (x)dx.
Pr (T = c)
Furthermore, notice that the size of the complier-subpopulation with characteristics x is iden-
ti…ed as
Pr (T = cjX = x) = E [DjX = x; Z = 1] E [DjX = x; Z = 0] . (31)
Now inserting the formula for (x) given above gives

Z
1 0 1
E Y Y jT = c = (E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]) fX (x)dx,
Pr (T = c)
R
and using that Pr (T = c) = Pr (T = cjX = x) fX (x)dx together with (31) gives
R
1 0 (E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]) fX (x)dx
E Y Y jT = c = R . (32)
(E [DjX = x; Z = 1] E [DjX = x; Z = 0]) fX (x)dx
By Assumption 5 the conditional expectations are identi…ed in the Z = 1 and Z = 0 subpopu-

lations.
Besides being a well de…ned treatment e¤ect, the formula (32) has two nice properties. First,
instead of being an integral of a ratio, it is a ratio of two integrals, which thus reduces the risk
of very small denominators. Second, the expression (32) corresponds to a ratio of two matching
estimators, which have been examined in detail in Chapter 2.
De…ning the conditional mean functions mz (x) = E[Y jX = x; Z = z] and z (x) = E[DjX =
x; Z = z], a nonparametric estimator of E Y 1 Y 0 jT = c is
P
(m^ 1 (Xi ) m
^ 0 (Xi ))
i
P ,
(^ 1 (Xi ) ^ 0 (Xi ))
i
where m
^ z (x) and ^ z (x) are corresponding nonparametric regression estimators. Alternatively,
we could use the observed values Yi and Di as estimates of E [Yi jXi ; Z = z] and E [Di jXi ; Z = z],
whenever z = Zi , which gives the estimator:
P P
(Yi m
^ 0 (Xi )) (Yi m
^ 1 (Xi ))
i:Zi =1 i:Zi =0
^= P P . (33)
(Di ^ 0 (Xi )) (Di ^ 1 (Xi ))
i:Zi =1 i:Zi =0
In Frölich (2007a) the asymptotic distribution of ^ is derived as well as the semiparametric

e¢ ciency bound for the estimation of the local average treatment e¤ect is calculated. It is
p
shown that n-consistency and e¢ ciency can be attained with several di¤erent nonparametric
estimators of mz (x) and z (x).
LATE estimation with propensity score

Having seen that the treatment e¤ect on all compliers can be estimated by a ratio of two
matching estimators, we already suspect that some kind of propensity score matching approach
or propensity score weighting should also be available. De…ning the "propensity score" to be
(x) = Pr (Z = 1jX = x) .
We can show that E Y 1 Y 0 jT = c can also be expressed as

h i
Y (1 Z)
E Y(X)Z
1 (X)
E Y 1 Y 0 jT = c = h i (34)
D(1 Z)
E DZ (X) 1 (X)
and estimated by the propensity score weighting estimator

P Yi Zi Yi (1 Zi )
(Xi ) 1 (Xi )
i
^ w =P . (35)
Di Zi Di (1 Zi )
(Xi ) 1 (Xi )
i
For deriving the propensity score matching estimator, note that it can be shown that
E Y1 Y 0 jT = c can be written as
R
1 0 (m 1 ( ) m 0 ( )) f ( )d
E Y Y jT = c = R , (36)
( 1( ) 0 ( )) f ( )d
where m 1 ( ) = E [Y j (X) = ; Z = 1] and 1( ) = E [Dj (X) = ; Z = 1] are the

conditional means given the propensity score and f is the density function of (x) in the
population. The propensity score matching estimator is thus
P
(m^ 1( i) m ^ 0 ( i ))
i
^ m= P , (37)
(^ 1 ( i ) ^ 0 ( i ))
i
where i = (Xi ). An advantage of propensity score matching ^ m over ^ is that it requires

only one-dimensional nonparametric regression.89 For a recent application of this method see
Henderson, Millimet, Parmeter, and Wang (2008).
89
In many applications, the propensity score (x) might be unknown and needs to be estimated. Even when the
propensity score is known, using an estimated propensity score in the estimator might be worthwhile. Whether
knowledge of the true propensity score a¤ects e¢ cient estimation has been analyzed in several articles for matching
estimators. Analogous results can be derived here for the estimation of LATE with covariates. In Frölich (2007a)
By using similar derivations (see Frölich (2007a)) one can also show that the potential
outcome distributions are identi…ed as
R
(E [1 (Y u) DjX; Z = 1] E [1 (Y u) DjX; Z = 0]) dFX
FY 1 jc (u) = R
(E [DjX; Z = 1] E [DjX; Z = 0]) dFX
R
(E [1 (Y u) (D 1)jX; Z = 1] E [1 (Y u) (D 1)jX; Z = 0]) dFX
FY 0 jc (u) = R .
(E [DjX; Z = 1] E [DjX; Z = 0]) dFX
ATET and ATE for treated compliers

As shown in Frölich and Lechner (2006) we can not only identify the treatment e¤ect
E Y1 Y 0 jT = c for the compliers but also the e¤ect for the treated compliers. This e¤ect is
identi…ed as
R
1 0 (E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]) (X) fX (x)dx
E Y Y jD = 1; T = c = R
(E [DjX = x; Z = 1] E [DjX = x; Z = 0]) (X) fX (x)dx
(38)
or R
1 0 (m 1 ( ) m 0 ( )) f ( )d
E Y Y jD = 1; T = c = R .
( 1( ) 0 ( )) f ( )d
A weighting type estimator probably can be derived as well.90 Why is this interesting? In the
situation of one-sided noncompliance only, i.e. where the subpopulation of always-treated does
not exist, the treated compliers are the only individuals that are treated. In other words the
average treatment e¤ect on the treated (ATET) is then identi…ed as
E Y1 Y 0 jD = 1 = E Y 1 Y 0 jD = 1; T = c .
Note that the formula (38) is di¤erent from (32). Hence, with one-sided noncompliance the
ATET is not the same as LATE, but instead it is the e¤ect for the treated compliers given by
it is shown that the e¢ ciency bound is not a¤ected by knowledge of the propensity score. It is also shown
that matching with known propensity score is ine¢ cient compared to matching on X. If the propensity score
is unknown, its estimation would usually add even further to the variance of the propensity score matching
estimator. (Although it cannot be precluded that it might be possible to contstruct a joint estimator of , m 1 ,
m 0, 1 and 0 with smaller variance. This would, however, require an estimator that estimates the propensity
score and the conditional expectation functions simultaneously.) Generally, it can be expected that matching on
the propensity score is ine¢ cient.
90
@@Rechne dies noch in Ruhe aus.
(38). This is di¤erent from the situation without any X. If X is just a constant, then the
the formula (38) and (32) would be identical. Note that at di¤erent locations in the old
version of these lecture notes it is sometimes claimed that ATET=LATE if there is
one-sided noncompliace. This is not CORRECT. It is only true if X is a constant.
IV with non-binary instruments

The discussion so far assumed that the instrument Z is binary. We can easily extend this
to a non-binary instrument or to having several instruments, i.e. Z being a vector. Before
we discuss this in more detail below, it is interesting to point out the relationship to the
overidenti…cation test in linear IV regression. Suppose we have two (binary) instruments Z1
and Z2 . In linear IV regression, if we have more instruments than endogenous variables we can
use an overidenti…cation test. We would be using the two moment conditions implied by Z1 and
Z2 and construct the overidenti…cation test. Basically, the test is based on comparing the 2SLS
estimate obtained from Z1 with the one obtained from Z2 and if they are very di¤erent rejects
the assumption of both moment conditions being valid. Now, in the nonparametric setting, we
can estimate a LATE using the binary Z1 and also one using the binary Z2 . These two estimated
LATE, however, refer to di¤erent subpopulations, since the compliers with Z1 are di¤erent from
the compliers with Z2 . Hence, if treatment e¤ects are permitted to be heterogenous, we simply
estimate two e¤ects for two di¤erent populations and there is no reason to expect these two
estimates to be similar. Based on this insight, we can see the overidenti…cation test of 2SLS in a
di¤erent light. If the overidenti…cation test rejects, this might simply mean that the treatment
e¤ect is di¤erent for di¤erent subpopulations. In other words, an alternative interpretation of
rejections in overidenti…cation tests is therefore that the e¤ects of interest vary, rather than
that some of the instruments are invalid. Without assuming homogenous e¤ects there are no
tests in general for the validity of the instruments. (Only if we had two instruments that are so
powerful such that everyone would be a complier for some movement in either of the instrument,
we would obtain two di¤erent estimates of the ATE, and if these are very di¤erent, we would
be very concerned about instrument validity.)
Permitting the instrumental variable Z to be non-binary (e.g. discrete or continuous), we

can derive a formula similar to (32) which compares only observations with values of Zi lying at
the end-points of the support of the instrument. D is still supposed to be binary and we have a
single non-binary instrument Z, which has bounded support Supp(Z) = [zmin ; zmax ]. Obviously,
a local average treatment e¤ect could be de…ned with respect to any two distinct values of Z.
However, this would yield a multitude of pair-wise treatment e¤ects, each of them referring to
a di¤erent population. Instead of estimating many pair-wise e¤ects, one would often prefer to
estimate the average treatment e¤ect in the largest subpopulation for which an e¤ect can be
identi…ed, which is the subpopulation of all individuals who react to the instrument.
De…ne the subpopulation of compliers as all individuals with Di;zmin = 0 and Di;zmax = 1.
The compliers comprise all individuals who switch from D = 0 to D = 1 at some point when
the instrument Z is increased from zmin to zmax . The value of z which triggers the switch can
be di¤erent for di¤erent individuals. If monotonicity holds with respect to any two values z
and z 0 , each individual switches D at most once. The following assumptions are extensions of
assumptions 1 to 5 to a non-binary instrument.
[Assumption 1’: Monotonicity] The e¤ect of Z on D is monotonous
Pr Di;z > Di;z 0 = 0 for any values z; z 0 with zmin z < z0 zmax .
[Assumption 2’: Existence of compliers] The subpopulation of compliers has positive probability
Pr (T = c) > 0 where Ti = c if Di;zmin < Di;zmax .
[Assumption 3’: Unconfounded instrument] For any two values z; z 0 2 Supp (Z), any d; d0 2
f0; 1g and for all x 2 Supp (X)
Pr Di;z = d; Di;z 0 = d0 jXi = x; Zi = z = Pr Di;z = d; Di;z 0 = d0 jXi = x .
[Assumption 4’: Mean exclusion restriction] For any two values z; z 0 2 Supp (Z), any d; d0 2
f0; 1g and for all x 2 Supp (X)
h i h i
d 0 d 0
E Yi;Zi
jX i = x; Di;z = d; Di;z 0 = d ; Zi = z = E Y
i;Zi
jX i = x; Di;z = d; Di;z 0 = d .
[Assumption 5’: Common support] The support of X is identical for zmin and zmax
Supp (XjZ = zmin ) = Supp (XjZ = zmax ) = Supp (X) .
Given these assumptions it can be shown that the local average treatment e¤ect for the
subpopulation of compliers is nonparametrically identi…ed as
R
1 0 (E [Y jX = x; Z = zmax ] E [Y jX = x; Z = zmin ]) fx (x)dx
E[Y Y jT = c] = R . (39)
(E [DjX = x; Z = zmax ] E [DjX = x; Z = zmin ]) fx (x)dx
This formula is analogous to (32) with Z = 0 and Z = 1 replaced with the endpoints of
p
the support of Z. If Z is discrete previous results would apply and n consistency can be
p
attained. For a continuous instrument, however, n-consistency usually cannot be achieved,
unless it is mixed continuous-discrete with mass points at zmin and zmax . The intuitive reason
for this is that with continuous Z the probability of observing individuals with Zi = zmax is
zero. Therefore we also have to use observations with Zi a little bit smaller than zmax and for
nonparametric regression to be consistent we will need the bandwidth to converge to zero. (A
similar situation will appear in the following situation on regression discontinuity design.)91
Now consider the situation with multiple instrumental variables, i.e. Z being vector valued.
Since the di¤erent instrumental variables act through their e¤ect on D, the di¤erent components
of Z can be summarized conveniently by using p(z; x) = Pr(D = 1jX = x; Z = z) as instrument.
If D follows an index structure in the sense that Di depends on Zi only via p(Zi ; Xi ),92 and the
assumptions 1’ to 5’ are satis…ed with respect to p(z; x), the local average treatment e¤ect is
identi…ed as
R
E[Y jX = x; p(Z; X) = px ] E[Y jX = x; p(Z; X) = px ] fx (x)dx
E[Y 1 Y 0 jT = c] = R ¯ ,
E[DjX = x; p(Z; X) = px ] E[DjX = x; p(Z; X) = px ] fx (x)dx
¯
(40)
where px = max p(z; x) and px = min p(z; x). This is equivalent to
z ¯ z
R
E[Y jX = x; p(Z; X) = px ] E[Y jX = x; p(Z; X) = px ] fx (x)dx
E[Y 1 Y 0 jT = c] = R ¯
px px fx (x)dx
¯
(41)
Again, this formula is analogous to (32). The two groups of observations on which estimation
is based are those with p(z; x) = px and those with p(z; x) =px . In the …rst representation (40),
¯
91
From (39) a bias-variance trade-o¤ in the estimation of the local average treatment e¤ect with non-binary
Z becomes visible. Although (39) incorporates the proper weighting of the di¤erent complier subgroups and
leads to an unbiased estimator of LATE, only observations with Zi equal (or close) to zmin or zmax are used for
estimation. Observations with Zi between the endpoints zmin and zmax are neglected, which might lead to a
large variance. Variance could be reduced, at the expense of a larger bias, by weighting the complier subgroups
di¤erently or by choosing larger bandwidth values for the estimators m
^ z (x) and ^ z (x). A detailed analysis is left
for future research.
92
I.e. Di;z = Di;z0 if p(z; Xi ) = p(z 0 ; Xi ). In other words, Di does not change if Zi is varied within a set where
p( ; Xi ) remains constant. A more explicit discussion will be given in the section on Marginal Treament E¤ects
below.
exact knowledge of p(z; x) is in fact not needed for estimation; it is su¢ cient to identify the set
of observations for which p(Z; X) is highest and lowest, respectively, and compare their values
of Y and D. In other words, only the ranking with respect to p matters but not the values of
p themselves.93 For example, if Z contains two binary instrumental variables (Z1 ; Z2 ) which,
for any value of X, both are known to have a positive e¤ect on D, then the observations with
Z1 = Z2 = 0 and those with Z1 = Z2 = 1 represent the endpoints of the support of p(Z; X)
given X and are used for the estimation.
4.4 Combination of IV with matching
Frölich and Lechner (2006) show how the previous results on IV identi…cation and selection-on-
observables can be combined. Under the conditions of the previous subsection, the treatment
e¤ect for the compliers is identi…ed and more precisely also the two potential outcomes
E Y 0 jT = c E Y 1 jT = c
are identi…ed. Additionally, we can also identify the outcomes for the treated compliers
E Y 0 jD = 1; T = c , E Y 1 jD = 1; T = c and the nontreated compliers E Y 0 jD = 0; T = c ,
E Y 1 jD = 0; T = c . In fact, we do not only identify the mean but also the entire
distributions, so any quantile treatment e¤ects are identi…ed as well. One of the crucial
assumptions for identi…cation was
Y d ??ZjX; T = c a:s: for d = 0; 1.
Since, for a complier we have Z = D, we also can write the previous assumption as
Y d ??DjX; T = c a:s: for d = 0; 1. (42)
This is very similar to the selection on observables assumption (2) of Chapter 2: Conditional
on X, the compliers randomly selected into D = 0 or D = 1. In other words, if we would know
who is a complier, we could use a matching estimator as discussed in Chapter 2. Suppose for
the moment, we could identify who is a complier and apply a selection-on-observables matching
estimator to this. Frölich and Lechner (2006) show that the resulting expression is exactly
identical to (32). In other words, the IV estimator for the complier is identical to a matching
93
In equation (41) consistent estimation of p now matters, though.
estimator for the compliers (if we knew who the compliers are). Hence, one of the crucial
assumptions for IV is that selection-on-observables holds within the subpopulation of compliers.
What would be obtain if we assumed that selection-on-observables holds not only for the
compliers but also for the always- and never-participants? I.e. if we have su¢ ciently informative
data, we could strengthen (42) to:
Y d ??DjX a:s: for d = 0; 1, (43)
which is the same as (2). As shown in Frölich and Lechner (2006), under the combination
of the IV assumptions and the selection-on-observables assumption for all individuals we can
separately identify the following six terms
E Y 0 jT = c E Y 1 jT = c
E Y 0 jT = a E Y 1 jT = a
E Y 0 jT = n E Y 1 jT = n .
This result is helpful in two respects: First, we obtain the average treatment e¤ects
Y1 Y 0 separately for the compliers, the always-participants and the never-participants.
This gives some indication about treatment e¤ect heterogeneity. Second, the comparison
between E Y 0 jT = c and E Y 0 jT = a and E Y 0 jT = n may be helfpul to obtain some
understanding what kind of people these groups actually represent. If Y is employment
status and we …nd that E Y 0 jT = a < E Y 0 jT = c < E Y 0 jT = n we would interpret
this …nding in that the never-participants have the best labour market chances (even withouth
treatment) and that the always-participants have worse labour market chances than the
compliers. This would help us to understand which kind of people belong to a, c and n.
(In addition to this, we can also identify the distributions of X among the always- and
never-participants and the compliers, which provides us with additional insights.)
4.5 Regression discontinuity design
Regression discontinuity design is an approach frequently used to identify interventions where

certain, e.g. bureaucratic, rules increase discontinuously the likelihood of D changing from 0 to
1.94 Access or incentives for participation in a programme are sometimes based on transparent
94
Much of this discussion is based on articles in the special issue on RDD of the Journal of Econometrics
(2008).
rules with criteria based on clear cuto¤ values. We will see that such rules sometimes generate
a local instrumental variable, i.e. an instrumental variable that is valid only at a particular
threshold value. (We will assume throughout that this threshold value z0 is known and not
estimated. One could extend this to an estimated breakpoint z0 , but most of the credibility of
the design would usually be lost if we do not know z0 .) Knowledge of these particular rules can
often provide a convincing evaluation design. One should always bear in mind, though, that
identi…cation is obtained only for people at the threshold value z0 , which may often not be the
primary population of interest. (Although sometimes it may be, e.g. when the policy of interest
is a marginal change of the threshold z0 .) Hence, the RDD may often provide internal validity,
but not much external validity.
Consider an illustrative example: A new education programme is designed to give extra

funding to schools with a large share of immigrants. The fraction of immigrant pupils Z is
measured per school on a particular day, and all schools with Z larger than some threshold
z0 , e.g. 70%, receive some additional funding, whereas all schools below this threshold do
not receive anything. We are interested in the e¤ect of this extra funding D on e.g. student
outcomes Y . The idea of the RDD is to compare the outcomes of schools with Z just below z0
to those schools with Z just above z0 .
Certainly, we cannot use Z as an instrumental variable as we would suspect that Z has a

direct e¤ect on school average outcomes Y . If the fraction of immigrant children Z is increased,
we would expect Y to decrease. However, when we compare schools very close to this threshold,
this direct e¤ect of Z should not matter. The following two graphs give an illustration of this
idea. Whereas the two functions E[Y 0 jZ] and E[Y 1 jZ] are continuous, the function E[DjZ]
jumps at a particular value. Before the threshold E[DjZ] is very low (in the driving license
example it should be zero), after the threshold it is much higher. This discontinuity will generate
a jump in E[Y jZ] in the second graph.
Hence, although Z is not "globally" a valid instrumental variable (since it has a direct
impact at Y 0 and Y 1 as visible from the graphs) it can "locally" be a valid instrument if we
compare only those observations with ages slightly below and slightly above z0 .
E[Y0|Z] and E[Y1|Z]
E[Y1|Z]
E[Y0|Z]
E[D|Z] E[D|Z]
E[Y|Z]
Regression discontinuity can thus be used when a continuous variable Z in‡uences an out-
come variable Y and also another variable D, which itself a¤ects the outcome variable Y .95
Hence, Z has a direct impact on Y as well as an indirect impact on Y via D. This latter impact
represents the causal e¤ect of D on Y , which can be identi…ed if the direct and the indirect
impacts of Z on Y can be told apart. In the case that the direct impact of Z on Y is known to
be smooth but the relationship between Z and D is discontinuous, any discontinuities (jumps)
95
Lee and Card (2008) examine the situation, when Z is measured only as a discrete variable, e.g. age measured
in years only. In this case, nonparametric identi…cation is not possible and a parametric speci…cation is required.
Various procedures are developed to take speci…cation error into account when drawing inferences.
in the observed relationship between Z and Y at locations where the relation Z to D is discon-
tinuous can be attributed to the indirect impact of Z on Y via D.
In the following we …rst discuss the RDD without any covariates X. In the literature,
two di¤erent designs are examined: the sharp design where Di changes for everyone at the
threshold, and the fuzzy design where Di changes only for some individuals. In the sharp
design (Trochim 1984), participation status
Di = 1(Zi z0 )
is a deterministic function of Zi , i.e. all individuals change programme participation status

exactly at z0 . This requires a strictly rule-based programme selection process (such as age limits
or other eligibility criteria). For example, Hahn, Todd, and van der Klaauw (1999) analyze the
e¤ect of antidiscrimination laws on the employment of minority workers by exploiting the fact
that only …rms with more than 15 employees are subject to these antidiscrimination laws.
As another example, Thistlethwaite and Campbell (1960) estimated the e¤ect of receiving a
National Merit Award on subsequent career aspirations. Since the Award is only granted if a
test score Z exceeds a certain threshold z0 , the treatment status D (Award granted: D = 1, not
granted: D = 0) depends in a discontinuous way on the test score Z. Since test score is likely
to be a¤ected by unobserved ability, Z cannot be a valid instrument. In a small neighbourhood
around the discontinuity at z0 , the direct impact of Z on the potential outcomes is likely to
vary only a little with Z. If locally the instrumental variable assumptions are satis…ed, we can
identify the causal e¤ect
E[Y 1 Y 0 jZ = z0 ].
(The formula is given further below). This is the treatment e¤ect for the population with test
score equal to z0 , and this e¤ect may not generalize to the population at large. As before, the
conceptual framework is such that we imagine that we could take an individual and hypothet-
ically change D from zero to one. There are two ways one could think about this happening
here. Either we imagine to move Z by external intervention a little. Or we imagine that we
could move the threshold z0 by external intervention a little.
Often, however, the participation decision is not completely determined by Z, even in a

rule-based selection process. Case workers may have some discretion about whom they o¤er
a programme, or they may base their decision also on criteria that are unobserved to the
econometrician. Additionally, individuals o¤ered a programme may decline participation. In

this fuzzy design not all individuals would change programme participation status from D = 0
to D = 1 if Z were increased from z0 " to z0 + ". Rather, the relation between Z and D may
be discontinuous at z0 only on average. In the fuzzy design the expected value of D given Z
(which is the probability of treatment receipt) is supposed to be discontinuous at z0 :96
lim E [DjZ = z0 + "] lim E [DjZ = z0 "] 6= 0. (44)

"!0 "!0
Note that the fuzzy design includes the sharp design as a special case when the left hand side
of the previous equation is equal to one. Therefore the following discussion focusses on the
more general fuzzy design. The fuzzy design may apply when the treatment decision contains
some element of discretion. Case workers may have some discretion about whom they o¤er
a programme, or they may base their decision also on criteria that are unobserved to the
econometrician. It may also often be appropriate in a situation where individuals are o¤ered a
treatment or a grant or …nancial support and decline their participation.97
A third case may be called mixed sharp-fuzzy design or a design with only one-sided
noncompliance. This occurs if the design is strict on one side and fuzzy on the other. A
frequent case is when eligibility depends strictly on observed characteristic but participation
in treatment is voluntary. For example, eligibility to certain treatments may be means tested
(e.g. food stamps programmes) with a strict eligibility threshold z0 , but take-up of the
treatment may be less than 100 percent. Then
lim E [DjZ = z0 "] = 0 and lim E [DjZ = z0 + "] 2 (0; 1).

"!0 "!0
96
For example, van der Klaauw (2002) analyses the e¤ect of …nancial aid o¤ers to college applicants on their
probability of subsequent enrollment. College applicants are ranked according to their test score achievements
into a small number of categories. The amount of …nancial aid o¤ered depends largely on this classi…cation. Yet,
the …nancial aid o¢ cer also takes other characteristics into account, which are not observed by the econometrician.
Hence the treatment assignment is not a deterministic function of the test score Z, but the conditional expectation
function E[DjZ] displays jumps because of the test-score rule.
97
For example, van der Klaauw (2002) analyses the e¤ect of …nancial aid o¤ers to college applicants on their
probability of subsequent enrollment. College applicants are ranked according to their test score achievements
into a small number of categories. The amount of …nancial aid o¤ered depends largely on this classi…cation. Yet,
the …nancial aid o¢ cer also takes other characteristics into account, which are not observed by the econometrician.
Hence the treatment assignment is not a deterministic function of the test score Z, but the conditional expectation
E[DjZ] displays jumps because of the test-score rule.
As another example, eligibility to certain labor market programmes may depend on the
duration of unemployment or on the age of individuals. E.g. the "New Deal for Young People"
in the UK o¤ ers job-search assistance (and other programmes) to all individuals aged between
eighteen and twenty-four who have been claiming unemployment insurance for six months.
Accordingly, the population consists of three subgroups (near the threshold): ineligibles, eligible
non-participants and participants. (It is assumed that data on all three groups is available.)
This setup thereby rules out the existence of de…ers such that the monotonicity condition is
automatically ful…lled close to z0 . In this mixed design all of the following discussion still applies
(with somewhat simpler formulae). Another relevant di¤erence, though, is that there cannot
be any always-participants under this design such that the local compliers is the only subgroup
that is treated.
Hahn, Todd, and van der Klaauw (2001) analyze nonparametric identi…cation in the case of
a fuzzy regression-discontinuity design, where D 2 f0; 1g is a random function of Z but E[DjZ]
is discontinuous at z0 . Since Z may also in‡uence the potential outcomes directly, the treatment
e¤ect is not identi…ed without further assumptions. Supposing that the direct in‡uence of Z
on the potential outcomes is continuous, the potential outcomes change little when Z is varied
within a small neighbourhood.
Identi…cation essentially relies on comparing the outcomes of those individuals to the left
of the threshold with those to the right of the threshold. Consider …rst the sharp design. A
necessary assumption is
h i
E Y d jZ = z is continuous in z at z0 for d = f0; 1g. (45)
A somewhat stronger condition which implies the previous one is
Yid ??Zi near z0 .
In a sharp design this assumption is su¢ cient for identi…cation of ATE, ATET and ATEN,
which we obtain as
E Y1 Y 0 jZ = z0 = lim E [Y jZ = z0 + "] lim E [Y jZ = z0 "] ,

"!0 "!0
where we would usually estimate the terms on the right hand side by local linear regressions
because we are estimating by de…nition at a boundary point.
Lee (2008) gives an intuitive discussion of assumption (45). More precisely, he describes
a selection mechanism under which (45) is true. Let Ui be unobservable characteristics of
individual i and suppose that treatment allocation depends on some score Zi such that Di =
1(Zi z0 ). Let FZjU be the conditional distribution function. Let fZ be the marginal density
of Z.
Lee (2008, Condition 2b): For every u 2 Supp(U ): Suppose 0 < FZjU (z0 ju) < 1 and
derivative fZjU (z0 ju) exists. In addition, fZ (z0 ) > 0.
The intuition is that every individual may attempt to modify or adjust the value of Zi in
his own interests, but that even after such modi…cations there is still some randomness left in
that FZjU (z0 ju) is neither zero nor one. In addition, fZ (z0 ) > 0 indeed implies that for some
individuals it was a random event whether Z happened to be larger or smaller than z0 .
Under this assumption is follows that
Z
1 0 fZjU (z0 ju)
E Y Y jZ = z0 = Y 1 (u) Y 0 (u) dFU (u),
fZ (z0 )
which says that the treatment e¤ect at z0 is a weighted average of the treatment e¤ect for all
individuals (represented by their value of U ) where the weights fZjU (z0 ju) are the density at
the threshold z0 where the "tie breaking experiment" takes place. Those individuals who are
more likely to have a value z0 , i.e. large fZjU (z0 ju), receive more weight, whereas individuals
whose score is extremely unlikely to fall close to the threshold, i.e. with fZjU (z0 ju) = 0,
receive zero weight. Hence, this representation gives us a nice interpretation of what the e¤ect
E Y1 Y 0 jZ = z0 represents.
Another implication of Lee (2008, Condition 2b) is that the distribution of all pre-treatment
variables is continuous at z0 . Hence, if we observe pre-treatment variables in our data, we can
test whether they are indeed continously distributed at z0 . If they are discontinuous at z0 , the
plausibility of the RDD is reduced. One should note, though, that this last implication is a
particular feature of Lee (2008, Condition 2b) and not of the RDD per se.
The selection mechanism of Lee (2008) permits that individuals may partly self-select or
even manipulate their desired value of Z but that the …nal value of Zi still depends on some
additional random noise. It permits some kind of endogenous sorting of individuals as long as
they are not able to sort precisely around z0 . Consider the example, where individuals have to
attend a summer school if they fail on a certain mathematics test. Some students may want
to avoid summer school (and therefor aim to perform well on the test), whereas others like to
attend summer school (and therefore perform poorly on the test). The important point is that
students, however, are unlikely to sort exactly about the threshold. The reason is that even
when they answer purposefully some of the test items correctly or incorrectly, they may not
know with certainty how their …nal score would be and/or may not know the threshold value
z0 . Hence, although the score Zi may not truely re‡ect ability and true ability may not even
be monotonous in Z, among those with …nal score Z close to z0 it is random who is above and
who is below.
On the other hand, those individuals who mark the exam have perfect control over the
outcome and they may manipulate the test scores. Often, they may not know the threshold z0
eventually be applied. But if they knew, they may manipulate scores around the threshold, e.g.
if they attempt to reduce the class size of the summer school programme, they may increase the
scores of a few individuals who had scores slightly below z0 such that they end up above z0 . If
this manipulation were done randomly, the RDD design would still be valid. In other situations,
this manipulation might not be random and may therefore lead to inconsistent estimates. An
interesting observation is that such kind of manipulations often go only in one direction, e.g.
if the treatment is something valuable for everyone, which would imply a discontinuity of fZ
at z0 . If we detect a discontinuity of fZ at z0 in the data, this may thus be a sign of possible
manipulation.
Another example discussed in Lee (2008) refers to political elections. Each party may
attempt to manipulate the counting of the votes to their own advantage, but there still remains
a random element in the …nal vote count. If the manipulated vote count falls close to 50%, it is
nevertheless random whether party A won or lost. If we are interested in the e¤ects of political
power on certain outcomes Y it does not matter whether the vote count was manipulated or
not. The important aspect is that vote counts very close to 50% led to a tie breaking experiment
which randomly assigned political power.
Nevertheless, manipulation of Z is a potential concern in every RDD application. As

discussed above manipulation requires two things: First, agents need to have perfect control
about their value of Z and reasons to manipulate. Second, they need to know the threshold
z0 . If treatment e¤ects are heterogenous, the assumptions (45) might be violated. In such
situations, we would often expect a disconuity of fZ at the threshold and also in other
pre-treatment variables. Note that a continuous fZ is neither a su¢ cient nor necessary
condition for RDD. If teachers simply want to reduce the number of students in a summer
school programme, they may manipulate test scores for some randomly chosen students. On
the other hand, McCrary (2008, p.701) discusses a hypothetical example where some students
may bene…t from a summer school programme whereas others would be harmed. If teachers
knew this, they may manipulate test scores such that those students who bene…t are moved
into treatment and that those who would be harmed are moved out of treatment. In this
case, manipulation is not monotonic as it goes in both directions, and the density fZ may be
continuous at z0 despite the fact that the assumptions (45) are violated.
In a fuzzy design the assumption (45) is not su¢ cient and additional assumptions are re-
quired. Hahn, Todd, and van der Klaauw (2001) consider two alternative identifying assump-
tions (in addition to (45))
HTK1: Yi1 Yi0 ??Di jZi for Zi near z0 (46)
or
HTK2: Yi1 Yi0 ; Di (z) ??Zi near z0 and there exists " > 0
such that Di (z0 + e) Di (z0 e) for all 0 < e < ". (47)
The former assumption (46) is some kind of selection on observables assumption and iden-
ti…es E[Y 1 Y 0 jZ = z0 ]. The second assumption (47) is some kind of instrumental variables
assumption and identi…es the treatment e¤ect only for a group of local compliers
lim E Y 1 Y 0 jD(z0 + ") > D(z0 "); Z = z0

"!0
and corresponds to some kind of local LATE e¤ect.

Whichever of these two assumptions is invoked, the estimator is the same. Under (47),
which is a localized version of the unconfounded-participation-type assumption, the exclusion
restriction and the monotonicity assumption (discussed in the previous section on instrumental
variables identi…cation), they show that the average treatment e¤ect on the local compliers is
identi…ed as
lim E Y 1 Y 0 jD (Z = z0 + ") = 1; D (Z = z0 ") = 0

"!0
lim E [Y jZ = z0 + "] lim E [Y jZ = z0 "]
"!0 "!0
= . (48)
lim E [DjZ = z0 + "] lim E [DjZ = z0 "]
"!0 "!0
The local compliers is the group of individuals whose Z value lies in a small neighbourhood of
z0 and whose treatment status D would change from 0 to 1 if Z were changed exogenously from
z0 " to z0 +". The limit expressions of E [Y jZ = z0 "] and E [DjZ = z0 "] can be estimated
e.g. by local linear regression (as this has better boundary properties than NW regression),
where one should make sure to use only the data points either to the left or only those to the
right of z0 . Optimal bandwidth choice does not seem to have been discussed very much (at
least to my knowledge). Since the fastest rate of convergence under conventional smoothness
2
assumptions is n 5 , conventional cross-validation would at least converge at the right rate. So
cross-validation might be a feasible choice, where one should use only the data points within
some (not too narrow) neigbourhood of z0 to calculate the cross-validation criterion. Otherwise,
observations very distant from z0 would a¤ect the bandwidth value too much. (See also Imbens
and Lemieux (2008).)
One immediately notices the similarity of (48) to the Wald estimator for a binary treatment
and binary instrument. The Wald estimator has been shown to be equivalent to 2SLS regression
of Y on a constant and D using Z as an instrument. The same would apply here, although
only in the limit case, i.e. when using only observations in…nitesimally close to z0 .
Many applied papers use a conventional linear regression in a neighbourhood about z0 .98
Due to the binary nature of the discontinuity at the threshold, these approaches are numerically
equivalent to the formulae given above.
Consider …rst the sharp design. For the sharp design, everyone is a complier at the deter-
ministic threshold z0 , such that lim E [DjZ = z0 + "] = 1 and lim E [DjZ = z0 "] = 0. The
ATE is identi…ed as
E Y1 Y 0 jZ = z0 = lim E [Y jZ = z0 + "] lim E [Y jZ = z0 "] , (49)

"!0 "!0
where we usually would estimate both terms on the right by local linear regression. We estimate
lim E [Y jZ = z0 + "] by ^ + in
"!0
X Zj z0
(^ + ; ^ + ) = arg min (Yj (Zj z 0 ) )2 K 1 (Zj z0 ) (50)
; h
j
98
The main reason is that they then use the simple calculation of the 2SLS standard errors, ignoring the
essentially nonparametric nature of the approach.
and lim E [Y jZ = z0 + "] by ^ in

"!0
X Zj z0
(^ ; ^ ) = arg min (Yj (Zj z 0 ) )2 K 1 (Zj < z0 ) , (51)
; h
j
to obtain
E [Y 1 \
Y 0 jZ = z0 ] = ^ = ^ + ^ .
We can also estimate in one step. To obtain the following formula, we can add the objective
functions of the previous two local linear regressions, which are de…ned on separate subsamples
and with separate arguments. Let 1+
j = 1 (Zj z0 ) and 1j = 1 (Zj < z0 ). We obtain as joint
objective function, which is minimized at (^ + ; ^ + ) and (^ ; ^ )
X 2 Zj z0
Yj + + (Zj z0 ) K 1+
j
h
j
X 2 Zj z0
+ Yj (Zj z0 ) K 1j
h
j
X Zj z0
= (Yj 1+
j
+
+ 1j + (Zj z0 ) 1+
j + Yj 1j 1j (Zj z0 ) 1j )2 K
h
j
by using that 1+ +
j + 1j = 1 and also noting that in the sharp design 1j implies Dj = 1 and 1j
implies Dj = 0 we obtain
X Zj z0
+
= (Yj + 1j (1 1+
j ) + (Zj z0 ) Dj (Zj z0 ) (1 Dj ))2 K
h
j
X Zj z0
= (Yj ( + )1+ + (Zj z0 ) Dj (Zj z0 ) (1 Dj ))2 K
| {z } j h
j
X Zj z0
= (Yj ( + )D + (Zj z0 ) Dj (Zj z0 ) (1 Dj ))2 K . (52)
| {z } j h
j
Since this function is minimized at (^ + ; ^ + ) and (^ ; ^ ), the coe¢ cient would be estimated
as ^ + ^ . We can thus obtain ^ directly by a local linear regression of Yj on: a constant, Dj ,
(Zj z0 ) Dj and (Zj z0 ) (1 Dj ). This is identical to the separate regressions given above.
In some applications, the restriction is imposed that the derivative of E [Y jZ] is identical
on the two sides of the threshold, i.e. that
@E [Y jZ = z0 + "] @E [Y jZ = z0 "]
lim = lim .
"!0 @z "!0 @z
This assumption appears particularly natural if one aims to test the hypothesis of a zero treat-
ment e¤ect, i.e. the null hypothesis that E Y 1 Y 0 jZ = z0 = 0. In other words, if the treat-
ment has no e¤ect on the level, it appears plausible that it also has no e¤ect on the slope. This
can easily be implemented in (52) by imposing that = +. In the implementation we would
thus obtain ^ by a local linear regression on: a constant, Dj and (Zj z0 ) (i.e. without inter-
acting the last term with Dj ). If one is not testing for a null e¤ect, this restriction is less ap-
pealing. Imbens and Lemieux (2008) also point out that (52) without the restriction = +
ensures that only data points to the left of z0 are used for estimating the potential outcome
E Y 0 jZ = z0 and only points to the right of z0 are used for estimating the potential outcome
E Y 1 jZ = z0 . (Because the solution to (52) is identical to the solutions to (50) and (51).)
With the restriction = +, data points from both sides of z0 are used for estimating Y 0
and also for Y 1 . In other words, some Y 0 outcomes are used to estimate E Y 1 jZ = z0 , and
analogously, some Y 1 outcomes are used to estimate E Y 0 jZ = z0 , which does not appear so
appealing.
In the fuzzy design, we implement the Wald type estimator (48) by estimating (50) and (51)
and analogous expressions with Y replaced by D.
Since E[DjZ = z] is usually much smoother (as a function of z) than E[Y jZ = z], one would
tend to choose a larger bandwidth for estimating the terms appearing in the denominator of
2
(48) than for those terms appearing in the numerator of (48). (For achieving n 5 consistency,
all bandwidths should be chosen as the appropriate rate, but they can be multiplied by di¤erent
constants.)
We can express the Wald estimator as a 2SLS estimator, if we use the same bandwidth values
in all expressions of (48) and use a uniform kernel throughout. The uniform kernel implies that
all observations with jZj z0 j h receive a weight of one and all other observations receive a
weight zero. Consider the 2SLS regression, using only observations with jZj z0 j h,
Yj on a constant, Dj , (Zj z0 ) 1+
j and (Zj z0 ) 1j (53)
with instruments: a constant, 1+

j , (Zj z0 ) 1+
j and (Zj z0 ) 1j . (In other words, 1+
j is the
excluded instrument.) The coe¢ cient on Dj is numerically identical to the Wald estimator
(48) based on (50) and (51) and the corresponding expressions for D. (Show this by using
partitioned inverses.)
The 2SLS regression equation (53) can be extended by adding further polynomial terms e.g.
(Zj z0 )2 1+
j and (Zj z0 )2 1j (which would then correspond to estimating the terms in (48) by
local quadratic regression). Similarly, higher order polynomials can be included, which is done in
many applied articles. The fact that the expression (48) can also be estimated via (53) appears
to be only of theoretical value, but it shows again the link between the RDD and a local IV
identi…cation. If one ignores the fact that the bandwidth value h is estimated, as many applied
papers do, the variance of the 2SLS regression (53), where one should use robust standard
errors, is a convenient way for inference. (Using only observations with jZj z0 j h.) Another
convenient advantage of (53) is that it easily permits the inclusion of additional covariates X,
school …xed e¤ects or time …xed e¤ects in a linear way.
If pre-treatment data on Y is available, we can also consider a Di¤-in-Di¤ RDD approach,

as exempli…ed further below. In other words, suppose that Yt=1 is measured after the treatment
started and Yt=0 is measured before. I.e. Yt=0 is una¤ected by the treatment and we could
therefore estimate the treatment e¤ect before and after. In the sharp design, we showed in (52)
that the (kernel weigthed) regression of Yt=1 on a constant, Dj , (Zj z0 ) Dj and (Zj z0 ) (1
Dj ) nonparametrically estimates the e¤ect in the period t = 1. With two time periods, we
could regress
Y on t, Dt, (Z z0 ) Dt, (Z z0 ) (1 D)t,
(1 t), D(1 t), (Z z0 ) D(1 t), and (Z z0 ) (1 D)(1 t),
where all the regressors of (52) are interacted with t. Because all regressors are interacted with
the two possible values of t, the result is numerically identical to estimating (52) separately in
each time period. We then immediately see that by re-arraging the regressors we obtain the
equivalent regression
Y on constant, t, D, Dt, (Z z0 ) , (Z z0 ) D, and (Z z0 ) t and (Z z0 ) Dt.
The coe¢ cient on Dt corresponds to the DiD-RDD treatment e¤ect estimate. Note that al-
though we estimate a linear equation, no linearity assumption is required at all (as we have not
needed it in the derivation). Hence, although the regressor Dt is the interaction term that we
know from linear DiD models, we have purely derived it from (49). (The linear is because of
the essentially binary nature of the RDD.)
An analogous result can be obtained for DiD-RDD with fuzzy design. Consider again the
simple case with a uniform kernel. In other words, only observations with jZj z0 j h are
used and receive equal weights. Extending the 2SLS equation (53) with full interaction with t
and the same for the list of instruments gives results that are identical to two separate 2SLS
regressions, i.e. separately in time period 0 and 1. Hence, by re-arranging the regressors we
can estimate (48) in one step, using only observations with jZj z0 j h, by 2SLS of
Y on a constant, t, D, Dt, (Z z0 ) 1+ , (Z z0 ) 1 , (Z z0 ) 1+ t, (Z z0 ) 1 t (54)
with instruments: a constant, t, 1+ , (Z z0 ) 1+ , (Zj z0 ) 1j , 1+ t, (Z z0 ) 1+ t, and

(Zj z0 ) 1j t. (Here, 1+ and 1+ t are the excluded instruments.) (With multiple pre-treatment
time periods, the regressor t may be replaced by a set of time dummies.)
The regression-discontinuity approach permits identi…cation of a treatment e¤ect under weak

conditions. In particular a type of instrumental variable assumption needs to hold only locally.
On the other hand, the average treatment e¤ect is identi…ed only for the local compliers. And
p
due to its local nature of identi…cation, no n-consistent estimator can exist for estimating
it, because we require a continuous variable Z for this approach and thus have to rely on
smoothing about z0 with bandwidth converging to zero for consistency, see Hahn, Todd, and
van der Klaauw (2001).
The methods discussed so far can be used to estimate the treatment e¤ect at a discontinuity
point z0 . In many applications there may be several discontinuity points z0 . E.g. Lalive,
Wüllrich, and Zweimüller (2008) consider a policy in Austria where …rms are obliged to hire
one severely disabled worker per 25 non-disabled workers (or pay a fee instead). This rule
implies a threshold at 25, 50, 75 etc. Another example in van der Klaauw (2002), for a fuzzy
design, considers the o¤er of …nancial aid as a function of an ability test score. The test score is
more or less continuous, but for administrative purposes it is grouped into four categories, e.g.
Z < z0 , z0 < Z < z1 , z1 < Z < z2 and z2 < Z. At each of these thresholds z0 , z1 and z2 , the
probability of treatment rises discontinuously and may or may not remain constant between
these thresholds.
Clearly we could use the above methods to estimate the treatment e¤ect separately at every
threshold, and then perhaps take a (weighted) average of all these estimates. An alternative
approach, outlined below, is to assume a constant treatment e¤ect. This approach may be
helpful for two reasons: First, if the treatment e¤ect is indeed constant, we would expect to
obtain more precise estimates. Second, it helps us to link the methods derived above to more
conventional parametric modelling, which may be helpful if we would like to incorporate further
speci…c features in a particular application.
We will consider …rst the sharp design. With a constant treatment e¤ect, we can think of
the outcome being determined as
Yi = + Di + Ui (55)
where endogeneity arises because of dependency between D and U . In the sharp design Di =
1(Zi z0 ) such that we obtain
E[Y jZ; D] = + D + E[U jZ; D] (56)
where E[U jZ; D] = E[U jZ] because D is a deterministic function of Z. We can rewrite the
previous equation by adding and subtracting Y to obtain
Yi = + Di + E[U jZ] + Wi .
|{z}
Yi E[Y jZi ;Di ]
Note that the "error term" Wi has the properties E[W ] = 0, cov(W; D) = 0 and
cov(W; E[U jZ]) = 0 as can be shown by straightforward calculations using iterated
expectations. Suppose further that E[U jZ] belongs to a parametric family of functions, e.g.
polynomial functions, which we denote by (z; ). (E.g. a second order polynomial would be
2.
1Z + 2Z We ignored the constant here because we already have in the above equation
as the constant. Hence, we can not identify the constant, but we are only interested in
anyway.) So we assume that there is a true vector such that E[U jZ] = (Z; ) almost
surely.99 If E[U jZ] is su¢ ciently smooth, it can always be approximated to arbitrary precision
by a polynomial of su¢ ciently large order. The important point is thus to have the number
of terms in (z; ) su¢ ciently large. By using E[U jZ] = (Z; ) we can rewrite the previous
expression as
Yi = + Di + (Z; ) + Wi , (57)
|{z}
Yi E[Y jZi ;Di ]
99
I.e. there is a vector such that for all values z 2 RnA, where Pr(Z 2 A) = 0, it holds that E[U jZ = z] =
(z; ).
where we know consider the terms in (Z; ) as additional regressors, which are all uncorrelated
with W as shown above. Hence, the treatment e¤ect can be consistently estimated by OLS. In
a sense this looks similar to (53). In (53) we accounted for E[U jZ] via local linear regression in
a neighbourhood of z0 , whereas (57) contains a global estimate of E[U jZ]. (We would therefore
usually like to include a higher order polynomial, or use generalized cross-validation to …nd an
appropriate polynomial.) We also note that the regression in (57) does not make any use of z0
itself. The identi…cation nevertheless comes from the discontinuity at z0 (and the smoothness
assumpion on E[U jZ]). To see this consider what would happen if we used only data to the left
of z0 (or only data to the right of z0 ). In this case, Di would be the same for all data points
such that would not be identi…ed. From the derivations leading to (57) it is also clear that
the regression (57) would be the same if there were multiple thresholds.
The case of multiple thresholds become more interesting in the fuzzy design. Note that the
assumption of a constant treatment e¤ect implies by de…nition that assumption HTK1 (46) is
satis…ed. In other words, we do not permit that individuals may select into treatment according
to their gain from it, as we would be permitting if we were to assume only HTK2 (47). We start
again from equation (55) and aim to rewrite it such that we could estimate it by OLS. Because
D is no longer a deterministic function of Z, we consider only expected values conditional on
Z and not on Z and D as we did in (56). Therefore
Yi = + Di + Ui
E[Y jZ] = + E[DjZ] + E[U jZ]
and
Yi = + E[DjZi ] + E[U jZi ] + Wi ,
|{z}
Yi E[Y jZi ]
where Wi is not correlated with any of the other terms on the right hand side of the above
equation. As before we suppose that E[U jZ] = (Z; ) belongs to a parametric family of
functions, such that
Yi = + E[DjZi ] + (Zi ; ) + Wi . (58)
|{z}
Yi E[Y jZi ]
If we knew the function E[DjZ] = Pr(D = 1jZ), we could estimate the previous equation by
OLS to obtain . Since we do not know E[DjZ] we could pursue a two step approach in that
we …rst estimate E[DjZ] and plug the estimated E[DjZ] in the previous equation (58) to obtain
via OLS. To estimate E[DjZ] we could be assuming the following speci…cation
E[DjZi ] = + (Zi ; ) + 1(Zi z0 ) (59)
where (z; ) is a parametric family of functions indexed by , e.g. a polynomial. In (59) we

use our a priori knowledge of a discontinuity at z0 .
Note what would happen if we choose the same polynomial order for and , e.g. a third
order polynomial. Because with exact identi…cation, IV and 2SLS are identical, the solution to
(58) is identical to IV regression of Yi on constant, Di , Zi , Zi2 , Zi3 with instruments: constant,
Zi , Zi2 , Zi3 and 1(Zi z0 ). Hence, 1(Zi z0 ) is the excluded instrument.
If we have multiple thresholds e.g. z0 , z1 , z2 where treatment probability is discontinous we
can simply replace (59) with
E[DjZi ] = + (Zi ; ) + 0 1(Zi z0 ) + 1 1(Zi z1 ) + 2 1(Zi z2 ) (60)
and use the estimated E[DjZi ] in (58). In this case we have three exluded instruments: 1(Zi
z0 ), 1(Zi z1 ), 1(Zi z2 ).
We also note that in these derivations we did not make use of the fact that D is binary.
So these approaches would also apply if D were a non-binary treatment variable. E.g. van der
Klaauw (2002) considers the amount of …nancial aid o¤ers which is more or less continuous. Of
course, in this case the assumption (55) becomes more restrictive.
Consider a few examples: Knowledge of institutional details may often unveil such discon-
tinuities in programme admission rules. If university admission, for example, requires a test
score of 670 in a central examination, we could compare those individuals with 670 points to
those with 669 points, i.e. to compare those who made it to those who “almost made it.” In
practice we would compare probably those with 670-680 points to those with 660-669 points to
have a larger sample size and for robustness examine this for di¤erent intervals. The further
we move away from 670 points, the more biased our results are going to be.100 On the other
hand, our sample size will increase and thus variance decrease.
Similarly we can imagine a remedial teaching programme for students with poor language
abilities. A test score might be administered to all students and those below a threshold level
100
Because test score is likely to be in‡uenced by unobserved ability which also a¤ects the outcomes later in
life.
would be admitted to this additional teaching support, whereas those above would not gain
access to this teaching support (or would have to pay for it).
Geographical borders can also lead to regression discontinuity. Consider an administrative

border between two provinces (of the same country). Consider two villages close to the border
but on di¤erent sizes of the border. If commuting times are short between these two villages,
they will share many common features, the same labour market etc. But administrative reg-
ulations can in some speci…c detail di¤er a lot between these two villages because they are in
di¤erent provinces. See Frölich and Lechner (2006) for an example on the e¤ects of labour mar-
ket programmes for unemployed. (They used an IV approach and not a strict RDD approach,
though.) Such kind of geographic or administrative borders provide opportunities for evalua-
tion in various applications. E.g. individuals living close but on di¤erent sides of an administra-
tive border may be living in the same labour market, but in case of becoming unemployed they
have to attend di¤erent employment o¢ ces with potentially rather di¤erent types of support or
training programmes. These individuals living on the di¤erent sides of the border may however
also di¤er in other observed characteristics that one would like to control for.
Lalive (2008) gives a nice and convincing application to study the e¤ects of maximum
duration of unemployment bene…ts (in Austria) combining RDD with di¤erence-in-di¤erences
(DiD). In certain clearly de…ned regions of Austria the maximum duration of receiving
unemployment bene…ts was substantially extended for jobseekers aged 50 or older at entry
into unemployment. Basiclly, two control group comparisons can be examined: Those slightly
younger than 50 to those 50 and above, and those living in the treatment region but close to
a border to a non-treatment region to those on the other side of the border. The age based
strategy would compare job seekers who are aged 50 years to those slightly below. Both groups
of jobseekers would be basically of the same age, but only the older group enjoys access to
the extended UI entitlement. This is an example of the sharp design. A clear concern is that
employers and employees might collude to manipulate age at entry into unemployment. Firms
could o¤er to wait with laying o¤ their employees until they reach the age of 50, provided the
employees are also willing to share some of their gains, e.g. through higher e¤ort in their
…nal years. In this case, the group becoming unemployed at the age of 49 might be rather
di¤erent from those becoming unemployed at age 50. To mitigate this concern somewhat one
can examine the histogram of age into unemployment. If the above concern is real, a strong
increase of the number of observations at age 50 would be expected. Another way to mitigate
the above concerns is to examine the exact process how the policy change was enacted. If the
change in the legislation was passed rather unexpectedly, i.e. rapidly without much public
discussion, it may have come as a surprise to the public. If in addition, the new rules apply
retrospectively, e.g. for all cases who had become unemployed six months ago, these early cases
might not have been aware of the change in the law at the time they became unemployed.
The second, and perhaps more convincing, identi…cation strategy uses as threshold the
border between treated and control regions, i.e. individuals living close to both sides of the
border between treated and control regions. In the Austrian case, region of residence would be
harder to manipulate as the law provides access to extended bene…ts only if the person lives
in that region since at least 6 months prior to the claim. Selective migration is still possible,
but workers would have to move from control to treated regions well in advance of the eventual
layo¤.
An important part the analysis is knowledge about the exact details of the implementation of
the reform, which is strongly related to the history of the Austrian steel sector. After the second
World War, Austria nationalized its iron, steel and oil industries and segments of the heavy
engineering and electrical industries into a large holding company, the Oesterreichische Industrie
AG (OeIAG). Due to low productivy and various other failures, in 1986 a large restructuring
plan was envisioned with huge layo¤s due to plant closures and downsizing, particularly in
the steel industry. With such large public mass-layo¤s planned, a social plan with extended
unemployment bene…t durations was enacted, but only in those regions that were severly hit by
the restructuring and only for older workers. The extended entitlement was 209 weeks, subject
to the following criteria: (i) age 50 or older; (ii)a continuous work history (780 employment weeks
during the last 25 years prior to the current unemployment spell); (iii) location of residence in
one of the 28 selected labor market districts since at least 6 months prior to the claim; and (iv)
start of a new unemployment spell after June 1988 or spell in progress in June 1988. Lalive
examines only individuals who enter unemployment from a non-steel job. The focus on non-steel
jobs is that they should only be a¤ected by the change in the unemployment bene…t system,
whereas individuals entering unemployment from a job in the steel industry were additionally
also a¤ected by the restructuring of the steel sector.
A nice part of the analysis is the presence of extensive administrative databases also for
the period well before the reform took place. This permits a pseudo-treatment analysis to
"test" the assumptions of the RDD. The RDD compares either individuals on both sides of
the age 50 threshold or geographically across the border between a¤ected and not a¤ected
regions. Using the same de…nitions of treatment and outcome with respect to a population that
became unemployed well before the reform (including the period during which the outcome is
measured), one would expect a pseudo-treatment e¤ect of zero. If the estimate is di¤erent from
zero, it may indicate that di¤erences in unobserved characteristics are present across the border.
On the one hand, this would reduce the appeal of the RDD assumptions. On the other hand,
one would like to account for such di¤erences in a DiD RDD approach, i.e. by subtracting the
pseudo-treatment e¤ect from the treatment e¤ect.
Lalive (2008) therefore uses information on individuals entering unemployment from a job
in the non-steel sector in the period January 1986 until December 1987 (i.e. before the policy
change )and in the time August 1989 to July 1991 (when the policy was active).
As another example of a RDD with DiD, Leuven, Lindahl, Oosterbeek, and Webbink (2007)
consider a programme in the Netherlands, where schools with at least 70% disadvantaged mi-
nority pupils received extra funding. The 70% threshold was maintained nearly perfectly, which
would imply a sharp design. The existence of a few exceptions make the design nevertheless
fuzzy, where the threshold indicator can be used as an instrument for treatment. Given the
availability of pre-programme data on the same schools, di¤erences-in-di¤erences around the
threshold can be used. The programme was announced in February 2000 and eligibility was
based on the percentage of minority pupils in the school in October 1998, in other words well
before the programme started. This reduces the usual concern that schools might have manip-
ulated their shares of disadvantaged pupils to become eligible. In this situation, schools would
have to have anticipated the subsidy about one to one-and-a-half years prior to the o¢ cial an-
nouncements. As a check of such potential manipulation, one can compare the density of the
minority share across schools around the 70% cuto¤. With manipulation, one would expect a
drop in the number of schools which are slightly below 70% and a larger number above the
cuto¤. Leuven et. al. use data from schools with a minority share between 60 and 80%. Data
on individual test scores is available for preintervention years 1999 and 2000 and for postinter-
vention years 2002 and 2003, permitting a DiD-RDD. As a Pseudo-treatment test they further
examine the estimated e¤ects when assuming that the relevant threshold was 10%, 30%, 50%
or 90%. In all these cases the estimated e¤ects should be zero since no additional subsidy was
granted at those thresholds. (It is also important to verify that no other (public) programme
set in at or about the same threshold. Otherwise, one would be measuring the e¤ect of these
programmes together.)
In another example, Angrist and Lavy (1999) observed that in Israel class size is usually
determined by a rule that splits classes when class size would be larger than 40 otherwise.
This policy generates discontinuities in class size when the enrollment in a grade grows from
40 to 41 (as class size changes from one class of 40 to one class each of size 20 and 21), 80
to 81, etc. Enrollment (Z) thus has a discontinuous e¤ect on class size (D) at these points.
Since enrollment Z may directly in‡uence student achievement (as it represents the size of the
school), it is not a valid instrument variable, but it may be so if we compare only classes with
enrollment of size 40 to those with 41. Angrist and Lavy (1999) impose more structure in form
of a linear model to estimate the impact of class size on student achievement. Nevertheless, the
justi…cation for their approach essentially relies on the consideration above. But apart from
class size there may also be other di¤erences in observed characteristics X between the children
in a grade with 40 versus 41 children. We might be concerned about confounding due to parents
pulling their children out of school (and sending to private schools) if they realize that their
child would be in a class of 40 students, whereas they might not want to do so if class size is
only 20 or 21 students (if enrollment Z is 41).
Consider a few other examples. Black (1999) used this idea to study the impact of school
quality on the prices of houses. In many countries, admission to primary school is usually
based on the residency principle. Someone living in a particular school district is automatically
assigned to a particular school. If the quality of school varies from schools to schools, parents
have to relocate to the schol district where they want their child to attend school. Houses in
areas with better schools would thus be in higher demand and thus more expensive. Therefore,
to avoid sending children into bad schools, parents may want to move to a school district with a
good school, which increases housing prices in those districts. If the school district border runs,
for example, through the middle of a street, houses on the left hand side of the street might be
more or less expensive than those on the right hand side of the street because of belonging to
di¤erent school districts.
Black (1999) examined the impact of school quality on housing prices by comparing houses
adjacent to school-attendance district boundaries. School quality varies across the border,
which should be re‡ected in the prices of apartments. For using this approach we have to verify
whether the instrumental variables assumptions examined above would hold locally, i.e. there
should be monotonicity, no confounding and no direct impact of the instrument. In the driving
license example above, we would probably not believe that the age 18 threshold has no direct
e¤ect because of many other changes in regulations, rights and duties at the 18th birthday such
that Y 0 would also change discontinuously. On the other hand, we can be assured that age
is exogenous, i.e. is not caused by any unobserved variables.101 In the house pricing example
of Black (1999) we might perhaps be concerned that there might be also many other changes
in regulations when moving from the left hand side to the right hand side of the street. It
seems though that school districts boundaries do often not coincide with other administrative
boundaries such that these concerns can be dissipated. Confounding might still be an issue here.
Although, in contrast to individual location decisions, houses cannot move, the construction
companies might have decided to build di¤erent types of houses on the left hand side and the
right hand side of the road. If school quality was indeed valued by parents, developers would
build di¤erent housing structures on the two sides of the boundary: Flats with many bedrooms
for families with children on that side of the boundary where the good school is located, and
apartments for singles and couples without children on the other side of the border. As in
the previous chapter, the instrumental variables assumptions may become more credible after
conditioning on several characteristics X, e.g. the type of the house (number of bedrooms, size,
garden etc) in the house pricing example. Black (1999) therefore controls for the number of
bedrooms (and other characteristics of the apartments) in a linear model. A fully nonparametric
way to control for X is discussed in the next subsection.
Numerous evaluation studies also examined the PROGRESA experimental data. In the pilot
phase of PROGRESA a random selection of villages/localities was chosen that participated in
the programme. Within each locality only households below a region-speci…c poverty index
were entitled to participate whereas all other households were not admitted. If the poverty
threshold was strictly enforced, this provides a sharp regression discontinuity design.
As another example, consider a law specifying that companies with more than 50 employees
have to adhere to certain anti-discrimination legislation whereas smaller …rms are exempted.
101
Assuming that unobservable characteristics do not change discontinuously with age.
This situation can be considered as a kind of local experiment: Some units, …rms or individuals
happen to lie on the side of the threshold at which a treatment is administered, whereas others
lie on the other side of the threshold. Units close to the threshold but on di¤erent sides can be
compared to estimate the average treatment e¤ect.
More often than not, however, the units to the left of the threshold di¤er in their observed
characteristics from those to the right of the threshold. Accounting for these di¤erences is
important to identify the treatment e¤ect. In the example referred to above, a comparison of
…rms with 49 employees to those with 51 employees could help to estimate the e¤ects of anti-
discrimination legislation on various outcomes. However, …rms near the threshold might take
the legal e¤ects into account when choosing their employment level. Therefore, …rms with 49
employees might thus be quite di¤erent in observed characteristics from …rms with 51 employees,
e.g. with respect to assets, sales, union membership, industry etc. One would therefore like to
account for the observed di¤erences between these …rms.102
As another example, Lee (2008) considers the e¤ect of certain election outcomes, in partic-
ular the electoral advantage of incumbency. In presidency-like democratic systems, the party
with the absolute majority of votes wins the ballot, introducing a clear discontinuity. Consider
a certain country and suppose there are a number of districts or regions where elections take
place. Suppose there are only two political parties. Now, de…ne the outcomes of the last elec-
tion by Zstate = V ote share party Astate V ote share party Bstate . If Z is close to zero, one
could probably argue that both parties roughly have the same number of supporters. Then it
is more or less random, whether Zstate happened to be slightly above or below zero. However,
if Zstate is above, party A won and might have enacted very di¤erent policies than in another
region where Zstate was slightly below zero and party B thus won. The requirement of absolute
majority provides here a strong discontinuity in power allocations. This is an example of the
sharp design where D jumps from 0 to 1 at z0 = 0. (Lee (2008) looks at the incumbency e¤ect,
in particular at the probability of winning again the election given that one won the last elec-
tion. More precisely, he is interested in the probability of Democrats winning the next election,
by comparing districts where the Democrats won the previous election with just over 50% of
the vote with districts where the Democrats lost the previous election with just under 50% of
102
Other recent examples include Battistin and Rettore (2002), Lalive (2008) and Puhani and Weber (2007).
the vote. In fact, instead of a 50% threshold, winning only requires to have more votes than
the second largest party. Hence, the di¤erence in votes between the two largest parties is rel-
evant, with a threshold z0 = 0.) Since the elections are secret ballot, complete manipulation
(about the threshold z0 = 0) is extremely unlikely. This is con…rmed by the following graph
(McCrary 2008, Figure 4) which shows that the density is continuos at z0 . McCrary (2008) also
developed a formal test.
McCrary (2008) looks at another example regarding voting outcomes: "Percent voting in
favour of proposed bill in the U.S. House of Representatives". Here, the voting is not secret and
their are strong incentives to coordinate voting behaviour such that proposed bills are passed.
In the following graph (McCrary (2008, Figure 5)), the circles show the "percent voting in
favour" for all the di¤erent bills proposed. The superimposed line shows a density estimate
(which is estimated separately for the half-line to the right and the half-line to the left of 50%).
As the graph below (McCrary 2008, Figure 5), if the vote in favour is very close to 50%, the
probability that the bill is accepted is larger than the probability that it is rejected.
Based on similar ideas, various studies looked into the impact of unionization on …rm out-
comes. In some countries, …rms become unionized if the majority of the workers vote for it.
DiNardo and Lee (2004) looked at the USA, where unions are organized at the establishment
level. New unionization usually develops as a result of secret ballot election. If a union at-
tempts to unionize a new establishment, it can arrange for a union vote. If the majority of
workers votes in favour of the union, the management of the …rm is then required by US law
to bargain "in good faith" with the recognized union. (The NLRA law sets certain rules for
the collective bargaining process: If the union vote was won, the …rm is required to bargain
"in good faith" and restricted in its use of replacement workers. The …rm cannot …re workers
for being associated with a union. Without the legal protection, the …rm could simply dismiss
those workers and/or replace workers on strike). This requires that the majority of the work-
ers vote in favour of the union in a secret ballot election. This creates a discontinuity which
we can exploit by comparing establishments where the unions barely won the elections which
those where they barely lost the election. Continuing with our discussion on manipulation of
the variable Z, unions and the …rm may well attempt to manipulate the outcome of the voting.
However, if the elections are secret ballot and if more than, say, 50 workers participate in it,
none of them would have complete control over the outcome.
To justify a RDD design, various plausibility checks are important. First, one should always
verify that no other programmes set in at or about the threshold z0 . For example, if we examine
the e¤ect of a certin law that applies only to …rms with more than 10 employees, there might
be many other changes in the law that also change at this threshold. More di¢ cult is even the
situation, if other law changes happen e.g. for …rms with more than 8 employees, because for
obtaing a su¢ cient sample size we would often have to include …rms with 7, 8, 9 and 10 to our
control group.
Second, although the RDD contains various non-testable assumptions, there are nevertheless
various tools to assess the plausibility of such situations. Here graphical tools can be helpful.
1) One could plot the functions E [Y jZ = z0 + "] and E [Y jZ = z0 "] for " 2 (0; 1). There
should be only one discontinuity at z0 . If there happen to be other discontinuities for di¤erent
values of z, they should be much smaller than the jump at z0 .
2) If one has access to data on other covariates X, one can plot the functions
E [XjZ = z0 + "] and E [XjZ = z0 "] for " 2 (0; 1). Ideally, X should not have any
discontinuity at z0 . If a discontinuity at z0 is observed, one might be concerned about potential
confounding and apply the RDD with covariates estimator discussed in the next subsection.
If the variance of X given Z is not too large and the total number of data points not too
large, a simple scatter plot of X versus Z often gives a very helpful visual impression. If the
scatter plot is too blurred with data points, nonparametric estimation of E [XjZ = z0 + "] and
E [XjZ = z0 "] can be helpful. Note: If one uses kernel or local linear regression to estimate
and plot E [XjZ = z0 + "] and E [XjZ = z0 "] one should make sure that uses only data
points with Zi > z0 for estimating E [XjZ = z0 + "] and only data points with Zi < z0 for
estimating E [XjZ = z0 "]. (I.e. imagine deleting the data on the other side of the threshold
during estimation.) If one uses the entire dataset, one would automatically also smooth over z0
and any true discontinuity would usually be smoothed away (or at least greatly diminished).
E[X|Z]
Z
3) Also helpful is a plot of the density fZ (z). Although the theory only requires that
fZ is positive in a neighbourhood around z0 , one might be concerned about possible data
manipulation if the density fZ is discontinuous at z0 , i.e. is much smaller on the one side of z0
than on the other side. It might be that some individuals happened to be on the wrong side of
z0 (e.g. in the test score example) and then somehow managed to manipulate the test score to
be above z0 . (Or the programme administrator might have manipulated this.) This could then
be a sign of non-random selection around the threshold. In particular we would then expect
to be a larger fraction of the compliers to be above than below z0 . If individuals managed to
manipulate Z, we would probably see a discontinuity of fZ . McCrary (2008) develops a test
for continuity of the density at z0 . Since estimation of the density takes place at a boundary
point, conventional kernel density estimation, to be discussed in later chapters, is not advisable.
The approach suggested by McCrary (2008) proceeds in two steps: First an (undersmoothed)
histogram is estimated. (It is important that none of the histogram bins contains points both
to the left and the right of z0 .) Second, the estimated histogram is smoothed by local linear
regression, again separately for the half-line to the left and the half-line to the right of z0 .
As examples, Lalive (2008) examines the histogram of age at entry into unemployment.
As mentioned above, the policy reform in Austria provided a longer unemployment bene…t
duration in certain regions of Austria, but only for individuals who became unemployed at age
50 or older. Since the policy improvements were quite substantial for them, …rms and workers
may agree to delay a layo¤ until the age of 50. If this were the case, the histogram should
show substantially more entries into unemployment at age 50 than below. Non-continuity of
the density at the threshold may indicate that employers and employees actively change their
behaviour because of the policy change. This could induce a bias in the RDD if the additional
layo¤s are selective, i.e. if they have di¤erent counterfactual unemployment duration. Lalive
(2008) has the advantage of having control regions available that were not a¤ected by the policy
change. This permits to consider the histogram of age of those becoming unemployed in the
treated regions compared to those in the non-treated regions. Lalive (2008) …nds an abnormal
reaction at the age threshold for women. Using the data on the control regions, however, there
seems to be no dip just before age 50, only a substantial increase of the density at age 50. To
analyze this further, he examines the previous labour market history of these women to assess
whether these individuals around the threshold seem to be di¤erent in important characteristics,
which could lead to selection bias.

Another example is Leuven, Lindahl, Oosterbeek, and Webbink (2007) who examined the
e¤ect of an education subsidy for schools with at least 70% disadvantaged students. Had schools
known of this subsidy in advance and tried to manipulate their numbers to obtain the additional
subsidy, we would expect a drop in the number of schools just below the 70% threshold and an
increase at and above 70%. The following …gure taken from Leuven, Lindahl, Oosterbeek, and
Webbink (2007) shows that this does not seem to be the case.
Another example is a university entrance admission test (or GRE test) which can be taken
repeatedly. If individuals know the threshold test score z0 , those scoring slightly below z0
might retake the test, hoping for a better test result. Unless the outcomes of repeated tests
are perfectly correlated, this will lead to much lower density fZ at locations slightly below z0
and much higher density above z0 . We might be then comparing people who took the test only
once with those who repeatedly took the test, which might also perhaps be di¤erent on other
unobserved characteristics. (Probably those individuals who really have to comply with the test
results in that their attendance depends on this are likely to repeat the test most persistently.)
4) If data on una¤ected periods is available, the Pseudo-treatment test can be used as well.
E.g. Lalive (2008) examines the e¤ects of an extension of unemployment bene…ts for those age 50
and above in selected regions of Austria. He considers the age and region threshold separately.
The region threshold is used by measuring distance to the regional border to an adjacent region
that is not subject to the policy. The next …gure shows the regions, and the …gure below
the estimated treatment e¤ect according to the region threshold after the introduction of the
programme.
Lalive (2008) also has access to the same administrative data for the time period before the
introduction of the policy change. If the identi…cation strategy is valid we should not observe
a di¤erence at the age nor at the region threshold, which is examined in the following …gure.
As another example, Lee (2008) examines the e¤ect of incumbency on winning the next
elections. The following …gure shows that if the vote share margin of victory for the democratic
party was positive at time t, it has a large e¤ect of wining the elections in t + 1.
On the other hand, if it was more or less random whether the vote share happened to be
positive or negative in t, conditional on being close to zero, it should not be related to election
outcomes in t 1. In other words, the sign of the vote share margin in t should have had no
e¤ect on earlier periods. This is examined in the following …gure.
An extension of this idea is to use the previous periods in a Di¤erences-in-di¤erences RDD

setup.
Finally, examine also the mixed sharp-fuzzy design. In a mixed sharp-fuzzy RDD (without
X) the LATE=ATET and we can identify the distribution of Y 0 and Y 1 for those treated at
the threshold. To obtain the average treatment e¤ect on the treated
E[Y 1 Y 0 jD = 1; Z = z0 ]
we need to identify the counterfactual outcome E[Y 0 jD = 1; Z = z0 ]. The only assumption

needed is that the mean of Y 0 is continuous at z0 . (No further assumption is required.) Then
lim E Y 0 jZ = z0 + " = lim E Y 0 jZ = z0 " . We can also write
E Y 0 jZ = z0 + " = E Y 0 jD = 1; Z = z0 + " Pr (D = 1jZ = z0 + ")
+E Y 0 jD = 0; Z = z0 + " Pr (D = 0jZ = z0 + ") .
Taking limits and using the continuity assumption at z0 we obtain
lim E [Y jZ = z0 "] (1 ) lim E [Y jD = 0; Z = z0 + "]

"!0 "!0
lim E Y 0 jD = 1; Z = z0 + " = .
"!0
(61)
(Remember that this formula is valid only in the mixed sharp-fuzzy design without X. In a
strictly fuzzy design, more assumptions are required as discussed before.)
Battistin and Rettore (2008) suggest to use this result as some kind of a test for the use of
a selection-on-observables strategy. The RDD estimates have the disadvantage of being valid
only for the subpopulation at the threshold z0 . Suppose that we assume additionally
Y 0 ??DjX; Z for Z z0 . (62)
If this assumption was true, we could identify the treatment e¤ect on the treated not only
locally for those at z0 , but in the entire population. The selection-on-observables assumption
would also imply
lim E [Y jD = 0; X; Z = z0 + "] lim E Y 0 jD = 1; X; Z = z0 + " = 0. (63)

"!0 "!0
Since lim E Y 0 jD = 1; X; Z = z0 + " is identi…ed analogously to (61), provided that

"!0
E Y 0 jX; Z is continuous at z0 , (63) implies that
lim E [Y jX; D = 0; Z = z0 + "] lim E [Y jX; Z = z0 "] = 0.

"!0 "!0
Hence, one can test (63) and thereby the joint validity of the RDD and the selection-on-
observables assumption at z0 . Of course, non-rejection of (62) at z0 does not ensure that
selection-on-observables is valid at other values of z. We would nevertheless feel more con…dent
in using (62) to estimate ATET for the entire population.
These derivations can immediately be extended to the case where Z is a proper instrumental
variable, i.e. not only at a limit point. In other words, if Pr(D = 0jZ = z~) = 1 for some value
z~, the ATET can be identi…ed and the CIA assumption can be tested.
4.5.1 Regression discontinuity design with covariates
This chapter is in revision and should be

skipped !!!
References: Frölich (2007, RDD with covariates)
Both assumptions above are in many applications too strong. The conditional independence
assumption (46) does not permit any kind of deliberate treatment selection which incorporates
the individual gains Yi1 Yi0 . But even the local IV assumption (47) can be too strong without
conditioning on any covariates. It requires that the individuals to the left and right of the
threshold have the same unobserved gains and also that there is no deliberate selection into
Zi < z0 versus Zi > z0 . When the individuals left and right of the threshold di¤er in their
observed characteristics, one would be doubtful of the assumptions (46) or (47). In the following,
a weaker version of the local IV condition (47) in the fuzzy design is discussed. (A weaker version
of (46) is also discussed in Frölich (2007c).) This subsection is based on Frölich (2007c) and
gives only a brief overview. For the technical details the reader is referred to the original paper.
As mentioned several times above, the RDD design might often be plausible only conditional
on some covariates X. Stating the same from a di¤erent perspective: We suggested to examine
graphically whether there are discontinuous at z0 in the distribution of X. If we observe
such discontinuities, our previous assumptions become less plausible and we could decide to
abandon the attempted evaluation study and to try to collect better data. Instead of taking
this bold decision, we might instead be trying to pursue a RDD but to incorporate di¤erences
in the observed X in the estimation process. Frölich (2007c) develops a fully nonparametric
approach to this permitting arbitrary e¤ect heterogeneity. He shows that the rate for univariate
2
nonparametric regression, i.e. n 5 , can be achieved irrespective of the number of variables in
X. Hence, the curse of dimensionality does not apply. This is achieved by smoothing over all
the covariates X. Including covariates is often necessary for identi…cation. But even when the
estimator would be consistent without controlling for X, e¢ ciency gains can be achieved by
accounting for covariates.
We start with an informal discussion to provide intuition for what follows. As discussed
by examples, the IV assumption may become more credible103 if we control for a number of
103
In the following, it is assumed that the local conditional IV assumption is valid, but even it were not exactly
observed covariates X that may be related to Y , D and/or Z:
Yi1 Yi0 ; Di (z) ??Zi jXi for Zi near z0 . (64)
We also maintain the monotonicity assumption:
Di (z0 + e) Di (z0 e) for all 0 < e < " and some " > 0. (65)
By an analogous reasoning as in HTK, and some more assumptions made precise below, the
treatment e¤ect on the local compliers conditional on X will be:
m+ (X; z0 ) m (X; z0 )
lim E Y 1 Y 0 jX; D(z0 + ") > D(z0 "); Z = z0 = , (66)
"!0 d+ (X; z0 ) d (X; z0 )
where m+ (X; z) = lim E [Y jX; Z = z + "] and m (X; z) = lim E [Y jX; Z = z "] and d+ (X; z)
"!0 "!0
and d (X; z) de…ned analogously with D replacing Y .
Estimating the conditional treatment e¤ect for every value of X by (66), although sometimes
informative, has two disadvantages, particularly if the number of covariates in X is very large:
First, precision of the estimate decreases with the dimensionality of X, which is known as the
curse of dimensionality. Second, policy makers and other users of evaluation studies often prefer
to see one number and not a multidimensional estimate. We may therefore be interested in the
unconditional treatment e¤ect, in particular in estimating the average treatment e¤ect in the
largest subpopulation for which it is identi…ed. More precisely, we may be interested in the
treatment e¤ect on all compliers:
lim E Yi1 Yi0 jDi (z0 + ") > Di (z0 "); Z = z0 ,

"!0
i.e. without conditioning on X. Under the assumptions (64) and (65), this is the largest
subpopulation, since only the treatment status of the local compliers is a¤ected by variation in
Z. In a one-sided non-compliance design, this is the ATET, see Section 4. I have to verify
whether this is true. If there are X included, we have to weight by prob of being
left or right. But, probably this is 0.5 for every value of X !?!
true it is nevertheless rather likely that accounting for observed di¤erences between units to the left and to the
right of the threshold would help to reduce bias, even if not eliminating it completely.
From inspecting the right-hand side of (66) one might imagine to estimate the unconditional
e¤ect by integrating out the distribution of X and plugging in nonparametric estimators in the
resulting expression:
Z
m+ (X; z0 ) m (X; z0 )
dFX . (67)
d+ (X; z0 ) d (X; z0 )
This approach, however, has two disadvantages. First, when X is high dimensional, the denom-
inator in (67) may often be very close to zero, leading to a very high variance of (67) in small
samples. Second, it does not correspond to a well-de…ned treatment e¤ect for a speci…c popula-
tion. As shown in Frölich (2007c) a nicer expression can be obtained for the treatment e¤ect on
the local compliers, which is in the form of a ratio of two integrals. Under certain weak assump-
tions, which are extensions of the binary IV case to the limit expression at z0 we obtain that the
local average treatment e¤ect for the subpopulation of local compliers is nonparametrically
identi…ed as:
R
1 0 (m+ (x; z0 ) m (x; z0 )) (f + (xjz0 ) + f (xjz0 )) dx
= lim E Y Y jZ 2 N" ; " =c = R + .
"!0 (d (x; z0 ) d (x; z0 )) (f + (xjz0 ) + f (xjz0 )) dx
(68)
The necessary asumptions are basically: monotonicity, independent IV, IV exclusion and com-
mon support. The new assumptions are that FZ must be di¤erentiable at z0 with positive den-
sity in a neighbourhood and that lim FXjZ2N"+ (x) and lim FXjZ2N" (x) exist and are di¤eren-
"!0 "!0
tiable in x. Hence, the joint distribution of FXZ can be discontinuous at z0 (if it were not, the
distributions of the observed covariates would be the same on both sides of z0 ) but it has to be
di¤erentiable with respect to x.104
A straightforward estimator of (68) is
P
n
Zi z0
^ + (Xi ; z0 )
(m m
^ (Xi ; z0 ) ) Kh h
i=1
, (69)
P
n
d^+ (Xi ; z0 ) d^ (Xi ; z0 ) Kh Zi z 0
h
i=1
104
It is assumed throughout that the covariates X are continuously distributed with a Lebesgue density. This
is an assumption made for convenience to ease the exposition, particularly in the derivation of the asymptotic
distributions. Discrete covariates can easily be included in X and identi…cation does not require any continuous
X variables. The derivation of the asymptotic distribution only depends on the number of continuous regressors
in X. Discrete random variables do not a¤ect the asymptotic properties and could easily be included at the
expense of a more cumbersome notation. Only Z has to be continuous near z0 , but could have masspoints
elsewhere.
^ and d^ are nonparametric estimators and Kh (u) =

where m 1
h (u) is a positive, symmetric
kernel function with h converging to zero with growing sample size. In addition to its well
de…ned causal meaning, the estimator (69) is likely to behave more stable in …nite samples
than an estimator of (67) because the averaging over the distribution of X is conducted …rst
before the ratio is taken. However, the estimator (69) can be shown to su¤er from a slower
than optimal convergence rate. The reason for this is that all the elements in (69) are limit
points at z0 and thus estimated at the boundary. We use local linear regression to deal with the
^ , d^+ and d^ . However, the boundary problem also
^ +, m
boundary aspect when estimating m
Zi z 0
comes into play via the smoothing by Kh h at z0 . Therefore, Frölich (2007c) suggests a
2
boundary RDD estimator, which can achieve n 5 convergence, i.e. the rate of convergence of
onedimensional nonparametric regression. The boundary RDD estimator uses a kernel function
^ , d^+ and
^ +, m
which implicitly adapts to the boundary. By using local linear regression for m
d^ we have a double boundary correction. The boundary RDD estimator is de…ned as
P
n
Zi z 0
^ + (Xi ; z0 )
(m m
^ (Xi ; z0 )) Kh h
i=1
^ RDD = , (70)
P
n
d^+ (Xi ; z0 ) d^ (Xi ; z0 ) Kh Zi z0
h
i=1
where the kernel function is

Kh (u) = ( 2 1 u) Kh (u) . (71)
By using this kernel function, the estimator ^ RDD achieves the convergence rate of a one dimen-
sional nonparametric regression estimator, irrespective of the dimension of X. Loosely speak-
ing, it achieves thus the fastest convergence rate possible and is not a¤ected by a curse of di-
mensionality. This is achieved by smoothing over all other regressors and by an implicit bound-
ary adaptation. (In addition, the bias and variance terms due to estimating m+ ; m ; d+ ; d
f (xjz0 )+f + (xjz0 )
and due to estimating the density functions 2 by the empirical distribution func-
tions converge at the same rate.)
Both estimators proceed in two steps and require nonparametric …rst step estimates of m+ ,
m , d+ and d . These can be estimated nonparametrically by considering only observations
to the right or the left of z0 , respectively. Since this corresponds to estimation at a boundary
point, local linear regression is suggested, which is known to display better boundary behaviour
than conventional Nadaraya-Watson kernel regression. m+ (x; z0 ) is estimated by local linear
regression as the value of a that solves

Xn
2
arg min Yj a b (Zj z0 ) c0 (Xj x) Kj Ij+ (72)
a;b;c j=1
where Ij+ = 1(Zj > z0 ) and a product kernel is used, where L = dim(X)
L
Y
Zj z0 Xjl xl
Kj = Kj (x; z0 ) = . (73)
hz hx
l=1
where and are univariate kernel functions, where is a second-order kernel and is a kernel
of order 2. The kernel
is assumed to be symmetric and integrating to one. The following
R l
1 R
1
kernel constants will be used later: l = u (u)du and l = ul (u)du and ~ = 22 2.
1
1 0
1
R
1
(With symmetric kernel 0 = 2 .) Furthermore de…ne • l =
The kernel function is ul 2 (u)du.
R 0
a univariate kernel of order , with kernel constants of this kernel be denoted as l = ul (u)du
R l 2
1
and _ l = u (u)du. The kernel function being of order means that 0 = 1 and l = 0 for
1
0<l< and 6= 0.105
A result derived later will require higher-order kernels if the number of continuous regressors
is larger than 3. For applications with at most 3 continuous regressors, a second-order kernel
will su¢ ce such that = can be chosen.
Notice that three di¤erent bandwidths hz ; hx ; h are used. h is the bandwidth in the matching
estimator to compare observations to the left and right of the threshold, whereas hz and hx
determine the local smoothing area for the local linear regression, which uses observations only
to the right or only to the left of the threshold. Some smoothness assumptions on the functions
and the density are needed (mostly di¤erentiability up to order ) as well as existence of the
second moments of the residuals. The bandwidth conditions needed are h, hz , hx ! 0 and
dim(X)
nh ! 1 and nhz ! 1 and nhz hx ! 1. Furthermore, to achieve an optimal rate of
dim(X)
convergence some further conditions are needed, which implicitly also require that > 2 .
In other words, a higher order kernel ( > 2) is needed when X contains more than 3 continuous
covariates. Under these conditions it can be shown that
2
n 5 (^ RDD ) ! N (BRDD ; VRDD )
where the exact expressions of BRDD and VRDD are given in Frölich (2007c).
105
For the Epanechnikov kernel with support [ 1; 1], i.e. K(u) = 3
4
1 u2 1 (juj < 1) the kernel constants are
0 = 1, 1 = 3 = 5 = 0, 2 = 0:2, 4 = 6=70, 0 = 0:5, 1 = 3=16, 2 = 0:1, 3 = 1=16, 4 = 3=70.

If X contains more than 3 continuous covariates, higher order kernels are needed to control
the bias due to smoothing in the X dimension. Instead of using higher order kernels, one could
alternatively use local higher order polynomial regression instead of local linear regression (72).
However, when the number of regressors in X is large, this could be inconvenient to implement
in practice since a large number of interaction and higher order terms would be required,
which could give rise to problems of local multicollinearity in small samples and/or for small
bandwidth values. On the other hand, higher order kernels are very convenient to implement
when a product kernel (73) is used. Higher order kernels are only necessary for smoothing in
the X dimension but not for smoothing along Z.
The inclusion of covariates X may not only reduce bias but can also help to increase e¢ -
ciency. If some variable X causally a¤ects Z and also D or Y , we generally have to control
for X to avoid bias. If some variable X does not a¤ect Z but does a¤ect either D and/or Y ,
the asymptotic variance can be reduced by controlling for this variable, see Frölich (2007c). In
either words, even if X has the same distribution on the left and right of z0 , it could never-
theless still make sense to include these covariates in the estimator. (One should note, though,
that this is a theoretical result based on examining the asymptotic variance formula. How large
such precision gains are in …nite samples is still unknown.) This result can easily be extended
to show that the RDD estimator with a larger regressor set X, i.e. where X X, has smaller
asymptotic variance than the RDD estimator with X. Hence, one can combine including some
X for eliminating bias with adding further covariates to reduce variance. The more variables
are included in X the smaller the variance will be.
Instead of estimating only the average treatment e¤ect one can also estimate the entire
potential outcome distributions, as is examined in Frölich (2007c).
4.6 Marginal treatment e¤ect
References:
- Heckman Vytlacil (1999 Proceedings National Academic Sciences)
- Heckman Vytlacil (2005 Econometrica)
In this section the ideas of the local average treatment e¤ect are extended to continuous
instruments and the discussion of the marginal treatment e¤ect will also help the interpretation
of LATE. Since LATE is the e¤ect on the complier subpopulation and since the complier
subpopulation is de…ned through the instrumental variable, any local average treatment e¤ect
is directly tied to its instrumental variable and cannot be interpreted on its own. For example,
if the instrumental variable Z represents the size of a programme (e.g. the number of available
slots), the local average treatment e¤ect would represent the impact of the programme if it
were extended from size z0 to size z1 on the subpopulation which would participate only in the
enlarged programme.
The local average treatment e¤ect identi…ed the treatment e¤ect for the subpopulation
induced to change D by a change in Z from 0 to 1. For an instrumental variable that takes
on more than 2 di¤erent values, we could take the endpoints of the support of Z (presuming
that the assumptions remain valid). Instead of examining the e¤ects of very large changes in
Z, we could also be interested in what would happen if we changed Z only a little. For a
continuous Z one could think of in…nitesimal small changes in Z to de…ne a treatment e¤ect
for the individuals just at the margin to change D. These di¤erent e¤ects can be summarized
nicely in the marginal treatment e¤ect, which also shows how to deal with multiple instruments.
We still remain in the setup with a single binary endogenous regressor D 2 f0; 1g. The model
is
Yi1 = '1 (Xi ; Ui1 )
Yi0 = '0 (Xi ; Ui0 )
Di = 1 ( (Zi ; Xi ) Vi 0)
with the assumptions (see Heckman Vytlacil 2005):

1) (Z; X) is a nondegenerate random variable conditional on X
2) (U 1 ; V )??ZjX and (U 0 ; V )??ZjX
3) The distribution of V is absolutely continuous (with respect to Lebesgue measure)
4) Y 0 and Y 1 have …nite …rst moments
5) 0 < Pr(D = 1jX) < 1 a:s:
Although this latent index threshold-crossing model looks rather di¤erent from the LATE
model discussed before, Vytlacil (2002) has shown that both are di¤erent representations of
the same model. I.e. this latent index model satis…es the LATE assumptions conditional on
X (as can easily be veri…ed). And any model satisfying the LATE assumptions can essentially
be written in a latent index representation.106 This is useful for sharpening one’s intuition
about the exact economic implications of these assumptions. (Zi ; Xi ) Vi is here the function
determining the choice and could be considered as a latent index representing the net gain or
utility from choosing D = 1. If this net utility is larger than zero, D = 1 is chosen, otherwise
D = 0.
Assumptions 1 and 2 are instrumental variables assumption: an inclusion restriction (that Z
a¤ects the function , i.e. that the instrument has an impact on D) and an exclusion restriction
(that Z does not enter in Yi1 and Yi0 , i.e. has no direct e¤ect). The other assumptions are rather
innocuous. Assumption 3 supposes the unobserved ability V to be continuously distributed,
which is needed for some integral calculations. Assumption 4 requires that Y 0 or Y 1 are not
in…nitely large. Assumption 5 requires that for any value of X, participants (D = 1) and
non-participants (D = 0) can be observed.
Since V is assumed to be continuous, we can normalize the distribution of V . With V
continuos, its distribution function FV ( ) is strictly increasing. Hence, the following equivalences
hold
(Zi ; Xi ) Vi () FV ( (Zi ; Xi )) FV (Vi ) () p(Xi ; Zi ) FV (Vi ) ,
where p(x; z) = Pr(D = 1jX = x; Z = z) denotes the propensity score (=participation proba-
bility) given X and Z. The last equality holds because
p(z; x) = Pr(D = 1jZ = z; X = x) = Pr ( (z; x) V 0 ) = Pr (V (z; x)) = FV ( (z; x)) .
Noting that FV (V ) is uniform[0;1] distributed,107 therefore the model can, without loss of gen-
erality, be written as
Y 1 = '1 (X; U 1 ) (74)
Y 0 = '0 (X; U 0 )
D = 1 ( p(Z; X) V 0) with V U (0; 1).
Hence, the distribution of the error term can be normalized to be uniform. Intuitively, individ-
uals can be thought of as being ordered on the real line from 0 to 1 in terms of their inclination
106
The model Vytlacil (2002) examines is slightly di¤erent from the one presented here.
107
To see this consider (for a strictly increasing distribution function) Pr (FV (V ) a) = Pr V FV 1 (a) =
FV FV 1 (a) = a. Hence, the distribution is uniform.
to participate. Individuals with a low value of V are very likely to participate, while those with
a high value are unlikely to participate. By varying p(Z; X) through variation in Z, each indi-
vidual can be made more or less inclined to participate. This representation also shows nicely
how multiple instruments can be handled: they all enter in the one-dimensional participation
probability.108 Therefore in the following Pi = p(Xi ; Zi ) is considered as the instrumental vari-
able.
Examine everything in the following conditional on X. If Z were to take only two di¤erent
values, i.e. Z 2 fz 0 ; z 00 g, also P would take only two di¤erent values, conditional on X, i.e.
P 2 f 0; 00 g and suppose that 0 < 00 . Individuals with Vi < 0 would participate irrespective
of the value of P , whereas individuals with Vi > 00 would never participate. Those with
0 Vi 00 are those who would be induced to switch if the instrument were changed. Let
i = Yi1 Yi0 be the e¤ect for individual i. Hence,
E [Y jX = x; P = 00 ] E [Y jX = x; P = 0]
LAT E
(x; 0 ; 00
) = E[ jX = x; 0
V 00
]= 00 0
for 0 < 00 .109 To proof the above equation notice that
E [Y jX = x; P = ] = E [Y jX = x; P = ; D = 1] P (D = 1jX = x; P = )
+E [Y jX = x; P = ; D = 0] P (D = 0jX = x; P = )
Z Z1
1 dv dv
= E Y jX = x; V = v + (1 ) E Y 0 jX = x; V = v
1
0
where V is uniformly distributed in the D = 0 and D = 1 subpopulations with densities

re-scaled such that they sum to one. This gives
00 0
E Y jX = x; P = E Y jX = x; P =
00 00
Z Z
1
= E Y jX = x; V = v dv E Y 0 jX = x; V = v dv
0 0
00
Z
= E Y1 Y 0 jX = x; V = v dv = ( 00 0
) E[ jX = x; 0
V 00
].
0
108
In practice, p(z; x) needs to be estimated, which makes the incorporation of multiple instruments not that
trivial.
109 00 00 00
It has been used that E [DjX = x; P = ] = E [E [DjX; Z; P (Z; X) = ] jX = x; P = ] =
00 00
E [E [DjX; Z] jX = x; P = ]= .
Everything conditional on X=x

MTE(x,v)
LATE
V uniform
complier
V=1
ρ'
ρ''
D=0
D=1
Indifference
line
P=1
If Z takes on many di¤erent values, di¤erent LATE could be de…ned for any two values of Z.
If Z is continuous, the limit version of the above de…nition would yield the marginal treatment
e¤ ect (MTE)
MT E @E [Y jX = x; P = v]
(x; v) = E[ jX = x; V = v] = ,
@v
provided E [Y jX = x; P = v] is di¤erentiable in the second argument at the location v. This
is the average treatment e¤ect among those with characteristics X = x and unobserved
characteristic V = v. These are the individuals just indi¤erent between participating and
non-participating if P = v. It can be estimated simply by estimating the derivative of
E [Y jX; P ] with respect to P . A possible nonparametric estimator is local linear regression
with respect to X and P where the respective coe¢ cient gives the estimate of the derivative.
This marginal treatment e¤ect is useful for detecting treatment e¤ect heterogeneity. If the
MTE is constant, conditional on X, i.e. identical for every individual (with the same value of
X), then the derivative @E [Y jX = x; P = ] =@ should be constant and thus E [Y jX = x; P ]
as a function of P (and X kept constant) should be a straight line. A curse of dimensionality
issue remains here and one could be thinking of integrating out X here as well, which appears
not to have been done so far.
This marginal treatment e¤ect is also useful in thinking about the relationship between
ATE, ATET, LATE. It is related to LATE via
00
Z
LAT E dv
(x; 0 ; 00
) = E[ jX = x; 0
V 00
]= MT E
(x; v) 00 0
.
0
Similarly,
Z1
AT E MT E
(x) = E[ jX = x] = (x; v)dv.
0
and for the ATET
Z1
AT ET
(x) = E[ jX = x; D = 1] = E[ jX = x; P; D = 1] dFP jX=x;D=1
0
where
Z
MT E dv
E[ jX = x; P = ; D = 1] = (x; v) .
0
With some derivations (Heckman Vytlacil 1999) it follows for ATET
Z1
AT ET MT E
(x) = (x; v)gx (v)dv
0
where the weights gx (v) are

1 FP jX=x (v)
gx (v) = .
Pr(D = 1jX = x)
Hence, all these treatment e¤ects can be written as a weighted average of the MTE. The support
of p(Z; X) determines which e¤ects we could identify. If Z is continuous and has a substantial
impact on D, the support of p(Z; X) given X will be large and many di¤erent e¤ects could be
identi…ed. On the other hand, if Z induces only few individuals to change treatment status only
little is identi…ed.110 This shows that a strong impact of Z on D is important for identi…cation.
(Hint: This is very di¤erent from the weak instrument discussion to follow later, which is
essentially not about identi…cation but about estimation and particularly inference.)
The MTE could also be used to evaluate di¤erent policies which operate on decisions to
participate but not on the potential outcomes. E.g. a policy could increase the incentives
for taking up or extending schooling through …nancial support without directly a¤ecting the
remuneration of education in the labour market 20 years later. If the policy only operates
110
If the outcome variables are bounded, we could still bound the e¤ect.
through changing Z without a¤ecting any of the structural relationships, the impact of the policy
can be identi…ed by averaging MTE appropriately. Consider two potential policies denoted as a
and a, which di¤er in that they a¤ect the participation inclination, but where the model remains
valid under both policies, in particular the independence of the instrument. Denote Pi;a and
Pi;a the participation probabilities under policy a and a, respectively. If the distributions of
the potential outcomes and of V conditional on X is the same under policy a and a, the MTE
remains the same under both policies and is thus invariant to the policy. Using the Benthamite
social welfare function, the MTE is
MT E
U (x; v) = E[U(Y 1 ) U(Y 0 )jX = x; V = v]
and the policy impact for individuals with a given level of X is

Z1
MT E
E[U(Ya )jX = x] E[U(Ya )jX = x] = U (x; v) FPa jX (vjx) FPa jX (vjx) dv,
0
where FPa jX and FPa jX are the respective distributions of the participation probability. If the
distribution of P can be forecasted for the di¤erent policies, this gives the appropriate weighting
of the MTE for the impact of the policy.
Before proceeding to non-binary models, it is worth emphasizing again that identi…cation

of MTE hinges crucially on the additive separability in the choice equation index: Di;z =
1 ( p(z; Xi ) Vi 0 ). This representation entails monotonicity from two di¤erent perspectives.
First, conditional on X, a change in z shifts the participation index in the same direction for
every individual. I.e. if an increase in z makes individual i more inclined to participate, than it
also makes individual j more inclined to participate. This is the type of monotonicity discussed
in the LATE framework which rules out de…ers, i.e. it rules out that changing the instrument
changes treatment inclination in di¤erent directions for di¤erent people. The second perspective,
again conditional on X, is like a rank invariance assumption between individuals: If Vi is smaller
than Vj , individual i will always be more inclined to participate than individual j whatever the
value of the instrument. In other words, the individuals can be ordered according to their
inclination to participate: Individuals with small V are always more inclined to participate
than individuals with large V . This second perspective states monotonicity without referring
to an instrument. It is an statement between individuals, independent of the instrument or the
precise values the instrument takes.
In these binary models both de…nitions of monotonicity are essentially equivalent and one
can use the de…nition which is more intuitive or easier to verify by economic reasoning. Whereas
monotonicity with respect to the impact of the instrument has been discussed frequently in the
binary world, the monotonicity assumption in terms of ranking or ordering individuals by their
participation inclination V will be dominant in the following subsections in non-binary models.
Monotonicity is a crucial condition. Allowing for a more general choice model of the form
Di;z = (z; Xi ; Vi )
would often still permit the de…nition of the MTE but not its identi…cation, which is intuitively
obvious from the analysis of the LATE framework as opposite ‡ows of compliers and de…ers
could occur.
4.7 Non-binary models with monotonicity in choice equation
References:
- Imbens and Newey (2003)
The previous sections examined identi…cation for a scalar binary endogenous regressor D.
This is the simplest situation and is in fact often quite relevant for the evaluation of certain
policies. In many other situations, D is continuous, and nonparametric identi…cation for this
case is considered here. The models examined here are based on restrictions in the choice
equation.
We still impose triangularity
Y = '(D; X; U; V; :::)
D = (Z; X; U; V; :::),
where it is assumed that Y does not a¤ect D. In other words, we impose a causal chain in
that D may a¤ect Y but not vice versa. Such a model may be appropriate because of temporal
ordering, e.g. if D represents schooling and Y represents some outcome 20 years later. In
other situation, e.g. of a market equilibrium where Y represents supply and D demand, such a
triangular model may not be appropriate and models without triangularity are brie‡y examined
in the next section.
4.7.1 Continuous D with triangularity
Imbens and Newey (2003) consider a model of the triangular type with continuous D and no
restrictions on the support of Y . Here we are examining only issues of identi…cation and relegate
issues of estimation to later sections. Nonparametric estimation with endogenous continuous
p
D is relatively di¢ cult and from the discussion in previous chapters it is obvious that n-
consistency cannot be possible since nonparametric regression even with exogenous continuous
p
D cannot attain n rate.
Imbens and Newey (2003) employ a generalized control function approach where
conditioning on V removes the endogeneity of D. The variables Y , D and V are considered to
be scalar.111 U can be of any dimension.
Y = '(D; X; U )
D = (Z; X; V ).
The basic intuition to the identi…cation results is that the function is assumed to be strictly
monotonous in its third argument such that it can be inverted. Since the inverse function
depends only on observed covariates it is identi…ed. Thus conditioning on this inverse function
controls for all sources of endogeneity, as it essentially makes the unobservable V identi…able.
This is a control function approach, which attempts to control V .
Assumption IN.1: (U; V )??(X; Z)

This assumption can be decomposed into (U; V )??ZjX and (U; V )??X. The …rst part is
similar to the assumptions in the previous sections. The second part requires that the variables
X are exogenous, i.e. unrelated to U and V . As an example, this could be reasonable when Z
and X are not choice variables on the individual level, but chosen at a more aggregate level.
E.g. regulations at a regional level or natural variation in economic environment conditions, in
combination with random location of individuals.
Assumption IN.2: V is a scalar and is strictly monotone in its third argument, with
probability 1. ( is normalized to be increasing in v.112 We could also normalize V to be
uniform.)
111
Generalizations to vectors would be possible but are not discussed in their paper.
112
The condition that is increasing in v is a convenient and innocuous normalization, given that the function
is not restricted further.
The assumption that is (weakly) monotonous corresponds to a rank invariance assumption

in the endogenous regressor D (e.g. years of schooling). An individual i with value vi larger
than vj for an individual j (with identical characteristics X) will always receive more schooling,
regardless of the value of the instrument Z. This assumption may often appear more plausible,
when X contains a number of covariates. The assumption of strict monotonicity essentially
requires D to be continuous.
This assumption implies that the inverse function of with respect to its third argument
1 1
exists: v = (z; x; d), such that (z; x; (z; x; d)) = d. Hence, if were known, the unob-
served V would be identi…ed by z; x; d. Now, is unknown but together with assumption IN.1
it follows
FDjZX (djz; x) = Pr(D djX = x; Z = z)
= Pr( (z; x; V ) djX = x; Z = z)

1
= Pr(V (z; x; d)jX = x; Z = z)
1
= Pr(V (z; x; d))
1
= FV ( (z; x; d))
= FV (v).
If V is continuously distributed, FV (v) is a one-to-one function of v. Hence, controlling

for FV (v) is identical to controlling for V .113 Hence, two individuals with the same value
of FDjZX (Di jZi ; Xi ) have the same V . Since FDjZX (djz; x) depends only on observed
covariates, it is identi…ed. It could be estimated by nonparametric regression, noting that
FDjZX (djz; x) = E [1 (D d) jZ = z; X = x]. After conditioning on V , observed variation in
D is statistically independent of variation in U , such that the e¤ect of D on the outcome
variable can be separated from the e¤ect of U . This requires, that there is variation in D after
conditioning on V and X, which is generated by the instrumental variable(s) Z.
Therefore by conditioning on FDjZX the unobservable V is ’observable’and the endogeneity
of D can be controlled for in a similar way as in the selection on observables approach. This is
a nonparametric generalization of the control function approach.
113
If V is not continuously distributed, FV (v) contains steps and the set fv : FV (v) = ag of values v with the
same FV (v) is not a singleton. Nevertheless, only one element of this set, the smallest, has a positive probability,
and therefore conditioning on FV (v) is equivalent to conditioning on this element with positive probability.
To simplify notation, de…ne the random variable
V FV (V ) = FDjZX (DjZ; X)
and let v be a realization of it. V can be thought of as a rank-preserving transformation of V

to the unit interval. If V were uniformly[0;1] distributed, then V = V .
To identify the average structural function, notice that conditional on V , the endogeneity
is controlled for:
fU jD;X;V = fU jX;V = fU jV .
Assuming that all the conditional moments in the following expressions are …nite, the average
structural function is identi…ed as
Z
ASF (d; x) = E[Y jD = d; X = x; V = v] fV (v)dv,
because
Z
E[Y jD = d; X = x; V = v] = '(d; x; u) fU jDX V (ujd; x; v)du
Z
= '(d; x; u) fU jV (ujv)du,
and
Z
E[Y jD = d; X = x; V = v] fV (v)dv
Z Z
= '(d; x; u) fU jV (ujv)du fV (v)dv
Z Z
= '(d; x; u) fU;V (u; v)dv du
Z
= '(d; x; u)fU (u)du = ASF (d; x).
Hence, the average structural function is identi…ed as

Z
ASF (d; x) = E[Y jD = d; X = x; V = v] fV (v)dv,
provided the term E[Y jD = d; X = x; V = v] is identi…ed at all v where fV (v) is non-zero.

This requires that the support of V jD; X is the same as the support of V . For example, if we
want to identify the ASF for d = 5 years of schooling and suppose the distribution of ’ability
in schooling’ V ranges from 0 to 1, it would be necessary to observe individuals of all ability
levels with d = 5 years of schooling. If, for example, the upper part of the ability distribution
would always (i.e. wp1) choose to have more than 5 years of schooling, the E[Y jD = 5; X; V ]
would not be identi…ed for all large ability values V . In the sub-population observed with 5
years of schooling, the high-ability individuals would be missing. If this were the case, then we
could never infer from data what these high-ability individuals would have earned if they had
received only 5 years of schooling. The precise condition is
Assumption IN.3: (Rank and full range condition) For all (d; x) where the ASF shall be
identi…ed,
Supp(V jD = d; X = x) = Supp(V ).
Since the support of V given d and x depends only on the instrument(s) Z, this requires that
the instrument(s) are su¢ ciently powerful to move any individual to 5 years of schooling. By
varying Z, the individuals with the highest ability for schooling and the lowest ability could be
induced to choose 5 years of schooling.
This is a strong condition and requires a large amount of variation in the instrument,
conditional on X.
An analogous derivation shows the identi…cation of the DSF (and thus the QSF)
Z
E[1(Y a)jD = d; X = x; V = v] fV (v)dv
Z Z
= 1('(d; x; u) a) fU jV (ujv)du fV (v)dv
Z
= 1('(d; x; u) a)fU (u)du,
hence, the distribution structural function is identi…ed as

Z
DSF (d; x; a) = FY jDX V [ajD = d; X = x; V = v] fV (v)dv.
Consider some extensions to the paper of Imbens and Newey (2003):

If we are only interested in the expected potential outcomes E[Y d ], i.e. the average structural
function as a function of d only and not of x, we could relax the previous assumptions somewhat.
The expected potential outcome is identi…ed as
Z Z
E[Y d ] = E[Y jD = d; X = x; V = v] fX V (x; v)dxdv.
Proof:
Z Z
E[Y jD = d; X = x; V = v] fX V (x; v)dxdv
Z Z Z
= '(d; x; u) fU jDX V (ujd; x; v)du fX V (x; v)dxdv
Z Z Z
= '(d; x; u) fU jX V (ujx; v)du fX V (x; v)dxdv
Z Z Z
= '(d; x; u) fU X V (u; x; v)dvdudx
Z Z
= '(d; x; u) fU X (u; x)dudx = E[Y d ].
For this result, we could relax assumption 1 to (U; V )??ZjX and would no longer require
(U; V )??X. (We would have to change notation somewhat in that we should permit the
distribution function FV of the variable V to depend on X.) Furthermore, the common
support assumption changes to: For all d where E Y d shall be identi…ed
Supp(V ; XjD = d) = Supp(V ; X).
To compare this assumption to the previous assumption 3, we can re-write is as
Supp V jX; D = d = Supp(V jX) and
Supp (XjD = d) = Supp(X).
The …rst part of the assumption is in some sense weaker than assumption 3 in that Supp(V jX =
x; D = d) has to contain only those ability values V that are also observed in the X = x
population instead of all values observed in the population at large. Hence, a less powerful
instrument can be admitted.114 However, this assumption is not neccesarily strictly weaker
than assumption 3 since this assumption is required to hold for all values of X. The second
part of the above assumption is new and was not needed before.
The Assumption IN.3 can be quite strong and may not always be satis…ed. It is not needed,
however, for identifying average derivatives. This is because it is not required to observe high
ability individuals with 5 years of schooling, but only that the years of schooling vary a little
114
Think of X as family income. Ability V is likely to be positively correlated with family income. Consider
X =low income families. The previous assumption 3 would require that all ability values of the entire population
would also be observed in the low income population with D = d. The …rst part of the above assumption requires
only that all ability values observed in low-income families are also observed in the D = d subpopulation.
around the observed value. Suppose ' is continuously di¤erentiable in the …rst element with
probability one. Using again
Z
E[Y jD = d; X = x; V = v] = '(d; x; u) fU jV (ujv)du,
compare " R #
@E[Y jD; X; V ] @ '(D; X; u) fU jV (ujV )du
E =E
@d @d
and
@'(D; X; U ) @'(D; X; U )
ADerivative = E =E E jD; X; V
@d @d
Z
@'(D; X; u)
= E fU jD;X;V (ujD; X; V )du
@d
Z
@'(D; X; u)
= E fU jV (ujV )du .
@d
If di¤erentiation and integration are interchangeable, these two expressions are identical and
the average derivative is identi…ed as
@E[Y jD; X; V ]
ADerivative = E .
@d
No large support condition is needed since the derivative of E[Y jD; X; V ] is evaluated where it
is observed.115
4.7.2 Ordered discrete D with triangularity
The following discussion is from Frölich (2007). Hier fehlen noch ein paar weitere
Modelle.
Non-binary endogenous regressor, binary instrument

Consider the situation where the endogenous regressor D is discrete but non-binary. Suppose
that D 2 f0; ::; Kg is discrete and that the instrument Z is binary. With D taking many
di¤erent values, the compliance intensity can di¤er among units. Some units might be induced
to change from Di = d to Di = d + 1 as a reaction on changing Zi from 0 to 1. Other units
115
Theorem 4 of Imbens and Newey (2003) gives precise conditions for identi…cation, which consist of the
assumptions IN.1 and IN.2, existence of a continuous derivative and a boundedness condition that permits the
interchange of integration and di¤erentiation.
might change, for example, from Di = d0 to Di = d0 + 2.116 Hence, a change in Z induces a

variety of di¤erent reactions in D, which cannot be disentangled. Since D is not continuous,
the previously discussed approach cannot be used to identify the value of V . If many di¤erent
instruments are available, they might help to disentangle the e¤ects of di¤erent changes in
treatment status. This is examined in ...117
Here we consider the situation when only a single binary instrument is available, e.g. random
assignment to drug versus placebo. Only a weighted average of the e¤ects can then be identi…ed.
According to their reaction on a change in Z from 0 to 1, the population can be partitioned
into the types c0;0 , c0;1 , : : :, cK;K where
i = ck;l if Di;0 = k and Di;1 = l. (75)
Assuming monotonicity, the de…er-types ck;l for k > l do not exist. The types ck;k represent
those units that do not react on a change in Z (these are the always-takers and the never-takers
in the setup where D is binary). The types ck;l for k < l are the compliers, which comply by
increasing Di from k to l. These compliers comply at di¤erent base levels k and with di¤erent
intensities l k. In the returns to schooling example, E[Y k+1 Y k jX; = ck;k+1 ] measures the
return to one additional year of schooling for the ck;k+1 subpopulations. E[Y k+2 Y k jX; =
ck;k+2 ] measures the return to two additional years of schooling, which can be interpreted as
twice the average return of one additional year. Similarly, E[Y k+3 Y k jX; = ck;k+3 ] is three
times the average return to one additional year. Hence, the e¤ective weight contribution of
the ck;l subpopulation to the measurement of the return to one additional year of schooling is
(l k) Pr ( = ck;l ). Accordingly, the weighted average treatment e¤ect w (x) for all compliers
with characteristics x can be de…ned as118
P
K P
K
E Yl Y k jX; = ck;l Pr ( = ck;l jX)
k l>k
w (X) = . (76)
P
K P
K
(l k) Pr ( = ck;l jX)
k l>k
116
Suppose D is years of schooling and Z an instrument that in‡uences the schooling decision. If Z were
changed exogenously, some individuals might respond by increasing school attendance by an additional year.
Other individuals might increase school attendance even by two or three years. Furthermore, even if Z were set
to zero for all individuals, they would attend di¤erent years of schooling.
117
@@Hier noch etwas von Carneiro einbauen
118
The presentation in Angrist and Imbens (1995) looks di¤erent from the de…nition of w used here, as they
present the e¤ect in terms of overlapping subpopulations. Nevertheless, both de…nitions are equivalent.
w (x) is the e¤ect of the induced treatment change, averaged over the di¤erent complier groups
and normalized by the intensity of compliance.
To obtain the weighted average e¤ect for the subpopulation of all compliers (i.e. all sub-
populations ck;l with k < l), one would need to weight w (x) by the distribution of X in the
complier subpopulation:
Z
w (x) dFxjcomplier ,
where Fxjcomplier is the distribution of X in the all-compliers subpopulation.

Unfortunately, the distribution of X in the all-compliers subpopulation is not identi…ed if
D takes more than 2 di¤erent values. In particular, the size of the all-compliers subpopulation
is no longer identi…ed by the distribution of D and Z.119 Nevertheless, if one de…nes the
all-compliers subpopulation in terms of compliance intensity units, the distribution of X is
identi…ed. In the intensity-weighted complier subpopulation, each complier is weighted by its
compliance intensity. In the case where D 2 f0; 1; 2g, the subpopulation c0;2 receives twice
the weight of the subpopulation c0;1 . In the years-of-schooling example, the subpopulation c0;2
complies with 2 additional years of schooling. If the returns to a year of schooling are the same
for each year of schooling, an individual who complies with 2 additional years can be thought
of as an observation that measures twice the e¤ect of one additional year of schooling. Or, in
other words, as two (correlated) measurements of the return to a year of schooling. Unless these
two measurements are perfectly correlated, the individual who complies with 2 additional years
contributes more to the estimation of the return to schooling than an individual who complies
with only one additional year. Consequently, the individuals who comply with more than one
year should receive a higher weight when averaging the return to schooling over the distribution
of X. If each individual is weighted by its number of additional years, the weighted distribution
function of X in the all-compliers subpopulation, in the case where D 2 f0; 1; 2g, is
w
fxj =c0;1 Pr ( = c0;1 ) + fxj =c1;2 Pr ( = c1;2 ) + 2fxj =c0;2 Pr ( = c0;2 )
fxjcomplier =
Pr ( = c0;1 ) + Pr ( = c1;2 ) + 2 Pr ( = c0;2 )
119
Consider the following example: For D taking values in f0; 1; 2g, the population can be partitioned in the
subpopulations: fc00 , c01 , c02 , c11 , c12 , c22 g with the all-compliers subpopulation consisting of fc01 , c02 , c12 g. The
two partitions fc00 , c01 , c02 , c11 , c12 , c22 g = f0:1, 0:1, 0:3, 0:3, 0:1, 0:1g and f0:1, 0:2, 0:2, 0:2, 0:2, 0:1g generate
the same distribution of D given Z: P (D = 0jZ = 0) = 0:5, P (D = 1jZ = 0) = 0:4, P (D = 2jZ = 0) = 0:1,
P (D = 0jZ = 1) = 0:1, P (D = 1jZ = 1) = 0:4, P (D = 2jZ = 1) = 0:5. However, the size of the all-compliers
subpopulation is di¤erent for the two partitions (0:5 and 0:6, respectively). Hence the size of the all-compliers
subpopulation is not identi…ed from the observable variables.
Markus Frölich 5. Identi…cation without triangularity 183
or in the general case
P
K P
K
(l k) fxj =ck;l Pr ( = ck;l )
w k l>k
fxjcomplier = . (77)
P
K P
K
(l k) Pr ( = ck;l )
k l>k
With respect to this weighted distribution function, the weighted local average treatment e¤ect
is identi…ed, again by a formula analogous to (32).
Suppose that D is discrete with bounded support, the instrument Z is binary and assump-
tions 1,2,5 are satis…ed as well as assumptions 3 and 4 with respect to all types t 2 fck;l : k lg,
de…ned in (75). The weighted local average treatment e¤ect for the subpopulation of compliers
is nonparametrically identi…ed as
Z
w
w = w (x) fxjcomplier (x)dx (78)
R
(E [Y jX = x; Z = 1] E [Y jX = x; Z = 0]) fx (x)dx
= R . (79)
(E [DjX = x; Z = 1] E [DjX = x; Z = 0]) fx (x)dx
4.7.3 Unordered discrete D with triangularity
This section is missing.
5 Identi…cation without triangularity
All the nonparametric IV models discussed so far relied on some kind of monotonicity assump-
tion in the relationship determining D. In this section we are going to exploit a monotonicity
assumption in the relationship determining Y . This requires Y to be continuous.
In some situations, this monotonicity assumption might be more plausible, whereas in other
situations we might want to rely on monotonicity in the second equation. Identi…cation with
monotonicity in both equations is discussed e.g. in various articles by Chesher.
By using monotonicity in the …rst equation, we can remain (almost) completely agnostic
about the second equation and thereby do not need the causal chain assumption. Hence, (at
least in some of the models) we can permit simultaneity in Y and D, i.e. a model of the type:
Y = '(D; X; U; V; :::)
D = (Y; X; U; V; :::),
where D and Y are simultaneously determined. (Of course, the identi…cation approaches dis-
cussed here could also be used for triangular models.)
Consider one example (see Wooldridge p. 183f) of labour supply and demand equations.
Let
h(w) = '(w; z1i ; ui )
be the labour supply function for an individual i with observed characteristics Z1i , such as num-
ber of children, nonlabour income, education, experience, age, marital status, and unobserved
characteristics Ui . This relationship describes how many hours individual i would like to work
for di¤erent wages w. The following graph on the left gives an example for three di¤erent indi-
viduals with the same value of Z1 but di¤erent values of U . (Suppose that U1 < U2 < U3 and
that the left curve refers to individual 1, the middle curve to individual 2 and the right curve
to individual 3. Then, this graph satis…es monotonicity.)
Wage Wage
Hours Hours
The graph on the right depicts the wage o¤er function
w(h) = (h; z2 ; v) .
This function gives the hourly wage that the (competitive) market o¤ers a person with observed
productivity characteristics Z2 (education, experience, training) and unobserved productivity
characteristics V . In addition, the wage o¤er may depend on the number of hours worked.
(Usually the di¤erent job positions available in a …rm require di¤erent number of hours worked.)
The right graph depicts this wage o¤er function for three di¤erent values of v (and z2 remaining
constant).
In the data we would observe only equilibrium values (Wi ; Hi ) for every individual i
Hi = '(Wi ; Z1i ; Ui )
Wi = (Hi ; Z2i ; Vi ) ,
which poses identi…cation problems if Ui and Vi are correlated.
Wage
Hours
We might be interested in estimating the entire system, but here we focus on estimating only
one of the two equations, e.g. the wage o¤er function. Any variables that are in Z1 but not in
Z2 may be potential instruments, in that they shift the labour supply curve without shifting
the wage o¤er curve.
Another classical example are supply and demand in a market. Let P be the price of butter
and Qs and Qd their supply and demand such that
Qs = (P; U )
Qd = (P; V )
where the price is determined such that Qs = Qd . This implies that price and quantity are
simultaneously determined. In the simple linear model we have
Qs = s + sP +U
Qd = d + dP +V
which implies that
s d U V
P = +
d s d s
s d d s U d V s
Q = + .
d s d s
Hence, both P and Q are determined by U and by V and a regression of Q on P (and a constant)
would not uncover any of the above functions (because the regression coe¢ cient depends on the
variances and covariances of U and V ).
The usual approach is to extend the above equations by adding instrumental variables.
Candidate variables Zs for the supply equation could be the prices of input factors such as the
agricultural wage rate and the price of hay. Candidate variables Zd for the demand equation
are the prices of substitutes (margarine) and complements (bread):
Qs = s + sP + s Zs +U
Qd = d + dP + d Zd +V.
Variation in Zs permits to identify the demand curve, and variation in Zd identi…es the supply
curve. Variation in these variables as well as P may be obtained by variation over time or by
examining di¤erent geographic regions (i.e. local markets).
Identi…cation (of both equations) requires two things: First, Zs should not enter the de-
mand function Qd , and Zd should not enter the supply function Qs . The plausibility of these
assumptions depends on the production technology and producer and consumer behaviour, and
it is relatively straightforward to judge their plausibility. Second, and often more di¢ cult to
justify, are the conditions that
E [Zd U ] = 0 and E [Zs V ] = 0,
i.e. that the instrumental variables are uncorrelated with the "error terms". To judge these
assumptions we need to have a good understanding of what these error terms are and how they
are generated. In other words, if we do not know why demand was "above average" in some
regions, it would be hard to argue why E [Zs V ] should be zero.
As another example of potential reverse causality, Frölich and Vazquez-Alvarez (2007) were
interested in the e¤ects of HIV/AIDS knowledge on HIV status. The motivation was to …nd
out, which level of HIV prevalence rate should we expect even if everyone had perfect knowledge
about the causes of HIV/AIDS and channels to avoid infection. If this rate would be low, public
policy should focus on information campaigns to make everyone aware of prudent behaviour.
On the other, if the HIV prevalence rate would be expected to remain high even if with perfect
knowledge, other policy interventions than information campaigns might be more important.
Since only cross-sectional data was available, one could compare people with high and low HIV
knowledge and compare their HIV stati. The direction of causality would be unclear, though.
It could be that HIV knowledge a¤ected behaviour and thereby a¤ected HIV status. On the
other hand, HIV positive people might know their own HIV status and therefore are likely to
pick up more information about HIV in general (e.g. at the health center where they were
tested). (Another aspect might be that HIV positive individuals may after a while experience
a higher morbidity which makes them to visit a doctor or a health facility more often (if they
can a¤ord it), where they are also more exposed to health information.) Hence, the direction of
a causal link is unknown with cross-sectional data. In this case, there is no strict simultaneous
determination of Y and D in a semantic sense. Rather it might be that for some individuals
the causal link runs from Y to D and for others vice versa. Frölich and Vazquez-Alvarez (2007)
partly solve this problem by using only the subsample of individuals who have never been tested
for HIV and who do not know their status.
5.1 Monotonicity in the outcome equation
References:
- Chesher (2005, Paper prepared for invited session at ESWC 2005)
- Chernozhukov Hansen (2005, Econometrica)
- Chernozhukov Imbens Newey (forthcoming JoE)
In a linear simultaneous equations model we write
Y = 1 + 1D + 1X +U
D = 2 + 2Y + 2X +V.
Usually this model only makes sense if Y and D are both continuous variables. If e.g. Y
was binary, the support of the variable U could only take certain values such that the data
generating process for Y only produces values of 0 or 1. This means further that the support
of U has to depend on the variables D and X. Such a model does not seem to be very sensible
for economic application.
If interest is only in the …rst equation, we can reduce the above model to
Y = 1 + 1D + 1X + U,
without assuming anything on the data generating process for D at all. A nonparametric version
of this model is examined in the following.
In this sub-section we examine brie‡y a few recent results on nonparametric identi…cation

in a model where we specify only assumptions on the …rst equation, and thereby do not impose
that D does not depend on Y :
Y = '(D; X; U ).
There will be also variables Z that do not enter ', i.e. some kind of instrumental variables.
The variable D will be allowed to be binary, discrete or continuous and the approaches used
for identi…cation do not di¤er so much as in the previously discussed triangular case. To
some extent this model is more general, but identi…cation will also be harder in practice. The
following exposition follows Chernozhukov and Hansen (2005) and Chesher (2005, Section 2,
preliminary), who provides a lucid discussion.
A central assumption for identi…cation is strict monotonicity of ' in its third argument,
which will render this model useful only for continuous outcome variables.
Assumption 1: The function ' is strictly monotonous in its third argument. (' is nor-
malized to be increasing in u.)
What does this assumption imply? The assumption that ' is weakly monotonous corresponds
to a rank invariance assumption in the outcome variable (e.g. earnings). An individual i with
value ui larger than uj for an individual j (with identical characteristics x) will never earn less,
regardless of the value of d. The assumption of strict monotonicity essentially requires Y to be
continuous. If Y were discrete and ' strictly monotonous, then also U would be discrete. If
furthermore ' depends on d and/or x, the location of the points of support of U would change
with d and/or x in a particular way; an unrealistic scenario. In most applications it is more
plausible to assume U to be continuously distributed. The condition that ' is increasing in u is

a convenient and innocuous normalization, given that the function ' is not further restricted.
This assumption implies that all individuals could be ordered according to their value of U ,
which represents their rank in the population. (We might think of U to be normalized to a scalar
variable uniformly distributed between 0 and 1.) This assumption is not innocuous. Let D be
the choice of educational track after secondary schooling: vocational track or academic track. It
may well be the case that the relative position of two individuals (with same x) in the academic
track would change if both were placed in the vocational track. Many models of occupational
choice explicitly suppose that individuals di¤er in which trade/craft/occupation they perform
best in and that each individual attempts to choose the most preferred occupation. Some
individuals may be poor in some crafts but better in others, whereas it is the reverse for other
individuals. This may lead to a reversal in the ranks. In such a situation, the unobservables
may represent di¤erent skills instead of a single ability measure, i.e. U is a vector, and the
function '(d; x; u) needs to be allowed to depend on d and u in a non-monotonic way. To make
assumption 1 plausible will often require that X contains a lot of individual characteristics
that might generate such multidirectional heterogeneity. Conditional on a set of characteristics
X, it might appear sensible that all the remaining heterogeneity a¤ects the outcome variable
monotonously, i.e. such that the individuals could be ranked independently of D.
Chernozhukov and Hansen (2005) relax the assumption of rank invariance to rank similarity,
which permits some random changes in the ranks as long as it is not related to the unobservables
in the choice equation. The model is extended to
Y = '(D; X; U D ),
where UiDi may be di¤erent for di¤erent values of Di . In the example above, where D 2 f0; 1g
is vocational track or academic track, Ui0 are the unobserved skills relevant when choosing the
vocational track whereas Ui1 are the unobserved skills relevant when choosing academic track.
Hence, the characteristics of individual i consist of
Xi ; Ui0 ; Ui1 .
Assumption 2: (a) Either
Uid does not depend on d

or (b) for some unknown function
D = (Z; X; V )
and the distribution function
FU d jV;X;Z does not (functionally) depend on d.
Assumption 2(b) thus permits that Ui0 and Ui1 may be di¤erent, but that among all indi-
viduals with the same value of V; X; Z and thus also D, the distributions of Ui0 and Ui1 are
identical. Hence, there can be random changes in the ranks on the individual level, but it can
not happen that the entire group would have di¤erent values of the unobserved characteristics
for d = 1 and d = 0. Hence, assumption 2(b) permits some unsystematic ’slippages’ in the
ranks, but does not permit systematic di¤erences between U .
For ease of notation, we use Assumption 2(a) henceforth.
Assumption 3: We further assume that U is independent of instruments Z given some

characteristics X
fU d g??ZjX.
In the following we will often examine the -th quantile of U conditional on Z and X:
QU jZX (z; x). Since U is independent of Z given X it follows that QU jZX (z; x) = QU jX (x)
which we abbreviate as q jx .
From these assumptions we can derive
= Pr U QU jZX (z; x)jZ = z; X = x
= Pr U q jx jZ = z; X = x
= Pr '(D; X; U ) '(D; X; q jx ) jZ = z; X = x
= Pr Y '(D; X; q jx ) jZ = z; X = x
= E 1 Y '(D; X; q jx ) jZ = z; X = x .
This implies the moment condition
E 1 Y '(D; X; q jx ) jZ = z; X = x = 0,
from which we may be able to identify the function '.

To ease notation in the following discussion, we impose a normalization on U d , in that we
normalize U d jX to be uniformly distributed. This normalization changes the interpretation of
U d , which is now indicating the rank of the unobservable conditional on each value of X. Hence,
U d by itself is no longer indicating the rank but only within the group of individuals with the
same X. Any correlation of X with U d is thus attributed to the e¤ect of X. This change in
the interpretation should be kept in mind.
Normalization:
U d jX unif orm[0;1] .
This normalization implies QU jX (x) = q jx = and thus gives the moment condition120
E[ 1 (Y '(D; X; )) jZ = z; X = x] = 0.
We know that for the true function ' this moment condition is satis…ed at all values x and
z. The function ' would thus be nonparametrically identi…ed, if no other function would satisfy
the above moment condition for all x 2 Supp(X) and z 2 Supp(Z). Chernozhukov, Imbens,
Newey (2004) and Chernozhukov and Hansen (2005) give some local and global identi…cation
results.
For easing notation, the conditioning on x is suppressed in the following derivations. In

other words, we are now implicitly examining identi…cation of ' separately for one speci…c
value of x, and thus drop the x argument in '. In addition, we consider identi…cation for a
…xed quantile and therefore also drop the argument in '. Hence, the function '(D) contains
the arguments and x implicit.
Consider …rst the situation with discrete D and Z. Let D 2 fd1 ; d2 ; :::; dM g be discrete with
M points of support and Z 2 fz1 ; z2 ; :::; zJ g is discrete with J points of support. The moment
equation
E[ 1 (Y '(D)) jZ = z] = 0.
120
Alternatively, if we had strengthened the assumption U d ?
?ZjX to U d ?
?(Z; X), then we could use that
q jx =q .
can be written as
M
X
Pr (Y '(dm ) jD = dm ; Z = z) Pr (D = dm jZ = z) =
m=1
which gives a system of nonlinear equations with J equations and M unknown values
'(d1 ); :::; '(dM ). Notice that the probabilities Pr (D = dm jZ = z) are observed. A necessary
condition for identi…cation is that J M (i.e. the support of the instruments must be as rich
as the support of the endogenous variable) and that the equations are distinct, which requires
that Z must have an in‡uence on D.
Consider the example with binary D 2 f0; 1g and J = 2, i.e. binary D and binary instru-
ment. The values we want to identify are
y0 '(0) = '(0; x; )
y1 '(1) = '(1; x; ),
e.g. the wages when attending or not attending university for an individual with characteristics
x and unobserved rank . It remains to …nd values for y0 and y1 such that
Pr (Y Dy1 + (1 D)y0 jZ = z1 ) =
Pr (Y Dy1 + (1 D)y0 jZ = z2 ) = .
The issue is not whether a solution exists, which must be the case since the true function satis…es
the above moment conditions. The question rather is whether there is a unique solution for
(y0 ; y1 ). Su¢ cient conditions are hard to establish. Chernozhukov and Hansen (2005) give a
rank condition su¢ cient for local identi…cation. For the case J = M = 2, the condition is that
for any values y~0 ; y~1 that are close to satisfying the above moment condition and are in the
support of the data, the matrix
2 3
fY D (~
y0 ; 0jZ = z1 ) fY D (~
y1 ; 1jZ = z1 )
4 5
fY D (~
y0 ; 0jZ = z2 ) fY D (~
y1 ; 1jZ = z2 )
should be of full rank, which requires at least that Z is not independent of D.
Now, what would have been di¤erent in an additive model ? In the additive model
Y = '(D; X) + U
with the restriction E[U jZ; X] = 0 a.s., the moment condition is
E[Y '(D; X)jZ; X] = 0 a:s:
or
E[Y jZ; X] E['(D; X)jZ; X] = 0 a:s:
In contrast to the results for the nonseparable model given above, this condition represents a
system of linear equations when D is discrete. To ease notation, conditioning on X is going to be
kept implicit in the following presentation. Consider D 2 f0; 1g binary and binary instrument
and de…ne y0 '(0) and y1 '(1). This gives the system of equations:
y1 Pz1 + y0 (1 Pz1 ) = E[Y jZ = z1 ]
y1 Pz2 + y0 (1 Pz2 ) = E[Y jZ = z2 ]
where Pz1 = Pr (D = 1jZ = z1 ) or equivalently

2 32 3 2 3
P 1 Pz 1 y E[Y jZ = z1 ]
4 z1 5 4 15 = 4 5.
Pz 2 1 Pz 2 y0 E[Y jZ = z2 ]
This is a system of linear equations, for which by de…nition of the model a solution must exist
if the model is true. The question is whether the solution is unique, which requires at least
J M linear independent equations. Conditions for identi…cation of linear systems are much
weaker, which shows that additive separability (if it were true) helps a lot for identi…cation. In
the above example, Pz1 6= Pz2 su¢ ces for identi…cation.
Identi…cation critically depends on the support of the instruments, which must be at least as
rich as the support of the endogenous regressor. For a binary D, a binary instrument may su¢ ce.
For a continuous D, a continuous instrument is necessary. With continuous D identi…cation
and estimation becomes much harder.
Consider …rst the case of additive separability. Again X is kept implicit throughout. Now
the function ' is a solution to the linear integral equation
Z
'(D)dFDjZ=z = E[Y jZ = z] 8z 2 Supp(Z),
which represents an ill-posed linear inverse problem. In this equation ' is very di¢ cult to
identify since the inverse operator is not continuous. Essentially large but local changes in
' may transform into very small changes on the right hand side. Hence, it would be very
di¢ cult to estimate ' precisely and convergence is likely to be very slow.121 We are not
pursuing this issue further here, but the main message is that nonparametric regression with
(continuous) endogenous variables is much harder than conventional nonparametric regression
with exogenous variables.122
Now, for a nonseparable model with continuos D we would obtain the non-linear integral
equation, again keeping conditioning on X and implicit:
Z
Pr (Y '(D) jD; Z = z) dFDjZ=z = .
which represents an ill posed nonlinear inverse problem.
Chernozhukov, Imbens, Newey (2004) also suggest a consistent estimator based on two-step
series regression: In a …rst step, ' is approximated by a series and in the second step the
^ [ jZ; X] is approximated by a series. Chernozhukov, Imbens,
conditional expectation function E
Newey (2004) have no result on the convergence rate and suspect that the estimator would su¤er
from an ill-posed inverse problem, leading to much lower convergence rates. Chernozhukov and
Hansen (forthcoming JoE) propose (linear) quantile regression, which will be examined in the
section on quantile regression.
5.2 Non-continuous Y and the coherency condition
Dieses Kapitel noch weiter schreiben
When the outcome variable Y is not continuous, nonparametric identi…cation is more di¢ -
cult. This is even the case for linear models. Suppose Y and D are both binary and let Y and
D be latent variables such that
Y = 1 (Y > 0)
D = 1 (D > 0)
121
Consider the similar relationship between the cdf and its pdf. Large but local modi…cations of a pdf can
result in only very small or neglibigle changes in its cdf.
122
Nonparametric testing for endogeneity might be less di¢ cult, though, as recently demonstrated in Blundell
and Horowitz (presented at the ESWC 2005).
Markus Frölich 6. Linear single equation IV models 195
and
0
Y = 1 + 1D + 1Y + 1D + 1X +U
0
D = 2 + 2Y + 2Y + 2D + 2X +V.
If 1 = 2 = 1 = 2 = 0 and we assume a joint distribution for U and V , we can apply standard

ML estimators or parametric two-step estimators. Otherwise, certain assumptions are needed
to garantue the coherency of the model. Consider a simple case
Y = 1 + 1D +U
D = 2 + 2Y +V
1
and suppose that 1 = 2, 2 = 12 , 1 = 1 and 2 = 1. Now consider an individual with
Ui = Vi = 0. Suppose that Di = 1. The above model now implies that Yi = 1, which implies
that Di = 0, which implies that Yi = 0 etc. For this individual the two equations are not
coherent.
In a recent paper Chesher (2007a) examines interval identi…cation for a nonparametric model
with monotonicity in the outcome equation and Y being non-continuous.
6 Linear single equation IV models
References:
- Wooldridge (2002, Ch. 5)
Now we are focussing on linear single equation IV models of the type
Yi = Xi + Ui
Xi = (constant; Di ; Xi ) 1 K
Zi = (constant; Zi ; Xi ) 1 L
where X contains the regressors D (which can be a vector) we are interested in and a set of
control variables X (which could be empty). Z contains the instrumental variables Z (which
can be a vector) that have no direct impact on Y and it also contains the control variables X.
In traditional econometrics D usually represents the vector of endogenous variables and X
the vector of exogenous variables. We will adopt this distinction in the following for reasons
becoming obvious in the following discussion.
For identi…cation of the model, i.e. for being able to obtain a unique value for from an
in…nitely large sample, we will need the following two conditions:
Assumption 2SLS.1: Exogenous instruments
E[Z0 U ] = 0.
Assumption 2SLS.2: Rank condition
rank E[Z0 Z] = L rank E[Z0 X] = K.
Let us brie‡y consider this model for a single endogenous regressor D
Yi = + Di + Xi + Ui
Di = (Zi ; Xi ; Vi ).
The assumption 2SLS.1 will require that
E[Z 0 U ] = 0 (80)
and also that E[X 0 U ] = 0. We can see immediately that condition (80) usually requires all
variables X to be exogenous, i.e. not related to the error terms! This di¤erence can be seen
from the following two graphs:
X X
Y Y
D D
Z Z
V U V U
In the left graph, the above assumption (80) is satis…ed. In the right graph it is usually
not and 2SLS would usually be inconsistent. In the nonparametric models discussed before, we
required only independence conditional on X and this assumption would be satis…ed also in the
right graph. In other words, an assumption of the type
E[U jZ; X] = E[U jX] 6= 0

would su¢ ce. For the 2SLS approach, we "almost"123 need
E[U jZ; X] = 0.
Hence, whereas we could permit "endogenous" control variables in the nonparametric approach,
all control variables must be exogenous for 2SLS. The reason for this is that in the nonparametric
approach we compare only individuals to each other with the same values of X,124 whereas
2SLS compares all individuals to each other by …tting a global linear plane. Therefore, if we
include endogenous control variables, we must …nd additional instruments to control for their
endogeneity.
Hence, we implicitly need the following, loosely presented, assumptions for 2SLS:
1. Y is continuous
2. All variables X are exogenous.
3. The treatment e¤ect of D is constant for every individual (or, if we include interaction
terms, is constant for every individuals with the same value of X.
4. The relationship between Y and X and D is linear.
5. Yi is monotonous in Ui , i.e. all individuals can be ranked, conditional on X, and the

ranking does not depend on D.
6. Z has no direct e¤ect on Y
7. Z is not confounded with D or Y
8. Z has an e¤ect on D.
The last three assumptions are also required for nonparametric identi…cation. In addition,
either monotonicity in the outcome equation or in the choice equation is needed. The assump-
tions 2, 3 and 4 are new and speci…c to 2SLS.
123
Conditional independence is not strictly required since uncorrelatedness su¢ ces, but in most applications it
is often reasonable to assume that two variables that are uncorrelated are also independent.
124
By using nonparametric regression using only the data points in a small neighbourhood about X = x.
Now, consider derivation of the IV estimator using the exogenous instrument assumption
2SLS.1. Multiply the outcome equation by Z0 and taking expectations gives
Yi = Xi + Ui
=) Z0i Yi = Z0i Xi + Z0i Ui
=) E Z0i Yi = E Z0i Xi + E Z0i Ui
=) E Z0i Yi = E Z0i Xi ,
which is a linear system with K unknowns and L equations. For being able to solve for
we need at least as many equations as unknowns, i.e. L K, which is often called the order
condition. We can solve the system if the L K matrix E [Z0i Xi ] is of full rank. Consider an
example where the …rst variable in Z and X is a constant and where Z and X have mean zero
such that E[Zli Xki ] = cov(Zli Xki )
2 3
1 0 0
6 7
6 7
60 cov(Z2i X2i ) cov(Z2i X3i )7
6 7.
6 7
60 cov(Z3i X2i ) cov(Z3i X3i )7
4 5
0 cov(Z4i X2i ) cov(Z4i X3i )
Full rank requires that the X are not collinear and that the di¤erent instruments are correlated
with the X variables in di¤erent ways. The last condition would fail, for example, if all
instruments Z2 ; Z3 ; Z4 were powerful in moving the variables X but had the same strength on
both variables, in that cov(Z2i X2i ) = cov(Z2i X3i ) and cov(Z3i X2i ) = cov(Z3i X3i ) and also
cov(Z4i X2i ) = cov(Z4i X3i ).
If L < K, then is under-identi…ed as it does not permit to solve for uniquely. If L = K

and the rank condition holds, is just identi…ed. If L > K and the rank condition holds, is
over-identi…ed, and we can test whether the assumptions are mutually consistent.
If is just identi…ed, we obtain the linear IV estimator as
1
= E Z0 X E Z0 Y
and
1 X 1X
^ = E\
[Z0 X] E\
[Z0 Y ] = Z0i Xi Z0i Yi .
In several applications it may happen that we have more instruments than endogenous
regressors: L > K. One may be wondering whether this is a relevant situation after we have
seen that it is very di¢ cult to …nd even one credible instrumental variable. In the analysis of
panel data models below, we will often assume that lagged values of Xit are uncorrelated with
the error term. If we observe many lagged values, we can have a large number of instruments.
For this case of over-identi…cation, there will be more equations than unknowns. If all
instruments are valid, is uniquely identi…ed from E [Z0 Y ] = E [Z0 X] . For estimating from
^ [Z0 Y ] and E
a sample, we have to rely on estimates of E ^ [Z0 X]. For this sample, however, it is
usually not possible to …nd a solution ^ that solves all equations exactly. Then one could either
attempt to …nd a ^ that solves all equations almost exactly (this approach will be followed with
GMM estimation) or to solve a "linear combination of the L equations". Instead of using the
1 L vector of instruments Zi , a linear combination Zi of dimension 1 K is used, where
is some L K matrix. We can repeat the above calculations to obtain
Yi = Xi + Ui
0
=) Z0i Yi = 0
Z0i Xi + 0
Z0i Ui
0
=) E Z0 Y = E 0
Z0 X +E 0
Z0 Ui
0
=) E Z0 Y = E 0
Z0 X
0 1
=) = E Z0 X E 0
Z0 Y ,
provided the K K matrix E [ 0 Z0 X] has full rank. To obtain a precise estimator it is useful
to choose a matrix such that the e¤ective instruments Zi are most highly correlated with
the X variables. This is what 2SLS does. To obtain the 2SLS estimator, it is useful to repeat
some properties of linear projections.
——————————————————————————————
Let fXi ; Yi gN
i=1 be a sample of N iid observations and let XN be the N K matrix of all
observations stacked above each other. Analogously YN is the N 1 vector of all observations.
Provided the X are linearly independent, the following decomposition is always true125
YN = PX YN + MX YN with (PX YN )0 (MX YN ) = 0 and X0N (MX YN ) = 0

| {z } | {z }
Projection Residual
125
I.e. it does not depend on any assumptions about observables or unobservables.
where the symmetric and idempotent matrices126 PX and MX are de…ned as
1
PX = XN X0N XN X0N and MX = I PX .
The …rst part PX YN is the linear projection of YN on the space de…ned by XN . These are
the …tted values from an OLS regression of YN on XN . The second part MX YN are the OLS
residuals. The residuals are always orthogonal in the sample to the predicted values PX YN and
to the regressors XN , as can easily be shown.
We can de…ne a similar linear projection in terms of the population, i.e. not dependent on
the existence of a particular sample and a particular sample size. The population analogue of
the linear projection is
1 1
Y =X E[X0 X] E[X0 Y ] + Y X E[X0 X] E[X0 Y ]
| {z } | {z }
L(Y jX) r
with
E [L(Y jX) r] = 0 and E X0 r = 0.
Hence, the residual r is always orthogonal to the linear projection L(Y jX) and to the
regressors X. The analogy between the sample and the population linear projection suggests
the estimators
\
L(Y jX) = PX YN
r = MX YN .
——————————————————————————————
1
The 2SLS estimator is now de…ned by choosing the L K matrix = (E[Z0 Z]) E[Z0 X],
such that the e¤ective instruments Zi used are the linear projection of Z on X. One could
obtain the columns of by regressing separately the variables in X onto all Z. In practice
we are going to use some statistical software which does this for us. With this choice of we
obtain
0 1
= E Z0 X E 0
Z0 Y
1 1 1
= E[X0 Z] E[Z0 Z] E Z0 X E[X0 Z] E[Z0 Z] E Z0 Y ,
| {z }| {z }
K K K 1
126
A matrix A is symmetric if A = A0 . A matrix is idempotent if AA = A.
where the …rst K K matrix is invertible if the rank condition 2SLS.2 holds. Hence, we have
an explicit expression for and using the analogy principle there will be a unique solution ^
for every sample.
By using the analogy principle, the 2SLS estimator for is
1 1 1
^ = X0 ZN Z0 ZN Z0N XN X0N ZN Z0N ZN Z0N YN
N N
or equivalently127
^ = 1
X0N PZ XN X0N PZ YN
1
= XN0 XN XN YN
where XN = PZ XN are the …tted values from the …rst stage regression.
In other words, we estimate …rst the regression of XN on ZN (separately for each variable
in XN ). Then we generate the …tted values XN and we run a second stage regression of YN
on XN . The …rst stage basically summarizes the information from the many instruments to a
lower dimensional space.128
To show consistency of 2SLS, insert the de…nition of the model to obtain
1 1 1
^= + X0N ZN Z0N ZN Z0N XN X0N ZN Z0N ZN Z0N UN ,
where we can easily verify by applying the weak law of large numbers to each term (after
dividing by N ) together with Slutzky’s theorem that
1 1 1
plim X0N ZN Z0N ZN Z0N XN X0N ZN Z0N ZN Z0N UN = 0.
n!1
p
Similarly, we can show asymptotic normality of n( ^ ) by applying a central limit
p 1 0
theorem to n n ZN UN and assuming …nite second moments. The asymptotic variance is
1 1 1 1 1 1
EX0 Z EZ0 Z EZ0 X EX0 Z EZ0 Z E Z0 U U 0 Z EZ0 Z EZ0 X EX0 Z EZ0 Z EZ0 X
which simpli…es when assuming homoskedasticity:

127
For a just-identi…ed (i.e. where K = L), the 2SLS estimator is identical to the linear IV estimator because
1
1 1 1 1
X0N ZN (Z0N ZN ) Z0N XN = (Z0N XN ) (Z0N ZN ) (X0N ZN ) . This gives ^ = (Z0N XN ) Z0N YN .
128
If X contained only a single binary variable (D), then the …rst stage regression represents an attempt to
estimate the propensity score by using a linear model.
Assumption 2SLS.3: Homoskedasticity
E[U 2 Z0 Z] = 2
E[U 2 Z0 Z].
With homoskedasticity the asymptotic variance is
1 1
2
E X0 Z E Z0 Z E Z0 X .
î = Yi
The variance matrix can be estimated by using the residuals U Xi ^ .129
The popularity of 2SLS is also based on an e¢ ciency result. Assuming homoskedasticity

2SLS.3 (as well as 2SLS.1 and 2SLS.2) the 2SLS estimator is e¢ cient in the class of all estimators
1
using instruments linear in Z, i.e. Zi with = (E[Z0 Z]) E[Z0 X] is the best combination
of instruments.130 (On the other hand, 2SLS will generally not be e¢ cient in the presence
of heteroskedasticity. Also, there might be nonlinear combinations of the instrument that are
more e¢ cient, but estimation becomes more complicated.)
This e¢ ciency result also implies that asymptotically it is best to use all instruments avail-
able than only a subset of them, since using a subset would correspond to rows of zeros in .
In …nite samples, 2SLS can perform poorly when the sample size is modest to small and
when the impact of the instruments Z on the endogenous regressors D is small. The …rst stage
regression(s) should always be examined in detail and at least one element of Z should be
signi…cant. Similarly, one should always examine the partial R2 , i.e. the contribution of Z to
explaining the variance in D that is not already explained by X.
In contrast to OLS, 2SLS is consistent but not unbiased. Biases can be quite large even
in relatively large samples, as was highlighted in Bound, Jaeger and Baker (1995) with respect
to the quarter of birth instrument in the study of Angrist and Krueger (1991) with 300,000 to
500,000 observations.
In fact, the mean of the 2SLS estimator may not even exist. If all endogenous variables are
normally distributed, homoskedastic and with expectations linear in the exogenous variables, the
number of …nite moments of 2SLS is L K 1.131 Hence, without su¢ cient overidenti…cation,
the 2SLS estimator has no …nite mean.
129
Notice, these are not the residuals from the second stage, which would be Yi Xi ^ .
130
For a proof see Wooldridge (2002) Theorem 5.3, page 96.
131
@@Prüfe irgendwo anders ob korrekt.
A small deviation from the assumption E[Z0 U ] = 0 can also have a large impact on the
inconsistency of the estimator if the instrument is weak. Consider L = K (such that linear
IV=2SLS). The plim of the IV estimator is
1
plim ^ = + E Z0 X E Z0 U .
If the correlation between Z and X is small, a small deviation of E [Z0 U ] from zero can be
1
substantially in‡ated by (E [Z0 X]) .
In addition, when instruments are weak, the conventional approach to de…ning con…dence
regions as Estimate 1:96 Stddev may have coverage rates very di¤erent from nominal size.
A large literature has recently attempted to develop alternative approaches to inference.
Multiple proxy measurements of the unobservables

In the discussion so far we have been concerned with endogeneity due to some omitted
factors that introduce a correlation between X and U in the model:
Yi = Xi + Ui
and we have attempted to …nd at least as many instruments as endogenous variables in X.

Sometimes we may be relatively con…dent about what U represents, e.g. some kind of ability
Ai and we might believe that after including Ai in the regression the remaining unobserved
in‡uences should not be correlated with X:
Yi = Xi + Ai + Ui E [U jX; A] = 0.
After controlling for A we could thus simply use the selection on observables strategies as
discussed in Section 2 (e.g. OLS, Matching estimation).
Often we may not be able to observe A, but might have some proxy variable A1 of it, e.g. a
test of intelligence. One could think of A1 as measuring A with some measurement error. This
measurement error A1 A might be uncorrelated with the true ability and with X
E [A (A1 A)] = 0
E X0 (A1 A) = 0.
We can rewrite the above model
Yi = Xi + Ai1 + (Ai Ai1 + Ui )

| {z }
combined error term
and notice that X is uncorrelated with the new error term. On the other hand, A1 will be
correlated with the error term leading to attenuation bias in ^ . Hence, the endogeneity problem
with respect to X is solved, yet we need an instrument for A1 .
Now suppose we have a second measurement of ability A2 , which is such that its measure-
ment error A2 A is not correlated with the other measurement error, i.e.
E [(A1 A)(A2 A)] = 0.
This assumption might be reasonable if ability is assessed by two very di¤erent tests or inde-
pendently by two di¤erent persons.132 Supposing that the measurement error A2 A is uncor-
related with U , we obtain133
E [A2 (Ai Ai1 + Ui )] = 0
and thus can use X and A2 as instruments to estimate and .134

An advantage of this multiple indicator IV approach is that we need only one instrument
A2 , which is in fact often highly correlated to A1 . In the conventional 2SLS approach discussed
before, we would have needed as many instruments as there are endogenous variables in X.
Hence, we might have needed several instruments, which might also have been only weakly
correlated with X. On the other hand, this approach can only work if we have a good knowledge
about what the important unobservables actually are because we assume that by including A
we have eliminated all selection on unobservables problems
E [U jX; A] = 0.
Hence, if in reality the unobservables consisted of very di¤erent factors, including cognitive
ability but also, e.g. the quality of the individuals social network, then adding ability as a
regressor, even if perfectly measured, would not solve the problem.
7 Repetition of GMM estimation
References: Wooldridge Chapters 14

132
It might often not be reasonable if these two measurements were taken on the same day as their measurement
errors might well be correlated.
133
Hint, show this for E [(A2 A) (Ai Ai1 + Ui )] + E [A (Ai Ai1 + Ui )].
134
Wooldridge (2002, p. 106) discusses an extension where A1 and A are related by a function A1 = + A+
error, which permits measurement error (A1 A) to have a nonzero mean.
Markus Frölich 7. Repetition of GMM estimation 205
In following section, estimation of systems of equation by IV will be examined in some more

detail. Therefore it is useful to repeat several basic properties of generalized method of moments
(GMM) estimation. We focus here on GMM estimation of nonlinear systems, with results for
linear systems following immediately as a special case. We consider three situations:
1) General GMM: E[g(W; 0 )] =0

2) GMM with orthogonality restrictions: E[Z 0 r(W; 0 )] =0
3) GMM with conditional moment restrictions: E[r(W; 0 )jZ] =0
7.1 GMM
Wooldridge Ch. 14.1

Let Wi 2 RM be a random vector whose distribution depends on a P 1 parameter vector .
A sample of N iid observations is drawn. Let g(W; ) be a known vector valued L 1 function.
We assume that
Assumption: There is a 0 2 RP such that
E[g(W; 0 )] = 0.
Examples for such a function in the linear regression model are g(W; ) = Y X and g(W; ) =
Z 0 (Y X ). The …rst function is based on the assumption that the disturbance has mean zero,
whereas the second is based on the assumption that the disturbance is uncorrelated with Z. A
necessary requirement for identi…cation is L P , i.e. there must be at least as many moment
functions as unknown parameters.
For L = P , the analogy principle suggests to estimate 0 by setting the sample average to
zero:
N
1X
g(Wi ; ^) = 0.
N
i=1
This may not always be possible, e.g. due to collinearity in linear models. In non-linear models,
even in the absence of collinearity, setting the sample average of g to zero may not always be
possible.
If L > P , there are more equations than unknown parameters and it is usually not possible
P
to set g(Wi ; ) = 0 exactly to zero. The GMM approach attempts to set sample average
close to zero in quadratic form:
N
!0 N
!
^ = arg min 1X 1X
g(Wi ; ) ^ g(Wi ; )
N N
i=1 i=1
where ^ is a L L symmetric psd weighting matrix. Now assuming that the (estimated) matrix
p
^ converges in probability to a L L pd matrix 0 (i.e. ^ ! 0, where 0 is L L pd)
From here on only the slides
Wi 2 RM iid random vectors, aP 1 parameter vector

g(W; ) is a known vector valued L 1 function
Assumption: There is a 0 2 RP such that
E[g(W; 0 )] =0 and E [g (W; )] 6= 0 if 6= 0.
Examples: g(W; ) = Y X
g(W; ) = Z 0 (Y X )
Necessary requirement for identi…cation is L P.
For L = P , the analogy principle suggests to estimate 0 by

N
1X
g(Wi ; ^) = 0.
N
i=1
For L > P , there are more equations than unknown parameters

P
! usually not possible to set g(Wi ; ) = 0 exactly to zero
GMM attempts to set sample average close to zero in quadratic form:

N
!0 N
!
1 X 1 X
^ = arg min g(Wi ; ) ^ g(Wi ; )
2 N N
i=1 i=1
where ^ is a L L symmetric psd weighting matrix

p
with the assumption that ^ ! 0, where 0 is L L pd.
Identi…cation of
Using ULLN, the objective function converges uniformly in probability to
E [g(Wi ; )]0 0 E [g(Wi ; )]
which is uniquely minimized at 0 since 0 is pd.

Consistency (Theorem 14.1) ^ p! 0
if is compact, g( ; ) is measurable,
g(w; ) is continuous on for each w,
jgl (w; )j b(w) with E [b(W )] < 1 for all 2 and l = 1:::L
p
^ ! 0, where 0 is L L pd, and 0 is unique solution.
Asymptotic normality
Assuming g(w; ) is continuously di¤erentiable on int( ) and 0 2int( )
1 X 0
1 X
FOC: r g(Wi ; ^) ^ g(Wi ; ^) =0
N N
De…ne the expected gradient
G0 = E[r g(W; 0 )] L P assume full rank P
De…ne gi ( ) = g(Wi ; )
Taylor approximation
gi ( ) = gi ( 0 ) + r gi ( 0 ) (^ 0) + op ^ 0
Using WLLN and CLT
p 1 1 X
N (^ 0) = G00 0 G0 G00 0N 2 gi ( 0 ) + op (1) in‡uence function representation
d 1 1
! N 0; G00 0 G0 G00 0 E g(W; 0 )g(W; 0 )
0
0 G0 G00 0 G0
Asymptotic normality holds under the following conditions:
Theorem 14.2: Assume 0 is in the interior, g(w; ) is continuously di¤erentiable on int( )

for all w, g(w; 0) has …nite second moment,
r gl (W; ) is bounded in absolute value by b(w) with E [b(W )] < 1 ,
G0 has rank P .
Consistent estimation of asymptotic variance given ^ by analogy principle
Optimal weighting matrix

De…ne = E [g(W; 0
0 0 )g(W; 0 ) ]
Optimal weighting matrix if ^ is consistent estimator of 0

1
Intuition: Those moments where g has large variance should get lower weight
Hence, estimator with plim ^ = 0

1
is e¢ cient, because
1 1 1 1
G00 0 G0 G00 0 0 0 G0 G00 0 G0 G00 0 G0
= E[ss0 ] which is psd
where
n o
1 1 1 1
s= G00 0 G0 G00 0 G00 0 G0 G00 0 g(W; 0 ).
2 step estimation to obtain e¢ cient estimates

For asymptotically e¢ cient estimation, start with arbitrary weighting matrix
! ^ ! compute ^ 1 and estimate again
1
! simpler asymptotic variance matrix: (G00 0 G0 )
Overidenti…cation test
For a consistent estimate ^ and estimated matrix ^ , the statistic
N
!0 N
!
1X ^ 1X ^ d
N gi ( ) ^ 1
gi ( ) ! 2
(L P )
N N
i=1 i=1
! test whether some (or all) moment restrictions are incorrect

Hint: Matrix ^ 1 has to be used: This is the inverse of the covariance matrix of g
Test nested models

GMM criterion function statistic can also be used to test nested models
H0 : c( 0 ) = 0 where c( ) is Q 1 vector with Q P
Let ^ be estimate of unrestricted model and ~ estimate of restricted model

both obtained with the same ^ 1 as weighting matrix
The di¤erence in criterion function
1 X ~ 0
^ 1 1 X ~ 1 X ^ 0
^ 1 1 X ^ d 2
N gi ( ) gi ( ) N gi ( ) gi ( ) ! Q
N N N N
The test is invariant to reparameterization of H0 (in contrast to Wald statistic)
2 step estimators (or plug-in estimators)

How to obtain asymptotic variance?
Many estimation problems depend on estimated nuisance parameters,e.g. GLS

Moment conditions
E[h(W; 0; 0 )] =0
where is …nite-dimensional nuisance parameter: ^ estimated in previous step.

Now, estimate ^ by MM or GMM, plugging in ^ instead of 0:
X
h(Wi ; ^; ^ )] = 0.
How does variance of ^ depend on …rst step estimates ^ ?
Using ^ instead of 0 usually increases variance of ^

(but sometimes may not or might even decrease it)
If ^ is an M-estimator solving
X
s(Wi ; ^ )] = 0
we can de…ne the stacked GMM estimator as

2 3
h(W; ; )
g(W; ; ) = 4 5.
s(W; )
The …rst order conditions of this GMM estimator reproduce the FOC for
The above asymptotic variance formula for GMM gives the variance of ^
and also shows when variance of ^ does not depend on the …rst step estimator
! we do not need to re-estimate and jointly by GMM
we simply plug the estimates into the formula
Still, we also can estimate and jointly by GMM

! can give more e¢ cient estimates
Also, combined GMM estimation could be used for e¢ cient estimation ^ and ^
Note: ^ and ^ can be M-estimators or GMM-estimators
7.2 GMM with orthogonality restrictions
Wooldridge Ch. 14.2
Notation in this chapter: Dimensions of variables Z; X di¤erent than before
In regression models we often have moment conditions of the type
E[Z 0 r(W; 0 )] =0
or
E[r(W; 0 )jZ] = 0.
The …rst case is discussed here, the second in the next subsection.
The function r is often a residual and Z is a vector of instruments.
If Z contains a constant, E[r(W; 0 )] = 0 and E[Z 0 r(W; 0 )] is the covariance
We can use the GMM framework to derive speci…c results for this situation
r(w; ) is a G 1 known function Z is a G L instrument matrix

Z 0 r is L G
Often we can think of G as the number of equations

Example: We examined single equation linear IV models before:
Y =X +U
where
E[Z0 U ] = E[Z0 (Y X )] = 0.
L 1
Now we might have several equations with potentially di¤erent instruments Z:

- demand and supply side
- di¤erent time periods in panel data models
(we will see models, where only past values can serve as instruments
hence the number of instruments varies by equation and grows with t)
Hence, in contrast to Chapter 4 we allow G > 1 and permit non-linear models !!!
Identi…cation requires that 0 is the only 2 satisfying the moment restriction

Asymptotic normality requires that
rank E Zi0 r ri ( 0 ) =P
where r ri is of dimension G P
requiring L P.
Hence, Z must be orthogonal to r but su¢ ciently correlated with r ri .
In linear one-equation setup: G = 1 r(Wi ; ) = Yi Xi

this requires
rank E Z0i Xi =P (see Assumption SIV.2)
Two-step estimation process

Start with
^= 1 X 0 1
Zi Zi
N
and solve
X 0 X
~ = arg min Zi0 ri ( ) ^ Zi0 ri ( ) .
2
This is the nonlinear system 2SLS estimator.

This is the e¢ cient estimator under "homoskedasticity", i.e. if
E Zi0 r(Wi ; 0
0 )r(Wi ; 0 ) Zi = 2
0E Zi0 Zi .
We obtain the e¢ cient GMM estimator, irrespective of assumptions on rr0

with weighting matrix ^ 1
1 X 0 ~ 1
^ 1
= Zi ri ( )ri (~)0 Zi . (81)
N
GMM 3SLS estimator
If we knew more about the correlation structure between the di¤erent equations
we could estimate this correlation matrix and obtain the GMM 3SLS estimator.
System homoskedasticity: If we know that
E Zi0 r(Wi ; 0
0 )r(Wi ; 0 ) Zi = E Zi0 0 Zi ,
where
0
0 = E r(Wi ; 0 )r(Wi ; 0 ) G G matrix.
Given a preliminary estimator, the e¢ cient GMM estimator with weighting matrix
^ 1 1 X 0^ 1
= Zi Zi .
N
P
where ^ = 1
N ri (~)ri (~)0 .
This is called the GMM 3SLS estimator (but is done in two steps)
This estimator is no more e¢ cient than the GMM estimator with ^ 1 above
But is less e¢ cient when system homoskedasticity is wrong
Why is GMM 3SLS ever be used?

- historic reasons
- may sometimes have better …nite sample properties
7.3 GMM with conditional moment restrictions
E[r(W; 0 )jZ] = 0.
Wooldridge Ch. 14.5.3

Often not only assume that Z and r are uncorrelated but also that independent
in nonlinear models, uncorrelatedness not always ensures identi…cation
Independence implies that r is uncorrelated with any function of Z:
E[ (Z)r(W; 0 )] =0 for any function .
One possible function would be

(Z) = Z 0
which gives the results of the previous subsection.
The optimal GMM estimator is de…ned as the solution

N
X 0
@r(W; 0) 1
E jZi fV ar[r(W; 0 )jZi ]g r(Wi ; ^) = 0.
@
i=1
This is the just-identi…ed combination of the possible instruments,

no weighting matrix is needed.
(For proof see also Newey, McFadden 1994)
The optimal GMM requires knowledge of
@r(W; 0
0) 1
E jZi fV ar[r(W; 0 )jZi ]g .
@
These are usually unknown and require (complicated nonparametric) estimators

Estimation of these instruments has no e¤ect on asy variance of ^
Proof: (only valid for parametric estimators of the instruments)

GMM estimator is
N
X
A(Zi )0 r(Wi ; ^) = 0
i=1
where A(Zi ) are as de…ned above. Let A(Zi ) depend on some unknown parameters , such that
A(Zi ) = A(Zi ; 0 ).
Since A(Zi ; 0) are unknown, estimates A(Zi ; ^ ) are used to estimate ^

N
X
A(Zi ; ^ )0 r(Wi ; ^) = 0.
i=1
p
Suppose ^ is consistent at the parametric rate, such that N (^ 0) = Op (1).
Writing Ai ( ) for A(Zi ; ) and also for r(Wi ; ), an expansion of Ai (^ )0 ri (^) gives
N N N
1 X 1 X 1 X
p Ai (^ )0 ri (^) = p Ai ( 0 ) 0
r i ( 0 ) + p r Ai ( 0
0 ) ri ( 0 ) (^ 0)
N i=1 N i=1 N i=1
N
1 X
+p Ai ( 0
0) r ri ( 0 ) ^ 0 + op (1)
N i=1
which is equivalent to
N N
1 X 1 X p
p Ai (^ )0 ri (^) = p Ai ( 0
0 ) ri ( 0 ) + E r Ai ( 0
0 ) ri ( 0 ) N (^ 0)
N i=1 N i=1
p
+ Ai ( 0
0) r ri ( 0 ) N ^ 0 + op (1)
where all terms are Op (1).
Now the second term on the right hand side is zero

because
0 0
E r A(Z; 0 ) r(W; 0 ) = E r A(Z; 0) E [r(W; 0) jZ] =0
since E [r(W; 0 )jZ] = 0 had been assumed.

If we now expand Ai ( 0 ^ instead of Ai (^ )0 ri (^) we get the same result:
0 ) ri ( )
N N
1 X 0 ^ 1 X 0
p Ai ( 0 ) ri ( ) = p Ai ( 0 ) ri ( 0 )
N i=1 N i=1
N
1 X
+p Ai ( 0
0) r ri ( 0 ) ^ 0 + op (1) .
N i=1
Hence, the estimators with known and estimated instruments have

the same (…rst-order) representation.
General advice wrt 2step GMM: second step should be more precise but similar estimates,
since …rst step is already signi…cant. Otherwise, …rst step estimates more reliable.
Markus Frölich 8. Linear system estimation by IV 215
8 Linear system estimation by IV
Wooldridge Ch. 8
Having repeated the GMM approach, we can examine linear IV system in detail
Before linear equation of the type
Y =X +U
where X contains endogenous and exogenous variables

and Z contains only exogenous variables.
Now system of equations
Y1 = X1 1 + U1
..
.
YG = XG G + UG
where Xg is a 1 Kg vector
and Zg a 1 Lg vector of exogenous variables with
E[Z0g Ug ] = 0.
Examples of systems:
- supply and demand equations
supply and demand for potatoes in di¤erent local markets i and time t
demandit = '(priceit ; Z1it ; Uit )
supplyit = (priceit ; Z2it ; Vit )
we observe only equilibrium prices and quantities
Z1 ; Z2 are curve shifters
Z2 weather conditions Z1 price of substitute good
to trace out the demand or supply curve

Price
Market 1
Market 2
S
Quantity
- labor supply and wage o¤er function

h(w) = '(wage; Z1it ; Uit ) hours of labour supply (desired hours worked)
wo (h) = (hours; z2 ; v) wage o¤er: how much the market pays
Z1 are labour supply shifters: education, experience, age, marital status
number of children, nonlabour income
Z2 observed productivity attributes (education, experience, training)
v unobserved productivity attributes
Again we observe only the equilibrium values
- panel data models without strict exogeneity (in this case G = T )

basically the same equation, but for di¤erent time periods with di¤erent Zg
asymptotics for N ! 1, T
…x
De…ne the system

0 1 0 1 0 1
Y1 X1 0 0 0 U1
B C B C B C
B .. C B C B .. C
B . C B 0 X2 0 0 C B . C
Y B B ..
C
C X B
B .. ..
C
C U B
B ..
C
C
B . C B 0 0 . . C B . C
@ A @ A @ A
YG 0 0 XG UG
G 1 G K G 1
where K = K1 + ::: + KG
and L = L1 + ::: + LG
0 1
Z1 0 0 0
B C
B C
B 0 Z2 0 0 C
Z B C.
B .. .. C
B 0 0 . . C
@ A
0 0 ZG
G L
Now the system can be written as
Y =X +U
0 0 0 0
where =( 1; 2 ; ::; G) .
Assumption SIV.1: E[Z0 U] = 0

Assumption SIV.2: rank E[Z0 X] = K
0 1
E[Z01 X1 ] 0 0 0
B C
B C
B 0 E[Z02 X2 ] 0 0 C
E[Z X] = B
0
B .. ..
C
C
B 0 0 . . C
@ A
0 0 E[Z0G XG ]
L K
For this block-diagonal matrix to have full column rank, each block must have full column
rank,
which requires
rank E[Z0g Xg ] = Kg g = 1; ::; G.
This is the rank condition needed for estimating each equation by 2SLS.
! Identi…cation of the system is equivalent to identi…cation equation by equation
But, this holds only when g are unrestricted across equations.

Later in panel data models, we often have 1 = 2 = :: = G
Estimation by GMM
Under SIV.1
E[Z0 (Y X )] = 0
with uniquely identi…ed by SIV.2.
Estimation according to analogy principle estimate ^ as solution to

N
1X 0
Zi (Yi Xi ^ ) = 0
N
i=1
which is a system of L linear equations in K unknown coe¢ cients.
Just-identi…ed system: L = K and rank condition holds

P
N
If Z0i Xi is nonsingular:
i=1
N
! 1 N
X 1X 0
^= 1 Z0i Xi Zi Yi
N N
i=1 i=1
| {z }| {z }
L K L 1
The condition L = K and the rank condition together

imply that every equation is just identi…ed
(otherwise one equation would not be identi…ed)
In this case, system estimation is identical to IV estimation equation by equation

There is no need to estimate a system!
Over-identi…ed system: L > K and rank condition holds

Not possible to set the sample correlation with residual exactly to zero
N
1X 0
Zi (Yi Xi ^ ) 0.
N
i=1
Choose ^ to set the sample correlation close to zero by GMM:
N
!0 N
!
1 X 1 X
^ = arg min Z0i (Yi Xi ) W ^ Z0i (Yi Xi )
N |{z} N
i=1 i=1
| {z }L L| {z }
1 L L 1
^ is a L
where W L symmetric psd weighting matrix, which might be estimated.
^ , there is a closed form solution for ^ :

For a given W
N
! N
!
1X 0 ^ 1 X
0
FOC: Xi Zi W Zi (Yi Xi ) = 0
N N
i=1 i=1
which gives
n X X o 1 X X
^= ^
X0i Zi W Z0i Xi ^
X0i Zi W Z0i Yi .
- Consistency
^ = W , where W nonrandom symmetric, L
Assumption SIV.3: plim W L pd matrix
P ^ (P Z0 Xi ) converges to K
With SIV.2 and SIV.3 ( X0i Zi ) W i K nonsingular matrix
^= 1 X 0 ^ 1 X 0 1
1 X 0 ^ 1 X 0
+ Xi Zi W Zi Xi Xi Zi W Zi Ui
N N N N
1 1 X 0
plim ^ = + E X0 Z W E Z0 X E X0 Z W plim Zi Ui =
N
^ does not a¤ect the estimator, because P X0 Zi is K

If K = L, the choice of W K nonsin-
i
gular
- Asymptotic normality
p d
N (^ ) ! N 0; (C 0 W C) 1
C 0 W E Z0 UU0 Z W C(C 0 W C) 1
where C = E[Z0 X].

p P d
The proof makes use of N N1 Z0i Ui ! N (0; E [Z0 UU0 Z]).
Given a consistent estimator of E [Z0 UU0 Z], asy-variance of ^ can consistently estimated
How to choose weighting matrix ^

W
System 2SLS estimator
N
! 1
^ = 1X 0
W Zi Zi
N
i=1
which converges to (E[Z0 Z]) 1 .

This gives 2SLS equation by equation
^ does not depend on …rst step estimates

Advantage: SIV.3 is easily satis…ed. W
This is a good …rst step estimate
But, there might be a more e¢ cient weighting matrix
For the linear case with conditional moment restriction
E[(Y X )jZ] = 0,
L 1
optimal weighting given by GMM formula with conditional moment restriction

The solution simpli…es to
N
X 1
E [XjZi ] fV ar[UjZi ]g (Yi Xi ^ ) = 0.
i=1
This would be the best instruments under conditional moment restrictions.
Under the weaker assumption of uncorrelatedness E[Z0 (Y X )] = 0,

essentially only linear combinations of instruments are available.
Optimal weighting matrix is

1
E Z0 UU0 Z
which can be estimated given a …rst step estimate.
In case of system homoskedasticity
E Z0 UU0 Z = E Z0 0Z
where
0 = E UU0 G G matrix.
For the system of linear equations, this assumption is equivalent to
E ug uh Z0g Zh = E [ug uh ] E Z0g Zh g; h 2 f1; 2; :::; Gg.

If furthermore
2
0 = I
we see that the optimal linear weighting gives the system 2SLS estimator.
This is also the case for
0 is a diagonal matrix.
Since the GMM 3SLS estimator imposing ^ to be diagonal can also be shown to be algebraically
identical to 2SLS equation by equations
This correspond to the result that 2SLS is e¢ cient under homoskedasticity and with linear
weighting of the moment conditions
GMM 3SLS estimator
If we imposed system homoskedasticity, instead of weighting by
1
E Z0 UU0 Z
we could use weighting by

1
E Z0 0Z
where 0 can be estimated given the …rst step GMM estimates.

This is called the GMM 3SLS estimator (but is done in two steps)
This estimator is no more e¢ cient than the GMM estimator with ^ 1 above
But is less e¢ cient when system homoskedasticity is wrong
Why is GMM 3SLS ever be used?

- historic reasons
- may sometimes have better …nite sample properties
Is system IV always preferable to 2SLS?

less robust because it requires
E[Z0g Ug ] = 0 for all g = 1:::G.

If we were interested only in equation 1, then
E[Z01 U1 ] = 0
would su¢ ce for 2SLS.

Hence, the estimates 1 can become inconsistent due to endogeneity in any of the other
equations.
Trade-o¤ between robustness and e¢ ciency
Overidenti…cation test for system GMM (or 3SLS) can be used to test system
and compared to overidenti…cation test of 2SLS equation by equation
General advice wrt 2step GMM: second step should be more precise but similar estimates,
since …rst step is already signi…cant. Otherwise, …rst step estimates more reliable.
9 Nonparametric di¤-in-di¤ and Panel Data
(References: Athey Imbens Econometrica 2006)
Simple example: arrival of a large number of immigrants/refugees in one city
what is their impact on local market or crime or diseases?
Suppose immigrants arrived at some time between t 1 and t
Compare Yt to Yt 1
But there might have been also other changes over time
Compare Yt Yt 1 in city A to Yt Yt 1 in city B (where no immigrants arrived)
The di¤erence in di¤erence is
Yt;A Yt;B = (Yt;A Yt 1;A ) (Yt;B Yt 1;B )

| {z } | {z }
di¤ over time di¤ over time
= (Yt;A Yt;B ) (Yt 1;A Yt 1;B )
| {z } | {z }
di¤ between cities di¤ between cities
De…ne D = 1(time = t; city = A) as the treatment

Markus Frölich 9. Nonparametric di¤ -in-di¤ and Panel Data 223
The last term in the above representation is the bias before treatment
Regression representation of DiD:
Y = + 1time=t + 1city=A + D + U
| {z }
time constant
where D is the interaction term.
1city=A is the time constant city e¤ect
In …rst di¤erences
Y = time + D+ U
This type of estimator we are going to see again in linear panel data models
Condition on additional characteristics X
Y
D
V X U
135
In many evaluation settings it may not be feasible to observe all confounding variables. In
these cases the evaluation strategy has to cope with selection on unobserved variables. Nev-
ertheless, average treatment e¤ects may still be identi…ed either through an instrumental vari-
able (see next section) or if the average selection bias can be estimated from pre-treatment out-
comes. This latter approach is based on a similar motivation as the pre-programme test: If sys-
tematic di¤erences in the pre-programme outcomes between di¤erent treatment groups occur,
these di¤erences may not only indicate that not all confounding variables have been included,
but may further be useful to predict the magnitude of selection bias in the post-programme
outcomes.
If X does not contain all confounding variables, adjusting for the di¤erences in the X
distributions by using propensity score matching will not yield a consistent estimate of the
135
@Delete: A second strategy is what is often called the “di¤erence-in-di¤erence” technique: when a good
argument can be made that the outcome would not have had di¤erential trends in regions that received the
program if the program had not been put in place, it is possible to compare the growth in the variables of
interest between program and non-program regions. However, it is important not to take this assumption for
granted. This identi…cation assumption cannot be tested, and to even ascertain its plausibility one needs to
have long time series of data from before the program was implemented in order to be able to compare trends
over long enough periods. One also needs to make sure that no other program was implemented at the same
time, which is often not the case. Finally, when drawing inferences one must take into account that regions are
often a¤ected by time persistent shocks that may look like “program e¤ects.” Bertrand, Du‡o and Mullainathan
(2002) found that di¤erence-in-di¤erence estimations (as commonly performed) can severely bias standard errors:
the researchers randomly generated placebo laws and found that with about twenty years of data, di¤erence-in-
di¤erence estimates found an “e¤ect”signi…cant at the 5% level of up to 45% of the placebo laws. As an example
of where di¤erence-in-di¤erence estimates can be used, Du‡o (2001) took advantage of a rapid school expansion
program that occurred in Indonesia in the 1970s to estimate the impact of building schools on schooling and
subsequent wages. Identi…cation is made possible by the fact that the allocation rule for the school is known
(more schools were built in places with low initial enrollment rates), and by the fact that the cohorts participating
in the program are easily identi…ed (children twelve years or older when the program started did not participate
in the program). The increased growth of education across cohorts in regions that received more schools suggests
that access to schools contributed to increased education. The trends were quite parallel before the program
and shifted clearly for the …rst cohort that was exposed to the program, thus reinforcing con…dence in the
identi…cation assumption. However, this identi…cation strategy is not usually valid; often when policy changes
are used to identify the e¤ect of a particular policy, the policy change is itself endogenous to the outcomes it was
meant to a¤ect, thus making identi…cation impossible (see Besley and Case, 2000).
average treatment e¤ect on the treated because

Z
E Yt1 jDt = 1 E Yt0 jXt ; Dt = 0 dFXt jDt =1
Z
6= E Yt1 jDt = 1 E Yt0 jXt ; Dt = 1 dFXt jDt =1 = E Yt1 Yt0 jDt = 1
since E[Yt0 jXt ; Dt = 1] 6= E[Yt0 jXt ; Dt = 0]. The di¤erence

Z
E Yt0 jXt ; Dt = 1 E Yt0 jXt ; Dt = 0 dFXt jDt =1
is the systematic bias in the potential outcome Yt0 in period t that still remains even after
adjusting for the di¤erent distributions of X.
Pre-programme outcomes might help to estimate this systematic bias with respect to the
’non-participation’outcome Yt0 . Therefore the following discussion centers on the identi…cation
of average treatment e¤ects on the treated relative to non-participation: E[Yt1 Yt0 jDt = 1].
De…ne the average selection bias in period t with respect to the participants in period t as
Z
Bt;t = E Yt0 jXt ; Dt = 1 E Yt0 jXt ; Dt = 0 dFXt jDt =1
as the systematic outcome di¤erence between the group of non-participants (Dt = 0) and the
group of participants (Dt = 1) if both groups would participate in treatment 0. If, for example in
the evaluation of active labour market programmes, the individuals who decided to participate
were on average more able, it is likely that their labour market outcomes would also have been
better even without participation in the programme. In this case, the average selection bias Bt;t
would be positive. If the potential outcome in the case of non-participation Yt0 is related over
time, it is likely that these di¤erences between the treatment groups would also persist in other
time periods including periods before the start of the programme. In other words, the more able
persons would also had enjoyed better labour market outcomes in periods before treatment.
If the pre-programme outcome in period is not causally a¤ected by the programme, so that
(??) holds, the ’non-participation’outcomes Y 0 = Y are observed for the di¤erent treatment
groups and the corresponding average selection bias in period
Z
B ;t = (E [Y jX ; Dt = 1] E [Y jX ; Dt = 0]) dFX jDt =1
is identi…ed from the observed pre-programme data.
Assuming that the average selection bias is stable over time (Eichler and Lechner 2002)
Bt;t = B ;t (82)
the average treatment e¤ect on the treated is identi…ed as

Z
E Yt1 Yt0 jDt = 1 = E Yt1 jDt = 1 E Yt0 jXt ; Dt = 0 dFXt jDt =1 dx + Bt;t
Z
= E [Yt jDt = 1] E [Yt jXt ; Dt = 0] dFXt jDt =1 dx + B ;t . (83)
This resembles a di¤erence-in-di¤erence type estimator adjusted for the distribution of the
X covariates, which is further discussed in Section 2.1.5. For an application of nonparametric
di¤erence-in-di¤erence estimation to the evaluation of active labour market programmes in East
Germany, see Eichler and Lechner (2002) or Bergemann, Fitzenberger, Schultz, and Speckesser
(2000) and Bergemann, Fitzenberger, and Speckesser (2001).
The bias-stability assumption (82) is not strictly necessary. Instead, it su¢ ces if B ;t can be
consistently estimated from the average selection biases in pre-programme periods (predictable-
bias assumption). If (causally una¤ected) pre-programme outcomes are observed for many
periods, the average selection bias can be estimated in each period and any regular trends
^ ;t ; B
observed in B ^ ^
1;t ; B 2;t ,.. may lead to better predictions of the bias Bt than simply
estimating Bt by the selection bias in period , as the bias-stability assumption (82) would
suggest.
Loosely speaking, the predictable-bias-assumption (with the bias-stability-assumption as a

special case) is weaker than the conditional independence assumption (2) since it allows that
Bt 6= 0, whereas the conditional independence assumption requires Bt = 0. However, both
^ ;B
assumptions are not nested because Bt may be zero while B ^ ^
1; B 2 ,.. may be unable to
predict Bt = 0.136 A further di¤erence occurs if the pre-programme outcomes Y are themselves
confounders, i.e. in‡uencing the treatment selection decision and the post-programme outcomes.
If, in addition, all other confounding variables are observed, the independence assumption (2)
would be valid conditional on the pre-programme outcomes and the other confounders. This
would imply zero selection bias (Bt = 0) and the applicability of the control-for-confounding-
variables approach. The di¤erence-in-di¤erence approach, on the other hand, would introduce
selection bias (Bt 6= 0) by not conditioning on the pre-programme outcome Y (i.e. not including
Y in X).
A weakness of the di¤erence-in-di¤erence approach is that it does not entail any theoretical
guidelines for deciding which variables (if any at all) should be included in the conditioning set
136
For instance, if Bt = 0 and B 6= 0 and, erroneously, bias stability (82) is assumed.
X.137 Heckman, Ichimura, and Todd (1997), Heckman, Ichimura, Smith, and Todd (1998) and
Smith and Todd (2005) consider a stronger version of the bias-stability-assumption (82), which
requires that the bias is stable not only on average but for any possible value of X
E Yt0 Y jX; D = 1 = E Yt0 Y jX; D = 0 . (84)
This stronger assumption demands that all variables that a¤ect the increase (growth) in the
non-participation outcome over time (Yt0 Y ) and the selection to treatment 0 or r are included
in X. Although this stronger assumption does not help to identify the average treatment e¤ect
on the treated, it may be useful in the search to identify the relevant conditioning variables X:
because if (84) is true then also (82) holds.
9.1 Linear di¤erence-in-di¤erence
Whereas the previous subsection examined a nonparametric version of di¤erences-in-di¤erences,

we examine here briefy the conventional linear model with constant treatment e¤ects, mainly
because they have been very popular in applied policy evaluation.
Repeat the case with a single control group and two time periods t = 0; 1. Suppose a policy
change at t = 1 in the unemployment insurance law a¤ected individuals becoming unemployed
only if they were older than 50 at the time of unemployment registration and let Y be some
outcome measure. For a nice application see Lalive (2008).
We could run the regression
Y = 0 + 1 1age_50+ + 2 1time=1 + 1age_50+ 1time=1 + U (85)
where 1age_50+ is one if age > 50 and zero otherwise. measures the treatment e¤ect of the
policy change. 1 captures (time constant) di¤erences between the two age groups, and 2
captures time trends (in the absence of the policy change) that are assumed to be identical for
both age groups.
One can show that the OLS estimate of can also be written as
^ = (y50+;t=1 y50+;t=0 ) (y50 ;t=1 y50 ;t=0 ) (86)
or equivalently as
^ = (y50+;t=1 y50 ;t=1 ) (y50+;t=0 y50 ;t=0 ) , (87)
137
Although the guideline for the control-for-confounding-variables approach is rather vague, it still gives some
indication which variables are relevant and which are not.
where y is the group average outcome.
In representation (87) the DiD estimate compares the outcomes in time period 1 and sub-
tracts the bias from permanent (time constant) di¤erences between the two groups. In rep-
resentation (86) the average outcome gain for age group 50+ is estimated and a possible bias
from a general trend is removed (under the assumption that the trend is the same in the 50-
group).
It is worthwhile noting that for (85) individual panel data is not needed. In fact, not
even individual data is needed since only group averages are required. For estimation the four
averages y50+;t=1 , y50 ;t=1 , y50+;t=0 and y50 ;t=0 would be su¢ cient. For inference, however, we
have to be careful about this. One could estimate (85) by OLS and use the conventional formula
to obtain t-values. We would obtain di¤erent t-statistics when using individual observations
or when plugging in only the four averages y50+;t=1 , y50 ;t=1 , y50+;t=0 and y50 ;t=0 , though.
Clearly, since we used the same underlying data both approaches lead to the same conclusions
such that some adjustments are needed.
Before we consider inference, we examine …rst the Di¤erence-in-di¤erence-in-di¤erence (Di-

DiD) estimator. In the above example, we might be concerned that the assumption of a com-
mon time trend between the 50+ and the 50- group might be too strong. Particularly, if the
periods t = 0 and t = 1 are in fact some time apart (e.g. 10 years), di¤erent trends might
have a¤ected these groups, or in other words, the composition of the unobserved characteristics
might have changed over time. We might be able to remove the bias due to such non-identical
trends if we have another control group that was not a¤ected at all by the policy change. In
Lalive (2008) only the 50+ group living in certain regions, called A, were a¤ected by the policy,
whereas individual in neighbouring regions, called B, were not a¤ected at all. For those living
in region B we could estimate:
(yB;50+;t=1 yB;50 ;t=1 ) (yB;50+;t=0 yB;50 ;t=0 ) .
Because no policy change happened in region B this expression should be zero if time trends
were identical, i.e. if the unobserved di¤erences between 50+ and 50- remained identical over
time. If not we could subtract this estimate to remove the bias to due changing time trends as
the DiDiD
(yA;50+;t=1 yA;50 ;t=1 ) (yA;50+;t=0 yA;50 ;t=0 )
f(yB;50+;t=1 yB;50 ;t=1 ) (yB;50+;t=0 yB;50 ;t=0 )g
or equivantly as
yA;50+ yA;50 yB;50+ + yB;50 (88)
where refers to di¤erence over time.

This DiDiD is numerically equivalent to the coe¢ cient on the triple interaction term
1age_50+ 1time=1 1A :
Y = 0 + 1 1age_50+ + 2 1time=1 + 3 1A
+ 4 1age_50+ 1time=1 + 5 1age_50+ 1A
+ 6 1time=1 1A + 1age_50+ 1time=1 1A + U .
A simple way to prove that the population equivalent, i.e. expected value, of (88) is identical
to is to use the above regression equation to express the expected value of yA;50+;t=1 as
0 + 1 + 2 + 3 + 4 + 5 + 6 + . With analogous calculations for the other groups and

plugging these expressions into (88) we obtain that only remains.
A similar idea can be used when 3 time periods t = 1; 0; 1 are available of which 2 are
measured before the policy change. Again if the assumption of identical time trends for both
groups were valued, the following expression should have mean zero:
(y50+;t=0 y50 ;t=0 ) (y50+;t= 1 y50 ;t= 1 ) .
If not, we could use this expression to measure the change in the time trend before the treatment.
Assuming that the change in the trend, i.e. the acceleration, is the same in both groups, we
could predict the counterfactual average outcome for y50+;t=1 , i.e. in the absence of a policy
change. The DiDiD estimate is then
(y50+;t=1 y50 ;t=1 ) (y50+;t=0 y50 ;t=0 )
f(y50+;t=0 y50 ;t=0 ) (y50+;t= 1 y50 ;t= 1 )g
or equivalently
y50+;t=1 y50 ;t=1 .
The basic approach in all these situations is that we have only one treated group (in one time
period) and several non-treated groups (in earlier time periods). We thus use all the non-treated
observations to predict the counterfactual outcome for that time period in which the treated
group was a¤ected by the policy change.
More generally, let g index di¤erent groups (e.g. 50+, 50- in regions A and B) and t index
time periods. The model for the mean outcome ygt can be written as
ygt = gt + Dgt + vgt
where gt is a set of group by time period constants and d is one if treated and zero otherwise.
As we just consider group averages, the model so far is completely general. Without further
restrictions it is not identi…ed as it is obvious e.g. for the two groups two time periods case. An
identifying restriction, as discussed above, in this case is to assume
ygt = g + t + Dgt + Vgt
as well as uncorrelatedness of d and v. With G groups and T time periods and everything
measured at the group level, one can use conventional panel data analysis with the appropriate
asymptotic inference depending on whether G ! 1 and/or T ! 1. (Strict exogeneity of
dgt might be a concern.) Since ygt is a group level average, vgt is likely to be heteroskedastic
and its variance is likely to depend on the group size Ngt .
Whereas the previous discussion only required observation of the group level averages ygt ,
covariates are often included either for e¢ ciency reasons of for making the constant trend as-
sumption more plausible. As mentioned before, we require the time trends to be the same for
the two groups, or equivalently that the di¤erence in (counterfactual) outcomes between the
group 1 and the group 0 are identical in period t = 1 and t = 0. Clearly, if the observed char-
acteristics X changed over time, this assumption is less plausible. We would thus like to take
changes in X into account and assume that the di¤erences due to unobservables are constant
over time. In a linear model one could simply include the group by time averages Xgt in the
model
0
Ygt = g + t + Dgt + Xgt + Vgt .
To gain e¢ ciency, in a linear model, one could alternatively include individual characteristics
Zi as well:
0 0
Yigt = g + t + Dgt + Xgt + Vgt + Zigt gt + Uigt .
This is an example of a multilevel model, where the regressors and error terms are measured
at di¤erent aggregation levels. Simply calculating standard errors by the conventional formula
for iid errors and thereby ignoring the group structure in the error term Vgt + Uigt usually
leads to too large t-values. For calculating the standards errors one would like to permit serial
correlation in Vgt , while assuming that Vgt are independent across groups.
9.2 Changes-in-changes model
9.2.1 Changes-in-changes with continuous outcome Y
Athey and Imbens (2006) consider a nonparametic generalization of the DiD approach, permit-
ting treatment e¤ect heterogeneity and identifying distributional e¤ects as well. They call it
the "changes-in-changes" approach. The e¤ects of time and of treatment are permitted to di¤er
systematically across individuals. I discuss here the situation with 2 groups g 2 f0; 1g and two
time periods t 2 f0; 1g. The group 1 is subject to the policy change in the second time period.
The focus is on estimation of the ATET. To estimate the counterfactual outcome Y 0 in case of
non-treatment we can use the information from the other three group by time combinations.
We consider …rst the case without covariates. Each individual i is characterized by the
variables Ui and Gi , where Ui is some unobserved characteristic and Gi is the group that
individual i belongs to (i.e. treated or not). Both are random variables. U and G are permitted
to be dependent, i.e. the policy change could have happened in regions where U was particularly
low or high.
Assumption 1: Yi0 = '(Ui ; Ti ) and ' is strictly increasing in its …rst argument.
Assumption 2: U ??T jG
Assumption 3: Supp(U jG = 1) Supp(U jG = 0)
The …rst assumption requires that the non-treatment outcome Yi0 is only a function of U
and time but not of the group G. Hence, while G and U are permitted to be correlated, G does
not directly a¤ect Yi0 . This assumption requires the function ' to not depend on G.
The assumption of strict monotonicity in U permits us to invert the function ' to map from
the observed outcomes Yi0 to the unobserved Ui . Since U is usually continously distributed,
also the outcomes Y must be. For Y discrete, only set identi…cation is obtained or stronger
assumptions are needed.
The second assumption requires that, within each group, the distribution of U is the same
over time. Hence, while speci…c individuals are permitted to have di¤erent values of U in time
period 0 and 1, the distribution of U in the entire group remains unchanged.
The third assumption is a common support assumption on U . For every value of U in the
G = 1 population, we need to infer the counterfactual outcome in the t = 1 period, which can
only be achieved from the G = 0 population.
Note that groups and time periods are treated asymetrically. The important assumptions are
thus: First, within each time period, the production function (for the non-treatment outcome)
'(U; t) is the same in both groups. Second, the de…ning feature of a group is that the distribution
of U does not change over time (although for each individual it is permitted to change). Note,
that we can reverse the roles of G and T , which leads to a di¤erent model with di¤erent
assumptions and di¤erent estimates. In the reversed changes-in-changes model, the assumption
is that the production function ' does not change over time, but is permitted to be di¤erent
between groups. In addition, the distribution of U has to be the same in both groups, but
is permitted to change over time. As an example, consider as groups the cohort of 60 year
old males and females. We may be willing to assume that the distribution of U is the same
for males and females. As these cohorts age, the distribution of U changes over time. We
may consider a medical intervention and further assume that the health production function
(without tretment) may depend on U and also on group membership (i.e. gender) but does not
change over time. Hence, the model applies when either T or G does not enter in the production
function '(U; T; G) and the distribution of U (quantiles) remain the same in the other dimension
(i.e. the one which enters in '). Whichever of these two potential model applications is more
appropriate depends on the particular empirical application. The estimates can be di¤erent.
However, since the model does not contain any overidentifying restrictions, neither of these two
models can be tested.
To obtain a further understanding of the above assumptions, suppose we were to assume

additionally:
Additivity: Ui = + G i + "i with "i ??(Gi ; Ti )
Single Index Model: '(u; t) = (u + t)

Identity transformation: is the identify function.
Combined with Assumptions 1 to 3 above we would obtain
Yi0 = + Ti + G i + "i with "i ??(Gi ; Ti ),
which corresponds to the standard linear DiD model for the non-treatment outcome. Hence,
the CiC model nests the linear DiD as a special case.
The identi…cation result for the counterfactual distribution function is given below. We …rst
sketch an "intuitive" outline of the identi…cation. The basic idea is that in time period t = 0,
the production function ' is the same in both groups G. Di¤erent outcome distributions of Y
in the G = 0 and G = 1 groups can be attributed to di¤erent distributions of U in the two
groups. From time perido 0 to 1 the production function changes but the distribution of U
remains the same. I.e. someone at quantile q of U will remain at quantile q in time period 1.
The inverse distribution function (i.e. quantile function) will frequently be used and is de…ned
for a random variable Y as138
FY 1 (q) = inf fy : FY (y) q ; y 2 Supp(Y )g .
Consider an individual i in the G = 1 group, and suppose we knew the value of Ui .139 We
would like to know '(Ui ; 1) for which only the group G = 0 and t = 1 is informative, because
the G = 1 = t group is observed only in the treatment state and because the G = 0 = t group
is only informative for '(Ui ; 0). We do not observe Ui in the G = 0 group, but by monotonicity
we can relate quantiles of Y to quantiles of U .
We start from an individual of the group G = 0 and t = 1 and map it …rst into the G = 0 = t
group and relate it then to the G = 0, t = 1 group. First, suppose the value Ui corresponds to
the quantile q in the G = 1 and t = 0 group
FU j10 (Ui ) = q.
138
This implies that FY (FY 1 (q)) q. This relation hold with equality if Y is continuous or, when Y is discrete,
at discontinuity points of FY (q). Similary, FY 1 (FY (y))
1
y. This relation holds with equality at all y 2 Supp(Y )
for continous or discrete Y . (But not necessarily if Y is mixed.)
139
I use the word individual here only for convenience. In fact, only the quantile in the U distribution is
important, so whenever I refer to an individual, I refer to any individual at a particular quantile of U . (This
could be a di¤erent individual.)
We observe the outcomes in the non-treatment state for both groups in the t = 0 period. In
the G = 0 = t group, the value of Ui is associated with a di¤erent quantile q 0
FU j00 (Ui ) = q 0
or in other words the individual with Ui is at rank q 0 in the G = 0 = t group
1
q 0 = FU j00 (FU j10 (q)).
(More precisely, the observation at rank q in the G = 1, t = 0 group has the same value of U
as the observation at rank q in the G = 0 = t group.)
Because the function '(Ui ; 1) is strictly increasing, the rank transformation is the same with
respect to U or with respect to Y , i.e.
1
q 0 = FY j00 (FY j10 (q)). (89)
Now we use the assumption U ??T jG = 0 which implies that the quantile q 0 in the G = 0
group is the same in t = 0 as in t = 1. Hence, the outcome for rank q 0 in the U distribution in
t = 1 is
1
FY j01 (q 0 ).
Because the function ' depends only on U and T but not on G this implies that this is the
counterfactual outcome for an individual with Ui of group 1 in time period t = 1. In addition,
by the assumption U ??T jG = 1, this individual would also be at rank q in time period 1. Or in
other words, the counterfactual outcome FY 01j11 (q) for an individual with Ui that corresponds
to rank q in the G = 1 and t = 0 population is
FY 01j11 (q) = FY j01

1 1
(q 0 ) = FY j01 1
(FY j00 (FY j10 (q))).
The following graph illustrates the logic of this derivation. We consider an individual in
the G = 1; t = 0 group at rank q. The q-th quantile of Y in the G = t = 1 population is the
observed outcome after treatment. The counterfactual outcome is obtained by …rst mapping
the rank q into the rank q 0 in G = t = 0 population, which then qives the q 0 quantile in the
G = 0; t = 1 population.
(G = 1; t = 1) (G = 0; t = 1)
" "
(G = 1; t = 0) ! (G = 0; t = 0)
rank q rank q 0
Hence, the quantile treatment e¤ect on the treated for quantile q is
1 1 1
FY j11 (q) = FY j01 (FY j00 (FY j10 (q))).
Inverting the quantile function we obtain the counterfactual distribution function
1
FY 0 j11 (y) = FY j10 FY j00 FY j01 (y) . (90)
From the above derivations it is obvious that for every value of U 2 Supp(U jG = 1) we need
to have also observations with U in the G = 0 group, which is made precise in Assumption 3.
Now examine identi…cation of ATET. Consider an individual i from the G = 1 population

with outcome Yi;t=0 in the …rst period and Yi;t=1 after the treatment. As derived in (89) the
rank of this individual in the G = 0 population is
q 0 = FY j00 (Y )
and the corresponding outcome in the period t = 1 is thus
1
FY j01 (FY j00 (Y )),
which is thus the counterfactual outcome for this individual. By drawing randomy individuals
from the G = 1; t = 0 population we obtain the ATET as
h i
1
E [Y jG = t = 1] E FY j01 (FY j00 (Y )) jG = 1; t = 0
A more formal derivation as in Athey and Imbens (2006) can be obtained as follows. (The
proof in Athey and Imbens (2006) is more careful about the support conditions on U .) One
can …rst show that
FY 0 jgt (y) = Pr ('(U; t) yjG = g; T = t)

1
= Pr U ' (y; t) jG = g; T = t
1
= Pr U ' (y; t) jG = g
1
= FU jg ' (y; t) .
This implies,
1
FY j00 (y) = FU j0 ' (y; 0)
and substituting y = '(u; 0) we obtain
FY j00 (' (u; 0)) = FU j0 (u)
1
=) ' (u; 0) = FY j00 FU j0 (u) (91)
provided u 2 Supp(U jG = 0).

With similar derivations for G = 0 and t = 1 we obtain
1
FY j01 (y) = FU j0 ' (y; 1)
=) FU j01 FY j01 (y) = ' 1

(y; 1) . (92)
Now starting from (91), substituting u = ' 1 (y; 1) and entering (92) gives
1 1
' ' (y; 1) ; 0 = FY j00 FY j01 (y) . (93)
Further
1
FY j10 (y) = FU j1 ' (y; 0)
1 1
=) FY j10 (' ' (y; 1) ; 0 ) = FU j1 ' (y; 1) (94)
where we substituted y with ' ' 1 (y; 1) ; 0 .

Now …nally
1 1
FY 0 j11 (y) = FU j1 ' (y; 1) = FY j10 (' ' (y; 1) ; 0 )
by entering (94) and plugging in (93) gives
1
= FY j10 (FY j00 FY j01 (y) ),
which is identical to (90).

9.2.2 Changes-in-changes with discrete outcome Y and interval identi…cation
Now consider the situation when the outcome variable Y is discrete with a …nite number of
support points Supp(Y ) = f 0 ; :::; L g. The previous model needs thus be modi…ed to be a
realistic model of the observed data. Since the assumption of discrete U is not very attractive, we
maintain the assumption that U is continuously distributed in the G = 0 and G = 1 population,
but permit the function ' now to be weakly monotonously increasing in U . Without loss of
generality we assume that U jG = 0; t = 0 is uniformly distributed on [0; 1].
Without further assumptions, the counterfactual distribution is not point identi…ed anymore.
We …rst discuss this case and show later how to restore point identi…cation under additional
assumptions. The reason why point identi…cation is lost is that we cannot invert FY j00 any
longer to obtain the value of U . Consider the following graph and remember that we normalized
U jG = 0 to be uniformly distributed. When we observe Y = 3, we only know that U lies in
(FY j00 (2); FY j00 (3)]. If Y was continuosly distributed, the value of U would be exactly identi…ed.
Gra…k 345 einbauen
With discrete Y we only know that for ' a non-decreasing function
u = Pr (U ujG = 0) = Pr (U ujG = 0; t = 0) Pr (' (U; 0) ' (u; 0) jG = 0; t = 0) . (95)
The inequality follows because U u implies ' (U; 0) ' (u; 0) but not vice versa. Let Q
denote the set of all values of q 2 [0; 1] such that 9y 2 Y00 with FY j00 (y) = q. If u 2 Q, the
statements U u and ' (U; 0) ' (u; 0) both imply each other. We thus obtain for u 2 Q
u = Pr (U ujG = 0) = Pr (U ujG = 0; t = 0) = Pr (' (U; 0) ' (u; 0) jG = 0; t = 0)
= Pr (Y ' (u; 0) jG = 0; t = 0) = FY j00 (' (u; 0)).
Hence, for u 2 Q we have

1
' (u; 0) = FY j00 (u). (96)
All values of U in (FY j00 (y 1); FY j00 (y)] will be mapped into Y = y. It is helpful to de…ned a
second inverse function
1
FY j00 (q) = inf y : FY j00 (y) q ; y 2 Supp(Y00 ) ,
( 1)
FY j00 (q) = sup y : FY j00 (y) q ; y 2 Supp(Y00 ) [ f 1
where Y00 = Supp(Y jG = 0; t = 0). These two inverse functions also permit to describe the
interval of values of U that are mapped into the same value of Y . Consider a value q such that
1
FY j00 (q) = y. Then all values of Ui = u with
( 1) 1
FY j00 (FY j00 (q)) < u FY j00 (FY j00 (q))
will be mapped into Yi = y.

Regarding the two inverse functions, we note that for values of q such that 9y 2 Y00 with
( 1) 1
FY j00 (y) = q it follows that FY j00 (q) = FY j00 (q). Let Q denote the set of all values of q 2 [0; 1]
that satisfy this relationship. These are the jump points in the previous …gure. For all other
( 1) 1
values of q 2
= Q we have that FY j00 (q) < FY j00 (q). For all values of q it thus follows that
( 1) 1
FY j00 (FY j00 (q)) q FY j00 (FY j00 (q)) (97)
and
( 1) 1
FY j00 (FY j00 (q)) = q = FY j00 (FY j00 (q)) for q 2 Q.
We now show that FU jG=1 (u) is identi…ed only at certain values of u 2 Q. We derived above
1
that FY j00 (' (u; 0)) = u, and for u 2 Q it follows that ' (u; 0) = FY j00 (u) by (96). Now consider
FU jG=1 (u) for some value of u 2 Q.
FU jG=1 (u) = Pr (U ujG = 1) = Pr (U ujG = 1; t = 0)
= Pr (' (U; 0) ' (u; 0) jG = 1; t = 0) = Pr (Y ' (u; 0) jG = 1; t = 0)

1
= FY j10 (' (u; 0)) = FY j10 (FY j00 (u)).
Hence, FU jG=1 (u) is point identi…ed only at u 2 Q. For all other values FU jG=1 (u) can only be
bounded, similarly to (95).
The following …gure illustrates the identi…cation area of FU jG=1 (u).
y FY j00 FY j10 FY j01

y=1 0.1 0.3 0.2
y=2 0.4 0.5 0.6
y=3 0.7 0.9 0.8
y=4 1 1 1
The following …gure shows the distribution function FU jG=1 (u) as a function of u. The graph
on the left indicates the values of FU jG=1 (u) where it is identi…ed from FY j00 and FY j10 . Since
distribution functions are right-continuous and non-decreasing, the shaded areas in the graph
on the right show the lower and upper bounds on FU jG=1 (u). In other words, the function
FU jG=1 must lie in the shaded areas.
Gra…k von Seite -18„, links und rechts
Having (partly) identi…ed the function FU jG=1 we now proceed with identifying the distri-
bution of the counterfactual outcome FY 0 j11 . Note …rst that
FY j0t (y) = Pr (' (u; t) yjG = 0) = sup (u : ' (u; t) = y)
and
FY 0 j1t (y) = Pr (' (u; t) yjG = 1) = Pr (U sup fu : ' (u; t) = yg jG = 1)
= Pr U FY j0t (y)jG = 1 = FU jG=1 (FY j0t (y)).
This implies
FY 0 j11 (y) = FU jG=1 (FY j01 (y)). (98)
Hence, we can derive FY 0 j11 (y) from the distribution of FU jG=1 . For the example given above
we obtain
FY 0 j11 (1) = FU jG=1 (FY j01 (1)) = FU jG=1 (0:2) 2 [0:3; 0:5]
FY 0 j11 (2) = FU jG=1 (FY j01 (2)) = FU jG=1 (0:6) 2 [0:5; 0:9]
FY 0 j11 (3) = FU jG=1 (FY j01 (3)) = FU jG=1 (0:8) 2 [0:9; 1]
FY 0 j11 (4) = FU jG=1 (FY j01 (4)) = FU jG=1 (1) = 1.
This is also illustrated in the following graph.
Figure von Seite -17 hierhin
The formal derivations of these bounds are as follows. As shown above FY 0 j1t (y) =
( 1)
Pr U FY j0t (y)jG = 1 . Hence, setting replacing y with FY j00 (FY j01 (y)) we obtain
( 1) ( 1)
FY j10 (FY j00 (FY j01 (y))) = Pr U FY j00 (FY j00 (FY j01 (y)))jG = 1
Pr U FY j01 (y))jG = 1 = FY 0 j11 (y)

because of (97) and (98). Similarly,
1 1
FY j10 (FY j00 (FY j01 (y))) = Pr U FY j00 (FY j00 (FY j01 (y)))jG = 1
Pr U FY j01 (y))jG = 1 = FY 0 j11 (y).
We thus obtain Theorem 4.1 of Athey and Imbens (2006) for y 2 Y01
( 1) 1
FY j10 (FY j00 (FY j01 (y))) FY 0 j11 (y) FY j10 (FY j00 (FY j01 (y))).
This bounds the distribution of the counterfactual outcome FY 0 j11 (y). (Athey and Imbens
(2006) also show that these bounds are tight, i.e. that no narrower bounds can exist. This
is done by …nding at least one example consistent with the model assumptions that implies
a function FY 0 j11 (y) that is exactly equal to the lower bound and similarly …nding another
example where the implied function FY 0 j11 (y) is exactly equal to the upper bound. This implies
then that no tighter bound can exist that are generally valid.)
9.2.3 Changes-in-changes with discrete outcome Y and point identi…cation
As just shown, for discrete Y we obtain only interval identi…cation. Athey and Imbens (2006)
discuss two possible ways to restore point identi…cation. Either using an additional idependence
assumption or by imposing some exclusion restrictions. We consider here only the …rst approach.
In addition to the previous assumptions for the discrete Y case, and the normalization of U to
be uniform in the G = 0; t = 0 population, we further assume that
U ??GjT; Y . (99)
Hence, the distribution of U may still di¤er between the G = 1 and the G = 0 population. For
those observations with the same value of Y , for whom we know only the interval in which U
lies, it is assumed that they are distributed in the same way in the G = 1 as in the G = 0
population. (Note that assumption (99) is automatically satis…ed when Y is continuous and
we assumed ' to be strictly increasing. In that case we have U = ' 1 (Y; T ) such that U is
degenerate conditional on Y and T and therefore trivially independent of G.)
The intuition why we obtain point identi…cation can be obtained from the following …gure.
In the example discussed in the previous subsection we have shown that FU jG=1 (u) was point-
identi…ed only for some values of u. With the additional assumption (99) we can point identify
Markus Frölich 10. Sources of identifying variation 241
FU jG=1 (u) for all values of u. Since U is uniformly distributed in the G = 0; t = 0 population,
it is also uniformly distributed in the G = 0; t = 0; Y = y population. Assumption (99) now
implies that U is also uniformly distributed in the G = 1; t = 0; Y = y population. Hence, the
distribution function FU jG=1 (u) has to be a diagonal between the bounds on FU jG=1 (u) derived
above. These bounds are replicated in the left graph below, while the graph on the right shows
FU jG=1 (u) with the additional assumption (99).
Gra…k von Seite -16.
We will discuss here only the proof for Y being binary. (The general case is found in Athey
and Imbens (2006).)
@@@ hier weiter 3
...
10 Sources of identifying variation
IZA DP No. 3410

Mirko Draca, Stephen Machin, Robert Witt:
Panic on the Streets of London: Police, Crime and the July 2005 Terror Attacks
Abstract:
In this paper we study the causal impact of police on crime by looking at what happened
to crime before and after the terror attacks that hit central London in July 2005. The attacks
resulted in a large redeployment of police o¢ cers to central London boroughs as compared to
outer London – in fact, police deployment in central London increased by over 30 percent in
the six weeks following the July 7 bombings. During this time crime fell signi…cantly in central
relative to outer London. Study of the timing of the crime reductions and their magnitude, the
types of crime which were more likely to be a¤ected and a series of robustness tests looking at
possible biases all make us con…dent that our research approach identi…es a causal impact of
police on crime. Implementing an instrumental variable approach shows an elasticity of crime
with respect to police of approximately -0.3, so that a 10 percent increase in police activity
reduces crime by around 3 percent.
IZA DP No. 3411

Claudio Ferraz, Frederico Finan:
Motivating Politicians: The Impacts of Monetary Incentives on Quality and Performance
Abstract:
Recent studies have emphasized the importance of the quality of politicians for good
government and consequently economic performance. But if the quality of leadership matters,
then understanding what motivates individuals to become politicians and perform competently
in o¢ ce becomes a central question. In this paper, we examine whether higher wages attract
better quality politicians and improve political performance using exogenous variation in the
salaries’ of local legislators across Brazil’s municipal governments. The analysis exploits
discontinuities in wages across municipalities induced by a constitutional amendment de…ning
caps on the salary of local legislatures according to municipal population. Our main …ndings
show that increases in salaries not only attracts more candidates, but more educated ones.
Elected o¢ cials are in turn more educated and stay in o¢ ce longer. Higher salaries also
increase legislative productivity as measured by the number of bills submitted and approved,
and the provision of public goods.
http://ftp.iza.org/dp3411.pdf
11 Linear Panel Data models
Wooldridge Ch. 10,11

In Chapter 7 we discussed nonparametric di¤-in-di¤
where unobserved heterogeneity is di¤erenced away
Individual panel data was one special case where this could be used
Here we discuss parametric panel data models

which are very popular in (e.g. cross country) applications
Microeconometric perspective: N large, T …x (N ! 1, T …x or grows slowly)
Distinguish between Panel Data and Panel Data Models

Two approaches: random and …xed-e¤ects are usually discussed together
This can be misleading, since they are based on two very di¤ erent perspectives:
Markus Frölich 11. Linear Panel Data models 243
Fixed-e¤ects is about handling selection on unobservables

Random-e¤ects is about e¢ ciency with non-independent observations
Variables are observed for each individual i for various time periods (t = 1:::T )
Balanced panels, i.e. where every individual is observed in all time periods
Unbalanced panels: More cumbersome notation, available in STATA etc
! item non-response and attrition can be serious concern
because we assume that the relationship is identical in every period, i.e.
1 = 2 = 3 = ::: = T
Single equation with observations indexed by i and t
Yi;t = Xi;t + Ui;t
where Xi;t may include variables D for which we want to ascertain causal e¤ect
and additional variables X (e.g. for partial e¤ects or for controlling).
Examples:
i is individual, t is time period
i is country, t is time period (N >> T )
i is family, t is sibling
i is family, t is …rst or second twin
i is school, t is number of child in classroom
We will sometimes split the unobserved disturbance term into

an individual time constant component (unobserved heterogeneity) and
idiosyncratic error term:
Yi;t = Xi;t + Ci + Ui;t t = 1:::T .
We will usually assume that Ui;t is uncorrelated with Xi;t

A key distinction will be whether Ci is permitted to be correlated with Xi;t or not
If not, pooled OLS would be consistent and e¢ ciency is the only concern
If yes, OLS would be inconsistent and we are concerned with identi…cation

Y
D
V X U
(In linear parametric models no confounding arc between X and U is permitted)
Simple solution: just include a dummy variable for each individual i

to pick up Ci
Problem: number of coe¢ cients grows too fast with sample size
Alternatively, we can di¤erence Ci away:
4Yi;t = 4Xi;t + 4Ui;t t = 2:::T .
Assumption of a constant impact of unobservables on Y may be strong if T large

! perhaps more credible after including more control variables X
! often we want to control also for lagged Yi;t (Dynamic panel data)
! this can cause additional problems, though
In addition, we might want to permit individual speci…c time trends Ti
Yi;t = Xi;t + Ti t + Ci + Ui;t t = 1:::T
These time trends might also be correlated with Xi;t
Now, second di¤erencing gives
4 4 Yi;t = 4 4 Xi;t + 4 4 Ui;t t = 3:::T .
By di¤erencing, a lot of variation in Xi;t is lost

and only measurement error may remain
After di¤erencing, we can apply OLS to
4Yi;t = 4Xi;t + 4Ui;t t = 2:::T ,
provided
rank E 4X0 4 X = K
! thus X is not permitted to have time invariant regressors (gender, education?)
and
E 4X0 4 U = 0
which can also be written as
E X0i;t Ui;t + E X0i;t 1 Ui;t 1 E X0i;t Ui;t 1 E X0i;t 1 Ui;t =0
which will require strict exogeneity (i.e. no feedback)

and cannot be true if X includes the …rst lag of Y
! in this case, we have to …nd other approaches (system IV below)
But …rst examine estimators under strict exogeneity

Assume strict exogeneity (and thereby rule out lagged Y in X, i.e. no dynamics)
11.1 Strict exogeneity
Assumption: Strict exogeneity
E [Yi;t jXi;1 ; :::; Xi;T ; Ci ] = E [Yi;t jXi;t ; Ci ] = Xi;t + Ci
This assumes that Yi;t does not depend on past and future value of X
after conditioning on X and C.
This assumption is weaker than assuming
E [Yi;t jXi;1 ; :::; Xi;T ] = E [Yi;t jXi;t ] = Xi;t

because (under the assumption of strict exogeneity)
E [Yi;t jXi;1 ; :::; Xi;T ] = Xi;t + E [Ci jXi;1 ; :::; Xi;T ] 6= Xi;t
if Ci and Xi;1 ; :::; Xi;T are correlated.
Example: Yit agricultural output of farm i during year t

Xit inputs: capital, labour, fertilizer, rainfall etc.
Ci quality of land, managerial ability of farmer,assumed to have constant e¤ect
Assumption: after controlling for inputs Xit and Ci the inputs in other years should have
no e¤ect on the harvest
However, without controlling for Ci , inputs in other years would help to predict harvest in
this year (as they contain information about the Ci )
An equivalent representation of strict exogeneity is:
E [Ui;t jXi;1 ; :::; Xi;T ; Ci ] = 0.
Again, this is a statistical assumption and therefore symmetric

We assume that inputs Xi in other years have no e¤ect on this years harvest
But: we also assume that this year’s harvest has no e¤ect on past/future inputs
If the farmer was credit constrained, a good harvest may permit buying more fertilizer next
year
There might be feedback from U to X
Ui;t ! Xi;s for s < t
Ui;t ! Xi;s for s > t.
The latter may be more likely.
(Strict exogeneity cannot hold if X includes past values of Y , i.e. dynamic model)
11.2 Random E¤ects Methods
These assume that the individual speci…c e¤ects are not correlated with Xi;t
E [Ci jXi;1 ; :::; Xi;T ] = 0.
This cannot be true, if X contains lagged Y , i.e. no DPD permitted
Hence, OLS is consistent (for N ! 1) and only e¢ ciency is of concern here

OLS does not require strict exogeneity (only contemporaneous exogeneity)
! more robust than the following RE methods
Pooled OLS is consistent, but we have serial correlation in the errors

! use robust standard errors
RE methods assume that C is not correlated with X and strict exogeneity

and a particular covariance matrix
De…ne composite error
~i;t = Ci + Ui;t
U t 2 f1::T g
and
~i = (U
U ~i;1 ; U
~i;2 ; :::; U
~i;T )0
and
h i
=E U~i U
~i0 T T matrix
We can use unrestricted GLS, where we leave the matrix unrestricted

GLS is consistent if
Assumption RE.2:
h i
rank E X0 plim( ^ 1
)X = K.
p
GLS is consistent and N asymptotically normal
Covariance matrix of all errors is
0
~N = U
U ~1;1 ; :::; U
~1;T ; U
~2;1 ; :::; U
~2;T ; :::; U
~N;1 ; :::; U
~N;T .
2 3
0 0
6 7
~N U
~0 ] 6 7
E[U N = 60 07
4 5
0 0
Unrestricted GLS
1) Estimate pooled OLS
2) Use OLS estimates to estimate
3) Use GLS with ^
Under homoskedasticity this estimator is asymptotically e¢ cient
Impose additional structure on
We might want to impose additional restrictions on ! restricted GLS

Remember:
GLS is consistent irrespective of whether is correctly speci…ed or not
Unrestricted GLS is e¢ cient (under homoskedasticity)
Why do we want to impose restrictions on ?
may lead to better …nite sample properties
~N U
Two examples for E[U ~ 0 ]:
N
22 3 3
2 + 2 2 2
66 c u c c 7 7
66 7 7
66 2 2 + 2 2 7 0 0 7
64 c c u c 5 7
6 7
6 2 2 2 + 2 7
6 c c c u 7
6 .. 7
6 0 . 0 7
6 7
6 2 37
6 2 + 2 2 2 7
6 6 c u c c 77
6 6 77
6 0 0 6 2 2 2 2 77
6 c c + u c 7
4 4 55
2 2 2 + 2
c c c u
22 3 3
2 + 2 2 + 2 2 + 2 2
66 c u c u c u7 7
66 7 7
66 2 + 2 2 + 2 2 + 2 7 0 0 7
64 c u c u c u 5 7
6 7
6 2 + 2 2 2 + 2 2 + 2 7
6 c u c u c u 7
6 .. 7
6 0 . 0 7
6 7
6 2 37
6 2 + 2 2 + 2 2 + 2 2 7
6 6 c u c u c u 77
6 6 7
6 0 0 6 2 2 2 2 2 2 777
6 c + u c + u c + u 57
4 4 5
2 + 2 2 2 + 2 2 + 2
c u c u c u
The RE estimator (Random E¤ ects estimator) assumes that

2 3
2+ 2 2 2
h i 6 c u c c 7
~ ~ 0 6 2 2 2 2 7
= E Ui Ui = 6 c c + u c 7
4 5
2 2 2 + 2
c c c u
I.e. we assume that
2 2
E Ui;t = u does not depend on t
E [Ui;t Ui;s ] = 0 for s 6= t
which gives for

h i
E U~i;t U
~i;s = E [(Ci + Ui;t )(Ci + Ui;s )]
the above matrix for .

~i;t and U
One implicit assumption is that the correlation between U ~i;t 1
~i;t
is the same as U
~i;t
and U 10 , which may not always be plausible.
One example, where RE is very useful is when individuals are clustered as e.g.
children in one class, or households in the same city
we have to estimate only 2 elements instead of T (T + 1)=2 as in unrestricted GLS
The ratio
2
c
2 2
2 [0; 1]
c + u
indicates the importance of the time constant e¤ect C.
For e¢ ciency of the RE estimator we need homoskedasticity
Assumption RE.3:
(a) E Ui Ui0 jXi;1 ; Xi;2 ; :::; Xi;T ; Ci = 2

u IT (b) E Ci2 jXi;1 ; Xi;2 ; :::; Xi;T ; Ci = 2
c.
Again, the RE estimator is no more e¢ cient than unrestricted GLS

but can be more reliable in small samples
Procedure:
1) estimate pooled OLS (which is consistent without strict exogeneity)
2) estimate 2 and 2
c u
3) estimate GLS with ^ 1 matrix (consistent only under strict exogeneity)
Test for the presence of unobserved …xed e¤ect

~i;t
! Test for serial correlation in U
If Ui;t are white noise than serial correlation indicates 2 >0
c
Instead of RE, also alternative speci…cations of possible, e.g. Ui;t AR(1)
11.3 Fixed E¤ects Methods
Assume strict exogeneity

but permit Ci to be correlated Xi;t
FD (…rst di¤erence) transformation
Yi;t = Yi;t Yi;t 1
FE (…xed e¤ect) transformation

T
1X
Y•i;t = Yi;t Yi where Yi = Yi;t
T
t=1
no time invariant characteristics

(except if we multiply them with time dummies)
FE estimator: Pooled OLS regression
Y•i;t = X
• i;t + U
•i;t
• i;t does not contain a constant (unless Xi;t contained a time trend)
X
we use this equation to estimate , but is interpreted in the original model
For consistency we need

h i
E X• i;t
0
Ui;t = 0
and
Assumption FE.2 !
T
X h i
rank E X• i;t
0 •
Xi;t =K
t=1
FE is consistent and it is unbiased
The FE estimator is e¢ cient under homoskedasticity and white noise error

Assumption FE.3
2 3
2 0 0
u
6 7
6 .. 7
E (Ui;1 ; :::; Ui;T ) (Ui;1 ; :::; Ui;T )0 jXi;1 ; :::; Xi;T ; Ci = 6 0 . 07
4 5
0 0 2
u
If FE.3 is not true, a robust covariance matrix estimator should be used

e.g. if the Ui;t are suspected to be potentially serially correlated
Alternatively, FE GLS can be used
Compare FE with dummy variable estimator : including one dummy for every i
Estimated coe¢ cients are identical to FE estimates and are consistent
(This is usually not the case in nonlinear models)
Estimated Cî are unbiased but are not consistent
First Di¤erencing
Use FD instead of FE
requires also strict exogeneity and a full rank assumption
is consistent and unbiased
FD is not e¢ cient under assumption FE.3

but is e¢ cient if
Assumption FD.3:
2 3
2 0 0
e
6 7
0 6 .. 7
E ( Ui;2 ; :::; Ui;T ) ( Ui;2 ; :::; Ui;T ) jXi;1 ; :::; Xi;T ; Ci =60 . 07
4 5
0 0 2
e
FE is e¢ cient when Ui;t are serially uncorrelated (example school class)

FD is e¢ cient when Ui;t is random walk, i.e. very highly correlated
Otherwise not much di¤erence
Wooldridge suggests to estimate both and if there estimates are very di¤erent,
this may indicate a failure of the strict exogeneity assumption
Hausman Test comparing the RE and FE estimators:

FE is consistent (under the above assumptions) if Ci is correlated with X
RE is e¢ cient if Ci is not correlated with X
Test the di¤erence between RE and FE estimates

But remember: both require strict exogeneity
0h i 1 d
H = ^F E ^
RE
[ ^F E )
Avar( [ ^ RE )
Avar( ^
FE
^
RE ! 2
M
where M is the number of time varying regressors

We can use this test accordingly to test also only a single element of ^
11.4 Pre-determindness and dynamic panel data
We want to relax assumption of strict exogeneity

We want to permit lagged values of Y among the regressors
(FE would be inconsistent, but sometimes we know direction of bias
and this can be useful, see Bond (2002))
! we will use lagged values of X and Y as instruments

Identi…cation requires limited serial correlation in Ui;t , would not work if AR(1)
Linear model
Yi;t = Xi;t + Ci + Ui;t
Strict exogeneity:
E [Ui;t jXi;1 ; :::; Xi;T ; Ci ] = 0.
Pre-determindness (=sequentially exogenous X)
E [Ui;t jXi;1 ; :::; Xi;t ; Ci ] = 0.
Hence, we allow feedback from Ui;t to future X
Endogenous regressor:
E [Ui;t jXi;1 ; :::; Xi;t 1 ; Ci ] = 0.
Yt
Xt-2 Xt-1 Xt Xt+1
Ut-2 Ut-1 Ut Ut+1
1) Eliminate Ci
2) Past values as IV: Identi…cation through limited serial correlation in Ui;t
! The identifying power comes from the assumption that is time invariant
1 = 2 = ::: = T
Predetermined regressors:
No past values of X a¤ect the expected value of Yi;t , but future values might do
E [Yi;t jXi;1 ; :::; Xi;t ; Ci ] = E [Yi;t jXi;t ; Ci ] = Xi;t + Ci
We may sometimes want to permit that Ui;t and Xi;t are correlated. Then we use
E [Ui;t jXi;1 ; :::; Xi;t 1 ; Ci ] = 0.
The FE estimator is inconsistent: Bias is of order T 1
11.4.1 Arellano Bond di¤erence-GMM estimator:
Arellano Bond (1991) approach: First di¤erence to eliminate Ci
4Yi;t = 4Xi;t + 4Ui;t t = 2:::T ,
Potential instruments for time period t for pre-determined X:
Xi;1 ; Xi;2 ; ::::; Xi;t 1.
For endogenous X, use instruments
Xi;1 ; Xi;2 ; ::::; Xi;t 2.
If Xi;t contains lag Yi;t 1 (dynamic panel data), use instruments
Xi;1 ; Xi;2 ; ::::; Xi;t 1 which means Yi;1 ; :::; Yi;t 2

If Ui;t is MA(1) process, use instruments
Xi;1 ; Xi;2 ; ::::; Xi;t 2.
Instrumental variable estimation:
Suppose all Xi;t are pre-determined and Uit has no serial correlation,
First idea:use 4Xi;t 1 as instrument and estimate next equation by pooled 2SLS
4Yi;t = 4Xi;t + 4Ui;t t = 3:::T ,
which requires
E 4X0i;t 1 4 Ui;t = 0
rank E 4X0i;t 1 4 Xi;t = K.
This is just identi…ed 2SLS.
Second idea: We could also use Xi;t 1 and Xi;t 2 as instruments

This gives K overidentifying restrictions ! weighting of instruments is required
One possibility for weighting is to give weights 1 and 1 such that the weighted combination
is
Xi;t 1 Xi;t 2
which is identical to 4X0i;t 1 and thus corresponds to the just identi…ed 2SLS.
But, there might be better weighting schemes than 1 and 1, therefore use
Xi;t 1 and Xi;t 2
in an overidenti…ed GMM estimator and estimate optimal weighting matrix.

(This also gives us the Hansen-J-test or overidenti…cation test.)
Hint: when T = 2 this may be poorly identi…ed, since only Xi;1 is available which often
hardly correlated with 4Xi;2 ; since one is a level and the other di¤erence.
Third idea: Use not only Xi;t 1 and Xi;t 2 but also all other lags
Example: T = 4
Estimate the system by "di¤erence-GMM"
0 1 0 1 0 1
4Yi;2 4Xi;2 0 0 4Ui;2
B C B C B C
B C B C B C
B 4Yi;3 C=B 0 4Xi;3 0 C + B 4Ui;3 C
@ A @ A @ A
4Yi;4 0 0 4Xi;4 4Ui;4
where = ( 0; 0
; 0 0
) using the instrument matrix
2 3
Xi;1 0 0 0 0 0
6 7
6 7
Z=6 0 Xi;1 Xi;2 0 0 0 7
4 5
0 0 0 Xi;1 Xi;2 Xi;3
using the time periods t = 2; 3; 4.
If X contains …rst lag of Yi;t , we cannot use time period t = 2 (would require Yi;t=0 )
The optimal weighting matrix is,as discussed in the chapter on linear IV system:
1
E Z0 4 U 4 U0 Z
After …rst step GMM estimate:
1 X 0 î 4 U
^ 0i Zi
1
Zi 4 U
N
When T is large, this gives many overidentifying moments

GMM with many overidentifying moments often has poor …nite sample properties
An alternative is to assume homoskedasticity of the Ui;t . Also assuming that Ui;t are serially
uncorrelated, 4Ui;t are MA(1) and the covariance matrix of 4U is
2 3
2 1 0
6 7
2 6 7
H= u 6 1 2 17 .
4 5
0 1 2
Such that the optimal weighting matrix under homoskedasticity is:
1
E Z0 HZ
which can be estimated as

1 X 0 1
Zi HZi
N
It does not depend on any parameters and is thus a "…rst step GMM estimator"
With T large, better to leave the very distant lags away,

i.e. use only Xi;t 1 ; Xi;t 2 perhaps Xi;t 3 and perhaps Xi;t 4
since the partial correlation of 4Xi;t 4 with 4Xi;t is often very small after
having controlled for 4Xi;t 1 and 4Xi;t 2 and 4Xi;t 3.
Robustness analysis: all GMM estimators (with pd weighting) are consistent
Predetermined and strictly exogenous regressors:

Let X consist of Z and W , were Z is strictly exogenous and W is predetermined
E [Ui;t jZi;1 ; ::; Zi;T ; Wi;1 ; ::; Wi;t ; Ci ] = 0.
I.e. Z are variables not a¤ected by feedback
4Yi;t = 4Zi;t 1 + 4Wi;t 1 + 4Ui;t t = 3:::T ,
Then 4Zi;t is not correlated with 4Ui;t , so we can use all Zi;1 ; ::; Zi;T as IV
to not in‡ate the number of overidentifying restrictions, use 4Zi;t "as IV for itself".
So for T = 4 use the instrument matrix

2 3
4Zi;2 Wi;1 0 0 0 0 0 0 0
6 7
6 7
6 0 0 4Zi;3 Wi;1 Wi;2 0 0 0 0 7
4 5
0 0 0 0 0 4Zi;4 Wi;1 Wi;2 Wi;3
All these approaches depend on limited serial correlation in Ui;t

11.4.2 Blundell Bond system-GMM estimator:
Problem: what to do if there is little time variation in some X over time

Example: Is there persistence in earnings after controlling for unobserved Ci
Autoregressive process
Yi;t = + Yi;t 1 + Ci + Ui;t
Take di¤erences to eliminate Ci
4Yi;t = 4Yi;t 1 + 4Ui;t
where we could use as potential instruments
Yi;1 ; ::::; Yi;t 2.
Now suppose that in reality is one, which means Yi;t is a random walk
then 4Yi;t 1 is pure white noise and not correlated with any Yi;1 ; ::::; Yi;t 2
! the potential instruments have no correlation with the endogenous regressor !

not identi…ed ! hence we cannot test for = 1 using this approach
If < 1 but still large, the correlation would be very low and estimates imprecise
Suggestions by Blundell and Bond (1998, JoE) to impose moment restrictions in the level
equation.
I.e. use the level and the di¤erence equation and impose moment restrictions
Yi;t = Xi;t + Ci + Ui;t
4Yi;t = 4Xi;t + 4Ui;t .
Idea is that we might have some variables Wi;t that are not correlated with Ci
If Wi;t is predetermined, then use
E [Wi;s (Ci + Ui;t )] = 0 s t

Markus Frölich 12. Non-Linear Panel Data models 259
etc. as additional moments.
We might not …nd a Wi;t uncorrelated with Ci but we might …nd a Wi;t that has a constant
correlation with Ci such that
E [Wi;t Ci ] = E [Wi;t 1 Ci ]
which implies
E [4Wi;t Ci ] = 0
and such
E [4Wi;s (Ci + Ui;t )] = 0 s t.
Such assumptions follow from conditional stationarity in means,

Blundell Bond 98
Bond (2002) is a nice reference!
11.5 External instrument

X C
Y
D
Z
V U
12 Non-Linear Panel Data models
References:
- Lechner, Lollivier, Magnac (2003, forthcoming Handbook of Panel Economet-

rics)
- Arellano Honore (2001, Handbook of Econometrics)
- Arellano Hahn (2005, presented at the ESWC)
- Honore Lewbel (2002, Econometrica)
- Altonji Matzkin (2005, Econometrica)
Linear panel data models permitted us to eliminate time-constant unobserved heterogeneity
In non-linear models this will be more di¢ cult

Only for special cases, consistent estimation with …xed e¤ects possible
! Several attempts to …nd almost consistent estimators or estimators with lower order
bias
12.1 Binary choice models
12.1.1 Fixed e¤ects model under strict exogeneity
Now we do not want to impose any restrictions on dependence between Xit and Ci
Yit = 1 (Xit + Ci + Uit > 0)
but we retain the assumption of strict exogeneity:
FUt ( jC; X1 ; :::; XT ) = FUt ( jC; Xt ) = FUt ( ).
We will also often assume serial independence of Uit :

T
Y
FU1 ;:::;UT ( ; :::; jC; X1 ; :::; XT ) = FUt ( ).
t=1
Nevertheless, the observations Yit are still dependent over time due to C.
Individual e¤ects are unrestricted ! nuisance parameters

Either estimate them consistently or eliminate them.
1) Dummy variables estimator: Consistent estimation of C only possible for T !1

For …xed T the estimates of C are inconsistent and this will generally also lead to inconsistent
estimates of "incidental parameters problem"
For some models, the inconsistency of C does not lead to inconsistent (e.g. in the linear
model).
But generally, estimates of are inconsistent with bias of order T 1. Hence, for T small,
bias can be very large. For moderate T it can be helpful to construct alternative estimators
with bias of smaller order (e.g. T 2 ).
2) Di¤erencing: Alternatively, we may eliminate C by di¤erencing. This is less trivial

then in the normal model and not possible for all models. But, when it is possible, we can
estimate consistently.
Notice that consistent estimation of is not su¢ cient to identify treatment e¤ects! Suppose
Uit being standard normally distributed,hence
P (Yit = 1jXit ; Ci ) = E(Yit jXit ; Ci ) = 1 ( Xit Ci ) = (Xit + Ci ) .
Consider the …rst derivative with respect to one regressor Xk;t

@ (Xt + C)
= k (Xt + C) ,
@Xk;t
which is a function of C. E.g. if C is very large or very small, the derivative is close to zero,
whereas it will be large when C = Xt .
Hence, for estimating the ATE we would need to know the distribution of C. (We could
try to estimate C inconsistently, but for individuals with Yit time constant, Ci is not point
identi…ed anyhow.)
Otherwise, we can estimate but not the ATE !!! (Do we want coe¢ cients or ATE?)
The coe¢ cients are informative about the relative e¤ects, e.g. of regressor X1;t to X2;t , but
not the absolute e¤ects.
(This is di¤erent from linear model.)
Another di¤erence to the linear model is that some observations may not be informative
at all for the estimation of . Consider individuals for whom Yi1 = :::: = Yit = 1, i.e. their
observed value of Y never changes. These individuals provide no information about since C
could be (almost) in…nitely large such that any values of could have generated this constant
outcome. Only individuals for whom Yit changes at least once can help for identifying since
their value of C must be bounded.
! Hence, all individuals with no variation in Yit over time are eliminated as they contain
no information. (If T is 2, there might sometimes not be many observations left!)
(Another issue is that time-invariant regressors cannot be included in Xit . Hence, we need
not only variation over time in Xit but also in Yit .)
Conditional likelihood estimation Conditional ML is a method to construct consistent

estimates in presence of …xed e¤ects C based on the idea of …nding a su¢ cient statistic for the
nuisance parameters.
Consider a random variable Yit with distribution
fYit ( ; ; i) .
A su¢ cient statistic Si for i is a function of the data such that the distribution of the data
given Si does not depend on i:
fYit jSi ( ; ; ; i) = fYit jSi ( ; ; ) .
If the distribution conditional on Si still depends on the , one can estimate by ML conditional
on Si (because the nuisance parameter i no longer matters).
Conditional ML is not e¢ cient.

A more serious problem is that it is often not possible to …nd a su¢ cient statistic such that
fYit jSi does depend on but not on i.
Consider the conditional logit model

Assume that Uit is logistically distributed and serially independent, which gives
Pr (Yit = 1jXi1 ; :::; XiT ; Ci ) = Pr (Yit = 1jXit ; Ci ) = (Xit + Ci )

where
1 (u)
(u) = u
with = eu .
1+e 1 (u)
P
T
De…ne the statistic Yi = Yit . This sum is a su¢ cient statistic for the individual Ci .
t=1
As discussed above, we eliminate all observations for whom Yit never changes (i.e. Yi = 0 or
Yi = T ) as these observations are not informative about .
Consider the joint density of Yit ; ::; YiT conditional on the statistic Yi
Consider the situation with T = 3
Pr Y1 = 0; Y2 = 0; Y3 = 1jX1 ; X2 ; X3 ; C; Y = 1
Pr (Y1 = 0; Y2 = 0; Y3 = 1jX1 ; X2 ; X3 ; C)
=
Pr Y = 1jX1 ; X2 ; X3 ; C
(1 (X1 + C)) (1 (X2 + C)) (X3 + C)
=
Pr Y = 1jX1 ; X2 ; X3 ; C
1 eX3
= =
1 + e(X1 X3 ) + e(X2 X3 ) eX1 + eX2 + eX3
which does not depend on C but does depend on .

Had we conditioned on Y = 2, then the joint probability would depend on C.
For T = 2 we obtain a likelihood function that is identical to a simple logit estimator using
Xit as regressors (but without intercept).
The conditional logit ML estimator is consistent for N ! 1, regardless of T
The conditional ML approach, however, is possible only for some speci…c distributions, e.g.
also for the Poisson regression model. For the binary choice model with serially independent
errors, it has been shown that it can be used only when the errors are logistic.
Hence, this approach is highly sensitive to a correct distributional assumption: E.g. for
normal errors, consistent conditional ML is not possible.
Fixed E¤ect Maximum Score Manski (1987) suggested an adaptation of the M-score es-
timator to nonlinear panel data which achieves an elimination of the individual speci…c e¤ect
through some type of di¤erencing.
Consider T = 2
Suppose that the errors U1 and U2 are identically distributed.
Consider only individuals that changed Yit over time (Y = 1)
Manski showed that whatever the value of the unobserved C
Pr Y2 > Y1 jX1 ; X2 ; Y = 1 > 0:5 if (X2 X1 ) > 0.
Hence, the signs of (Yi2 > Yi1 ) and (X2 X1 ) should be positively correlated. This suggests
the conditional M-score estimator
N
X
^ = arg max sgn (Yi2 > Yi1 ) sgn ((Xi2 Xi1 ) ) .
i=1
If T > 2 one could consider any pairs of time-periods, leading to
XN X
^ = arg max sgn (Yis > Yit ) sgn ((Xis Xit ) ) .
i=1 s<t
This estimator has been shown to be consistent under mild regularity conditions. It is much
more general than the logit estimator, in that we did not specify any distribution and could
permit serial correlation as well as heteroskedasticity (across individuals but not across time).
p
The estimator does not converge at n rate and is not asymptotically normal.
Kyriazidou (1995) and Charlier, Melenberg and van Soest (1995) developed a smoothed
panel M-score estimator which avoids the discontinuity in the objective function. This smoothed
p
estimator is asymptotically normal but also does not attain n rate. In addition, it requires
the choice of smoothing parameters.
The panel M-score estimator seems to be rarely applied (similarly to the cross-section M-
score estimator). In the unsmoothed version, the solution to the optimization problem is not
unique and optimization is complicated because of the step function. The smoothed version
requires the choice of smoothing parameters.
p
Hence, on the one hand, panel-M score estimator is not n-consistent and di¢ cult to handle.
p
On the other hand, conditional logit is n-consistent but relies on strong distributional
assumption.
p
Are there alternative n-consistent estimator? Not many
Magnac (2004), extending the results of an unpublished paper of Chamberlain (1992),

p
showed for T = 2 that n consistent estimation is only possible if Y is a su¢ cient statistic. If,
p
in addition, the shocks Uit are independent, then n consistent estimation is only possible if Uit
are logistic. (If Uit are dependent, the su¢ ciency result may also hold for other distributions
and Magnac (2004) suggests a semiparametric estimator.)
p
Bias reduction for …xed e¤ects estimators with moderate or large T Hence, n con-
sistent estimation is not so straightforward with …xed e¤ects. The dummy variables estimator
is asymptotically biased as discussed above about the incidental parameters problem. Yet, if
T is moderately large, we nevertheless may consider the dummy variables estimator as its bias
is of order T 1 and may be relatively small for moderate or large values of T . Here, the vari-
ous propositions to reduce the bias of the estimator usually to order T 2 can be very helpful.
Arellano Hahn (2005, presented at the ESWC) review recent developments in this …eld. (They
consider nonlinear models in general and not only binary models.)
Consider the density function of Yi1 ; ::; YiT conditional on the strictly exogenous explanatory
variables Xi1 ; ::; XiT :
T
Y
f (Yit j 0 ; Xit ; Ci )
t=1
where we have assumed that Uit are independent over time. The log likelihood function for
independent individuals and parameter vector is
N X
X T
ln f (Yit j ; Xit ; Ci )
i=1 t=1
which we could maximize with respect to and all the …xed e¤ects Ci . Concentrating out the
Ci gives
XN X
T XT
^T = arg max ln f Yit j ; Xit ; Cî ( ) Cî ( ) = arg max ln f (Yit j ; Xit ; c) .
c
i=1 t=1 t=1
The source of the incidental parameters problem is the estimation error in Cî ( ), which usually
vanishes at order T 1. De…ne
N
" T
#
X X
T = arg max lim E ln f Yit j ; Xit ; Cî ( )
n!1
i=1 t=1
and for T …xed and n ! 1

^T = T + op (1).
For smooth likelihood functions one can show that
B D 3
T = 0 + + 2 + O(T ).
T T
Hence, not only is the dummy variables …xed e¤ects estimator biased for T < 1. It is also
usually asymptotically biased when T grows as fast as N . Under general conditions, for N and
T !1
p d
N T ^T T ! N (0; )
for some . Suppose that N and T grow at the same rate in that N=T ! a, then
r
p p N p d p
N T ^T 0 = N T ^T T + B + O( N=T 3 ) ! N ( aB; ).
T
Hence, the asymptotic bias only vanishes if N=T ! 0. This asymptotic unbiasedness is a¤ecting
test statistics as well as con…dence intervals.
Arellano and Hahn (2005) discuss various ways to obtain an analytic expression of the bias
and discuss methods that automatically correct for the …rst order bias. These expressions could
be used to estimate the bias.
Panel Jackknife
Some (computer intensive) methods automatically perform such a bias correction, such as
the Panel Jackknife. Let ^T be the (dummy variables) …xed e¤ects estimator and ^ t the …xed
e¤ects estimator when excluding all the observations of period t. The jackknife estimator is
de…ned as
T
~ = T ^T T 1 X^
t.
T
t=1
To derive the bias of the jackknife estimator, remember the previous bias expansion:
B D 3
T = 0 + + 2 + O(T ).
T T
The jackknife estimator converges for N ! 1 to
D
plim ~ = T T (T 1) T 1 = :::: = 0 + O(T 2
),
N !1 T (T 1)
hence its bias is of order T 2

and will be asymptotically unbiased when T increases at the same
p
rate as N . Hahn and Newey (2004, Econometrica) show that N T ~ 0 has the same
p
asymptotic variance as N T ^T 0 for N=T ! a. Hence, the jackknife estimator has bias
of lower order without a loss in (…rst order) e¢ ciency! (This may be di¤erent for N=T ! 1
…xed, though.)
Similar methods based on bootstrapping could be developed. Bootstrapping methods might

have the advantage that they could perhaps also accommodate some correlation over time in
the Uit .
12.1.2 Predetermined regressors (and dynamic models) with …xed e¤ects
Now, relax assumption of strict exogeneity.
Consider dynamic model
Yit = 1 (Xit + Yit 1 + Ci + Uit > 0)
where Xit is strictly exogenous.

Hence, Yit 1 is the only not strictly exogenous regressor
Persistence in Yit could be due to

- serial correlation in Uit
- heterogeneity (i.e. Ci )
- true state dependence ( 6= 0)
We would like to distinguish between these three sources

Example Chiappori (1998) and Chiappori and Salanie (2000):
French automobile insurance market is such that incentives for not having an accident are
stronger if driver has had fewer accidents in the past.
Hence, should be non-zero if individual behaviour depends on these incentives
Whereas persistence is also due to heterogeneity in driving style Ci
How could we distinguish between state dependence and heterogeneity

Idea: consider model without Xit and assume that Uit are iid
If we had heterogeneity but no state dependence, the sequence (0; 1; 0; 1) and (0; 0; 1; 1)
would have the same probability.
But, if < 0, the (0; 1; 0; 1) would be more likely than (0; 0; 1; 1).
But, if > 0, the (0; 1; 0; 1) would be less likely than (0; 0; 1; 1).
Hence, whereas C clearly a¤ects Yi it should have no e¤ect on the likelihood of changes in
Yi . If < 0, changes in Yit are very frequent, whereas for > 0, changes should be rare.
Hence, could Yi be a su¢ cient statistic? No.
But Yi1 ; YiT ; Yi together is a su¢ cient statistic.

when Uit are iid and logistically distributed and T 4
PT
exp t=2 yt yt 1
Pr Yi1 = y1 ; :::; YiT = yT jYi1 ; Yi ; YiT ; C = P PT
exp t=2 dt dt 1
(d1 :::dT )2B
where B contains all possible binary sequences such that d1 + :: + dT = Yi .

This expression does not depend on C anymore and we can use conditional ML.
However, Yi1 ; YiT ; Yi is no longer a su¢ cient when the model also contains Xit
However, if we consider only a subset of observations for whom Xit is partly time-constant,
then Yi1 ; YiT ; Yi is again a su¢ cient statistic.
For example if T = 4, only observations for whom Xi3 = Xi4 are used.
Honore and Kyriazidou (2000) have shown that
Pr Yi1 = y1 ; :::; Yi4 = y4 jYi1 ; Yi ; Yi4 ; Xi1 ; ::; Xi4 ; Xi3 = Xi4 ; C
does not depend on C.
But this uses only observations with Xi3 = Xi4 .

If Xit contains time dummies, there would be no observations
Also if Xit contains continuous variables, we should use observations with Xi3 close to Xi4 ,
e.g. use a kernel function for restricting to similar observations.
In addition, this requires strictly exogenous X
Nevertheless, all these estimators depend highly on the very speci…c logistic function.
Honore and Kyriazidou (2000) also extend the M-score approach to this situation.
Honore and Lewbel (2002) consider identi…cation with …xed e¤ects and predetermined re-
gressors if one of the regressors is known to be independent of Ci and Uit . In this case they can
derive some function that is linear in Xit and has an additive component that is identi…ed.
12.1.3 Random e¤ects model under strict exogeneity
First: ignore the panel structure
One could use a linear probability model. But not very appealing.
Model:
Yit = 1 (Yit > 0) = 1 (Xit + Uit > 0)
Contemporaneous exogeneity:
Uit ??Xit
or equivalently
FUt ( jXt ) = FUt ( )
This implies
P (Yit = 1jXit ) = 1 FUt ( Xit ).
We could maximize the partial ln-likelihood

N X
X T
fYit ln (1 FUt ( Xit )) + (1 Yit ) ln (FUt ( Xit ))g .
i=1 t=1
Partial likelihood because

we have not speci…ed joint distribution of (Yi1 ; :::; YiT ) given (Xi1 ; :::; XiT )
This is the pooled panel probit/logit estimator
Thus more robust than the random e¤ects estimators we examine below
Robust variance matrix required to account for serial correlation in the scores
As in the linear models

- we can attempt to obtain more precise estimates by exploiting the correlation structure
- which now requires assumptions on the joint distribution of (Yi1 ; :::; YiT ).
Strict exogeneity:
Uit ?? (Xi1 ; :::; XiT )
or equivalently
FUt ( jX1 ; :::; XT ) = FUt ( jXt ) = FUt ( ).
For errors correlated over time we sometimes impose the stronger assumption
(Ui1 ; :::; UiT ) ?? (Xi1 ; :::; XiT ) .
This implies
P (Yit = 1jXi1 ; :::; XiT ) = 1 FUt ( Xit ).
Likelihood function for independent observations

Q
N
L( ) = Li (Yi1 ; :::; YiT jXi1 ; :::; XiT ; )
i=1
where includes all coe¢ cients and unknown parameters of the distribution function of
the errors.
Errors independent over time When the errors are independent over time, then also
Yi1 ; :::; YiT jXi1 ; :::; XiT are independent over time and the likelihood for one observation is
Q
T
Li (Yi1 ; :::; YiT jXi1 ; :::; XiT ; ) = (1 FUt ( Xit ))Yit (FUt ( Xit ))1 Yit
.
t=1
This gives for ln L
P
N P
T
ln L( ) = fYit ln (1 FUt ( Xit )) + (1 Yit ) ln (FUt ( Xit )) g
i=1t=1
and for normal errors
P
N P
T Xit Xit
ln L( ; 1 ; :::; T) = Yit ln + (1 Yit ) ln 1
i=1t=1 t t
where one of the standard errors has to be normalized. E.g. 1 = 1. (If all the regressors Xit
are interacted with time dummies, more standard errors have to be normalized.)
For numerical stability (and e¢ ciency) we may often impose that 1 = ::: T. This is the
pooled panel probit estimator. (This also gives us more freedom to model time interactions in
Xit )
Error structure with time constant and time varying e¤ect Instead of assuming that
the errors Uit are independent over time, a structure such as
~it
Uit = Ci + U
~it is serially uncorrelated and uncorrelated with Ci . This gives

may be more natural, where U
the joint distribution
Q
T
FC;U~1 ;:::;U~T jX1 ;:::;XT ( ; :::; ) = FC ( ) FU~t ( ).
t=1
Remember that this structure implies that the correlations between Uit and Ui;t do not depend
on . This may be plausible in some situations, but not in many others.
~t is required.
For identi…cation, one restriction on the variances of C or U
This factor structure also permits a relatively simple derivation of the likelihood contribu-
tion. First, consider the likelihood given C (i.e. treating it like an observed regressor):
Q
T Yit 1 Yit
Li (Yi1 ; :::; YiT jCi ; Xi1 ; :::; XiT ; ) = 1 FU~t ( Xit Ci ) FU~t ( Xit Ci ) .
t=1
Now, we don’t observe Ci , hence we cannot calculate this likelihood contribution. But, Ci is
independent of Xit , hence we can integrate it out.
Z
Li (Yi1 ; :::; YiT jXi1 ; :::; XiT ; ) = Li (Yi1 ; :::; YiT jC; Xi1 ; :::; XiT ; ) dFC .
Z
Q
T Yit 1 Yit
= 1 FU~t ( Xit Ci ) FU~t ( Xit Ci ) dFC .
t=1
The distributions fC and fU~ are often speci…ed to ease numerical integration. Particularly
convenient is to choose C as discrete with a small number of support points.
The random e¤ ects probit estimator assumes that all errors are normal and the integration
can be approximated by Gaussian quadrature (Hermite integration formula)
Notice that the pooled panel probit is the pseudo ML estimator where it has incorrectly been
assumed that the errors are independent over time. Although ine¢ cient, it is still consistent.
(Pooled panel probit did not rely on strict exogeneity.)
General error structure The common e¤ect structure may often not be so plausible due to
its non-decaying covariance pattern. Suppose Ui1 ; :::; UiT is jointly normal N (0; ) but that we
do not want to impose any restrictions on . Then the likelihood contribution is
Li (Yi1 ; :::; YiT jXi1 ; :::; XiT ; ) = Pr(Y1 = Yi1 ; :::; YT = YiT jXi1 ; :::; XiT ; )
= Pr ( 1 (Xi1 + Ui1 > 0) = Yi1 ; :::; 1 (XiT + UiT > 0) = YiT jXi1 ; :::; XiT ; ) .
Consider for example the likelihood contribution for an observation with Yi1 = ::: = YiT = 0
Pr ( Ui1 < Xi1 ; :::; UiT < XiT jXi1 ; :::; XiT ; )
T
= ( Xi1 ; :::; XiT ; )
where T is the cdf of the T-variate normal distribution. Since there is no closed form expression
T, T
for it has to be computed as the integral of :
Z i1
X ZiT
X
T
= (#1 ; :::; #T ; ) d#1 d#T .
If T is 2 or 3, computing (i.e. approximating) this integral is possible but cumbersome.

However, Gaussian quadrature approximation methods do not work well anymore for T larger
than 4. (Other algorithms for numerical integration are available but can be very slow.)
Hence, for moderate T we have to …nd an alternative approach. Either use an estimator
robust to misspeci…cation in the serial correlation structure (e.g. pseudo likelihood estimators
or GMM), which will be ine¢ cient.
Or use Simulated Maximum Likelihood (MSL), where the integral is simulated by repeated
drawings of random numbers. (This is also used for multinomial probit models and is discussed
in a later chapter.)
can be left unrestricted (except for restrictions needed for identi…cation). However, very
often more restrictions are imposed on to improve the stability of convergence and reduce the
incidence of local extrema. E.g. one could impose a correlation structure with a time-invariant
component and an AR(1) autoregressive time-varying component.
SML would be the e¢ cient method, but consistency requires that the number of replications
in the simulator increase to in…nity. Hence, in …nite samples the number of replications should
not be too small.
Alternatively, we could use the method of simulated moments (MSM), which is consistent
even for a constant number of replications in the simulator if the moment conditions are linear.
We could derive simulated moment conditions for every observed sequence of Yi1 ; :::; YiT . Let
Y! de…ne a particular sequence. The moment condition for this sequence is:
E [1 (Yi = Y! ) Pr (Y! jXi1 ; :::; XiT ; 0) jXi1 ; :::; XiT ; 0] =0
The sum of all sequences is

" #
X
E A! (Xi1 ; :::; XiT ) f1 (Yi = Y! ) Pr (Y! jXi1 ; :::; XiT ; 0 )g =0
!
which is linear in the multivariate probability that is to be simulated.

The optimal choice of A! would be
@ ln Pr (Y! jXi1 ; :::; XiT ; 0)

A! (Xi1 ; :::; XiT ) =
@ j = 0
which however would re-introduce the problem of moment conditions nonlinear in the simulated
object.140
If T is moderately large, some of the probabilities Pr (Y! jXi1 ; :::; XiT ; 0) can be very small
and the simulation of these probabilities would be imprecise. Conditional probabilities could be
used instead, which however also leads to moment conditions nonlinear in the simulated object.
A third alternative is the method of simulated scores (MSS, Hajivassiliou and McFadden
1998), which directly simulates the score in
@ ln L
E = E [Si ( )] = 0,
@ = 0
0
i.e. instead of simulating ln L, the derivative @ ln L=@ is simulated. The score for the probit
and other models is given in Hajivassiliou and McFadden (1998) and an unbiased simulator for
the score could be constructed. However, this simulator is not smooth and the simulation of
the score can thus be relatively insensitive to the value of .
140
These are the optimal instruments because they lead to the condition that @E ln L=@ j = 0 = 0, i.e. the …rst
P
N P
order condition of the e¢ cient ML estimator. The likelihood function is ln L( ) = di! ln Pi! , where di! =
i=1 !
P
1 (Yi = Y! ) and Pi! = Pr (Y! jXi1 ; :::; XiT ; ). Now consider the above moment condition E Ai! (di! Pi! ) = 0.
!
P P
Plugging in Ai! = @ ln@ Pi! gives E di! @ ln@ Pi! E @Pi! =@ = 0. Notice that the second term can be written
! !
P P
as E@ Pi! =@ which is zero since Pi! = 1. The …rst term, on the other hand, correspond to @E ln L=@ if
! !
di¤erentiation and integration can be interchanged.
MSL has one clear practical advantage over MSM and MSS: We can distinguish local (from
global) optima by the criterion function.
Limited information methods Instead of full information maximum likelihood, more ro-
bust (but less e¢ cient) estimators can be obtained from the marginal period-by-period distri-
bution function, by noting that
20 1 3
Y (X1 )
6B 1 C 7
6B .. C 7
E 6B . C jX1 ; :::; XT 7 = 0.
4@ A 5
YT (XT )
j = 0
These moment conditions can be computed fast and do not depend on the covariances of the
error terms.
At least these moments yield a useful …rst estimate that can also be used as starting values
for more complex ML estimation based on the joint distribution. For obtaining the optimal
instrument matrix for GMM estimation, only the bivariate joint normal distribution is required
(such that no simulators are needed). Nevertheless, nonparametric regression is required for
estimating the optimal instruments, as they depend in (a perhaps nonlinear) way on the regres-
sors Xit and Xis .
12.1.4 Correlated e¤ects model under strict exogeneity
In the correlated e¤ects model, the individual e¤ects Ci are permitted to be correlated with Xit
in a speci…ed way. Chamberlain (1984) proposes that
Ci = (Xi1 ; :::; XiT )0 + C~i
where the remaining individual speci…c e¤ect C~i is assumed to be independent of all Xit :
C~i ?? (Xi1 ; :::; XiT ) .
This gives the model
Yit = 1 Xit + (Xi1 ; :::; XiT )0 + C~i + Uit > 0 .
The parameters can be recovered by noting that Xit has a di¤erent impact on Y in the
period t than in all other periods. Regressing in a …rst step Yit on Xi1 ; :::; XiT separately for
each period gives the coe¢ cients t. These reduced form coe¢ cients are related to and by
t = + et
where et = 1t IT , where 1t is a column vector of zeros of length T with a one in the t-th row.
Alternatively, constrained ML could be used with the above restrictions incorporated.
One can even permit the individual speci…c e¤ect to be related to Xit through an unknown
function g( )
Ci = g (Xi1 ; :::; XiT ) + C~i
and di¤erence C~i away by using the inverse function. Here, C and Xit can be related in a fairly
general way, but the variance and other moments of C are not permitted to depend on Xit .
Hence, Ci will shift the distribution but does not change its shape. (E.g. if Xit contains a time
trend, the variance of the distribution of the error terms is not permitted to change over time.)
Suppose C~i and Uit are independent normal with sum of their variances normalized to one.141
Hence:
Pr (Yit = 1jXi1 ; :::; XiT ) = (Xit + g (Xi1 ; :::; XiT )) ,
or equivalently
1
(E [Yit jXi1 ; :::; XiT ] ) = Xit + g (Xi1 ; :::; XiT ) .
Now we can eliminate the nuisance function g (Xi1 ; :::; XiT ) by any di¤erencing operator, e.g.
1 1
(E [Yit jXi1 ; :::; XiT ] ) (E [Yit 1 jXi1 ; :::; XiT ] ) = (Xit Xit 1) .
We can obtain by estimating the conditional expectation functions nonparametrically.
Finally, one could even permit for a nonparametric version where is replaced by an
unknown distribution function F , under some identi…cation restrictions (Chen 1998).
141
Normalization is necessary for identi…cation.
12.1.5 Predetermined regressors (and dynamic models) with random e¤ects
When estimating the dynamic model
Yit = 1 (Xit + Yit 1 + Ci + Uit > 0)
there is an initial conditions problem. We can characterize the likelihood contribution of one
individual, assuming that Uit are serially uncorrelated, by
Z
Pr(Yi1 ; :::; YiT jX1 ; :::; XT ; ) = Pr (Y1 ; :::; YT jC; Y0 ; X1 ; :::; XT ; ) dFY0 CjX1 ;:::;XT .
Z Y
T
= Pr (Yt jY1 ; ::; Yt 1 ; X1 ; :::; XT ; C; Y0 ) Pr (Y1 jY0 ; X1 ; :::; XT ; C) dFY0 CjX1 ;:::;XT
t=2
which simpli…es for the above model to
Z Y
T
= Pr (Yt jYt 1 ; Xt ; C) Pr (Y1 jY0 ; X1 ; C) dFY0 CjX1 ;:::;XT
t=2
Since Yi0 is not observed, additional information is needed for specifying the two last terms in
the previous expression
Yi0 = Ci + "i0
is imposed, where Yi0 is the latent index in Yit = 1 (Yit > 0).
An alternative approach suggested by Arellano and Carrasco (2003) permits correlated ef-
fects and is based on the inversion of the cdf similarly as discussed for the static model. Denote
Wit = (Xit ; Yit 1) and suppose that
2
(Uit + CjW1 ; :::; Wt ) N E [CjW1 ; :::; Wt ] ; t .
Hence, Uit does not depend on past and contemporaneous values of Wt but may have feedback
on future values of Wt . The …xed e¤ect C may depend on the values Wt . But the shape of the
distribution function (e.g. variance) does not depend on W1 ; :::; Wt .
Consider again the dynamic model
Yit = 1 (Xit + Yit 1 + Ci + Uit > 0)
which gives the following relationship:
Xt + Yt 1 + E [CjW1 ; :::; Wt ]
E (Yt jW1 ; :::; Wt ) = ,
t
Markus Frölich 13. Weak instruments 277
which gives
1
t (E (Yt jW1 ; :::; Wt ) ) = Xt + Yt 1 + E [CjW1 ; :::; Wt ]
1
t (E (Yt jW1 ; :::; Wt ) ) Xt Yt 1 = E [CjW1 ; :::; Wt ] .
And analogously for the previous period
1
t 1 (E (Yt 1 jW1 ; :::; Wt 1 ) ) Xt 1 Yt 2 = E [CjW1 ; :::; Wt 1] . (100)
Now consider the above expression for period t and take the expected value conditional on
W1 ; :::; Wt 1:
1
E t (E (Yt jW1 ; :::; Wt ) ) Xt Yt 1 jW1 ; :::; Wt 1 = E [CjW1 ; :::; Wt 1] (101)
where it has been used that
E [E [CjW1 ; :::; Wt ] jW1 ; :::; Wt 1] = E [CjW1 ; :::; Wt 1] .
Now we have two equations for E [CjW1 ; :::; Wt 1] which implies
1
E t (E (Yt jW1 ; :::; Wt ) ) Xt Yt 1 jW1 ; :::; Wt 1
1
= t 1 (E (Yt 1 jW1 ; :::; Wt 1 ) ) Xt 1 Yt 2
from which can be obtained by plugging in nonparametric estimators for the expected values.
12.2 Panel data models with nonseparability
Altonji Matzkin (2005)
13 Weak instruments
References:
- Cameron Trivedi (2005, Chapter 4.9)
- Andrews Stock (2005, presented at the ESWC London)
- Stock Wright Yogo (2002 JBES)
- Dufour (2003 Canadian Journal)
- Hahn Hausman (2003AER)
13.1 Weak IV in single linear equation models
We discussed linear systems of equations with instrument matrix Z
Y = X + U.
This includes
- single equation models (2SLS)
- simultaneous equations models (SEM)
- panel data models with predetermined variables.
We often found that, for the situation of overidenti…cation, the second step GMM estimators,
i.e. with estimated instrument matrix, can perform very poorly in …nite samples. The weak
instrument problem can even result in poor results of the …rst step GMM estimator (i.e. system
2SLS).
Identi…cation requires validity and relevance of the instruments:
E[Z0 U] = 0
rank E[Z0 X] = K.
However, if the correlation between Z and X is small, the IV estimator can perform very
poorly in …nite samples and the asymptotic approximation can lead to very poor inference.
Hence, although is identi…ed and consistently and asymptotically normal estimated, this
may be of little help for …nite-sample inference.
Consider the linear single equation IV model of the type
Yi = Xi + Ui
Xi = (constant; Di ; Xi ) 1 K
Zi = (constant; Zi ; Xi ) 1 L
where X contains the endogenous regressors D and the exogenous regressors X. The degree of
overidenti…cation is L K.
In the following we often write this as, where Xi includes the constant
Yi = Di + Xi + Ui
Di = Zi + Xi + Vi .
We are interested in estimating

But the properties of ^ , tests and CI depend on the values of (nuisance parameters)
0 X = 0, i.e. that the instruments Z are orthogonal to X)

(We sometimes impose that ZN N
If the variance of Zi is small compared to the variance of V , then instruments are weak.
Instruments can be weak because is close to zero or because V ar(Z) is small.
The properties of conventional IV estimators depend in particular on two parameters, the

correlation between U and V and the concentration parameter
Corr(U; V )
P
n
0Z 0Z
i i
i=1
conc = 2
v
If U and V are uncorrelated (i.e. no endogeneity), conventional 2SLS works well. If they are
highly correlated, then 2SLS can be poor if is small. increases with sample size, but can
even for large samples be relatively small: Example Angrist and Krueger (quarter of birth).
IV estimators perform poorly if instrument is weak:

2SLS is heavily biased (in same direction as OLS)
distribution of 2SLS is highly non-normal (e.g. bimodal)
Bound Jaeger Baker (1995) replaced quarter-of-birth instruments of Angrist Krueger
(1991) with random instruments and obtained very similar point estimates and standard errors
13.1.1 Finite-sample properties of 2SLS with normal errors
with iid normal errors and non-random exogenous variables

and no exogenous variables and where D is scalar:
Yi = Di + Ui
Di = Zi + Vi .
The concentration parameter is related to the …rst-stage F-statistic for the null = 0.
The F-statistic depends on an estimator 2.
v Let F~ be the infeasible F-statistic with known
2.
v
dimZ F~ 2
noncentral (dimZ ; conc ).
h i
conc
E F~ = 1 + .
dimZ
Hence, the …rst-stage F-statistic can be considered as an estimator of 1 + dimZ .
conc
The 2SLS estimator is

0 1 0
DN PZ DN DN PZ YN
from which we can obtain

spuv
p u
zu +
^ =
2SLS 0 2z svv
v 1+ pv +
where
0Z 0 U =
p
zu = 0Z 0 Z
N N u N N
0Z 0 V =
p
zv = 0Z 0 Z
N N v N N
suv = VN0 PZ UN =( u v)
svv = VN0 PZ VN = 2.
v
Under the assumptions of normal errors and non-random instruments, zu and zv are standard
normal random variables and suv and svv are quadratic forms.
If the concentration parameter is very large, a reasonable approximation is
p u
2
^ ' zu N 0; u
.
2SLS 0 2
v v
However, if the concentration parameter is small, the …nite-sample distribution is non-normal.
In fact, the …nite-sample distribution may be so non-normal that the mean and other mo-
ments may not even exist.
2SLS (with normal errors and …xed instruments) has …nite-sample moments only up to the
degree of overidenti…cation:
- If just-identi…ed: no moments exist
- If one overidentifying restriction: The mean exists, but not the variance
- If two overidentifying restrictions: The mean and variance exist, but not the skewness
In addition, the 2SLS estimator is biased towards the OLS estimator if instruments are weak
In particular, if = 0 and if we have overidenti…cation (such that the mean of ^ 2SLS exists)
h i h i
uv
E ^ 2SLS 0 = E ^ OLS 0 = 2
.
v
Hence, 2SLS is biased towards OLS.
The relative bias can be approximated for moderately large
h i
E ^ 2SLS 0 dimZ 2
plim ^ OLS 0
even when the errors are non-normal (Buse 1992). Both, the bias of 2SLS and of OLS are
proportional to uv , which thus cancels in the previous expression. Hence, the larger the degree
of "endogeneity" the more serious both estimators are biased.
Consider the following examples, where 0 = 0 and corr(U; V ) = :99 so that plim ^ = 0:99.
In the following graphs 2 refers to the concentration parameter and K to the number of
conc
instruments.
These are exact …nite sample results for nonrandom instruments and normal errors
(In the following we use di¤erent asymptotic approximations to deal with this problem)
Two problems with weak IV:

1) Poor …nite sample properties of 2SLS and GMM estimators.
2) Poor coverage rate of conventional CI: estimate 1:96 stderror
Three solutions:
- Develop better estimators
- Inference with weak IV: CI that are robust to strength of IV
- Test for weak IV ! if IV strong than ok, otherwise di¤erent approach. Problem: pre-test
Weak instrument asymptotics Di¤erent types of asymptotics have been developed to an-
alyze weak IV142
- Strong IV asymptotics: N ! 1 and …xed. This implies !1
p
- Weak IV asymptotics: N ! 1 and N = C remains constant.
1 P
Also assuming that N Zi0 Zi converges to a nonsingular matrix DZ , this gives that the
concentration parameter remains constant, i.e.
n
X
0
Zi0 Zi ! C 0 DZ C
i=1
converges to a constant matrix.
- Many IV asymptotics: N ! 1 and dimZ ! 1 and …xed.
- Many weak IV asymptotics: N ! 1 and dimZ ! 1 and ! 0.

p
N = C remains constant.
142
Alternative asymptotic theory (many-instrument asymptotics in Bekker, 1994, Hahn, 2002; weak-instrument
asymptotics in Staiger and Stock, 1997, Wang and Zivot, 1998, Zivot, Startz and Nelson, 1998, Stock and Wright,
2000; and higher-order asymptotics as in Nagar, 1959, Anderson and Sawa, 1979, Morimune, 1983, Rothenberg,
1983, Hahn, Hausman and Kuersteiner, 2002, among others) has been developed, which leads to better coverage
probabilities in the presence of weak instruments.
13.1.2 Test for weak IV
Large literature developed tests for weak instruments, e.g. Hahn and Hausman (2002) and
Stock and Yogo (2002).143
In the model
Yi = Di + Xi + Ui
Di = Zi + Xi + Vi ,
the strength of the instruments depends on the variance of Zi compared to variance of V .
A way to measure the strength of the instruments is the partial R2 in the second equation
(where the in‡uence of X is partialed out), i.e. in the regression
MX Di = MX Zi + MX Vi .
Or use the …rst-stage F -statistic for the null that = 0 !!!!
But how do we operationalize small?
Null hypothesis: Instruments are weak D is scalar
Instruments are de…ned as weak according to their impact on the 2SLS estimates
143
Donald and Newey (2001) are concerned with choosing the appropriate subset out of a set of valid instruments
to minimise mean squared error. If many instruments are available, including all instruments in the IV estimator
may worsen …nite-sample properties when some of the instruments are weak. The data-driven procedure of
Donald and Newey to select a subset of the available instruments can improve upon …nite-sample properties.
Alternative numbers from Stock and Yogo (2005) based on F -statistic for =0
H0 : bias of 2SLS is greater than 10% of the bias of OLS
critical value for the F -statistic is 10:3
H0 : null rejection rate of nominal 5% 2SLS t-test for has rejection rate 10%
critical value for the F -statistic is 24:6
Hence, if dim(Z) = 1, the F-statistic is the square of the t-statistic, such that we need a
t-statistic of at least 3 or 5 to reject weak IV.
Null hypothesis: Instruments are strong Hahn and Hausman (2002) proposed a test for
strong instruments by noting that under strong instruments the regression of Y on D and X
and the inverse regression of D on Y and X are asymptotically equivalent. However, this is not
the case under weak instruments.
Their test, however, does not seem to be very powerful.
One …nding of Bound Jaeger Baker 1995 (and a large number of papers on GMM):
Adding (many) weak instruments to a set of strong instruments can make the whole
set of instruments weak!
Hence, adding weak instruments can be harmful
Better to use only a smaller set with strong instruments

! Some literature on how to select an optimal set from many instruments
Practical conclusion: Begin analysis with a small set of instruments.
13.1.3 Alternative estimators partially robust to weak IV
A second strand of the literature aimed at developing alternative estimators with better …nite-
sample properties in the case of weak instruments.
- less …nite sample bias (bias correction of 2SLS via jackknife)

- estimators that are median unbiased (LIML)
- estimators with more …nite sample moments.
These estimators are "more robust" to weak instruments
Inference is done under conventional strong instrument asymptotics.
Many of these are k-class estimators, introduced by Theil (1958) and Nagar (1959), and are
of the type:
1
X0N [IN (1 ) + PZ ] X X0N [IN (1 ) + PZ ] YN
where XN = (DN : XN ) and PZ = P[Z:X] .

If = 1, this gives the 2SLS estimator
If = 0, this gives the OLS estimator
Values of between 0 and 1 trade o¤ asymptotic bias versus "…nite sample stability"
Limited information maximum likelihood (LIML) LIML is a k-class estimator with

chosen as the smallest root of the determinantal equation
0 0
k (IN PZ ) =0
where = [YN : DN ].
For normal errors (or errors with even fatter tails), LIML does not possess any …nite mo-
ments. But, its median is often closer to than the mean or median of 2SLS
When instruments are …xed and errors symmetric, LIML is best median unbiased k-class
estimator.
In contrast to 2SLS, LIML is consistent under many-instrument asymptotics (Bekker 1994).
With exact identi…cation, LIML and 2SLS are identical. 144
Fuller k-class estimator Fuller (1977) proposed to use the k-class estimator with
constant
= LIM L
N dimZ
where the constant is often chosen as 1 or 4.
The Fuller estimator has …rst and second moments in …nite-samples (for normal errors),
even without over-identi…cation.
Choosing = 1 gives nearly unbiased estimates, whereas = 4 yields asymptotically the

smallest MSE, see also Rothenberg (1984).
Bias-adjusted 2SLS and Jackknife Donald and Newey (2001) examined a bias-adjusted
2SLS estimator with
N
= LIM L .
N dimZ +2
Angrist, Imbens and Krueger (1999) and Blomquist and Dahlberg (1999) proposed a jack-
knife IV estimator. Remember the model
Yi = Di + Xi + Ui
Di = Zi + Xi + Vi .
144
See Basmann (1961, 1963), Kabe (1964), Richardson (1968), Sawa (1969), Mariano and Sawa (1972), Nelson
and Startz (1990), Buse (1992), Maddala and Jeong (1992) and others.
2SLS proceeds by projecting D on Z and X in the …rst stage, thus obtaining the …tted values
~ N = PZ DN = ZN Z0 ZN 1
D N Z0N DN = ZN ^
where
^ = Z0 ZN 1
N Z0N DN .
These …tted values are now used in the regression
regress YN on ~N
D XN .
Remember the intuition behind 2SLS: The original problem is that Di depends on Vi
~ i does not depend on Vi (since D
But, the …tted value D ~ i is only a function of Zi ).
~ i and Vi could still be dependent since Vi a¤ects the estimate ^ .

But, in …nite samples D
~ i and Vi , a jack-knife estimator derives the …tted value

To avoid this dependence between D
as
~ i = Zi ^
D i
^ 1
i = Z0N; i ZN; i Z0N; i DN; i
where ZN; i is the matrix ZN without the i-th row.

~ i is independent of ^
Here, D i since the estimate is based on di¤erent observations.
JIVE can improve on 2SLS when there are many instruments.
See also Hahn, Hausman and Kuersteiner (2002) for a di¤erent jack-knife estimator.
It is too early to draw …rm conclusions from this literature

Nevertheless, it seems that an estimator with …nite moments seems to be advisable (see e.g.
the Monte Carlo of Hahn et al).
! LIML and Nagar have no moments and caution should be exercised.
! Without overidenti…cation, Fuller is recommended
! With overidenti…cation, several of the alternative estimators may be used (e.g. Fuller,
2SLS, jackknife 2SLS of Hahn et al)
The following graphs are for scalar.

13.1.4 Robust Inference with weak IV for 2SLS
Another strand of the literature intends to improve inference for IV estimators in the presence
of weak instruments. Robust inference is obtained by inverting a test that is robust to the
strength of the IV.
The conventional approach to determine con…dence intervals is
estimate 1:96 stderror
which is based on the result that for asymptotically normal scalar ^
a
Pr ^ 1:96 ^ 0
^ + 1:96 ^ = 0:95
or equivalently !
^ a
Pr 1:96 = 0:95.
^
(Show this graphically.)
The fact that this con…dence interval is so simple, is because the t-statistic is asymptotically
pivotal, i.e. its distribution (e.g. quantiles) does not depend on any unknown parameters.
It does not even depend on the other coe¢ cients if multiple regressors are included in the
regression, which is a particular feature of the normal distribution.
A more general de…nition of a con…dence interval is based on the inversion of a test.

! Hence, testing and construction of con…dence interval are essentially equivalent.
Intuitive derivation of con…dence interval (more precise notation in Dufour 2003)
Suppose data W1 ; :::; WN are iid draws from some distribution function
F (W j 0; 0 ),
where 2 B and 2 P are some (possibly in…nite dimensional) parameter vectors with 0 and
0 the "true" values. is the parameter of interest whereas is a nuisance parameter. We

may be interested in testing a simple hypothesis of the form:
H0 : 0 = .
I.e. we want to test whether 0 is a speci…c value without imposing any restrictions on the
nuisance parameters.
We may de…ne, for every value of , a test statistic and a critical value
S( ; W1 ; :::; WN ) and c( ).
This test statistic and the critical value may be di¤erent for every value that we want to test.
Example: t-statistic ( ^ b and critical value 1.96.
)=se
The testing rule is such that
reject H0 if S( ; W1 ; :::; WN ) > c( ).
The test has level i¤
sup Pr (reject H0 ) if H0 is true.

0 2P
It has size if equality holds (for every value of 0 ).
We can also use this test statistic to de…ne a con…dence set given the data W1 ; :::; WN as
CW1 ;:::;WN = f : S( ; W1 ; :::; WN ) c( )g

which is the set of all for which the test would not reject. If the test satis…es the level
constraint for all values of 0 of a parameter space (i.e. for whatever the values 0 and 0
actually are), then the con…dence set has a coverage rate of at least 1 :
inf Pr ( 0 2 CW1 ;:::;WN ) 1 .

0 ; 0 2B P
Hence, CW1 ;:::;WN covers the true value of 0 with probability at least 1 .
Intuitive proof:
Suppose any …xed values for 0, 0 and examine a test where we set = 0 (i.e. our test is
H0 : 0 = 0 ). Now, using the frequentist approach, consider repeated drawings of W1 ; :::; WN
from F (W j 0; 0 ). In about of these hypothetical replications, we reject the null hypothesis,
because we have assumed that the level constraint is valid for any value of 0 and 0. To be
precise, we reject the null in of these replications. This means that in 1 of these
hypothetical replications we do not reject H0 . Now, in these replications, 0 must be in the
con…dence set, because the con…dence set is de…ned as all values of for which the test does
not reject. Therefore, in at least of these replications CW1 ;:::;WN contains 0. Since, this is
argument is assumed valid for any value of 0 and 0, if follows that
inf Pr ( 0 2 CW1 ;:::;WN ) 1 .

0; 0
Hence, this inversion of the test statistic delivers a con…dence interval for 0.
The de…nition of con…dence regions becomes more simple if we can …nd pivotal functions
S( 0 ; W1 ; :::; WN ) whose distribution does not depend on unknown parameters, such as or .
This simpli…es the level constraint of the test
Pr ( S( 0 ; W1 ; :::; WN ) > c( 0 )) for all 0
to
Pr ( S( 0 ; W1 ; :::; WN ) > c) for all 0.
Because the distribution of S( 0 ; W1 ; :::; WN ) and thus also its quantiles does not depend on
the parameters anymore. Hence, …nding one critical value c for any 0 su¢ ces.
An example for a pivotal statistic is the t-statistic when the observations Wi are iid normally
distributed N ( ; 2 ). The t-statistic
p W
N P 2
1
N 1 Wi W
1 P 2.
where W = N Wi , follows a Student tN 1 distribution, which does not depend on or
p 2
On the other hand, N (W ) is not a pivot, since its distribution depends on and
therefore the critical values would depend on 2.
An asymptotically pivotal statistic is a function whose asymptotic distribution does not

depend on unknown parameters. E.g. the "t-statistic" of OLS regression coe¢ cients. Note,
for example, that in an OLS regression with multiple regressors the t-statistic for regressor one
does not depend on regressor two.
(Such pivotal statistics will also be important when we discuss bootstrapping.)
We are interested only in , but the properties of ^ depend on .

If is zero, would be non-identi…ed.
If is close to zero, would be only weakly identi…ed.
If the parameter space for 0 and 0 is unbounded and we cannot exclude the possibility
that 0 might be zero (i.e. if our model may admit 0 = 0), then any procedure for constructing
a1 CI for must have in…nite length with positive probability
Pr (C is unbounded; 0; 0) >0
and
Pr (C is unbounded; 0; 0 = 0) 1
or otherwise its …nite sample size is zero (Gleser Hwang 1987).

Intuition: If 0 = 0, a 1 CI for must include ( 1; 1) with probability at least 1
because every value of 2 R is a true value. Since the de…nition of the test and the critical
value only depends on , this will imply that also for 0 6= 0 the critical values will be so large
that C contains ( 1; 1) with positive probability. Hence, any procedure to construct CI that
have …nite length with probability one must have coverage smaller than 1 . Since the above
argument is valid for any , setting close to one gives that the coverage probability of the CI
is zero.
This result implies that the CI based on
estimate 1:96 stderror

must have size zero since they are …nite with probability one.
This result also holds if the parameter space of 0; 0 excludes the point 0 = 0, but still
includes all points arbitrarily close to zero. This results no longer holds if the parameter space
for 0 is bounded away from zero. But in this case the level of the CI lies between 0 and 1 .
Hence, a more robust approach to constructing CI would be based on a test robust to weak
instruments and would sometimes lead to a CI of in…nite length in …nite samples.
CI has asymptotically correct coverage probability under standard asymptotics and under
weak IV asymptotics.
Since CI can be obtained by inverting fully-robust test, the following focusses on robust
tests in particular on tests on 0:
H0 : 0 = H1 : 0 6= .
Tests should be robust to weak IV under strong and under weak IV asymptotics
and should ideally also be robust to some other features (e.g. non-normal errors, non-
linearity of the second equation, heteroskedasticity etc.)
Anderson-Rubin statistic The idea of the AR statistic is to test whether the instruments
Z have any direct impact in the regression for Y . Consider the model
Yi = Di 0 + Xi 0 + Ui
Di = Zi 0 + Xi 0 + Vi .
If we subtract Di 0 we obtain
Yi Di 0 = Xi 0 + Ui
Di = Zi 0 + Xi 0 + Vi .
Hence, if we now run a regression
regress Yi Di 0 on Zi Xi
the regressors Z should be irrelevant.

However, if we subtracted some 6= 0, we obtained
Yi Di = Di ( 0 ) + Xi 0 + Ui
Di = Zi 0 + Xi 0 + Vi .
Now in the regression

regress Yi Di on Zi Xi
Z should be signi…cant since the regressor Di ( 0 ) has been omitted.
Hence, the AR statistic is the F-statistic for Z to be irrelevant in the regression
regress Yi Di on Zi Xi .
AR statistic 0
YN DN PZ YN DN = dimZ
AR( ) = 2
"
where 0
YN DN M[Z:X] YN DN
2
" = .
N dimZ dimX
Under the null this is
0 P U = dim 2
UN Z N Z d dimZ
AR( )jH0 = 0 ! ,
UN M[Z:X] UN = (N dimZ dimX ) dimZ
assuming iid homoskedastic errors.

This is limit distribution under strong and weak IV asymptotics.
The null distribution of the AR statistic does not depend on (regardless of distribution of
U)
Hence, robust to weak IV
AR test does not rely on any assumption on the second equation (could be nonparametric)
But, AR test is not robust to heteroskedasticity or autocorrelation.
Under the alternative hypothesis

0
UN + ZN 0 0 PZ UN + ZN 0 0 = dimZ
AR( ) = 0 M .
UN [Z:X] UN = (N dimZ dimX )
2
For scalar, power depends on 0 Z0 Z
0 N N 0 0
Power of AR test is very good if dimZ = 1.
For dimZ > 1, however, AR test is not so powerful. For 6= 0 we have:
Yi Di = Zi 0 0 + ::::.
The AR test tests whether Z is relevant in this equation, but ignores the restriction that the
coe¢ cient is 0 0 . Hence, too many data generating processes are permitted under the
H0 leading to higher critical values and thus a loss of power.
Kleibergen (2002) proposed an alternative statistic that instead of regressing
regress Yi Di on Zi Xi
as for the AR statistic, regresses
regress Yi Di on Zi ^ Xi ,
where ^ = ^ ( ) is the ML estimator under the null hypothesis. For dimZ = 1, these statistics
are identical. But for dimZ > 1, the Kleibergen statistic has better power properties.
Kleibergen, Moreira and LR statistics Alternative statistics have been proposed:

De…ne:
1=2 1=2
S = Z0N ZN Z0N [YN : DN ] b0 b00 b0
1=2 1=2
T = Z0N ZN Z0N [YN : DN ] 1
a0 a00 a0
2 3 2 3
1
where b0 = 4 5 and a0 = 4 5 and is the covariance matrix of the reduced form errors.
1
Moreira (2001) showed that for this setup (normal iid errors, …xed instruments), these two
statistics S and T are su¢ cient for and . This means, the distribution of any function of
the data does not depend on and after conditioning on S and T . In other words, any
statistic conditional on S and T is pivotal in the sense that it does not depend on the unknown
parameters anymore. Thus for testing 0 = it su¢ ces to consider test statistics that are
functions of only S and T .
De…ne the matrix 2 3 2 3

QS QST S0S S0T
Q=4 5=4 5
QST QT T 0S T 0T
The AR statistic can be written as

QS
AR( ) = .
dimZ
The Kleibergen statistic is
Q2ST
K( ) = .
QT
The LR statistic is
q
1
LR( ) = QS QT + (QS QT )2 4Q2ST
2
Andrews Stock (2005) analyze various of these tests and …nd that a conditional LR test
performs best.
13.1.5 Robust Inference with weak IV for alternative estimators
......
13.2 Weak IV in system of linear equations models
GMM estimation
If D is a vector, the concentration parameter is a matrix
1=20 0 0 1=2
conc = v ZN ZN v
1=20 1=2
where v = v v is the covariance matrix of V .
The strength of the instruments requires this matrix to be large,

e.g. its smallest eigenvalue should be large
It does not su¢ ce if the F-statistic is large in every single …rst-stage regression
13.3 Weak IV in nonlinear models
13.4 Weak IV in nonparametric models
14 Nonparametric estimation
References:
- Ichimura Todd (2008, Handbook of Econometrics)
- Li Racine (2007), Nonparametric econometrics, Theory and applications
- Pagan Ullah (1999)
- Cameron Trivedi Chapter 9
- Racine Li J Econometrics
- Gozalo Linton J Econometrics
Nonparametric estimation of a density or distribution function fX or FX or some conditional

functions such as fY jX or FY jX or E[Y jX] or QY jX often may serve two di¤erent purposes. The
estimates are used for visual explorative purposes, i.e. visually examining the location of the
density mass and the support region or visually examining the dependence of Y on another
variable X. We might be interested in estimating and plotting fX or E[Y jX] ideally at all
values of X in its support. These explorative purposes are often restricted to X being one- or
two-dimensional for the obvious reasons of presentation.
The second purpose of nonparametric estimation is to provide plug-in estimators into more
complex (often semiparametric) estimators. Here X can be of high dimension and the objects
such as fX or E[Y jX] are often estimated at a restricted number of locations, e.g. at all values
fXi g of some subset of the dataset. Sometimes, univariate nonparametric regression still serves
this purpose if the multivariate regressors X have been reduced to a one-dimensional index,
such as the propensity score, which is conventionally estimated by parametric regression.
14.1 Curse of dimensionality
The (asymptotic) properties of nonparametric estimators145 depend on smoothness assumptions

on the true regression curve. Therefore the true regression curve is assumed to belong to a
particular class of functions. If this class contains only linear functions in X, we know that
145
And of semiparametric estimators that use nonparametric plug-in estimators.
Markus Frölich 14. Nonparametric estimation 299
p
under general regularity conditions, OLS converges at rate n. This class of functions can be
p
described by a vector of coe¢ cients with the same dimension as X, and n( ^ ) converges in
distribution. Hence, ^ is stochastically bounded even after multiplying with the increasing
p
sequence n. In addition, the asymptotic distribution is centered at zero, i.e. the asymptotic
bias is zero. This result does not depend on the dimension of X.146
If we embrace a much wider class of functions, the situation changes: The convergence will
p
be slower than n and the rate will usually decrease with the number of continuous regressors in
1
X. In other words, the variance of m(x)
^ decreases to zero at a slower rate than n. In addition,
the optimal rate is usually achieved by balancing bias and variance, such that the estimator is
asymptotically biased. Hence, in contrast to parametric regression, we also have to consider
asymptotic bias for inference. However, we still have to restrict the class of functions somewhat
as we otherwise might not obtain any convergence at all. Think of functions that are highly
discontinuous throughout, e.g. nowhere di¤erentiable. These restrictions are often imposed in
the form of restrictions on the smoothness of the function, often in the form of di¤erentiability
or Hölder/Lipschitz conditions, sometimes boundedness conditions.
Brie‡y repeating some basic concepts from real analysis are helpful for this purpose. In
this chapter we will use the letter d to denote the dimension of a multivariate variable x. Let
m : Rd ! R be a real-valued function. The function m is Lipschitz continuous over a set S
Rd if there is a nonnegative constant c such that for any two values x1 ; x2 2 S
jm(x1 ) m(x2 )j c kx1 x2 k
where k k is the Euclidean norm. Loosely speaking, the smallest value of c for which this
condition is satis…ed represents the ’steepest slope’of the function over the set S. If there is a
c such that the Lipschitz condition is satis…ed over its entire domain, the function is uniformly
Lipschitz continuous. For example, the function m(x) = jxj is uniformly Lipschitz continuous
over its entire domain. On the other hand, it is not di¤erentiable at zero.147 As a second
example, the function m(x) = x2 is di¤erentiable, but not Lipschitz over R. Hence, neither
does di¤erentiablity imply Lipschitz nor the other way around. A di¤erentiable function is
Lipschitz continuous if its …rst derivative is bounded.
A generalization of Lipschitz continuity is the Hölder continuity which is satis…ed over a set
146
As long as the dimension does not increase with sample size n.
147
According to the theorem of Rademacher, a Lipschitz continuous function is di¤erentiable almost everywhere.
S if there is a nonnegative constant c such that for any two values x1 ; x2 2 S
jm(x1 ) m(x2 )j c kx1 x2 k
for some 0 < 1.148

The class of real-valued functions that are k times di¤erentiable and for which all k-th
derivatives are Hölder continuous with exponent is often denoted as C k; . For this class
of functions the remainder term of a k-th order Taylor series expansion of m(x + u) is of
order kukk+ . Therefore, we will often refer to the ’smoothness’ of this class as k + . As
we will frequently make use of Taylor series expansions, some more notation is useful. Let
= ( 1 ; :::; d) be a d–tuple of nonnegative integers, and let j j = 1 + ::: + d. De…ne
!= 1 !::: d ! and x = x1 1 xd d and de…ne partial derivatives as
@j j
D m(x) = m(x).
@x1 1 @xd d
Then a Taylor expansion of a function m(x) 2 C k; up to order k gives
X 1
m(x + u) = D m(x) u + R(x; u) with jR(x; u)j c kukk+ (102)
!
0 j j k
for some nonnegative c, where k k is the Euclidean norm. Note that the summation is over all
permutations of the d-tuple .
Stone (1980, 1982) derived general results on optimal convergence of nonparametric estima-
tors when the only information we have about the true regression function m(x) over some set
X Rd is that it belongs to the class C k; .149 In most applications X is de…ned as the support
of the random variable X, i.e. the set where fX (x) > 0.
To examine the properties of nonparametric estimators further, we …rst need to de…ne
convergence of a function m(x)
^ to m(x) over some set X Rd . Di¤erent ways to measure
the distance between two functions can be used, which also a¤ects the asymptotic properties.
148
For example, the function x2 is Lipschitz continuous e.g. on the domain [1; 5], but it is not Lipschitz on the
p
domain ( 1; 1) as the function becomes arbitrarily steep for very large values of x. Similarly, the function x
1
is not Lipschitz as it becomes in…nitely steep for x ! 0. It is, however, Hölder continuous for 2
.
149
Alternative classes of functions can be considered, see e.g. Ichimura and Todd (2008), which may lead to
di¤erent optimality properties.
It is helpful to introduce …rst a few examples of norms to measure the distance between two
functions. The Lq norm k kq for 1 q < 1 is
2 31
Z q
km(x)
^ m(x)kq = 4 jm(x)
^ m(x)j d (x)5 .
q
The sup-norm k k1 is
km(x)
^ m(x)k1 = sup jm(x)
^ m(x)j .
X
These two norms measure the distance between the two functions. The Sobolev norms also
account for distances in the derivatives.
The Sobolev norm k ka;q is
2 31
X Z q
q
km(x)
^ m(x)ka;q = 4 k
D (m(x)
^ m(x)) d (x)5 .
0 jkj a X
The sup Sobolev norm k ka;1 is
km(x)
^ m(x)ka;1 = max sup Dk (m(x)
^ m(x)) .
0 jkj a X
The Sobolev norms include the Lq and the sup-norm for a = 0. Since these norms express the
distance between two functions by a real-valued number, the standard concepts of convergence
(plim, mean square, a.s., in distribution) can be applied.
Stone showed the optimal convergence rate of any nonparametric estimator when the only
information we have about the true regression function m(x) over some set X Rd is that
it belongs to the class C k; , in addition to a few other regularity conditions. Suppose we are
interested in estimating the v-th order derivative of the function, the optimal convergence rate
is
(k+ ) v
n 2(k+ )+dim(X)
for the Lq norm for any 0 < q < 1, and for the sup norm it is
(k+ ) v
n 2(k+ )+dim(X)
.
ln n
Only the number of continuous elements of X is relevant, the discrete regressors do not a¤ect
the convergence rate. These results show that nonparametric estimators can never achieve the
1
convergence rate of n 2 , unless the class of functions is further restricted. Convergence is faster
the smoother the function. The convergence is slower when derivatives are estimated (v > 0).
In addition, the convergence rate becomes slower with the dimension of X, which is often called
the curse of dimensionality. The curse of dimensionality means in practice that nonparametric
regression is useful only in two or three dimensions since con…dence bands become extremely
wide otherwise. However, this pessimistic conclusion only holds if interest is in the estimation
of the function itself. If the nonparametric estimator is only used as a plug-in estimator into a
more complex estimation setting, this conclusion does no longer strictly apply.
Consider the class C 1;1 and suppose X is one-dimensional and interest is in estimating the
function itself. The optimal convergence rate is
2
2 n 5
n 5 or .
ln n
If one is willing to assume that the function is much smoother, one can get, in principle,
p
arbitrarily close to n consistency.
14.2 Kernel based methods for nonparametric regression
14.2.1 Nonparametric moment estimation in one dimension
Nonparametric estimation is mostly used for estimating conditional moments, in particular the
conditional mean m(x) = E[Y jX = x]. The following methods could also be used to estimate
conditional variances by noting that
V ar(Y jX) = E[Y 2 jX] (E[Y jX])2 .
Hence, we could estimate conditional variances by combining two estimates of conditional

means. (In practice one would use methods particularly designed for estimating variances
since the above formula does not guarantee nonnegative estimates.)
In the following we focus on estimation of the conditional mean m(x). For most of the dis-
cussion of asymptotic properties, we suppose that X contains only continuous random vari-
ables, with dim(X) = d. (Discrete regressors do not a¤ect the asymptotic properties, although
they have an impact on …nite sample properties.) Nadaraya-Watson kernel regression and local
linear regression have been discussed in the introduction, which both belong to the class of local
polynomial estimators. To gain some intuition, we start with the case dim(X) = 1 here.
Consider estimation of m(x0 ) at the location x0 . The local polynomial regression estimator
of order p is obtained by solving
( p
)2
X X l Xj x0
arg min Yj l (Xj x0 ) K ,
; 1 ;:::; p
h
j l=1
where K is a kernel function.150 Note that the estimated values ^ ; ^ 1 ; :::; ^ p depend on the
location x0 and should more precisely be written as ^ (x0 ); ^ 1 (x0 ); :::; ^ p (x0 ). For the ease
of notation, we will mostly keep this dependence implicit. The local polynomial regression
P
p
estimator …ts a local polynomial centered at x0 . The local hyperplane is + l (Xj x0 )l .
l=1
It is easiest to be understood intuitively if a uniform weighting function K (u) = 1 (juj 1) is
used, which implies that only the observations with distance at most h from x0 are used in the
local …tting of the polynomial. One can show that the …rst derivative of m(x) is consistently
estimated by the …rst derivative of the local hyperplane, that is ^ 1 . Analogously one can show
that v! ^ v is a consistent estimator of the v-th derivative of m(x0 ). (In nonlinear models this
will be less obvious.)
0
0 0
By de…ning = ; 0
1 ; :::; p and Xj = 1; (Xj x0 ) ; :::; (Xj x0 )l and Kj =
Xj x0
K h and XN = (X1 ; X2 ; :::; Xn )0 and KN = diag(K1 ; K2 ; :::; Kn ) and YN = (Y1 ; :::; Yn )
we can write the regression as
^ = arg min (YN XN )0 KN (YN XN )
and we obtain
^ = X0N KN XN 1
X0N KN YN (103)
and thus
1
m(x)
^ = e01 X0N KN XN X0N KN YN
where e01 = (1; 0; 0; ::) or equivalently

0 1 1 0 1
Q0 (x0 ) Q1 (x0 ) Qp (x0 ) T0 (x0 )
B C B C
B C B C
0 B
B Q1 (x0 ) Q2 (x0 ) Qp+1 (x0 ) C B T1 (x0 ) C
m(x
^ 0 ) = e1 B C B C,
.. .. .. .. C B .. C
B . . . . C B . C
@ A @ A
Qp (x0 ) Qp+1 (x0 ) Q2p (x0 ) Tp (x0 )
150
Certain conditions on the kernel function will be assumed later. For deriving asymptotic theory, a compact
kernel is usually assumed, i.e. K(u) being zero outside a compact set. Kernels with unbounded support may
often be more stable in practice when dim(X) is large.
P Xj x P Xj x
where Ql (x) = K h (Xj x)l and Tl (x) = K h (Xj x)l Yj . From these
derivations we also see that the local polynomial estimator is a linear smoother in that
n
1X
m(x
^ 0) = wj Yj (104)
n
j=1
with the weights

1
1 0
wj = e01 X KN XN X0j Kj . (105)
n N
According to the polynomial order the local polynomial estimator is also called Nadaraya
(1965)-Watson (1964) kernel (p = 0), local linear (p = 1), local quadratic (p = 2) or local cubic
(p = 3) regression. Polynomials of order higher than three are rarely used in practice, except
for estimating local bias in data-driven bandwidth selectors. Nadaraya-Watson kernel and local
linear regression are most common in econometrics. Local polynomial regression of order two
or three is more suited than kernel or local linear regression for modelling peaks and oscillating
regression curves in larger samples, but it often proves unstable in small samples since more
data points in each smoothing interval are required (Loader 1999). The expressions of m(x)
^ up
to polynomial order three are
P Xj x
T0 (x) Yj K h
m
^ p=0 (x) = = P (106)
Q0 (x) K
Xj x
h
Q2 T0 Q1 T1
m
^ p=1 (x) =
Q2 Q0 Q21
(Q2 Q4 Q23 )T0 + (Q2 Q3 Q1 Q4 )T1 + (Q1 Q3 Q22 )T2
m
^ p=2 (x) =
Q0 Q2 Q4 + 2Q1 Q2 Q3 Q32 Q0 Q23 Q21 Q4
A0 T0 + A1 T1 + A2 T2 + A3 T3
m
^ p=3 (x) = ,
A0 Q0 + A1 Q1 + A2 Q2 + A3 Q3
where A0 = Q2 Q4 Q6 + 2Q3 Q4 Q5 Q34 Q2 Q25 Q23 Q6 , A1 = Q3 Q24 + Q1 Q25 + Q2 Q3 Q6

Q1 Q4 Q6 Q2 Q4 Q5 Q23 Q5 , A2 = Q1 Q3 Q6 + Q2 Q24 + Q2 Q3 Q5 Q23 Q4 Q1 Q4 Q5 Q22 Q6 ,
A3 = Q33 + Q1 Q24 + Q22 Q5 Q1 Q3 Q5 2Q2 Q3 Q4 .
Using the above formuale, we can also write the local linear estimator equivalently as
P
Kj Yj
m^ p=1 (x) = P (107)
Kj
where
Xj x
Kj = fQ2 Q1 (Xj x)g .
h
Hence, we can express the local linear estimator as a Nadaraya-Watson kernel estimator with
kernel function Kj . Note that the kernel function Kj can be negative for some values of Xj .
This may help our intution what kernel functions with negative values mean. Similarly, every
local polynomial regression estimator can be written in the form (107) with di¤erent equivalence
kernels Kj , which all, except for the case p = 0, take sometimes negative values.
Finite sample properties of local polynomial regression An important result is that

the exact …nite sample bias of local polynomial regression is zero up to order p. To show this
some preliminary results are helpful. Observe that the weights (105) satisfy the orthogonality
condition for an integer l
8
n
1X < 1 for l = 0
(Xj x0 )l wj = . (108)
n : 0 for 1 l p
j=1
To see this, note that all these conditions (108) can be summarized as saying that
0 1
1
B C
n B C
1X B 0 C
wj Xj = B
B ..
C.
C
n B . C
j=1
@ A
0
Inserting the de…nition of the weights (105) the proof is immediate. These orthogonality con-
ditions (108) will imply an exactly zero …nite bias up to order p, as we will see below. This
also implies that if the true function m(x) happened indeed to be a polynomial function of or-
der p or less, the local polynomial estimator would be exactly unbiased, i.e. in …nite samples
and thereby also asymptotically. (In this case, one would actually like to choose the bandwidth
value h = 1 to minimize variance, because the bias is zero for every value of h.)
Now, consider the expression as a linear smoother (104) and consider the expected value
of the estimator. Note that the expected value of the estimator could be unde…ned if the
denominator in the weights is zero. In other words, there could be ’local collinearity’ which
prohibits the calculation of the estimator. Ruppert and Wand (1994) therefore proposed to
examine the expected value conditional on the observations X1 ; :::; Xn
n n
1X 1X
E [m(x
^ 0 )jX1 ; :::; Xn ] = E [wj Yj jX1 ; :::; Xn ] = wj m(Xj )
n n
j=1 j=1
and using a Taylor series expansion assuming that m 2 C p;

n
1X @m(x0 ) 1 @ p m(x0 )
= wj m(x0 ) + (Xj x0 ) + ::: + (Xj x0 )p + R(Xj ; x0 )
n @x p! @xp
j=1
n
1X
= m(x0 ) + wj R(Xj ; x0 )
n
j=1
where the …rst terms up to order p are zero because of (108). We thus obtain that
n
1X
E [m(x
^ 0) m(x0 )jX1 ; :::; Xn ] = wj R(Xj ; x0 )
n
j=1
where the remainder term R(Xj ; x0 ) is of order (Xj x0 )p+ by (102). Now inserting (105)
gives
n
1 0 1
1X 0
E [m(x
^ 0) m(x0 )jX1 ; :::; Xn ] = e01 X KN XN Xj Kj R(Xj ; x0 ).
n N n
j=1
To obtain an intuition assume that the kernel is zero outside a compact set. (We sometimes refer
to this as a kernel with compact support.) Hence, for every j where jXj x0 j > 0, the kernel
function Kj will be zero. This implies that the remainder term is at most of order Op (hp+ ).
Since h will always be assumed to converge to zero as n ! 1, the higher the polynomial order p
the lower the order of the …nite sample bias. (We will show later that the expression n1 X0N KN XN
is Op (1). Therefore, the entire expression is Op (hp+ ).)
Hence, if the function is su¢ ciently smooth and satis…es a Lipschitz condition, i.e. = 1,
the …nite sample bias of Nadaraya-Watson regression is of order h, whereas it is of order h2
for local linear and so on. This result also applies for dim(X) > 1. In the following, we will
consider asymptotic properties of the estimator, which often provides more useful expressions.
Asymptotic properties of kernel regression (We still consider the case where dim(X) =
1.) The bias properties of local polynomial regression depend on the order of the local polyno-
mials and on the kernel function, in particular on the order of the kernel function. Generally
speaking, the asymptotic properties can be improved by increasing the order of the polynomial
or the order of the kernel function, provided that the smoothness of the true regression function
does permit this. De…ne the kernel constants
R R
t = ut K(u) du
R R
t = ut K(u)2 du
and also de…ne

1
Kh (u) = K(u).
h
The kernel K is of order if
0 = 1
t = 0 for 1 t 1
t 6= 0 for t= .
The Epanechnikov, Gaussian and biweight kernel are all of second order. The Epanechnikov
kernel function is optimal in the sense that it minimizes MSE and MISE over all non-negative
symmetric functions K. More precisely, the Epanechnikov kernel is optimal in the interiour
and is nearly optimal at the boundary points.151 In the calculations in Fan, Gasser, Gijbels,
Brockmann, and Engel (1997), the biweight K(u) = const (1 u2 )2 1(u < 1) and the triweight
K(u) = const (1 u2 )3 1(u < 1) kernel functions performed very close to the Epanechnikov
in boundary regions, whereas the Gaussian performed substantially worse.
R
Kernels of higher order require that the "variance" u2 K (u) du is zero. Hence, these
cannot be density functions and the kernel function K(u) has to be negative for some values
of the support of K. Higher-order kernels are often used in theoretical derivations, particularly
for reducing the bias of semiparametric estimators. They have been relatively rarely used in
nonparametric applications, but may be particularly helpful for semiparametric estimators such
as estimators of average treatment e¤ects. Higher-order polynomials or higher-order kernels are
two ways to reduce asymptotic bias. As we will see later, when dim(X) is large local linear
estimation with higher-order product kernels can be much more convenient to implement than
higher order local polynomial regression.
To obtain some intuition, consider …rst a heuristic argument for the Nadaraya-Watson
regression estimator with dim(X) = d = 1:
1 P j X x0
nh Yj K h
m(x
^ 0 ; h) = P .
1 Xj x 0
nh K h
The expected value of the numerator can be considered as

Z
1 X Xj x0 1 x x0
E Yj K = m(x) K f (x)dx
nh h h h
151
The minimum variance kernel is the uniform kernel.
Z
= m(x0 + uh)f (x0 + uh) K (u) du
Z Z
0 0
= m(x0 )f (x0 ) K (u) du + h m (x0 )f (x0 ) + m(x0 )f (x0 ) u K (u) du
Z
m00 (x0 ) f 00 (x0 )
+h2 f (x0 ) + m(x0 ) + m0 (x0 )f 0 (x0 ) u2 K (u) du + O h3
2 2
Z
m00 (x0 ) f 00 (x0 )
= m(x0 )f (x0 ) + h2 f (x0 ) + m(x0 ) + m0 (x0 )f 0 (x0 ) u2 K (u) du + O h3
2 2
R R
for K (u) du = 1 and uK (u) du = 0. Analogously, the expected value of the denominator
is152
Z
1 X Xj x0 2 f 00 (x0 )
E K = f (x0 ) + h u2 K (u) du + O h3 .
nh h 2
Under regularity conditions, a WLLN gives as plim for a …xed h and n ! 1. This can be
shown by showing that the variance converges to zero and applying Chebyshev’s inequality:
m00 (x0 ) R 2
2 + m0 (x0 )f 0 (x0 )
f (x0 ) u K (u) du + O h3
plim m(x
^ 0 ; h) m(x0 ) = h2 00 R .
f (x0 ) + h2 f (x2
0)
u2 K (u) du + O (h3 )
This gives
Z
m00 (x0 ) m0 (x0 )f 0 (x0 )
plim m(x
^ 0 ; h) m(x0 ) = h2 + u2 K (u) du + O h3 .
2 f (x0 )
Hence, the ’bias’is proportional to h2 .
- What would have happened if a third order kernel instead of a second order kernel would
R
be used? In this case u2 K (u) du is zero and the h2 terms cancel in the above expressions.
The bias is then proportional to h3 . Analogously a -th order kernel would reduce the bias to
be proportional to h , provided that the -th derivatives of m and f at x0 exist.
- Boundary bias: The previous derivations assumed that the kernel is compact and that f (x)
R
is positive for all values of x were K x hx0 > 0. Then u K (u) du is zero and the h terms
152
Note that the expected value of the denominator may be zero if a kernel with compact support is used.
Therefore, the expected value of the Nadaraya-Watson estimator may not exist. The asymptotic analysis therefore
is usually done by conditioning on the design points fXj gn
j=1 as in Ruppert and Wand (1994) or by adding a
small amount to the denominator that tends to zero as n ! 1, as it is done e.g. in Fan (1993). When pursuing
the former approach it is found that the results conditional on the fXj gn n
j=1 sample do not depend on the fXj gj=1
points. Therefore, the unconditional results are the same as the conditional ones, provided that the estimator is
de…ned.
dropped from the above derivations. For points x0 close to the boundary of the support of X,
the h terms will not vanish and the bias expressions will be di¤erent. For the Nadaraya-Watson
regression, the bias will be of order h. This boundary behaviour is particularly important for
semiparametric estimators, e.g. when estimating the ATE with rather di¤erent X distributions
for participants and non-participants. To see this, suppose that the support of X is [0; 1] and
that the kernel K is compact in [ 1; 1]. Consider the numerator of the Nadaraya-Watson
estimator:
Z1
1 X Xj x0 1 x x0
E Yj K = m(x) K f (x)dx
nhd h hd h
0
1 x0
Z h
= m(uh + x0 ) f (uh + x0 )K (u) du

0 x0
h
1 x0 1 x0
Z h Z h
0 0
= m(x0 )f (x0 ) K (u) du + h m (x0 )f (x0 ) + m(x0 )f (x0 ) uK (u) du + O h2 .
0 x0 0 x0
h h
1 x0
Rh
If h is large and x0 close to zero or one, the term uK (u) du will not be zero and the bias
0 x0
h
will be of order h. Consider e.g. the boundary point x0 = 0. If the support of X is compact,
a point x0 is an interior point if x0 + h sup Supp(X) and x0 h inf Supp(X). If one
assumes that h converges to zero su¢ ciently fast, almost all evaluation points are interior. If
Supp(X) is open and convex, all points will be interior for su¢ ciently large sample size and
h decreasing with n. However, this may not provide useful approximations for …nite samples
when m(x0 ) is evaluated close to the endpoints of the empirical support of Xi . To take account
of this boundary behaviour, particular boundary kernels have been developed for the Nadaraya-
Watson estimator. For the local linear regression, the bias will nevertheless remain to be of order
h2 because the orthogonality conditions (108) still apply. Hence, the use of local linear may
be more appropriate for such applications.153154 Furthermore, many semiparametric estimators

with nonparametric plug-ins often require uniform convergence over the support of X at a
certain rate. Since the Nadaraya-Watson estimator converges slower in the boundary, this will
reduce the uniform convergence rate. Local linear regression, on the other hand, has the same
rate in the interior as in the boundary.
The previous derivations made implicit use of the Dominated (Bounded) Convergence The-
orem, see e.g. Pagan and Ullah (1999, p. 362). For a Borel measurable function g(x) on Rd
with
R
i) jg(x)j dx < 1
ii) kxkd jg(x)j ! 0 as kxk ! 1
iii) sup jg(x)j < 1
R
and another function f (x) with jf (x)j dx < 1. At every point x0 of continuity of f
Z Z
1 x
g f (x0 x) dx ! f (x0 ) g(x)dx (109)
hd h
as n ! 1 and h = hn is a sequence of positive constants such that h ! 0 as n ! 1.

Furthermore, if f is uniformly continuous, then convergence is uniform.
To illustrate the above formula de…ne x uh and denote g by K then the expression (109)
becomes
Z Z
K (u) f (x0 uh) du ! f (x0 ) K(u)du.
h P i
Xj x0
For g being a kernel function, this theorem gives the result that E nh1 d K h !
R
f (x0 ) K(x)dx. Notice that the Lemma does not require di¤erentiability of f .
Pagan and Ullah (1999, Section 3.3) derives the approximate …nite sample properties of the
NW estimator with dim(X) = 1. Assuming that Y = m(X) + U and that (A2) Uj are iid,
153
A similar result relates to local cubes versus local quadratics or more generally that for some odd value of
p, the local polynomial estimator of order p and p 1 have the same order of the bias in the interior but not in
the boundary. Therefore Fan et al proposed to prefer odd order polynomials, since increasing p by one from an
even to an odd number does not change the variance. (Generally, the order of the variance does not depend on
p but the multiplicative term does so. However, for p even, it is the same for the local polynomial estimator of
order p and of order p + 1.) When the …rst derivative shall be estimated, an even order polynomial is preferred.
More generally p v shall be odd, where v refers to the order of derivative of interest.
154
Ruppert and Wand (1994) developed many results for local constant and local linear regression.
(A3) m and f are twice continuously di¤erentiable in a neighbourhood of x0 , (A4) the kernel is
symmetric, second order and integrates to one, (A5) h ! 0 and nh ! 1, (A6) Xj are iid and
00 is continuous and bounded in a neighbourhood of x and that
independent of Uj and (A7) fX 0
x0 is in the interior of the support of Xj , their Theorem 3.2 gives:
h2 1
Bias(m)
^ = 2 m00 (x0 ) f (x0 ) + 2f 0 (x0 ) m0 (x0 ) + O + o(h2 )
2f (x0 ) nh
2 1
V ar(m)
^ = 0 +o .
nhf (x0 ) nh
These results also indicate how the bandwidth should be chosen optimally. Suppose we aim
^ 0 )).155 The …rst-order approximation to the M SE(m(x
to minimize M SE(m(x ^ 0 )) is
2
h2 00 0 0
2
2 m (x0 ) f (x0 ) + 2f (x0 ) m (x0 ) + 0.
2f (x0 ) nhf (x0 )
Considering this as a function of h for …xed n the optimal bandwidth choice is obtained my
optimizing this with respect to h. The …rst order condition thus gives
h3 2
2
2 m00 (x0 ) f (x0 ) + 2f 0 (x0 ) m0 (x0 ) 0 =0
f 2 (x0 ) nh2 f (x0 )
s
1 2f
(x0 ) 0
=) hopt = n 5 5 .
( 00
2 (m (x0 ) f (x0 ) + 2f 0 (x0 ) m0 (x0 )))2
1
Hence, the optimal bandwidth is proportional to n 5 :
1
h/n 5 . (110)
Pagan and Ullah (1999, Section 3.4) considers the asymptotic properties of the NW estima-
tor. First they show consistency using only (A5), (A6) and some technical assumption (A8) on
the kernel function.156 For analysing the asymptotic distribution, the estimator is decomposed
into bias and variance
p p p
nh (m(x
^ 0) m(x0 )) = nh (m(x
^ 0) E [m(x
^ 0 )]) + nh (E [m(x
^ 0 )] m(x0 )) .
155
@Discuss also MISE etc., which gives same results.
156
R R
Let K be a real valued function such that (i) K(v)dv = 1, (ii) jK(v)j dv < 1, (iii) jvj jK(v)j ! 0 as
R
v ! 1, (iv) sup jK(v)j < 1 and (v) 0 = K 2 (v)dv < 1.
Suppose (A3)-(A8) and in addition:

Z
2+
(A9): E jUi j <1 and jK(v)j2+ dv < 1 for some > 0,
which is needed for Liapunov’s central limit theorem. Assume also that
p
nhh2 ! c < 1, (111)
then
p d m0 (x0 )f 0 (x0 ) 1 00 2
0
nh (m(x
^ 0) m(x0 )) ! N c 2 + m (x0 ) ; .
f (x0 ) 2 f (x0 )
This result shows that if the bandwidth is chosen optimally by (110), the estimator will be
asymptotically biased.
Pagan and Ullah (1999, Theorem 3.7) shows uniform convergence of the estimator, which
is often important for semiparametric estimators. See also Masry (1996) or Gozalo and Linton
(2000).
Bias reduction approaches As just shown, the asymptotic expression of the kernel
regression estimator usually contains a bias and a variance term. Various approaches have
been suggested to reduce the asymptotic bias of the nonparametric regression estimator. For
nonparametric regression, these approaches have mainly theoretical appeal and seem not
always to work well in …nite samples. However, for many semiparametric regression estimators
the bias problem will be more important since variance is often reduced through further
smoothing, e.g. averaging. Then reduction of the bias term is crucial for obtaining better
properties of the estimator.
One possibility to reduce bias is through undersmoothing. By (110) the bandwidth was
chosen to balance variance and squared bias. On the other hand, if the bandwidth converges to
zero faster than suggested by (110), which implies that c = 0 in (111), the asymptotic bias is
zero. However, this elimination of the asymptotic bias comes at the price of a lower convergence
rate.
If one uses a higher order kernel, the bias would be of lower order and a bandwidth choice
by (110) would thus eliminate the asymptotic bias. On the other hand, the use of an -th order
kernel would suggest choosing the bandwidth such that
p
nhh ! c
to equilibrate squared bias and variance. Again undersmoothing by

p
nhh ! 0
would again eliminate the asymptotic bias.

For making this operational, a data driven bandwidth selector will be required as discussed later.
Implementing higher order polynomials or higher order kernels can be inconvenient, unless a
product kernel is used as discussed later.
An alternative approach to bias reduction could be based on the idea of "jackkni…ng" to

eliminate the …rst order bias term. The jackknife kernel estimator would be
1
m(x
^ 0 ; h) c2
m(x
^ 0 ; ch)
m(x
~ 0) = 1
1 c2
where c > 1 is a constant157 and m(x

^ 0 ; h) is the kernel estimator with bandwidth h and m(x
^ 0 ; ch)
is the estimator using ch as bandwidth. The intuition behind this estimator is the following.
The …rst order approximation to the expected value of the kernel estimator is
c2 h2
E[m(x
^ 0 ; ch)] = m(x0 ) + 2 m00 (x0 ) f (x0 ) + 2f 0 (x0 ) m0 (x0 ) .
2f (x0 )
Inserting this in the above expression shows that the bias of m(x
~ 0 ) contains terms only of order
h3 and below. This estimator is easy to implement.158
Bias can also be reduced through higher-order local polynomial regression or higher order
kernels as discussed before.
Properties of local polynomial regression The properties of local polynomial regression

have been considered in Pagan and Ullah (1999, Theorem 3.7), among others. A comprehensive
overview is also given in Fan and Gijbels (1996). Local polynomial regression of order p provides
not only an estimate of m(x) but also of its derivatives up to order p. As discussed before the
exact bias terms up to order p are exactly zero. Hence, a general result is that the bias at the
boundary is of order hp+1 . For interior points the bias is of order hp+1 for p being odd and hp+2
for p being even. (The reason for this is that the term or order p + 1 is asymptotically, but
not in …nite samples, zero if p is even.) The bias expression for p even will be somewhat more
complicated, though, than for p odd. Hence, choosing a higher value for p reduces the order
157
1 < c < 1:1 is suggested e.g. in Pagan and Ullah (1999).
158
@@ Suche noch Ergebnisse für …nite sample Verhalten
of the bias (at least in the boundary). On the other hand, the extra parameter in the local
1
polynomial does not a¤ect the asymptotic variance, which is always of order nhd
whatever the
value of p. This …nding led Pagan and Ullah (1999, Theorem 3.7) and Fan and Gijbels (1996)
to conclude that one should choose p to be an odd number. A similar result is obtained when
interest is in the v-th derivative of m(x). Here p v should be an odd number.
Now consider the asymptotic properties of local polynomial regression. The local polynomial
estimator was de…ned in (103). We still consider the case dim(X) = 1. The estimate of the
mean is given by the …rst element
1
^ 0 ) = e01 X0N KN XN
m(x X0N KN YN .
We can write
Yi = Xi 0 + Ri + Ui
where Xi is the vector of all design regressors and 0 = (m(x0 ); m0 (x0 ); 2!1 m(2) (x0 ); :::; p!1 m(p) (x0 ))0
the derivatives of m at x0 . In other words, X 0 is the Taylor series approximation at the
location Xi . Ri = m(Xi ) Xi 0 is the approximation error, i.e. the remainder term of the
Taylor series approximation. Now we can write
^ 1 1
x0 = 0 + X0N KN XN X0 KN RN + X0N KN XN X0 KN UN .
| {z N } | {z N }
bias term variance term
The latter two terms characterize the bias and the variance of the local polynomial estimator.
Since the estimators of the derivatives converge at di¤erent rates, it is helpful to introduce the
diagonal matrix H = diag(1; h 1 ; h 2 ; :::) and let Xh;N = XN H to obtain
1 ^ 1 1
H x0 0 = X0h;N KN Xh;N X0h;N KN RN + X0h;N Kh;N Xh;N X0h;N KN UN . (112)
| {z } | {z }
bias term variance term
For concreteness, consider p = 1 and kernel function of order 2. The scaled design regressors
Xj x0 0
are Xh;j = 1; h . Ignore the variance term for the moment and focus on the bias:
0 1 2 0 31 18 2 3 9
Xn
Kj
Xj x0 < 1 Xn =
@
a
^ m(x0 )
A=@ 1 4 h Kj 5A 4
Kj
5 (m(Xj )
Xj x0 2
Xj 0) .
h ^b m0 (x0 ) nh Xj x0 : nh Xj x0 ;
j=1 h Kj h Kj j=1 K h j
(113)
One can show that the …rst term converges in probability to a pd matrix. As discussed
above for this particular case with dim(X) = 1 and p = 1 and symmetric second order kernel
with mean zero and integrating to one it converges to
0 2 0 (x )
31 1 0 2 0 (x )
3 1 1
1 1 + h ff (x 0
2 5A 1 h ff (x 0
25
@f (x0 ) 4 0)
= @f (x0 ) 4 0 0)
(1 + Op (h))A
0 (x )
1 + h ff (x 0
0) 2 2 h ff (x
(x0 )
0) 2 2
R R
where l = ul K (u) du and l = ul K 2 (u) du and 1 = 0 for a symmetric kernel. This
can be shown element-wise via mean square convergence. The determinant of the matrix is:
0 2
2 h ff (x
(x0 )
0) 2 , which simpli…es to 2 when retaining only the highest order term. Using
this result we obtain
2 3
0 (x )
14
1 h ff (x 0
0) 5
= f (x0 ) 0 (x )
(1 + Op (h)).
h ff (x 0
0) 2
1
Now consider the second term in (113). By a series expansion we obtain
m00 (x0 ) (Xj x0 )2 m000 (xj ) (Xj x0 )3

m(Xj ) Xj 0 = +
2! 3!
where xj is on the line connecting Xj and x0 . The second term of (113) can be written as
2 3
n
X
1 4
K j
00
5 m (x0 ) (Xj x0 )2
(1 + Op (Xj x0 )) .
nh Xj x0
K 2!
j=1 h j
If a kernel function with bounded support is used, the term (Xj x0 )2 is of order Op (h2 ). If
in addition the second derivative is bounded, the bias term is of order Op (h2 ). Inserting these
terms we obtain
0 1 2 3
Xn f 0 (x0 ) Xj x0 2
a
^ m(x0 ) m00 (x0 ) 1 h
@ A = 1 4 f (x0 ) h 5 Kj (Xj x0 ) (1 + Op (Xj x0 )) (1 + Op (h))
0
h ^b m0 (x0 ) f (x0 ) nh 1 Xj x0
h ff (x
(x0 ) 2!
j=1 h 0)
2
2 3
m00 (x0 ) 2 4 2 5
= h (1 + op (1)).
2 3
2
The last term in (112) characterizes the conditional variance, which is given by
1 1
X0h;N KN Xh;N X0h;N KN N KN Xh;N X0h;N KN Xh;N , (114)
where is an n n diagonal matrix with elements 2 (X = E[Uj2 jXj ]. For the local linear
N j)
1
case, the variance of the third term in (113) is of order O( nhdim(X) ). Multiplied by nhdim(X) ,
the middle term in (114) converges to
2 3
4 0 1 5 f (x0 ) 2
(x0 ).
1 2
Hence, the …nite sample expressions are such that the bias if of order hp+1 and the variance
1
of order nhdim(X)
.
For convergence of the estimator the bandwith therefore needs to be chosen such that h ! 0
and nhdim(X) ! 1. If the bandwidth is chosen such, the estimator will converge under certain
regularity conditions. Some asymptotic results are given in Ichimura and Todd (2008).
Ridge Regression Local linear regression is usually preferred to Nadaraya-Watson since its
local bias does not depend on the density (i.e. does not include the term f 0 m0 =f ) and is of
the same order in the boundary as in the interior, see Fan (1992) or Hastie and Loader (1992).
However, despite its good performance in settings with abundant data, local linear regression
is very susceptible to variance problems when the data are sparse or clustered, see Seifert
and Gasser (1996). Despite its favourable asymptotic properties, local linear regression with
a global bandwidth value often leads to a very rugged curve in regions of sparse data (Seifert
and Gasser 1996). If the Xi observations in the neighbourhood around x are clustered, the
denominator of the least squares local polynomial regression estimator can become arbitrarily
close to zero, or even zero if a compact kernel is used, and the estimate E[Y jX = x] might
explode towards 1. Therefore the local linear estimator has in…nite unconditional variance
and unbounded conditional variance (Seifert and Gasser 1996).159 They also showed that the
probability for the occurrence of sparse regions is substantial if the Xj observations are randomly
spaced. The seriousness of this behaviour becomes apparent from their simulation results, which
reveal that the mean integrated squared error (MISE) of the local linear estimator explodes at
bandwidth values that are only slightly below the asymptotically optimal bandwidth. Due
to this risk of single ’extremely bad’ estimates the variance of local polynomial regression is
159
At least 4 observations in each smoothing interval are required for …nite unconditional variance, but even
then the conditional variance is still unbounded.
unbounded. A simple but clearly ine¢ cient solution consists in deliberate over-smoothing.
Another strategy, based on the idea of ridge regression, suggests to add a small amount to the
denominator to avoid near-zero or zero denominators. Fan (1992) proposed adding the term
n 2 to the denominator, mainly for technical reasons in the proofs of the asymptotic properties.
Yet, further adjustments are necessary for reliable small sample behaviour. Seifert and Gasser
(1996, 2000) proposed a modi…cation of the estimator. To improve numerical stability the
regression line is centered at x instead at x, where x refers to the middle of the data in the
smoothing window:
P Xj x
X Xj x Xj K h
2
min (Yj a b(Xj x)) K with x = P .
a;b h K
Xj x
j h
T~0 T~1
m(x) is estimated as m(x)
^ ^ + ^b(x x), which is equal to m(x)
=a ^ = Q ~0 + Q~ 2 (x x) where
~ l (x) = P X x P X x
Q K j
h (Xj x)l and T~l (x) = K j
h (Xj x)l Yj . This re-centering in
itself does not change the estimator. Their main innovation is to add a ridge parameter r to
~ 2 to avoid zero and almost zero denominators:
Q
T~0 T~1 (x x)
m
^ ridge (x) = + . (115)
Q~0 Q ~ 2 + rh jx xj
The constant r is the ridge parameter that ensures non-zeros denominators. According to
5
the ’rule-of-thumb’of Seifert and Gasser (2000), r is set to 16 for the Epanechnikov kernel and
p R 2
to (4 2 (u)du) 1 0:35 for the Gaussian kernel. Seifert and Gasser (2000) also analyze
extensions to the multivariate case.160
This ridge regression estimator has been applied to a propensity score matching estimator
in Frölich (2004) and performed much better than Nadaraya-Watson regression and also better
than local linear.
14.2.2 Multivariate local polynomial regression
Nonparametric regression for a one-dimensional covariate is of limited interest for most econo-
metric applications where usually quite a number of covariates X are included. There may be
160
Alternative solutions to avoiding zero or near-zero denominators, are local increases of the bandwidth value
or using kernels with unbounded support. Gaussian weights, have also been considered in Seifert and Gasser
(1996, 2000) and behaved more stable than the standard local linear estimator, but less promising than the
ridging estimator with compact kernel.
situations where the dimension of the covariates can be reduced before nonparametric regres-
sion itself is required. An example is propensity score matching with a propensity score esti-
mated parametrically. For such situations, the ridge regression estimator discussed above per-
forms very well, see Frölich (2004). Apart from such situations, and others discussed later in
the Chapter on semiparametric regression, one usually has to consider nonparametric regres-
sion for dim(X) = d > 1. The extension of local polynomial regression to such multidimen-
sional X is straightforwardly to implement. The derivations of its properties are also analo-
gously, although some care in the notation is required.
A multivariate kernel function K(v1 ; :::; vd ) is needed and a d d bandwidth matrix de-
termines the shape of the smoothing window. This permits smoothing in di¤erent directions
and can also take into account that some regressors may be more highly correlated than oth-
ers. However, selecting this d d bandwidth matrix in practice by some data-driven bandwidth
selector can be inconvenient and time consuming when d is large. As a practical alternative,
one could rescale all the regressors to have the same scale. With a sample fXj;1 ; :::; Xj;d gnj=1 of
size n one would rescale the regressors such that their sample mean is zero, that their sample
variance is one and that they are all orthogonal to each other. (I.e. all covariances are zero.)161
With all regressors on the same scale and uncorrelated, most of the appeal for di¤erent smooth-
ing parameters in the di¤erent dimensions is gone and a reasonable and convenient choice might
be to use only a single scalar bandwidth value h to control smoothing in all directions.
In such a setting, the bias of the local polynomial estimator, at a boundary point, is of
order hp+1 . At an interior point, it is of order hp+1 if p is odd and of order hp+2 if p is even,
see Ruppert and Wand (1994). Hence, these results are the same as in the univariate setting.
1
However, the variance is now of order nhd
. Hence, we have this, perhaps surprising, result
that the bias does not depend on d, whereas the variance does not depend on p. The variance
decreases slower the large the dimension of X. However it does not depend on p. I.e. increasing
the order p does not a¤ect the convergence rate (althought it may a¤ect the multiplication
constant in the variance). Nonparametric smoothing in higher dimensions is susceptible to
161
Let XN be the data matrix with arbitrary covariance matrix. Multiplication with the Cholesky decomposition
of the inverse of its covariance matrix will lead to orthogonal regressors. In Gauss the code would be XN
chol(inv(vcx(XN )))0 , where vcx delivers the variance-covariance matrix of XN , inv calculates the inverse of this
matrix and chol gives the Cholesky decomposition of a symmetric, positive de…nite matrix, which is an upper
triangular matrix A such that A0 A = V ar(XN ) 1
.
the curse of dimensionality. The rate of convergence of nonparametric regression estimators

decreases with the number of continuous covariates. This has led to a general suspicion against
nonparametric regression in higher dimensions162 and to the development of the many di¤erent
semiparametric estimators, which however are not very often applied. Nevertheless, if the
nonparametric regression estimator is only used as a plug-in estimator in some semiparametric
p
regression context, N convergence can often still be achieved. (But regularity conditions
will become stronger with the dimension of X.) Hence, for dim(X) larger than 100 or 1000
nonparametric regression may be infeasible, given the current speed of computers. But for
dim(X) about 10 or 20, nonparametric regression is feasible but still computationally very
demanding, particularly when using data driven bandwidth selectors, perhaps local bandwidth
values and perhaps bootstrapping of the entire procedure for inference.
The reason why multivariate nonparametric regression becomes di¢ cult is the sparsity of
data in higher dimensional spaces. Consider a relatively large sample of size N and suppose
X is one-dimensional and uniformly distributed between 0 and 1. If we choose a smoothing
0:01
window of size 0.01 (e.g. a bounded symmetric kernel with h = 2 ), we expect about 1% of
the observations to lie in this smoothing window. Consider the situation where the dimension of
X is 10 and X is multivariate uniformly distributed. Again, we want to …nd a smoothing area
that contains 1% of the observations. This requires a 10-dimensional cube with length 0:63.
Hence, for each component Xk the smoothing area covers almost two thirds of the support of
Xk , whereas it was only 0:01 in the one-dimensional case.
This means that the intuition about higher-dimensional smoothing is di¤erent from one-
dimensional smoothing. In the one-dimensional case we think of some kind of averaging in a
very small neighbourhood. In higher-dimensional regression the neighbourhoods are often very
large and the nonparametric character consists in that the data is weighted di¤erently at each
evaluation point x.163
This discussion implies that in higher dimensions we need h to go to zero much slower than
in the univariate case to reduce the variance of the estimator. This in turn will imply that the
bias will be larger. Supposing su¢ cient smoothness of m(x), one could use local polynomials
162
See for instance Fan (2000), Fan and Gijbels (1996) or Härdle (1991).
163
Higher-dimensional nonparametric regression di¤ers from lower-dimensional regression fundamentally in the
aspect that data are extremely sparse in high-dimensional spaces and that extrapolation within much larger
neighbourhoods (and to empty cells) is necessary.
of higher order p to reduce the bias. However, when dim(X) = d is large a high order of p can
very inconvenient in practice since the number of (interaction) terms proliferate quickly. This
could soon give rise to problems of local multicollinearity in small samples and/or for small
bandwidth values. A convenient alternative is to use local linear regression with higher-order
kernels for bias reduction.
Let be a d-tuple of non-negative integers and de…ne j j = 1 + ::: + d and de…ne v =
v1 1 v2 2 vd d . De…ne the kernel constants
R R
= v K(v1 ; :::; vd ) dv1 dvQ
and
R R
= v (K(v1 ; :::; vd ))2 dv1 dvQ
The kernel K is of order r if
0 = 1
= 0 for 1 j j r 1
6= 0 for j j = r.
R
For convenience we also assume that the kernel integrates to one 0 = K (u) du = 1.
Of particular convenience for multivariate regression problems are the product kernels,
where the multivarate kernel function K(v1 ; :::; vd ) is de…ned as the product of univariate
kernel functions
d
Y
K(v1 ; :::; vd ) = K (vl )
l=1
and we also de…ne
1
Kh (v1 ; :::; vd ) = K(v1 ; :::; vd ).
hd
For such product kernels, higher order kernels are very easy to implement, and it is easy to
show that the order of the bias is O(hr ) when a kernel of order r is used.
Multivariate local linear regression:

0 0
Xj x0
Consider estimation of m(x0 ) at a location x0 . De…ne the regressor matrices Xj = 1; h
and X = (X1 ; X2 ; :::; Xn )0 and K = diag(K1 ; K2 ; :::; Kn ). Since m(x0 ) is estimated by a weighted least
squares regression, we can write the solution as
n
X n
X
1 1
^ 0 ) = e01 (X0 KX)
m(x Xj Kj Yj = e01 (X0 KX) Xj Kj (Yj mj + mj )
j=1 j=1
where e1 is a column vector of zeros with …rst element being one and mj = m(Xj ). A series expansion
gives
n
X
1
= e01 (X0 KX) Xj Kj (Yj mj )
j=1
n
X
1 0 @m (x0 ) 0 1 @ 2 m (x0 )
+ e01 (X0 KX) Xj Kj m (x0 ) + (Xj x0 ) + (Xj x0 ) (Xj x0 ) + Rj
j=1
@x 2 @x@x0
@m(x0 ) @ 2 m(x0 )
where @x is the d 1 vector of …rst derivatives and @x@x0 the d d matrix of second derivatives
and Rj is the remainder term of all third order derivatives multiplied with the respective third order
interaction terms of Xj x0 . Since Kj has bounded support, the remainder term premultiplied with Kj
is of order O(Kj h3 ). We thus obtain after some derivations that
n
X n
X
1 1 0 1 @ 2 m (x0 )
= e01 (X0 KX) Xj Kj (Yj mj )+m (x0 )+e01 (X0 KX) Xj Kj (Xj x0 ) (Xj x0 ) + O h3
j=1 j=1
2 @x@x0
(116)
where we can now enter the expression (118).
Denominator of the local linear estimator

Under the assumption that nhd ! 1 and h ! 0 one can show that for a kernel of order :
n d
1 0 1X Y1 Xjl xl
(X KX) = Xj X0j
n n j=1 h h
l=1
2 1
3
1 @ f (x0 )
f (x0 ) + O(h ) h ( 1)! 1 + O(h )
6 @x1 7
6 @ 1
f (x0 ) @ 2
f (x0 ) @ 1
f (x0 ) 7
6 h 1
+ O(h ) h 2
+h 1 +1
+ O(h ) O(h 2 2
) 7
6 ( 1)! @x1 1 ( 2)! @x1 2 ( 1)! @x1 1
7
=6 .. .. 7
6 2 7
6 . O(h2 2
) . O(h2 ) 7
4 5
.. .. ..
. . O(h2 2
) .
(117)
This can be shown element-wise via mean square convergence. Only the derivations for the (2; 2) element
are shown here, with the derivations for the other elements being analogous. Consider the (2; 2) element
1
of n (X0 KX) and denote it by
n d
1 X
2Y
Xj1 x1 Xjl xl
=
nhq j=1 h h
l=1
which has the expected value:

Z Z d
2Y
1 Xj1 x1 Xjl xl
E[ ] = q f (Xj )dXj .
h h h
l=1
Xjl xl
With a change in variables: ul = h and u = (u1 ; :::; uL )0 and a Taylor series expansion and noting
that is a kernel of order we obtain
Z Z d
Y
= u21 (ul ) f (x0 + uh)du
l=1
Z Z d
!
Y u1 2
h 2 @ 2 f (x0 ) u1 1
h 1 @ 1 f (x0 )
= u21 (ul ) + + O(h ) du
l=1
( 2)! @u1 2 ( 1)! @u1 1
2 1
2 @ f (x0 ) 1 +1 @ f (x0 )
= h 2
+h 1
+ O(h )
( 2)! @x1 ( 1)! @x1
by bounded convergence.
To show convergence in mean square, it also needs to be shown that V ar ( ) converges to zero
n d
!
1 X
2Y
Xj1 x1 Xjl xl
V ar ( ) = 2 2d
V ar
n h j=1 h h
l=1
2 !2 3 " #!2
d
2Y d
2Y
1 X j1 x1 X jl xl 1 Xj1 x1 Xjl xl
= E4 5 E
nh2d h h nh2d h h
l=1 l=1
Z d Z Z d
!2
1 Y 1 Y
= hd u 4 2
(ul ) f (x0 + uh)du hd u2 (ul ) f (x0 + uh)du
nh2d nh2d
l=1 l=1
1 h2 4
= O O ,
nhd n
by bounded convergence and Taylor series expansion. As it has been assumed that nhd ! 1, the variance
of converges to zero. Hence, mean square convergence has been shown, which implies convergence in
probability by Chebyshev’s inequality.
From (117) one can derive after some tedious calculations that
0 10
( 2)! P
d
@ 1
f (x0 ) 2
B 1+h
+1
( 1)! @xl 1 =@ @xl
f (x0 )
2 C
B l=1 C
1 B @ 1
f (x0 ) @ 2
f (x0 ) ( 2)! C
1 0 1 B h = C
e01 X KX = B @x1 1
@x1 2 ( 1)! C + O(h2 )
n f (x0 ) + O(h) B
B .. C
C
B . C
@ 1
A
@ f (x0 ) @ 2 f (x0 ) ( 2)!
h @xd 1
= @x 2 ( 1)!
d
0 10
1 + O(h)
B C
B @ 1
f (x0 ) @ 2
f (x0 ) ( 2)! 0 2 C
1 B B
h @x1 1 = @x1 2 ( 1)! + O(h ) C
C
= B .. C (118)
f (x0 ) B . C
B C
@ A
@ 1
f (x0 ) @ 2 f (x0 ) ( 2)! 0 2
h @xd 1
= @x 2 ( 1)! + O(h )
d
which can now be entered in (116).

Bandwidth selection The behaviour of all nonparametric regression estimators depends on

the proper choice of a bandwidth value. Asymptotic rates of convergence are of little guidance
for choosing the bandwidth for a particular dataset. Generally, one may distinguish between a
global bandwith, i.e. a value of h that is used for any value of x0 , or a local bandwidth, where
hx0 is chosen di¤erently for di¤erent values of x0 . In principle, local bandwidth values may
lead to a better adaptation of the estimator to the local availability of data. Nevertheless, the
computational demand can increase considerably, depending on the method used. A simple
implementation of a local varying bandwidth follows the k-NN approach where hx0 is chosen
such that k data observations are within the range of x0 , assuming a kernel with compact
support is used.
A disadvantage of local bandwidths for density estimation is that the estimated density may
no longer integrate to one, and would need to be re-scaled to ensure this basic property.
A versatile approach to bandwidth selection is cross-validation (Stone 1974). Cross-

validation is based on the principle of maximizing the out-of-sample predictive performance. If
a quadratic loss function is used to assess the estimation of m(x) at a particular point x, a
bandwidth value h should be selected to minimize E[(m(x;
^ h) m(x))2 ]. If a single bandwidth
value is used to estimate the function m(x) at all points x, the (global) bandwidth should be
R
chosen to minimize the mean integrated squared error MISE(h) = E (m(x; ^ h) m(x))2 dx.
Since m(x) is unknown, a computable approximation to minimizing mean integrated squared
error is minimizing average squared error
1X
arg min (Yj ^ (Xj ; h))2 .
m (119)
h n
j
However, minimizing average squared error leads to the selection of too small bandwidth values.
For example, if a compact kernel is used and h is very small, the local neighbourhood of Xj
would contain only the observation (Yj ; Xj ). As the estimate m(X
^ j ) is a weighted average of
the Y observations in the neighbourhood, the estimate of m(Xj ) would just be Yj . Hence,
(119) would be minimized by a bandwidth value h close to zero. To avoid underestimating
the optimal bandwidth, the observation (Yj ; Xj ) should be excluded from the sample when
estimating m(Xj ). The corresponding estimate m
^ j (Xj ) is called the leave-one-out estimate
and represents the out-of-sample prediction from the sample f(Yl ; Xl )gl6=j at Xj . The resulting
cross validation function is de…ned as

X
CV (h) = (Yj m
^ j (Xj ; h))2 , (120)
j
and h is chosen to minimize (120). For properties of cross validation bandwidth selection see
Härdle and Marron (1987).
Instead of out-of-sample-prediction validation, the average squared error criterion (119)
could be modi…ed to correct the downward bias by ’penalizing’ very small bandwidth values.
These are similar in spirit to the ’in-sample’model selection criteria in parametric regression,
which seek to account for the degrees of freedom by penalizing models with a large number of
coe¢ cients.
A variety of penalized cross-validation criteria for bandwidth selection have been proposed
including Akaike’s (1970) information criterion, Shibata’s (1981) model selector and Rice’s
(1984) bandwidth selector. Widely used is generalized cross-validation. A linear smoother for
the data points YN = (Y1 ; :::; YN )0 can be written as (Y^1 ; :::; Y^N )0 = AYN where A is the N N
hat matrix. Letting ajj denote the jj element of A, the generalized cross validation criterion is
P
2
(Yj m ^ (Xj ; h))2
1 k(I A) YN k 1 j
GCV (h) = = !2 ,
n (tr(I A))2 n P
(1 ajj )
j
which does not require estimating the leave-one-out estimates.

As an alternative to cross validation, a variety of plug-in bandwidth selectors have been
developed e.g. Fan and Gijbels (1995), Fan, Hall, Martin, and Patil (1996), Ruppert, Sheather,
and Wand (1995), Ruppert (1997) and Schucany (1995). Plug-in bandwidth selectors proceed
by estimating bias and variance at one or a at variety of pilot bandwidth values and choose
a bandwidth to balance squared bias and variance. The idea can be sketched for the one-
dimensional case. One could think of estimating (for a given value of h) the 2 (x), f (x) and
m00 (x) for a number of values of x and compute MISE according to the formulae given above.
Repeating this for di¤erent values of h gives an indication about which bandwidth to choose.164
Plug-in selectors are often used in univariate regression, but are less convenient in multivariate
regression. For a lucid and nicely written discussion of plug-in and cross validation bandwidth-
selection see Loader (1999).
164
.... (@@ This section needs to be further extendend, diskutiere EBBS bandwidht selector.)
It should be noted that many of these methods usually will not work if (i) either smoothing
also takes place with respect to discrete regressors, as discussed further below, or (ii) if it might
be possible that the local parametric model indeed happens to be correct. In the latter case,
the optimal bandwidths would be in…nitely large, and many bandwidth selectors would then
not be consistent. Cross-validation could still be so, under certain conditions.
In‡uence function representation Nonparametric regression estimators are often used as

…rst step estimators in more complex estimators, e.g. of the average treatment e¤ect. We can
use the previous derivations e.g. for the local linear estimator to obtain an expression of a
matching estimator in the form of a double sum to which a U-statistics projection theorem can
be applied, as will be discussed in the Chapter on semiparametric regression.
When we use matching estimation of the average treatment e¤ect, a common support re-
quirement is needed for identi…cation. If this common support condition is not satis…ed over
the entire support of X, we may restrict the de…nition of the average treatment e¤ect to the
common support region, which we have to estimate. Heckman, Ichimura, and Todd (1998) de-
rived the asymptotic properties of the matching estimator with estimated support region. They
impose su¢ ciently strong conditions on the estimator of the support region, such that this does
not a¤ect the asymptotic distribution. A nice by-product of their analysis is a asymptotically
linear form of the local polynomial regression estimator with trimming. In addition, they per-
mitted for an estimated propensity score. (If interest is in the local linear estimator without
any trimming, the previous derivations can be used as they require weaker regularity condi-
tions.) Let S^ be an estimate of the support of the variable X. Heckman, Ichimura, and Todd
(1998) showed that the local polynomial regression estimator m(x)
^ can be written in an asymp-
totically linear form with trimming
1X
(m(x)
^ ^ =
m(x)) 1(x 2 S) (Yi ; Xi ; x) + ^b(x) + R(x),
^
n
i
with the properties:

a) E[ (Yi ; Xi ; X)jX = x] = 0
1 P
b) plim n 2 ^b(Xi ) < 1
n !1 i
1 P^
c) n 2 R(Xi ) = op (1):
i
S^ is an estimator of the support of X in the source population. is a mean-zero in‡uence
function which determines the local variance of the estimate. may depend on the sample
size, for example through a bandwidth value that decreases with sample size. The term b(x)
represents the local bias and R(x) is a remainder term of lower order. For X one-dimensional
and m twice continuously di¤erentiable, the local in‡uence functions are identical at interior
points for the local constant and the local linear estimator:
Xi x
K h
(Yi ; Xi ; x) = (Yi m (Xi )) ,
hf (x)
as derived in Fan (1992), Ruppert and Wand (1994) and Heckman, Ichimura, and Todd (1998).
Such kind of linearized representations can then be plugged into more complex estimators.
14.2.3 Local parametric regression
As mentioned above, because of the sparsity of data in higher dimensions, when dim(X) is
large often large bandwidth values are required. The reason for this curse of dimensionality
is that data are extremely sparse in a high-dimensional regressor space, leading to almost
empty neighbourhoods ’almost everywhere’, see for instance Silverman (1986, p.94) or Härdle
(1991, Ch. 10). Even if most of the regressors are discrete, e.g. binary, the number of cells
will still proliferate quickly, leading to many empty cells. Estimating m(x) will then require
extrapolation from observations that are not as nearby as in the low-dimensional regression
context. In …nite samples, nonparametric regression is then not so much about averages in small
local neighbourhoods but rather about di¤erent weighting of the data in large neighbourhoods.
Consequently, the choice of the parametric hyperplane being used becomes more important,
because regression in …nite samples will be based substantially on local extrapolation. (I.e. at
location x most of the data points might be relatively far away, so that the local model is used
for intra- and extrapolation.)165
165
The reason why Nadaraya Watson (local constant) regression performs poorly might be due to its limited
use of covariate information, which is incorporated only in the distance metric in the kernel function but not
in the extrapolation plane. Consider a simple example where only two binary X characteristics are observed:
gender (male/female) and professional quali…cation (skilled/unskilled) and expected wages shall be estimated.
Suppose that, for instance, the cell skilled-males contains no observations. The Nadaraya Watson estimate of the
expected wage for skilled-male workers would be a weighted average of the observed wages for unskilled male,
skilled female and unskilled female workers, and would thus probably be even lower than the expected wage for
unskilled male and for skilled female workers, which is in contrast to economic reality. If the a priori beliefs
sustain that skilled workers earn higher wages than unskilled workers and that male workers earn higher wages
than female workers, then a monotonic ’additive’extrapolation would be more adequate than simply averaging
Local parametric estimation proceeds by …rst specifying a parametric class of functions
g(x; x) (121)
where the function g is known, but the coe¢ cients x are unknown, and …tting this local model
to the data in a neighbourhood about x. The estimate of m(x) is then calculated as
m(x)
^ = g(x; ^ ).
x
The function g should be chosen according to the properties of the outcome variable Y . If Y is
binary or takes only values between 0 and 1, a local logit spec…cation would be appealing
1
g(x; x) = 0 ,
1+e 0;x +x x
where 0;x refers to the constant and x to the other coe¢ cients corresponding to the regressors
in x. This local logit speci…cation has the advantage vis-a-vis local linear that all the estimated
values m(x)
^ by de…nition are between 0 and 1. Furthermore, it may also help to reduce the
very high variablity of local linear in …nite samples. Whereas the variance of the local linear
estimator is unbounded, the variance of the local logit estimator is bounded. The function g
could also be chosen to incorporate other properties that one might expect of the true function
m, such as convexity or monotonicity. These properties, however, only apply locally when
…tting the function g at location x. It does not imply that the estimates m(x)
^ are convex or
monotonous. The reason for this is that the coe¢ cients x are re-estimated for every location
x. I.e. for two di¤erent values x1 and x2 the function estimates are g(x1 ; ^1 )
x and g(x2 ; ^2 ),
x
where not only x changes but also ^.

x
Note that when one is interested e.g. in the …rst derivative that there are two di¤erent
ways to estimating it: Either as @ m(x)=@x
^ or from inside the model as @g(x; ^ )=@x.
x These are
di¤erent estimators and may have di¤erent properties. E.g. when a local logit model is used,
the observations in the neighbourhood (even when down-weighting more distant observations). Under these
circumstances a linear extrapolation e.g. in form of local linear regression would be more appropriate, which
would add up the gender wage di¤erence and the wage increment due to the skill level to estimate the expected
wage for skilled-male workers. Although the linear speci…cation is not true it is still closer to the true shape than
the ‡at extrapolation plane of Nadaraya Watson regression. Here, a priori information from economic theory
becomes useful for selecting a suited parametric hyperplane that allows the incorporation of covariate information
more thoroughly to obtain more precise extrapolations.
the …rst derivative @g(x; ^ )=@x

x is always between 0 and 0.25, where @ m(x)=@x
^ is not restricted
and can take any value between 1 and 1.
One should also note that the local coe¢ cients x may not be uniquely identi…ed, although
g(x; ^)
x may still be. E.g. if some of the regressors are collinear, x is not unique, but all
solutions lead to the same value of g(x; ^ ).
x See the discussion in Gozalo and Linton (2000).
There are several di¤erent ways to estimate the local model. Local least squares regression
estimates the local coe¢ cients x as
Xn
^x = arg min 2
(Yi g(Xi ; x )) K(Xi x), (122)
x i=1
Local least squares is embedded in the class of local likelihood estimation, which estimates ^x
as
Xn
^x = arg max ln L (Yi ; g(Xi ; x )) K(Xi x), (123)
x i=1
where ln L (Yi ; g(Xi ; x )) is the log-Likelihood contribution of observation (Yi ; Xi ). For the log-
Likelihood approach one has to specify a likelihood function in addition to the local model.
The likelhood function entails the conjectured properties of the local error term. For example,
if the likelihood function for a normally distributed error is used, the likelihood function in
(123) is identical to least squares (122). If Y is binary, the likelihood for Bernoulli random
variables is more appropriate. The asymptotic nonparametric results are usually the same for
both approaches, but the …nite sample performance is better the closer the proposed model is
to the true data generating process. The bandwidth h determines the local neighbourhood of
the kernel weighting. If h converges to in…nity, the local neighbourhood widens and the local
estimator converges to the global parametric estimator. In this sense, each parametric model
can be nested in a corresponding nonparametric one.
Although similar in appearance local and global parametric estimation rest on di¤erent
motivations. Global parametric regression assumes that the shape of the conditional expectation
function is known and correctly speci…ed. Local parametric regression, on the other hand,
imposes the g( ) function merely as a device for more stable extrapolations in …nite samples.
Indeed, not even all coe¢ cients x of g( ) need to be identi…ed for the estimation of m(x), see
Gozalo and Linton (2000).
Local least squares (122) and local likelihood (123) can be estimated by setting the …rst
derivative to zero. Therefore, they can also be written as

n
X
(Yi ; g(Xi ; x )) K(Xi x) = 0, (124)
i=1
for some function that is de…ned by the …rst order condition. This can then thus also be
embedded in the framework of local estimating equations (Carroll, Ruppert, and Welsh 1998),
which can be extended to local GMM estimation for more complex estimation setups.
Local least squares, local likelihood and local estimating equations are essentially equivalent
approaches. However, local least squares and local likelihood have the practical advantage over
local estimating equations that they can distinguish between multiple optima of the objective
function through their objective function value, whereas local estimating equations would treat
them all alike.
An interesting result of Gozalo and Linton (2000) and Carroll, Ruppert, and Welsh (1998)
is that the asymptotic theory is quite similar to the results for local polynomial regression. The
asymptotic variance m(x)
^ is independent (to …rst order) of the parametric model used and thus
the same as for the Nadaraya-Watson and local linear estimator. The bias is
1 2
E [m(x)
^ m(x)] = 2h m00 (x) g 00 (x; x) , (125)
2
where g 00 (x; x) refers to that value of x that satisi…es (124) or maximizes (123) or minimizes
(122), depending on which objective function has been optimized. Hence, the bias is of order
h2 as for the local linear estimator. In contrast to the linear case, this is an asymptotic result.
(The result that the exact …nite sample bias is of order h2 is no longer true in nonlinear models.)
In addition, the bias is no longer proportional to m00 but rather to m00 g 00 . When the local
model is linear, g 00 is zero and the result is the one we obtained for local linear regression. If we
use a di¤erent local model, the bias will be smaller than with local linear regression if
m00 (x) g 00 (x; x) < m00 (x0 ) .
Hence, even if we pursue a nonparametric approach, prior knowledge of the shape of the local
regression is helpful. If our prior assumptions are correct, bias will be smaller. If they were
wrong, the estimator is still consistent but with a larger bias.
Note that this discussion was very vague since some further restrictions are required on
the local parametric model. If e.g. the local model would be a local constant, then we should
obtain the result for Nadaraya-Watson regression, such that (125) cannot apply. Roughly
speaking (125) applies if the number of coe¢ cients in g is the same as the number of regressors
in X plus one (for the constant).
When comparing local linear to local polynomial regression, we observed that the bias
decreased and depended on higher order derivatives of m(x). A similar extension is possible
here as well where we could use the local logit hyperplane with squared and interaction terms
as well:
1
g(x; 0;x ; 1;x ; 2;x ) = 0 0 .
1+e 0;x +x 1;x +x 2;x x
The asymptotic bias expression is then similar to local quadratic and also of oder h3 where the
term m000 (x) g 000 (x; x ).
An alternative, and perhaps more convenient, approach to bias reduction can again be used
on higher order kernels, which for local likelihood estimation and product kernels is shown in
the following.
Local logit estimation

Suppose dim(X) = d and that the kernel function is a product kernel of order . De…ne the
log likelihood function for local logit regression at a location x0 as
n
1X
ln Ln (x0 ; a; b) = Yj ln a + b0 (Xj x0 ) + (1 Yj ) ln 1 a + b0 (Xj x0 ) Kj
n
j=1
1 0 (x), 00 (x), (3) (x)

where (x) = 1+e x . We will denote derivativesof (x) by etc. and also
note that 0 (x) = (x) (1 ^ and ^b be the maximizers of ln Ln (x0 ; a; b) and a0 and
(x)). Let a
b0 be the values that maximize the expected value of the likelihood function E [ln Ln (x0 ; a; b)].
^, and include ^b only to appeal to the well known properties
Note that we are interested only in a
that local likelihood or local estimating equations perform better if more than a constant term
is included in the local approximation (see e.g. Fan and Gijbels (1996) and Carroll, Ruppert,
and Welsh (1998)). We estimate m(x0 ) by m(x
^ 0) = (^
a). For clarity we may also write
m(x
^ 0) = (^
a(x0 )) because the value of a
^ varies for di¤erent x0 . Similarly, a0 is a function of
x0 that is a0 = a0 (x0 ). The same applies to ^b(x0 ) and b0 (x0 ). At other times we supress the
dependence to ease notation and to focus attention on the properties at a particular x0 .
In the following we will also show that (a0 (x0 )) is identical to m(x0 ) up to an O(h ) term. To
derive this, note that since the likelihood function is globally convex, the maximizers are obtained by
setting the …rst order conditions to zero. The values of a0 (x0 ) and b0 (x0 ) are thus implicitly de…ned by
the moment conditions
2 0 1 3
1
E 4(Yj (a0 + b00 (Xj x0 ))) @ A Kj 5 = 0
Xj x0
2 0 1 3
1
= E 4(mj (a0 + b00 (Xj x0 ))) @ A Kj 5 = 0. (126)
Xj x0
Now examine only the …rst moment condition to obtain

Z
0 = (m(Xj ) (a0 + b00 (Xj x0 ))) Kj f (Xj )dXj
Z L
Y
= (m(x0 + uh) (a0 + b00 uh)) (ul ) f (x0 + uh)du
l=1
Xj x0
where u = h . Now assuming that m is times di¤erentiable and noting that the kernel is of order
we obtain by Taylor expansion that
(m(x0 ) (a0 )) f (x0 ) + O(h ) = 0
hence
m(x0 ) = (a0 ) + O(h ).
Combining this with the previous results we thus have obtained an expression for m
^ (x0 ) m(x0 )
m(x
^ 0) m(x0 ) = (^
a(x0 )) (a0 (x0 )) + O(h )
and by Taylor expansion of (a^) which converges to (a0 )
0
m(x
^ 0) m(x0 ) = (^
a(x0 ) a0 (x0 )) (a0 (x0 )) (1 + op (1)) + O(h ).
(One could also explicitly consider the second order term, but for sake of brevity we omit this here.)
By entering (127) and (129) we obtain
1 Xn 0 0 o 1
= 0
(a0 (x0 )) e01 ( 0 Xj ) + 00 ( 00 Xj )Xj ( ^ 0 ) 0
+ Op (jj ^ 0 jj2
Xj X0j Kj
n
1X
(Yj mj + mj (a0 + b00 (Xj x0 ))) Kj Xj (1 + op (1)) + O(h )
n
0 0
Xj x0
where we de…ned = (a; hb0 )0 and Xj = 1; h to obtain
0 10
1
B C
B @ 1
( 0
f (x0 )) @ 2
( 0
f (x0 )) C
B h (( 2)!
= C
1 B B
1)! @x1 1
@x1 2
C
C
=
f (x0 ) B
B
..
.
C
C
B C
@ @ 1
( 0
f (x0 )) @ 2
( 0
f (x0 )) A
h (( 2)!
1)! @xd 1 = @xd 2
1X
(Yj mj + mj (a0 + b00 (Xj x0 ))) Kj Xj (1 + op (1)) + O(h ),
n
where @ ( 0 f (x0 )) =@x1 is de…ned in (128)
Properties of a
^
0 0
Xj x0
Now we need to examine a
^ in more detail. De…ne …rst = (a; hb0 )0 and Xj = 1; h . The
…rst order condition of the estimator is given by
1X 0
0= Yj ( ^ Xj ) Kj X0j
n
1X
= Yj ( 00 Xj ) 0
( 00 Xj )( ^ 0
0 ) Xj
00
( 00 Xj ) ( ^ 0 0 ^
0 ) Xj Xj ( 0) Op (jj ^ 3
0 jj ) Kj X0j
n
by Taylor expansion. Further
1 Xn 0 0 o 1X
1
^ 0 = ( X
0 j ) + 00
( 0
X
0 j )X j ( ^ 0 )0
+ O p jj ^ 0 jj2
Xj X0j Kj Yj ( 0
0 Xj ) Kj Xj .
n n
^ and not in ^b we write
As we are only interested in a
1 Xn 0 o 1
a
^ a0 = e01 ( 0
0 Xj ) + 00
( 0
0 Xj )Xj (
^ 0
0 ) + Op jj
^ 0 jj
2
Xj X0j Kj
n
1X
(Yj (a0 + b00 (Xj x0 ))) Kj Xj . (127)
n
Denominator for local logit

We start with an approximation to the term
1 Xn 0 0 o
( 0 Xj ) + 00 ( 00 Xj )Xj ( ^ 0
0 ) + Op jj
^ 0 jj
2
Xj X0j Kj .
n
^ and ^b, one can show
Under the assumption that nhd ! 1 and h ! 0, which implies consistency of a
that for a kernel of order
2 1 0 3
@ ( f (x0 ))
f (x0 ) 0 (a0 ) h 1
( 1)! 1
6 @x1 7
6 @ 1 ( 0 f (x0 )) @ 2
( 0
f (x0 )) 7
6 h 1 h 2
0 0 7
6 ( 1)! @x1 1 ( 2)! @x1 2
7
=6 . 7 (1 + op (1))
6 .. .. 7
6 0 . 0 7
4 5
.. ..
. 0 0 .
where @ ( 0 f (x0 )) =@xl is a shortcut notation for all the cross derivatives of 0
and f (x0 )
@ ( 0 f (x0 )) X @ r
f (x0 )
(r+1)
(a0 (x0 )) . (128)
@xl r=0
@xl r
The derivations are similar to those for the local linear estimator and are omitted here. An additional
complication compared to the derivations for the local linear estimator are the second order terms, which
however are all of lower order when (^

a a0 ) and (^b b0 ) are op (1).
Similarly to the derivations for the local linear estimator we can now derive
1 Xn 0 o 1
e01 ( 0
0 Xj ) + 00
( 0
0 Xj )Xj (
^ 0
0 ) + Op jj
^ 0 jj
2
Xj X0j Kj
n
0 10
1
B C
B @ 1
( 0
f (x0 )) @ 2
( 0
f (x0 )) C
B
B h (( 2)!
1)! 1 = 2
C
C
1 B
@x1 @x1
C (1 + op (1))
= (129)
f (x0 ) (a0 (x0 )) B
0
B
..
.
C
C
B C
@ @ 1
( 0
f (x0 )) @ 2
( 0
f (x0 )) A
h (( 2)!
1)! @xd 1 = @xd 2
14.2.4 Combination of discrete and continuous regressors
Many econometric applications contain as well continuous as discrete explanatory variables.

Whereas both types of regressors can easily be incorporated in the parametric speci…cation g( )
in (121) through the choice of an appropriate function, they need also to be accommodated
in the distance metric of the kernel function K(Xi x) de…ning the local neighbourhood.
Building on the work of Aitchison and Aitken (1976), Racine and Li (2004) developed a hybrid
product kernel that coalesces continuous and discrete regressors. They distinguish three types of
regressors: continuous, discrete with natural ordering (number of children) and discrete without
natural ordering (bus, train, car). Suppose that the variables in X are arranged such that the
…rst d1 regressors are continuous, the regressors d1 + 1; :::; d2 discrete with natural ordering
and the remaining d d2 regressors discrete without natural ordering. Then the kernel weights
K(Xi x) are computed as
d1
Y d2
Y d
Y
Xq;i xq jXq;i xq j 1(Xq;i 6=xq )
Kh; ; (Xi x) = , (130)
h
q=1 q=d1 +1 q=d2 +1
where Xq;i and xq denote the q-th element of Xi and x, respectively, 1( ) denotes the indicator
function, is a symmetric univariate weighting function and h, , and are positive bandwidth
parameters with 0 ; 1. This kernel function Kh; ; (Xi x) measures the distance between
Xi and x through three components: The …rst term is the standard product kernel for continuous
regressors with h de…ning the size of the local neighbourhood. The second term measures the
distance between the ordered discrete regressors and assigns geometrically declining weights
to more unlike observations. The third term measures the mismatch between the unordered
discrete regressors. controls the amount of smoothing for the ordered and for the unordered
discrete regressors. For example, the multiplicative weight contribution of the last regressor is
1 if the last element of Xi and x are identical and it is if they are di¤erent. The larger
and/or the more smoothing takes place with respect to the discrete regressors. If and are
both 1 then the discrete regressors would not a¤ect the kernel weights and the nonparametric
estimator would ’smooth globally’over the discrete regressors. On the other hand, if and
are both zero then smoothing would proceed only within each of the cells de…ned by the discrete
regressors but not between them. If, further, X contained no continuous regressors this would
correspond to the frequency estimator, where Y is estimated by the average of the observations
within each cell. Any values between 0 and 1 for ; thus correspond to some smoothing over
the discrete regressors. By noting that
Y P
1(Xq;i 6=xq ) 1(Xq;i 6=xq )
= ,
the weight contribution of the unordered discrete regressors thus depends only the number of
regressors that are distinct between Xi and x. Racine and Li (2004) analyzed Nadaraya Watson
regression based on this hybrid kernel and derived its asymptotic distribution for bandwidths
selected by cross-validation.
Principally, instead of using only 3 bandwidth values h; ; for all regressors, a di¤erent
bandwidth could be employed for each regressor. But this would increase substantially the
computational burden for bandwidth selection and might lead to additional noise due to
estimating these bandwidth parameters. Nevertheless, groups of similar regressors could be
formed, with each group assigned a separate bandwidth parameter, if the explanatory variables
are deemed too distinct. Particularly if the ranges assumed by the ordered discrete variables
vary considerably, those variables that take on many di¤erent values should be separated from
those with only few values.
A more practical solution is to apply the same kernel function also to the ordered discrete
regressors and to rotate the continuous and ordered discrete regressors together such that they
are orthonormal, i.e. have mean zero, variance one and zero covariances. There is no convincing
reason why geometrically declining kernel weights provides a better weighting function than the
kernel . Hence, in practice instead of (130) the kernel function
d1
Y d
Y
Xq;i xq 1(Xq;i 6=xq )
Kh; (Xi x) = , (131)
h
q=1 q=d1 +1
where the regressors 1:::d1 contain the continuous and the ordered discrete.
An important aspect in practice is that unordered discrete regressors should be entered
di¤erently in the local model (121) than in the (131) if the same value of is used for all these
unordered regressors. As an example, suppose we have two unordered discrete regressors: gender
and region, where region takes values in {1=North, 2=South, 3=East, 4=West, 5=North-East,
6=North-West, 7=South-East, 8=South-West}. The dummy variable gender would enter as
a regressor in (121) and also in (131). The situation with region is more di¢ cult, though.
First, using region as such as a regressor in (121) makes no sense because the values 1 to 8
have no logical meaning. Instead of this one would use 7 dummy variables for the di¤erent
regions in (121). However, in the kernel function (131) one would use the single regressor region
and not the 7 dummy variables. Because if one were to use also 7 dummy variables in (131)
7
the e¤ective kernel weight used for region would be but only for gender. The reason is
that if two observations j and i live in di¤erent regions, they will be di¤erent on all 7 regional
dummies. Hence, the implicit bandwidth would be dramatically smaller for these 7 dummies
than for gender. This would either require using separate bandwidths 1, 2 for region and
gender or a rescaling of the bandwidth by the number of corresponding dummy variables. As a
much simpler alternative, one could use the regressor region, taking values in 1..8, in the kernel
weighting (131) and 7 regional dummies in the local model (121).
Local multicollinearity If the bandwidth values are small and the number of regressors
rather large multicollinearity or near-multicollinearity among the regressors may prevent
the estimation of the coe¢ cients x. However, in contrast to conventional parametric
regression, in local parametric regression this is a local phenomena, where the degree of (near)
multicollinearity varies with x. Whereas at some locations the conditional mean m(x) might be
estimable without any complications, the estimate might be unde…ned at others. In particular,
with a bounded kernel for the continuous elements observations that are rather distant from x
are assigned a kernel weight of zero, and thus the e¤ective number of observations included in
the estimation of m(x) depends on x. At some x the number of observations with positive
weight might be smaller than the number of regressors, or the regressors might be (almost)
linearly dependent or one regressor might be without variation (collinearity with the constant
term) among the observations with positive weight, rendering the estimation of x impossible.
But even with an unbounded kernel, distant observations receive almost zero weights and the
estimated coe¢ cients may be spurious due to numerical inaccuracies in matrix inversion.166
Possible solutions are either to increase the bandwidths locally or to drop locally those
regressors that cause collinearity problems, i.e. restricting their corresponding coe¢ cients in the
g function (121) to zero, since local parametric regression with arbitrary coe¢ cient restrictions
is still consistent as long as the support of Y is contained in the domain of the restricted g
function.167 In its extreme, if all regressors but the constant are restricted to zero the estimator
corresponds to the Nadaraya Watson local constant estimator. In this sense, dropping regressors
’shrinks’the estimator towards the Nadaraya Watson estimator.168169
Bandwidth selection Crucial to the performance of all nonparametric estimators is an ap-

propriate choice of the degree of smoothing. For kernel based regression, the bandwidth values
should decrease to zero with growing sample size. In practice, however, in local parametric re-
gression the optimal bandwidth choice depends not only on the number of observations and
their location, but also on how well the speci…ed function g( ) resembles the true conditional
mean function. If the parametric hyperplane encompasses the true conditional mean function,
the optimal bandwidth values would be in…nity for h and one for and , i.e. correspond-
166
Such invertibility and multicollinearity problems have often been addressed by ridge or shrinkage regression,
see Stein (1981), Judge, Hill, Gri¢ ths, Lütkepohl, and Lee (1982, p. 878 ¤) or Mittelhammer, Judge, and
Miller (2000, Chapter 18.7). Ridge regression is based on shrinking coe¢ cient estimates towards zero (or any
other anchor point) to reduce their variance and mean squared error, at the cost of biased estimated coe¢ cients.
However, this seems not particularly appealing here since the coe¢ cients itself are not of interest but only the
estimate of E[Y jX], which would be biased by ridge regression.
167
Or some combination of these two approaches.
168
As always, the continuous regressors should be rotated to be orthonormal to improve numerical accuracy.
169
For detecting (nearly) linear dependencies in the regressor matrix the pivotal orthogonal-triangular (QR)
decomposition is employed, see Judd (1998, p. 58 f) or Press, Flannery, Teukolsky, and Vetterling (1986, p.
357 ¤). This decomposition decompounds a regressor or moment matrix X into an orthogonal matrix Q and
an upper triangular matrix R, and diagonal elements of R close to zero indicate (nearly) linear dependencies
attributable to the corresponding columns. Here, all regressors associated with a diagonal element in R smaller
5
than 10 are dropped.
ing to (global) parametric regression. Otherwise the bandwidths should converge to zero with
increasing sample size to rely only on the nearby observations. A suitable data-driven band-
width selector should thus select very large bandwidth values if the (nested) parametric model
is correct and much smaller and decreasing values otherwise.
A versatile approach to bandwidth selection is based on least squares cross-validation,170
where the bandwidth values are chosen to minimize the squared one-out-of-sample prediction
error
Xn
2
^ ^
h; ; ^ = arg min Yi g(Xi ; ^ Xi jh; ; ) , (132)
h; ; i=1
where ^ Xi jh; ; is the leave-one-out coe¢ cients estimate for the estimation of E[Y jX = Xi ] that
is obtained from the data sample without observation i. The sum of squared errors indicates
how well the estimator is able to predict E[Y jX] for the sample distribution of X.171
14.2.5 Nonparametric estimation of monotone functions
In the previous section we were concerned with the estimation of E[Y jX = x]. In many
applications we rather need an estimate of the conditional distribution function FY jX ( ; ) or
the quantile function QY jX ( ). Both approaches could be used to estimate the entire conditional
distribution of Y jX. Yet, often we might be interested only in the median ( = 0:5) or the cdf
at some particular value a.
For estimating the conditional distribution function we note that
FY jX (a; x) = E [1 (Y a) jX = x] .
170
For the properties of cross-validation in nonparametric regression see Härdle and Marron (1987) or Racine
and Li (2004) for the hybrid kernel. A variety of alternative bandwidth selectors have been developed which
usually rely on estimating mean squared error by estimating bias and variance at di¤erent bandwidth values or
for di¤erent speci…cations of g( ), e.g. di¤erent polynomial orders, see Fan and Gijbels (1995), Fan, Farmen, and
Gijbels (1998) and Ruppert (1997) among others. However, it is not clear whether these methods would also
work well in higher dimensions. See also Loader (1999) for a recent discussion on bandwidth selection.
171
In the context of local likelihood estimation Staniswalis (1989) suggested a di¤erent cross-validation criterion
based on maximizing the leave-one-out …tted likelihood function
X
n
^ ^; ^ = arg max
h; ln L Yi ; g(Xi ; ^ Xi jh; ; ) , (133)
h; ;
i=1
where ^ Xi jh; ; again is the leave-one-out estimate at X = Xi .

Hence, for any value of a we could estimate FY jX (a; x) by nonparametric regression of the
binary variable 1 (Y a) on x. If the cdf shall be estimated at various di¤erent values of a, it
would sometimes be useful to have estimates that ensure the monotonicity property i.e. that
guarantee that
F^Y jX (a; x) F^Y jX (a + "; x) for any a and " 0.
When using Nadaraya Watson regression with the same bandwidth value for all locations a
this would automatically be guaranteed. However, Nadaraya Watson has unfavourable bias
properties. Since 1 (Y a) is binary, a local logit model would be appropriate. For a further
discussion see Neumeyer (2007) or Hall, Wol¤, and Yao (1999) or Hall Müller (2003 JASA)
among many others.
Alternatively, nonparametric quantile regression is examined in Yu Jones (1998 JASA).

Consider estimation of QY jX (x). A local linear quantile regression would be
XN
^ Xj x
QY jX (x) = arg min (Yj Xj ) K ,
; h
j=1
where
(u) = u ( 1 (u 0))
is the "check" function.
14.3 Nonparametric density estimation
We are interested in estimating the density fX ( ) at a particular location x from a sample of

iid observations fXj gN
i=1 . Consider …rst the situation where X is one-dimensional. We could
estimate fX by a histogram.
The histogram has several disadvantages, though. It is not continuous and gives the same
density estimate to every point in the same bin. Furthermore, the shape of the histogram de-
pends on the origin of the bin grid and the appearance of the histogram can change substan-
tially if the origin is changed. Finally, the histogram estimate f^ is not an e¢ cient estimator of
f.
To eliminate some of the disadvantages, we could estimate f locally at point x by centering

a histogram at x:
N
1 X 1 (jXj xj h)
f^(x; h) =
N 2h
j=1
or using a more ‡exible weighting via a kernel function K(u):

N
1 X Xj x
f^(x; h) = K .
Nh h
j=1
This estimate f^ is di¤erentiable if K is and under certain restrictions on the bandwidth f^(x; h)
can be shown to converge to f (x). The bias can be shown to be
h i h2
E f^(x; h) f (x) = f 00 (x) 2 + o(h2 )
2
R
where 2 = u2 K(u)du. The variance is
1 1
V ar f^(x; h) = 0 f (x) +o ,
Nh Nh
R
where 0 = K 2 (u) du.
For density estimation in higher dimensions the choice of an appropriate model is quite
important, and local likelihood estimation can be useful. (We discuss this only in the context
of regression.)
Choice of smoothing parameters If the density estimate is used for explorative

purposes, it is best to choose the bandwidth by visually examining the density plot
and trying di¤erent values. For automatic bandwidth choice the rule-of-thumb of
Silverman (1986) is frequently used. For a normal kernel the bandwidth is chosen as
h = 1:06N 0:2 min( ;interquartile range=1:34).172
14.4 Global nonparametric methods
Whereas the previous methods were all based on estimating a local model in a small neighbour-
hood de…ned by a kernel function, there are also global regression methods where the number
of coe¢ cients estimated tends to in…nity with growing sample size. A number of methods have
been proposed with splines and series estimation being the most widespread.
172
For the biweight kernel (or quartic kernel) K(u) = 15
16
(1 u2 )2 1(juj < 1) the bandwidth choice is h =
0:2
2:7768n min( ;interquartile range=1:34).
14.4.1 Smoothing splines
Smoothing splines have been widely studied in the statistics literature and are extensively used
in engineering, CAD etc., but less so in econometrics. Splines are piecewise polynomials that
are joined at certain knots, e.g. the observed Xi observations.
The idea is to …nd a function g that minimizes the penalized objective function
Z
1X 2 2
(Yi g (Xi )) + g 00 (x) dx,
n
where is the tuning parameter, which controls the trade-o¤ between …delity to the data (…rst
part) and roughness penalty (second part). For zero, the minimizing function would be the
interpolation of all data points. For very large, the function g would be a straight line173
that passes through the data as the least squares …t. Reinsch (1967) considered the Sobolev
space of C 2 functions with square integrable second derivatives. The choice of the function
g can be considered as a variational problem over this space, which has a …rst-order (Euler)
condition that requires the fourth derivative of g to be zero almost everywhere. The solution is
a piecewise cubic polynomial whose third derivative jumps at a set of points of measure zero. In
practice, these knots are chosen to be the data points fXi g. Hence, the solution itself, its …rst
and its second derivative are continuous everywhere. The third derivative is continuous almost
everywhere and jumps at the knots Xi . The fourth derivative is zero almost everywhere.
These conditions provide a …nite dimensional set of equations, for which fast solutions are
available. Smoothing splines leads to a linear smoother, i.e. the …tted values are linear in YN
Y^N = A( )YN
where A( ) is a matrix that depends on the Xi and .
14.4.2 Series estimation
A truly global nonparametric approach is series estimation. The intuition behind series
estimation is very easily understood: Simply add more and more interaction and polynomial
terms to a parametric regression as the sample size increases. For example, consider estimation
of E[Y jX] for X one-dimensional and continuous. The function is obtained by using OLS to
regress Yi on constant, Xi ; Xi2 ; Xi3 ; Xi4 ; :::. The number of regressors included tends to in…nity
173
Since the second derivative is zero.
with growing sample size n. In this sense the estimator is nonparametric, as the number of
coe¢ cients goes to in…nity. For any particular choice for a given data set, the estimation
approach is simply OLS.
The motivation is given by the result174 that any square integrable real valued function can
1
be uniquely expressed by a linear combination of linearly independent functions j (x) j=1 :
1
X
m(x) = bj j (x).
j=1
The series estimator is given by

K
X
m
^ K (x) = ^bj
j (x),
j=1
where for a particular choice of the tuning parameter K, the coe¢ cients bj are obtained by OLS
1
^=
b K0
N
K
N
K0
N YN ,
where YN is the vector of all fYi gN

i=1 observations and
K
N is the matrix of all 1 (Xi ); ::: K (Xi )
for all N observations.

A widely used function basis is the power series:
1; x; x2 ; x3 ; :::
for one-dimensional x. For higher dimensional x also the interaction terms are included, e.g.
for two-dimensional x
1; x1 ; x2 ; x1 x2 ; x21 ; x22 ; x21 x2 ; x1 x22 ; x31 ; x32 ; :::
The Fourier series consists of sine and cosine functions with increasing frequency, often aug-
mented by linear and quadratic terms.
The linear spline basis for given knots t1 ; t2 ; t3 ; ::: is
1; x; (x tl ) 1 (x tl ) ; :::.
Wavelet bases provide another set of bases functions which have received substantial research
interest.
The asymptotic theory for series estimators is very di¤erent from kernels. Usually no closed-
form bias expressions can be derived and only results on the rates of convergence are available.
174
@@ Stammt dies von Kolmogorov?
Certain basis functions that permit a convenient derivation of asymptotic properties may lead
to problems of collinearity in estimation, e.g. power series. Therefore, power series are very
often used to derive the theoretical properties, but orthogonalized series need to be used in
practice, e.g. the Legendre polynomials. (This change in the basis functions does not change
the estimators asymptotic properties as long as the two basis span the same space.) Series
estimators may often behave somewhat erratic in the boundary regions of the support of X.
On the other hand, they have the advantage that the estimator can be computed very fast by
OLS for a particular choice of K. Least-squares cross-validation can be used to chose a K for
a given data set.
14.5 Comparison of splines, series, kernels
Theory: Explicit formulae for bias terms of kernel estimators. For series estimators usually only
rate results available.
Global approaches can be convenient to ensure that m

^ satis…es certain properties, such as
monotonicity, convexity or additive separability or that m
^ goes through a certain point. This
may be helpful e.g. when estimating conditional distribution functions which, by de…nition, are
nondecreasing.
15 Semiparametric estimation
The curse of dimensionality has led to various approaches to handle estimation in higher
dimensions. Nonparametric estimation with dim(X) larger than 2 or 3 is often considered
unfeasible or at least very imprecise for the usual sample sizes between 1’000 and 100’000
observations. This has led to a number of developments that attempt to reduce the
dimensionality. One can roughly distinguish two di¤erent approaches. The …rst strand, reduces
the class of permitted functions m(x) by imposing a certain structure, e.g. an index restriction
that E [Y jX = x] = m(x0 ). A large theoretical literature has developed in statistics and
econometrics proposing alternative restrictions and analysing their properties. Nevertheless,
it is fair to say that there have been only very few applications in econometrics so far. The
second strand stays closer to the nonparametric ideal and reduces the dimensionality by
some kind of averaging. In essence, this research approach limits itself to a simpler question.
Markus Frölich 15. Semiparametric estimation 343
Most of the treatment evaluation literature can be situated here, and the estimation of the
average treatment e¤ect by matching estimators may illustrate this. As discussed in Section
2, under the assumption of selection on observables, the potential outcomes are identi…ed as
E Y d jX = E [Y jX; D = d] = md (X). Hence, m
^ 1 (x) m
^ 0 (x) gives a nonparametric estimate
of the treatment e¤ect conditional on x. Nevertheless most interest focusses on the average
R
treatment e¤ect175 (m1 (x) m0 (x)) dF (x), which is just a single number. The estimation
sets out from estimating md (x) nonparametrically but then averages over the distribution of
X. Obviously, with the ATE just being a scalar, estimation should be more precise then for
the average treatment e¤ect conditional on x. Hence, no restrictions are imposed, rather the
dimension of the object of interest has been reduced. In these situations we are not actually
interested in the nonparametric regression plane m(x) itself, but rather use it only as a plug-in
estimator to estimate a restricted set of parameters. If the lower-dimensional object is of …nite
p
dimension, often N convergence can be achieved.
15.1 Semiparametric structure
A large number of nonparametric models have been proposed that overcome the curse of di-
mensionality by restricting the class of admitted functions. However, these models typically
achieve the dimension reduction by restricting the interactions between the regressors, which
makes them unattractive for most econometric applications, where interactions between the ex-
planatory variables are often considered important.
One example is the generalized additive model (GAM), which assumes that m(x) with
dim(x) = d can be written as a sum of nonparametric functions of each component of x:
E[Y jX = x] = c + m1 (x1 ) + m2 (x2 ) + ::: + md (xd ),
where the functions mk are unknown. This e¤ectively reduces the dimension of the nonpara-
metric regression to one. On the other hand, it clearly restricts the interaction between the
components of X. (It is possible to de…ne the interaction x1 x2 as another regressor to be in-
cluded in the GAM, but one has to know which variables should be interacted and in which
way.) Notice that this model does not contain any parametric element, i.e. all elements in-
volved are in…nite dimensional.
175
Or the average treatment e¤ect on the treated.
Index models are among the most widely used semiparametric model. The single index
model in particular assumes that
E[Y jX = x] = m(x ),
where m is an unknown function and is unknown but …nite dimensional. Again, the e¤ective
dimension of the estimation is reduced to one. Here, interaction between the X on Y is permit-
ted, but the structure imposes that all the heterogeneity in the characteristics of the individuals
can be reduced to one dimension. In other words, the relative impact of X1 on Y compared
to the impact of X2 on Y is the same for every individual and does not vary with any other
characteristics X3 :
@E [Y jX = x] =@x1 1
= .
@E [Y jX = x] =@x2 2
The partial linear model separates the components of x into two sets xS1 and xS2 and assumes
that xS1 enters linearly whereas xS2 can enter nonparametrically through an unknown function
g:
E[Y jX = x] = xS1 + g(xS2 ).
The generalized partial linear model extends this model to including a known link function
E[Y jX = x] = G (xS1 + g(xS2 ) ) ,
e.g. a logit link function.176
15.1.1 Partial linear models
Partially linear models are widely used in the analysis of consumer behaviour, particularly in
the analysis of Engel (1857) curves. Let Y be the budget share of a good, X total income and
Z other household covariates. A partially linear model speci…es
Y = g(X) + Z 0 + U ,
where the relationship between the budget share and income is left completely unspeci…ed.
p
Robinson (1988) showed n consistent estimation of under su¢ cient smoothness assump-
tions. Note that
Y E [Y jX] = (Z E [ZjX])0 + U.
176
@@Could we actually have an unknown link function?
Hence, one could estimate as

X 0 1X 0
^= Zi ^ [ZjXi ]
E Zi ^ [ZjXi ]
E Iî Zi ^ [ZjXi ]
E Yi ^ [Y jXi ]
E Iî ,
^ represents nonparamatric estimators. Hence, given …rst step nonparametric estimates,

where E
a closed form expression is given for ^ . Robinson (1988) also suggests the use of a trimming
function Iî to eliminate observations where f (Xi ) is small since the conditional expectations
would be estimated very imprecisely. This trimming function requires a preliminary estimator
of the density f (x), e.g. a kernel function, or a priori assumptions about the shape of the
support. Under a number of regularity conditions
p d
n( ^ ) ! N 0; 2 1
where
1X ^ [ZjXi ] ^ [ZjXi ]
0 1
p
^2 Zi E Zi E Iî ! 2 1
n
and
1X ^ [Y jXi ] ^ [ZjXi ]
0 2
^2 = Yi E Zi E .
n
15.1.2 Index models
Semiparametric single index models impose the assumption
E[Y jX = x] = m(v(x; 0 ))
for a known function v and an unknown function m. The function v is often linear in that
E[Y jX = x] = m(x0 0 ).
Clearly 0 can only be identi…ed up to scale, i.e. multipliation of 0 by a scalar and appropriate
changes in the unknown m would lead to an observationally equivalent model. (Also, obviously
0 cannot contain an intercept.)

The single index assumptions supposes that there is one coe¢ cient vector that is iden-
tical for all individuals. This implies that all individuals with the same value for the linear
combination x behave in all respects identically, although they might have very di¤erent char-
acteristics x.177
177
@@Consider for instance the dependence of female labour force participation on family size. It is undisputed
Consider a …rst-step nonparametric estimator of E [Y jX 0 ], e.g. a kernel estimator, given

X and . Ichimura (1993) suggested to estimate such that
X 2
arg min Yi ^ Y jXi0
E . (134)
In contrast to the partial linear model, ^ is here only implicitly de…ned and obtained by iteration.
^ jX 0 ^ ]. More precisely, a starting value for
This also requires repeated estimation of E[Y is
^ jX 0 ^ ] is estimated and (134) calculated. Then a di¤erent value of
chosen, E[Y is chosen,
^ jX 0 ^ ] is estimated and (134) calculated. This is repeated until the minimum of (134)
E[Y
among all values 2 seems to have been found. (See the Chapter on numerical optimization
on this.)
Since the conditional mean E [Y jX 0 ] cannot be estimated at locations where the density
f (X 0 ) is very low, a trimming function (Xi ; ) is introduced to eliminate these observations:178
X 2
arg min Yi ^ Y jXi0
E (Xi ; ).
0
: =1
As mentioned above, 0 is identi…ed only up to scale. Hence, without restricting in the

0
above minimization the estimator would not converge. One might either restrict to one,
i.e. that the length of the vector is one. Alternatively, one may …x one of the components of
0
at one, if one knows that the true coe¢ cient is nonzero. (Otherwise, restricting = 1 is
more appropriate, unless 0 is zero for all coe¢ cients.) A weighting function might also be
introduced in the above equation, e.g. for e¢ ciency reasons in case of heteroskedasticity.
The covariance matrix of the estimated ^ is V V where, provided that the restriction
0
= 1 was used,
h i
2 0
V =E m0 (X 0 0) X E XjX 0 0 X E XjX 0 0
that generally the probability of being employed decreases for a woman with the number of children. However,
for some women the e¤ect of children might be zero or might even be positive (particularly if the children are
older), e.g. brought about by increased …nancial needs due to a larger family (housing etc). If women react
di¤erently on the number of children then policy instruments such as subsidized child care, all-day schooling, tax
incentives, social insurance and labour market regulations must be targeted more precisely; particularly if the
subpopulation of women who increase their labour supply in response to an additional child can be identi…ed
and distinguished from those women who reduce their labour supply.
178
Even if the density of X is bounded away from zero over its support, the density of X 0 b might not be bounded
away from zero over its support, see e.g. Ichimura and Todd (2008). Consider as example two random variables
uniformly distributed on [0; 1]. The joint density is bounded away from zero. The sum of these two variables has
support from 0 to 2 but the density is not bounded away from zero towards the endpoints of the support.
h i
2 0
= E V ar(Y jX) m0 (X 0 0) X E XjX 0 0 X E XjX 0 0 ,
where the expectations are taken over the set X where the density of X 0 0 is assumed to be
0
bounded away from zero. (When a di¤erent normalization than = 1 is used, the formula is
slightly di¤erent, see Ichimura (1993).) Note that V and are not invertible since e¤ectively
0
only dim(X) 1 coe¢ cients are estimated due to the restriction = 1. Therefore a generalized
inverse needs to be computed.
Ichimura and Lee (2006) consider estimation of the single index model in case of misspeci…-
cation, i.e. when the single index assumption is incorrect. The estimates can still be considered
as a best approximation within the class of single index models, but the variance formula given
above needs to be extended.
When Y is binary, Klein and Spady (1993) suggested to estimate by maximizing

N
X
^ Y jX 0
arg max Yi ln E + (1 Yi ) ln(1 ^ Y jX 0
E ).
i i
i=1
^ [Y jX 0 ] could happen to be negative e.g. if a bias-reducing, higher order Kernel is used

Since E i
for estimation or e.g. local linear regression, they modify the objective function to:
N
X
^ Y jX 0 2 ^ Y jX 0
arg max Yi ln E i + (1 Yi ) ln (1 E i )2 .
i=1
Here, the nonparametric estimation is repeated in each iteration of the optimization with respect
to . I.e. for every succesive value of ^ [Y jX 0 ] needs to be estimated
during the iterations E i
again.
15.1.3 Additive models
A somewhat distinct motivation underlies additively separable models, where it is assumed that
the true regression function can be written as:
E[Y jX = x] = c + m1 (x1 ) + m2 (x2 ) + ::: + mQ (xQ ), (135)
where mq are unknown functions of subsets of the regressors. Usually these subsets contain
only one regressor, such that m1 (x1 ) to mQ (xQ ) are one-dimensional unknown functions. The
term semiparametric may not be semantically appropriate since the additive model does not
contain any parametric part. All the functions are still nonparametric. The approach rather
achieves a reduction in the dimensionality to one-dimensional nonparametric regression.
The additively separable model usually heavily restricts interaction between the regressors.
In the above speci…cation, the derivative of E[Y jX] to the …rst regressor does not depend on
the value of any of the other regressors. This then permits to plot the relationship between Y
and each of the regressors, holding all the other …xed, by two-dimensional graphs. This is a
very helpful diagnostic and contributes substantially to the popularity of GAM.
To obtain more ‡exibility it would be possible to also include interaction terms as separate
regressors, e.g. to specify for two-dimensional X
E[Y jX = x] = c + m1 (x1 ) + m2 (x2 ) + m3 (x1 x2 ),
although this is rarely done, mainly because the visual interpretation of the graphs of m1 (x1 )
and m2 (x2 ) becomes di¢ cult because one cannot use a ceteris paribus interpretation since a
change in x1 also applies a change in x1 x2 . It further requires one to know the form of the
interacation. I.e. should x1 x2 or x1 x22 or x1 =x2 be used?
An advantage of this speci…cation is that the nonparametric estimators of mq (x) and thus
the combined estimate for E[Y jX = x] converge at the rate for one-dimensional nonparametric
regression.
Two approaches are mainly used for the estimation of additive models: Back-…tting and
marginal integration.
Back-…tting: Back-…tting as discussed in Hastie and Tibshirani (1990) estimates mq (x)

iteratively, given previous estimates of the other mq functions. Noting from (135) that
mQ (xQ ) = E[Y c m1 (x1 ) ::: mQ 1 (xQ 1 )jX = x]
we could estimate mQ given preliminary estimates c~, m

~ 1 to m
~Q 1 by nonparametrically regress-
ing
m
^ Q (Xi;Q ) on Yi c~ m
~ 1 (Xi;1 ) ::: m
~Q 1 (Xi;Q 1 ).
Given this new estimate m

^ Q one would start again with estimating m1 given the other estimates,
then m2 and so on until convergence. Hence, sequentially the functions m1 to mQ are …tted
given pilot estimates for the other functions.
For starting the iterative process, …rst estimates for c and the functions m1 to mQ are
needed, e.g. an OLS regression.
The GAM command in Splus made additive model and generalized additive models with
back-…tting rather popular. The asymptotic theory has been developed in @@Mammen, Linton,
Nielsen (noch eintragen).
Marginal integration: Marginal integration has been studied by Härdle and Linton in
a number of papers and is motivated on integrating out the other regressors. Consider the
example with two regressors:
E[Y jX = x] = c + m1 (x1 ) + m2 (x2 ).
A location restriction is needed on m1 and m2 since an intercept in m1 or m2 could not be

identi…ed, i.e. not be distinguished from c. Therefore we may impose that E [m1 (X)] =
E [m2 (X)] = 0. This restriction then implies that
Z Z
E [Y jX1 = x1 ; X2 ] dFX2 = (c + m1 (x1 ) + m2 (X2 )) dFX2
Z Z
= c dFX2 + m1 (x1 ) dFX2 + E [m2 (X2 )]
= c + m1 (x1 ).
Hence, we can estimate m1 (x1 ) up to an additive constant by

1X^
m
^ 1 (x1 ) = E [Y jX1 = x1 ; X2 = X2;i ]
n
1 P
and similarly for m2 (x2 ). The constant c is then estimated afterwards as n Yi , and
E [mq (X)] = 0 is imposed on the estimated functions.
This estimator is somewhat more di¢ cult to implement as it requires higher dimensional
regression in a …rst stage. It has the advantage, though, that its asymtptotic distribution has
been derived, which however is based on higher order kernels for bias reduction.
If the true mean function is not additively separable, the two estimation approaches (back-
…tting and marginal integration) converge to di¤erent limits. The marginal integration estimator
bears similarities with the matching estimators for estimating a potential outcomes in a selection
on observables szenario. First the conditional mean given X1 and X2 is obtained, which is then
integrated over the distribution of X2 .
Generalized additive models embed a link function on the dependent variable:
g(Y ) = c + m1 (X1 ) + m2 (X2 ) + ::: + mQ (XQ ) + U , (136)

where g is a known link function, e.g. the logit link for binary Y , see e.g. Hastie and Tibshirani
(1990). For unknown link function g the alternating conditional expectations ACE has been
proposed by Breiman and Friedman (1985).
15.2 Semiparametric estimators and partial means
The previous section considered various approaches to restrict the class of functions, e.g. m(x),
permitted to reduce the dimensionality of the estimation problem. In contrast to parametric
estimation, where the class of permitted functions is reduced to a subfamily indexed by a …nite-
dimensional coe¢ cient vector, the class of functions permitted still remained nonparametric but
certain structure was imposed.
In the following an alternative approach to analyze semiparametric estimation is examined.

We start with an example for semiparametric m-estimators Newey (2004). Consider a general
class F of distribution functions with F a particular member of it. F generates the observed
data, e.g. Y; X; Z; D etc. The class of functions F is unrestricted except for certain regularity
conditions and certain restrictions implied by the nature of the application. E.g. if D is a binary
treatment variable, the class F is restricted such that it contains only distribution functions
FY DXZ (y; d; x; z) which are zero for d 2
= f0; 1g. Otherwise no other restrictions are imposed on
F. This is thus di¤erent from e.g. Maximum Likelihood where one speci…es a parametric class
for F.
In the following we will denote the data for individual i by the shorthand notation Wi . (Wi
may contain Yi , Di , Xi or any other observed variables of individual i.) In addition, we denote
by the …nite-dimensional parameter we are interested in. This could e.g. be the average
treatment e¤ect. In addition, there is a (possibly nonparametric) function h that may depend
on and the data.179 For example, h could be m(Xi ), where m(x) = E[Y jX = x], or h could
be the entire function m(x). Futhermore, h is also permitted to be a collection of functions, e.g.
^ be any nonparametric estimator
a regression function m(x) and a density function f (x). Let h
and h0 be the true value.
Consider a function
M (Wi ; ; h)
179
Note that the symbol h has no relation to the symbol h used as the bandwidth value in the previous section.
We follow the notation of Newey (2004).
with
E [M (W; 0 ; h0 )] =0
where 0 is the true value. Note that the true values 0 and h0 are determined by the data
generating process F 2 F. Hence, for a di¤erent F di¤erent true values 0 and h0 would result.
In other words, the above equation could be written more precisely as
E [M (W; 0 (F ); h0 (F ))] = 0,
F
where 0 and h0 are determined by F and the expectation operator is with respect to F .
A semiparametric M-estimator ^ solves the moment equation
N
1X ^ = 0.
M (Wi ; ; h) (137)
N
i=1
Under some regularity conditions, the estimator ^ converges to 0.

^
As n ! 1 the estimator h
converges to h0 and by a law of large numbers the sample moment converges to the population
p
moment. If 0 is the unique solution of this population moment, ^ ! 0.
One example of such semiparametric estimators is the matching estimator for the expected
potential outcome E[Y d ]:
n
\ d] =
1X
E[Y m
^ d (Xi ).
n
i=1
This can be written as
n
1X
(m
^ d (Xi ) )=0
N
i=1
which gives
E [md (Xi ) 0] =0
where 0 = E[Y d ]. Here the dimensionality problem is eliminated by averaging a

high-dimensional function to a one-dimensional number.
Another example is the average derivative estimator
n
1X @m(Xi )
= 0.
N @x1
i=1
Our interest is in establishing the asymptotic distribution of ^ , where we have to take the
^ into account.180
contribution of the nonparametric …rst step estimation h
180
If h were parametric and could also be written as an m-estimator, we could write the combined estimator
as a two-step GMM estimator and apply standard results for GMM estimation for obtaining the asymptotic
distribution.
It turns out that the calculation of the (…rst order term of the) asymptotic variance is often
relatively straightforward for estimators of the kind (137). An interesting result is that the
method used for estimating h, e.g. by kernel, spline or series methods, does not a¤ect the
(…rst order term of the) asymptotic variance. This also implies that the …rst order term of the
asymptotic variance does not depend anymore on any bandwidth or smoothing parameter and
is thus useless for bandwidth choice. The asymptotic variance for (137) generally is of the form
E [ (W ) (W )0 ] where
! 1
@E [M (W; ; h0 )]
(W ) = fM (W; 0 ; h0 ) + (W )g (138)
@ j 0
where (W ) is an adjustment factor for the nonparametric estimation of h0 .
If h0 contains several components, e.g. a regression function m(x) and a density f (x),
the adjustment factor (w) is simply the sum of the various adjustment factor relating to
each component being estimated. Note that this gives the general form of how the asymptotic
variance of ^ would usually look like. One still has to specify precise regularity conditions under
p
which the estimator actually achieves n consistency and this variance.
15.2.1 Examples for calculation of the adjustment factor
We consider two examples for calculation of the adjustment term (W ). First, when h is a
conditional expectation, second, when h is a density function. In both cases we suppose that
h0 depends only on the observation Wi . More general derivations are given in Newey (2004).
Let m(x) = E[Y jX = x] be a conditional expectation and @ m(x) = @ j j m(x)=@x1 1 @xd d .
Now, suppose that h @ m(x). Let
@M (w; 0 ; h)
D(w) = (139)
@h jh=@ m(x)
and
D(x) = E [D(W )jX = x] (140)
then
@ D(x) f (x)
(w) = ( 1)j j
fy m(x)g . (141)
f (x)
Consider three examples:
(1) Let M (W; ; h) = YD

p(X) where p(x) = E [DjX = x], where D is binary. Here ^
is an estimator of the expected potential outcome via propensity score weighting. Clearly,
the nonparametric element h is the estimate of the propensity score @ 0 p(x) for this particular
observation. Here, j j = 0. By adopting (139) to this notation we obtain
YD
D(W ) =
p2 (X)
and
1 m1 (x)
D(x) = E [D(W )jX = x] = E [Y jD = 1; X = x] =
p(x) p(x)
where m1 (x) = E [Y jD = 1; X = x]. Therefore, with j j = 0 we obtain
m1 (x)
(w) = fd p(x)g
p(x)
and inserting this in (138) we obtain
YD m1 (X)
(W ) = 0 fD p(X)g
p(X) p(X)
(Y m1 (X)) D
= + m1 (X) 0.
p(X)
From this we obtain the asymptotic variance as E [ (W ) (W )0 ]
V ar(Y jX; D = 1) 2
=E + (m1 (X) 0) .
p(X)
E[Y DjX]
(2) Let M (W; ; h1 ; h2 ) = E[DjX] , which again estimates E Y 1 under selection on
E[Y DjX]
observables because E[DjX] = E [Y jX; D = 1] = m1 (X). Now h contains two conditional
expectation functions and we can calculate the adjustment terms (W ) separately for both and
add them up. Consider …rst the calculations (139) and (140) with respect to E [DjX]. We
obtain
E [Y DjX]
D(W ) =
p2 (X)
and
E [Y DjX] m1 (X)
D(x) = =
p2 (X) p(X)
and
m1 (x)
1 (w) = fd p(x)g .
p(x)
Now we derive the adjustment term 2 (w) with respect to E [DjX].
1
D(w) =
p(x)
and
1
D(x) =
p(x)
and
1
2 (w) = fyd E [Y DjX = x]g .
p(x)
Adding the two adjustment terms 1 (w) and 2 (w) we obtain the in‡uence function (138)
(W ) = fM (W; 0 ; h0 ) + 1 (W ) + 2 (W )g
E [Y DjX] m1 (X) 1
= 0 fD p(X)g + fY D E [Y DjX = x]g
E [DjX] p(X) p(X)
(Y m1 (X)) D
= + m1 (X) 0
p(X)
which is identical to the in‡uence function that we obtained for example (1) above. Hence,
propensity score weighting and matching on X can achieve the same asymptotic variance.
@E[Y jX]
(3) As a third example, let M (W; ; h) = @x1 , which gives the average derivative
estimator, with respect to the …rst element of X. (E.g. the …rst element of X could be the
continous treatment variable D. For notational simplicity, we refer to this as X1 and not as
@E[Y jX]
D here.)181 Let dim(X) = d and let = (1; 0; :::; 0), then @x1 = @ E [Y jX] = @ m(X).
Calculating (139) and (140) we obtain
D(w) = 1
and
@ f (x)
(w) = fy m(x)g .
f (x)
The in‡uence function (138) for the average derivative estimator is thus
@E [Y jX] @f (X) Y E [Y jX]

(W ) = 0
@x1 @x1 f (X)
181
If one were interested in the average derivative with respect to all elements of X separately, one could de…ne
@E[Y jX]
M as @x
, where would now be a vector of dimension d. This, of course, now requires that all variables
in X must be continuous.
and the asymptotic variance thus E [ (W ) (W )0 ]

" # " #
2 2
@E [Y jX] @ ln f (X)
=E 0 +E (Y E [Y jX])2 .
@x1 @x1
Now we examine the adjustment terms when the nonparametric part h is a density function
f (x). We will …nd a similar kind of adjustment term (w) as for the case when h was a
conditional expectation. Again if h contains several components, the terms 1 (w), 2 (w) etc.
add up. E.g. if h contains a conditional expectation and a density, we would have to terms
terms 1 (w) and 2 (w). If h contains a conditional density f (xjz) we can write this as f (xjz) =
f (x; z)=f (z) and calculate the adjustment terms separately for f (x; z) and f (z).
Now, we examine the case where h is a density function or a derivative of it. Suppose that
h @ f (x). Let
@M (w; 0 ; h)
D(w) = (142)
@h jh=@ f (x)
and
D(x) = E [D(W )jX = x] (143)
and
(x) = ( 1)j j
@ D(x) f (x) (144)
then
(W ) = (X) E [ (X)] . (145)
Consider a simple example: M (W; ; h) = f (X) . Then D(w) = 1 and (x) = f (x) and
(W ) = f (X) E [f (X)]. We thus obtain
(W ) = f (X) + f (X) E [f (X)]
and therefore the asymptotic variance

h i
E (W )2 = 4E ff (X) E [f (X)]g2 = 4V ar (f (X)) .
The same result will be obtained furhter below.

15.2.2 Semiparametric e¢ ciency bounds
An interesting result is that an analogue to the Cramer-Rao bound in the parametric approach
exists for semiparametric estimators. In parametric estimation, the analysis of e¢ ciency is
greatly simpli…ed by the Cramer-Rao bounds and the Gauss-Markov theorem. Both these
theorems establish for a large class of models lower bounds on the variance of any estimator
within this class. Hence, no estimator in this class can have a variance lower than this bound
and any estimator that attains this bound is e¢ cient. A similar type of variance bound often
p
exists for many semiparametric problems if N consistent estimation of is possible. Despite
the presence of the in…nite-dimensional object h, which requires a nonparametric …rst step
p
estimator, can often nevertheless be estimated at rate n and a variance bound can often be
calculated which gives a lower bound on the variance of all possible semiparametric estimators.
If such a semiparametric variance bound exists, no semiparametric estimator can have lower
variance than this bound, and any estimator that attains this bound is semiparametrically
p
e¢ cient. In addition, if the variance bound is in…nitely large, this implies that no n consistent
estimator can exist.
Newey (1994a) shows how this variance bound can help to compute the variance of semi-
parametric estimators. An interesting …nding is that the speci…c method used to estimate h,
e.g. kernel or series estimation, does not a¤ect the (…rst-order) variance of the estimator. Chen,
Linton, and van Keilegom (2003) extend these results to non-smooth criterion functions, which
are helpful e.g. for quantile estimators. Newey (2004) considers GMM estimators.
The derivation of such bounds can most easily be illustrated for a likelihood context. Newey
(1994a) describes other approaches to obtaining the variance bound. Consider the ln likelihood
function
N
1X
ln LN ( ; h) = ln L (Wi ; ; h) ,
N
i=1
that is maximized at the values 0 and h0 where the derivative has expectation zero. Con-
sider …rst the situation where the nuisance parameters h0 are …nite dimensional. Then, the
information matrix provides the Cramer Rao lower bound , using partitioned inversion, as
1
V = I I h Ihh1 Ih
where I ,I h ,Ihh ,Ih are the respective submatrices of the information matrix for ( ; h). For
ML estimation we obtain
p d
N (^ ) ! N (0; V ).
If I h is nonzero, there is an e¢ ciency loss when h is unknown.

Loosely speaking, the semiparametric variance bound V is the largest value of V over
all possible parametric models that nest ln LN ( ; h0 ) for some value of h. There are various
di¤erent methods to derive semiparametric variance bounds and these have been used frequently
in the recent years to characterize the e¢ ciency of various treatment evaluation estimators. An
estimator that attains the semiparametric variance bound
p d
N (^ ) ! N (0; V ).
is called semiparametrically e¢ cient.

In some situations, the semiparametric estimator may even obtain the variance V , which
means that considering h as a nonparametric function does not lead to an e¢ ciency loss com-
pared to the parametric case. Such an estimator is called adaptive. One example is the het-
eroskedastic linear model, where we specify the mean E[Y jX = x] = x as linear and the vari-
ance function [Y jX = x] = 2 (x) is nonparametric (i.e. not speci…ed). Robinson (1987) pro-
posed a variant of FGLS using a nonparametric -NN estimator of 2 (x) and showed that this
estimator attains the Gauss-Markov bound. (See section 9.7.6 in Cameron and Trivedi).
In some situations, semiparametric e¢ ciency bounds have been derived but no estimator
p
is known that attains this bound. Hence, there exist N consistent estimators that all have
variance larger than V and often have non-normal asymptotic distributions.
p
Finally, there may situations where N consistent estimation is not possible, and estimators
converge at lower rates, e.g. the binary …xed e¤ect panel data M-score estimator of Manski.
Ichimura and Todd (2008, Section 3.1.3) provides another example, where the possibility
p
of n convergence depends on the statistical assumption being made: A mean-zero restriction
p
does not admit n consistent estimation, whereas a median-zero restriction does so. Consider
the linear regression model with only an intercept, where Y is censored at zero. Intuitively, one
might think of a latent model
Y = +U and Y =Y 1(Y > 0),
where estimation of is of interest. A parametric approach would specify the distribution of U

to belong to a class of density functions indexed by a …nite set of parameters, e.g. N (a; b). The
semiparametric approach remains agnostic about the density of U as far as possible. The density
fU of U is supposed to belong to a class of density functions. This problem is semiparametric
in the sense that a single parameter is of interest, but that part of the model, i.e. the density
of U , is a function.
Obviously, without any restrictions on , the parameter is not identi…ed. One could
restrict the class to densities with mean zero. An alternative restriction, would be to impose
median zero.
The class F of density functions of the observed Y is
8 2 31(y=0) 9
>
< Z >
=
1(y>0) 4 5
F = fY : fY (y) = fU (y ) fU (s)ds ; fU 2 .
>
: >
;
1
Suppose fU is known. The most e¢ cient estimator of is then the ML estimator, with
asymptotic variance be obtained by the inverse of the Hessian. The semiparametric e¢ ciency
bound is obtained by considering the worst case choice of fU , i.e. the sup of the variance over
the class . Loosely speaking, we could consider the semiparametric e¢ ciency bound as
sup inf AsyV ar(^ ).

F all estimators
p
If the bound is in…nite, no n consistent semiparametric estimator can exist (unless further
p
structure is imposed). On the other hand, a …nite bound does not imply that feasible n
consistent estimators may indeed exist. In other words, the bound may not be achievable.
Semiparametric e¢ ciency bounds were introduced by Stein (1956) and developed by Ko-
shevnik and Levit (1976), Pfanzagl and Wefelmeyer (1982), Begun, Hall, Huang, and Wellner
(1983) and Bickel, Klaassen, Ritov, and Wellner (1993). See also the survey of Newey (1990)
or Newey (1994a). The following discussion is from Newey (1994a). Let denote the object of
interest, which depends on the true distribution function F (z) of the data. Let f (z) be the den-
sity of the data. Let F be a general family of distributions and fF : F 2 Fg a one-dimensional
subfamily of F. Let F =0 be the true distribution function and F 6=0 the other distribution
functions from the class F. The pathwise derivative of (F ) is a vector d(z) with E [d(z)] = 0
and E[kd(z)k2 ] < 1 such that for every path
@ (F )
= E [d(Z) S(Z)]j = , (146)
@ j = 0
0
where S(z) = @ ln f (z)=@ is the score function. The semiparametric variance bound when
estimating under general misspeci…cation is V ar(d(Z)). The score function is mean zero at
the true distribution F 0
Z Z Z
@ ln f (z) @f 0 (z) @
E 0 [S(Z)] = f 0 (z)dz = dz = f 0 (z)dz = 0
@ j = 0 @ @
provided conditions for interchanging integration and di¤erentiation.

A simple example illustrates the use of this formula. Consider estimation of E [f (Z)],
P^ R
which could be estimated e.g. as ^ = n1 f (Zi ). Hence, 0 = f 20 (z)dz and @ (F 0 )=@ =
R
2@f 0 =@ f 0 . The function d(z) = 2 (f 0 (z) 0 ) is mean zero and satis…es (146). Hence,
the semiparametric variance bound is 4V ar (f 0 (Z)).
This formula is also related to the in‡uence function representation of linear estimators. ^
is asymptotically linear with in‡uence function (z) if
p 1 X
n( ^ 0 ) = p (Zi ) + op (1)
n
i
where E[ (Z)] = 0 and V ar[ (Z)] < 1. Under certain conditions, e.g. Theorem 2.1 of Newey
(1994a), (z) = d(z) in (146). Hence, if a d(z) can be found that satis…es (146), the remainder
p P
term n( ^ 0) p1
n
(Zi ) can usually be shown to be small.
One particular example for calculating it is the ATE in Hahn (1998). In a …rst step, one
speci…es the density function of the observed variables. For the ATE example, (Y; D; X) with
D binary are the observed variables. The potential outcomes Y 1 and Y 0 are observed only
conditional on X. The joint density of (Y; D; X) can be related to the potential outcomes by
f (y; d; x) = f (yjd; x) f (djx) f (x) = ff1 (yjx) p (x)gd ff0 (yjx) (1 p (x))g1 d
f (x)
where f1 (yjx) f (yjd = 1; x) and p(x) = P (d = 1jX = x).

Consider a regular parametric submodel indexed by with 0 corresponding to the true
model: f (y; d; xj 0 ) = f (y; d; x). The density f (y; d; xj ) can be written as
f (y; d; xj ) = ff1 (yjx; ) p (x; )gd ff0 (yjx; ) (1 p (x; ))g1 d

f (x; ).
Suppose that f1 and f0 admit an interchange of the order of integration and di¤erentiation:
Z Z
@f1 (yjx; ) @
= f1 (yjx; ) = 0
@ @
and analogously for f0 .182

The corresponding score of f (y; d; xj ) is
@ ln f (y; d; xj )
S (y; d; xj ) =
@
d p (x; )
= d f1 (yjx; ) + (1 d) f0 (yjx; ) + p(x; ) + f (x; )
1 p (x; )
where f1 (yjx; ) = @ ln f1 (yjx; ) =@ , and f0 analogously, and p(x; ) = @ ln p(x; )=@ and
f (x; ) = @ ln f (x; )=@ .
At the true value 0 the expectation of the score is zero:
E [S (y; d; xj )]j = 0
= 0.
The tangent space of the model is the set of functions that are mean zero and satisfy the additive
structure of the score:
= = fd s1 (yjx) + (1 d) s0 (yjx) + (d p (x)) sp (x) + sx (x)g (147)
for any functions s1 ; s0 ; sp ; sx satisfying the mean-zero property:

R
s1 (yjx) f1 (yjx) dy = 0 8x
R
s0 (yjx) f0 (yjx) dy = 0 8x
R
sx (x) f (x) dx = 0
and sp (x) being a square-integrable measurable function of x.
The semiparametric variance bound of the parameter of interest , e.g. the average treat-
ment e¤ect ATE, is the variance of the projection on = of a function (Y; d; X) (with E [ ] = 0
and E[k ( )k2 ] < 1) that satis…es for all regular parametric submodels
@ (F )
= E [ (Y; D; X) S(Y; D; X)]j = . (148)
@ j = 0
0
Consider the example of the ATE for a parametric submodel :

Z
(F ) = (E [Y jX; D = 1] E [Y jX; D = 0]) f (x; )dx
Z Z Z
= yf1 (yjx; ) dy yf0 (yjx; ) dy f (x; )dx.
182
Su¢ cient conditions for permitting interchanging di¤erentiation and integration are, for example, given by
Theorem 1.3.2 of Amemiya (1985). These are: @f1 (yjx; )=@ is continuous in 2 and y, where is an open
R R
set; f1 (yjx; )dy exists; and j@f1 (yjx; )=@ j dy < 1 for all 2 .
Computing the pathwise derivative and evaluating it at 0 gives:

Z Z n o Z
@ (F )
= ::: = y f_1 _
f0 f (x)dydx + (m1 (x) m0 (x)) f_(x)dx.
@ j = 0
where f_1 = @
@ f1 (yjx; )j = 0
, f_0 = @
@ f0 (yjx; )j = 0
and f_ = @
@ f (x; )j = 0
.
Now, we have to …nd a function (Y; D; X) that satis…es the (148). There is often no direct
way to compute and one simply tries di¤erent possible functions. We might attempt to choose
(Y; d; X) as
y m1 (x) m0 (x) y
(y; d; x) = d + (1 d) + m1 (x) m0 (x) . (149)
p(x) 1 p(x)
With some derivations we would see that indeed satis…es (148).
The next step is to compute the projection of ( ) on the tangent space =. For this example,
we can verify after some calculations that itself lies in the tangent space (147)
2 =.
Hence, the projection is ( ) itself.

Since lies in the tangent space, the variance bound is the expected square of :
" #
2 (X) 2 (X)
Y Y
E (Y; D; X)2 = =E 1
+ 0
+ (m1 (X) m0 (X) )2
p(X) 1 p(X)
where de…ning 2 (x) = V ar[Y jX; D = 1].

Y1
If we …nd an estimator that achieves this bound, we could consider it as semiparametrically

e¢ cient. There may often be many di¤erent estimators that are e¢ cient. Among them we
might prefer those estimators that are consistent and e¢ cient under the weakest regularity
assumptions.183 Further considerations are Monte Carlo …nite sample properties and ease of
implementation.
15.2.3 Choice of smoothing parameters
The optimality properties of all semiparametric estimators depend on an appropriate choice of

the smoothing parameters, e.g. the bandwidth value. The rules developed for nonparametric
183
Remember the discussion about weighting versus matching estimators.
estimation will not apply any longer and optimal bandwidth choice will depend on the particular
semiparametric estimator being used. This problem has achieved much less attention than for
the nonparametric regression case. Bandwidth selection approaches are often based on deriving
the formula for the mean squared error of the estimator for a …xed bandwidth h, estimating
the terms appearing in this formula and minimizing this formula with respect to h. Oftentimes
it may be necessary to also retain second order terms in the MSE derivations to obtain a more
precice approximation for moderate values of h.
Compared to nonparametric regression, often some kind of asymptotic undersmoothing is

required to eliminate the asymptotic bias term. To provide one illustrative example, consider
again the matching estimator of the expected potential outcome. The MSE of this estimator
clearly depends on the bandwidth value. Consider:
n
!!2 n
!
\ 1]
1X 1X
M SE E[Y = Bias m
^ 1 (Xi ; h) + V ar m
^ 1 (Xi ; h)
n n
i=1 i=1
n
!
2 1X
= (Bias (m
^ 1 (X; h)) ) + V ar ^ 1 (Xi ; h) ,
m
n
i=1
1 P
n
where we also could use that m
^ 1 usually is a linear smoother: m
^ 1 (x0 ) = n1 wj Yj . The bias
j:Dj =1
of a local constant or local linear regression estimator is of order h2 . Since the variance is of order
1
N hd
, where d is the number of continuous regressors, for minimizing MISE we usually choose, as
1 4
discussed above, the bandwidth h = O(N 4+d ). This yields a squared bias of order O(N 4+d )
which converges slower to zero than O(N 1 ). If we used a kernel of order r, for an interior point
1 2r
the optimal bandwidth would be h = O(N 2r+d ) and squared bias thus O(N 2r+d ). A similar
result can be obtained for local polynomial regression where one has to distinguish between
p being odd or even, when estimation is at an interior point. In all case, this is slower than
p
O(N 1 ). However, we know that N consistent estimation is possible, provided kernels of
higher order or local polynomial regression of higher order is used. This thus requires squared
bias to be of order O(N 1 ). To achieve this we need to choose
1 1
h = O(N 4 ) instead of O(N 4+d ).
Hence, h has to converge to zero faster than a bandwidth selector such as cross-validation or
the usual plug-in methods would deliver. However, there is no general data-driven method
available for choosing the bandwidth value as the optimal choice di¤ers among semiparametric
estimators.
This will often require the use of higher-order kernels that reduce the bias of the nonpara-
metric regression estimate. It seems, though, that such higher order kernels are rarely used in
practice.
15.2.4 Average derivative estimation (skip this chapter)
One example of a semiparametric estimator with nonparametric plug-in estimates was the
estimation of average treatment e¤ect under selection on observables:
Z
(m(x; d = 1) m(x; d = 0)) f (x)dx,
where f (x) is the density of X in the population. Alternatively, any other weighting function
could be used instead, e.g. the distribution of X to be expected after a policy change.
A widely studied parameter is the average derivative
Z
@m(x; d)
f (x)dx,
@d
which could be considered as the marginal treatment e¤ect if d is increased marginally for
each individual from the level actually observed. Direct estimation of this average derivative is
obvious.
Härdle and Stoker (1989) considered an average derivative estimator of the form
Z
@m(x)
f (x)dx,
@x
which could be estimated either directly as184
1X 0
m (Xi )
n
or indirectly by using an integration by parts derivation:

Z1 Z Z
@m(x) 1 @f (x) f 0 (x) f 0 (X)
f (x)dx = m(x)f (x)j 1 m(x) dx = m(x) f (x)dx = E Y = E [Y
@x @x f (x) f (X)
1
184
Stoker also pointed out that the for semiparametric models of the single index type: E[Y jX = x] = m(x0 0 ),
0 0
the derivative is @E[Y jX = x]=@x = m (x 0) 0 and is thus proportional to 0. Hence E [@E[Y jX = x]=@x]
would estimate 0 up to scale.
1 1
provided that m(x)f (x)j 1
= 0. The term m(x)f (x)j 1
is shorthand notation for
lim m(x)f (x) lim m(x)f (x). Since f (x) converges to zero for x ! 1 or x ! 1, the
x!1 x! 1
product m(x)f (x) also converges to zero, unless m(x) converges to 1. In most economic
applications it will be reasonable to assume that m(x) either converges to some upper bound
or converges to in…nty at a slower rate than f (x) converges to zero, for x ! 1. In this case,
m(x)f (x) converges to zero in the tails. Hence, it can be estimated as the expected value of
Y weighted by f 0 (X)=f (X), plugging in nonparametric estimators of f 0 and f . Despite its
neat form, the weighting by f 0 (X)=f (X) requires that f (X) is bounded away from zero over
the support of X. Otherwise the asymptotic properties may be poor, particularly for uniform
convergence. This assumption of boundedness away from zero appears also with many other
weighting estimators, e.g. weighting by the propensity score. Intuitively it requires that the
density suddenly drops to zero at the boundaries of the support. For many economic variables,
e.g. income, this assumption is not very reasonable. In these cases it is necessary to trim, i.e.
eliminate observations where f^(X) is zero or very small, as discussed in Ichimura and Todd
(2008). The average derivative estimator is then
1X f 0 (Xi ) ^
Yi Ii
n f (Xi )
where Iî = 1(f^(Xi ) > cn ) for some sequence of numbers cn ! 0 as n ! 1. Alternatively, one
could impose a ‡oor on the estimated f^(X), i.e. set all values below a certain threshold to this
threshold.
p
Härdle and Stoker (1989) showed n convergence of the estimator with variance matrix
0
@m @m @m @m 2 @ ln f @ ln f
E E E +E U .
@x @x @x @x @x @x0
and derived conditions for bandwidth choice.
15.2.5 Alternative dimension reduction approaches: Partial means
In the …rst chapters, estimation of the average treatment e¤ect was discussed. Under selection
on observables, the average potential outcome is identi…ed as
h i Z
E Y d = m(X; d)dFX .
p
If d is discrete, e.g. binary, E Y d can be estimated under certain regularity assumptions at n
rate, and may therefore be considered a semiparametric estimator. If d is continuous, however,
Markus Frölich 16. Quantile regression 365
p
n rate cannot be achieved and E Y d cannot be considered as semiparametric as it is not …nite
dimensional. Nevertheless, it …ts the spirit of this chapter in that the dimensionality of the …nal
object is much lower than of the plug-in nonparametric estimator. Under certain conditions,
E Y d can be estimated at the rate for dim(d) dimensional nonparametric regression despite
the high dimensional nonparametric …rst stage. In other words, for one-dimensional d the rate
2
n 5 may be attainable.
Newey (1994b) analyzed the properties of nonparametric estimators of partial means based
on kernel regression estimators. A recent paper by Imbens and Ridder (2006) extended this to
generalized partial means, where for a Q dimensional random vector X the conditional mean
is averaged only over a subset of the regressors in X. Let X1 be a subvector of X which is
"averaged out" whereas the remaining X2 are …xed. The partial mean is then
Z
E [m(X1 ; x2 )] = m(x1 ; x2 )dFx1 ,
which corresponds to the above situation where x2 is the …xed treatment variable. The gen-
eralized partial mean (GPM) permits a weight function !(X1 ). Further it allowsfor X2 to be
evaluated at a known function X2 = t(X1 ), which includes the previous case where t(X1 ) = x2
was the constant function. In addition a known function ' is introduced which may combine
various components of m( ) for multivariate Y .
GP M = E !(X1 )0 ' (m(X1 ; t(X1 )))
They show that under certain conditions, the convergence rate depends only on the number
of variables in X2 .
16 Quantile regression
References:
- Koenker Koenker (2005)
- Buchinsky Buchinsky (1998)
- Chernozhukov Hansen (forthcoming JoE)
So far we have been interested mainly in estimating the conditional mean function
E[Y jX].
E.g. in the linear model

Y =X +U
with E[U jX] = 0.
We might also be interested in the entire distribution of Y jX.

In principle, we can use a similar approach as before by de…ning
FY jX (a; x) = E [1 (Y a) jX = x]
and estimate the entire distribution function. Instead of the distribution function one might
very often be interested in the quantiles. In principle, one could simply invert the estimated
cdf. Nevertheless, a substantial literature has developed which aims to estimate the quantiles
directly. We will later see that, from a nonparametric viewpoint, these approaches are quite
related. For parametric models, the estimation procedures are rather di¤erent, though.
The quantile of a variable Y is de…ned as
QY = Q Y = FY 1 ( ) inf fa : FY (a) g.
If Y is continuous with strictly monotonically increasing cdf, there will be one unique value a
that satis…es FY (a) . This is the case if FY has a density and fY (QY ) > 0. Otherwise, the
smallest value is chosen.185 Given a random iid sample fYi gN
i=1 , one could estimate the quantile
by
n o
QY = inf a : F^Y (a)
^ ,
and plug-in the empirical distribution function of Y . Such an approach bears a close similarity
with sorting the observed values of Yi in an ascending order.
16.0.6 Properties of quantiles
The quantile function QY is nondecreasing in .

If Y has cdf F , then F 1( ) gives the quantile function. The quantile function of Y is
given by
1
Q Y = F (1 )
185
The cdf is right continuous, the quantile function is left continuous.
Bild2
- Equivariance to monotone transformations: Let h( ) be a nondecreasing function on R
Q h(Y ) = h (Q Y ) .
The mean does not share this property: E [h(Y )] 6= h(E [Y ]).
On the other hand, no "iterated expectation" property exists for the quantiles. Whereas
E[Y ] = E[E[Y jX]], no such property exists for quantiles.
- Median is more robust than mean regression to outliers.
16.1 Parametric quantile regression
In applied work, we are usually interested in estimating some relationship between various
covariates and will examine conditional quantiles QY jX further below. Unless X is discrete (with
few mass points), estimation by sorting and exact conditioning on X will be futile. Therefore
some parametric assumption may be helpful. Assume a linear model
QY jX = 0 + X0 0
Y = 0 + X0 0 +U QU jX = 0.
In other words, at the true values 0 and 0 the quantile of Y 0 X0 0 should be zero.
Consider the following examples:
As the second example shows, the quantiles would cross if the X variable could take negative
values. If we specify a linear model, this will always happen at some value of x, unless all quantile
functions are parallel. Consider e.g. the model
Y = + X0 + + X0 U.
If = 0 and > 0, all conditional quantiles will path through the point (0; ).
16.1.1 Some examples
Union wage premium: Chamberlain (1994) considered the union wage premium, regressing
the log hourly wage on a union dummy for men with 20 to 29 years of work experience and
other covariates.
OLS = 0:1 = 0:25 = 0:5 = 0:75 = 0:9

0.158 0.281 0.249 0.169 0.075 -0.003
For the moment we abstract from a causal interpretation. The results show that on average
the wage premium is 16% and is similar to the premium for the median earner. For the lower
quantiles it is very large and for the large quantiles it is zero or even negative. The following
graph shows a (hypothetical) distribution of log wages, conditional on X, in the union and
non-union sector, which illustrates the above estimates.
Demand for alcohol: Manning, Blumberg, and Moulton (1995) examine demand for
alcohol estimating
log consumptioni = + 1 log pricei + 2 log incomei +U
at di¤erent quantiles. Incomei is the annual income of individual i and consumptioni is the
annual alcohol consumption. Pricei is a price index for alcoholic beverages, which is computed
for the place of residence of individual i. Hence, it varies between individuals who live in
di¤erent locations. For about 40% of the observations consumption is zero, such that price
and income responses are zero for low quantiles. For larger quantiles the income elasticity is
relatively constant about 0.25. The price elasticity shows more variation. Its value is largest,
in absolute terms, at = 0:7 and becomes very inelastic for low levels of consumption 0:4
and also for high levels of consumption = 1. Hence, individuals with very low demand and
also those with very high demand are insensitive to price changes, whereas those with average
consumption show a stronger price response. A conventional mean regression would not detect
this kind of heterogeneity.
16.1.2 Optimization instead of ordering
We consider …rst the situation without X covariates. De…ne the asymmetric loss function:
(u) = u f 1 (u < 0)g (150)
and consider the optimization problem
arg minE [ (Y b)] . (151)

b
Before, we had usually examined the square loss function u2 , which led to the least squares
1
estimator. For the median = 2 the loss function (150) is the absolute loss function. For
values di¤erent than zero, it gives an asymmetric absolute loss function.186
Suppose that a density exists and is positive at the value QY , i.e. fY (QY ) > 0, then the
minimizer in (151) is187
QY .
186
Hence, the following estimators are not only for quantile regression, but can also be used if an asymmetric
loss function is appropriate. For example, a …nancial institution might value the risk of large losses higher (or
lower) than the chances of large gains.
187
Proof: Suppose that the quantile QY is unique. The interior solution to arg minE [ (Y b)] is given by the
b
…rst order condition:
0 1
Z1 Zb Z1
@ @ @
(Y b) f 1 ((Y b) < 0)g dFY = ( 1) (Y b) dFY + (Y b) dFY A .
@b @b
1 1 b
Applying the Leibniz rule of di¤erentiation gives

Zb Z1
= ( 1) ( 1) dFY + ( 1) dFY + 0 0
1 b
= ( 1) FY (b) (1 FY (b))
= FY (b)
which is zero for FY (b) = .

Hence, minimizing E [ (Y b)] leads to an estimator of the quantile. An alternative interpre-

tation is that b is chosen such that the -quantile of Y b is set to zero. Or in other words it
follows that
E[ 1 (Y QY < 0)] = 0.
An estimator of the quantile is thus
Xn
^ = arg min (Yi ).
2 i=1
1
The following shows the objective function for = 2 and sample sizes 1, 2, 3 and 4.
As the …gure shows, the objective function is not di¤erentiable everywhere. It is di¤eren-
tiable except at the points at which one or more residuals are zero. At such points, it has direc-
tional derivatives in all directions depending on the direction of evaluation. The above …gures
also show that the objective function is ‡at at its minimum when n is an integer.
The solution is typically at a vertex. To verify the optimality one needs only to verify that
the objective function is nondecreasing along all edges.
Similar to the above proof, one can show that when including covariates X
arg minE (Y X 0 b) = arg zeroE 1 Y < X 0b X (152)

b b
if there is a unique interior solution. This relationship will be useful for deriving a GMM
estimator later.
Now, assume a linear quantile regression model with a constant included in X
Y = X0 0 +U QU jX = 0.
In other words, at the true values 0 the quantile of Y X0 0 should be zero. This suggests
the linear quantile regression estimator
Xn
^ = arg min Yi Xi0 . (153)
2 i=1
Using the relationship (152) we could choose to set the moment conditions
n
X
arg min 1 Yi < Xi0 b Xi
2 i=1
to zero. In …nite samples, it will not always be possible to set this exactly to zero188 , so we
set it as close to zero as possible. For n ! 1, the distance from zero will vanish. In other
words, for …nite n the objective function (153) is not di¤erentiable, whereas E [ (Y X 0 b)]
usually will be. The objective function is piecewise linear and continous. It is di¤erentiable
everywhere except at those values of where Yi Xi0 = 0 for at least one sample observation.
At those points, the objective function has directional derivatives, which depend on the direction
of evaluation. If at a point ^ all directional derivatives are nonnegative, then ^ minimizes the
objective function (153). See Koenker (2005) for further discussion.
In analytical derivations one may still consider
n
1X
1 Y < X 0b X
n
i=1
as an approximate derivative of the objective function (153) and apply similar approaches. For
a di¤erentiable objective function QN ( ) one often employs an expansion of the type
@QN ( 0 ) ^
QN (^) QN ( 0 ) = ( 0) + O (^ 0)
2
,
@ 0
where the last term vanishes if the estimator is consistent. For a non-di¤ erentiable objective
function one could use the approximate derivative D
QN (^) QN ( 0 ) = D( 0 ) (^ 0) + remainder
and imposes then su¢ cient regularity conditions such that
QN (^) QN ( 0 ) D( 0 ) (^ 0)
converges to zero su¢ ciently fast.
188
Consider the situation without X and for = 0:25. Suppose we have 3 data points. It will be impossible to
P P
…nd a b such that f 1(Yi < b)g = 0. We can write this as 3i=1 1(Yi < b)g = 0:75, which is impossible to
be satis…ed.
Estimating all quantiles at same time One may often be interested in estimating 0 for
various di¤erent values of , e.g. for all deciles or all percentiles. With a …nite number of
observations, only a …nite number of estimates will be numerically distinct. Think of a sample
with two observations and try to estimate the median and all deciles. In addition, the estimates
^ for di¤erent values of will be correlated, which is important e.g. if one wants to test for
equality of the slopes.
Quantile crossing:
By de…nition the quantile QY jX is nondecreasing in . However, this is not guaranteed for the
^
estimates Q Y jX if we estimate the quantiles by (153). This could happen either due to sampling
variability or due to misspeci…cation of the model. E.g. if we assumed a linear regression model,
the quantile functions have to cross for some values of x, unless all quantile functions are parallel.
Hence, quantile crossing could be used to verify or test for misspeci…cation. If crossing occurs
in regions where the density fX is low, there would be little concern. E.g. if years of education
is positive throughout, we would not be concerned if quantiles cross for some value of x below
8 years. If crossing occurs in regions where the density fX is high, we might want to re-specify
the model, e.g. include square or higher order terms.
As another example, consider the location-scale shift model
Yi = 0 + Xi 1 +( 0 + 1 Xi )Ui
and concentrate on the simple version where 0 = 1 = 0 = 0 and 1 =1
Yi = Xi Ui .
The conditional quantiles of this model will be linear in x for x > 0 and for x < 0. But at x = 0
the slope will change. At x = 1 the conditional quantile is
QY jX (1) = FU 1 ( ).
At x = 1, on the other hand, the conditional quantile is
QY jX ( 1) = FU 1 (1 ).
Hence, the quantiles are given by
x FU 1 ( ) if x 0
QY jX (x) =
x FU 1 (1 ) if x < 0
where the slope changes at x = 0. Hence, assuming a linear model would be incorrect. The
following graphs show the conditional = 0:75 quantiles, on the left for standard normal errors
U and on the right for uniform[0;1] errors U . Since the normal distribution is symmetric about
zero, the values of FU 1 ( ) and FU 1 (1 ) are the same and therefore the absolute value of
the slope is the same to the left and right of zero, in the left graph. In the right graph, the
sign of the slope does not change, but its magnitude does.189 In these examples, a quadratic
regression model would …t the reality better.
Hence, quantile crossing can occur for the estimated quantiles. Nevertheless, the Theorem
2.5 of Koenker (2005) ensures us that even if we estimate all quantile functions separately by
P
(153), at least at the center of the design points X = n1 Xi the estimated quantile function
QY jX (X) = X 0 ^ is nondecreasing in on [0; 1].
On the other hand, if the assumed model was indeed correct, estimates with crossing quan-
tiles cannot be e¢ cient since they did not incorporate the information that QY jX must be non-
decreasing in .
16.1.3 Linear programming algorithms
Instead of using GMM estimation, fast algorithms based on a linear programming representation
are available, which are particularly interesting if one wants to estimate 0 for various . Since,
in …nite samples all ^ will be numerically dependent, they can be computed fast. (See Koenker,
2005.)
Rewrite the estimator
Xn
^ = arg min Yi Xi0
2 i=1
Xn
= arg min Yi Xi0 1 Yi > Xi0 (1 ) Yi Xi0 1 Yi < Xi0 .
2 i=1
189
Note that the conditional median would still be linear.
Hence, the estimator minimizes a weighted sum of positive residuals. Consider the residuals
r1i = jYi Xi0 j 1 (Yi > Xi0 ) and r2i = jYi Xi0 j 1 (Yi < Xi0 ) such that
Xn
^ = arg min r1i + (1 ) r2i with r1i r2i = Yi Xi0
2 i=1
and where r1i ; r2i 0 but only one of the two residuals r1i ; r2i is nonzero for each observation.
It can be shown that the solution is identical to the solution to the LP problem
n
X
arg min r1i + (1 ) r2i with r1i r2i = Yi Xi0 and r1 ; r2 0 (154)
;r1N ;r2N i=1
where minimization is over , r1 and r2 .
De…ne the following linear programming (LP) problem
minc0 z subject to Az = YN , z 2 S, (155)

z
. .
where A = X ..I .. I is a matrix of dimension N (dim(X) + 2N ). The column vector c is
(00dim(X) ; 10N ; (1 )10N )0 . The column vector z is of length dim(X) + 2N and the set S is
Rdim(X) R2N
+0 .
If z is set to ( 0 ; r1N
0 ; r 0 )0 this exactly reproduces the expression (154). In (155) the scalar
2N
c0 z is minimized over z where z satis…es the linear constraints Az = YN and z is nonnegative,

except for the …rst dim(X) components. The …rst dim(X) components of z refer to the coe¢ -
cients whereas the other components represent the nonnegative residuals r1i and r2i .
Having expressed the minimzation problem in canonical form in (155) conventional linear
programming algorithms can be used for estimating c and thus . An introduction to these
algorithms is given in Koenker (2005, Chapter 6).
16.1.4 Inference and asymptotic theory
For deriving con…dence intervals and for testing hypothesis, the distribution of the estimated
^ needs to be known. Koenker (2005, Chapter 3) derives the exact distribution for the linear
quantile regression model, which however may be expensive to compute for larger sample sizes.
Approximations based on asymptotic theory may be more helpful. We …rst consider, a heuristic
proof for the estimation of a quantile of a random variable, i.e. without covariates, assuming
that fY (QY ) is strictly positive in a neighbourhood of QY .
Asymptotics of estimated quantile

The objective function of the -quantile is:
^ = inf 1X
: (Yi ) = min! .
n
For 6= Yi , the gradient is

1X
gn ( ) = 1 (Yi < ) ,
n
which is monotonically increasing in . Hence, for any value
Pr <^ = Pr gn ( ) < gn ( ^ ) = Pr (gn ( ) < 0) .
p
Pr n ^ 0 > = Pr gn ( 0 +p )<0
n
1X
= Pr 1 Yi < 0 +p <0 .
n n
This is a sum of Bernoulli random variables such that a DeMoivre-Laplace central limit
theorem can be applied. Compute …rst the expected value and variance of gn ( 0 + p
n
).
2 2
E gn ( 0 + p ) = FY 0 +p = FY ( 0 )+ p fY ( 0 )+O = p fY ( 0 )+O .
n n n n n n
1 1
V ar gn ( 0 +p ) = V ar 1 Yi < 0 +p = FY 0 +p 1 FY 0 +p
n n n n n n
1 1 1 1
= FY ( 0 ) (1 FY ( 0 )) + O p = (1 )+ O p .
n n n n n n
Hence, by standardizing and retaining only …rst order terms and applying a central limit theorem
0 1
2 2
p B gn ( 0 + p
n
) p f
n Y
( 0) +O n
p f
n Y
( 0) +O n C
Pr n ^ 0 > = Pr B
@ r <r C
A
1
n (1 ) + n1 O p
n
1
n (1 ) + n1 O p
n
! !
fY ( 0) f ( )
! p =1 pY 0
(1 ) (1 )
Thus
p d (1 )
n ^ 0 !N 0; .
fY2 ( 0)
The variance is large when (1 ) is large, which has its maximum at 0:5. Hence, this
part of the variance component decreases in the tails, i.e. for small or large. On the other
hand, the variance is large when the density fY ( 0) is small, which thus usually increases the
variance in the tails. (If the density fY ( 0) is zero, this simple derivation can no longer be
used. Similarly, if the density fY ( 0) is very small, rates of convergence can be slower than
p
n.)
We can easily extend the previous derivations to obtain the joint distribution of several
quantiles ^ = ^ ; :::; ^
1 m
:
p d
n ^ 0 ! N (0; ) ,
where the m m matrix has typical elements
min ( i ; j ) i j
.
f (F 1 ( i )) f (F 1(
j ))
Asymptotics of linear quantile regression with IID errors

The following is based on Chapter 4 of Koenker (2005). Suppose a linear quantile regression
model
Y = X0 0 +U QU jX = 0.
^ is consistent under the following three conditions:
p 1X
n Fi Xi0 0 " ! 1
n
p 1X
n Fi Xi0 0 +" ! 1
n
where Fi is the cdf of Yi (allowing for heteroskedasticity). This condition requires that the
density of U at U = 0 is bounded away from zero at an appropriate rate. If the density of
U was zero in an " neighbourhood, the two previous expressions would be exactly zero. The
conditions require positive density, which can be smaller for large sample sizes but not too
small. Two other conditions are required on the data matrix X:
There exist d > 0 and D > 0 such that
1X
lim inf inf 1 Xi0 b < d = 0
n!1 kbk=1 n
and
1X 2
lim sup sup Xi0 b D.
n!1 kbk=1 n
The former condition ensures that the Xi observations are not collinear, i.e. that there is no b
such that Xi0 b = 0 for every observed Xi . The latter condition controls the rate of growth of
P
the Xi and is satis…ed, e.g., when n1 Xi Xi0 tends to a pd matrix.
Alternative sets of conditions can also be used to proof consistency, e.g. by trading o¤ some
conditions on the density of U versus conditions on the X design.
Having established consistency, we examine the asymptotic distribution, for which, as usual,
stronger conditions are required. Suppose the Yi observations are independent with conditional
distribution function Fi = FYi jXi and let i = QYi jXi . The estimated coe¢ cients converge in
p
distribution at rate n
p d
n ^ 0 ! N 0; (1 ) D1 1 D0 D1 1
under the conditions:

Condition A.1: The df Fi are absolutely continuous, with continuous densities fi uniformly
bounded away from zero and in…nity at the points fi ( i ) for all i.
Condition A.2: There exist pd matrices D0 and D1 such that
1X
lim Xi Xi0 = D0
n
1X
lim fi ( i )Xi Xi0 = D1
n
1
lim max p kXi k = 0.
n!1 i n
The proof follows Koenker (2005) on pages 121f. The outline of the proof consists of three
steps. First, it is shown that the function
X Xi0
Zn ( ) = Ui p (Ui )
n
where Ui = Yi X0 0 is convex in and converges in distribution to a function Z0 ( ), as

shown below. Since Z0 ( ) is also convex, the minimizer is unique and arg min Zn ( ) converges
p
in distribution to arg min Z0 ( ). Third, n ^ 0 is equivalent to the minimizer of Zn ( )
and therefore the distribution theory follows.

Consider …rst the function Zn ( ). Following Knight (1998) it can be shown that
d 0 1 0
Zn ( ) ! W+ D1
2
where W N (0; (1 )D0 ). Since both the left and right hand side are convex in with
unique minimizers, it follows that
d 1 0
arg min Zn ( ) ! arg min 0
W+ D1 = D1 1 W .
2
Since W is normally distributed, it follows that
d
arg min Zn ( ) ! N (0; (1 )D1 1 D0 D1 1 ).
The …nal step of the proof is to note that the function Zn ( ) is minimized at the value
p ^
n 0 . To see this note that with a few simple calculations it follows that
p ^ P
Zn ( n( 0 )) = (Yi Xi0 ^ ) (Ui ). The …rst term achieves here its minimum as
this is the de…nition of the linear quantile regression estimator. The second term does not
p
depend on anyhow. Hence, arg min Zn ( ) = n ^ 0 , which then gives the asymptotic
distribution of ^ .
An alternative approach to examining the asymptotic properties can be based on the GMM
framework. Under certain regularity conditions, the GMM framework can be used to show
consistency and asymptotic normality
p d
n ^ 0 ! N (0; )
with asymptotic variance matrix
1 1
= (1 ) E fU jX (0jX) XX 0 E XX 0 E fU jX (0jX) XX 0 .
If one is willing to strengthen the assumption QU jX = 0 to full independence of U and X,

the variance matrix simpli…es
(1 ) 1
= 2 E XX 0 .
fU (0)
As an exercise we can derive the asymptotic variance using the results for exactly identi…ed
GMM estimators.
OPG:
E 1 Y < X0 0 XX 0 1 Y < X0 0
h h i i
2
= E E 1 Y < X0 0 jX XX 0
2
= E E 2 1 Y < X0 0 + 1 Y < X0 0 jX XX 0
2
= E 2 Pr (U < 0jX; 0) + Pr (U < 0jX; 0) XX 0
2 2
= E 2 + XX 0 = (1 ) E XX 0
Expected gradient of the moment function
@E [f 1 (Y < X 0 )g X]
@ j 0
Z1 Z1
@
= 1 Y < X0 X dFY jX dFX j
@ 0
1 1
0 0
1
Z1 Z
X Z1
@ B C
= @ ( 1) XfY jX (y)dy + XfY jX (y)dy A dFX j
@ 0
1 1 X0
Z1
= ( 1) XX 0 fY jX X 0 XX 0 fY jX X 0 dFX j 0
1
= E fY jX X 0 XX 0 j
= E fY jX X 0 0 XX 0 = E fU jX (0jX) XX 0
0
since FY jX (X 0 0 jX) = FU jX (0jX).
Bahadur representation:
The previous derivations can also be used to derive an "in‡uence function" representation
of the linear quantile regression estimator
p 1 X 1 3
n ^ 0 = D1 1 p Xi ( 1 (Yi i < 0)) + O n 4 (ln ln n) 4
n
as derived in Bahadur (1966) and Kiefer (1967). This representation is useful if the quantile
regression estimator is used as a plug-in estimator in a more complex, e.g. semiparametric,
estimator. It can further be shown that this representation holds uniformly over an interval
2 ["; 1 "] for some 0 < " < 1.
16.2 Nonparametric quantile regression
The previous section examined linear quantile regression, certainly the most widely used ap-
proach. Extensions to nonlinear models have been developed, but it might be most illuminating
to proceed directly to nonparametric approaches. We start with local quantile regression and
later consider quantile treatment e¤ects, which will link this chapter with the …rst chapters of
this book.
Chaudhuri (1991) introduced local polynomial quantile regression.
Start with the situation where Xi is one-dimensional and we aim to estimate the conditional
quantile function QY jX (x) at a location x. A local linear quantile regression estimator is given
by the solution a to
X Xi x
min (Yi a b(Xi x)) K .
a;b h
Extensions to local polynomial quantile regression are obvious, but seem to be rarely used in
practice.
Suppose that
Y = g(X) + U
with QU = 0 and g belonging to the class of Hölder continous functions C k; with k

continuously di¤erentiable derivatives with k-th derivative Hölder continous with exponent
. Chaudhuri (1991) showed that the estimate of the function g^ when choosing a bandwidth
1
h/n 2(k+ )+dim(X) converges to g almost surely
(k+ ) p
k^
g gk = O n 2(k+ )+dim(X) ln n
and obtained analogous results for the estimated derivatives. He also derived a local Bahadur
representation. This result is thus similar to nonparametric estimation of conditional expecta-
p
tion, since ln n increases only very slowly.
16.3 Quantile treatment e¤ects
So far we have considered estimation of QY jX , i.e. the relationship between observed variables.
As argued throughout this book, the prime interest of empirical research does not lie in esti-
mating regression relationships but causal e¤ects. Let us start with the simplest setup where
D is binary and Y 1 ; Y 0 are the potential outcomes.
In many research areas it is of …rst order importance to assess the distributional e¤ects of
policy variables. For instance, policy makers will evaluate di¤erently two training programs
having the same average e¤ect but whose e¤ects are concentrated in the lower end of the
distribution for the …rst one and on the upper end for the second one. Instead of considering
only average e¤ects, it is often of considerable interest to compare the distributional e¤ects
of the treatment as well. Another example which has received considerable public interest
is educational equality, where many societies would prefer to provide every child with a fair
chance into adult live. Here, Y is a measure of cognitive ability (e.g. obtained from Math and
language tests) and D may be the introduction of computers in classroom (teaching). In this
paper, we will identify and estimate the entire distribution functions of Y 1 and Y 0 . The ability
of quantile treatment e¤ects (QTE) to characterize the heterogeneous impact of variables on
di¤erent points of an outcome distribution make them appealing in many economic application.
This has motivated the recent surge of interest in their identi…cation and estimation using
di¤erent sets of assumptions, particularly in the applied policy evaluation literature.
We may be interested in the distributions of the potential outcomes
QY 1 QY 0
or the distribution of the e¤ect

QY 1 Y0
or the di¤erence between the two quantiles
QY 1 QY 0 .
Most of the literature has focussed on the last parameter. Estimation of QY 1 and QY 0 only
requires identifying assumptions on the marginal distributions of Y 1 and Y 0 , whereas estimation
of QY 1 Y0 requires identifying assumptions on the joint distribution of Y 1 and Y 0 . Knowledge
of the marginal distributions does not su¢ ce for identifying the joint distribution. Suppose that
the distributions of Y 1 and Y 0 are exactly identical. Hence, the di¤erence QY 1 QY 0 is zero
for every quantile. This could be the case if the treatment e¤ect is zero for every individual.
However, this could also be the result of o¤-setting individual treatment e¤ects, i.e. if some
individuals had a negative treatment e¤ect and others a positive treatment e¤ect. In other
words, QY 1 Y0 could be zero or positive or negative.
The di¤erence in the quantiles QY 1 QY 0 simply measures the distance in the two outcome
distributions. If the treatment does not change the ranks of the individuals, i.e. if for any two
individuals Yi0 > Yj0 implies Yi1 > Yj1 and Yi0 = Yj0 implies Yi1 = Yj1 , then QY 1 QY 0 gives the
treatment e¤ect for an individual at rank in the outcome distribution, e.g. at rank = 0:9.
This is usually still di¤erent from QY 1 Y 0, which refers to quantile of the e¤ect, e.g. to the
90% person who gains the most from treatment.
Another way to see this is to remember that the integral over all quantiles gives the expected
value. I.e.
Z1
QY 1 d = E[Y 1 ]
0
and
Z1
QY 1 Y 0d = E[Y 1 Y 0 ].
0
We thus obtain the relationship
Z1 Z1
QY 1 QY 0 d = QY 1 Y 0d = E[Y 1 Y 0 ],
0 0
which however does not provide us with information on QY 1 Y0 at a particular quantile .
Since quantile treatment e¤ ects (QTE) are an intuitive way to summarize the distributional
impact of a treatment, we especially focus our attention on them:
= QY 1 QY 0 , (156)
where QY 1 = inf Pr Y 1 q is the quantile of Y 1 . It is worthwile noting that QY 1 and

q
QY 0 are often separately identi…ed. Hence, instead of the di¤erence one could also examine
other parameters e.g. the treatment e¤ect on inequality measures such as the interquantile
spread. A typical inequality measure is the inter-decile ratio that can be de…ned as190
Q0:9
Y1 QY0:90 Q0:9 0:1
Y 1 QY 0
0:1 or as .
QY 1 QY0:10 Q0:1
Y1
QY0:90
190
The Stata command ivqte provides estimate of the necessary ingredients to compute all these inequality
measures, but gives analytical standard errors only for the QTE .
For the following discussion, is it important to distinguish between unconditional and con-
ditional QTE. The unconditional QTE (156) gives the e¤ects of D in the population at large.
The conditional QTE X = QY 1 jX QY 0 jX gives the e¤ect in the subpopulation of individuals
with characteristics X, where X may contain a number of control variables discussed below.
Conditional and unconditional e¤ects are interesting in their own rights. In some applications,
the conditional e¤ect X may be of primary interest, e.g. when testing hypothesis on treat-
ment e¤ects heterogeneity. From a nonparametric perspective, however, the conditional e¤ect
X is subject to the curse of dimensionality and will often be estimated with low precision if
many variables are included in X, unless a strict functional form restriction is imposed.
The unconditional QTE, on the other hand, can be estimated, under certain regularity
p
conditions, at n rate without any parametric restriction. From a statistical perspective we
therefore expect more precise estimates for the unconditional QTE than for the conditional
QTE. For purposes of public policy evaluation the unconditional QTE might also often be of
more interest because it can be easier conveyed to policy makers and the public than conditional
QTE with a …ne X vector. Policy makers and the public need more aggregated results for
decision making. The unconditional QTE summarize the e¤ects of a treatment for the entire
population and can easily be conveyed and summarized since the unconditional quantile function
is a one dimensional function, whereas the conditional quantile functions are multidimensional
functions (of the quantile on one side and of each of the covariates on the other side).
We discuss …rst the identi…cation and estimation of unconditional QTE. This covers also
the case where we estimate QTE separately for large demographic subgroups, e.g. men, women,
foreigners etc. For simplicity, we will refer to these also as unconditional QTE, whereas the
properties for conditional QTE would be more appropriate when X contains several (continuous)
covariates.
16.4 QTE under selection on observables
In this section, we will discuss the selection-on-observables estimator of the unconditional QTE
of Firpo (2007), Frölich (2007b) and Melly (2006), who consider estimation of treatment e¤ects,
when D is exogenous conditional on X. Identi…cation is fully nonparametric, i.e. without any
functional form assumptions.
If the selection problem can be solved by conditioning on a set of covariates X
Y d ??DjX (157)
and if common support Supp(XjD) = Supp(X) is satis…ed, the distribution of the expected
potential outcome is identi…ed as
Z
FY d (a) = (E [1 (Y a) jX; D = d]) dFX ,
and the quantiles can be obtained by inverting the distribution function
QY d = FY d1 ( ),
provided that FY d1 ( ) is invertible, i.e. that the QY d is well de…ned. For an example see Frölich
(2007b).
Alternatively, weighting by the propensity score gives
1 (Y a) 1 (D = d)
FY d (a) = E ,
Pr(D = djX)
which may be less precise in …nite samples if the estimated probabilites Pr(D = djX) are very
small for some Xi .
If one is interested particularly in the di¤erence between two quantiles and its asymptotic
properties, direct estimation of the quantiles might then be more convenient than estimating
the entire distribution function. Notice that the last equation implies that
" #
1 Y QY d 1 (D = d)
= FY d (QY d ) = E .
Pr(D = djX)
Hence, we can identify QY d as
1 (Y < b) 1 (D = d)
QY d = arg zeroE . (158)
b Pr(D = djX)
This can be considered as the …rst order condition to
1 (D = d)
QY d = arg minE (Y b) . (159)
b Pr(D = djX)
Hence, we could use a conventional univariate quantile regression estimation routine with
1(D=d)
weights Pr(D=djX) .
To see the relationship between the last two equations, rewrite
1 (D = d)
E (Y b)
Pr(D = djX)
1 (D = d) 1 (D = d)
= E (Y b) E (Y b) 1 (Y < b)
Pr(D = djX) Pr(D = djX)
Z Z Zb
1 (D = d) 1 (D = d)
= E (Y b) (Y b) dFY XD
1
Di¤erentiation with respect to b, using Leibniz rule and assuming that the order of integration
and di¤erentiation can be interchanged, gives the …rst order condition
Z Z Zb
1 (D = d) 1 (D = d)
= E + dFY XD 0
1
Z Z Zb
1 (D = d)
= + 1 (Y < b) dFY XD
Pr(D = djX)
1
1 (D = d)
= +E 1 (Y < b)
Pr(D = djX)
Setting the …rst order condition to zero gives (158).191
16.5 Unconditional QTE under endogeneity
If the number of control variables X observed is not su¢ cient to make the conditional inde-
pendence assumption Y d ??DjX plausible, instrumental variables techniques may overcome the
endogeneity. There are a variety of nonparametric techniques recently been proposed, which
bear similarity with the approaches in Chapter 3.
The following discussion is based on Frölich and Melly (2007), who examine nonparamet-
ric identi…cation and estimation of unconditional QTE by instrumental variables for the sub-
population of compliers. It can be used for binary D192 and permits binary and non-binary
instrumental variables.
Consider a triangular model related to the recent literature on nonparametric identi…cation
191
The asymptotic theory when using a logistic series regression estimator for Pr(D = djX) is developed in
Firpo (2007).
192
Imbens and Newey (2003) and Chesher (2003) analyzed identi…cation for continuous D and Chesher (2005)
examined interval identi…cation with discrete D.
of nonseparable models:
Yi = '(Di ; Xi ; Ui ) (160)
Di = (Zi ; Xi ; Vi ),
where ' and are unknown functions, Y is the outcome variable, D the treatment variable, Z
the instrumental variable(s), X additional control variables and U and V are possibly related
unobservables. (Note that the control variables X are permitted to be correlated with U and/or
V .) We assume Z to be excluded from the function ', i.e. Z has no direct e¤ect on Y . The
corresponding potential outcomes are
Yid = '(d; Xi ; Ui ) (161)
Diz = (z; Xi ; Vi ).
The exclusion restriction in (160) is not su¢ cient to obtain identi…cation and we impose the
monotonicity assumption that the function is weakly monotonous in z, and without loss of
generality we normalize it to be increasing i.e. assume that an exogenous increase in Zi can
never decrease the value of Di . If Z is binary, this corresponds to the framework of Imbens and
Angrist (1994).193194
We will focus our attention on the subgroup of compliers, which we de…ne as all individuals
who are responsive to a change in Z within the support of Z. This is the largest subpopulation
193
Note that various other kinds of monotonicity assumptions have been used in the literature to obtain (partial)
identi…cation. Manski (1997) examines monotonicity of ' with respect to d, whereas Chernozhukov and Hansen
(2005), Chernozhukov, Imbens, and Newey (2007) and Chesher (2007a) assume monotonicity of ' with respect
to the unobservable u. Imbens and Newey (2003) examines monotonicity of with respect to the unobservable v,
whereas Heckman and Vytlacil (2005) and us assume monotonicity of with respect to z. Chesher (2003, 2005,
2007) combines several of these assumptions.
194
In contrast to Chernozhukov and Hansen (2005), Chernozhukov, Imbens, and Newey (2007) and Chesher
(2007a), we impose triangularity, i.e. assume that Y does not enter in , but do not need to assume any kind of
monotonicity or rank invariance for '. (Chernozhukov and Hansen (2005), Chernozhukov, Imbens, and Newey
(2007) and Chesher (2007a) assume that ' is monotonous in its third argument.) We do impose, on the other
hand, that the function is (weakly) monotonous in its …rst argument, i.e. assume that an exogenous increase in
Zi can never decrease the value of Di . This is the monotonicity assumption of Imbens and Angrist (1994). This
assumption may be more plausible than monotonicity in ' in some applications, whereas in other applications it
may be less appealing.
for which the e¤ect is identi…ed. If the instruments Z are su¢ ciently powerful to move everyone
from Di = 0 to Di = 1, this will lead to the average treatment e¤ect (ATE) in the entire
population. In most applications, however, the instruments available are not so powerful and it
is interesting in this case to consider e¤ects in the largest subpopulation for which the e¤ect is
identi…ed. In addition, if Y is bounded, we can derive bounds on the overall treatment e¤ects
because the size of the subpopulation of compliers is identi…ed as well. Therefore, we focus on
the QTE for the compliers:
c = QY 1 jc QY 0 jc
where QY 1 jc = inf Pr Y 1 qjT = c , where Ti = c means that individual i is a complier,

q
as de…ned below.
Suppose for the moment that Z is binary, i.e. taking values in f0; 1g. Using the language
of Imbens and Angrist (1994), there may be some individuals with Di0 = Di1 = 1 (always-
treated), some individuals with Di0 = Di1 = 0 (never-treated) and those individuals with Di0 =
0; Di1 = 1 (compliers), and …nally individual with Di0 = 1; Di1 = 0 (de…ers). By the monotonicity
assumptions, the de…ers do not exist. The always-treated and the never-treated are not a¤ected
by any exogenous change in Z, whereas the compliers react to it. Now, let Z be (a vector) of
non-binary instrumental variables with support Z. By the assumption that in (??) is weakly
monotonously increasing the population can again be partitioned into three groups: Individuals
i for whom minDiz = 1 (always-treated), individuals for whom maxDiz = 0 (never-treated) and
z2Z z2Z
individuals i for whom minDiz < maxDiz (compliers). The latter group consists of all individuals
z2Z z2Z
i whose treatment status is not constant for all values of z 2 Z. We call this group also
compliers as they somehow react to the instrument. This is the largest subpopulation for which
the treatment e¤ect is identi…ed, because with the instruments available it would be impossible
to move always-treated to non-treatment and never-treated to treatment. (If the support Z
is su¢ ciently large, the subpopulations of always- and never-treated might have measure zero
such that everyone is a complier.) De…ne
zmin = min Pr (Dz = 1) and zmax = max Pr (Dz = 1) .

z2Z z2Z
By virtue of the monotonicity assumption, Dizmin < Dizmax for a complier whereas Dizmin = Dizmax
for an always- or never-treated. (If Z is binary, clearly zmin = 0 and zmax = 1.) As discussed
in Frölich and Melly (2007), the largest subpopulation a¤ected would be obtained by moving
Z from the smallest point of its support to its largest point. In other words, identi…cation of
the e¤ect on all compliers is obtained by those observations with Zi = zmin and Zi = zmax ,
irrespective of the number of instrumental variables or whether they are discrete or continuous.
The asymptotic theory in Frölich and Melly (2007), however, requires that there are positive
masspoints i.e. that Pr(Zi = zmin ) > 0 and that Pr(Zi = zmax ) > 0. This rules out continuous
instrumental variables, unless they are mixed discrete-continuous and have positive mass at
zmin and zmax .
Summarizing this discussion, identi…cation and estimation is based only on those
observations with Zi 2 fzmin ; zmax g.195 In the following we will assume throughout that zmin
and zmax are known (and not estimated) and that Pr(Z = zmin ) > 0 and Pr(Z = zmax ) > 0.
This rules out continuous instruments, unless they have masspoints at zmin and zmax . Note
p
that our identi…cation results would also hold for continuous instruments, but n consistent
estimation would not be possible anymore. We will develop estimators for those situations in
future work.
To simplify the notation we will use the values 0 and 1 subsequently instead of zmin to
zmax , respectively. Furthermore, we will in the following only refer to the e¤ectively used
sample fi : Zi 2 f0; 1gg or in other words assume that Pr(Z = zmin ) + Pr(Z = zmax ) = 1.
This is appropriate for our applications where the single instruments Z are binary. In other
applications, where Pr(Z = zmin ) + Pr(Z = zmax ) < 1, our results apply with reference to the
subsample fi : Zi 2 f0; 1gg.
By considering only the endpoints of the support of Z, recoding Z as 0 and 1, and with D
being a binary treatment variable, we can partition the population into four groups de…ned as
Ti = a if Di1 = Di0 = 1 (always treated), Ti = n if Di1 = Di0 = 0 (never treated), Ti = c if
Di1 > Di0 (compliers), Ti = d if Di1 < Di0 (de…ers). Let Ti 2 fa; n; c; dg de…ne whether individual
195
The STATA ivqte command requires the instrumental variable to be coded as binary! The user therefore
should de…ne the instrument as 2 3
1 if Zi = zmax
6 7
6 7
6 0 if Zi = zmin 7.
4 5
undef ined if Zi 2
= fzmin ; zmax g
If the original instrument Z is binary, no recoding is required. One should note that in the current version of
ivqte all observations where Zi 2
= fzmin ; zmax g are not used in the estimation, which could imply a large loss
of observations if Z is multivariate, continuous or discrete with many masspoints. In future versions we plan to
extend ivqte to incorporate the additional information via nonparametric smoothing around zmin and zmax .
i is an always-treated, never-treated, a complier or a de…er. Frölich and Melly (2007) show that
the QTE for the compliers
c = QY 1 jT =c QY 0 jT =c (162)
is identi…ed under the assumption:

Assumption 1:
i) Existence of compliers: Pr(T = c) = Pc > 0
ii) Monotonicity: Pr(T = d) = 0
iii) Independent instrument: (Y d ; T )??ZjX
iv) Common support: 0 < (X) < 1 a:s:
where (x) = Pr(Z = 1jX = x), which we will often refer to (x) as the "propensity score",
where one should note that it refers to the instrument Z and not, as usual, to the treatment D.
Assumption (1i) requires that the instruments have some power in that there are at least
some individuals who react to it. The strength of the instrument can be measured by Pc ,
which is the probability mass of the compliants. The second assumption is often referred to as
monotonicity. It requires that Diz either weakly increases with z for all individuals (or decreases
for all individuals). The third assumption is the main instrumental variable assumption. It
implicitly requires an exclusion restriction (=triangularity) and an unconfounded instrument
restriction. In other words, Zi should not a¤ect the potential outcomes of individual i directly
and those individuals for whom Z = z is observed should not di¤er in their relevant unobserved
characteristics from individuals with Z 6= z. Unless the instrument has been randomly assigned,
this last restriction is often very unlikely to hold. However, conditional on a large set of
covariates X, this assumption is often more plausible. (This setup obviously also contains the
case where X is empty, i.e. it covers also the case where the instrument is valid unconditionally,
e.g. a randomized assignment.)
Note that we permit X to be endogenous. X can be related to U and V in (160) in any
way. This may be important in many applications where X often contains lagged (dependent)
variables that may well be related to unobserved ability U . The fourth assumption requires
that the support of X is identical in the Z = 0 and the Z = 1 subpopulation. This assumption
is needed since we …rst condition on X to make the instrumental variables assumption valid
but then integrate out to obtain the unconditional treatment e¤ects. An alternative set of
assumptions, which leads to the same estimators later, replaces monotonicity (Assumption 1ii)
with assuming that the average treatment e¤ect is identical for compliers and de…ers, conditional
on X.
We also need to assume that the quantiles are unique and well-de…ned:
Assumption 2:
The random variables Y 1 jc and Y 0 jc are continuous with positive density in a neighborhood
of QY 1 jc and QY 0 jc , respectively.
Under these two assumptions assuming uniqueness of the quantiles, Frölich and Melly
(2007) derive four di¤erent estimators, which do not require any functional form assumptions
or restrictions on treatment e¤ect heterogeneity. A natural starting point is to identify the
distribution functions of the potential outcomes, which can then be inverted to obtain the
quantile treatment e¤ects. More convenient to implement and most stable in some Monte
Carlo simulations were the two weighted quantile estimators, which are implemented in ivqte.
To obtain these estimtators, Frölich and Melly (2007) …rst showed that, in addition to a
regression (or matching) representation, the potential outcome distribution are also identi…ed
by weighting as
E [1 (Y < u) DW ]
FY 1 jc (u) = (163)
E [DW ]
E [1 (Y < u) (1 D) W ]
FY 0 jc (u) = ,
E [DW ]
where
Z (X)
W = (2D 1) . (164)
(X) (1 (X))
Hence, one could estimate the QTE by the di¤erence
q1 q0
of the solutions of the two moment conditions
E [1 (Y < q1 ) DW ] = E [DW ] (165)
E [1 (Y < q0 ) (1 D) W ] = E [DW ]
or equivalently
E [f1 (Y < q1 ) g W D] = 0 (166)
E [f1 (Y < q0 ) g W (1 D)] = 0,
because E[W ] = 2Pc and E[DW ] = Pc . We could thus estimate q0 and q1 by these weighted
univariate quantiles in the D = 0 and D = 1 populations.
One can show that this is equivalent to a weighted quantile regression representation. The
solution of the following optimization problem
( ; ) = arg minE [ (Y a bD) W ] , (167)

a;b
where (u) = u f 1 (u < 0)g, is equivalent to the solutions to the moment conditions
(166) in that the solution for a corresponds to QY 0 jc and the solution for b corresponds to
c = QY 1 jc QY 0 jc .
The weighted quantile estimator of ^ c is

n
X
^ 0 ; ^ c ) = arg min 1
(Q (Yi a î
bDi )W (168)
Y jc
a;b n
i=1
or numerically equivalently via:

n
X
^ 1 = arg min 1
Q (Yi î
q1 )Di W
Y jc
q1 n
i=1
n
^ 0 = arg min 1X ^ i.
QY jc (Yi q0 ) (1 Di ) W
q0 n
i=1
where (u) = u f 1 (u < 0)g.
Note that the sample objective (168) is typically non-convex since Wi is negative for Zi 6= Di .
This complicates the optimization problem a little because local optima could exist. This
problem is not very serious here because we need to estimate only a scalar in the D = 1
population and another one in the D = 0 population. In other words, we can write (168)
equivalently as which are two separate one-dimensional estimation problems in the D = 1
and D = 0 populations such that we can easily use grid-search methods supported by visual
inspection of the objective function for local minima. The asymptotic distribution is derived
p
in Frölich and Melly (2007), showing n-consistency and normality, with asymptotic variance
given in Frölich and Melly (2007).
As an alternative, a positive weights quantile estimator is suggested. These weights are

obtained by applying an iterated expectations argument to (167) to obtain
( ; ) = arg minE [ (Y a bD) W ]

a;b
= arg minE [ (Y a bD) E [W jY; D]]
a;b
= arg minE (Y a bD) W +

a;b
where
Z (X)
W + = E [W jY; D ] = E jY; D (2D 1) . (169)
(X) (1 (X))
The weights W + have the advantage that they are always positive, whereas the weights W
can be positive and negative. Hence, they can be used to develop an estimator with a linear
programming representation. The sample objective function with W + instead of W is globally
convex since it is the sum of convex functions, and the global optimum can be obtained in a
…nite number of iterations. However, we would need to estimate W + …rst. (Negative estimated
W + are discarded in the estimations.)
Both estimators are multiple step estimators in that they …rst require estimation of (X) =
Pr(Z = 1jX). With estimates of ^ (Xi ) the weights (164) can be calculated and plugged in the
estimator (168). When using the positive weights quantile estimator, …rst the weights W + also
have to be estimated by nonparametric regression on Y and D. In Frölich and Melly (2007)
two di¤erent nonparametric estimators are considered: Local linear regression and local logit
regression. The …rst order asymptotic distribution of the weighted quantile estimator is the
p
same for both nonparametric estimators. The QTE estimator is n consistent and attains
the semiparametric e¢ ciency bound. The variance contributions stem from two parts: First
the weighting by W if the weights were known and second from the fact that the weights
p
were estimated. (The estimate of the variance matrix is implemented in Stata.) To attain n
consistency higher order kernels are required if X contains 4 or more continuous regressors,
higher order kernels are required. With 3 or less continuous regressors, conventional kernels can
be used. More precisely, the order of the kernel should be larger than dim(X)=2.
As it is often the case for semiparametric estimators, the …rst order asymptotic theory does
not depend on the bandwidth value anymore, which has the unpleasant implication that it is
not helpful for selecting bandwidth values. In principle, we could also extend the proof to derive
a second order approximation to the mean squared error where the second order terms would
depend on the bandwidth values. From the derivations in the proof, however, it appears that
the second order terms would be very complex since they depend on higher order derivatives of
several types of functions. Hence, although feasible to derive, it appears that the second order
approximation would be of little practical value. Too many nuisance functions would have to
be estimated and plugged in and the estimated bandwidth could be very noisy in …nite sample.
It thus appears that alternative approaches to bandwidth selection should be considered as e.g.
the empirical bias bandwidth selector of Ruppert (1997) and further developed in Flossmann
(2007). This is a topic for future research.
Similar to results discussed in the previous chapters the covariates X are usually included
to make the instrumental variable assumptions (exclusion restriction and unconfoundedness
of the instrument) more plausible. In addition, including covariates X can also lead to more
e¢ cient estimates. Generally, with a causal model in mind, we could think of four di¤erent
cases for the covariates. A covariate X can (1) causally in‡uence Z and also D or Y , it can (2)
in‡uence Z but neither D nor Y , it can (3) in‡uence D or Y but not Z, and …nally (4) it may
neither in‡uence Z nor D nor Y . (There are also other possibilities, where X might itself be on
the causal path from Z or D or Y . This case is not discussed here.) In case (1), the covariate
should be included in the set of regressors X because otherwise the estimates would generally be
inconsistent. In cases (2) and (4), the covariate should usually not be included in X as it would
decrease e¢ ciency and might also lead to common support problems. In case (3), however,
inclusion of the covariate in X can reduce the asymptotic variance: Frölich and Melly (2007)
show that the semiparametric e¢ ciency bound decreases in this situation. We can therefore
include some X variables to make the instrumental variables assumption more plausible and
additional X variables to reduce the asymptotic variance. Consider two regressors sets X1
and X2 with X1 X2 . X1 may be the empty set, which is the case when the instrumental
variable is randomly assigned. Suppose that both regressor sets satisfy Assumption 1. In other
words, controlling for X1 is su¢ cient to obtain consistent estimates, but X2 may help to reduce
variance. Hence, regressors that turn out as insigni…cant in a regression of Z on control variables
could nevertheless be retained as regressors for e¢ ciency reasons. In the particular case were
Z is completely randomized, as e.g. in a controlled experiment where Z is under the control of
the experimter, one could nevertheless gain e¢ ciency by incorporating control variables in the
estimation process. So far, this is only an asymptotic result and it is unclear at the moment
whether this would lead to improvements in …nite-samples.196
Relationship to other estimators of the unconditional QTE

The previous estimator can also be related to the case of selection on observables (157). In
this situation, D is exogeneous conditional on X and D can be used "as its own instrument".
In other words, we can set Z D and the above estimators then simplify to those obtained
under (157). Consider now another special case of our model: the case where treatment D is
exogenous conditional on X. If D is exogenous, then we can use D as its own instrument and
set Z = D and our representation in (??) simpli…es to
Z
FY 1 (u) = E [1 (Y u) jX; D = 1] dFX .
When the conditional distribution is estimated by local regression, we obtain the estimator
proposed in Frölich (2007b); when it is estimated by non-parametric quantile regression, this
is the estimator proposed by Melly (2006); when it is estimated by parametric methods we
obtain the estimators proposed by Machado and Mata (2005), Gosling, Machin, and Meghir
(2000) and, more generally, Chernozhukov, Fernández-Val, and Melly (2007). Furthermore, in
this exogenous case, our weights simplify to
D 1 D
W = + .
p(x) 1 p(x)
These are exactly the weights proposed by Firpo (2007). The weights of Firpo (2007) are
always positive such that there is no need for estimating positive weights by projection. Finally,
note that the whole population complies when we assume exogeneity such that we obtained the
QTE for the population.
Frölich and Melly (2007) also discuss the relationship to Abadie, Angrist, and Imbens (2002).
Abadie, Angrist, and Imbens (2002) are interested in estimating conditional QTE as discussed
in the next section. However, one could be attempting to adapt their approach to estimating
unconditional QTE by using their weights (172) and using only a constant and D, but no X, in
the parametric speci…cation. However, this approach would not lead to consistent estimation
as it would converge to the di¤erence between the quantiles of the treated compliers and
non-treated compliers, respectively:
FY 11jc;D=1 ( ) FY 01jc;D=0 ( ).
196
Frölich and Melly (2007) exemplary examine the JTPA experiment and …nd e¢ ciency gains of about 5%.
This di¤erence is not very meaningful as one compares the Y 1 outcomes among the treated with
the Y 0 outcomes among the non-treated. Therefore, in the general case the weights (172) are
only useful to estimate conditional quantile e¤ects. Hence, if one is interested in nonparametric
estimation of the unconditional QTE, one should use the weights in (164) but not those in
(172). (However, when X is the empty set, e.g. in the case where Z is randomly assigned, the
weights (164) and those in (172) are proportional such that both approaches converge to the
same limit.)
16.6 Conditional QTE with endogeneity
In this section we examine estimation of conditional QTE, i.e. the quantile treatment e¤ect
conditionally on X. When X contains continuous regressors, fully nonparametric estimation
p
will be slower than n rate, because of the curse of dimensionality.
The early contributions to the estimation of conditional QTE usually imposed functional
form assumptions. (They also often imposed restrictions on treatment e¤ect heterogeneity, e.g.
that the QTE does not vary with X, which in fact often implies equality of conditional and
unconditional QTE.) Koenker and Bassett (1978) proposed and derived the statistical properties
of a parametric (linear) estimator for conditional quantile models. Due to its ability to capture
heterogeneous e¤ects, its theoretical properties have been studied extensively and it has been
used in many empirical studies; see, for example, Powell (1986), Guntenbrunner and Jureµcková
(1992), Buchinsky (1994), Koenker and Xiao (2002), Angrist, Chernozhukov, and Fernández-
Val (2006). Chaudhuri (1991) analyzed nonparametric estimation of conditional QTE. A recent
contribution is Hoderlein and Mammen (2007), who consider marginal e¤ects in nonseparable
models.
All these estimators assume that the treatment selection is exogenous, conditional on
X.197 Alternatively, instrumental variable type estimators have been considered. Abadie,
Angrist, and Imbens (2002) and Chernozhukov and Hansen (2005, 2006, 2007) have proposed
linear instrumental variable quantile regression estimators. Chernozhukov, Imbens, and
Newey (2007) and Horowitz and Lee (2007) have considered nonparametric IV estimation of
conditional quantile functions. In a serie of papers, Chesher (2003, 2005, 2007) also examines
nonparametric identi…cation of conditional e¤ects.
197
Also called ”selection on observables”, ”conditional independence” or ”unconfoundedness”.
16.6.1 Abadie, Angrist, Imbens estimator
In this section, we …rst consider the estimator of Abadie, Angrist, and Imbens (2002), which
incorporates a parametric assumption and can be considered as an instrumental variable esti-
mator of the conditional QTE. They exploit monotonicity in the treatment choice equation but
aim at estimating the conditional quantile treatment e¤ects
QY 1 jX QY 0 jX
for the subgroup of compliers using a parametric approach. They assume that conditional on
X the quantile of Y in the subpopulation of compliers is linear
Q (Y jX; T = c) = 0D + X0 0 (170)
where D and Z are binary. T = c refers to the subpopulation of compliers, i.e. those individuals
who change from Di = 0 to Di = 1 if the instrument is changed by external intervention from
Z = 0 to Z = 1. For this linear quantile regression model, and could be estimated via
arg minE Y D X0 jT = c , (171)

;
if the subpopulation of compliers were known. Note that for any absolutely integrable function
(Y; D; X)
E [ (Y; D; X)jT = c] P (c) = E [ (Y; D; X)] ,
where
D (1 Z) (1 D) Z
=1 , (172)
1 (X) (X)
where (X) = Pr(Z = 1jX). To see this note that with D and Z binary and the monotonicity
assumption, the population consists of the three types always-participants, never-participants
and compliers such that we obtain
E[ (Y; D; X)] = E [ (Y; D; X)jT = a] P (a)
+E [ (Y; D; X)jT = n] P (n)
+E [ (Y; D; X)jT = c] P (c).
Inserting the function and exploiting the exclusion restriction gives
= E [ (Y; D; X)jT = c] P (c).

This suggests to estimate the coe¢ cients by a weighted quantile regression based on
arg minE Y D X0 , (173)

;
where the term P (c) has been ignored as it does not a¤ect the values where this function is
minimized.
Since (171) is globally convex in and , also the population function (173) is convex as
the objective function is identical apart from multiplication by P (c). But in constrast to the
previous section, the weights are negative when D 6= Z such that the sample analogue
1X
arg min i Yi Di Xi0 , (174)
; n
will not be globally convex in and . Algorithms for such piecewise linear but nonconvex
objective functions may not …nd the global optimum and (174) does not have a linear program-
ming (LP) representation.
Abadie, Angrist, and Imbens (2002) therefore suggest to use the weights
E [ jY; D; X]
instead of , which can be shown to be always nonnegative. Although this permits now the
use of conventional LP algorithms since the weights are nonnegative, estimation of the weights
E [ jY; D; X] now either requires additional parametric assumptions or high-dimensional non-
parametric regression.
As shown in Frölich and Melly (2007), alternatively one could also use the weights (164)
instead of the weights (172). Both weights would lead to consistent estimation of and .198
Hence, both types of weights, i.e. those in (172) and those in (??), would identify the conditional
QTE, but it is not clear which one will be more e¢ cient. For the compliers, W varies with x
whereas the weights in Abadie, Angrist, and Imbens (2002) are identical to one. In any case,
both types of weights would be generally ine¢ cient since they do not incorporate the conditional
density function of the error term at the quantile. Hence, if one was mainly interested in
estimating conditional QTE with a parametric speci…cation, more e¢ cient estimators could be
developed.
198
Instead of W one could also use E[W jY; X; D], which are always nonnegative, but usually not the weights
W + = E[W jY; D] as conditioning on X is necessary here.
Instead of exploiting monotonicity in the relationship determining D, alternative approaches

assume a monotonicity in the relationship determining the Y variable, e.g. Chernozhukov and
Hansen (2005), Chernozhukov and Hansen (2006) and Chernozhukov, Imbens, and Newey (2007)
among others, see also Chapter 3.
17 Simulation and numerical methods
References:
- Cameron Trivedi (2005, Chapters 10-12)
- Gourieroux Monfort Renault (1993, Journal of Applied Econometrics)
17.1 Numerical optimization
In the previous chapter we examined stochastic properties in the presence of weak identi…cation.
But weak identi…cation may also pose numerical problems, for non-linear estimators.
Estimation of OLS is very simple since we have a closed-form expression
Several other estimators are based on minimizing or maximizing an objective function, such
as NLS, GMM and ML estimators (probit, logit, multinomial probit):
^ = arg max ln L( ; W1 ; :::WN ).
For most nonlinear estimators no closed-form expression is available.

This requires numerical maximization.
For standard models (e.g. probit), fast algorithms are available in most statistical software
If you need to programme your own estimator, numerical optimization commands available.
But: optimization may be slow and may not converge

Some knowledge about numerical optimization very helpful
Also helpful for available estimators, e.g. multinomial probit.
Often we have some kind of M-estimator
^ = arg maxQN ( )
2
Markus Frölich 17. Simulation and numerical methods 399
where the true value is de…ned by
0 = arg maxQ0 ( )
2
If we have interior solution
@QN (^)
= 0
@
@Q0 ( 0 )
= 0.
@
If contains only a small number of possible values, use grid search.
But, usually we need to use some iterative method.

(Nevertheless, a restricted grid search useful for obtaining starting values.)
Iterative methods based on the gradient:
Beginning with some starting values ^zero , the parameters ^s+1 are computed from the
parameters ^s
^s+1 = ^s + s As g s
where gs is the q 1 gradient

@QN ( )
gs = j^s
@
and As is a q q matrix and is a steplength.
Ignore the steplength (i.e. set = 1).
The gradient gives the direction of steepest ascent.

The matrix As determines how far to go in each direction.
Newton-Raphson method uses

1
1 @ 2 QN ( )
As = Hs = j^s
@ @ 0
i.e.
^s+1 = ^s H s 1 gs .
Why should we use this method?

Examine the mean-value expansion
@QN (^s ) @QN (^) @ 2 QN ( ) ^s ^

= + j
@ | @{z } @ @ 0
=0
where is on the line between ^s+1 and ^. This gives

1
^ = ^s @ 2 QN ( )
j gs .
@ @ 0
Hence, if we knew and thus could compute the Hessian at , we would jump in one step from
the value ^s to the solution ^.
This may work very well when ^s is close to .

But since ^zero is usually rather far from , we need many iterations.
Another justi…cation could be given as follows. Consider a second order series expansion of
QN ( ) about ^s
QN ( ) = QN (^s ) + gs0 ^s + 1 ^s Hs ^s + remainder.

2
Ignoring the remainder term, which is the best value of given that we have ^s ? This question is
solved by di¤erentiating QN ( ) with respect to and setting this derivative to zero. We obtain,
ignoring the remainder term, @QN ( )=@ = gs + Hs ^s = 0, which yields = ^s Hs 1 gs .
Hence, ^s+1 should be chosen as ^s Hs 1 gs to maximize QN (^s+1 ).199
This method only works if the Hessian is invertible.
p
Another nice feature is that if the starting values ^zero are n consistent, then ^s , ^s+1 and
^ have the same asymptotic distribution. Hence, from an asymptotic perspective, there is no
need to iterate until convergence. One Newton-Raphson step su¢ ces.

For example, if ^zero are the …rst-step GMM estimates, then the …rst iteration in the second
step GMM estimator already yields the e¢ cient estimator. (In practice one prefers to iterate
until convergence, but if that requires too much computation time, one Newton-Raphson step
su¢ ces.)
199
This heuristic argument may only work well if ^s+1 ^s is not too large. Otherwise, the remainder term
may be too large to be ignored.

Notice that this formula is the same for minimization and for maximization!
I.e. pre-multiplying QN ( ) by minus one (which gives minimization instead of maximiza-
tion), reverses the sign of Hs and of gs and thus does not change the above formulae.
Hence, we don’t know whether we are minimizing or maximizing!
By using a second-order expansion of QN (^s+1 ) and entering the iteration formulae we

obtain
1 ^ 0
QN (^s+1 ) QN (^s ) = s+1
^s Hs ^s+1 ^s + remainder term
2
which is positive if Hs is negative de…nite (except for the remainder term!)200
The Hessian is negative de…nite when we are close to the solution, but may be positive
de…nite or singular when being distant from the solution.
For estimators where the Hessian is globally concave (probit, logit), Newton Raphson per-
forms very well.
In practice it is often convenient to use an estimate of the expected Hessian at the true value
instead of the Hessian itself. E.g. consider a binary choice model
X
max Yi ln Fi + (1 Yi ) ln(1 Fi )
where
Fi = F (Xi )
and F is the probit or logit function, for example. The gradient function is
X fi
(Yi Fi ) X0
Fi (1 Fi ) i
and the expected value of the Hessian at the true values 0 is
fi2
E X 0 Xi .
Fi (1 Fi ) i
To ensure that we are actually maximizing as fast as possible, we would like to have ^s+1
such that the change
QN (^s+1 ) QN (^s )
200
Remember that a matrix A is negative de…nite if x0 Ax < 0 for every vector x 6= 0.
should be positive and large. We may therefore augment the process by a step-length ^ s
^s+1 = ^s + ^ s As gs .
Di¤erent algorithms are used for ^ s . Often QN (^s+1 ) is computed for several di¤erent potential
values of ^ s for given ^s ; As ; gs and a polynomial is …tted to obtain the best ^ s . If this does not
lead to an increase over QN (^s ) di¤erent methods are tried (e.g. line search).
This process is used since the computation of As gs can be very time-consuming, but now
trying di¤erent values of does not requires much computation time.
The iterations continue until the changes in ^s , QN (^s ), Hs 1 gs from one iteration to the
next are very small. (Di¤erent stopping rules and convergence criteria.)
Computation of derivatives
To compute the gradient, the computer uses numerical di¤erentiation, either forward di¤er-
encing
@QN ( ) QN ( + hej ) QN ( )
=
@ j h
or central di¤erencing
@QN ( ) QN ( + hej ) QN ( hej )
= ,
@ j 2h
where h is a small quantity, and ej is a vector of zeros with j-th entry being one. The Hessian
is computed by di¤erentiating the gradient.
Here, the limited precision of numbers in a computer becomes relevant. If h is too small,
the gradient would be imprecise due to rounding error. The computation of the numerical
derivative of the gradient results in a further loss of precision. Finally, computing the inverse
of the Hessian leads again to a reduction in the precision of the estimates.
When the objective function is relatively ‡at, e.g. when a parameter is only weakly identi…ed,
such rounding errors can substantially reduce the precision in the estimation of the Hessian.
Analytical derivatives or even analytical Hessians can substantially increase the precision
in the computation of the gradients and the Hessian. If no analytical Hessian is provided, the
numerical derivatives of the analytical derivatives are taken to compute the Hessian, which will
often be more precise than using numerical derivatives of numerical derivatives. Analytical
derivatives can be derived by symbolic di¤ erentiation or automatic symbolic di¤ erentiation,
where the latter approach also provides e¢ cient computer code for computation of derivatives
(see Judd 1998,p. 37).201
Nevertheless, even with analytical derivatives approximation errors due to limited machine
precision will occur.
Numerical machine precision:

Real numbers can be stored only with limited precision
Example: in Stata 8
mod(0:3; 0:1) gives 0:1.
Why is this?
The values 0.3 and 0.1 are converted to binary base and would be stored as
0:2999999999999999999990
0:1000000000000000000010
If 0.3 is divided by 0.1 this gives
0:2999999999999999999600
Such that ‡oor(0.3/0.1) gives 2.
Therefore, using double or quadruple precision may ameliorate such problems,

but would not solve them in all cases.
Condition number of the Hessian: (Judd 1999, p.67)

Such rounding errors can become particularly relevant when "inverting the Hesse matrix":
^s+1 = ^s H s 1 gs .
In fact, the Hesse matrix is never directly inverted (as this would entail an even greater loss
of precision), but instead the following equation is solved for ^s+1 ^s :
Hs ^s ^s+1 = gs .
201
When you programme your own analytical derivatives, you should compare them to the numerical derivatives
to verify that the …rst 7 to 9 digits are identical. If not, this could be a sign that your analytical derivatives were
incorrect.
If there are rounding or approximation errors in gs , they will be in‡ated via the Hessian
matrix. The log condition number of a symmetric matrix is de…ned as
p
eigenvaluemax
log10 p .
eigenvaluemin
A large log condition number is an indication of near singularity of a matrix. Suppose for the
moment that there are no errors in Hs but only in gs . The number of digits lost in solving the
equation
Hs ^s ^s+1 = gs
is approximately equal to the log condition number of Hs . If Hs were the identity matrix (with
log condition number 0), the solution ^s ^s+1 would be as precise as gs . But if the condition
number is larger, the solution is less precise because of the "inversion" of the Hessian. On
many computers, ‡oating point numbers have 16 decimal places of accuracy (double precision,
depending on the software). Hence, if the log condition number is 16 or larger, the direction of
the changes ^s+1 ^s would be purely driven by machine precision errors. If the log condition
number is larger than 8, the problem is considered to be ill-conditioned.

The imprecision in the computation of the Hessian would even add to this factor. As a rule
of thumb, on computers with 16 places of accuracy, about four places are lost in the numerical
calculation of the …rst derivative and another four are lost in the numerical calculation of the
second derivative. Hence, if a numerical Hessian is used, about 8 places of accuracy are already
lost in the calculation of the second order derivatives. If the log condition number of the
numerical Hessian is 8, another 8 places of accuracy will be lost in the inversion of the Hessian.
Two factors are often the causes of large condition numbers: (near) linear dependencies
and scaling. If some of the elements in the Hessian are very large and others are small, this
leads to large and small eigenvalues and thus to large condition numbers.202 To avoid such
problems, the regressors used should be scaled to similar means and variances. I.e. instead of
202
The inversion of the Hessian or the solving of the above equation requires a series of operations including
additions and subtractions. Adding two numbers of very di¤erent size can lead to a dramatic loss in precision,
because their exponents have to be made the same. Consider the addition of two numbers: 1:234567890123456 109
5
and 1:234567890123456 10 . Since ‡oating point numbers are stored in the form of abscissa and exponent, their
exponents have to be made the same. Hence the smaller of the above numbers is converted to 0:000000000000012
109 and added to the …rst number. Clearly almost all of the accuracy of the smaller number is lost.
measuring yearly family income in Euros and the returns in investments by daily interest rates,
all regressors should be scaled to have means (and ideally also variances) of similar order.
In nonlinear problems, scaling the regressors to similar means and variances may not always
be the best way for scaling the Hessian. If the Hessian remains ill-conditioned, one might want
to try di¤erent ways to scale the regressors in an attempt to obtain a Hessian with diagonal
elements being of similar magnitude.
The other factor is (near) linear dependencies in the Hessian. It might not always be obvious
what the origins of linear dependencies in the Hessian are, but a …rst check would be to examine
0 X is larger
linear dependencies in the matrix of regressors. If the log condition number of XN N
than 2, one might be concerned about linear dependencies and perhaps try estimating the model
with a subset of the regressors only.
Steepest ascent, BHHH, DFP, BFGS methods:

In practice, the computation of the Hessian may often be very time consuming and often
not invertible. Therefore, alternative matrices As instead of Hs 1 are often used.
The steepest ascent simply uses As = I and relies on the search methods for the step length
. (This may be a good start method.)
The BHHH method is available for an estimator with objective function of the type
X
QN ( ) = qi ( ),
e.g. ML, NLS, MM but not GMM. The matrix As is chosen as
X @qi ( ) @qi ( ) 1
As = j^ ,
@ @ 0 s
which only requires …rst derivatives and is always p.d.

P
This is motivated by the information matrix equality for ML estimators QN ( ) = ln fi ( )
X @ 2 ln fi ( 0 ) X @ ln fi ( 0 ) X @ ln fj ( 0 )
E = E
@ @ 0 @ @
X @ ln fi ( 0 ) @ ln fj ( 0 )
= E
@ @
since the observations are assumed to be independent.

Most frequently used is the BFGS algorithm, which is a re…nement of DFP. DFP and BFGS
do not compute the Hessian in each iteration but update an approximation to the Hessian in
each iteration.
The DFP computes As as
As gs gs0 As As (gs+1 gs ) (gs+1 gs )0 As
As+1 = As + + .
gs0 As (gs+1 gs ) (gs+1 gs )0 As (gs+1 gs )
As converges to Hs 1 . This method requires only …rst derivatives and is always pd.
The BFGS is a re…nement on the DFP algorithm and often performs best, in particular
when the starting values are poor.
Simulated annealing is a non-gradient method that also permits changes in ^s that de-
crease QN ( ) to avoid being locked in at moving to one local minima.
Given ^s we randomly perturb one of the components by a random number, e.g. Cauchy. If
this new value for leads to an increase in QN ( ) it is kept as ^s+1 . It is also sometimes kept
if QN ( ) decreases.
Usually gradient methods for di¤erent startvalues are more useful, but simulated annealing
can be useful for obtaining starting values.
Practical considerations:
- think whether the estimator is globally concave/convex or not?
- scaling of Hessian
- try di¤erent startvalues (local optima), step-length procedures, "Hessians" (BFGS)
- switch methods after e.g. 20 iterations
- program analytical gradients (and Hessian)
- use random subsample of dataset: if estimates change a lot, this could be a signal for a
local optimum
- use a simulated data set (i.e. known dgp). If still doesn’t converge, perhaps model is not
identi…ed (e.g. multinomial probit model with too many free covariance parameters)
If we use FOC instead of maximization of a function, we cannot distinguish between (local)

minima and maxima (simulated scores).
! Always inspect the objective function after convergence
! Always use a few di¤erent startvalues
(Di¢ cult in Monte Carlo: Here simulated annealing can be useful!)
17.2 Simulation based methods (MSL, SGMM)
Consider ML estimation of for the model fY jX (yjx; ) where the expression f may be di¢ cult
to compute. Often we have some model which contains integral expression e.g. because of
rational expectations:
Z
fY jX (yjx; ) = h (y; x; u; ) g(u)du
where h and g are known functions.

If there is no analytical solution for the integral, evaluation of the likelihood function will
be very di¢ cult.
Example: Multinomial probit model

Consider Y 2 f1; 2; 3g and the model
Yi = arg maxYik
k
Yi1 = Xi1 1 + Ui1
Yi2 = Xi2 2 + Ui2
Yi3 = Xi3 3 + Ui3
where Ui1 ; Ui2 ; Ui3 are iid and independent of X1 ; X2 ; X3 .
The probability to buy a BMW is then
Pr (Y = 1jX1 ; X2 ; X3 ) = Pr (X1 1 X2 2 U2 U1 ; X1 1 X3 3 U3 U1 )
= E [1 (X1 1 X2 2 U2 U1 ) 1 (X1 1 X3 3 U3 U1 )]
= FU2 U1 ;U3 U1 (X1 1 X2 2 ; X1 1 X3 3)

X1 1Z X3 3 X1 Z
1 X2 2
= fU2 U1 ;U3 U1 ("1 ; "2 ) d"1 d"2 .

1 1
If U1 ; U2 ; U3 are jointly logistically distributed, we obtain the multinomial logit model. This
has the unattractive IIA property.
If U1 ; U2 ; U3 are jointly normal, no closed form expression exists for the above integral.
Approximation methods are used to approximate the cdf of the joint normal.
Approximation methods become either imprecise or very slow for the joint normal cdf if
dimension is larger than 3.
Alternatively simulation of the integral.

If U1 ; U2 ; U3 jointly normal then also U2 U1 ; U3 U1 jointly normal with N (0; ) where
may depend on several parameters to be estimated.
Let 1=2 1=20 = . We could approximate Pr (Y = 1jX1 ; X2 ; X3 ) by
- drawing repeatedly 2 standard normal variables and multiplying by 1=2 to obtain "1 ; "2
- and counting how often 1 (X1 1 X2 2 "1 ) 1 (X1 1 X3 3 "2 ) is true.
For a large number of replications by the WLLN this approximates the expectation.
The simulator for Pr (Y = 1jX1 ; X2 ; X3 ) is unbiased.
Another example is panel probit models
17.2.1 Maximum simulated likelihood estimation

Z
fY jX (yjx; ) = h (y; x; u; ) g(u)du
Direct simulator:
S S
1X~ 1X
f^Y jX (yjx; ) = f (y; x; us ; ) = h (y; x; us ; )
S S
s=1 s=1
where us is drawn from distribution g(u). The simulator is unbiased if

h i
E f~ (y; x; us ; ) = fY jX (yjx; ) .
To obtain numerical convergence and correct computations of the gradients, it is important

that the same us are used in every iteration. (Always save all us or otherwise reset the seed
when reaching the …rst observation of the sample, either for evaluating the objective function
or gradients or Hessian.)
The MSL estimator is

XN
^M SL = arg max ln f^ (Yi jXi ; ) .
i=1
If f^ is di¤erentiable in then standard gradients methods can be used for estimation.
If S ! 1 as N ! 1
plim ^M SL = plim ^M L
N !1 N !1
the MSL estimator is consistent, because fî ! fi , which implies that ln fî ! ln fi which implies
that 1 ^ )!
ln L( 1
ln L( ).
N N
Regarding the variance, the simulation might introduce additional variance such that MSL
might be less precise than ML. Yet, if S increases su¢ ciently fast in that
p
N
! 0,
S
which would be satis…ed if S = N 1=2+" where " > 0, the asymptotic variance of ML and MSL
are identical (to …rst order).
Hence, we can use the inverse of the estimated Hessian matrix as an estimator of the variance
of ^: 0 1 1
XN @ 2 ln f Yi jXi ; ^M SL
@1 A
0
N @ @
i=1
where we again could replace f by the simulator f^. It might often be simpler to use the OPG
form (using the information matrix equality)
0 1 1
XN @ ln f Yi jXi ; ^M SL @ ln f Yi jXi ; ^M SL
@1 A
0
N @ @
i=1
which requires only the …rst derivative.
For S < 1 the MSL is biased. This also implies that MSL would be inconsistent if S does
not increase to 1 as N ! 1. The reason is that the simulator ln f^ is biased for ln f even if f^
is unbiased for f , because of the logarithm.
We could avoid this if we could construct an unbiased simulator for ln f , which is often too
di¢ cult.
Alternatively, Gourieroux and Monfort (1991) suggested to use a bias-adjusted log likelihood
function. A second order expansion of ln f^ is
2
f^ f f^ f
ln f^ = ln f + .
f 2f 2
Now suppose that f^ is unbiased for f such that Eu [f^] = f where expectation is with respect to
the density of u. Integrating with respect to the density of u gives:
2
h i Eu f^ f
Eu ln f^ = ln f
2f 2
or
2
h i Eu f^ f
ln f = Eu ^
ln f + .
2f 2
Hence, the bias of ln f^ increases with the variance of f^.
The bias corrected ML estimator adds the rightmost term to the likelihood function
8 9
> P
S 2>
XN >< f~ (Yi jXi ; ) f^ (Yi jXi ; ) >=
^BCM SL = arg max ^ 1
ln f (Yi jXi ; ) + ,
>
i=1 >
:
2S f^ (Yi jXi ; )2 >
>
;
P ~ 2
where 1
S f f^ is an approximation to the variance of the simulator.
The usefulness of this approach varies from case to case as the bias may be large.
17.2.2 Method of simulated moments
Suppose we have conditional moment conditions
E [m (Yi ; Xi ; 0 ) jXi ] =0
which we could also express as
E [A(Xi ) m (Yi ; Xi ; 0 )] =0
where A(Xi ) is some "instrument matrix". Consider the just-identi…ed case only where
dim(A) = dim( ).
If m is some complicated function depending on an integral, we could replace it by an
unbiased simulator m
^
E [m
^ (y; x; us ; )] = m (y; x; ) .
The MSM that minimizes squared deviations from moment expressions is consistent and
asymptotically normal for N ! 1 even with S …xed !!!
Hence, we do not need S to be very large. But variance will be larger if S is small.
p d
N ^M SM 0 ! N 0; A0 1 B0 A0 10
where
1 X @m (Yi ; Xi ; 0)
A0 = plim A(Xi )
N @
1 X
B0 = plim ^ i ] A(Xi )0
A(Xi )V [m
N j 0
where V [m
^ i ] is the variance with respect to both the conditional distribution of Y given X and
the draws of us . For the case that the simulator is a frequency simulator, it can be shown for
the asymptotic variance that
1
V [m
^ i ( 0 )] = 1+ V [mi ( 0 )]
S
Hence, a small S = 100 will lead to only a 1% increase in variance (asymptotically).
Why should one ever want to use MSL?

ML could be more e¢ cient than MM (with linear moment restrictions) and sometimes easier
to implement.
MSM can have larger small sample bias than simulation bias.
17.2.3 Indirect inference
The theory of indirect inference is concisely described in Gourieroux Monfort Renault (1993,
Journal of Applied Econometrics).
Suppose the true dgp belongs to a model
fW (w; 0) 0 2 Rq
that is di¢ cult to estimate but easy to simulate from.

Suppose we have an auxiliary model based on some criterion function that is easy to maxi-
mize
^ = arg maxQN (W1 ; :::; WN ; ) .
N
2B Rr
The auxiliary model could be an approximation to the likelihood or the exact likelihood of an
approximate model or any other model.
Suppose that the criterion function tends to a nonstochastic limit
lim QN (W1 ; :::; WN ; ) = Q1 ( ; 0)

N !1
which is a function of the true dgp and thus a function of 0. Suppose that the limit function
is continuous in and has a unique maximum
= ( 0 ) = arg maxQ1 ( ; 0) .
2B Rr
By conventional asymptotic theory the estimator ^ N is consistent for
plim ^ N = .
N !1
If we knew the mapping ( 0 ), we could invert it to obtain as a function of and plug-in

the consistent estimator ^ N .
We don’t know the function but we can simulate it for di¤erent values of . Idea: we
simulate for a large number of values of , a new dataset of size N and estimate the maximizer
of QN in these new datasets until we obtain an estimate that is almost identical to ^ N .
De…ne ~ N ( ) as the maximizer of QN in a simulated sample of size N drawn from W (w; ).
To reduce simulation noise we could repeat this H times and de…ne
H
X h
~ ( )= 1 ~ ( ).
N N
H
h=1
The indirect estimator ^ is de…ned as the minimum distance solution
0
^ = arg min ^ ~ ( ) ^ ~ ( )
N N
where is a given symmetric pd matrix.

It is important to always use the same random numbers for simulation for every value of .
The crucial assumption for consistency of ^ is that the function ( ) is one-to-one and
smooth:
@ ( )
( ) is one-to-one and 0 is full column rank,
@ j = 0
which requires dim( ) dim( ). See Assumption A4 in Gourieroux Monfort Renault (1993).
An alternative estimator is
@QN fWis g ; ^ N @QN fWis g ; ^ N

^ = arg min ,
0
@ @
which may be particularly interesting when the score has a closed form.
Consider MNP model. Gourieroux Monfort Renault (1993) suggest to approximate the
choice probabilities by a simple …rst order expansion and by replacing normal distributions
with logit distributions that are easy to evaluate.
17.2.4 Simulators
How shall we construct a simulator for

Z
h (y; x; u; ) g(u)du.
To lighten notation, keep y; x; implicit and consider

Z
h (u) g(u)du.
Consider as example
Z
1 (u 2 A) g(u)du
where the de…nition of the set A may depend on y; x; , i.e. A = A(y; x; ).

Direct Monte Carlo integration (i.e. Frequency simulator) gives
S
1X
1(us 2 A).
S
s=1
Disadvantages:
- Not di¤erentiable nor continuous in . For …nite S, small changes in may lead to the
same number of simulations falling in the set A, such that the simulated likelihood may be less
sensitive to than the true likelihood
- The simulator can be very ine¢ cient if the set A is small. E.g. if the true (conditional)
probability to buy a BMW is 0.001, then even with S = 10000, only about 10 us are expected
to fall in A and the simulated probability will be very noisy.
Importance sampling:
The integral can be expressed as
Z Z
h (u) g(u)
p(u)du = w(u)p(u)du
p(u)
where the density function p(u) is chosen such that (i) it is easy to draw from it, (ii) it has
the same support as the original domain of integration, (iii) the expression hg=p is simple to
evaluate, bounded and has …nite variance. The Monte Carlo integral is
S
1X
w(us ).
S
s=1
w(u) determines the weight or importance of di¤erent points in the sample space. The
importance sampler has variance
1
Vp [w(u)] ,
S
which is small if w(u) is relatively constant over u.
What does this mean for our frequency simulator. Instead of drawing u from the distribution
g(u) and giving it a weight of one if u 2 A, it is better to draw more u in regions where u is
likely to be in A and give the occurrence of u 2 A then a weight of less than one. Hence, if it
is roughly known where the region A is, the density from which u is drawn should be larger in
those areas.
For the multinomial probit model, the GHK simulator is widely used, which recursively
truncates the multivariate normal pdf to avoid sampling in regions outside A. In addition, the
GHK simulator is di¤erentiable with respect to .
Antithetic acceleration
So far we assumed iid drawings us g(u)
E.g. we have 2S iid drawings to simulate the integral:
2S
1 X
h(us ).
2S
s=1
Suppose g(u) is symmetric with mean zero. For antithetic sampling we draw only S iid
random numbers and use us and us in the simulator:
S
1 X h(us ) + h( us )
.
S 2
s=1
Dependent draws us may help to reduce variance
If h(us ) and h( us ) are uncorrelated, the variances of the simulated integral with iid sam-
pling and with antithetic sampling are identical. But we need only half the number of random
draws for antithetic sampling.
If h(us ) and h( us ) are negatively correlated, the variance with antithetic sampling will be
even smaller.
The extension to multivariate sampling can be illustrated for the bivariate case: u1s and u2s
are drawn and the four pairs (u1s ; u2s ), (u1s ; u2s ), ( u1s ; u2s ) and ( u1s ; u2s ) are used.
Quasi-random numbers (e.g. Halton sequences) is another method to reduce variance

of the simulated integral. Quasi-random numbers are sequences of us that provide a better
coverage of the domain of the sampling distribution.
17.3 Bootstrapping
The exact …nite-sample distributions of most estimators and test statistics are often di¢ cult to
calculate and even if exact solutions exist, they obviously depend on correctness of the imposed
distributional assumptions. This problem is usually dealt with by examining their asymptotic
properties i.e. when the sample size increases to in…nity. Many estimators and test statistics
are asymptotically normally or 2 distributed, regardless of the true population distribution of
the data. It is well known, however, that the asymptotic normal and 2 approximations can be
very inaccurate in small samples. Size, power and coverage probabilities of con…dence intervals
based on the asymptotic approximations can be very di¤erent from the truth. One approach to
deal with these problems is to use higher-order asymptotic approximations such as Edgeworth
or saddlepoint expansions.
The conventional …rst order approximation for an estimated parameter ^ usually takes the
form !
p ^ 0 1
Pr N = ()+O N 2 .
1
The term O(N 2 ) becomes smaller and smaller as N goes to in…nity, but for moderate N
may even be larger than the …rst term ( ). Edgeworth expansions extend the proof of the
1
conventional central limit theorem to characterize the O(N 2 ) term more precisely. Usually
1
it can be expressed as some computable expression that vanishes at order N 2 and another
1
O(N 1) term. Usually the N 2 will then dominate the O(N 1) term in large and moderate
samples. These higher order approximations are substantially more complex and seem to be
rarely used. This is particularly so with the advent of the bootstrap nd faster computers.
The bootstrap is a resampling procedure which often provides a very convenient way for
inference, particularly for complex or multiple step estimators. It is also based on an asymptotic
approximation, but may under certain conditions provide so-called asymptotic re…nements of
asymptotically pivotal statistics. The bootstrap amounts to treating the data as if they were
the population. It proceeds by considering the empirical distribution of the data as if it were
the true population. Since the empirical distribution function is known, the exact …nite sample
distribution of any statistic can be estimated (or simulated) with arbitrary accuracy via a
Monte Carlo simulation by drawing repeatedly samples from the empirical distribution of the
data. Since the empirical distribution function converges to the true distribution function for
N ! 1, the bootstrap will (in most cases) converge to the true asymptotic distribution of
the statistic. The bootstrap thereby provides an approach to replace analytic calculations with
computation. This is particularly useful when the asymptotic distribution is di¢ cult to work
with algebraically.
(One should note that there are situations in which the bootstrap does not estimate the
asymptotic distribution consistently. Manski’s maximum score estimator of the coe¢ cients
of a binary response model is an example. Another example already mentioned is the one-
to-one matching estimator. Most of these estimators, where the bootstrap does not work,
contain some non-smooth or non-di¤erentiable elements. In cases where the bootstrap does
not estimate consistently the asymptotic distribution, other subsampling methods can be used.
With subsampling, the distribution of a statistic is estimated via Monte Carlo simulation by
drawing repeatedly subsamples (i.e. of sample size smaller than N ) of the original data set. The
subsamples are of smaller sample size and can be drawn repeatedly with or without replacement.
Subsampling provides estimates of asymptotic distributions tht are consistent under very weak
assumptions, but it is usually less precise than the bootstrap, provided the latter is consistent.
The previous two sections referred to practical aspects of estimation, i.e. how to obtain an
estimate ^ given a theoretical mapping from data to coe¢ cients.
In this section, inference on ^ is considered: How to perform hypothesis tests and to obtain
con…dence intervals.
Bootstrapping is a type of Monte Carlo simulation using the empirical data available. It
has two advantages:
1) Bootstrapping can be simpler or more convenient when the asymptotic distribution is
di¢ cult to compute
2) In some cases, bootstrapping can lead to asymptotic re…nements that lead to better
approximations in …nite samples.
Bootstrap methods are based on asymptotic theory and only exact in in…nitely large samples.
Bootstrapping may not always work. E.g. for pair-matching estimators, nonsmooth estima-
tors (quantile regression, nonparametric estimators) and non-iid data.
p
Consider iid sample fW1 ; :::; WN g and a smooth estimator ^ that is N -consistent and
asymptotically normally distributed.
Let ^ be one speci…c element of ^. The usual statistics of interest are:
^ 0
^ s^
s^
and we are interested in the distribution of ^ in samples of size N .
In a Monte Carlo analysis (where we know the population, i.e. the dgp), the distribution of ^
can be approximated by drawing repeatedly independent samples of size N from the population
and computing ^.
For bootstrapping we consider the observed distribution fW1 ; :::; WN g as an approximation

to the population and we draw (with replacement) samples of size N from fW1 ; :::; WN g. For
each bootstrap sample, ^ is computed. (This is the nonparametric bootstrap.)
This provides us with an approximation to the distribution of ^N in samples of size N .
17.3.1 Uses of the bootstrap
Suppose we obtain B bootstrap replications of
^b and s^b .
To form the t-statistic, we consider the original sample as the approximation of the population
such that we replace 0 by ^:
^b ^
.
s^b
Similarly we can construct any other test statistic by substituting the estimates obtained from
the original sample for any population parameters.
It is important that we ensure that any properties valid in the population are also valid in
the original sample. This often requires some adjustment of the statistic. Example GMM with
overidenti…cation. Consider a L 1 vector valued function m (Wi ; ) with E[m(Wi ; 0 )] = 0.
Consider the case of overidenti…cation, i.e. that L > dim( ), such that we estimate with
P
GMM for example. Generally, N1 m(Wi ; ^GM M ) 6= 0 since with overidenti…cation the sample
moments cannot be set exactly to zero. This implies that
E[m(Wi ; 0 )] =0 but ^
E[m(W ^
i ; )] 6= 0
^ refers to the empirical distribution function. Hence, bootstrapping the overidenti…ca-

where E
tion test would not be appropriate without any adjustment (see e.g. Hahn 20??).
Bootstrap estimate of the bias:
De…ne the mean of the bootstrap estimates as
B
B 1 X^b
= .
B
b=1
B
If the estimator ^ is unbiased we would expect ^ . (See below.)
If the estimator is biased:
h i
E ^ 0 =b
we could estimate the bias by

h i
^ ^b
^b = E ^ = B ^
to obtain the bootstrap bias corrected estimator
^BC = 2^ B
.
p
For typical N consistent estimators, the bias of ^BC is of lower order. However, the M SE
p
of ^BC can be larger than of ^, and therefore bias correction is hardly used for N consistent
estimators. For estimators converging at a lower rate, bias correction is more widespread.
In any case, though, it is useful to examine the bias and hope that it is close to zero.
Bootstrap estimate of the variance of the estimates:
v
u B
u 1 X B 2
s^;Boot =t ^b .
B 1
b=1
This bootstrap estimate can be very useful when the conventional asymptotic formula is di¢ cult
to compute, e.g. if the estimator is a two-step estimator, or if clustered data with many small
clusters.
We could use this estimate of the variance to conduct tests on and produce con…dence
intervals. However, since it is not a pivotal statistic, there will be no asymptotic re…nement.
Bootstrap tests:
Consider tests on coe¢ cient , e.g. the two sided test of H0 : = 0 against Ha : 6= 0. The
^
t-statistic TN = s^
0
is asymptotically pivotal and may therefore lead to asymptotic re…nement.
The bootstrap t-statistic is
^b ^
tb = .
s^b
The empirical distribution of tb gives an approximation to the distribution of TN .
Suppose B = 999. Let t1 ; :::; tB be the ordered values of the statistic.
The critical values for a two-sided test at 5% level are: t25 and t975 . These critical values
can now be used to obtain the 95% CI
^ t25 s^ ^+t
975 s^ .
We can derive a bootstrap p-value. Suppose t974 < t < t975 . The p-value for an upper
974
one-tailed alternative test is 1 B+1 = 0:026%.
For a symmetrical two-sided test, we can use the absolute values of tb to make it a one-sided
test. This gives a better approximation, if the test is really symmetrically distributed. (For
a symmetrical two-sided t-test or asymptotic 2 test, the true size of the test is + O(N 2)
instead of + O(N 1) with conventional asymptotics.)
Notice that the test

^ 0
s^;Boot
does not provide this asymptotic re…nement.
Nested bootstrap:
To obtain asymptotic re…nement of the t-test, the bootstrap t-statistic
^b ^
tb =
s^b
needs to be computed in every replication. This requires an estimator of s^b , which might
sometimes be di¢ cult to compute. Alternatively, a bootstrap within the bootstrap could be
used to estimate s^b , which permits asymptotic re…nements.
17.3.2 Consistency
Using asymptotic theory, bootstrapping can sometimes lead to asymptotic re…nements.
Consider a statistic TN computed from the data (e.g. the estimate, a t-value, a Wald, LM
or LR statistic). The exact …nite sample distribution of TN is GN
GN ( ; F0 ) = Pr (TN ),
where F0 = F ( ; 0) is the cdf from which the iid data Wi are drawn.
We need to …nd a good approximation for GN .
Conventional asymptotic theory is based on the properties of T1 with distribution function

G1 ( ; F0 ). Since we do not know 0, this function is approximated using ^
G1 ( ; F ( ; ^)).
From G1 the …nite sample distribution is approximated. Example:

p d
Standard asymptotic theory leads to N (^ 0) ! N (0; 2 ), thus
!
p ^ 0 1
^ N ( ; F ( ; ^)) = Pr
G N = ()+O N 2
1
where O(N 2 ) is a remainder term of lower order.
The empirical (i.e. nonparametric) bootstrap approach instead simulates the distribution
function
GN ; F^N
where F^N is a consistent nonparametric estimator of F0 , e.g. the empirical distribution function.
With B bootstrap replications, the simulation of GN is
B
^N 1X
G ; F^N = 1 TNb ! GN ; F^N ,
B b!1
b=1
where TNb is the value of TN in the b-th bootstrap sample.
^N
Consistency of G ; F^N for GN ( ; F0 ) thus requires convergence
p
GN ; F^N ! GN ( ; F 0 )
uniformly for all values and for all F0 in the space of permitted cdfs. This requires:
- F^N to be consistent for F0
- F (a; 0) has to be smooth in a such that F^N and F0 are uniformly close
- GN ( ; F ) has to be smooth with respect to F , such that GN ( ; F~ ) and GN ( ; F ) are close
when F~ and F are close.
Horowitz (2001) contains a more formal discussion and proofs. For smooth objective func-
tions, the bootstrap leads to consistent estimates of GN in a wide range of settings.
17.3.3 Asymptotic re…nement

p
Suppose the statistic TN has a limit normal distribution under conventional N asymptotics.
p ^
Example: TN = N ^ 0 with limit distribution N (0; 1). The conventional asymptotic approx-
imation is then
1
GN ( ; F 0 ) = G1 ( ; F 0 ) + O N 2 ,
1
hence the approximation error is of order N 2. For example,
!
p ^ 0 1
Pr N = ()+O N 2 .
^
This result is derived using a conventional CLT. An Edgeworth expansion also derives the
expression for the lower order terms to obtain
g1 ( ; F 0 ) g2 ( ; F 0 ) 3
GN ( ; F 0 ) = G1 ( ; F 0 ) + p + +O N 2 .
N N
E.g. !
p ^ 0 g1 ( ) g2 ( ) 3
Pr N = ()+ p + +O N 2 ,
^ N N
where g1 (a) = (a2 1) (a) 3 =6 with 3 the third cumulant of TN and g2 ( ) given by a
complicated formula.
This Edgeworth expansion characterizes the distribution of ^ more precisely and can help
to compare two di¤erent inference procedures that are …rst order equivalent.
In principle, we could use the Edgeworth expansion directly to approximate the distribution
of ^, but the g1 and g2 functions can be complicated. (This approach is sometimes used when
…rst order asymptotic theory does not perform very well, e.g. for weak IV estimators.)
Now consider the bootstrap estimator with F^N :
g1 ( ; F^N ) g2 ( ; F^N ) 3
GN ( ; F^N ) = G1 ; F^N + p + +O N 2 .
N N
Subtracting the expansion of the exact …nite sample distribution GN ( ; F0 ) from the bootstrap
distribution GN ( ; F^N ) we get
g1 ( ; F^N ) g1 ( ; F0 )
GN ( ; F^N ) GN ( ; F 0 ) = G1 ; F^N G1 ( ; F 0 ) + p +O N 1
.
N
p
Suppose that F^N is N consistent for F0 such that
1
F^N F0 = O(N 2 ),
hence, if G1 is continuous, that
1
G1 ; F^N G1 ( ; F0 ) = O(N 2 ).
1
This implies that the bootstrap approximation error is of order N 2 , i.e.
1
GN ( ; F^N ) GN ( ; F0 ) = O(N 2 )
and is thus of the same order as the conventional asymptotic approximation approach (i.e. using
the limit ( ) distribution).
However, if the limit distribution does not depend on the function F , in the sense that
G1 ( ; F ( ; 0 )) = G1 ; F ( ; ~) , the bootstrap approximation error simpli…es
g1 ( ; F^N ) g1 ( ; F0 )
GN ( ; F^N ) GN ( ; F 0 ) = G1 ; F^N G1 ( ; F 0 ) + p +O N 1
| {z } N
=0
1
= O N
because
1
g1 ( ; F^N ) g1 ( ; F0 ) = O(N 2 )
provided that g1 is continuous in F !
Hence, for an asymptotically pivotal statistic (under the H0 ), the …nite sample approxima-
1
tion error is reduced from O(N 2 ), when using conventional asymptotics, to O(N 1) by boot-
strapping.
A similar result can be obtained for asymptotically 2 statistics.
This derivation required continuity in G and F and needs to be modi…ed if there is a

discontinuity (e.g. boundary constraints on 0) or non-iid errors.
17.3.4 Bootstrap sampling methods and number of replications
Empirical bootstrap = nonparametric bootstrap = paired bootstrap
Resample from fW1 ; :::; WN g with replacement.
Write sample as
f(Yi ; Xi )gN
i=1
Parametric bootstrap: suppose cdf of data is speci…ed as
Y jX FY jX ( ; 0)
and ^ a consistent estimate.

We can form various bootstrap samples
a) keeping the X1 ; :::; XN …xed and generating Yib via random draws from FY jX (Xi ; ^).
b) …rst resample Xi and generating Yib via random draws from FY jX (Xib ; ^).
Residual bootstrap: If regression model contains additive iid error, e.g.
Yi = g (Xi ; ) + Ui
where error term does not depend on unknown parameters. We can then obtain the …tted
^1 ; :::U
residuals U ^N and draw bootstrap samples U
^ b ; :::U
^ b . These are then used to generate
1 N
Yib = g Xi ; ^ + U
^ b.
i
The nonparametric bootstrap relies on the weakest assumptions. But the other sampling
schemes generally provide better approximations and would be more appropriate if we believe
in the parametric speci…cation.
How many bootstrap replications should we use?

Cameron and Trivedi discuss a number of recommendations of various articles. Generally,
we need less replications if we are only interested in bootstrap estimate of variance and we need
more if we want perform a t-Test at size 5% or 1% (since we need more precise estimates in the
tails).
It seems that B = 999 is a good choice.
If numerical optimization is used in every bootstrap, the estimates ^ often provide good
starting values.
Much more on this topic in Cameron and Trivedi and Horowitz (2001).
18 Duration models
....
Markus Frölich 19. Further topics 425
19 Further topics
- identi…cation and bounds

- di¤erences-in-di¤erences
- dynamic programme evaluation (van den Berg)
- general equilibrium e¤ects
Estimation
- simulation methods (MCMC, Markov chain Monte Carlo)
- empirical likelihood
Classical asymptotic theory:

parametric estimators: repetition of CLT
nonpara estimators
semipara estimators: semipara e¢ ciency bounds
Test theory: @@@Dufour Canadian Journal of Econometrics contains a very nice discussion
of power of tests etc.
References
Abadie, A., J. Angrist, and G. Imbens (2002): “Instrumental Variables Estimates of the
E¤ect of Subsidized Training on the Quantiles of Trainee Earnings,” Econometrica, 70, 91–
117.
Abadie, A., and G. Imbens (2006): “Large Sample Properties of Matching Estimators for
Average Treatment E¤ects,” Econometrica, 74, 235–267.
Acemoglu, D., S. Johnson, and J. Robinson (2001): “The colonial origins of comparative
development: an empirical investigation,” American Economic Review, 91, 1369–1401.
Aitchison, J., and C. Aitken (1976): “Multivariate binary discrimination by the kernel
method,” Biometrika, 63, 413–420.
Akaike, H. (1970): “Statistical Predictor Information,” Annals of the Institute of Statistical

Mathematics, 22, 203–217.
Amemiya, T. (1985): Advanced Econometrics. Harvard University Press, Cambridge, Mass.
Angrist, J. (1990): “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence From
Social Security Administrative Records,” American Economic Review, 80, 313–336.
(1998): “Estimating Labour Market Impact of Voluntary Military Service using Social
Security Data,” Econometrica, 66, 249–288.
Angrist, J., V. Chernozhukov, and I. Fernández-Val (2006): “Quantile Regression

under Misspeci…cation, with an Application to the U.S. Wage Structure,” Econometrica, 74,
539–563.
Angrist, J., and W. Evans (1998): “Children and their parents labor supply: Evidence from
exogeneous variation in family size,” American Economic Review, 88, 450–477.
Angrist, J., and G. Imbens (1995): “Two-Stage Least Squares Estimation of Average Causal
E¤ects in Models with Variable Treatment Intensity,” Journal of American Statistical Asso-
ciation, 90, 431–442.
Angrist, J., G. Imbens, and D. Rubin (1996): “Identi…cation of Causal E¤ects using Instru-
mental Variables,”Journal of American Statistical Association, 91, 444–472 (with discussion).
Angrist, J., and A. Krueger (1991): “Does Compulsory School Attendance A¤ect Schooling
and Earnings?,” Quarterly Journal of Economics, 106, 979–1014.
Angrist, J., and V. Lavy (1999): “Using Maimonides Rule to Estimate the E¤ect of Class
Size on Scholastic Achievement,” Quarterly Journal of Economics, 114, 533–575.
Athey, S., and G. Imbens (2006): “Identi…cation and inference in nonlinear di¤erence-in-
di¤erences models,” Econometrica, 74, 431–497.
Bahadur, R. (1966): “A note on quantiles in large samples,” Annals of Mathematical Statis-

tics, 37, 577–580.
Battistin, E., and E. Rettore (2002): “Testing for programme e¤ects in a regression
discontinuity design with imperfect compliance,” Journal of Royal Statistical Society Series
A, 165, 39–57.
(2008): “Ineligibles and eligible non-participants as a double comparison group in

regression-discontinuity designs,” Journal of Econometrics, 142, 715–730.
Becker, S., and L. Wößmann (2007): “Was Weber Wrong? A Human Capital Theory of
Protestant Economic History,” Discussion paper, 2007-07, Universität München.
Beegle, K., R. Dehejia, and R. Gatti (2006): “Child labor and agricultural shocks,”
Journal of Development Economics, 81, 80–96.
Begun, J., W. Hall, W. Huang, and J. Wellner (1983): “Information and Asymptotic
E¢ ciency in Parametric-Nonparametric Models,” Annals of Statistics, 11, 432–452.
Bergemann, A., B. Fitzenberger, B. Schultz, and S. Speckesser (2000): “Multiple

Active Labor Market Policy Participation in East Germany: An Assessment of Outcomes,”
Konjunkturpolitik, 51, 195–244.
Bergemann, A., B. Fitzenberger, and S. Speckesser (2001): “Evaluating the Employ-

ment E¤ects of Public Sector Sponsored Training in East Germany: Conditional Di¤erence-
in-Di¤erences and Ashenfelter’s Dip,” mimeo, University of Mannheim.
Markus Frölich References 427
Bickel, P., C. Klaassen, Y. Ritov, and J. Wellner (1993): E¢ cient and Adaptive
Estimation for Semiparametric Models. John Hopkins University Press, Baltimore.
Black, D., and J. Smith (2004): “How Robust is the Evidence on the E¤ects of College
Quality? Evidence from Matching,” Journal of Econometrics, 121, 99–124.
Black, S. (1999): “Do ’Better’Schools Matter? Parental Valuation of Elementary Education,”

Quarterly Journal of Economics, 114, 577–599.
Blundell, R., and J. Powell (2003): “Endogeneity in Nonparametric and Semiparametric

Regression Models,” in Advances in Economics and Econometrics, ed. by L. H. M. Dewa-
tripont, and S. Turnovsky, pp. 312–357. Cambridge University Press, Cambridge.
Breiman, L., and J. Friedman (1985): “Estimating optimal transformations for multiple
regression and correlation,” Journal of American Statistical Association, 80, 580–619.
Buchinsky, M. (1994): “Changes in the U.S. Wage Structure 1963-1987: Application of Quan-
tile Regression,” Econometrica, 62, 405–458.
(1998): “Recent advances in quantile regression models,”Journal of Human Resources,

33, 88–126.
Cameron, C., and P. Trivedi (2005): Microeconometrics: Methods and applications. Cam-
bridge University Press, Cambridge.
Card, D. (1995): “Using Geographic Variation in College Proximity to Estimate the Return to
Schooling,” in Aspects of Labor Market Behaviour: Essays in Honour of John Vanderkamp,
ed. by L. Christo…des, E. Grant, and R. Swidinsky, pp. 201–222. University of Toronto Press,
Toronto.
Carroll, R., D. Ruppert, and A. Welsh (1998): “Local Estimating Equations,” Journal
of American Statistical Association, 93, 214–227.
Chamberlain, G. (1994): “Quantile regression, censoring and the structure of wages,” in

Advances in Econometrics, ed. by C. Sims. Elsevier, Amsterdam.
Chaudhuri, P. (1991): “Global nonparametric estimation of conditional quantile functions

and their derivatives,” Journal of Multivariate Analysis, 39, 246–269.
Chen, X., O. Linton, and I. van Keilegom (2003): “Estimation of semiparametric models
when the criterion function is not smooth,” Econometrica, 71, 1591–1608.
Chernozhukov, V., I. Fernández-Val, and B. Melly (2007): “Inference on Counterfac-

tual Distributions,” mimeo.
Chernozhukov, V., and C. Hansen (2005): “An IV Model of Quantile Treatment E¤ects,”
Econometrica, 73, 245–261.
(2006): “Instrumental quantile regression inference for structural and treatment e¤ect
models,” Journal of Econometrics, 132, 491–525.
(2007): “Instrumental quantile regression: A robust inference approach,” Journal of

Econometrics, forthcoming.
Chernozhukov, V., G. Imbens, and W. Newey (2007): “Instrumental variable estimation

of nonseparable models,” Journal of Econometrics, 139, 4–14.
Chesher, A. (2003): “Identi…cation in nonseparable models,” Econometrica, 71, 1405–1441.
(2005): “Nonparametric identi…cation under discrete variation,” Econometrica, 73,

1525–1550.
(2007a): “Endogeneity and discrete outcomes,” mimeo.
(2007b): “Instrumental values,” Journal of Econometrics, 139, 15–34.
Collier, P., and A. Höffler (2002): “On the incidence of civil war in Africa,” Journal of
Con‡ict Resolution, 46, 13–28.
DiNardo, J., and D. Lee (2004): “Economic impacts of new unionization on private sector
employers: 1984-2001,” Quarterly Journal of Economics, 119, 1383–1441.
Eichler, M., and M. Lechner (2002): “An Evaluation of Public Employment Programmes
in the East German State of Sachsen-Anhalt,” Labour Economics, 9, 143–186.
Engel, E. (1857): “Die Produktions- und Konsumtionsverhältnisse des Königsreichs Sachsen,”

Zeitschrift des statistischen Büros des Königlich Sächsischen Ministeriums des Inneren, 8,
1–54.
Falk, A. (2007): “Gift Exchange in the Field,” Econometrica, 75, 1501–1511.
Fan, J. (1992): “Design-adaptive Nonparametric Regression,” Journal of American Statistical

Association, 87, 998–1004.
(1993): “Local Linear Regression Smoothers and their Minimax E¢ ciency,”Annals of

Statistics, 21, 196–216.
(2000): “Prospects of Nonparametric Modeling,” Journal of American Statistical As-

sociation, 95, 1296–1300.
Fan, J., M. Farmen, and I. Gijbels (1998): “Local Maximum Likelihood Estimation and
Inference,” Journal of the Royal Statistical Society, Series B, 60, 591–608.
Fan, J., T. Gasser, I. Gijbels, M. Brockmann, and J. Engel (1997): “Local Polynomial
Regression: Optimal Kernels and Asymptotic Minimax E¢ ciency,” Annals of the Institute
of Mathematical Statistics, 49, 79–99.
Fan, J., and I. Gijbels (1995): “Data-driven Bandwidth Selection in Local Polynomial Fit-
ting: Variable Bandwidth and Spatial Adaptation,” Journal of the Royal Statistical Society,
Series B, 57, 371–394.
(1996): Local Polynomial Modeling and its Applications. Chapman and Hall, London.
Fan, J., P. Hall, M. Martin, and P. Patil (1996): “On Local Smoothing of Nonparametric
Curve Estimators,” Journal of American Statistical Association, 91, 258–266.
Fechner, G. (1860): Elemente der Psychophysik. Breitkopf und Härtel, Leipzig.
Firpo, S. (2007): “E¢ cient Semiparametric Estimation of Quantile Treatment E¤ects,”Econo-

metrica, 75, 259–276.
Flossmann, A. (2007): “Empirical Bias Bandwidth Choice for Local Polynomial Matching
Estimators,” Working paper University of Konstanz.
Fredriksson, P., and P. Johansson (2003): “Program Evaluation and Random Program
Starts,” IFAU Discussion Paper 2003:1.
Frölich, M. (2004): “Finite Sample Properties of Propensity-Score Matching and Weighting

Estimators,” The Review of Economics and Statistics, 86, 77–90.
(2005): “Matching Estimators and Optimal Bandwidth Choice,” Statistics and Com-
puting, 15/3, 197–215.
(2006a): “Nonparametric regression for binary dependent variables,” Econometrics

Journal, 9, 511–540.
(2006b): “A Note on Parametric and Nonparametric Regression in the Presence of

Endogenous Control Variables,” IZA Discussion Paper, 2126.
(2007a): “Nonparametric IV Estimation of Local Average Treatment E¤ects with

Covariates,” Journal of Econometrics, 139, 35–75.
(2007b): “Propensity score matching without conditional independence assumption -

with an application to the gender wage gap in the UK,” Econometrics Journal.
(2007c): “Regression discontinuity design with covariates,” IZA Discussion Paper,

3024.
(2007d): “Statistical treatment choice: an application to active labour market pro-

grammes,” forthcoming in Journal of American Statistical Association, — .
Frölich, M., and M. Lechner (2006): “Exploiting regional treatment intensity for the
evaluation of labour market policies,” IZA discussion paper, 2144.
Frölich, M., and B. Melly (2007): “Nonparametric quantile treatment e¤ects under endo-
geneity,” mimeo.
Frölich, M., and R. Vazquez-Alvarez (2007): “HIV/AIDS Knowledge and behaviour:

Have information campaigns reduced HIV infection? The case of Kenya,” mimeo, mimeo.
Gerfin, M., and M. Lechner (2002): “Microeconometric Evaluation of the Active Labour
Market Policy in Switzerland,” Economic Journal, 112, 854–893.
Gerfin, M., M. Lechner, and H. Steiger (2005): “Does subsidised temporary employment
get the unemployed back to work? An econometric analysis of two di¤erent schemes,”Labour
Economics, 12, 807–835.
Gosling, A., S. Machin, and C. Meghir (2000): “The Changing Distribution of Male Wages
in the U.K.,” Review of Economics Studies, 67, 635–666.
Gozalo, P., and O. Linton (2000): “Local Nonlinear Least Squares: Using parametric
information in nonparametric regression,” Journal of Econometrics, 99, 63–106.
Guntenbrunner, C., and J. JureCµ ková (1992): “Regression Quantile and Regression Rank
Score Process in the Linear Model and Derived Statistics,”Annals of Statistics, 20, 305–330.
Hahn, J. (1998): “On the Role of the Propensity Score in E¢ cient Semiparametric Estimation
of Average Treatment E¤ects,” Econometrica, 66, 315–331.
Hahn, J., P. Todd, and W. van der Klaauw (1999): “Evaluating the E¤ect of an Antidis-
crimination Law Using a Regression-Discontinuity Design,” NBER working paper, 7131.
(2001): “Identi…cation and Estimation of Treatment E¤ects with a Regression-

Discontinuity Design,” Econometrica, 69, 201–209.
Hall, P., R. C. L. Wolff, and Q. Yao (1999): “Methods for estimating a conditional
distribution function,” Journal of American Statistical Association, 94(445), 154–163.
Härdle, W. (1991): Applied Nonparametric Regression. Cambridge University Press, Cam-

bridge.
Härdle, W., and S. Marron (1987): “Optimal Bandwidth Selection in Nonparametric Re-
gression Function Estimation,” Annals of Statistics, 13, 1465–1481.
Härdle, W., and T. Stoker (1989): “Investigating smooth multiple regression by the method
of average derivatives,” Journal of American Statistical Association, 84, 986–995.
Hastie, T., and C. Loader (1992): “Local Regression. Automatic Kernel Carpentry,” Sta-
tistical Science, 8, 120–143.
Hastie, T., and R. Tibshirani (1990): Generalized Additive Models. Chapman and Hall,
London.
Hearst, N., T. Newman, and S. Hulley (1986): “Delayed E¤ects of the Military Draft
on Mortality: A Randomized Natural Experiment,” New England Journal of Medicine, 314,
620–624.
Heckman, J., H. Ichimura, J. Smith, and P. Todd (1998): “Characterizing Selection Bias
Using Experimental Data,” Econometrica, 66, 1017–1098.
Heckman, J., H. Ichimura, and P. Todd (1997): “Matching as an Econometric Evaluation

Estimator: Evidence from Evaluating a Job Training Programme,” Review of Economic
Studies, 64, 605–654.
(1998): “Matching as an Econometric Evaluation Estimator,” Review of Economic

Studies, 65, 261–294.
Heckman, J., and J. Smith (1995): “Assessing the Case for Social Experiments,” Journal of
Economic Perspectives, 9, 85–110.
Heckman, J., and E. Vytlacil (1999): “Local Instrumental Variables and Latent Variable
Models for Identifying and Bounding Treatment E¤ects,” Proceedings National Academic
Sciences USA, Economic Sciences, 96, 4730–4734.
(2005): “Structural Equations, Treatment E¤ects, and Econometric Policy Evalua-

tion,” Econometrica, 73, 669–738.
Henderson, D., D. Millimet, C. Parmeter, and L. Wang (2008): “Fertility and the
health of children: a nonparametric investigation,”in Advances in Econometrics, Volume 21,
Modelling and evaluating treatment e¤ ects in econometrics, ed. by D. Millimet, J. Smith, and
E. Vytlacil, pp. 169–197. x, x.
Henley, A., G. Arabsheibani, and F. Carneiro (2007): “On de…ning and measuring the
informal sector: evidence from Brazil,” mimeo, University of Wales.
Hirano, K., G. Imbens, and G. Ridder (2003): “E¢ cient Estimation of Average Treatment
E¤ects Using the Estimated Propensity Score,” Econometrica, 71, 1161–1189.
Hoderlein, S., and E. Mammen (2007): “Identi…cation of marginal e¤ects in nonseparable

models without monotonicity,” Econometrica, 75, 1513–1518.
Horowitz, J., and S. Lee (2007): “Nonparametric instrumental variables estimation of a

quantile regression model,” Econometrica, 75, 1191–1208.
Ichimura, H. (1993): “Semiparametric least squares (SLS) and weighted SLS estimation of
single-index models,” Journal of Econometrics, 58, 71–120.
Ichimura, H., and S. Lee (2006): “Characterization of the asymptotic distribution of semi-
parametric M-estimators,” mimeo, University of Tokyo.
Ichimura, H., and O. Linton (2002): “Asymptotic Expansions for Some Semiparametric
Program Evaluation Estimators,” mimeo, University College London.
Ichimura, H., and P. Todd (2008): “Implementing Nonparametric and Semiparametric Es-
timators,” in forthcoming in Handbook of Econometrics. North Holland, Amsterdam.
Imbens, G. (2000): “The Role of the Propensity Score in Estimating Dose-Response Func-
tions,” Biometrika, 87, 706–710.
(2001): “Some remarks on instrumental variables,” in Econometric Evaluation of

Labour Market Policies, ed. by M. Lechner, and F. Pfei¤er, pp. 17–42. Physica/Springer,
Heidelberg.
(2004): “Nonparametric Estimation of Average Treatment E¤ects under Exogeneity:

A Review,” The Review of Economics and Statistics, 86, 4–29.
Imbens, G., and J. Angrist (1994): “Identi…cation and Estimation of Local Average Treat-
ment E¤ects,” Econometrica, 62, 467–475.
Imbens, G., and T. Lancaster (1996): “E¢ cient estimation and strati…ed sampling,”Jour-
nal of Econometrics, 74, 289–318.
Imbens, G., and W. Newey (2003): “Identi…cation and Estimation of Triangular Simul-
taneous Equations Models Without Additivity,” presented at the EC2 conference London
December 2003.
Imbens, G., and G. Ridder (2006): “Estimation and inference for generalized partial means,”
unpublished.
Imbens, G., D. Rubin, and B. Sacerdote (2001): “Estimating the e¤ect of unearned income
on labor earnings, savings, and consumption: Evidence from a survey of lottery players,”
American Economic Review, 91, 778–794.
Imbens, G. W., and T. Lemieux (2008): “Regression discontinuity designs: A guide to

practice,” Journal of Econometrics, 142, 615–635.
Judd, K. (1998): Numerical Methods in Economics. MIT Press, Cambridge.
Judge, G., R. Hill, W. Griffiths, H. Lütkepohl, and T.-S. Lee (1982): Introduction
to the Theory and Practice of Econometrics. Wiley, New York, 2 edn.
Kiefer, J. (1967): “On Bahadur’s representation of sample quantiles,”Annals of Mathematical

Statistics, 38, 1323–1342.
Klein, R., and R. Spady (1993): “An E¢ cient Semiparametric Estimator for Binary Response
Models,” Econometrica, 61, 387–421.
Koenker, R. (2005): Quantile Regression. Cambridge University Press, Cambridge.
Koenker, R., and G. Bassett (1978): “Regression Quantiles,” Econometrica, 46, 33–50.
Koenker, R., and Z. Xiao (2002): “Inference on the Quantile Regression Process,” Econo-
metrica, 70, 1583–1612.
Koshevnik, Y., and B. Levit (1976): “On a Non-parametric Analogue of the Information
Matrix,” Theory of Probability and Applications, 21, 738–753.
Lalive, R. (2008): “How do extended bene…ts a¤ect unemployment duration? A regression

discontinuity approach,” Journal of Econometrics, 142, 785–806.
Lalive, R., J. Wüllrich, and J. Zweimüller (2008): “Do …nancial incentives for …rms
promote employment of disabled workers: a regression discontinuity approach,”mimeo, Uni-
versity of Lausanne, x, x.
Lechner, M. (1999): “Earnings and Employment E¤ects of Continuous O¤-the-Job Training

in East Germany after Uni…cation,”Journal of Business and Economic Statistics, 17, 74–90.
(2001): “Identi…cation and Estimation of Causal E¤ects of Multiple Treatments under

the Conditional Independence Assumption,” in Econometric Evaluation of Labour Market
Policies, ed. by M. Lechner, and F. Pfei¤er, pp. 43–58. Physica/Springer, Heidelberg.
(2002a): “Program Heterogeneity and Propensity Score Matching: An Application to

the Evaluation of Active Labor Market Policies,” The Review of Economics and Statistics,
84, 205–220.
(2002b): “Some Practical Issues in the Evaluation of Heterogeneous Labour Market

Programmes by Matching Methods,” Journal of the Royal Statistical Society, Series A, 165,
59–82.
(2004): “Sequential Matching Estimation of Dynamic Causal Models,”Universität St.

Gallen Discussion Paper, 2004-xx.
(2006): “The relation of di¤erent concepts of causality in econometrics,” Universität

St. Gallen Discussion Paper, 2006-15.
(2008a): “Matching estimation of dynamic treatment models: Some practical issues,”

in forthcoming in Advances in Econometrics, Volume 21, Modelling and evaluating treatment
e¤ ects in econometrics, ed. by D. Millimet, J. Smith, and E. Vytlacil, pp. x–x. x, x.
(2008b): “A note on endogenous control variables in evaluation studies,” forthcoming

in Statistics and Probability Letters, 2008.
(2008c): “Sequential causal models for the evaluation of labor market programs,”
Journal of Business and Economic Statistics, 2008.
Lechner, M., and R. Miquel (2001): “A potential outcome approach to dynamic programme
evaluation: nonparametric identi…cation,”Universität St. Gallen Discussion Paper, 2001-07.
(2005): “Identi…cation of the e¤ects of dynamic treatments by sequential conditional

independence assumptions,” Universität St. Gallen Discussion Paper, 2005-17.
Lechner, M., R. Miquel, and C. Wunsch (2004): “Long-run e¤ects of public sector spon-
sored training in West Germany,” Universität St. Gallen Discussion Paper, 2004-19.
Lee, D. (2008): “Randomized experiments from non-random selection in U.S. House elections,”
Journal of Econometrics, 142, 675–697.
Lee, D., and D. Card (2008): “Regression discontinuity inference with speci…cation error,”
Journal of Econometrics, 142, 655–674.
Leuven, E., M. Lindahl, H. Oosterbeek, and D. Webbink (2007): “The e¤ect of extra
funding for disadvantaged pupils on achievement,” Review of Economics and Statistics, 89,
721–736.
Li, Q., and J. Racine (2007): Nonparametric Econometrics - Theory and Practice. Princeton
University Press, Princeton.
Loader, C. (1999): “Bandwidth Selection: Classical or Plug-In?,” Annals of Statistics, 27,

415–438.
Machado, J., and J. Mata (2005): “Counterfactual decomposition of changes in wage dis-
tributions using quantile regression,” Journal of Applied Econometrics, 20, 445–465.
Manning, W., L. Blumberg, and L. Moulton (1995): “The demand for alcohol: the
di¤erential response to price,” Journal of Health Economics, 14, 123–148.
Manski, C. (1997): “Monotone Treatment Response,” Econometrica, 65, 1311–1334.

Manski, C., and S. Lerman (1977): “The Estimation of Choice Probabilities from Choice-
Based Samples,” Econometrica, 45, 1977–1988.
Masry, E. (1996): “Multivariate local polynomial regression for time series: uniform strong
consistency and rates,” Journal of Time Series Analysis, 17, 571–599.
McCrary, J. (2008): “Manipulation of the running variable in the regression discontinuity

design: A density test,” Journal of Econometrics, 142, 698–714.
Melly, B. (2006): “Estimation of counterfactual distribution using quantile regression,”

mimeo.
Miguel, E., S. Satyanath, and E. Sergenti (2004): “Economic Shocks and Civil Con‡ict:
An Instrumental Variables Approach,” Journal of Political Economy, 112, 725–753.
Mittelhammer, R., G. Judge, and D. Miller (2000): Econometric Foundations. Cam-

bridge University Press, Cambridge.
Murphy, S. (2003): “Optimal dynamic treatment regimes,”Journal of Royal Statistical Society

Series B, 65, 331–366.
Nadaraya, E. (1965): “On Nonparametric Estimates of Density Functions and Regression

Curves,” Theory of Applied Probability, 10, 186–190.
Neumeyer, N. (2007): “A note on uniform consistency of monotone function estimators,”

Statistics and Probability Letters, 77, 693–703.
Newey, W. (1984): “A method of moments interpretation of sequential estimators,”Economics

Letters, 14, 201–206.
(1990): “Semiparametric E¢ ciency Bounds,” Journal of Applied Econometrics, 5,

99–135.
(1994a): “The Asymptotic Variance of Semiparametric Estimators,” Econometrica,

62, 1349–1382.
(1994b): “Kernel estimation of partial means and a general variance estimator,”

Econometric Theory, 10, 233–253.
(2004): “E¢ cient semiparametric estimation via moment restrictions,”Econometrica,

72, 1877–1897.
Pagan, A., and A. Ullah (1999): Nonparametric Econometrics. Cambridge University Press,
Cambridge.
Pearl, J. (2000): Causality: Models, Reasoning, and Inference. Cambridge University Press,
Cambridge.
Pfanzagl, J., and W. Wefelmeyer (1982): Contributions to a General Asymptotic Statis-

tical Theory. Springer Verlag, Heidelberg.
Powell, J. (1986): “Censored Regression Quantiles,” Journal of Econometrics, 32, 143–155.

Press, W., B. Flannery, S. Teukolsky, and W. Vetterling (1986): Numerical Recipes.

Cambridge University Press, Cambridge.
Puhani, P., and A. Weber (2007): “Does the early bird catch the worm? Instrumental
variable estimates of early educational e¤ects of age of school entry in Germany,” Empirical
Economics, 32, 359–386.
Racine, J., and Q. Li (2004): “Nonparametric Estimation of Regression Functions with Both
Categorical and Continuous Data,” Journal of Econometrics, 119, 99–130.
Reinsch, C. (1967): “Smoothing by spline functions,” Numerische Mathematik, 16, 177–183.
Rice, J. (1984): “Bandwidth Choice for Nonparametric Regression,” Annals of Statistics, 12,
1215–1230.
Robinson, P. (1988): “Root-N consistent semiparametric regression,”Econometrica, 56, 931–

954.
Rosenbaum, P. (1984): “The consequences of adjustment for a concomitant variable that has
been a¤ected by the treatment,”Journal of Royal Statistical Society, Series A, 147, 656–666.
Roy, A. (1951): “Some Thoughts on the Distribution of Earnings,” Oxford Economic Papers,
3, 135–146.
Rubin, D. (1974): “Estimating Causal E¤ects of Treatments in Randomized and Nonrandom-

ized Studies,” Journal of Educational Psychology, 66, 688–701.
(2004): “Direct and indirect causal e¤ects via potential outcomes,” Scandinavian
Journal of Statistics, 31, 161–170.
(2005): “Causal inference using potential outcomes: Design, Modeling, Decisions,”

Journal of American Statistical Association, 100, 322–331.
Ruppert, D. (1997): “Empirical-Bias Bandwidths for Local Polynomial Nonparametric Re-

gression and Density Estimation,” Journal of American Statistical Association, 92, 1049–
1062.
Ruppert, D., S. Sheather, and M. Wand (1995): “An E¤ective Bandwidth Selector for
Local Least Squares Regression,”Journal of American Statistical Association, 90, 1257–1270.
Ruppert, D., and M. Wand (1994): “Multivariate Locally Weighted Least Squares Regres-
sion,” Annals of Statistics, 22, 1346–1370.
Schucany, W. (1995): “Adaptive Bandwidth Choice for Kernel Regression,”Journal of Amer-

ican Statistical Association, 90, 535–540.
Seifert, B., and T. Gasser (1996): “Finite-Sample Variance of Local Polynomials: Analysis
and Solutions,” Journal of American Statistical Association, 91, 267–275.
(2000): “Data Adaptive Ridging in Local Polynomial Regression,”Journal of Compu-

tational and Graphical Statistics, 9, 338–360.
Shibata, R. (1981): “An Optimal Selection of Regression Variables,” Biometrika, 68, 45–54.
Sianesi, B. (2004): “An Evaluation of the Swedish System of Active Labor Market Programs
in the 1990s,” The Review of Economics and Statistics, 86, 133–155.
Silverman, B. (1986): Density estimation for statistics and data analysis. Chapman and Hall,
London.
Smith, J., and P. Todd (2005): “Does matching overcome LaLonde’s critique of nonexperi-
mental estimators?,” Journal of Econometrics, 125, 305–353.
Staniswalis, J. (1989): “The kernel estimate of a regression function in likelihood-based

models,” Journal of American Statistical Association, 84, 276–283.
Stein, C. (1956): “E¢ cient Nonparametric Testing and Estimation,” in Proceedings of the
third Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. University of
California Press, Berkeley.
(1981): “Estimation of the Mean of a Multivariate Normal Distribution,” Annals of

Statistics, 9, 1135–51.
Stone, C. (1974): “Cross-validatory Choice and Assessment of Statistical Predictions,”Journal

of Royal Statistical Society, Series B, 36, 111–147, (with discussion).
(1980): “Optimal rates of convergence of nonparametric estimators,” Annals of Sta-

tistics, 8, 1348–1360.
(1982): “Optimal Global Rates of Convergence for Nonparametric Regression,”Annals

of Statistics, 10, 1040–1053.
Thistlethwaite, D., and D. Campbell (1960): “Regression-discontinuity analysis: An al-

ternative to the ex post facto experiment,” Journal of Educational Psychology, 51, 309–317.
Trochim, W. (1984): Research Design for Program Evaluation: The Regression-Discontinuity

Approach. Sage Publications, Beverly Hills.
van der Klaauw, W. (2002): “Estimating the E¤ect of Financial Aid O¤ers on College En-
rollment: A Regression-Discontinuity Approach,” International Economic Review, 43, 1249–
1287.
Wald, A. (1940): “The Fitting of Straight Lines if Both Variables are Subject to Error,”
Annals of Mathematical Statistics, 11, 284–300.
Watson, G. (1964): “Smooth regression analysis,” Sankhya, 26:15, 175–184.
Wooldridge, J. (1999): “Asymptotic Properties of Weighted M-Estimators for Variable Prob-

ability Samples,” Econometrica, 67, 1385–1406.
(2002): Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge.
Chapter 2.2.4 Bei Asymptotic properties noch einbauen
A remarkable result of Hahn (1998) is that a projection on the propensity score (i.e. match-
ing on the propensity score) does not change the variance bound and that knowledge of the
true propensity score is not informative for estimating average treatment e¤ects. The variance
bound (??) is the same regardless of whether the propensity score is known. Hence, asymptot-
ically the propensity score does not lead to any reduction in dimensionality. However, the vari-
ance bound (??) of the average treatment e¤ect on the treated changes when the true propen-
sity score is known. Hahn (1998) attributes this to the ’dimension reduction’property of the
propensity score. In my opinion this interpretation is highly misleading. I rather argue that
the only value of knowing the true propensity score is that the observed X values of individuals
who participated in other programmes than r can be used to improve the estimation of the
density fXjD=1 (x) among the programme-r-participants.
If the propensity score would indeed contribute to reducing the dimensionality of the
estimation problem, it should also help to estimate potential outcomes E[Y 0 ] and average
treatment e¤ects E[Y 1 Y 0 ] more precisely. On the other hand, the propensity score
provides information about the ratio of the density in the source and the target population
and thus allows source observations to identify the density of X in the target population
and vice versa. Consider the binary treatment case with r =1 (treated participants) and
s = 0 (non-participants). The (Y; X) observations of the treated sample are informative
for estimating E[Y 1 jX], whereas the (Y; X) observations of the non-participant sample are
informative for estimating E[Y 0 jX]. Since the joint distribution of Y 1 ; Y 0 is not identi…ed, the
observations of the treated sample cannot assist in estimating E[Y 0 jX] and vice versa. The X
observations of both samples are useful for estimating the distribution function of X in the
population. With this information the average treatment e¤ect can be estimated by weighting
the estimates of E[Y 1 jX] and E[Y 0 jX] by the distribution of X in the population. Knowledge
of the propensity score is of no use. Now consider the estimation of the average treatment
e¤ect on the treated E[Y 1 Y 0 jD = 1] or of the counterfactual outcome E[Y 0 jD = 1].
Again the (Y; X) observations of both samples identify the conditional expectation functions
separately. These conditional expectation functions are weighted by the distribution of X
among the treated, which can be estimated by the empirical distribution function of X in
the treated sample. The non-participant observations are not informative for estimating the
distribution of X among the treated. However, if the relationship between the distribution
of X among the treated and the distribution of X among the non-participants were known,
the X observations of the non-participants would be useful for estimating the distribution of
X among the treated. Since the propensity score ratio equals the density ratio times the
size ratio of the subpopulations (??), and since the relative size of the treated subpopulation
P (D = 1) can be estimated precisely, both the treated and the non-participant observations
can be used to estimate fXjD=1 if the propensity score is known. Consider a simple example:
In the case of random assignment with p1 (x) = 0:5 for all x, the distribution of X is the
same among the treated and the non-participants, and using only the treated observations to
estimate fXjD=1 would neglect half of the informative observations. With knowledge of the
propensity score the counterfactual outcome E[Y 0 jD = 1] is identi…ed as

Z
E Y 0 jD = 1 = E Y 0 jX = x; D = 0 fXjD=1 (x)dx (175)
Z
1
= E Y 0 jX = x; D = 0 p1 (x) fX (x)dx
P (D = 1)
and could be estimated by the empirical moment estimator

P
m^ 0 (Xi ) p1 (Xi )
Di 2f0;1g
P ,
p1 (Xi )
Di 2f0;1g
which uses the X observations of both the treated and the non-participants. This estimator is
suggested by Hahn (1998, Proposition 7) and achieves the variance bound for known propensity
score.
The value of knowing the propensity score for estimating the distribution function fXjD=1
becomes even more obvious when rewriting (175) as
Z
0
E Y jD = 1 = E Y 0 jX = x; D = 0 fXjD=1 (x)dx
Z
P (D = 0) p1 (x)
= E Y 0 jX = x; D = 0 f (x)dx,
P (D = 1) 1 p1 (x) XjD=0
if p1 (x) 6= 1 8x. This suggests the empirical moment estimator

P 1
^ 0 (Xi ) 1 p p(X
m i)
1 (X )
i
Di =0
P p1 (Xi )
,
1 p1 (Xi )
Di =0
which uses only non-participant observations (Di = 0) to estimate the counterfactual outcome
for the treated. Hence, with knowledge of the propensity score the counterfactual outcome
E[Y 0 jD = 1] for the treated could be estimated nonparametrically even without a single treated
observation!
In the case of multiple treatment evaluation there are a variety of propensity scores. Knowl-
edge of prjrs would allow using the X observations of the s sample to improve the precision of
estimating the distribution of X in the r subpopulation. Knowledge of prjrt would allow using
the X observations of a t sample for estimating fXjD=1 (x). Knowledge of pr would allow using
all X observations to improve upon the estimation of fXjD=1 (x). Hence, in the multiple treat-
ment setting, the variance bound for the average treatment e¤ect on the treated depends on
which and how many propensity scores are known.
Besides deriving the e¢ ciency bounds, Hahn (1998) further gives general conditions under
which a generalized matching estimator based on a particular nonparametric series regression
estimator attains both variance bounds (??) and (??).
Abadie and Imbens (2006) analyze the asymptotic e¢ ciency of -nearest-neighbours match-
ing estimators in estimating average treatment e¤ects when is …xed, i.e. when the number of
neighbours is …xed and does not grow with increasing sample size.203 This includes the stan-
dard pair-matching estimator ( = 1). They consider matching with respect to the X vari-
ables and show that 1) these estimators do not attain the variance bound (??) and, hence, are
ine¢ cient. 2) The bias term of the estimator is of order O(n 2=c ) where c is the number of
continuous covariates. Consequently, if the number of continuous covariates is 4, the estimator
is asymptotically biased. If the number of continuous covariates is even larger, the estimator
p
does no longer converge at rate n. 3) The bias term can be removed through re-centering.
However, since re-centering leaves the variance term unchanged, the modi…ed estimator is still
ine¢ cient.
Heckman, Ichimura, and Todd (1998) analyze local polynomial matching for the estimation
p
of average treatment e¤ects on the treated. They prove n-consistency and asymptotic
normality when matching with respect to X, with respect to the known propensity score or
with respect to the estimated propensity score. The asymptotic distribution consists of a bias
term and a variance term. The variance term equals (??) when matching with respect to X.
When matching with respect to the known propensity score the variance term corresponds
to (??) with X replaced by the propensity score and the density functions f (x) replaced
by density functions with respect to the propensity score. Heckman, Ichimura, and Todd
(1998) show that this variance term can be either larger or smaller than the variance when
matching on X and conclude that neither matching on X nor matching on the propensity
score necessarily dominates the other. (However, they ignore in their discussion the di¤erent
bias terms.) This ambiguity holds also when the propensity score is estimated since the
variance contribution due to estimating the propensity score may be small. This variance
contribution of estimated-propensity-score matching is derived for a propensity score estimated
parametrically or nonparametrically by local polynomial regression with a suitably chosen
bandwidth value.
Hirano, Imbens, and Ridder (2003) analyzed the e¢ ciency of the re-weighting estimator for
estimating average treatment e¤ects and average treatment e¤ects on the treated. They show
that re-weighting using a propensity score estimated by a particular series estimator attains the
variance bounds (??) and (??).
Ichimura and Linton (2002) derived higher-order expansions for the re-weighting estimator.
Including second-order terms in the analysis is relevant since the …rst-order approximations do
not depend on the smoothing or bandwidth parameters used in the nonparametric …rst step,
such that optimal bandwidth choice cannot be discussed with …rst-order asymptotics. (This is
also found in the analysis of the generalized matching estimator in Section 3.) They consider
estimation of the propensity score by local linear regression methods and show that the optimal
bandwidth is of order O(n 1=3 ) and O(n 2=5 ) for a bias corrected version.
The analysis of the asymptotic properties of the evaluation estimators implied no …rm
recommendations on which estimator to use in practice. Generalized matching estimators as
well as re-weighting estimators with estimated propensity scores can be e¢ cient. Yet their small-
sample properties have not been subject to extensive investigation and the previous asymptotic
considerations may be of limited use to choose an adequate estimator. Although, from an
asymptotic perspective, matching on the propensity score implies no reduction in dimensionality
and there are no reasons why matching should not proceed with respect to X, propensity score
matching can often be quite useful since ”in practice inference for average treatment e¤ects
203
Consistent estimation of E[Y r jX; D = r] would require ! 1 as n ! 1.
is often less sensitive to misspeci…cation of the propensity score than to speci…cations of the
conditional expectation of the potential outcomes” (Imbens 2000).204 In Chapter 3 the …nite
sample properties of generalized matching estimators including pair-matching, least squares
matching and various local polynomial matching estimators and the re-weighting estimators
are examined. In addition, the bandwidth selection problem is investigated. In spite of the
undecided conclusions resulting from the asymptotic considerations, it appears that some rather
stable recommendations can be drawn for the choice of evaluation estimators in …nite samples.
Dies auch noch einbauen

Heckman, Ichimura, and Todd (1998) showed furthermore that nonparametric regression
on the propensity score psjrs (x) is asymptotically linear with trimming, even if the propensity
score itself is estimated, provided it is estimated parametrically or nonparametrically by local
polynomial regression. Su¢ cient conditions and the resulting in‡uence functions are given in
the Corollaries ?? and ?? in Appendix B. The linear asymptotic representation with trimming
is used in Chapter 3 to analyze the mean squared error of matching estimators and in Chapter 4
to derive the asymptotic properties of the semiparametric estimator of the conditional expected
potential outcomes.
A Appendix: Asymptotic Theory

Some de…nitions and results regarding convergence of series. First we consider deterministic
sequences fan : n = 1; 2; :::g.
De…nition 5 Limit of a sequence: A sequence of nonrandom numbers an converges to a if for

all " > 0 there exists an n" such that jan aj < " for every n > n" . This is also denoted as
an ! a. If the sequence converges to zero an ! 0, we also write an = o(1).
De…nition 6 Boundedness of a sequence: A sequence of nonrandom numbers an is bounded if
and only if there is some b < 1 such that jan j b for all n = 1; 2; ::::. If the sequence an is
bounded, we write an = O(1).
De…nition 7 Order of a sequence O: A sequence an is at most of order nv if n va
n = O(1).
We write an = O(nv ).
De…nition 8 Order of a sequence o: A sequence an is of smaller order than nv if n va
n ! 0.
We write an = o(nv ).
If we consider vectors or matrices of random numbers, we say that the vector or matrix is
of order O(nv ) or o(nv ) if each element satis…es the de…nition.
Now we consider random sequences fxn : n = 1; 2; :::g, i.e. sequences of random numbers.
An example are random draws from the same random variable, i.e. X1 ; X2 ; :::; Xn or the sample
P
mean n1 ni=1 Xi or the OLS estimators ^ n .
204
However, in speci…c situations, for instance if the outcome variable is bounded, nonparametric regression on
X might work better than is widely thought, see Frölich (2006a).
Markus Frölich A. Appendix: Asymptotic Theory 441
De…nition 9 Covergence in probability: A sequence of random numbers xn converges in prob-

ability to the constant x if for all " > 0 the probability Pr (jxn xj > ") ! 0 for n ! 1. We
p
write xn ! x or plim xn = x. In the case where plim xn = 0 we also write xn = op (1).
De…nition 10 Boundedness in probability: A sequence of random numbers xn is bounded in

probability if and only if for every " > 0 there exists a b" < 1 and an integer n" such that
Pr (jxn j b" ) < " for all n n" . We write xn = Op (1).
If xn is a nonrandom sequence, then xn = Op (1) i¤ xn = O(1). And xn = op (1) i¤ xn = o(1).
In the following de…nition, xn is a sequence of random numbers and an is a sequence of

nonrandom positive numbers.
De…nition 11 Order in probability: A random sequence xn is Op (an ) if xn =an = Op (1). A

random sequence xn is op (an ) if xn =an = op (1).
Useful rules for calculations:
op (1) + op (1) = op (1)

Op (1) + Op (1) = Op (1)
op (1) + Op (1) = Op (1)
op (1) op (1) = op (1)
Op (1) Op (1) = Op (1)
op (1) Op (1) = op (1).
(Try to proof these rules on your own.)

Skript Frölich Impact Evaluation

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skript Frölich Impact Evaluation

Uploaded by

Copyright:

Available Formats

.

Impact evaluation, treatment e¤ects

Lecture notes, Version: 17 April 2008

Do not circulate or quote without permission

Professor Dr. Markus Frölich

Lehrstuhl für Ökonometrie / Chair of Econometrics

L7, 3-5, 68131 Mannheim, Deutschland

1 Causality, nonparametric identi…cation, experimental data 1

1.1 Experiments, randomized trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Nonexperimental data and parameters of interest . . . . . . . . . . . . . . . . . . 17

2 Selection on observables, causal graphs, matching estimators 22

2.1 Causal graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1.1 Front door identi…cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.1.2 Direct and indirect e¤ects, partial e¤ects . . . . . . . . . . . . . . . . . . 36

2.2 Matching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.1 Brief introduction to Nonparametric regression . . . . . . . . . . . . . . . 41

2.2.2 Matching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.3 Propensity score matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.2.4 Propensity score weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.2.5 Combination of weighting and regression . . . . . . . . . . . . . . . . . . . 65

2.2.6 Asymptotic properties of treatment e¤ect estimators . . . . . . . . . . . . 66

2.2.7 Testing the validity of the conditional independence assumption . . . . . 66

2.2.8 Matching with non-binary D . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.2.9 OLS and linear regression models . . . . . . . . . . . . . . . . . . . . . . . 70

2.3 Multiple treatment evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Dynamic treatment evaluation 73

3.1 Shortcomings of the static potential outcomes model . . . . . . . . . . . . . . . . 73

3.2 Dynamic potential outcomes model . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.2.1 Equivalence to static model . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.2.2 Sequential conditional independence assumptions . . . . . . . . . . . . . . 86

3.2.3 Sequential matching or weighting estimation . . . . . . . . . . . . . . . . 95

3.2.4 Relationship to causality in time series econometrics . . . . . . . . . . . . 99

3.3 Timing of events and duration models . . . . . . . . . . . . . . . . . . . . . . . . 99

4 Selection on unobserved - nonparametric IV - triangular model 99

5 Identi…cation without triangularity 183

6 Linear single equation IV models 195

7 Repetition of GMM estimation 204

8 Linear system estimation by IV 215

9 Nonparametric di¤-in-di¤ and Panel Data 222

10 Sources of identifying variation 241

11 Linear Panel Data models 242

12 Non-Linear Panel Data models 259

13 Weak instruments 277

14 Nonparametric estimation 298

14.2 Kernel based methods for nonparametric regression . . . . . . . . . . . . . . . . . 302

15 Semiparametric estimation 342

16 Quantile regression 365

16.3 Quantile treatment e¤ects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

17 Simulation and numerical methods 398

18 Duration models 424

19 Further topics 425

A Appendix: Asymptotic Theory 440

1 Causality, nonparametric identi…cation, experimental data

dealing with the endogeneity of D will require more attention.

E[Ui jDi ; Xi ] = 0. (1)

E[Y 1 jD = 1] E[Y 0 jD = 0].

Di = 1 Yi1 > Yi0 ,

Yi1 Yi0 > 0 if Di = 1

Yi1 Yi0 < 0 if Di = 0.

The naive estimator of the ATET is

E[Y 1 jD = 1] E[Y 0 jD = 0],

which can be written as

= E[Y 1 jD = 1] E[Y 0 jD = 1] + E[Y 0 jD = 1] E[Y 0 jD = 0] .