Professional Documents
Culture Documents
obs Yi (0) if Wi = 0,
There is a large literature in economet- Yi =
Yi (1) if Wi = 1,
rics and statistics on semiparametric esti-
mation of average treatment effects under and pretreatment variables or features Xi .
the assumption of unconfounded treatment To identify τ we assume unconfoundedness
assignment. Recently this literature has (Rosenbaum and Rubin [1983])
focused on the setting with many covari-
ates, where regularization of some kind is
Wi ⊥⊥ Yi (0), Yi (1) Xi ,
required. In this article we discuss some
of the lessons from the earlier literature
and their relevance for the many covariate and overlap of the covariate distributions,
setting, and propose some supplementary
analyses to assess the credibility of the re- e(x) ∈ (0, 1),
sults.
where the propensity score
(Rosenbaum and Rubin [1983]) is
II. The Set Up e(x) = pr(Wi = 1|Xi = x). De-
fine the marginal treatment prob-
We are interested in estimating an aver- ability p = E[Wi ], the conditional
age treatment effect in a setting with a bi- means of the potential outcomes,
nary treatment. We use the potential out- µ(w, x) = E[Yi (w)|Xi = x], the marginal
come or Rubin Causal Model set up (Rubin means, µw = E[Yi (w)], and the conditional
[1974], Holland [1986], Imbens and Rubin variances σ 2 (w, x) = V(Yi (w)|Xi = x).
[2015]). Each unit in a large popula- The efficient score for τ , which plays a key
tion is characterized by a pair of potential role in the discussion, is
outcomes (Yi (0), Yi (1)), with the estimand
y − µ(1, x)
equal to the average causal effect: φ(y, w, x; τ, µ(·, ·), e(·)) = w −
e(x)
τ = E[Yi (1) − Yi (0)],
y − µ(0, x)
(1 − w) + µ(1, x) − µ(0, x) − τ,
or the average effect for the treated, τt = 1 − e(x)
E[Yi (1) − Yi (0)|Wi = 1]. The treatment as- (Hahn [1998]) and the implied semipara-
signment for unit i is Wi ∈ {0, 1}. For each metric variance bound is
unit in a random sample from the popula-
tion we observe the treatment received and AV = E φ(Yiobs , Wi , Xi ; τ, µ(·, ·), e(·))2 .
tion between the treatment indicator and sity score is close to zero for a substantial
the covariates, through balancing, weight- fraction of the population. This is a partic-
ing, or otherwise, and adjust directly for ular concern in settings with many covari-
the association between the potential out- ates because regularization based on predic-
comes and the covariates. There are mul- tion criteria may downplay biases that are
tiple ways of obtaining such estimators. present in estimation of µ(w, x) in parts of
One can do so by subclassification on the the (w, x) space with few observations, even
propensity score in combination with re- if those values are important for the estima-
gression within the subclasses, or weight- tion of the average treatment effect. In that
ing in combination with regression. For ex- case one may wish to focus on a weighted
ample, suppose we parametrize the condi- average effect of the treatment. One can do
tional means as µ(w, x) = wτ +x′ β, and the so by trimming or weighting. Crump et al.
propensity score as e(x) = 1/(1+exp(x′ γ)), [2006, 2009] and Li et al. [2014] suggest es-
and estimate the regression by weighted timating
linear
p regression with weights equal to
E [ω(Xi ) · (Yi (1) − Yi (0))]
p
Wi / e(Xi ; γ̂) + (1 − Wi )/ (1 − e(Xi ; γ̂)), τω(·) = ,
then the estimator for τ is consistent if ei- E [ω(Xi )]
ther the propensity score or the conditional
expectations of the potential outcomes are for ω(x) = e(x)(1 − e(x)) or ω(x) =
correctly specified. Similarly, using the effi- 1α<e(x)<1−α . The semiparametric efficiency
cient score, if we estimate the average treat- bound for τω(·) is (Hirano et al. [2001])
ment effect by solving
ω(Xi )2 σ 2 (1, Xi )
1
N AV = E
1 X obs E[ω(Xi )2 ] e(Xi )
φ Yi , Wi , Xi ; τ, µ̂(·, ·), ê(·) = 0,
N i=1
ω(Xi )2 σ 2 (0, Xi )
+
as a function of τ given estimators µ̂(·, ·) 1 − e(Xi )
and ê(·), then as long as either the estima- 2
2
+ω(Xi ) µ(1, Xi ) − µ(0, Xi ) − τω(·) ,
tor for either µ(w, x) or e(x) is consistent,
the resulting estimator for τ is consistent.
which can be an order of magnitude smaller
If we use general nonparametric estima-
than the asymptotic variance bound for τ
tors for µ(·), ·) and e(·), this last estima-
itself.
tor also has the property that the esti-
In settings with limited or no heterogene-
mator for the finite dimensional compo-
ity in the treatment effects as a function
nent τ is asymptotically uncorrelated with
of the covariates, these weights are par-
the estimator for the nonparametric compo-
ticularly helpful and the weights ω(x) =
nents µ(w, x) and e(x). This orthogonality
e(x)(1 − e(x)) lead to efficient estima-
property (Chernozhukov et al. [2016]) fol-
tors for τ in that case. The arguments
lows from the representation of the estima-
in Crump et al. [2006, 2009] and Li et al.
tor in terms of the efficient score. Note that
[2014] show that one may wish to impose a
the properties are distinct: not all estima-
constant treatment effect in estimation even
tors that have the orthogonality property
if substantively one does not find that as-
are doubly robust.
sumption credible.
B. Modifying the Estimand
C. Weighting versus Balancing
A second issue is the choice of estimand.
Much of the literature has focused on the Although weighting by the inverse of the
average treatment effect E[Yi (1) − Yi (0)], or treatment assignment balances pretreat-
the average effect for the treated. A practi- ment variables in expectation, it does not
cal concern is that these estimands may be do so in finite samples. Recently there
difficult to estimate precisely if the propen- have been a number of estimators pro-
4 PAPERS AND PROCEEDINGS MONTH YEAR
pect those to perform well. The first two [2006] propose this estimator as a spe-
estimators we discuss assume linearity of cial case of the targeted maximum like-
the conditional expectation of the potential lihood approach, suggesting various ma-
outcomes in the, potentially many, covari- chine learning methods for estimation of
ates. How sensitive the results are in prac- the conditional outcome expectation and
tice to this linearity assumption in settings the propensity score. Chernozhukov et al.
with many covariates, where some of the [2016], in the context of much more gen-
covariates may be functions of underlying eral estimation problems, propose a closely
variables, remains to be seen. related estimator focusing on the orthogo-
nality properties arising from the use of the
A. The Double Selection Estimator (DSE) efficient score. In the Chernozhukov et al.
[2016] approach the sample is partitioned
Belloni et al. [2013] propose using into K subsamples, with the nonparamet-
LASSO (Tibshirani [1996]) as a covariate ric component estimated on one subsample,
selection method. They do so first to select and the parameter of interest estimated as
pretreatment variables that are important the average of the influence function over
for explaining the outcome, and then to the remainder of the sample. This is re-
select pretreatment variables that are peated K times, and the estimators for the
important for explaning the treatment as- parameter of interest averaged to obtain the
signment. They then combine the two sets final estimator, thereby further improving
of pretreatment variables and estimate a the properties in settings with many covari-
regression of the outcome on the treatment ates. We report both the simple version of
indicator and the union of the selected the DRE and the averaged version DMLE.
pretreatment variables.
V. Outstanding Challenges and
B. The Approximate Residual Balancing Practical Recommendations
Estimator (ARBE)
Here we present some practical rec-
Athey et al. [2016] suggest using elas- ommendations for researchers estimating
tic net (Zou and Hastie [2005]) or LASSO treatment effects, and discuss some of the
(Tibshirani [1996]) to estimate the condi- remaining challenges for the theoretical re-
tional outcome expectation, and then using searchers.
an approximate balancing approach in the
spirit of Zubizarreta [2015] as discussed in A. Recommendations
Section III.C to further remove bias arising
from remaining imbalances in the pretreat- The main recommendation is to report
ment variables. analyses beyond the point estimates and
the associated standard errors. Supporting
C. The Doubly Robust Estimator (DRE) and analyses should be presented to convey to
the Double Machine Learning Estimator the reader that the estimates are credible
(DMLE) (Athey and Imbens [2016]). By credible we
do not mean whether the unconfoundedness
In the general discussion of semiparamet- property holds, but whether the estimates
ric estimation van der Vaart [2000] suggest effectively adjust for differences in the co-
estimating the finite dimensional compo- variates. Here are four specific recommen-
nent as the average of the influence func- dations to do so.
tion, with the infinite dimensional compo-
nents estimated nonparametrically, leading 1) (Robustness) Do not rely on a sin-
to a doubly robust estimator in the spirit of gle estimation method. Many of the
Robins and Rotnitzky [1995], Robins et al. methods have attractive properties un-
[1995], Scharfstein et al. [1999]. In the der slightly different sets of regularity
specific context of estimation of average conditions but rely on the same funda-
treatment effects Van Der Laan and Rubin mental set of identifying assumptions.
6 PAPERS AND PROCEEDINGS MONTH YEAR
Table 1—An Illustration Based on the Connors et al. [1996] Heart Catherization Data
Richard Crump, V Joseph Hotz, Guido Im- Guido W Imbens and Donald B Rubin.
bens, and Oscar Mitnik. Moving the goal- Causal Inference in Statistics, Social,
posts: Addressing limited overlap in the and Biomedical Sciences. Cambridge
estimation of average treatment effects University Press, 2015.
by changing the estimand, 2006.
Joseph DY Kang and Joseph L Schafer. De-
Richard K Crump, V Joseph Hotz, mystifying double robustness: A compar-
Guido W Imbens, and Oscar A Mit- ison of alternative strategies for estimat-
nik. Dealing with limited overlap in ing a population mean from incomplete
estimation of average treatment effects. data. Statistical science, pages 523–539,
Biometrika, pages 187–199, 2009. 2007.
Fan Li, Kari Lock Morgan, and Alan M Za-
Bradley Efron and Robert J Tibshirani. An
slavsky. Balancing covariates via propen-
introduction to the bootstrap. CRC press,
sity score weighting. arXiv preprint
1994.
arXiv:1404.1785, 2014.
Bryan Graham, Christine Pinto, and Daniel Daniel F McCaffrey, Greg Ridgeway, and
Egel. Inverse probability tilting for Andrew R Morral. Propensity score
moment condition models with missing estimation with boosted regression for
data. Review of Economic Studies, pages evaluating causal effects in observational
1053–1079, 2012. studies. Psychological Methods, 9(4):403,
2004.
Bryan Graham, Christine Pinto, and Daniel
Egel. Efficient estimation of data Whitney K Newey. The asymptotic
combination models by the method of variance of semiparametric estimators.
auxiliary-to-study tilting (ast). Journal Econometrica: Journal of the Economet-
of Business and Economic Statistics, 34 ric Society, pages 1349–1382, 1994.
(2):288–301, 2016.
James Robins and Andrea Rotnitzky. Semi-
Jinyong Hahn. On the role of the propensity parametric efficiency in multivariate re-
score in efficient semiparametric estima- gression models with missing data. Jour-
tion of average treatment effects. Econo- nal of the American Statistical Associa-
metrica, pages 315–331, 1998. tion, 90(1):122–129, 1995.
Jens Hainmueller. Entropy balancing for James Robins, Andrea Rotnitzky, and L.P.
causal effects: A multivariate reweighting Zhao. Analysis of semiparametric regres-
method to produce balanced samples in sion models for repeated outcomes in the
observational studies. Political Analysis, presence of missing data. Journal of the
20(1):25–46, 2012. American Statistical Association, 90(1):
106–121, 1995.
Keisuke Hirano, Guido Imbens, Geert Rid-
Paul R Rosenbaum and Donald B Rubin.
der, and Donald Rubin. Combining pan-
The central role of the propensity score
els with attrition and refreshment sam-
in observational studies for causal effects.
ples. Econometrica, pages 1645–1659,
Biometrika, 70(1):41–55, 1983.
2001.
Donald B Rubin. Estimating causal effects
Paul W Holland. Statistics and causal infer- of treatments in randomized and nonran-
ence. Journal of the American Statistical domized studies. Journal of Educational
Association, 81(396):945–970, 1986. Psychology, 66(5):688–701, 1974.
Guido Imbens and Jeffrey Wooldridge. Re- Daniel O Scharfstein, Andrea Rotnitzky,
cent developments in the econometrics and James M Robins. Adjusting for
of program evaluation. Journal of Eco- nonignorable drop-out using semipara-
nomic Literature, 47(1):5–86, 2009. metric nonresponse models. Journal of
VOL. VOL NO. ISSUE AVERAGE TREATMENT EFFECTS 9