You are on page 1of 15

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: https://www.tandfonline.com/loi/uasa20

Decomposing Treatment Effect Variation

Peng Ding, Avi Feller & Luke Miratrix

To cite this article: Peng Ding, Avi Feller & Luke Miratrix (2019) Decomposing Treatment
Effect Variation, Journal of the American Statistical Association, 114:525, 304-317, DOI:
10.1080/01621459.2017.1407322

To link to this article: https://doi.org/10.1080/01621459.2017.1407322

View supplementary material

Published online: 09 Jul 2018.

Submit your article to this journal

Article views: 2201

View related articles

View Crossmark data

Citing articles: 10 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=uasa20
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2019, VOL. 114, NO. 525, 304–317, Theory and Methods
https://doi.org/./..

Decomposing Treatment Effect Variation


Peng Dinga , Avi Fellera,b , and Luke Miratrixc
a
Department of Statistics, University of California, Berkeley, CA; b Goldman School of Public Policy, University of California, Berkeley, CA;
c
Harvard Graduate School of Education, Cambridge, MA

ABSTRACT ARTICLE HISTORY


Understanding and characterizing treatment effect variation in randomized experiments has become Received May 
essential for going beyond the “black box” of the average treatment effect. Nonetheless, traditional sta- Revised November 
tistical approaches often ignore or assume away such variation. In the context of randomized experiments, KEYWORDS
this article proposes a framework for decomposing overall treatment effect variation into a systematic com- Heterogeneous treatment
ponent explained by observed covariates and a remaining idiosyncratic component. Our framework is fully effect; Idiosyncratic
randomization-based, with estimates of treatment effect variation that are entirely justified by the random- treatment effect variation;
ization itself. Our framework can also account for noncompliance, which is an important practical compli- Noncompliance;
cation. We make several contributions. First, we show that randomization-based estimates of systematic Randomization inference;
variation are very similar in form to estimates from fully interacted linear regression and two-stage least Systematic treatment effect
squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect varia- variation
tion, both with and without noncompliance. Third, we propose an R2 -like measure of treatment effect vari-
ation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via
simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment.
Supplementary materials for this article are available online.

1. Introduction
the resulting estimator is very similar in form to linear regres-
The analysis of randomized experiments has traditionally sion with interactions between the treatment indicator and
focused on the average treatment effect, often ignoring or covariates. Unlike with linear regression, however, the proposed
assuming away treatment effect variation (e.g., Neyman 1923; estimator does not require any modeling assumptions on the
Fisher 1935; Kempthorne 1952; Rosenbaum 2002). Today, marginal outcomes.
understanding and characterizing treatment effect variation in Second, we extend these methods from intention-to-treat
randomized experiments has become essential for going beyond (ITT) analysis to allow for noncompliance, proposing a
the “black box” of the average treatment effect. This is clear randomized-based estimator for systematic treatment effect
from the increasing number of articles on the topic in statis- variation for the local average treatment effect (LATE) in the
tics and machine learning (Hill 2011; Athey and Imbens 2016; case of noncompliance (Angrist, Imbens, and Rubin 1996).
Wager and Athey 2017), biostatistics (Huang, Gilbert, and Janes We show that this estimator is nearly identical to the two-
2012; Matsouaka, Li, and Cai 2014), education (Raudenbush and stage least-square estimator with interactions between the treat-
Bloom 2015), economics (Heckman, Smith, and Clements 1997; ment and covariates. We believe that this is a particularly
Crump et al. 2008; and Djebbari and Smith 2008), political sci- novel contribution to the recent literature seeking to recon-
ence (Green and Kern 2012; Imai and Ratkovic 2013), and other cile the randomization-based tradition in statistics and the
areas. linear model-based perspective more common in economet-
This article proposes a framework for decomposing overall rics (Abadie 2003; Imbens 2014; Imbens and Rubin 2015).
treatment effect variation in a randomized experiment into Armed with these estimators, we turn to two practical tools
a systematic component that is explained by observed covari- for decomposing treatment effect variation. The first is an
ates, and an idiosyncratic component that is not explained omnibus test for the presence of systematic treatment effect vari-
(Heckman, Smith, and Clements 1997; Djebbari and Smith ation. While versions of this test have been proposed previ-
2008). In doing so, we make several key contributions. First, we ously, largely in the context of linear models (Cox 1984; Crump
take a fully randomization-based perspective (see Rosenbaum et al. 2008), our proposed test is fully randomization-based and
2002; Imbens and Rubin 2015), and propose estimators that are can also account for noncompliance. The second is to develop
entirely justified by the randomization itself. This is in contrast and bound an R2 -like measure of the fraction of treatment
to much of the literature on randomization-based methods, effect variation explained by covariates. This builds on previ-
where treatment effect variation is typically a nuisance (e.g., ous versions proposed in the econometrics literature (Heckman,
Rosenbaum 1999, 2007). Similar to Lin (2013), we show that Smith, and Clements 1997; Djebbari and Smith 2008), again

CONTACT Luke Miratrix lmiratrix@stat.harvard.edu Harvard Graduate School of Education, Cambridge, MA .
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/JASA.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA.
©  American Statistical Association
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 305

extending results to account for noncompliance. This approach control:


is also closely related to the Oaxaca–Blinder decomposition in
1  1 
n n
obs obs
economics (Oaxaca 1973; Blinder 1973). See Angrist, Pathak, 
τV = V̄ 1 − V̄ 0 = TiV obs
i − (1 − Ti )V obs
i
and Walters (2013) for a recent application that also addresses n1 i=1 n0 i=1
compliance. Finally, we apply these methods to the Head Start 1 
n
1 
n
Impact Study, a large-scale randomized trial of Head Start, a fed- = TiV i (1) − (1 − Ti )V i (0).
n1 i=1 n0 i=1
erally funded preschool program (Puma et al. 2010). We relegate
the technical details and some further extensions to the online The behavior of our estimator, and of our estimators for het-
supplementary material. erogeneity discussed later, revolve around covariances of vector
outcomes. For notation, let A = {A1 , . . . , An } be a collection of
2. Framework for Treatment Effect Variation n vectors, with Ā = n−1 ni=1 Ai the vector mean, and define the
covariance operator on A as
2.1. Setup and Notation
1 
n
S (A) = (Ai − Ā)(Ai − Ā)T ,
Assume that we have n units in an experiment. For unit i, n − 1 i=1
let X i = (X1i , . . . , XKi )T ∈ RK denote the vector of pretreat-
ment covariates, with the constant 1 as its first component. Let which gives the covariance matrix of the n vectors in A. For
Ti denote the treatment indicator with 1 for treatment and 0 example, Ai can be V i (1), V i (0), or V i (1) − V i (0).
for control. We use the potential outcomes framework (Ney- The following theorem, generalizing the results for scalar
man 1923; Rubin 1974) to define causal effects. Under the sta- outcomes from Neyman (1923), demonstrates that  τV is unbi-
ble unit treatment value assumption (Rubin 1980) that there is ased and gives its covariance matrix.
only one version of the treatment and no interference among
Theorem 1. Over all possible randomizations of a completely
units, we define Yi (1) and Yi (0) as the potential outcomes of
randomized experiment, τV is unbiased for τV , with K × K
unit i under treatment and control, respectively. The observed
covariance matrix:
outcome, Yiobs = TiYi (1) + (1 − Ti )Yi (0), is quite general and
includes continuous, binary, and zero-inflated cases. On the S{V (1)} S{V (0)} S{V (1) − V (0)}
τV ) =
cov( + − . (1)
difference scale, the individual treatment effect is τi = Yi (1) − n1 n0 n
Yi (0).
The diagonal elements of this matrix are the variances of the
Importantly, this is finite population inference in that we con-
estimators of each component of τV . The covariance matrix of
dition on the n units at hand—the potential outcomes are fixed

τV depends on the various covariances of the potential outcomes
and pretreatment. This differs from super population inference
under treatment and control. In particular, the last term depends
in which some variables or residuals are assumed to be indepen-
on the correlation between the potential outcomes V (1) and
dent and identically distributed (iid) draws from some distribu-
V (0), and therefore cannot be identified from the observed
tion. See, for example, Rosenbaum (2002), Imbens and Rubin
data. When the individual treatment effects are constant for all
(2015), and Li and Ding (2017). Under the potential outcomes
components of V , the last term in the above covariance matrix
framework, {Yi (1), Yi (0)}ni=1 are all fixed numbers; the random-
vanishes, because then S{V (1) − V (0)} = 0K×K . Under this
ness of any estimator comes from the assignment mechanism,
assumption, we can unbiasedly estimate the sampling covari-
which is the distribution of possible treatment assignments
ance matrix cov(τV ) by replacing the covariances of the poten-
T = (T1 , . . . , Tn )T . Note that pr{(T1 , . . . , Tn ) = (t1 , . . . , tn )} =
 n −1 n tial outcomes by the sample analogs:
n1
if i=1 ti = n1 .
S1 (V obs ) S0 (V obs )
 τV ) =
cov( + ,
2.2. Randomization Inference for Vector Outcomes n1 n0

To set up our overall framework, we first generalize where


Neyman’s (1923) classic results to vector outcomes. We consider St (V obs )
a completely randomized experiment, with n1 units assigned to
1 
n
treatment and n0 units assigned to control; in total we have nn1 =
obs obs
I(T =t ) (V i − V̄ t )(V i − V̄ t )T (t = 0, 1) (2)
possible randomizations. We are interested in estimating the nt − 1 i=1 i
finite population average treatment effect on a vector outcome
V ∈ RK : are the sample covariance matrices of V obs in the treatment and
control groups. Without the constant treatment effect assump-
1
n
 τV ) is conservative in the
tion, the covariance estimator cov(
τV = {V i (1) − V i (0)} ,
n i=1 sense that the difference between the expectation of the variance
estimator and the true variance is a nonnegative definite matrix.
where V i (1) and V i (0) are the potential outcomes of V for In particular, the diagonal terms of the expected estimator will
unit i. For example, V can be Y or XY . The Neyman-type all be larger than the truth. Letting K = 1, the covariance matri-
unbiased estimator for τV is the difference between the sample ces become simple variances, which recovers Neyman’s original
mean vectors of the observed outcomes under treatment and result.
306 P. DING, A. FELLER, AND L. MIRATRIX

Using the mathematical framework introduced in the practice, we do not fully observe these components, but we can
Appendix and in Li and Ding (2017), we can easily gener- obtain unbiased or consistent estimates for them as we discuss
alize Theorem 1 to more complicated experimental designs, below.
for example, cluster-randomized trials (Middleton and Aronow
2015) and unbalanced 22 split-plot designs (Zhao et al. 2017).
3. Systematic Treatment Effect Variation for the ITT

2.3. Decomposing Treatment Effect Variation 3.1. Randomization-Based Estimator


We now apply this general framework to treatment effect varia- We now turn to estimating β. As shown in (5), β has three com-
tion. We decompose the individual treatment effect, τi , via ponents. The first term, Sxx , is fully observed as all the covari-
ates are observed. Our estimation then depends on the sample
τi = Yi (1) − Yi (0) = X Ti β + εi , (i = 1, . . . , n) (3)
analogs of Sx1 and Sx0 :
with β being the finite population linear regression coefficient
1 
n
of τi on X i , defined by 
Sx1 = TiYiobs X i ∈ RK ,
n1 i=1

n
 2
β = arg minK τi − X Ti b . (4) 1 
n
b∈R
i=1

Sx0 = (1 − Ti )Yiobs X i ∈ RK .
n0 i=1
Following Heckman, Smith, and Clements (1997) and Djeb-
bari and Smith (2008), we call δi = X Ti β the systematic treatment The 
Sxt ’s capture how the observed potential outcomes correlate
effect variation explained by the observed covariates, X i , and call with the covariates. Plug these into (5) to obtain an overall esti-
εi ≡ τi − δi = τi − X T β the idiosyncratic treatment effect varia- mate of β. The randomization of T then justifies the following
tion not explained by X i . theorem.
More generally, we can view this decomposition in a  −1
Theorem 2. Under decomposition (3), S−1 xx Sx1 and Sxx Sx0 are
regression-style framework. Define
unbiased estimates of γ 1 and γ 0 , respectively. Therefore,
1 1
n n

βRI = S−1  −1
Sxx = X i X Ti ∈ RK×K , Sxε = εi X i ∈ RK , xx Sx1 − Sxx Sx0 ,
n i=1 n i=1
is an unbiased estimator for β with covariance matrix
1
n
Sxτ = τi X i ∈ RK , 
S{Y (1)X} S{Y (0)X} S (τ X ) −1
n i=1 cov(
βRI ) = S−1
xx + − Sxx .
n1 n0 n
where Sxx is nondegenerate, analogous to the usual full-rank (6)
assumption in linear models. Also define
Here, for example, S{Y (0)X} denotes the covariance opera-
1
n
tor on new unit-level variables Yi (0)X i ∈ RK , made by scaling
Sxt = X iYi (t ) ∈ RK , (t = 0, 1).
n i=1 the X i vector of each unit by Yi (0), similarly for S{Y (1)X} and
S (τ X ). This slight abuse of notation gives formulas less clut-
These are all finite population quantities, as in they are fixed pre- tered by subscripts and excessive annotation. As with the vector
randomization values. The definition of β gives Sxε = 0, that is, version of Neyman’s formula, the square root of the diagonal of
εi and X i have finite population covariance zero. Therefore, in cov(βRI ) gives the standard errors of  βRI .
the spirit of the agnostic regression framework (e.g., Lin 2013), The covariance formula (6) generalizes the result of Neyman
the systematic component, δi = X Ti β, is a projection of τi onto (1923) for the average treatment effect, reducing to Neyman’s
the linear space spanned by X i , and the idiosyncratic treatment formula if X i = 1 for all units. We can obtain a “conservative”
effect, εi , is the corresponding residual. The linear projection estimate of cov( βRI ) by
applies to general outcomes, including the binary case.

Because of our finite population focus, if we observed all the 1 (Y obs X ) S0 (Y obs X )
S
potential outcomes we could immediately calculate all individ-  
cov( βRI ) = Sxx−1
+ S−1
xx ,
n1 n0
ual treatment effects and apply standard linear regression theory
to (3) and obtain β. In particular, the solution of (4), that is, the recalling the definitions of the sample covariance operators S1
ordinary least-square (OLS) solution from regressing τ on X, is and S0 introduced in (2). Similar to Neyman (1923), this implic-
itly assumes S (τ X ) = 0. Under the assumption that εi = 0 for
β = S−1 −1 −1
xx Sxτ = Sxx Sx1 − Sxx Sx0 ≡ γ 1 − γ 0 , (5)
all units (i.e., no idiosyncratic variation whatsoever), we can
where γ 1 = S−1 −1
xx Sx1 and γ 0 = Sxx Sx0 are the corresponding instead use S ( τ = X Ti 
τ X ) with  βRI as a plug-in estimate for
finite population regression coefficients of the potential out- S (τ X ). This yields tighter standard errors based on the diag-
comes on the covariates. Let ei (1) = Yi (1) − X Ti γ 1 and ei (0) = onal elements of the covariance matrix.
Yi (0) − X Ti γ 0 be the residual potential outcomes from the Finite Population Asymptotic Analysis. Theorem 2 holds for
regression of Yi (t ) onto X. Our idiosyncratic treatment vari- any finite sample. To obtain confidence intervals and to conduct
ation is then the difference of residuals: εi = ei (1) − ei (0). In hypothesis testing as we describe below, we need to prove further
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 307

that βRI is asymptotically normal with mean β and covariance 3.2. Regression with Treatment-Covariate Interactions
cov( βRI ). Finite population asymptotic analysis, however, has The results from randomization inference can shed light on the
a slightly different flavor from the usual super population familiar case of linear regression with treatment-covariate inter-
approach. Formally, the finite asymptotic scheme embeds the actions. This classical approach assumes the model
finite population {(X i , Yi (1), Yi (0), Ti )}ni=1 with size n into a
hypothetical sequence of finite populations with sizes approach- Yiobs = X Ti γ + Ti X Ti β + ui , (i = 1, . . . , n), (8)
ing infinity. This effectively assumes that all the finite population
quantities, for example, Sxx and β, depend on n, although they where {ui }ni=1 are errors implicitly assumed to induce the
are fixed numbers for a given finite population. Moreover, the randomness, and where β models systematic treatment effect
sample quantities such as  Sx1 and  βRI depend on n as well, and variation, as in (3). Departing from much of the previous lit-
are random quantities due to the randomization of T . For nota- erature (e.g., Cox 1984; Berrington de González and Cox 2007;
tional simplicity, we drop the index n for all these quantities. Crump et al. 2008), we study the properties of the least-square
Importantly, we must impose some regularity conditions on the estimator under complete randomization, without assuming
hypothetical sequence of finite populations. Throughout the that model (8) is correctly specified. In particular, we do not
article, we invoke the following conditions for asymptotic analy- assume any iid sampling; the assignment mechanism drives the
sis, which are required for a form of the finite population central distribution of the OLS estimator.
limit theorem discussed in Li and Ding (2017, Theorem 5). Theorem 3. The OLS estimator for β from fitting model (8) can
be rewritten as
Condition 1. (i) Stable treatment proportions: p1 = n1 /n and −1 −1

βOLS =  Sx1 − 
Sxx,1 Sxx,0
Sx0 ,
p0 = n0 /n have positive limiting values; (ii) Stable means,
variances, and covariances: the finite population means, vari- where
ances, and covariances of the covariates and potential out-
1 
n
comes have finite and nonzero limiting values; (iii) both Sxx 
Sxx,t = I(T =t ) X i X Ti , (t = 0, 1).
and its limit have full-rank K; (iv) there are no individual nt i=1 i
extreme values in the limit: max1≤i≤n ||Vi − V̄ ||22 /n → 0, for
−1 −1
Vi = Xki , Yi (z), Yi (z)Xki , Xki Xk i , Yi (z)Xki Xk i with 1 ≤ k, k ≤ Over all possible randomizations of T ,  Sx1 and 
Sxx,1 Sxx,0
Sx0 are
K, and z = 0, 1. consistent estimates of γ 1 and γ 0 , respectively; 
βOLS therefore
follows an asymptotic normal distribution with mean β and
Condition parts (i) and (ii) are natural. Part (iii) is a basic covariance matrix
requirement for asymptotic analysis of quantities depending on 
S−1
xx . The condition on the limit is of particular interest. Having  −1 S{e(1)X} S{e(0)X} S (εX ) −1
cov(βOLS ) = Sxx + − Sxx (9)
a nonsingular limiting covariance matrix essentially means that n1 n0 n
there cannot be too many units with extreme leverage on any with ei (1), ei (0), and εi as defined after (5).
of the regression coefficients (see Huber 1973). For example, for
a binary covariate Xki , the numbers of units with Xki = 1 and This estimate is simply the difference between  γ 1,OLS =
Xki = 0 must both go to infinity. If this did not hold, the limit of  −1

Sxx,1 Sx1 and   −1

γ 0,OLS = Sxx,0 Sx0 , two OLS regressions run sep-
Sxx would not have full-rank K as the kth row and column would arately on each treatment arm. The (asymptotic) covariance
all be driven to 0. Part (iv) controls the tails; it holds if V has formula (9) is different from (6), with {Y (1), Y (0)} replaced
more than two moments (Li and Ding 2017). In particular, (iv) by {e(1), e(0)}. For treated units, define residual  ei = Yiobs −
holds automatically for bounded covariates and outcomes. For a XiT
γ 1,OLS , and for control units, define residual  ei = Yiobs −
more technical discussion of finite population causal inference, XiT
γ 0,OLS . We can drop the unidentifiable term S (εX ), estimate
see Ding (2014), Aronow, Green, and Lee (2014), and Middleton S{e(1)X} and S{e(0)X} by their sample analogs, and conserva-
and Aronow (2015); for regularity conditions of the finite pop- tively estimate the asymptotic covariance matrix (9) by
ulation central limit theorems, see Hájek (1960) and Lehmann


(1998). A recent review is Li and Ding (2017).
  −1 S1 (
eX ) −1  −1 S0 (
eX ) −1
Under these conditions, we can extend Theorem 2 to a  βOLS ) = Sxx,1
cov( Sxx,1 + Sxx,0 Sxx,0 .
n1 n0
sequence of finite populations and obtain a limiting distribution
as follows: This form of the sandwich variance estimator has the same prob-
√ d ability limit as the Huber–White covariance estimator for linear
n βRI − β → model (8) (Huber 1967; White 1980; Angrist and Pischke 2008;
 −1  −1 Lin 2013).
N 0, lim S−1 −1
xx p1 S{Y (1)X}+ p0 S{Y (0)X}−S (τ X ) Sxx .
n→∞ Importantly,  βRI and βOLS are quite similar in form. In par-
(7) ticular, βRI uses the true Sxx while 
 βOLS separately estimates the
covariance matrix for each treatment arm,  Sxx,0 and 
Sxx,1 . The
As a result, we can state that 
βRI is approximately normal with latter is effectively a ratio estimator. Although this introduces
mean β and covariance matrix (6), which allows us to construct some small bias (on the order of 1/n), using the estimated  Sxx,t
confidence intervals and hypothesis tests. In our theory below, rather than true Sxx can often lead to gains in precision, espe-
we use this informal statement instead of (7) to avoid notational cially when covariates are strongly correlated with the poten-
complexity. tial outcomes. In particular, the OLS estimator, by separately
308 P. DING, A. FELLER, AND L. MIRATRIX

estimating the (known) Sxx matrix for each treatment arm, can 1977; Särndal, Swensson, and Wretman 2003), we can achieve
account for random imbalances in the covariates in both arms. meaningful precision gains in practice. More importantly, this
For related discussion, see Cochran (1977) on ratio estimators setup allows researchers to assess systematic variation across
in surveys. one set of covariates while adjusting for another set.
The RI estimator, by comparison, has no adjustment what- Second, under the assumption of no idiosyncratic variation
soever, and so cannot account for such random covariate (i.e., εi = 0 for all i), we can obtain exact inference for β by
imbalances. However, in Section 3.4 and in the supplementary inverting a sequence of randomization-based tests. This com-
materials, we introduce a different form of adjustment that plements previous work on randomization-based tests for the
uses covariates to make the estimates of the Sxt more precise. presence of idiosyncratic treatment effect variation (Ding, Feller,
Depending on the structure of covariates, this estimator could and Miratrix 2016).
be better or worse than OLS adjustment; we leave a thorough
investigation of these trade-offs for future work.
Regardless, we again emphasize that we do not rely on classi- 4. Idiosyncratic Treatment Effect Variation for ITT
cal OLS assumptions to justify the OLS estimator here. Rather, After characterizing the systematic component of treatment
randomization (with some mild regularity conditions for the effect variation, we now turn to characterizing the idiosyncratic
finite sample asymptotics) justifies our results. component. Since this quantity is inherently unidentifiable, we
propose sharp bounds on this component and a framework for
sensitivity analysis. We then leverage these results to bound an
3.3. Omnibus Test for Systematic Variation
R2 -like measure of the treatment effect variation explained by
Finally, we can use these results to develop an omnibus test for covariates.
the presence of any systematic treatment effect variation. The
null hypothesis of no treatment effect variation explained by the
4.1. Bounds
observed covariates can be characterized by
We first define the main quantities of interest:
H0 (X ) : β1 = 0,
1 1
n n
where β1 contains all the components of β except the first com- Sτ τ = (τi − τ )2 , Sδδ = (δi − τ )2 ,
ponent corresponding to the intercept. Under H0 (X ), the indi- n i=1 n i=1
vidual treatment effects have no linear dependence on X. 1 2
n

We then construct a Wald-type test for H0 (X ) using an esti- Sεε = ε ,


n i=1 i
mator   
β and its covariance estimator cov( β); it could be 
βRI or
   β1 ) denote the subvector of 
βOLS . Let β1 and cov(  β and sub- with δi and εi defined as in (3). Then Sτ τ = Sδδ + Sεε . We
 
matrix of cov( β), corresponding to the nonintercept coordi- can immediately estimate Sδδ via the sample variance of { δi =
nates of X. We reject when β}ni=1 , where 
X Ti  β is a consistent estimator, for example,  βRI
 
or βOLS . However, the idiosyncratic variance, Sεε , is inherently
 −1 (
β1 )
T
β1 cov β1 > qK−1 (1 − α), (10)
unidentifiable because it depends on the joint distribution of
where qK−1 (1 − α) is the 1 − α quantile of the χ 2 random vari- potential outcomes.
able with degrees of freedom K − 1. We can, however, derive sharp bounds for Sεε . Let F1 (y)
The test in (10) is nearly identical to the test proposed and F0 (y) be the empirical cumulative distribution functions of
by Crump et al. (2008). They relax the parametric assumption by {ei (1)}ni=1 and {ei (0)}ni=1 . Let F1−1 (u) and F0−1 (u) be the corre-
taking a “sieve estimator” approach, namely, by using a quadratic sponding empirical quantile functions, with F −1 (u) = inf{x :
form of the regression function, which allows for more flexible F (x) ≥ u}. Below we denote e(t ) as a random variable tak-
marginal distributions. Our approach differs in that we avoid ing equal probabilities on n values of {ei (t )}ni=1 . Based on
modeling the marginal distributions entirely. If desired, we can the Fréchet–Hoeffding bounds (Hoeffding 1941; Fréchet 1951;
add polynomials of X (or other basis functions) into the model Nelsen 2007), we can bound Sεε as follows.
for δ to allow for more flexible systematic treatment effect varia-
tion, which could enhance power or model more complex rela- Theorem 4. Sεε has sharp bounds Sεε ≤ Sεε ≤ Sεε , where
tionships between the X and treatment impact.  1
Sεε = {F1−1 (u) − F0−1 (u)}2 du,
0
 1
3.4. Additional Considerations
Sεε = {F1−1 (u) − F0−1 (1 − u)}2 du.
In the supplementary material, we describe two additional 0
points about systematic treatment effect variation that we The lower and upper bounds are attainable when e(1) and e(0)
briefly address here. First, as mentioned above, we can use have the same ranks and opposite ranks, respectively.
model-assisted estimation to improve the randomization-based
estimator. In particular, improving estimation of  Sxt directly The lower bound of Sεε corresponds to a rank-preserving
improves βRI , as the 
Sxt are the only random components. Thus, relationship between e(1) and e(0), and the upper bound of Sεε
if we replace the standard sample estimator,  Sxt , by a more effi- corresponds to an anti-rank-preserving relationship between
cient, model-assisted estimator, as in survey sampling (Cochran e(1) and e(0). Equivalently, they correspond to the cases where
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 309

the Spearman rank correlation coefficients between e(1) and Therefore, the dependence of the potential outcomes is deter-
e(0) are +1 and −1. mined by the dependence of the uniform random variables U1
In practice, we can often sharpen these bounds because we and U0 , which are the standardized ranks of the potential out-
are unlikely to have negatively associated potential outcomes comes. When U1 = U0 , Sεε attains the lower bound Sεε ; when
after adjusting for covariates. If we assume a nonnegative cor- U1 = 1 − U0 , Sεε attains the upper bound Sεε ; when U1 U0 , Sεε
relation between e(1) and e(0), we have the following corollary. attains the improved upper bound V1 + V0 .
Rather than simply examine extreme scenarios of Sεε , we can
Corollary 1. If the correlation between e(1) and e(0) is non- instead represent U1 as a mixture of U0 and another independent
negative, then the bounds for Sεε become Sεε ≤ Sεε ≤ V1 + V0 , uniform random variable V0 :
where Vt is the variance of e(t ) for t = 0, 1.
iid
U1 ∼ ρU0 + (1 − ρ)V0 , U0 , V0 ∼ Uniform(0, 1), (11)
We can consistently estimate each quantity: Sδδ by the sample
which the sensitivity parameter ρ captures the association
variance of X Ti 
β, Fe1 (y) and Fe0 (y) by F1 (y) and F
0 (y), the empir-
between U1 and U0 . An immediate interpretation of ρ is the
ical cumulative distribution functions of the residuals  ei under
proportion of rank preserved units, with the other 1 − ρ as the
treatment and control, and V1 and V0 by the variances of  e(1)
proportion of units with independent treatment and control
and e(0).
residual outcomes. When ρ = 0, U1 U0 , and the residual
Variance of the Overall ITT Estimator. We can use these
potential outcomes are independent; when ρ = 1, U1 = U0 , and
results to obtain sharper bounds on the variance of Ney-
−1 n the residual potential outcomes have the same ranks. The values
man’s (1923) estimate of overall ITT, 
τ = n i=1 iYi
T obs

−1 n between (0, 1) correspond to positive rank correlation but not
1
n0 i=1 (1 − Ti )Yi
obs
, extending previous work by Heckman,
full-rank preservation. Note that the representation of the joint
Smith, and Clements (1997) and Aronow, Green, and Lee
distribution is not unique, because we can choose any copula
(2014). See also Fogarty (2016). Applying the results in Section
as a joint distribution of (U1 , U0 ) (Nelsen 2007). We choose the
2 for scalar outcomes, we have the following variance for the
above representation and notation ρ for the following theorem.
difference-in-means estimator,
  Theorem 5. If Equation (11) holds, then ρ is Spearman’s rank
S11 S00 Sδδ Sεε correlation coefficient between e(1) and e(0). Furthermore, Sεε
var( τ) = + − + ,
n1 n0 n n is a linear function of ρ:

where Sτ τ = Sδδ + Sεε . As we discuss above, Neyman (1923) Sεε (ρ) = ρSεε + (1 − ρ)(V1 + V0 ).
proposed a lower bound for the overall var( τ ) under the
We cannot extract any information about ρ from the data. We
assumption of a constant treatment effect, Sτ τ = 0. More
therefore treat ρ as a sensitivity parameter, choose a plausible
recently, Aronow, Green, and Lee (2014) instead proposed to
range of ρ, and obtain corresponding values for Sεε .
bound Sτ τ via Fréchet–Hoeffding bounds. We can modestly
improve these results by applying Fréchet–Hoeffding bounds for
Sεε alone rather than for Sτ τ = Sδδ + Sεε . So long as Sδδ > 0, this 4.3. Fraction of Treatment Effect Variation Explained
yields strictly tighter bounds on var(τ ) than the corresponding
A natural question is the relative magnitudes of Sδδ and
bounds that do not incorporate covariate information. In turn,
Sεε (Djebbari and Smith 2008). Continuing the regression anal-
this gives a tighter estimate of the standard error for the same
ogy, this is an R2 -like measure for the proportion of total treat-
difference-in-means estimator,  τ.
ment effect variation explained by the systematic component:
A Variance Ratio Test. Finally, while the relationship between
e(0) and e(1) is inherently unidentifiable, there is some infor- Sδδ Sδδ
R2τ = = ,
mation in the data about the relationship between εi , the Sτ τ Sδδ + Sεε
individual-level idiosyncratic treatment effect, and Yi (0), the
which is the ratio between the finite population variances of δ
control potential outcome. In particular, Raudenbush and
and τ. As above, we can directly estimate Sδδ but must bound
Bloom (2015) noted that if the variance of the treatment
Sεε . Applying Theorem 4, we obtain the following bounds on
potential outcomes is smaller than the variance of the control
R2τ .
potential outcomes, then the treatment effect must be nega-
tively associated with the control potential outcomes. In the Corollary 2. The sharp bounds on R2τ are
supplementary material, we extend this result to incorporate
Sδδ Sδδ
covariates and propose a formal test. ≤ R2τ ≤ .
Sδδ + Sεε Sδδ + Sεε
If we further assume that the correlation between e(1) and e(0)
4.2. Sensitivity Analysis is nonnegative, the sharp bounds on R2τ are
Going beyond worst-case bounds, we can assess the sensitivity Sδδ Sδδ
of our estimate of Sεε to different assumptions of the dependence ≤ R2τ ≤ .
Sδδ + V1 + V0 Sδδ + Sεε
between potential outcomes. Using the probability integral
transformation, we represent the residual potential outcomes as We estimate these bounds via plug-in estimates. Note
that Djebbari and Smith (2008) explored a similar quan-
e(1) = F1−1 (U1 ), e(0) = F0−1 (U0 ), U1 , U0 ∼ Uniform(0, 1). tity by using a permutation approach to approximate the
310 P. DING, A. FELLER, AND L. MIRATRIX

Fréchet–Hoeffding upper and lower bounds. Finally, we can use 5.2. Systematic Treatment Effect Variation Among
the sensitivity results for Sεε , with values of ρ ∈ [0, 1]: Compliers
Sδδ ... Randomization Inference
R2τ (ρ) = . We now extend the results of Section 3 to estimate systematic
Sδδ + Sεε (ρ)
treatment effect variation among Compliers. Define

1  1 
n n
5. Noncompliance
Sxx,u = I(U =u) X i X Ti , Sxt,u = I(U =u)Yi (t )X i ,
nu i=1 i nu i=1 i
5.1. Setup
(t = 0, 1; u = a, c, n).
We now extend our results to allow for noncompliance. Let T be
the indicator of treatment assigned, D be the indicator of treat- Then, analogous to (5),
ment received, Y be outcome of interest, and X be pretreatment
covariates. Under the Stable Unit Treatment Value Assumption, βc = S−1 −1 −1
xx,c (Sx1,c − Sx0,c ) = Sxx,c Sx1,c − Sxx,c Sx0,c ≡ γ 1c − γ 0c ,
we define Di (t ) and Yi (t ) as the potential outcomes for unit i (13)
under treatment assignment t. Following Angrist, Imbens, and
Rubin (1996) and Frangakis and Rubin (2002), we can classify where
units into four compliance types based on the joint values of
γ 1c = S−1
xx,c Sx1,c , γ 0c = S−1
xx,c Sx0,c
Di (1) and Di (0):
⎧ are the linear regression coefficients of Y (1) and Y (0) on covari-

⎪ Always Taker (a) if Di (1) = 1, Di (0) = 1,
⎨ ates among Compliers.
Never Taker (n) if Di (1) = 0, Di (0) = 0,
Ui = Unlike in the ITT case, we cannot estimate these quanti-

⎪ Complier (c) if Di (1) = 1, Di (0) = 0,
⎩ ties directly. Instead, following standard results from noncom-
Defier (d) if Di (1) = 0, Di (0) = 1.
pliance (e.g., Angrist, Imbens, and Rubin 1996; Abadie 2003;
Denote nu and πu as the number and proportion of compliance Angrist and Pischke 2008), we use estimates from observed sub-
types πu of stratum U = u for u = a, n, c, d. groups to estimate the desired quantities of interest. Define sam-
Throughout our discussion, we invoke the following assump- ple moments:
tions which are commonly used for analyzing randomized
1 
n
experiments with noncompliance. 
Sxx,td = I(T =t ) I(Di =d) X i X Ti ,
nt i=1 i
Assumption 1. (i) Monotonicity: Di (1) ≥ Di (0); (ii) Exclusion
1 
n
restrictions for Always Takers and Never Takers: Yi (1) = Yi (0)

Sxt,td = I(T =t ) I(Di =d)Yiobs X i (t, d = 0, 1). (14)
for all units with Di (1) = Di (0); (iii) Strong instrument: πc > nt i=1 i
C0 > 0, where C0 is a positive constant independent of the sam-
ple size. The following theorem connects these quantities with the finite
population quantities in (13).
Monotonicity rules out the existence of Defiers, that is,
πd = 0. Under monotonicity, we can estimate the proportion Theorem 6. Over all possible randomizations of a com-
πu using the observed counts of units classified by T and D: let pletely randomized experiment, both  Sxx (1) = Sxx,11 − Sxx,01
ntd = #{i : Ti = t, Di = d}, and then  πn = n10 /n1 , 
πa = n01 /n0 ,   
and Sxx (0) = Sxx,00 − Sxx,10 are unbiased for πc Sxx,c , and
and πc = n11 /n1 − n01 /n0 . The exclusion restrictions assume
that treatment assignment has no effect on the outcome for Sx1,11 − 
E( Sx0,01 ) = πc Sx1,c , E(
Sx0,00 − 
Sx1,10 ) = πc Sx0,c .
Always Takers and Never Takers. As a result, treatment effect (15)
variation is trivially zero for Always Takers and Never Takers.
Note that this is the unit-level exclusion restriction imposed This theorem shows that we can obtain unbiased esti-
in Angrist, Imbens, and Rubin (1996). This can be relaxed in mates for all terms in (13). The following corollary shows
other settings, for example, we could assume the impact of that we can then obtain consistent estimates for γ 1c , γ 0c ,
randomization for these groups is zero on average (see Imbens and βc , recalling that in the asymptotic analysis, we need to
and Rubin 2015). Finally, to avoid technical complexity, we embed {(X i , Yi (1), Yi (0), Di (1), Di (0), Ti )}ni=1 into a hypothet-
rule out the weak instrument case (Bound, Jaeger, and Baker ical sequence of finite populations under Condition 1 and the
1995; Staiger and Stock 1997), that is, πc is within a small following Condition 2.
neighborhood of 0 with radius shrinking to 0.
Condition 2. Both Sxx,c and its limit have full-rank K.
We are interested in treatment effect variation among Com-
pliers, which motivates the following decomposition:
Condition 2 holds if and only if any linear combination of X,
 l X with l = 0, has positive finite population variance among
0, if Ui = a or n,
τi = Yi (1) − Yi (0) = (12) Compliers. Condition 2 is effectively the finite population ver-
X Ti βc + εi , if Ui = c,
sion of ruling out weak instruments in the two-stage least-square
where βc is the regression coefficient of τi on X i among Compli- estimate with treatment-covariate interactions (e.g., Angrist and
ers, analogous to (3). Pischke 2008).
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 311

Corollary 3.  γ 1c,RI = 
−1
Sxx (1)( Sx1,11 − 
Sx0,01 ) and  γ 0c,RI = Theorem 7. Over all randomizations, the TSLS estimator  βTSLS
−1
 
Sxx (0)(Sx0,00 − Sx1,10 ) are consistent for γ 1c and γ 0c . Further- follows an asymptotic normal distribution with mean βc and
more, βc,RI = 
γ 1c,RI − 
γ 0c,RI is consistent for βc and follows an covariance matrix
 
asymptotic normal distribution with covariance matrix −1 S{e (1)X} S{e (0)X} S (εX )
(πc Sxx,c ) + − (πc Sxx,c )−1 ,
cov(
βc,RI ) n1 n0 n

−1 S{e (1)X} S{e (0)X} S (εX ) where we define the residual potential outcomes to be
= (πc Sxx,c ) + − (πc Sxx,c )−1 ,
n1 n0 n ⎧
⎨ Yi (1) − X Ti (γ ∞ + βc ), Ui = a,
(16) 
ei (1) = Yi (1) − X Ti γ ∞ , Ui = n

where we define the residual potential outcomes to be: Yi (1) − X Ti (γ ∞ + βc ), Ui = c,
⎧ ⎧ ⎧
⎨ Yi (1) − X Ti γ 1c , ⎨ Yi (0) − X Ti γ 1c , Ui = a, ⎨ Yi (0) − X Ti (γ ∞ + βc ), Ui = a,

 
ei (1) = Yi (1) − X i γ 0c , ei (0) = Yi (0) − X Ti γ 0c , Ui = n,
T ei (0) = Yi (0) − X Ti γ ∞ , Ui = n
⎩ ⎩ ⎩
Yi (1) − X Ti γ 1c , Yi (0) − X Ti γ 0c , Ui = c. Yi (0) − X Ti γ ∞ , Ui = c,
(17) where γ ∞ is the probability limit of the TSLS regression coef-
ficient,  γ TSLS , and the idiosyncratic treatment effect is εi ≡
The idiosyncratic variation is εi = ei (1) − ei (0) for unit i,
ei (1) − ei (0).
with εi = 0 for Never Takers and Always Takers, and with εi for
Compliers as in (12). The two sets of residuals are not formed For variance estimation, define the residual as  ei = Yiobs −
from a regression on all units, but instead the population regres- X T (  ei = Yiobs −
i γ TSLS + βTSLS ) for units with Di = 1 and 
sion on Compliers alone. As in the ITT case, we can estimate X T γ for units with D = 0. We can then use the following
i TSLS i
S{e (1)X} and S{e (0)X} using their sample analogs; S (εX ), sandwich variance estimator:
however, is unidentifiable. For units with Di = 1, we define

the residual  ei = Yiobs − X Ti  γ c1,RI , and for units with Di = 0, −1 S1 (
e 
X ) −1
cov( 
 βTSLS ) =  Sxx (1) 
Sxx (1)
we define the residual  ei = Yiobs − X Ti  γ c0,RI . Therefore, we can n1
obtain a conservative estimate for the asymptotic covariance

(16) by the following sandwich form: −1 S0 ( e 
X ) −1
+ Sxx (0) 
Sxx (0),

n0
−1 S1 (
e X ) −1
 
cov( βc,RI ) = Sxx (1) Sxx (1)
n1 which has the same probability limit as the Huber–White covari-

ance estimator for  βTSLS . Therefore, the randomization itself
−1 0 (
S e 
X ) −1

+ Sxx (0) 
Sxx (0). effectively justifies the use of TSLS for estimating systematic
n0 treatment effect variation among Compliers, extending our ITT
results.
As with the ITT analog, so long as we have Assumption 1, ran- Finally, while  βTSLS is a consistent estimator for βc ,  γ TSLS is
domization itself fully justifies the theorem and estimators with- not, in general, a consistent estimator for γ , that is, γ = γ .
c0 ∞ c0
out relying on a model of the observed outcomes. Instead,  γ TSLS converges to γ ∞ = S−1 −1
xx Sx0 − πa Sxx Sxx,a βc . In the
special case of one-sided noncompliance (i.e., πa = 0), γ ∞ =
... Two-Stage Least Squares γ 0 = S−1 xx Sx0 , the population OLS regression coefficient, among
We now turn to the standard two-stage least-square (TSLS) all Compliers and Never Takers, of Y (0) on covariates.
setting in econometrics (e.g., Angrist and Pischke 2008). First,
we impose a linear regression model with treatment-covariate
... Omnibus Test for Systematic Treatment Effect
interactions:
Variation Among Compliers
Yi = X i γ + Di X i β + ui (i = 1, . . . , n).
obs T T
With point estimate β and covariance estimate cov(   β) for βc , we
can use the same Wald-type χ 2 test as in (10) for the presence
Here, the randomness of the observed outcome comes from of systematic treatment effect variation among Compliers. Here,
the randomness of Di and ui . In the language of econometrics, the estimator can be either randomization-based  βc,RI or TSLS
the treatment received is “endogenous,” that is, Di and the error 
estimator βTSLS ; the degrees of freedom are the same, K − 1.
term ui are assumed to be correlated; we therefore use Ti as an
Unlike in the ITT case, we are not aware of existing tests for sys-
instrument for Di . The TSLS estimates ( γ TSLS , 
βTSLS ) are the
tematic treatment effect variation among Compliers.
solutions to the following estimating equations:
n  
Xi
n−1
(Yiobs − X Ti γ TSLS − Di X Ti  βTSLS ) = 0. (18) 5.3. Idiosyncratic Treatment Effect Variation with
Ti X i Noncompliance
i=1

This approach is based on M-estimation, though there are many ... Bounding Idiosyncratic Variation
other ways to formalize the TSLS estimator (e.g., Imbens 2014). We now turn to decomposing the overall treatment effect in
The following theorem shows that the fully interacted TSLS the presence of noncompliance. In this setting, we have three
estimator 
βTSLS is consistent for βc across randomizations. sources of treatment effect variation: (i) systematic treatment
312 P. DING, A. FELLER, AND L. MIRATRIX

effect variation among Compliers, (ii) idiosyncratic treatment Second, we can measure the proportion of treatment effect vari-
effect variation among Compliers, and (iii) treatment effect vari- ation among Compliers explained by covariates (i.e., only X):
ation due to noncompliance.
First, Sδδ,c Sδδ,c
 recall that total treatment effect variation is R2τ,c = = .
Sτ τ = ni=1 (τi − τ )2 /n. We can define a similar quantity Sτ τ,c Sδδ,c + Sεε,c
among Compliers:
Third, we can measure the treatment effect variation explained
by covariates and noncompliance (i.e., both X and U ):
1 
n
Sτ τ,c = I(U =c) (τi − τc )2 . Sτ τ,U + πc Sδδ,c Sτ τ,U + πc Sδδ,c
nc i=1 i R2τ,U X = = .
Sτ τ Sτ τ,U + πc Sδδ,c + πc Sεε,c
As in Section 4, we can decompose this variation into system- For each measure, we can use tailored versions of Corollary 1
atic and idiosyncratic treatment effect variation for Compliers, to construct bounds, or conduct sensitivity analysis as in Section
respectively: 4.2, with the sensitivity parameter expressed as the Spearman
correlation between the treatment and control potential out-
1  1 
n n
comes among Compliers.
Sδδ,c = I(U =c) (δi − τc )2 , Sεε,c = I(U =c) εi2 .
nc i=1 i nc i=1 i
6. Simulation Study
Because treatment effects for Never Takers and Always Takers
are zero, there is no treatment effect variation for these units. 6.1. ITT Estimators
The component of treatment effect variation due to compliance
status is We simulate completely randomized experiments to evaluate
the finite sample performance of the tests for systematic treat-
 w
Sτ τ,U = πu (τu − τ )2 . ment effect variation based on  βOLS , 
βRI , and 
βRI , the model-
u=c,a,n assisted version discussed in the supplementary material. Our
data generation process is inspired by the Head Start Impact
Using τa = τn = 0 and τ = πc τc due to the exclusion restric- Study (HSIS) study analyzed in the next section. For a given
tions, we have the following theorem summarizing the relation- sample size, we first generate four independent covariates (X1 ,
ships among the above components. a standard normal, X2 , a binary covariate with probability 0.5
being 1, X3 , a binary covariate with probability 0.25 being 1, and
Theorem 8. Sτ τ = πc Sτ τ,c + Sτ τ,U , Sτ τ,c = Sδδ,c + Sεε,c , and X4 , a standard normal). The control potential outcomes are then
Sτ τ,U = πc (1 − πc )τc2 . generated from

Yi (0) = 0.3 + 0.2X1i + 0.3X2i − 0.4X3i


In words, total treatment effect variation has three parts: (i)
systematic treatment effect variation among Compliers, πc Sδδ,c ; + 0.8X4i + ui , ui ∼ N (0, σ 2 ).
(ii) idiosyncratic treatment effect variation among Compliers,
We select σ 2 = 0.26 to make the marginal variance for the con-
πc Sεε,c ; (iii) treatment effect variation due to noncompliance,
trol potential outcomes 1; thus we can interpret impacts in “effect
Sτ τ,U .
size” units. The R2 of regressing Y (0) onto the covariates is
As in the ITT case, even though Sεε,c is not identifiable,
approximately 0.74, due to the “pretest”-like variable X4i . With-
we can derive bounds in terms of the marginal distributions
out X4i , the R2 is about 0.09.
of the residuals, {ei (1) = Yi (1) − X Ti γ 1c : Ui = c, i = 1, . . . , n}
The treatment effects are τi = δi + εi , with (i) either δi = 0.3
and {ei (0) = Yi (0) − X Ti γ 0c : Ui = c, i = 1, . . . , n}, denoted by
for all i, or δi = 0.2 + 0.1X1i + 0.4X3i ; and (ii) either εi = 0 for
F1c (y) and F0c (y), and with marginal variances, V1c and V0c . Once
all i, or εi ∼ N (0, 0.22 ). All combinations of these two options
we estimate these quantities, we can plug them in to Theorem 4
give the four cases of (a) no treatment effect variation, (b) only
and Corollary 1 to get our bounds. As compliance status is only
systematic variation, (c) idiosyncratic variation with no system-
partially observed, we have to estimate these quantities by dif-
atic variation, and (d) both systematic and idiosyncratic varia-
ferencing observed distributions; we defer this and some other
tion. For an α-level test of systematic variation, scenarios (a) and
technical details to the supplementary material.
(c) should only reject at rate α, while we would like to see high
rejection rates for scenarios (b) and (d). For scenario (d), the R2τ
is about 0.5; systematic variation explains a good share of the
... Treatment Effect Decomposition
overall variation.
Since there are two sources of variation—covariates and
To generate a synthetic dataset, we generated all potential
noncompliance—there are three possible R2 -type measures.
outcomes, randomized units into treatment with probability 0.6,
First, we can measure the treatment effect variation explained
and then calculated the corresponding observed outcomes. We
by noncompliance alone (i.e., only U ):
then conducted a test for systematic variation using each of our
three estimators. For  βRI and  βOLS , we use X1 , X2 , X3 . For our
Sτ τ,U Sτ τ,U Sτ τ,U w
R2τ,U = = = . covariate-adjusted estimator  βRI , we also include the fairly pre-
Sτ τ Sτ τ,U + πc Sτ τ,c Sτ τ,U + πc Sδδ,c + πc Sεε,c dictive X4 for adjustment.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 313

w
Figure . Power of the tests based on 
βRI , 
βOLS , and 
βRI .

Figure 1 shows the power of these tests, with significance level with additional uncertainty due to partial information about the
α = 0.05, for different sample sizes. First, all estimators appear identity of Compliers.
asymptotically valid, consistent with the theoretical results. The
OLS and adjusted estimators are slightly anti-conservative for 7. Application to the Head Start Impact Study
small n, however, with rejection rates of around 9%. Second, the
OLS estimator appears to have the greatest power in this setting, Established in 1965, Head Start is the largest Federal preschool
which is unsurprising since the true data-generating process is a program in the United States, serving nearly 1 million
linear model. Finally, covariate adjustment slightly improves the low-income 3- and 4-year-old children each year at a cost
power of the RI estimator. Overall, in the scenarios we consider, of over $7 billion (Administration for Children and Families
we only achieve decent levels of power in large samples, although 2015). Researchers and policymakers have debated Head Start’s
there seems to be reasonable power for the sample size in the effectiveness since its inception, with early randomized trials
data application, n = 3586. finding limited impacts (e.g., Westinghouse Learning Corpo-
ration 1969) and quasi-experimental studies showing much
larger effects (e.g., Currie and Thomas 1995). Designed in part
to settle this debate, the Head Start Impact Study (HSIS) is a
6.2. LATE Estimators
large-scale, nationally representative randomized trial of Head
We next simulate completely randomized experiments with Start first launched in 2002 (Puma et al. 2010). The Congres-
noncompliance to evaluate the finite sample performance of the sional mandate for HSIS included two broad questions: (1)
tests for systematic treatment effect variation among Compliers the program’s overall impact, and (2) how impacts vary across
based on  βc,RI and 
βTSLS . We first generated a complete dataset children and centers. The policy debate has largely focused on
as in the ITT case above, and then assigned strata membership to this first question; HSIS only found modest average effects on
all units with probabilities proportional to their covariates. For a range of children’s cognitive and social-emotional outcomes.
Always Takers, we then set Yi (0) = Yi (1), and for Never Takers, However, both the original study and several recent articles
Yi (1) = Yi (0). The overall ITT is now reduced to 0.21 (due to argue that these topline results mask important treatment effect
the 0 effects of Never Takers and Always Takers), although the variation (e.g., Bloom and Weiland 2014; Bitler, Hoynes, and
CACE is still approximately 0.3. The proportion of Compliers is Domina 2014; Walters 2015; Ding, Feller, and Miratrix 2016;
approximately 68%. Feller et al. 2016). Understanding such variation is critical both
The Compliers have the systematic and idiosyncratic effects for assessing the program’s benefits and costs and for improving
described as above. We tested for the presence of systematic vari- the practice and science of early childhood education.
ation for Compliers under the exclusion restrictions. Figure 2 HSIS collected a rich set of covariates about children and
shows the power of these tests for our RI and TSLS estimators. their families, including pretest score, child’s age, child’s race,
First, in this scenario, the 2SLS and the RI estimators are vir- child’s home language, mother’s education level, and mother’s
tually equivalent; the additional adjustment provided by TSLS marital status. At the same time, many potentially important
does not add significantly to the precision. We see the tests are covariates are unavailable. For instance, while families must be
valid (they even appear conservative) for cases (a) and (c). Power low-income to be eligible for Head Start, HSIS does not include
is reduced compared to the ITT simulation; this is reasonable information on families’ actual income nor other financial
as power is effectively a function of the number of Compliers, details that could be important predictors of program impact.
314 P. DING, A. FELLER, AND L. MIRATRIX

Figure . Power of the tests based on 


βc,RI and 
βTSLS .

In addition, Feller et al. (2016) and others argue that the setting 7.1. Decomposing Variation in the ITT Effect
in which a child would otherwise receive care is an impor-
We first explore treatment effect variation for the ITT estimate,
tant source of impact variation, although this is not directly
beginning with estimating systematic treatment effect varia-
observable.
tion. We examine three estimators: the randomization-based
We now use the methods outlined above to assess treat-
and OLS estimators discussed in Section 3,  βRI and  βOLS , and
ment effect variation in HSIS. The original study included n =
the corresponding model-assisted version of the RI estimator
4400 total children, with n1 = 2644 in the treatment group and w
n0 = 1796 in the control group. Following earlier analy- discussed in the supplementary material,  βRI . For this latter esti-
ses (Ding, Feller, and Miratrix 2016) and to simplify exposi- mator, we use all available covariates to adjust the standard esti-
tion, we restrict our attention to a complete-case subset of the mators, that is, W is the entire vector of covariates.
HSIS, with n1 = 2238 in the treatment group and n0 = 1348 in Omnibus Test for Systematic Treatment Effect Variation. We
the control group (so p1 ≈ 0.62 and p0 ≈ 0.38). Our outcome begin by using these estimators for an omnibus test of whether
of interest is the Peabody Picture Vocabulary Test (PPVT), a any treatment effect variation is explained by the full set of
widely used measure of cognitive ability in early childhood. To covariates. The p-values for the unadjusted  βRI estimator and
w
assess treatment effect variation, we consider the full set of child- model-assisted  βRI are 0.39 and 0.25, respectively, which do not
and family-level covariates used in the original HSIS analysis show any evidence of treatment effect variation. The OLS esti-
of Puma et al. (2010), including those mentioned above. After mator, however, shows much stronger evidence with p = 0.005.
creating dummy variables for factors (e.g., recoding race), the Importantly, all three estimators are based on the same
covariate matrix has 17 columns. See Figure 3(b) for a complete underlying assumptions: the randomization itself justifies all
list. three p-values. And while we expect the unadjusted  βRI to have

Figure . Treatment effect R2τ , with sensitivity parameter, ρ ∈ [0, 1].


JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 315

the lowest power, it is instructive that the p-value for  βOLS is In the setting with noncompliance, we focus on two estima-
substantially smaller than the p-value for the covariate-adjusted tors for systematic treatment effect variation among Compliers:
 w
βRI . As we discuss in Section 3.2,  βOLS can account for covariate the randomization-based estimator,  βc,RI , and the two-stage
imbalance across experimental arms by estimating the Sxx 
least-squares estimator, βTSLS . We first use these estimators to
matrix separately for the treatment and control groups. By construct omnibus tests for systematic treatment effect varia-
contrast,  βRI does not address imbalance in X and instead tion among Compliers. Tests using both estimators show strong
attempts to residualize out the Y to get a more precise estimate evidence for such variation, with p-value 0.02 using  βc,RI and
of the relationship of the X to Y for each treatment arm. Based p-value 0.01 using  βTSLS .
on the discrepancy in p-values, adjusting for baseline imbalance Finally, we turn to decomposing the overall treatment effect.
is clearly important in this example. As in the ITT case, we assume that the potential outcomes
Treatment Effect R2τ . Next, we examine how much of the varia- have a nonnegative correlation. Figure 3(a) shows the treatment
tion could be explained by our covariates. Figure 3(a) shows val- effect R2 among Compliers, which ranges from R2τ,c = 0.05 to
w
ues of the treatment effect R2τ using βRI to estimate the systematic R2τ,c = 0.68. Next, we can calculate treatment effect variation
variation. Results are nearly identical using the other estimators. due to noncompliance, R2τ,U . In the case of HSIS, this is rela-
In the worst case of perfect negative dependence between poten- tively small—between 0.01 and 0.16—in part because the overall
tial outcomes (not shown), the treatment effect R2τ could be treatment effect is fairly small. Therefore, the overall treatment
as low as 0.01. Assuming that this dependence is nonnegative, effect decomposition due to both covariates and noncompli-
the treatment effect R2τ ranges from 0.03 to 0.76. While the ance, R2τ,U X , is quite close to R2τ,c , as shown in Figure 3(a). Taken
estimate is clearly sensitive to the unidentifiable sensitivity together, these estimates suggest that there is indeed important
parameter, the covariates explain a substantial proportion of treatment effect variation that is neither captured by pretreat-
treatment effect variation for values of ρ near 1. ment covariates nor by noncompliance, consistent with previous
We can also use this framework to assess the relative impor- results in Ding, Feller, and Miratrix (2016).
tance of each covariate in terms of explaining overall treatment
effect variation. To do this, we use the model-assisted RI esti-
w
mator,  βRI , adjusting for all covariates (i.e., dim(W ) = 17) but 8. Conclusion
restricting systematic treatment effect variation to one covariate
at a time. Note that we consider factors (e.g., race) as a group. In this article, we propose a broad, flexible framework for assess-
Figure 3(b) shows the resulting estimates for the upper bound of ing and decomposing treatment effect variation in random-
R2τ , with lower bound estimates all below 0.01. Having a mother ized experiments with and without noncompliance. In general,
who is a recent immigrant and dual language learner status we believe this is a natural setup for researchers to formulate
(which are highly correlated in practice) could each explain a and investigate a broad range of questions about impact het-
substantial proportion of treatment effect variation, consistent erogeneity (e.g., Heckman, Smith, and Clements 1997). Appli-
with previous results from Bloom and Weiland (2014) and Bitler, cations include assessing underlying causal mechanisms and
Hoynes, and Domina (2014). This is not true for other covari- targeting treatments based on individual-level characteristics.
ates, like mother’s education level. Understanding such variation is also important for the design
Negative Correlation Between Treatment Effect and Control of experiments. Djebbari and Smith (2008), for example, argued
Potential Outcomes. Finally, we test whether the individual- that characterizing the size of the idiosyncratic treatment effect
level idiosyncratic treatment effects, {εi }ni=1 , are negatively corre- is useful for determining the value of additional data collection.
lated with the control potential outcomes, {Yi (0)}ni=1 , extending We briefly note several directions for future work. First,
results from Raudenbush and Bloom (2015). As outlined in the our primary purpose was to propose a framework for analysis
supplementary material, we do so by testing whether the vari- rooted in and justified by the randomization itself. As a result,
w we focused on the core properties of several relatively simple
ance of {Yiobs − X Ti 
βRI : Ti = 1} is smaller than the variance of
versions of linear regression and TSLS. We did not, however,
{Yi : Ti = 0}. This yields a p-value of 0.02, which suggests that
obs
fully explore their practical and finite-sample properties. For
the unexplained treatment effect is indeed larger for smaller val-
example, in future work, we hope to determine the settings
ues of the control potential outcomes. This result is consistent
in which model assistance will most improve estimation and
with findings from Bitler, Hoynes, and Domina (2014) who use
assess the increased power of the OLS approach versus the
a quantile treatment effect approach.
unbiased RI approach. We are also investigating how to connect
model-assisted and OLS approaches to take advantage of both
7.2. Incorporating Noncompliance
methods of precision gain. Similarly, there is still much potential
As with many social experiments, there is substantial noncom- improvement in determining ways of characterizing the degree
pliance with random assignment in HSIS. In the analysis sample of heterogeneity, such as with an effect size for the systematic
we consider here, the estimated proportion of compliance types variation.
πc = 0.69 for Compliers, 
is  πa = 0.13 for Always Takers, and Second, a natural extension is to use more complex methods
πn = 0.18 for Never Takers. Given the exclusion restrictions for
 to estimate systematic treatment effects, such as via hierarchi-
Always Takers and Never Takers, the treatment effect is therefore cal models (Feller and Gelman 2015) or via machine-learning
zero (by assumption) for over 30% of the sample, suggesting that methods (Wager and Athey 2017), extending the results for
noncompliance will be an important component of treatment the omnibus test and treatment effect R2τ accordingly. While
effect variation. the guarantees from randomization are clearly weaker in such
316 P. DING, A. FELLER, AND L. MIRATRIX

settings, researchers can assess these tradeoffs themselves. For Blinder, A. S. (1973), “Wage Discrimination: Reduced form and Structural
example, hierarchical modeling would be especially useful in the Estimates,” Journal of Human resources, 8, 436–455. [305]
Head Start Impact Study due to the multi-site design (Bloom and Bloom, H. S., and Weiland, C. (2015), “Qualifying Variation in Head
Start Effects on Young Children’s Cognitive and Socio-Emotional Skills
Weiland 2014). Using Data from the National Head Start Impact Study,” SSRN Work-
Third, a question of increasing practical importance is the ing Paper 2594430. [313,315]
generalizability of experimental results to a given target popu- Bound, J., Jaeger, D. A., and Baker, R. M. (1995), “Problems with Instru-
lation (Stuart et al. 2011). We believe that the treatment effect mental Variables Estimation when the Correlation Between the Instru-
R2τ is a critical measure for assessing the credibility of these gen- ments and the Endogenous Explanatory Variable is Weak,” Journal of
the American Statistical Association, 90, 443–450. [310]
eralizations. In short, if there is substantial idiosyncratic treat- Cochran, W. G. (1977), Sampling Techniques (3rd ed.), New York: Wiley. [5]
ment effect variation, that is, R2τ is small, then researchers should Cox, D. R. (1984), “Interaction” (with discussion), International Statistical
be wary of using observed covariates to extrapolate treatment Review, 52, 1–24. [304,307]
effects. Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2008), “Non-
Finally, a question is how to extend this treatment effect vari- parametric Tests for Treatment Effect Heterogeneity,” Review of Eco-
nomics and Statistics, 90, 389–405. [304,307,308]
ation framework to nonrandomized settings. While the results Currie, J., and Thomas, D. (1995), “Does Head Start Make a Difference?”
would necessarily rest on much stronger assumptions, many American Economic Review, 85, 341–364. [313]
settings already use an as-if-randomized framework, such as Ding, P. (2017), “A Paradox from Randomization-Based Causal Inference”
in observational studies (Rosenbaum 2002; Imbens and Rubin (with discussion), Statistical Science, 32, 331–335. [307]
2015). Under this approach, extensions should be natural. Ding, P., Feller, A., and Miratrix, L. W. (2016), “Randomization Inference
for Treatment Effect Variation,” Journal of the Royal Statistical Society,
Series B, 78, 655–671. [308,313,314,315]
Djebbari, H., and Smith, J. (2008), “Heterogeneous Impacts in PRO-
Acknowledgments GRESA,” Journal of Econometrics, 145, 64–80. [304,306,309,315]
Feller, A., and Gelman, A. (2015), “Hierarchical Models for Causal Effects,”
The authors thank Alberto Abadie, Donald Rubin, participants at the
in Emerging Trends in the Social and Behavioral Sciences: An Interdisci-
Applied Statistics Seminar at the Harvard Institute of Quantitative Social
plinary, Searchable, and Linkable Resource, eds. R. Scott and S. Kosslyn,
Science, and colleagues at University of California, Berkeley and Harvard
New York: Wiley. [315]
University for helpful comments. The authors also thank their reviewers
Feller, A., Grindal, T., Miratrix, L., and Page, L. C. (2016), “Compared
who helped them sharpen their mathematical presentations, in particular
to What? Variation in the Impacts of Early Childhood Education by
the asymptotic arguments.
Alternative Care Type,” The Annals of Applied Statistics, 10, 1245–1285.
[313]
Fisher, R. A. (1935), The Design of Experiments (1st ed.), Edinburgh: Oliver
Funding & Boyd. [304]
Fogarty, C. B. (forthcoming), “Regression Assisted Inference for the Aver-
The authors gratefully acknowledge financial support from the Spencer
age Treatment Effect in Paired Experiments,” Biometrika. [309]
Foundation through a grant entitled “Using Emerging Methods with Exist-
Frangakis, C. E., and Rubin, D. B. (2002), “Principal Stratification in Causal
ing Data from Multi-site Trials to Learn About and From Variation in Edu-
Inference,” Biometrics, 58, 21–29. [310]
cational Program Effects,” and from the Institute for Education Science (IES
Fréchet, M. (1951), “Sur Les Tableaux de Corrélation dont les Marges son
Grant #R305D150040). Peng Ding also gratefully acknowledges financial
Données,” Annals Universite de Lyon, Section A, Series 3, 14, 53–77. [308]
support from the National Science Foundation (DMS grant #1713152).
Green, D. P., and Kern, H. L. (2012), “Modeling Heterogeneous Treat-
ment Effects in Survey Experiments with Bayesian Additive Regression
Trees,” The Public Opinion Quarterly, 76, 491–511. [304]
References Hájek, J. (1960), “Limiting Distributions in Simple Random Sampling from
a Finite Population,” Publications of the Mathematics Institute of the
Abadie, A. (2003), “Semiparametric Instrumental Variable Estimation of Hungarian Academy of Science, 5, 361–374. [307]
Treatment Response Models,” Journal of Econometrics, 113, 231–263. Heckman, J. J., Smith, J., and Clements, N. (1997), “Making the Most Out of
[304,310] Programme Evaluations and Social Experiments: Accounting for Het-
Administration for Children and Families (2015), “Head Start Pro- erogeneity in Programme Impacts,” The Review of Economic Studies,
gram Facts, Fiscal Year 2014,” available at https://eclkc.ohs.acf.hhs. 64, 487–535. [304,306,309,315]
gov/hslc/data/factsheets/docs/hs-program-fact-sheet-2014.pdf [313] Hill, J. L. (2011), “Bayesian Nonparametric Modeling for Causal Inference,”
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996), “Identification of Journal of Computational and Graphical Statistics, 20, 217–240. [304]
Causal Effects Using Instrumental Variables,” Journal of the American Hoeffding, W. (1941), “Masstabinvariante Korrelationsmasse Für Diskon-
Statistical Association, 91, 444–455. [304,310] tinuierliche Verteilungen,” Arkiv fr matematischen Wirtschaften und
Angrist, J. D., Pathak, P. A., and Walters, C. R. (2013), “Explaining Charter Sozialforschung, 7, 49–70. [308]
School Effectiveness,” American Economic Journal: Applied Economics, Huang, Y., Gilbert, P. B., and Janes, H. (2012), “Assessing Treatment-
5, 1–27. [305] Selection Markers Using a Potential Outcomes Framework,” Biomet-
Angrist, J. D., and Pischke, J. (2008), Mostly Harmless Econometrics: An rics, 68, 687–696. [304]
Empiricist’s Companion, Princeton: Princeton University Press. [307,310,311] Huber, P. J. (1967), “The Behavior of Maximum Likelihood Estimates Under
Aronow, P. M., Green, D. P., and Lee, D. K. (2014), “Sharp Bounds on the Nonstandard Conditions,” in Proceedings of the Fifth Berkeley Sympo-
Variance in Randomized Experiments,” The Annals of Statistics, 42, sium on Mathematical Statistics and Probability (Vol. 1), pp. 221–233.
850–871. [307,309] Berkeley: University of California Press. [307]
Athey, S., and Imbens, G. (2016), “Recursive Partitioning for Heteroge- ——— (1973), “Robust Regression: Asymptotics, Conjectures and Monte
neous Causal Effects,” Proceedings of the National Academy of Sciences, Carlo,” The Annals of Statistics, 1, 799–821. [307]
113, 7353–7360. [304] Imai, K., and Ratkovic, M. (2013), “Estimating Treatment Effect Hetero-
Berrington de González, A., and Cox, D. R. (2007), “Interpretation of Inter- geneity in Randomized Program Evaluation,” The Annals of Applied
action: A Review,” The Annals of Applied Statistics, 1, 371–385. [307] Statistics, 7, 443–470. [304]
Bitler, M., Hoynes, H., and Domina, T. (2014), “Experimental Evidence Imbens, G. (2014), “Instrumental Variables: An Econometrician’s Perspec-
on Distributional Effects of Head Start,” NBER Working Paper 20434. tive” (with discussion), Statistical Science, 29, 323–358. [304,311]
[313,315]
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 317

Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, and ——— (2002), Observational Studies (2nd ed.), New York: Springer. [304,305,316]
in the Social and Biomedical Sciences, New York: Cambridge University ——— (2007), “Confidence Intervals for Uncommon but Dra-
Press. [304,305,310,316] matic Responses to Treatment,” Biometrics, 63, 1164–1171.
Kempthorne, O. (1952), The Design and Analysis of Experiments, New York: [304]
Wiley. [304] Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Random-
Lehmann, E. L. (1998), Elements of Large-Sample Theory, New York: ized and Nonrandomized Studies,” Journal of Educational Psychology,
Springer. [307] 66, 688–701. [305]
Li, X., and Ding, P. (2017), “General Forms of Finite Population Central ——— (1980), “Comment on “Randomization Analysis of Experimental
Limit Theorems with Applications to Causal Inference,” Journal of the Data: The Fisher Randomization Test” by D. Basu,” Journal of the Amer-
American Statistical Association, 112, 1759–1769. [305,307] ican Statistical Association, 75, 591–593. [305]
Lin, W. (2013), “Agnostic Notes on Regression Adjustments to Experimen- Särndal, C.-E., Swensson, B., and Wretman, J. (2003), Model-Assisted Survey
tal Data: Reexamining Freedman’s Critique,” The Annals of Applied Sampling, New York: Springer. [308]
Statistics, 7, 295–318. [304,306,307] Staiger, D. O., and Stock, J. H. (1997), “Instrumental Variables Regression
Matsouaka, R. A., Li, J., and Cai, T. (2014), “Evaluating Marker-Guided with Weak Instruments,” Econometrica, 65, 557–586. [310]
Treatment Selection Strategies,” Biometrics, 70, 489–499. [304] Stuart, E. A., Cole, S. R., Bradshaw, C. P., and Leaf, P. J. (2011), “The Use of
Middleton, J. A., and Aronow, P. M. (2015), “Unbiased Estimation of the Propensity Scores to Assess the Generalizability of Results from Ran-
Average Treatment Effect in Cluster-Randomized Experiments,” Statis- domized Trials,” Journal of the Royal Statistical Society, Series A, 174,
tics, Politics and Policy, 6, 39–75. [306,307] 369–386. [316]
Nelsen, R. B. (2007), An Introduction to Copulas (2nd ed.), New York: Wager, S., and Athey, S. (2017), “Estimation and Inference of Hetero-
Springer. [308,309] geneous Treatment Effects Using Random Forests,” Journal of the
Neyman, J. (1923), “On the Application of Probability Theory to Agricul- American Statistical Association, doi:10.1080/01621459.2017.1319839.
tural Experiments. Essay on Principles. Section 9,” Statistical Science, 5, [304,315]
465–472. [304,305,306,309] Walters, C. R. (2015), “Inputs in the Production of Early Childhood
Oaxaca, R. (1973), “Male-Female Wage Differentials in Urban Labor Mar- Human Capital: Evidence from Head Start,” American Economic Jour-
kets,” International Economic Review, 14, 693–709. [305] nal: Applied Economics, 7, 76–102. [313]
Puma, M., Bell, S., Cook, R., Heid, C., Shapiro, G., Broene, P., Jenkins, F., Westinghouse Learning Corporation (1969), The Impact of Head Start: An
Fletcher, P., Quinn, L., Friedman, J., Ciarico, J., Rohacek, M., Adams, Evaluation of the Effects of Head Start on Children’s Cognitive and Affec-
M., and Spier, E. (2010), “Head Start Impact Study: Final Report,” Tech- tive Development, Volume 1: Report to the Office of Economic Oppor-
nical Report, Department of Health and Human Services, Administra- tunity, Athens, OH: Westinghouse Learning Corporation and Ohio
tion for Children and Families, Washington DC. [305,313,314] University. [313]
Raudenbush, S. W., and Bloom, H. S. (2015), “Learning About and from White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix
a Distribution of Program Impacts Using Multisite Trials,” American Estimator and a Direct Test for Heteroskedasticity,” Econometrica, 48,
Journal of Evaluation, 36, 475–499. [304,309,315] 817–838. [307]
Rosenbaum, P. R. (1999), “Reduced Sensitivity to Hidden Bias at Upper Zhao, A., Ding, P., Mukerjee, R., and Dasgupta, T. (in press),
Quantiles in Observational Studies with Dilated Treatment Effects,” “Randomization-Based Causal Inference from Split-Plot Designs,”
Biometrics, 55, 560–564. [304 ] Annals of Statistics. [306]

You might also like