You are on page 1of 5

American Economic Review: Papers & Proceedings 2017, 107(5): 546–550

https://doi.org/10.1257/aer.p20171000

Labor Markets and Crime‡

Using Causal Forests to Predict Treatment Heterogeneity:


An Application to Summer Jobs†
By Jonathan M.V. Davis and Sara B. Heller*

Exploring treatment heterogeneity can pro- regression tree and random forest algorithms
vide valuable information about how to improve to the problem of estimating average treatment
program targeting and what mechanisms effects for different subgroups, building on a
drive results. But ad hoc searches for particu- large literature about estimating personalized
larly responsive subgroups may mistake noise treatment effects in medicine. If these methods
for a true treatment effect. Committing to a adequately predict variation in treatment effects
­preregistered analysis plan can protect against based on observable characteristics, they could
claims of data mining or ­p-hacking but may also provide more flexibility and better prediction
prevent researchers from discovering unantic- of treatment heterogeneity by searching over
ipated results and developing new hypotheses. ­high-dimensional functions of covariates rather
Modern corrections for multiple hypothesis test- than a small number of subgroups (typically
ing are a useful alternative, though in practice defined by one or two interaction terms).
they can be quite conservative when used across We apply Wager and Athey’s (2015) causal
more than a few dimensions of heterogeneity. forest algorithm to data from two randomized
Recent developments in the machine learning controlled trials (RCTs) of the same summer
literature may help identify treatment hetero- jobs program. Policymakers regularly make
geneity in a principled way. Athey and Imbens decisions about whom to serve in youth employ-
(2016) and Wager and Athey (2015) extend ment programs, so understanding how differ-
ent subpopulations respond is substantively

important in this context. Our setting is also a
Discussants: Bruno Crépon, CREST; Michael Mueller- technically useful application because eligibil-
Smith, University of Michigan; Stephen Raphael, University
of California-Berkeley; Abigail Wozniak, University of ity criteria were deliberately changed across the
Notre Dame. two RCTs to test for treatment heterogeneity;
the studies have relatively large sample sizes for
social experiments (1,634 and 5,216 observa-
* Davis: University of Chicago, 1129 E 59th Street,
Chicago, IL 60637 (e-mail: jonmvdavis@gmail.com);
Heller: University of Pennsylvania, 3718 Locust Walk, tions, respectively); and we observe a large set
McNeil Building 483, Philadelphia, PA 19104 (e-mail: of covariates.
hellersa@upenn.edu). Research generously supported Since this is an early application of the method,
by B139634411 from the US Department of Labor,
2012-MIJ-FX-0002 from the Office of Juvenile Justice we focus our discussion on a s­tep-by-step
and Delinquency Prevention, and 2014-IJ-CX-0011 from explanation of the process targeted at applied
the National Institute of Justice. We are deeply indebted researchers. We explore how useful the predicted
to Susan Athey for providing an early beta version of the heterogeneity is in practice by testing whether
Causal Forest package. We thank Chicago Public Schools,
the Department of Family and Support Services, the Illinois
youth with larger predicted treatment effects
Department of Employment Security, and the Illinois actually respond more in a ­hold-out sample. We
State Police via the Illinois Criminal Justice Information conclude that although the method is likely to
Authority for providing data. The views expressed here are work best with datasets that are larger than our
solely ours and do not represent the views of any agency or ­post-hold-out sample, it can identify treatment
data provider.
† 
Go to https://doi.org/10.1257/aer.p20171000 to visit the heterogeneity for some outcomes that typical
article page for additional materials and author disclosure interaction effects with adjustments for multiple
statement(s). testing would have missed.
546
VOL. 107 NO. 5 Causal Forests and Treatment Heterogeneity 547

I.  Background on Causal Forests of predicted treatment effects, rather than the
infeasible MSE itself, is equivalent to maxi-
A standard regression tree algorithm predicts mizing the variance of treatment effects across
an individual’s outcome, ​​Y​i​​​ , using the mean ​Y​ leaves minus a penalty for ­within-leaf variance.
of observations that share similar covariates, ​X.​ Within a tree grown using this modified criterion,
To define what counts as similar, the algorithm CATEs are estimated as τ  ​​​ ​​l̂ ​​  = ​​ y ̅ ​​Tl​​  − ​​ y ̅ ​​Cl​​​  , or the
forms disjoint groups of observations, called ­treatment-control difference of mean outcomes
“leaves,” within which everyone shares values within terminal leaf ​l​. Here, ​​​τ ​​l ̂ ​​​ is the predicted
of certain ​Xs​. A tree starts with the entire dataset treatment effect for ­out-of-sample observations
in a single group. For every unique value of each with the X ​ ​s belonging to leaf ​l​. To ensure cor-
covariate, ​​Xj​​​  = x​ , the algorithm forms a candi- rect inference, Athey and Imbens (2016) recom-
date split of this group into two leaves by placing mend an “honest” approach: divide the data in
all observations with ​​X​j​​  ≤ x​in a left leaf and all two, then use one subsample to determine the
observations with ​​X​j​​  > x​in a right leaf. It imple- splits in the tree and the other subsample to esti-
ments the one split that minimizes an i­n-sample mate ​​​τ ​​l ̂ ​​​. Wager and Athey (2015) extend this
­goodness-of-fit criterion such as mean squared idea to many trees and develop theory for infer-
ence in a causal forest (CF), which averages
error (MSE) (​​ ∑   ​ ​​ ​( ​​y ​​ î ​​  − ​y​i​​  )​​  2​  , where ​​y ​​i ̂ ​​ is the
n
​  i=1 predictions from a large number of causal trees
mean Y within an observation’s leaf)​​. The algo- generated using subsamples of the full dataset.
rithm then repeats the process for each of the two
new leaves, and so on until it reaches a stopping II.  Our Application
rule. Using the last set of “terminal” leaves, the
tree provides o­ ut-of-sample predictions by figur- Our application uses two ­large-scale RCTs
ing out in which terminal leaf l​​an observation of Chicago’s One Summer Plus (OSP) program
belongs based on its X ​ s​ and assigning ​​​y ​​ î ​​  = ​​ y ̅ ​​l​​​. conducted in 2012 and 2013. OSP provides dis-
With a single tree, o­ver-fitting is typically advantaged youth ages ­ 14 –22 with 25 hours
avoided by using a penalty parameter for the a week of employment, an adult mentor, and
number of leaves selected via cross validation. some other programming. Participants are paid
But using a single tree is not always desirable; Chicago’s minimum wage ($8.25 at the time).
it is a high variance approach with no guaran- The 2012 study (N = 1,634) is described in
tee of optimality.1 An alternative process called Heller (2014), which shows a 43 percent reduc-
bootstrap aggregating selects hundreds or thou- tion in v­ iolent-crime arrests in the 16 months
sands of random subsamples of the data and after random assignment. In 2013, we block
grows a tree with no penalty in each subsample. randomized 5,216 applicants to OSP, 2,634 of
Predictions of ​Y​are the average of the ​​​y  ​​î ​​  ​s across whom were assigned to treatment.3
all trees for individual ​i​.2 We match all applicants to administrative
In our case, however, we want to predict con- arrest records from the Illinois State Police
ditional average treatment effects (CATEs, or (available for everyone), administrative school-
​E [ ​Y1​ ​​  − ​Y​0​​  | X = x]​in a potential outcomes ing records from Chicago Public Schools (avail-
framework) to assess how causal effects vary by able for those who had ever enrolled in CPS),
subgroup. Standard fit measures like MSE are and unemployment insurance records from the
not feasible; unlike Y ​ ​  , ​​Y1​ ​​  − ​Y​0​​​ is not observed Illinois Department of Employment Security
for any individual. Athey and Imbens (2016) (available for anyone who had a social secu-
introduce causal trees to solve this problem. rity number in the CPS records, which was
They show that minimizing the expected MSE required for matching). There are no significant
differences between the treatment and con-
trol groups’ match rates. For this exercise, we
1 
Trees are typically built with “greedy” algorithms focus on two outcomes of interest: the number
which choose the split that minimizes the MSE at a particu-
lar step, even if a different split would result in better predic-
3 
tive ­accuracy overall.  Eligibility criteria changed to test program effects on a
2 
This reduces bias by narrowing the neighborhood repre- more ­criminally-involved population including only male
sented by each leaf and reduces variance by averaging across youth. We individually randomized applicants within appli-
many predictions.  cant pool, age, and geographic blocks. 
548 AEA PAPERS AND PROCEEDINGS MAY 2017

of ­violent-crime arrests within two years of ran- (vi) Return to the full sample of N observa-
dom assignment (N = 6,850) and an indicator tions. Assign ​​​τ ​​ l, b
̂ ​​  = ​​τ ​​ l̂ ​​​ to each observa-
for ever being employed during the six quarters tion whose ​Xs​ would place it in leaf l​​  ,
after the program, defined only for those with and save this prediction.
­non-missing employment data (N = 4,894).
(vii) Repeat steps (i) to (vi) ​
B =  25,  000​
III.  Implementation Road Map times.

Using pooled data across both OSP RCTs, (viii) Define observation ​i​’s predicted CATE
we implement a version of Wager and Athey’s as ​​​τi ​​  ̂ CF
​  ​ (x) = ​ __
1
B
  ​ ​∑ Bb=1
  ​ ​​ ​​τ ​​l, b
 ̂ ​​​  , the average
(2015) algorithm with a modification of their prediction for that individual across
causalForest R package. The steps are as follows: trees.

(i) Draw a subsample ​ b​without replace- The procedure requires the researcher to select
ment containing ​​n​b​​  = 0.2N​ observations three parameters: the number of trees, the min-
from the N observations in the dataset. imum number of treatment and control obser-
vations in each leaf, and the subsample size.
(ii) Randomly split the n​ ​​b​​​ observations In the absence of formal criteria to guide our
in half to form a training sample ​(tr)​ choices, we used a large number of trees (more
and an estimation sample ​(e)​such that trees reduce the Monte Carlo error introduced
​​ntr​ ​​  = ​n​e​​  = ​ __
​n​b​​
2
 ​​ . Using just the training by subsampling; we found moving from 10,000
sample, start with a single leaf contain- to 25,000 improved the stability of estimates
ing all ​​n​tr​​​ observations. across samples). Increasing the minimum num-
ber of observations in each leaf trades off bias
(iii) For each value of each covariate, X​ ​​ j​​  = x​  , and variance; bigger leaves make results more
form candidate splits of the observa- consistent across different samples but predict
tions into two groups based on whether less heterogeneity. Smaller subsamples reduce
​​Xj​​​  ≤ x​. Consider only splits where there dependence across trees but increase the vari-
are at least ten treatment and ten control ance of each estimate (larger subsamples made
observations in both new leaves. Choose little difference in our application).
the single split that maximizes an objec- For the CF to produce consistent estimates of
tive function ​O​capturing how much the program impacts, treatment assignment must be
treatment effect estimates vary across the orthogonal to potential outcomes within each
two resulting subgroups, with a penalty leaf (the “unconfoundedness” assumption). For
for ­within-leaf variance (see the online this to be true in our case, we must condition
Appendix for details and definition of O ​ ​). on randomization block, since treatment prob-
If this split increases ​O​relative to no abilities vary across blocks. We adjust for dif-
split, implement it and repeat this step in ferences in treatment probabilities using inverse
both new leaves. If no split increases O ​ ​, probability weights throughout the procedure,
this is a terminal leaf. including calculations of treatment effects and

(
variances ​​ weigh​ti​​​  = ​ _____ + ​  _______
​T​i​​ 1 − ​T​i​​
   ​ ​​     ​ ​​  
, where​
(iv) Once no more splits can be made in step ​p​block(i) 1 − ​p​
block(i)

3, the tree is defined for subsample b​ ​. T​i​​is an indicator for being randomly assigned to
Move to the estimation sample, and the treatment group and p​ ​block(i)​​is the probability

)
group the ​​n​e​​​observations into the same of being treated in observation i’s block ​​.
tree based on their ​Xs​.
In defining covariates, we have to deal
(v) Using just the estimation sample, calcu- with missing data (e.g., ­prior-year wages are
late ​​​τ ​​l ̂ ​​  = ​​ y  ​​̅Tl​​  − ​​ y  ​​̅Cl​​​ within each terminal only available for those with valid SSNs who
leaf. This step makes the tree honest, worked). To minimize missingness, we define
since treatment effect estimates are made mutually exclusive categories of covariates
using ­ different observations than the which are observed for all observations (e.g., a
ones that determined the splits. set of indicators for working in the year prior to
VOL. 107 NO. 5 Causal Forests and Treatment Heterogeneity 549

the quarter of randomization, not working in that fixed effects.6 We then test the null hypothesis
year, or having missing employment data). In that the ­treatment effects are equal across the
total, we use 19 covariates as inputs in the CF.4 two subgroups. Rejecting the null would sug-
gest that the CF predictions successfully sort the
IV.  A Test of the Predictions observations into two groups that respond dif-
ferentially.7 We do not adjust our inference for
In the spirit of a standard subgroup anal- the fact that our regressors are defined by esti-
ysis, but with subgroups determined by the mates themselves; calculating uniformly valid
high-dimensional combination of covariates
­ standard errors for CF predictions is still an
captured by ​​​τ ​​  ̂CF i​  ​(x)​rather than a few interac- open question. However, generating an indicator
tions, we ask: If we divide the sample into a variable based on the predictions should reduce
group predicted to respond positively to OSP estimation error relative to using the predictions
and one that is not, would we successfully iden- directly, since the error is less likely to matter for
tify youth with larger treatment effects? To test observations far from the zero threshold.
this, we first randomly split our 6,850 obser- Table 1 shows the results. In panel A, which
vations in half to create in- and o­ ut-of-sample uses ​​S​in​​​only, the CF appears to identify distinct
groups, ​​Sin ​ ​​​ and ​​Sout
​ ​​​. We run the entire CF proce- treatment heterogeneity for both outcomes.
dure using only ​​S​in​​​ , then use the trees grown in​​ The subset of youth with predicted positive
S​in​​​to generate predictions for all observations in impacts shows a significant positive impact on
both samples.5 This allows us to assess the per- average, the remaining youth have significant
formance of the predictions in a h­ old-out sam- negative treatment effects, and the difference
ple (albeit with reduced statistical power) and to across these two groups is statistically signifi-
check whether heterogeneity is more distinct in​​ cant. Panel B shows analogous estimates for​​
S​in​​​ than ​​S​out​​​ , which could be a sign of overfitting. S​out​​​ , where the difference in impacts between
Within each sample, we group youth by the predicted positive and negative respond-
whether they are predicted to have a posi- ers is largely attenuated; we can reject the null
tive or negative treatment effect (​​​τ  ​​ ̂CF i​  ​(x) >  0​ that the subgroup difference is equal across
is desirable for employment and adverse ​​Sin ​ ​​​ ( ​p < 0.01​for both outcomes).
​ ​​​ and ​​Sout
for arrests). We estimate separate treatment The difference between in- and ­out-of-sample
effects for these two subgroups by regressing results could either be because of an unlucky
each outcome on the indicator 1​  [ ​​τ ​​  ̂CF i​  ​(x) > 0]​  , sample split or because there is some overfitting
​​Ti​​​  × 1[​​τ ​​  ̂CF
​ i ​(x) > 0]​  , ​​Ti​​​  × (1 − 1 [​​τ ​​  ̂CF
​ i ​(x) > 0 ] )​  , in ​​S​in​​​. To distinguish the two explanations, we
the baseline ­covariates used in the CF, and block make a small modification to the CF algorithm.
In step (viii), instead of averaging across all
trees to predict an individual’s CATE, we only
average across trees in which that observation
4 
This is smaller than many “big data” settings but fairly
was not part of either the t­ree-building or esti-
standard for large social experiments. Other covariates are mation samples.8 Table 1, panel C shows results
demographic characteristics (age in years and indicator vari-
ables for being male, Black, or Hispanic), neighborhood
data (census tract unemployment rate, median income, and 6 
This estimates separate i­ ntent-to-treat effects for
the proportions of tract with at least a high school diploma predicted positive and negative responders, capturing
and who rent their home), education categories (indicator ­differences in both ­take-up rates and treatment responses
variables for having graduated from CPS prior to the pro- across groups. 
7 
gram, being enrolled in CPS in the ­preprogram school year, There are many other ways to test whether there is
not being enrolled in the p­ reprogram year despite having useful information in the predictions. Variants of our test
a prior CPS record, and not being in the CPS data at all), might make different ­cut-off decisions (e.g., serve the high-
criminal history (number of arrests at baseline for violent, est quartile of predicted responders or those whose CATEs
property, drug, and other crimes), and the employment indi- are significantly different from 0 using the standard errors
cators described above. Gender is missing for 351 observa- in Wager and Athey 2015), or interact τ  ​  ​ (x)​with treat-
​​​i ​​ ̂ CF
tions, which we impute using block means from the rest of ment directly. Alternative tests could also address different
the sample.  ­questions entirely (e.g., how much of the variation in effects
5 
We stratify the sample split by block, treatment status, the predictions explain). 
8 
and having a valid SSN, resulting in S​ ​​ in​​​ having 3,428 obser- That is, for an individual in ​​S​in​​​ , we only average pre-
vations. So each CF iteration uses 684 observations, and​​ dictions from around 80 percent of iterations that did not
n​tr​​  = ​ne​ ​​  =  342.​  include that observation in the 20 percent subsample; S​ ​​ out​​​
550 AEA PAPERS AND PROCEEDINGS MAY 2017

Table 1—Treatment Effects by Predicted Response With the adjustment, the subgroups split
No. of by ​​​τ ​​  ̂CF
i​  ​  (x) > 0​no longer show significantly dif-
violent ferent treatment effects for ­violent-crime arrests.
crime Any formal For employment, on the other hand, the pre-
Subgroup arrests employment dicted positive and negative responders have sig-
Panel A. In sample nificantly different treatment effects in both the
​​​τˆ  ​​  CF
i​  ​​(x) > 0 0.22 0.19 adjusted ​​Sout ​ ​​​ and ​​Sin ​ ​​​samples. One caveat is that
(0.05) (0.03) the success of the CF in identifying heteroge-
​​​τˆ ​​   CF
i​  ​​(x) < 0 −0.05 −0.14 neity ­out-of-sample varies somewhat depending
(0.02) (0.03)
on how we split the sample (not shown). This
​​H​0​​​: subgroups equal, p = 0.00 0.00 sensitivity to the sample suggests that although
Panel B. Out of sample our split sample is large relative to many other
​​​τˆ ​​   CF
i​  ​​(x) > 0 −0.01 0.08 social experiments (over 3,400 observations in
(0.05) (0.03) both ​​Sin ​ ​​​ and ​​S​out​​​ ), the causal forest may be most
​​​τˆ ​​   CF
i​  ​​(x) < 0 −0.02 −0.01 useful in settings with more observations.
(0.02) (0.03) In our setting, the CF successfully identifies
​​H​0​​​: subgroups equal, p = 0.77 0.02 two subgroups with distinct employment effects.
Standard interaction effects, on the other hand,
Panel C. Adjusted in sample
​​​τˆ ​​   CF
fail to uncover heterogeneity that survives mod-
i​  ​​(x) > 0 −0.06 0.05
(0.04) (0.03) ern adjustments for multiple testing. The causal
​​​τˆ ​​   CF forest does not detect heterogeneity in violence
i​  ​​(x) < 0 −0.02 −0.04
(0.02) (0.03) impacts, which could happen for a few reasons.
​​H​0​​​: subgroups equal, p = 0.41 0.02 First, the treatment effects may not vary with
observed covariates, either because unobserv-
Notes: Dependent variables are a count of violent-crime ables drive treatment heterogeneity or because
arrests over 2 post-random assignment years and an indica- treatment effects are homogeneous. Second, the
tor for any formal employment over 6 post-summer quar-
ters. Table shows ­intent-to-treat effects for youth whom the greedy algorithm may fail to identify the true
CF predicts will have a positive or negative treatment effect functional form of the treatment effect, or our
based on their covariates (standard errors clustered on indi- subgroup test may not isolate the true form of
vidual). ­p-values from test that subgroup treatment effects are the heterogeneity. Finally, treatment heterogene-
equal. Panels A and B show estimates for the sample used to
estimate the CF (N = 3,428) and a h­ old-out sample which
ity could be obscured by sampling error; the CF
was not used to estimate the CF (N = 3,422), respectively. may need bigger datasets.
Panel C shows estimates for the same sample as panel A, but
with an adjustment to avoid overfitting (see text for details). References
using this adjusted approach, which are much Athey, Susan, and Guido Imbens. 2016. “Recur-
more similar to the ​​S​out​​​results: we can no lon- sive Partitioning for Heterogeneous Causal
ger reject the null that the subgroup differences Effects.” Proceedings of the National Academy
are the same in and out of sample ( ​p =  0.45​ of Sciences of the United States of America 113
and ​0.99​for violence and employment, respec- (27): 7353–60.
tively). It seems, then, that including the obser- Heller, Sara B. 2014. “Summer Jobs Reduce Vio-
vations used in ­tree-growing and estimation in lence among Disadvantaged Youth.” Science
the predictions generates some overfitting in our 346 (6214): 1219–23.
setting.9 Wager, Stefan, and Susan Athey. 2015. “Estima-
tion and Inference of Heterogeneous Treat-
ment Effects using Random Forests.” https://
arxiv.org/abs/1510.04342v2 (accessed Febru-
predictions are unchanged. This adjustment is a version of
­split-sample approaches; it completely separates estimation
and prediction. If results still differ across ​​S​in​​​ and ​​S​out​​​  , the ary 23, 2016).
difference must be driven by something other than overfit-
ting (e.g., the samples differ by chance). We note, though,
that it is an ad hoc solution which may require adjustments in ​​Sin
​ ​​​. A leaf size of 25 treatment and control observations
to the CF’s theoretical justification and inference.  reduces but not does eliminate the differential performance
9 
One way to reduce overfitting is to increase the minimum across samples, though it also does a worse job identifying
number of treatment and control observations in each leaf heterogeneity o­ ut-of-sample than our adjustment. 

You might also like