You are on page 1of 65

Doubly Robust Estimation of Direct and Indirect

Quantile Treatment Effects with Machine Learning


Yu-Chin Hsu∗ Martin Huber† Yu-Min Yen‡
Academia Sinica University of Fribourg National
Chengchi University

July 4, 2023
arXiv:2307.01049v1 [econ.EM] 3 Jul 2023

Abstract

We suggest double/debiased machine learning estimators of direct and indirect quantile


treatment effects under a selection-on-observables assumption. This permits disentangling
the causal effect of a binary treatment at a specific outcome rank into an indirect component
that operates through an intermediate variable called mediator and an (unmediated) direct
impact. The proposed method is based on the efficient score functions of the cumulative
distribution functions of potential outcomes, which are robust to certain misspecifications of
the nuisance parameters, i.e., the outcome, treatment, and mediator models. We estimate
these nuisance parameters by machine learning and use cross-fitting to reduce overfitting
bias in the estimation of direct and indirect quantile treatment effects. We establish uniform
consistency and asymptotic normality of our effect estimators. We also propose a multiplier
bootstrap for statistical inference and show the validity of the multiplier bootstrap. Finally,
we investigate the finite sample performance of our method in a simulation study and apply
it to empirical data from the National Job Corp Study to assess the direct and indirect
earnings effects of training.
JEL classification: C01, C21
Keywords: Causal inference, efficient score, mediation analysis, quantile treatment effect,
semiparametric efficiency


Institute of Economics, Academia Sinica, 128, Section 2, Academia Road, Nankang, Taipei 115, Taiwan.
E-mail: ychsu@econ.sinica.edu.tw.

University of Fribourg, Department of Economics, Bd. de Pérolles 90, 1700 Fribourg, Switzerland. E-mail:
martin.huber@unifr.ch.

Department of International Business, National Chengchi University, 64, Section 2, Zhi-nan Road, Wenshan,
Taipei 116, Taiwan. E-mail: yyu min@nccu.edu.tw.

1
1 Introduction
Causal mediation analysis aims at understanding the mechanisms through which a treatment
affects an outcome of interest. It disentangles the treatment effect into an indirect effect, which
operates through a mediator, and a direct effect, which captures any causal effect not operating
through the mediator. Such a decomposition of the total treatment effect permits learning the
drivers of the effect, which may be helpful for improving the design of a policy or intervention.
Causal mediation analysis typically focuses on the estimation of average indirect and direct
effects, which may mask interesting effect heterogeneity across individuals. For this reason,
several contributions focusing on total (rather than direct and indirect) effects consider quantile
treatment effects (QTE) instead of average treatment effects (ATE). The QTE corresponds to
the difference between the potential outcomes with and without treatment at a specific rank
of the potential outcome distributions, but has so far received little attention in the causal
mediation literature.
The main contribution of this paper is to propose doubly robust/debiased machine learning
(DML) estimators of the direct and indirect QTE under a selection-on-observables (or sequen-
tial ignorability) assumption, implying that the treatment and the mediator are as good as
random when controlling for observed covariates. The method computes the quantile of a po-
tential outcome by inverting an DML estimate of its cumulative distributional function (c.d.f.).
This approach makes use of the efficient score function of the c.d.f., into which models for
the outcome, treatment, and mediator enter as plug-in or nuisance parameters. Relying on
the efficient score function makes treatment effect estimation robust, i.e., first-order insensitive
to (local) misspecifications of the nuisance parameters, a property known as Neyman (1959)-
orthogonality. This permits estimating the nuisance parameters by machine learning (which
generally introduces regularization bias) and still obtains root-n-consistent treatment effect es-
timators, given that certain regularity conditions hold. In addition, cross-fitting is applied to
mitigate overfitting bias. Cross-fitting consists of estimating the nuisance parameter models
and treatment effects in different subsets of the data and swapping the roles of the data to
exploit the entire sample for treatment effect estimation, see Chernozhukov et al. (2018). We
then establish uniform consistency and asymptotic normality of the effect estimators.
For conducting statistical inference, we propose a multiplier bootstrap procedure and show
the validity of the multiplier bootstrap. We also provide a simulation study to investigate the
finite sample performance of our method. Finally, we apply our method to empirical data
from the Job Corps Study to analyse the direct and indirect QTE of participation in a training
program on earnings when considering general health as a mediator. The results point to positive
direct effects of training across a large range of the earnings quantiles, while the indirect effects
are generally close to zero and mostly statistically insignificant.
To more formally discuss the direct and indirect effects of interest, let Y denote the outcome
of interest, D the binary treatment, M the mediator, and X a vector of pre-treatment covariates.
Following Pearl (2000), we may represent causal relationships between (Y, D, M, X) by means
of a directed acyclic graph (DAG), as provided in Figure 1. The causal arrows in the DAG
imply that (D, M, X) may affect Y , (D, X) may affect M , and X may affect D. We can

2
Figure 1: DAG illustrating causal links between outcome Y , treatment D, mediator M and
covariates X.

therefore define the outcome as a function of the treatment and the mediator, Y = Y (D, M ),
and the mediator as a function of the treatment, M = M (D), while being agnostic about X.
Furthermore, we make use of the potential outcome notation advocated by Neyman (1923) and
Rubin (1974) to denote by Y (d, m) the potential outcome if D were set to a specific value
d ∈ {0, 1} and M were set to some value m in the support of the mediator, while M (d) denotes
the potential mediator for D = d. Accordingly, Y (d, M (d)) is the potential outcome if D were
set to d, implying that the mediator is not forced to take a specific value m, but corresponds
to its potential value under D = d. Depending on the actual treatment and mediator values of
an observation, Y (d, M (d)), Y (d, m), M (d) is either observed or counterfactual. Furthermore,
the potential outcome Y (d, M (1 − d)) is inherently counterfactual, as no observation can be
observed in the opposite treatment states d and 1 − d at the same time.
Armed with this notation, we define the causal parameters of interest. The natural direct
effect (NDE), which is for instance considered in Robins and Greenland (1992), Pearl (2001)
and Tchetgen Tchetgen and Shpitser (2012), is based on a comparison of the potential outcomes
when varying the treatment, but keeping the potential mediator fixed at treatment value D = d:
Y (1, M (d)) − Y (0, M (d)). The natural indirect effect (NIE) is based on a comparison of the
potential outcomes when fixing the treatment to D = d, but varying the potential mediator
according to the values it takes under treatment and non-treatment: Y (d, M (1))−Y (d, M (0)).
It is worth noting that the NDE and the NIE defined upon opposite treatment states d and
1 − d sum up to the total effect (TE): Y (1, M (1)) − Y (0, M (0)).
Previous methodological research on causal mediation predominantly focused on the esti-
mation of averages of the aforementioned NDE, NIE and TE or of averages of related path-wise
causal effects (Imai et al., 2010; Tchetgen Tchetgen and Shpitser, 2012; Hsu et al., 2019; Farb-
macher et al., 2022; Zhou, 2022). We complement this literature by suggesting a method for
estimating natural direct and indirect QTEs, which permits assessing the effects across the en-

3
tire distribution of potential outcomes. The estimation of the total (rather than the direct or
indirect) QTE has already been studied in multiple contributions (Abadie et al., 2002; Cher-
nozhukov and Hansen, 2005; Firpo, 2007; Donald and Hsu, 2014; Belloni et al., 2017; Ai et al.,
2022; Hsu et al., 2022). Among the few studies considering QTEs in causal mediation is Bind
et al. (2017), suggesting a two-stage quantile regression estimation to estimate the controlled
direct QTE, i.e., Y (1, m) − Y (0, m) at a specific rank, as well as a particular indirect QTE.
The latter is based on first estimating the mediator at a specific rank and then including it
in a quantile regression of the outcome, which generally differs from the natural indirect QTE
considered in this paper. Furthermore, our approach is nonparametric and relies on results
on semiparametric efficiency, very much in contrast to the parametric approach of Bind et al.
(2017). Huber et al. (2022) adapted the Changes-in-Changes (CiC) approach of Athey and
Imbens (2006) to estimate direct and indirect QTEs in subgroups defined in terms of how the
mediator reacts to (or complies with) the treatment. The NDE and NIE investigated here dif-
fer from such subgroup-specific causal parameters and furthermore, our identification strategy
relies on a selection-on-observables (or sequential ignorability) assumption rather than CiC.
The remainder of this study is organized as follows. Section 2 introduces the natural direct
and indirect QTE, the identifying assumptions, the effect estimators based on double/debiased
machine learning, and the multiplier bootstrap procedure for inference. Section 3 gives the
theoretical results on the asymptotic behavior of our methods. Section 4 presents a simulation
study that investigates the finite sample properties of our method. Section 5 provides an
empirical application to data from the National Job Corps Study to assess the direct earnings
effects of training across the earnings distribution, as well as the indirect effects operating via
general health. Section 6 concludes.

2 Methodology
2.1 Causal effects and Identifying Assumptions
To define the direct and indirect QTEs of interest, let QZ (τ ) := inf {q ∈ R : P (Z ≤ q) ≥ τ }
denote the τ th-quantile of a random variable Z, where τ ∈ (0, 1). Furthermore, let QZ|V (τ ) :=
inf {q ∈ R : P (Z ≤ q|V ) ≥ τ } denote the τ th-quantile of Z conditional on another random vari-
able (or a random vector) V , where τ ∈ (0, 1). Let FZ (z) and fZ (z) denote cumulative distri-
bution function (c.d.f.) and probability density or probability mass function (p.d.f. or p.m.f.)
of Z at z, and FZ|V (z|v) and fZ|V (z|v) denote the c.d.f. and p.d.f. (or p.m.f.) of Z at z
conditional on V = v. We define the natural direct quantile treatment effect (NDQTE) at the
τ th-quantile as:
NDQTE (τ ) := QY (1,M (0)) (τ ) − QY (0,M (0)) (τ ) , (1)

and the natural indirect quantile treatment effect (NIQTE) at the τ th-quantile as:

NIQTE (τ ) := QY (1,M (1)) (τ ) − QY (1,M (0)) (τ ) . (2)

4
The NDQTE in equation (1) corresponds to the direct effect of the treatment when fixing the
mediator at its potential value under non-treatment, M (0). Alternatively, we may consider the
NDQTE when conditioning on the potential mediator under treatment, M (1):

NDQTE′ (τ ) := QY (1,M (1)) (τ ) − QY (0,M (1)) (τ ) . (3)

Likewise, the NIQTE in equation (3) is the indirect effect when varying the potential media-
tors but keeping the treatment fixed ad D = 1, but we may also consider the indirect effect
conditional on D = 0:

NIQTE′ (τ ) := QY (0,M (1)) (τ ) − QY (0,M (0)) (τ ) . (4)

If the effects in expressions (1) and (3) (or (2) and (4)) are different, then this implies effect
heterogeneity due to interaction effects between the treatment and the mediator. The sum of
NDQTE (NDQTE’) and NIQTE (NIQTE’) yields the total quantile treatment effect (TQTE)
at the τ th-quantile, which includes all causal mechanisms through which the treatment affects
the outcome:

TQTE (τ ) = NDQTE (τ ) + NIQTE (τ ) = NDQTE′ (τ ) + NIQTE′ (τ )


= QY (1,M (1)) (τ ) − QY (0,M (0)) (τ ) . (5)

We aim at estimating the quantile treatment effects (1) to (5).1 To this end, we first need
to estimate the τ th-quantile of the relevant potential outcomes, by inverting estimates of the
corresponding c.d.f.’s at the τ th-quantile. Let FY (d,M (d′ )) (a) denote the c.d.f. of the potential
outcome Y (d, M (d′ )) at value a. To identify FY (d,M (d′ )) (a) in the data, we impose the following
assumptions.

Assumption 1
1. For any observation and d ∈ {0, 1} as well as m in the support of M , M = M (d) if D = d,
and Y = Y (d, m) if D = d and M = m.
2. (Y (d, m) , M (d′ )) ⊥ D|X = x for (d, d′ ) ∈ {1, 0}2 and m, x in the support of (M, X).
3. Y (d, m) ⊥ M (d′ ) |D = d′ , X = x for (d, d′ ) ∈ {1, 0}2 and m, x in the support of (M, X).
4. fD|M,X (d|m, x) > 0 for any d ∈ {1, 0} and m, x in the support of (M, X).

Assumption 1.1 implies the stable unit treatment value assumption (SUTVA), see Cox (1958)
and Rubin (1980), stating the potential mediators and potential outcomes are only a function of
an individual’s own treatment and mediator states, respectively, which are well defined (ruling
out multiple treatment or mediator versions). Assumptions 1.2 and 1.3 are sequential ignor-
ability or selection-on-observables conditions (Imai et al., 2010) for causal mediation analysis.
Assumption 1.2 states that conditional on X, the treatment variable D is independent of the
1
We do not consider the controlled direct quantile treatment effect (CDQTE) at the τ -quantile, CDQTE (τ ) :=
QY (1,m) (τ ) − QY (0,m) (τ ), which can be identified under less stringent assumptions than required for the identi-
fication of natural effects.

5
potential outcome Y (d, m) and the potential mediator M (d′ ). This assumption also implies
that Y (d, m) ⊥ D|M (d′ ) = m′ , X = x. Assumption 1.3 requires that Y (d, m) and M (d′ ) are
independent, too, conditional on X and D. Even if treatment D were random, this would not
suffice to identify direct and indirect effects and for this reason, we need to impose an identifying
assumption like Assumption 1.3 to tackle the endogeneity of the mediator. Assumption 1.4 is
a common support condition, which says that the treatment is not deterministic in covariates
X and mediator M such that for each covariate-mediator combination in the population, both
treated and non-treated subjects exist.
Under Assumptions 1.1 to 1.4, we obtain the following identification result.

Proposition 1 Under Assumptions 1.1 to 1.4,


Z
FY (d,M (d′ )) (a) = gd,d′ ,a (x) fX (x) dx (6)

where (d, d′ ) ∈ {0, 1}2 , a ∈ A where A is a countable subset of R, and


Z
FY |D,M,X (a|d, m, x) fM |D,X m|d′ , x dm

gd,d′ ,a (x) =

= E FY |D,M,X (a|d, M, X) |d′ , X = x .


 

The proof of Proposition 1 is provided in the appendix. Under Proposition 1, we may esti-
mate FY (d,M (d′ )) (a) based on plug-in estimation of the nuisance parameters FY |D,M,X (a|d, m, X)
and fM |D,X (m|d′ , X):
n
YM 1X
θ̂d,d ′ ,a = ĝd,d′ ,a (Xi ) , (7)
n
i=1

where Z
F̂Y |D,M,X (a|d, m, Xi ) fˆMi |Di ,Xi m|d′ , Xi dm,

ĝd,d′ ,a (Xi ) = (8)

and F̂Y |D,M,X (a|d, m, Xi ) and fˆM |D,X (m|d′ , Xi ) are estimates of FY |D,M,X (a|d, m, X) and fM |D,X (m|d′ , Xi ).
If M is a continuous variable, we may avoid estimating the conditional density fM |D,X (m|d′ , X),
and use the following alternative estimator for estimating FY (d,M (d′ )) (a):

n
1X 
RI
Ê FY |D,M,X (a|d, Mi , Xi ) |d′ , Xi ,

θ̂d,d ′ ,a = (9)
n
i=1

where Ê FY |D,M,X (a|d, Mi , Xi ) |d′ , Xi is an estimate of E FY |D,M,X (a|d, Mi , Xi ) |d′ , Xi . For


   

example, it might be based on a “regression-imputation” (Zhou, 2022), corresponding to the


fitted value of a linear regression of FY |D,M,X (a|d, Xi , Mi ) on Di and Xi at (d′ , Xi ).
The quality of estimators (7) and (9) crucially depends on the accuracy of nuisance parameter
estimation. If the number of pretreatment covariates X is small (low dimensional X) and the
functional forms of the nuisance parameters are known, parametric methods can provide high-
quality estimations on the nuisance parameters. In contrast, if X is high dimensional and/or
the nuisance parameters have complex forms, machine learning may be the preferred choice of
estimation. However, applying ML directly to estimate expressions (7) or (9) may result in

6
non-negligible bias induced by regularization and/or overfitting (Chernozhukov et al., 2018).
Causal machine learning algorithms aim at avoiding such biases by applying ML estimation
when making use of Neyman-orthogonal moment conditions, which imply that the estimation
of causal parameters is first order insensitive to (regularization) bias in the nuisance parameters,
and of cross-fitting, which avoids overfitting. One of these causal algorithms is double/debiased
machine learning (DML) (Chernozhukov et al., 2018), which has been previously adapted to
the estimation of average effects in causal mediation analysis (Farbmacher et al., 2022), while
this study extends it to the estimation of direct and indirect quantile treatment effects.
Let Ya = 1{Y ≤ a} be an indicator function which is one if outcome Y is smaller than or
equal to a (and zero otherwise) and Wa = (Ya , D, M, X) be a vector of the observed variables.
An estimator of the c.d.f. of the potential outcome that satisfies Neyman-orthogonality can be
derived from the efficient influence function (EIF) of FY (d,M (d′ )) (a):

θ
ψd,d ′ ,a (Wa ; va ) = ψd,d′ ,a (Wa ; va ) − θ, (10)

where

1 {D = d} fD|M,X (d′ |M, X)  


ψd,d′ ,a (Wa ; va ) = ′
× Ya − FY |D,M,X (a|d, M, X)
fD|X (d |X) fD|M,X (d|M, X)
1 {D = d′ }  
+ ′
× FY |D,M,X (a|d, M, X) − gd,d′ ,a (X) + gd,d′ ,a (X) (11)
fD|X (d |X)

for (d, d′ ) ∈ {0, 1}2 , a ∈ A and va denoting the vector of nuisance parameters. Let θd,d′ ,a denote
θ
the value of θ that satisfies E[ψd,d ′ ,a (Wa ; va )] = 0:

 
θd,d′ ,a = E ψd,d′ ,a (Wa ; va ) . (12)

We can show that θd,d′ ,a = FY (d,M (d′ ) (a) if Assumptions 1.1 to 1.4 hold, see the appendix for a
derivation of these results. Therefore, we may use the sample analogue of equation (12) to esti-
mate FY (d,M (d′ )) (a). A similar strategy was previously used to derive the triply robust approach
for estimating E [Y (d, M (d′ ))] in Tchetgen Tchetgen and Shpitser (2012) and Farbmacher et al.
(2022). If d = d′ , then the estimator of equation (12) reduces to

θd,d,a = E [ψd,d,a (Wa ; va )] , (13)

where

1 {D = d}
ψd,d,a (Wa ; va ) = × [Ya − gd,d,a (X)] + gd,d,a (X) ,
fD|X (d|X)
Z
gd,d,a (X) = FY |D,M,X (a|d, m, X) fM |D,X (m|d, X) dm

= FY |D,X (a|d, X) ,

for d ∈ {0, 1} and a ∈ A. We may use the sample analogue of equation (13) to estimate
FY (d,M (d)) (a). This is in analogy to the doubly robust approach for estimating E [Y (d, M (d))]

7
in Robins et al. (1994) and Hahn (1998).
We note that by using the Bayes rule, we can rewrite equation (11) alternatively as:

′ ′
 1 {D = d} fM |D,X (M |d′ , X)  
ψd,d ′ ,a Wa ; va = × Ya − FY |D,M,X (a|d, M, X)
fD|X (d|X) fM |D,X (M |d, X)
1 {D = d′ }  
+ × F Y |D,M,X (a|d, M, X) − gd,d ′ ,a (X) + gd,d′ ,a (X) . (14)
fD|X (d′ |X)

Therefore,

 ′ ′

θd,d′ ,a = E ψd,d′ ,a Wa ; va (15)

can also be used to construct an estimator of FY (d,M (d′ ) (a). There are several differences between
the estimators based on equations (12) and (15). Making use of equation (12) requires estimat-
ing four nuisance parameters: fD|X (d|x), fD|M,X (d|m, x), FY |D,M,X (a|d, m, x) and gd,d′ ,a (x).
Since D is binary, the first two nuisance parameters may for instance be estimated by a logit
or probit model. The conditional c.d.f. FY |D,M,X (a|d, m, x) can be estimated by distributional
regression (DR) (Chernozhukov et al., 2013). gd,d′ ,a (x) might be estimated by regression im-
putation as outlined in equation (9). However, the estimator based on equation (15) requires
only three nuisance parameter estimates of fD|X (d|x), fM |D,X (m|d, x) and FY |D,M,X (a|d, m, x).
We may estimate gd,d′ ,a (x) based on equation (8) after having estimated fM |D,X (m|d, x) and
FY |D,M,X (a|d, m, x). The estimator based on equation (15) appears particularly attractive if
the mediator M is discrete and takes a finite (and relatively small) number of values. However,
if M is continuous, estimation based on equation (12) may appear more attractive, because it
avoids estimating the conditional density fM |D,X (m|d, x) and the integral in equation (8) to
obtain an estimate of gd,d′ ,a (x).
The estimators based on equations (12) and (15) also differ in terms of their robustness to

misspecification of the nuisance parameters. Let θ̂d,d′ ,a and θ̂d,d ′ ,a denote estimators based on

equations (12) and (15) and the respective estimators of the nuisance parameters. Applying
the theorem of semiparametric efficiency in Tchetgen Tchetgen and Shpitser (2012) and Zhou
(2022), we can show that under certain regularity conditions, the following results hold for
estimating FY (d,M (d′ )) (a) at outcome value a:
p.
• If FY |D,M,X (a|d, m, x) and fM |D,X (m|d, x) are correctly specified, θ̂d,d
′ ′
′ ,a −→ θd,d′ ,a .

p.
• If FY |D,M,X (a|d, x, m) and fD|X (d′ |x) are correctly specified, θ̂d,d
′ ′
′ ,a −→ θd,d′ ,a .

p.
• If fD|X (d′ |x) and fM |D,X (m|d, x) are correctly specified, θ̂d,d
′ ′
′ ,a −→ θd,d′ ,a .

This implies that if two of the nuisance parameters entering equation (14) are correctly

specified, while also certain regularity conditions and Assumptions 1.1 to 1.4 hold, then θ̂d,d ′ ,a is
p.
a consistent estimator of FY (d,M (d′ )) (a) at outcome value a. In contrast, θ̂d,d′ ,a −→ θd,d′ ,a only
holds if fD|X (d|x) is consistently estimated. If the latter holds and only one of the other three
nuisance parameters in equation (11) is misspecified, while certain regularity conditions and
Assumptions 1.1 to 1.4 are satisfied, then θ̂d,d′ ,a remains a consistent estimator of FY (d,M (d′ )) (a)
at outcome value a. Finally, if all nuisance parameters are correctly specified and consistently

8
estimated, while certain regularity conditions and Assumptions 1 to 4 also hold, then both θ̂d,d′ ,a

and θ̂d,d ′ ,a are semiparametrically efficient.

2.2 Improving Finite Sample Behavior


The estimate of the c.d.f. of Y (d, M (d′ )) can be inverted at a specific rank τ to obtain an
estimate of the τ th quantile, which we denote by QY (d,M (d′ )) (τ ). Suppose Y (d, M (d′ )) is con-
tinuous and let the grid of points used for the estimation be a non-decreasing sequence {al }L
l=1 ,
where 0 < a < a1 < a2 < . . . < aL < ā < ∞. Let p̂l denote an estimate of FY (d,M (d′ )) (al ) (e.g.,
the K-fold cross-fitting estimate, see Section 2.2). Note that p̂l is not necessarily bounded away
from 0 and 1 nor monotonically increasing in al , as required for a valid c.d.f. For this reason,
we apply two additional constraints on the estimates p̂l , l = 1, . . . , L. The first one restricts
their values to be within the range [0, 1]. That is, we replace p̂l with p̃l = max {min {p̂l , 1} , 0}.
Then we follow Chernozhukov et al. (2010) and use the rearrangement operator to sort p̃l in

non-decreasing order. Let p̃(1) , p̃(2) , . . . , p̃(L) be the sorted sequence of p̃l , l = 1, 2, . . . , L. The
sequence p̃(1) , p̃(2) , . . . , p̃(L) is our final estimate of the c.d.f. of Y (d, M (d′ )) at (a1 , a2 , . . . , aL ).


We then fit a function for points p̃(l) , al , l = 1, 2, . . . , L with linear interpolation, and use the
fitted function to calculate the value of a at rank τ to estimate QY (d,M (d′ )) (τ ),2 , which permits
estimating the quantile treatment effects (1) to (5). When Y (d, M (d′ )) is discrete, we need

not fit the function for points p̃(l) , al ; we may obtain QY (d,M (d′ )) (τ ) by directly using the
definition of the τ th-quantile.

2.3 K-Fold Cross-Fitting


Neyman-orthogonality may mitigate regularization bias coming from machine learning-based
estimation of the nuisance parameters in equations (12) or (15). To also safeguard against
overfitting bias, we follow Chernozhukov et al. (2018) and Farbmacher et al. (2022) and apply
K-fold cross-fitting to estimate the nuisance parameters and the potential outcome distributions,
FY (d,M (d′ )) (a), in different parts of the data. To describe the approach, let Ya,i = 1{Yi ≤ a}
and Wa,i = (Ya,i , Di , Mi , Xi ) denote the ith observation, i = 1, 2, . . . , n. In the following, we
use the estimator based on equation (12) to illustrate K-fold cross-fitting.

1. Randomly split the n samples into K (mutually exclusive) subsamples of equal sample
size nk = n/K, k = 1, 2, . . . , K. Let Ik , k = 1, 2, . . . , K denote the set of indices for
the K different subsamples. Let Ikc , k = 1, 2, . . . , K denote the complement set of Ik :
Ikc = {1, 2, . . . , n} \ Ik .

2. For each k, estimate the model parameters of the nuisance parameters FY |D,M,X (a),
fD|X (d′ |X), fD|M,X (d|M, X) and gd,d′ ,a (X) based on observations Wa,i , i ∈ Ikc . For
(k)
observations Wa,i , i ∈ Ik , predict the nuisance parameters: F̂Y |D,M,X (a|Di , Mi , Xi ),
(k) (k) (k)
fˆD|X (d′ |Xi ), fˆD|M,X (d|Mi , Xi ), fˆD|M,X (d′ |Mi , Xi ) and ĝd,d′ ,a (X), i ∈ Ik .
2
The monotonicity property is preserved under the linear interpolation.

9
3. For each k, compute the estimate of FY (d,M (d′ )) (a) using the predicted nuisance parameters
of step 2 as

 1 {Di = d} fˆ(k) ′
(k) 1 D|M,X (d |Mi , Xi )
X
θ̂d,d′ ,a =
nk  fˆ(k) (d′ |Xi ) fˆ(k)
i∈Ik D|X D|M,X (d|Mi , Xi )
h i
(k)
× 1 {Yi ≤ a} − F̂Y |D,M,X (a|d, Mi , Xi ) (16)
1 {Di = d′ } h (k) (k)
i
+ (k) F̂Y |D,M,X (a|d, Mi , Xi ) − ĝd,d′ ,a (Xi )
fˆD|X (d′ |Xi )
o
(k)
+ĝd,d′ ,a (Xi ) .

(k)
4. Average θ̂d,d′ ,a over k = 1, 2, . . . , K to obtain the final estimate of FY (d,M (d′ )) (a):

K
1 X (k)
θ̂ d,d′ ,a = θ̂d,d′ ,a . (17)
K
k=1

We repeat steps 1 to 4 for a grid of points a, a non-decreasing sequence {al }L


l=1 , where 0 < a <
a1 < a2 < . . . < aL < ā < ∞, to construct the estimate of the c.d.f. profile of Y (d, M (d′ )).
In Section 3, we will establish the asymptotic properties uniformly over a ∈ A for the K-fold
cross-fitting estimator of equation (17).
Alternatively, we can also construct the K-fold cross-fitting estimator based on equation
(15):
K
′ 1 X ′(k)
θ̂d,d ′ ,a = θ̂d,d′ ,a , (18)
K
k=1

where

 1 {Di = d} fˆ(k) ′
′(k) 1 M |D,X (Mi |d , Xi )
X
θ̂d,d′ ,a , =
nk  fˆ(k) (d|Xi ) fˆ(k)
i∈Ik D|X M |D,X (Mi |d, Xi )
h i
(k)
× 1 {Yi ≤ a} − F̂Y |D,M,X (a|d, Mi , Xi ) (19)
1 {Di = d′ } h (k) (k)
i
+ (k) F̂Y |D,M,X (a|d, Mi , Xi ) − ĝd,d′ ,a (Xi )
fˆD|X (d′ |Xi )
o
(k)
+ĝd,d′ ,a (Xi ) .

2.4 Nuisance Parameter Estimation


In the case when D and M are binary variables, fD|X (d|x) and fM |D,X (m|d, x) can be es-
timated straightforwardly, for instance by logit or probit models. If M is continuous, we
might prefer θ̂d,d′ ,a to avoid estimating fM |D,X (m|d, x), while the required nuisance param-
eter fD|M,X (d|m, x) can be straightforwardly estimated if D is binary. The estimation of any
nuisance parameter may be based on machine learning, for example, the lasso or neural net-
works. For estimating FY |D,M,X (a|d, m, x), we use distributional regression (DR). Conditional

10
on (D = d, M = m, X = x), the conditional c.d.f. of Y may be written as

FY |D,M,X (a|d, m, x) = E [Ya |d, m, x] , (20)

where a is a constant and a ∈ A ⊂ R, with A being a countable subset of R, and Ya = 1 {Y ≤ a}


being an indicator function for the event {Y ≤ a}. Equation (20) is the building block for
constructing the DR estimator (Foresi and Peracchi, 1995; Chernozhukov et al., 2013), which is
based on using the binary dependent variable Ya to estimate FY |D,M,X (a|d, m, x) by a regression
approach that estimates E [Ya |d, m, x]. For example, one may assume that the c.d.f. is linear
in variables (D, M, X), their higher-order terms and interaction terms, and estimate a linear
probability model (LPM) by OLS. However, the LPM does not guarantee that the estimated
FY |D,M,X (a|d, m, x) will lie within the interval [0, 1]. To overcome this difficulty, we may assume
that
FY |D,M,X (a|d, m, x) = Ga (va (d, m, x)) ,

where va : (D, M, X) 7→ R and Ga (.) is a link function which is non-decreasing and satisfies
Ga : R 7→ [0, 1], Ga (y) → 0 if y → −∞ and Gr (y) → 1 if y → ∞. The choices of va (D, M, X)
and the link function Ga (.) are flexible. For example, va (D, M, X) might be a neural network
(see Section 3.1) or a transformation of (D, M, X), which can vary with a. Depending on whether
Y is continuous or discrete, the link function Ga (.) may be the logit, probit, linear, log-log,
Gosset, the Cox proportional hazard function or the incomplete Gamma function. As pointed
out by Chernozhukov et al. (2013), for any given link function Ga (.), we can approximate
FY |D,M,X (a|d, m, x) arbitrarily well if va (D, M, X) is sufficiently flexible.
There are various ways to implement DR, and a popular choice is maximum likelihood
estimation. Let v̂a (D, M, X) denote the maximum likelihood estimator of va (D, M, X), which
is obtained by
n
X
max {Ya ln Ga (va (Di , Mi , Xi )) + (1 − Ya ) ln [1 − Ga (va (Di , Mi , Xi ))]} , (21)
va ∈Va
i=1

where Va is the parameter space of v̂a (D, M, X). Then, G (v̂a (D, M, X)) is an estimate of
FY |D,M,X (a|d, m, x). va may be estimated by machine learning methods when the dimension
of X is large and/or the functional form of va is complex.

2.5 Summarizing the Estimation Approach


Our estimation approach can be summarized as follows:

Step 1 Modeling the nuisance parameters


The nuisance parameters include fD|X (d|x), fD|M,X (d|m, x), FY |D,M,X (a|d, m, x) and gd,d′ ,a (X).
Depending on the properties of (Y, D, M ), choose appropriate functional forms for the nui-
sance parameters.

Step 2 Estimating the nuisance parameters


Estimate the nuisance parameters by K-fold cross-fitting, as described in points 1 and 2

11
in Section 2.2.

Step 3 Computing the Neyman-orthogonal estimator


With estimates of the nuisance parameters from K-fold cross-fitting, compute the Neyman-
orthogonal estimator as described in points 3 and 4 in Section 2.2.

Step 4 Repeating steps 2 to 3 for a grid of points {al }L


l=1 , where 0 < a < a1 < a2 < . . . < aL <
ā < ∞
Notice that only FY |D,M,X (a|d, m, x) needs to be re-estimated, not the remaining nuisance
parameters.

Step 5 Adopting the two constraints described in Section 2.1 to obtain p̃(l) , the final estimate of
FY (d,M (d′ )) (al )

Step 6 If Y (d, M (d′ )) is continuous, fitting a function for the points p̃(l) , al , l = 1, 2, . . . , L,


and using the fitted function to obtain Q̂Y (d,M (d′ )) (τ ), the estimate of QY (d,M (d′ )) (τ ); if
Y (d, M (d′ )) is discrete, using the definition of the τ th quantile to obtain Q̂Y (d,M (d′ )) (τ ).

Step 7 Estimating the quantile treatment effects of interest as defined in equations (1) to (5) based
on Q̂Y (d,M (d′ )) (τ ).

2.6 Multiplier Bootstrap


We propose a multiplier bootstrap for statistical inference. Let {ξi }ni=1 be a sequence of i.i.d.
(pseudo) random variables, independent of the sample path {(Ya,i , Mi , Di , Xi )}ni=1 , with E[ξi ] =
0, V ar(ξi ) = 1 and E [exp (|ξi |)] < ∞. Let v̂k,a denote a vector containing the K-fold cross-
fitting estimates of the nuisance parameters, whose model parameters are estimated based
on observations in the complement set, Wa,i with i ∈ Ikc . The proposed multiplier bootstrap
estimator for θ̂d,d′ ,a in equation (17) is given by:

n
∗ 1X  
θ̂d,d ′ ,a = θ̂d,d′ ,a + ξi ψd,d′ ,a (Wa,i ; v̂k,a ) − θ̂d,d′ ,a , (22)
n
i=1

The multiplier bootstrap estimator does not require re-estimating the nuisance parameters and
re-calculating the causal parameters of interest in each bootstrap sample. This is particularly
useful in our context, since our estimation approach is to be repeatedly applied across grid

points a. After obtaining θ̂d,d ′ ,a for all grid points, we use the procedures introduced in Section

2.1 to construct the bootstrap estimate of the c.d.f. of Y (d, M (d′ )) at these grid points and the
bootstrap estimate of the τ th quantile Q∗Y (d,M (d′ )) (τ ). Section 3 establishes the uniform validity
of the proposed multiplier bootstrap procedure.

3 Theoretical Results
3.1 Notation
In this section, we show that the proposed K-fold cross-fitting estimator in Section 2.2 is uni-
formly valid under certain conditions. We focus on establishing the theoretical properties of

12
the estimator in equation (17) and note that the assumptions and procedures required for
demonstrating uniform validity of estimation based on equation (18) are similar and omit-
0 (X) := F 0
ted for this reason. To ease notation in our analysis, let g1d g 0 (M, X) :=
D|X (d|X),
h 2d i
0
FD|M,X 0 (D, M, X) := F 0
(d|M, X), g3a (a|D, M, X) and g 0 (D, X) := E F 0 (a|d, M, X) |D, X
Y |D,M,X 4ad Y |D,M,X
denote the nuisance parameters. Let va0 denote the vector containing these true nuisance pa-
rameters and Ga be the set of all va0 . Let FY0 (d,M (d′ )) (a) denote the true c.d.f. of the potential
outcome Y (d, M (d′ )). The EIF of FY0 (d,M (d′ )) (a) is ψd,d
θ 0 0
 
′ ,a Wa ; va = ψd,d′ ,a Wa ; va − θ, where

0 (M, X)
 
1 {D = d} 1 − g2d
ψd,d′ ,a Wa ; va0 0
  
=  0
 0 × 1 {Y ≤ a} − g3a (d, M, X)
1 − g1d (X) g2d (M, X)
1 {D = d′ }  0 0
d′ X + g4ad
0
d′ , X .
 
+ 0 g3a (d, M, X) − g4ad
1 − g1d (X)

Under Assumptions 1.1 to 1.4, it can be shown that

FY0 (d,M (d′ )) (a) = E ψd,d′ ,a Wa , va0


 
(23)

for all a ∈ A and (d, d′ ) ∈ {0, 1}2 (see proof of Theorem 1).
In the subsequent theoretical analysis, the expectation E [.] is operated under the probability
P ∈ Pn . Let N = n/K be the size in a fold or subsample, where K is a fixed number.
Let En := n−1 ni=1 ςWi and EN,k := N −1 i∈Ik ςWi where ςw is a probability distribution
P P

degenerating at w and Ik is a set of indices of observations in the kth subsample. Let Z ⇝ Z ′


denote a random variable Z that weakly converges to a random variable Z ′ . Let ∥x∥ denote
the l1 norm and ∥x∥q denote the lq norm, q ≥ 2 for a deterministic vector x. Let ∥X∥P,q
denote (E[∥X∥q ])1/q for a random vector X. The function ψd,d′ ,a for identifying the parameter
of interest and constructing the estimator is such that ψd,d′ ,a (w, t) : Wa × Va 7−→ R, where
(d, d′ ) ∈ {0, 1}2 , a ∈ A ⊂ R, Wa ⊂ Rdw is a dw dimensional Borel set and Va is a G dimensional
set of Borel measurable maps. Let ψ a = (ψ1,1,a , ψ1,0,a ψ0,1,a , ψ0,0,a ) and ψ a : Wa × Va 7−→ R4 .
0 : U 7−→ R denote the gth true nuisance parameter, where U ⊆ W is a Borel set,
Let va,g a a a
 
0 0 0
and va := va,1 , . . . , va,G ∈ Va denote the vector of these true nuisance parameters. Let
0 , which is obtained by using the K-fold cross-fitting,
v̂k,a,g : Ua 7−→ R denote an estimate of va,g
such that the model parameters of the nuisance parameters are estimated based on observations
in the complement set, Wa,j with j ∈ Ikc . Let v̂k,a := (v̂k,a,1 , . . . , v̂k,a,G ) denote the vector of
these estimates. va0 and v̂k,a are both functions of Ua ∈ Ua , a subvector of Wa ∈ Wa . But to
ease notation, we will write va0 and v̂k,a instead va0 (Ua ) and v̂k,a (Ua ).
ψ a (Wa , v) denotes ψ a with elements ψd,d′ ,a (Wa ; v), (d, d′ ) ∈ {0, 1}2 . The parameter of inter-
est is FY0 (d,M (d′ )) (a), which can be identified by equation (23), the expectation of ψd,d′ ,a (Wa ; v)
evaluated at the true nuisance parameters va0 . Let θd,d 0 0 . The pro-
 
′ ,a := E ψd,d′ ,a Wa , va

0 (k)
posed estimator of θd,d ′ ,a is the K-fold cross-fitting estimator θ̂d,d′ ,a of (17), in which θ̂d,d′ ,a =
−1
P
N i∈Ik ψd,d′ ,a (Wa,i ; v̂k,a ) . In our case, the vector v̂k,a contains estimates of the four true
0 (X) , g 0 (M, X) , g 0 (D, M, X) and g 0 (D, X) when estimating the
nuisance parameters g1d 2d 3a 4ad
model parameters based on observations in the complement set, Wa,i with i ∈ Ikc . Let θ 0a , θ̂ a ,
(k) (k)
θ̂ a and F0 (a) denote vectors containing θd,d
0 0
′ ,a , θ̂d,d′ ,a , θ̂d,d′ ,a and FY (d,M (d′ )) (a) for different

13
PK (k)
(d, d′ ) ∈ {0, 1}2 . We note that θ̂ a = K −1 and θ 0a = E ψ a Wa ; va0 , and if equation
 
k=1 θ̂ a
(23) holds for all a ∈ A and (d, d′ ) ∈ {0, 1}2 , then F0 (a) = θ 0a .

3.2 Main Results


To establish the uniform validity of θ̂ a when estimating F0 (a), we impose the following condi-
tions.

Assumption 2
1. For P := ∞
S
n=n0 Pn , Ya := 1 {Y ≤ a} satisfies

lim sup sup ∥Ya − Yā ∥P,2 = 0,


δ↘0 P ∈P dA (a,ā)≤δ

sup E sup |Ya |2+c < ∞,


P ∈P a∈A

where (a, ā) ∈ A and A are a totally bounded metric space equipped with a semimetric dA . The
uniform covering number of the set G5 := {Ya : a ∈ A} satisfies
  e
sup log N ϵ ∥G5 ∥Q,2 , G5 , ∥∥Q,2 ≤ C log ,
Q ϵ

for all P ∈ P, where B5 (W ) = supa∈A |Ya | is an envelope function with the supremum taken
over all finitely discrete probability measures Q on (W, XW ).
0 (X) < 1 − ε 0
 
2. For d ∈ {0, 1}, P ε1 < g1d 1 = 1 and P ε2 < g2d (M, X) < 1 − ε2 = 1, where
ε1 , ε2 ∈ (0, 1/2).
3. The models for estimating the nuisance parameters va0 have functional forms
        ⊤ 
h1 f (x)⊤ β 1 , h2 f (m, x)⊤ β 2 , h3 f (d, m, x)⊤ β 3 , h4 f d′ , x β 4 ,

respectively, and satisfy the following conditions.

3a. Functional forms of hi (.) The functions hi , i = 1, 2, 3 take the forms of commonly used
link functions
L = {Id, Λ, 1 − Λ, Φ, 1 − Φ} ,

where Id is the identity function, Λ is the logistic link, and Φ is the probit link.Function
h4 has the form of the identity function Id.

3b. Dictionary controls f (.) The dimension of exogenous variables is dim (X) = p and
log p = o n−1/3 . The functions hi contain a linear combination of dictionary controls


f (.), where dim (f (x)) = p × 1 (dimension of X), dim (f (m, x)) = (p + 1) × 1 (X plus
one mediator), dim (f (d, m, x)) = (p + 2) × 1 (X plus one mediator and one treatment
variable), and dim (f (d′ , x)) = (p + 1) × 1 (X plus one treatment variable).

3c. Approximately sparsity The vectors of coefficients β i , i = 1, . . . , 4 satisfy ∥β i ∥0 ≤ si ,


where ∥∥0 denotes the l0 norm and si denotes the sparsity index. Furthermore, 4i=1 si ≤
P

14
s ≪ n and
s2 log2 (p ∨ n) log2 n ≤ δn n.
P4 β̄ i ≤ C ′ s,

Let β̄ i , denote estimators of β i . These estimators are sparse such that i=1 0
where C ′ < 1 is some constant.

3d. Gram matrix The empirical and population norms induced by the Gram matrix formed
by the dictionary f are equivalent on sparse subsets:




sup f δ
/ f δ − 1 → 0
∥δ∥0 ≤s log n
Pn ,2 P,2

as n → ∞, and also ∥∥f ∥∞ ∥P,∞ ≤ Ln .

3e. ∥X∥P,q < ∞.

4. Given a random subset I of [n] = {1, . . . , n} of size n/K, let β̂ i denote an estimate of coeffi-
cient vector β i defined in Assumption 3. These estimated nuisance parameters
        
v̂a := h1 f (X)⊤ β̂ 1 , h2 f (M, X)⊤ β̂ 2 , h3 f (D, M, X)⊤ β̂ 3 , h4 f (D, X)⊤ β̂ 4

= (ĝ1d (X) , ĝ2d (M, X) , ĝ3a (D, M, X) , ĝ4ad (D, X))

satisfy the following conditions concerning their estimation quality. For d ∈ {0, 1}

P (ε1 < ĝ1d (X) < 1 − ε1 ) = 1,


P (ε2 < ĝ2d (M, X) < 1 − ε2 ) = 1,

where ε1 , ε2 > 0. Let δn be a sequence converging to zero from above at a speed at most
polynomial in n, e.g., δn ≥ n−c for some c > 0. With probability P at least 1 − ∆n , for
d ∈ {0, 1}, all a ∈ A and q ≥ 4,

v̂a − va0 ≤ C,

P,q
v̂a − va0 ≤ δn n− 14 ,

P,2

∥ĝ1d (X) − 0.5∥P,∞ ≤ 0.5 − ϵ,


∥ĝ2d (M, X) − 0.5∥P,∞ ≤ 0.5 − ϵ,
1
ĝ1d (X) − g 0 (X) ĝ2d (M, X) − g 0 (M, X) ≤ δ n n− 2 ,

1d P,2 2d P,2
1
ĝ1d (X) − g 0 (X) ĝ3a (D, M, X) − g3a 0
≤ δ n n− 2 ,

1d P,2
(D, M, X) P,2
1
ĝ1d (X) − g 0 (X) ĝ4ad (D, X) − g 0 (D, X) ≤ δ n n− 2 ,

1d P,2 4ad P,2
1
0 0
≤ δ n n− 2 .

ĝ2d (M, X) − g (M, X) ĝ3a (D, M, X) − g3a (D, M, X)
2d P,2 P,2

Note that in Assumption 2.4, v̂a is by definition constructed based on observations in the

complement set (Wa,i )i∈I c : v̂a = v̂a (Wa,i )i∈I c . When the random subset I = Ik , then

v̂a = v̂k,a . Let Gn denote an empirical process Gn f (W ) = n (En f (W ) − E [f (W )]), where

15
f is any P ∈ Pn integrable function on the set W. Let GP f (W ) denote the limiting pro-
cess
h of Gn f (W ), which is a Gaussian process i with zero mean and a finite covariance matrix

E (f (W ) − E [f (W )]) (f (W ) − E [f (W )]) under probability P (the P -Brownian bridge).
Based on our notation and assumptions, we obtain the following result concerning the asymp-
totic behaviour of our estimator.
PK (k)
Theorem 1 If Assumptions 1 and 2 hold, the K-fold cross-fitting estimator θ̂ a = K −1 k=1 θ̂ a
for estimating F0 (a) satisfies
√  
n θ̂ a − F0 (a) = Zn,P + oP (1) ,
a∈A

in l∞ (A)4 , uniformly in P ∈ Pn , where Zn,P := Gn ψ a Wa ; va0 − θ 0a a∈A . Furthermore,


 

Zn,P ⇝ ZP

in l∞ (A)4 , uniformly in P ∈ Pn , where ZP := GP ψ a Wa ; va0 − θ 0a a∈A and paths of


 

GP ψ a Wa ; va0 − θ 0a have the properties that uniformly in P ∈ Pn ,


 

 
0 0
 
sup E sup GP ψ a Wa ; va − θ a
< ∞,
P ∈Pn a∈A
" #
Wa ; va0 − θ 0a − GP ψ ā Wā ; vā0 − θ 0ā = 0.
   
lim sup E sup GP ψ a
δ→0 P ∈Pn dA (a,ā)

Next, we establish the uniform validity of the multiplier bootstrap under Assumptions 1 and
2. As in Assumption 2.4, let v̂a denote the cross-fitting estimator of the nuisance parameters,
whose model parameters are estimated based on observations in the complement set, Wa,i with
i ∈ Ikc . Recall the multiplier bootstrap estimator in equation (22) and the definition of the
random variable ξ. By independence of ξ and Wa , we have that
h  i h i
E ξ ψd,d′ ,a (Wa ; v̂a ) − θ̂d,d′ ,a = E [ξ] E ψd,d′ ,a (Wa ; v̂a ) − θ̂d,d′ ,a = 0,

and therefore,
√  ∗   
n θ̂d,d′ ,a − θ̂d,d′ ,a = Gn ξ ψd,d′ ,a (Wa ; v̂a ) − θ̂d,d′ ,a .

∗ ∗ ′
Let θ̂ a denote a vector containing the multiplier bootstrap estimators θ̂d,d ′ ,a for different (d, d ) ∈

{0, 1}2 . We write the previous result in vector form as


√  ∗   
n θ̂ a − θ̂ a = Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a ,
  

and let Zn,P := Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a . Then we obtain the following result on the
a∈A
asymptotic behavior of the multiplier bootstrap.

Theorem 2 If Assumptions 1 and 2 hold, the large sample law ZP of Zn,P , can be consistently

16
∗ :
approximated by the bootstrap law Zn,P


Zn,P ⇝B ZP

uniformly over P ∈ Pn in l∞ (A)4 .

Let ϕτ (FX ) := inf {a ∈ R : FX (a) ≥ τ } be the τ th quantile function of a random variable


X whose c.d.f. is FX . The von Mises expansion of ϕτ (FX ) (p.292 in Vaart (1998)) is given by:

1 1 1 (k)
ϕτ (En ) − ϕτ (E) = √ ϕ′τ,E (Gn ) + . . . + m/2
ϕτ,E (Gn ) + . . . ,
n m! n

where ϕ′τ,E (.) is a linear derivative map and Gn denotes an empirical process: Gn f (W ) =

n (En f (W ) − E [f  (W )]). n o

Let ϕθ := ϕτ,θ ′ , where θ = (θ a )a∈A . Let Q0Y (d,M (d′ )) (τ ) := inf a ∈ R : FY0 (d,M (d′ )) (a) ≥ τ ,
τ ∈T
n o n o
Q̂Y (d,M (d′ )) (τ ) := inf a ∈ R : θ̂d,d′ ,a ≥ τ and Q̂∗Y (d,M (d′ )) (τ ) := inf a ∈ R : θ̂d,d

′ ,a ≥ τ . Let

Q0τ , Q̂τ and Q̂∗τ denote the corresponding vectors containing Q0Y (d,M (d′ )) (τ ), Q̂Y (d,M (d′ )) (τ ) and
Q̂∗Y (d,M (d′ )) (τ ) over different (d, d′ ) ∈ {0, 1}2 , respectively. We then obtain the following re-
sult of uniform validity for the estimation of quantiles, which can be proven by invoking the
functional delta theorems (Theorems B.3 and B.4) of Belloni et al. (2017).

Theorem 3 If Assumptions 1 and 2 hold,


√  
n Q̂τ − Q0τ ⇝ TP := ϕ′θ (ZP ) ,
τ ∈T
√  ∗ 
n Q̂τ − Q̂τ ⇝B TP := ϕ′θ (ZP ) .
τ ∈T

uniformly over P ∈ Pn in l∞ (T )4 , where T ⊂ (0, 1), TP is a zero mean tight Gaussian process
for each P ∈ Pn and ZP := GP ψ a Wa ; va0 a∈A .


4 Simulation
4.1 Simulation Design
This section presents a simulation study to examine the finite sample performance of the pro-
posed DML estimators in equations (17) and (18). We consider the following data-generating
process for the observed covariates X = (X1 , X2 , X3 ), where

X1 = 0.75V1 + 0.1V2 + 0.15V3 ,


X2 = 0.15V1 + 0.7V2 + 0.15V3 ,
X3 = 0.14V1 + 0.08V2 + 0.78V3 ,

with V1 , V2 and V3 being i.i.d. random variables following a chi-squared distribution with 1
degree of freedom. The binary treatment variable D is generated based on the following model:
n o
D = 1 0.371 + X ⊤ (0.198, 0.125, −0.323) + εD > 0 .

17
For the binary mediator M , the data-generating process is
n o
M = 1 −0.070 + 0.710D + X ⊤ (−0.054, −0.482, 0.299) + εM > 0 .

The model for outcome Y is defined as follows:

h (D, M, X) = 0.766 + 0.458D + 0.836DM + 0.383M + X ⊤ (0.640, 0.260, 0.474) ,


Y = h (D, M, X)−1 εY .

The error terms (εY , εD , εM ) are mutually independent standard normal random variables and
also independent of X. Analytically computing the unconditional c.d.f. and quantiles of the
potential outcome Y (d, M (d′ )) is difficult for the data generating process considered. For this
reason, we use a Monte Carlo simulation to approximate the true values of the c.d.f. and
the quantiles. We draw 40 million observations of (εY , εM , V1 , V2 , V3 ) from their respective
true distributions, and for each observation, we calculate the corresponding potential outcomes
Y (d, M (d′ )) for (d, d′ ) ∈ {0, 1}2 . In the next step, we evaluate profiles of the empirical c.d.f.’s
and quantiles of the 40 million sampled potential outcomes and use the evaluated profiles as
approximations to their true profiles. In Figure 4 in the appendix, the upper panel shows the
approximate true profiles of the c.d.f.’s FY (d,M (d′ )) (a), while the lower panel provides the ap-
proximate true profiles of quantiles QY (d,M (d′ )) (τ ) for (d, d′ ) ∈ {0, 1}2 . Figure 5 in the appendix
depicts the approximate true profiles of NDQTE (NDQTE’), NIQTE (NIQTE’) and TQTE
across quantiles.
In our simulation design, the nuisance parameters have the following functional forms:
 
FY |D,M,X (a|D, M, X) = Φ β0,a + α0.a D + α1,a DM + β1,a M + X ⊤ β 2,a
 
fD|X (D = 1|X) = Φ λ0 + X ⊤ λ1
 
fM |D,X (M = 1|D, X) = Φ b0 + a0 D + X ⊤ b1 ,

where Φ (.) is the c.d.f. of a standard normal random variable. The vector of parameters
satisfies β0,a , α0,a , α1,a , β1,a , β ⊤ ⊤
 
2,a = a × β0 , α0 , α1 , β1 , β 2 . When running the simulations, we
also include a set of auxiliary (exogenous) variables: X aug := X1aug , X2aug , . . . , XJaug , Xjaug =


U1j (V1 + V2 + V3 ) + U2j (Zj )2 , j = 1, . . . , J, where U1j ∼ i.i.d.U (0, 0.2), U2j ∼ i.i.d.U (0.8, 1).
Z := (Z1 , Z2 , . . . , ZJ ) follow a multivariate normal distribution with a mean vector 0 and a
covariance matrix with elements 0.5|j−l| , j, l = 1, . . . , J. V1 , V2 , V3 , U1j , U2j , Z and the error
terms (εY , εD , εM ) are
 mutually
 independent.
  Depending
 on
 the realized values of U1j and U2j ,
the correlations cor X1 , Xja , cor X2 , Xja , cor X3 , Xja vary, and on average they amount
to 0.139, 0.147 and 0.135, respectively.

4.2 The Post Lasso Estimator


aug
Let Wa,i = (Yi , Di , Mi , Xi , Xiaug ) denote the ith observation of the simulated data. When
aug
applying K-fold cross-fitting to Wa,i , we estimate the models of the nuisance parameters based
on post-lasso regression: we first estimate the models by lasso regression and then re-estimate

18
the models without (lasso) penalization when including only those regressors with non-zero co-
efficients in the respective previous lasso steps. We denote the lasso estimator of the coefficients
of FY |D,M,X (a|D, M, X) in K-fold cross-fitting by

1 X  aug  γ
γ̂ Y,a ∈ arg min p c L Wa,i ; γ Y,a + I c Ψ̂ γ Y,a 1 ,
(24)

γ Y,a ∈R I
k i∈I c k
k

where p is the number of covariates, |Ikc | is the number of observations in the complement set
Ikc , γ is the penalty parameter, ∥.∥1 denotes the l1 norm and Ψ̂ is a diagonal matrix of penalty
loadings. Here, the loss function L (.) corresponds to that in equation (21) and the link function
Ga (.) is Φ (.). When solving the lasso estimation problem of expression (24), we only impose
a penalty on (X, X aug ) and therefore, the first four diagonal elements (for the intercept term,
D, M and DM ) of Ψ̂ are ones, while the remaining diagonal elements are zeros. The value of
the penalty parameter is determined based on the procedure outlined in Belloni et al. (2017).
The other nuisance parameters fD|X (d|X) and fM |D,X (m|D, X) are estimated in an analogous
way and we denote by γ̂ D and γ̂ M their corresponding lasso estimators. Let Ξ̃ denote the
union of variables in (X, X aug ) with non-zero lasso coefficient estimates in one or several lasso

regressions of the three nuisance parameters, with Ξ̃ ⊆ supp γ̂ Y,a ∪ supp (γ̂ D ) ∪ supp (γ̂ M ).
The post-lasso estimator of FY |D,M,X (a|D, M, X) is defined as

1 X  aug   
γ̃ Y,a ∈ arg min p c L Wa,i ; γ Y,a : supp γ Y,a ⊆ supp γ̂ Y,a ∪ Ξ̃. (25)
γ Y,a ∈R I
k i∈I c
k

The post-lasso estimators of fD|X (d|X) and fM |D,X (m|D, X) are obtained analogously. Based
on the post-lasso estimates of the coefficients, we estimate the nuisance parameters among
aug
observations Wa,i , i ∈ Ik , and use them to compute estimators (16) or (19).
When using the estimator based on equation (17), we approximate the nuisance parameters
fD|M,X (d|M, X) by a probit model Φ λ2 + λ3 M + X ⊤ λ4 . The post-lasso approach for esti-


mating fD|M,X (d|M, X) is the same as before. To estimate E FY |D,M,X (a|D, M, X) |d′ , X , we
 

approximate E FY |D,M,X (a|D, M, X) |D, X by a linear model β3 +β4 D +X ⊤ β 5 . We calculate


 
aug
post-lasso estimates of FY |D,M,X (a|D, M, X) among observations in the complement set, Wa,i
with i ∈ Ikc , and estimate (β3 , β4 , β 5 ) by linearly regressing these estimates on D and those co-
variates previously selected for computing the post-lasso estimate of FY |D,M,X (a|D, M, X). We
then use the linear regression coefficients coming from the complement set to make cross-fitted
predictions among observations with indices i ∈ Ik and Di = d′ , which serve as estimates of
E FY |D,M,X (a|D, M, X) |d′ , X .
 

4.3 Simulation Results


To evaluate the performance of the proposed DML estimators of the c.d.f.’s of the potential
outcomes across grid values a, we calculate integrated mean squared error (IMSE) and integrated

19
Table 1: Integrated mean squared error (IMSE) and integrated Anderson Darling weighted
MSE (IWMSE) for estimated potential outcome distributions
θ̂d,d′ ,a
IMSE IWMSE
2,500 5,000 10,000 2,500 5,000 10,000
FY (1,M (1)) 0.114 0.056 0.027 0.640 0.315 0.149
FY (1,M (0)) 0.159 0.079 0.038 0.891 0.443 0.213
FY (0,M (1)) 0.314 0.130 0.062 1.555 0.661 0.315
FY (0,M (0)) 0.211 0.100 0.047 1.060 0.506 0.242

θ̂d,d ′ ,a

IMSE IWMSE
2,500 5,000 10,000 2,500 5,000 10,000
FY (1,M (1)) 0.127 0.057 0.028 0.690 0.316 0.157
FY (1,M (0)) 0.169 0.078 0.037 0.927 0.437 0.209
FY (0,M (1)) 0.267 0.130 0.066 1.360 0.661 0.336
FY (0,M (0)) 0.214 0.107 0.051 1.080 0.539 0.260

Anderson–Darling weighted MSE (IWMSE) for each simulation:


Z h i2
IMSE = F̂Y (d,M (d′ )) (a) − FY (d,M (d′ )) (a) dFY (d,M (d′ )) (a) , (26)
a∈A
h i2
Z F̂Y (d,M (d′ )) (a) − FY (d,M (d′ )) (a)
IWMSE =  dFY (d,M (d′ )) (a) . (27)
a∈A FY (d,M (d′ )) (a) 1 − FY (d,M (d′ )) (a)

To assess the performance of the estimators of the quantiles of the potential outcomes, QY (d,M (d′ )) ,
we compute the integrated absolute error (IAE) across ranks τ :

1 X
IAE = Q̂ (τ ) − Q (τ ) , (28)


Y (d,M (d )) ′
Y (d,M (d ))
|T |
τ ∈T

where T is the grid of ranks, which we set to (0.05, 0.06, . . . , 0.95). Furthermore, we calculate
the IAE for the estimators of the quantile treatment effects defined in equations (1) to (5).
In the simulations, we set K = 3 for 3-fold cross-fitting. The number of auxiliary variables is
J = 250 and we consider sample sizes 2,500, 5,000 and 10,000 observations in the simulations.
The reported performance measures are averages for (d, d′ ) ∈ {0, 1}2 over 1,000 simulations.
Table 1 reports the results for the IMSE and IWMSE (scaled by 1,000), Table 2 those for
the IAE. All the performance measures behave rather favorably. As the sample size increases,
the performance measures (and thus, estimation errors) decline sharply. However, for different
combinations of (d, d′ ), the levels of the performance measures are different, especially when the
sample size is small. Estimation errors are significantly larger if (d, d′ ) = (0, 1) and (0, 0), rather
than (d, d′ ) = (1, 1) and (1, 0). This is also reflected by the performance measures of NDQTE
(NDQTE’) and TQTE, which point to higher errors than those of NIQTE (NIQTE’). When
comparing the performance measures of the two estimators based on equations (17) and (18),
we find some differences in their levels when the sample size is small. However, the differences
vanish as the sample size increases, which suggests that the two estimators perform equally well

20
Table 2: Integrated absolute error (IAE) for estimated quantiles of potential outcomes and
quantile treatment effects
θ̂d,d′ ,a ′
θ̂d,d ′ ,a

2,500 5,000 10,000 2,500 5,000 10,000


QY (1,M (1)) 0.010 0.007 0.005 0.010 0.007 0.005
QY (1,M (0)) 0.014 0.010 0.007 0.014 0.010 0.007
QY (0,M (1)) 0.028 0.022 0.017 0.028 0.022 0.018
QY (0,M (0)) 0.031 0.026 0.022 0.031 0.026 0.022

NDQTE 0.034 0.027 0.023 0.034 0.028 0.023


NDQTE’ 0.030 0.023 0.018 0.029 0.023 0.018
NIQTE 0.008 0.006 0.004 0.008 0.006 0.004
NIQTE’ 0.016 0.012 0.010 0.016 0.012 0.010
TQTE 0.033 0.026 0.022 0.033 0.027 0.023

asymptotically in the simulation design considered.

5 Empirical Application
5.1 The Job Corps Data
We apply the proposed estimators of natural direct and indirect quantile treatment effects to
data from the National Job Corps Study, in order to evaluate the impact of the Job Corps (JC)
training program on earnings of young individuals with disadvantaged backgrounds. JC is the
largest and most comprehensive job training program for disadvantaged youth in the US. It
provides participants with vocational training and/or classroom education, housing, and board
over an average duration of 8 months. Participants also receive health education as well as
health and dental care. Schochet et al. (2001) and Schochet et al. (2008) assess the average
effects of random assignment to JC on several labor market outcomes and find it to increase
education, employment, and earnings in the longer run. Other contributions evaluate more
specific aspects or components of JC, like the average effect of the time spent in training or
of particular training sequences on employment and earnings, see e.g. Flores et al. (2012) and
Bodory et al. (2022).
Furthermore, several studies conduct mediation analyses to assess the average direct and
indirect effects of program participation. Flores and Flores-Lagunes (2009) and Huber (2014)
consider work experience or employment as mediators, respectively, and find positive direct ef-
fects of JC on earnings and general health, respectively, when invoking a selection-on-observables
assumption. Flores and Flores-Lagunes (2010) avoid the latter assumption based on a partial
identification approach based on which they compute upper and lower bounds for the causal
mechanisms of JC when considering the achievement of a GED, high school degree, or voca-
tional degree as mediators. Under their strongest set of bounding assumptions, they find a
positive direct effect on labor market outcomes, net of the indirect mechanism via obtaining
a degree. Frölich and Huber (2017) base their mediation analysis on separate instrumental
variables for the treatment and the mediator and find a positive indirect effect of training on

21
Table 3: Estimates of Average Effects
TE NDE NIE NDE’ NIE’
Effect 16.591 16.995 -0.403 16.586 0.005
Std.err 3.740 3.747 0.190 3.770 0.553
p-value 0.000 0.000 0.034 0.000 0.992

earnings through an increase in the number of hours worked. We contribute to the causal medi-
ation literature on the effectiveness of the JC program by considering quantile treatment effects
across different ranks of the potential outcome distributions, which provides more insights on
effect heterogeneity than the evaluation of average effects.
For our empirical analysis, we consider the JC data provided in the causalweight package
by Bodory and Huber (2022) for the statistical software R, which is a processed data set with
9,240 observations that contains a subset of the variables available in the original National Job
Corps Study. Our outcome of interest is weekly earnings in the third year after the assign-
ment (the variable earny3 in the JC data frame), while the treatment is a binary indicator for
participation in any (classroom-based or vocational) training in the first year after program
assignment (trainy1). We aim at assessing whether training directly affects the earnings out-
come, and whether it also has an indirect effect by affecting health. For this reason, we consider
general health one year after program assignment (health12) as mediator, a categorical variable
ranging from 1 (excellent health) to 4 (poor health). The motivation is that participation in
training aimed at increasing human capital and labor market perspectives may have an impact
on mental health, which in turn may affect labor market success. Furthermore, JC might also
affect physical health through health education and health/dental care, which can influence
labor market success, too. For this reason, we aim at disentangling the direct earnings effect of
training and its indirect effect operating via health.
The data set also contains 28 pre-treatment covariates, which include socio-economic in-
formation such as a study participant’s gender, age, ethnicity, (own) education and parents’
education, mother tongue, marital status, household size, previous employment, earnings and
welfare receipt, health status, smoking behavior, alcohol consumption, and whether a study
participant has at least one child. We assume that sequential ignorability of the treatment
and the mediator holds conditional on these observed characteristics, implying that the permit
controlling for any factors jointly affecting training participation and the earnings outcome,
training participation and health 12 months after assignment, or health and earnings. To make
lasso-based estimation of the nuisance parameters in our DML approach more flexible, we create
interaction terms between all of the 28 covariates and squared terms for any non-binary covari-
ates. This entails a total of 412 control variables that include both the original covariates and
the higher order/interaction terms which we include in our DML approach. Table 4 provides
summary statistics for the outcome, the treatment, the mediator and the covariates.

5.2 Effect Estimates


Before considering quantile treatment effects, we first estimate the average direct and indirect
effects by a K-fold cross-fitting estimator based on Theorem 2 in Farbmacher et al. (2022), as

22
implemented in the causalweight package for R. Table 3 reports the estimated average total
effect (TE) of training, the average natural direct effects (NDE and NDE’) and the average
natural indirect effects (NIE and NIE’) operating via general health. The TE estimate (Effect)
suggests that participation in JC increases average weekly earnings in the third year by roughly
16 to 17 USD. As the estimated mean potential outcome under non-treatment amounts to
approximately 161 USD, the program increases weekly earnings by roughly 10% according to
our estimate. The TE is highly statistically significant as the standard error (Sdt.err) of 3.740
is rather low relative to the effect estimate, such that p-value that is close to zero.
The total effect seems to be predominantly driven by the direct impact of training on earn-
ings, as both NDE and NDE’ are of similar magnitude as TE and highly statistically significant.
In contrast, the indirect effect under non-treatment (d = 0), NIE’, is close to zero and insignif-
icant, while that under treatment (d = 1), NIE, amounts to -0.403 USD and is statistically
significant at the 5% level. Bearing in mind that the health mediator is inversely coded (a
smaller value implies better health), this negative estimate suggests a positive average indirect
effect of training participation on earnings under treatment, which is, however, rather modest.
Furthermore, the effect heterogeneity across NIE and NIE’ points to moderate interaction effects
of the treatment and the mediator: the impact of health on earnings appears to be somewhat
more important under training than without training.

Figure 2: Estimates of the TQTE across ranks 0.2 to 0.9 (solid lines), based on inverting θ̂d,d′ ,a .
95% confidence intervals (dashed lines) are based on the multiplier bootstrap.

The average effects might mask interesting effect heterogeneity across ranks of the earnings
distribution. For this reason, we estimate the total quantile treatment effect (TQTE), natural
direct quantile treatment effects (NDQTE and NDQTE’) and natural indirect quantile treat-
ment effects (NIQTE and NIQTE’) across ranks (τ ) 0.2 to 0.9. To this end, we invert our
K-fold cross-fitting estimator θ̂d,d′ ,a of equation (17) and estimate the nuisance parameters by
post-lasso regression as outlined in Section 4.2. Figures 2 and 3 depict the estimates of the
causal effects (on the y-axis) across τ (on the x-axis), which correspond to the solid lines in the
respective graphs. The dashed lines provide the 95% confidence intervals based on the multiplier
bootstrap introduced in Section 2.5.
The quantile treatment effects are by and large in line with the average treatment effects.
TQTE, NDQTE and NDQTE’ are statistically significantly positive at the 5% across almost all

23
ranks τ considered and generally quite similar to each other. In contrast, all of the NIQTE esti-
mates (the indirect effects under d = 1) are relatively close to zero and statistically insignificant.
The majority of the NIQTE’ estimates (the indirect effects under d = 0) are not statistically sig-
nificantly different from zero either. However, several of the negative effects measured at lower
ranks (roughly between the 0.2th and 0.4th quantiles) are marginally statistically significant and
point to an earnings-increasing indirect effect under non-treatment (due to inverse coding of the
health mediator). This potentially interesting pattern is averaged out when considering NIE’
(the average indirect effect under d = 0), which we found to be virtually zero and insignificant,
see Table 3. Finally, the non-monotonic shape of the point estimates of TQTE, NDQTE and
NDQTE’ across ranks τ suggests heterogeneous effects at different quantiles of the potential
earnings distributions. At the same time, the width of the confidence intervals suggests that the
null hypothesis of homogeneous effects cannot be rejected for most of the quantiles considered.

Figure 3: Estimates of the NDQTE, NIQTE, NDQTE’ and NIQTE’ across ranks 0.2 to 0.9
(solid lines), based on inverting θ̂d,d′ ,a . 95% confidence intervals (dashed lines) are based on the
multiplier bootstrap.

24
6 Conclusion
We proposed a DML approach for estimating natural direct and indirect quantile treatment
effects under a sequential ignorability assumption. The method relies on the efficient score
functions of the potential outcomes’ cumulative distributional functions, which are inverted to
compute the quantiles as well as the treatment effects (as the differences in potential outcomes at
those quantiles). The robustness property of the efficient score functions permits estimating the
nuisance parameters (outcome, treatment, and mediator models) by machine learning and cross-
fitting avoids overfitting bias. We demonstrated that our quantile treatment effect estimators are
root-n-consistent and asymptotically normal. Furthermore, we suggested a multiplier bootstrap
and demonstrated its consistency for uniform statistical inference. We also investigated the finite
sample performance of our estimators by means of a simulation study. Finally, we applied our
method to data from the National Job Corp Study to evaluate the direct earnings effects of
training across the earnings distribution, as well as the indirect effects operating via general
health. We found positive and statistically significant direct effects across a large range of the
earnings quantiles, while the indirect effects were generally close to zero and mostly statistically
insignificant.

25
A Appendix
A.1 Proof of Proposition 1
Proof. The proof relies on using Assumptions 1.1 through 1.4. Under these assumptions, it can be
shown that:
Z
FY (d,M (d′ )) (a) = P (Y (d, M (d′ )) ≤ a|X = x) fX (x) dx (by iterated expectation)
Z Z
= P (Y (d, m) ≤ a|M (d′ ) = m, X = x) dP (M (d′ ) = m|X = x) fX (x) dx

(by iterated expectation)


Z Z
= P (Y (d, m) ≤ a|D = d′ , M (d′ ) = m, X = x)

×dP (M (d′ ) = m|D = d′ , X = x) fX (x) dx (by Assumption 2)


Z Z
= P (Y (d, m) ≤ a|D = d′ , M = m, X = x) dP (M = m|D = d′ , X = x) fX (x) dx

(by Assumption 1)
Z Z
= P (Y (d, m) ≤ a|D = d′ , X = x) dP (M = m|D = d′ , X = x) fX (x) dx

(by Assumption 3)
Z Z
= P (Y (d, m) ≤ a|D = d, X = x) dP (M = m|D = d′ , X = x) fX (x) dx

(by Assumption 2)
Z Z
= P (Y (d, m) ≤ a|D = d, M = m, X = x) dP (M = m|D = d′ , X = x) fX (x) dx

(by Assumption 3)
Z
= P (Y ≤ a|D = d, M = m, X = x) dP (M = m|D = d′ , X = x) fX (x) dx

(by Assumption 1)
Z Z
= FY |D,M,X (a|d, m, x) fM |D,X (m|d′ , x) fX (x) dmdx.

A.2 Derivations of the EIF


The derivation of the efficient influence function (EIF) of an estimand is based on calculating Gateaux
derivatives for the estimand. Let P denote the true data generating distribution and Ψ (P ) the estimand
of interest, which is a statistical functional of P . The Gateaux derivative of Ψ (.) measures how the
estimand Ψ (.) changes as P shifts in the direction of another distribution, say P̃ . Let Pt = tP̃ +(1 − t) P ,
where t ∈ [0, 1]. Formally, the Gateaux derivative of estimand Ψ (.) when changing P in the direction of
P̃ is defined as  
Ψ (Pt ) − Ψ (P ) d
lim = Ψ (Pt ) , (A.1)
t↓0 t dt t=0

if the limit on the right-hand side exists. It can be shown that under certain regularity conditions, the
EIF of Ψ (P ) under the distribution P̃ is equal to Gateaux derivative (A.1) (Hines et al., 2022). This
fact provides a convenient way of deriving the EIF. Following Hines et al. (2022), we use the strategy
of “point mass contamination” to derive the EIF of FY (d,M (d′ )) (a). Specifically, we consider P̃ to be
a point mass of a single observation, say õ, and then the EIF of Ψ (P ) evaluated at õ is equal to the
Gateaux derivative (A.1). This derivation strategy appears attractive when the treatment variable D is

26
discrete.3 Let 
1 if o = õ
1õ (o) =
0 otherwise

denote the Dirac delta function with respect to õ. If the density function for P is fO (o), the density
t
function for Pt is fO (o) = t1õ (o) + (1 − t) fO (o) and

d t
fO (o) = 1õ (o) − fO (o) ,
dt t=0

t
and fO (o) = fO (o) when t = 0. Under Assumptions 1.1 to 1.4, it follows from Proposition 1 that
Z Z
FY (d,M (d′ )) (a) = FY |D,M,X (a|d, m, x) fM |D,X (m|d′ , x) fX (x) dmdx

fY,D,M,X (y, d, m.x) fM,D,X (m, d′ , x)


Z Z Z
= 1 {y ≤ a} fX (x) dydmdx.
fD,M,X (d, m, x) fD,X (d′ , x)

Let
t
fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x) t
Z Z Z
Ψ (Pt ) := 1 {y ≤ a} t t fX (x) dydmdx.
fD,M,X (d, m, x) fD,X (d′ , x)
We would like to calculate the Gateau derivative:
!
t
fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x) t
Z Z Z
d d
Ψ (Pt ) = 1 {y ≤ a} f (x) dydmdx .

t t X
dt dt fD,M,X (d, m, x) fD,X (d′ , x)

t=0

t=0

It can be shown that


!
t
fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x) d t
Z Z Z
d
Ψ (Pt ) = 1 {y ≤ a} f (x) dydmdx

dt t
fD,M,X (d, m, x) t
fD,X (d′ , x) dt X

t=0
t=0
(A.2)
!
t
fY,D,M,X (y, d, m.x) t
fM,D,X (m, d′ , x)
Z Z Z
t d
+ 1 {y ≤ a} fX (x) dydmdx.

dt t
fD,M,X (d, m, x) t
fD,X (d′ , x)


t=0
(A.3)

Considering the expression within the integral of (A.2),


!
t
fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x) d t t
fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x)
f (x) = × [1x̃ (x) − fX (x)]

t
fD,M,X (d, m, x) t
fD,X (d′ , x) dt X t
fD,M,X (d, m, x) t
fD,X (d′ , x)


t=0
= fY |D,M,X (y|d, m.x) fM |D,X (m|d′ , x) 1x̃ (x)
−fY |D,M,X (y|d, m.x) fM |D,X (m|d′ , x) fX (x) .

Therefore, (A.2) can be further expressed as

E FY |D,M,X (a|d, M, x̃) |d′ , x̃ − E [gd,d′ ,a (X)] .


 
(A.4)
3
Notice that if D is not discrete, this strategy can not be used, and the derivation needs to rely on using other
methods instead, see Fisher and Kennedy (2019); Levy (2019); Ichimura and Newey (2022).

27
Considering the xpression within the integral of (A.3),
! !
t
t
d fY,D,M,X t
(y, d, m.x) fM,D,X (m, d′ , x)
t
t
fY,D,M,X t
(y, d, m.x) d fM,D,X (m, d′ , x)
fX (x) = fX (x)

t
dt fD,M,X (d, m, x) t
fD,X (d′ , x)

t
fD,M,X (d, m, x) dt fD,Xt (d′ , x)


t=0 t=0
(A.5)
!
t
t
fM,D,X (m, d′ , x) d fY,D,M,X
t
(y, d, m.x)
+ fX (x) t ′ t .
fD,X (d , x) dt fD,M,X (d, m, x)


t=0
(A.6)

Considering (A.5),
!
t
d fM,D,X (m, d′ , x) 1 d t
= f (m, d′ , x)

t
dt fD,X (d′ , x) t
fD,X (d′ , x) dt M,D,X


t=0 t=0
 

t ′
 fM,D,X (m, d , x) d t

−  fD,X (d′ , x)

2
t dt
fD,X (d′ , x)


t=0
1 {D = d′ } 
1(m̃,x̃) (m, x) − fM |D,X (m|d′ , x) 1x̃ (x) .

= ′
fD,X (d , x)

Using some algebra, the part of (A.3) appearing in (A.5) can be expressed as

1 {D = d′ }
FY |D,M,X (a|d, m̃, x̃) − E FY |D,M,X (a|d, M, x̃) |d′ , x̃ .
 

(A.7)
fD|X (d |x̃)

Concerning (A.6),
!
t
d fY,D,M,X (y, d, m.x) 1 d t
= f (y, d, m.x)

t
dt fD,M,X (d, m, x) t
fD,M,X (d, m, x) dt Y,D,M,X


t=0 t=0
 

t
 fY,D,M,X (y, d, m, x) d t

−  2 fD,M,X (d, m, x)

t dt
fD,M,X (d, m, x)


t=0
1  
= 1(ỹ,m̃,x̃) (y, m, x) 1 {D = d} − fY,D,M,X (y, d, m, x)
fD,M,X (d, m, x)
fY |D,M,X (y|d, m, x)  
− 1(m̃,x̃) (m, x) 1 {D = d} − fD,M,X (d, m, x)
fD,M,X (d, m, x)
1 {D = d}  
= 1(ỹ,m̃,x̃) (y, m, x) − fY |D,M,X (y|d, m, x) 1m̃,x̃ (m, x) .
fD,M,X (d, m, x)

Furthermore, the part of (A.3) appearing in (A.6) can be expressed as

1 {D = d} fM |D,X (m̃|d′ , x̃) 


1 {ỹ ≤ a} − FY |D,M,X (a|d, m̃, x̃) . (A.8)
fD|X (d|x̃) fM |D,X (m̃|d, x̃)

28
Combing (A.4), (A.7) and (A.8), we obtain

d
= E FY |D,M,X (a|d, M, x̃) |d′ , x̃ − E [gd,d′ ,a (X)]
 
Ψ (Pt )
dt t=0
1 {D = d′ }
FY |D,M,X (a|d, m̃, x̃) − E FY |D,M,X (a|d, M, x̃) |d′ , x̃
 
+
fD|X (d′ |x̃)
1 {D = d} fM |D,X (m̃|d′ , x̃) 
+ 1 {ỹ ≤ a} − FY |D,M,X (a|d, m̃, x̃) .
fD|X (d|x̃) fM |D,X (m̃|d, x̃)

If we replace the notation (ỹ, m̃, x̃) with (Y, M, X) and notice that gd,d′ ,a (X) := E FY |D,M,X (a|d, M, X) |d′ , X
 
d 
and E [gd,d′ ,a (X)] = FY (d,M (d′ )) (a), then E dt Ψ (Pt ) t=0 = 0 implies that

FY (d,M (d′ )) (a) = E [ψ ′ (Wa , va′ )] ,

where ψ ′ (Wa , va′ ) is defined in equation (14). We can apply the Bayes rule to rewrite the term (A.8) as

1 {D = d} fD|M,X (d′ |m̃, x̃) 


1 {ỹ ≤ a} − FY |D,M,X (a|d, m̃, x̃) ,
fD|X (d′ |x̃) fD|M,X (d|m̃, x̃)
 d

and E dt Ψ (Pt ) t=0
= 0 now implies that

FY (d,M (d′ )) (a) = E [ψ (Wa , va )] ,

where ψ (Wa , va ) is defined in equation (11).

A.3 Proofs of Theorems in Section 3


Proof of Theorem 1. The proof relies on Theorem A.1 in Appendix A.4, which states that K-fold
cross-fitting is uniformly valid for estimating a parameter of interest under certain regularity conditions.
We show that the conditions in Assumption 2 are sufficient for the proposed K-fold cross-fitting estimator
to satisfy Assumptions A.1.1 to A.1.8, which are required for establishing Theorem A.1. Notice that the
1
condition v̂a − va0 P,2 ≤ δn n− 4 in Assumption 2.4 already satisfies Assumption A.1.8. For this reason,
we will only verify Assumption A.1.1 to A.1.7 in the subsequent discussion. We first derive several
preliminary results which are useful for the proof to follow.
Let Gan be the set of

v := (g1d (X) , g2d (M, X) , g3a (D, M, X) , g4ad (D, X)) .

g1d (X) , g2d (M, X) , g3a (D, M, X) and g4ad (D, X) are P -integrable functions such that for d ∈ {0, 1},

P (ε1 < g1d (X) < 1 − ε1 ) = 1, (A.9)


P (ε2 < g2d (M, X) < 1 − ε2 ) = 1, (A.10)

29
where ε1 , ε2 ∈ (0, 1/2), and with probability P at least 1 − ∆n , for d ∈ {0, 1}, all a ∈ A and q ≥ 4,

v − va0 ≤ C,

P,q
v − va0 ≤ δn n− 41 ,

P,2

∥g1d (X) − 0.5∥P,∞ ≤ 0.5 − ϵ,


∥g2d (M, X) − 0.5∥P,∞ ≤ 0.5 − ϵ,
1
0 0
≤ δ n n− 2 ,

g1d (X) − g1d (X) g2d (M, X) − g2d (M, X) P,2
P,2
1
0 0
≤ δ n n− 2 ,

g1d (X) − g1d (X) P,2 g3a (D, M, X) − g3a (D, M, X) P,2
1
0 0
≤ δ n n− 2 ,

g1d (X) − g1d (X) P,2 g4ad (D, X) − g4ad (D, X) P,2
1
0 0
≤ δ n n− 2 .

g2d (M, X) − g2d (M, X) P,2 g3a (D, M, X) − g3a (D, M, X) P,2

Notice that v̂a ∈ Gan by Assumption 2.4. Proving the result for all functions in Gan implies that it also
2
holds for v̂a . By Assumption 2.2, it can be shown that for (d, d′ ) ∈ {0, 1} , the event

ε2 g 0 (M, X) 1 − ε2
0< < 02d < <∞
1 − ε2 g2d′ (M, X) ε2

holds with probability one. Furthermore,


 h h q ii q1
0 0

g3a (D, M, X) − g3a (D, M, X) P,q = E E g3a (D, M, X) − g3a (D, M, X) |M, X
   q1
X
0
q 0
= E  g3a (d, M, X) − g3a (d, M, X) g2d (M, X)
d∈{0,1}
   q1
1
q
X
0
q
≥ ε2 E  g3a (d, M, X) − g3a (d, M, X) 
d∈{0,1}
 1
1 h
0
q i q
≥ ε2q max E g3a (d, M, X) − g3a (d, M, X) .
d∈{0,1}

which implies that for d ∈ {0, 1} and all a ∈ A, with P probability at least 1 − ∆n ,
0
−1
g3a (d, M, X) − g3a (d, M, X) P,q ≤ Cε2 q ,

by the assumption v − va0 P,q ≤ C and ε2 > 0. A similar argument can be used to show that for
d ∈ {0, 1} and all a ∈ A, with probability P at least 1 − ∆n ,
1
1 − 1
0
(d, M, X) P,2 ≤ δn n− 4 ε2 q ≲ δn n− 4 ,

g3a (d, M, X) − g3a (A.11)
1
1 − 1
g4ad (d′ , X) − g4d
0
(d′ , X) P,2 ≤ δn n− 4 ε1 q ≲ δn n− 4 ,

(A.12)
1
by assumption v − va0 P,2 ≤ δn n− 4 and ε2 > 0. Similarly,

−1 1 1
0 0
(d, M, X) P,2 ≤ ε2 q δn n− 2 ≲ δn n− 2 ,

g1d (X) − g1d (X) P,2 g3a (d, M, X) − g3a (A.13)
−1 1 1
0
(X) P,2 g4ad (d′ , X) − g4ad
0
(d′ , X) P,2 ≤ ε1 q δn n− 2 ≲ δn n− 2 ,

g1d (X) − g1d (A.14)
−1 1 1
0 0
(d, M, X) P,2 ≤ ε2 q δn n− 2 ≲ δn n− 2 .

g2d (M, X) − g2d (M, X) P,2 g3a (d, M, X) − g3a (A.15)

30
Verifying Assumption A.1.1: FY0 (d,M (d′ )) (a) = θd,d
0
′ ,a

The case when d ̸= d′


Recall that
 0

1 {D = d} 1 − g2d (M, X)
ψd,d′ ,a Wa ; va0 0
  
= 0 0 × Ya − g3a (d, M, X) (A.16)
[1 − g1d (X)] g2d (M, X)
1 {D = d′ }  0 0
(d′ , X) + g4ad
0
(d′ , X) .

+ 0 g3a (d, M, X) − g4ad (A.17)
1 − g1d (X)

Conditional on M and X, the expectation of the first term after the equals sign in expression (A.16) is
"   #
0 0
1 {D = d} 1 − g2d (M, X) 1 − g2d (M, X)
E 0 0 Ya |M, X = 0 0 (M, X) × E [1 {D = d} Ya |M, X]
[1 − g1d (X)] g2d (M, X) [1 − g1d (X)] g2d
0
1 − g2d (M, X)
= 0 0 (M, X) × g2d (M, X) g3a (d, M, X)
[1 − g1d (X)] g2d
0
1 − g2d (M, X)
= 0 g3a (d, M, X) ,
1 − g1d (X)

and the expectation of the last term right in expression (A.16) is


"   #
0 0
1 {D = d} 1 − g2d (M, X) 0 1 − g2d (M, X)
E 0 0 g3a (d, M, X) |M, X = 0 0 (M, X) × g2d (M, X) g3a (d, M, X)
[1 − g1d (X)] g2d (M, X) [1 − g1d (X)] g2d
0
1 − g2d (M, X)
= 0 g3a (d, M, X) .
1 − g1d (X)

Therefore, conditional on M and X, the term in expression (A.16) is zero and its unconditional expec-
tation is also zero. Concerning the terms in expression (A.17), notice that the first term

1 {D = d′ }  0
 
0 ′
 1  0 ′

E 0 (X) g3a (d, M, X) − g4ad (d , X) |X = 1 − g 0 (X) E g3a (d, M, X) 1 {D = d } |X
1 − g1d 1d
0
− g4ad (d′ , X)
=0

Therefore, the expectation of the first term in expression (A.17) is zero. The expectation of the second
(d′ , X) = FY0 (d,M (d′ )) (a) by Proposition 1. Therefore, it follows
 0 
term of in expression (A.17) is E g4ad
that FY0 (d,M (d′ )) (a) = θd,d
0
′ ,a .

The case when d = d


Now, we have

 1 {D = d} 
ψd,d,a Wa ; va0 = 0 0

0 × Ya − g4ad (d, X) + g4ad (d, X) , (A.18)
g1d (X)

0
where g4ad (d, X) = FY0 |D,X (a|d, X). Concerning the first term right of the equals sign in equation
(A.18),
 
1 {D = d}  0
 E [1 {D = d} Ya |X] 0
E 0 × Ya − g4ad (d, X) |X = 0 (X) − g4ad (d, X)
g1d (X) g1d
= 0.

31
Concerning the last term in equation (A.18), notice that

FY0 |D,X (a|d, X) = FY0 (d,M (d))|D,X (a|d, X) = FY0 (d,M (d))|X (a|X)

by Assumptions 1.1 and 1.2, and therefore,


 0 h i
(d, X) = E FY0 (d,M (d))|X (a|X) = FY0 (d,M (d)) (a).

E g4ad


Combining the previous results, it follows that FY0 (d,M (d′ )) (a) = θd,d
0
′ ,a holds for all a ∈ A and (d, d ) ∈
2
{0, 1} .

Verifying Assumption A.1.2


If we treat v as deterministic, the second order Gateau derivative of the map v 7−→ E [ψd,d′ ,a (Wa ; v)]
2
exists and is continuous on v ∈ Gan , and this property holds for each (d, d′ ) ∈ {0, 1} and a ∈ A.
Therefore v 7−→ E [ψ a (Wa ; v)] is twice continuously Gateau-differetiable for a ∈ A.

Verifying Assumption A.1.3 (Neyman near orthogonality)

The case when d ̸= d′


Recall that
 0

 1 {D = d} 1 − g2d (M, X) 
ψd,d′ ,a Wa , va0 = 0

0 0 1 {Y ≤ a} − g3a (d, M, X)
[1 − g1d (X)] g2d (M, X)
1 {D = d′ }  0 0
(d′ , X) + g4ad
0
(d′ , X) .

+ 0 g3a (d, M, X) − g4ad
1 − g1d (X)

Let
1 {D = d} (1 − t2 ) 1 {D = d′ }
µd,d′ ,a (t) = (1 {Y ≤ a} − t3 ) + (t3 − t4 ) + t4 ,
(1 − t1 ) t2 1 − t1
where t = (t1, , . . . , t4 ). If we set t = (g1d (X) , g2d (M, X) , g3a (d, M, X) , g4ad (d′ , X)), then E [µd,d′ ,a (t)] =
E [ψd,d′ ,a (Wa , v)]. If we set

t = t0 = g1d
0 0 0 0
(d′ , X) ,

(X) , g2d (M, X) , g3a (d, M, X) , g4ad
    
then E [µd,d′ ,a (t0 )] = E ψd,d′ ,a Wa , va0 . Furthermore, ∂v E ψd,d′ ,a Wa , va0 v − va0 = ∥E [∂t µd,d′ ,a (t0 ) [t − t0 ]]∥.
The partial derivatives with respect to (t1 , . . . , t4 ) are given by

1 {D = d} (1 − t2 ) 1 {D = d′ }
∂t1 µd,d′ ,a (t) = 2 (1 {Y ≤ a} − t3 ) + (t3 − t4 ) , (A.19)
(1 − t1 ) t2 (1 − t1 )2
1 {D = d}
∂t2 µd,d′ ,a (t) = − (1 {Y ≤ a} − t3 ) , (A.20)
(1 − t1 ) t22
1 {D = d′ } 1 {D = d} (1 − t2 )
∂t3 µd,d′ ,a (t) = − , (A.21)
1 − t1 (1 − t1 ) t2
1 {D = d′ }
∂t4 µd,d′ ,a (t) = 1 − . (A.22)
1 − t1

32
Replacing t with t0 and taking expectations in equation (A.19), we have
"
0
  0
 1 − g2d (M, X)
E ∂t1 µd,d′ ,a (t0 ) g1d (X) − g1d (X) = E 2 × [E [1 {Y ≤ a} 1 {D = d} |M, X]
0 (X)) g 0 (M, X)
(1 − g1d 2d
(A.23)
0 0 0
 
−g2d (M, X) g3a (d, M, X) g1d (X) − g1d (X)
"
1
E 1 {D = d′ } g3a
0
  
+E 2 (d, M, X) |X (A.24)
0 (X))
(1 − g1d
0
(d′ , X) g1d (X) − g1d
 0  0

− 1 − g1d (X) g4ad (X)
= 0,

because in expressions (A.23) and (A.24), we have that

0
E [1 {D = d} 1 {Y ≤ a} |M, X] = PY,D|M,X (Y ≤ a, D = d|M, X)
0 0
= g2d (M, X) g3a (d, M, X) , (A.25)
Z
E1 1 {D = d′ } g3a
0 0 0 ′
 
(d, M, X) |X = g3a (d, m, X) PD,M |X (D = d , M = m|X) dm

0
(d′ , X) .
 0
= 1 − g1d (X) g4ad

When taking expectations in equation (A.20), we have


"
  0
 1
E ∂t2 µd,d′ ,a (t0 ) g2d (M, X) − g2d (M, X) = E − 2 (A.26)
(1 − 0
g1d 0 (M, X))
(X)) (g2d
× E [1 {D = d′ } 1 {Y ≤ a} |M, X] − g2d
0 0
 
(M, X) g3a (d, M, X)
0
 
× g2d (M, X) − g2d (M, X)
= 0,

by making use of expression (A.25). When taking expectations in equation (A.21), we have
" "  #
′ 0
  0
 1 {D = d } 1 {D = d} 1 − g 2d (M, X)
E ∂t3 µd,d′ ,a (t0 ) g3a (d, M, X) − g3a (d, M, X) = E E 0 (X) − (1 − g 0 (X)) g 0 (M, X) |M, X
1 − g1d 1d 2d
0
 
× g3a (d, M, X) − g3a (d, M, X)
= 0,

because
"  #
0
1 {D = d′ } 1 {D = d} 1 − g2d (M, X)
E 0 (X) − (1 − g 0 (X)) g 0 (M, X) |M, X = 0.
1 − g1d 1d 2d

When taking expectations in equation (A.22), we have

1 {D = d′ }
    
′ 0 ′ ′ 0 ′
    
E ∂t4 µd,d′ ,a (t0 ) g4ad (d , X) − g4ad (d , X) = E E 1 − 0 (X) |X g4ad (d , X) − g4ad (d , X)
1 − g1d
= 0,

since
1 {D = d′ } 0
  
1 − g1d (X)
E 1− 0 (X) |X = 1 − 0 = 0.
1 − g1d 1 − g1d (X)

33
The case when d = d′
Recall that
 1 {D = d} 
ψd,d,a Wa , va0 = 0 0

0 1 {Y ≤ a} − g4ad (d, X) + g4ad (d, X) .
g1d (X)

Now µd,d′ ,a (t) is given by

1 {D = d}
µd,d′ ,a (t) = (1 {Y ≤ a} − t4 ) + t4 .
t1

The partial derivatives with respect to (t1 , t4 ) are given by

1 {D = d}
∂t1 µd,d′ ,a (t) = − (1 {Y ≤ a} − t4 ) , (A.27)
t21
1 {D = d}
∂t4 µd,d′ ,a (t) = 1 − . (A.28)
t1

Replacing t with t0 and taking expectation in equation (A.27), we have


"
  0
 1
E ∂t1 µd,d′ ,a (t0 ) g1d (X) − g1d (X) = E 2 × [E [1 {Y ≤ a} 1 {D = d} |X]
0
(g1d (X))
0 0 0
 
−g1d (X) g4ad (d, X) g1d (X) − g1d (X)
= 0,

since the term

0 0 0
g1d (X) g4ad (d, X) = PY,D,M,X (Y ≤ a, d|X) = E [1 {D = d} Ya |X] .

Taking expectations in equation (A.28), we have


   
  0
 1 {D = d}  ′ 0 ′

E ∂t4 µd,d′ ,a (t0 ) g4ad (d, X) − g4ad (d, X) = E E 1 − 0 |X g4ad (d , X) − g4ad (d , X)
g1d (X)
= 0,

since
g 0 (X)
 
1 {D = d}
E 1− 0 |X = 1 − 1d
0 (X) = 0.
g1d (X) g1d
Combining all previous results, it follows that

∂v E ψd,d′ ,a Wa , va0 v − va0 = 0


  

2
holds for each (d, d′ ) ∈ {0, 1} and all a ∈ A. Therefore,

∂v E ψ a Wa , va0 v − va0 = 0
  

holds all a ∈ A.

34
 1

Verifying Assumption A.1.4a rn ≤ δn n− 4

The case when d ̸= d′


We have
(  )
0
1 {D = d} [1 − g2d (M, X)] 1 {D = d} 1 − g2d (M, X)
ψd,d′ ,a Wa , va0

ψd,d′ ,a (Wa , v) − = − 0 (X)] g 0 (M, X) 1 {Y ≤ a}
[1 − g1d (X)] g2d (M, X) [1 − g1d 2d

1 {D = d′ }
 
1 {D = d} [1 − g2d (M, X)]
+ − × g3a (d, M, X)
1 − g1d (X) [1 − g1d (X)] g2d (M, X)
(   )
0
1 {D = d′ } 1 {D = d} 1 − g2d (M, X) 0
− 0 (X) − 0 (X)] g 0 (M, X) × g3a (d, M, X)
1 − g1d [1 − g1d 2d

1 {D = d′ } 1 {D = d′ } 0
   
+ 1− g4ad (d′ , X) − 1 − ′
0 (X) g4ad (d , X) .
1 − g1d (X) 1 − g1d

To ease the notation, we express these nuisance parameters without their arguments in the following
proof. Using the Minkowski inequality yields

ψd,d′ ,a (Wa , v) − ψd,d′ ,a Wa , va0 ≤ Π1 + Π2 + Π3 ,



P,2

where
" #
1 {D = d} (1 − g ) 1 {D = d} 1 − g 0 
2d 2d
Π1 = − 1 {Y ≤ a} ,

(1 − g1d ) g2d (1 − g1d0 ) g0
2d
P,2
 " #
1 {D = d′ } 1 {D = d} (1 − g )  ′ 0
2d 1 {D = d } 1 {D = d} 1 − g2d
0
Π2 = − × g3a − − × g ,

0 0 ) g0 3a
1 − g1d (1 − g1d ) g2d 1 − g1d (1 − g1d 2d
P,2
1 {D = d′ } 1 {D = d′ }
   
0

Π3 = 1 − g 4ad − 1 − 0 g4ad .
1 − g1d 1 − g1d P,2

In the following, Assumption 2.4 and the boundedness conditions (A.9), (A.10), (A.11) and (A.12) are
applied to derive the relevant upper bounds. For the term Π1 , with probability P at least 1 − ∆n , we
have

0
g2d − g2d
1 − g0  1 − g 0
2d 2d
Π1 ≤ + −

(1 − g1d ) g2d (1 − g ) g (1 − g 0 ) g0
P,2
1d 2d 1d 2d P,2
1 g20 − g2 + 1 g2 − g20 + 1 g1 − g10

≤ P,2 P,2 2 P,2
ε1 ε2 ε1 ε2 ε1
 
2 1 1 1
≤ + 2 δ n n− 4 ≲ δ n n− 4
ε1 ε2 ε1

, by making use of the fact that


0 0

1 1 g2d − g2d g1d − g1d
(1 − g1d ) g2d − (1 − g 0 ) g 0 = (1 − g1d ) g2d g 0 + (1 − g1d ) (1 − g 0 ) g 0

1d 2d 2d 1d 2d
1 0
1 0

≤ g2d − g2d + 2 g1d − g1d
ε1 ε22 ε1 ε2

35
1
and Assumption v − va0 P,2 ≲ δn n− 4 . For the term Π2 , it is known that with probability P at least
1 − ∆n , for a ∈ A,
0
 
(2g2d − 1) 0
 2g2d − 1 2g2d −1 0

Π2 ≤
(1 − g1d ) g2d × g3d − g3d +
(1 − g1d ) g2d − 0 0 × g3a
P,2 (1 − g1d ) g2d P,2
1 − 2ε2 0
1 − 2ε2 0
1 − 2ε2 g1d − g1d + 2 g2d − g2d
0
0

≤ g3a − g3a P,2 + 2 g2d − g2d P,2 + 2 P,2 P,2
ε1 ε2 ε1 ε2 ε1 ε2 ε1 ε2
 
1 − 2ε2 − q1 1 − 2ε2 1 − 2ε2 2 1 1
≤ ε + + 2 + δ n n− 4 ≲ δ n n− 4 ,
ε1 ε2 2 ε1 ε22 ε1 ε2 ε1 ε2

by using
2g2d − 1 1 − 2ε2
≤ ,
(1 − g1d ) g2d ε1 ε2

0
 0 0

0


2g2d − 1 2g2d − 1 g2d − g 2d g1d − g1d 2 g2d g2d
(1 − g1d ) g2d − (1 − g 0 ) g 0 = (1 − g1d ) g2d g 0 + (1 − g1d ) (1 − g 0 ) g 0 × (2g2d − 1) + (1 − g 0 ) g 0

1d 2d 2d 1d 2d 1d 2d
1 − 2ε2 0
1 − 2ε 2
0
2 0

≤ g2d − g2d + 2 g1d − g1d + g2d − g2d ,
ε1 ε22 ε1 ε2 ε1 ε2
1
and Assumption v − va0 P,2 ≲ δn n− 4 as well as condition (A.11). For the term Π3 , we can show that
with probability P at least 1 − ∆n and for a ∈ A,
0
0
g1d − g1d 0
g4ad g4ad 0

Π3 ≤

0 g4ad + − + g4ad − g4ad
P,2
(1 − g1d ) (1 − g1d ) P,2
1 − g1d 1 − g1d P,2

 
1 0 1 0

≤ 2 g1d − g1d P,2 + 1 + g4ad − g4ad P,2
ε1 ε1
  
1 − q1 1 1 1
≤ 2 + ε2 1+ δn n− 4 ≲ δn n− 4 ,
ε1 ε1
1
0
by using 0 ≤ g4ad ≤ 1, Assumption v − va0 P,2 ≲ δn n− 4 and condition (A.12).

The case when d = d′


We have
 
1 {D = d} 1 {D = d}
ψd,d,a (Wa , v) − ψd,d,a Wa , va0 =

− 0 1 {Y ≤ a}
g1d (X) g1d (X)
1 {D = d′ }
   
1 {D = d} 0
+ 1− g4ad − 1 − 0 g4ad ,
g1d g1d

where g4ad = g4ad (d, X) (not g4ad (d′ , X)). Using the triangle inequality yields

ψd,d,a (Wa , v) − ψd,d,a Wa , va0 ≤ Π4 + Π5 ,



P,2

where
 
1 {D = d} 1 {D = d}
Π4 =
− 0 1 {Y ≤ a} ,
g1d (X) g1d (X) P,2

   
1 {D = d } 1 {D = d} 0
Π5 = 1 − g4ad − 1 − g
4ad .
g1d g0 1d P,2

36
1 1
Following previous arguments, it can be shown that Π4 ≤ ε−2
2 δn n
−4
≲ δn n− 4 . Similar as for Π3 , we
have for Π5 ,
    
1 1 1 − q1 1 1 1
0 0
δ n n− 4 ≲ δ n n− 4 .

Π5 ≤ 2
g1d − g1d P,2
+ 1 + g4ad − g4ad P,2
≤ 2 + ε 2 1 +
ε1 ε1 ε1 ε1

Combining the previous results, we obtain that with probability P at least 1 − ∆n and for all a ∈ A,

ψd,d′ ,a (Wa , v) − ψd,d′ ,a Wa , va0 ≲ δn n− 41



P,2

2
holds for each (d, d′ ) ∈ {0, 1} . Therefore, with probability P at least 1 − ∆n and all a ∈ A and v ∈ Gan ,

ψ a (Wa , v) − ψ a Wa , va0 ≲ δn n− 41 .

P,2

 1

Verifying Assumption A.1.4b λ′n ≤ δn n− 2

The case when d ̸= d′


 
We may write ψd,d′ ,a Wa ; r v − va0 + va0 as
  0
 0

1 {D = d} 1 − r g2d − g2d + g2d
va0 va0
 
ψd,d′ ,a Wa ; r v − + = 0 ) + g 0 ]} {r (g 0 0
{1 − [r (g1d − g1d 1d 2d − g2d ) + g2d }
0 0
   
× Ya − r g3d − g3d + g3d
1 {D = d′ }
+ 0 ) + g0 ]
1 − [r (g1d − g1d 1d
0 0 0 0
   
× r (g3d − g4ad ) − g3a − g4ad + g3a − g4ad
0 0
  
+ r g4ad − g4ad + g4ad .

Let

0 0 0 0
     
A1 (r) = 1 − r g2d − g2d + g2d , A2 (r) = 1 − r g1d − g1d + g1d ,
0 0 0 0
   
A3 (r) = r g2d − g2d + g2d , A4 (r) = Ya − r g3a − g3a + g3a ,
0 0 0 0
  
A5 (r) = r (g3a − g4ad ) − g3a − g4ad + g3a − g4ad .

Notice that the functions Ai (r), i = 1, . . . , 5 are also functions of the random variables (Ya , M, X). For
 
this reason, we may rewrite ψd,d′ ,a Wa ; r v − va0 + va0 as

 1 {D = d} A1 (r)
ψd,d′ ,a Wa ; r v − va0 + va0 =

× A4 (r)
A2 (r) A3 (r)
1 {D = d′ }  0
 0

+ × A5 (r) + r g4ad − g4ad + g4ad .
A2 (r)

37
After some calculations, we obtain
" #
1 h 2 0
 0
i 1 {D = d} A 4 (r̄) 0
2
E ∂r ψd,d′ ,a Wa ; r v − va + va = −E 2 g1d − g1d
2
r=r̄ (A2 (r̄)) A3 (r̄)
" #
1 {D = d} A4 (r̄) 0
 0

+E 2 g2d − g2d g1d − g1d
A2 (r̄) (A3 (r̄))
 
1 {D = d} 0
 0

+E g2d − g2d g3a − g3a
A2 (r̄) A3 (r̄)
" #
1 {D = d} A1 (r̄) A4 (r̄) 0 2

+E 3 g1d − g1d
(A2 (r̄)) A3 (r̄)
" #
1 {D = d} A1 (r̄) A4 (r̄) 0
 0

−E 2 2 g1d − g1d g2d − g2d
(A2 (r̄)) (A3 (r̄))
(A.29)
" #
1 {D = d} A1 (r̄) 0
 0

+E 2 g1d − g1d g3a − g3a
(A2 (r̄)) A3 (r̄)
" #
1 {D = d} A1 (r̄) A4 (r̄) 0 2

+E 3 g2d − g2d
A2 (r̄) (A3 (r̄))
" #
1 {D = d} A1 (r̄) 0
 0

+E 2 g2d − g2d g3a − g3a
A2 (r̄) (A3 (r̄))
" #
1 {D = d′ } A5 (r̄) 0 2

+E 3 g1d − g1d
(A2 (r̄))
" #
1 {D = d′ } A5 (r̄) 0
 0 0

+E 2 g1d − g1d g3a − g4ad − g3a − g4ad .
(A2 (r̄))

To bound the expectation of the second order derivative above, we can use the properties of Ai (r). Using
Assumption 2.2 and acknowledging that r̄ ∈ (0, 1), we have that

0 0
  
ε2 < A1 (r̄) = 1 − r g2d − g2d + g2d < 1 − ε2 ,
0 0
  
ε1 < A2 (r̄) = 1 − r g1d − g1d + g1d < 1 − ε1 ,
0 0

ε2 < A3 (r̄) = r g2d − g2d + g2d < 1 − ε1

0
hold with probability one. Also |A4 (r̄)| and |A5 (r̄)| are bounded by constants, since |Ya |, g3a , g3a , g4ad
0
and g4ad are all bounded and r̄ ∈ (0, 1). Based on these results and Assumption 2.4, it can be shown
that the absolute values of those terms on the right hand side of equation (A.29) that involve interaction
terms are all bounded by δn n−1/2 . We now consider the terms on the right hand side of equation
(A.29) that involve quadratic terms (the first, fourth, seventh and ninth terms). By the assumption

v − va0
P,2
≤ δn n−1/4 , we have g1d − g1d
0
P,2
≲ δn n−1/4 and g2d − g2d
0
P,2
≲ δn n−1/4 . Concerning
the first term,
" # "  #
0
1 {D = d} A4 (r̄) 0
 2
1 {D = d} Ya − g3a 0
 2

−E g1d − g1d ≤ −E g1d − g1d (A.30)

2 2
(A2 (r̄)) A3 (r̄) (A2 (r̄)) A3 (r̄)
"  #
0
1 {D = d} r̄ g3a − g3a 
0 2
+ E g − g

2 1d 1d
(A2 (r̄)) A3 (r̄)
r̄ (1 − 2ε1 ) − q1 1 1
≤ ε2 δ n n− 2 ≲ δ n n− 2 ,
ε21 ε2

38
since
"
0
 # " 2 #
1 {D = d} Ya − g3a 0 2
 g1 − g10 
0 0 0

E 2 g1d − g1d =E 2 × PY,D|M,X (Y ≤ a, D = d|M, X) − g2d g3a
(A2 (r̄)) A3 (r̄) (A2 (r̄)) A3 (r̄)
= 0,

and
"  # " #
0
1 {D = d} r̄ g3a − g3a 0
 2
1 {D = d}
0
0 2

E g1d − g1d ≤E r̄ g3a − g3a g1d − g1d

2 (A2 (r̄))2 A3 (r̄)
(A2 (r̄)) A3 (r̄)
r̄ (1 − 2ε1 ) 0
0

≤ 2
g3a − g3a
P,2
g1d − g1d
P,2
ε1 ε2
r̄ (1 − 2ε1 ) − q1 1 1
≤ ε2 δn n− 2 ≲ δn n− 2 ,
ε21 ε2

by assuming that v − va0 P,2 ≤ δn n−1/2 and g1d − g1d
0
< 1 − 2ε1 with probability one. Applying a
similar argument to the fourth term,

0 2
" # "  #
1 {D = d} A1 (r̄) A4 (r̄) 
0 2 A1 (r̄) g1d − g1d  0
 
E g1d − g1d ≤ E × E 1 {D = d} Ya − g3a |M, X

3 3
(A2 (r̄)) A3 (r̄) (A2 (r̄)) A3 (r̄)
" #
A1 (r̄)
0
0 2

+E r̄ g3a − g3a g1d − g1d

3
(A2 (r̄)) A3 (r̄)
r̄ (1 − ε2 ) (1 − 2ε1 ) − q1 1 1
≤0+ ε2 δn n− 2 ≲ δn n− 2 .
ε31 ε2

For the ninth term,

0 2
" # "  #
1 {D = d′ } A5 (r̄) 0
 2
g1d − g1d  ′ 0 0
 
E g1d − g1d ≤ E 3 E 1 {D = d } g3a − g4ad |X

3
(A2 (r̄)) (A2 (r̄))
" #
1 {D = d′ } 2
0 0 0
 
+E r̄ g3a − g3a − g4ad − g4ad g1d − g1d

(A2 (r̄))3
 
r̄ (1 − 2ε1 ) − q1 − 21 − q1 − 12 1
≤ 3 ε2 δn n + ε1 δn n ≲ δ n n− 2 ,
ε1

since the term


Z
E 1 {D = d′ } g3a
0 0
FY |D,M,X (a|d, m, X) fM |D|X (m|d′ X) fD|X (d′ |X) dm
  
− g4ad |X =

− fD|X (d′ |X) E FY |D,M,X (a|d, M, X) |d′ , X


 

= 0.

For the seventh term,


" #
1 {D = d} A1 (r̄) A4 (r̄) 
0 2 r̄ (1 − ε2 ) (1 − 2ε2 ) 0
0

E g2d − g2d ≤0+ g3a − g3a g1d − g1d

3 ε1 ε23 P,2 P,2
A2 (r̄) (A3 (r̄))
r̄ (1 − ε2 ) (1 − 2ε2 ) − q1 1 1
≤ ε2 δ n n− 2 ≲ δ n n− 2 ,
ε1 ε32
0

by g2d − g2d < 1 − 2ε2 with probability one.

39
The case when d = d′
Let

0 0

A6 (r) = r g1d − g1d + g1d ,
0 0
  
A7 (r) = Ya − r g4ad − g4ad + g4ad ,

where g4ad = g4ad (d, X). It holds that ε1 < A6 (r̄) < 1 − ε1 and |A7 (r̄)| for r̄ ∈ [0, 1] are bounded, since
0
|Ya |, g4ad and g4ad are bounded. Then,

1 {D = d}
ψd,d,a Wa ; r v − va0 + va0 = 0 0 0 0
        
0 0 Ya − r g4ad − g4ad + g4ad + r g4ad − g4ad + g4ad
r (g1d − g1d ) + g1d
1 {D = d}  0
 0

= A7 (r̄) + r g4ad − g4ad + g4ad .
A6 (r̄)

After some calculations, we obtain


" #
1  2 1 {D = d} A 7 (r̄) 2
E ∂r ψd,d,a Wa ; r v − va0 + va0 0
  
=E 3 g1d − g1d
2 r=r̄ A6 (r̄)
" #
1 {D = d} 0
 0

+E 2 g1d − g1d g4ad − g4ad .
A6 (r̄)

Considering the first term on the right hand side, we have that
" # " #
1 {D = d} A7 (r̄) 
0 2 1 0 2
 0 0

E g1d − g1d ≤ E 3 g1d − g1d E [1 {D = d} Ya |X] − g1d g4ad

3
A6 (r̄) A6 (r̄)
" #
1 {D = d} r̄
0 0 0

+E g1d − g1d g1d − g1d g4ad − g4ad

A6 (r̄)3
r̄ (1 − 2ε1 ) − q1 1 1
≤ ε1 δn n− 2 ≲ δn n− 2 ,
ε31

0 0
since E [1 {D = d} Ya |X] − g1d g4ad = 0 and

0
0
0
0
−1
g1d − g1d
P,2
g4ad − g4ad
P,2
≤ g1d (X) − g1d (X) P,2 g4ad (D, X) − g4ad (D, X) P,2 ε1 q
−1 1 1
≤ ε1 q δn n− 2 ≲ δn n− 2 .

Concerning the second term on the right hand side,


" #
1 {D = d} 0
 0
 1 0
0

E g − g g − g ≤ 2 g1d − g1d g4ad − g4ad

2 1d 1d 4ad 4ad P,2 P,2
A6 (r̄) ε 1
1
≲ δ n n− 2 .
h   i
Finally, if v = va0 , it is trivial to see that E ∂r2 ψd,d′ ,a Wa ; r v − va0 + va0 = 0 for each
r=r̄
′ 2
(d, d ) ∈ {0, 1} , a ∈ A and r̄ ∈ (0, 1). Combining the previous results, it follows that with probability P
at least 1 − ∆n , for r̄ ∈ (0, 1), all a ∈ A and v ∈ Gan ∪ va0 , we have that
h  i 1
E ∂r2 ψd,d′ ,a Wa ; r v − va0 + va0 ≲ δ n n− 2 ,


r=r̄

40
2
and this result holds for each (d, d′ ) ∈ {0, 1} . Therefore, with probability P at least 1 − ∆n ,

E ∂r ψ a Wa ; r v − va0 + va0 ≲ δn n− 12
 2  
r=r̄

holds for r̄ ∈ (0, 1), all a ∈ A and v ∈ Gan ∪ va0 .

Verifying Assumption A.1.5 (Smoothness condition)

The case when d ̸= d′


 
We may write ψd,d′ ,a Wa , va0 − ψd,d′ ,ā Wā , vā0 as

0

1 {D = d} 1 − g2d
ψd,d′ ,a Wa , va0 ψd,d′ ,ā Wā , vā0
 
− = 0 ) g0 (Ya − Yā )
(1 − g1d 2d
" #
0
1 {D = d′ } 1 {D = d} 1 − g2d 0 0

+ 0 − 0 0 g3a − g3ā
1 − g1d (1 − g1d ) g2d

 
1 {D = d } 0 0

+ 1− g4ad − g4ād .
1 − g10 (d, X)

Using the Minkowski inequality yields

ψd,d′ ,a Wa , va0 − ψd,d′ ,ā Wā , vā0 ≤ Π1 (a) + Π2 (a) + Π3 (a) ,


 
P,2

where

1 {D = d} 1 − g 0 
2d
Π1 (a) = (Y − Y ) ,

0 ) g0 a ā
(1 − g1d 2d
P,2
" #
1 {D = d′ } 1 {D = d} 1 − g 0 
2d 0 0

Π2 (a) = − g − g ,

0 0 ) g0 3a 3ā
1 − g1d (1 − g1d 2d
P,2

 
1 {D = d } 0 0

1 − 1 − g 0 (d, X) g4ad − g4ād .
Π3 (a) =
1 P,2

For Π1 (a), we note that



1 {D = d} 1 − g 0  1 − g 0
1 − ε2
2d 2d
≤ ≤


0 0
(1 − g1d ) g2d

0 0
(1 − g1d ) g2d ε1 ε2

with probability one, which implies that

1 − ε2
Π1 (a) ≤ ∥Ya − Yā ∥P,2 .
ε1 ε2

For Π2 (a), we note that



1 {D = d′ } 1 {D = d} 1 − g 0  1 1 − ε2
2d
− ≤ +

0
1 − g1d 0 0
(1 − g1d ) g2d ε1 ε1 ε2

with probability one, which implies that


   
1 1 − ε2 0 0
1 1 − ε2
Π2 (a) ≤ + g3a − g3ā
P,2
≤ + ∥Ya − Yā ∥P,2 ,
ε1 ε1 ε2 ε1 ε1 ε2

41
because
 h i 12
0 0
g3a − g3ā
2
P,2
≤ E 1 {D = d} (Ya − Yā ) ≤ ∥Ya − Yā ∥P,2 .

For Π3 (a), we note that




1 − 1 {D = d } ≤ 1 + ε1 ,

0
1 − g1d ε1
which implies that

1 + ε1 0 0
1 + ε1
Π3 (a) ≤ g4ad − g4ād P,2
≤ ∥Ya − Yā ∥P,2 ,
ε1 ε1

because
 h i 21
0 0
g4ad − g4ād
′ 2
P,2
≤ E 1 {D = d } (E [Ya − Yā |d, M, X]) ≤ ∥Ya − Yā ∥P,2 .

Combining the previous results, we have


 
ψd,d′ ,a Wa , va0 − ψd,d′ ,ā Wā , vā0 ≤
  1 − ε2 1 1 − ε2 1 + ε1
P,2
+ + + ∥Ya − Yā ∥P,2
ε1 ε2 ε1 ε1 ε2 ε1
≲ ∥Ya − Yā ∥P,2 .

2
By Assumption 2.1, for each (d, d′ ) ∈ {0, 1} , we then obtain

sup ψd,d′ ,a Wa , va0 − ψd,d′ ,ā Wā , vā0 P,2 ≲ sup ∥Ya − Yā ∥P,2 = 0
 
P ∈P P ∈P

as dA (a, ā) → 0.

The case when d = d′

 1 {D = d}
ψd,d,a Wa , va0 − ψd,d,ā Wā , vā0 =

0 (1 {Y ≤ a} − 1 {Y ≤ ā})
g1d
 
1 {D = d} 0 0

+ 1− 0 g4ad − g4ād ,
g1d

0 0
where g4ad = g4ad (d, X). By the triangle inequality,

ψd,d,a Wa , va0 − ψd,d,ā Wā , vā0 ≤ Π4 (a) + Π5 (a) ≲ ∥Ya − Yā ∥ ,


 
P,2 P,2

where

1 {D = d}
≤ 1 ∥Ya − Yā ∥

Π4 (a) = 0 (1 {Y ≤ a} − 1 {Y ≤ ā}) P,2
g1d
P,2
ε1
 
1 {D = d} 0 0 ≤ 1 + ε1 ∥Ya − Yā ∥

1−
Π5 (a) =
g 0 g4ad − g4ād ε1 P,2
1d P,2

2
By Assumption 2.1, for each (d, d′ ) ∈ {0, 1} , we obtain

sup ψd,d,a Wa , va0 − ψd,d,ā Wā , vā0 P,2 ≲ sup ∥Ya − Yā ∥P,2 = 0
 
P ∈P P ∈P

as dA (a, ā) → 0.

42
Combining the previous results, it follows

sup ψ a Wa , va0 − ψ a Wā , vā0 P,2 = 0


 
P ∈P

as dA (a, ā) → 0.

Verifying Assumption A.1.6

The case when d ̸= d′


 0  0
Let G10 = g1d (X) : d ∈ {0, 1} G20 = g2d (M, X) : d ∈ {0, 1} ,

G30 = g3a
 0
(d, M, X) : a ∈ A, d ∈ {0, 1} ,
n o
2
G40 = g4ad
0
(d′ , X) : a ∈ A, {d, d′ } ∈ {0, 1} ,

G5 = {Ya : a ∈ A}, G6 = {1 {D = d} : d ∈ {0, 1}}. The union ∪4j=1 Gj0 forms the set Ga as defined above.
0 0
By our assumptions, g1d (X) and g2d (M, X) are bounded within the interval (0, 1) with probability one.
g3a (d, M, X) = FY |D,M,X (a|d, M, X) is a conditional c.d.f. and g40 (a, d, d′ X) = E g3a (d, M, X) |d′ , X
0
 0 

is a conditional expectation of a c.d.f., which are also bounded for a ∈ A with probability one. The
functions Ya = 1 {Y ≤ a} and 1 {D = d} are indicator functions and are bounded with probability one.
In conclusion, functions in the sets Gj0 , j = 1, . . . , 4 are uniformly bounded and their envelop functions are
all bounded by some constant. By Assumption 1 and Lemma L.2 of Belloni et al. (2017), it can be shown
that uniform covering numbers of functions in G30 and G40 are bounded by log (e/ϵ) ∨ 0 multiplied by some
constants. Uniform covering numbers of functions in G10 ,G20 , G5 and G6 are also bounded by log (e/ϵ) ∨ 0

multiplied by some constants. The function ψd,d′ ,a Wa , va0 is formed based on a union of functions in
2
the sets Gj0 , j = 1, . . . , 4, G5 and G6 . Let Ψ0d,d′ = ψd,d′ ,a Wa , va0 , a ∈ A , where (d, d′ ) ∈ {0, 1} .
 

Fixing (d, d′ ), we have



1 − g0  1 {D = d′ } 1 {D = d} 1 − g 0 
ψd,d′ ,a Wa , va0 ≤ 2d 2d 0

0 ) g 0 |Ya | + − g3a

(1 − g1d 1 − g 0 (1 − g 0 ) g0
2 1 1d 2
1 {D = d′ } 0


+ 1 − 0 g4ad
1 − g1d
   
1 − ε2 1 1 − ε2 0 1 + ε1 0
≤ + + g3a + g4ad .
ε1 ε2 ε1 ε1 ε2 ε1

The envelop function of f ∈ Ψ0d,d′ is defined as ψ 0d,d′ (W ) := supa∈A,v∈(∪4 G 0 ) |ψd,d′ ,a (Wa , v)| and we
j=1 j
have
 
1 − ε2 1 1 − ε2
ψ 0d,d′ (W ) ≤
0
+ + sup g3a (d, M, X)
ε1 ε2 ε1 ε1 ε2 a∈A,g3a
0 ∈G 0
3
 
1 + ε1
sup g4ad (d′ , X) .
0
+
ε1 a∈A,g40 ∈G40

43
0 0 −1 0
Using the facts that for q ≥ 4, g3a (d, M, X) P,q ≤ g3a (D, M, X) P,q ε2 q and g4ad (d′ , X) P,q ≤
1
g (D, X) ε− q , it follows that
0
4ad P,q 1

 
ψ d,d′ (W ) ≤ 1 − ε2 + 1 + 1 − ε2
1
g3a (D, M, X) ε− q
0 0
P,q
sup P,q 2
ε1 ε2 ε1 ε1 ε2 a∈A,g3a0 ∈G 0
3
 
1 + ε1 0 −1
+ sup g4ad (D, X) P,q ε1 q
ε1 a∈A,g40 ∈G40
  1   1
1 − ε2 1 1 − ε2 −q 1 + ε1 −
≤ + + ε2 + ε1 q < ∞,
ε1 ε2 ε1 ε1 ε2 ε1

and this property holds for P ∈ P. Therefore, supP ∈P ψ 0d,d′ (W ) < ∞ for q ≥ 4 and f ∈ Ψ0d,d′ is

P,q
uniformly bounded and has a uniform covering entropy bounded by log (e/ϵ) ∨ 0 up to multiplication by
a constant.

The case when d = d′



The function ψd,d,a Wa , va0 is formed by a union of functions in the sets G10 , G40 , G5 and G6 . Let
 
Ψ0d,d = ψd,d,a Wa , va0 , a ∈ A , where d ∈ {0, 1}. Fixing d, we have

ψd,d,a Wa , va0 ≤ 1 {D = d} |Ya | + 1 {D = d} g4ad
 0 0
+ g4ad
0
g1d 0
g1d
 
1 1 0
≤ + + 1 g4ad ,
ε1 ε1

0 0
where g4ad = g4ad (d, X). The envelop function of f ∈ Ψ0d,d is defined as ψ 0d,d (W ) := supa∈A,v∈(G 0 ∪G 0 ) |ψd,d,a (Wa , v)|
1 4
and we have  
1 1
ψ 0d,d (W ) ≤
0
+ +1 sup g4ad (d, X) .
ε1 ε1 0
a∈A,g4 ∈G40

0 0 −1
For q ≥ 4 and g4ad (d, X) P,q ≤ g4ad (D, X) P,q ε1 q , we have
 
ψ d,d (W ) ≤ 1 + 1
0 0
P,q
+1 sup g4ad (d, X)
P,q
ε1 ε1 a∈A,g40 ∈G40
 
1 1 g4ad (D, X) ε− q
0 1
≤ + +1 sup P,q 1
ε1 ε1 a∈A,g40 ∈G40
 
1 1 −1
≤ + + 1 ε1 q < ∞,
ε1 ε1

and this property holds for P ∈ P. Therefore, supP ∈P ψ 0d,d (W ) < ∞ for q ≥ 4 and f ∈ Ψ0d,d is

P,q
uniformly bounded and has a uniform covering entropy bounded by log (e/ϵ) ∨ 0 up to multiplication by
a constant.
2
Combining the previous results, let Ψ0 = Ψ01,1 ∪Ψ01,0 ∪Ψ00,1 ∪Ψ00,0 be a union of Ψ0d,d′ , (d, d′ ) ∈ {0, 1} .
2
Since Ψ0 is a union of Ψ0d,d′ for (d, d′ ) ∈ {0, 1} , it is a finite union of classes of functions which are
uniformly bounded and have the uniform entropies bounded by log (e/ϵ) ∨ 0 up to multiplication
  by
a constant. For this reason, Ψ0 has a uniform covering number supP ∈P supQ log N ϵ, Ψ0 , ∥.∥Q,2 ≲
log (e/ϵ) ∨ 0 and its envelop function

Ψ0 (W ) = sup |ψd,d′ ,a (Wa , v)|


(d,d′ )∈{0,1}2 ,a∈A,v∈∪4j=1 Gj0

44
is also bounded. Furthermore, the uniform covering integral of Ψ0 satisfies:

√ Z √ Z
Z 1
s 1 1
  p 1
sup log N ϵ, Ψ0 , ∥.∥Q,2 dϵ ≤ C 1 − log ϵdϵ ≤ C √ dϵ < ∞
0 Q 0 0 ϵ

R1
by the fact that 1 − log ϵ ≤ 1/ϵ for all ϵ > 0 and 0 ϵ−b dϵ < ∞ for b < 1.
The second condition in (B.1) of Belloni et al. (2017) can also be proven (which corresponds to
Assumption A1.4). The goal of such a proof is the same as the aim of bounding mn in Farbmacher et al.
(2022) and Chernozhukov et al. (2018). But in their scenarios, they need to consider Y, E [Y |d, M, X] and
E [E [Y |d, M, X] |d′ , X], which requires additional moment conditions, say ∥Y ∥P,q , ∥E [Y |d, M, X]∥P,q
and ∥E [E [Y |d, M, X] |d′ , X]∥P,q . Our scenario is less challenging since the functions g3a (d, M, X) (a
c.d.f.) and g4ad (d′ , X) (an expectation of a c.d.f.) are bounded.

Verifying Assumption A.1.7

The case when d ̸= d′


We define the following sets of functions:
 

 

 x −
7 → h1 f (x) β 1 : ∥β 1 ∥0 ≤ s1 


   
⊤ 1

0 −4
h1 f (X) β 1 − g1d (X) ≲ δn n

G1 (d) := ,
   P,2

 ⊤
 h1 f (X) β 1 − g 0 (X)

≲C 
 
1d
P,∞
 

 

 (m, x) 7−→ h2 f (m, x) β 2 : ∥β 2 ∥0 ≤ s2 


   
⊤ 1

0 −4
h2 f (M, X) β 2 − g2d (M, X) ≲ δn n

G2 (d) := ,
   P,2

 ⊤ 0

h2 f (M, X) β 2 − g2d (M, X) ≲C 

 
P,∞
 

 

 (d, m, x) −
7 → h3 f (d, m, x) β 3 : ∥β 3 ∥0 ≤ s3 

  

 1

h3 f (d, M, X) β 3 − g3a (d, M, X) ≲ δn n− 4
0
G3 (d) := ,
   P,2

 ⊤ 0

h3 f (d, M, X) β 3 − g3a (d, M, X) ≲C

 

P,∞
 

 

 (d′ , x) 7−→ h4 f (d′ , x) β 4 : ∥β 4 ∥0 ≤ s4 

   

 1

′ ′ 0 ′ −4
h4 f (d , x) β 4 − g4ad (d , X) ≲ δn n

G4 (d ) := ,
 P,2
 

 

 h4 f (d′ , X) β 4 − g 0 (d′ , X)

≲C 

4ad
P,∞

where β i , i = 1, . . . , 4 are vectors of coefficients on different sets of conditioning variables and ∥∥0 denotes
the l0 norm. By Assumption 2.3b, dim (f (x)) = p × 1, dim (f (m, x)) = (p + 1) × 1, dim (f (d, m, x)) =
(p + 2)×1, and dim (f (d′ , x)) = (p + 1)×1. From Assumption 2.4 follows that with probability P no less
than 1 − ∆n , 1 − ĝ1d (X) ∈ 1 − G1 (d), ĝ2d (M, X) ∈ G2 (d), (1 − ĝ2d (M, X)) ∈ 1 − G2 (d). ĝ3a (d, M, X) ∈
G3 (d) and ĝ4ad (d′ , X) ∈ G4 (d′ ). Notice that the union (1 − G1 (d)) ∪ G2 (d) ∪ (1 − G2 (d)) ∪ G3 (d) ∪ G4 (d′ )
forms the set Gan . We consider the following sets of functions:
n   o

H1 = x 7−→ h1 f (x) β 1 : ∥β 1 ∥0 ≤ s1 , h1 ∈ H1∗ ,
n   o

H2 = (m, x) 7−→ h2 f (m, x) β 2 : ∥β 2 ∥0 ≤ s2 , h2 ∈ H2∗ ,
n   o

H3 = (d, m, x) 7−→ h3 f (d, m, x) β 3 : ∥β 3 ∥0 ≤ s3 , h3 ∈ H3∗ ,
n   o

H4 = (d′ , x) 7−→ h4 f (d′ , x) β 4 : ∥β 4 ∥0 ≤ s4 , h1 ∈ H4∗ ,

45
where Hi∗ , i = 1, . . . , 4 are sets containing a finite number of monotonically increasing, continuously
differentiable link functions, possibly bounded within a certain interval (say (0, 1)). For example, Belloni
et al. (2017) chose some commonly used link functions: {Id, Λ, 1 − Λ, Φ, 1 − Φ}, where Id is the identity
function, Λ is the logistic link, and Φ is the probit link. In our case, each Hi∗ is also a subset of
{Id, Λ, 1 − Λ, Φ, 1 − Φ}. Obviously, 1 − G1 (d) ⊆ 1 − H1 , G2 (d) ⊆ H2 , 1 − G2 (d) ⊆ 1 − H2 G3 (d) ⊆ H3 ,
and G4 (d′ ) ⊆ H4 by Assumption 2.3a. For functions in 1 − G1 (d) G2 (d), 1 − G2 (d) and G3 (d), their
envelope functions are constant and bounded. As shown in Belloni et al. (2017), for the set 1 − H1 ,

f (x) β 1 is VC-subgraph function with VC dimension bounded by some constant (s1 ), and 1 − H1 is a
union of at most sp1 of such functions. Therefore,


 
log sup N ϵ, G1 , ∥.∥Q ≲ (s1 log p + s1 log e/ϵ) ∨ 0.
Q

Using a similar argument, we obtain


 
log sup N ϵ, G2 ∪ (1 − G2 ), ∥.∥Q ≲ (s2 log (p + 1) + s2 log e/ϵ) ∨ 0,
Q
 
log sup N ϵ, G3 , ∥.∥Q ≲ (s3 log (p + 2) + s3 log e/ϵ) ∨ 0.
Q


Concerning the functions in G4 (d), they have an additive linear form f (d′ , X) β 4 = aD + X ⊤ b,

β 4 = (a, b) . Its envelope function is bounded by β 4 being bounded and Assumption 2.3e, implying

that ∥X∥P,q is bounded. Concerning the set H4 and as shown in Belloni et al. (2017), f (d′ , X) β 4 is a
VC-subgraph function with VC dimension bounded by some constant (s4 ), and H4 is a union of at most
p+1

s4 of such functions. Therefore,
 
log sup N ϵ, G4 , ∥∥Q ≲ (s4 log (p + 1) + s4 log e/ϵ) ∨ 0.
Q

Combining the previous results, it follows that


 
log sup N ϵ, Gan , ∥∥Q ≲ (s log p + s log e/ϵ) ∨ 0,
Q

where s1 + s2 + s3 + s4 ≤ s. The set of functions


n o
2
F2,n = ψd,d′ ,a (Wa , v) − ψd,d′ ,a Wa , va0 : (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Gan


is a Lipschitz transformation of function sets Gi0 i = 1, . . . , 4 , G5 , G6 defined in the previous proof, and
Gan , with bounded Lipschitz coefficients and with a constant envelope. Therefore,
   e
log sup N ϵ, F2,n , ∥∥Q ≲ s log p + s log ∨ 0.
Q ϵ

With probability P 1 − o (1), we have

GN,k ψd,d′ ,a (Wa ; v̂k,a ) − ψd,d′ ,a Wa ; va0 ≤ sup |GN,k f |



sup
2
(d,d′ )∈{0,1} ,a∈A f ∈F2

Furthermore, we have proven that rn ≲ δn n−1/4 and therefore, supf ∈F2 ∥f ∥P,2 ≲ rn ≲ δn n−1/4 holds.
Let L be a constant and L ≥ e, using the maximum inequality A.1 of Lemma 6.2 of Chernozhukov
et al. (2018) by setting the parameters σ = C ′′ δn n−1/4 . C ′′ > 1 is some constant, a = b = p, where

46
log p = o n−1/3 = o K −1/3 N −1/3 and v = s in this maximum inequality. We obtain that
 

p 1 s
sup |GN,k f | ≲ δn n− 4 s log (p ∨ L ∨ σ −1 ) + √ log p ∨ L ∨ σ −1

f ∈F2,n N
q
1 p
≲ δn n− 4 s log (p ∨ n) + Ks2 log2 (p ∨ n) n−1
1 1 1
≲ δn δn4 + δn2 log−1 n ≲ δn2 .

by applying similar arguments as in the proof of Theorem A.1 and by Assumption 2.3c.

The case when d = d′


As this case is a special case of d ̸= d′ , the same results apply.

Proof of Theorem 2. The proof makes use of Theorem A.2. As in Theorem A.2, let Un,P :=
0
 0
 ∗ ∞ 4
Gn ξψ a Wa ; va − θ a a∈A . To show that Zn,P ⇝B ZP uniformly over P ∈ Pn in l (A) , we first show
∗ ∗
∗ 4
that Zn,P − Un,P = oP (1) and then show that Un,P ⇝B ZP uniformly over P ∈ Pn in l∞ (A) . Let
 
∗ ∗
(a) := Gn ξ ψ a Wa ; va0 − θ 0a . We notice that since
 
Zn,P (a) := Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a and Un,P
E [ξ] = 0 and ξ and Wa are independent, E ξ ψ a Wa ; va0 − θ 0a = 0 and
  

n
∗ 1 X  0

ξi ψ a Wa,i ; v̂a0 − θ̂ a ,

Zn,P (a) = √
n i=1
n
∗ 1 X
ξi ψ a Wa,i ; va0 − θ 0a .
 
Un,P (a) = √
n i=1

It follows that
∗ ∗

sup Zn,P (a) − Un,P (a) ≤ Π1 + Π2 ,
a∈A

where

Π1 = sup Gn ξ ψ a Wa,i ; v̂a0 − ψ a Wa,i ; va0


 
a∈A

√ 1 X K
1  0
 0

= sup n √ GN,k ξ ψ a Wa ; v̂k,a − ψ a Wa ; va

a∈A K N
k=1
K
√ 1 X 1 0
− ψ a Wa ; va0 ,
 
≤ n √ sup GN,k ξ ψ a Wa ; v̂k,a
K N a∈A
k=1

and
1 X n  0  0
Π2 = sup √ 0
ξi θ̂ a − θ a ≤ sup θ̂ a − θ 0a |Gn ξ| .

a∈A n i=1 a∈A
0
The term Π2 is Op n−1/2 , since supa∈A θ̂ a − θ 0a = Op n−1/2 by Theorem 1 and |Gn ξ| = Op (1).
 

Concerning the term Π1 , recall the class of functions used in the proof of Theorem 1:
n o
2
F2,n = ψd,d′ ,a (Wa ; v) − ψd,d′ ,a Wa ; va0 : (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Gan .


with the envelop function F2,n is |ξ| times a constant. In the proof of Theorem 1, we have established
that the covering entropy of F2 obeys
   e
log sup N ϵ, F2,n , ∥∥Q ≲ s log p + s log ∨ 0.
Q ϵ

47
Furthermore, using Lemma L.1 in the appendix of Belloni et al. (2017), multiplication of this class by ξ
does not change the entropy bound modulo
 an absolute  constant, and therefore its covering entropy is
bounded by the same order as log supQ N ϵ, F2,n , ∥∥Q ,

   e
log sup N ϵ ∥F2,n ∥Q,2 , ξF2,n , ∥.∥Q,2 ≲ s log p + s log ∨ 0.
Q ϵ

 1/2
Next, we use the result E maxi∈Ik ξi2 ≲ log N by E [exp (|ξ|)] < ∞, and the maximum inequality
A.1 of Lemma 6.2 of Chernozhukov et al. (2018) by setting the envelope function F2,n = C ′′′ |ξ|, σ =
C ′′ δn n−1/4 , where C ′′ > 1 is some constant, a = b = p, where log p = o n−1/3 = o K −1/3 N −1/3 and
 

v = s, with probability P 1 − o (1). We have

1
p s log N
sup |GN,k f | ≲ δn n− 4 log p ∨ L ∨ σ −1

s log (p ∨ L ∨ σ −1 ) + √
f ∈ξF2,n N
1 p 1 s (log n − log K)
≲ δn n− 4 s log (p ∨ L ∨ σ −1 ) + K 2 log p ∨ L ∨ σ −1


n
s
1p s2 log2 (p ∨ n) log2 n
≲ δn n− 4 s log (p ∨ n) +
n
1 1 1
≲ δn δn4 + δn2 ≲ δn2 = op (1) ,

by using supf ∈ξF2,n ∥f ∥P,2 = supf ∈F2,n ∥f ∥P,2 ≤ rn ≲ δn n−1/4 and Assumption 2.3c. With probability
P 1 − o (1) and for v̂k,a ∈ Van , it can be shown that

sup GN,k ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲



sup |GN,k f | .
a∈A f ∈ξF2,n

Therefore we conclude that with probability P 1 − o (1),

1 1 1
 1
√ sup GN,k ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲ K 2 n− 2 op (1) ≲ op n− 2 ,

N a∈A

and since K is fixed and finite,


 √  1
sup Gn ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲ nop n− 2 = op (1) ,

a∈A

∗ ∗

which implies that Zn,P − Un,P = op (1). Next, we notice that Un,P is associated with the class of
functions ξf , where f ∈ F0 . As shown in the proof of Theorem 1, the class of F0 is Donsker uniformly in
P ∈ Pn under the required assumptions. Therefore we can invoke Theorem B.2 of Belloni et al. (2017)
∗ ∗
and conclude that Un,P ⇝B ZP . Indeed, Un,P and ZP both are Gaussian processes, and share the same
(zero) mean and the same covariance matrix. Finally, using a similar argument as in step 2 for proving

Theorem 5.2 in the appendix of Belloni et al. (2017), it follows that Zn,P ⇝B ZP . Let BL1 (l∞ (A)) be

the space of functions mapping the space of functions in l (A) to [0, 1] with a Lipschitz norm of at most
n n
1. Let EBn denote the expectation over the multiplier weights (ξi )i=1 when holding the data (Wi )i=1
fixed. Following step 2 for proving Theorem 5.2 in the appendix of Belloni et al. (2017), we obtain the
following inequality:

∗ ∗
   
sup EB h Zn,P
n
− EP [h (ZP )] ≤ sup EB h Un,P
n
− EP [h (ZP )]
h∈BL1 (l∞ (A)) h∈BL1 (l∞ (A))
 ∗ 
+ EBn Zn,P − Un,P ∧ 2 .

48
The first term vanishes as asserted by using Theorem B.2 of Belloni et al. (2017), since we have proven that

 ∗    ∗ 
Un,P ⇝B ZP . The second term is oP (1) since E Zn,P − Un,P ∧ 2 = E EBn Zn,P − Un,P ∧ 2 →
0 by using the Markov inequality,
  ∗ ∗

 ∗ ∗
  E EBn Zn,P − Un,P ∧2
P EBn Zn,P − Un,P ∧ 2 ≥ ε ≤
 ∗ ε


E Zn,P − Un,P ∧2
= .
ε


∗ ∗

As shown above, Zn,P − Un,P = op (1), which implies that Zn,P − Un,P ∧ 2 = oP (1). Therefore,
 ∗
 ∗
suph∈BL1 (l∞ (A)) EBn h Zn,P − EP [h (ZP )] vanishes and we obtain that Zn,P
⇝B Z P .
Concerning the proof of Theorem 3, we first introduce the definition of uniform Hadamard differ-
entiability in the appendix further below. The proof relies on Theorems B.3 and B.4 of Belloni et al.
(2017) (restated as Theorem A.4 and A.5 in their appendix), which show that when an estimator satisfies
uniform validity, this property also holds for a transformation of this estimator if uniform Hadamard
tangential differentiability of the transformation holds.
Proof of Theorem 3. : Since the ϕθ satisfies uniform Hadamard tangential differentiable, and as
∗ 4
shown in Theorem 1 and 2, both Zn,P ⇝ ZP and Zn,P ⇝ ZP in l∞ (A) uniformly in P ∈ Pn . Therefore
the proof can be completed by using Theorems A.4 and A.5 (which are restated results of Theorems B.3
and B.4 of Belloni et al. (2017)).

A.4 General Theorems for Uniform Validity of a K-Fold Cross Fitting Esti-
mator Based on the EIF
We subsequently derive some useful theorems for establishing the uniform validity of the proposed K-fold
cross-fitting estimator based on the efficient influence function (EIF) under specific conditions (see below).
We recall the notation used in Section 3 and assume that the parameter of interest, FY0 (d,M (d′ )) (a), is
identified as
FY0 (d,M (d′ )) (a) = θd,d
0
′ ,a , (A.31)

for (d, d′ ) ∈ {0, 1}2 , where θd,d


0
 0

′ ,a := E ψd,d′ ,a Wa ; va is an expectation of ψd,d′ ,a evaluated with the
0
true nuisance parameters. The estimator of θd,d′ ,a is the K-fold cross-fitting estimator

K
1 X (k)
θ̂d,d′ ,a = θ̂d,d′ ,a ,
K
k=1

(k)
where θ̂d,d′ ,a = N −1
P
i∈Ik ψd,d′ ,a (Wa,i ; v̂k,a ). Let ψ a (Wa , v) denote a vector containing the elements
2 (k)
ψd,d′ ,a (Wa ; v), (d, d′ ) ∈ {0, 1} . Let θ 0a , θ̂ a , θ̂ a
and F0 (a) denote vectors containing θd,d 0
′ ,a , θ̂d,d′ ,a ,
(k) 0
and FY0 (d,M (d′ )) (a) over different (d, d′ ) ∈ {0, 1}2 . It holds that θ a = E ψ a Wa ; va0 and θ̂ a =
 
θ̂d,d′ ,a
K (k)
K −1 k=1 θ̂ a . If equation (A.31) holds for all a ∈ A and (d, d′ ) ∈ {0, 1}2 , F0 (a) = θ 0a .
P

The main results are stated in Theorems A.1 to A.3. Establishing these theorems relies on imposing
the following high level assumptions on ψd,d′ ,a (Wa ; v).

Assumption 3 Consider a random element W , taking values in a measure space (W, XW ), with the
n
law determined by a probability measure P ∈ Pn . The observed data (Wa,i )a∈A i=1 consist of n i.i.d.
copies of a random element (Wa )a∈A which is generated as a suitably measurable transformation with
respect to W and a. Uniformly for all 3 ≤ n0 ≤ n and P ∈ Pn ,
1. The true parameter FY0 (d,M (d′ )) (a) satisfies equation (A.31), θd,d
0
′ ,a is interior relative to Θa ⊂ Θ ⊂ R

′ 2
for all a ∈ A, (d, d ) ∈ {0, 1} and Θ is a compact set.

49
2. For a ∈ A, the map v 7−→ E [ψ a (Wa ; v)] is twice continuously Gateaux-differentiable on Va .
3. The function ψ a (Wa ; v) satisfies the following Neyman λn near-orthogonality condition at v = va0 with

respect to a ∈ A and v ∈ Van ∪ va0 :

∂v E ψ a Wa ; va0 v − va0 ≤ δn n− 21 ,
  
λn := sup
0}
a∈A,v∈Van ∪{va

where δn is a sequence converging to zero from above at a speed at most polynomial in n, e.g., δn ≥ n−c
for some c > 0.
4. The following moment conditions hold:

ψ a (Wa ; v) − ψ a Wa ; va0 ≤ δn n− 41

rn := sup P,2
a∈A,v∈Van

λ′n ∂r E ψ a Wa ; r v − va0 + va0 |r=r̃ ≤ δn n− 12 ,


2   
:= sup
0 },r̃∈(0,1)
a∈A,v∈Van ∪{va

5. The following smoothness condition holds for each (d, d′ ) ∈ {0, 1}2 :
h  2 i
ψd,d′ ,a Wa ; va0 − θd,d
0 0 0
≤ Cδ c1 ,
  
sup E ′ ,a − ψd,d′ ,ā Wā ; vā − θd,d′ ,ā
dA (a,ā)≤δ

where c1 is a constant.
6. The set of functions
n o
′ 2
F0 = ψd,d′ ,a Wa ; va0 − θd,d
0

′ ,a : (d, d ) ∈ {0, 1} , a ∈ A ,

expressed as a function of W , is suitably measurable, and has an envelope function

F0 (W ) = sup |ψd,d′ ,a (Wa ; v) − θ|


(d,d′ )∈{0,1}2 ,a∈A,v∈Va ,θ∈Θa

, which is measurable with respect to W , and ∥F0 (W )∥P,q ≤ C, where q ≥ 4 is a fixed constant. Its
uniform covering entropy satisfies
 
log sup N ϵ ∥F0 ∥Q,2 , F0 , ∥.∥Q,2 ≤ C log (e/ϵ) ∨ 0,
Q

where C > 0 is a constant, e denotes exp(1) and 0 < ϵ ≤ 1.


7. The set of functions
n o
2
F1 = ψd,d′ ,a (Wa ; v) − θ : (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Van , θ ∈ Θa

is suitably measurable and has an envelope function

F1 (W ) = sup |ψd,d′ ,a (Wa ; v) − θ|


(d,d′ )∈{0,1}2 ,a∈A,v∈Van ,θ∈Θa

which is measurable with respect to W , and F1 (W ) ≤ F0 (W ). Its uniform covering entropy satisfies
 
log sup N ϵ ∥F1 ∥Q,2 , F1 , ∥.∥Q,2 ≤ v log (b/ε) ∨ 0,
Q

where v ≥ 1 and b ≥ max{e, N } and 0 < ϵ ≤ 1.


8. Nuisance parameter estimation:

50
Let K be a fixed integer and ∆n and τn be a sequence of positive constants converging to zero at a
speed of at most polynomial n. The following conditions hold for each n ≥ 3 and all P ∈ Pn . Given a
G
random subset Ik , k = 1, . . . , K of size n/K, the estimated nuisance parameter {v̂k,a,g }g=1 ∈ Van with
G
probability at least 1 − ∆n , where Van is the set of measurable maps {vg }g=1 ∈ Va such that for each g,
0

vg − va,g
P,2
≤ τn and nτn2 ≤ δn . Therefore and when denoting by En the event that v̂k,a ∈ Van for
all k = 1, . . . , K, the probability of En is not smaller than 1 − K∆n = 1 − o (1).

Let Gn denote an empirical process Gn f (W ) = n (En f (W ) − E [f (W )]), where f is any P ∈ Pn
integrable function on the set W. Let GP f (W ) denote the limiting
h process of Gn f (W ), which is a Gaus-i

sian process with zero mean and a finite covariance matrix E (f (W ) − E [f (W )]) (f (W ) − E [f (W )])
under probability P (the P -Brownian bridge). Using the previous notation and assumptions, we obtain
the following result.

Theorem 4 If Assumptions A.1.1 to A.1.8 hold, the K-fold cross-fitting estimator θ̂ a for estimating
F0 (a) satisfies
√  
n θ̂ a − F0 (a) = Zn,P + oP (1) ,
a∈A
4
in l∞ (A) , uniformly in P ∈ Pn , where Zn,P := Gn ψ a Wa ; va0 − θ 0a a∈A . Furthermore,
 

Zn,P ⇝ ZP

4
in l∞ (A) , uniformly in P ∈ Pn , where ZP := GP ψ a Wa ; va0 − θ 0a a∈A and the paths of a 7−→
 

GP ψ a Wa ; va0 − θ 0a are a.s. uniformly continuous on (A, dA ), and


 

 
sup E sup GP ψ a Wa ; va0 − θ 0a < ∞,
 
P ∈Pn a∈A
" #
0 0 0 0
   
lim sup E sup GP ψ a Wa ; va − θ a − GP ψ ā Wā ; vā − θ ā = 0.
δ→0 P ∈Pn dA (a,ā)

Under Assumptions A.1.1 to A.1.8, we can also establish the uniform validity of the multiplier
bootstrap. Recall the multiplier bootstrap estimator:
n
∗ 1X  
θ̂d,d ′ ,a = θ̂d,d′ ,a + ξi ψd,d′ ,a (Wa,i ; v̂k,a ) − θ̂d,d′ ,a ,
n i=1

where ξ is a random variable independent of Wa that satisfies E [ξ] = 0, V ar (ξ) = 1 and E [exp (|ξ|)] < ∞.
Byh the independence
i of ξ and Wa , E [ξψd,d′ ,a (Wa ; v̂k,a )] = E [ξ] E [ψd,d′ ,a (Wa ; v̂k,a )] = 0, and also
E ξ θ̂d,d ,a = 0. Therefore,

n
√  ∗  1 X  
n θ̂d,d′ ,a − θ̂d,d′ ,a = √ ξi ψd,d′ ,a (Wa,i ; v̂k,a ) − θ̂d,d′ ,a
n i=1
 
= Gn ξ ψd,d′ ,a (Wa ; v̂k,a ) − θ̂d,d′ ,a .

∗ ∗ ′ 2
Let θ̂ a denote a vector containing the multiplier bootstrap estimators θ̂d,d ′ ,a over different (d, d ) ∈ {0, 1} .

We may rewrite the previous result in a vector form as


√  ∗   
n θ̂ a − θ̂ a = Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a .

  

Furthermore, let Zn,P := Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a to postulate the following theorem.
a∈A

51
Theorem 5 If Assumptions A.1.1 through A.1.8 hold, the large sample law ZP of Zn,P in Theorem A.1

can be consistently approximated by the bootstrap law Zn,P :


Zn,P ⇝B Z P

4
uniformly over P ∈ Pn in l∞ (A) .

Let ϕτ (FX ) := inf {a ∈ R : FX (a) ≥ τ } be the τ -th quantile function of a random variable X asso-
ciated with a c.d.f. FX . The von Mises expansion of ϕτ (FX ) (p.292 in Vaart (1998)) is given by:

1 1 1 (k)
ϕτ (En ) − ϕτ (E) = √ ϕ′τ,E (Gn ) + . . . + ϕ (Gn ) + . . . ,
n m! nm/2 τ,E

where ϕ′τ,E (.) is a linear derivative map and Gn denotes an empirical process Gn f (W ) = n (En f (W ) − E [f (W )]).
  n o
Let ϕ′θ := ϕ′τ,θ , where θ = (θ a )a∈A . Let Q0Y (d,M (d′ )) (τ ) := inf a ∈ R : θd,d
0
′ ,a ≥ τ , Q̂Y (d,M (d′ )) (τ ) :=
n τ ∈T o n o
inf a ∈ R : θ̂d,d′ ,a ≥ τ and Q̂∗Y (d,M (d′ )) (τ ) := inf a ∈ R : θ̂d,d ∗
′ ,a ≥ τ . Let Q0τ , Q̂τ and Q̂∗τ denote
the corresponding vectors containing Q0Y (d,M (d′ )) (τ ), Q̂Y (d,M (d′ )) (τ ) and Q̂∗Y (d,M (d′ )) (τ ) for different
(d, d′ ) ∈ {0, 1}2 , respectively. We then obtain the following result of uniform validity of quantile estima-
tion, which can be proven by invoking the functional delta theorems (Theorems B.3 and B.4) of Belloni
et al. (2017).

Theorem 6 Under Assumptions A.1.1 to A.1.8,


√  
n Q̂τ − Q0τ ⇝ TP := ϕ′θ (ZP ) ,
τ ∈T
√  ∗ 
n Q̂τ − Q̂τ ⇝B TP := ϕ′θ (ZP ) .
τ ∈T

4
uniformly over P ∈ Pn in l∞ (T ) , where T ⊂ (0, 1), TP is a zero mean tight Gaussian process for each
P ∈ Pn and ZP := GP ψ a Wa ; va0 − θ 0a a∈A .
  

A.5 Proofs of Theorems A.1 to A.3


In the proofs of Theorems A.1 to A.3, we will use the notation ≲ to denote “less than equal a constant
times”: b ≲ c denotes b ≤ Bc, where B is a constant depending on Assumptions A.1.1 to A.1.8, but not
on n0 ≤ n and P ∈ Pn . We assume n0 ≤ n since the results are all asymptotic.
Proof of Theorem A.1. It is sufficient to establish the result over any sequence of induced
probability measure Pn ∈ Pn . But we will write P = Pn to simplify the notation. Furthermore, we fix

52
any k = 1, ..., K. From the definition of θ̂ a and under Assumption A.1.1, we obtain
"
K
#
√   √ 1 X (k) 0
n θ̂ a − F0 (a) = n θ̂ a − θ a
K
k=1
( K
)
√ 1 X   0

= n EN,k ψ a (Wa ; v̂k,a ) − E ψ a Wa ; va
K
k=1
( K
)
√ 1 X  0

= n EN,k [ψ a (Wa ; v̂k,a )] − E ψ a Wa ; va
K
k=1
( K n
√ 1 X 1X
ψ a Wa,i ; va0

= n EN,k [ψ a (Wa ; v̂k,a )] −
K n i=1
k=1
n
)
1X 0
 0
  0
 0

+ ψ Wa,i ; va − θ a − E ψ a Wa ; va − θ a
n i=1 a
( K N
)
√ 1 X 1X 0

= n EN,k [ψ a (Wa ; v̂k,a )] − ψ Wa,i ; va
K n i=1 a
k=1
| {z }
m1,n (a)

Gn ψ a Wa ; va0 θ 0a
 
+ − .

√  
4
Therefore, proving n θ̂ a − θ 0a = Zn,P + oP (1) uniformly over P ∈ Pn in l∞ (A) is equivalent to
a∈A
4
showing that (m1,n (a))a∈A = oP (1) uniformly over P ∈ Pn in l∞ (A) . Notice that

K N K
1 X 1X 1 X
ψ a Wa,i ; va0 = EN,k [ψ a (Wa ; v̂k,a )] − EN,k ψ a Wa ; va0
  
EN,k [ψ a (Wa ; v̂k,a )] −
K n i=1 K
k=1 k=1

and

K
√ 1 X
EN,k [ψ a (Wa ; v̂k,a )] − EN,k ψ a Wa ; va0
  
sup ∥m1,n (a)∥ = n sup

K

a∈A a∈A
k=1
K
√ 1 X EN,k [ψ a (Wa ; v̂k,a )] − EN,k ψ a Wa ; va0
 
≤ n sup
a∈A K
k=1
K
√ 1 X
sup EN,k [ψ a (Wa ; v̂k,a )] − EN,k ψ a Wa ; va0 .
 
≤ n
K a∈A
k=1

Therefore, it suffices to show that


 
sup EN,k [ψ a (Wa ; v̂k,a )] − EN,k ψ a Wa ; va0 = oP n−1/2
 
a∈A | {z }
m2,N,k (a)

53
holds, since K is finite and fixed. Next,

1 X 1 X 
0
m2,N,k (a) = ψ a (Wa,i ; v̂k,a ) − ψ a Wa,i ; va

N N
i∈Ik i∈Ik

1 X 1 X h i
= ψ a (Wa,i ; v̂k,a ) − E ψ a (Wa,i ; v̂k,a ) | (Wa,j )j∈I c

N N k
i∈Ik i∈Ik
" #
1 X 1 X 
ψ a Wa,i ; va0 − E ψ a Wa,i ; va0
 

N N
ki∈I k i∈I

1 X h i 1 X 
E ψ a Wa,i ; va0

+ E ψ a (Wa,i ; v̂k,a ) | (Wa,j )j∈I c −

N k N
i∈Ik i∈Ik
1 GN,k ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0

≤√
N | {z }
m3,N,k (a)

1 X h i 
0

+ E ψ a (Wa,i ; v̂k,a ) | (Wa,j )j∈I c − E ψ a Wa ; va ,

N k
i∈Ik
| {z }
m4,N,k (a)

where GN,k is an empirical process defined as


!
√ 1 X
Z
GN,k f (W ) = N f (Wi ) − f (w) dP ,
N
i∈Ik

and f is any P integrable function on W. We note that


h i
E ψ a (Wa,i ; v̂k,a ) | (Wa,j )j∈I c = E [ψ a (Wa,i ; v̂k,a )]
k

for i ∈ Ik , since conditional on (Wa,j )j∈I c , v̂k,a is a constant and (Wa,i )i∈Ik and (Wa,j )i∈I c are indepen-
k k
dent. Then,
supa∈A m3,N,k (a)
sup m2,N,k (a) ≤ √ + sup m4,N,k (a) .
a∈A N a∈A

In order to bound supa∈A m3,N,k (a), we define the following class of functions:
n o
2
F2′ = ψd,d′ ,a (Wa ; v) − θ − ψd,d′ ,a Wa ; va0 − θd,d
0
: (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Van , θ ∈ Θan ,
 
′ ,a

n o
0 ′
where Θan := θ ∈ Θa : θ − θd,d ′ ,a ≤ Cτn . Notice that the envelope function of F2 , denoted by

F2′ (W ) = ψd,d′ ,a (Wa ; v) − θ − ψd,d′ ,a Wa ; va0 − θd,d


0
 
sup ′ ,a

(d,d′ )∈{0,1}2 ,a∈A,v∈Van ,θ∈Θan

≤ sup |ψd,d′ ,a (Wa ; v) − θ|


(d,d′ )∈{0,1}2 ,a∈A,v∈Van ,θ∈Θan

+ sup |ψd,d′ ,a (Wa ; v) − θ|


(d,d′ )∈{0,1}2 ,a∈A,v∈Va ,θ∈Θan

≤ F1 (W ) + F0 (W )
≤ 2F0 (W ) .

54
 
The uniform covering entropy of F2′ : log supQ N ϵ ∥F2′ ∥Q,2 , F2′ , ∥.∥Q,2 satisfies

 
log sup N ϵ ∥F2′ ∥Q,2 , F2′ , ∥.∥Q,2 ≲ 2v (log (a/ϵ)) ∨ 0.
Q

Next, consider another class of functions:


n o
2
F2 = ψd,d′ ,a (Wa ; v) − ψd,d′ ,a Wa ; va0 : (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Van , θ ∈ Θan .


F2 is a subset of F2′ in which we choose C = 0 in Θan , and for this reason, its envelope
 function F2 (W ) is


bounded by F2 (W ). Therefore, the uniform covering entropy of F2 : log supQ N ϵ ∥F2 ∥Q,2 , F2 , ∥.∥Q,2
also satisfies  
log sup N ϵ ∥F2 ∥Q,2 , F2 , ∥.∥Q,2 ≲ 2v (log (a/ϵ)) ∨ 0.
Q

With probability P 1 − K∆n = 1 − o (1), we have

GN,k ψd,d′ ,a (Wa ; v̂k,a ) − ψd,d′ ,a Wa ; va0 ≤ sup |GN,k f | ,



sup
(d,d′ )∈{0,1}2 ,a∈A f ∈F2

 
where ψd,d′ ,a (Wa ; v̂k,a ) − ψd,d′ ,a Wa ; va0 is an element of ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 in m3,N,k (a).
Furthermore, it can be shown that supf ∈F2 ∥f ∥P,2 ≤ rn , where

ψ a (Wa ; v) − ψ a Wa ; va0 .

rn := sup P,2
a∈A,v∈Van

Then using the maximum inequality A.1 of Lemma 6.2 of Chernozhukov et al. (2018) by setting the used
parameter σ = C ′ δn n−1/4 , where C ′ > 1 is a constant, and a = b = N in this maximum inequality, we
have
1 1 1
log (N ∨ σ −1 ) + N q − 2 log N ∨ σ −1
p
sup |GN,k f | ≲ δn n− 4

f ∈F2
1 1 log n
≲ δn + K 2 − q 1 1 = o (1)
n2−q

If follows that with probability P 1 − o (1),

GN,k ψd,d′ ,a (Wa ; v̂k,a ) − ψd,d′ ,a Wa ; va0 ≤ sup |GN,k f | ≲ o (1) ,



sup
2
(d,d′ )∈{0,1} ,a∈A f ∈F2

and supa∈A m3,N,k (a) ≲ o (1). Then, N −1/2 supa∈A m3,N,k (a) = n−1/2 K 1/2 supa∈A m3,N,k (a) = oP n−1/2


since K is fixed and finite.


Concerning the term m4,N,k (a), let
h i
φk (r) = E ψ a Wa ; r v̂k,a − va0 + va0 | (Wa,j )j∈I c − E ψ a Wa ; va0 ,
   
k

where r ∈ (0, 1). Notice that φk (0) = 0, since


h i
E ψ a Wa ; va0 | (Wa,j )j∈I c = E ψ a Wa ; va0 ,
  
k

h i  
and φk (1) = E ψ a (Wa ; v̂k,a ) | (Wa,j )j∈I c − E ψ a Wa ; va0 . By applying a Taylor expansion to φk (r)
k
around 0,
1
φk (r) = φk (0) + φ′k (0) r + φ′′k (r̄) r2
2

55
for some r̄ ∈ (0, 1). Then, m4,N,k (a) ≤ ∥φk (1)∥ = ∥φ′k (0) + 1/2φ′′k (r̄)∥, where
h i
φ′k (0) = ∂v E ψ a Wa ; va0 v̂k,a − va0 | (Wa,j )j∈I c = ∂v E ψ a Wa ; va0 v̂k,a − va0 ,
    
k
h i
φ′′k (r̄) = ∂r2 E ψ a Wa ; r̄ v̂k,a − va0 + va0 | (Wa,j )j∈I c , r̄ ∈ (0, 1) .
 
k

Therefore, supa∈A m4,N,k (a) ≤ supa∈A ∥φ′k (0)∥ + 1/2 supa∈A ∥φ′′k (r̄)∥. Furthermore, supa∈A ∥φ′k (0)∥
and supa∈A ∥φ′′k (r̄)∥ are bounded by the following terms, respectively:
 
∂v E ψ a Wa ; va0 v − va0 = o n− 21 ,
  
sup
0}
a∈A,v∈Van ∪{va

 
∂r E ψ a Wa ; r̄ v − va0 + va0 = o n− 12 .
2   
sup
0 },r̃∈(0,1)
a∈A,v∈Van ∪{va

It follows that supa∈A m4,N,k (a) = oP n−1/2 . Combining the previous results, we obtain that supa∈A m2,N,k (a) =

4
oP n−1/2 , and supa∈A ∥m1,n (a)∥ ≤ n1/2 oP n−1/2 = oP (1) uniformly over P ∈ Pn in l∞ (A) .
 
4
Finally, to show that Zn,P ⇝ ZP in l∞ (A) uniformly in P ∈ Pn , we may exploit the properties
of functions in F0 . Recall that F0 is suitably measurable, has an envelop function F0 (W ) that is
q 1/q
measurable with respect to W and satisfies ∥F0 ∥P,q = (E |F0 | ) ≤ C, where q ≥ 4 is a fixed number.
By Assumption A.1.5, functions in F0 satisfy
h  2 i
ψd,d′ ,a Wa ; va0 − θd,d
0 0 0
  
sup E ′ ,a − ψd,d′ ,ā Wā ; vā − θd,d′ ,ā → 0,
P ∈Pn

as dA (a, ā) → 0. By Assumption A.1.6, uniform covering entropy of F0 satisfies


 
sup log N ϵ ∥F0 ∥Q,2 , F0 , ∥.∥Q,2 ≤ C log (e/ϵ) ∨ 0.
Q

In fact, the uniform covering integral satisfies

√ Z
Z 1
s   1 p
sup log N ϵ ∥F0 ∥Q,2 , F0 , ∥.∥Q,2 dϵ ≤ C 1 − log ϵdϵ
0 Q 0

√ Z 1
1
≤ C √ dϵ < ∞,
0 ϵ
R1
which follows from the result that 1 − log ϵ ≤ 1/ϵ for all ϵ > 0 and 0 ϵ−b dϵ < ∞ for b < 1. Therefore,
we may invoke Theorem B.1 of Belloni et al. (2017) to obtain the result. The class of F0 is Donsker
uniformly in P ∈ Pn because ∥F0 ∥P,q is bounded, the entropy condition holds, and Assumption A.1.5
implies that supP ∈Pn ψ a Wa , va0 − θ 0a − ψ ā Wā , vā0 − θ 0ā P,2 → 0 as dA (a, ā) → 0.
  

Proof of Theorem A.2. It is sufficient to establish the result over any sequence of proba-

bility measure Pn ∈ Pn . Again, we will write P = Pn for simplifying the notation. Let Un,P :=
0
 0
 ∗ ∞ 4
Gn ξ ψ a Wa ; va − θ a a∈A . To show that Zn,P ⇝B ZP uniformly over P ∈ Pn in l (A) , we first
∗ ∗
∗ 4
show that Zn,P − Un,P = oP (1) and then show that Un,P ⇝B ZP uniformly over P ∈ Pn in l∞ (A) .
∗ ∗
4
To prove Zn,P − Un,P = oP (1) uniformly over P ∈ Pn in l∞ (A) , we use a similar argument as for
 

proving equation (E.7) in the appendix of Belloni et al. (2017). Let Zn,P (a) := Gn ξ ψ a (Wa ; v̂k,a ) − θ̂ a

(a) := Gn ξ ψ a Wa ; va0 − θ 0a . We first notice that since E [ξ] = 0 and ξ and Wa are indepen-
 
and Un,P

56
dent, E ξ ψ a Wa ; va0 − θ 0a = 0 and
  

n
∗ 1 X  0

ξi ψ a Wa,i ; v̂a0 − θ̂ a ,

Zn,P (a) = √
n i=1
n
∗ 1 X
ξi ψ a Wa,i ; va0 − θ 0a .
 
Un,P (a) = √
n i=1

Therefore, we have
∗ ∗

sup Zn,P (a) − Un,P (a) ≤ Π1 + Π2 ,
a∈A

where

Π1 = sup Gn ξ ψ a Wa,i ; v̂a0 − ψ a Wa,i ; va0


 
a∈A
K

√ 1 X 1  
0 0

= sup n √ GN,k ξ ψ a Wa ; v̂k,a − ψ a Wa ; va

a∈A K N
k=1
K
√ 1 X 1 0
− ψ a Wa ; va0 ,
 
≤ n √ sup GN,k ξ ψ a Wa ; v̂k,a
K N a∈A
k=1

and
1 X n  0  0
Π2 = sup √ 0
ξi θ̂ a − θ a ≤ sup θ̂ a − θ 0a |Gn ξ| .

a∈A n i=1 a∈A
0
The term Π2 is Op n−1/2 , since supa∈A θ̂ a − θ 0a = Op n−1/2 by Theorem A.1 and |Gn ξ| = Op (1).
 

Concerning the term Π1 , recall the class of functions used in the proof of Theorem A.1,
n o
2
F2 = ψd,d′ ,a (Wa ; v) − ψd,d′ ,a Wa ; va0 : (d, d′ ) ∈ {0, 1} , a ∈ A, v ∈ Van ,


as well as its envelope function F2 ≤ 2F0 and the covering entropy:


 
log sup N ϵ ∥F2 ∥Q,2 , F2 , ∥.∥Q,2 ≲ 2v (log (a/ϵ)) ∨ 0.
Q

Using Lemma K.1 in the Appendix of Belloni et al. (2017), multiplication of this class by ξ does not
change the entropy bound modulo an absolute constant, and therefore its covering entropy
 
log sup N ϵ ∥ξF2 ∥Q,2 , ξF2 , ∥.∥Q,2
Q

   1/2
is bounded by the same order as log supQ N ϵ ∥F2 ∥Q,2 , F2 , ∥.∥Q,2 . Next, notice that E maxi∈Ik ξi2 ≲
log N by E [exp (|ξ|)] < ∞. By the independence of ξi and Wi , we have

1
max ξi F0 (Wi ) ≤ max ξi
max F0 (Wi ) ≲ N q log N,
i∈Ik i∈Ik i∈Ik
P,2 P,2 P,2

which holds for k = 1, . . . , K. Using the maximum inequality A.1 of Lemma 6.2 of Chernozhukov et al.

57
(2018), with probability P 1 − o (1), we obtain
1
1 p N q log N
sup |GN.k f | ≲ δn n− 4 log N ∨ σ −1

log (N ∨ σ −1 ) + √
f ∈ξF2 N
1 1 1 (log n − log K)
≲ δn n− 4 log (N ∨ σ −1 ) + K 2 − q
p
log N ∨ σ −1

1
− q1
n 2
1 1
≲ δn + n q − 2 log n log n ∨ σ −1 = op (1) ,


1
by using the fact that supf ∈ξF2 ∥f ∥P,2 = supf ∈F2 ∥f ∥P,2 ≤ rn and setting the parameters σ = C ′ δn n− 4
and a = b = N in this maximum inequality. With probability P 1 − o (1) and for v̂k,a ∈ Van , it can be
shown that
sup GN,k ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲ sup |GN,k f | .

a∈A f ∈ξF2

Therefore, it follows with probability P 1 − o (1) that

1 1 1
 1
√ sup GN,k ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲ K 2 n− 2 op (1) ≲ op n− 2 ,

N a∈A

and since K is fixed and finite,


 √  1
Π1 = sup Gn ξ ψ a (Wa ; v̂k,a ) − ψ a Wa ; va0 ≲ nop n− 2 = op (1) .

a∈A


∗ ∗
Combining the previous results, we obtain that Zn,P − Un,P = op (1). Next, notice that Un,P is
associated with the class of functions ξf , where f ∈ F0 is defined in Assumption A.1.6. As shown in the
proof of Theorem A.1, the class of F0 is Donsker, uniformly in P ∈ Pn under the imposed assumptions.

Therefore, we may invoke Theorem B.2 of of Belloni et al. (2017) and obtain Un,P ⇝B ZP . Indeed, both

Un,P and ZP are Gaussian processes that share the same (zero) mean and the same covariance matrix.
Finally, using a similar argument in step 2 for proving Theorem 5.2 in the appendix of Belloni et al.

(2017), we can obtain Zn,P ⇝B ZP . Let BL1 (l∞ (A)) be the space of functions mapping the space of

functions in l (A) to [0, 1], with the Lipschitz norm being at most 1. Let EBn denote the expectation
n n
over the multiplier weights (ξi )i=1 when holding the data (Wi )i=1 fixed. Following step 2 for proving
Theorem 5.2 in the appendix of Belloni et al. (2017), we obtain the following inequality:

∗ ∗
   
sup EBn h Zn,P − EP [h (ZP )] ≤ sup EBn h Un,P − EP [h (ZP )]
h∈BL1 (l∞ (A)) h∈BL1 (l∞ (A))
 ∗ 
+ EBn Zn,P − Un,P ∧ 2 .


The first term vanishes by Theorem B.2 of Belloni et al. (2017), since we have proven that Un,P ⇝B Z P .
 ∗    ∗ 
The second term is oP (1), because E Zn,P − Un,P ∧ 2 = E EBn Zn,P − Un,P ∧ 2 → 0 by the

Markov inequality
  ∗ ∗

 ∗ ∗
  E EBn Zn,P − Un,P ∧2
P EBn Zn,P − Un,P ∧ 2 ≥ ε ≤
 ∗ ε


E Zn,P − Un,P ∧2
= .
ε


∗ ∗

As shown above, Zn,P − Un,P = op (1), which implies that Zn,P − Un,P ∧ 2 = oP (1). Therefore,
 ∗
 ∗
suph∈BL1 (l∞ (A)) EBn h Zn,P − EP [h (ZP )] vanishes and it follows that Zn,P ⇝B ZP .

The proof of Theorem A.3 relies on the uniform Hadamard differentiability (Belloni et al., 2017) of
the quantile function. The definition of uniform Hadamard differentiability is as follows.

58
Definition 1 Uniform Hadamard Tangential Differentiability, Belloni et al. (2017): Let E
and D be normed spaces. Consider a map ϕ : Dϕ 7−→ E, where Dϕ ⊂ D and the range of ϕ is a subset
of E. Let D0 ⊂ D be a normed space, and Dρ ⊂ Dϕ be a compact metric space. Let h 7−→ ϕ′ρ (h) be the
linear derivative map associated with ϕ, where h ∈ D0 and ρ ∈ Dρ . The linearity of ϕ′ρ (h) holds for each
ρ. Then the map ϕ : Dϕ 7−→ E is called Hadamard differentiable uniformly in ρ ∈ Dρ tangentially to D0
with derivative map h 7−→ ϕ′ρ (h), if

ϕ (ρn + tn hn ) − ϕ (ρn ) ′

− ϕρ (h) → 0,
tn

ϕρ (hn ) − ϕ′ρ (h) → 0 as n → 0.

n

for all convergence sequences ρn → ρ, tn → 0 in R and hn → h, such that ρn + tn hn ∈ Dϕ for every n.

As pointed out by Belloni et al. (2017), the quantile function is uniformly Hadamard-differentiable
if we set D = l∞ (T ), where T = [ϵ, 1 − ϵ] , ϵ > 0, Dϕ is a set of càdlàg functions on T , D0 = UC (T ),
Dρ being a compact subset of C 1 (T ) such that each ρ satisfies ∂ρ (u) /∂u > 0.4 Notice that this setting
rules out the case that Y (d, M (d′ )) is a discrete random variable. Also if Dρ = Dϕ , the quantile function
is not Hadamard-differentiable uniformly in Dρ in the sense of our definition. This is different from
the definition of uniformly differentiability given in Vaart (1998) which requires Dρ = Dϕ . Since our
estimation is for infinite dimension, it is essential to restrict Dρ to be much smaller than Dϕ and endow
Dρ to have a much stronger metric than the metric induced by the norm of D. However, here the
estimated ρ̂ can satisfy ρ̂ ∈ Dϕ , but ρ̂ ∈
/ Dρ (for example when ρ̂ is an empirical c.d.f.), even though the
population values of ρ ∈ Dρ and ∂ρ (u) /∂u > 0 should hold. With the definition of uniform Hadamard
differentiability, we in a next step restate Theorems B.3 and B.4 of Belloni et al. (2017) as follows.

Theorem 7 Functional delta method uniformly in P ∈ P, Belloni et al. (2017): Let ϕ : Dϕ ⊂


D 7−→ E be Hadamard differentiable uniformly in ρ ∈ Dρ ⊂ Dϕ tangentially to D0 with derivative map
h 7−→ ϕ′ρ (h). Let ρ̂n,P be a sequence of stochastic processes taking values in Dϕ , where each ρ̂n,P is an
estimator of the parameter ρ = ρP ∈ Dρ . Suppose there exists a sequence of constants rn → ∞ such that
Zn,P := rn (ρ̂n,P − ρP ) ⇝ ZP in D uniformly in P ∈ Pn . The limit process ZP is separable and takes
S
its values in D0 for all P ∈ P = n≥n0 Pn , where n0 is fixed. Moreover, the set of stochastic processes
{ZP : P ∈ P} is relatively compact in the topology of weak convergence in D0 , that is, every sequence
in this set can be split into weakly convergent subsequences. Then, rn (ϕ (ρ̂n,P ) − ϕ (ρ)) ⇝ ϕ′ρP (ZP ) in
E uniformly in P ∈ Pn . If (ρ, h) 7−→ ϕ′ρP (h) is defined and continuous on the whole of Dρ × D, then
the sequence rn (ϕ (ρ̂n,P ) − ϕ (ρ)) ⇝ ϕ′ρP (rn (ρ̂n,P − ρ)) converges to zero in outer probability uniformly
in P ∈ Pn . Moreover, the set of stochastic processes ϕ′ρP (ZP ) : P ∈ P is relatively compact in the


topology of weak convergence in E.

Theorem 8 Functional delta method uniformly in P ∈ P for the bootstrap and other simula-
tion methods, Belloni et al. (2017): Assume that the conditions in Theorem A.4 hold. Let ρ̂n,P and
ρ̂∗n,P be maps as previously indicated, taking values in Dϕ such that Zn,P := rn (ρ̂n,P − ρP ) ⇝ ZP and

:= rn ρ̂∗n,P − ρP ⇝B ZP in D uniformly in P ∈ Pn . Then, rn ϕ ρ̂∗n,P − ϕ (ρ̂n,P ) ⇝B ϕ′ρP (ZP )
  
Zn,P
uniformly in P ∈ Pn .

Proof of Theorem A.3. Function ϕθ satisfies uniform Hadamard tangential differentiability and
∗ 4
both Zn,P ⇝ ZP and Zn,P ⇝ ZP in l∞ (A) uniformly in P ∈ Pn , as shown in Theorems A.1 and A.2.
Therefore, the proof follows by applying Theorems A.4 and A.5.
4
UC (T ) denotes a set of uniformly continuous functions from T to R, and C 1 (T ) denotes a set of continuous
differentiable functions from T to R.

59
A.6 Tables and Figures

Table 4: Descriptive statistics for the empirical application (Job Corps data)
All D=1 D=0 Diff p-value M =0 M =1 M =2 M =3 M =4
sample size 9,240 6,574 2,666 - - 311 3,495 4,004 1,298 141
Outcome Y
weekly earnings in third year 172.93 173.39 171.82 1.57 0.68 159.70 188.92 165.37 159.22 145.98
Treatment D
training in first year 0.71 1.00 0.00 - - 0.42 0.74 0.72 0.70 0.70
Mediator M
general health after first year 1.72 1.74 1.70 0.04 0.03 0.00 1.00 2.00 3.00 4.00
Covariates X
female 0.44 0.45 0.42 0.03 0.01 0.41 0.39 0.46 0.50 0.51
age 18.44 18.21 18.99 -0.77 0.00 18.55 18.43 18.42 18.42 18.86
white 0.26 0.25 0.30 -0.05 0.00 0.33 0.24 0.28 0.28 0.24
black 0.49 0.50 0.47 0.03 0.01 0.41 0.52 0.48 0.49 0.44
Hispanic 0.17 0.17 0.16 0.01 0.28 0.21 0.17 0.17 0.16 0.21
education 9.96 9.91 10.07 -0.15 0.00 9.93 10.00 9.94 9.91 9.76
education missing 0.02 0.01 0.03 -0.01 0.00 0.00 0.02 0.02 0.01 0.03
GED degree 0.05 0.04 0.07 -0.03 0.00 0.06 0.05 0.05 0.05 0.05
high school degree 0.20 0.17 0.25 -0.08 0.00 0.13 0.21 0.20 0.17 0.16
English mother tongue 0.85 0.84 0.86 -0.02 0.02 0.83 0.85 0.84 0.88 0.80
cohabiting or married 0.06 0.05 0.08 -0.03 0.00 0.07 0.05 0.06 0.08 0.09
has one or more children 0.20 0.18 0.24 -0.06 0.00 0.23 0.20 0.19 0.21 0.21
ever worked before JC 0.14 0.15 0.13 0.02 0.02 0.15 0.15 0.14 0.15 0.15
average weekly gross earnings 19.42 18.24 22.32 -4.08 0.05 37.52 19.91 17.65 16.97 39.69
household size 3.43 3.47 3.34 0.13 0.01 3.30 3.40 3.47 3.45 3.37
household size missing 0.02 0.01 0.03 -0.02 0.00 0.01 0.02 0.02 0.01 0.03
mum’s education 9.31 9.45 8.98 0.47 0.00 8.61 9.35 9.41 9.24 8.00
mum’s education missing 0.19 0.18 0.21 -0.03 0.00 0.24 0.19 0.18 0.20 0.28
dad’s education 7.04 7.16 6.73 0.44 0.00 6.45 7.14 7.10 6.74 6.66
dad’s education missing 0.39 0.38 0.41 -0.02 0.03 0.44 0.39 0.38 0.40 0.44
received welfare as child 1.92 1.93 1.89 0.04 0.23 1.86 1.89 1.92 1.98 1.97
welfare info missing 0.07 0.07 0.08 -0.01 0.06 0.10 0.07 0.07 0.07 0.07
general health at baseline 1.65 1.65 1.65 -0.01 0.65 1.69 1.40 1.73 2.00 2.12
health at baseline missing 0.02 0.01 0.03 -0.01 0.00 0.00 0.02 0.02 0.01 0.03
smoker 0.81 0.81 0.80 0.01 0.63 0.86 0.75 0.82 0.88 0.86
smoker info missing 0.48 0.49 0.46 0.03 0.03 0.42 0.52 0.47 0.41 0.43
alcohol consumption 1.79 1.78 1.84 -0.06 0.15 1.96 1.67 1.85 1.90 1.88
alcohol info missing 0.43 0.43 0.41 0.03 0.03 0.38 0.46 0.41 0.39 0.38

60
61
Figure 4: Approximate true c.d.f. profiles from 40 million Monte Carlo simulations under the data generating process described in Section 4.
Top: FY (1,M (1)) , FY (1,M (0)) , FY (0,M (1)) and FY (0,M (0)) ; Bottom: QY (1,M (1)) , QY (1,M (0)) , QY (0,M (1)) and QY (0,M (0)) . The data generating process is
described in Section 4.
Figure 5: Approximate true effect profiles from 40 million Monte Carlo simulations under the
data generating process described in Section 4: NDQTE (black solid) and NDQTE’ (gray
dashed), NIQTE (black solid) and NIQTE’ (gray dashed) and TQTE.

62
References
Abadie, A., J. Angrist, and G. Imbens (2002): “Instrumental Variables Estimates of the Effect of
Subsidized Training on the Quantiles of Trainee Earnings,” Econometrica, 70, 91–117.

Ai, C., O. Linton, and Z. Zhang (2022): “Estimation and inference for the counterfactual distribution
and quantile functions in continuous treatment models,” Journal of Econometrics, 228, 39–61, annals
Issue: In Honor of Ron Gallant.

Athey, S. and G. Imbens (2006): “Identification and inference in nonlinear difference-in-differences


models,” Econometrica, 74, 431–497.

Belloni, A., V. Chernozhukov, I. Fernández-Val, and C. Hansen (2017): “Program Evaluation


and Causal Inference With High-Dimensional Data,” Econometrica, 85, 233–298.

Bind, M.-A. C., T. J. VanderWeele, J. D. Schwartz, and B. Coull (2017): “Quantile causal
mediation analysis allowing longitudinal data,” Statistics in Medicine, 36, 4182 – 4195.

Bodory, H. and M. Huber (2022): causalweight: Estimation Methods for Causal Inference Based on
Inverse Probability Weighting, r package version 1.0.3.

Bodory, H., M. Huber, and L. Lafférs (2022): “Evaluating (weighted) dynamic treatment effects
by double machine learning,” The Econometrics Journal, 25, 628–648.

Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and


J. Robins (2018): “Double/debiased machine learning for treatment and structural parameters,” The
Econometrics Journal, 21, 1–68.

Chernozhukov, V., I. Fernández-Val, and A. Galichon (2010): “Quantile and Probability Curves
Without Crossing,” Econometrica, 78, 1093–1125.

Chernozhukov, V., I. Fernández-Val, and B. Melly (2013): “Inference on Counterfactual Dis-


tributions,” Econometrica, 81, 2205–2268.

Chernozhukov, V. and C. Hansen (2005): “An IV Model of Quantile Treatment Effects,” Econo-
metrica, 73, 245–261.

Cox, D. (1958): Planning of Experiments, New York: Wiley.

Donald, S. G. and Y.-C. Hsu (2014): “Estimation and inference for distribution functions and
quantile functions in treatment effect models,” Journal of Econometrics, 178, 383–397.

Farbmacher, H., M. Huber, L. Lafférs, H. Langen, and M. Spindler (2022): “Causal mediation
analysis with double machine learning,” The Econometrics Journal, 25, 277–300.

Firpo, S. (2007): “Efficient Semiparametric Estimation of Quantile Treatment Effects,” Econometrica,


75, 259–276.

Fisher, A. and E. H. Kennedy (2019): “Visually Communicating and Teaching Intuition for Influence
Functions,” .

Flores, C. A. and A. Flores-Lagunes (2009): “Identification and Estimation of Causal Mechanisms


and Net Effects of a Treatment under Unconfoundedness,” IZA Discussion Paper No. 4237.

63
——— (2010): “Nonparametric Partial Identification of Causal Net and Mechanism Average Treatment
Effects,” mimeo, University of Florida.

Flores, C. A., A. Flores-Lagunes, A. Gonzales, and T. Neuman (2012): “Estimating the effects
of Length of Exposure to Instruction in a Training Program: The Case of Job Corps,” The Review of
Economics and Statistics, 94, 153–171.

Foresi, S. and F. Peracchi (1995): “The Conditional Distribution of Excess Returns: An Empirical
Analysis,” Journal of the American Statistical Association, 90, 451–466.

Frölich, M. and M. Huber (2017): “Direct and Indirect Treatment Effects: Causal Chains and
Mediation Analysis with Instrumental Variables,” Journal of the Royal Statistical Society: Series(B),
79, 1645–1666.

Hahn, J. (1998): “On the Role of the Propensity Score in Efficient Semiparametric Estimation of
Average Treatment Effects,” Econometrica, 66, 315–332.

Hines, O., O. Dukes, K. Diaz-Ordaz, and S. Vansteelandt (2022): “Demystifying Statistical


Learning Based on Efficient Influence Functions,” The American Statistician, 76, 292–304.

Hsu, Y.-C., M. Huber, and T.-C. Lai (2019): “Nonparametric Estimation of Natural Direct and
Indirect Effects Based on Inverse Probability Weighting,” Journal of Econometric Methods, 8.

Hsu, Y.-C., T.-C. Lai, and R. P. Lieli (2022): “Estimation and inference for distribution and quantile
functions in endogenous treatment effect models,” Econometric Reviews, 41, 22–50.

Huber, M. (2014): “Identifying causal mechanisms (primarily) based on inverse probability weighting,”
Journal of Applied Econometrics, 29, 920–943.

Huber, M., M. Schelker, and A. Strittmatter (2022): “Direct and Indirect Effects based on
Changes-in-Changes,” Journal of Business & Economic Statistics, 40, 432–443.

Ichimura, H. and W. K. Newey (2022): “The influence function of semiparametric estimators,”


Quantitative Economics, 13, 29–61.

Imai, K., L. Keele, and T. Yamamoto (2010): “Identification, Inference and Sensitivity Analysis for
Causal Mediation Effects,” Statistical Science, 25, 51–71.

Levy, J. (2019): “Tutorial: Deriving The Efficient Influence Curve for Large Models,” .

Neyman, J. (1923): “On the Application of Probability Theory to Agricultural Experiments. Essay on
Principles.” Statistical Science, Reprint, 5, 463–480.

——— (1959): Optimal asymptotic tests of composite statistical hypotheses, Wiley, 416–444.

Pearl, J. (2000): Causality: Models, Reasoning, and Inference, Cambridge: Cambridge University
Press.

——— (2001): “Direct and indirect effects,” in Proceedings of the Seventeenth Conference on Uncertainty
in Artificial Intelligence, San Francisco: Morgan Kaufman, 411–420.

Robins, J. M. and S. Greenland (1992): “Identifiability and Exchangeability for Direct and Indirect
Effects,” Epidemiology, 3, 143–155.

64
Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994): “Estimation of Regression Coefficients When
Some Regressors are not Always Observed,” Journal of the American Statistical Association, 89, 846–
866.

Rubin, D. (1980): “Comment on ’Randomization Analysis of Experimental Data: The Fisher Random-
ization Test’ by D. Basu,” Journal of American Statistical Association, 75, 591–593.

Rubin, D. B. (1974): “Estimating Causal Effects of Treatments in Randomized and Nonrandomized


Studies,” Journal of Educational Psychology, 66, 688–701.

Schochet, P., J. Burghardt, and S. Glazerman (2001): “National Job Corps Study: The Im-
pacts of Job Corps on Participants Employment and Related Outcomes,” Report, Washington, DC:
Mathematica Policy Research, Inc.

Schochet, P., J. Burghardt, and S. McConnell (2008): “Does Job Corps Work? Impact Findings
from the National Job Corps Study,” The American Economic Review, 98, 1864–1886.

Tchetgen Tchetgen, E. J. and I. Shpitser (2012): “Semiparametric theory for causal mediation
analysis: Efficiency bounds, multiple robustness and sensitivity analysis,” The Annals of Statistics,
40, 1816–1845.

Vaart, A. W. v. d. (1998): Asymptotic Statistics, Cambridge Series in Statistical and Probabilistic


Mathematics, Cambridge University Press.

Zhou, X. (2022): “Semiparametric estimation for causal mediation analysis with multiple causally
ordered mediators,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), forth-
coming.

65

You might also like