Professional Documents
Culture Documents
Abstract
One of the major concerns of targeting interventions on individuals in social welfare
programs is discrimination: individualized treatments may induce disparities on sensi-
tive attributes such as age, gender, or race. This paper addresses the question of the
design of fair and efficient treatment allocation rules. We adopt the non-maleficence
perspective of “first do no harm”: we propose to select the fairest allocation within the
Pareto frontier. We provide envy-freeness justifications to novel counterfactual notions
of fairness. We discuss easy-to-implement estimators of the policy function, by casting
the optimization into a mixed-integer linear program formulation. We derive regret
bounds on the unfairness of the estimated policy function, and small sample guaran-
tees on the Pareto frontier. Finally, we illustrate our method using an application from
education economics.
∗
We thank James Fowler, Yixiao Sun and Kaspar Wüthrich for comments and discussion. All mistakes
are our own.
†
Department of Economics University of California at San Diego, La Jolla, CA, 92093. Email: dvi-
viano@ucsd.edu.
‡
Department of Mathematics and Halicioğlu Data Science Institute, University of California at San
Diego, La Jolla, CA, 92093. Email: jbradic@ucsd.edu.
1
1 Introduction
Heterogeneity in treatment effects, widely documented in social sciences, motivates treat-
ment allocation rules that assign treatments to individuals differently, based on observable
characteristics (Manski, 2004; Murphy, 2003). However, targeting individuals may induce
“disparities” across sensitive attributes, such as age, gender, or race. Motivated by ev-
idence for preferences of policymakers towards non-discriminatory actions (Cowgill and
Tucker, 2019), this paper designs fair and efficient targeting rules for applications in social
welfare programs. We construct treatment allocation rules using data from experiments or
quasi-experiments, and we develop new notions of optimality and counterfactual fairness
for the construction of fair policies.
“Fair targeting” is a controversial task, due to the lack of consensus on the formulation
of the decision problem and the definition of fairness. Conventional approaches consist in
designing algorithmic decisions that maximize the expected utility across all individuals,
by imposing some form of axiomatic “fairness constraints” on the decision space of the
policymaker; this is usually imposed on the prediction of the algorithm.1 However, most
of these approaches ignore the welfare effects of such constraints (Kleinberg et al., 2018).
In particular, fairness constraints imposed on the decision space of the policymaker may
ultimately lead to sub-optimal welfare for both sensitive groups. This is a crucial limitation
when policymakers are concerned with the effects of their decisions on the utilities of each
individual: we may not want to impose unnecessary constraints on the policy if such
constraints are harmful for some or all individuals.
Motivated by these considerations, this paper advocates for Pareto optimal treatment
rules: decision-makers must prefer allocations for which we cannot find any other policy
that strictly improves welfare for one of the two sensitive groups, without further decreas-
ing welfare on the opposite group.2 We propose to select the fairest allocation within the
Pareto frontier. Our approach is motivated by the intuitive notion of “first do no harm”:
instead of imposing possibly harmful fairness constraints on the decision space, we instead
restrict the set of admissible solutions to the Pareto optimal set, and among such, we
choose the fairest one. We allow for general notions of fairness, whereas we also propose a
notion of fairness for treatment allocation rules based on envy-freeness (Foley, 1967; Var-
ian, 1976), making a clear connection between classical microeconomic literature and the
counterfactual interpretations of fairness. Our framework encompasses several applications
in economics, including development programs (De Ree et al., 2017), educational programs
(Muralidharan et al., 2019), health programs (Dupas, 2014), cash-transfer programs (Egger
et al., 2019), among others. We name our method Fair Targeting.
1
For a review, the reader may refer to Corbett-Davies and Goel (2018). Further discussion on the related
literature is contained in Section 1.2.
2
Pareto optimality has been often associated with the notion of fairness in economic theory (Varian,
1976), and it finds intuitive justifications under the rational preferences of the decision-maker (Noghin,
2006).
2
Figure 1: Data from Lyons and Zhang (2017). The points on the line denote Pareto optimal
allocations. Points on and below the line denote the Pareto dominated allocations. The
colors refer to two classes of policies, unrestricted (red), restricted by fairness constraints
(blue). Y-axis denotes the welfare of male students and the x-axis on female students.
From the figure, we can observe that imposing fairness constraints may result in harmful
allocations for both groups of individuals.
To gain further intuition on the advantages of the proposed approach, using student
award data of Lyons and Zhang (2017), we plot the Pareto frontier for two nested classes
of linear decision rules in Figure 1. Details on the definition of welfare and the data
can be found in Section 5. In the figure, the red line refers to the unconstrained linear
decision rule, the blue line constrains the welfare on female individuals to be larger than the
welfare on male students, and it does not allow to use the gender attribute in the decision
process. Points denote the Pareto optimal allocations. The figure showcases that imposing
constraints directly on the decision rules leads to allocations that are always dominated
by some other treatment allocation in an unconstrained environment. The blue allocation
finds an harmful allocation for both types of individuals when compared to three red dots
in the figure. The proposed method instead estimates the Pareto optimal allocation in the
least constrained environment (e.g., the red line), and among such, it chooses the “fairest”
one.
In summary, this paper contributes to the statistical treatment choice literature in two
directions: (i) we introduce the notion, estimation procedures, and we study properties of
Pareto optimal treatment allocation rules under general notions of fairness; we justify such
an approach within a decision-theoretical framework; (ii) we adapt notions of envy-freeness
to treatment rules and study statistical properties of the estimated policy. We now discuss
3
each of these contributions in detail.
The decision problem consists of lexicographic preferences of the policymaker of the
following form: (i) each Pareto optimal allocation strictly dominates allocations that are not
Pareto optimal; (ii) a Pareto optimal allocation is strictly preferred to any other allocation
if it satisfies additional fairness requirements. We identify the Pareto frontier as the set
of maximizers over any weighted average of the welfares of each group. Therefore, such
an approach embeds as a special case maximizing a weighted combination of welfares of
each sensitive group (Kitagawa and Tetenov, 2018; Rambachan et al., 2020). However,
the weights are not chosen a-priori, but instead, they are part of the decision problem,
and directly selected to maximize fairness. This has important practical implications: the
procedure is solely based on the notion of fairness adopted by the social planner, and it
does not require to specify importance weights assigned to each sensitive group. These
weights would be extremely hard to justify to the general public.3
Estimation of the set of Pareto optimal allocations represents a key challenge since
infinitely many policies may populate it. In fact, (i) the set consists of maximizers over
a continuum of weights between zero and one; (ii) each maximizer of the welfare (or a
weighted combination of welfares) is often not unique (Elliott and Lieli, 2013). To over-
come these issues, we showcase that the Pareto frontier can be approximated using simple
linear constraints on the policy space. Such a result drastically simplifies the optimiza-
tion algorithm: instead of estimating the entire set of Pareto allocations, we propose to
maximize fairness of the policy function under easy-to-implement linear constraints. Our
approach consists of a novel a-priori multi-objective formulation (Deb, 2014). Differently
from previous literature, we use a discretization argument, and we evaluate weighted com-
binations of the objective functions separately to construct a polyhedron that contains
Pareto allocations. We provide theoretical guarantees on our approach, and we show-
case that the distance between the Pareto frontier obtained via linear constraints and its
√
population counterpart converges uniformly to zero at rate 1/ n.
The choice of the allocation within the Pareto set depends on the notion of fairness
adopted by the social planner. Inspired by the economic concept of envy-freeness, we
propose a novel definition of fairness, which can be seen of independent interest. We
interpret the policy assignment under the lens of the classical notion of “bundles” defined
in microeconomics. For each policy function, we compare expected utilities that depend
on its distribution under the same and opposite sensitive attribute. The proposed notion
of fairness makes a connection to previous notions of counterfactual fairness (Chiappa,
2019; Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi and Shpitser,
2018), whereas, differently from previous references, (i) we provide formal justification to
fairness using an envy-freeness argument; (ii) we construct the definition of fairness based
on distributional impact of the treatment allocation rule on the welfare.
In our theoretical analysis, we also study “regret unfairness”. This denotes the differ-
3
See for example also the discussion on relative importance weights contained in Kasy and Abebe (2020).
4
ence between the expected unfairness achieved by the estimated policy function, against
the minimal possible unfairness achieved by any Pareto optimal allocation. We showcase
that such a difference converges to zero uniformly in the sample size, with the rate depend-
ing on the convergence rate of the estimated conditional mean of potential outcomes. All
our results also extend to different notions of fairness that the planner may instead prefer
to adopt. Finally, we conclude with an application that showcases the advantages of the
proposed method compared to alternatives that ignore Pareto optimality.
The remainder of this paper is organized as follows: we provide a brief overview of the
related literature in the following paragraph; we introduce the setting and main notation in
Section 2, and discuss definitions and motivations for Pareto optimal allocations and fair-
ness; Section 3 contains new Fair Targeting algorithms; Section 4 contains the theoretical
analysis; Section 5 discusses an empirical application, and Section 6 concludes. Complete
formal derivations are relegated in the Appendix.
5
the authors propose semi-heuristic and computationally intensive procedures for estimat-
ing Pareto efficient classifiers. Finally, Xiao et al. (2017) discuss the different problem of
estimation of a Pareto allocation that trade-offs fairness and individual utilities for rec-
ommender systems, where the relative importance weights of the different objectives are
selected a-priori. The above references do not address the problem of policy targeting
discussed in the current paper.
Additional references in computer science on fairness include Chouldechova (2017),
Corbett-Davies et al. (2017), Dwork et al. (2012), Hardt et al. (2016), Feldman et al. (2015),
Kleinberg et al. (2016), among others. For a review and comparisons, the reader may refer
to Corbett-Davies and Goel (2018), and Mitchell et al. (2018). The notion of fairness in
previous references differs in the lack of a valid causal framework. For instance, Zemel
et al. (2013) defines fairness based on statistical independence of a classifier with respect
to the sensitive attribute, whereas Donini et al. (2018) defines a classifier fair if the error of
such a classifier is similar across the sensitive groups. Hossain et al. (2020) and references
therein propose envy-freeness notions of fairness for applications in classification. However,
such notions do not account for covariates across different sensitive groups being drawn
from potentially different distributions, which instead differently justifies counterfactual
notions of fairness. The fair classification has also been studied from different angles in
Liu et al. (2017) who discuss stochastic fair bandits, and Ustun et al. (2019) who propose
decoupled estimation of tree classifiers. A related strand of literature discusses estimation
of decision rules under fairness constraints within a causal framework (Chiappa, 2019;
Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi et al., 2019; Nabi and
Shpitser, 2018). All such references discuss the problem of maximizing some given objective
function under fairness constraints. This may result in Pareto dominated allocations, and,
therefore, possibly harmful policies for both disadvantaged as well as advantaged groups
(see Figure 1).
6
assumed to be drawn conditional on the realization of the sensitive attribute, i.e.,
where Xi (0), Xi (1) denote the potential covariates, i.e., the covariates in the state where the
attribute of individual i is equal to s. Note that for each individual we only observe either
Xi (1) or Xi (0) and never both.4 The participants of the study have either been enrolled
in the social welfare program or not. We denote their treatment status with Di ∈ {0, 1},
where Di = 1 denotes that individual i has been enrolled in the social program. We define
their post-treatment outcomes with Yi ∈ Y ⊆ R realized only once the sensitive attribute,
covariates and the treatment assignment are realized.
Following the Neyman-Rubin potential outcome framework (Imbens and Rubin, 2015),
we define Yi (d, s), d ∈ {0, 1}, i.e.
the potential outcomes that would have been observed respectively under treatment zero
and one, for the protected attribute or the opposite one. Note that we only get to observe
one of the four potential outcomes. Throughout the paper, the observed Yi satisfies the
Single Unit Treatment Value Assumption (SUTVA) (Rubin, 1990), with
Yi = Di Si Yi (1, 1) + Di (1 − Si )Yi (1, 0) + (1 − Di )Si Yi (0, 1) + (1 − Di )(1 − Si )Yi (0, 0). (2)
We assume that the vector of potential outcomes, covariates, sensitive attributes and treat-
ment assignments, (Yi , Di , Si , Xi ) are identically and independently distributed (i.i.d.). We
denote with EX(s) , the expectation with respect to the distribution of the potential covari-
ates X(s) only.
Next, we discuss modeling assumptions. The first condition is the unconfoundeness
assumption.
Assumption 2.1. (Unconfoundeness) Let the following two conditions hold: ∀s ∈ {0, 1}, d ∈
{0, 1}, (A) Yi (d, s) ⊥ (Di , Si )|Xi (s); (B) Xi (s) ⊥ Si .
Condition (A) states that the treatments are randomized based on covariates, indepen-
dently on potential outcomes (Imbens, 2004). Similarly, Condition (B) assumes that the
sensible attribute is randomized independently of Xi (s). We remark that such a condition
allows for the dependence of the observed covariates and outcomes on the sensitive at-
tribute, this condition holds when, for instance, Si is either randomized before everything
else is observed, or Si denotes some immutable characteristic that cannot change given
intervention on the other variables.5 The randomization assumption on Si is common in
4
Throughout our discussion we implicitely assume that the support of Xi (1) and Xi (0) is the same,
namely X .
5
i.e., do-operation in the notation of Pearl (2009).
7
Xi Yi
Si Di
Figure 2: Example of a directed acyclical graph under which Assumption 2.1 holds.
mediation analysis under the name of conditional ignorability (Imai et al., 2010). Figure
2 displays a directed acyclical graph under which Assumption 2.1 holds. In the figure, the
variable Xi acts as a mediator of the effect of changing the sensitive attribute. Assumption
2.1 can be equivalently stated after also conditioning on additional baseline characteristics
that do not depend on the sensitive attribute s, such as, for instance, age. Namely, we may
consider an additional set of covariates Zi , and assume that Xi (s) ⊥ Si |Zi . We omit such
an extension for the sake of clarity only. Finally, we define
as the propensity score and the probability of individual i being assigned to the disad-
vantaged group, respectively. The following condition is the overlap assumption, which is
common in the causal inference literature.
Assumption 2.2 (Overlap). Let e(Xi , s), ps ∈ (δ, 1 − δ), almost surely, for δ ∈ (0, 1), for
all s ∈ {0, 1}.
that depends on the individual characteristics and protected attribute alone. That is, a
policy is a deterministic treatment assignment, which depends on covariates Xi , and on
8
status Si only. Here, Π denotes the function class of interest for the treatment assignment
π. Such a class may be unconstrained, or it may reflect economic or legal constraints that
restrict the decision space of the policy-maker.
For a given policy π and a sensitive attribute s, the social welfare for such a group of
individuals is defined as follows
h i
Ws (π) = E Yi (1, s)π(Xi (s), s) + Yi (0, s)(1 − π(Xi (s), s)) . (3)
The above expression denotes the welfare within the group of individuals with sensi-
tive attribute s. Under the standard properties of the propensity score (Imbens, 2000),
identification of the social welfare within each group can be easily expressed using inverse-
probability type estimators (Horvitz and Thompson, 1952). The reader may refer to the
Appendix for further discussion.
Under the standard utilitarian perspective (Manski, 2004), where welfare is aggregated
across all the individuals, the estimand of interest solves
n o
arg max p1 W1 (π) + (1 − p1 )W0 (π)
π∈Π
where p1 is the fraction of the population for which Si = 1. However, we observe that
whenever the sensitive group is a minority group, such an approach assigns a small weight
to the welfare of such individuals, disproportionally favoring the majority group.
An alternative approach is to maximize the welfare for each possible sensitive group
(Ustun et al., 2019). However, in such a case, the resulting policy function may violate
the constraint in Π, e.g., violating anti-discriminatory laws. A simple example is when the
policy π(x, s) is constrained to be constant in the sensitive attribute s.
Motivated by this consideration, we move away from a single-objective function to a
multi -objective decision problem: the social planner aims to maximize the welfare in each
sensitive attribute under the legal or economic constraints. We propose a notion of Pareto
optimality as a suitable notion of efficiency as it formalizes the trade-off between different,
possibly contradicting, objectives.6 The planner then chooses within the set of Pareto
allocations the least unfair policy.
9
while simultaneously
Ws (π1 ) < Ws (π2 ) for some s ∈ {0, 1}.
We denote with Πo the set of all π ∈ Π that are Pareto optimal.
The set of Pareto optimal choices contains all such allocations for which the welfare for
one of the two groups cannot be improved without reducing the welfare for the opposite
group. The following lemma characterizes the set of Pareto optimal allocations.
Lemma 2.1 (Pareto Frontier). The set Πo ⊆ Π is such that
n o
Πo = πα : πα ∈ arg sup αW1 (π) + (1 − α)W0 (π), α ∈ (0, 1) . (4)
π∈Π
The lemma is a trivial application of results of Negishi (1960), applied to the func-
tional of interest. For the sake of completeness, the proof is included in the Appendix.
Lemma 2.1 provides an intuitive interpretation: Pareto optimal allocations consist of poli-
cies that maximize a weighted combination of the protected group welfares. Throughout
our discussion we define
being the largest objective after weighting the welfares of each group by α and 1 − α.
The set of Pareto allocations generalizes notions of optimal treatment rules discussed
in previous literature. Simple examples may illustrate this point.
Example 2.1 (Utilitarian Welfare Maximization). Notice that the population equivalent
of EWM belongs to the Pareto frontier. Namely,
n o
arg max p1 W1 (π) + (1 − p1 )W0 (π) ⊆ Πo
π∈Π
Since the above expression maximizes a weighted average of welfares, with weight α =
p1 . An alternative approach consists in choosing appropriately the weights in the above
expression to reflect the relative importance of each group to the social planner. For
instance we may consider maximizing (Rambachan et al., 2020)
n o
arg max ωW1 (π) + (1 − ω)W0 (π) ⊆ Πo
π∈Π
for some specific weight ω. Clearly such an allocation belongs to the Pareto frontier.
Example 2.2 (Unrestricted Policy Function Class). Whenever the set Π is unrestricted
the Pareto optimal allocation equals the set of policies that lead to the largest possible
welfare within each class. Namely, such a set denotes all n the policies whose welfare
o leads
7
to the welfare under the first best policy πf b (x, s) = 1 m1,s (x) − m0,s (x) ≥ 0 .
n h i o
7
Formally, the set reads as follows: π : Ws (π) = E Yi (1, s) − Yi (0, s) πf b (Xi (s), s) , s ∈ {0, 1} .
10
Remark 1 (Anscillary Fairness Restrictions on Π). Complementary restrictions may also
be considered within the above framework. Following the discussion in Section 1, one such
example of constraint is predictive parity (Kasy and Abebe, 2020), which in such a setting
is interpreted as follows8
h i h i
π : E Yi (1, 1) − Yi (0, 1)π(Xi , Si ) = 1, Si = 1 = E Yi (1, 0) − Yi (0, 0)π(Xi , Si ) = 1, Si = 0 .
These additional restrictions can be directly incorporated in the function class Π. However,
whereas we see legal or economic restrictions often necessary in the construction of Π,
additional restrictions imposed directly by the researcher and not by the policy-maker may
induce sub-optimal allocations for both of the sensitive groups. In fact, observe that for
two function classes Π, Π0 , where Π0 ⊆ Π, the following is true
n o n o
sup αW0 (π) + (1 − α)W1 (π) ≤ sup αW0 (π) + (1 − α)W1 (π) . (5)
π∈Π0 ⊆Π π∈Π
Therefore, the Pareto frontier of a more restricted function class Π is always weakly dom-
inated by the Pareto frontier under weaker constraints.
Pareto optimal allocations are often non-unique, allowing for flexibility in the choice of
efficient policies. If there is a range of alternative policies that assign different weights to
different groups, the social planner must appeal to some principle of preferential ranking.
UnFairness : Π 7→ R
a functional that maps from a given policy function to the real line, and that quantifies
the level of the unfairness of a given policy. We leave unspecified such a functional and
provide examples in the following paragraphs. Given a set of available policy functions Π,
we define C(Π) the choice set of the decision-maker (Fudenberg and Tirole, 1991). The
function C denotes a choice function, where C({π1 , π2 }) = π1 is interpreted as π1 being
strictly preferred to π2 .
Definition 2.2 (Planner’s Preferences). Planner’s preferences satisfy the following three
conditions:
(i) C({π1 , π2 }) = π1 if W1 (π1 ) ≥ W1 (π2 ) and W0 (π1 ) ≥ W0 (π2 ) and either (or both) of
the two inequalities hold strictly;
8
Note that this is not the only definition that can be attributed to predictive parity, and also alternative
notions may be considered.
11
(ii) for each π1 , π2 ∈ Πo , C({π1 , π2 }) = π1 if UnFairness(π1 ) < UnFairness(π2 );
Definition 2.2 postulates that the set of optimal actions of the decision-maker is a
subset of the Pareto optimal set. The above definition also states that within the set of
Pareto optimal allocations, the planner strictly prefers a given policy based on fairness
considerations. Intuitively, the above definition models preferences lexicographically: (i)
each allocation is strictly preferred to the ones for which welfare on both groups is strictly
lower; (ii) allocations are then ranked based on unfairness considerations. Under Definition
2.2, the decision problem takes the following form.
Lemma 2.2 (Social Planner’s Decision Problem). Under rational preferences, with pref-
erences as discussed in Definition 2.2, the following holds: each π ? ∈ C(Π) is such that π ?
solves
π ? ∈ arg inf UnFairness(π)
π∈Π (6)
subject to αW1 (π)+(1 − α)W0 (π) ≥ W̄α , for some α ∈ (0, 1).
Lemma 2.2 provides a formal characterization of the social planner decision problem,
and the proof is contained in the Appendix. The problem consists in minimizing the
unfairness criterion of the policy function, under the condition that the policy is Pareto
optimal. The social planner does not maximize a weighted combination of welfares, with
some pre-specified and hard-to-justify weighting scheme. Instead, the relative importance
of each group (i.e., α) is implicitly chosen within the optimization problem to minimize the
unfairness criterion. Such an approach allows for a transparent choice of the policy based
on the definition of fairness adopted by the social planner.
Remark 2 (Allowing for Exploration). One key question is whether we may lose the
constraint on Pareto optimality, to substantially decrease the unfairness of the estimated
policy. For instance, the policy-maker may be willing to find a policy that is only “approx-
imately” Pareto optimal, i.e., whose resulting welfares are within the orange regions, as
long as it minimizes the unfairness criterion. Such an approach can be easily implemented
by imposing slacker constraints on social welfare than the ones in Equation (6). Simple
examples are constraints of the form αW1 (π) + (1 − α)W0 (π) ≥ W̄α − ε for some ε > 0.
12
in the context under consideration, “bundles” assigned to individuals refer to the treatment
assignment π(Xi , Si ), and utilities depend on the distribution of such assignments.
We start our discussion with the following definition:
h i
Vπ(x,s1 ) (x, s2 ) = E π(x, s1 )Yi (1, s2 ) + (1 − π(x, s1 ))Yi (0, s2 )Xi (s2 ) = x . (7)
The above expression denotes the conditional welfare, for the policy function being assigned
to the opposite attribute. Namely, it defines the generated effect of π(x, s1 ), for type s2 ,
conditional on the value of the covariates.
We say that the agent with attribute s2 envies the agent with attribute s1 , if
h i
EX(s1 ) Vπ(X(s1 ),s1 ) X(s1 ), s2 > Ws2 (π). (8)
Intuitively, the agent, with attribute s2 , envies the one with opposite attribute if, after
changing her distribution of covariates and treatment assignment to be of the opposite
attribute, the resulting welfare is strictly larger than her current welfare.
Motivated by the above definition, we measure the unfairness towards an individual
with attribute s2 as
h i
A(s1 , s2 ; π) = E Vπ(X(s1 ),s1 ) (X(s1 ), s2 ) −Ws2 (π). (9)
Here Ws2 (π) is as defined in Equation (3). Observe that the above measure denotes the
difference between the welfare of group s2 under respectively the opposite and same dis-
tribution of covariates and treatment assignments. The above notion of fairness showcases
the advantage of embedding the problem within a causal inference framework: we compare
utilities, by letting covariates and treatment assignments being drawn from the distribution
under the opposite sensitive attribute.
Since we aim not to discriminate in either direction (women with respect to men
and vice-versa), we define unfairness by taking the sum of the effects A(s1 , s2 ; π) and
A(s2 , s1 ; π). Such an approach builds on the notion of “social envy” discussed in Feldman
and Kirman (1974), where envy is defined as a weighted combination of envy of each group.
Namely, we propose the following notion of unfairness
Note that the above definition induces partial ordering in the set of Pareto optimal alloca-
tions.
Motivated by the above definition, we propose an estimand which finds the fairest
allocation within the set of Pareto optimal policies.
13
3 Fair Targeting
In this section, we discuss the estimation and optimization problem, namely, we aim to
construct an estimator of π ? in Equation (6). We design a plug-in estimator, where the
population quantities are replaced with particularly designed sample equivalents. By letting
An and Π̂o denote the sample approximations of the population quantities A and Πo , which
are discussed in the following paragraphs, our Fair Targeting method, solves
One of the key challenges induced by the estimation procedure is represented by the pres-
ence of a possibly infinitely dimensional set of Pareto optimal allocations. To overcome
this issue, we represent the Pareto frontier using a set of linear constraints on the policy
function, without directly constructing the entire set of Pareto optimal policies. Such an
approach permits us to find efficient solutions without the need to exhaustively search over
the entire set of Pareto allocations.
be the conditional mean of the group s under treatment status d. We denote the doubly
robust score (Robins et al., 1994) as
1{Si = s}1{Di = d}
Γd,s,i = Yi − md,s (Xi ) + md,s (Xi ) (12)
ps P (Di = d|Xi , Si )
The estimators of nuisances m̂d,s (.), ê(.), p̂s , are built upon causal inference and semi-
parametric literature (Bang and Robins, 2005; Hahn, 1998; Newey, 1990, 1994; Robins and
Rotnitzky, 1995). Each of these components is estimated via cross-fitting (Chernozhukov
et al., 2018; Chiang et al., 2019; Schick, 1986). One needs to carefully consider protected
14
Figure 3: Graphical representation of cross-fitting under two alternative model formulation.
The light gray area is the training set, used to construct an estimator of m̂d,s=1 , whereas the
darker gray area is an evaluation set, area in which a prediction of m̂d,s=1 is computed. The
figure on the left refers to the case where no parametric assumptions are imposed on the
depends on md,s with s. The panel on the right uses both groups’ information to estimate
the conditional mean function by further imposing additional parametric assumptions on
the depends on md,s with s.
Ik Ik
attribute in this procedure. Two alternative cross-fitting procedures are available to the
researcher. The first one, consists in dividing the sample into K folds and estimating the
(−k)
conditional mean m̂d,s (Xi ) using observations for which S = s only, after excluding the
fold k corresponding to unit i.9 Such an approach does not impose parametric restrictions
on the dependence of md,s on the attribute s, at the expense of shrinking the effective sam-
ple size used for estimation. The second approach consists in further imposing additional
parametric restrictions on the depends of md,s on s and using all observations in all folds
(−k)
except k for estimating m̂d,s (Xi ). A graphical representation is depicted in Figure 3.
Clearly, such a set cannot be directly estimated since it contains solutions to uncountably
many optimization problems. Therefore, as a first step, we discretize such a set, and we
construct a grid of equally spaced values αj ∈ (0, 1), j ∈ {1, ..., N },
{α1 , ..., αN }
9 (−k)
Formally, let i ∈ Ik ∩ S1 where Ik is the k-th fold of the data and S1 = {i : Si = s1 }. Let m̂d,s2 be an
estimator obtained using samples not in the fold k, Ikc ∩ S1c for which S1c = {i : Si = s2 }; for example by a
random forest or linear regression of Yj onto Xj for Sj = s2 , and j ∈
/ Ik .
15
√
where we choose N = n. We approximate Πo,n using the discretized set
n n o o
Π̂o = πα ∈ Π : πα ∈ arg sup αŴ0 (π) + (1 − α)Ŵ1 (π) , s.t. α ∈ {α1 , ..., αN } . (16)
π∈Π
The choice of the grid is arbitrary, as long as each value is far from the other by no more
than a small ε-value. A natural choice for the construction of the grid consists in starting
with αj = p̂1 and then choosing the remaining N − 1 values, with a distance from each
other by ε. Such an approach guarantees, for instance, that the solution of the EWM
method is a feasible solution under Fair Targeting.
Unfortunately, the set Π̂o may be hard, if not impossible, to directly estimate. In fact,
for each α we may have uncountably many solutions (Elliott and Lieli, 2013; Manski and
Thompson, 1989). Therefore, instead of directly estimating such a set, we characterize Π̂o
using simple linear constraints. The key idea consists in imposing that for some arbitrary
weight αj , the weighted combination of empirical welfare is “large enough” to guarantee
Pareto optimality. We formalize this approach in the following lines.
The first step consists in finding the maximum empirical welfare achieved on the “dis-
cretized” Pareto Frontier by finding
n o
W̄j,n = sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) , (17)
π∈Π
for each j ∈ {1, ..., N }. This can be easily solved using using exact methods such as
mixed-integer programs (Kitagawa and Tetenov, 2018) or recursive search algorithms (Zhou
et al., 2018).10 To guarantee Pareto optimality it is sufficient to impose that a weighted
combination of the welfare is at least as large as W̄j,n , for some j ∈ {1, ..., N }.
Observe now that the set Π̂o , (16), can be characterized using the following represen-
tation:
∀π ∈ Π̂o
∃j ∈ {1, ..., N } such that αj Ŵ0,n (π) + (1 − αj )Ŵ1,n (π) ≥ W̄j,n . (18)
That is, Π̂o contains all policies π that achieve the maximum value of weighted welfares,
W̄j,n for some j. Therefore, whereas we may not be able to find all the policies in the set
Π̂o , we can still represent such a set using simple linear constraint on the estimated policy
function.
16
additional set of decision variables that guarantee that the constraints in Equation (18)
hold.
We represent the constraints on the policy space by introducing the variables
These variables have simple representation for general classes of policy functions, such as
linear rules (see for example Chen and Lee (2018); Kitagawa and Tetenov (2018)). In
addition, we define a vector
that encodes the locations on the grid of α for which the supremum in (18) is reached at;
uj = 1 whenever the constraint in Equation (18) holds for αj . The chosen policy must be
Pareto optimal, i.e., uj must be equal to one for at least one j. To ensure this, we introduce
a simple constraint N
P
j=1 uj ≥ 1.
To represent the Pareto optimality, following Equation (18), the following constraint
must be imposed
D E D E D E
uj αj Γ̂1,0 − Γ̂0,0 , z 0 + uj (1 − αj ) Γ̂1,1 − Γ̂0,1 , z 1 + Γ̂0,1 + Γ̂0,0 , 1 − uj nW̄j,n ≥ 0,
(21)
where Γ̂d,s = (Γ̂d,s,1 , ..., Γ̂d,s,n ), and h., .i denotes the vector product. The above constraint,
together with (20) and N
P
j=1 uj ≥ 1, ensures that the assignment z1,i , z0,i is Pareto optimal.
17
Theorem 3.1. The solution set to Equation (22) and Equation (11) are equal.
Theorem 3.1 showcases that the optimization algorithm can be written as a mixed-
integer program with constraints that depend on the function class of interest. Such con-
straints read as mixed-integer linear constraints for several function classes such as optimal
trees and maximum score methods (Bertsimas and Dunn, 2017; Florios and Skouras, 2008).
In addition, the objective function is clearly linear in the decision variables. Constraints
(A), (C), (E) are mixed-integer linear constraints, whereas the Constraint (B) is quadratic.
However, notice that we can further simplify (B) as a linear program at the expense of
introducing additional N n binary variables, and 2N n additional constraints (e.g., see Vi-
viano (2019); Wolsey and Nemhauser (1999)). We omit such an extension for the sake of
brevity only.
4 Theoretical Guarantees
Next, we formalize the finite sample theoretical properties of the solution in Equation (22).
The following assumption is imposed throughout our discussion.
Assumption 4.1. Suppose that the following conditions hold.
(A) Yi (d, s) ∈ [−M, M ], for some M < ∞, for all d ∈ {0, 1}, s ∈ {0, 1};
(B) Π has finite VC-dimension, denoted as v;11
(C) Π is pointwise measurable.12
Condition (A) states the outcome is uniformly bounded by a universal constant M <
∞. Importantly, the implementation of Fair Targeting does not require knowledge of
such a constant. Condition (B) imposes a restriction on the function class of interest of
the policy function. Condition on the finite VC dimension is common in many areas of
research (see Devroye et al. (2013) for further reference). General examples where such an
assumption holds for Empirical Welfare Maximization can be found in Mbakop and Tabord-
Meehan (2016), among others. Condition (C) ensures the measurability of the supremum
of the empirical process of interest, similarly, for instance, to Rai (2018). Simple examples
where the finite VC-dimension condition holds are linear decision rules (Manski, 1975), and
decision trees (Loh, 2011). In the former case, the VC-dimension is bounded by the number
of covariates, whereas in the latter case is bounded by the exponential of the number of
layers in the tree (Athey and Wager, 2017; Zhou et al., 2018).
11
The VC dimension denotes the cardinality of the largest set of points that the function π can shatter.
The VC dimension is commonly used to measure the complexity of a class, see for example Devroye et al.
(2013).
12
A pointwise measurable function satisfies the following condition: for each π ∈ Π, there exists a count-
able subset G, such that there exists a sequence πm ∈ G such that πm →m→∞ π uniformly (Chernozhukov
et al., 2014).
18
4.1 Guarantees on the Pareto Frontier
In the following lines, we impose consistency conditions of the estimated nuisance param-
eters.
Assumption 4.2. Assume that for some ξ1 ≥ 1/4, ξ2 ≥ 1/4, the following holds:
h 2 i h . . 2 i
E m̂d,s (Xi ) − md,s (Xi ) = O(n−2ξ1 ), E 1 p̂s ê(Xi , s) − 1 ps e(Xi , s) = O(n−2ξ2 ).
(23)
for all s ∈ {0, 1}, d ∈ {0, 1}, where Xi is out-of-sample. Assume in addition that for some
H, and n ≥ H, with probability one,
1 1
sup |m̂d,s (x) − md,s (x)| ≤ 2M, sup − ≤ 2/δ 2 . (24)
x∈X ,s∈S,d∈{0,1} x∈X ,s∈S p s e(x, s) p̂s ê(x, s)
Assumption 4.2 assumes that the product of the mean-squared error of the propensity
score and conditional mean function converges at the parametric rate n−1 . Such a condition
is standard in the doubly-robust literature (Chernozhukov et al., 2018; Farrell, 2015; Newey,
1994; Robins and Rotnitzky, 1995). Since we only consider rates (and not constants which
depend on the variance component) in the upper bound, we find that uniform consistency
is not necessary for the derivation of the regret bound. Instead, we state the stability
statement as a finite-sample condition, which allows the characterization of small sample
properties on the regret. The condition can be alternatively stated asymptotically (Athey
and Wager, 2017), whereas in such a case, all the remaining results should be interpreted
in the asymptotic sense only.
As the first step for our results, we derive uniform convergence of the estimated welfare
with respect to its population counterpart, weighted by α-weights.
Theorem 4.1. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least
√
1 − 1/ n − γ, for some H > 0, and n ≥ H, γ > 0
sup sup αW0 (π) + (1 − α)W1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π)
α∈(0,1) π∈Π αj ∈{α1 ,...,αN }
M p p
≤ C̄ 2 v/n + log(2/γ)/n + 1/N
δ
for a universal constant C̄ < ∞.
Theorem 4.1 guarantees uniform convergence rate of the empirical Pareto frontier con-
√
structed via discretization to its population counterpart at the optimal 1/ n rate. The
proof is contained in the Appendix.
Next, we characterize the distance between the estimated set Π̂o in Equation (18), ob-
tained using a discretization argument and taking into account estimation of the empirical
19
welfares, and the set of Pareto optimal policies Πo . To quantify such a difference, we
introduce the following measure:
n o n o
d(Πo , Π̂o ; α) = sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) . (25)
π∈Πo π∈Π̂o
Equation (25) quantifies the distance between the population Pareto frontier and the
empirical Pareto frontier, obtained via discretization. Under the above conditions, we can
state the following theorem.
Theorem 4.2. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least
√
1 − 1/ n − γ, for some H > 0, and n ≥ H, γ > 0, for some universal constants c, c0 < ∞,
r
cM v log(c0 /γ) M
sup d(Πo , Π̂o ; α) ≤ 2 +c 2 . (26)
α∈(0,1) δ n δ N
20
The above condition imposes that for each element in the Pareto Frontier, the supre-
mum at the population level is reached by the same policies.13 We remark that such a
condition is not imposed on the maximum of the empirical counterpart, which instead may
not satisfy the above condition. The second condition is the consistency assumption of the
conditional mean function, given the opposite sensitive attribute.
Assumption 4.4. Assume that for some η > 0, the following holds:
h 2 i
E m̂d,s1 (Xi (s2 )) − md,s1 (Xi (s2 )) = O(n−2η ), ∀s1 , s2 ∈ {0, 1}, d ∈ {0, 1}.
Assumption 4.4 states that the estimator of the conditional mean function for each sen-
sitive attribute s ∈ {0, 1} and treatment status d ∈ {0, 1}, must consistently converge to the
true conditional mean function in MSE at some arbitrary rate 2η > 0. One important as-
pect to remark is that differently, for instance, from previous literature on semi-parametric
estimation, we require convergence in l2 for a given sensitive attribute conditional on the
opposite sensitive attribute. Such a condition is motivated by the counterfactual notion
of fairness: to estimate fairness, we are required to compute the empirical average of the
conditional mean function for a given group of individuals with respect to the distribution
under opposite group. Therefore, standard doubly-robust procedures cannot be employed
for estimation of the counterfactual envy-free fairness, but instead, its estimation must rely
on the estimation of the conditional mean function.
Examples include (i) linear regression models, of the form md,s (x) = xβs , with βs being po-
tentially high-dimensional, and Xi (1), Xi (0) ∈ [−M, M ]p , for M < ∞; (ii) local polynomial
estimators of the form
n x − X
1 X 1{Si = s} i
m̂d,s (x) = Yi K (29)
n ps h
i=1
where K(u) is a kernel and h is the bandwidth, whose uniform convergence rates are
discussed, for instance, in Fan and Gijbels (1996); Hansen (2008).
Under the above conditions we state our second theorem, where we implicitely assume
√
that N = n.
13
Conditions on the uniqueness of the optimal welfare maximization rule can also be found in Andrews
et al. (2019).
21
Theorem 4.4. (Optimal Fairness) Let Assumptions 2.1, 2.2, 4.1-4.4 hold. Then for some
ÑM,δ,v > 0, which depends on M, δ, v, and n > max{ÑM,δ,v , H}, with probability at least
√
1 − γ − 1/ n, for universal constants C̄, c < ∞
r
M v log(1/γ) c −η
UnFairness(π̂) − inf UnFairness(π) ≤ C̄ + n . (30)
π∈Πo δ n δ
22
We consider a function class of linear decision rules, given their large use in economic
applications (Manski, 1975). Namely, we consider the following policy function:
n n o o
Π = π(x, fem) = 1 β0 + β1 fem + x> φ ≥ 0 , (β0 , β1 , φ) ∈ B . (31)
where En [·] denote the empirical expectation, estimated using the doubly-robust method.
We refer to each policy space respectively as Type 1, Type 2, Type 3.
14
A simple example may illustrate the difference. Consider the policy π corresponding to “always treat”.
Suppose that X(1) ∼ N (µ1 , σ 2 ) and X(0) ∼ N (µ0 , σ 2 ). Suppose that Y (d, s)|X(s) ∼ N (βd + X(s), η 2 ).
Then inequity corresponding to the policy “always treat” reads as follows: |β1 + µ1 − β0 − µ0 |. However,
counterfactual envy reads as follows: β0 − β1 + β1 − β0 = 0.
23
5.1 Results
In Figure 4 we plot the Pareto frontier over each function class.15 The figure showcases that
restricting the function class of interest leads to Pareto dominated allocations. This outlines
the limitations of maximizing welfare under “fairness constraints”: such constraints can
result as harmful for both types of individuals. Instead, the proposed method considers the
red line (i.e., the least constrained environment), and select the policy based on fairness
considerations. In Figure 4, we also label allocations on the Pareto frontier corresponding
to the lowest envy and to the lowest inequity. We observe that different definitions of
fairness lead to different treatment allocations and different importance weights assigned
to each of the two groups. The proposed method finds such weights solely based on the
notion of fairness provided by the social planner, without requiring any prior specification
of relative weights assigned to each group. We see this as a crucial advantage, since relative
importance weights may be hard to justify to the general public.
In Figure 5, we report the value of counterfactual envy and inequity as a function of
the relative weight assigned to female students. We use a dotted line to refer to the weight
corresponding to the EWM allocation. The figure showcases that the EWM allocation does
not find allocations with the lowest unfairness in three of the four panels. Only for Case
1, the EWM allocation, and the Fair Targeting rule with the lowest inequity coincide.
Finally, in Table 1, we report the welfare of female students, male students, and their
weighted combination, for different targeting rules. Similarly to what is shown in Figure
4, we observe that different notions of fairness correspond to different weights of groups to
minimize the objective function. As we may observe from Table 1, the Fair Targeting rules
are not Pareto dominated by any other competitor. Instead, they correspond to the optimal
allocations under definitions of fairness based respectively on envy-freeness and equity. Here
is why: the envy-freeness Fair Targeting leads to a better allocation than unconstrained
EWM since it leads to lower envy while being both Pareto optimal (e.g., see Figure 5). It
leads to a better allocation than constrained EWM (EWM2 and EWM3) since, as shown
in Figure 4, for each constrained allocation, we can find an unconstrained allocation that
Pareto dominates it. Such an allocation is strictly preferred to the constrained EWM
method. However, such an allocation has higher envy than the optimal envy-freeness
allocation rule reported in the table. Similar reasoning also applies when fairness is defined
in terms of equity across groups.
15
The value functions over the Pareto frontier can be exactly recovered as follows: we solve 2 optimization
problems for each αj , j ∈ {1, ..., N }. For each of these problems, we impose constraints of the welfare of
one the two groups being larger than the other and vice-versa; we then select the subset of solutions that
are not Pareto dominated by the others, and we plot the corresponding welfares in the figure.
24
Figure 4: Pareto frontier under linear policy rule. Dots denote Pareto optimal allocations,
whereas the point on and below the line denote the set of dominated allocations. Panel
on the left collects results for the case where the covariates used in the decision rule are
years of graduation, years of entrepreneurship, region of the start-up, the major and the
school rank. The panel on the right collects results for covariates used in the decision rule
are the score assigned to the applicant and the school rank. Type 1 refers to the function
class Π1 , Type 2 to the function class Π2 and Type 3 to the function class Π3 . Notice
that the blue and green lines denote the Pareto frontier under fairness constraints, which
are strictly dominated by the Pareto frontier in an unconstrained environment (red line).
From the figure, we can observe that imposing fairness constraints may result in harmful
allocations for both groups of individuals.
25
Figure 5: Envy (left-panels) and in-equity (right-panels) under Case 1 (top panels) and
Case 2 (bottom panels). The x-axis corresponds to different importance weight assigned
to the welfare of female students, and the y-axis reports the level of the unfairness of
the procedure that maximizes the weighted combination of welfares of female and male
students. Dotted lines denote the importance weight assigned to female students by the
EWM method. Notice that the Pareto allocation corresponding to the lowest envy-freeness
and the one corresponding to the lowest inequity assign different importance weights to
female students. From the figure, we can observe that importance weights may have an a-
priori unknown dependence with different notions of fairness adopted by the social planner.
26
Table 1: Panel at the top refers to the case where the covariates used in the decision rule
are years of graduation, years of entrepreneurship, region of the start-up, the major, and
the school rank. The panel at the bottom collects results for covariates used in the decision
rule are the score assigned to the applicant and the school rank. Fair Envy refers to the Fair
Targeting rule that minimizes envy-freeness unfairness; Fair Equity is the equity-based Fair
Targeting method; EWM1, EWM2, EWM3 refers to the EWM method with function class
being respectively Π1 , Π2 , Π3 . The second column reports the welfare of female students
and the third column the welfare of male students; the fourth column report a weighted
combination of the two welfares. the last column reports the importance weight assigned
by the method to the welfare of female students.
27
6 Conclusion
In this paper, we have introduced a novel method for estimating fair and optimal treatment
allocation rules. We proposed a multi-objective decision problem, where the policymaker
aims to select the least unfair policy, in the set of Pareto optimal allocations. We provided
counterfactual definitions of fairness, which are justified from an envy-freeness perspec-
tive. We derive strong theoretical guarantees on the properties of the Pareto frontier and
unfairness regret bounds of the proposed policy function.
We employ the Neyman-Rubin potential outcome framework, indexing potential out-
comes, and covariates by the sensitive attribute. We assume that the sensitive attribute
is independent of such potentials. Such an assumption may hold in many applications
involving attributes such as age or gender, whereas it may fail in applications where the
sensitive attribute is confounded by unobserved characteristics, as in the case of the race
attribute. We leave for future research extensions of this work, which allow for weaker
notions of exogeneity of the sensitive attribute, and the use of possible instruments in such
settings.
From a theoretical perspective, the validity of the regret bound relies on the convergence
rate of the proposed estimator in MSE, evaluated on covariates drawn from a different
distribution for the training sample. Whereas examples are discussed, such notions open
new questions on the theoretical guarantees of classical estimators, when used to predict
on samples drawn from different distributions.
Finally, this paper sheds some light on the connection of counterfactual definitions of
fairness and the statistical treatment allocation problem. A comprehensive discussion of
such a connection in a decision-theoretical framework remains an open research question.
28
References
Andrews, I., T. Kitagawa, and A. McCloskey (2019). Inference on winners. Technical
report, National Bureau of Economic Research.
Athey, S. and S. Wager (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896 .
Balashankar, A., A. Lees, C. Welty, and L. Subramanian (2019). What is fair? exploring
pareto-efficiency for fairness constrained classifiers. arXiv preprint arXiv:1910.14120 .
Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal
inference models. Biometrics 61 (4), 962–973.
Bertsimas, D. and J. Dunn (2017). Optimal classification trees. Machine Learning 106 (7),
1039–1082.
Chen, L.-Y. and S. Lee (2018). Best subset binary prediction. Journal of Economet-
rics 206 (1), 39–56.
Chiang, H. D., K. Kato, Y. Ma, and Y. Sasaki (2019). Multiway cluster robust dou-
ble/debiased machine learning. arXiv preprint arXiv:1909.03489 .
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidi-
vism prediction instruments. Big data 5 (2), 153–163.
29
Corbett-Davies, S. and S. Goel (2018). The measure and mismeasure of fairness: A critical
review of fair machine learning. arXiv preprint arXiv:1808.00023 .
Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq (2017). Algorithmic decision
making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 797–806.
Cowgill, B. and C. E. Tucker (2019). Economics, fairness and algorithmic bias. preparation
for: Journal of Economic Perspectives.
De Ree, J., K. Muralidharan, M. Pradhan, and H. Rogers (2017). Double for nothing?
experimental evidence on an unconditional teacher salary increase in indonesia. The
World Bank.
Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition,
Volume 31. Springer Science & Business Media.
Dudik, M., J. Langford, and L. Li (2011). Doubly robust policy evaluation and learning.
arXiv preprint arXiv:1103.4601 .
Dupas, P. (2014). Short-run subsidies and long-run adoption of new health products:
Evidence from a field experiment. Econometrica 82 (1), 197–228.
Dwork, C., M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012). Fairness through
awareness. In Proceedings of the 3rd innovations in theoretical computer science confer-
ence, pp. 214–226.
Egger, D., J. Haushofer, E. Miguel, P. Niehaus, and M. W. Walker (2019). General equi-
librium effects of cash transfers: experimental evidence from kenya. Technical report,
National Bureau of Economic Research.
30
Elliott, G. and R. P. Lieli (2013). Predicting binary outcomes. Journal of Economet-
rics 174 (1), 15–26.
Fan, J. and I. Gijbels (1996). Local polynomial modelling and its applications: monographs
on statistics and applied probability 66, Volume 66. CRC Press.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more
covariates than observations. Journal of Econometrics 189 (1), 1–23.
Feldman, A. and A. Kirman (1974). Fairness and envy. The American Economic Review ,
995–1005.
Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015).
Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD
international conference on knowledge discovery and data mining, pp. 259–268.
Florios, K. and S. Skouras (2008). Exact computation of max weighted score estimators.
Journal of Econometrics 146 (1), 86–91.
Foley, D. K. (1967). Resource allocation and the public sector.
Fudenberg, D. and J. Tirole (1991). Game theory.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation
of average treatment effects. Econometrica, 315–331.
Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent
data. Econometric Theory 24 (3), 726–748.
Hardt, M., E. Price, and N. Srebro (2016). Equality of opportunity in supervised learning.
In Advances in neural information processing systems, pp. 3315–3323.
Hirano, K. and J. R. Porter (2009). Asymptotics for statistical treatment rules. Econo-
metrica 77 (5), 1683–1701.
Horvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-
ment from a finite universe. Journal of the American statistical Association 47 (260),
663–685.
Hossain, S., A. Mladenovic, and N. Shah (2020). Designing fairly fair classifiers via eco-
nomic fairness notions. In Proceedings of The Web Conference 2020, pp. 1559–1569.
Imai, K., L. Keele, and D. Tingley (2010). A general approach to causal mediation analysis.
Psychological methods 15 (4), 309.
Imbens, G. W. (2000). The role of the propensity score in estimating dose-response func-
tions. Biometrika 87 (3), 706–710.
31
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exo-
geneity: A review. Review of Economics and statistics 86 (1), 4–29.
Imbens, G. W. and D. B. Rubin (2015). Causal inference in statistics, social, and biomedical
sciences. Cambridge University Press.
Kasy, M. and R. Abebe (2020). Fairness, equality, and power in algorithmic decision
making. Working paper .
Kitagawa, T. and A. Tetenov (2018). Who should be treated? Empirical welfare maxi-
mization methods for treatment choice. Econometrica 86 (2), 591–616.
Kleinberg, J., S. Mullainathan, and M. Raghavan (2016). Inherent trade-offs in the fair
determination of risk scores. arXiv preprint arXiv:1609.05807 .
Kusner, M., C. Russell, J. Loftus, and R. Silva (2019). Making decisions that reduce
discriminatory impacts. In International Conference on Machine Learning, pp. 3591–
3600.
Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 1 (1), 14–23.
32
Luedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcome
under a possibly non-unique optimal treatment strategy. Annals of statistics 44 (2), 713.
Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice.
Journal of econometrics 3 (3), 205–228.
Martinez, N., M. Bertran, and G. Sapiro (2019). Fairness with minimal harm: A pareto-
optimal approach for healthcare. arXiv preprint arXiv:1911.06935 .
Mbakop, E. and M. Tabord-Meehan (2016). Model selection for treatment choice: Penalized
welfare maximization. arXiv preprint arXiv:1609.03167 .
Mitchell, S., E. Potash, S. Barocas, A. D’Amour, and K. Lum (2018). Prediction-based de-
cisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint
arXiv:1811.07867 .
Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 65 (2), 331–355.
Nabi, R., D. Malinsky, and I. Shpitser (2019). Learning optimal fair policies. Proceedings
of machine learning research 97, 4674.
33
Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Economet-
rica: Journal of the Econometric Society, 1349–1382.
Nie, X., E. Brunskill, and S. Wager (2019). Learning when-to-treat policies. arXiv preprint
arXiv:1905.09751 .
Rambachan, A. and J. Roth (2019). Bias in, bias out? evaluating the folk wisdom. arXiv
preprint arXiv:1909.08518 .
Rubin, D. B. (1990). Formal mode of statistical inference for causal effects. Journal of
statistical planning and inference 25 (3), 279–292.
Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity
of experiments. Journal of Econometrics 166 (1), 138–156.
Ustun, B., Y. Liu, and D. Parkes (2019). Fairness without harm: Decoupled classifiers with
preference guarantees. In International Conference on Machine Learning, pp. 6373–6382.
34
Van Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In Weak convergence
and empirical processes, pp. 16–28. Springer.
Varian, H. R. (1976). Two problems in the theory of fairness. Journal of Public Eco-
nomics 5 (3-4), 249–260.
Xiao, L., Z. Min, Z. Yongfeng, G. Zhaoquan, L. Yiqun, and M. Shaoping (2017). Fairness-
aware group recommendation with pareto-efficiency. In Proceedings of the Eleventh ACM
Conference on Recommender Systems, pp. 107–115.
Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013). Learning fair representa-
tions. In International Conference on Machine Learning, pp. 325–333.
Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2012). A robust method for
estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018.
Zhou, Z., S. Athey, and S. Wager (2018). Offline multi-action policy learning: Generaliza-
tion and optimization. arXiv preprint arXiv:1810.04778 .
35
A Proofs
A.1 Auxiliary Lemmas
Lemma A.1. Under Assumption 2.1, 2.2 for any sensitive attribute s ∈ {0, 1}
h 1{S = s} Y D
i i i 1{Si = s} Yi (1 − Di ) i
Ws (π) = E π(Xi , s) + 1 − π(Xi , s) . (33)
ps e(Xi , s) ps 1 − e(Xi , s)
Proof of Lemma A.1. Assumption 2.2 guarantees existence of the expectation. By the law
of iterated expectations and Equation (2), we obtain:
h 1{S = s} Y D
i i i Yi (1 − Di ) i
E − π(Xi , s)
ps e(Xi , s) 1 − e(Xi , s)
(34)
h 1{S = s} h Y (1, s)D
i i i Yi (0, s)(1 − Di ) i i
=E E − Si = s, Xi π(Xi , s) .
ps e(Xi , s) 1 − e(Xi , s)
By the first condition in Assumption 2.1, we have
h Y (1, s)D
i i Yi (0, s)(1 − Di ) i h i
− Si = s, Xi = E Yi (1, s) − Yi (0, s)Si = s, Xi . (35)
E
e(Xi , s) 1 − e(Xi , s)
Observe now that
h i h i
E Yi (1, s) − Yi (0, s)Si = s, Xi = E Yi (1, s) − Yi (0, s)Xi (s) . (36)
36
Lemma A.2. Let Ws,n = n1 ni=1 Γ1,s,i π(Xi , s) + Γ0,s,i (1 − π(Xi , s)), where Γd,s,i is defined
P
as in Equation (12). Let Assumptions 2.1, 2.2, 4.1 hold. Then with probability at least
1 − γ,
Mp C̄M p
sup sup αW0 (π)+(1−α)W1 (π)−αW0,n (π)+(1−α)W1,n (π) ≤ C̄ 2 v/n+ 2 log(2/γ)/n
α∈(0,1) π∈Π δ δ
(39)
for a universal constant C̄ < ∞. Similarly,
h i Mp
E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) ≤ C̄ 2 v/n. (40)
α∈(0,1) π∈Π δ
Proof of Lemma A.2. Throughout the proof we refer to C̄ < ∞ as a universal constant.
Observe first that under Assumption 2.2 and Assumption 4.1, we have
sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π), (41)
α∈(0,1) π∈Π
satisfies the bounded difference assumption (Boucheron et al., 2013) with constant δ2M
2 n . See
for instance Boucheron et al. (2005). By the bounded difference inequality, with probability
at least 1 − γ,
sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)
α∈(0,1) π∈Π
h i Mp
≤ E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) + C̄ 2 log(2/γ)/n.
α∈(0,1) π∈Π δ
(42)
We now move to bound the expectation in the right-hand side of Equation (42). Under
Assumption 2.1, we obtain by Lemma A.1 and trivial rearrangments, that
h i
E αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) = 0. (43)
Using the symmetrization argument (Van Der Vaart and Wellner, 1996), we can now
bound the above supremum with the Radamacher complexity of the function class of
interest (e.g., Athey and Wager (2017), Viviano (2019)), which combined with the triangle
37
inequality reads as follows:
h i
E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)
α∈(0,1) π∈Π
h i h i
≤ E sup sup αW0 (π) − αW0,n (π) + E sup sup (1 − α)W1 (π) − (1 − α)W1,n (π)
α∈(0,1) π∈Π α∈(0,1) π∈Π
h i h i
≤ E sup W0 (π) − W0,n (π) + E sup W1 (π) − W1,n (π)
π∈Π π∈Π
h 1 Xn i h 1 Xn i
≤ E sup σi π(Xi (1), 1)Γ1,1,i + E sup σi (1 − π(Xi (1), 1))Γ0,1,i
π∈Π n i=1 π∈Π n i=1
h 1 Xn i h 1 Xn i
+ E sup σi π(Xi (0), 0)Γ1,0,i + E sup σi (1 − π(Xi (0), 0))Γ0,0,i ,
π∈Π n i=1 π∈Π n i=1
(44)
where here σi are independent Radamacher random variables. We can study each com-
ponent of the above expression separately. By the Dudley’s entropy integral bound, since
the VC-dimension of the function class Π is bounded by Assumption 4.1, and since each
Γd,s, is bounded, we obtain (see for instance Wainwright (2019)), under Assumption 4.1
(A) and (B), with trivial rearrangement
h n
1 X i M C̄ p
E sup σi π(Xi (s), s)Γd,s,i ≤ 2 v/n. (45)
π∈Π n i=1
δ
for each d, s. Observe now that the VC-dimension of π equals the VC dimension of 1 − π
(Devroye et al., 2013). This completes the proof.
Lemma A.3. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then with probability at least 1 − γ,
and n > H,
Mp C̄M p
sup sup αW0 (π)+(1−α)W1 (π)−αŴ0 (π)+(1−α)Ŵ1 (π) ≤ C̄ 2 v/n+ 2 log(2/γ)/n
α∈(0,1) π∈Π δ δ
(46)
for a universal constant C̄ < ∞.
38
Proof of Lemma A.3. First observe that we can bound the above expression as
sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π) ≤
α∈(0,1) π∈Π
sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)
α∈(0,1) π∈Π
| {z } (47)
(I)
+ sup sup αW0,n (π) + (1 − α)W1,n (π) − αŴ0 (π) + (1 − α)Ŵ1 (π) .
α∈(0,1) π∈Π
| {z }
(II)
Here Ws,n is as defined in Lemma A.2. The term (I) is bounded as in Lemma A.2. There-
fore, we are only left to discuss (II).
Using the triangular inequality, we only need to bound
sup W0,n (π) − Ŵ0,n (π) + sup W1,n (π) − Ŵ1,n (π). (48)
π∈Π π∈Π
We bound the first term while the second term follows similarly. We write
sup Ws,n (π) − Ŵs (π)
π∈Π
n
1 X 1{Si = s} Di (Yi − m1,s (Xi ))
≤ π(Xi , s) + m1,s (Xi )π(Xi , s)
n ps e(Xi , s)
i=1
n
1 X 1{Si = s} Di (Yi − m̂1,s (Xi ))
− π(Xi , s) − m̂1,s (Xi )π(Xi , s)
n p̂s ê(Xi , s) (49)
i=1
n
1 X 1{Si = s} (1 − Di )(Yi − m0,s (Xi ))
+ (1 − π(Xi , s)) + m0,s (Xi )(1 − π(Xi , s))
n ps 1 − e(Xi , s)
i=1
n
1 X 1{Si = s} (1 − Di )(Yi − m̂s,0 (Xi ))
− (1 − π(Xi , s)) − m̂s,0 (Xi )(1 − π(Xi , s)).
n p̂s 1 − ê(Xi , s)
i=1
39
n
1 X 1{Si = s} Di (Yi − m1,s (Xi ))
π(Xi , s) + m1,s (Xi )π(Xi , s)
n ps e(Xi , s)
i=1
n
1 X 1{Si = s} Di (Yi − m̂1,s (Xi ))
− π(Xi , s) − m̂1,s (Xi )π(Xi , s)
n p̂s ê(Xi , s)
i=1
n
1 X 1 1
≤ sup 1{Si = s}Di (Yi − m1,s (Xi )) − π(Xi , s) (50)
π∈Π n i=1 ps e(Xi , s) p̂s ê(Xi , s)
| {z }
(i)
1 X n
1{Si = s}Di
+ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) .
π∈Π n p̂ s ê(Xi , s)
i=1
| {z }
(ii)
We study (i) and (ii) separately. We start from (i). Recall, that by cross fitting ê(Xi , s) =
ê−k(i) (Xi , s), where k(i) is the fold containing unit i. Therefore, observe that given the K
folds for cross-fitting, we have
1 X n 1 1
1{Si = s}Di (Yi − m1,s (Xi )) − π(Xi , s)
n ps e(Xi , s) p̂s ê(Xi , s)
i=1
X 1 X 1 1
≤ 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s).
n ps e(Xi , s) p̂s ê(−k) (Xi , s)
k∈{1,...,K} i∈Ik
(51)
In addition, we have that
hX 1 1 i
E 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s)
ps e(Xi , s) p̂s ê(−k) (Xi , s)
i∈Ik
h hX 1 1 ii
=E E 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s)p̂(−k) , ê(−k) = 0,
ps e(Xi , s) p̂s ê(−k) (Xi , s)
i∈Ik
(52)
by cross-fitting. By Assumption 4.2, we know that for n > H,
1 1
sup − (−k) ≤ 2/δ 2 (53)
x∈X ,s∈S p s e(x, s) p̂s ê(−k) (x, s)
and therefore each summand in Equation (51) is bounded by a finite constant 2/δ 2 . We
now obtain, that for n > H, using the symmetrization argument (Van Der Vaart and
40
Wellner, 1996), and the Dudley’s entropy integral (Wainwright, 2019)
h 1 X 1 1 i Mp
E sup | 1{Si = s}Di (Yi −m1,s (Xi )) − (−k) π(Xi , s)|p̂(−k) , ê(−k) . 2 v/n.
π∈Π n i∈Ik
ps e(Xi , s) p̂s ê(−k) (Xi , s) δ
(54)
In addition, by the bounded difference inequality (Boucheron et al., 2005), we obtain that
for n > H, with probability at least 1 − γ, for a universial constant c < ∞
1 X 1 1
sup 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s) ≤
π∈Π n i∈I ps e(Xi , s) p̂s ê(−k) (Xi , s)
k
h 1 X 1 1 i
E sup | 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s)|p̂(−k) , ê(−k)
π∈Π n i∈I ps e(Xi , s) p̂s ê(−k) (Xi , s)
k
r
M log(2/γ)
+c 2 .
δ n
(55)
We now consider the term (ii). Observe that we can write
1 X n
Di 1{Si = s} Di 1{Si = s}
(ii) ≤ sup − (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s)
π∈Π n p̂ s ê(Xi , s) p s e(X i , s)
i=1
| {z }
(j)
1 n (56)
X Di 1{Si = s}
+ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) .
π∈Π n i=1
ps e(Xi , s)
| {z }
(jj)
We consider each term seperately. Consider (jj) first. Using the cross-fitting argument we
obtain
1 Xn
Di 1{Si = s}
sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s)
π∈Π n i=1 ps e(Xi , s)
1 X D 1{S = s} (57)
i i (−k)
X
≤ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s).
π∈Π n ps e(Xi , s)
i∈Ik
k∈{1,...,K}
41
We are now left to bound (j). We obtain that
v v
u n u n
u1 X 1 1 2 u1 X
(j) ≤ t − t (m1,s (Xi ) − m̂1,s (Xi ))2 . (60)
n p̂s ê(Xi , s) ps e(Xi , s) n
i=1 i=1
Such a bound does not depend on π. Observe now that we can write
v v
u n u n
u1 X 1 1 2 u1 X
t − t (m1,s (Xi ) − m̂1,s (Xi ))2
n p̂s ê(Xi , s) ps e(Xi , s) n
i=1 i=1
v v
1 X 1 1 2 u X 1 X
u X
(−k)
=t − (m1,s (Xi ) − m̂1,s (Xi ))2 .
u u
(−k)
t
n p̂s ê(−k) (Xi , s) ps e(Xi , s) n
k∈{1,...,K} i∈Ik k∈{1,...,K} i∈Ik
(61)
By the bounded difference inequality, and the union bound, we obtain that the following
holds:
v v
1 X 1 1 2 u X 1 X
u X
(−k)
− (m1,s (Xi ) − m̂1,s (Xi ))2
u u
(−k)
t t
n p̂s ê(−k) (Xi , s) ps e(Xi , s) n
k∈{1,...,K} i∈Ik k∈{1,...,K} i∈Ik
s
h 1 1 2 ir h i
≤K E − E (m1,s (Xi ) − m̂1,s (Xi ))2
p̂s ê(Xi , s) ps e(Xi , s)
s r h
p4
h 1 1 2 i p
4
i
+ 2 log(2K/γ)/n E − + 2 log(2K/γ)/n E (m1,s (Xi ) − m̂1,s (Xi ))2
p̂s ê(Xi , s) ps e(Xi , s)
p
+ 2 log(2K/γ)/n,
(62)
with probability at least 1 − γ. Under Assumption 4.2 and the union bound, the result
completes.
Lemma A.4. Under Assumption 2.1, 2.2, 4.1, 4.2, 4.4, the following holds: with proba-
bility at least 1 − γ , for n ≥ H,
cM r v log(1/γ) c
0 0
sup A(s, s ; π) − An (s, s ; π) ≤ 2 + n−η (63)
π∈Π δ n δ
42
Proof of Lemma A.4. We consider the case where s0 6= s, whereas s0 = s follows trivially.
Observe that we can write
sup A(s, s0 ; π) − An (s, s0 ; π) ≤
π∈Π
i 1X n
h 1{Si = s} 1{Si = s}
sup EX(s) Vπ(X(s),s) (X(s), s0 ) − m̂1,s0 (Xi )π(Xi , s) + m̂0,s0 (Xi )(1 − π(Xi , s))
π∈Π n p̂s p̂s
i=1
| {z }
(A)
We discuss (I) and (III), whereas (II) and (IV) follow similarly. Observe first that by
Assumption 4.1 and the bounded difference inequality, with probability 1 − γ,
1 X n
1{Si = s} h 1{S = s}
i
i
sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s)
π∈Π n p s p s
i=1
n
1 X 1{S = s} h 1{S = s}
h
i i
ii Mp
≤ E sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s) + C̄ log(2/γ)/n
π∈Π n i=1
ps ps δ
43
for a constant C̄ < ∞. Using the symmetrization argument (Van Der Vaart and Wellner,
1996), we have
1 X n
h 1{Si = s} h 1{S = s}
i
ii
E sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s) ≤
π∈Π n p s p s
i=1
1 X n
h 1{Si = s} i
2E sup σi m1,s0 (Xi )π(Xi , s)
π∈Π n p s
i=1
where σi are i.i.d. Radamacher random variables. Since m1,s is uniformly bounded and
similarly ps is bounded, and by Assumption 4.1, we obtain by the properties of the Dudley’s
entropy integral (Wainwright, 2019),
n
1 X
h 1{Si = s} i Mp
E sup σi m1,s0 (Xi )π(Xi , s) ≤ C̄ v/n
π∈Π n i=1
ps δ
for a universal constant C̄ < ∞. We now move to bound (III). Using the triangular
inequality and Holder’s inequality, we obtain
n
1 X 1{Si = s}
(III) ≤ m1,s0 (Xi ) − m̂1,s0 (Xi ) (65)
n ps
i=1
The above bound is deterministic and it does not depend on π. Observe now that by
Equation (1)
n
1 X 1{Si = s}
m (X ) − m̂ (X )
1,s 0 i 1,s 0 i
n ps
i=1
n n
1 X 1{Si = s} 1 X
= m (X (s)) − m̂ (X (s)) ≤ m (X (s)) − m̂ (X (s)).
1,s 0 i 1,s 0 i 1,s 0 i 1,s 0 i
n ps nδ
i=1 i=1
(66)
We now separate the contribution of each of the K folds using in the cross-fitting algorithm.
Namely, we define
n
1 X X 1 X (−k)
m1,s0 (Xi (s))− m̂1,s0 (Xi (s)) ≤ m1,s0 (Xi (s))− m̂1,s0 (Xi (s)) (67)
nδ nδ
i=1 k∈{1,...,K} i∈Ik
(−k)
where Ik denotes the set of indexes in fold k, and m̂1,s0 denotes the estimator obtained
from all folds except k. Next, we bound the following term using Liaponuv inequality:
1 X h i h i
E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))|m̂(−k) ≤ E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))|m̂(−k)
n
i∈Ik
r h i
≤ E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))|2 m̂(−k) ≤ cn−η .
(68)
44
The last inequality follows by Assumption 4.4, for a universal constant c < ∞. Finally, we
discuss exponential concentration of the empirical counterpart. For n ≥ H,
sup md,s0 (x) − m̂−k
(x) ≤ 2M. (69)
d,s 0
x∈X
Define
G = {G(α), α ∈ (0, 1)}.
Under Assumption 4.1, for any ε > 0, there exist a set {α1 , ..., αN (ε) }, such that for all
α ∈ (0, 1),
|G(α) − max G(αj )| ≤ 4εM, a.s. (72)
j∈{1,...,N (ε)}
Proof of Lemma A.5. We denote {α1 , ..., αN (ε) } an ε-cover of the interval (0, 1) with respect
to the L1 norm. Namely, {α1 , ..., αN (ε) } are equally spaced numbers between (0, 1). Clearly,
we have that the covering number N (ε) ≤ 1 + 1/ε. We denote
n o
G(α) = sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) . (73)
π∈Π π∈Π̂o
we claim that for any α ∈ (0, 1), there exist an αj in the ε cover such that
45
Such a result follows by the argument outlined in the following lines: take αj closest to α
and consider
|G(α) − G(αj )|
n o n o
= sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π)
π∈Π π∈Π̂o
n o n o
− sup αj W0 (π) + (1 − αj )W1 (π) + sup αj W0 (π) + (1 − αj )W1 (π)
π∈Π o π∈Π̂
n o n o
≤ sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
(75)
π∈Π π∈Π
| {z }
(i)
n o n o
+ sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) .
π∈Π̂o π∈Π̂o
| {z }
(ii)
We study (i) and (ii) separately. Consider first (i). We observe the following fact:
whenever
sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) > 0 (76)
π∈Π π∈Π
sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) ≤ 0 (78)
π∈Π π∈Π
we can use the same argument by switching sign, which, with trivial rearrengment reads
as
(i) ≤ αW0 (π ∗∗ ) + (1 − α)W1 (π ∗∗ ) − αj W0 (π ∗∗ ) + (1 − αj )W1 (π ∗∗ ). (79)
where the last inequality follows by Assumption 4.1 and the triangle inequality. Similar
reasoning also applies to (ii). Therefore, we observe that the covering number of the
function class G for a 4M ε cover is at most 1/ε + 1.
46
A.2 Additional Lemmas
Proof of Lemma 2.1
The proof follows similarly to standard microeconomic textbook (Mas-Colell et al., 1995).
Let
Π̃ = {πα : πα ∈ arg sup α1 W0 (π) + α2 W1 (π), α ∈ R2+ , α1 + α2 > 0}. (81)
π∈Π
Then we want to show that Πo = Π̃. Trivially Π̃ ⊆ Πo , since otherwise the definition of
Pareto optimality would be violated. Consider now some π ∗ ∈ Πo . Then we show that
there exist a vector α ∈ R2+ , such that π ∗ maximizes the expression
Since (0, 0) ∈ F, such a set is non-empty. Notice now that Ws (π) is linear is π for
s ∈ {0, 1}, and hence concave in π. Therefore, we obtain that the set F is a convex set,
since it denotes the sub-graph of a concave functional. We denote W̄ = (W0 (π ∗ ), W1 (π ∗ ))
and G = R2++ + W̄ the set of welfares that strictly dominates π ∗ . Then G is non-empty
and convex. Since π ∗ ∈ Πo , we must have that F ∩ G = ∅. Therefore, by the separating
hyperplane theorem, there exist an α ∈ R2 , with α 6= 0, such that α> F ≤ α> (W̄ + d) for
any F ∈ F, d ∈ R2++ . Let d1 → ∞, it must be that α1 ∈ R+ , and similarly for α2 . So
α ∈ R2+ . By letting d → 0, we have that α> F ≤ α> W̄ . This implies that
for any π ∈ Π (since it is true for any F ∈ F). Hence π ∗ maximizes welfare over all
possible feasible allocations once reweighted by (α1 , α2 ). Since the maximizer is invariant
to multiplication of the objective function by constants, the result follows after dividing
the objective function by the sums of the coefficients, which is non-zero by the separating
hyperplane theorem. This completes the proof.
47
would contraddict the definition of C(Π). Observe now that by definition, C(Π) contains
the set of Pareto optimal allocations that achieves minimal UnFairness. By Lemma 2.1,
the set of Pareto optimal policies coincide with the policies satisfying the constraint in
Definition 2.2. We use a contraddiction argument for this statement. Suppose not, then it
means that there exist a πα satisfying such a constraint for some α, but such that πα 6∈ Πo .
However, this means that
which contraddicts Lemma 2.1. Therefore, the expression finds the allocation with lowest
unfairness within the Pareto optimal set, which completes the proof.
sup sup αW0 (π) + (1 − α)W1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π)
α∈(0,1) π∈Π αj ∈{α1 ,...,αN }
≤ sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) − (1 − α)Ŵ1 (π)
α∈(0,1) π∈Π
| {z } (85)
(I)
+ sup sup αŴ0 (π) + (1 − α)Ŵ1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π) .
α∈(0,1) π∈Π αj ∈{α1 ,...,αN }
| {z }
(II)
Under Assumption 4.4 and Assumption 4.1, 2.2, for n ≥ H, the estimated conditional
mean and propensity score are uniformly bounded. Therefore we obtain that
M M
ε sup |Ŵ0 (π)| + sup ε|Ŵ1 (π)| ≤ C̄ε 2
≤ C̄
π∈Π π∈Π δ N δ2
48
A.4 Proof of Theorem 4.2
Proof. Throughout the proof we refer to C̄ < ∞ as a universal constant. Recall Lemma
A.5, where we chose a cover with N (ε) = N and ε ≤ 1/N .
Let G(α) be as defined in Lemma A.5. Clearly, we have by definition of Πo that
n o n o
sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) = G(α).
π∈Πo π∈Π̂o
max G(αj ).
j∈{1,...,N (ε)}
max G(αj ) = max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − α)W1 (π)
j∈{1,...,N (ε)} j∈{1,...,N (ε)} π∈Π π∈Π̂o
n o n o
= max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Π π∈Πo,n
| {z }
(I)
n o n o
+ max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
| {z }
(II)
(88)
For the term (I), by definition of Πo,n , since it contains empirical Pareto optimal
policies, and by trivial rearrengement, is bounded as follows
n o
max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Π π∈Πo,n
(89)
≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) − (1 − α)Ŵ1 (π),
α∈(0,1) π∈Π
where the right hand side is measurable by Assumption 4.1 (C). The above term is bounded
49
by Lemma A.3. The term (II) is instead bounded as follows:
n o n o
(II) = max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
n o n o
= max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Πo,n
n o n o
− sup αj W0 (π) + (1 − αj )W1 (π) + sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
π∈Π̂o π∈Π̂o
n o n o
+ sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
π∈Πo,n o π∈Π̂
≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π)
α∈(0,1) π∈Π
n o n o
+ max sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) .
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
| {z }
(III)
(90)
Observe that by definition of Πo,n and Π̂o , since αj is in the cover, we have that
n o n o
sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) = 0.
π∈Πo,n π∈Π̂o
Therefore,
(II) ≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π).
α∈(0,1) π∈Π
50
Proof of Lemma A.6. The result is trivial if Π is empty. Let therefore Π be non-empty.
√
Throughout the rest of the proof we consider a cover with N = n elements. The argument
proceeds as follows. We choose Ñ such that for each α ∈ (0, 1),
s
cM v log(c0 Ñ )
sup αW0 (π) + (1 − α)W1 (π) ≥ αW0 (π) + (1 − α)W1 (π) + 2 , (92)
π∈Π δ Ñ
We show this by contraddiction. Suppose that π̂ ∗ 6∈ {arg supπ∈Π αW0 (π) + (1 − α)W1 (π)}.
Then it must be that Equation (92) holds. On the other hand, this contradicts Equation
(93), since n ≥ Ñ . Since such a statement is true for any α ∈ (0, 1), the proof completes.
Lemma A.7. Under the conditions in Theorem 4.2, and Assumption 4.3, for n ≥ max{Ñ , H},
with Ñ as defined in Lemma A.6, the following holds:
n o n o √
P inf An (0, 1; π) + An (1, 0; π) − inf An (0, 1; π) + An (1, 0; π) ≤ 0 ≥ 1 − 1/ n.
π∈Π̂o π∈Πo
(95)
Proof of Lemma A.7. Observe first that the result is trivial if Πo ⊆ Π̂o . Suppose instead
√
that Πo 6⊆ Π̂o . Then by Lemma A.6, with probability 1 − 1/ n, we can find a set K ⊆ Πo ,
51
which is defined as follows: for each α ∈ (0, 1), there exist a unique πα ∈ K, such that πα
satisfies
πα ∈ arg sup αW0 (π) + (1 − α)W1 (π).
π∈Π
In addition, such a set satisfies the properties that K ⊆ Π̂o . Under Assumption 4.3, and
the definition of An , we have that
n o n o
inf An (0, 1; π) + An (1, 0; π) = inf An (0, 1; π) + An (1, 0; π) ,
π∈Πo π∈K
since An only depends on the image of π, and since under Assumption 4.3, for any π 6∈ K,
we can find a π 0 ∈ K, such that π and π 0 have the same image at each value in the domain.
Therefore, the left-hand side in Equation (95) simplifies to
n o n o
inf An (0, 1; π) + An (1, 0; π) − inf An (0, 1; π) + An (1, 0; π) .
π∈Π̂o π∈K
Observe (A) and (C) first. Then since Πo , Π̂o ⊆ Π, we have by the triangular inequality
(A) ≤ 2 sup A(s, s0 ; π) − An (s, s0 ; π). (98)
π∈Π,maxs,s0 ∈{0,1}
Since also Π̂o ⊆ Π, the same inequality also holds for (C). By Lemma A.4, with probability
at least 1 − γ, we have r
M log(1/γ)
(A), (C) ≤ c . (99)
δ n
52
for a constant c < ∞.
√
We now study (B). By Lemma A.7, for n ≥ Ñ , with probability at least 1 − 1/ n, we
have
(B) ≤ 0.
Consider now (D). With simple rearrengement, we can write
(D) ≤ 2 sup A(s, s0 ; π) − An (s, s0 ; π) (100)
π∈Π̂o ,s,s0 ∈{0,1}
The above term is bounded as in Lemma A.4. Collecting our results, by setting γ =
√
1/ n, and using the union bound, the result follows.
53