You are on page 1of 53

Fair Policy Targeting


Davide Viviano† Jelena Bradic‡
arXiv:2005.12395v1 [econ.EM] 25 May 2020

May 27, 2020

Abstract
One of the major concerns of targeting interventions on individuals in social welfare
programs is discrimination: individualized treatments may induce disparities on sensi-
tive attributes such as age, gender, or race. This paper addresses the question of the
design of fair and efficient treatment allocation rules. We adopt the non-maleficence
perspective of “first do no harm”: we propose to select the fairest allocation within the
Pareto frontier. We provide envy-freeness justifications to novel counterfactual notions
of fairness. We discuss easy-to-implement estimators of the policy function, by casting
the optimization into a mixed-integer linear program formulation. We derive regret
bounds on the unfairness of the estimated policy function, and small sample guaran-
tees on the Pareto frontier. Finally, we illustrate our method using an application from
education economics.

Keywords: Causal Inference, Welfare Programs, Pareto Optimal, Treatment Rules.


We thank James Fowler, Yixiao Sun and Kaspar Wüthrich for comments and discussion. All mistakes
are our own.

Department of Economics University of California at San Diego, La Jolla, CA, 92093. Email: dvi-
viano@ucsd.edu.

Department of Mathematics and Halicioğlu Data Science Institute, University of California at San
Diego, La Jolla, CA, 92093. Email: jbradic@ucsd.edu.

1
1 Introduction
Heterogeneity in treatment effects, widely documented in social sciences, motivates treat-
ment allocation rules that assign treatments to individuals differently, based on observable
characteristics (Manski, 2004; Murphy, 2003). However, targeting individuals may induce
“disparities” across sensitive attributes, such as age, gender, or race. Motivated by ev-
idence for preferences of policymakers towards non-discriminatory actions (Cowgill and
Tucker, 2019), this paper designs fair and efficient targeting rules for applications in social
welfare programs. We construct treatment allocation rules using data from experiments or
quasi-experiments, and we develop new notions of optimality and counterfactual fairness
for the construction of fair policies.
“Fair targeting” is a controversial task, due to the lack of consensus on the formulation
of the decision problem and the definition of fairness. Conventional approaches consist in
designing algorithmic decisions that maximize the expected utility across all individuals,
by imposing some form of axiomatic “fairness constraints” on the decision space of the
policymaker; this is usually imposed on the prediction of the algorithm.1 However, most
of these approaches ignore the welfare effects of such constraints (Kleinberg et al., 2018).
In particular, fairness constraints imposed on the decision space of the policymaker may
ultimately lead to sub-optimal welfare for both sensitive groups. This is a crucial limitation
when policymakers are concerned with the effects of their decisions on the utilities of each
individual: we may not want to impose unnecessary constraints on the policy if such
constraints are harmful for some or all individuals.
Motivated by these considerations, this paper advocates for Pareto optimal treatment
rules: decision-makers must prefer allocations for which we cannot find any other policy
that strictly improves welfare for one of the two sensitive groups, without further decreas-
ing welfare on the opposite group.2 We propose to select the fairest allocation within the
Pareto frontier. Our approach is motivated by the intuitive notion of “first do no harm”:
instead of imposing possibly harmful fairness constraints on the decision space, we instead
restrict the set of admissible solutions to the Pareto optimal set, and among such, we
choose the fairest one. We allow for general notions of fairness, whereas we also propose a
notion of fairness for treatment allocation rules based on envy-freeness (Foley, 1967; Var-
ian, 1976), making a clear connection between classical microeconomic literature and the
counterfactual interpretations of fairness. Our framework encompasses several applications
in economics, including development programs (De Ree et al., 2017), educational programs
(Muralidharan et al., 2019), health programs (Dupas, 2014), cash-transfer programs (Egger
et al., 2019), among others. We name our method Fair Targeting.
1
For a review, the reader may refer to Corbett-Davies and Goel (2018). Further discussion on the related
literature is contained in Section 1.2.
2
Pareto optimality has been often associated with the notion of fairness in economic theory (Varian,
1976), and it finds intuitive justifications under the rational preferences of the decision-maker (Noghin,
2006).

2
Figure 1: Data from Lyons and Zhang (2017). The points on the line denote Pareto optimal
allocations. Points on and below the line denote the Pareto dominated allocations. The
colors refer to two classes of policies, unrestricted (red), restricted by fairness constraints
(blue). Y-axis denotes the welfare of male students and the x-axis on female students.
From the figure, we can observe that imposing fairness constraints may result in harmful
allocations for both groups of individuals.

To gain further intuition on the advantages of the proposed approach, using student
award data of Lyons and Zhang (2017), we plot the Pareto frontier for two nested classes
of linear decision rules in Figure 1. Details on the definition of welfare and the data
can be found in Section 5. In the figure, the red line refers to the unconstrained linear
decision rule, the blue line constrains the welfare on female individuals to be larger than the
welfare on male students, and it does not allow to use the gender attribute in the decision
process. Points denote the Pareto optimal allocations. The figure showcases that imposing
constraints directly on the decision rules leads to allocations that are always dominated
by some other treatment allocation in an unconstrained environment. The blue allocation
finds an harmful allocation for both types of individuals when compared to three red dots
in the figure. The proposed method instead estimates the Pareto optimal allocation in the
least constrained environment (e.g., the red line), and among such, it chooses the “fairest”
one.
In summary, this paper contributes to the statistical treatment choice literature in two
directions: (i) we introduce the notion, estimation procedures, and we study properties of
Pareto optimal treatment allocation rules under general notions of fairness; we justify such
an approach within a decision-theoretical framework; (ii) we adapt notions of envy-freeness
to treatment rules and study statistical properties of the estimated policy. We now discuss

3
each of these contributions in detail.
The decision problem consists of lexicographic preferences of the policymaker of the
following form: (i) each Pareto optimal allocation strictly dominates allocations that are not
Pareto optimal; (ii) a Pareto optimal allocation is strictly preferred to any other allocation
if it satisfies additional fairness requirements. We identify the Pareto frontier as the set
of maximizers over any weighted average of the welfares of each group. Therefore, such
an approach embeds as a special case maximizing a weighted combination of welfares of
each sensitive group (Kitagawa and Tetenov, 2018; Rambachan et al., 2020). However,
the weights are not chosen a-priori, but instead, they are part of the decision problem,
and directly selected to maximize fairness. This has important practical implications: the
procedure is solely based on the notion of fairness adopted by the social planner, and it
does not require to specify importance weights assigned to each sensitive group. These
weights would be extremely hard to justify to the general public.3
Estimation of the set of Pareto optimal allocations represents a key challenge since
infinitely many policies may populate it. In fact, (i) the set consists of maximizers over
a continuum of weights between zero and one; (ii) each maximizer of the welfare (or a
weighted combination of welfares) is often not unique (Elliott and Lieli, 2013). To over-
come these issues, we showcase that the Pareto frontier can be approximated using simple
linear constraints on the policy space. Such a result drastically simplifies the optimiza-
tion algorithm: instead of estimating the entire set of Pareto allocations, we propose to
maximize fairness of the policy function under easy-to-implement linear constraints. Our
approach consists of a novel a-priori multi-objective formulation (Deb, 2014). Differently
from previous literature, we use a discretization argument, and we evaluate weighted com-
binations of the objective functions separately to construct a polyhedron that contains
Pareto allocations. We provide theoretical guarantees on our approach, and we show-
case that the distance between the Pareto frontier obtained via linear constraints and its

population counterpart converges uniformly to zero at rate 1/ n.
The choice of the allocation within the Pareto set depends on the notion of fairness
adopted by the social planner. Inspired by the economic concept of envy-freeness, we
propose a novel definition of fairness, which can be seen of independent interest. We
interpret the policy assignment under the lens of the classical notion of “bundles” defined
in microeconomics. For each policy function, we compare expected utilities that depend
on its distribution under the same and opposite sensitive attribute. The proposed notion
of fairness makes a connection to previous notions of counterfactual fairness (Chiappa,
2019; Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi and Shpitser,
2018), whereas, differently from previous references, (i) we provide formal justification to
fairness using an envy-freeness argument; (ii) we construct the definition of fairness based
on distributional impact of the treatment allocation rule on the welfare.
In our theoretical analysis, we also study “regret unfairness”. This denotes the differ-
3
See for example also the discussion on relative importance weights contained in Kasy and Abebe (2020).

4
ence between the expected unfairness achieved by the estimated policy function, against
the minimal possible unfairness achieved by any Pareto optimal allocation. We showcase
that such a difference converges to zero uniformly in the sample size, with the rate depend-
ing on the convergence rate of the estimated conditional mean of potential outcomes. All
our results also extend to different notions of fairness that the planner may instead prefer
to adopt. Finally, we conclude with an application that showcases the advantages of the
proposed method compared to alternatives that ignore Pareto optimality.
The remainder of this paper is organized as follows: we provide a brief overview of the
related literature in the following paragraph; we introduce the setting and main notation in
Section 2, and discuss definitions and motivations for Pareto optimal allocations and fair-
ness; Section 3 contains new Fair Targeting algorithms; Section 4 contains the theoretical
analysis; Section 5 discusses an empirical application, and Section 6 concludes. Complete
formal derivations are relegated in the Appendix.

1.1 Related Literature


This paper relates to a growing literature on statistical treatment rules (Armstrong and
Shen, 2015; Athey and Wager, 2017; Bhattacharya and Dupas, 2012; Dehejia, 2005; Hirano
and Porter, 2009; Kitagawa and Tetenov, 2018, 2019; Mbakop and Tabord-Meehan, 2016;
Stoye, 2012; Tetenov, 2012; Viviano, 2019; Zhou et al., 2018). Further research on op-
timal treatment allocations also includes estimation of individualized optimal treatments
via residuals weighting (Zhou et al., 2017), penalized methods (Qian and Murphy, 2011),
inference on the welfare for optimal treatment strategies (Andrews et al., 2019; Labe et al.,
2014; Luedtke and Van Der Laan, 2016; Rai, 2018), doubly robust methods for treatment
allocations (Dudik et al., 2011; Zhang et al., 2012), reinforcement learning (Kallus, 2017;
Lu et al., 2018), and dynamic treatment regimes (Murphy, 2003; Nie et al., 2019). Further
connections are also related to the literature on classification, which includes Boucheron
et al. (2005), and Elliott and Lieli (2013) among others. However, none of these discuss
the design of fair decisions.
Fairness is a rising concern in economic applications, see, e.g., Cowgill and Tucker
(2019), Kleinberg et al. (2018), Rambachan et al. (2020), Rambachan and Roth (2019). The
authors provide fundamental economic insights on the characteristics of optimal decision
rules in the presence of discrimination bias. Here we answer the different question of the
design and estimation of the optimal targeting rule, and we propose a statistical framework
and derive properties of the method. In addition, also the decision problem differs since
here, we advocate for a multi-objective, instead of a single-objective utility function, as in
previous references. Finally, Kasy and Abebe (2020) provide comparative statics on the
impact of fairness on the welfare of individuals, and they discuss the analysis of algorithms
for equity considerations.
In contrast, in computer science, Pareto optimality has been considered in the con-
text of binary predictions by Balashankar et al. (2019) and Martinez et al. (2019), where,

5
the authors propose semi-heuristic and computationally intensive procedures for estimat-
ing Pareto efficient classifiers. Finally, Xiao et al. (2017) discuss the different problem of
estimation of a Pareto allocation that trade-offs fairness and individual utilities for rec-
ommender systems, where the relative importance weights of the different objectives are
selected a-priori. The above references do not address the problem of policy targeting
discussed in the current paper.
Additional references in computer science on fairness include Chouldechova (2017),
Corbett-Davies et al. (2017), Dwork et al. (2012), Hardt et al. (2016), Feldman et al. (2015),
Kleinberg et al. (2016), among others. For a review and comparisons, the reader may refer
to Corbett-Davies and Goel (2018), and Mitchell et al. (2018). The notion of fairness in
previous references differs in the lack of a valid causal framework. For instance, Zemel
et al. (2013) defines fairness based on statistical independence of a classifier with respect
to the sensitive attribute, whereas Donini et al. (2018) defines a classifier fair if the error of
such a classifier is similar across the sensitive groups. Hossain et al. (2020) and references
therein propose envy-freeness notions of fairness for applications in classification. However,
such notions do not account for covariates across different sensitive groups being drawn
from potentially different distributions, which instead differently justifies counterfactual
notions of fairness. The fair classification has also been studied from different angles in
Liu et al. (2017) who discuss stochastic fair bandits, and Ustun et al. (2019) who propose
decoupled estimation of tree classifiers. A related strand of literature discusses estimation
of decision rules under fairness constraints within a causal framework (Chiappa, 2019;
Coston et al., 2020; Kilbertus et al., 2017; Kusner et al., 2019; Nabi et al., 2019; Nabi and
Shpitser, 2018). All such references discuss the problem of maximizing some given objective
function under fairness constraints. This may result in Pareto dominated allocations, and,
therefore, possibly harmful policies for both disadvantaged as well as advantaged groups
(see Figure 1).

2 Social Planner and Fairness


In this section we discuss the main framework and notation, the decision problem under
consideration and counterfactual notions of fairness.

2.1 Causal Framework


For each unit, we denote Si ∈ S a sensitive or protected attribute that a social planner
wishes to be cautious about. For expositional convenience only, we let S = {0, 1} where
Si = 1 is considered to be a “disadvantaged” or minority group. Extensions to multiple
groups follow similarly to what discussed in the current section. Each individual is also
characterized by a vector of its characteristics, Xi ∈ X ⊆ Rp . Such characteristics are

6
assumed to be drawn conditional on the realization of the sensitive attribute, i.e.,

Xi = Si Xi (1) + (1 − Si )Xi (0), (1)

where Xi (0), Xi (1) denote the potential covariates, i.e., the covariates in the state where the
attribute of individual i is equal to s. Note that for each individual we only observe either
Xi (1) or Xi (0) and never both.4 The participants of the study have either been enrolled
in the social welfare program or not. We denote their treatment status with Di ∈ {0, 1},
where Di = 1 denotes that individual i has been enrolled in the social program. We define
their post-treatment outcomes with Yi ∈ Y ⊆ R realized only once the sensitive attribute,
covariates and the treatment assignment are realized.
Following the Neyman-Rubin potential outcome framework (Imbens and Rubin, 2015),
we define Yi (d, s), d ∈ {0, 1}, i.e.

Yi (0, 0), Yi (0, 1), Yi (1, 0), Yi (1, 1),

the potential outcomes that would have been observed respectively under treatment zero
and one, for the protected attribute or the opposite one. Note that we only get to observe
one of the four potential outcomes. Throughout the paper, the observed Yi satisfies the
Single Unit Treatment Value Assumption (SUTVA) (Rubin, 1990), with

Yi = Di Si Yi (1, 1) + Di (1 − Si )Yi (1, 0) + (1 − Di )Si Yi (0, 1) + (1 − Di )(1 − Si )Yi (0, 0). (2)

We assume that the vector of potential outcomes, covariates, sensitive attributes and treat-
ment assignments, (Yi , Di , Si , Xi ) are identically and independently distributed (i.i.d.). We
denote with EX(s) , the expectation with respect to the distribution of the potential covari-
ates X(s) only.
Next, we discuss modeling assumptions. The first condition is the unconfoundeness
assumption.

Assumption 2.1. (Unconfoundeness) Let the following two conditions hold: ∀s ∈ {0, 1}, d ∈
{0, 1}, (A) Yi (d, s) ⊥ (Di , Si )|Xi (s); (B) Xi (s) ⊥ Si .

Condition (A) states that the treatments are randomized based on covariates, indepen-
dently on potential outcomes (Imbens, 2004). Similarly, Condition (B) assumes that the
sensible attribute is randomized independently of Xi (s). We remark that such a condition
allows for the dependence of the observed covariates and outcomes on the sensitive at-
tribute, this condition holds when, for instance, Si is either randomized before everything
else is observed, or Si denotes some immutable characteristic that cannot change given
intervention on the other variables.5 The randomization assumption on Si is common in
4
Throughout our discussion we implicitely assume that the support of Xi (1) and Xi (0) is the same,
namely X .
5
i.e., do-operation in the notation of Pearl (2009).

7
Xi Yi

Si Di

Figure 2: Example of a directed acyclical graph under which Assumption 2.1 holds.

mediation analysis under the name of conditional ignorability (Imai et al., 2010). Figure
2 displays a directed acyclical graph under which Assumption 2.1 holds. In the figure, the
variable Xi acts as a mediator of the effect of changing the sensitive attribute. Assumption
2.1 can be equivalently stated after also conditioning on additional baseline characteristics
that do not depend on the sensitive attribute s, such as, for instance, age. Namely, we may
consider an additional set of covariates Zi , and assume that Xi (s) ⊥ Si |Zi . We omit such
an extension for the sake of clarity only. Finally, we define

e(x, s) = P (Di = 1|Xi = x, Si = s), ps = P (Si = s)

as the propensity score and the probability of individual i being assigned to the disad-
vantaged group, respectively. The following condition is the overlap assumption, which is
common in the causal inference literature.
Assumption 2.2 (Overlap). Let e(Xi , s), ps ∈ (δ, 1 − δ), almost surely, for δ ∈ (0, 1), for
all s ∈ {0, 1}.

2.2 Social Welfare


We design a policy function to maximize social welfare. This paper takes a utilitarian
perspective of social welfare (Manski, 2004), whereas, with fairness in mind, our focus is
shifted from an aggregate perspective to a “group perspective”, with special focus on the
“protected” group.
We now formalize the problem. Given observables, (Yi , Xi , Di , Si ) we seek to design a
policy function or equivalently a treatment assignment

π : X × S 7→ {0, 1}, π∈Π

that depends on the individual characteristics and protected attribute alone. That is, a
policy is a deterministic treatment assignment, which depends on covariates Xi , and on

8
status Si only. Here, Π denotes the function class of interest for the treatment assignment
π. Such a class may be unconstrained, or it may reflect economic or legal constraints that
restrict the decision space of the policy-maker.
For a given policy π and a sensitive attribute s, the social welfare for such a group of
individuals is defined as follows
h i
Ws (π) = E Yi (1, s)π(Xi (s), s) + Yi (0, s)(1 − π(Xi (s), s)) . (3)

The above expression denotes the welfare within the group of individuals with sensi-
tive attribute s. Under the standard properties of the propensity score (Imbens, 2000),
identification of the social welfare within each group can be easily expressed using inverse-
probability type estimators (Horvitz and Thompson, 1952). The reader may refer to the
Appendix for further discussion.
Under the standard utilitarian perspective (Manski, 2004), where welfare is aggregated
across all the individuals, the estimand of interest solves
n o
arg max p1 W1 (π) + (1 − p1 )W0 (π)
π∈Π

where p1 is the fraction of the population for which Si = 1. However, we observe that
whenever the sensitive group is a minority group, such an approach assigns a small weight
to the welfare of such individuals, disproportionally favoring the majority group.
An alternative approach is to maximize the welfare for each possible sensitive group
(Ustun et al., 2019). However, in such a case, the resulting policy function may violate
the constraint in Π, e.g., violating anti-discriminatory laws. A simple example is when the
policy π(x, s) is constrained to be constant in the sensitive attribute s.
Motivated by this consideration, we move away from a single-objective function to a
multi -objective decision problem: the social planner aims to maximize the welfare in each
sensitive attribute under the legal or economic constraints. We propose a notion of Pareto
optimality as a suitable notion of efficiency as it formalizes the trade-off between different,
possibly contradicting, objectives.6 The planner then chooses within the set of Pareto
allocations the least unfair policy.

2.3 Pareto Principle for Treatment Rules


Pareto optimality is driven by the idea that allocations are efficient when no-one can be
made better off without making somebody else worse off. Therefore, the notion of Pareto
optimal policies is particularly suitable for fairness considerations (Varian, 1976).
Definition 2.1 (Pareto optimality). A policy function π1 ∈ Π is called Pareto optimal
with respect to the protected group s if there is no other policy π2 ∈ Π, such that
Ws (π1 ) ≤ Ws (π2 ), for all s ∈ {0, 1}
6
A simple example where contradictory objectives take place is when treatment effects are heterogenous
in covariates Xi , with positive effect over such covariates for one group, and negative for the other.

9
while simultaneously
Ws (π1 ) < Ws (π2 ) for some s ∈ {0, 1}.
We denote with Πo the set of all π ∈ Π that are Pareto optimal.
The set of Pareto optimal choices contains all such allocations for which the welfare for
one of the two groups cannot be improved without reducing the welfare for the opposite
group. The following lemma characterizes the set of Pareto optimal allocations.
Lemma 2.1 (Pareto Frontier). The set Πo ⊆ Π is such that
n o
Πo = πα : πα ∈ arg sup αW1 (π) + (1 − α)W0 (π), α ∈ (0, 1) . (4)
π∈Π

The lemma is a trivial application of results of Negishi (1960), applied to the func-
tional of interest. For the sake of completeness, the proof is included in the Appendix.
Lemma 2.1 provides an intuitive interpretation: Pareto optimal allocations consist of poli-
cies that maximize a weighted combination of the protected group welfares. Throughout
our discussion we define

W̄α = sup αW1 (π) + (1 − α)W0 (π),


π∈Π

being the largest objective after weighting the welfares of each group by α and 1 − α.
The set of Pareto allocations generalizes notions of optimal treatment rules discussed
in previous literature. Simple examples may illustrate this point.
Example 2.1 (Utilitarian Welfare Maximization). Notice that the population equivalent
of EWM belongs to the Pareto frontier. Namely,
n o
arg max p1 W1 (π) + (1 − p1 )W0 (π) ⊆ Πo
π∈Π

Since the above expression maximizes a weighted average of welfares, with weight α =
p1 . An alternative approach consists in choosing appropriately the weights in the above
expression to reflect the relative importance of each group to the social planner. For
instance we may consider maximizing (Rambachan et al., 2020)
n o
arg max ωW1 (π) + (1 − ω)W0 (π) ⊆ Πo
π∈Π

for some specific weight ω. Clearly such an allocation belongs to the Pareto frontier.
Example 2.2 (Unrestricted Policy Function Class). Whenever the set Π is unrestricted
the Pareto optimal allocation equals the set of policies that lead to the largest possible
welfare within each class. Namely, such a set denotes all n the policies whose welfare
o leads
7
to the welfare under the first best policy πf b (x, s) = 1 m1,s (x) − m0,s (x) ≥ 0 .
n h  i o
7
Formally, the set reads as follows: π : Ws (π) = E Yi (1, s) − Yi (0, s) πf b (Xi (s), s) , s ∈ {0, 1} .

10
Remark 1 (Anscillary Fairness Restrictions on Π). Complementary restrictions may also
be considered within the above framework. Following the discussion in Section 1, one such
example of constraint is predictive parity (Kasy and Abebe, 2020), which in such a setting
is interpreted as follows8
h i h i
π : E Yi (1, 1) − Yi (0, 1) π(Xi , Si ) = 1, Si = 1 = E Yi (1, 0) − Yi (0, 0) π(Xi , Si ) = 1, Si = 0 .

These additional restrictions can be directly incorporated in the function class Π. However,
whereas we see legal or economic restrictions often necessary in the construction of Π,
additional restrictions imposed directly by the researcher and not by the policy-maker may
induce sub-optimal allocations for both of the sensitive groups. In fact, observe that for
two function classes Π, Π0 , where Π0 ⊆ Π, the following is true
n o n o
sup αW0 (π) + (1 − α)W1 (π) ≤ sup αW0 (π) + (1 − α)W1 (π) . (5)
π∈Π0 ⊆Π π∈Π

Therefore, the Pareto frontier of a more restricted function class Π is always weakly dom-
inated by the Pareto frontier under weaker constraints.

Pareto optimal allocations are often non-unique, allowing for flexibility in the choice of
efficient policies. If there is a range of alternative policies that assign different weights to
different groups, the social planner must appeal to some principle of preferential ranking.

2.4 Decision Problem


We discuss here the decision problem of the social planner. In full generality we define

UnFairness : Π 7→ R

a functional that maps from a given policy function to the real line, and that quantifies
the level of the unfairness of a given policy. We leave unspecified such a functional and
provide examples in the following paragraphs. Given a set of available policy functions Π,
we define C(Π) the choice set of the decision-maker (Fudenberg and Tirole, 1991). The
function C denotes a choice function, where C({π1 , π2 }) = π1 is interpreted as π1 being
strictly preferred to π2 .

Definition 2.2 (Planner’s Preferences). Planner’s preferences satisfy the following three
conditions:

(i) C({π1 , π2 }) = π1 if W1 (π1 ) ≥ W1 (π2 ) and W0 (π1 ) ≥ W0 (π2 ) and either (or both) of
the two inequalities hold strictly;
8
Note that this is not the only definition that can be attributed to predictive parity, and also alternative
notions may be considered.

11
(ii) for each π1 , π2 ∈ Πo , C({π1 , π2 }) = π1 if UnFairness(π1 ) < UnFairness(π2 );

(iii) C({π1 , π2 }) = {π1 , π2 } if π1 , π2 ∈ Πo and UnFairness(π1 ) = UnFairness(π2 ).

Definition 2.2 postulates that the set of optimal actions of the decision-maker is a
subset of the Pareto optimal set. The above definition also states that within the set of
Pareto optimal allocations, the planner strictly prefers a given policy based on fairness
considerations. Intuitively, the above definition models preferences lexicographically: (i)
each allocation is strictly preferred to the ones for which welfare on both groups is strictly
lower; (ii) allocations are then ranked based on unfairness considerations. Under Definition
2.2, the decision problem takes the following form.

Lemma 2.2 (Social Planner’s Decision Problem). Under rational preferences, with pref-
erences as discussed in Definition 2.2, the following holds: each π ? ∈ C(Π) is such that π ?
solves
π ? ∈ arg inf UnFairness(π)
π∈Π (6)
subject to αW1 (π)+(1 − α)W0 (π) ≥ W̄α , for some α ∈ (0, 1).
Lemma 2.2 provides a formal characterization of the social planner decision problem,
and the proof is contained in the Appendix. The problem consists in minimizing the
unfairness criterion of the policy function, under the condition that the policy is Pareto
optimal. The social planner does not maximize a weighted combination of welfares, with
some pre-specified and hard-to-justify weighting scheme. Instead, the relative importance
of each group (i.e., α) is implicitly chosen within the optimization problem to minimize the
unfairness criterion. Such an approach allows for a transparent choice of the policy based
on the definition of fairness adopted by the social planner.

Remark 2 (Allowing for Exploration). One key question is whether we may lose the
constraint on Pareto optimality, to substantially decrease the unfairness of the estimated
policy. For instance, the policy-maker may be willing to find a policy that is only “approx-
imately” Pareto optimal, i.e., whose resulting welfares are within the orange regions, as
long as it minimizes the unfairness criterion. Such an approach can be easily implemented
by imposing slacker constraints on social welfare than the ones in Equation (6). Simple
examples are constraints of the form αW1 (π) + (1 − α)W0 (π) ≥ W̄α − ε for some ε > 0.

2.5 Counterfactual Envy-Freeness


Next, we propose a notion of fairness. However, we remark that alternative notions of
fairness that policy-maker may prefer to adopt may be considered for the proposed method.
To define fairness, we make a connection between counterfactual notions of fairness
and the economic notions of envy (Foley, 1967). Envy refers to the concept that “an
allocation is equitable if and only if no agent prefers another agent’s bundle to his own”
(Varian, 1976). We build on such a notion, to construct the measure of unfairness, where,

12
in the context under consideration, “bundles” assigned to individuals refer to the treatment
assignment π(Xi , Si ), and utilities depend on the distribution of such assignments.
We start our discussion with the following definition:
h i
Vπ(x,s1 ) (x, s2 ) = E π(x, s1 )Yi (1, s2 ) + (1 − π(x, s1 ))Yi (0, s2 ) Xi (s2 ) = x . (7)

The above expression denotes the conditional welfare, for the policy function being assigned
to the opposite attribute. Namely, it defines the generated effect of π(x, s1 ), for type s2 ,
conditional on the value of the covariates.
We say that the agent with attribute s2 envies the agent with attribute s1 , if
h  i
EX(s1 ) Vπ(X(s1 ),s1 ) X(s1 ), s2 > Ws2 (π). (8)

Intuitively, the agent, with attribute s2 , envies the one with opposite attribute if, after
changing her distribution of covariates and treatment assignment to be of the opposite
attribute, the resulting welfare is strictly larger than her current welfare.
Motivated by the above definition, we measure the unfairness towards an individual
with attribute s2 as
h i
A(s1 , s2 ; π) = E Vπ(X(s1 ),s1 ) (X(s1 ), s2 ) −Ws2 (π). (9)

Here Ws2 (π) is as defined in Equation (3). Observe that the above measure denotes the
difference between the welfare of group s2 under respectively the opposite and same dis-
tribution of covariates and treatment assignments. The above notion of fairness showcases
the advantage of embedding the problem within a causal inference framework: we compare
utilities, by letting covariates and treatment assignments being drawn from the distribution
under the opposite sensitive attribute.
Since we aim not to discriminate in either direction (women with respect to men
and vice-versa), we define unfairness by taking the sum of the effects A(s1 , s2 ; π) and
A(s2 , s1 ; π). Such an approach builds on the notion of “social envy” discussed in Feldman
and Kirman (1974), where envy is defined as a weighted combination of envy of each group.
Namely, we propose the following notion of unfairness

UnFairness(π) = A(1, 0; π) + A(0, 1; π). (10)

Note that the above definition induces partial ordering in the set of Pareto optimal alloca-
tions.
Motivated by the above definition, we propose an estimand which finds the fairest
allocation within the set of Pareto optimal policies.

13
3 Fair Targeting
In this section, we discuss the estimation and optimization problem, namely, we aim to
construct an estimator of π ? in Equation (6). We design a plug-in estimator, where the
population quantities are replaced with particularly designed sample equivalents. By letting
An and Π̂o denote the sample approximations of the population quantities A and Πo , which
are discussed in the following paragraphs, our Fair Targeting method, solves

π̂ ∈ arg inf {An (1, 0; π) + An (0, 1; π)}. (11)


π∈Π̂o

One of the key challenges induced by the estimation procedure is represented by the pres-
ence of a possibly infinitely dimensional set of Pareto optimal allocations. To overcome
this issue, we represent the Pareto frontier using a set of linear constraints on the policy
function, without directly constructing the entire set of Pareto optimal policies. Such an
approach permits us to find efficient solutions without the need to exhaustively search over
the entire set of Pareto allocations.

3.1 Estimating Social Welfare and UnFairness


We start by introducing some necessary notation. We let
h i h i
md,s (x) = E Yi (d, s) Xi (s) = x = E Yi (d, s) Xi = x, Si = s

be the conditional mean of the group s under treatment status d. We denote the doubly
robust score (Robins et al., 1994) as

1{Si = s}1{Di = d}  
Γd,s,i = Yi − md,s (Xi ) + md,s (Xi ) (12)
ps P (Di = d|Xi , Si )

and Γ̂d,s,i its estimated counterpart. We estimate the welfare as


n n
1X 1X
Ŵs (π) = Γ̂1,s,i π(Xi , s) + Γ̂0,s,i (1 − π(Xi , s)). (13)
n n
i=1 i=1

Finally, we estimate A(·) is Equation (9) as follows:


1 X
An (s, s0 ; π) = m̂1,s0 (Xi )π(Xi , s) + m̂0,s0 (Xi )(1 − π(Xi , s)) − Ŵs0 (π). (14)
np̂s
i:Si =s

The estimators of nuisances m̂d,s (.), ê(.), p̂s , are built upon causal inference and semi-
parametric literature (Bang and Robins, 2005; Hahn, 1998; Newey, 1990, 1994; Robins and
Rotnitzky, 1995). Each of these components is estimated via cross-fitting (Chernozhukov
et al., 2018; Chiang et al., 2019; Schick, 1986). One needs to carefully consider protected

14
Figure 3: Graphical representation of cross-fitting under two alternative model formulation.
The light gray area is the training set, used to construct an estimator of m̂d,s=1 , whereas the
darker gray area is an evaluation set, area in which a prediction of m̂d,s=1 is computed. The
figure on the left refers to the case where no parametric assumptions are imposed on the
depends on md,s with s. The panel on the right uses both groups’ information to estimate
the conditional mean function by further imposing additional parametric assumptions on
the depends on md,s with s.

S=0 S=1 S=0 S=1

Ik Ik

attribute in this procedure. Two alternative cross-fitting procedures are available to the
researcher. The first one, consists in dividing the sample into K folds and estimating the
(−k)
conditional mean m̂d,s (Xi ) using observations for which S = s only, after excluding the
fold k corresponding to unit i.9 Such an approach does not impose parametric restrictions
on the dependence of md,s on the attribute s, at the expense of shrinking the effective sam-
ple size used for estimation. The second approach consists in further imposing additional
parametric restrictions on the depends of md,s on s and using all observations in all folds
(−k)
except k for estimating m̂d,s (Xi ). A graphical representation is depicted in Figure 3.

3.2 Pareto Frontier via Linear Inequalities


Next, we characterize the Pareto frontier using linear inequalities. We define the empirical
counterpart of Equation (4) as follows
n n o o
Πo,n = πα ∈ Π : πα ∈ arg sup αŴ0 (π) + (1 − α)Ŵ1 (π) , s.t. α ∈ (0, 1) . (15)
π∈Π

Clearly, such a set cannot be directly estimated since it contains solutions to uncountably
many optimization problems. Therefore, as a first step, we discretize such a set, and we
construct a grid of equally spaced values αj ∈ (0, 1), j ∈ {1, ..., N },

{α1 , ..., αN }
9 (−k)
Formally, let i ∈ Ik ∩ S1 where Ik is the k-th fold of the data and S1 = {i : Si = s1 }. Let m̂d,s2 be an
estimator obtained using samples not in the fold k, Ikc ∩ S1c for which S1c = {i : Si = s2 }; for example by a
random forest or linear regression of Yj onto Xj for Sj = s2 , and j ∈
/ Ik .

15

where we choose N = n. We approximate Πo,n using the discretized set
n n o o
Π̂o = πα ∈ Π : πα ∈ arg sup αŴ0 (π) + (1 − α)Ŵ1 (π) , s.t. α ∈ {α1 , ..., αN } . (16)
π∈Π

The choice of the grid is arbitrary, as long as each value is far from the other by no more
than a small ε-value. A natural choice for the construction of the grid consists in starting
with αj = p̂1 and then choosing the remaining N − 1 values, with a distance from each
other by ε. Such an approach guarantees, for instance, that the solution of the EWM
method is a feasible solution under Fair Targeting.
Unfortunately, the set Π̂o may be hard, if not impossible, to directly estimate. In fact,
for each α we may have uncountably many solutions (Elliott and Lieli, 2013; Manski and
Thompson, 1989). Therefore, instead of directly estimating such a set, we characterize Π̂o
using simple linear constraints. The key idea consists in imposing that for some arbitrary
weight αj , the weighted combination of empirical welfare is “large enough” to guarantee
Pareto optimality. We formalize this approach in the following lines.
The first step consists in finding the maximum empirical welfare achieved on the “dis-
cretized” Pareto Frontier by finding
n o
W̄j,n = sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) , (17)
π∈Π

for each j ∈ {1, ..., N }. This can be easily solved using using exact methods such as
mixed-integer programs (Kitagawa and Tetenov, 2018) or recursive search algorithms (Zhou
et al., 2018).10 To guarantee Pareto optimality it is sufficient to impose that a weighted
combination of the welfare is at least as large as W̄j,n , for some j ∈ {1, ..., N }.
Observe now that the set Π̂o , (16), can be characterized using the following represen-
tation:
∀π ∈ Π̂o
∃j ∈ {1, ..., N } such that αj Ŵ0,n (π) + (1 − αj )Ŵ1,n (π) ≥ W̄j,n . (18)
That is, Π̂o contains all policies π that achieve the maximum value of weighted welfares,
W̄j,n for some j. Therefore, whereas we may not be able to find all the policies in the set
Π̂o , we can still represent such a set using simple linear constraint on the estimated policy
function.

3.3 Mixed Integer Program Formulation


Given the set of linear inequalities, we aim to formulate the program as a mixed-integer
linear program, for a general class of policy functions admitting linear program formula-
tions (Bertsimas and Dunn, 2017; Florios and Skouras, 2008). To do so, we introduce an
10
Notice that previous references do not discuss estimation of the Pareto frontier, and we use their results
as a tool for computing W̄j,n only.

16
additional set of decision variables that guarantee that the constraints in Equation (18)
hold.
We represent the constraints on the policy space by introducing the variables

zs = (zs,1 , · · · , zs,n ), zs,i = π(Xi , s), π ∈ Π. (19)

These variables have simple representation for general classes of policy functions, such as
linear rules (see for example Chen and Lee (2018); Kitagawa and Tetenov (2018)). In
addition, we define a vector

u = (u1 , ..., uN ) ∈ {0, 1}N (20)

that encodes the locations on the grid of α for which the supremum in (18) is reached at;
uj = 1 whenever the constraint in Equation (18) holds for αj . The chosen policy must be
Pareto optimal, i.e., uj must be equal to one for at least one j. To ensure this, we introduce
a simple constraint N
P
j=1 uj ≥ 1.
To represent the Pareto optimality, following Equation (18), the following constraint
must be imposed
D E D E D E
uj αj Γ̂1,0 − Γ̂0,0 , z 0 + uj (1 − αj ) Γ̂1,1 − Γ̂0,1 , z 1 + Γ̂0,1 + Γ̂0,0 , 1 − uj nW̄j,n ≥ 0,
(21)

where Γ̂d,s = (Γ̂d,s,1 , ..., Γ̂d,s,n ), and h., .i denotes the vector product. The above constraint,
together with (20) and N
P
j=1 uj ≥ 1, ensures that the assignment z1,i , z0,i is Pareto optimal.

Combining our results, we obtain the following quadratic program formulation:

min An (1, 0; π) + An (0, 1; π)


z0 ,z1 ,u,π
(22)
subject to zs,i = π(Xi , s) (A)
uj αj hΓ̂1,0 − Γ̂0,0 , z 0 i + uj (1 − αj )hΓ̂1,1 − Γ̂0,1 , z 1 i + hΓ̂0,1 + Γ̂0,0 , 1i ≥ uj nW̄j,n (B)
h1, ui ≥ 1 (C)
π∈Π (D)
z0,i , z1,i , uj ∈ {0, 1}, 1 ≤ i ≤ n, 1 ≤ j ≤ N. (E)

Remark 3 (Existence of Feasible Solutions). The optimization problem always admits a


feasible solution. In fact, observe that for each αj , there exist by definition a policy πj
such that αj Ŵ0,n + (1 − αj )Ŵ1,n = W̄j,n . Such a policy satisfies the constraints in the
optimization with uj = 1.
We formalize the validity of the above formulation in the following theorem.

17
Theorem 3.1. The solution set to Equation (22) and Equation (11) are equal.
Theorem 3.1 showcases that the optimization algorithm can be written as a mixed-
integer program with constraints that depend on the function class of interest. Such con-
straints read as mixed-integer linear constraints for several function classes such as optimal
trees and maximum score methods (Bertsimas and Dunn, 2017; Florios and Skouras, 2008).
In addition, the objective function is clearly linear in the decision variables. Constraints
(A), (C), (E) are mixed-integer linear constraints, whereas the Constraint (B) is quadratic.
However, notice that we can further simplify (B) as a linear program at the expense of
introducing additional N n binary variables, and 2N n additional constraints (e.g., see Vi-
viano (2019); Wolsey and Nemhauser (1999)). We omit such an extension for the sake of
brevity only.

4 Theoretical Guarantees
Next, we formalize the finite sample theoretical properties of the solution in Equation (22).
The following assumption is imposed throughout our discussion.
Assumption 4.1. Suppose that the following conditions hold.
(A) Yi (d, s) ∈ [−M, M ], for some M < ∞, for all d ∈ {0, 1}, s ∈ {0, 1};
(B) Π has finite VC-dimension, denoted as v;11
(C) Π is pointwise measurable.12
Condition (A) states the outcome is uniformly bounded by a universal constant M <
∞. Importantly, the implementation of Fair Targeting does not require knowledge of
such a constant. Condition (B) imposes a restriction on the function class of interest of
the policy function. Condition on the finite VC dimension is common in many areas of
research (see Devroye et al. (2013) for further reference). General examples where such an
assumption holds for Empirical Welfare Maximization can be found in Mbakop and Tabord-
Meehan (2016), among others. Condition (C) ensures the measurability of the supremum
of the empirical process of interest, similarly, for instance, to Rai (2018). Simple examples
where the finite VC-dimension condition holds are linear decision rules (Manski, 1975), and
decision trees (Loh, 2011). In the former case, the VC-dimension is bounded by the number
of covariates, whereas in the latter case is bounded by the exponential of the number of
layers in the tree (Athey and Wager, 2017; Zhou et al., 2018).
11
The VC dimension denotes the cardinality of the largest set of points that the function π can shatter.
The VC dimension is commonly used to measure the complexity of a class, see for example Devroye et al.
(2013).
12
A pointwise measurable function satisfies the following condition: for each π ∈ Π, there exists a count-
able subset G, such that there exists a sequence πm ∈ G such that πm →m→∞ π uniformly (Chernozhukov
et al., 2014).

18
4.1 Guarantees on the Pareto Frontier
In the following lines, we impose consistency conditions of the estimated nuisance param-
eters.

Assumption 4.2. Assume that for some ξ1 ≥ 1/4, ξ2 ≥ 1/4, the following holds:
h 2 i h . . 2 i
E m̂d,s (Xi ) − md,s (Xi ) = O(n−2ξ1 ), E 1 p̂s ê(Xi , s) − 1 ps e(Xi , s) = O(n−2ξ2 ).
(23)
for all s ∈ {0, 1}, d ∈ {0, 1}, where Xi is out-of-sample. Assume in addition that for some
H, and n ≥ H, with probability one,
1 1
sup |m̂d,s (x) − md,s (x)| ≤ 2M, sup − ≤ 2/δ 2 . (24)

x∈X ,s∈S,d∈{0,1} x∈X ,s∈S p s e(x, s) p̂s ê(x, s)

Assumption 4.2 assumes that the product of the mean-squared error of the propensity
score and conditional mean function converges at the parametric rate n−1 . Such a condition
is standard in the doubly-robust literature (Chernozhukov et al., 2018; Farrell, 2015; Newey,
1994; Robins and Rotnitzky, 1995). Since we only consider rates (and not constants which
depend on the variance component) in the upper bound, we find that uniform consistency
is not necessary for the derivation of the regret bound. Instead, we state the stability
statement as a finite-sample condition, which allows the characterization of small sample
properties on the regret. The condition can be alternatively stated asymptotically (Athey
and Wager, 2017), whereas in such a case, all the remaining results should be interpreted
in the asymptotic sense only.
As the first step for our results, we derive uniform convergence of the estimated welfare
with respect to its population counterpart, weighted by α-weights.

Theorem 4.1. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least

1 − 1/ n − γ, for some H > 0, and n ≥ H, γ > 0

sup sup αW0 (π) + (1 − α)W1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π)

α∈(0,1) π∈Π αj ∈{α1 ,...,αN }

M p p 
≤ C̄ 2 v/n + log(2/γ)/n + 1/N
δ
for a universal constant C̄ < ∞.

Theorem 4.1 guarantees uniform convergence rate of the empirical Pareto frontier con-

structed via discretization to its population counterpart at the optimal 1/ n rate. The
proof is contained in the Appendix.
Next, we characterize the distance between the estimated set Π̂o in Equation (18), ob-
tained using a discretization argument and taking into account estimation of the empirical

19
welfares, and the set of Pareto optimal policies Πo . To quantify such a difference, we
introduce the following measure:
n o n o
d(Πo , Π̂o ; α) = sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) . (25)
π∈Πo π∈Π̂o

Equation (25) quantifies the distance between the population Pareto frontier and the
empirical Pareto frontier, obtained via discretization. Under the above conditions, we can
state the following theorem.
Theorem 4.2. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then, with probability at least

1 − 1/ n − γ, for some H > 0, and n ≥ H, γ > 0, for some universal constants c, c0 < ∞,
r
cM v log(c0 /γ) M
sup d(Πo , Π̂o ; α) ≤ 2 +c 2 . (26)
α∈(0,1) δ n δ N

The proof is in the Appendix.


Theorem 4.2 ensures that the empirical Pareto Frontier Π̂o , estimated using the dis-
cretization step, is close in terms of welfare-effects
p from the population one, up to a small
error factor which scales at the rate 1/n + 1/N ; here N denotes the number of elements
in the discretized Pareto frontier. Such a result guarantees that the set Π̂o , obtained via the
discretization argument and estimated using the empirical welfares, “well-approximates”
the Pareto optimal set Πo , in Equation (4), for the sample size being large enough. To the
best of our knowledge, this is the first result of this type of Pareto optimal policy. In the
following theorem, we characterize the rate for an explicit choice of N .

Theorem 4.3. Suppose that we consider a grid with N = n many elements. Then under
Assumptions 2.1, 2.2, 4.1, 4.2, the following hold: for a universal constant C̄ < ∞
r
 C̄M v log(c0 n)  √
P sup d(Πo , Π̂o ; α) ≤ 2 ≥ 1 − 1/ n. (27)
α∈(0,1) δ n

4.2 Fairness Bounds


Theorem 4.2 provides guarantees on the Pareto set, whereas, in this section, we discuss
guarantees on the unfairness of the estimated policy function, with respect to the least un-
fair policy within the set of truly Pareto optimal policy. The following additional conditions
are required for the derivation of the theorem.
Assumption 4.3. For each α ∈ (0, 1), for any two

πα , πα0 ∈ arg sup αW0 (π) + (1 − α)W1 (π),


π∈Π

then πα (x, s) = πα0 (x, s) for all x ∈ X , s ∈ {0, 1}.

20
The above condition imposes that for each element in the Pareto Frontier, the supre-
mum at the population level is reached by the same policies.13 We remark that such a
condition is not imposed on the maximum of the empirical counterpart, which instead may
not satisfy the above condition. The second condition is the consistency assumption of the
conditional mean function, given the opposite sensitive attribute.

Assumption 4.4. Assume that for some η > 0, the following holds:
h 2 i
E m̂d,s1 (Xi (s2 )) − md,s1 (Xi (s2 )) = O(n−2η ), ∀s1 , s2 ∈ {0, 1}, d ∈ {0, 1}.

Assumption 4.4 states that the estimator of the conditional mean function for each sen-
sitive attribute s ∈ {0, 1} and treatment status d ∈ {0, 1}, must consistently converge to the
true conditional mean function in MSE at some arbitrary rate 2η > 0. One important as-
pect to remark is that differently, for instance, from previous literature on semi-parametric
estimation, we require convergence in l2 for a given sensitive attribute conditional on the
opposite sensitive attribute. Such a condition is motivated by the counterfactual notion
of fairness: to estimate fairness, we are required to compute the empirical average of the
conditional mean function for a given group of individuals with respect to the distribution
under opposite group. Therefore, standard doubly-robust procedures cannot be employed
for estimation of the counterfactual envy-free fairness, but instead, its estimation must rely
on the estimation of the conditional mean function.

Remark 4 (Sufficient Conditions for Assumption 4.4). A sufficient condition consists in


imposing that almost surely,

sup m̂d,s (x) − md,s (x) = O(n−η ) (28)

x∈X

Examples include (i) linear regression models, of the form md,s (x) = xβs , with βs being po-
tentially high-dimensional, and Xi (1), Xi (0) ∈ [−M, M ]p , for M < ∞; (ii) local polynomial
estimators of the form
n x − X 
1 X 1{Si = s} i
m̂d,s (x) = Yi K (29)
n ps h
i=1

where K(u) is a kernel and h is the bandwidth, whose uniform convergence rates are
discussed, for instance, in Fan and Gijbels (1996); Hansen (2008).

Under the above conditions we state our second theorem, where we implicitely assume

that N = n.
13
Conditions on the uniqueness of the optimal welfare maximization rule can also be found in Andrews
et al. (2019).

21
Theorem 4.4. (Optimal Fairness) Let Assumptions 2.1, 2.2, 4.1-4.4 hold. Then for some
ÑM,δ,v > 0, which depends on M, δ, v, and n > max{ÑM,δ,v , H}, with probability at least

1 − γ − 1/ n, for universal constants C̄, c < ∞
r
M v log(1/γ) c −η
UnFairness(π̂) − inf UnFairness(π) ≤ C̄ + n . (30)
π∈Πo δ n δ

The proof is contained in the Appendix.


Theorem 4.4 showcases that the policy constructed using the empirical, and discretized
Pareto frontier achieves the same level of fairness as the fairest policy within the population
Pareto efficient policies. To the best of our knowledge, this is the first result of this type
of fair policies.

5 Empirical Application: Targeting Awards to Students


Fairness with respect to gender is a major concern in education and in industry. Motivated
by this concern, in this section we design policy that assigns students to enterpreneurial
programs, imposing fairness on gender. We use data that originated from Lyons and Zhang
(2017). The paper studies the effect of an entrepreneurship training and incubation pro-
gram for undergraduate students in North America on subsequent entrepreneurial activity.
We have in total 335 observations, of which 53% where treated and the remaining were
under control, and 26% of applicants are women.
The population of interest is the pool of final applicants. We construct a targeting rule
which, based on observable characteristics of the applicant, assigns the award to the final-
ist. We maximize subsequent entrepreneurial activity, which is captured using a dummy
variable, indicating whether the participant worked in the startup once the program ended.
We impose fairness with respect to the gender attribute. Whereas potential outcomes and
potential covariates may depend on such an attribute, it is likely to believe that the gender
attribute is independent on such potentials, and randomized before covariates and outcomes
are realized.
We estimate the probability of treatment using a penalized logistic regression, where we
condition on the non-Caucasian attribute, gender, the average score, years to graduation,
whether the individual had previously had entrepreneurship activities, the startup region
(which a dummy since only two regions are considered), the degree (either engineer or
business) and the school rank. We estimate the outcome using a logistic regression, after
conditioning on the above covariates, and any interaction term between gender, treat-
ment assignment, and a vector of covariates, which include years to graduation, prior
entrepreneurship, startup region, and the school rank. We estimate treatment effects using
a doubly robust estimator. We use cross-fitting with five folds in our estimation.

22
We consider a function class of linear decision rules, given their large use in economic
applications (Manski, 1975). Namely, we consider the following policy function:
n n o o
Π = π(x, fem) = 1 β0 + β1 fem + x> φ ≥ 0 , (β0 , β1 , φ) ∈ B . (31)

We allow covariates x to be either (i) the years to graduation, years of enterpreneurship,


the region of the start up, the major (either computer science or business), and the school
rank, or (ii) the score assigned to the candidate by the interviewer and the school rank. We
refer to these two cases respectively as Case 1 and Case 2. We consider the problem where
approximately half of the sample, namely 150 individuals, are selected for the treatment.
We consider two notions of unfairness: (i) in-equity across groups, which is captured
by |W0 (π) − W1 (π)|, and (ii) counterfactual envy as discussed in Section 2.5. Observe that
these are two different definitions of fairness. The first, inequity, denotes the difference
in welfare across groups. The second, counterfactual envy, instead depends on the sum of
the difference in welfares of two individuals in the same group, where one individual has
covariates and the treatment assignment drawn under the distribution corresponding to
the opposite sensitive attribute.14
Using the exact mixed-integer program formulation, we compute the optimal policy
rule that minimizes either envy or in-equity over the function class Π, with a grid of size
N = 36. Results are robust to different grid choices.
We compare the proposed methodology to the Empirical Welfare Maximization method
(Athey and Wager, 2017; Kitagawa and Tetenov, 2018), with nuisance functions being
estimated as above. We consider three nested function classes for the EWM method. The
first does not impose any restriction except for the functional form in Equation (31). We
then consider classes under additional “fairness constraints”. In particular, the second,
imposes that β1 = 0, i.e., anti-classification parity. The third class imposes that β1 = 0
and that the average marginal effect of the policy on females is at least as large as the one
on males. The three function classes are formally described below.
n n o o
Π1 =Π, Π2 = π(x) = 1 β0 + x> φ ≥ 0 , ,
n n o
Π3 = π(x) = 1 β0 + x> φ ≥ 0 , (32)
h i h io
En (Yi (1, 1) − Yi (0, 1))π(Xi (1)) ≥ En (Yi (1, 0) − Yi (0, 0))π(Xi (0)) ,

where En [·] denote the empirical expectation, estimated using the doubly-robust method.
We refer to each policy space respectively as Type 1, Type 2, Type 3.
14
A simple example may illustrate the difference. Consider the policy π corresponding to “always treat”.
Suppose that X(1) ∼ N (µ1 , σ 2 ) and X(0) ∼ N (µ0 , σ 2 ). Suppose that Y (d, s)|X(s) ∼ N (βd + X(s), η 2 ).
Then inequity corresponding to the policy “always treat” reads as follows: |β1 + µ1 − β0 − µ0 |. However,
counterfactual envy reads as follows: β0 − β1 + β1 − β0 = 0.

23
5.1 Results
In Figure 4 we plot the Pareto frontier over each function class.15 The figure showcases that
restricting the function class of interest leads to Pareto dominated allocations. This outlines
the limitations of maximizing welfare under “fairness constraints”: such constraints can
result as harmful for both types of individuals. Instead, the proposed method considers the
red line (i.e., the least constrained environment), and select the policy based on fairness
considerations. In Figure 4, we also label allocations on the Pareto frontier corresponding
to the lowest envy and to the lowest inequity. We observe that different definitions of
fairness lead to different treatment allocations and different importance weights assigned
to each of the two groups. The proposed method finds such weights solely based on the
notion of fairness provided by the social planner, without requiring any prior specification
of relative weights assigned to each group. We see this as a crucial advantage, since relative
importance weights may be hard to justify to the general public.
In Figure 5, we report the value of counterfactual envy and inequity as a function of
the relative weight assigned to female students. We use a dotted line to refer to the weight
corresponding to the EWM allocation. The figure showcases that the EWM allocation does
not find allocations with the lowest unfairness in three of the four panels. Only for Case
1, the EWM allocation, and the Fair Targeting rule with the lowest inequity coincide.
Finally, in Table 1, we report the welfare of female students, male students, and their
weighted combination, for different targeting rules. Similarly to what is shown in Figure
4, we observe that different notions of fairness correspond to different weights of groups to
minimize the objective function. As we may observe from Table 1, the Fair Targeting rules
are not Pareto dominated by any other competitor. Instead, they correspond to the optimal
allocations under definitions of fairness based respectively on envy-freeness and equity. Here
is why: the envy-freeness Fair Targeting leads to a better allocation than unconstrained
EWM since it leads to lower envy while being both Pareto optimal (e.g., see Figure 5). It
leads to a better allocation than constrained EWM (EWM2 and EWM3) since, as shown
in Figure 4, for each constrained allocation, we can find an unconstrained allocation that
Pareto dominates it. Such an allocation is strictly preferred to the constrained EWM
method. However, such an allocation has higher envy than the optimal envy-freeness
allocation rule reported in the table. Similar reasoning also applies when fairness is defined
in terms of equity across groups.

15
The value functions over the Pareto frontier can be exactly recovered as follows: we solve 2 optimization
problems for each αj , j ∈ {1, ..., N }. For each of these problems, we impose constraints of the welfare of
one the two groups being larger than the other and vice-versa; we then select the subset of solutions that
are not Pareto dominated by the others, and we plot the corresponding welfares in the figure.

24
Figure 4: Pareto frontier under linear policy rule. Dots denote Pareto optimal allocations,
whereas the point on and below the line denote the set of dominated allocations. Panel
on the left collects results for the case where the covariates used in the decision rule are
years of graduation, years of entrepreneurship, region of the start-up, the major and the
school rank. The panel on the right collects results for covariates used in the decision rule
are the score assigned to the applicant and the school rank. Type 1 refers to the function
class Π1 , Type 2 to the function class Π2 and Type 3 to the function class Π3 . Notice
that the blue and green lines denote the Pareto frontier under fairness constraints, which
are strictly dominated by the Pareto frontier in an unconstrained environment (red line).
From the figure, we can observe that imposing fairness constraints may result in harmful
allocations for both groups of individuals.

25
Figure 5: Envy (left-panels) and in-equity (right-panels) under Case 1 (top panels) and
Case 2 (bottom panels). The x-axis corresponds to different importance weight assigned
to the welfare of female students, and the y-axis reports the level of the unfairness of
the procedure that maximizes the weighted combination of welfares of female and male
students. Dotted lines denote the importance weight assigned to female students by the
EWM method. Notice that the Pareto allocation corresponding to the lowest envy-freeness
and the one corresponding to the lowest inequity assign different importance weights to
female students. From the figure, we can observe that importance weights may have an a-
priori unknown dependence with different notions of fairness adopted by the social planner.

26
Table 1: Panel at the top refers to the case where the covariates used in the decision rule
are years of graduation, years of entrepreneurship, region of the start-up, the major, and
the school rank. The panel at the bottom collects results for covariates used in the decision
rule are the score assigned to the applicant and the school rank. Fair Envy refers to the Fair
Targeting rule that minimizes envy-freeness unfairness; Fair Equity is the equity-based Fair
Targeting method; EWM1, EWM2, EWM3 refers to the EWM method with function class
being respectively Π1 , Π2 , Π3 . The second column reports the welfare of female students
and the third column the welfare of male students; the fourth column report a weighted
combination of the two welfares. the last column reports the importance weight assigned
by the method to the welfare of female students.

Case 1 Welf Fem Welf Mal Weighted Welfare Weight


Fair Envy 0.376 0.272 0.312 0.384
Fair Equity 0.297 0.300 0.299 0.256
EWM 1 0.297 0.300 0.299 0.266
EWM 2 0.288 0.285 0.286 0.266
EWM 3 0.331 0.265 0.282 0.266

Case 2 Welf Fem Welf Mal Weighted Welfare Weight


Fair Envy 0.372 0.195 0.281 0.487
Fair Equity 0.203 0.253 0.250 0.076
EWM 1 0.351 0.235 0.266 0.266
EWM 2 0.307 0.238 0.257 0.266
EWM 3 0.307 0.238 0.257 0.266

27
6 Conclusion
In this paper, we have introduced a novel method for estimating fair and optimal treatment
allocation rules. We proposed a multi-objective decision problem, where the policymaker
aims to select the least unfair policy, in the set of Pareto optimal allocations. We provided
counterfactual definitions of fairness, which are justified from an envy-freeness perspec-
tive. We derive strong theoretical guarantees on the properties of the Pareto frontier and
unfairness regret bounds of the proposed policy function.
We employ the Neyman-Rubin potential outcome framework, indexing potential out-
comes, and covariates by the sensitive attribute. We assume that the sensitive attribute
is independent of such potentials. Such an assumption may hold in many applications
involving attributes such as age or gender, whereas it may fail in applications where the
sensitive attribute is confounded by unobserved characteristics, as in the case of the race
attribute. We leave for future research extensions of this work, which allow for weaker
notions of exogeneity of the sensitive attribute, and the use of possible instruments in such
settings.
From a theoretical perspective, the validity of the regret bound relies on the convergence
rate of the proposed estimator in MSE, evaluated on covariates drawn from a different
distribution for the training sample. Whereas examples are discussed, such notions open
new questions on the theoretical guarantees of classical estimators, when used to predict
on samples drawn from different distributions.
Finally, this paper sheds some light on the connection of counterfactual definitions of
fairness and the statistical treatment allocation problem. A comprehensive discussion of
such a connection in a decision-theoretical framework remains an open research question.

28
References
Andrews, I., T. Kitagawa, and A. McCloskey (2019). Inference on winners. Technical
report, National Bureau of Economic Research.

Armstrong, T. and S. Shen (2015). Inference on optimal treatment assignments. Available


at SSRN 2592479 .

Athey, S. and S. Wager (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896 .

Balashankar, A., A. Lees, C. Welty, and L. Subramanian (2019). What is fair? exploring
pareto-efficiency for fairness constrained classifiers. arXiv preprint arXiv:1910.14120 .

Bang, H. and J. M. Robins (2005). Doubly robust estimation in missing data and causal
inference models. Biometrics 61 (4), 962–973.

Bertsimas, D. and J. Dunn (2017). Optimal classification trees. Machine Learning 106 (7),
1039–1082.

Bhattacharya, D. and P. Dupas (2012). Inferring welfare maximizing treatment assignment


under budget constraints. Journal of Econometrics 167 (1), 168–196.

Boucheron, S., O. Bousquet, and G. Lugosi (2005). Theory of classification: A survey of


some recent advances. ESAIM: probability and statistics 9, 323–375.

Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration inequalities: A nonasymp-


totic theory of independence. Oxford university press.

Chen, L.-Y. and S. Lee (2018). Best subset binary prediction. Journal of Economet-
rics 206 (1), 39–56.

Chernozhukov, V., D. Chetverikov, K. Kato, et al. (2014). Gaussian approximation of


suprema of empirical processes. The Annals of Statistics 42 (4), 1564–1597.

Chernozhukov, V., W. K. Newey, and J. Robins (2018). Double/de-biased machine learning


using regularized riesz representers. Technical report, cemmap working paper.

Chiang, H. D., K. Kato, Y. Ma, and Y. Sasaki (2019). Multiway cluster robust dou-
ble/debiased machine learning. arXiv preprint arXiv:1909.03489 .

Chiappa, S. (2019). Path-specific counterfactual fairness. In Proceedings of the AAAI


Conference on Artificial Intelligence, Volume 33, pp. 7801–7808.

Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidi-
vism prediction instruments. Big data 5 (2), 153–163.

29
Corbett-Davies, S. and S. Goel (2018). The measure and mismeasure of fairness: A critical
review of fair machine learning. arXiv preprint arXiv:1808.00023 .

Corbett-Davies, S., E. Pierson, A. Feller, S. Goel, and A. Huq (2017). Algorithmic decision
making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 797–806.

Coston, A., A. Mishler, E. H. Kennedy, and A. Chouldechova (2020). Counterfactual risk


assessments, evaluation, and fairness. In Proceedings of the 2020 Conference on Fairness,
Accountability, and Transparency, pp. 582–593.

Cowgill, B. and C. E. Tucker (2019). Economics, fairness and algorithmic bias. preparation
for: Journal of Economic Perspectives.

De Ree, J., K. Muralidharan, M. Pradhan, and H. Rogers (2017). Double for nothing?
experimental evidence on an unconditional teacher salary increase in indonesia. The
World Bank.

Deb, K. (2014). Multi-objective optimization. In Search methodologies, pp. 403–449.


Springer.

Dehejia, R. H. (2005). Program evaluation as a decision problem. Journal of Economet-


rics 125 (1-2), 141–173.

Devroye, L., L. Györfi, and G. Lugosi (2013). A probabilistic theory of pattern recognition,
Volume 31. Springer Science & Business Media.

Donini, M., L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil (2018). Empir-


ical risk minimization under fairness constraints. In Advances in Neural Information
Processing Systems, pp. 2791–2801.

Dudik, M., J. Langford, and L. Li (2011). Doubly robust policy evaluation and learning.
arXiv preprint arXiv:1103.4601 .

Dupas, P. (2014). Short-run subsidies and long-run adoption of new health products:
Evidence from a field experiment. Econometrica 82 (1), 197–228.

Dwork, C., M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012). Fairness through
awareness. In Proceedings of the 3rd innovations in theoretical computer science confer-
ence, pp. 214–226.

Egger, D., J. Haushofer, E. Miguel, P. Niehaus, and M. W. Walker (2019). General equi-
librium effects of cash transfers: experimental evidence from kenya. Technical report,
National Bureau of Economic Research.

30
Elliott, G. and R. P. Lieli (2013). Predicting binary outcomes. Journal of Economet-
rics 174 (1), 15–26.
Fan, J. and I. Gijbels (1996). Local polynomial modelling and its applications: monographs
on statistics and applied probability 66, Volume 66. CRC Press.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more
covariates than observations. Journal of Econometrics 189 (1), 1–23.
Feldman, A. and A. Kirman (1974). Fairness and envy. The American Economic Review ,
995–1005.
Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015).
Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD
international conference on knowledge discovery and data mining, pp. 259–268.
Florios, K. and S. Skouras (2008). Exact computation of max weighted score estimators.
Journal of Econometrics 146 (1), 86–91.
Foley, D. K. (1967). Resource allocation and the public sector.
Fudenberg, D. and J. Tirole (1991). Game theory.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation
of average treatment effects. Econometrica, 315–331.
Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent
data. Econometric Theory 24 (3), 726–748.
Hardt, M., E. Price, and N. Srebro (2016). Equality of opportunity in supervised learning.
In Advances in neural information processing systems, pp. 3315–3323.
Hirano, K. and J. R. Porter (2009). Asymptotics for statistical treatment rules. Econo-
metrica 77 (5), 1683–1701.
Horvitz, D. G. and D. J. Thompson (1952). A generalization of sampling without replace-
ment from a finite universe. Journal of the American statistical Association 47 (260),
663–685.
Hossain, S., A. Mladenovic, and N. Shah (2020). Designing fairly fair classifiers via eco-
nomic fairness notions. In Proceedings of The Web Conference 2020, pp. 1559–1569.
Imai, K., L. Keele, and D. Tingley (2010). A general approach to causal mediation analysis.
Psychological methods 15 (4), 309.
Imbens, G. W. (2000). The role of the propensity score in estimating dose-response func-
tions. Biometrika 87 (3), 706–710.

31
Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exo-
geneity: A review. Review of Economics and statistics 86 (1), 4–29.

Imbens, G. W. and D. B. Rubin (2015). Causal inference in statistics, social, and biomedical
sciences. Cambridge University Press.

Kallus, N. (2017). Recursive partitioning for personalization using observational data. In


Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.
1789–1798. JMLR. org.

Kasy, M. and R. Abebe (2020). Fairness, equality, and power in algorithmic decision
making. Working paper .

Kilbertus, N., M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf


(2017). Avoiding discrimination through causal reasoning. In Advances in Neural Infor-
mation Processing Systems, pp. 656–666.

Kitagawa, T. and A. Tetenov (2018). Who should be treated? Empirical welfare maxi-
mization methods for treatment choice. Econometrica 86 (2), 591–616.

Kitagawa, T. and A. Tetenov (2019). Equality-minded treatment choice. Journal of Busi-


ness & Economic Statistics, 1–14.

Kleinberg, J., J. Ludwig, S. Mullainathan, and A. Rambachan (2018). Algorithmic fairness.


In Aea papers and proceedings, Volume 108, pp. 22–27.

Kleinberg, J., S. Mullainathan, and M. Raghavan (2016). Inherent trade-offs in the fair
determination of risk scores. arXiv preprint arXiv:1609.05807 .

Kusner, M., C. Russell, J. Loftus, and R. Silva (2019). Making decisions that reduce
discriminatory impacts. In International Conference on Machine Learning, pp. 3591–
3600.

Labe, E. B., D. J. Lizotte, M. Qian, W. E. Pelham, and S. A. Murphy (2014). Dynamic


treatment regimes: Technical challenges and applications. Electronic journal of statis-
tics 8 (1), 1225.

Liu, Y., G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes (2017). Calibrated


fairness in bandits. arXiv preprint arXiv:1707.01875 .

Loh, W.-Y. (2011). Classification and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 1 (1), 14–23.

Lu, C., B. Schölkopf, and J. M. Hernández-Lobato (2018). Deconfounding reinforcement


learning in observational settings. arXiv preprint arXiv:1812.10576 .

32
Luedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcome
under a possibly non-unique optimal treatment strategy. Annals of statistics 44 (2), 713.

Lyons, E. and L. Zhang (2017). The impact of entrepreneurship programs on minorities.


American Economic Review 107 (5), 303–07.

Manski (2004). Statistical treatment rules for heterogeneous populations. Economet-


rica 72 (4), 1221–1246.

Manski, C. F. (1975). Maximum score estimation of the stochastic utility model of choice.
Journal of econometrics 3 (3), 205–228.

Manski, C. F. and T. S. Thompson (1989). Estimation of best predictors of binary response.


Journal of Econometrics 40 (1), 97–123.

Martinez, N., M. Bertran, and G. Sapiro (2019). Fairness with minimal harm: A pareto-
optimal approach for healthcare. arXiv preprint arXiv:1911.06935 .

Mas-Colell, A., M. D. Whinston, J. R. Green, et al. (1995). Microeconomic theory, Vol-


ume 1. Oxford university press New York.

Mbakop, E. and M. Tabord-Meehan (2016). Model selection for treatment choice: Penalized
welfare maximization. arXiv preprint arXiv:1609.03167 .

Mitchell, S., E. Potash, S. Barocas, A. D’Amour, and K. Lum (2018). Prediction-based de-
cisions and fairness: A catalogue of choices, assumptions, and definitions. arXiv preprint
arXiv:1811.07867 .

Muralidharan, K., A. Singh, and A. J. Ganimian (2019). Disrupting education? ex-


perimental evidence on technology-aided instruction in india. American Economic Re-
view 109 (4), 1426–60.

Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 65 (2), 331–355.

Nabi, R., D. Malinsky, and I. Shpitser (2019). Learning optimal fair policies. Proceedings
of machine learning research 97, 4674.

Nabi, R. and I. Shpitser (2018). Fair inference on outcomes. In Thirty-Second AAAI


Conference on Artificial Intelligence.

Negishi, T. (1960). Welfare economics and existence of an equilibrium for a competitive


economy. Metroeconomica 12 (2-3), 92–97.

Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied economet-


rics 5 (2), 99–135.

33
Newey, W. K. (1994). The asymptotic variance of semiparametric estimators. Economet-
rica: Journal of the Econometric Society, 1349–1382.

Nie, X., E. Brunskill, and S. Wager (2019). Learning when-to-treat policies. arXiv preprint
arXiv:1905.09751 .

Noghin, V. D. (2006). An axiomatization of the generalized edgeworth–pareto principle in


terms of choice functions. Mathematical Social Sciences 52 (2), 210–216.

Pearl, J. (2009). Causality. Cambridge university press.

Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatment


rules. Annals of statistics 39 (2), 1180.

Rai, Y. (2018). Statistical inference for treatment assignment policies. Unpublished


Manuscript.

Rambachan, A., J. Kleinberg, J. Ludwig, and S. Mullainathan (2020). An economic ap-


proach to regulating algorithms.

Rambachan, A. and J. Roth (2019). Bias in, bias out? evaluating the folk wisdom. arXiv
preprint arXiv:1909.08518 .

Robins, J. M. and A. Rotnitzky (1995). Semiparametric efficiency in multivariate regression


models with missing data. Journal of the American Statistical Association 90 (429), 122–
129.

Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients


when some regressors are not always observed. Journal of the American statistical As-
sociation 89 (427), 846–866.

Rubin, D. B. (1990). Formal mode of statistical inference for causal effects. Journal of
statistical planning and inference 25 (3), 279–292.

Schick, A. (1986). On asymptotically efficient estimation in semiparametric models. The


Annals of Statistics, 1139–1151.

Stoye, J. (2012). Minimax regret treatment choice with covariates or with limited validity
of experiments. Journal of Econometrics 166 (1), 138–156.

Tetenov, A. (2012). Statistical treatment choice based on asymmetric minimax regret


criteria. Journal of Econometrics 166 (1), 157–165.

Ustun, B., Y. Liu, and D. Parkes (2019). Fairness without harm: Decoupled classifiers with
preference guarantees. In International Conference on Machine Learning, pp. 6373–6382.

34
Van Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In Weak convergence
and empirical processes, pp. 16–28. Springer.

Varian, H. R. (1976). Two problems in the theory of fairness. Journal of Public Eco-
nomics 5 (3-4), 249–260.

Viviano, D. (2019). Policy targeting under network interference. arXiv preprint


arXiv:1906.10258 .

Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, Vol-


ume 48. Cambridge University Press.

Wolsey, L. A. and G. L. Nemhauser (1999). Integer and combinatorial optimization, Vol-


ume 55. John Wiley & Sons.

Xiao, L., Z. Min, Z. Yongfeng, G. Zhaoquan, L. Yiqun, and M. Shaoping (2017). Fairness-
aware group recommendation with pareto-efficiency. In Proceedings of the Eleventh ACM
Conference on Recommender Systems, pp. 107–115.

Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013). Learning fair representa-
tions. In International Conference on Machine Learning, pp. 325–333.

Zhang, B., A. A. Tsiatis, E. B. Laber, and M. Davidian (2012). A robust method for
estimating optimal treatment regimes. Biometrics 68 (4), 1010–1018.

Zhou, X., N. Mayer-Hamblett, U. Khan, and M. R. Kosorok (2017). Residual weighted


learning for estimating individualized treatment rules. Journal of the American Statis-
tical Association 112 (517), 169–187.

Zhou, Z., S. Athey, and S. Wager (2018). Offline multi-action policy learning: Generaliza-
tion and optimization. arXiv preprint arXiv:1810.04778 .

35
A Proofs
A.1 Auxiliary Lemmas
Lemma A.1. Under Assumption 2.1, 2.2 for any sensitive attribute s ∈ {0, 1}
h 1{S = s} Y D
i i i 1{Si = s} Yi (1 − Di )  i
Ws (π) = E π(Xi , s) + 1 − π(Xi , s) . (33)
ps e(Xi , s) ps 1 − e(Xi , s)

Proof of Lemma A.1. Assumption 2.2 guarantees existence of the expectation. By the law
of iterated expectations and Equation (2), we obtain:
h 1{S = s}  Y D
i i i Yi (1 − Di )  i
E − π(Xi , s)
ps e(Xi , s) 1 − e(Xi , s)
(34)
h 1{S = s} h Y (1, s)D
i i i Yi (0, s)(1 − Di ) i i
=E E − Si = s, Xi π(Xi , s) .
ps e(Xi , s) 1 − e(Xi , s)
By the first condition in Assumption 2.1, we have
h Y (1, s)D
i i Yi (0, s)(1 − Di ) i h i
− Si = s, Xi = E Yi (1, s) − Yi (0, s) Si = s, Xi . (35)

E
e(Xi , s) 1 − e(Xi , s)
Observe now that
h i h i
E Yi (1, s) − Yi (0, s) Si = s, Xi = E Yi (1, s) − Yi (0, s) Xi (s) . (36)

Therefore, we can write by consistency of potential covariates


h 1{S = s} h i i
i
E Yi (1, s) − Yi (0, s) Si = s, Xi π(Xi (s), s)

E
ps
h 1{S = s} h i i (37)
i
=E E Yi (1, s) − Yi (0, s) Xi (s) π(Xi (s), s) .

ps
Under Condition (B):
h 1{S = s} h i i
i
E Yi (1, s) − Yi (0, s) Xi (s) π(Xi (s), s)

E
ps
h 1{S = s} i h h i ii
i (38)
=E × E E Yi (1, s) − Yi (0, s) Xi (s) π(Xi (s), s)

ps
h  i
= E Yi (1, s) − Yi (0, s) π(Xi (s), s) .

Finally, observe that using similar arguments


h 1{S = s} Y (1 − D ) i
i i i
E = E[Yi (0, s)]
ps 1 − e(Xi , s)
which concludes the proof.

36
Lemma A.2. Let Ws,n = n1 ni=1 Γ1,s,i π(Xi , s) + Γ0,s,i (1 − π(Xi , s)), where Γd,s,i is defined
P
as in Equation (12). Let Assumptions 2.1, 2.2, 4.1 hold. Then with probability at least
1 − γ,
Mp C̄M p
sup sup αW0 (π)+(1−α)W1 (π)−αW0,n (π)+(1−α)W1,n (π) ≤ C̄ 2 v/n+ 2 log(2/γ)/n

α∈(0,1) π∈Π δ δ
(39)
for a universal constant C̄ < ∞. Similarly,
h i Mp
E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) ≤ C̄ 2 v/n. (40)

α∈(0,1) π∈Π δ

Proof of Lemma A.2. Throughout the proof we refer to C̄ < ∞ as a universal constant.
Observe first that under Assumption 2.2 and Assumption 4.1, we have

sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) , (41)

α∈(0,1) π∈Π

satisfies the bounded difference assumption (Boucheron et al., 2013) with constant δ2M
2 n . See
for instance Boucheron et al. (2005). By the bounded difference inequality, with probability
at least 1 − γ,

sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)

α∈(0,1) π∈Π
h i Mp
≤ E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) + C̄ 2 log(2/γ)/n.

α∈(0,1) π∈Π δ
(42)
We now move to bound the expectation in the right-hand side of Equation (42). Under
Assumption 2.1, we obtain by Lemma A.1 and trivial rearrangments, that
h i
E αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π) = 0. (43)

Using the symmetrization argument (Van Der Vaart and Wellner, 1996), we can now
bound the above supremum with the Radamacher complexity of the function class of
interest (e.g., Athey and Wager (2017), Viviano (2019)), which combined with the triangle

37
inequality reads as follows:
h i
E sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)

α∈(0,1) π∈Π
h i h i
≤ E sup sup αW0 (π) − αW0,n (π) + E sup sup (1 − α)W1 (π) − (1 − α)W1,n (π)

α∈(0,1) π∈Π α∈(0,1) π∈Π
h i h i
≤ E sup W0 (π) − W0,n (π) + E sup W1 (π) − W1,n (π)

π∈Π π∈Π
h 1 Xn i h 1 Xn i
≤ E sup σi π(Xi (1), 1)Γ1,1,i + E sup σi (1 − π(Xi (1), 1))Γ0,1,i

π∈Π n i=1 π∈Π n i=1
h 1 Xn i h 1 Xn i
+ E sup σi π(Xi (0), 0)Γ1,0,i + E sup σi (1 − π(Xi (0), 0))Γ0,0,i ,

π∈Π n i=1 π∈Π n i=1
(44)
where here σi are independent Radamacher random variables. We can study each com-
ponent of the above expression separately. By the Dudley’s entropy integral bound, since
the VC-dimension of the function class Π is bounded by Assumption 4.1, and since each
Γd,s, is bounded, we obtain (see for instance Wainwright (2019)), under Assumption 4.1
(A) and (B), with trivial rearrangement
h n
1 X i M C̄ p
E sup σi π(Xi (s), s)Γd,s,i ≤ 2 v/n. (45)

π∈Π n i=1
δ

for each d, s. Observe now that the VC-dimension of π equals the VC dimension of 1 − π
(Devroye et al., 2013). This completes the proof.

Lemma A.3. Let Assumptions 2.1, 2.2, 4.1, 4.2 hold. Then with probability at least 1 − γ,
and n > H,
Mp C̄M p
sup sup αW0 (π)+(1−α)W1 (π)−αŴ0 (π)+(1−α)Ŵ1 (π) ≤ C̄ 2 v/n+ 2 log(2/γ)/n

α∈(0,1) π∈Π δ δ
(46)
for a universal constant C̄ < ∞.

38
Proof of Lemma A.3. First observe that we can bound the above expression as

sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π) ≤

α∈(0,1) π∈Π

sup sup αW0 (π) + (1 − α)W1 (π) − αW0,n (π) + (1 − α)W1,n (π)

α∈(0,1) π∈Π
| {z } (47)
(I)

+ sup sup αW0,n (π) + (1 − α)W1,n (π) − αŴ0 (π) + (1 − α)Ŵ1 (π) .

α∈(0,1) π∈Π
| {z }
(II)

Here Ws,n is as defined in Lemma A.2. The term (I) is bounded as in Lemma A.2. There-
fore, we are only left to discuss (II).
Using the triangular inequality, we only need to bound

sup W0,n (π) − Ŵ0,n (π) + sup W1,n (π) − Ŵ1,n (π) . (48)

π∈Π π∈Π

We bound the first term while the second term follows similarly. We write

sup Ws,n (π) − Ŵs (π)

π∈Π
n
1 X 1{Si = s} Di (Yi − m1,s (Xi ))
≤ π(Xi , s) + m1,s (Xi )π(Xi , s)

n ps e(Xi , s)
i=1
n
1 X 1{Si = s} Di (Yi − m̂1,s (Xi ))
− π(Xi , s) − m̂1,s (Xi )π(Xi , s)

n p̂s ê(Xi , s) (49)
i=1
n
1 X 1{Si = s} (1 − Di )(Yi − m0,s (Xi ))
+ (1 − π(Xi , s)) + m0,s (Xi )(1 − π(Xi , s))

n ps 1 − e(Xi , s)
i=1
n
1 X 1{Si = s} (1 − Di )(Yi − m̂s,0 (Xi ))
− (1 − π(Xi , s)) − m̂s,0 (Xi )(1 − π(Xi , s)) .

n p̂s 1 − ê(Xi , s)
i=1

We discuss the first component while the second follows similarly.


With trivial re-arrengment, using the triangular inequality, we obtain that the following
holds

39
n
1 X 1{Si = s} Di (Yi − m1,s (Xi ))
π(Xi , s) + m1,s (Xi )π(Xi , s)

n ps e(Xi , s)

i=1
n
1 X 1{Si = s} Di (Yi − m̂1,s (Xi ))
− π(Xi , s) − m̂1,s (Xi )π(Xi , s)

n p̂s ê(Xi , s)
i=1
n
1 X  1 1 
≤ sup 1{Si = s}Di (Yi − m1,s (Xi )) − π(Xi , s) (50)

π∈Π n i=1 ps e(Xi , s) p̂s ê(Xi , s)
| {z }
(i)
1 X n 
1{Si = s}Di 
+ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) .

π∈Π n p̂ s ê(Xi , s)
i=1
| {z }
(ii)

We study (i) and (ii) separately. We start from (i). Recall, that by cross fitting ê(Xi , s) =
ê−k(i) (Xi , s), where k(i) is the fold containing unit i. Therefore, observe that given the K
folds for cross-fitting, we have
1 X n  1 1 
1{Si = s}Di (Yi − m1,s (Xi )) − π(Xi , s)

n ps e(Xi , s) p̂s ê(Xi , s)

i=1
X 1 X  1 1 
≤ 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s) .

n ps e(Xi , s) p̂s ê(−k) (Xi , s)

k∈{1,...,K} i∈Ik
(51)
In addition, we have that
hX  1 1  i
E 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s)
ps e(Xi , s) p̂s ê(−k) (Xi , s)
i∈Ik
h hX  1 1  ii
=E E 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s) p̂(−k) , ê(−k) = 0,

ps e(Xi , s) p̂s ê(−k) (Xi , s)
i∈Ik
(52)
by cross-fitting. By Assumption 4.2, we know that for n > H,
1 1
sup − (−k) ≤ 2/δ 2 (53)

x∈X ,s∈S p s e(x, s) p̂s ê(−k) (x, s)

and therefore each summand in Equation (51) is bounded by a finite constant 2/δ 2 . We
now obtain, that for n > H, using the symmetrization argument (Van Der Vaart and

40
Wellner, 1996), and the Dudley’s entropy integral (Wainwright, 2019)
h 1 X  1 1  i Mp
E sup | 1{Si = s}Di (Yi −m1,s (Xi )) − (−k) π(Xi , s)| p̂(−k) , ê(−k) . 2 v/n.

π∈Π n i∈Ik
ps e(Xi , s) p̂s ê(−k) (Xi , s) δ
(54)
In addition, by the bounded difference inequality (Boucheron et al., 2005), we obtain that
for n > H, with probability at least 1 − γ, for a universial constant c < ∞
1 X  1 1 
sup 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s) ≤

π∈Π n i∈I ps e(Xi , s) p̂s ê(−k) (Xi , s)
k
h 1 X  1 1  i
E sup | 1{Si = s}Di (Yi − m1,s (Xi )) − (−k) π(Xi , s)| p̂(−k) , ê(−k)

π∈Π n i∈I ps e(Xi , s) p̂s ê(−k) (Xi , s)
k
r
M log(2/γ)
+c 2 .
δ n
(55)
We now consider the term (ii). Observe that we can write
1 X n 
Di 1{Si = s} Di 1{Si = s} 
(ii) ≤ sup − (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s)

π∈Π n p̂ s ê(Xi , s) p s e(X i , s)
i=1
| {z }
(j)
1 n  (56)
X Di 1{Si = s} 
+ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) .

π∈Π n i=1
ps e(Xi , s)
| {z }
(jj)

We consider each term seperately. Consider (jj) first. Using the cross-fitting argument we
obtain
1 Xn 
Di 1{Si = s} 
sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s)

π∈Π n i=1 ps e(Xi , s)
1 X  D 1{S = s}  (57)
i i (−k)
X
≤ sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) .

π∈Π n ps e(Xi , s)
i∈Ik
k∈{1,...,K}

Observe now that


h D 1{S = s}  i
i i (−k) (−k)
E − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) m̂1,s = 0. (58)
ps e(Xi , s)
Therefore, following the same argument discussed before, we obtain that for n > H being
large enough, with probability at least 1 − γ
1 X  D 1{S = s}  M r v M r log(1/γ)
i i (−k)
sup − 1 (m1,s (Xi ) − m̂1,s (Xi ))π(Xi , s) . 2 + .

π∈Π n p s e(X i , s) δ n δ2 n
i∈Ik
(59)

41
We are now left to bound (j). We obtain that
v v
u n  u n
u1 X 1 1  2 u1 X
(j) ≤ t − t (m1,s (Xi ) − m̂1,s (Xi ))2 . (60)
n p̂s ê(Xi , s) ps e(Xi , s) n
i=1 i=1

Such a bound does not depend on π. Observe now that we can write
v v
u n  u n
u1 X 1 1  2 u1 X
t − t (m1,s (Xi ) − m̂1,s (Xi ))2
n p̂s ê(Xi , s) ps e(Xi , s) n
i=1 i=1
v v
1 X 1 1 2 u X 1 X
u X
(−k)
=t − (m1,s (Xi ) − m̂1,s (Xi ))2 .
u u
(−k)
t
n p̂s ê(−k) (Xi , s) ps e(Xi , s) n
k∈{1,...,K} i∈Ik k∈{1,...,K} i∈Ik
(61)
By the bounded difference inequality, and the union bound, we obtain that the following
holds:
v v
1 X 1 1 2 u X 1 X
u X
(−k)
− (m1,s (Xi ) − m̂1,s (Xi ))2
u u
(−k)
t t
n p̂s ê(−k) (Xi , s) ps e(Xi , s) n
k∈{1,...,K} i∈Ik k∈{1,...,K} i∈Ik
s
h 1 1 2 ir h i
≤K E − E (m1,s (Xi ) − m̂1,s (Xi ))2
p̂s ê(Xi , s) ps e(Xi , s)
s r h
p4
h 1 1 2 i p
4
i
+ 2 log(2K/γ)/n E − + 2 log(2K/γ)/n E (m1,s (Xi ) − m̂1,s (Xi ))2
p̂s ê(Xi , s) ps e(Xi , s)
p
+ 2 log(2K/γ)/n,
(62)
with probability at least 1 − γ. Under Assumption 4.2 and the union bound, the result
completes.

Lemma A.4. Under Assumption 2.1, 2.2, 4.1, 4.2, 4.4, the following holds: with proba-
bility at least 1 − γ , for n ≥ H,
cM r v log(1/γ) c
0 0
sup A(s, s ; π) − An (s, s ; π) ≤ 2 + n−η (63)

π∈Π δ n δ

for a universal constant c < ∞.

42
Proof of Lemma A.4. We consider the case where s0 6= s, whereas s0 = s follows trivially.
Observe that we can write

sup A(s, s0 ; π) − An (s, s0 ; π) ≤

π∈Π
i 1X n 
h 1{Si = s} 1{Si = s} 
sup EX(s) Vπ(X(s),s) (X(s), s0 ) − m̂1,s0 (Xi )π(Xi , s) + m̂0,s0 (Xi )(1 − π(Xi , s))

π∈Π n p̂s p̂s
i=1
| {z }
(A)

+ sup |Ŵs0 (π) − Ws0 (π)| .


π∈Π
| {z }
(B)
(64)
The term (A) is bounded as discussed in Lemma A.3. Therefore, we are only left to discuss
bounds on (B). To derive bounds in such a scenario, we first observe that we can write
i 1X n 
h 1{Si = s} 1{Si = s} 
sup EX(s) Vπ(X(s),s) (X(s), s0 ) − m̂1,s0 (Xi )π(Xi , s) + m̂0,s0 (Xi )(1 − π(Xi , s))

π∈Π n p̂s p̂s
i=1
1 X n
1{Si = s} h 1{S = s}
i
i
≤ sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s)

π∈Π n p s p s
i=1
| {z }
(I)
1 n
X 1{Si = s} h 1{S = s}
i
i
+ sup m0,s0 (Xi )(1 − π(Xi , s)) − E m0,s0 (Xi )(1 − π(Xi , s))

π∈Π n i=1
ps ps
| {z }
(II)
1 X n 
1{Si = s} 1{Si = s} 
+ sup m̂1,s0 (Xi ) − m1,s0 (Xi ) π(Xi , s)

π∈Π n p̂ s p s
i=1
| {z }
(III)
n 
1 X 1{Si = s} 1{Si = s} 
+ sup m̂0,s0 (Xi ) − m0,s0 (Xi ) (1 − π(Xi , s)) .

π∈Π n i=1 p̂s ps
| {z }
(IV )

We discuss (I) and (III), whereas (II) and (IV) follow similarly. Observe first that by
Assumption 4.1 and the bounded difference inequality, with probability 1 − γ,
1 X n
1{Si = s} h 1{S = s}
i
i
sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s)

π∈Π n p s p s
i=1
n
1 X 1{S = s} h 1{S = s}
h
i i
i i Mp
≤ E sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s) + C̄ log(2/γ)/n

π∈Π n i=1
ps ps δ

43
for a constant C̄ < ∞. Using the symmetrization argument (Van Der Vaart and Wellner,
1996), we have
1 X n
h 1{Si = s} h 1{S = s}
i
i i
E sup m1,s0 (Xi )π(Xi , s) − E m1,s0 (Xi )π(Xi , s) ≤

π∈Π n p s p s
i=1
1 X n
h 1{Si = s} i
2E sup σi m1,s0 (Xi )π(Xi , s)

π∈Π n p s
i=1
where σi are i.i.d. Radamacher random variables. Since m1,s is uniformly bounded and
similarly ps is bounded, and by Assumption 4.1, we obtain by the properties of the Dudley’s
entropy integral (Wainwright, 2019),
n
1 X
h 1{Si = s} i Mp
E sup σi m1,s0 (Xi )π(Xi , s) ≤ C̄ v/n

π∈Π n i=1
ps δ

for a universal constant C̄ < ∞. We now move to bound (III). Using the triangular
inequality and Holder’s inequality, we obtain
n
1 X 1{Si = s}
(III) ≤ m1,s0 (Xi ) − m̂1,s0 (Xi ) (65)

n ps
i=1
The above bound is deterministic and it does not depend on π. Observe now that by
Equation (1)
n
1 X 1{Si = s}
m (X ) − m̂ (X )

1,s 0 i 1,s 0 i
n ps
i=1
n n
1 X 1{Si = s} 1 X
= m (X (s)) − m̂ (X (s)) ≤ m (X (s)) − m̂ (X (s)) .

1,s 0 i 1,s 0 i 1,s 0 i 1,s 0 i
n ps nδ

i=1 i=1
(66)
We now separate the contribution of each of the K folds using in the cross-fitting algorithm.
Namely, we define
n
1 X X 1 X (−k)

m1,s0 (Xi (s))− m̂1,s0 (Xi (s)) ≤ m1,s0 (Xi (s))− m̂1,s0 (Xi (s)) (67)

nδ nδ
i=1 k∈{1,...,K} i∈Ik

(−k)
where Ik denotes the set of indexes in fold k, and m̂1,s0 denotes the estimator obtained
from all folds except k. Next, we bound the following term using Liaponuv inequality:
1 X h i h i
E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))| m̂(−k) ≤ E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))| m̂(−k)

n
i∈Ik
r h i
≤ E |m1,s0 (Xi (s)) − m̂1,s0 (Xi (s))|2 m̂(−k) ≤ cn−η .

(68)

44
The last inequality follows by Assumption 4.4, for a universal constant c < ∞. Finally, we
discuss exponential concentration of the empirical counterpart. For n ≥ H,

sup md,s0 (x) − m̂−k

(x) ≤ 2M. (69)

d,s 0
x∈X

By the bounded difference inequality, with probability at least 1 − γ, for n > H


1 X h i p
m1,s0 (Xi (s)) − m̂1,s0 (Xi (s)) ≤ E m1,s0 (Xi (s)) − m̂1,s0 (Xi (s)) + 4M log(2/γ)/n.

n
i∈Ik
(70)
Combining the above bounds, the proof completes.

Lemma A.5. Let


n o n o
G(α) = sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) . (71)
π∈Π π∈Π̂o

Define
G = {G(α), α ∈ (0, 1)}.
Under Assumption 4.1, for any ε > 0, there exist a set {α1 , ..., αN (ε) }, such that for all
α ∈ (0, 1),
|G(α) − max G(αj )| ≤ 4εM, a.s. (72)
j∈{1,...,N (ε)}

and N (ε) ≤ 1 + 1/ε.

Proof of Lemma A.5. We denote {α1 , ..., αN (ε) } an ε-cover of the interval (0, 1) with respect
to the L1 norm. Namely, {α1 , ..., αN (ε) } are equally spaced numbers between (0, 1). Clearly,
we have that the covering number N (ε) ≤ 1 + 1/ε. We denote
n o
G(α) = sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) . (73)
π∈Π π∈Π̂o

To characterize the corresponding cover of the function class

G = {G(α), α ∈ (0, 1)},

we claim that for any α ∈ (0, 1), there exist an αj in the ε cover such that

|G(α) − G(αj )| ≤ 4εM. (74)

45
Such a result follows by the argument outlined in the following lines: take αj closest to α
and consider
|G(α) − G(αj )|
n o n o
= sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π)

π∈Π π∈Π̂o
n o n o
− sup αj W0 (π) + (1 − αj )W1 (π) + sup αj W0 (π) + (1 − αj )W1 (π)

π∈Π o π∈Π̂
n o n o
≤ sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
(75)
π∈Π π∈Π
| {z }
(i)
n o n o
+ sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) .

π∈Π̂o π∈Π̂o
| {z }
(ii)

We study (i) and (ii) separately. Consider first (i). We observe the following fact:
whenever

sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) > 0 (76)
π∈Π π∈Π

then we can bound



(i) ≤ αW0 (π ∗ ) + (1 − α)W1 (π ∗ ) − αj W0 (π ∗ ) + (1 − αj )W1 (π ∗ ) . (77)

Here π ∗ ∈ arg supπ∈Π αW0 (π) + (1 − α)W1 (π). When instead

sup αW0 (π) + (1 − α)W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π) ≤ 0 (78)
π∈Π π∈Π

we can use the same argument by switching sign, which, with trivial rearrengment reads
as
(i) ≤ αW0 (π ∗∗ ) + (1 − α)W1 (π ∗∗ ) − αj W0 (π ∗∗ ) + (1 − αj )W1 (π ∗∗ ) . (79)

Here π ∗∗ ∈ arg supπ∈Π αj W0 (π) + (1 − αj )W1 (π). Therefore we obtain,



(i) ≤ sup αW0 (π) + (1 − α)W1 (π) − αj W0 (π) + (1 − αj )W1 (π) ≤ 2|α − αj |M (80)

π∈Π

where the last inequality follows by Assumption 4.1 and the triangle inequality. Similar
reasoning also applies to (ii). Therefore, we observe that the covering number of the
function class G for a 4M ε cover is at most 1/ε + 1.

46
A.2 Additional Lemmas
Proof of Lemma 2.1
The proof follows similarly to standard microeconomic textbook (Mas-Colell et al., 1995).
Let
Π̃ = {πα : πα ∈ arg sup α1 W0 (π) + α2 W1 (π), α ∈ R2+ , α1 + α2 > 0}. (81)
π∈Π

Then we want to show that Πo = Π̃. Trivially Π̃ ⊆ Πo , since otherwise the definition of
Pareto optimality would be violated. Consider now some π ∗ ∈ Πo . Then we show that
there exist a vector α ∈ R2+ , such that π ∗ maximizes the expression

sup α1 W0 (π) + α2 W1 (π). (82)


π∈Π

Denote the set

F = {(W̃0 , W̃1 ) ∈ R2 : ∃π ∈ Π : W̃0 ≤ W0 (π) and W̃1 ≤ W1 (π)}. (83)

Since (0, 0) ∈ F, such a set is non-empty. Notice now that Ws (π) is linear is π for
s ∈ {0, 1}, and hence concave in π. Therefore, we obtain that the set F is a convex set,
since it denotes the sub-graph of a concave functional. We denote W̄ = (W0 (π ∗ ), W1 (π ∗ ))
and G = R2++ + W̄ the set of welfares that strictly dominates π ∗ . Then G is non-empty
and convex. Since π ∗ ∈ Πo , we must have that F ∩ G = ∅. Therefore, by the separating
hyperplane theorem, there exist an α ∈ R2 , with α 6= 0, such that α> F ≤ α> (W̄ + d) for
any F ∈ F, d ∈ R2++ . Let d1 → ∞, it must be that α1 ∈ R+ , and similarly for α2 . So
α ∈ R2+ . By letting d → 0, we have that α> F ≤ α> W̄ . This implies that

α1 W0 (π) + α2 W1 (π) ≤ α1 W0 (π ∗ ) + α2 W1 (π ∗ ) (84)

for any π ∈ Π (since it is true for any F ∈ F). Hence π ∗ maximizes welfare over all
possible feasible allocations once reweighted by (α1 , α2 ). Since the maximizer is invariant
to multiplication of the objective function by constants, the result follows after dividing
the objective function by the sums of the coefficients, which is non-zero by the separating
hyperplane theorem. This completes the proof.

Proof of Lemma 2.2


First, observe that by rationality, preferences are complete and transitive. Observe then
that the preference function equivalently correspond to lexico-graphic preferences, where
lexico-graphic, with π  π 0 if π Pareto dominates π 0 . If instead neither π, π 0 , Pareto
dominates the other, then π  π 0 is UnFairness (π) < UnFairness (π 0 ). Therefore, it must
be that C(Π) ⊆ Πo , since if not, we can find a policy π ∈ C(Π) whose welfares for both
groups are both dominated by some other policy π 0 ∈ Π. Under rational preferences, this

47
would contraddict the definition of C(Π). Observe now that by definition, C(Π) contains
the set of Pareto optimal allocations that achieves minimal UnFairness. By Lemma 2.1,
the set of Pareto optimal policies coincide with the policies satisfying the constraint in
Definition 2.2. We use a contraddiction argument for this statement. Suppose not, then it
means that there exist a πα satisfying such a constraint for some α, but such that πα 6∈ Πo .
However, this means that

πα ∈ arg sup αW1 (π) + (1 − α)W0 (π)


π∈Π

which contraddicts Lemma 2.1. Therefore, the expression finds the allocation with lowest
unfairness within the Pareto optimal set, which completes the proof.

A.3 Proof of Theorem 4.1


First observe that we can write the expression as follows. Throughout the proof we refer to
C̄ < ∞ as a universal constant. Recall Lemma A.5, where we chose a cover with N (ε) = N
and ε ≤ 1/N .


sup sup αW0 (π) + (1 − α)W1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π)

α∈(0,1) π∈Π αj ∈{α1 ,...,αN }

≤ sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) − (1 − α)Ŵ1 (π)

α∈(0,1) π∈Π
| {z } (85)
(I)

+ sup sup αŴ0 (π) + (1 − α)Ŵ1 (π) − max αj Ŵ0 (π) − (1 − αj )Ŵ1 (π) .

α∈(0,1) π∈Π αj ∈{α1 ,...,αN }
| {z }
(II)

(I) is bounded as in Lemma A.3. (II) is bounded as follows.

(II) ≤ ε sup |Ŵ0 (π)| + ε sup |Ŵ1 (π)|. (86)


π∈Π π∈Π

Under Assumption 4.4 and Assumption 4.1, 2.2, for n ≥ H, the estimated conditional
mean and propensity score are uniformly bounded. Therefore we obtain that
M M
ε sup |Ŵ0 (π)| + sup ε|Ŵ1 (π)| ≤ C̄ε 2
≤ C̄
π∈Π π∈Π δ N δ2

which concludes the proof.

48
A.4 Proof of Theorem 4.2
Proof. Throughout the proof we refer to C̄ < ∞ as a universal constant. Recall Lemma
A.5, where we chose a cover with N (ε) = N and ε ≤ 1/N .
Let G(α) be as defined in Lemma A.5. Clearly, we have by definition of Πo that
n o n o
sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) = G(α).
π∈Πo π∈Π̂o

Observe now that by Lemma A.5

sup G(α) − max G(αj ) ≤ 4M ε. (87)


α∈(0,1) j∈1,...,N (ε)

Therefore, we only have to bound

max G(αj ).
j∈{1,...,N (ε)}

Observe now that the following holds:

max G(αj ) = max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − α)W1 (π)
j∈{1,...,N (ε)} j∈{1,...,N (ε)} π∈Π π∈Π̂o

n o n o
= max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Π π∈Πo,n
| {z }
(I)
n o n o
+ max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
| {z }
(II)
(88)
For the term (I), by definition of Πo,n , since it contains empirical Pareto optimal
policies, and by trivial rearrengement, is bounded as follows
n o
max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Π π∈Πo,n
(89)
≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) − (1 − α)Ŵ1 (π) ,

α∈(0,1) π∈Π

where the right hand side is measurable by Assumption 4.1 (C). The above term is bounded

49
by Lemma A.3. The term (II) is instead bounded as follows:
n o n o
(II) = max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj W0 (π) + (1 − αj )W1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
n o n o
= max sup αj W0 (π) + (1 − αj )W1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
j∈{1,...,N (ε)} π∈Πo,n π∈Πo,n
n o n o
− sup αj W0 (π) + (1 − αj )W1 (π) + sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
π∈Π̂o π∈Π̂o
n o n o
+ sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π)
π∈Πo,n o π∈Π̂

≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π)

α∈(0,1) π∈Π
n o n o
+ max sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) .
j∈{1,...,N (ε)} π∈Πo,n π∈Π̂o
| {z }
(III)
(90)
Observe that by definition of Πo,n and Π̂o , since αj is in the cover, we have that
n o n o
sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) − sup αj Ŵ0 (π) + (1 − αj )Ŵ1 (π) = 0.
π∈Πo,n π∈Π̂o

Therefore,

(II) ≤ 2 sup sup αW0 (π) + (1 − α)W1 (π) − αŴ0 (π) + (1 − α)Ŵ1 (π) .

α∈(0,1) π∈Π

Combining the results with Lemma A.3 the proof completes.

A.5 Proof of Theorem 4.4


Before discussing the formal proof, we introduce an additional auxiliary lemma, which
builds on the previous theorem.

A.5.1 Auxiliary Lemmas



Lemma A.6. Suppose that the conditions in Theorem 4.2 hold. Let N = n. Then, for
some Ñ > 0, which only depends on δ, M , and n ≥ Ñ ,
 n o √
P ∃K ⊆ Π̂o : ∀α ∈ (0, 1)∃πα ∈ K ∩ arg sup αW0 (π) + (1 − α)W1 (π) ≥ 1 − 1/ n. (91)
π∈Π

50
Proof of Lemma A.6. The result is trivial if Π is empty. Let therefore Π be non-empty.

Throughout the rest of the proof we consider a cover with N = n elements. The argument
proceeds as follows. We choose Ñ such that for each α ∈ (0, 1),
s
cM v log(c0 Ñ )
sup αW0 (π) + (1 − α)W1 (π) ≥ αW0 (π) + (1 − α)W1 (π) + 2 , (92)
π∈Π δ Ñ

∀π ∈ Π \ {arg sup αW0 (π) + (1 − α)W1 (π)}, ∀α ∈ (0, 1),


π∈Π

where the constants are as in Theorem 4.2.



Notice now that by Theorem 4.2, we have that with probability at least 1 − 1/ n, for
all α ∈ (0, 1),
r
n o n o cM v log(c0 n)
sup αW0 (π) + (1 − α)W1 (π) − sup αW0 (π) + (1 − α)W1 (π) ≤ 2 . (93)
π∈Π π∈Π̂o
δ n

Therefore, for each n, there exists some π̂ ∗ ∈ Π̂o , such that


r
∗ cM∗ v log(c0 n)
sup αW0 (π) + (1 − α)W1 (π) ≤ αW0 (π̂ ) + (1 − α)W1 (π̂ ) + 2 . (94)
π∈Π δ n

Observe now that trivially π̂ ∗ ∈ Π. We now claim that for n ≥ Ñ ,

π̂ ∗ ∈ {arg sup αW0 (π) + (1 − α)W1 (π)}.


π∈Π

We show this by contraddiction. Suppose that π̂ ∗ 6∈ {arg supπ∈Π αW0 (π) + (1 − α)W1 (π)}.
Then it must be that Equation (92) holds. On the other hand, this contradicts Equation
(93), since n ≥ Ñ . Since such a statement is true for any α ∈ (0, 1), the proof completes.

Lemma A.7. Under the conditions in Theorem 4.2, and Assumption 4.3, for n ≥ max{Ñ , H},
with Ñ as defined in Lemma A.6, the following holds:
 n o n o  √
P inf An (0, 1; π) + An (1, 0; π) − inf An (0, 1; π) + An (1, 0; π) ≤ 0 ≥ 1 − 1/ n.
π∈Π̂o π∈Πo
(95)

Proof of Lemma A.7. Observe first that the result is trivial if Πo ⊆ Π̂o . Suppose instead

that Πo 6⊆ Π̂o . Then by Lemma A.6, with probability 1 − 1/ n, we can find a set K ⊆ Πo ,

51
which is defined as follows: for each α ∈ (0, 1), there exist a unique πα ∈ K, such that πα
satisfies
πα ∈ arg sup αW0 (π) + (1 − α)W1 (π).
π∈Π

In addition, such a set satisfies the properties that K ⊆ Π̂o . Under Assumption 4.3, and
the definition of An , we have that
n o n o
inf An (0, 1; π) + An (1, 0; π) = inf An (0, 1; π) + An (1, 0; π) ,
π∈Πo π∈K

since An only depends on the image of π, and since under Assumption 4.3, for any π 6∈ K,
we can find a π 0 ∈ K, such that π and π 0 have the same image at each value in the domain.
Therefore, the left-hand side in Equation (95) simplifies to
n o n o
inf An (0, 1; π) + An (1, 0; π) − inf An (0, 1; π) + An (1, 0; π) .
π∈Π̂o π∈K

Observe now that since K ⊆ Π̂o the proof completes.

A.5.2 Proof of the Theorem


For notational convenience, we define
B(Πo ) := inf A(0, 1; π) + A(1, 0; π)
π∈Πo
Bn (Πo ) := inf An (0, 1; π) + An (1, 0; π)
π∈Πo

B(Π̂o ) := inf A(0, 1; π) + A(1, 0; π) (96)


π∈Π̂o

Bn (Π̂o ) := inf An (0, 1; π) + An (1, 0; π)


π∈Π̃o,n

B̂ := A(0, 1; π̂) + A(1, 0; π̂).


First, we observe that the following equality holds:
B̂ − B(Πo ) = Bn (Πo ) − B(Πo ) + Bn (Π̂o ) − Bn (Πo ) + B(Π̂o ) − Bn (Π̂o ) + B̂ − B(Π̂o ) . (97)
| {z } | {z } | {z } | {z }
(A) (B) (C) (D)

Observe (A) and (C) first. Then since Πo , Π̂o ⊆ Π, we have by the triangular inequality

(A) ≤ 2 sup A(s, s0 ; π) − An (s, s0 ; π) . (98)

π∈Π,maxs,s0 ∈{0,1}

Since also Π̂o ⊆ Π, the same inequality also holds for (C). By Lemma A.4, with probability
at least 1 − γ, we have r
M log(1/γ)
(A), (C) ≤ c . (99)
δ n

52
for a constant c < ∞.

We now study (B). By Lemma A.7, for n ≥ Ñ , with probability at least 1 − 1/ n, we
have
(B) ≤ 0.
Consider now (D). With simple rearrengement, we can write

(D) ≤ 2 sup A(s, s0 ; π) − An (s, s0 ; π) (100)

π∈Π̂o ,s,s0 ∈{0,1}

Since Π̂o ⊆ Π, we have



2 sup A(s, s0 ; π) − An (s, s0 ; π) ≤ 2 sup A(s, s0 ; π) − An (s, s0 ; π) . (101)

π∈Π̂o ,s,s0 ∈{0,1} π∈Π,s,s0 ∈{0,1}

The above term is bounded as in Lemma A.4. Collecting our results, by setting γ =

1/ n, and using the union bound, the result follows.

53

You might also like