Professional Documents
Culture Documents
variables, such as race/ethnicity/sex/age) does not reflect positive- Outlier detection for “policing”: In such critical systems, OD
class membership (such as criminal/fraud), OD produces unjust out- is often used to flag instances that reflect riskiness, which are then
comes. Surprisingly, fairness-aware OD has been almost untouched “policed” (or audited) by human experts. For example, law enforce-
in prior work, as fair machine learning literature mainly focuses on ment agencies might employ automated surveillance systems in
supervised settings. Our work aims to bridge this gap. Specifically, public spaces to spot suspicious individuals based on visual charac-
we develop desiderata capturing well-motivated fairness criteria teristics, who are subsequently stopped and frisked. Alternatively,
for OD, and systematically formalize the fair OD problem. Further, in the financial domain, analysts can police fraudulent-looking
guided by our desiderata, we propose FairOD, a fairness-aware claims, and corporate trust and safety employees can police bad
outlier detector that has the following desirable properties: FairOD actors on social networks.
(1) exhibits treatment parity at test time, (2) aims to flag equal Group sample size disparity yields unfair OD: Importantly,
proportions of samples from all groups (i.e. obtain group fairness, outlier detectors are designed exactly to spot rare, statistical mi-
via statistical parity), and (3) strives to flag truly high-risk sam- nority samples1 with the hope that outlierness reflects riskiness,
ples within each group. Extensive experiments on a diverse set which prompts their bias against societal minorities (as defined by
of synthetic and real world datasets show that FairOD produces race/ethnicity/sex/age/etc.) as well, since minority group sample
outcomes that are fair with respect to protected variables, while size is by definition small.
performing comparable to (and in some cases, even better than) However, when minority status (e.g. Hispanic) does not reflect
fairness-agnostic detectors in terms of detection performance. positive-class membership (e.g. fraud), OD produces unjust out-
comes, by overly flagging the instances from the minority gro-
CCS CONCEPTS ups as outliers. This conflation of statistical and societal minorities
can become an ethical matter.
• Computing methodologies → Machine learning algorithms;
Unfair OD leads to disparate impact: What would happen
Anomaly detection.
downstream if we did not strive for fairness-aware OD given the ex-
KEYWORDS istence of societal minorities? OD models’ inability to distinguish so-
cietal minorities (as induced by so-called protected variables (𝑃𝑉 s)),
fair outlier detection; outlier detection; anomaly detection; algo- from statistical minorities, contributes to the likelihood of minority
rithmic fairness; end-to-end detector; deep learning group members being flagged as outliers (see Fig. 1). This is fur-
ACM Reference Format: ther exacerbated by proxy variables which partially-redundantly
Shubhranshu Shekhar, Neil Shah, and Leman Akoglu. 2021. FairOD: Fairness- encode (i.e. correlate with) the 𝑃𝑉 (s), by increasing the number of
aware Outlier Detection. In Proceedings of the 2021 AAAI/ACM Conference on subspaces in which minorities stand out. The result is overpolicing
AI, Ethics, and Society (AIES ’21), May 19–21, 2021, Virtual Event, USA. ACM, due to over-representation of minorities in OD outcomes. Note
New York, NY, USA, 14 pages. https://doi.org/10.1145/3461702.3462517
that overpolicing the minority group also implies underpolicing
the majority group given limited policing capacity and constraints.
1 INTRODUCTION Overpolicing can also feed back into a system when the policed
Fairness in machine learning (ML) has received a surge of atten- outliers are used as labels in downstream supervised tasks. Alarm-
tion in the recent years. The community has largely focused on ingly, this initially skewed sample (due to unfair OD), may be am-
designing different notions of fairness [4, 15, 51] mainly tailored plified through a feedback loop via predicting policing where more
towards supervised ML problems [21, 24, 52]. However, perhaps outliers are identified in more heavily policed groups. Given that
Permission to make digital or hard copies of all or part of this work for personal or OD’s use in societal applications has direct bearing on social well-
classroom use is granted without fee provided that copies are not made or distributed being, ensuring that OD-based outcomes are non-discriminatory
for profit or commercial advantage and that copies bear this notice and the full citation is pivotal. This demands the design of fairness-aware OD models,
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, which our work aims to address.
to post on servers or to redistribute to lists, requires prior specific permission and/or a Prior research and challenges: Abundant work on algorithm
fee. Request permissions from permissions@acm.org.
AIES ’21, May 19–21, 2021, Virtual Event, USA
fairness has focused on supervised ML tasks [6, 24, 52]. Numerous
© 2021 Association for Computing Machinery.
1 Inthis work, the words sample, instance, and observation are used interchangeably
ACM ISBN 978-1-4503-8473-5/21/05. . . $15.00
https://doi.org/10.1145/3461702.3462517 throughout text.
Inlier, P V = a 0.7
PV = a PV = b
Inlier, P V = b
1.0
10 Outlier, P V = a
0.6
Outlier score
0.8
Feature 2 5
0.5
0.6
0
0.4 0.4
−5
0.3 0.2
−10
1 2 4 8 1 2 4 8
−10 −5 0 5 10
Feature 1 Sample size ratio (|XP V =a|/|XP V =b|) Sample size ratio (|XP V =a|/|XP V =b|)
Figure 1: (left) Simulated 2-dim. data with equal sized groups i.e. |X𝑃𝑉 =𝑎 |=|X𝑃𝑉 =𝑏 |. (middle) Group score distributions induced
by 𝑃𝑉 = 𝑎 and 𝑃𝑉 = 𝑏 are plotted by varying the simulated |X𝑃𝑉 =𝑎 |/|X𝑃𝑉 =𝑏 | ratio. Notice that minority group (𝑃𝑉 = 𝑏) re-
ceives larger outlier scores as the size ratio increases. (right) Flag rate ratio of the groups for the varying sample size ratio
|X𝑃𝑉 =𝑎 |/|X𝑃𝑉 =𝑏 |. As we increase size disparity, the minority group is “policed” (i.e. flagged) comparatively more.
Ideal Ideal
notions of fairness [4, 51] have been explored in such contexts, each 1.0 1.0
with their own challenges in achieving equitable decisions [15]. 0.9 0.9
Group Fidelity
Group Fidelity
0.8 0.8
(6 datasets)
(Average)
unsupervised OD. Incorporating fairness into OD is challenging, in 0.7 0.7
the face of (1) many possibly-incompatible notions of fairness and, 0.6 0.6
The two works tackling2 unfairness in the OD literature are 0.4 0.4
introduce fairness specifically to the LOF algorithm [10], and Zhang 0.2 0.4
Fairness
0.6 0.8 1.0 0.2 0.4 0.6
Fairness
0.8 1.0
and Davidson [56] (concurrent to our work) which proposes an base rw dir lfr arl FairOD
adversarial training based deep SVDD detector. Amongst other
issues (see Sec. 5), the approach proposed in [43] invites disparate Figure 2: Fairness (statistical parity) vs. GroupFidelity (group-
treatment, necessitating explicit use of 𝑃𝑉 at decision time, leading level rank preservation) of baselines and our proposed
to taste-based discrimination [16] that is unlawful in several critical FairOD (red cross), (left) averaged across 6 datasets, and
applications. On the other hand, the approach in [56] has several (right) on individual datasets. FairOD outperforms existing
drawbacks (see Sec. 5), and in light of unavailable implementation, solutions (tending towards ideal), achieving fairness while
we include a similar baseline called arl that we compare against preserving group fidelity from the base detector. See Sec. 4
our proposed method. for more details.
Alternatively, one could re-purpose existing fair representation
learning techniques [7, 19, 54] as well as data preprocessing strate-
gies [20, 28] for subsequent fair OD. However, as we show in Sec. 4
and discuss in Sec. 5, isolating representation learning from the de- FairOD, a new detector which directly incorporates the pre-
tection task is suboptimal, largely (needlessly) sacrificing detection scribed criteria into its training. Notably, FairOD (1) aims to
performance for fairness. equalize flag rates across groups, achieving group fairness
Our contributions: Our work strives to design a fairness-aware via statistical parity, while (2) striving to flag truly high-
OD model to achieve equitable policing across groups and avoid an risk samples within each group, and (3) avoiding disparate
unjust conflation of statistical and societal minorities. We summa- treatment. (Sec. 3)
rize our main contributions as follows: (3) Effectiveness on Real-world Data: We apply FairOD on
(1) Desiderata & Problem Definition for Fair Outlier De- several real-world and synthetic datasets with diverse ap-
tection: We identify 5 properties characterizing detection plications such as credit risk assessment and hate speech
quality and fairness in OD as desiderata for fairness-aware detection. Experiments demonstrate FairOD’s effectiveness
detectors. We discuss their justifiability and achievability, in achieving both fairness goals (Fig. 2) as well as accurate
based on which we formally define the (unsupervised) fairness- detection (Fig. 6, Sec. 4), significantly outperforming alterna-
aware OD problem (Sec. 2). tive solutions.
(2) Fairness Criteria & New, Fairness-Aware OD Model: We Reproducibility: All of our source code and datasets are shared
introduce well-motivated fairness criteria and give mathe- publicly at https://tinyurl.com/fairOD.
matical objectives that can be optimized to obey the desider-
ata. These criteria are universal, in that they can be embedded 2 DESIDERATA FOR FAIR OUTLIER
into the objective of any end-to-end OD model. We propose DETECTION
2 [17] aims to quantify fairness of OD model outcomes post hoc, which thus has a Notation. We are given 𝑁 samples (also, observations or instances)
different scope. 𝑁 ⊆ R𝑑 as the input for OD where 𝑋 ∈ R𝑑 denotes
X = {𝑋𝑖 }𝑖=1 𝑖
Table 1: Frequently used symbols and definitions. (also, precision) is strictly larger than the base rate (also, prevalence),
that is,
Symbol Definition
𝑃 (𝑌 = 1 | 𝑂 = 1) > 𝑃 (𝑌 = 1) . (1)
𝑋 𝑑-dimensional feature representation of an observation
𝑌 true label of an observation, w/ values 0 (inlier), 1 (outlier)
𝑃𝑉 binary protected (or sensitive) variable, w/ groups 𝑎 (majority), 𝑏 (minority)
This condition ensures that any policing effort concerted through
𝑂 detector-assigned label to an observation, w/ value 1 (predicted/flagged outlier) the employment of an OD model is able to achieve a strictly larger
base rate of/fraction of ground-truth outliers in group 𝑣, i.e. 𝑏𝑟 𝑣 = 𝑃 (𝑌 = 1|𝑃𝑉 = 𝑣)
𝑏𝑟 𝑣
𝑓 𝑟𝑣 flag rate of/fraction of flagged observations in group 𝑣, i.e 𝑓 𝑟 𝑣 = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑣)
precision (LHS) as compared to random sampling, where policing
via the latter would simply yield a precision that is equal to the
prevalence of outliers in the population (RHS) in expectation. Note
the feature representation for observation 𝑖. Each observation is that our first condition in (1) is related to detection performance,
additionally associated with a binary3 protected (also, sensitive) and specifically, the usefulness of OD itself for policing applications.
variable, PV = {𝑃𝑉𝑖 }𝑖=1 𝑁 , where 𝑃𝑉 ∈ {𝑎, 𝑏} identifies two groups
𝑖 How-to: We can indirectly control for detection effectiveness
– the majority (𝑃𝑉𝑖 = 𝑎) group and the minority (𝑃𝑉𝑖 = 𝑏) group. We via careful feature engineering. Assuming domain experts assist
𝑁 , 𝑌 ∈ {0, 1}, to denote the unobserved ground-truth
use Y = {𝑌𝑖 }𝑖=1 𝑖 in feature design, it would be reasonable to expect a better-than-
binary labels for the observations where, for exposition, 𝑌𝑖 = 1 random detector that satisfies Eq. (1).
denotes an outlier (positive outcome) and 𝑌𝑖 = 0 denotes an inlier Next, we present fairness-related conditions for OD.
(negative outcome). We use 𝑂 : 𝑋 ↦→ {0, 1} to denote the predicted
outcome of an outlier detector, and 𝑠 : 𝑋 ↦→ R to capture the D2. Treatment parity: OD should exhibit non-disparate treatment
corresponding numerical outlier score as the estimate of the out- that explicitly avoid the use of 𝑃𝑉 for producing a decision. In
lierness. Thus, 𝑂 (𝑋𝑖 ), 𝑠 (𝑋𝑖 ) respectively indicate predicted outlier particular, OD decisions should obey
label and outlier score for sample 𝑋𝑖 . We use O = {𝑂 (𝑋𝑖 )}𝑖=1 𝑁 and
𝑃 (𝑂 = 1 | 𝑋 ) = 𝑃 (𝑂 = 1 | 𝑋, 𝑃𝑉 = 𝑣), ∀𝑣 . (2)
𝑁
S = {𝑠 (𝑋𝑖 )}𝑖=1 to denote the set of all predicted labels and scores
from a given model without loss of generality. Note that we can In words, the probability that the detector outputs an outlier label
derive 𝑂 (𝑋𝑖 ) from a simple thresholding of 𝑠 (𝑋𝑖 ). We routinely 𝑂 for a given feature vector 𝑋 remains unchanged even upon ob-
drop 𝑖-subscripts to refer to properties of a single sample without serving the value of the 𝑃𝑉 . In many settings (e.g. employment),
loss of generality. We denote the group base rate (or prevalence) explicit 𝑃𝑉 use is unlawful at inference.
of outlierness as 𝑏𝑟𝑎 = 𝑃 (𝑌 = 1|𝑃𝑉 = 𝑎) for the majority group. How-to: We can build an OD model using a disparate learning
Finally, we let 𝑓 𝑟𝑎 = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑎) depict the flag rate of the process [36] that uses 𝑃𝑉 only during the model training phase,
detector for the majority group. Similar definitions extend to the but does not require access to 𝑃𝑉 for producing a decision, hence
minority group with 𝑃𝑉 = 𝑏. Table 1 gives a list of the notations satisfying treatment parity.
frequently used throughout the paper. Treatment parity ensures that OD decisions are effectively “blind-
Having presented the problem setup and notation, we state our folded” to the 𝑃𝑉 . However, this notion of fairness alone is not suffi-
fair OD problem (informally) as follows. cient to ensure equitable policing across groups; namely, removing
the 𝑃𝑉 from scope may still allow discriminatory OD results for
Informal Problem 1 (Fair Outlier Detection). Given samples
the minority group (e.g., African American) due to the presence
X and protected variable values PV, estimate outlier scores S and
of several other features (e.g., zipcode) that (partially-)redundantly
assign outlier labels O, such that
encode the 𝑃𝑉 . Consequently, by default, OD will use the 𝑃𝑉 indi-
(i) assigned labels and scores are “fair” w.r.t. the 𝑃𝑉 , and
rectly, through access to those correlated proxy features. Therefore,
(ii) higher scores correspond to higher riskiness encoded by the
additional conditions follow.
underlying (unobserved) Y.
D3. Statistical parity (SP): One would expect the OD outcomes to
How can we design a fairness-aware OD model that is not biased
be independent of group membership, i.e. 𝑂 ⊥ ⊥ 𝑃𝑉 . In the context
against minority groups? What constitutes a “fair” outcome in OD,
of OD, this notion of fairness (also, demographic parity, group
that is, what would characterize fairness-aware OD? What specific
fairness, or independence) aims to enforce that the outlier flag rates
notions of fairness are most applicable to OD?
are independent of 𝑃𝑉 and equal across the groups as induced by
To approach the problem and address these motivating ques-
𝑃𝑉 .
tions, we first propose a list of desired properties that an ideal
Formally, an OD model satisfies statistical parity under a distri-
fairness-aware detector should satisfy, and whether, in practice,
bution over (𝑋, 𝑃𝑉 ) where 𝑃𝑉 ∈ {𝑎, 𝑏} if
the desired properties can be enforced followed by our proposed
solution, FairOD. 𝑓 𝑟𝑎 = 𝑓 𝑟𝑏 or equivalently,
(3)
𝑃 (𝑂 = 1|𝑃𝑉 = 𝑎) = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑏) .
2.1 Proposed Desiderata
D1. Detection effectiveness: We require an OD model to be accu- SP implies that the fraction of minority (majority) members in the
rate at detection, such that the scores assigned to the instances by flagged set is the same as the fraction of minority (majority) in the
OD are well-correlated with the ground-truth outlier labels. Specif- overall population. Equivalently, one can show
ically, OD benefits the policing effort only when the detection rate
𝑓 𝑟𝑎 = 𝑓 𝑟𝑏 (SP) ⇐⇒ 𝑃 (𝑃𝑉 = 𝑎|𝑂 = 1) = 𝑃 (𝑃𝑉 = 𝑎)
3 For
simplicity of presentation, we consider a single, binary protected variable (PV).
We discuss extensions to multi-valued PV and multi-attribute PVs in Sec. 3.
and 𝑃 (𝑃𝑉 = 𝑏 |𝑂 = 1) = 𝑃 (𝑃𝑉 = 𝑏) . (4)
The motivation for SP derives from luck egalitarianism [31] – a How-to: The unsupervised OD task does not have access to Y,
family of egalitarian theories of distributive justice that aim to coun- therefore, group fidelity cannot be enforced directly. Instead, we
teract the distributive effects of “brute luck”. By redistributing equal- propose to enforce group-level rank preservation that maintains
ity to those who suffer through no fault of their own choosing, me- fidelity to within-group ranking from the base model, where base
diated via race, gender, etc., it aims to counterbalance the manifes- is a fairness-agnostic OD model. Our intuition is that rank preserva-
tations of such “luck”. Correspondingly, SP ensures equal flag rates tion acts as a proxy for group fidelity, or more broadly Separation,
across 𝑃𝑉 groups, eliminating such group-membership bias. There- via our assumption that within-group ranking in the base model is
fore, it merits incorporation in OD since OD results are used for accurate and top-ranked instances within each group encode the
policing or auditing by human experts in downstream applications. highest risk samples within each group.
How-to: We could enforce SP during OD model learning by com- Specifically, let 𝜋 base represent the ranking of instances based on
paring the distributions of the predicted outlier labels 𝑂 amongst base OD scores, and let 𝜋𝑃𝑉 base and 𝜋 base denote the group-level
=𝑎 𝑃𝑉 =𝑏
groups, and update the model to ensure that these output distribu- ranked lists for majority and minority groups, respectively. Then,
tions match across groups. base = 𝜋
the rank preservation is satisfied when 𝜋𝑃𝑉 =𝑣 𝑃𝑉 =𝑣 ; ∀𝑣 ∈ {𝑎, 𝑏}
SP, however, is not sufficient to ensure both equitable and ac- where 𝜋𝑃𝑉 =𝑣 is the ranking of group-𝑣 instances based on outlier
curate outcomes as it permits so-called “laziness” [4]. Being an scores from our proposed OD model. Group rank preservation aims
unsupervised quantity that is agnostic to the ground-truth labels Y, to address the “laziness” issue that can manifest while ensuring
SP could be satisfied while producing decisions that are arbitrarily SP; we aim to not lose the within-group detection prowess of the
inaccurate for any or all of the groups. In fact, an extreme scenario original detector while maintaining fairness. Moreover, since we
would be random sampling; where we select a certain fraction of are using only a proxy for Separation, the mutual exclusiveness
the given population uniformly at random and flag all the sampled of SP and Separation may no longer hold, though we have not
instances as outliers. As evident via Eq. (4), this entirely random established this mathematically.
procedure would achieve SP (!). The outcomes could be worse –
D5. Base rate preservation: The flagged outliers from OD re-
that is, not only inaccurate (put differently, as accurate as random)
sults are often audited and then used as human-labeled data for
but also unfair for only some group(s) – when OD flags mostly
supervised detection (as discussed in previous section) which can
the true outliers from one group while flagging randomly selected
introduce bias through a feedback loop. Therefore, it is desirable
instances from the other group(s), leading to discrimination despite
that group-level base rates within the flagged population is reflec-
SP. Therefore, additional criteria is required to explicitly penalize
tive of the group-level base rates in the overall population, so as
“laziness,” aiming to not only flag equal fractions of instances across
to not introduce group bias of outlier incidence downstream. In
groups but also those true outlier instances from both groups.
particular, we expect OD outcomes to ideally obey
D4. Group fidelity (also, Equality of Opportunity): It is desir-
able that the true outliers are equally likely to be assigned higher 𝑃 (𝑌 = 1|𝑂 = 1, 𝑃𝑉 = 𝑎) = 𝑏𝑟𝑎 , and (6)
scores, and in turn flagged, regardless of their membership to any 𝑃 (𝑌 = 1|𝑂 = 1, 𝑃𝑉 = 𝑏) = 𝑏𝑟𝑏 . (7)
group as induced by 𝑃𝑉 . We refer to this notion of fairness as
group fidelity, which steers OD outcomes toward being faithful to Note that group-level base rate within the flagged population (LHS)
the ground-truth outlier labels equally across groups, obeying the is mathematically equivalent to group-level precision in OD out-
following condition comes, and as such, is also a supervised quantity which suffers the
same caveat as in D4, regarding unavailability of Y.
𝑃 (𝑂 = 1|𝑌 = 1, 𝑃𝑉 = 𝑎) = 𝑃 (𝑂 = 1|𝑌 = 1, 𝑃𝑉 = 𝑏) . (5) How-to: As noted, Y is not available to an unsupervised OD task.
Importantly, provided an OD model satisfies D1 and D3, we show
Mathematically, this condition is equivalent to the so-called Equal- that it cannot simultaneously also satisfy D5, i.e. per-group equal
ity of Opportunity4 in the supervised fair ML literature, and is a base rate in OD results (flagged observations) and in the overall
special case of Separation [24, 51]. In either case, it requires that population.
all 𝑃𝑉 -induced groups experience the same true positive rate. Con-
sequently, it penalizes “laziness” by ensuring that the true-outlier Claim 1. Detection effectiveness: 𝑃 (𝑌 = 1|𝑂 = 1) > 𝑃 (𝑌 = 1)
instances are ranked above (i.e., receive higher outlier scores than) and SP: 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑎) = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑏) jointly imply that
the inliers within each group. 𝑃 (𝑌 = 1|𝑂 = 1, 𝑃𝑉 = 𝑣) > 𝑃 (𝑌 = 1|𝑃𝑉 = 𝑣), ∃𝑣.
The key caveat here is that (5) is a supervised quantity that re-
quires access to the ground-truth labels Y, which are explicitly Proof. We prove the claim in Appendix5 A.1. □
unavailable for the unsupervised OD task. What is more, various im-
possibility results have shown that certain fairness criteria, includ- Claim 1 shows an incompatibility and states that, provided D1
ing SP and Separation, are mutually exclusive or incompatible [4], and D3 are satisfied, the base rate in the flagged population cannot
implying that simultaneously satisfying both of these conditions be equal to (but rather, is an overestimate of) that in the overall
(exactly) is not possible. population for at least one of the groups. As such, base rates in OD
outcomes cannot be reflective of their true values. Instead, one
may hope for the preservation of the ratio of the base rates (i.e. it
4 Opportunity,because positive-class assignment by a supervised model in many fair
ML problems is often associated with a positive outcome, such as being hired or
approved a loan. 5 https://tinyurl.com/fairOD
is not impossible). As such, a relaxed notion of D5 is to preserve 3 FAIRNESS-AWARE OUTLIER DETECTION
proportional base rates across groups in the OD results, that is, In this section, we describe our proposed FairOD – an unsupervised,
𝑃 (𝑌 = 1|𝑂 = 1, 𝑃𝑉 = 𝑎) 𝑃 (𝑌 = 1|𝑃𝑉 = 𝑎) fairness-aware, end-to-end OD model that embeds our proposed
= . (8) learnable (i.e. optimizable) fairness constraints into an existing base
𝑃 (𝑌 = 1|𝑂 = 1, 𝑃𝑉 = 𝑏) 𝑃 (𝑌 = 1|𝑃𝑉 = 𝑏)
OD model. The key features of our model are that FairOD aims for
Note that ratio preservation still cannot be explicitly enforced as (8) equal flag rates across groups (statistical parity), and encourages
is also label-dependent. Finally we show in Claim 2 that, provided correct top group ranking (group fidelity), while not requiring 𝑃𝑉
D1, D3 and Eq. (8) are all satisfied, then it entails that the base rate for decision-making on new samples (non-disparate treatment). As
in OD outcomes is an overestimation of the true group-level base such, it aims to target the proposed desiderata D1 – D4 as described
rate for every group. in Sec. 2.
Claim 2. Detection effectiveness: 𝑃 (𝑌 = 1|𝑂 = 1) > 𝑃 (𝑌 =
1), SP: 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑎) = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑏), and Eq. (8): 3.1 Base Framework
𝑃 (𝑌 =1|𝑂=1,𝑃𝑉 =𝑎) 𝑃 (𝑌 =1|𝑃𝑉 =𝑎) Our proposed OD model instantiates a deep-autoencoder (AE)
𝑃 (𝑌 =1|𝑂=1,𝑃𝑉 =𝑏) = 𝑃 (𝑌 =1|𝑃𝑉 =𝑏) jointly imply 𝑃 (𝑌 = 1|𝑃𝑉 =
𝑣, 𝑂 = 1)>𝑃 (𝑌 = 1|𝑃𝑉 = 𝑣), ∀𝑣. framework for the base outlier detection task. However, we re-
mark that the fairness regularization criteria introduced by FairOD
can be plugged into any end-to-end optimizable anomaly detec-
Proof. We prove the claim in Appendix A.2. □
tor, such as one-class support vector machines [48], deep anomaly
detector [12], variational AE for OD [3], and deep one-class classi-
Claim 1 and Claim 2 indicate that if we have both (𝑖) better-
fiers [47]. Our choice of AE as the base OD model stems from the
than-random precision (D1) and (𝑖𝑖) SP (D3), interpreting the base
fact that AE-inspired methods have been shown to be state-of-the-
rates in OD outcomes for downstream learning tasks would not be
art outlier detectors [14, 39, 58] and that our fairness-aware loss
meaningful, as they would not be reflective of true population base
criteria can be optimized in conjunction with the objectives of such
rates. Due to both these incompatibility results, and also feasibility
models. The main goal of FairOD is to incorporate our proposed
issues given the lack of Y, we leave base rate preservation – despite
notions of fairness into an end-to-end OD model, irrespective of
it being a desirable property – out of consideration.
the choice of the base model family.
AE consists of two main components: an encoder 𝐺 𝐸 : 𝑋 ∈
2.2 Problem Definition R𝑑 ↦→ 𝑍 ∈ R𝑚 and a decoder 𝐺 𝐷 : 𝑍 ∈ R𝑚 ↦→ 𝑋 ∈ R𝑑 . 𝐺 𝐸 (𝑋 )
Based on the definitions and enforceable desiderata, our fairness- encodes the input 𝑋 to a hidden vector (also, code) 𝑍 that pre-
aware OD problem is formally defined as follows: serves the important aspects of the input. Then, 𝐺 𝐷 (𝑍 ) aims to
generate 𝑋 ′ , a reconstruction of the input from the hidden vec-
Problem 1 (Fairness-Aware Outlier Detection). Given samples
tor 𝑍 . Overall, the AE can be written as 𝐺 = 𝐺 𝐷 ◦ 𝐺 𝐸 , such that
X and protected variable values PV, estimate outlier scores S and
𝐺 (𝑋 ) = 𝐺 𝐷 (𝐺 𝐸 (𝑋 )). For a given AE based framework, the outlier
assign outlier labels O, to achieve
score for 𝑋 is computed using the reconstruction error as
(i) 𝑃 (𝑌 = 1|𝑂 = 1) > 𝑃 (𝑌 = 1) ,
[Detection effectiveness]
𝑠 (𝑋 ) = ∥𝑋 − 𝐺 (𝑋 )∥ 22 . (9)
(ii) 𝑃 (𝑂 | 𝑋, 𝑃𝑉 = 𝑣) = 𝑃 (𝑂 | 𝑋 ), ∀𝑣 ∈ {𝑎, 𝑏} ,
[Treatment parity] Outliers tend to exhibit large reconstruction errors because they
(iii) 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑎) = 𝑃 (𝑂 = 1|𝑃𝑉 = 𝑏) , do not conform to to the patterns in the data as coded by an
[Statistical parity] auto-encoder, hence the use of reconstruction errors as outlier
base = 𝜋
(iv) 𝜋𝑃𝑉 =𝑣 𝑃𝑉 =𝑣 , ∀𝑣 ∈ {𝑎, 𝑏}, where base is a fairness- scores [2, 44, 49]. This scoring function is general in that it applies
agnostic detector. [Group fidelity proxy] to many reconstruction-based OD models, which have different
Given a dataset along with 𝑃𝑉 values, the goal is to design an parameterizations of the reconstruction function 𝐺. We show in
OD model that builds on an existing base OD model and satisfies the following how FairOD regularizes the reconstruction loss from
the criteria (𝑖)–(𝑖𝑣), following the proposed desiderata D1 – D4. base through fairness constraints that are conjointly optimized
during the training process. The base OD model optimizes the
2.3 Caveats of a Simple Approach following
where 𝛼 ∈ (0, 1) and 𝛾 > 0 are hyperparameters which govern ∑︁ © ∑︁ 2𝑠 base (𝑋𝑖 ) − 1 ª
L𝐺𝐹 = 1 − (13)
the balance between different fairness criteria and reconstruction dnm
®
𝑣 ∈ {𝑎,𝑏 } « 𝑋𝑖 ∈X𝑃𝑉 =𝑣
quality in the loss function. ¬
The first term in Eq. (11) is the objective for learning the recon- ∑︁
struction (based on base model family) as given in Eq. (10), which
dnm = log2 1 + sigm(𝑠 (𝑋𝑘 ) − 𝑠 (𝑋𝑖 )) · 𝐼 𝐷𝐶𝐺 𝑃𝑉 =𝑣 ,
quantifies the goodness of the encoding 𝑍 via the squared error 𝑋𝑘 ∈X𝑃𝑉 =𝑣
between the original input and its reconstruction generated from exp(−𝑐𝑥)
sigm(𝑥) = 1+exp(−𝑐𝑥) is the sigmoid function where 𝑐 > 0 is
𝑍 . The second component in Eq. (11) corresponds to regularization Í |X | base
introduced to enforce the fairness notion of independence, or sta- the scaling constant, and, 𝐼𝐷𝐶𝐺 𝑃𝑉 =𝑣 = 𝑗=1𝑃𝑉 =𝑣 ((2𝑠 (𝑋 𝑗 ) − 1)
tistical parity (SP) as given in Eq. (4). Specifically, the term seeks /log2 (1 + 𝑗)) is the ideal (hence 𝐼 ), i.e. largest DCG value attainable
to minimize the absolute correlation between the outlier scores for the respective group. Note that IDCG can be computed per
S (used for producing predicted labels O) and protected variable group apriori to model training via base outlier scores alone, and
values PV. L𝑆𝑃 is given as serves as a normalizing constant in Eq. (13).
Í
𝑁 Í𝑁 Note that having trained our model, scoring instances does not
𝑖=1 𝑠 (𝑋𝑖 ) − 𝜇𝑠 𝑖=1 𝑃𝑉𝑖 − 𝜇𝑃𝑉
L𝑆𝑃 = (12) require access to the value of their 𝑃𝑉 , as 𝑃𝑉 is only used in Eq.
𝜎𝑠 𝜎𝑃𝑉 (12) and (13) for training purposes. At test time, the anomaly score
of a given instance 𝑋 is computed simply via Eq. (9). Thus, FairOD
where 𝜇𝑠 = 𝑁1 𝑖=1 𝑠 (𝑋𝑖 ), 𝜎𝑠 = 𝑁1 𝑖=1 (𝑠 (𝑋𝑖 ) − 𝜇𝑠 ) 2 , 𝜇𝑃𝑉 =
Í𝑁 Í𝑁
also fulfills the desiderata on treatment parity.
1 Í𝑁 𝑃𝑉 , and 𝜎 1 Í𝑁 2
𝑁 𝑖=1 𝑖 𝑃𝑉 = 𝑁 𝑖=1 (𝑃𝑉𝑖 − 𝜇𝑃𝑉 ) .
We adapt this absolute correlation loss from [6], which proposed Optimization and Hyperparameter Tuning. We optimize the parame-
its use in a supervised setting with the goal of enforcing statistical ters of FairOD by minimizing the loss function given in Eq. (11) by
parity. As [6] mentions, while minimizing this loss does not guar- using the built-in Adam optimizer [30] implemented in PyTorch.
antee independence, it performs empirically quite well and offers FairOD comes with two tunable hyperparameters, 𝛼 and 𝛾. We
stable training. We observe the same in practice; it leads to minimal define a grid for these and pick the configuration that achieves the
associations between OD outcomes and the protected variable (see best balance between SP and our proxy quantity for group fidelity
details in Sec. 4). (based on group-level ranking preservation). Note that both of these
Finally, the third component of Eq. (11) emphasizes that FairOD quantities are unsupervised (i.e., do not require access to ground-
should maintain fidelity to within-group rankings from the base truth labels), therefore, FairOD model selection can be done in a
model (penalizing “laziness”). We set up a listwise learning-to-rank completely unsupervised fashion. We provide further details about
objective in order to enforce group fidelity. Our goal is to train hyperparameter selection in Sec. 4.
FairOD such that it reflects the within-group rankings based on
𝑠 base (·) from base. To that end, we employ a listwise ranking loss Generalizing to Multi-valued and Multiple Protected Attributes.
criterion that is based on the well-known Discounted Cumulative Multi-valued 𝑃𝑉 . FairOD generalizes beyond binary 𝑃𝑉 , and easily
Gain (DCG) [26] measure, often used to assess ranking quality in applies to settings with multi-valued, specifically categorical 𝑃𝑉
information retrieval tasks such as search. For a given ranked list, such as race. Recall that L𝑆𝑃 and L𝐺𝐹 are the loss components
DCG is defined as that depend on 𝑃𝑉 . For a categorical 𝑃𝑉 , L𝐺𝐹 in Eq. (13) would
∑︁ 2𝑟𝑒𝑙𝑟 − 1 simply remain the same, where the outer sum goes over all unique
DCG = values of the 𝑃𝑉 . For L𝑆𝑃 , one could one-hot-encode (OHE) the
log2 (1 + 𝑟 )
𝑟 𝑃𝑉 into multiple variables and minimize the correlation of outlier
where 𝑟𝑒𝑙𝑟 depicts the relevance of the item ranked at the 𝑟 𝑡ℎ posi- scores with each variable additively. That is, an outer sum would be
tion. In our setting, we use the outlier score 𝑠 base (𝑋 ) of an instance added to Eq. (12) that goes over the new OHE variables encoding
𝑋 to reflect its relevance since we aim to mimic the group-level the categorical 𝑃𝑉 .
ranking by base. As such, DCG per group can be re-written as Multiple 𝑃𝑉 𝑠. FairOD can handle multiple different 𝑃𝑉 𝑠 simulta-
base neously, such as race and gender, since the loss components Eq. (12)
∑︁ 2𝑠 (𝑋𝑖 ) − 1 and Eq. (13) can be used additively for each 𝑃𝑉 . However, the caveat
DCG𝑃𝑉 =𝑣 =
log2 1 + 𝑋𝑘 ∈X𝑃𝑉 =𝑣 1 [𝑠 (𝑋𝑖 ) ≤ 𝑠 (𝑋𝑘 )]
Í
𝑋𝑖 ∈X𝑃𝑉 =𝑣 to additive loss is that it would only enforce fairness with respect to
where X𝑃𝑉 =𝑎 and X𝑃𝑉 =𝑏 would respectively denote the set of ob- each individual 𝑃𝑉 , and yet may not exhibit fairness for the joint dis-
servations from majority and minority groups, and 𝑠 (𝑋 ) is the tribution of protected variables [29]. Even when additive extension
estimated outlier score from our FairOD model under training. may not be ideal, we avoid modeling multiple protected variables
A key challenge with DCG is that it is not differentiable, as it as a single 𝑃𝑉 that induces groups based on values from the cross-
involves ranking (sorting). Specifically, the sum term in the denom- product of available values across all 𝑃𝑉 𝑠. This is because partition-
inator uses the (non-smooth) indicator function 1 (·) to obtain the ing of the data based on cross-product may yield many small groups,
position of instance 𝑖 as ranked by the estimated outlier scores. which could cause instability in learning and poor generalization.
Table 2: Summary statistics of real-world and synthetic datasets used for evaluation.
Feature 2
(6 datasets)
(Average)
𝑝 and 𝑞 is defined as 1/( 𝑝1 + 𝑞1 ). We use HM to report GroupFidelity
0.6 0.6
since it is (more) sensitive to lower values (than e.g. arithmetic 0.4 0.4
2
group ranking from the base detector is well preserved.
Rank Agreement (computed at top-5% of ranked lists) of each method
Top-𝑘 Rank Agreement. We also measure how well the final rank- evaluated across datasets. The agreement measures the degree of
ing of the method aligns with the purely performance-driven base alignment of the ranked results by a method with the fairness-
detector, as base optimizes only for reconstruction error. We com- agnostic base detector. In Fig. 4 (left), as averaged over datasets,
pute top-𝑘 rank agreement as the Jaccard set similarity between FairOD achieves better rank agreement as compared to the com-
the top-𝑘 observations as ranked by two methods. Let 𝜋 base [1:𝑘 ]
de- petitors. In Fig. 4 (right), FairOD approaches ideal statistical parity
note the top-𝑘 of the ranked list based on outlier scores 𝑠 base (𝑋𝑖 )’s, across datasets while achieving better rank agreement with the
and 𝜋 𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟 be the top-𝑘 of the ranked list for competing meth- base detector. Note that FairOD does not strive for a perfect Top-𝑘
[1:𝑘 ]
ods where 𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟 ∈{rw, dir, lfr, arl, FairOD }. Then the mea- Rank Agreement (=1) with base, since base is shown to fall short with
base ∩ 𝜋 𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟 |/ |𝜋 base ∪
sure is given as Top-𝑘 Rank Agreement = |𝜋 [1:𝑘 respect to our desired fairness criteria. Our purpose in illustrating it
] [1:𝑘 ] [1:𝑘 ]
is to show that the ranked list by FairOD is not drastically different
𝑑𝑒𝑡𝑒𝑐𝑡𝑜𝑟
𝜋 [1:𝑘 ] |.
from base, which simply aims for detection performance.
AUC-ratio and AP-ratio. Finally, we consider supervised parity Next we evaluate the competing methods against supervised
measures based on ground-truth labels, defined as the ratio of (label-aware) fairness metrics. Note that FairOD does not (by de-
ROC AUC and Average Precision (AP) performances across groups; sign) optimize for these supervised fairness measures. Fig. 5a eval-
AUC-ratio = AUC𝑃𝑉 =𝑎 /AUC𝑃𝑉 =𝑏 and AP-ratio = AP𝑃𝑉 =𝑎 /AP𝑃𝑉 =𝑏 . uates the methods against Fairness and label-aware parity crite-
rion – specifically, group AP-ratio (ideal AP-ratio is 1). FairOD ap-
[Q1] Fairness proaches ideal Fairness as well as ideal AP-ratio across all datasets.
In Fig. 2 (presented in Introduction), FairOD is compared against FairOD outperforms the competitors on the averaged metrics over
base, as well as all the preprocessing baselines across datasets. The datasets (Fig. 5a (left)) and across individual datasets (Fig. 5a (right)).
methods are evaluated using the best configuration of each method6 In contrast, the preprocessing baselines are up to ∼5× worse than
on each dataset. The best hyperparameters for FairOD are the ones FairOD over AP-ratio measure across datasets. Fig. 5b reports evalua-
for which GroupFidelity and Fairness7 are closest to the “ideal” point tion of methods against Fairness and another label-aware parity mea-
as indicated in Fig. 2. sure – specifically, group AUC-ratio (ideal AUC-ratio = 1). As shown
in Fig. 5b (left), FairOD outperforms all the baselines in expectation
6 In Appendix D, for all methods and all datasets, we report detailed values for different
as averaged over all datasets. Further, in Fig. 5b (right), FairOD
metrics for each PV induced group.
7 Note that we can do model selection in this manner without access to any labels, consistently approaches ideal AUC-ratio across datasets, while the
since both are unsupervised measures. preprocessing baselines are up to ∼1.9× worse comparatively.
We note that impressively, FairOD approaches parity across In Fig. 6, we compare the AUC performance of FairOD to that of
different supervised fairness measures despite not being able to base detector for all datasets. Notice that each of the symbols (i.e.
optimize for label-aware criteria explicitly. datasets) is slightly below the diagonal line indicating that FairOD
achieves equal or sometimes even better (!) detection performance
[Q2] Fairness-accuracy trade-off as compared to base. The explanation is that since FairOD enforces
In the presence of ground-truth outlier labels, the performance of a SP and does not allow “laziness", it addresses the issue of falsely
detector could be measured using a ranking accuracy metric such or unjustly flagged minority samples by base, thereby, improving
as area under the ROC curve (ROC AUC). detection performance.
From Fig. 6, we conclude that FairOD does not trade-off detec-
tion performance much, and in some cases it even improves perfor-
5.0 5.0 mance by eliminating false positives from the minority group, as
4.5 4.5 compared to the performance-driven, fairness-agnostic base.
4.0 4.0
APP V =a/APP V =b
APP V =a/APP V =b
3.5 3.5
[Q3] Ablation study
(6 datasets)
(Average)
3.0 3.0
2.5 2.5 Finally, we evaluate the effect of various components in the design
2.0 2.0 of FairOD’s objective. Specifically, we compare to the results of
1.5 1.5 two relaxed variants of FairOD, namely FairOD-L and FairOD-C,
1.0 1.0
described as follows.
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Fairness Fairness • FairOD-L: We retain only the SP-based regularization term from
(a) Fairness vs. AP-ratio FairOD objective along with the reconstruction error. This relax-
base rw dir lfr arl FairOD ation of FairOD is partially based on the method proposed in [6],
which minimizes the correlation between model prediction and
1.8 1.8
group membership to the 𝑃𝑉 . In FairOD-L, the reconstruction
AUCP V =a/AUCP V =b
AUCP V =a/AUCP V =b
1.6 1.6
error term substitutes the classification loss used in the opti-
(6 datasets)
1.4 1.4
only group fairness to attain SP which may suffer from “laziness”
1.2 1.2
(hence, FairOD-L) (see Sec. 2).
• FairOD-C: Instead of training with NDCG-based group fidelity
1.0 1.0
regularization, FairOD-C utilizes a simpler regularization, aim-
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 ing to minimize the correlation (hence, FairOD-C) of the outlier
Fairness Fairness
base rw dir lfr arl FairOD
scores per-group with the corresponding scores from base detec-
tor. Thus, FairOD-C attempts to maintain group fidelity over the
(b) Fairness vs. AUC-ratio entire ranking within a group, in contrast to FairOD’s NDCG-
based regularization which emphasizes the quality of the ranking
Figure 5: FairOD outperforms all competitors on averaged at the top. Specifically, FairOD-C substitutes L𝐺𝐹 in Eq. (11) with
label-aware parity metrics over datasets (left) and for indi- the following.
vidual datasets (right): we report Fairness against (a) Group
Í Í
AP-ratio and (b) Group AUC-ratio. Í 𝑋𝑖 ∈X𝑃𝑉 =𝑣 𝑠 (𝑋𝑖 )−𝜇𝑠 𝑋𝑖 ∈X𝑃𝑉 =𝑣 𝑠
base (𝑋 )−𝜇
𝑖 𝑠 base
L𝐺𝐹 = − 𝜎𝑠 𝜎𝑠 base
𝑣 ∈ {𝑎,𝑏 }
0.6 ness averaged over datasets, and in Fig. 7 (right), the metrics are
reported for each individual dataset. FairOD-L approaches SP and
0.4
achieves comparable Fairness to FairOD except on one dataset as
y=x Credit-defaults
Synthetic-1 Abusive tweets shown in Fig. 7 (right). This results in lower Fairness compared to
0.2
Synthetic-2 Internet ads FairOD when averaged over datasets as shown in Fig. 7 (left). How-
Adult
ever, FairOD-L suffers with respect to GroupFidelity as compared to
0.2 0.4 0.6 0.8 1.0 FairOD. This is because FairOD-L may randomly flag instances to
FairOD AUC
achieve SP since it does not include any group ranking criterion
in its objective. On the other hand, FairOD-C improves Fairness
Figure 6: ROCAUC of FairOD vs. base: FairOD matches the when compared to base, while under-performing on the majority
performance of base detector, while enforcing fairness crite- of datasets compared to FairOD across metrics. Since FairOD-C
ria (maintaining good performance with fairness). tries to preserve group-level ranking, it trades-off on Fairness as
1.0 1.0 label-aware loss terms with unsupervised reconstruction loss can
plausibly extend such methods to OD (by using masked represen-
0.9 0.9
tations as inputs to a detector). However, a common shortcoming
Group Fidelity
Group Fidelity
(6 datasets)
(Average)
0.8 0.8 is that statistical parity (SP) is employed as the primary fairness
criterion in these methods, e.g. in fair principal component analy-
0.7 0.7
sis [42] and fair variational autoencoder [37]. To summarize, fair
0.6 0.6 representation learning techniques exhibit two key drawbacks for
0.5 0.5
unsupervised OD: (i) they only employ SP, which may be prone
0.2 0.4 0.6
Fairness
0.8 1.0 0.2 0.4 0.6
Fairness
0.8 1.0
to “laziness", and (ii) isolating embedding from detection makes
embedding oblivious to the task itself, and therefore can yield poor
base FairOD-L FairOD-C FairOD
detection performance (as shown in experiments in Sec. 4).
Strategies for Data De-Biasing Some of the popular de-biasing
Figure 7: FairOD compared to its variants FairOD-L and methods [28, 33] draw from topics in learning with imbalanced
FairOD-C across datasets, to evaluate the effect of differ- data [25] that employ under- or over-sampling or point-wise weight-
ent regularization components. FairOD-L achieves compa- ing of the instances based on the class label proportions to obtain
rable Fairness to FairOD while compromising GroupFidelity. balanced data. These methods apply preprocessing to the data in a
FairOD-C improves Fairness as compared to base, but is ill- manner that is agnostic to the subsequent or downstream task and
suited to optimizing for GroupFidelity. consider only the fairness notion of SP, which is prone to “laziness.”