Fair Thesis

ECOLE NORMALE SUPERIEURE PARIS-SACLAY
On Dataset Bias and Recognition

Models
by
Rostant Noyiessie Ndebeka
A master thesis submitted in partial fulfillment for the

degree of Master of Mathematics
in the
Mathematics department
December 2019
Declaration of Authorship
I, NOYIESSIE NDEBEKA ROSTANT, declare that this master thesis titled, “On
Dataset Bias and Recognition Models” and the work presented in it are my own.
I confirm that:
this work was done wholly or mainly while in candidature for a research degree at
ECOLE NORMALE SUPERIEURE PARIS-SACLAY;
where any part of this master thesis has previously been submitted for a degree or
any other qualification at ECOLE NORMALE SUPERIEURE PARIS-SACLAY
or any other institution, this has been clearly stated;
where I have consulted the published work of others, this is always clearly at-
tributed;
where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this Master thesis is entirely my own work;
i have acknowledged all main sources of help;
where the master thesis is based on work done by myself jointly with others, I have
made clear exactly what was done by others and what I have contributed myself,
Signed:
Date:
i
“By following the true lord we can”
A Mathematical Adventure
ECOLE NORMALE SUPERIEURE PARIS-SACLAY
Abstract
Mathematics department
Master In Artificial Intelligence
by Rostant Noyiessie Ndebeka
Abstract
While large datasets have proven to be an integral part of contemporary object recog-
nition research and particularly one of the key enabler for progress in deep learning,
they can have biases that lead to erroneous conclusions. In another side, deep neural
networks are formidable, but present undesirable properties such as memorization or
overfitting and sensitivity beyond the training dataset. A question could be about the
causal explanation of those indesirable behaviors of deep learning algorithms. Do they
related to bias correlation in terms of bias dataset, selection bias or dataset bias present
in training and testing datasets? The learning process is it correct with respect to the
task? That is, does the deep neural network until the learning process on which we
are concern solves the underlying task?. In this work, we investigate those questions by
making experiments on a synthetic dataset containing five’s and six’s colored MNIST
digits. We start by making some observations and prove that the testing and training
distributions turn to be different for some testing and training parameters. We run
REPAIR [1] procedure on this dataset, compare the results with baseline, and introduce
another representation by extracting bias features using the featurizer of a given model.
After all, we run IRM [2] on the same dataset and compare with others results.
Acknowledgements
This work would not have been possible without the financial support of Facebook
Artificial Intelligence researchers which seek to understand and develop systems with
human-level intelligence by advancing the longer-term academic problems surrounding
AI. I am especially indebted to Dr.David Lopez-paz, my mentor, a research scientist at
Facebook AI Research, who have been supportive of my career goals and who worked
actively to provide me with the protected academic time to pursue those goals. To
Dr.Maxime Oquab, a Facebook AI Research, who help me with some great time of sci-
entific discussions related to my working subject. A big thanks to my MVA advisor
Francois Malgouyres. I am grateful to all of those with whom I have had the pleasure to
work during this and other related projects. I would like to thank Pr.Mama Foupouag-
nigni, the center president of African institute for mathematical sciences in cameroon
(AIMS). I would especially like to thank Dr.Ngakeu Ferdinand who as my teacher, he
has taught me more than I could ever give him credit for here. He has shown me, by
his example, what a good scientist (and person) should be. Nobody has been more
important to me in the pursuit of this project than God and the members of my family.
I would like to thank the lord and my parents, whose love and guidance are with me
in whatever I pursue. They are the ultimate role models. Most importantly, I wish
to thank my loving and supportive wife, Stella Fondap Guimbop , and my wonderful
children, Emmanuel, who provide unending inspiration. . .
iv
Contents
Declaration of Authorship i
Abstract iii
Acknowledgements iv
1 Introduction 1
2 Related works 3
2.1 From maximum likelihood to cross-entropy loss [3], [4], [5], [6] . . . . . . 3
2.2 Bias, biased algorithm, dataset bias, representation bias [1], [7], [8] . . . . 5
2.2.0.1 Statistical bias and estimators . . . . . . . . . . . . . . . 5
2.2.0.2 Statistical variance and estimators . . . . . . . . . . . . 5
2.2.1 Bias as visual concept . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1.1 Biased algorithm and biased dataset . . . . . . . . . . . . 6
2.2.1.2 Dataset bias or selection bias . . . . . . . . . . . . . . . . 7
2.3 REPAIR Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 RESOUND Procedure [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 IRM Procedure [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Algorithms for IRM . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Experiments 12
3.1 Core problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Core experiment setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Observations 14
5 Experiments with REPAIR procedure 28

5.1 Repair ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Bias features extraction representation . . . . . . . . . . . . . . . . 29
5.3 Repair re-sampling procedure with threshold . . . . . . . . . . . . 31
5.4 Repair re-sampling procedure with ranking . . . . . . . . . . . . . . 31
5.5 Repair re-sampling procedure with class ranking . . . . . . . . . . 32
5.6 Repair re-sampling procedure with sample . . . . . . . . . . . . . . 32
v
Contents vi
5.7 Uniformly repair re-sampling procedure . . . . . . . . . . . . . . . . 32

5.8 Repair results interpretations . . . . . . . . . . . . . . . . . . . . . . 33
6 Experiments with IRM principle 35

6.1 IRM results interpretations . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 37
Chapter 1
Introduction
Over recent years, deep neural networks (DNNs), particularly convolutional neural net-
works (CNNs) have achieved great advances in image understanding problems, such as
object recognition or semantic segmentation. A key enabling factor was the introduction
of large scale image datasets such as ImageNet [9], Microsoft COCO [10], and others.
Nevertheless, like any other machine learning system, the quality of DNNs is only as
good as that of the datasets on which they are trained. These datasets have two main
properties. First, they contain enough samples to constrain the millions of parameters
of modern DNNs. Second, they cover a large variety of visual concepts to enable the
learning of visual representations that generalize across many tasks. This latest prop-
erty is the subject of this work, let us illustrate this property by a thought experiment.
Imagine that we want to classify images of cows and camels [7]. To address this task, in
[2] they label images of both types of animals. The dataset is settled in such a way that
most pictures of cows are taken in green pastures, while most pictures of camels happen
to be in deserts. This way of selecting the data from different environment is called selec-
tion bias, or dataset bias. After training a convolutional neural network on this dataset,
they observe that the model fails to classify easy examples of cows, when these are taken
on sandy beaches. Bewildered, they later realize that their neural network successfully
minimized its training error using a simple cheat which consists of classifying green land-
scapes as cows, and beige landscapes as camels. In some sense, this means that their
CNN is learning background features instead of learning the underlying task, which is
learning shape features of cows and camels. We therefore have in this sense what we call
a biased deep learning algorithm, which is due in part to the kind of learning procedure
[11, 12]. In another sense, taking most images of cows on sandy beaches for testing
introduces a selection bias which make the new cows examples to be out of the training
distribution. In this case the training distribution differs from the testing distribution.
Here we focus on this problem, which consists of finding a learning procedure that will
1
Contents 2
solve the underlying task, that is, learning the underlying ground-truth feature repre-
sentation φ? . We can emphasize that having a good learning process, that is, a good
learning algorithm for a task can be important for generalization. In fact, if we can
find a learning procedure that can learn a specific feature on data and ignore others, we
could solve the problem of selection bias and therefore the one of generalization beyond
the training distribution. The problem of generalization beyond the training domain is
classically known as out-of-distribution generalization, but as we have emphasized on
selection bias beforehand, it implies the problem of out-of-distribution generalization,
that is, it makes the testing distribution to be different from the training one which
therefore affect testing performance. In this work we focus to address this issue. Gener-
ally, to ensure good generalization, or to avoid selection bias, we shuffle the training and
testing datasets in order to bring the training and testing distributions closer together
theoretically. For example, whereas the original MNIST handwritten data was collected
from different writers under different conditions [13], the popular MNIST training and
testing sets [14] were carefully shuffled to represent similar mixes of writers. Our concern
in this work is to explore different learning process, that is, different algorithm in order
to be able to learn the underlying ground-truth feature representation on a given task
and deal with generalization beyond the training domain.
Chapter 2
Related works
In this chapter we talk about three learning principles related to our work, such as IRM
[2], RESOUND [8] and REPAIR [1]. We will give some definitions such as representation
bias, dataset bias, bias and others. We will emphasize on the motivation of the definition
of representation bias and selection bias.
2.1 From maximum likelihood to cross-entropy loss [3], [4],

[5], [6]
Definition 2.1. Cross-entropy
Let p and q be two distributions over the same probability espace, the cross-entropy
between p and q is defined as

H(p, q) = Ep − log q = H(p) + D(p||q).
Where H(p) is the entropy of p and D(p||q) is the Kullback-Leibler divergence between
p and q.
If p and q are discrete then,
X
H(p, q) = − p(x) log q(x).
x
If p and q are continuous then,
3
Contents 4
Z
H(p, q) = − p(x) log q(x)dx.
X
Definition 2.2. Entropy
If we have a random variable X, with probability mass function PX , we define the entropy
H(X) of the random variable X as:

X 1
H(X) = − PX (x) log(PX (x)) = EX∼PX log .
P (X)
x∈X(Ω)
It is easy to see that if there exits x0 such that PX (x0 ) = P (X = x0 ) = 1 then H(X) = 0.
In fact,
X
H(X) = − PX (x) log(PX (x))
x
X
= − PX (x0 ) log(PX (x0 )) + PX (x) log(PX (x))
x6=x0
= 0.
Hence, if we are sure that the random variable X will take a specific value then its entropy
drop down to zero, that is, the more we are sure about the random variable values the
more the entropy is slow. The entropy therefore measures the uncertainty of a
random variable.
n
Now let us assume that we have Xi be a family of n random variables independent
i=1
n
and identically distributed of distribution P0 unknown, x = xi i=1 be a sample com-
ing from that family. By using the maximum likelihood principle we approximate the
parameter α of the ground-true distribution P0 as
α̂ = argmaxα P (x, α) where P model P0

n
Y n
Y
= argmaxα PXi (xi , α) = argmaxα PX (xi , α) iid
i=1 i=1
n
1X
= argmaxα log(PX (xi , α)) avoided computer overflow as n increases
n
i=1

≈ argmaxα EX∼P0 log P (X, α) = argminα − EX∼P0 log P (X, α)
Contents 5

α̂ = argminα − EX∼P0 log P (X, α) (2.1)
We therefore realize that obtaining the best parameter α̂ using maximum likelihood
estimation is equivalent to minimizing the cross-entropy between the model P and the
empirical example distribution P0 . Note that, CNN learned with cross-entropy loss is a
maximum likelihood estimator of ground-truth parameters.
2.2 Bias, biased algorithm, dataset bias, representation

bias [1], [7], [8]
2.2.0.1 Statistical bias and estimators
From the algorithm 2.1, we defined its statistical bias as its expectation minus the
ground-true or optimal algorithm.
Definition 2.3. Statistical bias
bias(α̂) = E(α̂) − α. (2.2)
Hence, if the algorithm α̂ is unbiased, that is, bias(α̂) = 0 which implies that E(α̂) = α,
it means that the optimal algorithm α will be estimate by α̂ in average.
2.2.0.2 Statistical variance and estimators
We emphasize on the notion of variance of the algorithm 2.1, that is, the average of the
square of deviations from the average E(α̂).
Definition 2.4. Variance

2
Var(α̂) = E E(α̂) − α̂ . (2.3)
Notice that, if Var(α̂) is small, then the algorithm α̂ produces close to average results
when applied to any dataset drawn from P0 . In fact, remark that α̂ have been designed
using the dataset x picked in the beginning in the ground-true distribution P0 , hence if
Var(α̂) is small , then when evaluated over a few datasets out of x, the algorithm α̂ is
likely to beat others algorithms and become the state of the art.
Contents 6
2.2.1 Bias as visual concept
In computer vision, bias is perceptible as details object representation. In action recog-

nition dataset for instance, a wide range of diverse visual cue can be informative of
action class labels. We therefore need to give another definition of bias, this latest one
related on the specific learned representation.
Definition 2.5. bias [1, 8]
Let D = xi }ni=1 ⊂ X be a set of data of a feature space X . Suppose that there exists

a feature representation space Z and a feature representation map φ : X → Z such

that the features of elements of the dataset φ(D) can be learned by a certain algorithm,
for a task which consists of learning features of elements of D which are different from
the one of elements of φ(D). In this case, ∀x ∈ X , φ(x) is view as a bias on x. In our
core experiment of cows and camels, we can exhibit two visual features representation;
shape and background. If the task is to learn shape features with high probability, then,
background features are bias for shape features and the opposite conversely if the task is
to learn background features. Notice that, there exists three types of static bias, which
are scene, person and object bias.
Definition 2.6. representation bias [1, 8]
Let φ : X → Z be a feature representation, D be a dataset of element in X .
Representation bias with respect to φ, also called bias toward a feature representation φ
is the best achievable performance of the features φ(x) on D normalized by the chance
level (worst possible performance). Mathematically,

M(D, φ)
B(D, φ) = log
Mrnd
with
M(D, φ) = max M(D, γφ )

γφ
where M(D, γφ ) is a measure of performance of algorithm γφ on the dataset D, and

Mrnd the chance level.
2.2.1.1 Biased algorithm and biased dataset
The notion of biased algorithm in machine learning, particularly in deep learning, can
be defined as follow,
Contents 7
Definition 2.7. Biased algorithm
A learning algorithm, or procedure is said to be biased, if it not solving the underlying

task on but another task with high performance by exploiting spurious correlations or
biases.
An example of bias learning procedure has been stated in introduction (experiment on

cows and camels images). In that example the learning procedure favor the approxima-
tion of background features in the image instead of capturing shape features of images of
cows and camels. Background and shape are view as two features representation φ1 and
φ2 of an image. From this issue derive the notion of representation bias that follow from
learning on datasets that favor certain representation over others (see [1]). Notice that,
in the literature, two algorithms tend to implement different features representation.
Definition 2.8. Biased dataset [1]
A dataset is said to be bias toward a given representation φ if the task on this dataset
is to learn an underlying feature representation φ? .
2.2.1.2 Dataset bias or selection bias
The problem of limited generalization beyond the training domain is usually refer to as
dataset bias or selection bias. In fact, in supervised learning, the identical assumption,
that is, the training and the test samples are drawn from the same probability distribu-
tion, plays an important role. Nevertheless, this essential assumption is often violated
in the presence of selection bias. Under such condition, the standard supervised learning
frameworks may suffer a significant bias.
Definition 2.9. selection bias [15]
Selection bias, also termed dataset shift or domain adaptation in the literature [16, 17],
occurs when the training distribution Ptr (x, y) and the test distribution Pte (x, y) are
different. Remark that the common definition of dataset bias [18], i.e. that an algorithm
performs well on dataset A but not on dataset B, simply means that the algorithm has
large variance. Since variance decreases with dataset size n, it has always been known
that, to avoid it, datasets should be large enough. The extensive data collection efforts of
the recent past have produced some more objective rules of thumb, e.g. 1,000 examples
per class, that appear to suffice to control the variance of current CNN models (see [8]).
Dataset bias captures how algorithms trained on one dataset generalize to other datasets
of the same task, in fact it has long been known that an algorithm that performs well
in a given dataset, does not necessarily perform well on others. Selection bias can be
analyzed with the classical statistical tools of bias and variance and it occurs because :
Contents 8
• learning algorithms are statistical estimators
• estimates from too little data have high variance and generalize poorly.
It is important to remark that dataset bias or selection bias can be view as

the property of the estimator, that is, the learning algorithm which is usually
ameliorated by large datasets whereas representation turns to be a property
of the dataset.
2.3 REPAIR Procedure
We start by notice that in classification, the outputs P̂ (yi /x), i = 1, ..., n of a DNN
n
can mimic the posterior probabilities P (yi /x), i = 1, ..., n of target classes yi i=1 for
the input observation x when the non-linear activation function in the output layer is
defined as a soft-max function. The learning objective is to minimize the difference
between the predicted distribution and the true data generating distribution. Therefore
cross-entropy is a reasonable loss function for DNN-based classification.
Repair [1] is a learning principle which consists of reducing representation bias of a

dataset. Repair uses in particular a generative cross-entropy loss function, that is, a loss
function which do not discriminate directly correct class from competing classes. This
is in part because as mention beforehand, cross-entropy is a reasonable loss function
based-classification. Repair principle starts by measuring the classification performance
on a given feature representation with the cross-entropy loss. Mathematically, repair
minimizes the risk
R? (D, φ) = min EX,Y − log P (Y |Z; θ) .

θ
Where Z = φ(X) is the feature representation space, θ the bias estimator parameter.
Since P (Y |Z; θ), computed by the soft-max layer can mimic the true posterior proba-
bility P (Y |Z), the risk can be rewriting as
R? (D, φ) = EX,Y − log P (Y |Z)

P (X, Y )
= EX,Y − log P (Y ) − log
P (Y )P (Z)
= H(Y ) − I(Z, Y ) ≤ H(Y ).
Contents 9
The risk is upper bounded by the entropy of the ground-truth class label and decrease
when the mutual information I(Z, Y ) between the feature vector Z and the class label
Y increases. Hence, lower R? (D, φ) indicates that φ is more informative for solving D,
that is, the representation bias increases. This allows to define the representation bias
as
I(Z, Y ) R? (D, φ)
B(D, φ) = =1− .
H(Y ) H(Y )
Representation bias can be negative, in fact while a representation can achieves a best
performance on a task, this may not be the case for the dataset of that task if the dataset
is biased by another representation. To solve this problem repair attempts to create a
0
new dataset D of reduced learning bias in some sense that we will explain later, by
non-uniformly sampling examples from the existing dataset D. The goal is then to find
n
set of weights w = wi i=1 that minimize the representation bias. We then get to a
minimax problem
(w? , θ? ) = min max ν(w, θ).

w θ
2.4 RESOUND Procedure [8]
Yingwei Li et al. in [8] introduce the notion representation bias to deal with bias on
dataset that lead to erroneous conclusions. They introduce RESOUND procedure, which
is a generic procedure, applicable to the assembly of datasets for many tasks.
n
As their motivation experiment, let us consider X = Xi i=1
, be n identically and
independent distributed random variables which follow the Bernoulli law of parameter p

denoted as B(p). Let D = x1 , ..., xn be a dataset sampling from X. Using the maximum
likelihood principle and the fact Xi ∼ B(p) ∀ i iid, we have
n
1X
p̂ = argmaxr P (X1 = x1 , ..., Xn = xn , r) = xi
n
i=1
1 Pn
and since bias(p̂) = E(p̂) − p = E(X̂i ) − p = 0. p̂ is therefore and unbiased
n i=1
estimator of p. Let us compute the variance of p̂
Contents 10

2 1 1
= E(p̂2 ) − E(p̂)2 = 2 np + n(n − 1)p2 − p2 = p 1 − p .

Var(p̂) = E E(p̂) − p
n n
The variance therefore decreases as the dataset size n grows. Supposing that each Xi is
a coin toss experiment, most coins in the world have heads probability p = 0.5. However,
it is possible that a dataset researcher would only have access to biased coins, say with
p = 0.4. By using algorithm p̂ to estimate p, with a large enough n, the researcher would
eventually conclude that p = 0.4. Furthermore, using the fact that the variance is low ,
he would conclude that there is no dataset bias and announce to the world that p = 0.4.
Notice that there is nothing wrong with this practice, except for the final conclusion,
that there is something universal about p = 0.4. Because the scientist used a biased
dataset, he obtained a biased response.
To deal with this problem, [8] introduces the notion of representation bias, a mathe-
matical description of a property of the visual world. For example, optical flow for the
motion. They then define the performance M(D, φ) of a representation φ on a dataset
D as follows:
M(D, φ) = max M(D, γφ ) where M(D, γφ )

γφ
is a measure of performance of algorithm γφ on the dataset D. Remark that, if M(D, φ) is

high, then it captures the fact that the dataset D favor the representation φ over others.
This allows them to defined the notion of well calibrated dataset. A dataset D is said to
be well calibrate if the ground-truth representation φ? , thas is, the representation that
truly needed to solve the vision achieve strictly the higher performance. Mathematically,
φ? = argmaxφ M(D, φ) and M(D, φ? ) < M(D, φ)
For more understanding about RESOUND dataset collection and how they measure
bias, we refer the reader to [8].
2.5 IRM Procedure [2]
In substance, IRM addresses our core problem by finding a predictor which learns in-
variant feature across environments and ignores spurious correlations. If Etr and Ete are
Contents 11
the set of training and testing environments respectively, with Etr ⊂ Ete , IRM seeks a
predictor function f such that

OOD OOD e e e

R f = min R g = min max R (g) = min max E X e ,Y e l(g(X ), Y ) .
g∈F g∈F e∈Ete g∈F e∈Ete
We will focus in this work on the case of two environments e0 and e1 for experiments. e0
which generates blue digits and e1 which generates red digits, and we seek to learn the
shape digits invariant across the two environments. Blue and red colors represent spu-
rious correlations or biases. Notice that, we refer to spurious correlation a phenomenon
that we do not expect it to hold in the pass as it held in the future.
2.5.1 Algorithms for IRM
Definition 2.10. [2] We say that a data representation Φ : X → H elicits an invariant

predictor w ◦ Φ across environments if there is a classifier w : H → Y simultaneously
optimal for all environments, that is,
w ∈ argminw̄:H→Y Re (w̄ ◦ Φ) ∀e ∈ E.
The main goal of IRM is to learn a data representation Φ across environments which
infers and invariant predictor across environments. IRM algorithm reads as
X
min Re (w ◦ Φ)
Φ : X → H e∈Etr
w:H→Y
subject to w ∈ argminw̄:H→Y Re (w̄ ◦ Φ) ∀e ∈ Etr .

Chapter 3
Experiments
3.1 Core problem
Consider n examples (xi , yi ) drawn from a training distribution Ptr (X, Y ). Our goal
is to build a predictor yi ≈ f (xi ), i = 1, ...n with low prediction error on examples
0 0
(x0i , yi0 ), i = 1, ..., m drawn from a testing distribution Pte (X , Y ). We assume that Ptr
0 0
and Pte are related somehow. In fact, the underlying space of Pte (X , Y ) has to be
included into the one of Ptr (X, Y ). Mathematically, we want to minimize:
Z
Rte (f ) = `(f (x0 ), y 0 )pte (x0 , y 0 )dx0 dy 0
pte (x0 , y 0 )
Z
= `(f (x0 ), y 0 ) ptr (x0 , y 0 )dx0 dy 0
ptr (x0 , y 0 )
Z
:= `(f (x0 ), y 0 )w(x0 , y 0 )ptr (x0 , y 0 )dx0 dy 0
m
1 X
≈ w(x0i , yi0 )`(f (x0i ), yi0 ).
m
i=1
Following conventions in deep learning, we can express the predictor f (x) = c(φ(x)),
where the featurizer φ extracts some features from the data x, and the classifier c turns
the features φ(x) into the final label.
3.2 Core experiment setting
Consider a dataset formed of colored MNIST digits five’s and six’s labeled by the binary
numbers 1 and 0 respectively. Unfortunately, our dataset is drawn from a distribution
Ptr with a bias: most digits five’s are red, while most digits six’s are blue. Therefore,
12
Contents 13
neural networks will learn to distinguish five’s from six’s based on color, instead of shape.
Then, if the color proportions are different for Pte , the neural network will fail to predict
well.
3.3 Research question
How can we learn a neural network that will ignore the bias, and use only the digit
shape for prediction? Such neural network would be robust with respect to changes in
the color proportions at testing time.
Chapter 4
Observations
In what follows,
• n is an integer used for:
– the number of training five’s digits, in this case, n = 5421,

– the number of training six’s digits, where n = 5918,
– the number of testing five’s digits, with n = 892,
– the number of testing six’s digits, n = 958,
• ptr and pte are respectively the training and the testing parameters,
• the precise number of five’s red or six’s blue random MNIST colored digits is
p×n
int( ),
100
where int(x) means the integer portion of the quantity x,
• the precise number of five’s blue or six’s red random MNIST colored digits is
p×n
n − int( )
100
where p is the training parameter if we are in the training time and the testing
one otherwise,
• if trainacc and testacc is the training and the testing accuracy respectively, we
say that testacc is a good testing accuracy with respect to trainacc if there exist
> 0 small enough such that

trainacc − testacc ≤
R
14
Contents 15
where |.|R is set for the euclidean norm in R.
In our core problem stated beforehand, we can suppose to have two environments or two
sources of digits distribution, one which gives us digits with blue color bias, named e0
and other which gives us digits with red color bias named e1 . The selection bias allows
us to pick,
• p% of five’s digits in the environment e1 ,
• p% of six’s digits in the environment e0 ,
• (1 − p)% of five’s digits in the environment e0 ,
• (1 − p)% of six’s digits in the environment e1 .
With this setting, we can construct mathematically four distributions of points as follow,
 P


 βf r (p) ≈ i,xfive red, i ∈Df r (p) δxfive red, i where Df r (p) is the set of five’s red colored digits






 P
 βf b (p) ≈ i,xfive blue, i ∈Df b (p) δxfive blue, i where Df b (p) is the set of five’s blue colored digits




 P



 βsr (p) ≈ i,xsix red, i ∈Dsr (p) δxsix red, i where Dsr (p) is the set of six’s red colored digits






 β (p) ≈ P

sb i,xsix blue, i ∈Dsb (p) δxsix blue, i where Dsb (p) is the set of six’s blue colored digits
The learning distributions, that is, the training distribution Ptr (p) and the testing dis-
tribution Pte (p) are then given by a mixture of those four distributions. Mathematically,
def
Ptr (p) = βf r (p) + βf b (p) + βsr (p) + βsb (p)
def
[ [ [
with Dtr (p) = Df r (p) Df b (p) Dsb (p) Dsr (p)
and
0 def 0 0 0 0
Pte (p ) = βf r (p ) + βf b (p ) + βsr (p ) + βsb (p )
0 def 0 0 0 0
[ [ [
with Dte (p ) = Df r (p ) Df b (p ) Dsb (p ) Dsr (p ).
Contents 16
Let us emphasize on the main important focus of this work. remark that,
0 0
1 if, p = p , then Ptr (p) ≈ Pte (p ), this implication is truly true because in this
case we have the same color proportion p% in testing and training dataset. We
then drop down on the general assumption in machine learning that the training
0
and testing dataset have been drawed from the same distribution Ptr (p) ≈ Pte (p ).
We will run experiments in this case and prove that we always have good out
distribution generalization from the empirical training distribution.
0 0
2 if, p 6= p , then Ptr (p) 6= Pte (p ), this claim is also true and it comes from the
selection bias, that is, the fact that the data come from different environment with
different colors proportions. In contradiction with the general setting of machine
learning, we expect to see the testing accuracy drop down. This situation is in
fact the core and fundamental problem that we are trying to deal with. We will
also run some experiments in this case, make some observations, and try to solve
this problem.
We first fixed the training parameter ptr and the testing parameter pte to be equal 99,
that is
0
ptr = pte = p = p = 99.
This means that we have 99 percent of five’s red or six’s blue random MNIST colored
digits and one percent of five’s blue or six’s red random MNIST color digits. Vary the

number of epochs in the set 32, 64, 128, 256 and the batch-size in 50, 100, 200, 400 .
We therefore launch sixteen jobs in the computer with a CNN of two convolutionals layers
and two linear layers. The forward process is setting as follow, for an input example x
or a batch tensor of input examples X, we pass it through a conv layer, follow by a non
linear activation relu function, max pooling on the result, pass it through the second
conv layer, again relu and max pooling on the result, reshape the result and pass it
through the first linear layer, relu on it, follow by the last linear layer which gives us
the output prediction. We organize the results in a table.
Contents 17
Table 4.1: Table for ptr = pte = 99
tr-loss loss tr-acc acc epochs batch-size tr-parameter parameter
1.28 · 10−2 1.16 · 10−2 99.58 99.73 64 200 99 99

9.8 · 10−3 1.01 · 10−2 99.71 99.68 128 400 99 99
1.88 · 10−2 1.65 · 10−2 99.41 99.46 32 100 99 99
5.13 · 10−7 1.1 · 10−2 100 99.84 256 50 99 99
4.01 · 10−2 3.65 · 10−2 98.99 98.97 64 400 99 99
2.25 · 10−3 6.27 · 10−3 99.94 99.78 256 400 99 99
4.72 · 10−2 4.41 · 10−2 98.99 98.97 32 200 99 99
4.38 · 10−5 7.55 · 10−3 100 99.84 256 100 99 99
1.36 · 10−3 6.38 · 10−3 99.96 99.84 128 100 99 99
2.89 · 10−3 7.1 · 10−3 99.91 99.73 64 50 99 99
5.94 · 10−3 7.87 · 10−3 99.85 99.73 64 100 99 99
2.79 · 10−4 6.79 · 10−3 100 99.84 128 50 99 99
5.72 · 10−2 5.07 · 10−2 98.99 98.97 32 400 99 99
9 · 10−3 9.43 · 10−3 99.71 99.78 32 50 99 99
4.89 · 10−4 6.11 · 10−3 100 99.84 256 200 99 99
3.72 · 10−3 6.73 · 10−3 99.91 99.78 128 200 99 99
In the table, we have a good training and testing accuracy for all batch-size and number
of epochs, and this almost in the order of one hundred percent respectively. The number
of time that we get good out-of-distribution generalization is therefore greater than the
number of time that we obtain a bad one, mathematically,
number of bad testing accuracy 0

= = 0.
number good testing accuracy 16
These results were expecting in the sense that having pte = ptr = 99 means that we have
the same proportion of color in the training and in the testing dataset. To understand
well the table, we plot the testing accuracy in terms of the training one. We picked

forty one points in the interval min(tr acc), max(tr acc) as new points in our linear
interpolation setting.
The following figure shows that the testing accuracy increases with the training one,
and the lowest and greatest testing accuracies are respectively 98.97 and 99.84
Contents 18
Figure 4.1: Test accuracy in terms of the training one for ptr = pte = 99

With the same sets 32, 64, 128, 256 and 50, 100, 200, 400 of batch-size and number of
epochs we train the same architecture with ptr = pte = 1. The results are then collected
in the following table.
1.42 · 10−3 2.26 · 10−3 99.97 99.95 128 50 1 1

8.18 · 10−3 7.57 · 10−3 99.74 99.68 64 100 1 1
5.27 · 10−3 5.04 · 10−3 99.82 99.84 64 50 1 1
3.3 · 10−3 3.41 · 10−3 99.91 99.89 128 100 1 1
4.86 · 10−2 4.46 · 10−2 99 99.08 32 200 1 1
1.17 · 10−2 1.13 · 10−2 99.65 99.68 128 400 1 1
2.01 · 10−5 2.12 · 10−3 100 99.89 256 50 1 1
1.09 · 10−2 1.03 · 10−2 99.67 99.62 32 50 1 1
2.19 · 10−2 1.99 · 10−2 99.24 99.3 32 100 1 1
5.45 · 10−2 5.36 · 10−2 99 99.08 32 400 1 1
4.65 · 10−3 4.3 · 10−3 99.86 99.84 256 400 1 1
3.77 · 10−4 1.69 · 10−3 100 99.89 256 100 1 1
1.5 · 10−2 1.33 · 10−2 99.46 99.57 64 200 1 1
4.54 · 10−2 4.56 · 10−2 99 99.08 64 400 1 1
1.79 · 10−3 2.3 · 10−3 99.95 99.95 256 200 1 1
6.16 · 10−3 5.49 · 10−3 99.81 99.78 128 200 1 1
Contents 19
In this table, out-of-distribution generalization is quiet good for all couples of batch-size
and number of epochs. We then have the ratio

= = 0.
These results were also expecting in the sense that having pte = ptr = 1 means that
we have the same proportion of color in the training and in the testing dataset. To
understand well the table, we plot the testing accuracy in terms of the training one. As

beforehand, we picked forty one points in the interval min(tr acc), max(tr acc) as new
points in our linear interpolation setting.
This figure shows that the testing accuracy increases with the training one, and the
lowest and greatest testing accuracies are respectively 99.08 and 99.95.
Contents 20

Let us try for ptr = pte = 50 with same sets 32, 64, 128, 256 and 50, 100, 200, 400 of
batch-size and number of epochs. And for the same architecture described beforehand.
The results are then collected in the following table.
2.26 · 10−2 2.33 · 10−2 99.29 99.14 128 400 50 50

1.55 · 10−2 2.07 · 10−2 99.51 99.35 64 100 50 50
2.63 · 10−2 3.32 · 10−2 99.17 98.97 64 200 50 50
1.48 · 10−5 2.77 · 10−2 100 99.62 256 50 50 50
6.68 · 10−3 1.33 · 10−2 99.85 99.57 128 100 50 50
1.14 · 10−2 1.47 · 10−2 99.63 99.46 128 200 50 50
4.93 · 10−4 1.78 · 10−2 100 99.57 256 100 50 50
4.68 · 10−2 4.54 · 10−2 98.61 98.59 64 400 50 50
9.41 · 10−3 1.49 · 10−2 99.72 99.51 64 50 50 50
9.4 · 10−2 8.73 · 10−2 97.03 96.76 32 400 50 50
2.5 · 10−3 1.48 · 10−2 99.96 99.46 128 50 50 50
2.13 · 10−2 2.29 · 10−2 99.29 99.24 32 50 50 50
7.75 · 10−3 1.72 · 10−2 99.75 99.57 256 400 50 50
5.4 · 10−2 5.1 · 10−2 98.36 98.16 32 200 50 50
3.25 · 10−2 3.26 · 10−2 98.94 98.86 32 100 50 50
2.87 · 10−3 1.36 · 10−2 99.95 99.51 256 200 50 50
In this table, out-of-distribution generalization is good for all couples of batch-size and
number of epochs. We then have the ratio

= = 0.
As for pte = ptr = 99 and pte = ptr = 1 these results were also expecting in the sense
that having pte = ptr = 50 means that we have the same proportion of colors in the
training and in the testing dataset, which implies that we have the same testing and
training distributions. To understand well the table, we plot the testing accuracy in
terms of the training one. As beforehand, we picked forty one points in the interval

min(tr acc), max(tr acc) as new points in our linear interpolation setting.
The following figure shows that the testing accuracy increases with the training one,
and the lowest and greatest testing accuracies are respectively 96.76 and 99.62.
Contents 21
Let us now change the proportion, that is set
pte 6= ptr .
In the following table, we let pte to be an approximation of ptr at one nearest.
We set
pte = 98 and ptr = 99

with the same sets 32, 64, 128, 256 and 50, 100, 200, 400 of batch-size and number
of epochs; and for the same architecture described beforehand, we obtain the following
table.
Contents 22
Table 4.4: Table for pte = 98 and ptr = 99
5.94 · 10−3 1.28 · 10−2 99.85 99.51 64 100 99 98

8.99 · 10−3 1.59 · 10−2 99.71 99.51 32 50 99 98
4.72 · 10−2 7.85 · 10−2 98.99 97.95 32 200 99 98
4 · 10−2 6.6 · 10−2 98.99 97.95 64 400 99 98
2.77 · 10−4 1.07 · 10−2 100 99.78 128 50 99 98
2.89 · 10−3 1.12 · 10−2 99.91 99.57 64 50 99 98
1.36 · 10−3 9.78 · 10−3 99.96 99.78 128 100 99 98
1.88 · 10−2 2.93 · 10−2 99.41 98.81 32 100 99 98
5.72 · 10−2 9.15 · 10−2 98.99 97.95 32 400 99 98
2.25 · 10−3 1.02 · 10−2 99.94 99.73 256 400 99 98
1.28 · 10−2 2 · 10−2 99.58 99.41 64 200 99 98
4.38 · 10−5 1.18 · 10−2 100 99.78 256 100 99 98
3.72 · 10−3 1.05 · 10−2 99.91 99.68 128 200 99 98
5.11 · 10−7 1.76 · 10−2 100 99.78 256 50 99 98
9.79 · 10−3 1.62 · 10−2 99.71 99.41 128 400 99 98
4.89 · 10−4 9.24 · 10−3 100 99.78 256 200 99 98
We realize that the ratio

= = 0.
We still have good testing accuracy for all batch-size and number of epochs, this proves
the existence of some stability in the neighborhood of 99. Picking forty one points in the

interval min(tr acc), max(tr acc) as new points, we plot the linear interpolation of the
set of points {(tr acci , te acci )}.
Contents 23
Figure 4.4: Test accuracy in terms of the training accuracy for pte = 98 and ptr = 99.
We set now
pte = 1 and ptr = 99

with the same sets 32, 64, 128, 256 and 50, 100, 200, 400 of batch-size and number
of epochs. And for the same architecture described beforehand.
Contents 24
Table 4.5: Table for pte = 1 and ptr = 99
3.72 · 10−3 0.42 99.91 85.51 128 200 99 1

1.88 · 10−2 1.15 99.41 46.27 32 100 99 1
1.28 · 10−2 0.79 99.58 67.51 64 200 99 1
4.72 · 10−2 3.42 98.99 0.92 32 200 99 1
2.89 · 10−3 0.39 99.91 86.7 64 50 99 1
4.89 · 10−4 0.34 100 90.38 256 200 99 1
4.38 · 10−5 0.4 100 91.19 256 100 99 1
5.72 · 10−2 4.14 98.99 0.92 32 400 99 1
2.25 · 10−3 0.37 99.94 87.68 256 400 99 1
5.95 · 10−3 0.51 99.85 81.78 64 100 99 1
5.25 · 10−7 0.61 100 91.3 256 50 99 1
4 · 10−2 2.85 98.99 1.41 64 400 99 1
9.8 · 10−3 0.64 99.71 74.49 128 400 99 1
1.36 · 10−3 0.33 99.96 89.51 128 100 99 1
8.99 · 10−3 0.66 99.71 74.32 32 50 99 1
2.76 · 10−4 0.34 100 90.38 128 50 99 1
We realize that the ratio

= = 0.33.
We can see that testing accuracies for certain batch-size and number of epochs stated
to drop down. This proves the claim that the testing distribution is different from
the training distribution when ptr 6= pte . In this case, plotting the testing accuracies as
function of the training one and linearly interpolate them, we obtain the following curve.
We can easily see on the following curve that the slopes of each linear branch of the
entire curve decreases compare to the one of others curves draw beforehand.
Contents 25
Figure 4.5: Test accuracy in terms of the training accuracy for pte = 1 and ptr = 99
Let us fixed the batch-size to be equal to 32, the number of epochs to 256 and the training
parameter ptr = 99, and vary the testing parameter pte in the set {1, 5, 10, 20, 30, 40, 50, 60, 70, 80}
in such a way that, pte → ptr = 99. The results of the experiment are organized in the
following table.
Table 4.6: Table for ptr = 99 and pte ∈ {1, 5, 10, 20, 30, 40, 50, 60, 70, 80}
1.88 · 10−8 0.66 100 93.08 256 32 99 20

1.74 · 10−8 0.78 100 91.68 256 32 99 1
1.88 · 10−8 0.3 100 96.54 256 32 99 60
1.88 · 10−8 0.12 100 98.38 256 32 99 80
1.74 · 10−8 0.41 100 95.35 256 32 99 50
1.75 · 10−8 0.58 100 93.84 256 32 99 30
1.85 · 10−8 0.77 100 91.89 256 32 99 5
1.75 · 10−8 0.73 100 92.32 256 32 99 10
1.74 · 10−8 0.53 100 94.32 256 32 99 40
1.71 · 10−8 0.22 100 97.51 256 32 99 70
Contents 26
In the table, we have a training accuracy of one hundred percent for all batch-sizes and
number of epochs, and the testing accuracy grow with the testing parameter. Plotting
a cubic interpolation of the testing parameter in terms of the training one, we could see
that the function defined by
f (pte ) = te acc
satisfy the following property,

0 0
pte ≥ pte =⇒ f (pte ) ≥ f (pte ) ,
that is, f increases with the testing parameter. Furthermore, it is ease to see that,
lim f (pte ) = tr acc = 100 ≥ max{te acc}.

pte →ptr =99
Figure 4.6: Test accuracy in terms of the testing parameter for ptr = 99 and pte ∈
{1, 5, 10, 20, 30, 40, 50, 60, 70, 80}.
This simply means that the more the testing parameter get closer to the training one,
the more the testing accuracy associated to that testing parameter is good. In terms of
distribution, it means that the more the testing parameter get closer to the training
Contents 27
one, the more the testing distribution Pte (pte ) associated to that testing parameter pte
is closed to the training distribution.
Chapter 5
Experiments with REPAIR

procedure
As mentioned beforehand, repair uses the notion of representation bias, which relies on
the notion of bias towards a representation. Repair defines a representation function φ
which extract bias on each example x ∈ X . This means that having an example x, then,
φ(x) is the bias on x toward the representation φ. In order to reduce the learning of bias
φ(x) on each example x ∈ X , repair learns the features bias of φ(x), ∀x ∈ X in such a
way to penalize example x which has good training prediction on features of φ(x), that
is,
c(f (φ(x))) = yx ,
0
where x is labeled by yx , with a certain small weight wx ; and depenalize example x
0
which has bad training prediction (mispredicted example) on features of φ(x ), that is,
0
c(f (φ(x ))) 6= yx0 ,
0
where x is labeled by yx0 , with a certain high weight wx0 . Remark that f is the featurizer
and c the classifier. Repair procedure is then a re-sampling procedure. Having a dataset
0
D, repair then creates a new dataset D by non-uniformly assigning weight to examples
in D. Let us emphasize that repair re-sampling procedure is performed on the train and
test set combine.
28
Contents 29
In substance:
repair formulates bias minimization as an optimization problem, seeking a weight distri-

bution that penalizes examples easy for a classifier built on a given feature representation.
5.1 Repair ambiguity
Having an example x, how to define a representation φ in such a way that φ(x) is the
expected bias on x toward the representation φ? To be more explicit, let us imagine a
person or background bias on the image of an animal, maybe child or flowers landscape
on an image of cat. How to extract specific child or flowers landscape bias toward a
representation?.
In the repair paper, each colored MNIST digits x of size 3 × 28 × 28, is resize to x resize
of size 3 × 784, and the color function bias at x is defined by taking the maximum of
each row of x resize, that is
φ(x) = torch.tensor([max(x resize[i]) for i in range(x resize.size(0) = 3)]).
φ(x) is then a tensor containing three scalars, which is going to be passed through a
linear layer of a model, plus the softmax.
5.2 Bias features extraction representation
We introduce the notion of Bias features extraction representation. This notion is highly
based on repair paper, we are in fact giving another kind of representation function which
Contents 30
allows us to learn a prediction function which will ignore biases data and learn the un-
derlying task. The intuition behind this representation is that, bias on learning examples
is not something that we can explicitly define easily and exactly. In the thought experi-
ment stated in introduction, we conclude that the CNN model were learning background
features on the image with high probability. Thus, if the recognition model relies highly
on the bias features for learning, we can define bias features of φ(x) on an example x as
the featurizer at x; that is,
φ(x) = f (x)
where f is the featurizer of a certain model. In contrast with repair, the re-sampling
procedure will be make only on the training set; in such a way that testing on data
which are out of the re-sampling training dataset will allow us to know if the recognition
model really ignores features bias on data and learned features related to the underlying
task. In fact, making a re-sampling learning procedure on the training and testing
dataset combined, and split the obtained dataset into two; to obtain new training and
testing datasets. Thus, we would expect good out-of-distribution generalization on the
underlying features. The testing and the training datasets obtained from the repair re-
sampling procedure have been already penalized with respect a feature representation.
The question is what about prediction on data which are not inside the testing re-
sampling dataset but generate by the testing distribution Pte ?.
Let us make experiments for our core problem of colored MNIST five’s and six’s digits,
with varying colors red and blue proportions, and compare the results for the two views.
Let us first summarize the two views,
1 Repair representation:
– for each example x, the associated bias features is defined as
φ1 (x) = torch.tensor([max(x resize[i]) for i in range(x resize.size(0) = 3)]).
– The re-sampling procedure is done on the training and testing dataset com-
bined.
Contents 31
2 Bias features extraction representation:
– for each example, x, the associated bias features is defined as
φ2 (x) = f (x) where f is the featurizer of our working model describe beforehand.
– The re-sampling procedure is done only on the training dataset.
5.3 Repair re-sampling procedure with threshold
Thresholding (threshold): Keep all examples i such that wi ≥ t, where t = 0.5 is the
threshold.
φ tr acc te acc epochs batch-size tr parameter te parameter

b-line 100 % 95.35% 256 32 99 50
φ1 100 % 100 % 256 32 99 50
φ2 100 % 94.71 % 256 32 99 50
φ1 100 % 99.79 % 256 32 99 99
φ2 100 % 99.73 % 256 32 99 99
b-line 100 % 99.84% 256 32 99 99
5.4 Repair re-sampling procedure with ranking
Ranking (rank): Keep p = 50% examples of largest weights wi ;

b-line 100 % 95.35% 256 32 99 50
φ1 100 % 97.29 % 256 32 99 50
φ2 100 % 94.65 % 256 32 99 50
φ1 100 % 99.52 % 256 32 99 99
φ2 100 % 99.78 % 256 32 99 99
b-line 100% 99.84% 256 32 99 99
Contents 32
5.5 Repair re-sampling procedure with class ranking
Per-class ranking (cls rank): Keep the p = 50% examples of largest weight wi from each
class;

b-line 100 % 95.35% 256 32 99 50
φ1 100 % 97.05 % 256 32 99 50
φ2 100 % 95.73 % 256 32 99 50
φ1 100 % 99.66 % 256 32 99 99
φ2 100 % 99.84 % 256 32 99 99
b-line 100% 99.84% 256 32 99 99
5.6 Repair re-sampling procedure with sample
Sampling (sample):Keep each example i with probability wi (discard with probability

1 − wi );

b-line 100 % 95.35% 256 32 99 50
φ1 100 % 95.67 % 256 32 99 50
φ2 100 % 95.19 % 256 32 99 50
φ1 100 % 99.68 % 256 32 99 99
φ2 100 % 99.95 % 256 32 99 99
b-line 100% 99.84% 256 32 99 99
5.7 Uniformly repair re-sampling procedure
Uniform (uniform): Keep p = 50% examples uniformly at random;

Contents 33

b-line 100 % 95.35% 256 32 99 50
φ1 100 % 94.64 % 256 32 99 50
φ2 100 % 94.86 % 256 32 99 50
φ1 100 % 100 % 256 32 99 99
φ2 100 % 99.84 % 256 32 99 99
b-line 100% 99.84% 256 32 99 99
5.8 Repair results interpretations
Notice that, in all the five tables up there, for the testing parameter between fifty and
ninety-nine, the state of the art vary between experiments with φ1 , φ2 and the base line.
In fact,
• for
(pte = 50) 6= (ptr = 99);
base on passed observations, this means that the training distribution is different
from the testing one, mathematically,

Pte (pte ) 6= Ptr (ptr ) .
Experiment with φ1 beats experiments on baseline and with φ2 for all the re-
sampling strategies, except for the case of uniformly re-sampling where experiment
with baseline is state of the art,
• in the case of
pte = 99 = ptr ;
that is, the testing distribution is very closed to the training one, mathematically,

Pte (pte ) ≈ Ptr (ptr ) ,
Contents 34
– φ1 is the state of the art for thresholding and uniformly re-sampling strategies,
– baseline for ranking strategy,
– φ2 and baseline for class-ranking strategy,
– φ2 for sampling strategy.
The most important observation is that the baseline achieves high performance without
solving the underlying task, but by learning bias features on the digits. Experiments
with the features representation function φ1 help us to design a predictive function which
learned the shape features of colored digits, but testing are performed on penalized digits
with respect to the representation φ1 . Moreover, finding a feature representation as φ1
that fits to the bias present in the data is a complex problem.
Finally, experiments with φ2 , is a general method to find feature bias on a dataset,

this on the condition where we are sure with high probability that recognition model
on that dataset learned features bias. The goal being to learn a model that recognises
shape features and ignores bias features. We end by remarking that experiments on φ2
representation are in such a way that, testing is performed on the non penalized dataset,
that is, without re-sampling.
Chapter 6
Experiments with IRM principle
With our dataset formed by five’s and six’s colored MNIST digits and describe before-
hand, we run invariant rick minimization (IRM) principle and arrange the results in the
following table.
tr acc te acc epochs batch-size tr parameter te parameter

IRM 70.92 % 65.10% 256 32 99 50
IRM 70.86 % 64.10 % 256 32 99 1
6.1 IRM results interpretations
IRM results shows us some stability in testing time once changing the testing parameter.
We can see that in the table, for pte = 1 and pte = 50 we have almost the same testing
accuracy. This prove the learning of shape features and the ignorance of color bias
features.
6.2 Conclusion
“This is just a partial conclusion, the work is not really achieved”.

35
Contents 36
Repair procedure successful achieved good out of distribution generalization but they
have defined bias features exactly and explicitly. What is not always possible or am-
biguous. IRM successful learns an invariant predictor across environments but not very
well in the sense that the training accuracy is far from hundred percent. Feature bias
extraction representation that we have introduced is learning well, that is the train-
ing accuracy reaches the order of hundred percent and generalized well with stability
when changing the testing parameter.
Bibliography
[1] Yi Li and Nuno Vasconcelos. Repair: Removing representation bias by dataset re-
sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 9572–9581, 2019.
[2] Arjovsky Martin; Bottou Léon; Gulrajani Ishaan; Lopez-Paz David. Invariant risk
minimization. 2019.
[3] Thomas M Cover and Joy A Thomas. Elements of information theory second edition
solutions to problems. Internet Access, 2006.
[4] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,
2016.
[5] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From
theory to algorithms. Cambridge university press, 2014.
[6] Donglai Zhu, Hengshuai Yao, Bei Jiang, and Peng Yu. Negative log likelihood ratio
loss for deep neural network classification. arXiv preprint arXiv:1804.10690, 2018.
[7] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita.
In Proceedings of the European Conference on Computer Vision (ECCV), pages
456–473, 2018.
[8] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition
without representation bias. In Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 513–528, 2018.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
[10] Liang-Chieh Chen, Jonathan T Barron, George Papandreou, Kevin Murphy, and
Alan L Yuille. Semantic image segmentation with task-specific edge detection using
cnns and a discriminatively trained domain transform. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 4545–4554, 2016.
37
Bibliography 38
[11] Samuel Ritter, David GT Barrett, Adam Santoro, and Matt M Botvinick. Cognitive
psychology for deep neural networks: A shape bias case study. In Proceedings of the
34th International Conference on Machine Learning-Volume 70, pages 2940–2949.
JMLR. org, 2017.
[12] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T
Kalai. Man is to computer programmer as woman is to homemaker? debiasing
word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages
4349–4357. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/
6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embedd
pdf.
[13] Patrick J Grother. Nist special database 19. Handprinted forms and characters
database, National Institute of Standards and Technology, 1995.
[14] Léon Bottou, Corinna Cortes, John S Denker, Harris Drucker, Isabelle Guyon,
Larry D Jackel, Yann LeCun, Urs A Müller, Eduard Säckinger, Patrice Y Simard,
et al. Comparison of classifier methods: a case study in handwritten digit recogni-
tion. In International conference on pattern recognition, pages 77–77. IEEE Com-
puter Society Press, 1994.
[15] Van-Tinh Tran. Selection Bias Correction in Supervised Learning with Importance
Weight. PhD thesis, 2017.
[16] Jose G Moreno-Torres, Troy Raeder, Rocı́O Alaiz-Rodrı́Guez, Nitesh V Chawla,

and Francisco Herrera. A unifying view on dataset shift in classification. Pattern
Recognition, 45(1):521–530, 2012.
[17] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D

Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
[18] Antonio Torralba, Alexei A Efros, et al. Unbiased look at dataset bias. In CVPR,
volume 1, page 7. Citeseer, 2011.

Fair Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fair Thesis

Uploaded by

Copyright:

Available Formats

ECOLE NORMALE SUPERIEURE PARIS-SACLAY

On Dataset Bias and Recognition

A master thesis submitted in partial fulfillment for the

i have acknowledged all main sources of help;

Master In Artificial Intelligence

by Rostant Noyiessie Ndebeka

5 Experiments with REPAIR procedure 28

5.7 Uniformly repair re-sampling procedure . . . . . . . . . . . . . . . . 32

6 Experiments with IRM principle 35

2.1 From maximum likelihood to cross-entropy loss [3], [4],

Definition 2.1. Cross-entropy

If p and q are discrete then,

If p and q are continuous then,

Definition 2.2. Entropy

α̂ = argmaxα P (x, α) where P model P0

2.2 Bias, biased algorithm, dataset bias, representation

2.2.0.1 Statistical bias and estimators

Definition 2.3. Statistical bias

bias(α̂) = E(α̂) − α. (2.2)

2.2.0.2 Statistical variance and estimators

Definition 2.4. Variance

2.2.1 Bias as visual concept

In computer vision, bias is perceptible as details object representation. In action recog-

Definition 2.5. bias [1, 8]

a feature representation space Z and a feature representation map φ : X → Z such

Definition 2.6. representation bias [1, 8]

Let φ : X → Z be a feature representation, D be a dataset of element in X .

M(D, φ) = max M(D, γφ )

where M(D, γφ ) is a measure of performance of algorithm γφ on the dataset D, and

2.2.1.1 Biased algorithm and biased dataset

Definition 2.7. Biased algorithm

A learning algorithm, or procedure is said to be biased, if it not solving the underlying

An example of bias learning procedure has been stated in introduction (experiment on

Definition 2.8. Biased dataset [1]

2.2.1.2 Dataset bias or selection bias

Definition 2.9. selection bias [15]

• learning algorithms are statistical estimators

It is important to remark that dataset bias or selection bias can be view as

2.3 REPAIR Procedure

Repair [1] is a learning principle which consists of reducing representation bias of a

R? (D, φ) = min EX,Y − log P (Y |Z; θ) .

R? (D, φ) = EX,Y − log P (Y |Z)

(w? , θ? ) = min max ν(w, θ).

2.4 RESOUND Procedure [8]

M(D, φ) = max M(D, γφ ) where M(D, γφ )

is a measure of performance of algorithm γφ on the dataset D. Remark that, if M(D, φ) is

φ? = argmaxφ M(D, φ) and M(D, φ? ) < M(D, φ)

2.5 IRM Procedure [2]

2.5.1 Algorithms for IRM

Definition 2.10. [2] We say that a data representation Φ : X → H elicits an invariant

subject to w ∈ argminw̄:H→Y Re (w̄ ◦ Φ) ∀e ∈ Etr .

3.1 Core problem

3.2 Core experiment setting

3.3 Research question

• n is an integer used for:

– the number of training five’s digits, in this case, n = 5421,

where int(x) means the integer portion of the quantity x,

where |.|R is set for the euclidean norm in R.

• p% of five’s digits in the environment e1 ,

• p% of six’s digits in the environment e0 ,

• (1 − p)% of five’s digits in the environment e0 ,

• (1 − p)% of six’s digits in the environment e1 .

Table 4.1: Table for ptr = pte = 99

tr-loss loss tr-acc acc epochs batch-size tr-parameter parameter