Professional Documents
Culture Documents
image retrieval
Bojana Gajić
ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús
establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184
ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso
establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/
WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set
by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en
Training strategies for efficient deep
image retrieval
I would like to start this dissertation by expressing my true gratitude to all of those
who were supporting and assisting me in the last years.
First of all, I owe the biggest thanks to my supervisors Dr. Carlo Gatta and Dr.
Ramon Baldrich. If I could write all that I thank you for the list may be longer
than the thesis. Carlo, thank you for your bright and always original ideas and
discussions, for leading me through various stages of my development, for your
patience when it comes to correcting my writing, for all the years of great work,
support and understanding. Ramon, thank you for making me feel welcomed since
my very first day in CVC and Barcelona. Your feedback was always insightful and it
brought my work to a higher level.
I would like to thank Dr. Ariel Amato, CTO of Vintra for giving me an opportunity
to work in such an innovative company. Ariel, thank you for your support, for
recognition of my work, and for showing me that every problem has a solution.
Your forward thinking attitude has always been inspiring. I would also like to
acknowledge my colleagues from machine learning team: Francesco, Thomas,
Sergio and Esteve, it’s been a great pleasure to work with you! And many thanks to
Onur, Marc, Riqui and Eva for making my time in the office truly remarkable!
I would also like to thank Dr. Jon Almazan, for accepting me and leading through
my internship in Xerox Research Center Europe and Naver labs Europe. Jon, thanks
for showing me how it is to work in a world class research team! I appreciate all help
and support from you, Diane and Naila.
The time I spent in CVC wouldn’t be closely as memorable as it was, without
Ivet, Carles, Felipe, Dena, German, Prassanna, Arash, Gemma and Onur. It was
great to meet you, share ideas and spend time with you! And many thanks to all
other friends with whom I spent my free time.
Finally, my biggest thanks to my family, my parents Zoran and Vesna and my
brother Andrija. Thank you for all the courage and understanding that you have
been giving to me.
i
Abstract
First, in chapter 4 we analyze the importance of some state of the art strategies
related to the training of a deep model such as image augmentation, backbone
architecture and hard triplet mining. We then combine the best strategies to design
a simple deep architecture plus a training methodology for effective and high quality
person re-identification. We extensively evaluate each design choice, leading to
a list of good practices for person re-identification. By following these practices,
our approach outperforms the state of the art, including more complex methods
with auxiliary components, by large margins on four benchmark datasets. We also
provide a qualitative analysis of our trained representation which indicates that,
while compact, it is able to capture information from localized and discriminative
regions, in a manner akin to an implicit attention mechanism.
iii
problems. This hypothesis is supported by the fact that “a curve dominates in ROC
space if and only if it dominates in PR space” [17]. To test this hypothesis, we design
an approximated, derivable relaxation of the area under the ROC curve. Despite its
simplicity, AUC loss, combined with ResNet50 as a backbone architecture, achieves
state-of-the-art results on two large scale publicly available retrieval datasets. Ad-
ditionally, the AUC loss achieves comparable performance to the more complex,
domain specific, state-of-the-art methods for vehicle re-identification.
iv
Resumen
v
tes. BoN es un método eficiente que selecciona una bolsa de muestras negativas
restringidas basado en una nueva estrategia de indexación dispersa (hashing) en lí-
nea. Mostramos la superioridad de BoN frente a los métodos de minería de muestras
negativas del estado del arte en términos de precisión y tiempo de entrenamiento
en tres grandes conjuntos de datos.
vi
Resum
vii
tratègia d’indexació dispersa (hashing) en línia. Mostrem la superioritat de BoN en
front dels mètodes de mineria de mostres negatives de l’estat de l’art en termes de
precisió i temps d’entrenament en tres grans conjunts de dades.
viii
Contents
List of figures xv
1 Introduction 1
1.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Early methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Local representations . . . . . . . . . . . . . . . . . . . . . . . . 8
Global representations . . . . . . . . . . . . . . . . . . . . . . . . 9
ix
Contents
Deep representations . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 15
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
x
Contents
Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Training details. . . . . . . . . . . . . . . . . . . . . . . . . 34
Image transformation. . . . . . . . . . . . . . . . . . . . . 34
Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Backbone architecture. . . . . . . . . . . . . . . . . . . . . 35
Re-identification examples. . . . . . . . . . . . . . . . . . 38
Implicit attention. . . . . . . . . . . . . . . . . . . . . . . . 40
xi
Contents
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Linear auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . 48
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xii
Contents
5.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Heaviside to sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . 71
AUC metaparameters . . . . . . . . . . . . . . . . . . . . . . . . 74
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
∆s parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
xiii
Contents
7 Closing remark 89
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4 Patents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Bibliography 107
xiv
List of Figures
4.1 Summary of the training approach. Image triplets are sampled and
fed to a three-stream Siamese architecture, trained with a ranking loss.
Each stream encompasses an image transformation, convolutional
layers, a pooling step, a fully connected layer, and an `2 -normalization.
Weights of the model are shared across streams. In red we illustrate
the curriculum learning strategies: (1) pretraining for classification
(PFC), (2) hard triplet mining (HTM), (3) increasing image difficulty
(IID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 For several queries from Market, we show the first 10 retrieved images
together with the mAP and the number of relevant images (in brackets)
of that query. Green (resp. red) outlines images that are relevant (resp.
non-relevant) to the query. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Matching regions. For pairs of matching images, we show maps for
the top 5 dimensions that contribute most to the similarity. All these
images are part of the test set of Market-1501. . . . . . . . . . . . . . . 39
xv
List of Figures
5.1 BoN strategy. Triplets with good quality negatives are formed using
the information from the hash table. The resulting embedding is
used to learn both the deep model and a linear projection that, in
turn, provides a low-dimensional embedding. Its quantization pro-
vides (possibly) new entry positions in the hash table for the input
images. The hash table and the linear autoencoder are updated at
each training step with minimal overhead. . . . . . . . . . . . . . . . . 46
5.5 The percentage of samples that were added to the hash table or moved
from one bin to another. HD stands for Hamming distance between
the old and new hash entry. . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Dynamic s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 The ROC curve (red line) and its approximation based on a set of
thresholds s (blue line). The area under the approximated curve is
calculated using the Trapezoidal rule. . . . . . . . . . . . . . . . . . . . 70
xvi
List of Tables
4.2 Impact of the input image size. We report mean average precision
(mAP) on Market and Duke. . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Top (a): influence of the pooling strategy. Middle (b): results for dif-
ferent backbone architectures. Bottom (c): influence of pretraining
the network for classification before considering the triplet loss. We
report mAP for Market and Duke. . . . . . . . . . . . . . . . . . . . . . 36
5.2 Time required for training for 100k steps and until convergence. . . . 57
5.4 validation results at peak performance for every method and dataset.
* stands for the best number found in literature that uses additional
attention ensembles. F means that the method uses bilinear pooling. 61
xvii
List of Tables
6.3 Comparison of batch all and batch hard strategies on Stanford Online
Products [73] dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Comparison of the AUC and the triplet batch hard loss functions on
the Stanford Online Products [73] dataset. . . . . . . . . . . . . . . . . 80
6.5 Comparison of the AUC and the triplet batch hard loss functions on
the CUB-200-2011 [114] dataset. . . . . . . . . . . . . . . . . . . . . . . 80
6.6 Comparison of the AUC and the triplet batch hard loss functions on
the In-shop Clothes [63] dataset. . . . . . . . . . . . . . . . . . . . . . . 81
6.7 Comparison of the AUC and the triplet batch hard loss functions on
the VERI-Wild [64] dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.10 Comparison with the state of the art on the CUB-200-2011 [114]
cropped dataset. Embedding dimension is presented as a superscript
and the backbone architecture as a subscript. R stands for ResNet, G
for GoogLeNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.11 Comparison with the state-of-the-art methods on the In-shop Clothes [63]
dataset. Embedding dimension is presented as a superscript and
the backbone architecture as a subscript. R stands for ResNet, G for
GoogLeNet, V for VGG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
xviii
List of Tables
xix
1 Introduction
1 However, it took almost fifty years to develop an algorithm that could win against a world champion.
1
Chapter 1. Introduction
• Data preparation. Before starting the training of a deep neural network, the
input data has to be loaded on RAM. Additionally, the input data is usually
pre-processed by whitening and augmentation.
• Architecture design. In this step a deep neural network is designed for a spe-
cific task. The architecture can be adopted from publicly available resources,
or it can be designed from scratch. In both cases, decisions about the capacity
and speed of the network are taken to satisfy the task requirements, while
respecting hardware limitations. More about architectures will be presented
in section 2.1.
• Loss design. A loss is a measure of the difference between what model pre-
dicts and what is expected (a.k.a. the ground truth), and it should be designed
based on the final goal. The loss is calculated in every training step, and the
weights and biases of the network are updated in order to optimize it.
• Optimization strategy. The way that loss is optimized is defined by the op-
timization strategy. Some of the most common optimization strategies are
based on backpropagation [86], paired with an optimization function such as
Stochastic Gradient Decent (SGD) [83], RMSprop [103] or Adam [51]. These
2 Nonetheless, current neural networks employed in deep learning are a very limited simulation of
the actual neurophysiology of complex brain structures.
2
1.1. A brief Introduction to deep learning
3
Chapter 1. Introduction
1.1.4 Applications
Deep learning has a very wide range of applications. It found a purpose in every
sector that deals with large amounts of digital data such as text, images, numbers,
diagnosis, etc.
One of the most important fields of deep learning applications is health care.
There are many ways to benefit from deep learning methods: regression models can
be used to predict future development of a disease if a model is given relevant data
about the patient; medical imaging using deep learning methods provide analysis of
various medical images such as X-ray, magnetic resonance, ultrasound etc; robots
can use reinforcement learning when trained to assist a surgery.
Military is yet another sector where deep learning is used. The common appli-
cations are target recognition, battlefield health care, combat simulation, threat
monitoring etc.
Deep learning found application in forensics as well. Surveillance cameras can
be used to find criminals and evidence about crimes. Very often, police inspectors
4
1.1. A brief Introduction to deep learning
1.1.5 Limitations
Even though deep learning has been growing fast and improving quickly in the last
decade, there are certain limitations. One of the main problems of deep learning
is that it works well when models are trained with huge amounts of labeled data.
Even though the amount of publicly available data is rising tremendously, the great
majority of it is not labeled. Data labeling is a very slow and expensive process,
which requires humans help. Models trained with few data tend to overfit thus
performing poorly on new, unseen data.
Hardware limitations. It’s well known that deeper and more complex models
provide more accurate results. Also, having a lot of data available at each training
5
Chapter 1. Introduction
step provides better gradients and faster training. However, hardware limitations
for both training and inference are serious limitations that play a significant role
when designing the architecture.
Common sense. Deep learning is capable of solving very complicated tasks.
However, when a mistake occurs, it is not always clear on the first glance why it
happened. Deep learning methods make decisions based on different criteria than
humans, without explicitly providing an answer on a question why they predicted
a specific output. For example, if a deep neural network labels a picture of a dog
with a label cat, it is usually not intuitive for humans why that happened. However,
profound investigation of the activations of different layers of the network can
provide an insight of the reasons why the network made a certain decision.
6
1.2. Introduction to visual search
over thirty years, and therefore there is a wide range of approaches trying to solve
it. All these approaches have one thing in common: all images, both query and
gallery, are embedded into their vector representations. Depending on the nature
of the approach, one image can be embedded into a single vector representation
(early methods, global descriptors and some deep representations), or into multiple
vector representations (local representations, some deep representations). The final
result of instance retrieval is a ranked gallery set, which is generated based on the
similarities between vector representations of query and gallery images.
Early methods
The first methods that were proposed for solving an instance retrieval task were
published in the early 1990s [71, 99, 106]. These methods were straightforward so-
lutions that were based on the basic image characteristics such as color histograms,
textures or shapes. Even though these methods were easy to implement and had
a small inference time, they had poor performance even with smallest changes of
the images. For example, two images of the same object can have completely differ-
ent color histograms depending on the illumination, scale, viewpoint, presence of
occlusions etc (see images of one object on Figure 1.1 and their histograms on 1.2).
7
Chapter 1. Introduction
Local representations
In order to solve the problem of representing geometric and photometric invari-
ances in global representations another group of methods, called local representa-
tions, appeared (see the Survey [69] for more details about local representations).
The authors of these methods propose choosing locations of interest from the im-
ages and extracting their local descriptors. The points of interest are chosen by in-
terest point detectors such as Harris, Hessian, Hessian-affine, MSER etc [35, 66, 68]
(Figure 1.3). The local descriptors are extracted by applying SIFT [65], SURF [4] or
LBP [74] for each point of interest, which results in having a set of local descriptors
that belong to all query and gallery images (Figure 1.4).
8
1.2. Introduction to visual search
Global representations
A global representation of an image is a combination of all local representations of
that image. These representations can be easily compared, additionally storing the
descriptors requires less memory.
The process of obtaining global representations requires three steps: first, all
local descriptors are extracted for an input image; second, each local descriptor
is associated to a visual word using a bag-of-visual-words algorithm; and finally,
a histogram of occurrences of visual words is created. A global representation
that depends on a few visual words can be coarse. One way of improving a global
representation is introducing more entries in a codebook, but this solution requires
significant computational cost. Another way of improving the global representation
is using higher order statistics of the data belonging to each entry of a codebook.
One way of using higher order statistics is proposed in Vector of Locally Aggregated
Descriptors (VLAD) [44]. Instead of counting how many local descriptors fell into
each entry of a codebook, this method is aggregating all local descriptors assigned
to the same visual word. The final representation is a concatenation of vectors for
individual words.
9
Chapter 1. Introduction
A more elaborate solution of using higher statistics takes into account not only
the word to which a local descriptor belongs, but also mean and standard deviation
of each visual word. This method is called Fisher-Vector [76]. Descriptors are
soft-assigned to the words, based on their distances. Similarity to VLAD, the final
representation is obtained by concatenating vectors of each local descriptor.
Deep representations
Deep convolutional neural networks have been used to extract image descriptors
since the early appearance of deep learning. These descriptors are compact and
can be easily compared and used for ranking.
In the beginning of the deep learning era, the majority of methods were trained
for classification. The first retrieval approaches used off-the-shelf deep convolu-
tional neural network trained for classification on a large scale general purpose
dataset, such as ImageNet, for extracting features [27, 91, 122]. These methods were
not appropriate for two main reasons: first, the training data was too different from
the data used for the final task; and second, the loss was designed for classification,
and not for ranking.
The first problem has a straightforward solution: instead of using classification
data, we can train a model on the train partition of the retrieval data, which is called
fine-tuning [122]. However, this solution has two main drawbacks:
• The way that data is split into train and test partitions for retrieval and classi-
fication is different. Classification models work with a closed set of classes,
meaning that all train data, as well as queries and gallery images used for
testing, belong to one of the predefined classes. Therefore, both train and test
splits are non-overlapping subsets of images from all classes. On contrary, re-
trieval datsets are split into train and test partitions based on the class labels;
some classes are selected for training, while all images from the remaining
classes are used for testing. Training a model with train set of retrieval data
for classification can lead to overfitting to the selected training classes. The
model will try to associate the images that appear at test time to one of the
training classes, which is not appropriate. Even though using the data that is
collected for the specific task can improve the results obtained by training on
a general purpose datasets, there is still a big room for improvement.
• The number of classes that can be used for training is limited by the available
memory. Retrieval datasets typically contain images of large number of classes
or identities. A classification model usually has a fully-connected layer that
projects the output of the last convolutional layer to a single vector. The
number of parameters, as well as the memory, that this fully-connected
10
1.2. Introduction to visual search
layer has is linearly proportional to the number of classes, and thus is not
appropriate for datasets with a large number of classes/categories.
Several ranking losses were proposed in order to train a model that optimizes
distances between data points without predicting classes to which they belong.
These losses groups the data of one class in a cluster, that is separated from the
clusters of the data from other classes. More about the ranking losses will be
presented in Section 2.
As retrieval problems can be domain specific, many approaches propose do-
main specific architectures, or take advantage of knowing physical characteristics
of the objects. For example, in case of person re-identification, we can expect to
find some body parts on the images such as head, arms, legs, torso. Also, there are
attributes that can be used in addition to the general descriptors. For example, the
information of the gender of a person can be used to improve the descriptor, or
length of hair, wearing glasses, type of clothes etc.
11
Chapter 1. Introduction
TP
pr eci si on = (1.1)
TP +FP
TP
pr eci si on@N = . (1.2)
N
Recall (True positive rate or sensitivity) measures the ratio between the number
of relevant retrieved samples (true positives) and the total number of positive sam-
ples in the gallery (true positives and false negatives). Similarly to pr eci si on@N , we
can define r ec al l @N , which calculates the ratio between the number of correctly
retrieved samples and the total number of positive samples (Equation 1.4).
TP
r ec al l = (1.3)
TP +FN
TP
r ec al l @N = . (1.4)
Np
Average precision is a typical measure for instance retrieval that takes into
account the order of all retrieved samples from the gallery. It computes a precision
and recall at every position in the ranked sequence of samples, and plots a precision-
recall curve. Average precision is the average value of pr eci si on(r ec al l ) over the
interval of recall [0,1] (Equation 1.5).
ng
pr eci si on@k × ∆r ec al l @k.
X
Av g P = (1.5)
k=1
Mean average precision is a measure that calculates the area under the mean
recall-precision curve over all queries. It is calculated as the mean of average
12
1.2. Introduction to visual search
1.2.4 Datasets
In this thesis we use several retrieval and re-identification datasets.
The Market-1501 dataset [130] (Market) is a standard person re-ID benchmark
with images from 6 cameras of different resolutions. Deformable Part Model (DPM)
detections were annotated as containing one of the 1,501 identities, among which
751 are used for training and 750 for testing. The training set contains 12,936 images
with 3,368 query images. The gallery set is composed of images from the 750 test
identities and of distractor images, 19,732 images in total. There are two possible
evaluation scenarios for this database, one using a single query image and one with
multiple query images.
The MARS dataset [128] is an extension of Market that targets the retrieval of
gallery tracklets (i.e. sequences of images) rather than individual images. It contains
1,261 identities, divided into a training (631 IDs) and a test (630 IDs) set. The total
number of images is 1,067,516, among which 518,000 are used for training and the
remainder for testing.
The DukeMTMC-reID dataset [133] (Duke) was created by manually annotating
13
Chapter 1. Introduction
pedestrian bounding boxes every 120 frames of the videos from 8 cameras of the
original DukeMTMC dataset. It contains 16,522 images of 702 identities in the
training set, and 702 identities, 2,228 query and 17,661 gallery images in the test set.
The Person Search dataset [117] (PS) differs from the previous three as it was
created from images collected by hand-held cameras and frames from movies and
TV dramas. It can therefore be used to evaluate person re-identification in a setting
that does not involve a known camera network. It contains 18,184 images of 8,432
identities, among which 5,532 identities and 11,206 images are used for training,
and 2,900 identities and 6,978 images are used for testing.
Person re-identification large dataset. We merged eleven publicly available
datasets for person re-identification, CUHK01[56], CUHK02 [55], 3DPeS [3], VIPeR
[31], airport[47], MSMT17 [112], Market-1501 [130], DukeMTMC [82]. The merged
dataset has 10.5k IDs, and 178k images. We used both training and testing partitions
of all the datasets except for Market-1501 and DukeMTMC-reID and we did not use
the images that are labeled as distractors or junk.
Stanford Online Products [73] is a retrieval dataset which contains 120k images
of 22.6k products. The dataset is split into two partitions, the training one, which
contains 59.5k images of 11,3k products, and testing, 60.5k images of 11.3k classes.
DeepFashion - In-Shop Clothes Retrieval [63] is a part of DeepFashion dataset
which is designed for instance retrieval. The dataset is made of 54.6k images of
11.7k clothing items. All the images are taken under controlled conditions.
The Caltech-UCSD Birds 200 (CUB-200) [114] is a small dataset that is com-
monly used for image retrieval. It has 6033 images of 200 categories of birds. Fol-
lowing the common practice for the retrieval task, we use the first 100 categories
for training, and the rest for testing. Additionally, we use bounding boxes that are
provided by the authors during both training and testing.
The VERI-Wild [64] is a re-identification dataset of vehicles in the wild. The
images are captured by 174 surveillance cameras, during one month, which resulted
in having 277,797 images of 30,671 training identities, and three testing partitions:
small - 3,000 testing categories and 38,861 images, medium - 64,389 images and
5,000 identities, and large: 128,517 images of 10,000 identities.
14
2 Related Work
Image retrieval is the task of sorting the gallery set of images based on their relevance
to the query image, where the more relevant images are shown before the less
relevant ones. The more similar an image is to the query the more relevant it is.
However, the semantic of the word similar can be very broad. Hence, the group
of machine learning algorithms, called metric learning, has a goal to address a
simplified problem by learning distances between data points, assuming that the
distance between more similar data points (i.e. same object under an implicit
semantic) is smaller than the distance between the ones that are less similar (i.e.
different objects under the same implicit semantic).
Since the beginning of the deep learning era, the main-stream metric learning
techniques became deep metric learning systems. These systems use a deep neural
network, often called a backbone, to embed data into a space in which the distances
between data points can be measured. The backbone is trained to construct an
embedding space in which the distance between more similar data points is smaller
than the distance between less similar data.
In this chapter we will present the most relevant works that are related and
influential for this thesis. We start by introducing the backbone architectures that
are most commonly used for metric learning. Next, we present the loss functions, as
well as hard negative mining strategies that are used for efficient training.
15
Chapter 2. Related Work
ResNet [37] won the ImageNet challenge in 2015. Instead of increasing the
network capacity in width, as proposed in [101], the authors of ResNet show that
16
2.1. Backbone architectures
training a very deep network can be beneficial. ResNet comes in 5 different configu-
rations with 18, 34, 50, 101 or 152 layers. This architecture is made of consecutive
residual blocks, presented in Figure 2.2. The authors propose using residual blocks
in order to create a direct path from the shallow layers to the output of the network.
The direct path should ease training of the shallow layers, and thus allow training of
deeper networks.
17
Chapter 2. Related Work
architectures with mechanisms for automatic scale selection [77] or scale fusion [14].
[54] combines a multi-scale architecture with unsupervised body part localization
using spatial transformer networks.
exp s i
σ(s i ) = Pc . (2.1)
j =1 exp(s j )
The cross-entropy loss for a sample that belongs to a class i is defined as:
The main disadvantages of using the cross-entropy loss for retrieval are: (1) poor
generalization and (2) poor scalability due to the size of the fully connected layer.
18
2.2. Loss functions
1 1
L cont r ast i ve = (1 − Y )d 2 + Y max(0, m − d )2 . (2.3)
2 2
Triplet loss [88] requires a Siamese architecture with three streams which is fed
with three images: an anchor image I a , an image from the same class I p and an
image from any other class I n . All three images are embedded into their descriptors
r a , r p and r n . The triplet loss pushes the samples from the same class closer to
each other while separating the samples from different classes if the difference
between the distance of anchor and negative (d − = ||r a − r n ||2 ) and anchor-positive
(d + = ||r a − r p ||2 ) is smaller than a margin m.
1
L t r i pl et = max(0, d + − d − + m). (2.4)
2
In [13] authors propose adding an additional, fourth stream to the Siamese
architecture. In this case one stream is for an anchor, one for positive and two for
negative images. The final objective is having a reference-positive distance smaller
than the anchor-negative, while making sure that the anchor-negative distance is
greater than the negative-negative distance.
Many approaches, inspired by the triplet loss, proposed various ways to optimize
training time and quality of the results by exploiting more information from the data
that are available in a mini-batch. In [40] the authors propose creating mini-batches
of P classes and K images per class. In each training step they do a forward pass of
all P × K images and get their descriptors. They propose two ways of optimizing
the main objective: Batch hard and Batch all. The batch hard triplet loss treats all
images from the mini-batch as anchors, and for each one of them selects the one
from the same class that is furthest away as a positive, and the one from a different
19
Chapter 2. Related Work
P X
K
j
[m + max (||r ai − r pi ||) − min (||r ai − r n ||)]+ .
X
LB H = (2.5)
i =1 a=1 p=1..K j =1..P
n=1..K
i 6= j
Batch all strategy calculates the loss based on all positive and negative pairs
from the mini-batch 2.6.
P X K X
K X P X
K
X i ,a,p
LB A = [m + d j ,a,n ]+ ,
i =1 a=1 p=1 j =1 n=1
p6=a j =
6 i (2.6)
i ,a,p j
d j ,a,n = ||r ai − r pi ||2 − ||r ai − r n ||2 .
Similarly to the batch all triplet loss, the structured loss [73] (Equation 2.7) and
the n-pair loss [93] (Equation 2.8) take advantage of all positive and negative pairs
from the mini-batch. In [93] the authors propose creating a mini-batch of N pairs
{(x 1 , x 1+ ), (x 2 , x 2+ ), ..., (x N , x N
+
)} from N different classes. For each positive pair they
sample N − 1 negative samples from all different classes, and they use them for
calculating the loss. In [73] the negative pairs are sampled inside of a mini-batch, so
that the negative is one of the closest samples to either anchor or positive for each
anchor-positive pair in the mini-batch.
X X
L i , j = log( exp (m − d i ,k ) + exp (m − d j ,l )) + d i , j ),
(i , j )∈N ( j ,l )∈N
1 (2.7)
max(0, J i , j ), d i , j = ||r i − r j ||2 .
X
L st r uct ur ed =
2|P | (i , j )∈P
1 X N
exp (r iT r j+ − r iT r i+ )).
X
L n−pai r = log(1 + (2.8)
N i =1 i 6= j
20
2.3. Hard negative mining
histogram function, a new group of losses that optimize directly mAP appeared
[38, 39, 81].
In [105] the authors propose the Histogram loss. The method is based on the
histograms that approximate the distributions of positive and negative similarities
inside of a mini-batch. The loss is designed to separate the two distributions. This
objective does not directly optimize the ranking task, but indirectly, it forces all
positive pairs to have higher similarity than all negative pairs. It is the inspiration
of several listwise losses [38, 39, 81] that directly optimize the average precision.
The pioneer in this line is a differentiable approximation of average precision (AP)
for retrieval in Hamming space, that focuses especially on tie scenarios (where
both positive and negative samples belong to the same histogram bin) [38]. In
[39] the authors apply the same strategy on retrieval and patch matching tasks.
Mini-batch size has a great impact on the results, so the first approaches showed
the results on patch matching, because these images are smaller, and the backbone
architectures used have fewer parameters. In [81] the authors propose a way to
train a very deep CNN (such as ResNet101) with large images (800x800 pixels) while
optimizing mAP loss on the whole train set. The method performs a full forward
pass of all images in the dataset, calculate the similarity matrix and the loss, and
finally recompute descriptors of all images, store their intermediate tensors and
accumulate the gradients. Once all gradients are accumulated, it backpropagates
the errors through the network. This method cannot easily scale to larger datasets,
due to its high computational cost per weights update.
21
Chapter 2. Related Work
images per each one of l random classes. The pioneer approach was introduced by
[88], called semi hard loss, where triplets are created by all anchor-positive pairs in
a mini-batch. The negative sample is chosen so that the loss is in between 0 and the
predefined margin α (see the equation 2.4).
Lifted Embedding loss is proposed in [73], where the negative image is the one
closest to either anchor or positive for each anchor-positive pair in the mini-batch.
In [40] the authors propose two strategies for sampling within a mini-batch,
which are extensions of the Lifted Embedding loss. Batch all loss is obtained by
all possible combinations of triplets inside of batch. Batch hard loss takes all the
images from the mini-batch as anchors of triplets. The positive is selected as the
furthest sample from the same class as the anchor in the mini-batch, while the
negative is the closest to the anchor from all the samples from different classes in
the mini-batch.
Curriculum sampling is proposed in [107], where the beginning of training is
performed using easy negative instances, and complexity increases through time.
For each anchor, all negatives from the mini-batch are sorted according to their
distances to the anchor, and the representative negative is sampled with Gaussian
distribution N (µ, σ). µ and σ are changed through time, so that µ goes from max
distance to min distance, and σ reduces towards 0.
All of these approaches have the same drawback: they focus on the local distri-
bution of data inside of a mini-batch, while sampling the candidates for mini-batch
randomly. A mini-batch created randomly is a good representation of the global
distribution, but it does not represent the local embedding space. As relevant neg-
ative samples could be found in the local neighborhood of the anchor sample, the
probability of sampling useful triplets rises if the mini-batch is created from samples
that belong to the same local neighborhood.
Another research line comprises methods that use adversarial samples for met-
ric learning [12, 21]. In [21] the authors propose a way of training Siamese networks
by generating adversarial, potentially hard, negative samples for training with vari-
ations of the triplet loss. The descriptors of all three input images are used for
generating a synthetic, hard negative descriptor. This descriptor, together with the
anchor and positive, forms a triplet of descriptors that is used for calculating the
loss. Similarly, in [12] the authors propose a metric learning strategy that uses a set
of real and a set of synthetic pairs for training.
The fourth research line proposes online strategies for providing relevant nega-
tive samples prior to mini-batch formation [25, 36, 96, 108], and one of our contri-
butions belongs to this research line.
In [25] authors propose a strategy that builds a tree of identities to facilitate the
sampling of relevant negatives for a given anchor. The method clearly improves the
quality of negative samples but at the cost of updating the tree at every epoch. Also,
22
2.3. Hard negative mining
23
3 Motivation and contributions
In this section we present the outline of this thesis as well as the contributions
of each chapter. We first compare and evaluate various solutions for person re-
identification task proposed in the literature, and propose the best combination to
maximize the performance in Chapter 4. We propose a new strategy for sampling
hard negatives for an efficient training in Chapter 5. Finally, we propose a new loss
function for explicit maximization of the area under the ROC curve in Chapter 6.
25
Chapter 3. Motivation and contributions
26
3.3. Loss for explicit maximization
of the area under the ROC curve
if and only if it dominates in recall-precision space [17], which makes AUC loss
appropriate for both retrieval and recognition. We tested AUC loss on four publicly
available datasets, and showed that it achieved state-of-the-art performance for
both retrieval (which is measured by mAP and rank@N) and recognition (measured
by the area under the ROC curve).
27
4 Good practices for person re-identification
4.1 Introduction
Person re-identification (re-ID) is the task of correctly identifying the images in a
database that contain the same person as a query image. It is highly relevant to
applications where the same person may have been captured by different cameras,
for example in surveillance camera networks or for content-based image or video
retrieval.
Re-ID has been heavily studied for more than two decades (please refer to [5] for
a review). Most works that address this problem have sought to improve either the
image representation, often by carefully hand-crafting its architecture, or the image
similarity metric. Following the great success of deep networks on a wide variety
of computer vision tasks, including image classification [37], and object detection
[80], a dominant paradigm in person re-ID has emerged, where methods use or
fine-tune successful deep architectures for the re-ID task [13, 57, 59, 95].
Person re-ID is challenging for several reasons. First, one typically assumes that
the individuals to be re-identified at testing time were never seen during the model’s
training phase. Second, the problem is large-scale in the sense that at testing time
one may need to re-identify thousands of individuals. An additional challenge is
that images of the same person have often been captured under different conditions
(including lighting, resolution, scale and perspective), by the different cameras. In
particular, the pose of the person may be vastly different between different views.
For example, one may need to match a frontal view of a person walking to a profile
view of the same person after they have mounted a bicycle (see example of such a
positive pair in the triplets illustrated in Figure 4.1). Lastly, most re-ID systems rely
on a pre-processing stage where a person detection algorithm is applied to images
in order to localize individuals. As such, they must be robust to detection errors
leading to truncated persons or poor alignment.
Recent works in the literature often introduce additional modules to their deep
networks to address the aforementioned challenges of scale and pose variations,
and detection errors. Some of these additional modules explicitly align body parts
29
Chapter 4. Good practices for person re-identification
between images [94, 126], for example by using pre-trained pose estimators or
human joint detectors. Others add attentional modules [79] or scale fusion [14].
Some use additional annotations such as attributes [95].
In this work, rather than focus on hand-crafting additional modules to address
the various challenges of re-ID, we adopt a different approach and focus instead
on designing an effective training procedure for deep image representations. In
particular, we draw inspiration from works on curriculum learning [6], which aim
to improve model convergence and performance by continually modulating the
difficulty of the task to be learned throughout the model’s training phase. Our
carefully designed learning approach only impacts training, which means that at
test time our approach is very efficient. Consequently, our approach results in a
compact but powerful architecture that produces global image representations
that, when compared using a dot-product, outperform state-of-the-art person re-
identification methods by large margins, including more sophisticated methods
that rely on extra annotations or explicit alignment.
30
4.2. Learning a global representation for re-ID
31
Chapter 4. Good practices for person re-identification
Figure 4.1 – Summary of the training approach. Image triplets are sampled and
fed to a three-stream Siamese architecture, trained with a ranking loss. Each stream
encompasses an image transformation, convolutional layers, a pooling step, a fully
connected layer, and an `2 -normalization. Weights of the model are shared across
streams. In red we illustrate the curriculum learning strategies: (1) pretraining for
classification (PFC), (2) hard triplet mining (HTM), (3) increasing image difficulty
(IID).
the descriptors q, d + and d − respectively. We then define the ranking triplet loss as
where m is the margin. This loss ensures that the embedding of the positive image
I + is closer to the query image embedding I q than that of the negative image I − , by
at least a margin m.
We now discuss key practices for improved training of our model.
Pretraining for Classification (PFC). First, we follow standard practice and use
networks pre-trained on ImageNet. Then, we perform the additional pre-training
step of fine-tuning the model on the training set of each re-ID dataset using a
classification loss. That is, we trained our model for person identification or ID
classification. The weights obtained for the convolutional layers are then used to
32
4.3. Empirical evidence
initialize the weights of the Siamese architecture described in the previous section.
Hard Triplet Mining (HTM). Mining hard triplets is crucial for learning. As al-
ready argued in [116], when applied naively, training with a triplet loss can lead to
underwhelming results. Here we follow the hard triplet mining strategy introduced
in [29]. First, we extract the features for a set of N randomly selected examples
using the current model and compute the loss of all possible triplets. Then, to select
triplets, we randomly select an image as a query and randomly pick a triplet for that
query from among the 25 triplets with the largest loss. To accelerate the process,
we only extract a new set of random examples after the model has been updated k
times with the desired batch size b. This is a simple and effective strategy which
yields good model convergence and final accuracy, although other hard triplet
mining strategies [116] could also be considered.
Evaluation. We follow standard procedure for all datasets and report the mean
average precision (mAP) over all queries and the cumulative matching curve (CMC)
at rank-1 and rank-5 using the evaluation codes provided by the authors of the
datasets.
33
Chapter 4. Good practices for person re-identification
34
4.3. Empirical evidence
that the largest image dimension is either 256, 416, or 640 pixels, without distorting
the aspect ratio. We report results in Table 4.2 and observe that using a sufficiently
large resolution is key to achieving the best performance. Increasing the resolution
from 256 to 416 improves mAP by 3%, while increasing it further to 640 pixels shows
negligible improvement. We set the input size to 416 pixels for the rest of this paper.
Pooling. Table 4.3 (a) compares two pooling strategies (#4) over the feature map
produced by the convolutional layers. Thus max pooling performs better than
average pooling on both datasets, we use it for the rest of this chapter.
Backbone architecture. Table 4.3 (b) compares different architectures for the
convolutional backbone of our network (#3). Results show that using ResNet-101
significantly improves the results compared with using ResNet-50 (about +5 mAP
for both datasets). The more memory hungry ResNet-152 only marginally improves
the results.
Fine-tuning for classification. Table 4.3 (c) shows the importance of fine-tuning
the convolutional layers for the identity classification task before using the ranking
loss to adjust the weights of the whole network (#6). As discussed in Section 4.1.1,
training the model on tasks of increasing difficulty is highly beneficial.
35
Chapter 4. Good practices for person re-identification
Table 4.2 – Impact of the input image size. We report mean average precision
(mAP) on Market and Duke.
Market Duke
average 80.1 71.4
a) pooling strategy
max 81.2 72.9
ResNet-50 76.3 67.6
b) backbone architecture ResNet-101 81.2 72.9
ResNet-152 81.4 74.0
no 77.1 71.1
c) pretraining for class.
yes 81.2 72.9
Table 4.3 – Top (a): influence of the pooling strategy. Middle (b): results for
different backbone architectures. Bottom (c): influence of pretraining the
network for classification before considering the triplet loss. We report mAP for
Market and Duke.
very large performance drop (-11%), confirming that the generation of more and
more difficult examples is highly beneficial when paired with the HTM strategy that
feeds the hardest triplets to the network.
36
4.3. Empirical evidence
Table 4.4 – Impact of different design choices. We report mean average precision
(mAP) on Market and Duke, using ResNet-101 as backbone architecture.
1 We expand both the query and the dataset by averaging the representation of the first 5 and 10
closest neighbors, respectively.
37
Chapter 4. Good practices for person re-identification
Figure 4.2 – For several queries from Market, we show the first 10 retrieved images
together with the mAP and the number of relevant images (in brackets) of that
query. Green (resp. red) outlines images that are relevant (resp. non-relevant) to
the query.
Re-identification examples. In Figure 4.2, we show good results (left) and failure
cases (right) for several query images from the Market dataset. We see that our
method is able to correctly re-identify persons despite pose changes or strong scale
variations. We observe that failure cases are mostly due to confusions between two
people that are extremely difficult to differentiate even for a human annotator, or to
unusual settings (for instance the person holding a backpack in front of him as in
38
4.3. Empirical evidence
Figure 4.3 – Matching regions. For pairs of matching images, we show maps for the
top 5 dimensions that contribute most to the similarity. All these images are part of
the test set of Market-1501.
d.).
39
Chapter 4. Good practices for person re-identification
similarity between their representations. Then, for each image, we propagate the
gradients of these 5 dimensions individually, and visualize their activations in the
last convolutional layer of our architecture. In Figure 4.3, we show several image
pairs and their respective activations for the top 5 dimensions.
We first note that each of these output dimensions are activated by fairly lo-
calized image regions and that the dimensions often reinforce one-another in that
image pairs are often activated by the same region. This suggests that the similarity
score is strongly influenced by localized image content. Interestingly, these local-
ized regions tend to contain body regions that can inform on the type of clothing
being worn. Examples in the figure include focus on the hem of a pair of shorts,
the collar of a shirt, and the edge of a sleeve. Therefore, rather than focusing on
aligning human body joints, the model appears to make decisions based on at-
tributes of clothing such as the length of a pair of pants or of a shirt’s sleeves. This
type of information has been leveraged explicitly for retrieval using the idea of
“fashion landmarks”, as described in [63]. Finally, we observe that some of the paired
responses go beyond appearance similarity and respond to each other at a more
abstract and semantic level. For instance, in the top right pair the strong response
of the first dimension to the bag in the first image seems to pair with the response
to the strap of the bag in the second image, the bag itself being occluded.
Implicit attention. We now qualitatively examine which parts of the images are
highly influential, independently of the images they are matched with. To do so,
given an image and its embedding, we select the first 50 dimensions with the
strongest activations. We then propagate and accumulate the gradients of these
dimensions, again using Grad-Cam [89], and visualize their activations in the last
convolutional layer in our architecture. As a result, we obtain a visualization that
highlights parts of the images that, a priori, will have the most impact on the final
results. This can be seen as a visualization of the implicit attention mechanism that
is at play in our learned embedding.
We show such implicit attention masks in Figure 4.4 across several images of the
same person, for three different persons. We first observe that the model attends
to regions known to drive attention in human vision, such as high-resolution text
(e.g. in rows 1 and 2). We also note that our model shows properties of contextual
attention, particularly when image regions become occluded. For example, when
the man in the second row faces the camera, text on his t-shirt and the hem of his
pants are attended to. However, when his back or side is to the camera, the model
focuses more intently on the straps of his backpack.
40
4.3. Empirical evidence
Ours [ResNet101]
80 Ours [ResNet50]
Verif-Identif [ResNet50] [45]
70
mAP [%]
60
50
40
0 100 200 300 400 500
Number of distractors [K]
41
Chapter 4. Good practices for person re-identification
set of 500K distractors. To generate these distractors, the authors first collected
ground-truth bounding boxes for persons in the images. They then computed the
IoU between each predicted bounding box and ground-truth bounding box for a
given image. A detection was labeled a distractor if its IoU with all ground-truth
annotations was lower than 20%.
We evaluate our ResNet-50- and ResNet-101-based models, trained on Market,
on this expanded dataset, while increasing the number of distractors from 0 to 500K.
We selected distractors by randomly choosing them from the distractor set and
adding them to the gallery set. We compare our models with the previous state-
of-the-art results reported for this expanded dataset [131]. Both versions of our
model significantly outperform the state of the art, as presented in Figure 4.5. Note
that our ResNet-50 model, with 500K added distractors, still outperforms [131]’s
performance with 0 added distractors.
4.4 Conclusions
In this chapter we have proposed an approach to training person re-identification
models based on curriculum learning principles. We have shown that, by carefully
applying these principles to the training of a Siamese architecture with a triplet
loss, a compact architecture without additional hand-engineered modules can
outperform state-of-the-art methods with complex architectures, on 4 benchmark
datasets. Because our contribution only impacts the training phase, at test time our
approach remains simple and efficient, a key advantage for most applications.
Additionally, we found qualitative evidence that the different dimensions of our
representation specialize in a way that allows them to have strong, localized and
semantically discriminative responses in the presence of a positive image pair. This
suggests that our approach is able, to some extent, to implicitly capture what some
previous approaches have explicitly included in their visual representations.
42
5 Hard negative mining
5.1 Introduction
In this chapter, we propose an online strategy for mining samples that contributes
to a more efficient training of Siamese architectures, while providing better valida-
tion scores on several datasets. We tested our method on large datasets so that the
retrieval and re-identification problems cannot be easily solved using a classifica-
tion loss. We use a large person re-ID dataset by merging publicly available datasets
(similarly to [46]) and we show the results on the publicly available retrieval datasets
Stanford Online Products [73] and DeepFashion [63].
5.2 Motivation
The triplet loss (see Equation (5.1)) is based on the construction of triplets i ∈ T
p
formed by an anchor sample x ia , a positive sample x i (belonging to the same
n
class as the anchor) and a negative sample x i . The samples are mapped into an
embedding by a given function f (·), that is usually a deep convolutional network,
which parameters are learned by means of minimization of the loss L .
1 X p
L= max(0, || f (x ia ) − f (x i )||22 − || f (x ia ) − f (x in )||22 + α) (5.1)
n t i ∈T
The goal of the triplet loss is to ensure that the anchor-negative pairs are far from
each other by a margin α with respect to the anchor-positive pair distance. It is
well known that the most challenging part of using the triplet loss to train a metric
learning system is generating triplets that produce a non-zero loss [88]. This is hard,
since the number of all possible triplets in the dataset is proportional to the cube
of total number of images N in the dataset, |T | ∼ N 3 , and the more the system
trains, the less probable it is to find a negative for a given anchor-positive pair that
provides a non-zero loss [88].
Let n be the average number of images per class, m the mini-batch size, k
43
Chapter 5. Hard negative mining
the number of images of each class in the mini-batch, and l number of steps per
epoch. For the sake of clarity, we introduce the notation n̂, the number of negative
samples that produce a non-zero loss if used in conjunction with the triplet loss
and an anchor-positive pair. The more we train the Siamese network the smaller n̂
becomes.
We propose a systematic cost analysis given a sampling method in terms of
n e , the extra number of forward passes to be computed per epoch, and n d , the
extra number of distances to be computed in order to select a set of negatives per-
mini-batch over an entire epoch. Additionally, we report the number of triplets per
mini-batch n t , summarised in Table 5.1.
The “quality” of the retrieved negatives is also relevant, as pointed out in [116]:
negative samples have to be distributed such that the anchor-negative distance is al-
most uniformly distributed. More on this topic will be discussed in the experimental
section.
Sampling the negatives randomly from the whole dataset has complexity O (1)
but does not provide relevant negative samples except at the beginning of the
training, since p n̂ = n̂/(N −n) ' n̂/N . From now on we will omit n from the formula
since it is negligible w.r.t. N .
Semi hard loss [88] employs a negative sampling strategy that has an increased
cost due to the fact that the additional computed distances scale polynomially with
44
5.2. Motivation
the mini-batch size. The improvement in p n̂ with respect to the random sampling
is linearly dependent to the number of triplets b. For this reason, authors use huge
mini-batches in the order of 1800 samples. p n̂ is thus increased to 2b n̂/N , at the
cost of large mini-batches and additional computation.
Batch hard loss [40] is an improved version of the semi hard loss where, thanks
to a more controlled mini-batch creation and additional distances computation,
the method exhibits p n̂ = m n̂/N . This strategy offers a 50% improvement in p n̂
w.r.t. the semi hard approach, but still provides a probability that depends on the
mini-batch size. The additional cost in the distances computation is mitigated by a
3 times factor in the numbers of computed triplets.
An offline exhaustive search into the dataset provides p n̂ =min(3n̂/m, 1). This
is, of course, not viable for large datasets. Nonetheless, for relatively small datasets
and with the proper sampling strategy over the m(N − m) distances, exhaustive
search provides excellent negatives samples [116].
Hierarchical Tree sampling [25], 100k IDs [108], Smart Mining [36] and Stochas-
tic class-based hard example mining [96] are methods for sampling candidates prior
to mini-batch creation. Those methods can be combined with online hard mining
strategies (such as semi-hard and batch hard) and further increase the probability
of sampling relevant negative samples.
In [25] authors propose sampling identities based on inter-class distances. The
main drawback of this method is the high computational cost of creation of the
inter-class distance matrix. This matrix should be updated once per epoch, and
it requires forward passes of the whole dataset (O (N )), and calculating all-vs-all
sample distances (O (N 2 )).
The method proposed in [108] for batch generation is based on hashing. This
approach is faster than Hierarchical Tree, as it does not require any additional dis-
tance calculations nor extra embedding extraction. Its drawback is the complexity
of generating the hash table, as it requires training a classifier on a subset of the
dataset, extracting of features of all images from the train set and executing k-means
clustering.
The method called Smart Mining [36] uses samples from approximate nearest
neighborhood to create potentially relevant triplets. However, in the beginning of
each epoch one full forward pass of the whole dataset is performed. In addition to
this, in each training step (N /i )2 distances are computed, where i is the number of
neighborhoods.
Stochastic class-based hard example mining [96] is a method that uses class
signatures when creating triplets. This approach requires k(K − 1) extra forward
passes in each training step and kN /n distances, where K is the number of classes
in mini-batch.
This brief study shows that the efficiency of relevant negative mining is a crucial
45
Chapter 5. Hard negative mining
hash table
#1 xa
...
...
AE
max
...
CNN pooling W1
+ l2 new
#a
W2
#2
xp
...
...
AE
max
...
...
CNN pooling W1
weights
shared
+ l2 new
#p
weights
shared
W2
#2s
xn
...
...
AE
max
...
CNN pooling W1
new
weights
shared
+ l2
#n
weights
shared
W2
Figure 5.1 – BoN strategy. Triplets with good quality negatives are formed using the
information from the hash table. The resulting embedding is used to learn both the
deep model and a linear projection that, in turn, provides a low-dimensional
embedding. Its quantization provides (possibly) new entry positions in the hash
table for the input images. The hash table and the linear autoencoder are updated
at each training step with minimal overhead.
issue. Also, increasing the probability of picking a relevant negative is key to the
improved performance from semi hard to batch hard strategy. Scalability to very
large datasets with a large number of classes is a necessity within the training of
Siamese architectures.
In this chapter we propose a novel method for batch creation, inspired by
Spectral Hashing [113]. In contrast to Spectral Hashing, which requires additional
forward passes of all images from the dataset, our method updates the hash table
online, with negligible computational cost.
46
5.3. Bag of Negatives
47
Chapter 5. Hard negative mining
Linear auto-encoder
Since the online PCA estimation is in general computationally inefficient and po-
tentially numerically unstable [10], we train a linear autoencoder (AE) paired with
L 2 reconstruction loss, as in formulas (5.2), where h(x) is the projected sub-space
of dimensionality s. The reconstruction loss should not modify the embedding
space, therefore the gradients generated by the L AE loss are back-propagated only
through the fully connected layers of the autoencoder.
h(x) = W1 f (x) + b 1
fˆ(x) = W2 h(x) + b 2 (5.2)
L AE = || f (x) − fˆ(x)||22
This approach can approximate a PCA computation, but it also allows non-orthogo-
nal representations. The AE continuously models the projection that provides the
codeword to the hash table update procedure. The added cost of learning such
AE is negligible w.r.t. the Siamese network training. The choice of s is related to
two factors: (1) the smaller s, the more difficult to reconstruct (in the L 2 sense)
the original sized embedding and (2) the bigger s, the larger the number of bins
obtained after the binarization, more precisely B = 2s . A detailed analysis on the
behaviour of BoN as a function of s is presented in the experimental section.
48
5.3. Bag of Negatives
from the bin to which it had been assigned (line 6 in Algorithm 2), and it has a cost
of O (N /2s ). In term of memory cost, assuming that both the classes I (v) and the
sample v identifiers can be represented with 4 bytes integers, we need only a total of
4(N + 2N ) bytes to store both the hash table L and the hash entry C . As an example,
even a very large dataset with 10M images requires only 115 Megabytes for the hash
table. The update procedure is described in Algorithm 2.
49
Chapter 5. Hard negative mining
50
5.4. Empirical evidence
set k = 2 for all the experiments, as we are focusing on showing the importance of
good negative sampling, and we want to avoid the results being influenced by hard
positive sampling. We randomly sample l classes that belong to the same bin as the
first, random sample (lines 4-9 in Algorithm 3). If the bin has only one element, we
sample the rest of the images needed for the batch randomly (line 10 in Algorithm
3). In case the number of classes in the bin is greater than one and lower than l ,
we append the missing classes from another bin, which is randomly chosen. The
process is repeated until sampling l classes. Once we have a set of l classes, we
choose k images randomly from the images that belong to that class (lines 16 and
17 in Algorithm 3).
Training details
We use Inception-V3 as a backbone for our model. In particular, we take the convo-
lutional layers and initialize them with weights from a standard network pre-trained
on ImageNet. The final descriptors are further globally max-pooled and `2 normal-
ized. The descriptors size is 2, 048. The model is trained using ADAM optimizer, with
the initial learning rate 10−4 , and with learning rate decay 0.9 each 50k iterations.
The images for person re-ID are resized to 192×384 pixels. At test time, we extract
representations and compare them using the dot product.
51
Chapter 5. Hard negative mining
1.75
1.25
0.75
0.5
Min distance
0.25
Average distance
0
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
distance in the full embedding
Figure 5.2 – Negative distances calculated in the whole dataset (x-axis) vs negative
distances calculated inside of bins (y-axis) for 100 anchors.
loss triplets? (3) How does BoN perform changing the subspace dimension s? (4)
How much overhead it adds to the training? and (5) How stable is the hash table
during training? We provide all the analysis on the person re-identification dataset,
as it is challenging, as well as appropriate for testing of all the algorithms mentioned
in the section 5.2.
52
5.4. Empirical evidence
mislabeled samples [116]. This is particularly true and the authors of [96] introduce
stochasticity in order to avoid this problem.
The analysis of the red dots is also of interest: as expected, on average, sampling
from the reference bin provides samples that are closer to the anchor sample w.r.t.
random sampling from the whole dataset.
It is worth mentioning that the distance between the minimal and average
distances in the full embedding space is 0.7. This means that the data distribution is
sparse and that the random sampling can lead to choosing negative samples which
are far from being hard. On the other hand, the distribution of the distances inside
of a bin has smaller standard deviation, as the distance between the minimal and
average distances is 0.2. This allows us to use random sampling inside of the bins,
without decreasing the quality of the chosen samples.
53
Chapter 5. Hard negative mining
Random
70
BoN
20
10
0
30 40 50 60 70 80 90 100
mAP on train set
behaviour not only speeds-up the training, but also provides better triplets, which
leads to significant improvement of the performance on validation sets.
We measure the limitations of BoN by comparing it to the “gold standard",
Spectral Hashing. The combination of Spectral Hashing and batch hard requires
the following steps: (1) feature extraction on the whole training set, (2) reduction
of the feature size by PCA to the size s (s = 18) and (3) hash table construction; we
repeat this procedure every 5k steps. Given this hash table, batches are created the
same way as explained in 5.3.1. BoN-batch hard shows very similar behavior to the
Spectral Hashing - batch hard (magenta line): they both train quickly, obtaining
almost the same mAP after 10k steps, with high percentage of non-zero loss triplets.
During the whole training Spectral Hashing - batch hard is providing more non-zero
loss triplets. This is expected, as the hash table is updated at the same moment for
all the samples. However, this configuration does not scale for datasets with large
number of images.
We analyze the behavior of batch creation proposed in [108], using 10 clusters
as suggested by the authors. We use these clusters for creating the hash table and
we do not update it during the training. In addition to longer training time, this
method lacks flexibility in updating the hash table. In other words, samples that
are considered relevant negatives to an identity are set at the beginning of the
training and are static w.r.t. the training process. Moreover, a possible sub-optimal
clustering is going to be seriously detrimental to the training. In the beginning
of the training, this method obtained lower mAP on the train set (gray line) while
having more non-zero loss triplets than batch hard. The number of relevant triplets
54
5.4. Empirical evidence
in the end of the training decreases, and both accuracy and the percentage of the
non-zero loss triplets are inferior to BoN-batch hard.
Even though Semantic-Preserving Loss (SPL) [18] has not been designed as a
hashing method for hard negative mining, we consider it relevant to our work, and
thus we adapted it to this purpose. We use SPL loss (Equation 5.3) as a replacement
of reconstruction loss in BoN. In this case, the encoder is a fully connected layer with
tanh activation function, that maps image descriptors d i into corresponding hash
entries h(x i ). Following the rationale proposed in [18], the similarity matrix S is a
non-linear function of the dot product between images’ descriptors within a mini-
batch (5.4): a pair of similar images (with dot product above a certain t hr eshol d ,
set to 0.6) are mapped to 1, otherwise they are mapped to −1. The minimization of
equation (5.3) should encourage the mapping of similar images to the same hash
entry, thus providing useful negatives samples as efficiently as BoN.
m X m µ1 ¶2
1 X
L SP = h(x i )h(x j ) − S ij (5.3)
m 2 i =1 j =1 s
(
1, d i · d j > t hr eshol d
Si j = (5.4)
−1, ot her wi se
BoN-BH shows superior results: it trains faster with more non-zero loss triplets
(see table 5.3). We believe that the advantage of BoN over SPL resides in the fact that
the BoN AE loss does not depend on the relationship between mini-batch samples,
thus provide a more stable hash table. Also, the quality of SPL mapping depends on
the quality of the mini-batch sampling, which in turn depends on the hash table
itself; such dependence can introduce a non-negligible instability in the training.
Finally, in this context, SPL can be improved by adding an extra dedicated network
that provides the mapping between the input image and the hash entry, instead of
using the descriptor as an approximation of the input image; such a strategy could
importantly increase the computational cost of the approach, and it is currently out
of the scope of the chapter.
55
Chapter 5. Hard negative mining
60
40
mAP[%]
20
Market-1501
DukeMTMC-reID
0
0 5 10 15 20
s
nonetheless, with s = 22, BoN reaches its breaking point and the average number of
samples per bins (for non empty bins) is very low, such that BoN-Random starts to
perform negative sampling in the whole dataset too frequently.
Training time
Table 5.2 presents the time needed for training a model for 100k steps, and total
time needed for convergence for batch hard and BoN - batch hard methods, as BoN
provides the best results when combined with batch hard. Both experiments are
conducted under the same conditions; we trained the models on a TITAN X GPU
with non-augmented images of size 384x192 pixels, using inception_v3 as backbone
architecture, initialized with the weights obtained from ImageNet pretraining. The
relative overhead that BoN introduces is 3%. However, the model trained with BoN
needs fewer steps to train, which means that total train time is reduced 3.4 times.
In other words, BoN saves 24.26 hours when trained with batch hard loss, while
significantly improving the performance of batch-hard, as it will be shown in section
5.5.
We additionally measured the time needed for one full forward pass of all the
images in the train partition of the person re-identification dataset, which is in-
dependent on the sampling strategy, or loss function. The time to extract all the
features is 11.5 minutes, which is equal to 1527 training steps of BoN-BH or 1572
steps of batch hard. All methods that require the computation of features in each
epoch ([25, 36, 96, 108] and Spectral Hashing) introduce an overhead of at least 42%
at train time. BoN has equal or better performance than [25, 36, 96, 108] (see table
5.4) while adding one order of magnitude less overhead.
56
5.4. Empirical evidence
Table 5.2 – Time required for training for 100k steps and until convergence.
57
Chapter 5. Hard negative mining
20
Figure 5.5 – The percentage of samples that were added to the hash table or moved
from one bin to another. HD stands for Hamming distance between the old and
new hash entry.
there are a couple of reasons for that. First, the decision boundary that separates
bins is updated during training, which means that the samples that are close to
the boundary can easily move from one bin to another. Second, the embedding
changes through time, as does its compressed approximation, so an image that was
assigned to one bin can move and be closer to some other samples in a different
step of training. As mentioned above, the fact that images move in neighboring bins
is not a problem; it is actually beneficial to avoid sampling negative samples that
are either noisy or overly-difficult.
58
5.5. Results and comparison
related to negative sampling and triplets construction. For such reasons, we use
the same mini-batch size for all the methods, the same pre-trained back-bone, the
same margin α and the same embedding size (see subsection 5.4.1 for the details).
[36] and [96] are not included in this comparison, since they require an extra loss
which can corrupt the analysis; a performance comparison with these approaches
is provided in table 5.4.
Table 5.3 – mAP validation results at peak performance for every method.
Table 5.3 shows the results of the comparison on the person re-identification da-
taset. As it can be noticed, BoN-random clearly outperforms pure random sampling
in fewer steps and provides validation mAPs comparable to semi hard and batch
hard. Even though BoN improves results of batch hard sampling when combined
with the contrastive loss, it performs significantly worse than the original batch
hard (combined with triplet loss), so we performed all the experiments using the
batch hard - triplet loss setting. Spectral Hashing - batch hard outperforms BoN-
batch hard, which is expected, considering that BoN is an online approximation of
Spectral Hashing. The numbers show that the margin between BoN and Spectral
Hashing is only 1.5% on average on the two evaluation datasets. However, Spectral
Hashing can be used only if the train set is reasonably small; thus its application on
bigger datasets would be unfeasible.
59
Chapter 5. Hard negative mining
One can argue that the performance of BoN can be easily reached by just in-
creasing the mini batch size of the batch hard method. The experiment batch hard
(2x batch) in table 5.3 shows a training in which the mini-batch size has been dou-
bled. As expected, in this case, the method trains faster and has better performance,
but still does not outperform BoN-batch hard. This experiment shows that BoN
is a key component to the accelerated training and improved validation results of
BoN-batch hard.
We implemented two methods for batch selection known in the literature, Hi-
erarchical Tree (HT) [25] and 100k IDs [108], and combined them with batch hard.
We followed the procedure described in [25] and computed the distance matrix
between all the IDs every 5k steps. We formed a batch by randomly selecting one ID,
and taking the remaining l − 1 as its closest neighbors. We trained a classifier on the
whole train set for 10k steps and used this model to create the hash table with 10
bins. Additionally, we adapted one state-of-the-art hashing method [18] on image
retrieval task for hard negative mining (see section 5.4.2 for details). The results
of all three methods confirm our hypothesis that batch sampling is important for
improving and speeding up the training. However, none of them outperforms BoN
neither in speed nor accuracy.
Even though BoN is specially designed to improve training of Siamese networks
on large datasets, we tested the influence of BoN on two small datasets, CUB-200
[114] and Market-1501 [130]. BoN improves mAP on Market-1501 from 58.4 to 60.0,
and from 36.1 to 37.9 on CUB-200. The improvement in these cases is smaller than
in the experiments conducted on bigger datasets for two reasons: 1) Batch hard is
usually enough, since the probability that hard samples exist in the mini-batch is
higher than in case of large datasets; 2) Choosing optimal s becomes challenging:
small s does not contain enough information for reconstruction, while bigger s
leads to degenerate solution.
Table 5.4 shows the comparison of BoN-batch hard with state-of-the-art ap-
proaches on Stanford Online Products and DeepFashion In Shop datasets. We
trained BoN-batch hard using the same training parameters as explained in sec-
tion 5.4.1, with a few changes: inception_v1 was used as backbone architecture
(as in [25, 50, 73, 96]) with an extra fully connected layer with frozen weights after
the max pooling that reduces the embedding size to 256. We used images of size
336 × 336 pixels (as in [96]) with data augmentation techniques such as random
horizontal flipping, blurring, zooming in and out and cutout. As the images in these
datasets are more heterogeneous, the state-of-the-art methods usually do not use
task specific architectures.
We show that BoN-batch hard provides better or comparable results than both
[50], which uses attention ensembles, and stochastic class-based [96], which in
addition to having higher complexity enhances its performance by using second
60
5.6. Conclusion and Future Works
Table 5.4 – validation results at peak performance for every method and dataset. *
stands for the best number found in literature that uses additional attention
ensembles. F means that the method uses bilinear pooling.
order pooling [24], which introduces even more computational cost with respect to
the baseline model. Additionally, BoN-batch hard performs better than DAML [21],
which uses synthetic negative samples for training.
Our method achieves state-of-the-art results on Stanford Online Products da-
taset, while being comparable to the previously published methods evaluated on
inShop dataset. The nature of Stanford Online Products dataset is more aligned with
the problem that we are trying to solve: it has more training images than inShop
(60k vs. 25.8k) as well as more classes (11.3k vs. 4k). We used the same s = 10 in
both cases, so the hash table of the Stanford Online Products was more densely
populated. Better performance would probably be obtained by training a model on
inShop dataset with the smaller embedding size and smaller s.
61
Chapter 5. Hard negative mining
5.7 Appendix
So far we explained how to use BoN with one meta parameter s that is set at the
beginning of the training. In this section we propose an automatic strategy to dy-
namically adjust the parameter s during training. Our idea is to create a system
based on the exploration/exploitation paradigm which selects the optimal s in every
training iteration. We start training the system by setting s = 1, and we explore its
neighborhood. When another value of s starts providing more relevant training
samples, we use it and explore its neighborhood, and we repeat it in every iteration.
This way of selecting s leads to having a curriculum learning strategy which tends
to maximize the difficulty of the sampled pairs or triplets during training. Our
method does not rely on hard-coded scheduling, and does not require any prede-
fined parameters that depend on the dataset nor backbone architecture. We show
preliminary results of the method based on the results on person re-identification
dataset.
• an estimated loss.
62
5.7. Appendix
The architecture of the autoencoder and the hash table are the same as de-
scribed in earlier in this chapter. We initialize the expected loss (el ) of all BoN
modules to 0.
When we train a model with static s, we use one single BoN module. However, if
the parameter s is not predefined, we aggregate several BoN modules, as shown in
Figure 5.6. The size of every BoN module is equivalent to its ordinal number (the
first BoN module is of size 1, the second 2, etc). The total number of BoN modules
S, depends on the number of images in the dataset and is defined as:
where N is the number of images in the dataset. We train all autoencoders simulta-
neously, and update all hash tables in every training iteration.
In the beginning of every training step we sample images for a mini-batch from
the i t h BoN module, and we update its expected loss in the end of the iteration
following the formula:
After updating the i t h expected loss, we recompute the ordinal number of the
63
Chapter 5. Hard negative mining
The BoN module that provides samples for the next training iteration is either
h − 1, h or h + 1, with probability 0.25, 0.5, 0.25 respectively. If h = 1 we sample
from BoN modules h or h + 1 with probability 0.75, 0.25, and if h = s we sample the
images from h − 1 or h with probability 0.25, 0.75.
Dynamic s estimation introduces minimal time per iteration and memory over-
head. However, due to slow switch between two active BoN modules, full training
can be slower than the training with a fixed s.
Results
16
15
14
13
12
11
estimated s
10
9
8
7
6
5
4
3 s
2
1 smooth s
0
0 2 4 6 8
training iteration ·104
64
5.7. Appendix
Table 5.5 – mAP validation results at peak performance for every method.
Table 5.5 shows that training a model with BoN with dynamic s estimation can
improve the results in terms of both accuracy and number of steps for conver-
gence. However, dynamic s introduces additional memory overhead, which is still
negligible with respect to the size of the backbone architecture.
65
6 Explicit maximization of area under the
ROC curve
6.1 Introduction
The main objective of metric learning systems is to embed high dimensional data
(such as images, videos, or audio signals) into a lower dimensional space, while
ensuring that the data that comes from the same class or identity is embedded
within a cluster, which is separated from the clusters of data that belong to other
classes. On images, these systems were traditionally designed to extract patch
descriptors (such as SIFT or SURF), combined with the bag-of-words approach in
order to get a small size embedding which is representative of the input data.
These traditional approaches were replaced by newer deep learning methods
that compute the embedding by processing input data by deep neural networks,
and use this embedding for comparison of images. The neural networks are trained
by minimizing a loss function that models the desired structure of the embedding
space. Even though the most widely used losses for metric learning, such as con-
strastive loss [15], triplet loss [88], quadruplet loss [13], classification loss [125], etc,
train models to provide locally optimal solutions, they do not guarantee that the
embedding will be a good representation of data distribution on the full test set.
One notable exception is the loss presented in [81], in which the authors propose
direct maximization of the mean Average Precision (mAP) for solving the retrieval
task, which perfectly mimics the final metric learning goal. However, this loss
requires obtaining the vector representations of all training images by a full forward
pass through a deep convolutional neural network several times, in order to provide
a single gradient. The authors demonstrate the performance of the loss on training
data up to 43k images, which is significantly less than the size of commonly used
datasets nowadays.
The area under the ROC curve is a well known way of evaluating recognition
systems [9, 32, 34]. As the amount of available data, as well as computational power,
in the past were limited, the ROC curve based on the available samples was not
smooth, and the area below such an empirical curve was not accurate. Therefore,
in [34] the authors proposed two ways to approximate the real area under the ROC
67
Chapter 6. Explicit maximization of area under the ROC curve
68
6.2. AUC loss
Z t max d F (t )
A= T (t ) dt. (6.1)
t =t mi n dt
where H (·) is the Heaviside function. For the set of all negative pairs N = {(a 1 , n 1 ),
69
Chapter 6. Explicit maximization of area under the ROC curve
1
T (s)
s = t mi n
T (s + ∆s)
s = t max
0
0 F (s + ∆s) F (s) 1
Figure 6.1 – The ROC curve (red line) and its approximation based on a set of
thresholds s (blue line). The area under the approximated curve is calculated using
the Trapezoidal rule.
PNP PN
N
Z t max H ( f (a i , p i ) − t ) d H ( f (a j , n j ) − t )
A= i j dt. (6.4)
t =t mi n NP dt NN
However, this formula cannot be used for gradient based optimization because:
(1) the integral cannot be directly computed and (2) the Heaviside function has
zero gradient almost everywhere. Therefore, we propose two relaxations to obtain a
differentiable function that approximates the area under the ROC curve.
70
6.2. AUC loss
t max
X−∆s T (s + ∆s) + T (s)
A∗ = (F (s) − F (s + ∆s)) . (6.5)
s=t mi n 2
where s spans the interval [t mi n , t max ] in S discrete steps of size ∆s = (t max −t mi n )/S.
This approximation corresponds to the area below the piece-wise linear blue curve
from Fig. 6.1. The number of steps is a relevant parameter since more steps provide
a better approximation of the integral. Taking into account that T (s) and F (s)
depend only on the parameter s, they can be calculated in parallel for a set of
values s ∈ {t mi n , t mi n + ∆s, ..., t max − ∆s}, allowing for an efficient implementation
on GPUs.
Heaviside to sigmoid
The second step involves using a derivable approximation of the Heaviside function;
we use the following sigmoidal-like function:
1
σ(x, t ) = . (6.6)
1 + e −r (x−t )
This choice has three main rationales: (1) for large values of r this function becomes
a good approximation of the Heaviside function discontinuity, (2) it provides very
small gradients far from the discontinuity and, (3) it is symmetric around t thus pro-
ducing an approximation error with zero mean. The family of sigmoidal functions
for r = 12.02 and ∆s = 0.2 is shown in Fig. 6.2.
These characteristics allow having relevant gradients in the area close to the
discontinuity, and at the same time, keeping the properties of the Heaviside function
and almost completely ignoring sample pairs that have similarity very different
from the considered threshold t . The tuning of the parameter r is strictly related to
the step size ∆s and will be addressed in section 6.2.3.
Differently from [9], we choose to approximate multiple Heaviside functions
(one for each threshold) with the same number of sigmoidal functions. In this way
our loss provides abundant gradients for all relevant positive and negative pairs.
Using the approximation from Equation 6.6, we can re-write equations 6.2 and 6.3
71
Chapter 6. Explicit maximization of area under the ROC curve
σ(x, s)
0.5
0
−1.5 −1 −0.5 0 0.5 1 1.5
x
as follows:
PNP
∗ i
σ( f (a i , p i ), t )
T (t ) = , (6.7)
NP
PN N
j
σ( f (a j , n j ), t )
∗
F (t ) = . (6.8)
NN
Finally, we can substitute T (·) and F (·) from Equation 6.5 with their respective
approximations T ∗ (·) and F ∗ (·) (Equations 6.7 and 6.8), and for sake of simplicity
use shorter notation f p i instead of f (a i , p i ), and f ni instead of f (a i , n i ), we obtain
the following differentiable AUC formula:
t max
X−∆s NP ¡
1 X
A ∗∗ = σ( f p i , s) + σ( f p i , s + ∆s)
¢
s=t mi n 2NP i =1
(6.9)
1 N N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
N N i =1
72
6.2. AUC loss
t max NP ¡
X−∆s X
1
L AUC B A = 1 − σ( f p i , s) + σ( f p i , s + ∆s)
¢
2NP N N s=t mi n i =1
(6.10)
N N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
i =1
The batch hard strategy calculates the loss based only on the similarities of
the hardest positive and negative samples for each sample from the mini batch. If
N = kl is the mini-batch size, we can write the batch hard AUC loss as following:
t max
X−∆s X
N ¡
1
L AUC B H = 1 − σ( f p i , s) + σ( f p i , s + ∆s)
¢
2N 2 s=t mi n i =1
(6.11)
N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
i =1
Even though the batch all strategy takes into account all pairs from the mini-
batch, it leads to a much weaker underestimate of the AUC w.r.t the the batch hard
strategy. Additionally, the best scenario of training a model with AUC B A would
require batch creation where the number of all positive pairs would be the same
as the number of all negative pairs, which is impossible. AUC B H maximizes an
underestimation of the area under the full ROC curve on a mini-batch level. We
show the experimental comparison of the two strategies in section 6.3.2.
The AUC B H loss defined in Formula 6.11 can be seen as a pairwise loss, as it
is calculated based on similarities of image pairs. However, what makes AUC B H
different from the other pairwise losses is that it does not directly optimize the rela-
tions between positive and negative pairs, but rather maximizes the approximated
area under the ROC curve based on pair similarities.
73
Chapter 6. Explicit maximization of area under the ROC curve
AUC metaparameters
The AUC loss function, as defined in Equation 6.11 has two metaparameters: 1) step
size ∆s, and 2) slope of the sigmoid function r . The step size is a relevant parameter,
and the smaller the step, the more accurate approximation of the integral.
5
σ(x, s)
4
+∆s
s=tmax
P
3
tmin
dx
d
2
r=6
1 r = 12.02
r = 25
0
−1.5 −1 −0.5 0 0.5 1 1.5
x
The setting of r parameter in Equation 6.6 is of vital importance for the proposed
approach. The value of r should be large enough to ensure a good approximation
of the Heaviside function while providing useful and well-balanced gradients for
a gradient-based optimization strategy. Fig. 6.3 shows the first order derivative of
the sum of sigmoidal functions over x for different values of r on Fig. 6.3. Small r
leads to flat gradient magnitudes around the middle of the range, while significantly
decreasing the magnitude close to the edge of the range (blue line). On the other
hand, having a large r introduces oscillations of the magnitude of the gradients
on the whole input range (orange line). The approximation of the integral over t
with a discrete summation can generate larger gradients for thresholds t that are
close to the points of the grid if the slope of the sigmoidal-like function is too large.
For this reason we would like to find the parameter r for which the square of the
second order derivative of the summation of all sigmoidal-like functions over x is
74
6.2. AUC loss
∆s r
0.01 201.0
0.02 101.0
0.05 42.2
0.1 22.47
0.2 12.02
minimal1 :
Ptmi n +∆s !2
d2 σ(x, s)
Ã
Z t max
s=t max
r = arg min d x. (6.12)
s t mi n d x2
In such a way, we force the magnitudes of the gradients generated for all values
of x to be almost independent of the relative position of x to the grid point s. We
find a non-degenerate local minimum2 of Equation 6.12 numerically for a set of ∆s
parameters (see Table 6.1). This setting, for ∆s = 0.2, is presented by the red line.
75
Chapter 6. Explicit maximization of area under the ROC curve
N of hardest negative similarities (HNS) for each input feature (line 4 in algorithm
4).
We define a vector of thresholds as a step vector of size S + 1: [t mi n , t mi n +
∆s, ..., t max ] (line 5 in algorithm 4), and get the optimal slope for the given ∆s from
table 6.1 (line 6 in algorithm 4). For each threshold from the step vector and for
each hardest positive similarity, we get a value of sigmoid defined in formula 6.6,
and store it in a matrix σ+ of size N × (S + 1) (line 7 in algorithm 4). Similarly, we
obtain the σ− matrix based on hardest negative similarities (line 8 in algorithm 4).
We obtain vectors s 1 and s 2 of length S from σ+ and σ− matrices (lines 9-13 in
algorithm 4). Although we present the algorithm with a for loop over the samples,
this procedure is implemented with matrices and in a parallel way exploiting the
GPU parallelization capabilities. We get the estimated area under the ROC curve R
based on the samples from mini-batch, as shown in line 14 of algorithm 4. Finally,
the AUC loss is calculated as 1 − R (line 15 in algorithm 4).
76
6.3. Empirical evidence
Training details
In all the experiments we use ResNet50 as a backbone architecture, and we initialize
it with the weights obtained on ImageNet classification pre-training. We take the
output of the last convolutional layer and apply global max pooling to obtain a
feature vector for each input image. We reduce the size of the feature vector to 512
by an orthogonally initialized fully connected layer. Finally, we l 2 normalize the
vector. This normlization projects all vectors to a hypersphere which allows using
the dot product for calculating vector to vector similarities.
We train our models on large scale datasets (all except CUB-200) by using the
ADAM optimizer with initial learning rate 10−4 , with a decay of 0.9 every 10, 000
steps. When training a model on a small dataset, such as CUB-200, using the ADAM
optimizer is not appropriate, as it could lead to overfitting. Therefore, we use the
SGD optimizer with initial learning rate 10−3 which is decayed by 0.1 each 3, 000
steps.
We create each mini-batch out of 128 images, 2 images for each of 64 classes/i-
dentities, if not stated differently. All images in a mini-batch are resized to either
224 × 224 or 256 × 256. In all the experiments we augment one of the two images
per class in a mini-batch. We use horizontal flipping, cutout, zoom-in/out, color
shift and motion blur as augmentation techniques.
77
Chapter 6. Explicit maximization of area under the ROC curve
78
6.3. Empirical evidence
Table 6.3 – Comparison of batch all and batch hard strategies on Stanford Online
Products [73] dataset.
79
Chapter 6. Explicit maximization of area under the ROC curve
1 XN X N
LW i l coxon = 1 − 2
σ( f p i − f n j ). (6.13)
N i =1 j =1
We compare the Wilcoxon loss with AUC under the same experimental settings
on four metric learning datasets, and present results in Tables 6.4-6.7. AUC provides
significantly better results on all datasets (8 points on R@1 on Stanford Online
Products, 14 points on CUB-200, 17% on CUB-200 crops, 9% on In-shop, and 19%
on VERI-Wild small evaluation subset).
Table 6.4 – Comparison of the AUC and the triplet batch hard loss functions on the
Stanford Online Products [73] dataset.
Table 6.5 – Comparison of the AUC and the triplet batch hard loss functions on the
CUB-200-2011 [114] dataset.
We believe that the main advantage of AUC with respect to Wilcoxon statistics
is that AUC relies on a family of sigmoidal functions while Wilcoxon statistics ap-
proximates the area under the ROC curve based on the results of a single sigmoidal
function.
80
6.3. Empirical evidence
Table 6.6 – Comparison of the AUC and the triplet batch hard loss functions on the
In-shop Clothes [63] dataset.
Table 6.7 – Comparison of the AUC and the triplet batch hard loss functions on the
VERI-Wild [64] dataset.
81
Chapter 6. Explicit maximization of area under the ROC curve
the art when trained with original images. Our method outperforms all ensemble-
based methods, and shows comparable results with the newest state-of-the-art
methods. The only method that performs significantly better is RankMI [48]. Even
though RankMI outperforms AUC, it is computationally more expensive, as the
model is built out of two networks that are updated alternately, it has two extra
hyper parameters; also the authors do not report the input image size. R-Margin
model achieves 6.7% higher rank@1 on CUB-200 dataset, while using a bigger mini-
batch, distance based tuple mining, and ρ regularization. Additionally, this model
has an extra hyper parameter β and the results vary significantly with different
initialization values. We believe that AUC loss leads to overfitting due to its strong
gradients, when trained on a small size datasets. We improved the performance of
AUC by using image crops instead of whole images (see Table 6.10).
Another dataset appropriate for image retrieval is DeepFashion In-Shop. We
trained models with images resized to 224x224 and 256x256, and they achieved
82
6.3. Empirical evidence
Table 6.10 – Comparison with the state of the art on the CUB-200-2011 [114]
cropped dataset. Embedding dimension is presented as a superscript and the
backbone architecture as a subscript. R stands for ResNet, G for GoogLeNet.
83
Chapter 6. Explicit maximization of area under the ROC curve
comparable results. We believe that the bigger image size does not provide relevant
benefits on this dataset because there is not much room for improvement even
84
6.3. Empirical evidence
when trained with small images. The accuracy of both our model and state of the
art on this dataset are better than on previously discussed Stanford Online Products
and CUB-200 datasets, due to the lower complexity of the retrieval task. In Table 6.11
85
Chapter 6. Explicit maximization of area under the ROC curve
we show that we achieve state-of-the-art results on this dataset. The only method
that achieves results comparabale to AUC is FastAP, while using a mini-batch twice
as big as ours.
Finally, we tested our method on VERI-Wild dataset, which is used for vehicle
re-identification. The majority of state-of-the-art models combine several loss func-
tions for achieving better results (e.g.[1, 49, 64, 67, 97, 98, 124]). Additionally, several
methods that are evaluated on this dataset use some domain specific information
during training, such as position of the mirrors and wheels, color of the vehicle, side
and front view etc. These architectures are appropriate for vehicle re-identification
dataset, but they cannot be used for any other retrieval problem. Even though our
method is simpler and not domain-specific, its performance is comparable with
domain specific state-of-the-art approaches, as shown in Tables 6.12 - 6.14. Taking
into account that all images in this dataset have the same characteristics (they are
all vehicles) and that the number of classes is much greater than the mini-batch size,
we combine AUC with BoN, a state-of-the-art method for hard negative sampling
[22, 23]. This combination significantly improves the performance of AUC alone.
Finally, we train a model with AUC-BoN using bigger input images, which further
improve the results.
The model that achieves state-of-the-art results is SAFR [98], and it uses signifi-
cantly bigger input images than the ones that we use in the experiments (350x350
pixels) and embedding size 2048, which is four times bigger than the embedding
that we use. Implementation of such approach exceeds the hardware limitations
that we have. Additionally, this model uses three loss functions: smoothed softmax,
triplet and center loss, as well as unsupervised attention network. SAVER [49] is
another method that performs slightly better than AUC-BoN. However, this model
uses a more complicated network architecture that contains variational autoen-
coder, together with ResNet50 backbone, and a combination of cross entropy and
triplet losses.
The AUC loss outperforms state-of-the-art methods on large scale retrieval data-
sets, and is comparable with more complex models used for vehicle re-identification.
86
6.4. Conclusion and Future Works
87
7 Closing remark
7.1 Conclusions
In this thesis we have addressed the problem of image retrieval through three stages.
We started the research with extensive analysis of the state-of-the-art methods that
were available at that time, and we combined the best practices from the literature in
order to train a model that outperforms the state of the art. We noticed that the most
commonly used loss for retrieval was triplet loss, and that its main disadvantage is
that it was computationally expensive to find input samples which would generate
gradients for the training. Therefore, in our second contribution we addressed the
problem of hard negative sampling by proposing an online sampling strategy called
BoN. Finally, having in mind that the curve dominates in the ROC space if and only
if it dominates in the precision-recall space, we designed a new loss that maximizes
a strong underestimate of the area under the ROC curve, which is appropriate for
both retrieval and recognition. Here, we take the opportunity to summarize the
findings of this work.
In the first part of this thesis we proposed a set of good practices for training
re-identification models. First, we showed that deeper backbone architectures pro-
vide better results. We compared the performance of the models trained with three
different input image sizes, and found that images smaller than 416x416 pixels dete-
riorate the final results. We analyzed two pooling strategies that are applied before
the last fully connected layer, and we found that max pooling outperforms average
pooling. We discussed the idea of curriculum learning for re-identification through
increasing the task difficulty as the training evolves. Our curriculum learning ap-
proach was made of three strategies: 1) pre-training for classification, 2) increasing
image difficulty through the amount of augmentation and 3) hard negative mining.
We showed that each curriculum learning strategy has a positive impact on the
final result and that the best performance is achieved when all three strategies
are combined. We tested our approach on four publicly available datasets, and
compared the results with more complex, domain specific approaches. Our method
reached state-of-the-art results by a large margin on all datasets.
89
Chapter 7. Closing remark
90
7.2. Future Work
the optimization of the AUC on the mini-batch level by using all positive and
negative pairs (batch all strategy), with the strategy where the hardest positive
and negatives pairs are used (batch hard strategy), and we show that the AUC
batch hard loss provides significantly better results. Finally, we compared the
results of AUC with the benchmark triplet loss, and with Wilcoxon loss which also
optimizes the area under the ROC curve. We showed that AUC loss is superior
on four publicly available datasets. The AUC loss, combined with ResNet50 as a
backbone architecture, achieves state-of-the-art results on three publicly available
datasets that are most commonly used for metric learning. Additionally, the AUC
loss achieves comparable performance to the more complex, domain specific, state-
of-the-art methods for vehicle re-identification.
91
Chapter 7. Closing remark
bigger, but yet very close to being saturated (inShop). What would be the next step
in terms of data? Do we need more images per class or do we need more classes?
Or maybe both? Is there a possibility to use synthetic data to improve the current
datasets?
Finally, we want to discuss the way of training neural networks for retrieval. As
stated before, the artificial neural networks were inspired by neuroscience, and they
are designed to mimic information processing that is happening in the brain. So
far, we have been training these neural networks by showing them large amount
of images and their labels and expecting them to learn what is similar. This way
of learning would be analogous to learning by repetition in psychology, which is
considered to be the least efficient way of learning. Would it be useful to train
models by using some version of reinforcement learning for retrieval in order to
improve training efficiency?
These, among many others, are the current open problems in image retrieval.
We hope that this summary and discussion could serve to motivate researchers to
make some steps in their future research.
92
7.3. Publications
7.3 Publications
• Bojana Gajic, Ariel Amato, Ramon Baldrich, Carlo Gatta. Maximization of the
Area Under the ROC Curve for Metric Learning. Under review for International
Conference in Computer Vision 2021.
• Bojana Gajic, Ariel Amato, Carlo Gatta. Fast hard negative mining for deep
metric learning. In Pattern Recognition 112, 2020.
• Bojana Gajic, Ariel Amato, Ramon Baldrich, Carlo Gatta. Bag of Negatives for
Siamese Architectures. British Machine Vision Conference. Cardiff, 2019.
• Jon Almazan, Bojana Gajic, Naila Murray, Diane Larlus. Re-ID done right:
towards good practices for person re-identification. arxiv preprint, 2018.
7.4 Patents
• Ariel Amato, Angel Domingo Sappa, Carlo Gatta, Bojana Gajic, Brent Boekestein.
Object detection based on object relation. US Patent App. 16/584, 400, 2021.
93
Bibliography
[1] Saghir Alfasly, Yongjian Hu, Haoliang Li, Tiancai Liang, Xiaofeng Jin, Beibei
Liu, and Qingli Zhao. Multi-label-based similarity learning for vehicle re-
identification. IEEE Access, 7:162605–162616, 2019.
[2] Viktoriia Sharmanska Anastasia Pentina and Christoph H. Lampert. Curricu-
lum learning of multiple tasks. In Proc. CVPR, 2015.
[3] Davide Baltieri, Roberto Vezzani, and Rita Cucchiara. 3dpes: 3d people
dataset for surveillance and forensics. In Proceedings of the 2011 joint ACM
workshop on Human gesture and behavior understanding, pages 59–64, 2011.
[4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust
features. In European conference on computer vision, pages 404–417. Springer,
2006.
[5] Apurva Bedagkar-Gala and Shishir K Shah. A survey of approaches and trends
in person re-identification. Image and Vision Computing, 2014.
[6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Cur-
riculum learning. In Proc. ICML, 2009.
[7] Keno K Bressem, Lisa C Adams, Christoph Erxleben, Bernd Hamm, Stefan M
Niehues, and Janis L Vahldiek. Comparing different deep learning architec-
tures for classification of chest radiographs. Scientific reports, 10(1):1–16,
2020.
[8] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep metric
learning to rank. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1861–1870, 2019.
[9] Toon Calders and Szymon Jaroszewicz. Efficient auc optimization for classifi-
cation. In European Conference on Principles of Data Mining and Knowledge
Discovery, pages 42–53. Springer, 2007.
[10] Hervé Cardot and David Degras. Online principal component analysis in
high dimension: Which algorithm to choose? International Statistical Review,
86(1):29–50, 2018.
95
Bibliography
[12] Shuo Chen, Chen Gong, Jian Yang, Xiang Li, Yang Wei, and Jun Li. Adversarial
metric learning. arXiv preprint arXiv:1802.03170, 2018.
[13] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond
triplet loss: a deep quadruplet network for person re-identification. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 403–412, 2017.
[14] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Person re-identification by
deep learning multi-scale representations. In Proc. ICCV Workshop, 2017.
[15] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. Learning a similarity metric
discriminatively, with application to face verification. In CVPR (1), pages
539–546, 2005.
[16] Ondřej Chum, Jiří Matas, and Josef Kittler. Locally optimized ransac. In Joint
Pattern Recognition Symposium, pages 236–243. Springer, 2003.
[17] Jesse Davis and Mark Goadrich. The relationship between precision-recall
and roc curves. In Proceedings of the 23rd international conference on Machine
learning, pages 233–240, 2006.
[18] Cheng Deng, Erkun Yang, Tongliang Liu, Jie Li, Wei Liu, and Dacheng Tao.
Unsupervised semantic-preserving adversarial hashing for image search.
IEEE Transactions on Image Processing, 28(8):4032–4044, 2019.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet:
A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009.
[20] Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep fea-
ture learning with relative distance comparison for person re-identification.
PR, 2015.
[21] Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep ad-
versarial metric learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2780–2789, 2018.
[22] Bojana Gajic, Ariel Amato, Ramon Baldrich, and Carlo Gatta. Bag of negatives
for siamese architectures. In Proc. BMVC, 2019.
96
Bibliography
[23] Bojana Gajic, Ariel Amato, and Carlo Gatta. Fast hard negative mining for
deep metric learning. In PR, 2020.
[24] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear
pooling. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 317–326, 2016.
[25] Weifeng Ge. Deep metric learning with hierarchical triplet loss. In Proceedings
of the European Conference on Computer Vision, pages 269–285, 2018.
[26] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. It-
erative quantization: A procrustean approach to learning binary codes for
large-scale image retrieval. IEEE transactions on pattern analysis and machine
intelligence, 35(12):2916–2929, 2012.
[27] Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. Multi-scale
orderless pooling of deep convolutional activation features. In European
conference on computer vision, pages 392–407. Springer, 2014.
[28] Albert Gordo, Jon Almazán, Jérome Revaud, and Diane Larlus. Deep image
retrieval: Learning global representations for image search. In Proc. ECCV,
2016.
[29] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. End-to-end
learning of deep visual representations for image retrieval. IJCV, 2017.
[30] Albert Gordo and Diane Larlus. Beyond instance-level image retrieval: Lever-
aging captions to learn a global visual representation. In Proc. CVPR, 2017.
[31] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with
an ensemble of localized features. In Proceedings of the European Conference
on Computer Vision, pages 262–275. Springer, 2008.
[32] David Marvin Green, John A Swets, et al. Signal detection theory and psy-
chophysics, volume 1. Wiley New York, 1966.
[33] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by
learning an invariant mapping. In 2006 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages
1735–1742. IEEE, 2006.
[34] James A Hanley and Barbara J McNeil. The meaning and use of the area under
a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
97
Bibliography
[35] Christopher G Harris, Mike Stephens, et al. A combined corner and edge
detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
[36] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond,
et al. Smart mining for deep metric learning. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2821–2829, 2017.
[37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[38] Kun He, Fatih Cakir, Sarah Adel Bargal, and Stan Sclaroff. Hashing as tie-aware
learning to rank. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 4023–4032, 2018.
[39] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average
precision. In The IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2018.
[40] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet
loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[41] Chen Huang, Chen Change Loy, and Xiaoou Tang. Local similarity-aware deep
feature embedding. In Advances in neural information processing systems,
pages 1262–1270, 2016.
[42] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In Proc. CVPR, 2017.
[43] Esteve Jaulent. El ars generalis ultima de ramón llull: presupuestos metafísi-
cos y éticos. In Anales del Seminario de Historia de la Filosofía, volume 27,
pages 87–113. Universidad Complutense de Madrid, 2010.
[44] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating
local descriptors into a compact image representation. In 2010 IEEE computer
society conference on computer vision and pattern recognition, pages 3304–
3311. IEEE, 2010.
[45] Xin Jin, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Uncertainty-aware multi-
shot knowledge distillation for image-based object re-identification. arXiv
preprint arXiv:2001.05197, 2020.
98
Bibliography
[47] Srikrishna Karanam, Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia
Camps, and Richard J. Radke. A systematic evaluation and benchmark for
person re-identification: Features, metrics, and datasets. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 41(3):523–536, 2019.
[48] Mete Kemertas, Leila Pishdad, Konstantinos G Derpanis, and Afsaneh Fazly.
Rankmi: A mutual information maximizing ranking loss. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
14362–14371, 2020.
[49] Pirazh Khorramshahi, Neehar Peri, Jun-cheng Chen, and Rama Chel-
lappa. The devil is in the details: Self-supervised attention for vehicle re-
identification. arXiv preprint arXiv:2004.06271, 2020.
[50] Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon.
Attention-based ensemble for deep metric learning. In Proceedings of the
European Conference on Computer Vision, pages 736–751, 2018.
[51] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. Advances in neural informa-
tion processing systems, 25:1097–1105, 2012.
[53] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning
for latent variable models. In Advances in neural information processing
systems, volume 1, page 2, 2010.
[54] Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. Learning deep
context-aware features over body and latent parts for person re-identification.
In Proc. CVPR, 2017.
[55] Wei Li and Xiaogang Wang. Locally aligned feature transforms across views.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 3594–3601, 2013.
99
Bibliography
[56] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentification with trans-
ferred metric learning. In Proceedings of the Asian Conference on Computer
Vision, pages 31–44. Springer, 2012.
[57] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing
neural network for person re-identification. In Proc. CVPR, 2014.
[58] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for
person re-identification. In Proc. CVPR, 2018.
[59] Wentong Liao, Michael Ying Yang, Ni Zhan, and Bodo Rosenhahn. Triplet-
based deep similarity learning for person re-identification. In MSF Workshop,
2017.
[60] Weiyao Lin, Yang Shen, Junchi Yan, Mingliang Xu, Jianxin Wu, Jingdong Wang,
and Ke Lu. Learning correspondence structures for person re-identification.
IEEE Transactions on Image Processing, 26(5):2438–2453, 2017.
[61] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song.
Sphereface: Deep hypersphere embedding for face recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 212–220, 2017.
[62] Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan,
and Xiaogang Wang. Hydraplus-net: Attentive deep features for pedestrian
analysis. In Proceedings of the IEEE international conference on computer
vision, pages 350–359, 2017.
[63] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion:
Powering robust clothes recognition and retrieval with rich annotations. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1096–1104, 2016.
[64] Yihang Lou, Yan Bai, Jun Liu, Shiqi Wang, and Lingyu Duan. Veri-wild: A large
dataset and a new method for vehicle re-identification in the wild. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3235–3243, 2019.
[66] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide-
baseline stereo from maximally stable extremal regions. Image and vision
computing, 22(10):761–767, 2004.
100
Bibliography
[67] Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, Shijie Yang, Zheng-Jun
Zha, Xingyu Gao, Shuhui Wang, and Qingming Huang. Parsing-based view-
aware embedding network for vehicle re-identification. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
7103–7112, 2020.
[68] Krystian Mikolajczyk and Cordelia Schmid. An affine invariant interest point
detector. In European conference on computer vision, pages 128–142. Springer,
2002.
[71] Carlton Wayne Niblack, Ron Barber, Will Equitz, Myron D Flickner, Eduardo H
Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel
Taubin. Qbic project: querying images by content, using color, texture, and
shape. In Storage and retrieval for image and video databases, volume 1908,
pages 173–187. International Society for Optics and Photonics, 1993.
[72] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep
metric learning via facility location. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5382–5390, 2017.
[73] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric
learning via lifted structured feature embedding. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4004–4012,
2016.
[74] Timo Ojala, Matti Pietikäinen, and David Harwood. A comparative study of
texture measures with classification based on featured distributions. Pattern
recognition, 29(1):51–59, 1996.
[75] Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Deep
metric learning with bier: Boosting independent embeddings robustly. IEEE
transactions on pattern analysis and machine intelligence, 2018.
[76] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabu-
laries for image categorization. In 2007 IEEE conference on computer vision
and pattern recognition, pages 1–8. IEEE, 2007.
101
Bibliography
[77] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. Multi-
scale deep learning architectures for person re-identification. In Proc. ICCV,
2017.
[78] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN image retrieval
learns from BoW: Unsupervised fine-tuning with hard examples. In Proc.
ECCV, 2016.
[79] Tanzila Rahman, Mrigank Rochan, and Yang Wang. Person re-identification
by localizing discriminative regions. In Proc. BMVC, 2017.
[80] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Proc. NIPS, 2015.
[81] Jerome Revaud, Jon Almazan, Rafael S. Rezende, and Cesar Roberto de Souza.
Learning with average precision: Training image retrieval with a listwise loss.
In The IEEE International Conference on Computer Vision (ICCV), October
2019.
[82] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi.
Performance measures and a data set for multi-target, multi-camera tracking.
In Proceedings of the European Conference on Computer Vision, pages 17–35.
Springer, 2016.
[84] Michal Rolínek, Vít Musil, Anselm Paulus, Marin Vlastelica, Claudio Michaelis,
and Georg Martius. Optimizing rank-based metrics with blackbox differentia-
tion. arXiv preprint arXiv:1912.03500, 2019.
[85] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjoern Ommer,
and Joseph Paul Cohen. Revisiting training strategies and generalization
performance in deep metric learning. arXiv preprint arXiv:2002.08473, 2020.
[87] M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen.
A pose-sensitive embedding for person re-identification with expanded cross
neighborhood re-ranking. In Proc. CVPR, 2018.
102
Bibliography
[88] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified
embedding for face recognition and clustering. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 815–823, 2015.
[91] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carls-
son. Cnn features off-the-shelf: an astounding baseline for recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
workshops, pages 806–813, 2014.
[92] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[93] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss
objective. In Advances in neural information processing systems, pages 1857–
1865, 2016.
[94] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-
driven deep convolutional model for person re-identification. In Proc. ICCV,
2017.
[95] Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep attributes
driven multi-camera person re-identification. In Proc. ECCV, 2016.
[96] Yumin Suh, Bohyung Han, Wonsik Kim, and Kyoung Mu Lee. Stochastic
class-based hard example mining for deep metric learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
7251–7259, 2019.
[97] Abhijit Suprem and Calton Pu. Looking glamorous: Vehicle re-id in hetero-
geneous cameras networks with global and local attention. arXiv preprint
arXiv:2002.02256, 2020.
[98] Abhijit Suprem, Calton Pu, and Joao Eduardo Ferreira. Small, accurate,
and fast vehicle re-id on the edge: the safr approach. arXiv preprint
arXiv:2001.08895, 2020.
103
Bibliography
[99] Michael J Swain and Dana H Ballard. Color indexing. International journal of
computer vision, 7(1):11–32, 1991.
[100] Jonathan Swift. Gulliver’s travels. In Gulliver’s Travels, pages 27–266. Springer,
1995.
[101] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1–9, 2015.
[102] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer vision. In Proc.
CVPR, 2016.
[103] Tijmen Tieleman and G Hinton. Divide the gradient by a running average of
its recent magnitude. coursera neural netw. Mach. Learn, 6:26–31, 2012.
[104] Alan M Turing. Computing machinery and intelligence. In Parsing the turing
test, pages 23–65. Springer, 2009.
[105] Evgeniya Ustinova and Victor Lempitsky. Learning deep embeddings with
histogram loss. In Advances in Neural Information Processing Systems, pages
4170–4178, 2016.
[106] Remco C Veltkamp and Mirela Tanase. Content-based image retrieval sys-
tems: A survey. 2000.
[107] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang.
Mancs: A multi-task attentional network with curriculum sampling for person
re-identification. In Proceedings of the European Conference on Computer
Vision, pages 365–381, 2018.
[108] Chong Wang, Xue Zhang, and Xipeng Lan. How to train triplet networks
with 100k identities? In Proceedings of the IEEE International Conference on
Computer Vision Workshops, pages 1907–1915, 2017.
[109] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou,
Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5265–5274, 2018.
104
Bibliography
[110] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric
learning with angular loss. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2593–2601, 2017.
[111] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang,
James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity
with deep ranking. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1386–1393, 2014.
[112] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to
bridge domain gap for person re-identification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 79–88, 2018.
[113] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances
in neural information processing systems, pages 1753–1760, 2009.
[117] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint
detection and identification feature learning for person search. In Proc. CVPR,
2017.
[118] Yafu Xiao, Jing Li, Bo Du, Jia Wu, Jun Chang, and Wenfan Zhang. Memu:
Metric correlation siamese network and multi-class negative sampling for
visual tracking. Pattern Recognition, 100:107170, 2020.
[119] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Ag-
gregated residual transformations for deep neural networks. In Proc. CVPR,
2017.
[120] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-
aware compositional network for person re-identification. In Proc. CVPR,
2018.
105
Bibliography
[121] Hong Xuan, Richard Souvenir, and Robert Pless. Deep randomized ensembles
for metric learning. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 723–734, 2018.
[122] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable
are features in deep neural networks? arXiv preprint arXiv:1411.1792, 2014.
[123] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-aware deeply cascaded
embedding. In Proceedings of the IEEE international conference on computer
vision, pages 814–823, 2017.
[124] Xinyu Zhang, Rufeng Zhang, Jiewei Cao, Dong Gong, Mingyu You, and Chun-
hua Shen. Part-guided attention learning for vehicle re-identification. arXiv
preprint arXiv:1909.06023, 2019.
[125] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training
deep neural networks with noisy labels. In Advances in neural information
processing systems, pages 8778–8788, 2018.
[126] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi,
Xiaogang Wang, and Xiaoou Tang. Spindle net: Person re-identification with
human body region guided feature decomposition and fusion. In Proc. CVPR,
2017.
[127] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned
part-aligned representations for person re-identification. In Proc. ICCV, 2017.
[128] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and
Qi Tian. MARS: A video benchmark for large-scale person re-identification.
In Proc. ECCV, 2016.
[129] Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embed-
ding for deep person re-identification. arXiv, 2017.
[130] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and
Qi Tian. Scalable person re-identification: A benchmark. In Proceedings
of the IEEE International Conference on Computer Vision, pages 1116–1124,
2015.
[131] Zhedong Zheng, Liang Zheng, and Yi Yang. A discriminatively learned cnn
embedding for person re-identification. TOMM, 2017.
[132] Zhedong Zheng, Liang Zheng, and Yi Yang. Pedestrian alignment network for
large-scale person re-identification. arXiv, 2017.
106
Bibliography
[133] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated
by gan improve the person re-identification baseline in vitro. In Proceedings
of the IEEE International Conference on Computer Vision, pages 3754–3762,
2017.
107