Boga 1 de 1

Training strategies for efficient deep
image retrieval
Bojana Gajić
ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús
establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184
ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso
establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/
WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set
by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en
Training strategies for efficient deep
image retrieval
A dissertation submitted by Bojana Gajić at Univer-

sitat Autònoma de Barcelonato fulfil the degree of
Doctor of Philosophy.
Bellaterra, June 18, 2021
Co-Director Dr. Carlo Gatta
Vintra, Inc.
Co-Director Dr. Ramon Baldrich

Centre de Visió per Computador
Thesis Dr. Sergio Velastin

committee Queen Mary University of London
Dr. Joost van de Weijer

Centre de Visió per Computador
Dr. Jerome Revaud

Naver Labs Europe
International Dr. Pau Rodríguez López

evaluators Element AI
ServiceNow Canada Inc.
Dr. German Ros Sanchez

Intelligent Systems Lab
Intel, US
This document was typeset by the author using LATEX 2ε .

The research described in this book was carried out at the Centre de Visió per Computador,
Universitat Autònoma de Barcelona. Copyright © 2021 by Bojana Gajić. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopy, recording, or any information storage and
retrieval system, without permission in writing from the author.
ISBN: 978-84-945373-1-8
Printed by Ediciones Gráficas Rey, S.L.
To my family . . .
Acknowledgements
I would like to start this dissertation by expressing my true gratitude to all of those
who were supporting and assisting me in the last years.
First of all, I owe the biggest thanks to my supervisors Dr. Carlo Gatta and Dr.
Ramon Baldrich. If I could write all that I thank you for the list may be longer
than the thesis. Carlo, thank you for your bright and always original ideas and
discussions, for leading me through various stages of my development, for your
patience when it comes to correcting my writing, for all the years of great work,
support and understanding. Ramon, thank you for making me feel welcomed since
my very first day in CVC and Barcelona. Your feedback was always insightful and it
brought my work to a higher level.
I would like to thank Dr. Ariel Amato, CTO of Vintra for giving me an opportunity
to work in such an innovative company. Ariel, thank you for your support, for
recognition of my work, and for showing me that every problem has a solution.
Your forward thinking attitude has always been inspiring. I would also like to
acknowledge my colleagues from machine learning team: Francesco, Thomas,
Sergio and Esteve, it’s been a great pleasure to work with you! And many thanks to
Onur, Marc, Riqui and Eva for making my time in the office truly remarkable!
I would also like to thank Dr. Jon Almazan, for accepting me and leading through
my internship in Xerox Research Center Europe and Naver labs Europe. Jon, thanks
for showing me how it is to work in a world class research team! I appreciate all help
and support from you, Diane and Naila.
The time I spent in CVC wouldn’t be closely as memorable as it was, without
Ivet, Carles, Felipe, Dena, German, Prassanna, Arash, Gemma and Onur. It was
great to meet you, share ideas and spend time with you! And many thanks to all
other friends with whom I spent my free time.
Finally, my biggest thanks to my family, my parents Zoran and Vesna and my
brother Andrija. Thank you for all the courage and understanding that you have
been giving to me.
i
Abstract
In this thesis we focus on image retrieval and re-identification. Training a deep

architecture using a ranking loss has become standard for the retrieval and re-
identification tasks. We analyze and propose answers on three main issues: 1) What
are the most relevant strategies of state-of-the-art methods and how can they be
combined in order to obtain a better performance? 2) Can hard negative sampling
be performed efficiently (O (1)) while providing improved performance over naïve
random sampling? 3) Can recognition and retrieval objectives be achieved by using
a recognition-based loss?
First, in chapter 4 we analyze the importance of some state of the art strategies
related to the training of a deep model such as image augmentation, backbone
architecture and hard triplet mining. We then combine the best strategies to design
a simple deep architecture plus a training methodology for effective and high quality
person re-identification. We extensively evaluate each design choice, leading to
a list of good practices for person re-identification. By following these practices,
our approach outperforms the state of the art, including more complex methods
with auxiliary components, by large margins on four benchmark datasets. We also
provide a qualitative analysis of our trained representation which indicates that,
while compact, it is able to capture information from localized and discriminative
regions, in a manner akin to an implicit attention mechanism.
Second, in chapter 5 we address the problem of hard negative sampling when

training a model with triplet-like loss. In this chapter we present Bag of Negatives
(BoN), a fast hard negative mining method, that provides a set, triplet or pair of
potentially relevant training samples. BoN is an efficient method that selects a bag
of hard negatives based on a novel online hashing strategy. We show the superiority
of BoN against state-of-the-art hard negative mining methods in terms of accuracy
and training time over three large datasets.
Finally, in chapter 6 we hypothesize that training a metric learning model by

maximizing the area under the ROC curve (which is a typical performance mea-
sure of recognition systems) can induce an implicit ranking suitable for retrieval
iii
problems. This hypothesis is supported by the fact that “a curve dominates in ROC
space if and only if it dominates in PR space” [17]. To test this hypothesis, we design
an approximated, derivable relaxation of the area under the ROC curve. Despite its
simplicity, AUC loss, combined with ResNet50 as a backbone architecture, achieves
state-of-the-art results on two large scale publicly available retrieval datasets. Ad-
ditionally, the AUC loss achieves comparable performance to the more complex,
domain specific, state-of-the-art methods for vehicle re-identification.
Key words: computer vision, machine learning, applied mathematics, metric

learning, instance retrieval, re-identification
iv
Resumen
En esta tesis nos centramos en la recuperación y re-identificación de imágenes.

El entrenamiento de redes neuronales profundas usando funciones de pérdida
basadas en ranking se ha convertido en un estándar de facto para las tareas de
recuperación y re-identificación. Analizamos y aportamos propuestas de respuestas
a tres cuestiones principales: 1) ¿Cuáles son las estrategias más relevantes de los
métodos del estado del arte y cómo se pueden combinar para obtener un mejor
rendimiento? 2) Se puede realizar un muestreo de muestras negativas restrictivo de
manera eficiente (O (1)) mientras se proporciona un rendimiento mejorado respecto
al muestreo aleatorio simple? 3) Se pueden conseguir objetivos de reconocimiento
y recuperación mediante una función de pérdida basada en el reconocimiento?
En primer lugar, en el capítulo 4 analizamos la importancia de algunas estrate-

gias del estado del arte relacionadas con la formación de un modelo de aprendizaje
profundo que abarca el aumento de imágenes, la arquitectura vertebral y la minería
de tripletas restrictivas. A continuación, combinamos las mejores estrategias para
diseñar una arquitectura profunda sencilla, además de una metodología de entre-
namiento para una identificación de personas efectiva y de alta calidad. Evaluamos
ampliamente cada opción de diseño, dando lugar a una lista de buenas prácticas
para la re-identificación de personas. Siguiendo estas prácticas, nuestro enfoque
supera el estado del arte, incluidos métodos más complejos con componentes
auxiliares, de forma amplia en cuatro conjuntos de datos de referencia. También
proporcionamos un análisis cualitativo de nuestra representación entrenada que
indica que, a pesar de ser compacta, es capaz de captar información de regiones
focalizadas y discriminativas, de una manera similar a un mecanismo de atención
implícita.
En segundo lugar, el capítulo 5 abordamos el problema del muestreo de mues-

tras negativas restrictivo cuando se entrena un modelo con funciones del tipo
pérdida por tripletas. En este capítulo presentamos "Bag of Negative (BoN)", un
método de minería de muestras negativas rápido y restrictivo, que proporciona un
conjunto, tripleta o pareja de muestras de entrenamiento potencialmente relevan-
v
tes. BoN es un método eficiente que selecciona una bolsa de muestras negativas
restringidas basado en una nueva estrategia de indexación dispersa (hashing) en lí-
nea. Mostramos la superioridad de BoN frente a los métodos de minería de muestras
negativas del estado del arte en términos de precisión y tiempo de entrenamiento
en tres grandes conjuntos de datos.
Finalmente, en el capítulo 6 hacemos la hipótesis de que entrenar un modelo de

aprendizaje de métricas maximizando el área bajo la curva ROC (que es una medida
de rendimiento típica de los sistemas de reconocimiento automático) puede inducir
una clasificación implícita adecuada para tareas de recuperación. Esta hipótesis se
apoya en el hecho de que üna curva es relevante en el espacio ROC si y sólo si es
relevante en el espacio Precisión / Exhaustividad (PrecisionRecall)-[17]. Para probar
esta hipótesis, diseñamos una relajación derivable y aproximada del área bajo la
curva ROC. A pesar de su simplicidad, la función de pérdida basada en área bajo
la curva (AUC), combinada con ResNet50 como arquitectura vertebral, consigue
los resultados del estado del arte en dos conjuntos de datos para recuperación de
muestras a gran escala disponibles públicamente. Además, la función de pérdida
basada en AUC consigue un rendimiento comparable a métodos más complejos,
específicos de dominio, que marcan el estado del arte en el problema de la re-
identificación de vehículos.
Palabras clave: visión por computador, aprendizaje computacional, matemáti-

cas aplicadas, aprendizaje de métricas, recuperación de instancias, re-identificación
vi
Resum
En aquesta tesi ens centrem en la recuperació i re-identificació d’imatges. L’en-

trenament de xarxes neuronals profundes usant funcions de pèrdua basades en
rànquing ha esdevingut un estàndard de facto per a les tasques de recuperació i
re-identificació. Hi analitzem i aportem propostes de respostes a tres qüestions
principals: 1) Quines són les estratègies més rellevants dels mètodes de l’estat de
l’art i com es poden combinar per obtenir un millor rendiment? 2) Es pot realit-
zar un mostreig de mostres negatives restrictiu de manera eficient (O (1)) mentre
es proporciona un rendiment millorat respecte al mostreig aleatori simple? 3) Es
poden aconseguir objectius de reconeixement i recuperació mitjançant una funció
de pèrdua basada en el reconeixement?
En primer lloc, en el capítol 4 analitzem la importància d’algunes estratègies de

l’estat de l’art relacionades amb la formació d’un model d’aprenentatge profund
que abasta l’augment d’imatges, l’arquitectura vertebral i la mineria de tripletes
restrictives. A continuació, combinem les millors estratègies per dissenyar una
arquitectura profunda senzilla, a més d’una metodologia d’entrenament per a una
identificació de persones efectiva i d’alta qualitat. Avaluem àmpliament cada opció
de disseny, donant lloc a una llista de bones pràctiques per a la re-identificació
de persones. Seguint aquestes pràctiques, el nostre enfocament supera l’estat
de l’art, inclosos mètodes més complexos amb components auxiliars, de forma
amplia en quatre conjunts de dades de referència. També proporcionem una
anàlisi qualitativa de la nostra representació entrenada que indica que, tot i ser
compacta, és capaç de captar informació de regions focalitzades i discriminatives,
d’una manera semblant a un mecanisme d’atenció implícita.
En segon lloc, al capítol 5 abordem el problema del mostreig de mostres negati-

ves restrictiu quan s’entrena un model amb funcions del tipus pèrdua per tripletes.
En aquest capítol presentem “Bag of Negatives (BoN)”, un mètode de mineria de
mostres negatives ràpid i restrictiu, que proporciona un conjunt, tripleta o parella
de mostres d’entrenament potencialment rellevants. BoN és un mètode eficient
que selecciona una bossa de mostres negatives restringides basat en una nova es-
vii
tratègia d’indexació dispersa (hashing) en línia. Mostrem la superioritat de BoN en
front dels mètodes de mineria de mostres negatives de l’estat de l’art en termes de
precisió i temps d’entrenament en tres grans conjunts de dades.
Finalment, al capítol 6 fem la hipòtesi que entrenar un model d’aprenentatge de

mètriques maximitzant l’àrea sota la corba ROC (que és una mesura de rendiment
típica dels sistemes de reconeixement automàtic) pot induir una classificació implí-
cita adequada per a problemes de recuperació. Aquesta hipòtesi es recolza en el fet
que “una corba és rellevant en l’espai ROC si i només si és rellevant a l’espai Precisi-
ó/Exhaustivitat (PrecisionRecall)” [17]. Per a provar aquesta hipòtesi, dissenyem
una relaxació derivable i aproximada de l’àrea sota la corba ROC. Malgrat la seva
simplicitat, la funció de pèrdua basada en àrea sota la corba (AUC), combinada amb
ResNet50 com a arquitectura vertebral, aconsegueix els resultats de l’estat de l’art
en dos conjunts de dades per a recuperació de mostres a gran escala disponibles
públicament. A més, la funció de pèrdua basada en AUC aconsegueix un rendiment
comparable a mètodes més complexos, específics de domini, que marquen l’estat
de l’art en el problema de la re-identificació de vehicles.
Paraules clau: visió per computador, aprenentatge computacional, matemàti-

ques aplicades, aprenentatge de mètriques, recuperació d’instàncies, re-identificació
viii
Contents
Abstract (English/Spanish/Catalan) iii
List of figures xv
List of tables xvii
1 Introduction 1
1.1 A brief Introduction to deep learning . . . . . . . . . . . . . . . . . . . 1
1.1.1 Beginnings of artificial intelligence . . . . . . . . . . . . . . . . 1
1.1.2 From Machine Learning to Deep Learning . . . . . . . . . . . . 2
1.1.3 Classification of deep learning methods . . . . . . . . . . . . . 3
1.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Introduction to visual search . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Problem definition and applications . . . . . . . . . . . . . . . . 6
1.2.2 Instance retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Early methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Local representations . . . . . . . . . . . . . . . . . . . . . . . . 8
Global representations . . . . . . . . . . . . . . . . . . . . . . . . 9
ix
Contents
Deep representations . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Evaluation of visual search methods . . . . . . . . . . . . . . . . 12
1.2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Related Work 15
2.1 Backbone architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 General purpose backbone architectures . . . . . . . . . . . . . 16
2.1.2 Task specific architectures . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Classification losses . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Pairwise losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Listwise losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Hard negative mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Motivation and contributions 25
3.1 Boundaries of state of the art

for person re-identification . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Hard negative mining combined

with existing losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Loss for explicit maximization

of the area under the ROC curve . . . . . . . . . . . . . . . . . . . . . . 26
4 Good practices for person re-identification 29
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Curriculum learning for re-ID . . . . . . . . . . . . . . . . . . . 30
4.2 Learning a global representation for re-ID . . . . . . . . . . . . . . . . 31
x
Contents
4.2.1 Architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Architecture training . . . . . . . . . . . . . . . . . . . . . . . . . 31
Three-stream Siamese architecture. . . . . . . . . . . . . 31
4.2.3 Applying curriculum learning principles . . . . . . . . . . . . . 32
Pretraining for Classification (PFC). . . . . . . . . . . . . 32
Hard Triplet Mining (HTM). . . . . . . . . . . . . . . . . . 33
Increasing image difficulty (IID). . . . . . . . . . . . . . . 33
4.3 Empirical evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . 33
Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Training details. . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Ablative study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Image transformation. . . . . . . . . . . . . . . . . . . . . 34
Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Backbone architecture. . . . . . . . . . . . . . . . . . . . . 35
Fine-tuning for classification. . . . . . . . . . . . . . . . . 35
Curriculum learning strategies. . . . . . . . . . . . . . . . 35
4.3.3 Comparison with the state of the art . . . . . . . . . . . . . . . . 37
4.3.4 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Re-identification examples. . . . . . . . . . . . . . . . . . 38
Localized responses and clothing landmark detection. . 39
Implicit attention. . . . . . . . . . . . . . . . . . . . . . . . 40
xi
Contents
4.3.5 Re-ID in the presence of noise . . . . . . . . . . . . . . . . . . . 41
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Hard negative mining 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Bag of Negatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Linear auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . 48
Dynamic quantization thresholds . . . . . . . . . . . . . . . . . 48
Hash table dynamic update . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Bag of Negatives and pairwise losses . . . . . . . . . . . . . . . . 50
Bag of Negatives and triplet loss . . . . . . . . . . . . . . . . . . 50
Bag of Negatives with batch hard loss . . . . . . . . . . . . . . . 50
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.2 Analysis of Bag of Negatives . . . . . . . . . . . . . . . . . . . . . 51
BoN vs exhaustive search . . . . . . . . . . . . . . . . . . . . . . 52
Non-zero loss triplets analysis . . . . . . . . . . . . . . . . . . . 53
BoN-Random behavior varying s . . . . . . . . . . . . . . . . . . 55
Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bin stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Results and comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xii
Contents
5.6 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . 61
5.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Automatic s parameter estimation . . . . . . . . . . . . . . . . . 62
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Explicit maximization of area under the ROC curve 67
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Optimization of evaluation metrics . . . . . . . . . . . . . . . . 68
6.1.2 Mean Average Precision vs. area under ROC curve . . . . . . . 69
6.2 AUC loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2.1 Area under the ROC curve . . . . . . . . . . . . . . . . . . . . . . 69
6.2.2 Differentiable relaxation of AUC . . . . . . . . . . . . . . . . . . 71
Integral to series (Riemann sum) . . . . . . . . . . . . . . . . . . 71
Heaviside to sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.3 AUC loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
AUC metaparameters . . . . . . . . . . . . . . . . . . . . . . . . 74
AUC implementation details . . . . . . . . . . . . . . . . . . . . 75
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.3.2 AUC loss analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
∆s parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Batch all vs batch hard strategies . . . . . . . . . . . . . . . . . . 78
xiii
Contents
AUC loss evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Comparison with state of the art . . . . . . . . . . . . . . . . . . 81
6.4 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . 87
7 Closing remark 89
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4 Patents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Bibliography 107
xiv
List of Figures
1.1 Examples of appearance variations of images of Sagrada Familia. . . 7
1.2 Examples of color histograms of images from Figure 1.1. . . . . . . . . 8
1.3 Example of keypoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Example of local descriptor matching based on SIFT. . . . . . . . . . . 9
1.5 Dataset splits for classification and retrieval. . . . . . . . . . . . . . . . 11
2.1 Inception unit. Building block of GoogLeNet. (Source [101]) . . . . . . 16
2.2 Building block of ResNet architecture. (Source: [37]) . . . . . . . . . . 17
4.1 Summary of the training approach. Image triplets are sampled and
fed to a three-stream Siamese architecture, trained with a ranking loss.
Each stream encompasses an image transformation, convolutional
layers, a pooling step, a fully connected layer, and an `2 -normalization.
Weights of the model are shared across streams. In red we illustrate
the curriculum learning strategies: (1) pretraining for classification
(PFC), (2) hard triplet mining (HTM), (3) increasing image difficulty
(IID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 For several queries from Market, we show the first 10 retrieved images
together with the mAP and the number of relevant images (in brackets)
of that query. Green (resp. red) outlines images that are relevant (resp.
non-relevant) to the query. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Matching regions. For pairs of matching images, we show maps for
the top 5 dimensions that contribute most to the similarity. All these
images are part of the test set of Market-1501. . . . . . . . . . . . . . . 39
xv
List of Figures
4.4 We highlight regions that correspond to the most highly-activated

dimensions of the final descriptor. They focus on unique attributes,
such as backpacks, bags, or shoes. . . . . . . . . . . . . . . . . . . . . . 41
4.5 Performance comparison (mAP) in the presence of a large number of

distractors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 BoN strategy. Triplets with good quality negatives are formed using
the information from the hash table. The resulting embedding is
used to learn both the deep model and a linear projection that, in
turn, provides a low-dimensional embedding. Its quantization pro-
vides (possibly) new entry positions in the hash table for the input
images. The hash table and the linear autoencoder are updated at
each training step with minimal overhead. . . . . . . . . . . . . . . . . 46
5.2 Negative distances calculated in the whole dataset (x-axis) vs negative

distances calculated inside of bins (y-axis) for 100 anchors. . . . . . . 52
5.3 Percentage of non-zero loss triplets per mini-batch as a function of

mAP on the training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Validation mAP as a function of s. . . . . . . . . . . . . . . . . . . . . . 56
5.5 The percentage of samples that were added to the hash table or moved
from one bin to another. HD stands for Hamming distance between
the old and new hash entry. . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Dynamic s estimation. An example of a set of S BoN modules. The

third BoN module has the biggest expected loss, therefore the proba-
bility of sampling from it in the next training iteration is 0.5, while the
probability of sampling from BoN modules 2 and 4 is 0.25. . . . . . . 63
5.7 Dynamic s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1 The ROC curve (red line) and its approximation based on a set of
thresholds s (blue line). The area under the approximated curve is
calculated using the Trapezoidal rule. . . . . . . . . . . . . . . . . . . . 70
6.2 Family of sigmoids for ∆s = 0.2 and r = 12.02. . . . . . . . . . . . . . . 72
6.3 First order derivative of sum of sigmoids for ∆s = 0.2. . . . . . . . . . . 74
xvi
List of Tables
4.1 Impact of different data augmentation strategies. We report mean

average precision (mAP) on Market and Duke. . . . . . . . . . . . . . . 36
4.2 Impact of the input image size. We report mean average precision
(mAP) on Market and Duke. . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Top (a): influence of the pooling strategy. Middle (b): results for dif-
ferent backbone architectures. Bottom (c): influence of pretraining
the network for classification before considering the triplet loss. We
report mAP for Market and Duke. . . . . . . . . . . . . . . . . . . . . . 36
4.4 Impact of different design choices. We report mean average pre-

cision (mAP) on Market and Duke, using ResNet-101 as backbone
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 Comparison of sampling strategies. . . . . . . . . . . . . . . . . . . . . 44
5.2 Time required for training for 100k steps and until convergence. . . . 57
5.3 mAP validation results at peak performance for every method. . . . . 59
5.4 validation results at peak performance for every method and dataset.
* stands for the best number found in literature that uses additional
attention ensembles. F means that the method uses bilinear pooling. 61
5.5 mAP validation results at peak performance for every method. . . . . 65
6.1 Optimal r for a set of ∆s parameters. . . . . . . . . . . . . . . . . . . . . 75
6.2 Validation r@1 as a function of ∆s tested on SOP dataset. . . . . . . . 78
xvii
List of Tables
6.3 Comparison of batch all and batch hard strategies on Stanford Online
Products [73] dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Comparison of the AUC and the triplet batch hard loss functions on
the Stanford Online Products [73] dataset. . . . . . . . . . . . . . . . . 80
the CUB-200-2011 [114] dataset. . . . . . . . . . . . . . . . . . . . . . . 80
the In-shop Clothes [63] dataset. . . . . . . . . . . . . . . . . . . . . . . 81
the VERI-Wild [64] dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.8 Comparison with the state-of-the-art on the Stanford Online Prod-

ucts [73] dataset. Embedding dimension is presented as a superscript
and the backbone architecture as a subscript. R stands for ResNet, G
for GoogLeNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.9 Comparison with the state-of-the-art on the CUB-200-2011 [114] data-

set. Embedding dimension is presented as a superscript and the back-
bone architecture as a subscript. R stands for ResNet, G for GoogLeNet. 83
6.10 Comparison with the state of the art on the CUB-200-2011 [114]
cropped dataset. Embedding dimension is presented as a superscript
and the backbone architecture as a subscript. R stands for ResNet, G
for GoogLeNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.11 Comparison with the state-of-the-art methods on the In-shop Clothes [63]
dataset. Embedding dimension is presented as a superscript and
the backbone architecture as a subscript. R stands for ResNet, G for
GoogLeNet, V for VGG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.12 Comparison with the state-of-the-art methods on the VERI-Wild

small [64] dataset. Embedding dimension is presented as a super-
script and the backbone architecture as a subscript. R stands for
ResNet, A for ad-hoc, M for MobileNet . . . . . . . . . . . . . . . . . . 84
xviii
List of Tables

medium [64] dataset. Embedding dimension is presented as a su-
perscript and the backbone architecture as a subscript. R stands for

large [64] dataset. Embedding dimension is presented as a super-
script and the backbone architecture as a subscript. R stands for
xix
1 Introduction
1.1 A brief Introduction to deep learning

1.1.1 Beginnings of artificial intelligence
The idea of machines performing tasks that are traditionally done by humans has
been present since the early 14t h century. The first ideas were proposed in the field
of philosophy, where Ramon Llull in Ars generalis ultima introduced a mechanical
method of creating new knowledge based on combinations of known concepts
[43]. Jonathan Swift introduced The Engine in Gulliver’s Travels in 18t h century as a
device that generates permutations of word sets [100].
The first machine that thinks was proposed by Alan Turing in 1950 [104]. Turing
machine is a "mathematical model of computation that defines an abstract machine,
which manipulates symbols on a strip of tape according to a table of rules” [115].
The same year, Claude Shannon published an article about the first fully functional
algorithm for playing chess [90]. 1
The first artificial intelligence (AI) programs were designed to make reasonable
decisions in a space which can be described by a closed set of rules. Even though
these scenarios can be complicated for humans, the machines showed exceptional
performance. However, the challenges appear when the problem given to a machine
is more intuitive, and cannot be easily described by formal rules. Humans typically
solve those problems based on their life-long experience and acquired knowledge.
For example, people have no difficulties recognizing whether there is a car in an
image, but it is hard to transfer that knowledge to a machine in a formal way.
Therefore, one of the key challenges in artificial intelligence is how to transfer this
intuitive knowledge to a machine, which takes the name of Machine Learning.
1 However, it took almost fifty years to develop an algorithm that could win against a world champion.
1
Chapter 1. Introduction
1.1.2 From Machine Learning to Deep Learning

One branch of artificial intelligence, called machine learning (ML), deals with learn-
ing to make a correct prediction based on the available data. Machine learning
algorithms use data to extract patterns useful for future predictions, without hu-
mans help. For example, instead of describing the geometry of cars in images using
formal rules, a machine learning algorithm can be given a set of thousands images
of cars, and find patterns in these images without humans interaction.
The process of extracting patterns from input data (such as images, audio, text,
etc.) by mapping them into internal representations is called representation learning.
Nowadays, these representations are usually obtained by deep neural networks, and
this group of techniques is called deep learning. The idea of using neural networks in
artificial intelligence has roots in humans brain structure, therefore nomenclature
and architecture is coming from neuroscience. 2 Neural networks are composed
of a set of parameters, some of which are fixed, and the others that can be learned.
The parameters that can be learned are called weights and biases.
Before starting to train a deep neural network, the following steps have to be
done:
• Data preparation. Before starting the training of a deep neural network, the
input data has to be loaded on RAM. Additionally, the input data is usually
pre-processed by whitening and augmentation.
• Architecture design. In this step a deep neural network is designed for a spe-
cific task. The architecture can be adopted from publicly available resources,
or it can be designed from scratch. In both cases, decisions about the capacity
and speed of the network are taken to satisfy the task requirements, while
respecting hardware limitations. More about architectures will be presented
in section 2.1.
• Loss design. A loss is a measure of the difference between what model pre-
dicts and what is expected (a.k.a. the ground truth), and it should be designed
based on the final goal. The loss is calculated in every training step, and the
weights and biases of the network are updated in order to optimize it.
• Optimization strategy. The way that loss is optimized is defined by the op-
timization strategy. Some of the most common optimization strategies are
based on backpropagation [86], paired with an optimization function such as
Stochastic Gradient Decent (SGD) [83], RMSprop [103] or Adam [51]. These
2 Nonetheless, current neural networks employed in deep learning are a very limited simulation of
the actual neurophysiology of complex brain structures.
2
1.1. A brief Introduction to deep learning
optimization functions calculate the gradients of the loss, and provide it to

the backpropagation which updates the trainable parameters of the network.
A deep neural network is given a chunk of input data, called a mini-batch, in

every iteration. After a forward pass of the mini-batch, the loss is calculated and
backpropagated through the network.
1.1.3 Classification of deep learning methods

Deep learning methods can be categorized into different groups based on several
criteria.
Based on the type of data available and the task for which a model is trained,
learning can be supervised, unsupervised, semi-supervised or reinforcement
based. In supervised learning, the algorithm learns from a labeled dataset, pro-
viding an answer that can be used to evaluate the accuracy on training and testing
data. For example, a classification model that is trained to discriminate images of
cats and dogs predicts to which category an input image belongs. Unsupervised
learning, in contrast, learns from unlabeled data by finding patterns to describe
the input data. The typical example of an unsupervised model is the autoencoder
which has two parts: encoder, that maps the input data into a latent representation,
and decoder that reconstructs the input data from the latent representation. Semi-
supervised learning is in between supervised and unsupervised learning; it uses a
small amount of labeled data and a larger set of unlabeled data. Finally, reinforce-
ment learning trains an algorithm with a reward system, providing feedback about
correctness of a chain of actions that an agent performed.
A deep learning model can be discriminative, if it models a decision boundary
between classes, or generative, if it models a distribution of the input data.
Based on the type of layers that are used for building the architecture, deep learn-
ing models can be fully connected, convolutional or recurrent.3 Fully connected
networks are made of a series of fully connected layers that connect every neuron
from one layer to all neurons from the following layer by weighted connections.
Convolutional neural networks are mainly used when input data are images. They
are made of three main types of layers: convolutional, max or average pooling and
fully connected. Convolutional layers are sets of non-linear filters that transform
the input image or tensor. Pooling layers change the shape of the tensor, mainly
by replacing a certain area by its representation (maximum value, average...). Fully
connected layers are usually the last layers of an architecture, and they embed 3D
feature maps into vector representations.
3 Many other architectures are now being developed and explored; here we mention only the most
common ones, and relevant to the thesis.
3
Based on the task, deep learning methods can be regression, classification,

recognition, retrieval, re-identification, detection, segmentation, etc. Regres-
sion models predict a continuous output value based on the set of input variables.
For example, a regression model can predict the price of a flat based on the size,
zone in which it is, year in which the building was constructed etc. Logistic regres-
sion is similar to regression, with the difference that the output is a discrete value.
An example of logistic regression is cancer diagnosis, based on a set of analysis,
where the output is either 0 which means that the patient does not have cancer,
or 1, if the patient has cancer. Classification is a task of separating input data into
different categories. For example, images can be separated into different classes
based on their content (cat, dog, beach, sea, tree etc). Retrieval models search for
the data from the gallery set that is the most relevant for the query. For example, if a
query is the word ’Eiffel Tower’ and the gallery set is made of images, the expected
output is a list of images that contain Eiffel Tower. Recognition is a task closely
related to classification: it classifies whether two input samples are from the same
class, without specifying which class it is. For example, if a system is presented
images of two persons it should give an output which says whether the person is
the same in both pictures. Detection is the task of providing the regions in which
an object of a certain class is located on an image. For example, face detector draws
bounding boxes around all faces of the input image. Segmentation can be defined
as a pixel-wise classification. Unlike detection, which finds the regions where ob-
jects of a certain class are, segmentation provides a detailed map which includes all
pixels that belong to the desired class.
1.1.4 Applications
Deep learning has a very wide range of applications. It found a purpose in every
sector that deals with large amounts of digital data such as text, images, numbers,
diagnosis, etc.
One of the most important fields of deep learning applications is health care.
There are many ways to benefit from deep learning methods: regression models can
be used to predict future development of a disease if a model is given relevant data
about the patient; medical imaging using deep learning methods provide analysis of
various medical images such as X-ray, magnetic resonance, ultrasound etc; robots
can use reinforcement learning when trained to assist a surgery.
Military is yet another sector where deep learning is used. The common appli-
cations are target recognition, battlefield health care, combat simulation, threat
monitoring etc.
Deep learning found application in forensics as well. Surveillance cameras can
be used to find criminals and evidence about crimes. Very often, police inspectors
4
1.1. A brief Introduction to deep learning
know that something happened at a certain location that is covered by a camera,

but they do not have information about the specific time. Therefore, traditionally,
a person had to review the videos coming from the surveillance cameras that can
last more than 10 hours in order to identify a crime. Nowadays, advanced deep
learning algorithms can process long videos quickly, and detect subjects or actions
of interest.
Decisions of allowing somebody to take a credit in a bank is made based on
the data such as income, monthly spending, health conditions etc. Deep learning
methods in finance can make such decisions.
Autonomous driving has become one of the important applications of deep
learning. Autonomous vehicles are equipped with various sensors and cameras,
which constantly acquire data. Deep learning algorithms process the data in order
to segment the scene, detect other vehicles, persons, signs, obstacles etc.
Deep learning is used to process the data collected about a customer, and offer
products that are most likely relevant for his/her taste. This recommendation
systems are used in online shopping, advertisements, movie recommendations on
online platforms, etc.
Speech recognition. One important branch of deep learning deals with pro-
cessing speech. It turns spoken words by anyone to a written, digital information
that holds the same content. This method can be used for automatic subtitles, or as
a first stage of translating from one language to another.
Deep learning algorithms, especially reinforcement learning, can be taught to
play a game based on previous experience. Some of the games where deep learning
surpasses humans are chess, go etc.
In addition to being used for previously mentioned tasks, deep learning has
found its place in the domain of art as well. Most of these algorithms are focused
on style transfer, where they combine the style from one image with the context
from another image.
1.1.5 Limitations
Even though deep learning has been growing fast and improving quickly in the last
decade, there are certain limitations. One of the main problems of deep learning
is that it works well when models are trained with huge amounts of labeled data.
Even though the amount of publicly available data is rising tremendously, the great
majority of it is not labeled. Data labeling is a very slow and expensive process,
which requires humans help. Models trained with few data tend to overfit thus
performing poorly on new, unseen data.
Hardware limitations. It’s well known that deeper and more complex models
provide more accurate results. Also, having a lot of data available at each training
5
step provides better gradients and faster training. However, hardware limitations
for both training and inference are serious limitations that play a significant role
when designing the architecture.
Common sense. Deep learning is capable of solving very complicated tasks.
However, when a mistake occurs, it is not always clear on the first glance why it
happened. Deep learning methods make decisions based on different criteria than
humans, without explicitly providing an answer on a question why they predicted
a specific output. For example, if a deep neural network labels a picture of a dog
with a label cat, it is usually not intuitive for humans why that happened. However,
profound investigation of the activations of different layers of the network can
provide an insight of the reasons why the network made a certain decision.
1.2 Introduction to visual search

1.2.1 Problem definition and applications
Visual search is the task of looking for data in a gallery based on a query image. The
gallery can be either a set of images, or a data collection. The output of the search
is a ranked subset or the full gallery set, where data that is presented first is more
relevant for the query image than the data that comes later. In this thesis we will
focus on the image retrieval problem where both queries and gallery are images.
Visual search has a wide range of applications. Starting from a reverse image
search, where a user can draw an object of interest in order to retrieve a real image
of it from their own photo collection or from the internet. Another application is
geolocalization, where a user has a picture of a landmark, and is searching for the
place where the photo was taken. Visual search can be used for online shopping if
a user queries with a picture of a desired product and is searching for it in an online
market. Searching for more information about certain objects such as painting,
movies, food packages can be done by querying with an image as well. Visual
search is commonly used when the gallery set contains huge amounts of images
that cannot be separated by using only attributes. An example of such search is
person re-identification where a gallery set contains images of people captured by
different cameras in a certain time frame, and a user is searching for all appearances
of one person.
1.2.2 Instance retrieval

Instance retrieval is a sub-task of visual search where a user is searching for images
of an object from the query image. This task has been present in the literature for
6
1.2. Introduction to visual search
over thirty years, and therefore there is a wide range of approaches trying to solve
it. All these approaches have one thing in common: all images, both query and
gallery, are embedded into their vector representations. Depending on the nature
of the approach, one image can be embedded into a single vector representation
(early methods, global descriptors and some deep representations), or into multiple
vector representations (local representations, some deep representations). The final
result of instance retrieval is a ranked gallery set, which is generated based on the
similarities between vector representations of query and gallery images.
Early methods
The first methods that were proposed for solving an instance retrieval task were
published in the early 1990s [71, 99, 106]. These methods were straightforward so-
lutions that were based on the basic image characteristics such as color histograms,
textures or shapes. Even though these methods were easy to implement and had
a small inference time, they had poor performance even with smallest changes of
the images. For example, two images of the same object can have completely differ-
ent color histograms depending on the illumination, scale, viewpoint, presence of
occlusions etc (see images of one object on Figure 1.1 and their histograms on 1.2).
Figure 1.1 – Examples of appearance variations of images of Sagrada Familia.
7
Figure 1.2 – Examples of color histograms of images from Figure 1.1.
Local representations
In order to solve the problem of representing geometric and photometric invari-
ances in global representations another group of methods, called local representa-
tions, appeared (see the Survey [69] for more details about local representations).
The authors of these methods propose choosing locations of interest from the im-
ages and extracting their local descriptors. The points of interest are chosen by in-
terest point detectors such as Harris, Hessian, Hessian-affine, MSER etc [35, 66, 68]
(Figure 1.3). The local descriptors are extracted by applying SIFT [65], SURF [4] or
LBP [74] for each point of interest, which results in having a set of local descriptors
that belong to all query and gallery images (Figure 1.4).
Figure 1.3 – Example of keypoints.
8
Figure 1.4 – Example of local descriptor matching based on SIFT.
Comparison of local descriptors for each pair of query-gallery images is com-

putationally costly. The pairs of images that have enough pairwise matches are
stored in a shortlist. After processing the shortlist by a geometric verification such
as RANSAC [16], a shorter list of retrieved images is obtained.
Even though local representations provide significantly better solutions than
early approaches, they have two limitations: storing all local descriptors requires a
huge amount of memory, which is proportional to the number of local descriptors
per image, and the pairwise comparison of the local descriptors is computationally
expensive. In order to tackle these limitations, a new group of methods, called
global representations, was proposed.
Global representations
A global representation of an image is a combination of all local representations of
that image. These representations can be easily compared, additionally storing the
descriptors requires less memory.
The process of obtaining global representations requires three steps: first, all
local descriptors are extracted for an input image; second, each local descriptor
is associated to a visual word using a bag-of-visual-words algorithm; and finally,
a histogram of occurrences of visual words is created. A global representation
that depends on a few visual words can be coarse. One way of improving a global
representation is introducing more entries in a codebook, but this solution requires
significant computational cost. Another way of improving the global representation
is using higher order statistics of the data belonging to each entry of a codebook.
One way of using higher order statistics is proposed in Vector of Locally Aggregated
Descriptors (VLAD) [44]. Instead of counting how many local descriptors fell into
each entry of a codebook, this method is aggregating all local descriptors assigned
to the same visual word. The final representation is a concatenation of vectors for
individual words.
9
A more elaborate solution of using higher statistics takes into account not only
the word to which a local descriptor belongs, but also mean and standard deviation
of each visual word. This method is called Fisher-Vector [76]. Descriptors are
soft-assigned to the words, based on their distances. Similarity to VLAD, the final
representation is obtained by concatenating vectors of each local descriptor.
Deep representations
Deep convolutional neural networks have been used to extract image descriptors
since the early appearance of deep learning. These descriptors are compact and
can be easily compared and used for ranking.
In the beginning of the deep learning era, the majority of methods were trained
for classification. The first retrieval approaches used off-the-shelf deep convolu-
tional neural network trained for classification on a large scale general purpose
dataset, such as ImageNet, for extracting features [27, 91, 122]. These methods were
not appropriate for two main reasons: first, the training data was too different from
the data used for the final task; and second, the loss was designed for classification,
and not for ranking.
The first problem has a straightforward solution: instead of using classification
data, we can train a model on the train partition of the retrieval data, which is called
fine-tuning [122]. However, this solution has two main drawbacks:
• The way that data is split into train and test partitions for retrieval and classi-
fication is different. Classification models work with a closed set of classes,
meaning that all train data, as well as queries and gallery images used for
testing, belong to one of the predefined classes. Therefore, both train and test
splits are non-overlapping subsets of images from all classes. On contrary, re-
trieval datsets are split into train and test partitions based on the class labels;
some classes are selected for training, while all images from the remaining
classes are used for testing. Training a model with train set of retrieval data
for classification can lead to overfitting to the selected training classes. The
model will try to associate the images that appear at test time to one of the
training classes, which is not appropriate. Even though using the data that is
collected for the specific task can improve the results obtained by training on
a general purpose datasets, there is still a big room for improvement.
• The number of classes that can be used for training is limited by the available
memory. Retrieval datasets typically contain images of large number of classes
or identities. A classification model usually has a fully-connected layer that
projects the output of the last convolutional layer to a single vector. The
number of parameters, as well as the memory, that this fully-connected
10
layer has is linearly proportional to the number of classes, and thus is not
appropriate for datasets with a large number of classes/categories.
Figure 1.5 – Dataset splits for classification and retrieval.
Several ranking losses were proposed in order to train a model that optimizes
distances between data points without predicting classes to which they belong.
These losses groups the data of one class in a cluster, that is separated from the
clusters of the data from other classes. More about the ranking losses will be
presented in Section 2.
As retrieval problems can be domain specific, many approaches propose do-
main specific architectures, or take advantage of knowing physical characteristics
of the objects. For example, in case of person re-identification, we can expect to
find some body parts on the images such as head, arms, legs, torso. Also, there are
attributes that can be used in addition to the general descriptors. For example, the
information of the gender of a person can be used to improve the descriptor, or
length of hair, wearing glasses, type of clothes etc.
11
1.2.3 Evaluation of visual search methods

Let’s assume that we have to evaluate a model on a dataset of n c classes which
are not used at train time, and n i samples in total, divided into n q queries, and
n g = n i − n q gallery samples. Some of the metrics used for evaluation of both
retrieval and recognition systems are listed below.
rank@N measures the percentage of all queries from the test set that have at
least one sample from the same class among the first N retrieved samples.
Precision is a measure that calculates the percentage of relevant samples (true
positives) among all retrieved samples (true positives and false positives). P r eci si on
@ N is commonly used for evaluation of retrieval systems when the boundary be-
tween positive and negative samples is not known. This measure calculates the
percentage of relevant samples in the first N retrieved samples (Equation 1.2).
TP
pr eci si on = (1.1)
TP +FP
TP
pr eci si on@N = . (1.2)
N
Recall (True positive rate or sensitivity) measures the ratio between the number
of relevant retrieved samples (true positives) and the total number of positive sam-
ples in the gallery (true positives and false negatives). Similarly to pr eci si on@N , we
can define r ec al l @N , which calculates the ratio between the number of correctly
retrieved samples and the total number of positive samples (Equation 1.4).
TP
r ec al l = (1.3)
TP +FN
TP
r ec al l @N = . (1.4)
Np
Average precision is a typical measure for instance retrieval that takes into
account the order of all retrieved samples from the gallery. It computes a precision
and recall at every position in the ranked sequence of samples, and plots a precision-
recall curve. Average precision is the average value of pr eci si on(r ec al l ) over the
interval of recall [0,1] (Equation 1.5).
ng
pr eci si on@k × ∆r ec al l @k.
X
Av g P = (1.5)
k=1
Mean average precision is a measure that calculates the area under the mean
recall-precision curve over all queries. It is calculated as the mean of average
12
precision scores for each query:

PN q
q=1 Av g P (q)
m AP = . (1.6)
Nq
Accuracy is a commonly used measure for evaluation of recognition systems.

For test set composed of positive and negative sample pairs, accuracy is calculated
as percentage of well classified pairs.
False positive rate is calculating the ratio of the number of negative samples
that are incorrectly classified and the total number of negative samples for a given
threshold:
FP
FPR = . (1.7)
FP +T N
Receiver operating characteristics or ROC is a curve that shows the transition
between TPR and FPR and it is typically used for finding the optimal threshold
between positive and negative samples. For instance, if the threshold is lower than
the minimal distance between the query and gallery samples, TPR is 1, as well
as FPR. On the other hand, if the threshold is higher than the maximal distance
between query and gallery, TPR and FPR will be 0. When two recognition systems
are compared, the ROC curve of the one that has better class separability will be
above the other, covering greater area under it.
1.2.4 Datasets
In this thesis we use several retrieval and re-identification datasets.
The Market-1501 dataset [130] (Market) is a standard person re-ID benchmark
with images from 6 cameras of different resolutions. Deformable Part Model (DPM)
detections were annotated as containing one of the 1,501 identities, among which
751 are used for training and 750 for testing. The training set contains 12,936 images
with 3,368 query images. The gallery set is composed of images from the 750 test
identities and of distractor images, 19,732 images in total. There are two possible
evaluation scenarios for this database, one using a single query image and one with
multiple query images.
The MARS dataset [128] is an extension of Market that targets the retrieval of
gallery tracklets (i.e. sequences of images) rather than individual images. It contains
1,261 identities, divided into a training (631 IDs) and a test (630 IDs) set. The total
number of images is 1,067,516, among which 518,000 are used for training and the
remainder for testing.
The DukeMTMC-reID dataset [133] (Duke) was created by manually annotating
13
pedestrian bounding boxes every 120 frames of the videos from 8 cameras of the
original DukeMTMC dataset. It contains 16,522 images of 702 identities in the
training set, and 702 identities, 2,228 query and 17,661 gallery images in the test set.
The Person Search dataset [117] (PS) differs from the previous three as it was
created from images collected by hand-held cameras and frames from movies and
TV dramas. It can therefore be used to evaluate person re-identification in a setting
that does not involve a known camera network. It contains 18,184 images of 8,432
identities, among which 5,532 identities and 11,206 images are used for training,
and 2,900 identities and 6,978 images are used for testing.
Person re-identification large dataset. We merged eleven publicly available
datasets for person re-identification, CUHK01[56], CUHK02 [55], 3DPeS [3], VIPeR
[31], airport[47], MSMT17 [112], Market-1501 [130], DukeMTMC [82]. The merged
dataset has 10.5k IDs, and 178k images. We used both training and testing partitions
of all the datasets except for Market-1501 and DukeMTMC-reID and we did not use
the images that are labeled as distractors or junk.
Stanford Online Products [73] is a retrieval dataset which contains 120k images
of 22.6k products. The dataset is split into two partitions, the training one, which
contains 59.5k images of 11,3k products, and testing, 60.5k images of 11.3k classes.
DeepFashion - In-Shop Clothes Retrieval [63] is a part of DeepFashion dataset
which is designed for instance retrieval. The dataset is made of 54.6k images of
11.7k clothing items. All the images are taken under controlled conditions.
The Caltech-UCSD Birds 200 (CUB-200) [114] is a small dataset that is com-
monly used for image retrieval. It has 6033 images of 200 categories of birds. Fol-
lowing the common practice for the retrieval task, we use the first 100 categories
for training, and the rest for testing. Additionally, we use bounding boxes that are
provided by the authors during both training and testing.
The VERI-Wild [64] is a re-identification dataset of vehicles in the wild. The
images are captured by 174 surveillance cameras, during one month, which resulted
in having 277,797 images of 30,671 training identities, and three testing partitions:
small - 3,000 testing categories and 38,861 images, medium - 64,389 images and
5,000 identities, and large: 128,517 images of 10,000 identities.
14
2 Related Work
Image retrieval is the task of sorting the gallery set of images based on their relevance
to the query image, where the more relevant images are shown before the less
relevant ones. The more similar an image is to the query the more relevant it is.
However, the semantic of the word similar can be very broad. Hence, the group
of machine learning algorithms, called metric learning, has a goal to address a
simplified problem by learning distances between data points, assuming that the
distance between more similar data points (i.e. same object under an implicit
semantic) is smaller than the distance between the ones that are less similar (i.e.
different objects under the same implicit semantic).
Since the beginning of the deep learning era, the main-stream metric learning
techniques became deep metric learning systems. These systems use a deep neural
network, often called a backbone, to embed data into a space in which the distances
between data points can be measured. The backbone is trained to construct an
embedding space in which the distance between more similar data points is smaller
than the distance between less similar data.
In this chapter we will present the most relevant works that are related and
influential for this thesis. We start by introducing the backbone architectures that
are most commonly used for metric learning. Next, we present the loss functions, as
well as hard negative mining strategies that are used for efficient training.
2.1 Backbone architectures

Most deep learning based methods for metric learning comprise backbone archi-
tecture, which is a deep convolutional neural network that extracts descriptors of
the input images. Depending on the application, the backbone can be a general
purpose architecture or a task specific one.
15
Chapter 2. Related Work
2.1.1 General purpose backbone architectures

General purpose architectures do not depend on the nature of the input images,
and they take only images as inputs. General purpose neural networks are usually
primarly designed for classification task, and evaluated on ImageNet dataset [19].
ImageNet contains images from 1000 classes that are organized according to the
WordNet hierarchy [70].
The pioneer deep architecture that won ImageNet challenge in 2012 was AlexNet
[52]. This architecture was made of 5 convolutional layers, max-pooling layers,
dropout layers and 3 fully connected layers. It is shown in the paper that using ReLU
nonlinearities instead of tanh can significantly speed up the training.
Even thought VGG network [92] did not win the ImageNet challenge in 2013,
it attracted attention for its simple design and good performance. The authors
evaluated five configurations of VGG network which have between 11 and 19 layers,
where all convolutional layers have a very small 3x3 receptive fields. They show
that the convolutional layers with small receptive field (3x3) can be stacked without
spacial pooling in between, and effectively perform as a convolutional layer of
bigger receptive field (5x5, 7x7 etc.). This configuration allows training a deeper
neural network with fewer parameters.
GoogLeNet [101] was the first network that used an inception module as a
building block (Figure 2.1). The idea of using an inception module comes from
the "intuition that visual information should be processed at various scales and
then aggregated so that the next stage can abstract features from the different scales
simultaneously”[101]. GoogLeNet has 22 layers, it is made of 9 inception modules
that are stacked upon each other, and has 9 times less parameters than AlexNet.
The small number of parameters helps avoiding overfitting and guarantees faster
training [7].
Figure 2.1 – Inception unit. Building block of GoogLeNet. (Source [101])
ResNet [37] won the ImageNet challenge in 2015. Instead of increasing the
network capacity in width, as proposed in [101], the authors of ResNet show that
16
2.1. Backbone architectures
training a very deep network can be beneficial. ResNet comes in 5 different configu-
rations with 18, 34, 50, 101 or 152 layers. This architecture is made of consecutive
residual blocks, presented in Figure 2.2. The authors propose using residual blocks
in order to create a direct path from the shallow layers to the output of the network.
The direct path should ease training of the shallow layers, and thus allow training of
deeper networks.
Figure 2.2 – Building block of ResNet architecture. (Source: [37])
2.1.2 Task specific architectures

Task specific architectures are specially designed for the data that is characteristic
for the task at hand. The architecture designers take advantage of knowing some
priors of the input data. For example, the person re-identification task requires
only tight crops of images of people, therefore these crops usually have head in
the upper part, two arms and torso in the middle, legs in the lower part etc. Other
domains in which task specific networks can be used are face recognition, vehicle
re-identification, number-plate recognition, etc. In this section we focus on the
architectures designed for person re-identification, as it is the most relevant for this
thesis.
Many works in this line have focused on addressing the geometric alignment
problem via use of part detectors, pose estimation, or attention models. Spatial
transformer networks have been used to globally align images [132] and to localize
salient image regions for finding correspondences [79]. In a similar vein, [62, 127]
use multiple parallel sub-branches which learn, in an unsupervised manner, to
consistently attend to different human body parts. [94] uses a pre-trained pose
estimation network to provide explicit part localization, while a similar approach
[126] integrates a pose estimation network into their deep re-ID model. [129]
uses joint localization to create a new image that contains only the body parts.
Rather than localize parts, [60] represents images with fixed grids and learns cell
correspondences across camera views. Several works have proposed multi-scale
17
architectures with mechanisms for automatic scale selection [77] or scale fusion [14].
[54] combines a multi-scale architecture with unsupervised body part localization
using spatial transformer networks.
2.2 Loss functions

The first attempt to tackle retrieval using deep learning techniques was adopting
the well-known classification loss. Even though it was easy to train a model with this
loss and it brought decent results, it was not appropriate for the retrieval task. With
further development of deep learning, new loss functions that were more suited
for the task were proposed. In this section we provide more details about most
commonly used loss functions for instance retrieval and we separate them into
three groups: classification losses, pairwise losses and listwise losses.
2.2.1 Classification losses

Classification loss was a straightforward solution for a great majority of tasks in the
very beginning of deep learning. Convolutional neural networks were trained to
classify images, and at the same time, image descriptors were extracted as outputs
of the penultimate fully-connected layer. When a model is trained for classification
its last fully connected layer has the size as the number of classes c. This way the
output of the network gives c scores s 1 , s 2 , ..., s c , where each score corresponds to
one class.
The most commonly used loss for classification is the cross-entropy loss. This
loss maximizes the approximated probability of the sample belonging to the correct
class. The probability is approximated by using the softmax function that maps
class scores into values between 0 and 1:
exp s i
σ(s i ) = Pc . (2.1)
j =1 exp(s j )
The cross-entropy loss for a sample that belongs to a class i is defined as:
L cr oss−ent r op y = − log(σ(s i )). (2.2)
The main disadvantages of using the cross-entropy loss for retrieval are: (1) poor
generalization and (2) poor scalability due to the size of the fully connected layer.
18
2.2. Loss functions
2.2.2 Pairwise losses

This group of losses includes some of the most popular loss functions used for
metric learning, instance retrieval, face recognition or re-identification. All these
losses have one thing in common: they optimize absolute or relative distances of
pairs of images; they minimize distances between descriptors that represent samples
from the same class/identity, and maximize the ones that are coming from different
classes.
The pioneer from this group was the contrastive loss [33]. The authors of this
approach use a Siamese architecture of two streams. If the two input samples are
from the same class (Y = 0), the loss pushes their respective representations closer,
and separate them if they are on a distance d that is smaller than m otherwise
(Y = 1) (see Equation 2.3).
1 1
L cont r ast i ve = (1 − Y )d 2 + Y max(0, m − d )2 . (2.3)
2 2
Triplet loss [88] requires a Siamese architecture with three streams which is fed
with three images: an anchor image I a , an image from the same class I p and an
image from any other class I n . All three images are embedded into their descriptors
r a , r p and r n . The triplet loss pushes the samples from the same class closer to
each other while separating the samples from different classes if the difference
between the distance of anchor and negative (d − = ||r a − r n ||2 ) and anchor-positive
(d + = ||r a − r p ||2 ) is smaller than a margin m.
1
L t r i pl et = max(0, d + − d − + m). (2.4)
2
In [13] authors propose adding an additional, fourth stream to the Siamese
architecture. In this case one stream is for an anchor, one for positive and two for
negative images. The final objective is having a reference-positive distance smaller
than the anchor-negative, while making sure that the anchor-negative distance is
greater than the negative-negative distance.
Many approaches, inspired by the triplet loss, proposed various ways to optimize
training time and quality of the results by exploiting more information from the data
that are available in a mini-batch. In [40] the authors propose creating mini-batches
of P classes and K images per class. In each training step they do a forward pass of
all P × K images and get their descriptors. They propose two ways of optimizing
the main objective: Batch hard and Batch all. The batch hard triplet loss treats all
images from the mini-batch as anchors, and for each one of them selects the one
from the same class that is furthest away as a positive, and the one from a different
19
class, closest to the anchor, as a negative (see Equation 2.5).
P X
K
j
[m + max (||r ai − r pi ||) − min (||r ai − r n ||)]+ .
X
LB H = (2.5)
i =1 a=1 p=1..K j =1..P
n=1..K
i 6= j
Batch all strategy calculates the loss based on all positive and negative pairs
from the mini-batch 2.6.
P X K X
K X P X
K
X i ,a,p
LB A = [m + d j ,a,n ]+ ,
i =1 a=1 p=1 j =1 n=1
p6=a j =
6 i (2.6)
i ,a,p j
d j ,a,n = ||r ai − r pi ||2 − ||r ai − r n ||2 .
Similarly to the batch all triplet loss, the structured loss [73] (Equation 2.7) and
the n-pair loss [93] (Equation 2.8) take advantage of all positive and negative pairs
from the mini-batch. In [93] the authors propose creating a mini-batch of N pairs
{(x 1 , x 1+ ), (x 2 , x 2+ ), ..., (x N , x N
+
)} from N different classes. For each positive pair they
sample N − 1 negative samples from all different classes, and they use them for
calculating the loss. In [73] the negative pairs are sampled inside of a mini-batch, so
that the negative is one of the closest samples to either anchor or positive for each
anchor-positive pair in the mini-batch.
X X
L i , j = log( exp (m − d i ,k ) + exp (m − d j ,l )) + d i , j ),
(i , j )∈N ( j ,l )∈N
1 (2.7)
max(0, J i , j ), d i , j = ||r i − r j ||2 .
X
L st r uct ur ed =
2|P | (i , j )∈P
1 X N
exp (r iT r j+ − r iT r i+ )).
X
L n−pai r = log(1 + (2.8)
N i =1 i 6= j
2.2.3 Listwise losses

Even though the pairwise loss functions successfully optimize distances between
positive and negative pairs, they have two main disadvantages: first, they do not
explicitly optimize the ranking measure, and second, they are very hard to train
due to lack of relevant samples in later stages of training. The typical ranking
performance measure is mean average precision (mAP) which cannot be optimized
easily, as it requires ranking, which is not a differentiable operation. Inspired by
the Histogram loss [105], which proposed a differentiable approximation of the
20
2.3. Hard negative mining
histogram function, a new group of losses that optimize directly mAP appeared
[38, 39, 81].
In [105] the authors propose the Histogram loss. The method is based on the
histograms that approximate the distributions of positive and negative similarities
inside of a mini-batch. The loss is designed to separate the two distributions. This
objective does not directly optimize the ranking task, but indirectly, it forces all
positive pairs to have higher similarity than all negative pairs. It is the inspiration
of several listwise losses [38, 39, 81] that directly optimize the average precision.
The pioneer in this line is a differentiable approximation of average precision (AP)
for retrieval in Hamming space, that focuses especially on tie scenarios (where
both positive and negative samples belong to the same histogram bin) [38]. In
[39] the authors apply the same strategy on retrieval and patch matching tasks.
Mini-batch size has a great impact on the results, so the first approaches showed
the results on patch matching, because these images are smaller, and the backbone
architectures used have fewer parameters. In [81] the authors propose a way to
train a very deep CNN (such as ResNet101) with large images (800x800 pixels) while
optimizing mAP loss on the whole train set. The method performs a full forward
pass of all images in the dataset, calculate the similarity matrix and the loss, and
finally recompute descriptors of all images, store their intermediate tensors and
accumulate the gradients. Once all gradients are accumulated, it backpropagates
the errors through the network. This method cannot easily scale to larger datasets,
due to its high computational cost per weights update.
2.3 Hard negative mining

The problem of finding relevant candidates for ranking losses (especially for triplet
loss) has received a lot of attention in the recent years for both retrieval [40, 73, 88,
107, 111, 116] and tracking [118]. If a negative sample is too easy, the triplet loss
goes to zero and does not generate gradients for backpropagation, which in turn
makes the training possible only for simple problems. One research line bypasses
this problem by proposing modifications of softmax loss for easier training [61, 109].
A research line focuses on offline sampling approaches. An offline re-weighting
of the loss can improve the quality of negative samples, but at non-negligible cost
[116]. Taking advantage of extra knowledge on sub-categories within the dataset is
also advantageous in mining negative samples [111].
Another group of methods, widely known as online hard negative mining
(OLHN), take advantage of the samples representations available at mini-batch level
in order to improve the probability of retrieving relevant negatives for the triplet loss
[40, 73, 88, 107]. Most of these works create mini-batches of kl images, k random
21
images per each one of l random classes. The pioneer approach was introduced by
[88], called semi hard loss, where triplets are created by all anchor-positive pairs in
a mini-batch. The negative sample is chosen so that the loss is in between 0 and the
predefined margin α (see the equation 2.4).
Lifted Embedding loss is proposed in [73], where the negative image is the one
closest to either anchor or positive for each anchor-positive pair in the mini-batch.
In [40] the authors propose two strategies for sampling within a mini-batch,
which are extensions of the Lifted Embedding loss. Batch all loss is obtained by
all possible combinations of triplets inside of batch. Batch hard loss takes all the
images from the mini-batch as anchors of triplets. The positive is selected as the
furthest sample from the same class as the anchor in the mini-batch, while the
negative is the closest to the anchor from all the samples from different classes in
the mini-batch.
Curriculum sampling is proposed in [107], where the beginning of training is
performed using easy negative instances, and complexity increases through time.
For each anchor, all negatives from the mini-batch are sorted according to their
distances to the anchor, and the representative negative is sampled with Gaussian
distribution N (µ, σ). µ and σ are changed through time, so that µ goes from max
distance to min distance, and σ reduces towards 0.
All of these approaches have the same drawback: they focus on the local distri-
bution of data inside of a mini-batch, while sampling the candidates for mini-batch
randomly. A mini-batch created randomly is a good representation of the global
distribution, but it does not represent the local embedding space. As relevant neg-
ative samples could be found in the local neighborhood of the anchor sample, the
probability of sampling useful triplets rises if the mini-batch is created from samples
that belong to the same local neighborhood.
Another research line comprises methods that use adversarial samples for met-
ric learning [12, 21]. In [21] the authors propose a way of training Siamese networks
by generating adversarial, potentially hard, negative samples for training with vari-
ations of the triplet loss. The descriptors of all three input images are used for
generating a synthetic, hard negative descriptor. This descriptor, together with the
anchor and positive, forms a triplet of descriptors that is used for calculating the
loss. Similarly, in [12] the authors propose a metric learning strategy that uses a set
of real and a set of synthetic pairs for training.
The fourth research line proposes online strategies for providing relevant nega-
tive samples prior to mini-batch formation [25, 36, 96, 108], and one of our contri-
butions belongs to this research line.
In [25] authors propose a strategy that builds a tree of identities to facilitate the
sampling of relevant negatives for a given anchor. The method clearly improves the
quality of negative samples but at the cost of updating the tree at every epoch. Also,
22
2.3. Hard negative mining
the tree construction is based on an identity-to-identity distance matrix, which thus

scales quadratically with the number of identities.
In [108] authors explicitly face the problem of training a Siamese network with
100k identities. The basic idea is to generate a representation for each identity, and
apply clustering on all the identities to generate clusters or subspaces, wherein
identities are similar in each subspace. Authors propose to train a classifier on a
subset of identities, then use the classifier to generate image representations, and
finally perform k-means clustering in order to form the subspaces. Authors do
not update the clustering during the training, thus the subspaces could become
sub-optimal in later stages of the training.
In [96] authors propose a strategy in which anchor samples are compared to
“class signatures” in order to limit the number of sample to sample comparisons. A
stochastic process is added to avoid the oversampling of overly-difficult or noisy
classes. The class signatures are constantly updated thanks to an additional clas-
sification loss. The method could be seen as a stochastic extension of [25] where
the similarity between classes is computed at each step, while the class signatures
are updated in a more efficient way. Nonetheless, the number of additional dis-
tance computation w.r.t. a pure random sampling is proportional to the number of
classes.
In [36] authors propose a strategy for creating triplets of images from a subset
of approximate nearest neighbors of the anchor image. This strategy requires a
forward pass on all training set images at the beginning of each epoch, followed
by graph construction, and search inside of the subspace. This strategy increases
the speed w.r.t exhaustive search (“...Given that O (N 2 ) is the best case complexity
for the naïve hard mining approach above, we can conclude that our method is
computationally more efficient..."[36]), while providing relevant triplets. Similarly to
[25], the main issue of this approach is that triplets are formed at the beginning of
each epoch. This strategy for triplet formation can be useful for small datasets, but
does not guarantee high quality of sampled triplets towards the end of the epoch
when trained on datasets with large number of images.
The main drawbacks of the approaches from [25], [108], [96] and [36] are: (1)
high computational cost and (2) they scale poorly with the number of classes.
23
3 Motivation and contributions
In this section we present the outline of this thesis as well as the contributions
of each chapter. We first compare and evaluate various solutions for person re-
identification task proposed in the literature, and propose the best combination to
maximize the performance in Chapter 4. We propose a new strategy for sampling
hard negatives for an efficient training in Chapter 5. Finally, we propose a new loss
function for explicit maximization of the area under the ROC curve in Chapter 6.
3.1 Boundaries of state of the art

for person re-identification
In Chapter 4 we present an approach to the person re-identification problem that
combines a simple deep neural network with a curriculum learning training strat-
egy, and whose design choices were validated on several datasets. The result is
a simple yet powerful architecture that produces global image representations
that, when compared using a dot-product, outperforms state-of-the-art person
re-identification methods by large margins, including more sophisticated methods
that rely on attention models, extra annotations, or explicit alignment.
We identify a set of key practices to adopt, both for representing images effi-
ciently and for training such representations, when developing person re-ID models.
Many of these principles have been adopted in isolation in various related works.
However, we show that when these principles are applied jointly,they can signif-
icantly improve the performance. We evaluate different modeling and learning
choices that impact performance. A key conclusion is that curriculum learning is
critical for successfully training the image representation and several of our princi-
ples reflect this. Our method significantly improves over previous published results
on four standard benchmark datasets for person re-identification. For instance, we
show an absolute improvement of 8.1% mAP in the Market-1501 dataset compared
with the state of the art at 2018, the moment of the publication. We provide a qualita-
tive analysis of the information captured by the visual embedding produced by our
25
Chapter 3. Motivation and contributions
architecture. Our analysis illustrates, in particular, the effectiveness of the model

in localizing image regions that are critical for re-ID without the need for explicit
attention or alignment mechanisms. We also show how individual dimensions of
the embedding selectively respond to localized semantic regions producing a high
similarity between pairs of images from the same person.
3.2 Hard negative mining combined

with existing losses
In Chapter 5 we present a computationally inexpensive and mini-batch size in-
dependent, online strategy for improved negative mining in large datasets; which
we named Bag of Negatives (BoN). The main advantages of BoN w.r.t. previous
methods are: (1) fewer training steps to converge, (2) a better performance on
validation sets due better sampling of negatives, and (3) a negligible additional
computational cost w.r.t. the Siamese architecture training.
Our methodology does not require computing additional samples represen-
tation nor their respective distances to be able to select appropriate negatives. It
can be combined with any loss that requires a negative sample. Nonetheless, for
simplicity, we will analyzed the behaviour of BoN with triplet based losses since
they perform better than contrastive loss; analogous results are achieved with other
losses, such as quadruplet loss. We also want to emphasize that this approach has
been devised with large datasets and computational efficiency in mind. Finally, the
method has only one relevant meta-parameter.
3.3 Loss for explicit maximization

of the area under the ROC curve
In Chapter 6 we present the AUC loss, a new metric learning loss which explicitly
maximizes an underestimate of the area under the ROC curve at the mini-batch
level. We design an approximated, derivable relaxation of the area under the ROC
curve by: (1) approximating the integral with a Riemman summation, which can be
computed efficiently while keeping the accuracy of the approximation high by using
a small step size ∆s; (2) approximating the Heaviside function with a sigmoidal-like
function, whose slope is the only relevant meta-parameter, which depends only on
the step size ∆s and can be numerically calculated. AUC loss is simple, yet effective
and computationally inexpensive. Even though recall-precision and ROC curves are
not equivalent measures, it has been proven that a curve dominates in ROC space
26
3.3. Loss for explicit maximization
of the area under the ROC curve
if and only if it dominates in recall-precision space [17], which makes AUC loss
appropriate for both retrieval and recognition. We tested AUC loss on four publicly
available datasets, and showed that it achieved state-of-the-art performance for
both retrieval (which is measured by mAP and rank@N) and recognition (measured
by the area under the ROC curve).
27
4 Good practices for person re-identification
4.1 Introduction
Person re-identification (re-ID) is the task of correctly identifying the images in a
database that contain the same person as a query image. It is highly relevant to
applications where the same person may have been captured by different cameras,
for example in surveillance camera networks or for content-based image or video
retrieval.
Re-ID has been heavily studied for more than two decades (please refer to [5] for
a review). Most works that address this problem have sought to improve either the
image representation, often by carefully hand-crafting its architecture, or the image
similarity metric. Following the great success of deep networks on a wide variety
of computer vision tasks, including image classification [37], and object detection
[80], a dominant paradigm in person re-ID has emerged, where methods use or
fine-tune successful deep architectures for the re-ID task [13, 57, 59, 95].
Person re-ID is challenging for several reasons. First, one typically assumes that
the individuals to be re-identified at testing time were never seen during the model’s
training phase. Second, the problem is large-scale in the sense that at testing time
one may need to re-identify thousands of individuals. An additional challenge is
that images of the same person have often been captured under different conditions
(including lighting, resolution, scale and perspective), by the different cameras. In
particular, the pose of the person may be vastly different between different views.
For example, one may need to match a frontal view of a person walking to a profile
view of the same person after they have mounted a bicycle (see example of such a
positive pair in the triplets illustrated in Figure 4.1). Lastly, most re-ID systems rely
on a pre-processing stage where a person detection algorithm is applied to images
in order to localize individuals. As such, they must be robust to detection errors
leading to truncated persons or poor alignment.
Recent works in the literature often introduce additional modules to their deep
networks to address the aforementioned challenges of scale and pose variations,
and detection errors. Some of these additional modules explicitly align body parts
29
Chapter 4. Good practices for person re-identification
between images [94, 126], for example by using pre-trained pose estimators or
human joint detectors. Others add attentional modules [79] or scale fusion [14].
Some use additional annotations such as attributes [95].
In this work, rather than focus on hand-crafting additional modules to address
the various challenges of re-ID, we adopt a different approach and focus instead
on designing an effective training procedure for deep image representations. In
particular, we draw inspiration from works on curriculum learning [6], which aim
to improve model convergence and performance by continually modulating the
difficulty of the task to be learned throughout the model’s training phase. Our
carefully designed learning approach only impacts training, which means that at
test time our approach is very efficient. Consequently, our approach results in a
compact but powerful architecture that produces global image representations
that, when compared using a dot-product, outperform state-of-the-art person re-
identification methods by large margins, including more sophisticated methods
that rely on extra annotations or explicit alignment.
4.1.1 Curriculum learning for re-ID

Curriculum strategies have been successfully applied to optimization problems
[6, 53], showing a positive impact on the speed and quality of the convergence. We
adopt this learning strategy in our approach via three complementary techniques.
First, we observe that standard pre-training strategies belong to the curriculum
strategy of ordering multiple tasks from the easiest to the most difficult [2]. In prac-
tice, we adopt a pre-training strategy in which we train our model to learn the task
of person ID classification (which requires the model to first recognize individuals
within a closed set of possible IDs) before training it for the more challenging task
of re-identifying persons unseen during training.
Second, we use a selection at the sample level. We feed our ranking loss following
a hard-negative mining strategy that samples triplets of increasing difficulty as
learning continues. Here, to evaluate the difficulty of a triplet we compare the
similarity between the query image in the triplet and the relevant image to the
similarity between the query image and the non-relevant image.
Third, we progressively increase the difficulty of the training images themselves.
Here, by difficulty we mean that the appearance of regions of the image is degraded.
Next, we describe our model architecture and how the above techniques are applied
during training.
30
4.2. Learning a global representation for re-ID
4.2 Learning a global representation for re-ID

In this section we describe the design of our deep architecture and our strategy for
effectively training it for person re-ID.
4.2.1 Architecture design

The architecture of our image representation model resembles in most ways that
of standard deep image recognition models. However, it incorporates several im-
portant modifications that proved beneficial for image retrieval tasks [28, 78]. The
model contains a backbone convolutional network, pre-trained for image classifi-
cation, which is used to extract local activation features from input images of an
arbitrary size and aspect ratio. These local features are then max-pooled into a sin-
gle vector, fed to a fully-connected layer and `2 -normalized, producing a compact
vector whose dimension is independent of the image size. Figure 4.1 illustrates
these different components and identifies the design choices (#1 to #4) that we
evaluate in the experimental section (Section 4.3.2).
Different backbone convolutional neural networks, such as ResNet [37], ResNeXt
[119], Inception [102] and Densenet [42] can be used interchangeably in our archi-
tecture. In Section 4.3.2, we present results using several flavors of ResNet [37], and
show the influence of the number of convolutional layers on the accuracy of our
trained model.
4.2.2 Architecture training

A key aspect of the previously described representation is that all the operations are
differentiable. Therefore, all the network weights (i.e.from both convolutional and
fully-connected layers) can be learned in an end-to-end manner.
Three-stream Siamese architecture. To train our representation and we use a

three-stream Siamese architecture in which the weights are shared between the
streams. This learning approach has been successfully used for person re-identification
[20, 40, 95] as well as for different retrieval tasks [28, 78]. Since the weights of the
convolutional layers and the fully-connected layer are independent of the size of the
input image, this Siamese architecture can process images of any size and aspect
ratio. The three-stream architecture takes image triplets as input, where each triplet
contains a query image I q , a positive image I + (i.e.an image of the same person as in
the query image), and a negative image I − (i.e.an image of a different person). Each
stream produces a compact representation for each image in the triplet, leading to
31
Figure 4.1 – Summary of the training approach. Image triplets are sampled and
fed to a three-stream Siamese architecture, trained with a ranking loss. Each stream
encompasses an image transformation, convolutional layers, a pooling step, a fully
connected layer, and an `2 -normalization. Weights of the model are shared across
streams. In red we illustrate the curriculum learning strategies: (1) pretraining for
classification (PFC), (2) hard triplet mining (HTM), (3) increasing image difficulty
(IID).
the descriptors q, d + and d − respectively. We then define the ranking triplet loss as
L(I q , I + , I − ) = max(0, m + q T d − − q T d + ), (4.1)
where m is the margin. This loss ensures that the embedding of the positive image
I + is closer to the query image embedding I q than that of the negative image I − , by
at least a margin m.
We now discuss key practices for improved training of our model.
4.2.3 Applying curriculum learning principles

Our architecture is trained using the following strategies, illustrated in Figure 4.1.
Pretraining for Classification (PFC). First, we follow standard practice and use
networks pre-trained on ImageNet. Then, we perform the additional pre-training
step of fine-tuning the model on the training set of each re-ID dataset using a
classification loss. That is, we trained our model for person identification or ID
classification. The weights obtained for the convolutional layers are then used to
32
4.3. Empirical evidence
initialize the weights of the Siamese architecture described in the previous section.
Hard Triplet Mining (HTM). Mining hard triplets is crucial for learning. As al-
ready argued in [116], when applied naively, training with a triplet loss can lead to
underwhelming results. Here we follow the hard triplet mining strategy introduced
in [29]. First, we extract the features for a set of N randomly selected examples
using the current model and compute the loss of all possible triplets. Then, to select
triplets, we randomly select an image as a query and randomly pick a triplet for that
query from among the 25 triplets with the largest loss. To accelerate the process,
we only extract a new set of random examples after the model has been updated k
times with the desired batch size b. This is a simple and effective strategy which
yields good model convergence and final accuracy, although other hard triplet
mining strategies [116] could also be considered.
Increasing image difficulty (IID). Finally, to increase image difficulty (IID), we

adopt an image “cut-out” strategy, which consists of adding random noise to
random-sized regions of the image. We progressively increase the maximum size
of these regions during training, progressively producing more difficult examples.
This strategy improves the results because it serves two purposes: first it is a data
augmentation scheme that directly targets robustness to occlusion, and second it
allows for model regularization by acting as a “drop-out” mechanism at the image
level. As a result, this strategy avoids the over-fitting inherent to the small size of
the training sets and significantly improves the results. We also considered adding
standard augmentation strategies such as image flipping and cropping but found
no added improvement.
4.3 Empirical evidence

4.3.1 Experimental details
Datasets. We consider four datasets for evaluation: Market-1501 [130], MARS [128],
DukeMTMC-reID [133] and Person Search [117]. More details about the datasets
can be found in Section 1.2.4.
Evaluation. We follow standard procedure for all datasets and report the mean
average precision (mAP) over all queries and the cumulative matching curve (CMC)
at rank-1 and rank-5 using the evaluation codes provided by the authors of the
datasets.
33
Training details. As mentioned in Section 4.2.1, for the convolutional layers of

our network we evaluate two different flavors of ResNet [37], ResNet-50 and ResNet-
101, and report results with both. For each, we start with the publicly available
pre-trained model on ImageNet, and fine-tune the weights of the convolutional
layers for person identification (i.e. classification) on the training set of the specific
re-ID dataset. To do this, we follow standard practice and extract random-sized
crops and then resize them to 224 × 224 pixels. We train using stochastic gradient
descent (SGD) with a momentum of 0.9, weight decay of 5 · 10−5 , a batch size
of 128, and an initial learning rate of 10−2 , which we decrease to 10−4 . We use
the weights of this pre-trained network for the convolutional layers of our re-ID
architecture and we randomly initialize the fully-connected layer, whose output
we set to 2,048 dimensions. We then train the ranking network using our Siamese
architecture with input images of variable size, while fixing the largest side to 416
pixels, which we observed experimentally to be a good trade-off between efficiency
and performance. We use again SGD with a batch size of 64 and an initial learning
rate of 10−3 , which we decrease using a logarithmic rate that halves the learning rate
every 512 iterations. We observe in all our experiments that the model converges
after approximately 4,096 iterations. For hard triplet mining we set the number of
random examples to N = 5, 000 and the number of updates to k = 16. We set the
margin of the loss to m = 0.1. We use exactly the same training settings for all four
datasets.
4.3.2 Ablative study

In this section we first evaluate key design choices in our architecture. We evaluate
our approach and show how each curriculum learning-based strategy impacts the
final results, reported in Table 4.4, by removing each of them in turn. For this study
we use ResNet-101 as the backbone architecture.
Image transformation. We first focus on data augmentation (#2 in Figure 4.1). As

discussed in Section 4.2, we apply different transformations to the images at train-
ing time, namely flips, crops and cut-outs. Here we study how each transformation
impacts the final results, reported in Table 4.1. We observe that cut-out has a very
strong impact on the performance and renders the other two data augmentation
schemes superfluous. We believe that this is because cut-out makes our represen-
tation much more robust to occlusion, and also avoids over-fitting on such little
training data.
Second, we consider the impact of the input image size (#1). Images from the
Market dataset have a fixed size of 256×128, while images from Duke have a variable
size, with 256 × 128 pixels on average. In our experiments, we rescale images so
34
that the largest image dimension is either 256, 416, or 640 pixels, without distorting
the aspect ratio. We report results in Table 4.2 and observe that using a sufficiently
large resolution is key to achieving the best performance. Increasing the resolution
from 256 to 416 improves mAP by 3%, while increasing it further to 640 pixels shows
negligible improvement. We set the input size to 416 pixels for the rest of this paper.
Pooling. Table 4.3 (a) compares two pooling strategies (#4) over the feature map
produced by the convolutional layers. Thus max pooling performs better than
average pooling on both datasets, we use it for the rest of this chapter.
Backbone architecture. Table 4.3 (b) compares different architectures for the
convolutional backbone of our network (#3). Results show that using ResNet-101
significantly improves the results compared with using ResNet-50 (about +5 mAP
for both datasets). The more memory hungry ResNet-152 only marginally improves
the results.
Fine-tuning for classification. Table 4.3 (c) shows the importance of fine-tuning
the convolutional layers for the identity classification task before using the ranking
loss to adjust the weights of the whole network (#6). As discussed in Section 4.1.1,
training the model on tasks of increasing difficulty is highly beneficial.
Curriculum learning strategies. We observe that each curriculum strategy has

a significant impact on the performance, as shown in Table 4.4, ranging between
4-6% mAP for Market, and 2-3% for Duke. In particular, for Market, removing the
PFC step has a significant negative impact on the performance (4.1% of mAP),
showing that training the model on tasks of increasing difficulty is highly beneficial.
Removing our IID strategy decreases performance by 5.3%. We believe that IID has
a strong impact because it makes our representation more robust to occlusion and
also because it acts as a data augmentation strategy, thereby avoiding over-fitting
on such little training data. As mentioned previously, we also experimented with
training our model using standard augmentation strategies, such as random image
flipping and cropping, in addition to IID, but found no performance improvement,
either for Market or Duke. Removing HTM and instead choosing triplets at random
decreases the performance by 5.6% mAP.
Note that IID not only complements but also reinforces HTM. IID produces
images that are more difficult to re-identify by introducing random noise. These
images produce harder triplets, which boosts the effect of HTM. IID and HTM both
have an impact of more than 5% of mAP (third and fourth row versus first row in
Table 4.4). We trained an additional model using neither IID and HTM and found a
35
flip crop cut-out Market Duke

- - - 75.9 69.6
3 - - 77.2 69.7
- 3 - 76.8 69.4
- - 3 81.2 72.9
3 3 3 81.2 72.9
Table 4.1 – Impact of different data augmentation strategies. We report mean

average precision (mAP) on Market and Duke.
Largest dimension Market Duke

256 pixels 78.2 69.2
416 pixels 81.2 72.9
640 pixels 81.2 73.1
Table 4.2 – Impact of the input image size. We report mean average precision
(mAP) on Market and Duke.
Market Duke
average 80.1 71.4
a) pooling strategy
max 81.2 72.9
ResNet-50 76.3 67.6
b) backbone architecture ResNet-101 81.2 72.9
ResNet-152 81.4 74.0
no 77.1 71.1
c) pretraining for class.
yes 81.2 72.9
Table 4.3 – Top (a): influence of the pooling strategy. Middle (b): results for
different backbone architectures. Bottom (c): influence of pretraining the
network for classification before considering the triplet loss. We report mAP for
Market and Duke.
very large performance drop (-11%), confirming that the generation of more and
more difficult examples is highly beneficial when paired with the HTM strategy that
feeds the hardest triplets to the network.
36
PFC IID HTM Market Duke

3 3 3 81.2 72.9
3 3 77.1 (-4.1) 71.1 (-1.8)
3 3 75.9 (-5.3) 69.6 (-3.3)
3 3 75.6 (-5.6) 68.3 (-4.3)
Table 4.4 – Impact of different design choices. We report mean average precision
(mAP) on Market and Duke, using ResNet-101 as backbone architecture.
4.3.3 Comparison with the state of the art

Table 5.3 compares our approach to the state of the art. We report results using two
versions of our approach: one using ResNet-101 and one using ResNet-50, which
is less computationally expensive than either Inception V3 or ResNet-101. Our
ResNet-50-based method consistently outperforms all state-of-the-art methods by
large margins on all 4 re-ID datasets and all metrics. Our ResNet-101-based method
bring additional improvements across the board. We also report the performance of
our method with standard re-ranking1 and we again see large improvements with
respect to prior art that uses re-ranking, across all datasets and metrics.
Looking closely at the approaches that report results on these datasets, we first
note that our approach outperforms recent methods that also use variants of the
triplet loss and HTM [127]. As we show in this section, combining HTM with the
other strategies mentioned in Section 4.1.1 is crucial for effective training of our
image representation for Re-ID. It is also worth emphasizing that our approach
also outperforms recent works that propose complex models for multi-scale fusion
[14], localized attention [58, 127], or aligning images based on body parts [54, 87,
120, 126] using extra resources such as annotations or pre-trained detectors. As we
discuss in the next section, our model is able to discriminate between body regions
without such additional architectural modules.
We report results for the Person Search dataset in the last column of Table 5.3.
This dataset differs from traditional re-ID datasets in that the different views of
each person do not correspond to different cameras in a network. Nevertheless, our
approach performs quite well in this different scenario, achieving a large absolute
improvement over the previous best reported result [117], illustrating the generality
of our approach.
1 We expand both the query and the dataset by averaging the representation of the first 5 and 10
closest neighbors, respectively.
37
Figure 4.2 – For several queries from Market, we show the first 10 retrieved images
together with the mAP and the number of relevant images (in brackets) of that
query. Green (resp. red) outlines images that are relevant (resp. non-relevant) to
the query.
4.3.4 Qualitative analysis

In this section we perform a detailed analysis of our trained model’s performance
and induction biases.
Re-identification examples. In Figure 4.2, we show good results (left) and failure
cases (right) for several query images from the Market dataset. We see that our
method is able to correctly re-identify persons despite pose changes or strong scale
variations. We observe that failure cases are mostly due to confusions between two
people that are extremely difficult to differentiate even for a human annotator, or to
unusual settings (for instance the person holding a backpack in front of him as in
38
Figure 4.3 – Matching regions. For pairs of matching images, we show maps for the
top 5 dimensions that contribute most to the similarity. All these images are part of
the test set of Market-1501.
d.).
Localized responses and clothing landmark detection. In Section 4.2, we argued

that, using our proposed approach, we obtain an embedding that captures invari-
ance properties useful for re-ID. To qualitatively analyze this invariance, we use
Grad-Cam [89], a method for highlighting the discriminative regions that CNN-
based models activate to predict visual concepts. This is done by using the gradients
of these concepts flowing into the final convolutional layer. Similar to [30], given
two images, we select the 5 dimensions that contribute the most to the dot-product
39
similarity between their representations. Then, for each image, we propagate the
gradients of these 5 dimensions individually, and visualize their activations in the
last convolutional layer of our architecture. In Figure 4.3, we show several image
pairs and their respective activations for the top 5 dimensions.
We first note that each of these output dimensions are activated by fairly lo-
calized image regions and that the dimensions often reinforce one-another in that
image pairs are often activated by the same region. This suggests that the similarity
score is strongly influenced by localized image content. Interestingly, these local-
ized regions tend to contain body regions that can inform on the type of clothing
being worn. Examples in the figure include focus on the hem of a pair of shorts,
the collar of a shirt, and the edge of a sleeve. Therefore, rather than focusing on
aligning human body joints, the model appears to make decisions based on at-
tributes of clothing such as the length of a pair of pants or of a shirt’s sleeves. This
type of information has been leveraged explicitly for retrieval using the idea of
“fashion landmarks”, as described in [63]. Finally, we observe that some of the paired
responses go beyond appearance similarity and respond to each other at a more
abstract and semantic level. For instance, in the top right pair the strong response
of the first dimension to the bag in the first image seems to pair with the response
to the strap of the bag in the second image, the bag itself being occluded.
Implicit attention. We now qualitatively examine which parts of the images are
highly influential, independently of the images they are matched with. To do so,
given an image and its embedding, we select the first 50 dimensions with the
strongest activations. We then propagate and accumulate the gradients of these
dimensions, again using Grad-Cam [89], and visualize their activations in the last
convolutional layer in our architecture. As a result, we obtain a visualization that
highlights parts of the images that, a priori, will have the most impact on the final
results. This can be seen as a visualization of the implicit attention mechanism that
is at play in our learned embedding.
We show such implicit attention masks in Figure 4.4 across several images of the
same person, for three different persons. We first observe that the model attends
to regions known to drive attention in human vision, such as high-resolution text
(e.g. in rows 1 and 2). We also note that our model shows properties of contextual
attention, particularly when image regions become occluded. For example, when
the man in the second row faces the camera, text on his t-shirt and the hem of his
pants are attended to. However, when his back or side is to the camera, the model
focuses more intently on the straps of his backpack.
40
Figure 4.4 – We highlight regions that correspond to the most highly-activated

dimensions of the final descriptor. They focus on unique attributes, such as
backpacks, bags, or shoes.
Ours [ResNet101]
80 Ours [ResNet50]
Verif-Identif [ResNet50] [45]
70
mAP [%]
60
50
40
0 100 200 300 400 500
Number of distractors [K]
Figure 4.5 – Performance comparison (mAP) in the presence of a large number of

distractors.
4.3.5 Re-ID in the presence of noise

To test the robustness of our model, we evaluate it in the presence of noise using
Market+500K [130], an extension of the Market dataset that contains an additional
41
set of 500K distractors. To generate these distractors, the authors first collected
ground-truth bounding boxes for persons in the images. They then computed the
IoU between each predicted bounding box and ground-truth bounding box for a
given image. A detection was labeled a distractor if its IoU with all ground-truth
annotations was lower than 20%.
We evaluate our ResNet-50- and ResNet-101-based models, trained on Market,
on this expanded dataset, while increasing the number of distractors from 0 to 500K.
We selected distractors by randomly choosing them from the distractor set and
adding them to the gallery set. We compare our models with the previous state-
of-the-art results reported for this expanded dataset [131]. Both versions of our
model significantly outperform the state of the art, as presented in Figure 4.5. Note
that our ResNet-50 model, with 500K added distractors, still outperforms [131]’s
performance with 0 added distractors.
4.4 Conclusions
In this chapter we have proposed an approach to training person re-identification
models based on curriculum learning principles. We have shown that, by carefully
applying these principles to the training of a Siamese architecture with a triplet
loss, a compact architecture without additional hand-engineered modules can
outperform state-of-the-art methods with complex architectures, on 4 benchmark
datasets. Because our contribution only impacts the training phase, at test time our
approach remains simple and efficient, a key advantage for most applications.
Additionally, we found qualitative evidence that the different dimensions of our
representation specialize in a way that allows them to have strong, localized and
semantically discriminative responses in the presence of a positive image pair. This
suggests that our approach is able, to some extent, to implicitly capture what some
previous approaches have explicitly included in their visual representations.
42
5 Hard negative mining
5.1 Introduction
In this chapter, we propose an online strategy for mining samples that contributes
to a more efficient training of Siamese architectures, while providing better valida-
tion scores on several datasets. We tested our method on large datasets so that the
retrieval and re-identification problems cannot be easily solved using a classifica-
tion loss. We use a large person re-ID dataset by merging publicly available datasets
(similarly to [46]) and we show the results on the publicly available retrieval datasets
Stanford Online Products [73] and DeepFashion [63].
5.2 Motivation
The triplet loss (see Equation (5.1)) is based on the construction of triplets i ∈ T
p
formed by an anchor sample x ia , a positive sample x i (belonging to the same
n
class as the anchor) and a negative sample x i . The samples are mapped into an
embedding by a given function f (·), that is usually a deep convolutional network,
which parameters are learned by means of minimization of the loss L .
1 X p
L= max(0, || f (x ia ) − f (x i )||22 − || f (x ia ) − f (x in )||22 + α) (5.1)
n t i ∈T
The goal of the triplet loss is to ensure that the anchor-negative pairs are far from
each other by a margin α with respect to the anchor-positive pair distance. It is
well known that the most challenging part of using the triplet loss to train a metric
learning system is generating triplets that produce a non-zero loss [88]. This is hard,
since the number of all possible triplets in the dataset is proportional to the cube
of total number of images N in the dataset, |T | ∼ N 3 , and the more the system
trains, the less probable it is to find a negative for a given anchor-positive pair that
provides a non-zero loss [88].
Let n be the average number of images per class, m the mini-batch size, k
43
Chapter 5. Hard negative mining
Table 5.1 – Comparison of sampling strategies.
Strategy ne nd nt additional computation

Random 0 0 m/3 -
Semi-hard 0 (2b 2 − 2b)l m/3 -
Batch hard 0 (9b 2 − 2b)l m -
Exhaustive l (N − b) lN2 m/3 -
Hierarchical Tree [25] N N 2 /2 N.A. pre-training
Smart mining [36] N (N /i )2 N.A. extra global loss
classifier
100k IDs [108] 0 0 N.A. all feature extract
³ ´ k-means
Stoch. class-based [96] k(K − 1) k Nn + (K − 1) N.A. classifier, class signatures
Spectral Hashing [113] N 0 N.A. PCA

Bag of Negatives 0 0 N.A. autoencoder
the number of images of each class in the mini-batch, and l number of steps per
epoch. For the sake of clarity, we introduce the notation n̂, the number of negative
samples that produce a non-zero loss if used in conjunction with the triplet loss
and an anchor-positive pair. The more we train the Siamese network the smaller n̂
becomes.
We propose a systematic cost analysis given a sampling method in terms of
n e , the extra number of forward passes to be computed per epoch, and n d , the
extra number of distances to be computed in order to select a set of negatives per-
mini-batch over an entire epoch. Additionally, we report the number of triplets per
mini-batch n t , summarised in Table 5.1.
The “quality” of the retrieved negatives is also relevant, as pointed out in [116]:
negative samples have to be distributed such that the anchor-negative distance is al-
most uniformly distributed. More on this topic will be discussed in the experimental
section.
Sampling the negatives randomly from the whole dataset has complexity O (1)
but does not provide relevant negative samples except at the beginning of the
training, since p n̂ = n̂/(N −n) ' n̂/N . From now on we will omit n from the formula
since it is negligible w.r.t. N .
Semi hard loss [88] employs a negative sampling strategy that has an increased
cost due to the fact that the additional computed distances scale polynomially with
44
5.2. Motivation
the mini-batch size. The improvement in p n̂ with respect to the random sampling
is linearly dependent to the number of triplets b. For this reason, authors use huge
mini-batches in the order of 1800 samples. p n̂ is thus increased to 2b n̂/N , at the
cost of large mini-batches and additional computation.
Batch hard loss [40] is an improved version of the semi hard loss where, thanks
to a more controlled mini-batch creation and additional distances computation,
the method exhibits p n̂ = m n̂/N . This strategy offers a 50% improvement in p n̂
w.r.t. the semi hard approach, but still provides a probability that depends on the
mini-batch size. The additional cost in the distances computation is mitigated by a
3 times factor in the numbers of computed triplets.
An offline exhaustive search into the dataset provides p n̂ =min(3n̂/m, 1). This
is, of course, not viable for large datasets. Nonetheless, for relatively small datasets
and with the proper sampling strategy over the m(N − m) distances, exhaustive
search provides excellent negatives samples [116].
Hierarchical Tree sampling [25], 100k IDs [108], Smart Mining [36] and Stochas-
tic class-based hard example mining [96] are methods for sampling candidates prior
to mini-batch creation. Those methods can be combined with online hard mining
strategies (such as semi-hard and batch hard) and further increase the probability
of sampling relevant negative samples.
In [25] authors propose sampling identities based on inter-class distances. The
main drawback of this method is the high computational cost of creation of the
inter-class distance matrix. This matrix should be updated once per epoch, and
it requires forward passes of the whole dataset (O (N )), and calculating all-vs-all
sample distances (O (N 2 )).
The method proposed in [108] for batch generation is based on hashing. This
approach is faster than Hierarchical Tree, as it does not require any additional dis-
tance calculations nor extra embedding extraction. Its drawback is the complexity
of generating the hash table, as it requires training a classifier on a subset of the
dataset, extracting of features of all images from the train set and executing k-means
clustering.
The method called Smart Mining [36] uses samples from approximate nearest
neighborhood to create potentially relevant triplets. However, in the beginning of
each epoch one full forward pass of the whole dataset is performed. In addition to
this, in each training step (N /i )2 distances are computed, where i is the number of
neighborhoods.
Stochastic class-based hard example mining [96] is a method that uses class
signatures when creating triplets. This approach requires k(K − 1) extra forward
passes in each training step and kN /n distances, where K is the number of classes
in mini-batch.
This brief study shows that the efficiency of relevant negative mining is a crucial
45
hash table
#1 xa
...
...
AE
max
...
CNN pooling W1
+ l2 new
#a
W2
#2
xp
...
...
AE
max
...
...
CNN pooling W1
weights
shared
+ l2 new
#p
weights
shared
W2
#2s
xn
...
...
AE
max
...
CNN pooling W1
new
weights
shared
+ l2
#n
weights
shared
W2
Figure 5.1 – BoN strategy. Triplets with good quality negatives are formed using the
information from the hash table. The resulting embedding is used to learn both the
deep model and a linear projection that, in turn, provides a low-dimensional
embedding. Its quantization provides (possibly) new entry positions in the hash
table for the input images. The hash table and the linear autoencoder are updated
at each training step with minimal overhead.
issue. Also, increasing the probability of picking a relevant negative is key to the
improved performance from semi hard to batch hard strategy. Scalability to very
large datasets with a large number of classes is a necessity within the training of
Siamese architectures.
In this chapter we propose a novel method for batch creation, inspired by
Spectral Hashing [113]. In contrast to Spectral Hashing, which requires additional
forward passes of all images from the dataset, our method updates the hash table
online, with negligible computational cost.
5.3 Bag of Negatives

A negative sample whose representation is close to the anchor sample provides a
triplet that is more likely to produce non-zero loss. The main goal of BoN is pro-
viding these relevant negative samples using an algorithm that is computationally
inexpensive.
BoN is inspired by the Spectral Hashing method [113]. Nonetheless, we had
46
5.3. Bag of Negatives
to introduce several changes in order to efficiently adapt it to the negative mining

problem during training.
Spectral Hashing is a nearest neighbour search algorithm that is shown to per-
form better than Product Quantization while being simpler to implement and more
efficient at learning the hash function [113]. In terms of performance, it is inferior
with respect to methods that address the embedding compression and the quan-
tization as a whole problem, e.g. [11, 26]. However, we have to consider that the
embedding is changing during the training, thus a simpler but flexible approach is
preferred over methods providing better results at a greater computational cost.
The main approach of the Spectral Hashing method is to (1) learn a linear
projection from the embedding space (of size e) to a lower dimensional space (of size
s ¿ e) by means of a standard PCA , (2) apply the projection to a sample, (3) perform
a 1 bit quantization over every dimension by threshold at 0, and (4) group the s bits
into an integer codeword (line 4 in Algorithm 2). The codeword represents the entry
of a hash table. The underlying assumption is that samples falling into the same bin
are neighbours in the high dimensional embedding. Of course, this assumption is
over-optimistic and the variation from the optimal are mainly due to the following
facts: (1) being s ¿ e we lose some information about the topology of the high
dimensional embedding and (2) the quantization is harsh and there is no actual
control on the quantization error during the process. Nonetheless, experimental
validation shows that the Spectral Hashing method is indeed performing well in
retrieval tasks [113].
However, the direct application of the Spectral Hashing (or any other nearest
neighbour algorithm) method to the problem of retrieving negative samples is
not straightforward because the embedding is dynamically changing during the
training. One can compute the whole embedding every certain number of steps,
plus computing the PCA, and the hash table, but this naïve strategy does not scale
well for large datasets. Consequently, we propose three main modifications to the
Spectral Hashing approach in order to have an online algorithm that mimics its
performance (see Figure 5.1 and Algorithm 1):
1. The PCA is substituted by a linear auto-encoder paired with L 2 reconstruction

loss (Algorithm 1 line 8).
2. The quantization threshold is dynamically estimated, per-dimension, instead

of being fixed to 0 (Algorithm 2 line 9).
3. The hash table is dynamically updated (Algorithm 1 line 9).
47
Linear auto-encoder
Since the online PCA estimation is in general computationally inefficient and po-
tentially numerically unstable [10], we train a linear autoencoder (AE) paired with
L 2 reconstruction loss, as in formulas (5.2), where h(x) is the projected sub-space
of dimensionality s. The reconstruction loss should not modify the embedding
space, therefore the gradients generated by the L AE loss are back-propagated only
through the fully connected layers of the autoencoder.
h(x) = W1 f (x) + b 1
fˆ(x) = W2 h(x) + b 2 (5.2)
L AE = || f (x) − fˆ(x)||22
This approach can approximate a PCA computation, but it also allows non-orthogo-
nal representations. The AE continuously models the projection that provides the
codeword to the hash table update procedure. The added cost of learning such
AE is negligible w.r.t. the Siamese network training. The choice of s is related to
two factors: (1) the smaller s, the more difficult to reconstruct (in the L 2 sense)
the original sized embedding and (2) the bigger s, the larger the number of bins
obtained after the binarization, more precisely B = 2s . A detailed analysis on the
behaviour of BoN as a function of s is presented in the experimental section.
Dynamic quantization thresholds

Since the lower dimensional space h(x) changes dynamically during the training
and the AE does not guarantee a zero mean hidden representation, the correct
thresholds µ for its binarization have to be estimated as a running mean: µ ← β µ
+(1−β) h(x), where β controls how quickly the running average forgets old samples
(Algorithm 2 line 9). In our experiments we noticed that varying β ∈ [0.95 0.999]
does not influence the results so that it is not a critical parameter to tune.
Hash table dynamic update

We maintain an hash table L (hash_t abl e in Algorithm 2) that, for each entry in-
dexed by an integer j , contains a collection of images identified by pairs (v, I (v)),
where v is an integer that uniquely represents an image in the dataset (i mag e_i d x
in Algorithm 2), and I (v) is an integer that uniquely represents the class of the given
image (i mag e_l abel in Algorithm 2). Also, we keep track of the latest hash entry
for every image in the dataset, using integer values, such that C [v] = j (line 8 in
Algorithm 2). In such a way, updating the hash table has a very limited computa-
tional cost. The slowest part of the update procedure is removing the tuple (v, I (v))
48
5.3. Bag of Negatives
from the bin to which it had been assigned (line 6 in Algorithm 2), and it has a cost
of O (N /2s ). In term of memory cost, assuming that both the classes I (v) and the
sample v identifiers can be represented with 4 bytes integers, we need only a total of
4(N + 2N ) bytes to store both the hash table L and the hash entry C . As an example,
even a very large dataset with 10M images requires only 115 Megabytes for the hash
table. The update procedure is described in Algorithm 2.
Algorithm 1 Main algorithm

1: Input: i mag e_names, i mag e_l abel s, s, bat ch_si ze, n_i d s
2: hash_t abl e,C , µ ←hash_init(i mag e_names, i mag e_l abel s, s)
3: net ← ImageNet_init
4: aut oencod er ← random_init
5: while not converged do
6: i mag es, l abel s ← cr eat e_mi ni _bat ch(i mag e_l abel s, hash_t abl e,
bat ch_si ze, n_i d s)
7: d escr i pt or s ← net (i mag es)
8: hash_vec t or s ← aut oencod er (d escr i pt or s)
9: hash_t abl e,C , µ ← hash_upd at e(hash_t abl e, C , i mag es, l abel s,
hash_vec t or s, µ)
10: backpropagate
11: end while
Algorithm 2 Hash Table Update

1: function HASH _ UPDATE(hash_t abl e, C , i mag e_i nd exes, i mag e_l abel s,
hash_vec t or s, µ)
2: for (i mag e_i d x, i mag e_l abel , hash_vec t or ) in (i mag e_i nd exes,
i mag e_l abel s, hash_vec t or s) do
3: ol d _hash ← C [i mag e_i d x]
hash ← sj =1 2Heavyside(hash_vect or [ j ]−µ[ j ])
P
4:
5: if hash 6= ol d _hash then
6: remove (i mag e_i d x, i mag e_l abel ) from hash_t abl e[ol d _hash]
7: add (i mag e_i d x, i mag e_l abel ) to hash_t abl e[hash]
8: C [i mag e_i d x] ← hash
9: µ ← βµ + (1 − β)µ
10: end if
11: end for
return hash_t abl e, C , µ
12: end function
49
Algorithm 3 Mini batch creation

1: function CREATE _ MINI _ BATCH( i mag e_l abel s, hash_t abl e, bat ch_si ze,
n_cl asses)
2: bat ch_l abel s ← empty list
3: while length(bat ch_l abel s) < n_cl asses do
4: anchor _i d x ← random((1 to length(i mag e_l abel s)), size=1)
5: anchor _hash ← C [anchor _i d x]
6: if anchor _hash > 0 then
7: cl asses ← classes from hash_t abl e[anchor _hash]
8: new_cl asses ← random(cl asses, size = min(length(cl asses),
n_cl asses− length(bat ch_l abel s))
9: else
10: new_cl asses ← random((1 to max(i mag e_l abel s)),
size=n_cl asses−length(bat ch_l abel s))
11: end if
12: add new_cl asses to bat ch_l abel s
13: end while
14: bat ch_i d xs ← empty list
15: for cl ass in bat ch_l abel s do
16: cl ass_i d xs ← indexes of cl ass in i mag e_l abel s
17: add random(cl ass_i d xs, size=bat ch_si ze/n_cl asses) to bat ch_i d xs
18: end for
return bat ch_i d xs, bat ch_l abel s
19: end function
5.3.1 Bag of Negatives and pairwise losses

Bag of Negatives and triplet loss
The simplest way of using BoN is to create mini-batches by randomly sampling b
anchor-positive image pairs. For each pair, we sample a negative image randomly
among the images that belong to the same bin as the anchor. In case the anchor
belongs to a bin in which there are no other images from a different class, we sample
the negative image randomly from the whole dataset.
Bag of Negatives with batch hard loss

As BoN is able to provide relevant images for batch sampling, it can be easily
combined with a loss such as batch hard. It is important to create batches of size
m, which contain k images from l classes for batch hard (see Algorithm 3). We
50
set k = 2 for all the experiments, as we are focusing on showing the importance of
good negative sampling, and we want to avoid the results being influenced by hard
positive sampling. We randomly sample l classes that belong to the same bin as the
first, random sample (lines 4-9 in Algorithm 3). If the bin has only one element, we
sample the rest of the images needed for the batch randomly (line 10 in Algorithm
3). In case the number of classes in the bin is greater than one and lower than l ,
we append the missing classes from another bin, which is randomly chosen. The
process is repeated until sampling l classes. Once we have a set of l classes, we
choose k images randomly from the images that belong to that class (lines 16 and
17 in Algorithm 3).

Datasets
Person re-identification large dataset. We merged eleven publicly available da-
tasets for person re-identification, CUHK01[56], CUHK02 [55], 3DPeS [3], VIPeR
[31], airport[47], MSMT17 [112], Market-1501 [130], DukeMTMC [82]. The merged
dataset has 10.5k IDs, and 178k images. We used both training and testing partitions
of all the datasets except for Market-1501 and DukeMTMC-reID and we did not use
the images that are labeled as distructors or junk.
In addition to Person re-identification large dataset, we use Stanford Online
Products [73] and DeepFashion - In-Shop Clothes Retrieval [63] for evaluation.
For more details about these two datasets check Section 1.2.4.
Training details
We use Inception-V3 as a backbone for our model. In particular, we take the convo-
lutional layers and initialize them with weights from a standard network pre-trained
on ImageNet. The final descriptors are further globally max-pooled and `2 normal-
ized. The descriptors size is 2, 048. The model is trained using ADAM optimizer, with
the initial learning rate 10−4 , and with learning rate decay 0.9 each 50k iterations.
The images for person re-ID are resized to 192×384 pixels. At test time, we extract
representations and compare them using the dot product.
5.4.2 Analysis of Bag of Negatives

In this section we answer the following relevant questions: (1) How does BoN
compare to the exhaustive search? (2) How does BoN behave in terms of non-zero
51
1.75
distance inside of the bin

1.5
1.25
0.75
0.5
Min distance
0.25
Average distance
0
0 0.25 0.5 0.75 1 1.25 1.5 1.75 2
distance in the full embedding
Figure 5.2 – Negative distances calculated in the whole dataset (x-axis) vs negative
distances calculated inside of bins (y-axis) for 100 anchors.
loss triplets? (3) How does BoN perform changing the subspace dimension s? (4)
How much overhead it adds to the training? and (5) How stable is the hash table
during training? We provide all the analysis on the person re-identification dataset,
as it is challenging, as well as appropriate for testing of all the algorithms mentioned
in the section 5.2.
BoN vs exhaustive search

In the motivation section we explained that an exhaustive search in the embedding
provides always a non-zero loss negative sample, if existing. Nonetheless, its cost is
prohibitive when dealing with large datasets. Figure 5.2 shows the distance between
100 anchor samples and a set of negative samples at 200k steps of training for
BoN-random with s = 18. The abscissa shows the average distance (red) and the
minimal distance (blue) when applying an exhaustive search over the whole training
partition of the large person re-ID dataset. For the very same anchor sample, the
ordinate shows the distance where the search is limited to the samples belonging in
the same hash bin as the anchor.
An ideal hashing method would exhibit blue dots on the diagonal of the plot,
meaning that every hardest negative sample (closest to the anchor) lies in the corre-
sponding anchor bin. As it can be seen, the blue dots are not deviating excessively
from the diagonal, which means that BoN is able to retrieve negatives of good qual-
ity. Again, for the purpose of training a Siamese architecture with triplet loss, using
always the hardest negative can be disruptive and particularly dangerous due to
52
mislabeled samples [116]. This is particularly true and the authors of [96] introduce
stochasticity in order to avoid this problem.
The analysis of the red dots is also of interest: as expected, on average, sampling
from the reference bin provides samples that are closer to the anchor sample w.r.t.
random sampling from the whole dataset.
It is worth mentioning that the distance between the minimal and average
distances in the full embedding space is 0.7. This means that the data distribution is
sparse and that the random sampling can lead to choosing negative samples which
are far from being hard. On the other hand, the distribution of the distances inside
of a bin has smaller standard deviation, as the distance between the minimal and
average distances is 0.2. This allows us to use random sampling inside of the bins,
without decreasing the quality of the chosen samples.
Non-zero loss triplets analysis

Figure 5.3 shows the percentage of non-zero loss triplets (measured at train time) as
a function of the training mean Average Precision (mAP) for the random sampling,
BoN-Random, the semi hard loss, the batch hard loss and BoN-batch hard, Spectral
Hashing-batch hard, 100k IDs-batch hard and SPL-batch hard loss on the person
re-ID dataset. For all methods the margin is set to α = 0.3, the mini-batch size
is m = 48, and the leftmost point on the plot for every method is obtained at 10k
steps of training. As expected, the percentage of non-zero triplets for the random
sampling (blue line) starts at only 20% and decreases as the mAP increases; at
mAP=77.4 the non-zero triplets are less than 5% and the training is virtually unable
to learn anything else.
BoN-random (red line) significantly increases the number of non-zero loss
triplets w.r.t. the pure random sampling, without modifying the loss nor the way the
anchor-positive pairs are formed. The improvement is solely due to the improved
sampling of negatives.
BoN-random exhibits a behaviour similar to semi hard (green line) and batch
hard (orange line) while providing, in general, more non-zero losses triplets. How-
ever, the nature of the improvement provided by our method and batch hard is very
different: BoN searches for negatives in a local region of the embedding space while
the batch hard forms the triplets seeking for the non-zero triplets in an explicit way
within a mini-batch but sampling from the whole embedding.
These two complementary strategies can be easily combined as seen in section
5.3.1. The combination inherits the benefits of both approaches: at 10k steps batch
hard and BoN-batch hard have a similar mAP ≈ 83% but BoN-batch hard (black line)
has about 2 times more non-zero loss triplets, and it has more non-zero loss triplets
systematically until the end of the training. As it will be seen in the comparison, this
53
Random
70
BoN
% of non-zero loss triplets

60 semi hard
batch hard
50 100k IDs + BH
SPL + BH
40 BoN + BH
30 SH + BH
20
10
0
30 40 50 60 70 80 90 100
mAP on train set
Figure 5.3 – Percentage of non-zero loss triplets per mini-batch as a function of

mAP on the training set.
behaviour not only speeds-up the training, but also provides better triplets, which
leads to significant improvement of the performance on validation sets.
We measure the limitations of BoN by comparing it to the “gold standard",
Spectral Hashing. The combination of Spectral Hashing and batch hard requires
the following steps: (1) feature extraction on the whole training set, (2) reduction
of the feature size by PCA to the size s (s = 18) and (3) hash table construction; we
repeat this procedure every 5k steps. Given this hash table, batches are created the
same way as explained in 5.3.1. BoN-batch hard shows very similar behavior to the
Spectral Hashing - batch hard (magenta line): they both train quickly, obtaining
almost the same mAP after 10k steps, with high percentage of non-zero loss triplets.
During the whole training Spectral Hashing - batch hard is providing more non-zero
loss triplets. This is expected, as the hash table is updated at the same moment for
all the samples. However, this configuration does not scale for datasets with large
number of images.
We analyze the behavior of batch creation proposed in [108], using 10 clusters
as suggested by the authors. We use these clusters for creating the hash table and
we do not update it during the training. In addition to longer training time, this
method lacks flexibility in updating the hash table. In other words, samples that
are considered relevant negatives to an identity are set at the beginning of the
training and are static w.r.t. the training process. Moreover, a possible sub-optimal
clustering is going to be seriously detrimental to the training. In the beginning
of the training, this method obtained lower mAP on the train set (gray line) while
having more non-zero loss triplets than batch hard. The number of relevant triplets
54
in the end of the training decreases, and both accuracy and the percentage of the
non-zero loss triplets are inferior to BoN-batch hard.
Even though Semantic-Preserving Loss (SPL) [18] has not been designed as a
hashing method for hard negative mining, we consider it relevant to our work, and
thus we adapted it to this purpose. We use SPL loss (Equation 5.3) as a replacement
of reconstruction loss in BoN. In this case, the encoder is a fully connected layer with
tanh activation function, that maps image descriptors d i into corresponding hash
entries h(x i ). Following the rationale proposed in [18], the similarity matrix S is a
non-linear function of the dot product between images’ descriptors within a mini-
batch (5.4): a pair of similar images (with dot product above a certain t hr eshol d ,
set to 0.6) are mapped to 1, otherwise they are mapped to −1. The minimization of
equation (5.3) should encourage the mapping of similar images to the same hash
entry, thus providing useful negatives samples as efficiently as BoN.
m X m µ1 ¶2
1 X
L SP = h(x i )h(x j ) − S ij (5.3)
m 2 i =1 j =1 s
(
1, d i · d j > t hr eshol d
Si j = (5.4)
−1, ot her wi se
BoN-BH shows superior results: it trains faster with more non-zero loss triplets
(see table 5.3). We believe that the advantage of BoN over SPL resides in the fact that
the BoN AE loss does not depend on the relationship between mini-batch samples,
thus provide a more stable hash table. Also, the quality of SPL mapping depends on
the quality of the mini-batch sampling, which in turn depends on the hash table
itself; such dependence can introduce a non-negligible instability in the training.
Finally, in this context, SPL can be improved by adding an extra dedicated network
that provides the mapping between the input image and the hash entry, instead of
using the descriptor as an approximation of the input image; such a strategy could
importantly increase the computational cost of the approach, and it is currently out
of the scope of the chapter.
BoN-Random behavior varying s

Being s the only relevant meta-parameter of BoN, we find extremely important to
discuss its influence on BoN performance. It is interesting to note that BoN-Random
degenerates to pure random sampling for s = 0. Figure 5.4 shows the mAP results on
the Market and Duke validation datasets for different values of s at 200k steps. As it
can be seen, the performance increases with s and it reaches a maximum at s = 18;
55
60
40
mAP[%]
20
Market-1501
DukeMTMC-reID
0
0 5 10 15 20
s
Figure 5.4 – Validation mAP as a function of s.
nonetheless, with s = 22, BoN reaches its breaking point and the average number of
samples per bins (for non empty bins) is very low, such that BoN-Random starts to
perform negative sampling in the whole dataset too frequently.
Training time
Table 5.2 presents the time needed for training a model for 100k steps, and total
time needed for convergence for batch hard and BoN - batch hard methods, as BoN
provides the best results when combined with batch hard. Both experiments are
conducted under the same conditions; we trained the models on a TITAN X GPU
with non-augmented images of size 384x192 pixels, using inception_v3 as backbone
architecture, initialized with the weights obtained from ImageNet pretraining. The
relative overhead that BoN introduces is 3%. However, the model trained with BoN
needs fewer steps to train, which means that total train time is reduced 3.4 times.
In other words, BoN saves 24.26 hours when trained with batch hard loss, while
significantly improving the performance of batch-hard, as it will be shown in section
5.5.
We additionally measured the time needed for one full forward pass of all the
images in the train partition of the person re-identification dataset, which is in-
dependent on the sampling strategy, or loss function. The time to extract all the
features is 11.5 minutes, which is equal to 1527 training steps of BoN-BH or 1572
steps of batch hard. All methods that require the computation of features in each
epoch ([25, 36, 96, 108] and Spectral Hashing) introduce an overhead of at least 42%
at train time. BoN has equal or better performance than [25, 36, 96, 108] (see table
5.4) while adding one order of magnitude less overhead.
56
Table 5.2 – Time required for training for 100k steps and until convergence.
Method time for 100k steps [h] convergence time [h]

batch hard 12.3 34.4
BoN-batch hard 12.6 10.1
Bin stability analysis

The majority of recently published strategies for hard triplet mining take a snapshot
of the full embedding in each epoch and create batches of hard triplets based on
that information. In contrast, we create mini-batches based on an online hash
table that stores all the training images in bins. In every training step, we move all
samples from the mini-batch from old to new hash bins, which are approximated
by the latent representations of the autoencoder that is trained to reconstruct the
embedding. Even though this strategy introduces minimal computational overhead,
it uses a noisy embedding approximation for triplet mining. In this section we
analyse the tendency of images moving from one bin to another, as well as the
Hamming distance between the former and current hashes (see Figure 5.5).
In each training step we assign all 2l images (2 images from each of l classes)
from the mini-batch to new bins. In the beginning of the training we sample one
image per identity randomly from the list of images that have not been sampled
as the first images, and the second image randomly. The images are initially not
assigned to any bin, so all of them are added to the hash table as new (orange bar
in figure 5.5). As the training continues, the number of newly inserted images is
reducing, as both images per ID could have been sampled earlier in the training.
While the hash table is not fully populated, the associated hash entry is unstable,
and the images are moving from one bin to another frequently. Once the embedding
becomes more stable, the percentage of images staying in the same bin increases
significantly (green bar). However, around 90% of images still keep moving.
Figure 5.5 shows Hamming distance between the old and new hash entries
for all images processed in a mini-batch. It can be noticed that, after 25000 steps,
the bins become stable, and only 8% of images move to a bin that is on Hamming
distance greater than 4. We set s = 16, which means that more than 75% of bits
are kept the same. 92% of images are moving inside of the neighborhood of bins
on Hamming distance smaller than 5 means that most of the images are moved to
another bin with similar content.
The fact that images are moving from one bin to another is expected, and
57
percentage of images in mini-batch added samples

100
HD = 0
HD = 1
80 HD = 2
HD = 3
60 HD = 4
HD > 4
40
20
0 0.2 0.4 0.6 0.8 1

number of steps ·105
Figure 5.5 – The percentage of samples that were added to the hash table or moved
from one bin to another. HD stands for Hamming distance between the old and
new hash entry.
there are a couple of reasons for that. First, the decision boundary that separates
bins is updated during training, which means that the samples that are close to
the boundary can easily move from one bin to another. Second, the embedding
changes through time, as does its compressed approximation, so an image that was
assigned to one bin can move and be closer to some other samples in a different
step of training. As mentioned above, the fact that images move in neighboring bins
is not a problem; it is actually beneficial to avoid sampling negative samples that
are either noisy or overly-difficult.
5.5 Results and comparison

In this section we perform a controlled comparison of our proposal with some
of the most commonly used ranking losses: triplet, semi hard and batch hard,
contrastive-batch hard and the three methods for triplet selection: hierarchical
tree [25], 100k IDs [108] and SPL [18]. We avoid extra variables (e.g. augmentation,
other architectures, etc.) that could mask the empirical results for other reasons not
58
5.5. Results and comparison
related to negative sampling and triplets construction. For such reasons, we use
the same mini-batch size for all the methods, the same pre-trained back-bone, the
same margin α and the same embedding size (see subsection 5.4.1 for the details).
[36] and [96] are not included in this comparison, since they require an extra loss
which can corrupt the analysis; a performance comparison with these approaches
is provided in table 5.4.
Table 5.3 – mAP validation results at peak performance for every method.
Method #steps Market Duke

map r1 map r1
Random 600k 28.1 47.5 22.5 37.6
BoN-Random 440k 61.4 80.3 51.3 70.2
semi hard 350k 59.3 76.8 53.5 71.3
batch hard 280k 60.8 78.6 53.7 70.6
batch hard - contrastive 120k 48.9 66.9 37.8 56.5
SPL (reproduced) [18]-batch hard 200k 65.3 81.5 59.3 75.8
HT(reproduced) [25]-batch hard 310k 65.9 82.8 57.5 74.9
100k (reproduced) [108]-batch hard 90k 67.8 83.3 61.2 77.7
BoN-batch hard - contrastive 90k 59.4 77.0 51.9 70.6
BoN-batch hard 80k 69.5 85.2 62.1 78.5
BoN-dynamic s-batch hard 70k 70.92 86.34 64.46 79.71
SH-batch hard 100k 71.6 86.6 62.9 78.2
batch hard (2x batch) 70k 62.9 80.3 56.7 74.7
Table 5.3 shows the results of the comparison on the person re-identification da-
taset. As it can be noticed, BoN-random clearly outperforms pure random sampling
in fewer steps and provides validation mAPs comparable to semi hard and batch
hard. Even though BoN improves results of batch hard sampling when combined
with the contrastive loss, it performs significantly worse than the original batch
hard (combined with triplet loss), so we performed all the experiments using the
batch hard - triplet loss setting. Spectral Hashing - batch hard outperforms BoN-
batch hard, which is expected, considering that BoN is an online approximation of
Spectral Hashing. The numbers show that the margin between BoN and Spectral
Hashing is only 1.5% on average on the two evaluation datasets. However, Spectral
Hashing can be used only if the train set is reasonably small; thus its application on
bigger datasets would be unfeasible.
59
One can argue that the performance of BoN can be easily reached by just in-
creasing the mini batch size of the batch hard method. The experiment batch hard
(2x batch) in table 5.3 shows a training in which the mini-batch size has been dou-
bled. As expected, in this case, the method trains faster and has better performance,
but still does not outperform BoN-batch hard. This experiment shows that BoN
is a key component to the accelerated training and improved validation results of
BoN-batch hard.
We implemented two methods for batch selection known in the literature, Hi-
erarchical Tree (HT) [25] and 100k IDs [108], and combined them with batch hard.
We followed the procedure described in [25] and computed the distance matrix
between all the IDs every 5k steps. We formed a batch by randomly selecting one ID,
and taking the remaining l − 1 as its closest neighbors. We trained a classifier on the
whole train set for 10k steps and used this model to create the hash table with 10
bins. Additionally, we adapted one state-of-the-art hashing method [18] on image
retrieval task for hard negative mining (see section 5.4.2 for details). The results
of all three methods confirm our hypothesis that batch sampling is important for
improving and speeding up the training. However, none of them outperforms BoN
neither in speed nor accuracy.
Even though BoN is specially designed to improve training of Siamese networks
on large datasets, we tested the influence of BoN on two small datasets, CUB-200
[114] and Market-1501 [130]. BoN improves mAP on Market-1501 from 58.4 to 60.0,
and from 36.1 to 37.9 on CUB-200. The improvement in these cases is smaller than
in the experiments conducted on bigger datasets for two reasons: 1) Batch hard is
usually enough, since the probability that hard samples exist in the mini-batch is
higher than in case of large datasets; 2) Choosing optimal s becomes challenging:
small s does not contain enough information for reconstruction, while bigger s
leads to degenerate solution.
Table 5.4 shows the comparison of BoN-batch hard with state-of-the-art ap-
proaches on Stanford Online Products and DeepFashion In Shop datasets. We
trained BoN-batch hard using the same training parameters as explained in sec-
tion 5.4.1, with a few changes: inception_v1 was used as backbone architecture
(as in [25, 50, 73, 96]) with an extra fully connected layer with frozen weights after
the max pooling that reduces the embedding size to 256. We used images of size
336 × 336 pixels (as in [96]) with data augmentation techniques such as random
horizontal flipping, blurring, zooming in and out and cutout. As the images in these
datasets are more heterogeneous, the state-of-the-art methods usually do not use
task specific architectures.
We show that BoN-batch hard provides better or comparable results than both
[50], which uses attention ensembles, and stochastic class-based [96], which in
addition to having higher complexity enhances its performance by using second
60
5.6. Conclusion and Future Works
Table 5.4 – validation results at peak performance for every method and dataset. *
stands for the best number found in literature that uses additional attention
ensembles. F means that the method uses bilinear pooling.
Method Stanford inShop

r1 r10 r1 r10
lifted structured [73] 61.5 80.0 - -
DAML [21] 68.4 83.5 - -
hierarchical tree [25] 74.8 88.3 80.9 94.3
sampling matters [116] 72.7 86.2 - -
ABE-8512 [50]* 76.3 86.4 87.3 96.7
Stochastic class-based [96]F 77.6 89.1 91.9 98.0
BoN-batch hard 80.2 91.4 91.4 97.9
order pooling [24], which introduces even more computational cost with respect to
the baseline model. Additionally, BoN-batch hard performs better than DAML [21],
which uses synthetic negative samples for training.
Our method achieves state-of-the-art results on Stanford Online Products da-
taset, while being comparable to the previously published methods evaluated on
inShop dataset. The nature of Stanford Online Products dataset is more aligned with
the problem that we are trying to solve: it has more training images than inShop
(60k vs. 25.8k) as well as more classes (11.3k vs. 4k). We used the same s = 10 in
both cases, so the hash table of the Stanford Online Products was more densely
populated. Better performance would probably be obtained by training a model on
inShop dataset with the smaller embedding size and smaller s.
5.6 Conclusion and Future Works

In this paper we introduced Bag of Negatives (BoN), a novel method for hard nega-
tive mining that accelerates and improves training of Siamese networks and scales
well on datasets with large number of identities.
The main strengths of BoN are being computationally efficient and complemen-
tary to the popular batch hard approach. In fact, BoN provides a set of relevant
negative samples, while batch hard provides the explicit hard negative selection
process and the increased number of triplets per mini-batch; their combination
61
provides improved validation results thanks to a better sampling of negative can-

didates. We also shown that BoN computational cost is negligible with respect to
gradients computations during stochastic gradient descent based learning. It is also
way more efficient than similar negative mining algorithms in the literature and it
speeds-up the Spectral Hashing approach significantly. Summarising, BoN is better
and faster than previous hard negative mining methods.
The main disadvantage of BoN is the requirement of a user provided s parame-
ter. This parameter can be tuned by means of cross-validation or other standard
meta-parameter tuning techniques. Nonetheless, we consider that an automatic
strategy for tuning s would be very beneficial for the practical use of BoN on large
datasets. For such a reason, future work will address possible solutions on auto-
matic estimation of the s meta-parameter; since s must be a positive integer, one
possible line of research is the simultaneous use of several values of s combined
with an automated strategy of meta-parameter selection.
5.7 Appendix
So far we explained how to use BoN with one meta parameter s that is set at the
beginning of the training. In this section we propose an automatic strategy to dy-
namically adjust the parameter s during training. Our idea is to create a system
based on the exploration/exploitation paradigm which selects the optimal s in every
training iteration. We start training the system by setting s = 1, and we explore its
neighborhood. When another value of s starts providing more relevant training
samples, we use it and explore its neighborhood, and we repeat it in every iteration.
This way of selecting s leads to having a curriculum learning strategy which tends
to maximize the difficulty of the sampled pairs or triplets during training. Our
method does not rely on hard-coded scheduling, and does not require any prede-
fined parameters that depend on the dataset nor backbone architecture. We show
preliminary results of the method based on the results on person re-identification
dataset.
Automatic s parameter estimation

For the sake of simplicity, we define a BoN module of size s as a set that includes:
• an autoencoder whose latent representation is of dimensionality s,
• a hash table that has 2s hash entries,
• an estimated loss.
62
5.7. Appendix
The architecture of the autoencoder and the hash table are the same as de-
scribed in earlier in this chapter. We initialize the expected loss (el ) of all BoN
modules to 0.
Figure 5.6 – Dynamic s estimation. An example of a set of S BoN modules. The

third BoN module has the biggest expected loss, therefore the probability of
sampling from it in the next training iteration is 0.5, while the probability of
sampling from BoN modules 2 and 4 is 0.25.
When we train a model with static s, we use one single BoN module. However, if
the parameter s is not predefined, we aggregate several BoN modules, as shown in
Figure 5.6. The size of every BoN module is equivalent to its ordinal number (the
first BoN module is of size 1, the second 2, etc). The total number of BoN modules
S, depends on the number of images in the dataset and is defined as:
S = log(N )/ log(2), (5.5)
where N is the number of images in the dataset. We train all autoencoders simulta-
neously, and update all hash tables in every training iteration.
In the beginning of every training step we sample images for a mini-batch from
the i t h BoN module, and we update its expected loss in the end of the iteration
following the formula:
el i = 0.99 · el i + 0.01 · cur r ent _l oss. (5.6)
After updating the i t h expected loss, we recompute the ordinal number of the
63
BoN module that has the highest expected loss as:
h = argmax([el 1 , el 2 , ..., el s ]) (5.7)
The BoN module that provides samples for the next training iteration is either
h − 1, h or h + 1, with probability 0.25, 0.5, 0.25 respectively. If h = 1 we sample
from BoN modules h or h + 1 with probability 0.75, 0.25, and if h = s we sample the
images from h − 1 or h with probability 0.25, 0.75.
Dynamic s estimation introduces minimal time per iteration and memory over-
head. However, due to slow switch between two active BoN modules, full training
can be slower than the training with a fixed s.
Results
16
15
14
13
12
11
estimated s
10
9
8
7
6
5
4
3 s
2
1 smooth s
0
0 2 4 6 8
training iteration ·104
Figure 5.7 – Dynamic s.
We trained a model on the re-identification dataset by using BoN with dynamic s

estimation combined with batch hard loss. We set the total number of BoN modules
to S = 18. Figure 5.7 shows how the estimated s changes through time. In the
beginning of the training s takes small values between 1 and 4, which is expected.
The network learns how to model the images and even small values of s provide
images that are useful for training. As the training goes, the network needs harder
samples to train, which directly implies that the training images should be sampled
from the hash tables with more bins. We see that after around 2000 training iteration
64
5.7. Appendix
the s becomes almost constant, and oscillates between 12 and 13.
Table 5.5 – mAP validation results at peak performance for every method.
Method #steps Market Duke

map r1 map r1
Random 600k 28.1 47.5 22.5 37.6
BoN-Random 440k 61.4 80.3 51.3 70.2
semi hard 350k 59.3 76.8 53.5 71.3
batch hard 280k 60.8 78.6 53.7 70.6
batch hard - contrastive 120k 48.9 66.9 37.8 56.5
SPL (reproduced) [18]-batch hard 200k 65.3 81.5 59.3 75.8
HT(reproduced) [25]-batch hard 310k 65.9 82.8 57.5 74.9
100k (reproduced) [108]-batch hard 90k 67.8 83.3 61.2 77.7
BoN-batch hard - contrastive 90k 59.4 77.0 51.9 70.6
BoN-batch hard 80k 69.5 85.2 62.1 78.5
BoN-dynamic s-batch hard 70k 70.92 86.34 64.46 79.71
SH-batch hard 100k 71.6 86.6 62.9 78.2
batch hard (2x batch) 70k 62.9 80.3 56.7 74.7
Table 5.5 shows that training a model with BoN with dynamic s estimation can
improve the results in terms of both accuracy and number of steps for conver-
gence. However, dynamic s introduces additional memory overhead, which is still
negligible with respect to the size of the backbone architecture.
65
6 Explicit maximization of area under the
ROC curve
6.1 Introduction
The main objective of metric learning systems is to embed high dimensional data
(such as images, videos, or audio signals) into a lower dimensional space, while
ensuring that the data that comes from the same class or identity is embedded
within a cluster, which is separated from the clusters of data that belong to other
classes. On images, these systems were traditionally designed to extract patch
descriptors (such as SIFT or SURF), combined with the bag-of-words approach in
order to get a small size embedding which is representative of the input data.
These traditional approaches were replaced by newer deep learning methods
that compute the embedding by processing input data by deep neural networks,
and use this embedding for comparison of images. The neural networks are trained
by minimizing a loss function that models the desired structure of the embedding
space. Even though the most widely used losses for metric learning, such as con-
strastive loss [15], triplet loss [88], quadruplet loss [13], classification loss [125], etc,
train models to provide locally optimal solutions, they do not guarantee that the
embedding will be a good representation of data distribution on the full test set.
One notable exception is the loss presented in [81], in which the authors propose
direct maximization of the mean Average Precision (mAP) for solving the retrieval
task, which perfectly mimics the final metric learning goal. However, this loss
requires obtaining the vector representations of all training images by a full forward
pass through a deep convolutional neural network several times, in order to provide
a single gradient. The authors demonstrate the performance of the loss on training
data up to 43k images, which is significantly less than the size of commonly used
datasets nowadays.
The area under the ROC curve is a well known way of evaluating recognition
systems [9, 32, 34]. As the amount of available data, as well as computational power,
in the past were limited, the ROC curve based on the available samples was not
smooth, and the area below such an empirical curve was not accurate. Therefore,
in [34] the authors proposed two ways to approximate the real area under the ROC
67
Chapter 6. Explicit maximization of area under the ROC curve
curve: Gaussian based approximation and Wilcoxon Statistics. Maximization of

the area under the ROC curve for classification task was proposed in [9], using the
Wilcoxon statistics. Even though this approach was appropriate in the past, we
show in Section ?? that our proposal is clearly superior when evaluated on metric
learning datasets.
We hypothesize that training a metric learning model by maximizing the area
under the ROC curve can induce an implicit ranking suitable for retrieval problems.
This hypothesis is supported by the fact that “a curve dominates in ROC space if
and only if it dominates in PR space” [17]. In this chapter we propose AUC loss, a
new metric learning loss which explicitly maximizes an underestimate of the area
under the ROC curve at the mini-batch level. We show how the area under the ROC
curve can be approximated by its differentiable relaxation, without the need for
extra hyper-parameter search. AUC loss is simple, yet effective and computationally
inexpensive. We tested AUC loss on four publicly available datasets, and showed
that it achieved state-of-the-art performance for both retrieval (which is measured
by mAP and rank@N) and recognition (measured by the area under the ROC curve).
Our goal is to create a loss that directly optimizes a metric that is used for testing,
without extra hyper parameter search. In this section we present: (1) commonly
used evaluation metrics for metric learning, (2) challenges in direct optimization
of such evaluation metrics, and (3) comparison between the two most relevant
metrics: mAP and the area under the ROC curve.
6.1.1 Optimization of evaluation metrics

All evaluation metrics promote the scenario where all positive samples from the
gallery set appear before negative samples in the retrieval response for every query.
All pairwise loss functions presented in section 2.2.2 optimize a relaxed problem:
they tend to move embedding of the samples from the same classes close to each
other while separating them from other samples from the same mini batch. The
mAP loss takes into account a more general picture of the full embedding space
when optimizing the objective, instead of focusing only on the samples available in
a mini-batch.
Even though the pairwise approaches described in section 2.2.2 provide models
that are trained to optimize relaxed versions of the main objective, all of them
have the same drawback: they require the user to select the optimal margin that
separates positive and negative samples prior to training the model. Hence, we
propose a novel loss function that maximizes the area under the ROC curve. This
maximization does not require specifying the margin between positive and negative
samples.
68
6.2. AUC loss
6.1.2 Mean Average Precision vs. area under ROC curve

Even though recall-precision and ROC curves are not equivalent measures, it has
been proven that a curve dominates in ROC space if and only if it dominates in recall-
precision space [17]. In other words, training a recognition system by maximizing
the area under the ROC curve implicitly maximizes the mAP as well. Additionally,
mean average precision highly depends on the size of the test set on which it is
calculated. On contrary, T P R and F P R are relative measures that do not depend on
the test size (see the results on three different sizes of test set on Table 6.7), which
makes the area under the ROC curve on mini-batch level a relevant representation
of the area under the ROC curve on the whole dataset. All this allows maximization
of the area under the ROC curve for optimizing both recognition and retrieval
systems.
6.2 AUC loss

In this section we present a loss function that explicitly maximizes the area under
the ROC curve. We first introduce the formula for calculating the area under the
ROC curve, which we relax to obtain its differentiable version, which is the base of
the proposed new AUC loss
6.2.1 Area under the ROC curve

Given a threshold value t , the True Positive Ratio (TPR) T (t ) and the False Positive
Ratio (FPR) F (t ), the ROC curve is the parametrized curve t → (F (t ), T (t )), as shown
by the red line in Fig. 6.1. The area under the ROC curve can be written as follows:
Z t max d F (t )
A= T (t ) dt. (6.1)
t =t mi n dt
Given a similarity function f (d 1 , d 2 ) ∈ [t mi n , ..., t max ], for a pair of data points d 1

and d 2 and for the set of all positive pairs P = {(a 1 , p 1 ), (a 2 , p 2 ), . . . , (a NP , p NP )} in
the training set, the true positive ratio can be written as:
PNP
i
H ( f (a i , p i ) − t )
T (t ) = , (6.2)
NP
where H (·) is the Heaviside function. For the set of all negative pairs N = {(a 1 , n 1 ),
69
1
T (s)
s = t mi n
T (s + ∆s)
s = t max
0
0 F (s + ∆s) F (s) 1
Figure 6.1 – The ROC curve (red line) and its approximation based on a set of
thresholds s (blue line). The area under the approximated curve is calculated using
the Trapezoidal rule.
(a 2 , n 2 ), . . . , (a NN , n NN )}, the false positive ratio can be written as:

PN N
j
H ( f (a j , n j ) − t )
F (t ) = . (6.3)
NN
Plugging equations 6.2 and 6.3 into 6.1 we have:
PNP  PN 
N
Z t max H ( f (a i , p i ) − t ) d H ( f (a j , n j ) − t )
A= i  j dt. (6.4)
t =t mi n NP dt NN
However, this formula cannot be used for gradient based optimization because:
(1) the integral cannot be directly computed and (2) the Heaviside function has
zero gradient almost everywhere. Therefore, we propose two relaxations to obtain a
differentiable function that approximates the area under the ROC curve.
70
6.2. AUC loss
6.2.2 Differentiable relaxation of AUC

Integral to series (Riemann sum)
Firstly, we approximate the continuous function from Equation 6.4 with its discrete
representation. We apply the Trapezoidal rule with uniform grid from the same
input range and obtain the numerical approximation of Equation 6.1:
t max
X−∆s T (s + ∆s) + T (s)
A∗ = (F (s) − F (s + ∆s)) . (6.5)
s=t mi n 2
where s spans the interval [t mi n , t max ] in S discrete steps of size ∆s = (t max −t mi n )/S.
This approximation corresponds to the area below the piece-wise linear blue curve
from Fig. 6.1. The number of steps is a relevant parameter since more steps provide
a better approximation of the integral. Taking into account that T (s) and F (s)
depend only on the parameter s, they can be calculated in parallel for a set of
values s ∈ {t mi n , t mi n + ∆s, ..., t max − ∆s}, allowing for an efficient implementation
on GPUs.
Heaviside to sigmoid
The second step involves using a derivable approximation of the Heaviside function;
we use the following sigmoidal-like function:
1
σ(x, t ) = . (6.6)
1 + e −r (x−t )
This choice has three main rationales: (1) for large values of r this function becomes
a good approximation of the Heaviside function discontinuity, (2) it provides very
small gradients far from the discontinuity and, (3) it is symmetric around t thus pro-
ducing an approximation error with zero mean. The family of sigmoidal functions
for r = 12.02 and ∆s = 0.2 is shown in Fig. 6.2.
These characteristics allow having relevant gradients in the area close to the
discontinuity, and at the same time, keeping the properties of the Heaviside function
and almost completely ignoring sample pairs that have similarity very different
from the considered threshold t . The tuning of the parameter r is strictly related to
the step size ∆s and will be addressed in section 6.2.3.
Differently from [9], we choose to approximate multiple Heaviside functions
(one for each threshold) with the same number of sigmoidal functions. In this way
our loss provides abundant gradients for all relevant positive and negative pairs.
Using the approximation from Equation 6.6, we can re-write equations 6.2 and 6.3
71
σ(x, s)
0.5
0
−1.5 −1 −0.5 0 0.5 1 1.5
x
Figure 6.2 – Family of sigmoids for ∆s = 0.2 and r = 12.02.
as follows:
PNP
∗ i
σ( f (a i , p i ), t )
T (t ) = , (6.7)
NP
PN N
j
σ( f (a j , n j ), t )
∗
F (t ) = . (6.8)
NN
Finally, we can substitute T (·) and F (·) from Equation 6.5 with their respective
approximations T ∗ (·) and F ∗ (·) (Equations 6.7 and 6.8), and for sake of simplicity
use shorter notation f p i instead of f (a i , p i ), and f ni instead of f (a i , n i ), we obtain
the following differentiable AUC formula:
t max
X−∆s NP ¡
1 X
A ∗∗ = σ( f p i , s) + σ( f p i , s + ∆s)
¢
s=t mi n 2NP i =1
(6.9)
1 N N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
N N i =1
72
6.2. AUC loss
6.2.3 AUC loss function

Equation 6.9 is a derivable approximation of the area under the ROC curve on a
set of positive and negative pairs. Ideally, this equation is applied on the whole
dataset to calculate a very tight approximation of the real area under the ROC
curve. However, accessing the whole dataset in one training step is computationally
expensive. Therefore, we calculate the approximated area under the ROC curve
based on the samples available at the mini-batch level.
Inspired by [40], we create mini-batches out of k samples from each of l classes,
and explore two different strategies for calculating the loss: batch all and batch
hard. In batch all strategy we use all positive and all negative pairs when calculating
the loss as follows:
t max NP ¡
X−∆s X
1
L AUC B A = 1 − σ( f p i , s) + σ( f p i , s + ∆s)
¢
2NP N N s=t mi n i =1
(6.10)
N N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
i =1
The batch hard strategy calculates the loss based only on the similarities of
the hardest positive and negative samples for each sample from the mini batch. If
N = kl is the mini-batch size, we can write the batch hard AUC loss as following:
t max
X−∆s X
N ¡
1
L AUC B H = 1 − σ( f p i , s) + σ( f p i , s + ∆s)
¢
2N 2 s=t mi n i =1
(6.11)
N ¡
σ( f ni , s) − σ( f ni , s + ∆s) .
X ¢
i =1
Even though the batch all strategy takes into account all pairs from the mini-
batch, it leads to a much weaker underestimate of the AUC w.r.t the the batch hard
strategy. Additionally, the best scenario of training a model with AUC B A would
require batch creation where the number of all positive pairs would be the same
as the number of all negative pairs, which is impossible. AUC B H maximizes an
underestimation of the area under the full ROC curve on a mini-batch level. We
show the experimental comparison of the two strategies in section 6.3.2.
The AUC B H loss defined in Formula 6.11 can be seen as a pairwise loss, as it
is calculated based on similarities of image pairs. However, what makes AUC B H
different from the other pairwise losses is that it does not directly optimize the rela-
tions between positive and negative pairs, but rather maximizes the approximated
area under the ROC curve based on pair similarities.
73
AUC metaparameters
The AUC loss function, as defined in Equation 6.11 has two metaparameters: 1) step
size ∆s, and 2) slope of the sigmoid function r . The step size is a relevant parameter,
and the smaller the step, the more accurate approximation of the integral.
5
σ(x, s)
4
+∆s
s=tmax
P
3
tmin
dx
d
2
r=6
1 r = 12.02
r = 25
0
−1.5 −1 −0.5 0 0.5 1 1.5
x
Figure 6.3 – First order derivative of sum of sigmoids for ∆s = 0.2.
The setting of r parameter in Equation 6.6 is of vital importance for the proposed
approach. The value of r should be large enough to ensure a good approximation
of the Heaviside function while providing useful and well-balanced gradients for
a gradient-based optimization strategy. Fig. 6.3 shows the first order derivative of
the sum of sigmoidal functions over x for different values of r on Fig. 6.3. Small r
leads to flat gradient magnitudes around the middle of the range, while significantly
decreasing the magnitude close to the edge of the range (blue line). On the other
hand, having a large r introduces oscillations of the magnitude of the gradients
on the whole input range (orange line). The approximation of the integral over t
with a discrete summation can generate larger gradients for thresholds t that are
close to the points of the grid if the slope of the sigmoidal-like function is too large.
For this reason we would like to find the parameter r for which the square of the
second order derivative of the summation of all sigmoidal-like functions over x is
74
6.2. AUC loss
Table 6.1 – Optimal r for a set of ∆s parameters.
∆s r
0.01 201.0
0.02 101.0
0.05 42.2
0.1 22.47
0.2 12.02
minimal1 :
Ptmi n +∆s !2
d2 σ(x, s)
Ã
Z t max
s=t max
r = arg min d x. (6.12)
s t mi n d x2
In such a way, we force the magnitudes of the gradients generated for all values
of x to be almost independent of the relative position of x to the grid point s. We
find a non-degenerate local minimum2 of Equation 6.12 numerically for a set of ∆s
parameters (see Table 6.1). This setting, for ∆s = 0.2, is presented by the red line.
AUC implementation details

Now that we described the theoretical characteristics of the AUC loss, we explain
the steps needed for its implementation. The AUC loss function has three input
parameters: (1) list of N features extracted from the backbone architecture, (2) a list
of ground truth labels that correspond to the features, and (3) the step size ∆s (line
1 in algorithm 4). We first calculate the similarities between all features from the
minibatch (line 2 in algorithm 4). The si mi l ar i t y matrix is a square matrix of size
N × N , where the value of si mi l ar i t y[i , j ] corresponds to the similarity between
features i and j .
We get the positive mask as a binary matrix of size N × N that indicates if
the elements in the similarity matrix belong to different images from the same
class. Similarly, the negative mask extracts the positions in the similarity matrix
that belong to different classes. We mask the si mi l ar i t y matrix by element-wise
multiplying it with positive mask and find the minimal similarity per each row (line
3 in algorithm 4). In this way, we obtain a vector of the hardest positive similarities
(HPS) for each input feature. Similarly, we mask the si mi l ar i t y matrix with the
negative mask and find the maximum similarity per row, resulting in a vector of size
1 The second order derivative of sigmoidal function is not always positive, so we use its square for
calculating the minimal oscillations.
2 The trivial solution of this equation is r = 0, which is degenerate and thus unacceptable.
75
N of hardest negative similarities (HNS) for each input feature (line 4 in algorithm
4).
We define a vector of thresholds as a step vector of size S + 1: [t mi n , t mi n +
∆s, ..., t max ] (line 5 in algorithm 4), and get the optimal slope for the given ∆s from
table 6.1 (line 6 in algorithm 4). For each threshold from the step vector and for
each hardest positive similarity, we get a value of sigmoid defined in formula 6.6,
and store it in a matrix σ+ of size N × (S + 1) (line 7 in algorithm 4). Similarly, we
obtain the σ− matrix based on hardest negative similarities (line 8 in algorithm 4).
We obtain vectors s 1 and s 2 of length S from σ+ and σ− matrices (lines 9-13 in
algorithm 4). Although we present the algorithm with a for loop over the samples,
this procedure is implemented with matrices and in a parallel way exploiting the
GPU parallelization capabilities. We get the estimated area under the ROC curve R
based on the samples from mini-batch, as shown in line 14 of algorithm 4. Finally,
the AUC loss is calculated as 1 − R (line 15 in algorithm 4).
Algorithm 4 AUC B H loss function

1: Input: f eat ur es, l abel s, ∆s
2: calculate matrix of si mi l ar i t i es based on features
3: find the lowest similarity per row for features that belong to the same class
(har d est _posi t i ve_si mi l ar i t y or H P S)
4: find the highest similarity per row for features that do not belong to the same
class (har d est _neg at i ve_si mi l ar i t y or H N S)
5: st ep_vec t or ← [t mi n , t mi n + ∆s, ..., t max − ∆s]
6: for a given ∆s, find optimal sl ope from table 6.1
1
7: σ+ ←
1 + e −sl ope(H P S−st ep_vect or )
1
8: σ− ←
1 + e −sl ope(H N S−st ep_vect or )
9: for j in 1,2,...S do
10: s ← st ep_vec t or [ j ]
N
σ+ [i , s] + σ+ [i , s + ∆s]
X
11: s1 [ j ] ←
i =1
N
σ− [i , s] − σ+ [i , s + ∆s]
X
12: s2 [ j ] ←
i =1
13: end for
1 X S
14: R ← s 1 [ j ]s 2 [ j ]
2N 2 j =1
15: AUC ← 1 − R
76

In this section we analyze the performance of the AUC loss under different settings
and training parameters. We perform experiments with different step size ∆s, image
size, and we compare the two AUC strategies: batch all and batch hard. We evaluate
the performance of the AUC loss on four publicly available datasets and compare
the results with current state-of-the-art approaches.

Datasets
We evaluate our method on four publicly available datasets: Stanford Online Prod-
ucts (SOP) [73], DeepFashion - In-Shop Clothes Retrieval [63], Caltech-UCSD
Birds 200 (CUB-200) [114] and VERI-Wild [64]. For more details about the datasets
check Section 1.2.4
Training details
In all the experiments we use ResNet50 as a backbone architecture, and we initialize
it with the weights obtained on ImageNet classification pre-training. We take the
output of the last convolutional layer and apply global max pooling to obtain a
feature vector for each input image. We reduce the size of the feature vector to 512
by an orthogonally initialized fully connected layer. Finally, we l 2 normalize the
vector. This normlization projects all vectors to a hypersphere which allows using
the dot product for calculating vector to vector similarities.
We train our models on large scale datasets (all except CUB-200) by using the
ADAM optimizer with initial learning rate 10−4 , with a decay of 0.9 every 10, 000
steps. When training a model on a small dataset, such as CUB-200, using the ADAM
optimizer is not appropriate, as it could lead to overfitting. Therefore, we use the
SGD optimizer with initial learning rate 10−3 which is decayed by 0.1 each 3, 000
steps.
We create each mini-batch out of 128 images, 2 images for each of 64 classes/i-
dentities, if not stated differently. All images in a mini-batch are resized to either
224 × 224 or 256 × 256. In all the experiments we augment one of the two images
per class in a mini-batch. We use horizontal flipping, cutout, zoom-in/out, color
shift and motion blur as augmentation techniques.
77
6.3.2 AUC loss analysis

∆s parameter
In this section we analyse the impact of ∆s parameter from formula 6.11 on a model
trained on Stanford Online Products dataset, following the implementation details
described in Section 6.3.1.
We compare retrieval performances, represented by rank@1, of models trained
by optimizing the AUC loss function for different step sizes ∆s and show the results
in Table 6.2. The smaller the ∆s, the more precise the approximation of the inte-
gral, which directly implies better performance, with negligible memory overhead.
When ∆s is equal to 0.2, the ROC curve is approximated by 11 points, which is
a poor approximation. On the contrary, by setting ∆s to 0.01 we obtain a better
approximation of the curve based on 201 points. On the other hand, ∆s of 0.05
already represents the curve by 41 points, which is precise enough, and therefore
we set ∆s to 0.05 in our experiments.
Apart from providing a more precise approximation of the ROC curve, smaller
step size also guarantees that more relevant points are used for training. Taking
into account that the first working point (s = −1, T P R = 1, F P R = 1) and the last one
(s = 1, T P R = 0, F P R = 0) are static, they do not contribute to the training, leaves us
fewer points that have impact.
The memory requirements of the AUC loss are negligible with respect to the
memory needed for storing the backbone architecture. As an example, having a
mini-batch of 128, and embedding size 512, the memory required for calculating
the loss for ∆s = 0.01 is ≈ 1MB.
Table 6.2 – Validation r@1 as a function of ∆s tested on SOP dataset.
∆s 0.01 0.02 0.05 0.1 0.2

r @1[%] 79.11 79.06 78.97 77.46 74.67
Batch all vs batch hard strategies

We trained two models on Stanford Online Products dataset, under the same condi-
tions (as described in section 6.3.1), and we created mini-batches of 190 images. We
randomly sampled a set of 10 different classes, and from each one of them we chose
19 images randomly, which results in having N N = 16242 negative, and NP = 1710
positive pairs. If a chosen class has less than 19 images, we sampled some images
more than once.
As shown in Table 6.3, the AUC B H optimizes a stronger underestimation of the
area, thus providing stronger and better gradients during the training.
78
Table 6.3 – Comparison of batch all and batch hard strategies on Stanford Online
Products [73] dataset.
im size mb size R@1 R@10

AUC-BA512
R50 224x224 190 64.70 80.45
AUC-BH512
R50 224x224 190 75.72 89.13
AUC loss evaluation

In this section we compare the AUC loss with the triplet batch hard loss, as this loss
is a milestone for metric learning. We followed the same training parameters as
described in 6.3.1 and used images resized to 224x224, and compared the results on
four publicly available datasets, as shown in Tables 6.4 - 6.7. As the AUC loss does
not have extra hyper parameters for tuning, and that the influence of margin when
training a model with the triplet batch hard loss is minimal [40], we set the margin
for the triplet loss to 0.3.
The models trained on CUB-200 dataset with the AUC loss outperform the
models trained with triplet batch hard in terms of rank 1 by 4.4% when trained with
original images, and 5.71% when trained with crops.
On SOP dataset AUC outperforms triplet batch hard for 3.95% in terms of rank
1 and for 2.89% in terms of rank 10, and 1.23% and 0.6% on inShop dataset. We
evaluated the models trained on Veri Wild dataset on all three test splits, and AUC
achieved better performance of 3.8% in terms of mAP and 4.2% in terms of rank 1
on the small test split, 4.67% and 5.74% on medium and 4.86% and 7.9% on hard in
terms of mAP and rank 1 respectively.
In addition to comparing the numerical results of ranking in terms of mAP,
rank1 and rank10, we compared the area under the ROC curves for all models. This
measure evaluates the separability between different classes, and confirms that the
AUC loss provides a more robust model with better class separability.
Finally, we conclude that the AUC loss is superior to the triplet batch hard loss
for both ranking (evaluated by measuring mAP and rank@N) and recognition tasks
(measured by calculating the area under the ROC curve), without additional hyper
parameter tuning or a relevant increase of computational cost w.r.t. the triplet loss.
Inspired by [9], we implemented a variation of the loss that is originally proposed
and used for classification based on Wilcoxon statistic. We employ batch hard
strategy by using only the hardest positive and hardest negative pairs for each input
79
image when calculating the loss:
1 XN X N
LW i l coxon = 1 − 2
σ( f p i − f n j ). (6.13)
N i =1 j =1
We compare the Wilcoxon loss with AUC under the same experimental settings
on four metric learning datasets, and present results in Tables 6.4-6.7. AUC provides
significantly better results on all datasets (8 points on R@1 on Stanford Online
Products, 14 points on CUB-200, 17% on CUB-200 crops, 9% on In-shop, and 19%
on VERI-Wild small evaluation subset).
Table 6.4 – Comparison of the AUC and the triplet batch hard loss functions on the
Stanford Online Products [73] dataset.
R@1 R@10 AUC

Triplet-BH 74.90 87.98 98.98
Wilcoxon 70.22 84.38 98.07
AUC-BH 78.85 90.87 99.00
CUB-200-2011 [114] dataset.
crop R@1 R@8 AUC

Triplet-BH 7 53.79 84.26 87.51
Wilcoxon 7 44.59 76.77 79.10
AUC-BH 7 58.19 86.36 88.89
Triplet-BH 3 62.57 90.27 90.34
Wilcoxon 3 51.20 84.59 83.75
AUC-BH 3 68.28 92.31 92.76
We believe that the main advantage of AUC with respect to Wilcoxon statistics
is that AUC relies on a family of sigmoidal functions while Wilcoxon statistics ap-
proximates the area under the ROC curve based on the results of a single sigmoidal
function.
80
In-shop Clothes [63] dataset.
R@1 R@10 AUC

Triplet-BH 89.81 97.24 98.62
Wilcoxon 82.63 94.18 96.28
AUC-BH 91.04 97.84 98.88
VERI-Wild [64] dataset.
mAP R@1 AUC mAP R@1 AUC mAP R@1 AUC

Triplet-BH 70.79 84.13 99.88 62.40 77.78 99.88 51.12 70.00 99.89
Wilcoxon 47.76 70.5 99.57 40.06 63.48 99.62 30.65 54.56 99.59
AUC-BH 75.66 89.26 99.89 68.49 84.76 99.89 58.81 79.49 99.89
6.3.3 Comparison with state of the art

Tables 6.8 - 6.14 present extensive comparison with state-of-the-art methods on
four publicly available retrieval and re-identification datasets in terms of recall@k
and mean average precision (mAP), as well as image size, mini-batch size, backbone
architecture and embedding size.
We show the results on the Stanford Online Products dataset in table 6.8. Due
to diversity of product categories (bicycle, chair, lamp etc), the methods trained
on this dataset are not domain specific. Our method achieves state-of-the-art
performance when trained with images that are resized and cropped to 224x224
pixels, which is comparable with the image size used in the majority of state-of-the-
art methods. AUC outperforms all state-of-the-art methods, including the ensemble
methods such as A-BIER [75], ABE-8 [50] and HDC [123]. AUC achieves better
performance than FastAP [8], even though FastAP uses information about category
when sampling images for batches, while splitting them into smaller chunks for
training with effectively bigger batch size. We further improve the performance of
our method by using images of size 256x256.
In table 6.9 we show that AUC performs well on small datasets as well. We
trained our model in two scenarios: using original images and using crops of the
birds provided by the authors of the dataset. We treated both cases in the same
way, resizing the images to 256x256 and using a random 224x224 crop for training,
and central for testing. AUC loss achieves results that are comparable with state of
81
Table 6.8 – Comparison with the state-of-the-art on the Stanford Online

Products [73] dataset. Embedding dimension is presented as a superscript and the
backbone architecture as a subscript. R stands for ResNet, G for GoogLeNet.

512
Histogram LossG [105] 256 128 63.9 81.7
512
Binomial DevianceG [105] 256 128 65.5 82.3
512
N-Pair-LossG [93] - 120 67.7 83.8
64
ClusteringG [72] 227 128 67.0 83.7
512
Angular LossG [110] 256 128 70.9 85.0
384
HDCG [123] - 100 69.5 84.4
Margin128
R50 [116] 224 80 72.7 86.2
512
A-BIERG [75] 224 - 74.2 86.9
128
HTLG [25] 224 50 74.8 88.3
512
ABE-8G [50] 224 64 76.3 88.4
FastAP512
R50 [8] 224 256 76.4 89.1
RaMBO512
R50 log log [84] 224 128 78.6 90.5
RankMI128
R50 [48] - 120 74.3 87.9
R-Margin128
R50 [85] 224 160 78.5 -
AUC512
R50 224 128 78.97 91.11
AUC512
R50 256 128 80.32 91.89
the art when trained with original images. Our method outperforms all ensemble-
based methods, and shows comparable results with the newest state-of-the-art
methods. The only method that performs significantly better is RankMI [48]. Even
though RankMI outperforms AUC, it is computationally more expensive, as the
model is built out of two networks that are updated alternately, it has two extra
hyper parameters; also the authors do not report the input image size. R-Margin
model achieves 6.7% higher rank@1 on CUB-200 dataset, while using a bigger mini-
batch, distance based tuple mining, and ρ regularization. Additionally, this model
has an extra hyper parameter β and the results vary significantly with different
initialization values. We believe that AUC loss leads to overfitting due to its strong
gradients, when trained on a small size datasets. We improved the performance of
AUC by using image crops instead of whole images (see Table 6.10).
Another dataset appropriate for image retrieval is DeepFashion In-Shop. We
trained models with images resized to 224x224 and 256x256, and they achieved
82
Table 6.9 – Comparison with the state-of-the-art on the CUB-200-2011 [114]

dataset. Embedding dimension is presented as a superscript and the backbone
architecture as a subscript. R stands for ResNet, G for GoogLeNet.

512
Histogram LossG [105] 256 128 50.3 82.4
64
N-Pair-LossG [93] - 120 51.0 83.2
512
Binomial DevianceG [105] 256 128 52.8 83.9
512
Angular LossG [110] 256 128 54.7 83.9
64
ClusteringG [72] 227 128 48.2 81.9
64
Smart MiningG [36] - - 49.8 83.3
128
MarginG [116] 224 128 63.8 90.0
384
HDCG [123] - 100 53.6 85.6
128
HTLG [25] 224 50 57.1 86.5
512
A-BIERG [75] 224 - 57.5 86.2
512
ABE-8G [50] 224 64 60.6 87.7
128
R-MarginR50 [85] 224 160 64.9 -
RaMBO512
R50 log log [84] 224 128 64.0 90.6
RankMI128
R50 [48] - 120 66.7 91.0
AUC512
R50 224 128 58.19 86.36
AUC512
R50 256 128 62.10 89.45
Table 6.10 – Comparison with the state of the art on the CUB-200-2011 [114]
cropped dataset. Embedding dimension is presented as a superscript and the
backbone architecture as a subscript. R stands for ResNet, G for GoogLeNet.

128
PDDM TripletG[41] - 64 50.9 82.5
128
PDDM QuadrupletG [41] - 64 58.3 88.4
384
HDCG [123] - 100 60.7 89.2
128
MarginG [116] 224 128 63.9 90.6
AUC512
R50 224 128 68.28 92.31
AUC512
R50 256 128 70.81 93.53
83
Table 6.11 – Comparison with the state-of-the-art methods on the In-shop

Clothes [63] dataset. Embedding dimension is presented as a superscript and the
backbone architecture as a subscript. R stands for ResNet, G for GoogLeNet, V for
VGG.

FashionNetV [63] - - 53.0 73.0
384
HDCG [123] 224 100 62.1 84.9
DREML48R18 [121] 256 128 78.4 93.7
128
HTLG [25] 224 650 80.9 94.3
512
A-BIERG [75] 224 - 83.1 95.1
512
ABE-8G [50] 224 64 87.3 96.7
FastAP512
R50 [8] 224 256 90.9 97.7
RaMBO512
R50 log log [84] 224 128 86.3 96.2
AUC512
R50 224 128 91.04 97.84
AUC512
R50 256 128 91.01 97.90
Table 6.12 – Comparison with the state-of-the-art methods on the VERI-Wild

small [64] dataset. Embedding dimension is presented as a superscript and the
backbone architecture as a subscript. R stands for ResNet, A for ad-hoc, M for
MobileNet
im size mb size mAP R@1

Veri-wild1024
A [64] 224 - 35.11 64.03
MLSL1024
M [1] 224 24 46.32 86.03
GLAMOR512 R18 [97] 208 36 77.15 92.13
512
AUCR50 224 128 75.66 89.26
AUC-BoN512 R50 224 128 80.31 93.50
SAVER2048
R50 [49] 256 - 83.40 96.90
UMTS512R50 [45] 256 64 72.70 84.50
2048
PVENR50 [67] 256 8 82.50 -
AUC-BoN512 R50 256 128 82.14 94.43
comparable results. We believe that the bigger image size does not provide relevant
benefits on this dataset because there is not much room for improvement even
84

medium [64] dataset. Embedding dimension is presented as a superscript and the
MobileNet

Veri-wild1024
A [64] 224 - 29.80 57.82
1024
MLSLM [1] 224 24 42.37 83.00
AUC512
R50 224 128 68.49 84.76
AUC-BoN512 R50 224 128 74.55 91.22
SAVER2048
R50 [49] 256 - 78.70 96.00
UMTS512R50 [45] 256 64 66.10 79.30
2048
PVENR50 [67] 256 8 77.00 -
AUC-BoN512 R50 256 128 76.68 92.18

large [64] dataset. Embedding dimension is presented as a superscript and the
MobileNet

Veri-wild1024
A [64] 224 - 22.78 49.43
MLSL1024
M [1] 224 24 36.61 77.51
PGAN512R50 [124] 224 64 74.10 93.80
512
AUCR50 224 128 58.81 79.49
AUC-BoN512 R50 224 128 66.47 88.36
SAVER2048
R50 [49] 256 - 71.30 94.10
2048
SAFRR50 [98] 350 72 77.90 92.10
UMTS512 R50 [45] 256 64 54.20 72.80
2048
PVENR50 [67] 256 8 69.70 -
AUC-BoN512 R50 256 128 68.87 89.15
when trained with small images. The accuracy of both our model and state of the
art on this dataset are better than on previously discussed Stanford Online Products
and CUB-200 datasets, due to the lower complexity of the retrieval task. In Table 6.11
85
we show that we achieve state-of-the-art results on this dataset. The only method
that achieves results comparabale to AUC is FastAP, while using a mini-batch twice
as big as ours.
Finally, we tested our method on VERI-Wild dataset, which is used for vehicle
re-identification. The majority of state-of-the-art models combine several loss func-
tions for achieving better results (e.g.[1, 49, 64, 67, 97, 98, 124]). Additionally, several
methods that are evaluated on this dataset use some domain specific information
during training, such as position of the mirrors and wheels, color of the vehicle, side
and front view etc. These architectures are appropriate for vehicle re-identification
dataset, but they cannot be used for any other retrieval problem. Even though our
method is simpler and not domain-specific, its performance is comparable with
domain specific state-of-the-art approaches, as shown in Tables 6.12 - 6.14. Taking
into account that all images in this dataset have the same characteristics (they are
all vehicles) and that the number of classes is much greater than the mini-batch size,
we combine AUC with BoN, a state-of-the-art method for hard negative sampling
[22, 23]. This combination significantly improves the performance of AUC alone.
Finally, we train a model with AUC-BoN using bigger input images, which further
improve the results.
The model that achieves state-of-the-art results is SAFR [98], and it uses signifi-
cantly bigger input images than the ones that we use in the experiments (350x350
pixels) and embedding size 2048, which is four times bigger than the embedding
that we use. Implementation of such approach exceeds the hardware limitations
that we have. Additionally, this model uses three loss functions: smoothed softmax,
triplet and center loss, as well as unsupervised attention network. SAVER [49] is
another method that performs slightly better than AUC-BoN. However, this model
uses a more complicated network architecture that contains variational autoen-
coder, together with ResNet50 backbone, and a combination of cross entropy and
triplet losses.
The AUC loss outperforms state-of-the-art methods on large scale retrieval data-
sets, and is comparable with more complex models used for vehicle re-identification.
86
6.4. Conclusion and Future Works
6.4 Conclusion and Future Works

In this chapter we presented a new loss function for metric learning that maximizes
the area under the ROC curve. The AUC loss, combined with ResNet50 architecture
as a backbone, achieves better or comparable results with more complex state-
of-the-art methods on both small and large datasets, without additional hyper
parameter tuning. We presented the performance of the AUC loss on metric learn-
ing problems, but our future work will include variance of this loss which could
potentially bring benefits to other problems of computer vision, such as image
classification and object detection. Additionally, we will explore the usage of this
loss for other unstructured data, such as audio, video etc.
87
7 Closing remark
7.1 Conclusions
In this thesis we have addressed the problem of image retrieval through three stages.
We started the research with extensive analysis of the state-of-the-art methods that
were available at that time, and we combined the best practices from the literature in
order to train a model that outperforms the state of the art. We noticed that the most
commonly used loss for retrieval was triplet loss, and that its main disadvantage is
that it was computationally expensive to find input samples which would generate
gradients for the training. Therefore, in our second contribution we addressed the
problem of hard negative sampling by proposing an online sampling strategy called
BoN. Finally, having in mind that the curve dominates in the ROC space if and only
if it dominates in the precision-recall space, we designed a new loss that maximizes
a strong underestimate of the area under the ROC curve, which is appropriate for
both retrieval and recognition. Here, we take the opportunity to summarize the
findings of this work.
In the first part of this thesis we proposed a set of good practices for training
re-identification models. First, we showed that deeper backbone architectures pro-
vide better results. We compared the performance of the models trained with three
different input image sizes, and found that images smaller than 416x416 pixels dete-
riorate the final results. We analyzed two pooling strategies that are applied before
the last fully connected layer, and we found that max pooling outperforms average
pooling. We discussed the idea of curriculum learning for re-identification through
increasing the task difficulty as the training evolves. Our curriculum learning ap-
proach was made of three strategies: 1) pre-training for classification, 2) increasing
image difficulty through the amount of augmentation and 3) hard negative mining.
We showed that each curriculum learning strategy has a positive impact on the
final result and that the best performance is achieved when all three strategies
are combined. We tested our approach on four publicly available datasets, and
compared the results with more complex, domain specific approaches. Our method
reached state-of-the-art results by a large margin on all datasets.
89
Chapter 7. Closing remark
Second, we discovered that sampling hard negatives for a mini batch is of a

vital importance for an efficient training. The majority of existing methods were
based on exhaustive search of images in the whole dataset or in one part of it. We
proposed BoN, a novel online strategy for sampling negatives which improved
training in terms of speed and accuracy, without introducing significant additional
computational cost. BoN is made of one auto-encoder that has two fully connected
layers and one hash table where we store binarized latent representations of the
auto-encoder for all input images in every training step. We provide extensive
analysis of BoN, and we summarize our findings here. We analyzed distributions of
minimal and average distances between samples inside of a hash bin and in the full
embedding. The experimental results show that minimal negative distance inside
of a hash bin is slightly bigger than the minimal distance in the full embedding
space, meaning that the samples taken from a hash bin are not the hardest, but are
good candidates for a ranking loss. We measured the percentage of non-zero loss
triplets in a mini-batch in every training step for eight different sampling strategies,
and we showed that BoN outperforms all other online strategies. BoN does not
introduce a significant time overhead per training iteration, and it reduces the total
training time by 70%. Experimental results show that in the advanced stages of
training around 90% of samples either stay in the same hash bin, or move in a bin
that is on Hamming distance smaller than 5. We evaluated the performance of
BoN with other sampling strategies under the same conditions (same backbone,
image augmentation and training parameters), and showed that BoN reached the
best results in the shortest time on two publicly available person re-identification
datasets. We compared our approach with other state-of-the-art methods for metric
learning and showed comparable results.
Finally, we confirmed our hypothesis that training a model for retrieval can be
done by maximizing the area under the ROC curve, even though ROC is a measure
most suitable for recognition. We proposed the AUC loss, a loss specially designed
for maximization of a strong underestimate of the area under the ROC curve at the
mini-batch level. We designed an approximated, derivable relaxation of area under
the ROC curve by: (1) approximating the integral with a Riemman summation,
which can be computed efficiently while keeping the accuracy of the approximation
high by using a small step size ∆s; (2) approximating the Heaviside function with
a sigmoidal-like function, whose slope depends only on the step size ∆s and can
be numerically calculated. We show that the AUC loss has only one relevant meta
parameter, which depends only on the step size ∆s, and can be easily estimated.
We provide empirical evidence for several aspects of AUC. First, we analyze the
impact of ∆s parameter on the final performance. Experimental results show that
the performance can be significantly improved by decreasing the value of ∆s until
0.03, after which the improvement becomes negligible. Second, we compared
90
7.2. Future Work
the optimization of the AUC on the mini-batch level by using all positive and
negative pairs (batch all strategy), with the strategy where the hardest positive
and negatives pairs are used (batch hard strategy), and we show that the AUC
batch hard loss provides significantly better results. Finally, we compared the
results of AUC with the benchmark triplet loss, and with Wilcoxon loss which also
optimizes the area under the ROC curve. We showed that AUC loss is superior
on four publicly available datasets. The AUC loss, combined with ResNet50 as a
backbone architecture, achieves state-of-the-art results on three publicly available
datasets that are most commonly used for metric learning. Additionally, the AUC
loss achieves comparable performance to the more complex, domain specific, state-
of-the-art methods for vehicle re-identification.
7.2 Future Work

In this thesis we proposed several ways to tackle the problem of image retrieval.
However, we have not explored the possibilities of using our proposals in other
domains of information retrieval such as document retrieval, textual information
retrieval, digital audio retrieval etc. Taking into account that our proposals are not
domain specific, we believe that they can be of great interest in these other domains.
We will test the AUC loss on other tasks where it can be applicable, such as image
recognition or classification. Additionally, we plan to investigate the usage of the
approximations that we applied on the AUC loss in order to estimate the area under
the precision-recall curve. This area is mean average precision (mAP), and it is a
representative measure for several tasks such as image retrieval, re-identification,
multi-class classification and object detection.
Even though image retrieval has been a popular task for many years, there are
still some open questions. First of all, we noticed that domain specific backbone
architectures bring significant improvements in final results. However, current
domain specific architectures are designed based on the prior knowledge about
input image characteristics. We wonder if there is another, better way to represent
images for the image retrieval task, without using the prior knowledge about the
input images. Can the fact that the final task is ranking and not classification
be used to improve the representations obtained by the backbone architectures?
For example, adding attention modules in different stages of the general purpose
architectures can be useful for detecting which features are useful for comparison,
even though they may not be useful for classification.
Another question that we want to rise is: “do we have enough data for training
models for retrieval?" Many datasets that are commonly used for retrieval are either
too small to provide a generalized overview of the problem (sucha as CUB-200), or
91
Chapter 7. Closing remark
bigger, but yet very close to being saturated (inShop). What would be the next step
in terms of data? Do we need more images per class or do we need more classes?
Or maybe both? Is there a possibility to use synthetic data to improve the current
datasets?
Finally, we want to discuss the way of training neural networks for retrieval. As
stated before, the artificial neural networks were inspired by neuroscience, and they
are designed to mimic information processing that is happening in the brain. So
far, we have been training these neural networks by showing them large amount
of images and their labels and expecting them to learn what is similar. This way
of learning would be analogous to learning by repetition in psychology, which is
considered to be the least efficient way of learning. Would it be useful to train
models by using some version of reinforcement learning for retrieval in order to
improve training efficiency?
These, among many others, are the current open problems in image retrieval.
We hope that this summary and discussion could serve to motivate researchers to
make some steps in their future research.
92
7.3. Publications
7.3 Publications
• Bojana Gajic, Ariel Amato, Ramon Baldrich, Carlo Gatta. Maximization of the
Area Under the ROC Curve for Metric Learning. Under review for International
Conference in Computer Vision 2021.
• Bojana Gajic, Ariel Amato, Carlo Gatta. Fast hard negative mining for deep
metric learning. In Pattern Recognition 112, 2020.
• Bojana Gajic, Ariel Amato, Ramon Baldrich, Carlo Gatta. Bag of Negatives for
Siamese Architectures. British Machine Vision Conference. Cardiff, 2019.
• Jon Almazan, Bojana Gajic, Naila Murray, Diane Larlus. Re-ID done right:
towards good practices for person re-identification. arxiv preprint, 2018.
• Bojana Gajic, Eduard Vazquez, Ramon Baldrich. Evaluation of Deep Image

Descriptors for Texture Retrieval. VISAPP. Porto, 2017.
7.4 Patents
• Ariel Amato, Angel Domingo Sappa, Carlo Gatta, Bojana Gajic, Brent Boekestein.
Object detection based on object relation. US Patent App. 16/584, 400, 2021.
• Jon Almazan, Bojana Gajic, Naila Murray, Diane Larlus-Larrondo. Training

and using a convolutional neural network for person re-identification. US
Patent App. 16/675, 298, 2020.
93
Bibliography
[1] Saghir Alfasly, Yongjian Hu, Haoliang Li, Tiancai Liang, Xiaofeng Jin, Beibei
Liu, and Qingli Zhao. Multi-label-based similarity learning for vehicle re-
identification. IEEE Access, 7:162605–162616, 2019.
[2] Viktoriia Sharmanska Anastasia Pentina and Christoph H. Lampert. Curricu-
lum learning of multiple tasks. In Proc. CVPR, 2015.
[3] Davide Baltieri, Roberto Vezzani, and Rita Cucchiara. 3dpes: 3d people
dataset for surveillance and forensics. In Proceedings of the 2011 joint ACM
workshop on Human gesture and behavior understanding, pages 59–64, 2011.
[4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust
features. In European conference on computer vision, pages 404–417. Springer,
2006.
[5] Apurva Bedagkar-Gala and Shishir K Shah. A survey of approaches and trends
in person re-identification. Image and Vision Computing, 2014.
[6] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Cur-
riculum learning. In Proc. ICML, 2009.
[7] Keno K Bressem, Lisa C Adams, Christoph Erxleben, Bernd Hamm, Stefan M
Niehues, and Janis L Vahldiek. Comparing different deep learning architec-
tures for classification of chest radiographs. Scientific reports, 10(1):1–16,
2020.
[8] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep metric
learning to rank. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1861–1870, 2019.
[9] Toon Calders and Szymon Jaroszewicz. Efficient auc optimization for classifi-
cation. In European Conference on Principles of Data Mining and Knowledge
Discovery, pages 42–53. Springer, 2007.
[10] Hervé Cardot and David Degras. Online principal component analysis in
high dimension: Which algorithm to choose? International Statistical Review,
86(1):29–50, 2018.
95
Bibliography
[11] Miguel A Carreira-Perpinán and Ramin Raziperchikolaei. Hashing with binary

autoencoders. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 557–566, 2015.
[12] Shuo Chen, Chen Gong, Jian Yang, Xiang Li, Yang Wei, and Jun Li. Adversarial
metric learning. arXiv preprint arXiv:1802.03170, 2018.
[13] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond
triplet loss: a deep quadruplet network for person re-identification. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 403–412, 2017.
[14] Yanbei Chen, Xiatian Zhu, and Shaogang Gong. Person re-identification by
deep learning multi-scale representations. In Proc. ICCV Workshop, 2017.
[15] Sumit Chopra, Raia Hadsell, Yann LeCun, et al. Learning a similarity metric
discriminatively, with application to face verification. In CVPR (1), pages
539–546, 2005.
[16] Ondřej Chum, Jiří Matas, and Josef Kittler. Locally optimized ransac. In Joint
Pattern Recognition Symposium, pages 236–243. Springer, 2003.
[17] Jesse Davis and Mark Goadrich. The relationship between precision-recall
and roc curves. In Proceedings of the 23rd international conference on Machine
learning, pages 233–240, 2006.
[18] Cheng Deng, Erkun Yang, Tongliang Liu, Jie Li, Wei Liu, and Dacheng Tao.
Unsupervised semantic-preserving adversarial hashing for image search.
IEEE Transactions on Image Processing, 28(8):4032–4044, 2019.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet:
A Large-Scale Hierarchical Image Database. In Proc. CVPR, 2009.
[20] Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep fea-
ture learning with relative distance comparison for person re-identification.
PR, 2015.
[21] Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep ad-
versarial metric learning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2780–2789, 2018.
[22] Bojana Gajic, Ariel Amato, Ramon Baldrich, and Carlo Gatta. Bag of negatives
for siamese architectures. In Proc. BMVC, 2019.
96
Bibliography
[23] Bojana Gajic, Ariel Amato, and Carlo Gatta. Fast hard negative mining for
deep metric learning. In PR, 2020.
[24] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear
pooling. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 317–326, 2016.
[25] Weifeng Ge. Deep metric learning with hierarchical triplet loss. In Proceedings
of the European Conference on Computer Vision, pages 269–285, 2018.
[26] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. It-
erative quantization: A procrustean approach to learning binary codes for
large-scale image retrieval. IEEE transactions on pattern analysis and machine
intelligence, 35(12):2916–2929, 2012.
[27] Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. Multi-scale
orderless pooling of deep convolutional activation features. In European
conference on computer vision, pages 392–407. Springer, 2014.
[28] Albert Gordo, Jon Almazán, Jérome Revaud, and Diane Larlus. Deep image
retrieval: Learning global representations for image search. In Proc. ECCV,
2016.
[29] Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. End-to-end
learning of deep visual representations for image retrieval. IJCV, 2017.
[30] Albert Gordo and Diane Larlus. Beyond instance-level image retrieval: Lever-
aging captions to learn a global visual representation. In Proc. CVPR, 2017.
[31] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with
an ensemble of localized features. In Proceedings of the European Conference
on Computer Vision, pages 262–275. Springer, 2008.
[32] David Marvin Green, John A Swets, et al. Signal detection theory and psy-
chophysics, volume 1. Wiley New York, 1966.
[33] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by
learning an invariant mapping. In 2006 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages
1735–1742. IEEE, 2006.
[34] James A Hanley and Barbara J McNeil. The meaning and use of the area under
a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
97
Bibliography
[35] Christopher G Harris, Mike Stephens, et al. A combined corner and edge
detector. In Alvey vision conference, volume 15, pages 10–5244. Citeseer, 1988.
[36] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond,
et al. Smart mining for deep metric learning. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2821–2829, 2017.
[37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
[38] Kun He, Fatih Cakir, Sarah Adel Bargal, and Stan Sclaroff. Hashing as tie-aware
learning to rank. In Proceedings of the IEEE Conference on Computer Vision
[39] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average
precision. In The IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), June 2018.
[40] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet
loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[41] Chen Huang, Chen Change Loy, and Xiaoou Tang. Local similarity-aware deep
feature embedding. In Advances in neural information processing systems,
pages 1262–1270, 2016.
[42] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In Proc. CVPR, 2017.
[43] Esteve Jaulent. El ars generalis ultima de ramón llull: presupuestos metafísi-
cos y éticos. In Anales del Seminario de Historia de la Filosofía, volume 27,
pages 87–113. Universidad Complutense de Madrid, 2010.
[44] Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. Aggregating
local descriptors into a compact image representation. In 2010 IEEE computer
society conference on computer vision and pattern recognition, pages 3304–
3311. IEEE, 2010.
[45] Xin Jin, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. Uncertainty-aware multi-
shot knowledge distillation for image-based object re-identification. arXiv
preprint arXiv:2001.05197, 2020.
98
Bibliography
[46] Mahdi M Kalayeh, Emrah Basaran, Muhittin Gökmen, Mustafa E Kamasak,

and Mubarak Shah. Human semantic parsing for person re-identification. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1062–1071, 2018.
[47] Srikrishna Karanam, Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia
Camps, and Richard J. Radke. A systematic evaluation and benchmark for
person re-identification: Features, metrics, and datasets. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 41(3):523–536, 2019.
[48] Mete Kemertas, Leila Pishdad, Konstantinos G Derpanis, and Afsaneh Fazly.
Rankmi: A mutual information maximizing ranking loss. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
14362–14371, 2020.
[49] Pirazh Khorramshahi, Neehar Peri, Jun-cheng Chen, and Rama Chel-
lappa. The devil is in the details: Self-supervised attention for vehicle re-
identification. arXiv preprint arXiv:2004.06271, 2020.
[50] Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon.
Attention-based ensemble for deep metric learning. In Proceedings of the
European Conference on Computer Vision, pages 736–751, 2018.
[51] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classifica-
tion with deep convolutional neural networks. Advances in neural informa-
tion processing systems, 25:1097–1105, 2012.
[53] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning
for latent variable models. In Advances in neural information processing
systems, volume 1, page 2, 2010.
[54] Dangwei Li, Xiaotang Chen, Zhang Zhang, and Kaiqi Huang. Learning deep
context-aware features over body and latent parts for person re-identification.
In Proc. CVPR, 2017.
[55] Wei Li and Xiaogang Wang. Locally aligned feature transforms across views.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 3594–3601, 2013.
99
Bibliography
[56] Wei Li, Rui Zhao, and Xiaogang Wang. Human reidentification with trans-
ferred metric learning. In Proceedings of the Asian Conference on Computer
Vision, pages 31–44. Springer, 2012.
[57] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing
neural network for person re-identification. In Proc. CVPR, 2014.
[58] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for
person re-identification. In Proc. CVPR, 2018.
[59] Wentong Liao, Michael Ying Yang, Ni Zhan, and Bodo Rosenhahn. Triplet-
based deep similarity learning for person re-identification. In MSF Workshop,
2017.
[60] Weiyao Lin, Yang Shen, Junchi Yan, Mingliang Xu, Jianxin Wu, Jingdong Wang,
and Ke Lu. Learning correspondence structures for person re-identification.
IEEE Transactions on Image Processing, 26(5):2438–2453, 2017.
[61] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song.
Sphereface: Deep hypersphere embedding for face recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 212–220, 2017.
[62] Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan,
and Xiaogang Wang. Hydraplus-net: Attentive deep features for pedestrian
analysis. In Proceedings of the IEEE international conference on computer
vision, pages 350–359, 2017.
[63] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion:
Powering robust clothes recognition and retrieval with rich annotations. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 1096–1104, 2016.
[64] Yihang Lou, Yan Bai, Jun Liu, Shiqi Wang, and Lingyu Duan. Veri-wild: A large
dataset and a new method for vehicle re-identification in the wild. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3235–3243, 2019.
[65] David G Lowe. Distinctive image features from scale-invariant keypoints.

International journal of computer vision, 60(2):91–110, 2004.
[66] Jiri Matas, Ondrej Chum, Martin Urban, and Tomás Pajdla. Robust wide-
baseline stereo from maximally stable extremal regions. Image and vision
computing, 22(10):761–767, 2004.
100
Bibliography
[67] Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, Shijie Yang, Zheng-Jun
Zha, Xingyu Gao, Shuhui Wang, and Qingming Huang. Parsing-based view-
aware embedding network for vehicle re-identification. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
7103–7112, 2020.
[68] Krystian Mikolajczyk and Cordelia Schmid. An affine invariant interest point
detector. In European conference on computer vision, pages 128–142. Springer,
2002.
[69] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local

descriptors. IEEE transactions on pattern analysis and machine intelligence,
27(10):1615–1630, 2005.
[70] George A Miller. Wordnet: a lexical database for english. Communications of

the ACM, 38(11):39–41, 1995.
[71] Carlton Wayne Niblack, Ron Barber, Will Equitz, Myron D Flickner, Eduardo H
Glasman, Dragutin Petkovic, Peter Yanker, Christos Faloutsos, and Gabriel
Taubin. Qbic project: querying images by content, using color, texture, and
shape. In Storage and retrieval for image and video databases, volume 1908,
pages 173–187. International Society for Optics and Photonics, 1993.
[72] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep
metric learning via facility location. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 5382–5390, 2017.
[73] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric
learning via lifted structured feature embedding. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4004–4012,
2016.
[74] Timo Ojala, Matti Pietikäinen, and David Harwood. A comparative study of
texture measures with classification based on featured distributions. Pattern
recognition, 29(1):51–59, 1996.
[75] Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Deep
metric learning with bier: Boosting independent embeddings robustly. IEEE
transactions on pattern analysis and machine intelligence, 2018.
[76] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabu-
laries for image categorization. In 2007 IEEE conference on computer vision
and pattern recognition, pages 1–8. IEEE, 2007.
101
Bibliography
[77] Xuelin Qian, Yanwei Fu, Yu-Gang Jiang, Tao Xiang, and Xiangyang Xue. Multi-
scale deep learning architectures for person re-identification. In Proc. ICCV,
2017.
[78] Filip Radenovic, Giorgos Tolias, and Ondrej Chum. CNN image retrieval
learns from BoW: Unsupervised fine-tuning with hard examples. In Proc.
ECCV, 2016.
[79] Tanzila Rahman, Mrigank Rochan, and Yang Wang. Person re-identification
by localizing discriminative regions. In Proc. BMVC, 2017.
[80] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Proc. NIPS, 2015.
[81] Jerome Revaud, Jon Almazan, Rafael S. Rezende, and Cesar Roberto de Souza.
Learning with average precision: Training image retrieval with a listwise loss.
In The IEEE International Conference on Computer Vision (ICCV), October
2019.
[82] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi.
Performance measures and a data set for multi-target, multi-camera tracking.
In Proceedings of the European Conference on Computer Vision, pages 17–35.
Springer, 2016.
[83] Herbert Robbins and Sutton Monro. A stochastic approximation method.

The annals of mathematical statistics, pages 400–407, 1951.
[84] Michal Rolínek, Vít Musil, Anselm Paulus, Marin Vlastelica, Claudio Michaelis,
and Georg Martius. Optimizing rank-based metrics with blackbox differentia-
tion. arXiv preprint arXiv:1912.03500, 2019.
[85] Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjoern Ommer,
and Joseph Paul Cohen. Revisiting training strategies and generalization
performance in deep metric learning. arXiv preprint arXiv:2002.08473, 2020.
[86] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning

representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[87] M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, and Rainer Stiefelhagen.
A pose-sensitive embedding for person re-identification with expanded cross
neighborhood re-ranking. In Proc. CVPR, 2018.
102
Bibliography
[88] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified
embedding for face recognition and clustering. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 815–823, 2015.
[89] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna

Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations
from deep networks. In Proc. CVPR, 2017.
[90] Claude E. Shannon. Programming a computer playing chess. Philosophical

Magazine, Ser.7, 41(312), 1959.
[91] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carls-
son. Cnn features off-the-shelf: an astounding baseline for recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition
workshops, pages 806–813, 2014.
[92] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[93] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss
objective. In Advances in neural information processing systems, pages 1857–
1865, 2016.
[94] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-
driven deep convolutional model for person re-identification. In Proc. ICCV,
2017.
[95] Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep attributes
driven multi-camera person re-identification. In Proc. ECCV, 2016.
[96] Yumin Suh, Bohyung Han, Wonsik Kim, and Kyoung Mu Lee. Stochastic
class-based hard example mining for deep metric learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages
7251–7259, 2019.
[97] Abhijit Suprem and Calton Pu. Looking glamorous: Vehicle re-id in hetero-
geneous cameras networks with global and local attention. arXiv preprint
arXiv:2002.02256, 2020.
[98] Abhijit Suprem, Calton Pu, and Joao Eduardo Ferreira. Small, accurate,
and fast vehicle re-id on the edge: the safr approach. arXiv preprint
arXiv:2001.08895, 2020.
103
Bibliography
[99] Michael J Swain and Dana H Ballard. Color indexing. International journal of
computer vision, 7(1):11–32, 1991.
[100] Jonathan Swift. Gulliver’s travels. In Gulliver’s Travels, pages 27–266. Springer,
1995.
[101] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1–9, 2015.
[102] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer vision. In Proc.
CVPR, 2016.
[103] Tijmen Tieleman and G Hinton. Divide the gradient by a running average of
its recent magnitude. coursera neural netw. Mach. Learn, 6:26–31, 2012.
[104] Alan M Turing. Computing machinery and intelligence. In Parsing the turing
test, pages 23–65. Springer, 2009.
[105] Evgeniya Ustinova and Victor Lempitsky. Learning deep embeddings with
histogram loss. In Advances in Neural Information Processing Systems, pages
4170–4178, 2016.
[106] Remco C Veltkamp and Mirela Tanase. Content-based image retrieval sys-
tems: A survey. 2000.
[107] Cheng Wang, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang.
Mancs: A multi-task attentional network with curriculum sampling for person
re-identification. In Proceedings of the European Conference on Computer
Vision, pages 365–381, 2018.
[108] Chong Wang, Xue Zhang, and Xipeng Lan. How to train triplet networks
with 100k identities? In Proceedings of the IEEE International Conference on
Computer Vision Workshops, pages 1907–1915, 2017.
[109] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou,
Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face
recognition. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5265–5274, 2018.
104
Bibliography
[110] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric
learning with angular loss. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2593–2601, 2017.
[111] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang,
James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity
with deep ranking. In Proceedings of the IEEE Conference on Computer Vision
[112] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to
bridge domain gap for person re-identification. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 79–88, 2018.
[113] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Advances
in neural information processing systems, pages 1753–1760, 2009.
[114] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona.

Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California
Institute of Technology, 2010.
[115] Wikipedia. Turing machine. https://en.wikipedia.org/wiki/Turing_machine,

june 2021.
[116] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl.

Sampling matters in deep embedding learning. In Proceedings of the IEEE
International Conference on Computer Vision, pages 2840–2848, 2017.
[117] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint
detection and identification feature learning for person search. In Proc. CVPR,
2017.
[118] Yafu Xiao, Jing Li, Bo Du, Jia Wu, Jun Chang, and Wenfan Zhang. Memu:
Metric correlation siamese network and multi-class negative sampling for
visual tracking. Pattern Recognition, 100:107170, 2020.
[119] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Ag-
gregated residual transformations for deep neural networks. In Proc. CVPR,
2017.
[120] Jing Xu, Rui Zhao, Feng Zhu, Huaming Wang, and Wanli Ouyang. Attention-
aware compositional network for person re-identification. In Proc. CVPR,
2018.
105
Bibliography
[121] Hong Xuan, Richard Souvenir, and Robert Pless. Deep randomized ensembles
for metric learning. In Proceedings of the European Conference on Computer
Vision (ECCV), pages 723–734, 2018.
[122] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable
are features in deep neural networks? arXiv preprint arXiv:1411.1792, 2014.
[123] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-aware deeply cascaded
embedding. In Proceedings of the IEEE international conference on computer
vision, pages 814–823, 2017.
[124] Xinyu Zhang, Rufeng Zhang, Jiewei Cao, Dong Gong, Mingyu You, and Chun-
hua Shen. Part-guided attention learning for vehicle re-identification. arXiv
preprint arXiv:1909.06023, 2019.
[125] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training
deep neural networks with noisy labels. In Advances in neural information
processing systems, pages 8778–8788, 2018.
[126] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi,
Xiaogang Wang, and Xiaoou Tang. Spindle net: Person re-identification with
human body region guided feature decomposition and fusion. In Proc. CVPR,
2017.
[127] Liming Zhao, Xi Li, Yueting Zhuang, and Jingdong Wang. Deeply-learned
part-aligned representations for person re-identification. In Proc. ICCV, 2017.
[128] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and
Qi Tian. MARS: A video benchmark for large-scale person re-identification.
In Proc. ECCV, 2016.
[129] Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose invariant embed-
ding for deep person re-identification. arXiv, 2017.
[130] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and
Qi Tian. Scalable person re-identification: A benchmark. In Proceedings
of the IEEE International Conference on Computer Vision, pages 1116–1124,
2015.
[131] Zhedong Zheng, Liang Zheng, and Yi Yang. A discriminatively learned cnn
embedding for person re-identification. TOMM, 2017.
[132] Zhedong Zheng, Liang Zheng, and Yi Yang. Pedestrian alignment network for
large-scale person re-identification. arXiv, 2017.
106
Bibliography
[133] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated
by gan improve the person re-identification baseline in vitro. In Proceedings
of the IEEE International Conference on Computer Vision, pages 3754–3762,
2017.
107

Boga 1 de 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boga 1 de 1

Uploaded by

Copyright:

Available Formats

Training strategies for efficient deep

A dissertation submitted by Bojana Gajić at Univer-

Co-Director Dr. Ramon Baldrich

Thesis Dr. Sergio Velastin

Dr. Joost van de Weijer

Dr. Jerome Revaud

International Dr. Pau Rodríguez López

Dr. German Ros Sanchez

This document was typeset by the author using LATEX 2ε .

In this thesis we focus on image retrieval and re-identification. Training a deep

Second, in chapter 5 we address the problem of hard negative sampling when

Finally, in chapter 6 we hypothesize that training a metric learning model by

Key words: computer vision, machine learning, applied mathematics, metric

En esta tesis nos centramos en la recuperación y re-identificación de imágenes.

En primer lugar, en el capítulo 4 analizamos la importancia de algunas estrate-

En segundo lugar, el capítulo 5 abordamos el problema del muestreo de mues-

Finalmente, en el capítulo 6 hacemos la hipótesis de que entrenar un modelo de

Palabras clave: visión por computador, aprendizaje computacional, matemáti-

En aquesta tesi ens centrem en la recuperació i re-identificació d’imatges. L’en-

En primer lloc, en el capítol 4 analitzem la importància d’algunes estratègies de

En segon lloc, al capítol 5 abordem el problema del mostreig de mostres negati-

Finalment, al capítol 6 fem la hipòtesi que entrenar un model d’aprenentatge de

Paraules clau: visió per computador, aprenentatge computacional, matemàti-

Abstract (English/Spanish/Catalan) iii

List of tables xvii

1.1 A brief Introduction to deep learning . . . . . . . . . . . . . . . . . . . 1

1.1.1 Beginnings of artificial intelligence . . . . . . . . . . . . . . . . 1

1.1.2 From Machine Learning to Deep Learning . . . . . . . . . . . . 2

1.1.3 Classification of deep learning methods . . . . . . . . . . . . . 3

1.2 Introduction to visual search . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 Problem definition and applications . . . . . . . . . . . . . . . . 6

1.2.2 Instance retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Evaluation of visual search methods . . . . . . . . . . . . . . . . 12

2.1 Backbone architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 General purpose backbone architectures . . . . . . . . . . . . . 16

2.1.2 Task specific architectures . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Classification losses . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Pairwise losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 Listwise losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Hard negative mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Motivation and contributions 25

3.1 Boundaries of state of the art

3.2 Hard negative mining combined

3.3 Loss for explicit maximization

4 Good practices for person re-identification 29

4.1.1 Curriculum learning for re-ID . . . . . . . . . . . . . . . . . . . 30

4.2 Learning a global representation for re-ID . . . . . . . . . . . . . . . . 31

4.2.1 Architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.2 Architecture training . . . . . . . . . . . . . . . . . . . . . . . . . 31

Three-stream Siamese architecture. . . . . . . . . . . . . 31

4.2.3 Applying curriculum learning principles . . . . . . . . . . . . . 32

Pretraining for Classification (PFC). . . . . . . . . . . . . 32

Hard Triplet Mining (HTM). . . . . . . . . . . . . . . . . . 33

Increasing image difficulty (IID). . . . . . . . . . . . . . . 33

4.3 Empirical evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 Ablative study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Fine-tuning for classification. . . . . . . . . . . . . . . . . 35

Curriculum learning strategies. . . . . . . . . . . . . . . . 35

4.3.3 Comparison with the state of the art . . . . . . . . . . . . . . . . 37

4.3.4 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Localized responses and clothing landmark detection. . 39