You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/340688890

Loop Closure Detection for Monocular Visual Odometry: Deep-Learning


Approaches Comparison

Conference Paper · November 2019


DOI: 10.1109/SITIS.2019.00083

CITATION READS

1 112

4 authors:

Mohamed Ali Sedrine Wided Mseddi


Ecole Polytechnique de Tunisie Université Paris 13 Nord
2 PUBLICATIONS   1 CITATION    33 PUBLICATIONS   188 CITATIONS   

SEE PROFILE SEE PROFILE

Takoua Abdellatif Rabah Attia


University of Sousse Ecole Polytechnique de Tunisie
36 PUBLICATIONS   163 CITATIONS    174 PUBLICATIONS   640 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

CROW2 : Critical and Rescue Operations using Wearable Wireless sensors networks View project

PhD: abnormal events' detection using UAVs View project

All content following this page was uploaded by Wided Mseddi on 10 September 2020.

The user has requested enhancement of the downloaded file.


Loop Closure Detection for Monocular Visual
Odometry : Deep-Learning Approaches
Comparison

Abstract
In order to decrease monocular visual odometry drift by detecting loop closure, this paper
presents a comparison between state of the art, 2-channel and Siamese, Convolutional Neural
Networks. The work consists of training these networks in order to make them able to
robustly identify loop closures. As we are in the case of having two input images, we perform
our trainings and tests on both 2-channel and Siamese architecture for each network.
Keywords: visual odometry, loop-closure, deep learning, convolutional neural network

1. Introduction
The wide diffusion of unmanned vehicles in various fields rises the problem of robots
position and orientation estimation. Internal and external sensors are used for this purpose,
such as gyroscopes, accelerometers or GPS. Besides these sensors are subject to drift, noise
and jamming. In order to enhance navigation skills and position estimation accuracy, cam-
era based approach is adopted.

Monocular Visual Odometry is the operation of estimating the egomotion of an unmanned


vehicle using the input of a single camera attached to it, by matching features between
successive frames and recovering the path incrementally. Thus, visual odometry, as it is, is
unable to detect loop closure. This leads to drift growth. Therefore, loop closure informa-
tion is an important constraint not only to optimize estimation but also to recognize already
visited places.
The issue can be summarized in comparing new place with previously visited ones. Last
incoming frame is compared to already captured and stacked ones in order to recognize the
place. This task is often challenging due to brightness, angle of view and places’ continuous
appearance change. This leads to the detection of false positives or false negatives. In other
words, we are in the situation of false positives when two different places are depicted by
the algorithm as similar while false negatives situation is encountered when the same place
looks different to the algorithm.
While papers like [7], [8] and [9] used visual descriptors to build visual dictionaries and
Bag-of-Words (BoW), recent papers, like [6], used Convolutional Neural Networks (CNN)
in order to perform loop closure detection and scene recognition.
Preprint submitted to Elsevier June 17, 2019
Figure 1: Before Loop Closure Figure 2: After Loop Closure

In this paper we present a comparative study between different state-of-the-art CNN models.
Both 2-channel and Siamese variants will be studied for each model. Section II, is dedicated
to present related work about deep learning approaches in loop closure detection. In Section
III, we describe our approach. Section IV shows our results which are discussed in Section
V.

2. Related Work
To detect loop closure using appearance-based methods, there are mainly two approaches:
hand-crafted methods like Bag-of-Words (BoW) and deep learning approaches, using neural
networks.

BoW was initially developed to extract features from texts in order to describe, modelize
and classify documents [10] by counting words frequency. This technique is then adapted to
computer vision context where lexical words become visual words, namely, a set of extracted
visual features. There is a wide variety of features like SIFT [11], SURF [12] and ORB [13].
BoW technique consists, for computer vision, in extracting visual features, clustering them
and describing them using a histogram of visual words. This allows to have a pose invariant
place description based on training image sequences. Thus, the resulting model is limited to
this training sequence, and loop closure detection can only be performed in the known area.
Fast Appearance-Based Mapping (FAB-MAP) [14] is an algorithm for place recognition and
mapping that uses BoW to define a probabilistic model of the environment.

Recently, a large variety of Convolutional Neural Network (CNN) methods emerged. Be-
sides, deep learning approaches are widely used for computer vision tasks, such as object
detection and classification [15], [16] and face recognition [17]. Such an approach allows to
have an abstract representation of the issue, and makes its application more general than
the training context, in opposition to BoW. Some recent works tried to solve loop closure
problem using CNNs. While some papers like [18] proposed a novel architecture, others
used state-of-the-art, pretrained neural networks, like [2] and [6] In fact, [2] evaluated deep
learning networks in loop closure detection for visual SLAM. This paper showed that neural
2
networks outperform BoW. The document used pretrained models like AlexNet [19], Caf-
feNet [20] and GoogLeNet [21]. It was based on multichannel networks, while we propose
to use siamese networks in addition to that. It showed that AlexNet is the most accurate
in loop closure detection. Besides, [6] presented a Convolutional Neural Network approach
for estimating image-to-image similarity to detect loop closure in visual SLAM process. A
2-channel AlexNet network was implemented to estimate similarity between two input im-
ages. It has been tested on cross season dataset and showed satisfying results. [22] designed
a novel pyramid Siamese architecture to perform loop closure detection, in Simultaneous
Localization And Mapping context. This network processes RGB-D data as input.
Since we are in the case of image comparison, it is necessary to use a pair of images as an
input to the neural network. [23] presented various deep learning architectures to compare
gray scale image patches. According to this paper, these architectures showed good perfor-
mances and better results than hand-crafted approaches.
In our paper we are interested in both Siamese and 2-channel architectures. These architec-
tures will be applied to state-of-the-art, ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners CNNs. We will limit our study to three of them.

AlexNet [19] is ILSVRC 2012 winner. It is composed of eight layers, of which five are
convolution layers, two fully connected layers and one softmax layer. Besides, it has 60
million parameters and 650.000 neurons. While overlapping max pooling layers follow the
first two convolution layers, the third, fourth and fifth convolutional layers are directly con-
nected. Then comes another overlapping max pooling layer followed by two fully connected
layers. The top layer of this network is a softmax layer that could generate 1000 classes.
GoogLeNet [21] is ILSVRC 2014 winner. It has in total 22 layers. Besides, it has around
five million parameters. This network stands out by the use of inception modules. It uses
nine of them. These modules consist of simultaneously performing different convolutions
and concatenating the resulting feature maps.
Residual Networks (ResNet) [24] won first place in the ILSVRC 2015. This new architec-
ture introduced identity shortcut connection called residual connection. This allows layers
skipping. In fact, before ResNet, CNNs were going deeper and they became hard to train,
due to accuracy saturation. ResNet makes skipped layers fit a residual mapping instead of
fitting a desired underlying mapping. ResNet converges faster than its plain equivalent.
Siamese networks were introduced in 1994, by Bromley et al. [25]. At the beginning,
it aims to verify signatures by learning convenient descriptors to compare inputs. In fact,
this architecture is made of two, or more, identical branches, called subnetworks that share
the same weights and parameters (Fig. 3). Each branch takes distinct input but they
meet in the top conjoining layer, namely a contrastive loss function. During training phase,
back-propagation step updates simultaneously weights of each subnetworks. This means
that fewer parameters need to be trained. In general, Siamese networks are used for binary
classifications. Koch et al. employed Siamese neural network in [26] to do one-shot learning
on the MNIST dataset using a pre-trained network on a different dataset, Omniglot.
Multi-channel architectures consider input images as a single image with multiple channels.
3
Figure 3: Siamese neural network architecture

In other terms, if we consider N-channel network that processes RGB images (3 channels),
the input patches can be seen as a single image of 3xN-channel (Fig. 4). In order to
detect loop closure for visual SLAM, [6] estimated the similarity between two images by
concatenating their RGB channels. The approach employed AlexNet [19] which originally
takes 224x224x3 image. To measure image similarity, AlexNet input dimension became
224x224x6. According to Zagoruyko et al. [23], such an architecture provides great flexibil-
ity and is fast to train.

Figure 4: Multi-Channel neural network

Our contribution in this paper consists in performing a thorough comparison between


the CNNs cited above, implemented in both Siamese and 2-Channel architectures. While [2]
performed an evaluation between some neural networks as they are, and while [6] detected
loop closure using 2-channel AlexNet, we propose in this work to evaluate six approaches
to detect loop closure. Three 2-channel networks and three Siamese networks, based on
AlexNet, GoogLeNet and ResNet, respectively. We note that [2] and [6] dealt with visual
4
SLAM which takes into account the whole map and map features, whilst we focus on posi-
tion estimation only. Finally, we justify the choice of AlexNet, GoogLeNet and ResNet by
the fact that they won ILSRVC. Besides, AlexNet and GoogLeNet were used in [2].

3. Proposed Approach
Our work is divided in two main parts. We start with CNNs adaptation and modifi-
cation. Then, we prepare the data set for training and testing. Loop-closure detection is
a task that is based on images similarity. Therefore, we need to modify input and output
layers of the models cited above. In fact, these CNNs were trained to classify 1000 objects
while we need similarity estimation.

3.1. Networks adaptation and modification


3.1.1. Multi-channel architecture
All studied models undergo modifications that consist of slightly changing top and bot-
tom layers. In fact, AlexNet and ResNet have 224x224x3 input, GoogLeNet has 299x299x3.
In case of 2-channel architecture, their inputs become 224x224x6 and 299x299x6 respec-
tively. The RGB channels coming from each image are concatenated, as shown in Fig. 4.
This brings us back to the case of having a unique 6 channels image.
In the other hand, as the networks have to output the similarity of two images, the upper
layer of each CNN is now made of a single neuron. Like [6], we are using the sigmoid activa-
tion function. This function allows to clamp similarity to [0,1] range. Let x be the output
of the last fully connected layer, σ(x) is the similarity given by the following formula :
1
σ(x) = (1)
1 + e−x
In this work, we trained multi-channel models using Mean Square Error (MSE) loss
function. It consists in summing the square of the difference between the predicted yip and
the ground truth yi variables. This sum is divided by n, the number of batch samples.
n
1X p
M SE = (y − yi )2 (2)
n i=1 i

3.1.2. Siamese architecture


For Siamese networks, in this work, our models keep their layers as they are. Whereas,
each model is duplicated in order to form branches of the Siamese architectures as in Fig.
3. A Siamese network takes an image pair as input. So, each subnetwork takes a single
image to generate discriminative features for it. The top layers of the two branches are fed
to a similarity function layer. Basically, classification loss functions are not suited to such
tasks, because the problem is to differentiate the image pairs and not to classify them. So,

5
distance based loss function is more fitted. In this work, contrastive loss function [4] is used.
It is given by the following formula:

CL = yd2 + (1 − y)max(m − d, 0)2 (3)

where y is the ground truth. That is to say that y = 1 if the image pair is made of similar
images (loop closure in our case), y = 0 otherwise. The term d of the formula represents
Euclidean distance between discriminative features generated by each subnetwork. Let f1
and f2 be the generated descriptions
pP for each image. The term d is the euclidian distance
2
between f1 and f2 given by d = i (fi1 − fi2 ) to d = kf1 − f2 k2 . Finally, m is a margin
term introduced to clamp the constraint. It is the minimum distance between two dissimilar
images. To minimize the loss, the distance between two images should be close to m. There
is different ways to choose the margin m. A first way of doing this is to adaptively update
m during training by using the smallest error as margin. A second way is to use a constant
margin m tuned manually.

3.2. Dataset preparation


3.2.1. Pairs labeling
In this work, we based our data set on Visual Place Recognition Dataset from Bonn
University (Bonn subset)1 which has the advantage of light conditions change. We added
custom images, taken from Coppelia Robotics VREP simulator that we use for testing and
validating our processes. The data set is made of RGB images (Table 1). Data is divided in
two sets, negative pairs (without loop closure) and positive pairs (with loop closure) (Fig.
5). In fact, each image of the data set can form one or many positive pairs with other images.
From 2060 images, we generated 9000 pairs splitted equally into positive and negative pairs.

Table 1: Datasets specs


Dataset VisualPlaceRecognition Custom
Number 1035 1025
Image Size 960x540 512x512
Type RGB RGB
Outdoor, Indoor,
Description changing light simulator
conditions generated

1
http://www.ipb.uni-bonn.de/data/visual-place-recognition-datasets/

6
Figure 5: Samples from the dataset Positive pairs (left), negative pairs (right)

3.2.2. Data preprocessing


In order to make our data suitable to the networks, we perform an online transformation.
In fact, our images have a size of 512x512 or 960x540, and our networks ask 224x224 size
image (299x299 for GoogLeNet). First, each image undergoes a resize to 256x256. Then, we
apply a random cropping on 256x256 image to get a 224x224 image (299x299 for GoogLeNet).
We note that we apply transformations on every single image and not on image pairs.
This allows us to improve networks robustness by enlarging the range of spatial activation
statistics. It helps to take into account possible scene occlusion and obstruction.

4. Experiments and Results


In order to carry a thorough a comparative study between the networks, we used the same
hardware configuration, the same hyper-parameters and the same data set for each network.
First of all, our work is based on Python framework Pytorch as we used pretrained Pytorch
models, for each kind of Network. For GoogLeNet we used inceptionv3 implementation and
for ResNet we used resnet50 implementation. In fact, Pytorch is a machine learning library
for Python. It was released in 2016 by Facebook’s team and it is based on C framework,
Torch. It allows to use GPUs acceleration for Tensor computing and implements automatic
differentiation mechanism. It offers a lot of state-of-the-art pre-trained models.
All experiments were performed on Google Colaboratory platform, using Intel Xeon 2.20GHz
processor, 12 Gb RAM and NVidia Tesla K80 GPU.

7
4.1. Training
In this work, we limit our training to 30 epochs for each network. Besides, we set learning
rate to 10−4 . We chose Stochastic Gradient Descent as optimizer. We made training using
80% of the whole dataset, that is to say 7200 image pairs. Training was performed using
batches of 11 image pairs each.
The following curves show epochs duration evolution in seconds. First, we compare 6-
channels (Fig.6) and Siamese networks (Fig.7) separately. Then we perform an overall
comparison (Fig.8).

Figure 6: 6-Channels training duration per epoch

Figure 7: Siamese training duration per epoch Figure 8: Overall training duration per epoch

Training duration curves show that epochs’ duration are smoother for Siamese Networks.
Besides, Siamese Networks training phase is shorter than 6-Channels Networks training
8
phase. In the other hand, both GoogLeNet architectures take more time for training than
ResNet and AlexNet architectures.

4.2. Validation
In order to validate the networks, we used 20% of our dataset which is equal to 1800 pairs.
After 30 training epochs, pretrained 6-Channels AlexNet reaches 87.7% accuracy while
Siamese AlexNet reaches 93% accuracy (Fig.9). In the other hand, Siamese GoogLeNet
has 83.61% accuracy and 6-Channels GoogLeNet is able to reach 82.38% after 25 training
iterations (Fig.10). Finally, Siamese ResNet arrive at 94.83% accuracy and 6-Channels
ResNet reaches 93.55% accuracy (Fig.11).

Figure 9: AlexNet validation accuracy

Figure 10: GoogLeNet validation accuracy Figure 11: ResNet validation accuracy

Besides, our performance comparison exploited comparison matrices for each of the six
networks (Table 2). The first column represents Multichannels networks, the second one is
dedicated to Siamese networks.

9
Table 2: Networks Confusion Matrices
Multichannel Networks Siamese Networks
a) Multichannels AlexNet b) Siamese AlexNet

c) Multichannels GoogLeNet d) Siamese GoogLeNet

e) Multichannels ResNet f) Siamese ResNet

Let’s define true positive rate as T P R = TRealP


rueP ositive
ositive
and true negative rate as T P R =
T rueN egative
RealN egative
.
The following results are deducted from Table 2.

Table 3: Testing results


Network Accuracy TPR TNR
6-Chan AlexNet 87.7% 63.09% 96.2%
6-Chan GoogLeNet 82.38% 65.86% 96.65%
6-Chan ResNet 93.55% 84.64% 97.65%
Siamese AlexNet 93% 91.49% 94.53%
Siamese GoogLeNet 83.61% 88.83% 78.67%
Siamese ResNet 94.83% 96.42% 92.27%
10
5. Discussion
From the validation results presented above, we can clearly see that Siamese networks
versions outperform Multichannel ones. This is made clear in Tables 2 and 3. In Table
2 the confusion matrices are more ’diagonal’ for Siamese Networks than for Multichannel
networks. Furthermore, in figures 9, 10 and 11, the upper curves are those of Siamese ar-
chitectures. This may be due to the fact that Siamese networks were introduced to perform
binary classification, what is our case. However, Table 3 shows that Multichannel networks
are slightly more specific.
In the other hand, from figures 9, 10 and 11 we can see that Siamese ResNet is more accurate
than other Siamese Networks, same goes for Multichannel ResNet. This can be deducted
from TPR values too. Besides, ResNet training duration is comparable to others.

6. Conclusion and Future Work


In this work, we conducted comparison between six convolutional neural networks in or-
der to perform loop closure detection in the context of Monocular Visual Odometry. In fact,
we chose three state of the art CNNs which are AlexNet, GoogLeNet and ResNet. Then,
we implemented 6-channels and Siamese versions of each one. We used open datasets for
training, validation and testing. We augmented training dataset by simulation generated
images. Results showed that Siamese versions perform better than six channel ones, espe-
cially in terms of specificity. They have higher true positive rates.

In this paper, images were already saved in the hard disks, and storage management
issue doesn’t arise. In fact, in real time case, incoming frames have to be stored in order to
compare them to the future images. That’s why storage and real-time memory management
is an axis to explore in the future. Another future challenge is to incorporate a deep learning
based loop closure detection module to our monocular visual odometry framework.

References
[1] Beeson, Patrick, Joseph Modayil, and Benjamin Kuipers. ”Factoring the mapping problem: Mobile
robot map-building in the Hybrid Spatial Semantic Hierarachy.” The International Journal of Robotics
Research (2009).
[2] Y. Xia, J. Li, L. Qi, H. Yu and J. Dong, ”An Evaluation of Deep Learning in Loop Closure Detection
for Visual SLAM,” 2017 IEEE International Conference on Internet of Things (iThings) and IEEE
Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing
(CPSCom) and IEEE Smart Data (SmartData), Exeter, 2017, pp. 85-91.
[3] Zagoruyko, Sergey & Komodakis, Nikos. (2015). Learning to Compare Image Patches via Convolutional
Neural Networks.

11
[4] Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality Reduction by Learning an Invari-
ant Mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition - Volume 2 (CVPR ’06), Vol. 2. IEEE Computer Society, Washington, DC, USA,
1735-1742.
[5] J. Bromley, I. Guyon, Y. Lecun, et al, Signature verification using a Siamese time delay neural network,
International Conference on Neural Information Processing Systems, Morgan Kaufmann Publishers Inc,
1993:737-744.
[6] Ma, Jiale & Qian, Kun & Ma, Xudong & Zhao, Wei. (2018). Reliable Loop Closure Detection Using
2-channel Convolutional Neural Networks for Visual SLAM. 5347-5352. 10.23919/ChiCC.2018.8483560.
[7] D. Filliat,”A visual bag of words method for interactive qualitative localization and mapping.” IEEE
International Conference on Robotics and Automation 2007:3921-3926.
[8] M. Cummins and P. Newman, Highly Scalable Appearance Only SLAM - FAB-MAP 2.0, in Robotics:
Science and Systems (RSS), June 2009.
[9] Garcia-Fidalgo, Emilio & Ortiz, Alberto. (2018). iBoW-LCD: An Appearance-based Loop Closure
Detection Approach using Incremental Bags of Binary Words. IEEE Robotics and Automation Letters.
PP. 10.1109/LRA.2018.2849609.
[10] G. Salton and M. McGill. Introduction to modern infor- mation retrieval. McGraw-Hill computer science
series. McGraw-Hill
[11] D. G. Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer
vi- sion, 60(2):91110, 2004.
[12] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In European conference on
computer vision, pages 404417. Springer, 2006.
[13] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. In
Computer Vi- sion (ICCV), 2011 IEEE international conference on, pages 25642571. IEEE, 2011.
[14] Joseph Cummins, Mark and M. Newman, Paul. (2010). FAB-MAP: Appearance-Based Place Recog-
nition and Mapping using a Learned Visual Vocabulary Model.Proceedings of the 27th International
Conference on Machine Learning (ICML-10), June 21-24, 2010, on pages 3-10.
[15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional net-
works. In CVPR, volume 1, page 3, 2017.
[16] Redmon, Joseph and Ali Farhadi. YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (2017): 6517-6525.
[17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and
clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
815823, 2015.
[18] Merrill, Nathaniel and Huang, Guoquan. (2018). Lightweight Unsupervised Deep Loop Closure.
10.15607/RSS.2018.XIV.032.
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural
Networks, in Advances in Neural Information Processing Systems (NIPS), pp 10971105, 2012.
[20] Jia, Yangqing & Shelhamer, Evan & Donahue, Jeff & Karayev, Sergey & Long, Jonathan & Gir-
shick, Ross & Guadarrama, Sergio & Darrell, Trevor. (2014). Caffe: Convolutional Architecture
for Fast Feature Embedding. MM 2014 Proceedings of the 2014 ACM Conference on Multimedia.
10.1145/2647868.2654889.
[21] Szegedy, Christian, et al. ”Going deeper with convolutions.”Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015.
[22] Zhang, Q., Mai, A., Menke, J., and Yang, A.Y. (2018). Loop Closure Detection with RGB-D Feature
Pyramid Siamese Networks. CoRR, abs/1811.09938.
[23] Zagoruyko, Sergey and Komodakis, Nikos. (2015). Learning to compare image patches via convolutional
neural networks. 4353-4361. 10.1109/CVPR.2015.7299064.
[24] Szegedy, Christian & Ioffe, Sergey & Vanhoucke, Vincent. (2016). Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning. AAAI Conference on Artificial Intelligence.
[25] Bromley, Jane, et al. Signature verification using a siamese time delay neural network. Advances in

12
neural information processing systems. 1994.
[26] Koch, Gregory, Zemel, Richard and Salakhutdinov, Ruslan. ”Siamese Neural Networks for One-shot
Image Recognition.” Paper presented at the meeting of the , 2015.

13

View publication stats

You might also like