You are on page 1of 7

Parallelization of the PY3D monocular visual odometry algorithm

Dionata da Silva Nunes Altamiro Amadeu Susin


Signal and Image Processing Laboratory and Signal and Image Processing Laboratory and
Electrical Engineering Electrical Engineering
Federal University of Rio Grande do Sul Federal University of Rio Grande do Sul
Porto Alegre, Brasil 3308-3136 Porto Alegre, Brasil 3308-3136
Email: dionata.nunes@gmail.com Email: altamiro.susin@ufrgs.br

Gustavo Ilha Edison Pignaton De Freitas


Signal and Image Processing Laboratory and Electrical Engineering
Electrical Engineering Federal University of Rio Grande do Sul
Federal University of Rio Grande do Sul Porto Alegre, Brasil 3308-3136
Porto Alegre, Brasil 3308-3136 Email: edison.pignaton@ufrgs.br
Email: cucailha@yahoo.com.br

Abstract—This paper presents an improvement to the PY3D task at a time. The pipeline technique is also introduced to
Visual Odometry Algorithm aiming for real time embeded optimize processor downtime. Using the parallelism model
solutions. The availability of Multicore Systems on Chip de- becomes very useful for processing large amounts of in-
mand new software aproaches in order better benefit from formation quickly and efficiently. This method has other
the computacional power. In new complex digital systems inherent advantages due to the division of processing tasks
designs, power consumption is a severe retriction and clock and, one of the most important is the ability to distribute pro-
frequency limitation pushes to the parallelism expoitation in cessing on existing CPU cores. This distribution allows for
order to cope with computational intensive applications such as maximum utilization of the resources available to perform
vision. We propose parallel branches by hardware replication the processing, achieving considerable performance gains,
when there is no data dependency in a two stages pipeline in the case of this article the around 50 percent compared
to complete the algotithm. The proposal minimizes downtime to serialized version.
during processing thus seeking better performances. The tests It can be highlighted some works that used this technic
performed, applying the proposed method, were extracted with for performance improvements. Zhang cite c1 ’s work
good results, reaching twice the processing speed performed stands out parallel feature used on NVIDIA GPUs, the
by the original algorithm. The conclusion that the method implementation was centralized on the consumption step,
is promising can yield even more guaranteed gains, expand- calculating particle weight, and improving response time
ing image subdivisions and applying parallel processing. The [1]. Another recent method proposed for better response
number of cores that a processor can provide can also help to time for visual SLAM was proposed by Pereira [2], the
achieve even better performance gains. proposal was to modify SLAR Direct Sparse Odometer
(DSO) and implement code parallelization techniques us-
1. Introduction ing OpenMP, an API which support multi-platform shared
memory multiprocessing programming, several policy-based
code modifications have been made to make the SLAM
Currently, one of the relevant topics is Visual Simultane-
algorithm considerably faster.
ous Localization and Mapping (VSLAM) due to numerous
advantages over other existing methods, such as: cost; acces- This article is structured as follows, starting with related
sibility; precision; dependence on external signals, among works session where an explanation of the work is made
others. Its application can also be highlighted in: robotics, and a brief comparison with the proposed method. In the
augmented reality, Virtual applications. following, we have a problem definition session of that
One of the problems with this type of application is the contextualizing and defining the problem to be dealt with,
high response time. Applications that use this technology right after we have overview of the proposal explaining the
face many challenges, and one of the most pertinent is method didactically, we also have a design and implemen-
optimization and processing speed. This article proposes a tation section technically explaining the work done. And a
method to minimize this problem through parallel process- session of experiments and results, where the presented and
ing, making it possible so you can perform more than one discussed and, finally, a conclusion session.
2. Related Works simplicity and more compact. Another classification related
to visual odometry refers to the way information is pro-
The proposed method in this paper has as reference the cessed, which can follow two main lines: direct method and
works that also use parallelism to obtain performance gains, indirect method. [5] [6]
as already mentioned, the work by Pereira [2] which use the In the method used, direct information coming directly
OpenMP library to perform this task. The proposal of this from the camera, and not indirect, there is a pre-processing
paper doesn’t use library ready to execution parallel, but step of the images received by the camera, generating inter-
proposes an algorithm that parallelizes parts of the code to mediate values that will be used in a second step referring to
obtain better response times. the probabilistic model. In the project, the indirect method
In the work by Zhang [1] takes parallel processing to was followed, generating geometric measures, such as points
Graphics Processing Uni (GPU). Using NVIDIA graph- and vectors, performing a geometric optimization, which
ics card and using Compute Unified Device Architecture in the case of the direct method is defined as photometric
(CUDA) to achieve performance gains in Find and Simulta- optimization. [4] [5]
neous Mapping (SLAM) algorithms. Approach that guaran- The last definition made for this work concerns the
tees considerable performance gains by taking into account amount of sampled points that would be used. There are
specific hardware for calculations. At the proposed work basically three levels of sampling: sparse, semi-dense and
or parallelism is being performed on the CPU, not using dense methods. The sparse methods use a limited amount
auxiliary hardware to perform the calculations required to of strategic image points, such as corners and edges in
perform parallelism. Fig ??, we have an example of the sparse method. There
In the ENGEL [5] article, a method of visual odometry is no correlation between these points. In dense methods,
that uses a straightforward and sparse monocular configu- the objective is to capture as many points as possible in the
ration is presented, a combination indicated by him as the image, making a correlation between them. The project was
best for real-time applications. But for more robust appli- proposed followed by algorithms that implement the sparse
cations, the indirect method such as the one displayed for method. [5]
FORSTER [6] article that proposed the SVO method: Semi-
Direct Visual Odometry for Monocular and Multicamera, a
probabilistic depth estimation algorithm, is indicated that
allows us track corners and weak edges in environments
with little or high-frequency texture.
Another relevant proposal is the work of Mur-Artal
[3], a robust system that implements the visual odometry
algorithm. In this one of the relevant points that can be
highlighted are the good results presented regarding the
automatic corrections in difficult situations of movement Figure 1. Sparse sampling levels
disturbances.
One of the works that was researched and used as a For obtain performance gains in the reconstruction of
reference and from material supply for the experiments three-dimensional maps using the monocular visual odome-
was KITTI(Karlsruhe Institute of Technology) dataset. The try sparse method. It is proposed to divide the input images
project collected images using a car equipped with two color into four parts to be processed in parallel, and each part of
cameras and two PointGrey Flea2 grayscale video cameras the image to be processed in two steps: extraction of points
(10 Hz, resolution: 1392 x 512 pixels, aperture: 90 degrees of interest and calculation of descriptors. The purpose of
x 35 degrees), a scanner Velodyne HDL-64E 3D laser (10 separating the method into two distinct steps is to be able to
Hz, 64-ray laser, range: 100 m), a GPS / IMU tracking unit perform in the pipeline. When an image’s processing stage
with RTK correction signals (open sky tracking errors ¡ 5 is already in the second step, a next image can be called
cm) and a computer activated in the Real-time database. to run the first stage of calculations. Therefore minimizing
Collecting a total of 3 Tera Bytes of data approximately. downtime in the execution stream.
[10]
4. Proposal Overview
3. Problem definition
To process the images in the proposed method, initially
The algorithms that implement the visual odometry the images are subdivided, so that each part of the image is
method aim to allow an agent to estimate its spatial location processed in a new thread, unlike the traditional method
using only visual information captured by cameras. There in which the images are processed entirely in a single
are variations in the technique that can be used, one of which master thread. In Fig. 2 represents the subdivisions made
would be how to catch the images, that can follow three to the images in this job. For the tests, 4 subdivisions were
different lines: monocular, stereo and omnidirectional. The produced that can be scaled to new subdivisions in the
project in question will be developed using the monocular future, depending on the processing cores existing in the
method as it is a more attractive method in terms of cost, hardware used. The images used to carry out the experiments
were taken from the KITTI data set, a comparative refer- 5. Design and Implementation
ence set for visual odometry, which consists of 22 stereo
sequences with 11 sequences of real training trajectories and For implementation, modifications were made to the
11 unconfirmed soil sequences for evaluation. [10] PY3D visual odometry software developed by the Signal
and Image Processing Laboratory (LaPSI) which is part of
the Department of Electrical Engineering (DELET) and the
Graduate Program in Electrical Engineering (PPGEE) of the
Federal University. from Rio Grande do Sul (UFRGS). In
this algorithm 3 main modifications were made:
• Implementation of image subdivisions;
• New offset calculations and image size;
• Code separation in two stages for parallelization.
Generate the 3D map of the environment, which are:
Figure 2. image subdivisions for parallel processing frame and camera position, these two inputs are necessary to
be able to perform the calculations inherent to the process, in
To obtain even better performance gains, a pipeline Fig. 4 the step is illustrated and then it proceeds to the image
for the algorithm was also implemented. For this part, the subdivision stage. For the tests, a loop was implemented
main tasks of the algorithm were subdivided into two large that loads the images from the kitty data set and also the
groups called: stage 1 and stage 2. These steps contain positions of the camera. For this implementation, was used
the tasks with the longest response time for processing the of the python language and the opencv library were used to
algorithm, which should in its final step generate a 3D map handle the images.
of the environment. Stage 1 contains the features extraction
calculation, in which the Shi-Tomasi algorithm is used to
detect the detection of corners in the image. In stage 2,
the task of tracking the features extracted in stage 1 is
assigned, in the next image, Lucas Kanade’s algorithm was
used for this task. The implementation of the pipeline allows
parallel processing from phase 1 to phase 2, by the time the
features are being tracked in phase 2, it is already possible
to be extracting a new features in the next image. In Fig
ref fig: pipeline is to apply a diagram of the operation
flow of the proposed method. Can be seen: image capture,
Figure 4. Image Capture
image subdivision, point extraction block (step 1), descriptor
calculation block (step 2). Each stage is promoted to a new In the original algortimo flow everything is done in a
chain so that each stage can be processed in parallel. single master thread as seen in Fig. 5.

Figure 5. Pipeline method application diagram

This proposal aims to break this flow in parallel pro-


cesses taking advantage of the various processing colors of
the chip. To achieve efficient parallelization, it is necessary
to divide the image into blocks distributed by the processors,
in order to o process several blocks of an same image. It is
also possible to break the calculations in two stages and
execute them in a pipeline Fig. 6. But so that the data
resulting from the first stage be used in the second stage
were saved in a shared memory being possible recovering
and making available for the second stageto use to complete
the calculations.
To ensure that thread are completed safely the first stage
Figure 3. Pipeline method application diagram to the beginning of the second stage a call to join () is made
to stop the execution of the current topic until the topic it
images subdivided according to the number of cores.
With python’s multiprocessing.cpuCount() function it
is possible to find out the number of hardware cores
and the input image is read through the function
cv2.imread(filename[,flags]) requested by the multiplatform
library OpenCV (Open Source Computer Vision Library)
for the development of applications in the area of computer
vision, which contains modules for Image Processing, which
will be useful in this implementation.
The tests carried out applied 4 subdivisions in the image.
On the Four new images, the developed function imgAt-
tribute(file name) is called, which calculates the displace-
Figure 6. Pipeline method application diagram ment of each one. The offsets in the original version of the
code were defined as the center of the image, to be used
as references in the calculation, since now the input image
joins is finished or the topic you are trying to join is not has been divided into 4 parts, the new offsets are defined as
active, the current thread does not need to go back. The the lower right corner of the image 1, upper right corner of
Fig. 7 shows the complete flowchart of the proposed system. image 2, lower left corner of image 3 and upper left corner
of image 4 as defined in Fig 8.
After loading the input image, it is called a developed
function splitImage (file name) in which the width and
length of the image are extracted, using the format attribute,
after the proportional clipping of the image in four parts.

Figure 8. New offset points

The new images are processed in threads (way for a


process to divide itself into two or more tasks that can
be performed concurrently) going through two processing
stages, as shown in Fig 9. At stage 1 are point of inter-
est extraction calculations. In the OpenCV library, some
algorithms are available to perform this task. The algorithm
chosen to play this role, was the Shi-Tomasi [11] corner
detector algorithm available in the cv2.goodFeaturesToTrack
(...) function. The Shi-Tomasi detector has several input and
output parameters. They can be seen in Figure 5.
Figure 7. Flowchart The image parameter receives the input image to be
processed. The parameter corners receive the output vector
In the implementation of the image subdivisions, a with the corners detected by the algorithm. The maxCorners
function was developed to perform this task splitImage parameter indicates the maximum number of corners the
(filename, numberCores), takes only two parameteres, the algorithm returns. The qualityLevel parameter indicates the
input image and number of cores on the chip, and returns minimum value that a corner can have as weight. This
Figure 10. OpenCV function cv.CalcOpticalFlowPyrLK

Figure 9. OpenCV function cv2.goodFeaturesToTrack 6. Experiments and Results


The algorithm performance tests were performed on
Windows 7 professional 64-bit operating system. The hard-
parameter is multiplied by the value of the best corner de-
ware configurations used were as follows: HP Z620 Work-
tected by the algorithm, ie, the smallest eigenvalue Lambda.
station computer model; Intel (R) Xeon (R) processor E5-
For example, if the best quality found is 3,000 and the
2643 v2 CPU @ 3.50GHz, 3501 MHz, 6 Core (s), 12 Logic
qualityLevel is 0.01, corners with a quality value less than
processor (s) and 32 GB physical memory.
30 will be rejected. The minDistance parameter indicates
The Fig 11 presents a comparison of the execution times
the minimum distance that must exist between two corners.
(in seconds) obtained with the serialized and parallelized
The mask parameter is used to shrink the region in which
versions of the PY3D algorithm as a function of the first
to search for corners, or to hide regions in which there is
processing stage, extraction of points of interest, which was
no interest in finding objects. The blockSize parameter is
configured for 30000 point tracking. Each point shown in
used to indicate the size of the neighborhood centered on a
figure 4 represents the time taken to process one image in the
pixel to look for corners. The useHarrisDetector parameter
case of the original serial and the time taken to process four
replaces the Shi Tomasi detector with the Harris detector.
images in the parallel version. The results show a significant
In stage 2 descriptive calculations and optical flow track- performance of the parallelized version being on the order of
ing are reserved, for this task the Lucas-Kanade algorithm twice as fast on average. To perform this time measurement,
[12] was used, the principle of this algorithm is to search, a subclass of the Thread class defined in python called timer
in the following features, for a pixel combination similar to () are used, it is started by calling the start () function, thus
current features using optical flow equations in a predefined capturing the time before and after processing applying the
neighborhood the algorithm then returns a vector with all formula 1 gets the result of time spent.
the new coordinates of the positions of detected features.
The OpenCV library has a function that implements the ∆t = t − t0 (1)
Lucas-Kanade algorithm to cv2.CalcOpticalFlowPyrLK (...).
Lucas-Kanade optical flow tracking has several input param- To perform this measurement, 15 images taken from
eters. They can be seen in Fig 10. the KITTI dataset were used, using the same sequence of
The prevImg parameter gets the previous grayscale images, both for the original algorithm and for the parallel
frame of the video processed. The nextImg parameter re- version.
ceives the current frame of the video, also in grayscale. The The Fig 12 presents a comparison of time in the second
prevPts parameter receives the coordinates of the selected processing stage, which is the calculation of the descriptors,
features. The nextPts parameter predicts the location of that observed also significant performance gain from the
features in the current frame for easier tracking. The winSize parallelized version over the serialized version of the algo-
parameter denotes the size of the window around the feature rithm being twice as fast that the serial version. Analogue
in which to match the pixel combination. The criteria pa- than that performed for time measure performed in the first
rameter defines the stopping criterion when searching for the stage, here also each points in Fig 12 represent one image
depixels combination. Two options are available: stop after a for serial processing and four images for parallel processing.
certain amount of retries using the cv2.TERM-CRITERIA- In Fig 11 and Fig 12 we see the potential of using
COUNT flag; and stop after the search window moves less the parallel process as opposed to the serialized version
than a given value, with the cv2.TERM-CRITERIA-EPS of the PY3D algorithm. Through the use of threads, was
flag. parallelizes the subdivisions made in the input images. But
In Fig 13 is notorious the performance gains, for a
better understanding of why these gains, it is necessary to
understand how parallel processing behaves in the CPU.
First of all, it should be noted that the performance gains
are linked directly to the number of processor cores that
a computer has, that is, if we have a computer with only
one CPU, the threads do not run at the same time and
the operating system (OS) will ”scaling” the threads to
allow them to run a little at a time, seeming to be running
concurrently. On the other hand, on a computer with more
than one CPU, which in our case we have 6 core (s) and12
Logical Processor (s), as already mentioned earlier, threads
can run simultaneously: one on each CPU available and
free. There may still be some concurrency issues, but the
operating system can easily stagger and work around these
potential issues.

Figure 11. Stage 1 processing comparison

Figure 13. Total Processing Time Comparison

One of the works that can be mentioned and which also


Figure 12. stage 2 processing comparison achieved good results was the CUDA Accelerated Robot
Location and Mapping article [1], which used CUDA to
parallelize the SLAM algorithm, enabling the use of call
this is only possible because threads are competitors lines functions running on the NVIDIA GPU, dedicated hardware
of execution, they are concurrent in the sense that they for video processing. In the case of the test performed in
execute simultaneously, but each with its own line, behaving this article, it was implemented and tested on a GeForce
as if they were different programs. Different programs use GTX 660 and a CPU, the Intel Core i5-3570K. [1].
different memory area which makes communication more The proposed implementation using CUDA texture
complex to perform, because it needs other techniques to memory increased performance by 11 times. Excellent re-
exchange information, such as socket, pipes, shared mem- sults when it comes to running dedicated video processing
ories and others. Already for the threads, the memory area hardware. Correlating with the purpose of this paper, the pa-
is shared making information exchange more optimized and per discussed does not subdivide images in their parallelism
simple. implementation, so a combination of methods can be made
To guarantee the order of data access and prevent one for even more significant performance gains.
thread from writing while the other thread reads, some Analyzing the results of the article Pragma-oriented Par-
precautions were taken in this work. One of them was the allelization of the Direct Sparse Odometry SLAM Algorithm
use of the join () w function to ensure the finalization of [2], another work that obtained very satisfactory gains much
the thread set of first stage before going to the second stage because of by being able to identify the bottleneck of the
processing set, which has a degree of dependence on the visual odometry algorithm used, which in this case was the
data processed in the first stage and thus avoiding possible addActiveFrame () function, which represented about 82
occurrences of deadlocks. For C. Pereira’s work [2], this analysis of the execution
The last comparative graph Fig 13 shows the total pro- time of the functions was extremely important in order to
cessing time of the images, and it can also be noted that define the best work approach, which in this case was the
performance gains of about two times faster are achieved. choice of CPU parallelization of the single loops in the code,
using OpenMP directives. Already this article has a slightly [10] GEIGER, A.; LENZ, P.; URTASUN, R. kitti web site.
different approach aims to use the subdivisions of images 2012. [Online; accessed 2016-07-11]. Available from Internet:
¡http://www.cvlibs.net/datasets/kitti/eval odometry¿.
and pipeline, but achieving gains as good as C.Pereira [2].
In C.pereira’s work [2] performance gains were calcu- [11] SHI, Jianbo et al. Tomasi. Good features to track. In: Computer Vision
and Pattern Recognition. 1994. p. 593-600.
lated using average execution times, ie percentage reduction
in execution time, for example, 43.5 [12] LUCAS, Bruce D. et al. An iterative image registration technique
with an application to stereo vision. 1981.
One point that works can share in the same analysis is
about the gains that can be gained from increasing image [13] MATLOFF, Norman; HSU, Francis. Tutorial on threads programming
with Python. University of California, 2007.
resolution, this is easily explained. Higher resolution images
mean more data per image, and therefore more calculations [14] YANG, Zhiyi; ZHU, Yating; PU, Yong. Parallel image processing
to be made and, in this situation, more advantages are taken based on CUDA. In: 2008 International Conference on Computer
Science and Software Engineering. IEEE, 2008. p. 198-201.
from the image parallelization.
[15] NEIRA, Jos; DAVISON, Andrew J.; LEONARD, John J. Guest edi-
torial special issue on visual SLAM. IEEE Transactions on Robotics,
7. Conclusion v. 24, n. 5, p. 929-931, 2008.
[16] MOURAGNON, Etienne et al. Real time localization and 3d recon-
It is concluded that the proposed parallel processing struction. In: 2006 IEEE Computer Society Conference on Computer
method is really promising obtaining significant perfor- Vision and Pattern Recognition (CVPR’06). IEEE, 2006. p. 363-370.
mance gains in the order of twice as fast purchased with the [17] FUENTES-PACHECO, Jorge; RUIZ-ASCENCIO, José; RENDÓN-
serial method of the original version of the PY3D algorithm, MANCHA, Juan Manuel. Visual simultaneous localization and map-
improving considerably its response times. ping: a survey. Artificial Intelligence Review, v. 43, n. 1, p. 55-81,
2015.
Improvements that can be even greater if applied to a
larger number of processing cores. Also, this is where future [18] Marques, R., Paulino, H., Alexandre, F., Medeiros, P.D.: Algorithmic
skeleton framework for the orchestration of GPU computations. In:
work can go deeper, that is, by further subdividing the Euro-Par 2013. Number 8097 in Lecture Notes in Computer Science,
images and performing parallel processing. Another point Springer-Verlag (2013) 874–885
that can also be explored in the future is the realization of [19] DAVISON, Andrew J. Real-time simultaneous localisation and map-
this method on dedicated hardware, such as GPU (Graphics ping with a single camera. In: Iccv. 2003. p. 1403-1410.
Processing Unit) processing. [20] Veloso, A., Meira, W., Ferreira, R., et al.,“ Asynchronous and Antic-
ipatory Filter-Stream Based Parallel Algorithm for Frequent Itemset
References Mining”, European Conference on Principles of Data Mining and
Knowledge Discovery, 2004.
[21] LI, Jincheng et al. Realization of CUDA-based real-time multi-camera
[1] Zhang, H. and Martin, F., 2013, April. CUDA accelerated robot
visual SLAM in embedded systems. Journal of Real-Time Image
localization and mapping. In 2013 IEEE Conference on Technologies
Processing, p. 1-15, 2019.
for Practical Robot Applications (TePRA) (pp. 1-6). IEEE.
[2] PEREIRA, Cesar; FALCAO, Gabriel; ALEXANDRE, Luı́s A.
Pragma-oriented parallelization of the direct sparse odometry SLAM
algorithm. In: 2019 27th Euromicro International Conference on
Parallel, Distributed and Network-based Processing (PDP). IEEE,
2019. p. 252-259.
[3] Mur-Artal, R., Montiel, J.M.M. and Tardos, J.D., 2015. ORB-SLAM:
a versatile and accurate monocular SLAM system. IEEE transactions
on robotics, 31(5), pp.1147-1163.
[4] Engel, J., Schöps, T. and Cremers, D., 2014, September. LSD-SLAM:
Large-scale direct monocular SLAM. In European conference on
computer vision (pp. 834-849). Springer, Cham.
[5] ENGEL, Jakob; KOLTUN, Vladlen; CREMERS, Daniel. Direct
sparse odometry. IEEE transactions on pattern analysis and machine
intelligence, v. 40, n. 3, p. 611-625, 2017.
[6] FORSTER, Christian et al. SVO: Semidirect visual odometry for
monocular and multicamera systems. IEEE Transactions on Robotics,
v. 33, n. 2, p. 249-265, 2016.
[7] SULEIMAN, Amr et al. Navion: A 2-mW Fully Integrated Real-Time
Visual-Inertial Odometry Accelerator for Autonomous Navigation of
Nano Drones. IEEE Journal of Solid-State Circuits, v. 54, n. 4, p.
1106-1119, 2019.
[8] KIM, Hyungjin et al. RGB-D and Magnetic Sequence-based Graph
SLAM with Kidnap Recovery. In: 2018 18th International Conference
on Control, Automation and Systems (ICCAS). IEEE, 2018. p. 1440-
1443.
[9] PEREIRA, Fabio Irigon. High precision monocular visual odometry.
2018.

You might also like