You are on page 1of 7

Ego-motion estimation based on data fusion using

monocular camera and IMU of a Smartphone

Fabricio Amguaña1, Ronnie Esparza1, Fernando Muñoz1

1 Universidad de las Fuerzas Armadas ESPE, Sangolquí, Ecuador

Abstract. In this paper, our purpose is to develop a more portable and


lower cost system, based on the state of the art and related works of ego
motion estimation and odometry; this time using the inertial sensors and
the camera of smartphones. A first instance is presented, where the video and
IMU data taken from a public database are merged, with which is estimated the
trajectory followed by a vehicle.

Keywords: Camera monocular, camera of stereo vision, Lidar, features


point, estimation, ego-motion, environments indoor, environments out-
door, navigation system, visual odometry.

1 Introduction

Having a reliable navigation system for certain operations is important, since knowing
the location of a person or object could prevent accidents or human losses. The appli-
cations are varied, as for example the navigation of vehicles where the estimate is made
in an external environment, we can also contrast with the location of people inside the
mines, in this case the environment is closed, and the use of GPS is denied, so that a
system of ego-motion estimation is useful and would serve to safeguard the integrity of
human life.[2][7]

Nowadays, navigation systems are integrated into smart telephones, vehicles and even
robots [1]. The use of GPS in some of the cases becomes a problem due to the limited
accuracy, and because of the dependence on the internet, a viable alternative that has
been successfully tested is computer vision analysis to estimate the pose. Our goal is to
study the different methods used to estimate movement in a smartphone, since this not
only involves the use of its monocular camera, but also the possibility of using inertial
sensors to improve the estimate.

As a contribution to the academic level, it is expected to publicize the different alterna-


tives and use of concepts of the researchers, making a comparison of the algo-rhythms
they used to solve the problem. It also seeks to make information more compact, so that
readers have an idea of what to do next and that their time is well spent.[6]

This document is organized as follows: Section II describes the state of the art.In the
section III describes the related works. Next, our comparison of the best algorithms that
has been used to solve the problem in Section IV. Then, it shows what was the proce-
dure that was followed to perform the respective simulations in section V. Finally, the
results of the tests and the conclusions are in sections VI and VII respectively.

2 Related Works

In the literature several works show different methods to do ego-motion estimation and
navigation systems in closed or outdoor environments, also the use of hardware is in-
fluential when estimating movement.
In the International Conference on Indoor Positioning and Indoor Navigation, in the
year 2012 J. Ágila, B. Link, P. Smith, and K. Wehrle they present a system of estimation
of movement, oriented for disabled people, where intelligent telephones are used, for
being portable devices, and for their benefits such as connectivity, camera and sensors.
Their contribution was the creation of the system for closed environments, analyzing
the optical flow encoded in the movement vectors with a live video stream, they made
tests with subjects inside a building to determine their trajectory and monitor the
individual, in this way they looked for have knowledge if a person was lost or
intentionally deviated from their route.

M. Tommasi, P. Anedda, S. Pundlik, J. Zheng and G. Luo, 2015, published the article
“Smartphone sensor to improve image-based scene recognition for indoor localiza-
tion”, [7]. They propose a method where pairs if snapshots taken in different directions
are matched to a set of 360° panoramic images, each representing a location if an indoor
environment. Their method shows that the consistency between calculated snapshot
rotation and rotation measured by motion sensors helps reduce false marches.

A thesis presented in 2015, "A Monocular Visual Odometry System under Real Time
Restrictions", is based on directing the monocular visual waveform, making the esti-
mation of a robot's movement. For this, it will be necessary to extract the information
from the sequence of images that It is taken as input and, therefore, locate the charac-
teristic points of each of the captured images and track the
themselves along the images belonging to the sequence. For this, considerations such
as the use of anomalous data elimination method or external noise by Ransac and its
limitation that is not in real time will be taken.

In 2013 A. Geiger, P. Lenz, C. Stiller, and R. Urtasun created the KITTI database for
use in mobile robotics and autonomous research. In total, we recorded 6 hours of traffic
scenarios at 10-100 Hz using a variety of sensor modalities such as high resolution color
and grayscale stereo cameras, a Velodyne 3D laser scanner and a high inertial GPS /
IMU navigation. precision
system. The information turns out to have several aspects like calibrated, synchronized
and with timestamp, and we provide the Sequences of rectified and unprocessed im-
ages.

"Monocular Visual Odometry" is presented in 2013 by L. Ricci, where it explains a


somewhat-rhythm of monocular visual odometry, which has been implemented in
OpenCV / C ++. In conjunction with the KITTI data (using only one image of the stereo
data set). In addition they make use of the RANSAC method for the elimination of
anomalies, thus giving a more acceptable result.

3 Our Approach

Based on the previous work of odometry and monocular camera, we pretend to make a
more portable and lower cost system, this time using the inertial sensors and the camera
of smartphones.
Overview of the visual odometry system

Vision system

𝐼𝑘 , 𝐼𝑘+1

Feature detec-
tion

Projection

Eliminate
anomalies

State

Filtered

Cinematica cal-
culation

As an initial job, the fusion of data from a monocular camera and inertial sensor data
will be done offline. The data is taken from a public database, called KITTI Dataset,
which presents recordings at 100 Hz of traffic from the city of Karlsruhe in Germany;
separating them into frames and their corresponding IMU and GPS data .

The reason for using this database is that to merge the data of camera and inertial sen-
sors, the capture of each frame must coincide with the taking of the sensors, which
turned out to be a problem since in the market there are no applications that allow that
synchronization.
The future solution to achieve this work is to develop an Android application where
these data are stored automatically; for further processing and lay the foundations for a
system that works in real time.

4 Procedure

In the first place, extensive research was carried out on projects already carried out on
the subject, and research was carried out: paper, documents, theses and publications. In
all these was found the algorithm used by each publication to solve the problem, how-
ever none has the solution implemented and shared, only code is found in C ++, Java,
Matlab, among others. Due to this we proceeded to collect all these contributions to get
the result shown below develops the most useful contributions for the project: [5], [6],
[7]

4.1. Data collection

The IMU data and the video were taken simultaneously using the same Smartphone,
making 2 videos, the first where a certain trajectory was traveled in the Sangolquí cam-
pus of the Army University ESPE and the second one where the same was completely
bordered campus, this time filming from inside a vehicle.

4.1.1. Inertial data: for the inertial data collection, a mobile application called "IMU
+ GPS-Stream" was used, which provides data of the following sensors, accelerometer,
gyroscope and magnetometer, with the following format:

Timestamp [sec], sensorid, x, y, z, sensorid, x, y, z, sensorid, x,y,z, sensorid, x, y, z

Sensor id:
3. Acelerometer [m/s^2]
4. Gyroscope [rad/s]
5. Magnetometer [micro-Tesla]

For which only the accelerometer data was taken, to make the fusion with the images.

4.1.2. Video: We used the video camera of a Smartphone, the Xperia X model from
Sony, considered as a midrange and filmed at an HD resolution of 720 × 1280 and at
30 fps.

Finally, for the separation of the video in frames, the VLC media player application
was used.

Data collection and video were made simultaneously because the application allows to
take data in the background.

4.2 Data correspondence

The problem was that the application did not specify the frequency of the data collec-
tion, so it was determined experimentally, using the following formula once all the
frames of the video and data of the IMU were taken (which were greater than the num-
ber of frames):

𝑁𝑑𝑎𝑡𝑎
𝐼𝑀𝑈 𝑑𝑎𝑡𝑎 𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒 =
𝑁𝑓𝑟𝑎𝑚𝑒
Where 𝑁𝑑𝑎𝑡𝑎 is the total number of data of the filming and 𝑁𝑓𝑟𝑎𝑚𝑒 is the total num-
ber of frames, with which it was determined that for each frame between 7 and 8 data
were taken.

With this, to achieve a more realistic approximation, all the data were divided into
groups of 8, and data 4 was taken as an intermediate point, thus matching the number
of frames with the data.

However, because this number is not exact, the remaining data were eliminated, so that
they coincide with the total number of frames.

4.3. Estimation

The procedure starts with the acquisition of images img_1 and img_2, at instants t1 and
t2 respectively, these two images will serve as a starting point for the calculation of the
movement.

Then the FAST method of detection and monitoring of the characteristic points of the
images is used. FAST was used because, according to Scaramuzza, although it is a
corner detector, it shares the property of the detectors of blobs as SIFT or SURF to be
invariant before scales; which is an advantage within the changing panoramas that exist
in cities.

Next, concordances between the points of the different contiguous images will be
searched, calculating the optical flow that occurred in both instants of time.
A filtering process is then carried out, which will consist of the elimination of anoma-
lous values with the application of the RANSAC method. Finally, the movement is
calculated, its trajectory is plotted and the progress of the frames that constitute the
video is observed.

5 Results

Below we present the results of running the program with the data taken while travers-
ing the contour of the university; and they are compared with Google satellite views.
(a)

(b)

(c)
Figure 1. Comparison of trajectories: (a) trajectory resulting from the estimation, (2) Tra-
jectory of the real road taken from Google Maps, (3) Satellite image from Google Maps.

The program seems to work correctly, plotting the trajectory of the university compared
with de satellite trajectories.

6 Conclusions

The estimate when it is done offline is a test of how you can train our system, according
to our needs, because this way the changes are made faster and you can try several
criteria of optimization. In the algorithm it is an aid or it would be considered a previous
step to the implementation of a system that works in real time, this form seeks to enter
errors as soon as possible and solve them.

The code worked correctly in outdoor environments and processed offline, however,
when you get to experiment with the estimation in real time, the data acquiring depend-
ing on the place where you want to do respective tests and the method of image pro-
cessing, for them you should investigate more about optimization in obtaining the data.

Even if the code seems to work, the scale was not analyzed; and for future works a data
in storage on a database in order to analyze the scale must be considered.

7 References

[1] I. Jorge et al., “Implementación del detector y descriptor SURF en dispositivos móviles
con sistema operativo Android,” no. 5, pp. 5232–5237, 2014.
[2] K. Zindler, N. Geiß, K. Doll, and S. Heinlein, “Real-Time Ego-Motion Estimation Using
Lidar and a Vehicle Model Based Extended Kalman Filter,” 2014.
[3] A. Milella and R. Siegwart, “Stereo-Based Ego-Motion Estimation Using Pixel
Tracking and Iterative Closest Point,” no. Icvs, 2006.
[4] J. Ágila, B. Link, P. Smith, and K. Wehrle, “Indoor Navigation on Wheels ( and Without
) using Smartphones,” no. November, p. 2012, 2012.
[5] A. Singh, “Monocular Visual Odometry,” p. 11, 2015.
[6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics: The KITTI
Dataset,” 2016.
[7] Sunzuolei, “Monocular Visual Odometry with RANSAC-based Outlier Rejection,”
2012. .
[8] T. Doctoral, “Un Sistema de Odometría Visual Monocular bajo Restricciones de
Tiempo Real,” 2015.
[9] L. Ricci, “Monocular Visual Odometry,” 2013.

You might also like