You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/348368287

Vehicle Position Estimation with Aerial Imagery from Unmanned Aerial Vehicles

Conference Paper · October 2020


DOI: 10.1109/IV47402.2020.9304794

CITATIONS READS
14 56

4 authors, including:

Friedrich Kruber Eduardo Sanchez Morales


Technische Hochschule Ingolstadt Technische Hochschule Ingolstadt
12 PUBLICATIONS 106 CITATIONS 17 PUBLICATIONS 70 CITATIONS

SEE PROFILE SEE PROFILE

Michael Botsch
Technische Hochschule Ingolstadt
79 PUBLICATIONS 431 CITATIONS

SEE PROFILE

All content following this page was uploaded by Eduardo Sanchez Morales on 22 February 2022.

The user has requested enhancement of the downloaded file.


Vehicle Position Estimation with Aerial Imagery from
Unmanned Aerial Vehicles
Friedrich Kruber1 , Eduardo Sánchez Morales1 ,
Samarjit Chakraborty2 , Michael Botsch1

Abstract— The availability of real-world data is a key element


for novel developments in the fields of automotive and traffic
research. Aerial imagery has the major advantage of recording
arXiv:2004.08206v2 [cs.CV] 13 May 2020

multiple objects simultaneously and overcomes limitations such


as occlusions. However, there are only few data sets available.
This work describes a process to estimate a precise vehicle
position from aerial imagery. A robust object detection is
crucial for reliable results, hence the state-of-the-art deep
neural network Mask-RCNN is applied for that purpose. Two
training data sets are employed: The first one is optimized
for detecting the test vehicle, while the second one consists of
randomly selected images recorded on public roads. To reduce
errors, several aspects are accounted for, such as the drone
movement and the perspective projection from a photograph.
The estimated position is comapared with a reference system
installed in the test vehicle. It is shown, that a mean accuracy
of 20 cm can be achieved with flight altitudes up to 100 m,
Full-HD resolution and a frame-by-frame detection. A reliable
position estimation is the basis for further data processing,
such as obtaining additional vehicle state variables. The source
code, training weights, labeled data and example videos are
made publicly available. This supports researchers to create
new traffic data sets with specific local conditions.

I. INTRODUCTION
Real-world data is essential for automotive research and Fig. 1: Bird’s eye view on an orthorectified map of the test track
at 50 m flight altitude. Depicted is the trajectory of the reference in
traffic analysis. The publicly available data sets can be black and drone in red, respectively. The car was driven in spirals
mainly split in two groups: The first group provides data from to capture different vehicle poses and positions in the image frame.
the vehicle perspective as provided in the KITTI [1], Waymo The blue rectangle and cross indicate the image frame and center.
[2] or Audi A2D2 [3] data sets. These kind of data sets For estimation purposes, a true-to-scale T-junction is painted white
boost research in terms of in-vehicle functionality, mainly on the test track.
computer vision tasks. In order to analyze traffic with an
overall view on the situation, other approaches are prefer- For further details on how to obtain additional vehicle state
able. Commonly used infrastructure sensing technologies like variables from UAV imagery, given the position estimation,
inductive loops provide accurate accumulated traffic data, the reader is referred to [4].
while not being capable of providing individual trajectories Aerial remote sensing measurements have various ad-
of traffic participants. Aerial imagery from Unmanned Aerial vantages, such as dozens of objects can be captured in
Vehicles (UAV), usually drones, overcome this limitation. parallel with one sensor. UAVs are versatile by means of
Yet, only few birds-eye-view data sets are available, while locations and covered area on ground. Also, the hovering
interest is growing due to technological progress. Recently position can be chosen to reduce oclusion. The generated
published work and data sets are dicussed in Section II. The data suits for both, research on individual traffic participant
behavior of traffic participants and infrastructural conditions behavior and its predictions [5], [6], as well as accumulated
differ throughout the world. This underlines the need for data traffic flow analysis [7], [8]. Batteries are the bottleneck, but
according to the local specifics. The present work describes this disadvantage can nowadays be compensated by tethered
a process to generate reliable position data. Figure 1 depicts systems, which allow flight durations of several hours. Wind
an example from the experiments of Section IV. and water resistance, alongside with low-light capabilities of
cameras, are constantly improving, but remain a weak point.
1 Technische Hochschule Ingolstadt, Research Center CARISSMA,
Esplanade 10, 85049 Ingolstadt, Germany, {firstname.lastname}@thi.de
2 University of North Carolina at Chapel Hill (UNC), USA, Before generating such birds-eye-view data, several as-
{firstname}@cs.unc.edu pects need to be taken into account: Drones are often
equipped with non-metric cameras, so that the distortion has main differences to the present work can be stated as follows:
to be removed. The videos are affected by some movement The detection algorithm compares the differences between
and rotation of the hovering drone. Estimating the location two frames, hence identifying moving objects by localizing
of a vehicle within its environment requires a fixed frame. altered pixel values. This type of detector is prone to errors,
This property can be achieved algorithmically using image e.g., during vehicle standstills, changing light conditions or
registration techniques. Photograhps yield a perspective pro- due to the movement of vegetation, as stated by the authors.
jection, so that the detected objects are displaced compared The output is a non-rotated bounding box, which fails to
to the ground truth. Section III details how these aspects can estimate the vehicles shape and thereby worsens the position
be addressed. estimation. The reference sensor1 in [15] provides accuracy
of 20 cm at best, assuming the Diffenrential GPS version.
Own contributions In the present work, the reference sensor’s accuracy is 1 cm,
The three main contributions of the present work can be which is necessary to compare at pixel level, see Section III-
stated as follows: A. The images in [15] were processed with a Gaussian blur
First, a framework to obtain precise vehicle positions from filter, which is claimed to eliminate high frequency noise.
UAV imagery based on instance segmentation and image Applying such a filter blurs the edges and is contraproductive
registration is provided. To the knowledge of the authors, when applying a neural network detector. Finally, relief
no other comparable open sourced framework is available. displacement was not taken into account, which causes an
Second, it is shown, how the accuracy can be optimized, increasing error with growing distance to the principal point,
compared to related work. Reducing the error is meaningful, see Section III-D.2. The authors state a normalized root mean
for example, to associate a vehicle to its actual lane. Fur- square error of 0.55 m at a flight altitude of 60 m. By the
thermore, a small error is necessary to detect lane changes same measure, the error obtained in the present work is much
at the right time instance and to compute criticality measures lower with 0.18 m at a flight altitude of 75 m and identical
in a general sense. Accurate data acquisition is also essential image resolution.
to understand the locally characterized driving behavior and Except for DroNet, a missing publicly available implemen-
to develop algorithms based on it. A precise localization tation is common to all the above mentioned publications.
and representation of the vehicle’s shape allows the use Recently, the highD [16], inD [17] and INTERACTION
of simple trackers such as [9], which associates detections [18] data sets were published. They provide processed
across frames in a video sequence. traffic data obtained with drone and static camera images.
Third, the method’s capabilities and limitations are evalu- While [16] provides German highway data, [18] offers urban
ated with an industrial grade reference sytem. This work is a sceneries like crossings and roundabouts and [17] includes
step towards large-scale colletion of traffic data using UAVs furthermore pedestrians and cyclists. The Stanford campus
and discusses its feasibility and challenges. data set mainly captures pedestrians and bicycles on a cam-
pus, the publication focuses on human trajectory prediction
II. RELATED WORK [19]. While [16], [17], [18] provide traffic scenario data sets,
Object detection and tracking via UAV gained attention this paper describes the procedure to obtain vehicle positions
over the past years. DroNet [10] investigates the real time and compares the results to a widely accepted reference.
capability of vehicle detection with small onboard hardware. Finally, the code is open sourced for further improvements
DroNet is a lean implementation of the YOLO network and to faciliate the generation of new data sets.
[11], where the number of filters in each layer is reduced. III. METHOD
DroNet outputs several frames-per-second (fps) with onboard
Generating traffic data with UAVs is appealing, but certain
hardware, but at the cost of lower detection performance
challenges have to be mastered. First of all, the images
and image resolution. The network struggles with variations
are recorded with a flying object, i. e. a fixed frame has
in flight height and vehicle sizes. It outputs horizontal
to be established. Second, photographs yield a perspective
bounding boxes, which are not suitable for estimating certain
projection. The top of objects are displaced from their
variables such as orientation. The R3 network [12] enables
bases in vertical recorded photgraphs, leading to a false
the detection of rotated bounding boxes. R3 is a bounding
interpretation, when directly computing positions from their
box detector, while Mask R-CNN [13] yields instance seg-
bounding boxes. Obtaining bounding boxes in a sequence of
mentation. Reference [14] approaches vehicle detection via
many images, on the other hand, requires matured detection
instance segmentation. One goal of [14] is to obtain a high
techniques, which are limited by the accuracy and amount of
detection rate at higher altitudes, while the present work
labeled training data. Finally, when performing a benchmark,
pursues a precise position estimation at lower altitudes up
the mapping and synchronisation of both data sources have
to 100 m. Also, the experimental results are compared to a
to be considered.
reference system this paper.
In the following, the main steps are described as depicted
The methodology of [15] to assess aeriel remote sensing
in Figure 2. Beforehand, the coordinate systems used in this
performance is comparable to this work. A test vehicle was
work are explained.
equipped with a GPS logger to receive positions and speed.
The images were geo-referenced to obtain a fixed frame. The 1 Video VBOX Pro
Data generation Pre-Processing Object Detection Post-Processing
Video and DGPS Image registration Rotated Bounding Box Mapping, Relief D.

Fig. 2: The overall process: From data recording to relief displacement correction.

The vehicle moves on the Local Tangent Plane (LTP),


where xL points east, yL north and zL upwards, with an
arbitrary origin oL on the ground of the earth. The Local
Car Plane (LCP) is defined according to the ISO 8855 norm,
where xC points to the hood, yC to the left seat, zC upwards,
with the origin oC at the center of gravity of the vehicle.
For simplification, it is assumed that 1) the xC yC -plane is Fig. 3: Registered image (left) and the raw drone image (right) of
parallel to the xL yL -plane, 2) the centre of mass is identical a Ground Control Point (GCP) from 1.5 m height. The left image
borders are clipped due to translation and rotation. The red box
to the geometrical centre, and 3) the sensor in the vehicle depicts the object location from the first video frame, which was
measures in the LCP. The Pixel Coordinate Frame (PCF) shot around 30 s beforehand.
is a vertical image projection of the LTP, where xP and yP
represent the axes, with the origin oP in one corner of the
and afterwards compressed to Full-HD (1920 px x 1080 px)
image. Quantities expressed in PCF are given in pixels (px).
by applying a bicubic interpolation. For each altitude five
Throughout this work, vectors are represented in boldface
videos were recorded. The camera exposure time was kept
and matrices in boldface, capital letters. 1
constant at 400 s for all recordings.
A. Data Recording A spiral template trajectory was driven to obtain different
The data set was recorded on a test track. This gives vehicle poses and to cover a large are of the image. The test
degrees of freedom regarding arbitrary trajectories within vehicle was then equipped with a driving robot and a Satellite
the image frame. Experiments can be repeated with the same Navigation system3 that receives RTK correction data. This
setup. On the other side, recording on public roads challenges ensures an identical reproduction of the trajectory for all
the detector. Since test vehicles are a limited resource, a experiments, and a centimeter-accurate vehicle localization.
different approach is chosen to validate the detector to some
extend. A second training set, recorded on public roads, is B. Pre-Processing
used to validate wheter the test vehicle is detected in a robust
manner. See Table II in Section IV for further details. The The pre-processing consists generally of two parts: The
results suggest, that given a suitable large training data set, camera calibration and the image registration. According
the detection on public roads also performs well. to the manufacturer of the drone, the camera is shipped
Next, the recording process is detailed. Table I depicts the calibrated, so this step is skipped. The image registration
flight altitudes and Ground Sampling Distance (GSD) for is performed to overlay the sequential frames over the first
the drone2 in use. The GSD is also known as photo scale or frame to ensure a fixed frame. The registration applied in
spatial resolution, see Section III-D.1 for details about the this work is composed of a correction of the orientation,
computation. translation, and scaling of the image. Figure 3 depicts an
example of the registration result. This process involves three
Flight altitude 50 m 75 m 100 m steps in order to find correspondences between two images: a
Number of frames 14 532 15 217 24 106 feature detector, a descriptor and finally the feature matching.
GSD [cm/px] 3.5 5.2 6.9 The goal of the detector is to find identical points of interest
under varying viewing conditions. The descriptor is a feature
TABLE I: Total count of frames and GSD per altitude. At higher
altitudes a larger area on ground has to be covered, thus increasing vector, which describes the local area around the point of
the number of video frames. interest. For this work, the Speeded Up Robust Features
algorithm [20] is used as a detector and descriptor. To match
Generally, for a vertical photograph, the GSD S is a the points between two images, the distances between the
function of the focal length f of the camera and flight altitude feature vectors are computed. If the distance fulfills a certain
H above ground: criterion, e. g., a nearest neighbor ratio matching strategy,
f a matching point on two images was found. The matches
S=
. (1) are then fed into the MLESAC algorithm [21] to eliminate
H
Varying altitudes brings flexibility in the trade-off between outliers. Lastly, a randomly selected subset of the remaining
GSD and captured area on the ground. The videos were matching points are used for the image scaling, rotation and
recorded with 50 fps and 4K resolution (3840 px x 2160 px) translation.

2 DJI Phantom 4 Pro V2 3 GeneSys ADMA-G-PRO+


C. Object detection
The pre-processed images are fed into the Mask-RCNN L
implementation of [22], which is pre-trained on the Common f
Objects in Context (COCO) data set [23]. Mask R-CNN
extends Faster R-CNN [24] by adding a parallel, Fully Con- PCF
volutional Network [25] branch for instance segmentation, oc
H
next to the classification and bounding box regression from a a´
d r
Faster R-CNN. The network is a so called two stage detector:
In the first stage, feature maps generated by a backbone
network are fed into a Region Proposal Network, which
outputs Regions of Interest (RoI). In the second stage, the
predicitions are performed within the RoIs. Additionally, a LTP
Feature Pyramid Network is included for detecting objects Oc
A
at different scales [26].
h
Transfer learning has been applied with two training sets:
A´ R
The first one contains 196 randomly selected and manually A´´
labeled images with the test vehicle being present on all D

images. The second set contains 133 randomly selected


images with 1 987 manually labeled vehicles, recorded on
public roads. This is done to examine the performance under
general conditions. In the following, the two training sets Fig. 4: Geometry of the relief displacement, adapted from [28].
The red bar depicts an object of height h. Due to the perspective
are named ”specialized” and ”general”. Details about the projection and R > 0, the top of the bar A is displaced on the
training procedure can be obtained from the code. Finally, the photo compared to the bottom A0 . The relief displacement d is the
binary mask output is used to compute the smallest rectangle distance between the corresponding points a and a0 in the PCF.
containing all mask pixels using an OpenCV [27] library.
 T
D. Post-Processing steps defined in PCF as bi = xb,i,P yb,i,P , and the bounding
box is defined in PCF as
To complete the process, two more steps are performed.  
First, the output from the detector, given in PCF, has to be BP = b1 b2 b3 b4 . (5)
mapped on the LTP. Finally, the disturbing relief displace-
The corners of the bounding box are mapped to the LTP as
ment is handled.
shown in Eq. (4) to obtain BPL . The geometric centre of the
1) PCF Mapping: A comparison to the reference requires vehicle oveh is calculated by
the mapping of the PCF on the LTP. For this, GCPs are
placed on the xL yL -plane, in such a way that they are  
max(BPL 1,i )+min(BPL 1,i )
visible on
 the PCF. T The i-th GCP is defined  in LTP as 2
gi,L = xi,L yi,L , and in PCF as gi,P = xi,P yi,P .
T oveh = 
max(BPL 2,i )+min(BPL 2,i )
, (6)
The GSD S is calculated from two GCPs by 2

for i = 1, . . . , 4. The dimensions of the detected vehicle are


|gi+1,L − gi,L |
S= . (2) calculated next. Let
|gi+1,P − gi,P |
||b2 − b1 || < ||b3 − b1 || < ||b4 − b1 ||, (7)
 canTthen be expressed in meters by g̃i =
The i-th GCP
gi,P · S = x̃i ỹi . The orientation offset δ from the then ŵ = S · ||b2 − b1 || and ˆl = S · ||b3 − b1 || are the
LTP to the PCF is calculated as δ = θi − θi,L , with θi = estimated width ŵ and length ˆl of the vehicle in meters.
atan2(ỹi+1 − ỹi , x̃i+1 − x̃i ). θi,L is calculated by analogy. 2) Relief displacement: Photographs yield a perspective
The GCP g̃i is then rotated as follows projection. A variation in the elevation of an object results in
T
a different scale and a displacement of the object. An incrase
ĝi = R (δ) g̃i , (3) in the elevation of an object causes the position of the feature
where R (·) is a 2D rotation matrix. The linear offsets from to be displaced radially outwards from the principal point Oc
the LTP to the PCF are calculated by ∆ = ĝi − gi,L . Finally, [28].
 T Assuming a vertical camera angle, the displacement can
a pixel pP = xP yP on the PCF can be mapped to the
be computed from the similar triangles LOc A00 and AA0 A00 ,
LTP by 
T
 according to Figure 4:
pLP = R (δ) (pP · S) − ∆. (4)
D R d r
= , = , (8)
The next stage is to semantically define the four bounding h H h H
box corners. It is assumed that the box covers the complete where the second equation is expressed in GSD, with d
shape of the vehicle. The i-th corner of the bounding box is defining the relief displacement and r the radial distance
between oc and the displaced point a in PCF. D defining the
1.2
equivalent distance of d, projected on ground, R the radial 1.004
distance from Oc , H flight altitude and h being the object
1
height in LPT. L is the camera lense exposure station, where 1.002
light rays from the object intersect before being imaged at

rotation [deg]
scaling factor
0.8
the cameras’ sensor. The relief displacement decreases with
an increasing hovering altitude and is zero at Oc . 1.000
0.6
According to Eq. 8, the bounding box has to be shifted
radially. Two approaches are described: The first one requires 0.998 0.4
knowledge of the vehicle sizes, and the second one is an
approximation for unknown vehicle dimensions. Since the rotation 0.2
training is performed to detect the complete vehicle body, the 0.996 scaling
corner closest to Oc can be usually identified as the bottom of 0
0 10 20 30 40 50 60
the vehicle. So the height of this point is equal to the ground time [s]
clearance. Knowing the height of this corner, its displacement
Fig. 5: Image registration: Scaling and rotation for a period of 60 s:
is corrected as described in the following. Rotation in till 30 s. The scaling factor in with a drop
Defining the horizontal and vertical resolution of the image at around 35 s, caused by an altitude drop.
as rx and ry , the coordinates in PCF of bi w.r.t. the image
center are given by Eq. (10). Although this is only a coarse approximation, the

xb,i,img
 
xb,i,P − r2x
 overall error is reduced compared to the initial situation of
= r . (9) neglecting the displacement.
yb,i,img yb,i,P − 2y
The shift ∆x,P along the xP axis is calculated on the PCF IV. EXPERIMENTS
by In this section the main potential sources of errors are
xb,i,img · hb,i,L being discussed. Later, the results of estimating the position
∆x,P = , (10)
H will be presented.
where hb,i,L is the height of the i-th corner on the LTP. The
A. Sources of errors
shift for ∆y,P is computed by analogy along the yP axis. The
approximated coordinates bi,shift of bi are then given by The overall process involves several steps, of which all
 T affect the accuracy of the position estimation. Two groups
bi,shift = bi − ∆x,P ∆y,P . (11) of potential errors can be distinguished: The first group de-
scribes general issues appearing from aerial imagery captured
Let w be the width and l the known length of the vehicle by an UAV. Here, the registration and relief displacement
and b1 be the closest corner to the image centre. Then, b1 have the most significant influence. The second group is
is used for scaling bl and bw as follows only of concern, when comparing to a reference system.
w 
bw,scaled = · (bw − b1 ) + b1 , and (12) In this group, maximizing the distance between the GCPs
ŵ and localizing them precisely on the PCF is essential. The
following enumeration lists the main potential sources of
 
l
bl,scaled = · (bl − b1 ) + b1 , (13) errors:
ˆl
• Camera calibration and image registration,
where w is the element of BP associated with ||b2 − b1 || and
• Image compression,
l associated with ||b3 − b1 ||, respectively. The shifted centre
• Camera exposure time,
of the vehicle can then be calculated by
• Training data generation and object shape detection,
bw,scaled + bl,scaled • Location of the shape boundary on the vehicle body,
oveh,shift = . (14)
2 • Rotation and GSD with GCPs,
When gathering data on public roads, the vehicle dimensions • Sensor synchronisation.
are unknown and can not be estimated with a mono camera. Wide angle lenses have the preferable focal length to cap-
An approximation for the displacement can be performed ture a large area on ground. They are usually affected by
by assuming that two corners of the bounding box, which barrel distortion, which decreases the GSD with increasing
form ˆl and one of the corners is the closest to oc , the distance from the optical axis Oc . According to the drone
ground clearance is usually visible. The height of the ground manufacturer, the camera in use performs the corrections
clearance can be approximated with 15 cm for passenger cars automatically. Every pixel deviation in the feature detection
[29]. The remaining two corners can usually be referred to as and matching during the image registration process inevitably
the vehicle body shoulders, which protrude further than the leads to a deviation in rotation and GSD. Changing light
roof of the vehicle. The shoulders height is roughly half of conditions and the slight movements of the hovering drone
the vehicle height and can be approximated with 75 cm for affect the perception and thus influence the matching. Figure
passenger cars. Then all four corners can be shifted following 5 depicts a typical example of rotation and scaling, recorded
over a period of one minute at 100 m altitude. While achiev- Weights Specialized General
ing robust results, some potential outliers with a magnitude Metric AP@0.5 AP@[0.5, 0.95] AP@0.5 AP@[0.5, 0.95]
of approximately 0.1 % can be observed for the scaling 50 m 1.00 0.89 1.00 0.84
parameter, which translates into a deviation of up to 12 cm 75 m 1.00 0.89 1.00 0.82
at 100 m altitude. Filtering these variables was omitted to 100 m 1.00 0.91 1.00 0.86
examine the robustness of the image registration algorithms.
TABLE II: AP evaluation for all altitudes and both training sets.
The images are compressed in two ways. First, the res-
olution is reduced by half for both axes. Second, storing
the PASCAL VOC (AP@IoU = 0.5) [30] and COCO
the images as JPEGs leads to lossy image compression. For
(AP@IoU[0.5:0.05:0.95]) [23] definitions, where the Inter-
example, smooth transitions can be found, which reduce
section over Union threshold is abbreviated as IoU. The
the sharpness of edges. Another component is the camera
detection is robust and the AP is similar for all three flight
exposure time. With exposure time, motion blur is induced,
heights. An AP@IoU = 0.5 of 1 exhibits a detection rate of
which can stretch the vehicle on the image or blurs the edges.
100 % for the evaluation images. That holds for all images
Therefore, short exposure times are preferable, but at the cost
(see. Table I) detected with the specialized training weights.
of less light exposure.
For the general weights, the detection rate is 99.97 % w.r.t
A robust semantic segmentation is crucial for achieving
to the images listed in Table I. This is reasonable, since the
reliable results. Even though Mask-RCNN provides excellent
environment of the test track does not exhibit structures to be
results, see Section IV-B, minor deviations of at least one
confused with a top-down view of a vehicle shape. It should
pixel can not be avoided. The deviations stem mainly from
be noted, that the detection performance does not directly
the manual image labeling and limited training data.
reflect the accuracy of the position estimation, since the
With an radially increasing distance from Oc , the relief
bounding box is computed according to the most outer pixels
displacement has a major influence on the results. The
of the shape. Assuming the most outer pixels are detected
problem in correcting the displacement of vehicles is rooted
and they do match to the actual vehicle body border, the
in the complex shapes and varying heights. It is difficult to
position could still be computed correctly on a pixel level,
determine which part of the car has been detected exactly,
although the IoU is less than one.
even when inspected by a human: Assuming a flight altitude
of 100 m and a vehicle detected close to the image border, C. Position estimation
e. g., a distance of 60 m to Oc , the displacement increases This section is concluded with the experimental results.
with 0.6 cm per cm change in object height. Detecting a Figure 6 depicts the graph for each flight altitude, both
feature of the vehicle at 30 cm height, instead of the body training weights and the three main processing steps, where
bottom (Section III-D.2), yields an error of 9 cm. depicts results for non registered images, for reg-
The last two sources of error are only of concern when istered images, and for registered images with corrected
comparing the results with a reference system. The mapping relief displacement.
of the PCF to the LTP is based on localizing the GCPs, Image registration is the key to obtain reasonable results.
see Section III. Due to the limited resolution and image The correction of the relief displacement improves the results
compression, a mis-locatization of typically one pixel per by 0.8 px on a weighted average4 . Note, that the impact of
GCP in the PCF can be induced. Hence, all variables the displacement is dependent on the distance R. Hence, data
concerning the mapping, namely the GSD S, orientation sets recording vehicles at the image border benefit more.
offset δ and linear offset ∆, see Section III-D.1, are affected. As mentioned before, the relief displacement is reduced
The resulting error depends on the distance between two with higher hovering altitudes. Remembering the issue re-
GCPs, hence |gi+1,L − gi,L | should be maximized. garding which vehicle part is actually corresponding to the
Synchronisation between the two sensors is attained via outer pixel detected (Section III), explains the best results in
UTC time stamps. Since UTC time stamps can not be pixel measure for an altitude of 100 m. However, expressing
associated to a certain frame for the drone in use, a LED the error in the metric units, the error is lowest at 50 m.
signal, triggered by the Pulse-per-second signal (PPS) from The experiments with the general training weights, which
a satellite navigation receiver, was recorded. This appears are based on images recorded on public roads, perform at
to be the best solution, since the circuit delay within the weighted average only 0.3 px worse. This underlines the
receiver and the LED rising time can be neglected. The first suitability of the framework for applications on public roads.
video frame showing the illuminated LED is associated with Table III depicts a detailed comparison.
the corresponding UTC time stamp. Hence, the maximum The mean error is 20 cm and 14 cm for a flight altitude
1
synchronisation error is limited here to fps = 20 ms. During of 100 m and 50 m, respectively. Regardless of the training
the experiments, the maximum speed was around 30 km/h weights, 90 % of all frames have an error of 7 px or less.
leading to a worst case error of 17 cm due to synchronisation. Summarizing this section and the experimental results
B. Detection performance leads to the following conclusions: 1) A robust image reg-
istration is crucial for a good performance. Omitting the
A set of 50 images has been labeled for evaluation. Table
II depicts the Average Precision (AP) results according to 4 weighted by the number of frames per height, see Section III
Altitude 100 m 75 m 50 m
Weights Specialized General Specialized General Specialized General
Corrections reg reg+shift reg reg+shift reg reg+shift reg reg+shift reg reg+shift reg reg+shift
Median [px] 3.27 2.41 3.33 2.71 3.26 2.70 3.67 3.08 4.47 3.93 4.14 4.05
Mean [px] 3.87 2.95 3.96 3.27 3.75 2.99 4.09 3.34 4.53 3.98 4.47 4.26
90% [px] 7.39 6.01 7.52 6.50 6.98 5.28 7.42 5.80 7.04 6.08 7.75 7.12
99% [px] 11.19 8.88 11.33 9.58 9.67 8.08 9.60 8.27 9.97 8.10 10.81 10.05
99.9% [px] 11.85 9.72 12.23 12.07 11.15 9.63 11.07 10.04 10.83 8.95 12.89 13.27
Mean [m] 0.27 0.20 0.27 0.23 0.20 0.16 0.21 0.17 0.16 0.14 0.16 0.15

TABLE III: Accumulated frequency of the error: Depicted for all three altitudes, both training weights and corrections.

Specialized training weights General Training weights


4) Regarding real-world applications, the vehicle can at least
7 14 21 28 35 42 7 14 21 28 35 42
1
50m
1
50m
be associated to its lane as exemplified in Figure 1. Note,
that to some extend, the error values reported in Figure 6 and
Cumulative frequency

0.8 0.8
Table III can be assigned to the synchronisation and mapping
0.6 0.6
uncertainty, which is only of concern when benchmarking
0.4 0.4 two data sources. 5) The vehicle to lane association also
holds for the general training set, so that one can expect
0.2 0.2
similar results for public roads, given an approriate training
0
2 4 6 8 10 12
0
2 4 6 8 10 12 data set.
10 20 31 41 52 62 10 20 31 41 52 62
1
75m
1
75m V. CONCLUSIONS
Cumulative frequency

0.8 0.8 Vehicle detection by means of UAV imagery is an attrac-


0.6 0.6
tive option to generate data sets with relatively low effort.
This paper describes an approach based on deep neural
0.4 0.4
network object detection and automated image registration
0.2 0.2 techniques with state-of-the-art algorithms. A procedure to
reduce the impact of relief displacement originated by the
0 0
2 4 6 8 10 12 2 4 6 8 10 12 perspective projection of vertical images is described as well.
14 28 41 55 69 83 14 28 41 55 69 83
1
100m
1
100m
Additionally, an overview of the potential sources of errors,
and how to minimize their impact, is given.
Cumulative frequency

0.8 0.8
The estimated vehicle position is compared to a reference
0.6 0.6
system. Recording the data on a test track with consistent
0.4 0.4 conditions ensures meaningful results. It is shown, that with-
out applying any time-smoothing techniques, the position
0.2 0.2
can be estimated in a reliable manner. The mean error is
0
2 4 6 8 10 12
0
2 4 6 8 10 12 20 cm and 14 cm for a flight altitude of 100 m and 50 m,
Error in [cm] and [px] Error in [cm] and [px] respectively. Furthermore, 90 % of the 53 855 independently
reg+shift reg raw evaluated frames have an error of 7 px or less. To highlight
Fig. 6: Cumulative frequency diagrams: The left column depicts the generalization capabilities, the experiments were anal-
the results for the specialized training weights, the right column for ysed for two training data sets. One is a specialized data set,
the general training weights. The error above each plot is depicted while the second being recorded on public roads. Both sets
in [cm], and in [px] below each plot. perform at a similar level. This allows the framework to be
used for a wide range of applications. Interested readers are
referred to the repository [31], where the code, label data
effect of the relief displacement yields larger errors, when and exmaple videos are made publicy available.
hovering above the region of interest is not feasable and
objects are detected throughout the complete image frame. ACKNOWLEDGMENT
2) Considering the pixelwise results, similar performance The authors acknowledge the financial support by the
can be observed for all three altitudes, which proofs that Federal Ministry of Education and Research of Germany
data can be obtained from different flight heights by a single (BMBF) in the framework of FH-Impuls (project number
Mask-RCNN network. This advantage can also be helpful for 03FH7I02IA). The authors thank the AUDI AG department
different object sizes. 3) The best results in metric values are for Testing Total Vehicle for supporting this work.
retrieved at lower altitudes. Alternatively, in order to capture
a larger surface area, one can record at higher altitudes,
increase the resolution and crop the image if necessary.
R EFERENCES [21] P. Torr and A. Zisserman, “MLESAC: A New Robust Estimator with
Application to Estimating Image Geometry,” Computer Vision and
[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics: Image Understanding, June 2000.
The KITTI Dataset,” International Journal of Robotics Research [22] Waleed Abdulla, “Mask R-CNN for object detection and instance
(IJRR), 2013. segmentation on Keras and TensorFlow,” 2017. [Online]. Available:
[2] “Waymo Open Dataset: An autonomous driving dataset,” 2019. https://github.com/matterport/Mask RCNN
[Online]. Available: https://www.waymo.com/open [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in
[3] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S.
Context,” in Computer Vision – ECCV, 2014.
Chung, L. Hauswald, V. H. Pham, M. Mhlegg, S. Dorn, T. Fernandez,
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-
M. Jnicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, and
Time Object Detection with Region Proposal Networks,” in Advances
P. Schuberth, “A2D2: AEV Autonomous Driving Dataset,” 2019.
in Neural Information Processing Systems 28, 2015.
[Online]. Available: http://www.a2d2.audi
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
[4] E. Sánchez Morales, F. Kruber, M. Botsch, B. Huber, and A. Garcı́a
for semantic segmentation,” in IEEE Conference on Computer Vision
Higuera, “Accuracy Characterization of the Vehicle State Estimation
and Pattern Recognition (CVPR), 2015.
from Aerial Imagery,” in IEEE Intelligent Vehicles Symposium (IV),
[26] T. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie,
2020.
“Feature Pyramid Networks for Object Detection,” in IEEE Conference
[5] E. Sánchez Morales, R. Membarth, A. Gaull, P. Slusallek, T. Dirn- on Computer Vision and Pattern Recognition (CVPR), 2017.
dorfer, A. Kammenhuber, C. Lauer, and M. Botsch, “Parallel Multi- [27] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
Hypothesis Algorithm for Criticality Estimation in Traffic and Colli- Tools, 2000.
sion Avoidance,” in IEEE Intelligent Vehicles Symposium (IV), 2019. [28] Lillesand, Kiefer, and Chipman, Remote Sensing and Image Interpre-
[6] P. Nadarajan, M. Botsch, and S. Sardina, “Machine Learning Archi- tation, 2003, vol. 5.
tectures for the Estimation of Predicted Occupancy Grids in Road [29] Verband der TV e.V., “Merkblatt 751,” 2008.
Traffic,” Journal of Advances in Information Technology, 2018. [30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[7] L. Neubert, “Statistische Analyse von Verkehrsdaten und die Model- A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,”
lierung von Verkehrsfluss mittels zellularer Automaten,” Ph.D. disser- International Journal of Computer Vision, 2010.
tation, Universität Duisburg, 2000. [31] F. Kruber and E. Sánchez Morales, “Vehicle Detection and
[8] F. Kruber, J. Wurst, S. Chakraborty, and M. Botsch, “Highway traffic State Estimation with Aerial Imagery.” [Online]. Available:
data: macroscopic, microscopic and criticality analysis for capturing https://github.com/fkthi
relevant traffic scenarios and traffic modeling based on the highD
data set,” 2019. [Online]. Available: https://arxiv.org/abs/1903.04249
[9] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
and realtime tracking,” in IEEE International Conference on Image
Processing (ICIP), 2016.
[10] C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C. Bouga-
nis, “DroNet: Efficient convolutional neural network detector for
real-time UAV applications,” in Design, Automation Test in Europe
Conference Exhibition (DATE), 2018.
[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[12] Q. Li, L. Mou, Q. Xu, Y. Zhang, and X. X. Zhu, “R3 -
Net: A Deep Network for Multi-oriented Vehicle Detection in
Aerial Images and Videos,” CoRR, 2018. [Online]. Available:
http://arxiv.org/abs/1808.05560
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,”
in Proceedings of the International Conference on Computer Vision
(ICCV), 2017.
[14] L. Mou and X. X. Zhu, “Vehicle Instance Segmentation From Aerial
Image and Video Using a Multitask Learning Residual Fully Con-
volutional Network,” IEEE Transactions on Geoscience and Remote
Sensing, 2018.
[15] G. Guido, V. Gallelli, D. Rogano, and A. Vitale, “Evaluating the
accuracy of vehicle tracking data obtained from Unmanned Aerial
Vehicles,” International Journal of Transportation Science and Tech-
nology, 2016.
[16] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highD
Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on Ger-
man Highways for Validation of Highly Automated Driving Systems,”
in 21st International Conference on Intelligent Transportation Systems
(ITSC), 2018.
[17] J. Bock, R. Krajewski, T. Moers, L. Vater, S. Runde, and L. Eckstein,
“The inD Dataset: A Drone Dataset of Naturalistic Vehicle
Trajectories at German Intersections,” 2019. [Online]. Available:
https://arxiv.org/abs/1911.07602
[18] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann,
J. Kümmerle, H. Königshof, C. Stiller, A. de La Fortelle, and
M. Tomizuka, “INTERACTION Dataset: An INTERnational, Ad-
versarial and Cooperative moTION Dataset in Interactive Driving
Scenarios with Semantic Maps,” arXiv:1910.03088, 2019.
[19] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Human
Trajectory Prediction In Crowded Scenes,” in European Conference
on Computer Vision (ECCV), 2016.
[20] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust
Features,” in Computer Vision – ECCV, 2019.

View publication stats

You might also like