Professional Documents
Culture Documents
net/publication/348368287
Vehicle Position Estimation with Aerial Imagery from Unmanned Aerial Vehicles
CITATIONS READS
14 56
4 authors, including:
Michael Botsch
Technische Hochschule Ingolstadt
79 PUBLICATIONS 431 CITATIONS
SEE PROFILE
All content following this page was uploaded by Eduardo Sanchez Morales on 22 February 2022.
I. INTRODUCTION
Real-world data is essential for automotive research and Fig. 1: Bird’s eye view on an orthorectified map of the test track
at 50 m flight altitude. Depicted is the trajectory of the reference in
traffic analysis. The publicly available data sets can be black and drone in red, respectively. The car was driven in spirals
mainly split in two groups: The first group provides data from to capture different vehicle poses and positions in the image frame.
the vehicle perspective as provided in the KITTI [1], Waymo The blue rectangle and cross indicate the image frame and center.
[2] or Audi A2D2 [3] data sets. These kind of data sets For estimation purposes, a true-to-scale T-junction is painted white
boost research in terms of in-vehicle functionality, mainly on the test track.
computer vision tasks. In order to analyze traffic with an
overall view on the situation, other approaches are prefer- For further details on how to obtain additional vehicle state
able. Commonly used infrastructure sensing technologies like variables from UAV imagery, given the position estimation,
inductive loops provide accurate accumulated traffic data, the reader is referred to [4].
while not being capable of providing individual trajectories Aerial remote sensing measurements have various ad-
of traffic participants. Aerial imagery from Unmanned Aerial vantages, such as dozens of objects can be captured in
Vehicles (UAV), usually drones, overcome this limitation. parallel with one sensor. UAVs are versatile by means of
Yet, only few birds-eye-view data sets are available, while locations and covered area on ground. Also, the hovering
interest is growing due to technological progress. Recently position can be chosen to reduce oclusion. The generated
published work and data sets are dicussed in Section II. The data suits for both, research on individual traffic participant
behavior of traffic participants and infrastructural conditions behavior and its predictions [5], [6], as well as accumulated
differ throughout the world. This underlines the need for data traffic flow analysis [7], [8]. Batteries are the bottleneck, but
according to the local specifics. The present work describes this disadvantage can nowadays be compensated by tethered
a process to generate reliable position data. Figure 1 depicts systems, which allow flight durations of several hours. Wind
an example from the experiments of Section IV. and water resistance, alongside with low-light capabilities of
cameras, are constantly improving, but remain a weak point.
1 Technische Hochschule Ingolstadt, Research Center CARISSMA,
Esplanade 10, 85049 Ingolstadt, Germany, {firstname.lastname}@thi.de
2 University of North Carolina at Chapel Hill (UNC), USA, Before generating such birds-eye-view data, several as-
{firstname}@cs.unc.edu pects need to be taken into account: Drones are often
equipped with non-metric cameras, so that the distortion has main differences to the present work can be stated as follows:
to be removed. The videos are affected by some movement The detection algorithm compares the differences between
and rotation of the hovering drone. Estimating the location two frames, hence identifying moving objects by localizing
of a vehicle within its environment requires a fixed frame. altered pixel values. This type of detector is prone to errors,
This property can be achieved algorithmically using image e.g., during vehicle standstills, changing light conditions or
registration techniques. Photograhps yield a perspective pro- due to the movement of vegetation, as stated by the authors.
jection, so that the detected objects are displaced compared The output is a non-rotated bounding box, which fails to
to the ground truth. Section III details how these aspects can estimate the vehicles shape and thereby worsens the position
be addressed. estimation. The reference sensor1 in [15] provides accuracy
of 20 cm at best, assuming the Diffenrential GPS version.
Own contributions In the present work, the reference sensor’s accuracy is 1 cm,
The three main contributions of the present work can be which is necessary to compare at pixel level, see Section III-
stated as follows: A. The images in [15] were processed with a Gaussian blur
First, a framework to obtain precise vehicle positions from filter, which is claimed to eliminate high frequency noise.
UAV imagery based on instance segmentation and image Applying such a filter blurs the edges and is contraproductive
registration is provided. To the knowledge of the authors, when applying a neural network detector. Finally, relief
no other comparable open sourced framework is available. displacement was not taken into account, which causes an
Second, it is shown, how the accuracy can be optimized, increasing error with growing distance to the principal point,
compared to related work. Reducing the error is meaningful, see Section III-D.2. The authors state a normalized root mean
for example, to associate a vehicle to its actual lane. Fur- square error of 0.55 m at a flight altitude of 60 m. By the
thermore, a small error is necessary to detect lane changes same measure, the error obtained in the present work is much
at the right time instance and to compute criticality measures lower with 0.18 m at a flight altitude of 75 m and identical
in a general sense. Accurate data acquisition is also essential image resolution.
to understand the locally characterized driving behavior and Except for DroNet, a missing publicly available implemen-
to develop algorithms based on it. A precise localization tation is common to all the above mentioned publications.
and representation of the vehicle’s shape allows the use Recently, the highD [16], inD [17] and INTERACTION
of simple trackers such as [9], which associates detections [18] data sets were published. They provide processed
across frames in a video sequence. traffic data obtained with drone and static camera images.
Third, the method’s capabilities and limitations are evalu- While [16] provides German highway data, [18] offers urban
ated with an industrial grade reference sytem. This work is a sceneries like crossings and roundabouts and [17] includes
step towards large-scale colletion of traffic data using UAVs furthermore pedestrians and cyclists. The Stanford campus
and discusses its feasibility and challenges. data set mainly captures pedestrians and bicycles on a cam-
pus, the publication focuses on human trajectory prediction
II. RELATED WORK [19]. While [16], [17], [18] provide traffic scenario data sets,
Object detection and tracking via UAV gained attention this paper describes the procedure to obtain vehicle positions
over the past years. DroNet [10] investigates the real time and compares the results to a widely accepted reference.
capability of vehicle detection with small onboard hardware. Finally, the code is open sourced for further improvements
DroNet is a lean implementation of the YOLO network and to faciliate the generation of new data sets.
[11], where the number of filters in each layer is reduced. III. METHOD
DroNet outputs several frames-per-second (fps) with onboard
Generating traffic data with UAVs is appealing, but certain
hardware, but at the cost of lower detection performance
challenges have to be mastered. First of all, the images
and image resolution. The network struggles with variations
are recorded with a flying object, i. e. a fixed frame has
in flight height and vehicle sizes. It outputs horizontal
to be established. Second, photographs yield a perspective
bounding boxes, which are not suitable for estimating certain
projection. The top of objects are displaced from their
variables such as orientation. The R3 network [12] enables
bases in vertical recorded photgraphs, leading to a false
the detection of rotated bounding boxes. R3 is a bounding
interpretation, when directly computing positions from their
box detector, while Mask R-CNN [13] yields instance seg-
bounding boxes. Obtaining bounding boxes in a sequence of
mentation. Reference [14] approaches vehicle detection via
many images, on the other hand, requires matured detection
instance segmentation. One goal of [14] is to obtain a high
techniques, which are limited by the accuracy and amount of
detection rate at higher altitudes, while the present work
labeled training data. Finally, when performing a benchmark,
pursues a precise position estimation at lower altitudes up
the mapping and synchronisation of both data sources have
to 100 m. Also, the experimental results are compared to a
to be considered.
reference system this paper.
In the following, the main steps are described as depicted
The methodology of [15] to assess aeriel remote sensing
in Figure 2. Beforehand, the coordinate systems used in this
performance is comparable to this work. A test vehicle was
work are explained.
equipped with a GPS logger to receive positions and speed.
The images were geo-referenced to obtain a fixed frame. The 1 Video VBOX Pro
Data generation Pre-Processing Object Detection Post-Processing
Video and DGPS Image registration Rotated Bounding Box Mapping, Relief D.
Fig. 2: The overall process: From data recording to relief displacement correction.
rotation [deg]
scaling factor
0.8
the cameras’ sensor. The relief displacement decreases with
an increasing hovering altitude and is zero at Oc . 1.000
0.6
According to Eq. 8, the bounding box has to be shifted
radially. Two approaches are described: The first one requires 0.998 0.4
knowledge of the vehicle sizes, and the second one is an
approximation for unknown vehicle dimensions. Since the rotation 0.2
training is performed to detect the complete vehicle body, the 0.996 scaling
corner closest to Oc can be usually identified as the bottom of 0
0 10 20 30 40 50 60
the vehicle. So the height of this point is equal to the ground time [s]
clearance. Knowing the height of this corner, its displacement
Fig. 5: Image registration: Scaling and rotation for a period of 60 s:
is corrected as described in the following. Rotation in till 30 s. The scaling factor in with a drop
Defining the horizontal and vertical resolution of the image at around 35 s, caused by an altitude drop.
as rx and ry , the coordinates in PCF of bi w.r.t. the image
center are given by Eq. (10). Although this is only a coarse approximation, the
xb,i,img
xb,i,P − r2x
overall error is reduced compared to the initial situation of
= r . (9) neglecting the displacement.
yb,i,img yb,i,P − 2y
The shift ∆x,P along the xP axis is calculated on the PCF IV. EXPERIMENTS
by In this section the main potential sources of errors are
xb,i,img · hb,i,L being discussed. Later, the results of estimating the position
∆x,P = , (10)
H will be presented.
where hb,i,L is the height of the i-th corner on the LTP. The
A. Sources of errors
shift for ∆y,P is computed by analogy along the yP axis. The
approximated coordinates bi,shift of bi are then given by The overall process involves several steps, of which all
T affect the accuracy of the position estimation. Two groups
bi,shift = bi − ∆x,P ∆y,P . (11) of potential errors can be distinguished: The first group de-
scribes general issues appearing from aerial imagery captured
Let w be the width and l the known length of the vehicle by an UAV. Here, the registration and relief displacement
and b1 be the closest corner to the image centre. Then, b1 have the most significant influence. The second group is
is used for scaling bl and bw as follows only of concern, when comparing to a reference system.
w
bw,scaled = · (bw − b1 ) + b1 , and (12) In this group, maximizing the distance between the GCPs
ŵ and localizing them precisely on the PCF is essential. The
following enumeration lists the main potential sources of
l
bl,scaled = · (bl − b1 ) + b1 , (13) errors:
ˆl
• Camera calibration and image registration,
where w is the element of BP associated with ||b2 − b1 || and
• Image compression,
l associated with ||b3 − b1 ||, respectively. The shifted centre
• Camera exposure time,
of the vehicle can then be calculated by
• Training data generation and object shape detection,
bw,scaled + bl,scaled • Location of the shape boundary on the vehicle body,
oveh,shift = . (14)
2 • Rotation and GSD with GCPs,
When gathering data on public roads, the vehicle dimensions • Sensor synchronisation.
are unknown and can not be estimated with a mono camera. Wide angle lenses have the preferable focal length to cap-
An approximation for the displacement can be performed ture a large area on ground. They are usually affected by
by assuming that two corners of the bounding box, which barrel distortion, which decreases the GSD with increasing
form ˆl and one of the corners is the closest to oc , the distance from the optical axis Oc . According to the drone
ground clearance is usually visible. The height of the ground manufacturer, the camera in use performs the corrections
clearance can be approximated with 15 cm for passenger cars automatically. Every pixel deviation in the feature detection
[29]. The remaining two corners can usually be referred to as and matching during the image registration process inevitably
the vehicle body shoulders, which protrude further than the leads to a deviation in rotation and GSD. Changing light
roof of the vehicle. The shoulders height is roughly half of conditions and the slight movements of the hovering drone
the vehicle height and can be approximated with 75 cm for affect the perception and thus influence the matching. Figure
passenger cars. Then all four corners can be shifted following 5 depicts a typical example of rotation and scaling, recorded
over a period of one minute at 100 m altitude. While achiev- Weights Specialized General
ing robust results, some potential outliers with a magnitude Metric AP@0.5 AP@[0.5, 0.95] AP@0.5 AP@[0.5, 0.95]
of approximately 0.1 % can be observed for the scaling 50 m 1.00 0.89 1.00 0.84
parameter, which translates into a deviation of up to 12 cm 75 m 1.00 0.89 1.00 0.82
at 100 m altitude. Filtering these variables was omitted to 100 m 1.00 0.91 1.00 0.86
examine the robustness of the image registration algorithms.
TABLE II: AP evaluation for all altitudes and both training sets.
The images are compressed in two ways. First, the res-
olution is reduced by half for both axes. Second, storing
the PASCAL VOC (AP@IoU = 0.5) [30] and COCO
the images as JPEGs leads to lossy image compression. For
(AP@IoU[0.5:0.05:0.95]) [23] definitions, where the Inter-
example, smooth transitions can be found, which reduce
section over Union threshold is abbreviated as IoU. The
the sharpness of edges. Another component is the camera
detection is robust and the AP is similar for all three flight
exposure time. With exposure time, motion blur is induced,
heights. An AP@IoU = 0.5 of 1 exhibits a detection rate of
which can stretch the vehicle on the image or blurs the edges.
100 % for the evaluation images. That holds for all images
Therefore, short exposure times are preferable, but at the cost
(see. Table I) detected with the specialized training weights.
of less light exposure.
For the general weights, the detection rate is 99.97 % w.r.t
A robust semantic segmentation is crucial for achieving
to the images listed in Table I. This is reasonable, since the
reliable results. Even though Mask-RCNN provides excellent
environment of the test track does not exhibit structures to be
results, see Section IV-B, minor deviations of at least one
confused with a top-down view of a vehicle shape. It should
pixel can not be avoided. The deviations stem mainly from
be noted, that the detection performance does not directly
the manual image labeling and limited training data.
reflect the accuracy of the position estimation, since the
With an radially increasing distance from Oc , the relief
bounding box is computed according to the most outer pixels
displacement has a major influence on the results. The
of the shape. Assuming the most outer pixels are detected
problem in correcting the displacement of vehicles is rooted
and they do match to the actual vehicle body border, the
in the complex shapes and varying heights. It is difficult to
position could still be computed correctly on a pixel level,
determine which part of the car has been detected exactly,
although the IoU is less than one.
even when inspected by a human: Assuming a flight altitude
of 100 m and a vehicle detected close to the image border, C. Position estimation
e. g., a distance of 60 m to Oc , the displacement increases This section is concluded with the experimental results.
with 0.6 cm per cm change in object height. Detecting a Figure 6 depicts the graph for each flight altitude, both
feature of the vehicle at 30 cm height, instead of the body training weights and the three main processing steps, where
bottom (Section III-D.2), yields an error of 9 cm. depicts results for non registered images, for reg-
The last two sources of error are only of concern when istered images, and for registered images with corrected
comparing the results with a reference system. The mapping relief displacement.
of the PCF to the LTP is based on localizing the GCPs, Image registration is the key to obtain reasonable results.
see Section III. Due to the limited resolution and image The correction of the relief displacement improves the results
compression, a mis-locatization of typically one pixel per by 0.8 px on a weighted average4 . Note, that the impact of
GCP in the PCF can be induced. Hence, all variables the displacement is dependent on the distance R. Hence, data
concerning the mapping, namely the GSD S, orientation sets recording vehicles at the image border benefit more.
offset δ and linear offset ∆, see Section III-D.1, are affected. As mentioned before, the relief displacement is reduced
The resulting error depends on the distance between two with higher hovering altitudes. Remembering the issue re-
GCPs, hence |gi+1,L − gi,L | should be maximized. garding which vehicle part is actually corresponding to the
Synchronisation between the two sensors is attained via outer pixel detected (Section III), explains the best results in
UTC time stamps. Since UTC time stamps can not be pixel measure for an altitude of 100 m. However, expressing
associated to a certain frame for the drone in use, a LED the error in the metric units, the error is lowest at 50 m.
signal, triggered by the Pulse-per-second signal (PPS) from The experiments with the general training weights, which
a satellite navigation receiver, was recorded. This appears are based on images recorded on public roads, perform at
to be the best solution, since the circuit delay within the weighted average only 0.3 px worse. This underlines the
receiver and the LED rising time can be neglected. The first suitability of the framework for applications on public roads.
video frame showing the illuminated LED is associated with Table III depicts a detailed comparison.
the corresponding UTC time stamp. Hence, the maximum The mean error is 20 cm and 14 cm for a flight altitude
1
synchronisation error is limited here to fps = 20 ms. During of 100 m and 50 m, respectively. Regardless of the training
the experiments, the maximum speed was around 30 km/h weights, 90 % of all frames have an error of 7 px or less.
leading to a worst case error of 17 cm due to synchronisation. Summarizing this section and the experimental results
B. Detection performance leads to the following conclusions: 1) A robust image reg-
istration is crucial for a good performance. Omitting the
A set of 50 images has been labeled for evaluation. Table
II depicts the Average Precision (AP) results according to 4 weighted by the number of frames per height, see Section III
Altitude 100 m 75 m 50 m
Weights Specialized General Specialized General Specialized General
Corrections reg reg+shift reg reg+shift reg reg+shift reg reg+shift reg reg+shift reg reg+shift
Median [px] 3.27 2.41 3.33 2.71 3.26 2.70 3.67 3.08 4.47 3.93 4.14 4.05
Mean [px] 3.87 2.95 3.96 3.27 3.75 2.99 4.09 3.34 4.53 3.98 4.47 4.26
90% [px] 7.39 6.01 7.52 6.50 6.98 5.28 7.42 5.80 7.04 6.08 7.75 7.12
99% [px] 11.19 8.88 11.33 9.58 9.67 8.08 9.60 8.27 9.97 8.10 10.81 10.05
99.9% [px] 11.85 9.72 12.23 12.07 11.15 9.63 11.07 10.04 10.83 8.95 12.89 13.27
Mean [m] 0.27 0.20 0.27 0.23 0.20 0.16 0.21 0.17 0.16 0.14 0.16 0.15
TABLE III: Accumulated frequency of the error: Depicted for all three altitudes, both training weights and corrections.
0.8 0.8
Table III can be assigned to the synchronisation and mapping
0.6 0.6
uncertainty, which is only of concern when benchmarking
0.4 0.4 two data sources. 5) The vehicle to lane association also
holds for the general training set, so that one can expect
0.2 0.2
similar results for public roads, given an approriate training
0
2 4 6 8 10 12
0
2 4 6 8 10 12 data set.
10 20 31 41 52 62 10 20 31 41 52 62
1
75m
1
75m V. CONCLUSIONS
Cumulative frequency
0.8 0.8
The estimated vehicle position is compared to a reference
0.6 0.6
system. Recording the data on a test track with consistent
0.4 0.4 conditions ensures meaningful results. It is shown, that with-
out applying any time-smoothing techniques, the position
0.2 0.2
can be estimated in a reliable manner. The mean error is
0
2 4 6 8 10 12
0
2 4 6 8 10 12 20 cm and 14 cm for a flight altitude of 100 m and 50 m,
Error in [cm] and [px] Error in [cm] and [px] respectively. Furthermore, 90 % of the 53 855 independently
reg+shift reg raw evaluated frames have an error of 7 px or less. To highlight
Fig. 6: Cumulative frequency diagrams: The left column depicts the generalization capabilities, the experiments were anal-
the results for the specialized training weights, the right column for ysed for two training data sets. One is a specialized data set,
the general training weights. The error above each plot is depicted while the second being recorded on public roads. Both sets
in [cm], and in [px] below each plot. perform at a similar level. This allows the framework to be
used for a wide range of applications. Interested readers are
referred to the repository [31], where the code, label data
effect of the relief displacement yields larger errors, when and exmaple videos are made publicy available.
hovering above the region of interest is not feasable and
objects are detected throughout the complete image frame. ACKNOWLEDGMENT
2) Considering the pixelwise results, similar performance The authors acknowledge the financial support by the
can be observed for all three altitudes, which proofs that Federal Ministry of Education and Research of Germany
data can be obtained from different flight heights by a single (BMBF) in the framework of FH-Impuls (project number
Mask-RCNN network. This advantage can also be helpful for 03FH7I02IA). The authors thank the AUDI AG department
different object sizes. 3) The best results in metric values are for Testing Total Vehicle for supporting this work.
retrieved at lower altitudes. Alternatively, in order to capture
a larger surface area, one can record at higher altitudes,
increase the resolution and crop the image if necessary.
R EFERENCES [21] P. Torr and A. Zisserman, “MLESAC: A New Robust Estimator with
Application to Estimating Image Geometry,” Computer Vision and
[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics: Image Understanding, June 2000.
The KITTI Dataset,” International Journal of Robotics Research [22] Waleed Abdulla, “Mask R-CNN for object detection and instance
(IJRR), 2013. segmentation on Keras and TensorFlow,” 2017. [Online]. Available:
[2] “Waymo Open Dataset: An autonomous driving dataset,” 2019. https://github.com/matterport/Mask RCNN
[Online]. Available: https://www.waymo.com/open [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in
[3] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S.
Context,” in Computer Vision – ECCV, 2014.
Chung, L. Hauswald, V. H. Pham, M. Mhlegg, S. Dorn, T. Fernandez,
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-
M. Jnicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, and
Time Object Detection with Region Proposal Networks,” in Advances
P. Schuberth, “A2D2: AEV Autonomous Driving Dataset,” 2019.
in Neural Information Processing Systems 28, 2015.
[Online]. Available: http://www.a2d2.audi
[25] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
[4] E. Sánchez Morales, F. Kruber, M. Botsch, B. Huber, and A. Garcı́a
for semantic segmentation,” in IEEE Conference on Computer Vision
Higuera, “Accuracy Characterization of the Vehicle State Estimation
and Pattern Recognition (CVPR), 2015.
from Aerial Imagery,” in IEEE Intelligent Vehicles Symposium (IV),
[26] T. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie,
2020.
“Feature Pyramid Networks for Object Detection,” in IEEE Conference
[5] E. Sánchez Morales, R. Membarth, A. Gaull, P. Slusallek, T. Dirn- on Computer Vision and Pattern Recognition (CVPR), 2017.
dorfer, A. Kammenhuber, C. Lauer, and M. Botsch, “Parallel Multi- [27] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
Hypothesis Algorithm for Criticality Estimation in Traffic and Colli- Tools, 2000.
sion Avoidance,” in IEEE Intelligent Vehicles Symposium (IV), 2019. [28] Lillesand, Kiefer, and Chipman, Remote Sensing and Image Interpre-
[6] P. Nadarajan, M. Botsch, and S. Sardina, “Machine Learning Archi- tation, 2003, vol. 5.
tectures for the Estimation of Predicted Occupancy Grids in Road [29] Verband der TV e.V., “Merkblatt 751,” 2008.
Traffic,” Journal of Advances in Information Technology, 2018. [30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
[7] L. Neubert, “Statistische Analyse von Verkehrsdaten und die Model- A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,”
lierung von Verkehrsfluss mittels zellularer Automaten,” Ph.D. disser- International Journal of Computer Vision, 2010.
tation, Universität Duisburg, 2000. [31] F. Kruber and E. Sánchez Morales, “Vehicle Detection and
[8] F. Kruber, J. Wurst, S. Chakraborty, and M. Botsch, “Highway traffic State Estimation with Aerial Imagery.” [Online]. Available:
data: macroscopic, microscopic and criticality analysis for capturing https://github.com/fkthi
relevant traffic scenarios and traffic modeling based on the highD
data set,” 2019. [Online]. Available: https://arxiv.org/abs/1903.04249
[9] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online
and realtime tracking,” in IEEE International Conference on Image
Processing (ICIP), 2016.
[10] C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C. Bouga-
nis, “DroNet: Efficient convolutional neural network detector for
real-time UAV applications,” in Design, Automation Test in Europe
Conference Exhibition (DATE), 2018.
[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[12] Q. Li, L. Mou, Q. Xu, Y. Zhang, and X. X. Zhu, “R3 -
Net: A Deep Network for Multi-oriented Vehicle Detection in
Aerial Images and Videos,” CoRR, 2018. [Online]. Available:
http://arxiv.org/abs/1808.05560
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,”
in Proceedings of the International Conference on Computer Vision
(ICCV), 2017.
[14] L. Mou and X. X. Zhu, “Vehicle Instance Segmentation From Aerial
Image and Video Using a Multitask Learning Residual Fully Con-
volutional Network,” IEEE Transactions on Geoscience and Remote
Sensing, 2018.
[15] G. Guido, V. Gallelli, D. Rogano, and A. Vitale, “Evaluating the
accuracy of vehicle tracking data obtained from Unmanned Aerial
Vehicles,” International Journal of Transportation Science and Tech-
nology, 2016.
[16] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highD
Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on Ger-
man Highways for Validation of Highly Automated Driving Systems,”
in 21st International Conference on Intelligent Transportation Systems
(ITSC), 2018.
[17] J. Bock, R. Krajewski, T. Moers, L. Vater, S. Runde, and L. Eckstein,
“The inD Dataset: A Drone Dataset of Naturalistic Vehicle
Trajectories at German Intersections,” 2019. [Online]. Available:
https://arxiv.org/abs/1911.07602
[18] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann,
J. Kümmerle, H. Königshof, C. Stiller, A. de La Fortelle, and
M. Tomizuka, “INTERACTION Dataset: An INTERnational, Ad-
versarial and Cooperative moTION Dataset in Interactive Driving
Scenarios with Semantic Maps,” arXiv:1910.03088, 2019.
[19] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Human
Trajectory Prediction In Crowded Scenes,” in European Conference
on Computer Vision (ECCV), 2016.
[20] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust
Features,” in Computer Vision – ECCV, 2019.