Professional Documents
Culture Documents
2.71 m
Fig. 3. Sensor Setup. This figure illustrates the dimensions and mounting
positions of the sensors (red) with respect to the vehicle body. Heights above
ground are marked in green and measured with respect to the road surface.
Transformations between sensors are shown in blue.
date/
date_drive/
date_drive.zip
image_0x/ x={0,..,3}
data/
frame_number.png
timestamps.txt
oxts/
data/
frame_number.txt
dataformat.txt
timestamps.txt
velodyne_points/
data/
frame_number.bin
timestamps.txt
timestamps_start.txt
timestamps_end.txt
Fig. 2. Recording Zone. This figure shows the GPS traces of our recordings date_drive_tracklets.zip
in the metropolitan area of Karlsruhe, Germany. Colors encode the GPS signal tracklet_labels.xml
date_calib.zip
quality: Red tracks have been recorded with highest precision using RTK calib_cam_to_cam.txt
corrections, blue denotes the absence of correction signals. The black runs calib_imu_to_velo.txt
calib_velo_to_cam.txt
have been excluded from our data set as no GPS signal has been available.
Fig. 4. Structure of the provided Zip-Files and their location within a
A. Data Description global file structure that stores all KITTI sequences. Here, ’date’ and ’drive’
All sensor readings of a sequence are zipped into a single are placeholders, and ’image 0x’ refers to the 4 video camera streams.
file named date_drive.zip, where date and drive are tions and angular rates are both specified using two coordinate
placeholders for the recording date and the sequence number. systems, one which is attached to the vehicle body (x, y, z) and
The directory structure is shown in Fig. 4. Besides the raw one that is mapped to the tangent plane of the earth surface
recordings (’raw data’), we also provide post-processed data at that location (f, l, u). From time to time we encountered
(’synced data’), i.e., rectified and synchronized video streams, short (∼ 1 second) communication outages with the OXTS
on the dataset website. device for which we interpolated all values linearly and set
Timestamps are stored in timestamps.txt and per- the last 3 entries to ’-1’ to indicate the missing information.
frame sensor readings are provided in the corresponding data More details are provided in dataformat.txt. Conversion
sub-folders. Each line in timestamps.txt is composed utilities are provided in the development kit.
of the date and time in hours, minutes and seconds. As the c) Velodyne: For efficiency, the Velodyne scans are
Velodyne laser scanner has a ’rolling shutter’, three timestamp stored as floating point binaries that are easy to parse using
files are provided for this sensor, one for the start position the C++ or MATLAB code provided. Each point is stored with
(timestamps_start.txt) of a spin, one for the end its (x, y, z) coordinate and an additional reflectance value (r).
position (timestamps_end.txt) of a spin, and one for the While the number of points per scan is not constant, on average
time, where the laser scanner is facing forward and triggering each file/frame has a size of ∼ 1.9 MB which corresponds
the cameras (timestamps.txt). The data format in which to ∼ 120, 000 3D points and reflectance values. Note that the
each sensor stream is stored is as follows: Velodyne laser scanner rotates continuously around its vertical
a) Images: Both, color and grayscale images are stored axis (counter-clockwise), which can be taken into account
with loss-less compression using 8-bit PNG files. The engine using the timestamp files.
hood and the sky region have been cropped. To simplify
working with the data, we also provide rectified images. The
size of the images after rectification depends on the calibration B. Annotations
parameters and is ∼ 0.5 Mpx on average. The original images For each dynamic object within the reference camera’s field
before rectification are available as well. of view, we provide annotations in the form of 3D bounding
b) OXTS (GPS/IMU): For each frame, we store 30 differ- box tracklets, represented in Velodyne coordinates. We define
ent GPS/IMU values in a text file: The geographic coordinates the classes ’Car’, ’Van’, ’Truck’, ’Pedestrian’, ’Person (sit-
including altitude, global orientation, velocities, accelerations, ting)’, ’Cyclist’, ’Tram’ and ’Misc’ (e.g., Trailers, Segways).
angular rates, accuracies and satellite information. Accelera- The tracklets are stored in date_drive_tracklets.xml.
3
Object
Coordinates
y
rz
y x
z
wi
l)
h(
dt
gt
z
h(
x len
w)
Velodyne
Coordinates
• GPS/IMU: x = forward, y = left, z = up The calibration parameters for each day are stored in
Notation: In the following, we write scalars in lower-case row-major order in calib_cam_to_cam.txt using the
letters (a), vectors in bold lower-case (a) and matrices using following notation:
bold-face capitals (A). 3D rigid-body transformations which
take points from coordinate system a to coordinate system b • s(i) ∈ N2 . . . . . . . . . . . . original image size (1392 × 512)
will be denoted by Tba , with T for ’transformation’. • K(i) ∈ R3×3 . . . . . . . . . calibration matrices (unrectified)
• d(i) ∈ R5 . . . . . . . . . . distortion coefficients (unrectified)
A. Synchronization • R(i) ∈ R3×3 . . . . . . rotation from camera 0 to camera i
In order to synchronize the sensors, we use the timestamps • t(i) ∈ R1×3 . . . . . translation from camera 0 to camera i
(i)
of the Velodyne 3D laser scanner as a reference and consider • srect ∈ N2 . . . . . . . . . . . . . . . image size after rectification
(i)
each spin as a frame. We mounted a reed contact at the bottom • Rrect ∈ R3×3 . . . . . . . . . . . . . . . rectifying rotation matrix
of the continuously rotating scanner, triggering the cameras (i)
• Prect ∈ R3×4 . . . . . . projection matrix after rectification
when facing forward. This minimizes the differences in the
range and image observations caused by dynamic objects.
Unfortunately, the GPS/IMU system cannot be synchronized Here, i ∈ {0, 1, 2, 3} is the camera index, where 0 represents
that way. Instead, as it provides updates at 100 Hz, we the left grayscale, 1 the right grayscale, 2 the left color and
collect the information with the closest timestamp to the laser 3 the right color camera. Note that the variable definitions
scanner timestamp for a particular frame, resulting in a worst- are compliant with the OpenCV library, which we used for
case time difference of 5 ms between a GPS/IMU and a warping the images. When working with the synchronized
camera/Velodyne data package. Note that all timestamps are and rectified datasets only the variables with rect-subscript are
provided such that positioning information at any time can be relevant. Note that due to the pincushion distortion effect the
easily obtained via interpolation. All timestamps have been images have been cropped such that the size of the rectified
recorded on our host computer using the system clock. images is smaller than the original size of 1392 × 512 pixels.
The projection of a 3D point x = (x, y, z, 1)T in rectified
B. Camera Calibration (rotated) camera coordinates to a point y = (u, v, 1)T in the
For calibrating the cameras intrinsically and extrinsically, i’th camera image is given as
we use the approach proposed in [11]. Note that all camera
centers are aligned, i.e., they lie on the same x/y-plane. This
(i)
is important as it allows us to rectify all images jointly. y = Prect x (3)
5
200000 100
with
(i) (i) (i) (i)
fu 0 cu −fu bx
Number of Labels
0 0 1 0 100000 50
(i)
the i’th projection matrix. Here, bx denotes the baseline (in 50000
meters) with respect to reference camera 0. Note that in order
to project a 3D point x in reference camera coordinates to a 0 0
r ian ian
point y on the i’th image plane, the rectifying rotation matrix Ca
r n k n ) t
Va Truc stria itting yclis Tram Mis
c Ca ded Car ated destr ded destr ated
e
Ped rson (
s C clu nc Pe cclu Pe runc
(0) Pe
Oc Tru O T
of the reference camera Rrect must be considered as well: Object Class Occlusion/Truncation Status by Class
20000 6000
(i) (0)
y = Prect Rrect x (5) 5000
(0) 15000
Number of Pedestrians
Here, Rrect has been expanded into a 4×4 matrix by append- 4000
Number of Cars
(0)
ing a fourth zero-row and column, and setting Rrect (4, 4) = 1.
10000 3000
2000
C. Velodyne and IMU Calibration 5000
1000
We have registered the Velodyne laser scanner with respect
to the reference camera coordinate system (camera 0) by 0 0
-1 0
50
0
50
11 0
50
50
0
11 0
0
50
15 0
-1 0
50
50
.5
.5
.5
.5
.5
.5
.5
5
.5
7.
2.
2.
7.
initializing the rigid body transformation using [11]. Next, we
2.
2.
7.
7.
57
12
67
22
67
12
22
57
15
-2
-6
-2
-6
-1
-1
Orientation [deg] Orientation [deg]
optimized an error criterion based on the Euclidean distance of
50 manually selected correspondences and a robust measure on
the disparity error with respect to the 3 top performing stereo Fig. 8. Object Occurrence and Orientation Statistics of our Dataset.
This figure shows the different types of objects occurring in our sequences
methods in the KITTI stereo benchmark [8]. The optimization (top) and the orientation histograms (bottom) for the two most predominant
was carried out using Metropolis-Hastings sampling. categories ’Car’ and ’Pedestrian’.
The rigid body transformation from Velodyne coordinates to
camera coordinates is given in calib_velo_to_cam.txt: for example in difficult lighting situations such as at night, in
tunnels, or in the presence of fog or rain. Furthermore, we
• Rcam
velo ∈ R
3×3
. . . . rotation matrix: velodyne → camera plan on extending our benchmark suite by novel challenges.
• tcam
velo ∈ R 1×3
. . . translation vector: velodyne → camera In particular, we will provide pixel-accurate semantic labels
for many of the sequences.
Using cam
tcam
Rvelo
Tcam
velo =
velo (6) R EFERENCES
0 1
[1] G. Singh and J. Kosecka, “Acquiring semantics induced topology in
a 3D point x in Velodyne coordinates gets projected to a point urban environments,” in ICRA, 2012.
y in the i’th camera image as [2] R. Paul and P. Newman, “FAB-MAP 3D: Topological mapping with
spatial and visual appearance,” in ICRA, 2010.
(i) (0) [3] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele, “Monocular
y = Prect Rrect Tcam
velo x (7) visual scene understanding: Understanding multi-object traffic scenes,”
PAMI, 2012.
For registering the IMU/GPS with respect to the Velodyne [4] D. Pfeiffer and U. Franke, “Efficient representation of traffic scenes by
laser scanner, we first recorded a sequence with an ’∞’-loop means of dynamic stixels,” in IV, 2010.
and registered the (untwisted) point clouds using the Point- [5] A. Geiger, M. Lauer, and R. Urtasun, “A generative model for 3d urban
scene understanding from movable platforms,” in CVPR, 2011.
to-Plane ICP algorithm. Given two trajectories this problem [6] A. Geiger, C. Wojek, and R. Urtasun, “Joint 3d estimation of objects
corresponds to the well-known hand-eye calibration problem and scene layout,” in NIPS, 2011.
which can be solved using standard tools [12]. The rotation [7] M. A. Brubaker, A. Geiger, and R. Urtasun, “Lost! leveraging the crowd
for probabilistic visual self-localization.” in CVPR, 2013.
matrix Rvelo velo
imu and the translation vector timu are stored in [8] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
calib_imu_to_velo.txt. A 3D point x in IMU/GPS driving? The KITTI vision benchmark suite,” in CVPR, 2012.
coordinates gets projected to a point y in the i’th image as [9] M. Goebl and G. Faerber, “A real-time-capable hard- and software
architecture for joint image and knowledge processing in cognitive
(i) (0)
y = Prect Rrect Tcam velo
velo Timu x (8) automobiles,” in IV, 2007.
[10] P. Osborne, “The mercator projections,” 2008. [Online]. Available:
http://mercator.myzen.co.uk/mercator.pdf
V. S UMMARY AND F UTURE W ORK [11] A. Geiger, F. Moosmann, O. Car, and B. Schuster, “A toolbox for
automatic calibration of range and camera sensors using a single shot,”
In this paper, we have presented a calibrated, synchronized in ICRA, 2012.
and rectified autonomous driving dataset capturing a wide [12] R. Horaud and F. Dornaika, “Hand-eye calibration,” IJRR, vol. 14, no. 3,
range of interesting scenarios. We believe that this dataset pp. 195–210, 1995.
will be highly useful in many areas of robotics and com-
puter vision. In the future we plan on expanding the set of
available sequences by adding additional 3D object labels for
currently unlabeled sequences and recording new sequences,
6
Number of Images
Number of Images
5000
Number of Images
7000 1400 10000
6000 4000 1200 8000
5000 1000
3000 6000
4000 800
3000 2000 600 4000
2000 400
1000 2000
1000 200
0 0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Car per Image Van per Image Truck per Image Pedestrian per Image
500
3500 700 500
450 3000 450
Number of Images
600
Number of Images
Number of Images
Number of Images
400 2500 400
350 500 350
300 2000 400 300
250 1500 250
200 300 200
150 1000 200 150
100 500 100
50 100 50
0 0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Person (sitting) per Image Cyclist per Image Tram per Image Misc per Image
Fig. 9. Number of Object Labels per Class and Image. This figure shows how often an object occurs in an image. Since our labeling efforts focused on
cars and pedestrians, these are the most predominant classes here.
Number of Images
Number of Sequences
5000 100
Number of Images
10000 2.5
4000 8000 80
residential
2
campus
person
60
road
3000 6000
city
1.5
2000 4000 40
1
1000 2000 20
0.5
0 0 0
0 5 10 15 20 25 −4 −2 0 2 4 0 400 800 1200 1600 0
velocity [m/s] acceleration [m/s2] Frames per Sequences Street Category
Fig. 10. Egomotion, Sequence Count and Length. This figure show (from-left-to-right) the egomotion (velocity and acceleration) of our recording platform
for the whole dataset. Note that we excluded sequences with a purely static observer from these statistics. The length of the available sequences is shown as
a histogram counting the number of frames per sequence. The rightmost figure shows the number of frames/images per scene category.
1600 2500
600 350
1400
Number of Images
Number of Images
2000 300
Number of Images
Number of Images
1200 500
400 250
1000 1500
800 200
300
600 1000 150 static observer
200 100
400 500
200 100 50
0 0 0 0
0 4 8 12 16 0 4 8 12 16 0 5 10 15 20 25 30 0 1 2 3 4 5
velocity [m/s] velocity [m/s] velocity [m/s] velocity [m/s]
1800 5000 1000
4500 900 400
1600
Number of Images
Number of Images
Number of Images
1400
1200 3500 700 300
1000 3000 600 250
2500 500 200
800 2000 400
600 1500 300 150 static observer
400 1000 200 100
200 500 100 50
0 0 0 0
−3 −2 −1 0 1 2 3 −4 −2 0 2 4 −4 −2 0 2 −2 −1.5 −1 −0.5 0 0.5 1
acceleration [m/s2] acceleration [m/s2] acceleration [m/s2] acceleration [m/s2]
×104 5000
3.5 18000 12000 12000
16000 4500
Number of Labels
10000 10000
Number of Labels
Number of Labels
Number of Labels
3 4000
Number of Labels
14000
2.5 12000 8000 8000 3500
3000
2 10000 2500
6000 6000
1.5 8000 2000
6000 4000 4000 1500
1 4000
2000 2000 1000
0.5 2000 500
0 ar n k n ) st m c 0 ) t 0 0 0 r ) t
C VaTrucstriaittingycli Tra Mis
r
Ca VaTnruckstriainttingyclisTramMis
c r )
Ca VaTnruckstrianttingyclistTramMisc
r )
Ca VaTnruckstrianttingyclistTramMisc Ca VaTnruckstriainttingyclisTramMisc
de (s C de s C de (si C de (si C de s C
Person Person ( Person Person Person (
Pe Pe Pe Pe Pe
Object Class Object Class Object Class Object Class Object Class
90
700 500 300 120 80
Number of Objects
600 450 70
Number of Objects
Number of Objects
Number of Objects
Number of Objects