You are on page 1of 5

Massachusetts Institute of Technology AgeLab Technical Report 2020-1

MIT DriveSeg (Manual) Dataset for Dynamic Driving Scene Segmentation

Li Ding1 , Jack Terwilliger1 , Rini Sherony2 , Bryan Reimer1 , and Lex Fridman1
1
Massachusetts Institute of Technology (MIT)
2
Collaborative Safety Research Center, Toyota Motor North America

Abstract 2. Pixel-wise annotation on coarse video frames: This


can be either one annotated frame in a short video clip, e.g.
This technical report summarizes and provides more de- Cityscapes [2], BDD [12], or consistent video frames anno-
tailed information about the MIT DriveSeg dataset [3], in- tated at a low frequency, e.g. CamVid [1].
cluding technical aspects in data collection, annotation, 3. Pixel-wise annotation on dense video frames: e.g. the
and potential research directions. MIT DriveSeg dataset in this work.
Category 1 is the easiest to obtain and having larger vari-
ability in different scenes. Category 3 usually lack variabil-
1. Introduction ity, but is the most suitable to do temporal modeling.
Semantic scene segmentation has primarily been ad- Among the above datasets, Mapillary Vistas contains
dressed by forming representations of single images both 25k images with fine annotation, which is the largest value,
with supervised and unsupervised methods. The problem but the images are all still images, i.e. without temporal
of semantic segmentation in dynamic scenes has begun to connection between each other. Both Cityscapes and BDD
recently receive attention with video object segmentation choose one frame from each short video clip to cast fine an-
approaches. What is not known is how much extra informa- notation. It is a common approach to make the dataset such
tion the temporal dynamics of the visual scene carries that way in order to mostly capture the variability of scenes with
is complimentary to the information available in the indi- the least amount of budget. However, this phenomenon also
vidual frames of the video. We provide the MIT DriveSeg leads current research to focus on single-frame algorithms,
dataset [3], which is a large-scale driving scene segmen- ignoring the rich temporal information contained between
tation dataset, densely annotated for every pixel and every consecutive frames.
one of 5,000 video frames. The purpose of this dataset is In order to address this problem, we collect and annotate
to allow for exploration of the value of temporal dynamics a novel dataset, which has over 10k frames with fine anno-
information for full scene segmentation in dynamic, real- tation, from a single, untrimmed video of the front driving
world operating environments. scene at 30 fps. The similar idea can be seen in CamVid,
which also features annotated video frames, but only at
2. Related Work a low frequency (1 fps) and a small scale (less than 500
frames totally). Out of driving scene domain, DAVIS [9]
There are many other large-scale datasets for natural has densely annotated pixel-wise semantic object annota-
scene/object with semantic pixel-wise annotations, e.g. Pas- tions for trimmed video clips, each at around 60 frames.
cal VOC [4], MS COCO [7], ADE20k [13]. Similarly, SegTrack [11] and SegTrack v2 [6] also feature
In the driving context, there are several well-developed pixel-wise video object annotation.
datasets with dense semantic annotations, e.g. CamVid [1],
KITTI [5], Cityscapes [2], Mapillary Vistas [8], and re- 3. MIT DriveSeg Dataset
cently BDD [12]. Since our target problem is video scene
parsing in the driving context, we group current public The two main purposes of developing MIT DriveSeg
datasets into three categories: dataset are:
1. Pixel-wise annotation on still images: No temporal 1) experiment and develop the full-scene annotation sys-
information is provided, e.g. Mapillary Vistas [8]; tem that is scalable with a large pool of workers, e.g. on

1
Figure 1: Examples from the proposed MIT DriveSeg dataset. Annotations are overlayed on frames.

Amazon Mechanical Turk (MTurk); This dataset is made freely available to academic and non-
2) create an open-source densely-annotated video driv- academic entities for non-commercial purposes, such as
ing scene dataset that can help with future research in var- academic research, teaching, scientific publications. Per-
ious fields, e.g. spatiotemporal scene perception, predictive mission is granted to use the data given that you agree to
modeling, semi-automatic annotation process development. the license terms (see Appendix A).
Dataset Split. We do not specify an official training /
3.1. Dataset Overview validation / testing split of the dataset. Instead, we release
We collect a long, untrimmed video (2 minutes 47 sec- the whole dataset and encourage people to experiment with
onds, 5,000 frames in total) at 1080P (1920x1080) resolu- different split and sampling settings depending on the task,
tion, 30 fps, which is a single daytime driving trip around especially in the video domain. For the standard image se-
crowded city streets, and annotate it with fine, per-frame, mantic segmentation task, we suggest a split of 3,000 for
pixel-wise semantic labels. Examples from the dataset are training, 500 for validation, and 1,500 for testing.
shown in Fig. 1.

3.2. Technical Aspects 4. Annotation


Video recording. We use the camera model FDR-AX53 Existing full scene segmentation datasets rely on hiring a
and place it on vehicle (Land Rover Evoque) hood. The very small group of professional annotators to do full frame
recording is 1080p resolution at 30 fps with XAVCS codec. annotation, which usually takes around 1.5 hours per im-
The whole trip lasts 2 minutes 47 seconds, and we further age. [2, 8] This is reasonable because the pixel-wise an-
extract totally 5,000 frames. notation is a task of high-complexity that reach the limits
Class definitions. We define 12 classes in our dataset: of human perception and motor function without specific
vehicle, pedestrian, road, sidewalk, bicycle, motorcycle, training. However, in order to have scalability and flexibil-
building, terrian (horizontal vegetation), vegetation (verti- ity to annotate in potentially much larger scale, we develop
cal vegetation), pole, traffic light, traffic sign. the annotation tool that is web-based and understandable.
Open source. All the 5,000 video frames and the We deploy our tool on MTurk, which contains a large pool
corresponding annotations are released within the dataset. of professional and non-professional workers.

2
Figure 2: Front-end of our annotation tool.

4.1. Annotation Tool Limits of human perception. We found many mis-


classification errors are due to the difficulty of recognizing
To support fine and quick annotations, we develop a
objects in still scenes, but that these errors appeared obvi-
web-based annotation tool following the common poly-
ously incorrect in the video. In response, we designed the
gon annotator [10] design with the implementation of tech-
tool such that an annotator could step through consecutive
niques such as zoom in/out and keyboard shortcuts. A
frames quickly with keyboard shortcuts. The motion per-
screenshot of our annotation tool is shown in Fig 2. To
ceived when stepping through frames, reduce classification
further promote the accuracy and efficiency of annotation
errors.
processes, we also address several problems during the pro-
cess and design specific improvements.
4.2. Annotation Process
Task complexity. We found that annotators had diffi-
culty when asked to annotate an entire scene at once, which The intuition behind our annotation process, is that small
involved keeping in mind many object classes and keep simple tasks are preferable to large complex tasks. By
working for a non-flexible long time. In response, we di- breaking down the semantic segmentation task into sub-
vided the task of annotating an entire scene into multiple tasks so that each worker is responsible for annotating only
subtasks, in which an annotator is responsible for annotat- part of a scene, the annotations are: 1) easier to validate 2)
ing only 1 object class. We found that this largely reduces easier and more efficient to annotate and 3) higher quality.
classification errors, improves the quality of our annota- To accomplish this, we divide the work of annotating the
tions, and reduces the time required to fully annotate each video into tasks of 3 frames in which a worker is asked to
scene. draw polygons around only 1 class of object, e.g. vehicles.

3
Our annotation process involves 4 stages: 1) task creation 2) jects such as pedestrians and vehicles. In order for this to
task distribution 3) annotation validation and 4) the assem- work, we carefully designed the instructions for each class
bly of sub-scene annotations into full-scene annotations. so that they could fit together harmoniously. The order in
For stage 1, the creation of tasks, we label which frames which we draw the classes dictates the instructions. When
contain the classes we are interested in and group the frames annotating the ith class of n total classes, a worker must an-
into sets of 3. This stage removes cases where a worker notate the boundaries between objects of class i and classes
is asked to annotate the presence of a class which is not j where j >= i. In other words, if we draw the road an-
present in a frame. Since this stage only requires label- notations before vehicle annotations, workers do not need
ing frame numbers in which a member of a particular class to draw the boundary between road and vehicle when an-
enters a visual scene and frame numbers in which the last notating road, since this work will be handled by workers
member of a particular class leaves, it is much faster and annotating vehicles.
cheaper than creating a semantic segmentation task for ev-
ery frame and let annotators find out the class does not exist.
5. Research Directions
This approach creates significant time-and-cost-savings es-
pecially for rare classes, such as motorcycles in our case. There are many potential research directions that can be
For stage 2, the distribution of tasks, we submit our tasks pursued with this densely annotated video scene dataset.
to MTurk and specify additional information which controls We provide a few open research questions where people
how our tasks are distributed: may find the dataset helpful:
Reward. This is the amount of money a worker receives Spatio-temporal semantic segmentation. As we have
for completing our task. We specify different rewards for explored the value of temporal information in [3], we are
different classes based on the estimated duration and effort interested in further research finding a novel way to utilize
in the annotation. temporal data, such as optical flow and driving state, to im-
Qualifications. This allows us to limit the pool of work- prove perception from using static image only.
ers who may work on our tasks based on 1) the worker’s Predictive modeling. Can we know ahead what is going
approval rate, calculated from all a worker’s work on the to happen on the road? Predictive power is an important
MTurk platform. 2) the total number of tasks the worker component of human intelligence, and can be crucial to the
has completed. 3) the qualification task we designed for ev- safety of autonomous driving. The dataset we provide is
ery new worker taking our task for the first time, which is a consistent in time and therefore can be used for predictive
test task that can be evaluated with the known ground truth. perception research.
For stage 3, annotation validation, we use both auto- Transfer learning. How much extra data do we need
mated and manual processes for assessing the quality of if we have a perception system trained in Europe and
worker annotations. In addition to the first qualification want to use it in Boston, US? In practice, many literatures
task, workers are assigned additional test tasks occasionally, shows that pre-trained networks help generalization to other
which are indistinguishable from non-ground truth tasks, datasets or tasks. Transfer learning is the key to achieve
to check whether they are still following our instructions. training a deep neural network with limited data.
If the worker’s annotation deviates significantly from the
Deep learning with video encoding. Most of the current
ground truth, they are disqualified from working on our
deep learning systems are based on RGB encoded images.
tasks in the future. The process of comparing worker’s an-
However, preserving exact RGB value for each single frame
notations with the ground truth is automated, by calculating
in the video is too expensive for computation and storage.
the Jaccard distance. The threshold score is class depen-
dent since it is easier to score high on less-complex objects Solving redundancy of video frames. How can we effi-
like the road than pedestrians. For our manual validation ciently find useful data from visually similar frames? What
process, we visually validate that a worker’s annotations shall be the best fps for a good perception system? One
are of sufficient quality, using a tool which steps through of the most important problem of real-time applications is
annotated frames as a video player which allows approv- trade-off between efficiency and accuracy.
ing/rejecting work and blocking workers via key presses.
For stage 4, the merging of sub-annotations, we com- 6. Conclusion
bine the class-level annotations for a given frame into a full-
scene annotation. For this task, we automatically compose The MIT DriveSeg dataset [3] has been used to show
the final full-scene annotation one class at a time. Our al- the value of temporal dynamics information, and allows
gorithm first draws the background classes, such as road the computer vision community to explore modeling both
and sidewalk, and then stationary foreground objects, such short-term and long-term context as part of the driving
as poles and buildings, and finally dynamic foreground ob- scene segmentation task.

4
Acknowledgments Appendix
This work was in part supported by the Toyota Collabo- A. License Agreement
rative Safety Research Center. The views and conclusions
being expressed are those of the authors and do no necces- The MIT DriveSeg Dataset is made freely available to
sarily reflect those of Toyota. academic and non-academic entities for non-commercial
purposes such as academic research, teaching, scientific
References publications. Permission is granted to use the data given
that you agree:
[1] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object
classes in video: A high-definition ground truth database. 1. That the dataset comes “AS IS”, without express or im-
Pattern Recognition Letters, 30(2):88–97, 2009. 1 plied warranty. Although every effort has been made to
[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, ensure accuracy, we (MIT, Toyota) do not accept any
R. Benenson, U. Franke, S. Roth, and B. Schiele. The responsibility for errors or omissions.
cityscapes dataset for semantic urban scene understanding.
In Proc. of the IEEE Conference on Computer Vision and 2. That you include a reference to the MIT DriveSeg
Pattern Recognition (CVPR), 2016. 1, 2 Dataset in any work that makes use of the dataset. For
[3] L. Ding, J. Terwilliger, R. Sherony, B. Reimer, and L. Frid- research papers, cite our preferred publication as listed
man. Value of temporal dynamics information in driving on our website; for other media cite our preferred pub-
scene segmentation. arXiv preprint arXiv:1904.00758, 2019. lication as listed on our website or link to the website.
1, 4
[4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 3. That you do not distribute this dataset or modified ver-
A. Zisserman. The pascal visual object classes (voc) chal- sions. It is permissible to distribute derivative works in
lenge. International Journal of Computer Vision, 88(2):303– as far as they are abstract representations of this dataset
338, June 2010. 1 (such as models trained on it or additional annotations
[5] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets that do not directly include any of our data) and do
robotics: The kitti dataset. The International Journal of not allow to recover the dataset or something similar
Robotics Research, 32(11):1231–1237, 2013. 1 in character.
[6] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video
segmentation by tracking many figure-ground segments. In 4. That you may not use the dataset or any derivative
ICCV, 2013. 1 work for commercial purposes as, for example, licens-
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- ing or selling the data, or using the data with a purpose
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com- to procure a commercial gain.
mon objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014. 1 5. That all rights not expressly granted to you are re-
[8] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder. served by us (MIT, Toyota).
The mapillary vistas dataset for semantic understanding of
street scenes. In ICCV, pages 5000–5009, 2017. 1, 2
[9] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
M. Gross, and A. Sorkine-Hornung. A benchmark dataset
and evaluation methodology for video object segmentation.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 724–732, 2016. 1
[10] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman.
Labelme: a database and web-based tool for image annota-
tion. International journal of computer vision, 77(1):157–
173, 2008. 3
[11] D. Tsai, M. Flagg, and J. M.Rehg. Motion coherent tracking
with multi-label mrf optimization. BMVC, 2010. 1
[12] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madha-
van, and T. Darrell. Bdd100k: A diverse driving video
database with scalable annotation tooling. arXiv preprint
arXiv:1805.04687, 2018. 1
[13] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-
ralba. Scene parsing through ade20k dataset. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 1

You might also like