Professional Documents
Culture Documents
DOI: 10.1002/rob.21918
SU RV EY A R TIC L E
KEYW ORD S
AI for self‐driving vehicles, artificial intelligence, autonomous driving, deep learning for
autonomous driving
1 | I NTRODUCTIO N
driving on public roads. Their deployment in our environmental
landscape offers a decrease in road accidents and traffic
Over1 the course of the last decade, deep learning and artificial
congestions, as well as an improvement of our mobility in
intelligence (AI) became the main technologies behind many break-
overcrowded cities. The title of “self‐driving” may seem self‐evident,
throughs in computer vision (Krizhevsky, Sutskever, & Hinton,
but there are actually five safety in automotive software (SAE)
2012), robotics (Andrychowicz et al., 2018), and natural language
Levels used to define autonomous driving. The SAE J3016 standard
processing (NLP; Goldberg, 2017). They also have a major impact
(SAE Committee, 2014) introduces a scale from 0 to 5 for grading
in the autonomous driving revolution seen today both in academia
vehicle automation. Lower SAE Levels feature basic driver
and industry. Autonomous vehicles (AVs) and self‐driving cars
assistance, whilst higher SAE Levels move towards vehicles
began to migrate from laboratory development and testing
requiring no human interaction whatsoever. Cars in the Level 5
conditions to
category require no human input and typically will not even feature
steering wheels or foot pedals.
The authors are with Elektrobit Automotive and the Robotics, Vision and Control Although most driving scenarios can be relatively simply solved
Laboratory (ROVIS Lab) at the Department of Automation and Information Technology,
Transilvania University of Brasov, 500036 Brasov, Romania. See with classical perception, path planning, and motion control
http://rovislab.com/sorin_grigores- cu.html. methods, the remaining unsolved scenarios are corner cases in
which traditional methods fail.
1
The articles referenced in this survey can be accessed at the web‐page accompanying this One of the first autonomous cars was developed by Ernst
paper, available at http://rovislab.com/survey_DL_AD.html
Dickmanns (Dickmanns & Graefe, 1988) in the 1980s. This paved the
way for new
J Field Robotics. 2019;1–25. wileyonlinelibrary.com/journal/rob © 2019 Wiley Periodicals, Inc. | 1
2 | GRIGORESCU ET AL.
FIG U RE 1 Deep learning‐based self‐driving car. The architecture can be implemented either as a sequential perception‐planning‐action
pipeline (a) or as an End2End system (b). In the sequential pipeline case, the components can be designed either using AI and deep learning
methodologies, or based on classical nonlearning approaches. End2End learning systems are mainly based on deep learning methods. A
safety monitor is usually designed to ensure the safety of each module. AI, artificial intelligence [Color figure can be viewed at
wileyonlinelibrary.com]
Given a route planned through the road network, the first task of
cortex. The visual information is received by the visual cortex in a
an autonomous car is to understand and localize itself in the
crossed manner: The left visual cortex receives information from the
surrounding environment. On the basis of this representation, a
right eye, whereas the right visual cortex is fed with visual data from
continuous path is planned and the future actions of the car are
the left eye. The information is processed according to the dual flux
determined by the behavior arbitration system. Finally, a motion
theory (Goodale & Milner, 1992), which states that the visual flow
control system reactively corrects errors generated in the execution
follows two main fluxes: A ventral flux, responsible for visual
of the planned motion. A review of classical non‐AI design
identification and object recognition, and a dorsal flux used for
methodologies for these four components can be found in Paden,
establishing spatial relations between objects. A CNN mimics the
Cáp, Yong, Yershov, and Frazzoli (2016).
functioning of the ventral flux, in which different areas of the brain
Following, we will give an introduction of deep learning and AI
are sensible to specific features in the visual field. The earlier brain
technologies used in autonomous driving, as well as surveying
cells in the visual cortex are activated by sharp transitions in the
different methodologies used to design the hierarchical decision‐
visual field of view, in the same way in which an edge detector
making process described above. Additionally, we provide an over-
highlights sharp transitions between the neighboring pixels in an
view of End2End learning systems used to encode the hierarchical
image. These edges are further used in the brain to approximate
process into a single deep learning architecture which directly maps
object parts and finally to estimate abstract representations of
sensory observations to control outputs.
objects.
A CNN is parametrized by its weights vector θ = [W, b], where
W is the set of weights governing the interneural connections and b
3 | OVERVIEW OF DEEP LEARNING is the set of neuron bias values. The set of weights W is organized
TECHNOLOGIES as image filters, with coefficients learned during training.
Convolutional layers within a CNN exploit local spatial correlations of
In this section, we describe the basis of deep learning technologies image pixels to learn translation‐invariant convolution filters, which
used in AVs and comment on the capabilities of each paradigm. We capture discriminant image features.
focus on convolutional neural networks (CNNs), recurrent neural
Consider a multichannel signal representation Mk in layer k,
networks (RNNs), and deep reinforcement learning (DRL), which are
which is a channelwise integration of signal representations Mk,c,
the most common deep learning methodologies applied to autono-
where c ∈ . A signal representation can be generated in layer k + 1
mous driving. as
Throughout the survey, we use the following notations to
describe time‐dependent sequences. The value of a variable is Mk+1,l = φ (Mk * wk,l + bk,l), (1)
defined either for a single discrete timestep t, written as superscript
where wk,l ∈ W is a convolutional filter with the same number of
〈t〉, or as a discrete sequence defined in the 〈t, t + k〉 time interval, channels as M , ∈ b represents the bias, l is a channel index, and
b k k,l
where k denotes the length of the sequence. For example, the value
denotes the convolution operation. φ (⋅) is an activation function
of a state variable z is defined either at discrete time t, as z〈t〉, or
applied to each pixel in the input signal. Typically, the rectified linear
within a sequence interval z〈t,t+k〉. Vectors and matrices are indicated
unit (ReLU) is the most commonly used activation function in
by bold symbols.
computer vision applications (Krizhevsky, Sutskever, & Hinton,
2012). The final layer of a CNN is usually a fully connected layer
3.1 | Deep CNNs which acts as an object discriminator on a high‐level abstract
representation of objects.
CNNs are mainly used for processing spatial information, such as
In a supervised manner, the response R (⋅;θ) of a CNN can be
images, and can be viewed as image features extractors and trained using a training database D = [(x , y ), …, (x , y )], where x
universal
1 1 m m i
nonlinear function approximators (Bengio, Courville, & Vincent,
is a data sample, yi is the corresponding label, and m is the number
2013; Lecun, Bottou, Bengio, & Haffner, 1998 ). Before the rise of
of training examples. The optimal network parameters can be
deep learning, computer vision systems used to be implemented
calculated using maximum likelihood estimation (MLE). For the
based on handcrafted features, such as HAAR (Viola & Jones,
clarity of explanation, we take as example the simple least‐squares
2001), local binary patterns (LBPs; Ojala, Pietikäinen, & Harwood,
error function, which can be used to drive the MLE process when
1996), or histograms of oriented gradients (HoG; Dalal & Triggs,
training regression estimators:
2005). In comparison to these traditional handcrafted features,
CNNs are able m
with stochastic gradient descent (SGD) and the backpropagation state c〈t−1〉 and the output state h〈t−1〉 in an LSTM network, sampled at
algorithm for gradient estimation (Rumelhart et al., 1986). In timestep t − 1, as well as the input data s〈t〉 at time t. The opening or
practice, different variants of SGD are used, such as Adam (Kingma closing of a gate is controlled by a sigmoid function σ (⋅) of the
& Ba, 2015) or AdaGrad (Duchi, Hazan, & Singer, 2011). current input signal s〈t〉 and the output signal of the last time point
h〈t−1〉:
conventional neural network, the only difference is that the learned Wi represents the weights of the network’s gates and memory
weights in each unfolded copy of the network are averaged, thus cell multiplied with the input state, Ui are the weights governing
enabling the network to share the same weights over time. the activations, and bi denotes the set of neuron bias values.
The main challenge in using basic RNNs is the vanishing symbolizes elementwise multiplication.
gradient encountered during training. The gradient signal can end In a 〈t−τ
supervised learning setup, given a set of training sequences
,t〉 〈t+1,t+τo〉 〈t−τ ,t〉 <t+1,t+τo>
D = [(s i , z ), …, (s i , z )], that is, q indepen-
up being 1 1
q q
multiplied a large number of times, as many as the number of
timesteps. Hence, a traditional RNN is not suitable for capturing dent pairs of observed sequences with assignments z〈t,t+τo〉, one can
long‐term dependencies in sequence data. If a network is very deep, train the response of an LSTM network Q (⋅;θ) using MLE:
or processes long sequences, the gradient of the network’s output
would have a hard time in propagating back to affect the weights of θˆ = arg max )(θ;
D)
θ
m
θ
i=1
i i ) (8)
network will not be effectively updated, ending up with very small m τo
weight values.
Long–short‐term memory (LSTM; Hochreiter & Schmidhuber,
recurrent neural= network.
z 〈t+1,t+τo〉
Overtime
arg min ∑∑ l t, both
Q s the;input( ( 〈t−τ ,t〉
1997) networks are nonlinear function approximators for estimating also referred to as a sequence‐to‐sequence model
temporal dependencies in sequence data. As opposed to traditional
RNNs, LSTMs solve the vanishing gradient problem by incorporating
three gates, which control the input, output, and memory state.
Recurrent layers exploit temporal correlations of sequence data
to learn time‐dependent neural structures. Consider the memory
R〈t+1〉, and transits to the next state s〈t+1〉 following a transition ∀ s ∈ S : π*(s) = arg max Q*(s, a). (11)
〈t+1〉 a∈A
s
function T .
s〈t〉,a〈t〉
In reinforcement learning (RL) based autonomous driving, the task is
The optimal action‐value function Q* satisfies the Bellman
to learn an optimal driving policy for navigating from state s〈t〉 to a optimality equation (Bellman, 1957), which is a recursive formulation
start
destination state s〈t+k〉, given an observation I〈t〉 at time t and the system’s of Equation (10):
dest
state s〈t〉. I〈t〉 represents the observed environment, whereas k is the
〈t+k〉
Q*(s, a) = ∑T s′ (R s′ + γ⋅max Q*(s′, a′))
number of timesteps required for reaching the destination state sdest . s, s,a
a′
a
s
In reinforcement learning terminology, the above problem can be
= (Rs,s′ + γ⋅max Q*(s′, a′)), (12)
modeled as a POMDP M ≔ (I, S, A, T, R, γ), where a′ a a′
immediate rewards.
Q (s〈t〉, a〈t〉; Θ) ≈ Q*(s〈t〉, a〈t〉 ), (14)
Considering the proposed reward function and an arbitrary state where Θ represents the parameters of the DQN.
trajectory [s〈0〉, s〈1〉, …, s〈k〉] in observation space, at any time By taking into account the Bellman optimality equation (12), it
tˆ ∈ [0, 1, …, k], the associated cumulative future discounted reward is possible to train a DQN in a reinforcement learning manner
is defined as through the minimization of the mean squared error. The optimal
expected Q value can be estimated within a training iteration i
k
based on a set of reference parameters Θ̄i calculated in a previous
ˆ ˆ
R〈t〉 = ∑ γ〈t−t〉r〈t〉, (9)
t=tˆ iteration i′:
actions [a〈t〉, …, a〈t+k〉]: where r = R s′s, . On the basis of (16), the MLE function from Equation
a
(8) can be applied for calculating the weights of the DQN. The
Q*(s, a) = max
π [ ˆ ˆ
R〈t 〉 ∣s〈tˆ〉 = s, a〈t 〉 = a, (10) gradient is approximated with random samples and the back-
propagation algorithm, which uses SGD for training:
]
π ,
where π is an action policy, viewed as a probability density function
over a set of possible actions that can take place in a given state. ∇Θi = s,a,r,s′[(y − Q (s, a; Θi))∇Θi (Q (s, a; Θi))]. (17)
The
The DRL community has made several independent improve-
Wulfmeier, Wang, & Posner, 2016), to learn from human driving
ments to the original DQN algorithm (Mnih et al., 2015). A study on
demonstrations without needing to explore unsafe actions.
how to combine these improvements on DRL has been provided by
DeepMind in Hessel et al. (2018), where the combined algorithm,
entitled Rainbow, was able to outperform the independently
4 | DEEP LEARNING FOR DRIVING SCENE
competing methods. DeepMind (Hessel et al., 2018) proposes six
PERCEPTION AND LOCALIZATION
extensions to the base DQN, each addressing a distinct concern:
4.2 | Driving scene understanding candidates proposals and bounding boxes classification. In general,
single‐stage detectors do not provide the same performances as
An autonomous car should be able to detect traffic participants and double‐stage detectors, but are significantly faster.
drivable areas, particularly in urban areas where a wide variety of If in‐vehicle computation resources are scarce, one can use
object appearances and occlusions may appear. Deep learning‐ detectors, such as SqueezeNet (Iandola et al., 2016 or (J. Li, Peng,
based perception, in particular CNNs, became the de facto standard & Chang, 2018), which are optimized to run on embedded hardware.
in object detection and recognition, obtaining remarkable results in These detectors usually have a smaller neural network architecture,
competitions, such as the ImageNet Large‐Scale Visual Recognition making it possible to detect objects using a reduced number of
Challenge (Russakovsky et al., 2015). operations, at the cost of detection accuracy.
Different neural networks architectures are used to detect A comparison between the object detectors described above is
objects as 2D regions of interest (Dai, Li, He, & Sun, 2016; Girshick, given in Figure 4, based on the Pascal visual object classes (VOC)
2015; Iandola et al., 2016; Law & Deng, 2018; Redmon, Divvala, 2012 data set and their measured mean average precision (mAP)
with an intersection over union (IoU) value equal to 50 and 75,
4
https://www.theverge.com/transportation/2018/4/19/17204044/tesla‐waymo‐self‐
respectively.
driving‐car‐data‐simulation
FIG U RE 4 Object detection and recognition performance comparison. The evaluation has been performed on the Pascal VOC 2012
benchmarking database. The first four methods on the right represent single‐stage detectors, whereas the remaining six are double‐stage
detectors. Due to their increased complexity, the runtime performance in frames‐per‐second (FPS) is lower for the case of double‐stage
detectors. IoU, intersection over union; mAP, mean average precision; SSD, Single Shot multibox Detector; VOC, visual object classes [Color
figure can be viewed at wileyonlinelibrary.com]
6.1 | Learning controllers and the system’s dynamics given by a process model. A general
review of MPC techniques for autonomous robots is given in Kamel,
Traditional controllers make use of an a priori model composed of Hafez, and Yu (2018).
fixed parameters. When robots or other autonomous systems are Learning has been used in conjunction with MPC to learn driving
used in complex environments, such as driving, traditional models (Lefevre et al., 2015; Lefèvre et al., 2016), driving dynamics
controllers cannot foresee every possible situation that the system for race cars operating at their handling limits (Drews et al., 2017;
has to cope with. Unlike controllers with fixed parameters, learning Rosolia et al., 2017), as well as to improve path tracking accuracy
controllers make use of training information to learn their models (Brunner, Rosolia, Gonzales, & Borrelli, 2017; Ostafew et al., 2015,
over time. With every gathered batch of training data, the 2016). These methods use learning mechanisms to identify
approximation of the true system model becomes more accurate, nonlinear dynamics that are used in the MPC’s trajectory cost
thus enabling model flexibility, consistent uncertainty estimates and function optimization. This enables one to better predict
anticipation of repeatable effects and disturbances that cannot be disturbances and the behavior of the vehicle, leading to optimal
modeled before deployment (Ostafew, Collier, Schoellig, & Barfoot, comfort and safety constraints applied to the control inputs. Training
2015). Consider the following nonlinear, state‐space system: data are usually in the form of past vehicle states and observations.
For example, CNNs can be used to compute a dense OG map in
a local robot‐centric
z〈t+1〉 = ftrue (z〈t〉, u〈t〉 (18) coordinate system. The grid map is further passed to the MPC’s cost
),
function for optimizing the trajectory of the vehicle over a finite
with observable state z〈t〉 ∈ n and control input u〈t〉 ∈ m, at discrete prediction horizon.
time t. The true system ftrue is not known exactly and is A major advantage of learning controllers is that they optimally
approximated by the sum of an a priori model and a learned combine traditional model‐based control theory with learning
dynamics model:
algorithms. This makes it possible to still use established
methodol-
z〈t+1〉 = f(z〈t〉, u〈t〉 ) + h(z〈t〉) . (19)
a priori model learned model ogies for controller design and stability analysis, together with a
robust learning component applied at system identification and
In previous works, learning controllers have been introduced prediction levels.
based on simple function approximators, such as Gaussian process
(GP) modeling (Meier, Hennig, & Schaal, 2014; Nguyen‐Tuong,
Peters, & Seeger, 2008; Ostafew et al., 2015; Ostafew, Schoellig, & 6.2 | End2End learning control
Barfoot, 2016) or support vector regression (Sigaud, Salaün, & In the context of autonomous driving, End2End learning control is
Padois, 2011). defined as a direct mapping from sensory data to control
Learning techniques are commonly used to learn a dynamics commands. The inputs are usually from a high‐dimensional features
model which in turn improves an a priori system model in iterative space (e.g., images or point clouds). As illustrated in Figure 1b, this
learning control (ILC; Kapania & Gerdes, 2015; Ostafew, Schoellig, is opposed to traditional processing pipelines, where at first objects
& Barfoot, 2013; Panomruttanarug, 2017; Z. Yang, Zhou, Li, & are detected in the input image, after which a path is planned and
Wang, 2017b) and model predictive control (MPC; Drews, Williams, finally the computed control values are executed. A summary of
Goldfain, Theodorou, & Rehg, 2017; Lefevre, Carvalho, & Borrelli, some of the most popular End2End learning systems is given in
2015; Lefèvre, Carvalho, & Borrelli, 2016; Ostafew et al., 2015, Table 1.
2016; Pan et al., 2018, 2017; Rosolia, Carvalho, & Borrelli, 2017). End2End learning can also be formulated as a backpropagation
ILC is a method for controlling systems which work in a repetitive algorithm scaled up to complex models. The paradigm was first
mode, such as path tracking in self‐driving cars. It has been introduced in the 1990s, when the autonomous land vehicle in a
successfully applied to navigation in off‐road terrain (Ostafew neural network (ALVINN) system was built (Pomerleau, 1989).
et al., 2013), autonomous car parking (Panomruttanarug, 2017), ALVINN was designed to follow a predefined road, steering
and modeling of steering dynamics in an autonomous race car according to the observed road’s curvature. The next milestone in
(Kapania & Gerdes, 2015). Multiple benefits are highlighted, such as End2End driving is considered to be in the mid‐2000s, when Darpa
Autonomous VEhicle (DAVE) managed to drive through an
obstacle‐
12
TAB L E 1 Summary of End2End learning methods
Abbreviations: C‐LSTM, convolutional long–short‐term memory; CNN, convolutional neural network; DARPA, Defense Advanced Research Projects Agency; DAVE, Darpa Autonomous VEhicle; DQN, deep Q‐ network;
Description
DRL, deep reinforcement learning; E2E, End2End; FCN, fully convolutional network; LSTM, long–short‐term memory; MPC, model predictive control; R‐FCN, region‐based fully convolutional network; RNN, recurrent
acquired in similar, but not identical, driving scenarios (Muller et al.,
A CNN, referred to as the learner, 2006). Over the last couple of years, the technological advances in
computing hardware have facilitated the usage of End2End learning
models. The backpropagation algorithm for gradient estimation in
deep networks is now efficiently implemented on parallel graphic
processing units (GPUs). This kind of processing allows the training
network
(
Autonomy = 1 − (no. of interventions)*6 s * 100.
elapsed time (s)
(20)
Problem space
determining which elements in the input traffic image have the most
learn
toName
PilotNet are similar to the ones that are relevant to a human driver.
et al.,
framework
problem. The introduced FCN–LSTM method is designed to jointly together with motion prediction through a temporal encoder. The
train pixel‐level supervised tasks using a fully convolutional encoder, combination between visual temporal dependencies of the input data
ge
1 | GRIGORESCU ET AL.
requirements for assuring safety, but does not address the unique characteristics of deep learning‐based software. Salay, Queiroz, and
GRIGORESCU ET AL. | 1
Czarnecki (2017) address this gap by analyzing the places where
Statistical learning calls the risk of f as the expected value of the loss
machine learning can impact the standard and provides recommenda-
of f under P:
tions on how to accommodate this impact. These recommendations
are focused towards the direction of identifying the hazards,
R (f ) = ∫L (x, f (x), y) dP (x, y), (21)
implementing tools and mechanism for fault and failure situations, but
also ensuring complete training data sets and designing a multilevel where X × Y is a random example space of observations x and labels
architecture. The usage of specific techniques for various stages within y , distributed according to a probability distribution P (X, Y ). The
the software development life‐cycle is desired. statistical learning problem consists of finding the function f that
The ISO 26262 standard recommends the use of a Hazard optimizes (i.e., minimizes) the risk R (Jose, 2018). For an algorithm’s
hypothesis h and loss function L, the expected loss on the training
set is called the empirical risk of h:
McPherson (2018). This led to the development of specific and question. Current standards and regulation from the automotive
focused tools and techniques to help finding faults. Chakarov, Nori, industry cannot be fully mapped to such systems, requiring the
Rajamani, Sen, and 2015 Vijaykeerthy (2018) describe a technique development of new safety standards targeted for deep learning.
for debugging misclassifications due to bad training data, whereas
an approach for troubleshooting faults due to complex interactions
between linked machine learning components is proposed in Nushi, 8 | DATA SOURCES FOR TRAINING
Kamar, Horvitz, and Kossmann (2017). In Takanami, Sato, and AUTONOMOUS DRIVING SYSTEMS
Yang (2000), a white‐box technique is used to inject faults onto a
neural network by breaking the links or randomly changing the Undeniably, the usage of real‐world data is a key requirement for
weights. training and testing an autonomous driving component. The high
The training set plays a key role in the safety of the deep amount of data needed in the development stage of such
learning component. ISO 26262 standard states that the component components made data collection on public roads a valuable
behavior shall be fully specified and each refinement shall be activity. To obtain a comprehensive description of the driving scene,
verified with respect to its specification. This assumption is violated the vehicle used for data collection is equipped with a variety of
in the case of a deep learning system, where a training set is used sensors, such as radar, LiDAR, GPS, cameras, inertial
instead of a specification. It is not clear how to ensure that the measurement units (IMU), and ultrasonic sensors. The sensors
corresponding hazards are always mitigated. The training process is setup differs from vehicle to vehicle, depending on how the data are
not a verification process since the trained model will be correct by planned to be used. A common sensor setup for an AV is presented
construction with respect to the training set, up to the limits of the in Figure 7.
model and the learning algorithm (Salay et al., 2017). Effects of this In the last years, mainly due to the large and increasing research
considerations are visible in the commercial AV market, where Black interest in AVs, many driving data sets were made public and
Swan events caused by data not present in the training set may lead documented. They vary in size, sensor setup, and data format. The
to fatalities (McPherson, 2018). researchers need only to identify the proper data set which best fits
Detailed requirements shall be formulated and traced to hazards. their problem space. Janai et al. (2017) published a survey on a
Such a requirement can specify how the training, validation, and broad spectrum of data sets. These data sets address the computer
testing sets are obtained. Subsequently, the data gathered can be vision field in general, but there are few of them which fit to the
verified with respect to this specification. Furthermore, some autonomous driving topic.
specifications, for example, the fact that a vehicle cannot be wider A most comprehensive survey on publicly available data sets for
than 3 m, can be used to reject false positive detections. Such self‐driving vehicles algorithms can be found in Yin and Berger
properties are used even directly during the training process to (2017). The paper presents 27 available data sets containing data
improve the accuracy of the model (Katz, Barrett, Dill, Julian, & recorded on public roads. The data sets are compared from different
Kochenderfer, 2017). perspectives, such that the reader can select the one best suited for
Machine learning and deep learning techniques are starting to his task.
become effective and reliable even for safety‐critical systems, even Despite our extensive search, we are yet to find a master data
if the complete safety assurance for this type of systems is still an set that combines at least parts of the ones available. The reason
open may be
G
RI
G
O
R
E
S
C
| 17
1 | GRIGORESCU ET AL.
that there are no standard requirements for the data format and in the given format. The data set is provided under the Creative
sensor setup. Each data set heavily depends on the objective of the Commons Attribution‐NonCommercial‐NoDerivs 3.0 Unsupported
algorithm for which the data were collected. Recently, the License.
companies Scale® and nuTonomy® started to create one of the Ford campus vision and LiDAR data set (Ford; Pandey et al.,
largest and most detailed self‐driving data sets in the market to 2011): Provided by University of Michigan, this data set was
date.6 This includes Berkeley DeepDrive (F. Yu et al., 2018), a data collected using a Ford F250 pickup truck equipped with professional
set developed by researchers at Berkeley University. More relevant (Applanix POS‐ LV) and a consumer (Xsens MTi‐G) IMUs, a
data sets from the literature are pending for merging.7 Velodyne LiDAR scanner, two push‐broom forward looking Riegl
In Fridman et al. (2017), the authors present a study that seeks LiDARs and a Point Grey Ladybug3 omnidirectional camera system.
to collect and analyze large‐scale naturalistic data of semi‐ The approximately 100 GB of data was recorded around the Ford
autonomous driving to better characterize the state‐of‐the‐art of the Research campus and downtown Dearborn, MI in 2009. The data
current technology. The study involved 99 participants, 29 vehicles, set is well suited to test various autonomous driving and SLAM
405, 807 miles, and approximately 5.5 billion video frames. algorithms.
Unfortunately, the data collected in this study are not available for Udacity data set (Udacity, 2018): The vehicle sensor setup
the public. contains monocular color cameras, GPS and IMU sensors, as well
In the remaining of this section we will provide and highlight the as a Velodyne 3D LiDAR. The size of the data set is 223 GB. The
distinctive characteristics of the most relevant data sets that are data are labeled and the user is provided with the corresponding
publicly available (Table 2). steering angle that was recorded during the test runs by the
KITTI Vision Benchmark data set (KITTI; Geiger et al., 2013): human driver.
Provided by the Karlsruhe Institute of Technology (KIT) from Cityscapes data set (Cityscapes, 2018): Provided by Daimler AG
Germany, this data set fits the challenges of benchmarking stereo‐ R&D, Germany; Max Planck Institute for Informatics (MPI‐IS),
vision, optical flow, 3D tracking, 3D object detection, or SLAM Germany, TU Darmstadt Visual Inference Group, Germany, the
algorithms. It is known as the most prestigious data set in the self‐ Cityscapes data set focuses on semantic understanding of urban
driving vehicles domain. To this date it counts more than 2,000 street scenes, this being the reason for which it contains only
citations in the literature. The data collection vehicle is equipped stereo‐ vision color images. The diversity of the images is very
with multiple high‐resolution color and gray‐scale stereo cameras, a large: 50 cities, different seasons (spring, summer, and fall), various
Velodyne 3D LiDAR and high‐precision GPS/IMU sensors. In total, it weather conditions, and different scene dynamics. There are 5,000
provides 6 hr of driving data collected in both rural and highway images with fine annotations and 20,000 images with coarse
traffic scenarios around Karlsruhe. The data set is provided under annotations. Two important challenges have used this data set for
the Creative Commons Attribution‐NonCommercial‐ShareAlike 3.0 benchmarking the development of algorithms for semantic
Li- cense. segmentation (H. Zhao, Shi, Qi, Wang, & Jia, 2017) and instance
NuScenes data set (Caesar et al., 2019): Constructed by segmentation (S. Liu, Jia, Fidler, & Urtasun, 2017).
nuTonomy, this data set contains 1,000 driving scenes collected The Oxford data set (Maddern et al., 2017): Provided by Oxford
from Boston and Singapore, two known for their dense traffic and University, UK, the data set collection spanned over 1 year,
highly challenging driving situations. To facilitate common computer resulting in over 1,000 km of recorded driving with almost 20
vision tasks, such as object detection and tracking, the providers million images collected from six cameras mounted to the
annotated 25 object classes with accurate 3D bounding boxes at 2 vehicle, along with LiDAR, GPS, and INS ground truth. Data
Hz over the entire data set. Collection of vehicle data is still in were collected in all weather conditions, including heavy rain,
progress. The final data set will include approximately 1.4 million night, direct sunlight, and snow. One of the particularities of this
camera images, 400,000 LiDAR sweeps, 1.3 million RADAR data set is that the vehicle frequently drove the same route over
sweeps, and 1.1 million object bounding boxes in 40,000 keyframes. the period of a year to enable researchers to investigate long‐
The data set is provided under the Creative Commons Attribution‐ term localization and mapping for AVs in real‐ world, dynamic
NonCommercial‐ ShareAlike 3.0 License. urban environments.
Automotive multisensor data set (AMUSE; Koschorrek et al., The Cambridge‐driving Labeled Video data set (CamVid;
2013): Provided by Linköping University of Sweden, it consists of Brostow et al., 2009): Provided by the University of Cambridge, UK,
sequences recorded in various environments from a car equipped it is one of the most cited data sets from the literature and the first
with an omnidirectional multicamera, height sensors, an IMU, a released publicly, containing a collection of videos with object class
velocity sensor, and a GPS. The application programming interface semantic labels, along with metadata annotations. The database
(API) for reading these data sets is provided to the public, together provides ground truth labels that associate each pixel with one of 32
with a collection of long multisensor and multicamera data streams semantic classes. The sensor setup is based on only one monocular
stored camera mounted on the dashboard of the vehicle. The complexity of
the scenes is quite low, the vehicle being driven only in urban areas
6
https://venturebeat.com/2018/09/14/scale‐and‐nutonomy‐release‐nuscenes‐a‐self‐
with relatively low‐traffic and good‐weather conditions.
driving‐dataset‐with‐over‐1‐4‐million‐images/ The Daimler pedestrian benchmark data set (Flohr & Gavrila,
7
https://scale.com/open‐datasets 2013): Provided by Daimler AG R&D and University of Amsterdam,
this data set fits the topics of pedestrian detection, classification,
GRIGORESCU ET AL. | 1
segmentation, and path balance use of monocular and With very few exceptions,
reason for this is the memory
prediction. Pedestrian data are stereo cameras mainly the data sets are collected from
size of the data which is in
observed from a traffic vehicle configured to capture gray‐ a single city, which is usually
direct relation with the sensor
by using only on‐board mono scale images. AMUSE and around university campuses or
setup and the quality. For
and stereo cameras. It is the Ford databases are the only company locations in Europe,
example, the Ford data set
first data set which contains ones that use omnidirectional the US, or Asia. Germany is the
takes around 30 GB for each
pedestrians. Recently, the data cameras. most active country for driving
driven kilometers, which means
set was extended with cyclist Besides raw recorded data, recording vehicles.
that covering an entire city will
video samples captured with the data sets usually contain Unfortunately, all available data
take hundreds of TeraBytes of
the same setup (X. Li et al., miscellaneous files, such as sets together cover a very small
driving data. The majority of
2016). annotations, calibration files, portion of the world map. One
the available data sets
Caltech pedestrian labels, and so forth. To cope
consider sunny, daylight, and
detection data set (Caltech; with this files, the data set
urban conditions, these being
Dollar et al., 2009): Provided provider must offer tools and
ideal operating conditions for
by California Institute of software that enable the user to
autonomous driving systems.
Technology, US, the data set read and postprocess the data.
contains richly annotated Splitting of the data sets is also
videos, recorded from a an important factor to consider,
9 |
moving vehicle, with because some of the data sets
COMPUTATIONA
challenging images of low (e.g., Caltech, Daimler, and
L HARDWARE
resolution and frequently Cityscapes) already provide
AND
occluded people. There is preprocessed data that is
DEPLOYMENT
approximately 10 hr of driving classified in different sets:
scenarios cumulating about training, testing, and validation.
Deploying deep learning
250,000 frames with a total of This enables benchmarking of
algorithms on target edge
350 thousand bounding boxes desired algorithms against
devices is not a trivial task. The
and 2,300 unique pedestrians similar approaches to be
main limitations when it comes
annotations. The annotations consistent.
to vehicles are the price,
include both temporal Another aspect to consider
performance issues, and
correspondences between is the license type. The most
power consumption. Therefore,
bounding boxes and detailed commonly used license is
embedded platforms are
occlusion labels. Creative Commons
becoming essential for
Given the variety and Attribution‐NonCom- mercial‐
integration of AI algorithms
complexity of the available ShareAlike 3.0. It allows the
inside vehicles due to their
databases, choosing one or user to copy and redistribute
portability, versatility, and
more to develop and test an in any medium or format and
energy efficiency.
autonomous driving also to remix, transform, and
The market leader in
component may be difficult. As build upon the material. KITTI
providing hardware solutions
it can be observed, the sensor and NuScenes databases are
for deploying deep learning
setup varies among all the examples of such distribution
algorithms inside autonomous
available databases. For license. The Oxford database
cars is NVIDIA®. DRIVE PX
localization and vehicle motion, uses a Creative Commons
(NVIDIA) is an AI car computer
the LiDAR and GPS/IMU Attribution‐Noncommercial
which was designed to enable
sensors are necessary, with 4.0. which, compared with the
the automakers to focus
the most popular LiDAR first license type, does not
directly on the software for
sensors used being Velodyne force the user to distribute his
AVs.
(Velodyne, 2018) and Sick contributions under the same
The newest version of
(Sick, 2018). Data recorded license as the database.
DrivePX architecture is based
from a radar sensor is present Opposite to that the AMUSE
on two Tegra X2 (NVIDIA)
only in the NuScenes data set. database is licensed under
systems on a chip (SoCs).
The radar manufacturers adopt Creative Commons Attribution‐
Each SoC contains two Denve
proprietary data formats which Noncom- mercial‐noDerivs 3.0
(NVIDIA) cores, four ARM A57
are not public. Almost all which makes the database
cores, and a GPU from the
available data sets include illegal to distribute if
Pascal (NVIDIA) generation.
images captured from a video modifications of the material
NVIDIA® DRIVE PX is capable
camera, while there is a are made.
to perform real‐time
2 | GRIGORESCU ET AL.
environment perception, path improves the performance 10 times more power than navigate the driving scene, it
planning, and localization. It with an architecture which is FPGAs when processing must be able to understand its
combines deep learning, built on two NVIDIA® Xavier algorithms with the same surroundings. Deep learning is
sensor fusion, and surround processors and two state‐of‐ computation complexity, the main technology behind a
vision to improve the driving the‐art TensorCore GPUs. demonstrating that FPGAs can large number of perception
experience. A hardware platform used be a much more suitable systems. Although great
Introduced in September by the car makers for ADASs solution for deep learning progress has been reported with
2018, NVIDIA® DRIVE AGX is the R‐ Car V3H SoC applications in the automotive respect to accuracy in object
developer kit platform was platform from Renesas field. detection and recognition (Z.‐Q.
presented as the world’s most Autonomy (NVIDIA). This SoC In terms of flexibility, Zhao, Zheng, Xu, & Wu, 2018),
advanced self‐driving car provides the possibility to FPGAs are built with multiple current systems are mainly
platform (NVIDIA), being implement high‐performance architectures, which are a mix designed to calculate 2D or 3D
based on the Volta computer vision with low of hardware programmable
Technology (NVIDIA). It is power consumption. R‐Car resources, digital signal
available in two different V3H is optimized for processors, and Processor
configurations, namely DRIVE applications that involve the Block RAM (BRAM)
AGX Xavier and DRIVE AGX usage of stereo cameras, components. This architecture
Pegasus. containing dedicated hardware flexibility is suitable for deep
DRIVE AGX Xavier is a for CNNs, dense optical flow, and sparse neural networks,
scalable open platform that stereo‐vision, and object which are the state‐of‐the‐art
can serve as an AI brain for classification. The hardware for the current machine learning
self‐driving vehicles, and is features four 1.0 GHz Arm applications. Another
an energy‐efficient Cortex‐A53 MPCore cores, advantage is the possibility of
computing platform, with 30 which makes R‐Car V3H a connecting to various input and
trillion operations/second, suitable hardware platform output peripheral devices, like
while meeting automotive which can be used to deploy sensors, network elements, and
standards, like the ISO trained inference engines for storage devices.
26262 functional safety solving specific deep learning In the automotive field,
specification. NVIDIA ® tasks inside the automotive functional safety is one of the
DRIVE AGX Pegasus domain. most important challenges.
FPGAs have been designed to
Renesas also provides a flexibility, and functional safety.
meet the safety requirements for
similar SoC, called R‐Car H3 Our study is based on the
a wide range of applications,
(NVIDIA) which delivers research published by Intel
including ADAS. When compared
improved computing (Nurvitadhi et al., 2017),
with GPUs, which were originally
capabilities and compliance Microsoft (Ovtcharov et al.,
built for graphics and high‐
with functional safety 2015), and UCLA (Cong et al.,
performance computing systems,
standards. Equipped with new 2018).
where functional safety is not
CPU cores (Arm Cortex‐A57), By reducing the latency in
necessary, FPGAs provide a
it can be used as an deep learning applications,
significant advantage in
embedded platform for FPGAs provide additional raw
developing driver assistance
deploying various deep computing power. The memory
systems.
learning algorithms, compared bottlenecks, associated with
with R‐Car V3H, which is only external memory accesses, are
optimized for CNNs. reduced or even eliminated by
10 | DISCUSSION
A field‐programmable gate the high amount of chip cache
AND
array (FPGA) is another viable memory. In addition, FPGAs CONCLUSIONS
solution, showing great have the advantages of
improvements in both supporting a full range of data We have identified seven major
performance and power types, together with custom areas that form open
consumption in deep learning user‐defined types. challenges in the field of
applications. The suitability of FPGAs are optimized when autonomous driving. We
the FPGAs for running deep it comes to efficiency and power believe that deep learning and
learning algorithms can be consumption. The studies AI will play a key role in
analyzed from four major presented by manufacturers, overcoming these challenges:
perspectives: efficiency and like Microsoft and Xilinx, show Perception: In order for an
power, raw computing power, that GPUs can consume upon autonomous car to safely
GRIGORESCU ET AL. | 2
bounding boxes for a couple automotive industry. The case, such as autonomous better approximating the
of trained object classes, or to effectiveness of deep learning driving, these controllers underlaying true system model
provide a segmented image of systems is directly tied to the cannot anticipate all driving (Ostafew, 2016; Ostafew et al.,
the driving environment. availability of training data. As situations. The effectiveness 2016).
Future methods for perception a rule of thumb, current deep of deep learning components Functional safety: The
should focus on increasing learning methods are also to adapt based on past usage of deep learning in
the levels of recognized evaluated based on the quality experiences can also be used safety‐critical systems is still an
details, making it possible to of training data (Janai et al., to learn the parameters of the open debate, efforts being
perceive and track more 2017). The better the quality of car’s control system, thus made to bring the
objects in real‐time. the data, the higher the
computational intelligence and abled End2End learning
Furthermore, additional work accuracy of the algorithm. The
functional safety communities systems, able do directly map
is required for bridging the daily data recorded by an AV
closer to each other. Current sensory information to steering
gap between image‐ and are on the order of petabytes.
safety standards, such as the commands.
LiDAR‐based 3D perception This poses challenges on the
ISO 26262, do not Driverless cars are complex
(Wang et al., 2019), enabling parallelization of the training
accommodate machine systems which have to safely
the computer vision procedure, as well as on the
learning software (Salay et al., drive passengers or cargo from
community to close the storage infrastructure.
2017). Although new data‐ a starting location to destination.
current debate on camera Simulation environments have
driven design methodologies Several challenges are
versus LiDAR as main been used in the last couple of
have been pro- posed, there encountered with the advent of
perception sensors. years for bridging the gap
are still opened issues on the AI‐based AVs deployment on
Short‐ to middle‐term between scarce data and the
explainability, stability, or public roads. A major challenge
reasoning: Additional to a deep learning’s hunger for
classification robustness of is the difficulty in proving the
robust and accurate perception training examples. There is
deep neural networks. functional safety of these
system, an AV should be able still a gap to be filled between
Real‐time computing and vehicles, given the current
to reason its driving behavior the accuracy of a simulated
communication: Finally, real‐ formalism and explainability of
over a short (milliseconds) to world and real‐world driving.
time require- ments have to be neural networks. On top of this,
middle (seconds to minutes) Learning corner cases:
fulfilled for processing the large deep learning systems rely on
time horizon (Pendleton et al., Most driving scenarios are
amounts of data gathered from large training databases and
2017). AI and deep learning considered solvable with
the car’s sensors suite, as well require extensive computational
are promising tools that can be classical methodologies.
as for updating the parameters hardware.
used for the high‐ and low‐level However, the remaining
of deep learning systems over This paper has provided a
path planning required for unsolved scenarios are corner
high‐speed communication survey on deep learning
navigating the miriad of driving cases which, until now,
lines (Nurvitadhi et al., 2017). technologies used in
scenarios. Currently, the largest required the reasoning and
These real‐time constraints can autonomous driving. The survey
portions of papers in deep intelligence of a human driver.
be backed up by advances in of performance and
learning for self‐driving cars are To overcome corner cases,
semiconductor chips dedicated computational requirements
focused mainly on perception the generalization power of
for self‐ driving cars, as well as serves as a reference for
and End2End learning (Shalev‐ deep learning algorithms
by the rise of 5 G system‐level design of AI‐based
Shwartz et al., 2016; T. Zhang should be increased.
communication networks. self‐driving vehicles.
et al., 2016). Over the next Generalization in deep
period, we expect deep learning learning is of special
to play a significant role in the importance in learning 10.1 | Final notes ACKNOWLEDGMENTS
area of local trajectory hazardous situations that can
AV technology has seen a rapid The authors would like to
estimation and planning. We lead to accidents, especially
progress in the past decade, thank Elektrobit Automotive for
consider long‐ term reasoning due to the fact that training
especially due to advances in the infrastructure and research
as solved, as provided by data for such corner cases is
the area of AI and deep support.
navigation systems. These are scarce. This implies also the
learning. Current AI
standard methods for selecting design of one‐shot and low‐
methodologies are nowadays
a route through the road shot learning methods that can ORCID
either used or taken into
network, from the car’s current be trained a reduced number
consideration when designing Sorin Grigorescu
position to destination of training examples.
different components for a self‐ http://orcid.org/0000-
(Pendleton et al., 2017). Learning‐based control
driving car. Deep learning 0003-4763-5540
Availability of training methods: Classical controllers
approaches have influenced Bogdan Trasnea
data: “Data is the new oil” make use of an a priori model
not only the design of traditional http://orcid.org/0000-
became lately one of the most composed of fixed 0001-6169-1181
perception‐planning‐action
popular quotes in the parameters. In a complex
pipelines, but have also en- Tiberiu Cocias
2 | GRIGORESCU ET AL.
Redmon, J., & Farhadi, A. (2017). Salay, R., Queiroz, R., & and Autonomous Systems, integrated planning and
YOLO9000: Better, faster, Czarnecki, K. (2017). An 59(12), 1115–1129. control framework for
stronger. In analysis of ISO 26262: Simonyan, K., & Zisserman, A. autonomous driving via
IEEE Conference on Computer Machine learning and safety (2014). Very deep imitation learning. In ASME
Vision and Pattern Recognition
(CVPR). in automotive software (SAE convolutional networks for 2018 Dynamic Systems and
Redmon, J., & Farhadi, A. Technical Paper). large‐scale image Control Conference (Vol. 3).
(2018). Yolov3: An https://www.sae.org/publicati recognition. International Sutton, R., & Barto, A. (1998).
ons/technical- Conference on Learning Introduction to reinforcement
incremental improvement. learning.
arXiv preprint 1804.02767. papers/content/ 2018-01- Representations 2015.
1075/ Sun, L., Peng, C., Zhan, W., & Cambridge, MA: The MIT
Rehder, E., Quehl, J., & Stiller, Press.
C. (2017). Driving like a Sallab, A. E., Abdou, M., Perot, Tomizuka, M. (2018). A fast
human: Imitation learning E., & Yogamani, S. (2017a).
Szegedy, C., Liu, W., Jia, Y., Varshney, K. R. (2016).
for path planning using Deep reinforcement learning
Sermanet, P., Reed, S., Engineering safety in machine
convolutional neural framework for autonomous
Anguelov, D., … Rabinovich, learning. In 2016 Information
networks. In International driving. Electronic Imaging,
A. (2015). Going deeper with Theory and Applications
Conference on Robotics 2017(19), 70–76.
convolutions. In IEEE Workshop (ITA) (pp. 1–5).
and Automation Workshops. Sallab, A. E., Abdou, M., Perot,
Conference on Computer Velodyne (2018). Velodyne LiDAR
Ren, S., He, K., Girshick, R., & E., & Yogamani, S. (2017b).
Vision and Pattern for data collection.
Sun, J. (2017). Faster R‐ Deep reinforcement learning
Recognition (CVPR). https://velodynelidar. com/
CNN: Towards real‐ time framework for autonomous
Takanami, I., Sato, M., & Yang, Viola, P. A., & Jones, M. J. (2001).
object detection with region driving. CoRR.
Y. P. (2000). A fault‐value Rapid object detection using a
proposal networks. IEEE Sarlin, P., Debraine, F., injection approach for boosted cascade of simple
Transac- tions on Pattern Dymczyk, M., Siegwart, R., multiple‐weight‐fault tolerance features. In 2001 IEEE
Analysis and Machine & Cadena, C. (2018). of MNNs. In Proceedings of Computer Society Conference
Intelligence, 6, 1137–1149. Leveraging deep visual the IEEE‐INNS‐ENNS (Vol. 3, on Computer Vision and Pattern
Renesas. R‐Car H3. Retrieved descriptors for hierarchical pp. 515–520). Recognition (CVPR 2001), with
from efficient localiza- tion. In Thrun, S., Burgard, W., & Fox, D. CD‐ROM, December 8–14,
https://www.renesas.com/sg Proceedings of the 2nd (2005). Probabilistic robotics 2001 (pp. 511–518). Kauai,
/en/ Conference on Robot (Intelligent robotics and HI.
solutions/automotive/soc/r‐ Learning (CoRL). autonomous agents). Walch, F., Hazirbas, C., Leal‐
car‐h3.html/ Schwarting, W., Alonso‐Mora, J., Cambridge, MA: The MIT Taixé, L., Sattler, T., Hilsenbeck,
Renesas. R‐Car V3H. Retrieved & Rus, D. (2018). Planning Press. S., & Cremers,
from and decision‐ making for Tinchev, G., Penate‐Sanchez, A., D. (2017). Image‐based
https://www.renesas.com/eu autonomous vehicles. & Fallon, M. (2019). Learning localization using LSTMs for
/en/ Annual Review of Control, to see the wood for the trees: structured feature correlation.
solutions/automotive/soc/r‐ Robotics, and Autonomous Deep laser localization in In 2017 IEEE International
car‐v3h.html/ Systems, 1, 187–210. urban and natural Conference on Computer
Rosolia, U., Carvalho, A., & Seeger, C., Müller, A., & environments on a CPU. Vision (ICCV) (pp. 627–637).
Borrelli, F. (2017). Schwarz, L. (2016). Towards IEEE Robotics and Wang, Y., Chao, W.‐L., Garg, D.,
Autonomous racing using road type classification with Automation Letters, 4(2), Hariharan, B., Campbell, M., &
learning model predictive occupancy grids. In 1327–1334. Weinberger, K. (2019).
control. In 2017 American Intelligent Vehicles Treml, M., Arjona‐Medina, J. A., Pseudo‐LiDAR from visual
Control Con- ference (ACC) Symposium Unterthiner, T., Durgesh, R., depth estimation: Bridging the
(pp. 5115–5120). —Workshop: DeepDriving— Friedmann, F., Schuberth, P., gap in 3D object detection for
Rumelhart, D. E., McClelland, J. Learning Representations for … Hochreiter, S. (2016). autonomous driving. In IEEE
L., & PDP Research Group, Intelligent Vehi- cles IEEE. Speeding up semantic Conference on Computer
C. (Eds.), (1986). Parallel Gothenburg, Sweden. segmentation for autonomous Vision and Pattern
Distributed Processing: Shalev‐Shwartz, S., Shammah, driving. Whitepaper. Recognition (CVPR 2019).
Explorations in the S., & Shashua, A. (2016). Udacity. (2018). Udacity data Watkins, C., & Dayan, P. (1992).
Microstructure of Cognition, Safe, multi‐agent, collection. Retrieved from Q‐learning. Machine Learning,
Vol. 1: Foundations. reinforcement learning for http:// 8(3), 279–292.
Cambridge, MA: The MIT autonomous driving. ArXiv academictorrents.com/collecti Wulfmeier, M., Wang, D. Z., &
Press. preprint, 1610.03295. on/self‐driving‐cars Posner, I. (2016). Watch this:
Russakovsky, O., Deng, J., Su, Shin, K., Kwon, Y. P., & Ushani, A. K., & Eustice, R. M. Scalable cost‐ function
H., Krause, J., Satheesh, S., Ma, Tomizuka, M. (2018). (2018). Feature learning for learning for path planning in
S., … Fei‐Fei, RoarNet: A robust 3D object urban environments. In 2016
scene flow estimation from
L. (2015). ImageNet large detection based on region IEEE/RSJ International
LiDAR. In Proceedings of the
scale visual recognition approximation refinement. Conference on Intelligent
2nd Conference on Robot
challenge. Interna- tional 2019 IEEE Intelligent Robots and Systems (IROS).
Learning (CoRL) (Vol. 87, pp.
Journal of Computer Vision Vehicles Symposium (IV), abs/1607.02329.
283–292).
(IJCV), 115(3), 211–252. 2510–2515. Xu, H., Gao, Y., Yu, F., & Darrell,
Valada, A., Vertens, J., Dhall, A.,
SAE Committee. (2014). Sick. (2018). Sick LiDAR for T. (2017). End‐to‐end learning
& Burgard, W. (2017).
Taxonomy and definitions data collection. Retrieved of driving models from large‐
AdapNet: Adaptive semantic
for terms related to on‐ road from https://www. sick.com/ scale video datasets. In IEEE
segmentation in adverse
motor vehicle automated Sigaud, O., Salaün, C., & Conference on Computer
environmental conditions. In
driving systems. Padois, V. (2011). On‐line Vision and Pattern
2017 IEEE International
https://www.sae.org/ regression algorithms for Recognition (CVPR).
Conference on Robotics and
standards/content/j3016_20 learning mechanical models Yang, S., Wang, W., Liu, C.,
Automation (ICRA) (pp. 4644–
1806/ of robots: A survey. Robotics Deng, K., & Hedrick, J. K.
4651).
GRIGORESCU ET AL. | 2
(2017a). Feature analysis detection with deep learning:
controller using the deep
and selection for training an A review. IEEE transactions
learning approach. In 2017
end‐to‐end autonomous on neural networks and
IEEE Intelligent Vehicles
vehicle learning systems, 30(11),
Symposium (Vol. 1).
3212–3232.
Yang, Z., Zhou, F., Li, Y., &
Zhou, Y., & Tuzel, O. (2018).
Wang, Y. (2017b). A novel
VoxelNet: End‐to‐end
iterative learning path‐
learning for point cloud
tracking control for
based 3D object detection.
nonholonomic mobile robots
In IEEE Conference on
against initial shifts.
Computer Vision and Pattern
International Journal of
Recognition 2018 (pp.
Advanced Robotic Systems,
4490–4499).
14, 172988141771063.
Zhu, H., Yuen, K.‐V., Mihaylova,
Yin, H., & Berger, C. (2017).
L. S., & Leung, H. (2017).
When to use what data set
Overview of environment
for your self‐ driving car
perception for intelligent
algorithm: An overview of
vehicles. IEEE Transactions
publicly available driving
on Intelligent Transportation
datasets. In 2017 IEEE 20th
Systems, 18, 2584–2601.
International Conference on
Intelligent Transportation
Systems (ITSC) (pp. 1–8).
SUPPORTING INFORMATION
Yu, F., Xian, W., Chen, Y., Liu,
F., Liao, M., Madhavan, V.,
Additional supporting
& Darrell, T. (2018).
BDD100K: A diverse driving information may be found
video database with scalable online in the Supporting
annotation tooling. ArXiv Information section.
preprint, 1805.04687.
Yu, L., Shao, X., Wei, Y., &
Zhou, K. (2018). Intelligent
land‐vehicle model transfer How to cite this article: Grigorescu S, Trasnea B, Cocias T,
trajectory planning method Macesanu G. A survey of deep learning techniques for
based on deep reinforcement autonomous driving. J Field Robotics. 2019;1–25.
learning. Sensors (Basel,
https://doi.org/10.1002/rob.21918
Switzerland), 18, 1–22.
Zhang, S., Wen, L., Bian, X., Lei,
Z., & Li, S. Z. (2017). Single‐
shot refinement neural
network for object detection.
In IEEE Conference on
Computer Vision and
Pattern Recognition (CVPR).
Zhang, T., Kahn, G., Levine, S., &
Abbeel, P. (2016). Learning
deep control policies for
autonomous aerial vehicles
with MPC‐guided policy
search. In 2016 IEEE
International Conference on
Robotics and Automation
(ICRA).
Zhao, H., Qi, X., Shen, X., Shi,
J., & Jia, J. (2018). ICNet for
real‐time semantic
segmentation on high‐
resolution images. In
European Conference on
Computer Vision (pp. 418–
434).
Zhao, H., Shi, J., Qi, X., Wang,
X., & Jia, J. (2017). Pyramid
scene parsing network. In
2017 IEEE Conference on
Computer Vision and
Pattern Recognition (CVPR)
(pp. 6230–6239).
Zhao, Z. Q., Zheng, P., Xu, S.
T., & Wu, X. (2018). Object