Professional Documents
Culture Documents
MLOps Blog
12 min
Humanity has been waiting for self-driving cars for several decades. Thanks to
the extremely fast evolution of technology, this idea recently went from
“possible” to “commercially available in a Tesla”.
Deep learning is one of the main technologies that enabled self-driving. It’s a
versatile tool that can solve almost any problem – it can be used in physics, for
example, the proton-proton collision in the Large Hadron Collider, just as well
as in Google Lens to classify pictures. Deep learning is a technology that can
help solve almost any type of science or engineering problem.
Along the way, we’ll see how Tesla, Waymo, and Nvidia use CNN algorithms to
make their cars driverless or autonomous.
You may also like
Experiment Tracking for Systems Powering Self-Driving Vehicles [Case
Table of contents
Study with Waabi]
The first self-driving car was invented in 1989, it was the Automatic Land
Vehicle in Neural Network (ALVINN). It used neural networks to detect lines,
segment the environment, navigate itself, and drive. It worked well, but it was
limited by slow processing powers and insufficient data.
1. Perception
Table of contents
2. Localization
3. Prediction
4. Decision Making
High-level path planning
Behaviour Arbitration
Motion Controllers
Perception
One of the most important properties that self-driving cars must have is
perception, which helps the car see the world around itself, as well as
recognize and classify the things that it sees. In order to make good decisions,
the car needs to recognize objects instantly.
So, the car needs to see and classify traffic lights, pedestrians, road signs,
walkways, parking spots, lanes, and much more. Not only that, it also needs to
know the exact distance between itself and the objects around it. Perception is
more than just seeing and classifying, it enables the system to evaluate the
distance and decide to either slow down or brake.
To achieve such a high level of perception, a self-driving car must have three
sensors:
1. Camera
2. LiDAR
3. RADAR
Camera
The camera provides vision to the car, enabling multiple tasks like
classification, segmentation, and localization. The cameras need to be high-
resolution and represent the environment accurately.
In order to make sure that the car receives visual information from every side:
front, back, left, and right, the cameras are stitched together to get a 360-
degree view of the entire environment. These cameras provide a wide-range
view as far as 200 meters as well as a short-range view for more focused
perception.
Table of contents
In some tasks like parking, the camera also provides a panoramic view for
better decision-making.
Even though the cameras do all the perception related tasks, it’s hardly of any
use during extreme conditions like fog, heavy rain, and especially at night time.
During extreme conditions, all that cameras capture is noise and
discrepancies, which can be life-threatening.
To overcome these limitations, we need sensors that can work without light
and also measure distance.
LiDAR
LiDAR stands for Light Detection And Ranging, it’s a method to measure the
distance of objects by firing a laser beam and then measuring how long it takes
for it to be reflected by something.
A camera can only provide the car with images of what’s going around itself.
When it’s combined with the LiDAR sensor, it gains depth in the images – it
suddenly has a 3D perception of what’s going on around the car.
So, LiDAR perceives spatial information. And when this data is fed into deep
neural networks, the car can predict the actions of the objects or vehicles close
to it. This sort of technology is very useful in a complex driving scenario, like a
multi-exit intersection, where the car can analyze all other cars and make the
appropriate, safest decision.
Table of contents
0:01 / 0:03
In 2019, Elon Musk openly stated that “anyone relying on LiDARs are
doomed…”. Why? Well, LiDARs have limitations that can be catastrophic. For
example, the LiDAR sensor uses lasers or light to measure the distance of the
nearby object. It will work at night and in dark environments, but it can still fail
when there’s noise from rain or fog. That’s why we also need a RADAR sensor.
RADARs
Radio detection and ranging (RADAR) is a key component in many military and
consumer applications. It was first used by the military to detect objects. It
calculates distance using radio wave signals. Today, it’s used in many vehicles
and has become a primary component of the self-driving car.
RADARs are highly effective because they use radio waves instead of lasers, so
they work in any conditions.
Source
It’s important to understand that radars are noisy sensors. This means that
Table
even ifofthe
contents
camera sees no obstacle, the radar will detect some obstacles.
Source
The image above shows the self-driving car (in green) using LiDAR to detect
objects around and to calculate the distance and shape of the object. Compare
the same scene, but captured with the RADAR sensor below, and you can see a
lot of unnecessary noise.
Table of contents
Source
The RADAR data should be cleaned in order to make good decisions and
predictions. We need to separate weak signals from strong ones; this is called
thresholding. We also use Fast Fourier Transforms (FFT) to filter and interpret
the signal.
If you look at the below above, you’ll notice that the RADAR and LiDAR signals
are point-based data. This data should be clustered so that it can be
interpreted nicely. Clustering algorithms such as Euclidean Clustering or K
means Clustering are used to achieve this task.
Table of contents
Source
Localization
Source
Source
Prediction
The car has a 360-degree view of its environment that enables it to perceive
and capture all the information and process it. Once fed into the deep learning
algorithm, it can come up with all the possible moves that other road users
might make. It’s like a game where the player has a finite number of moves and
tries to find the best move to defeat the opponent.
The sensors in self-driving cars enable them to perform tasks like image
classification, object detection, segmentation, and localization. With various
forms of data representation, the car can make predictions of the object
around it.
A deep learning algorithm can model such information (images and cloud data
points from LiDARs and RADARs) during training. The same model, but during
inference, can help the car to prepare for all the possible moves which involve
braking, halting, slowing down, changing lanes, and so on.
The role of deep learning is to interpret complex vision tasks, localize itself in
the environment, enhance perception, and actuate kinematic maneuvers in
self-driving cars. This ensures road safety and easy commute as well.
But the tricky part is to choose the correct action out of a finite number of
actions.
Decision-making
In order to make a decision, the car should have enough information so that it
can select the necessary set of actions. We learned that the sensors help the
car to collect information and deep learning algorithms can be used for
localization and prediction.
To recap, localization enables the car to know its initial position, and prediction
creates an n number of possible actions or moves based on the environment.
The question remains: which option is best out of the many predicted actions?
In order to tackle the problem of finding the best move for itself, the deep
learning model is optimized with Bayesian optimization. There are also
situations where the framework, consisting of both a hidden Markov model and
Bayesian Optimization, is used for decision-making.
Table of contents
CNNs used for self-driving cars
CNNs can capture different patterns as the depth of the network increases. For
example, the layers at the beginning of the network will capture edges, while
the deep layers will capture more complex features like the shape of the
objects (leaves in trees, or tires on a vehicle). This is the reason why CNNs are
the main algorithm in self-driving cars.
The key component of the CNN is the convolutional layer itself. It has a
convolutional kernel which is often called the filter matrix. The filter matrix is
convolved with a local region of the input image which can be defined as:
Where:
The dimension of the filter matrix in practice is usually 3X3 or 5X5. During the
training process, the filter matrix will constantly update itself to get a
reasonable weight. One of the properties of CNN is that the weights are
shareable. The same weight parameters can be used to represent two different
transformations in the network. The shared parameter saves a lot of
processing space; they can produce more diverse feature representations
learned by the network.
The output of the CNN is usually fed to a nonlinear activation function. The
activation function enables the network to solve the linear inseparable
problems, and these functions can represent high-dimensional manifolds in
lower-dimensional manifolds. Commonly used activation functions are
Sigmoid, Tanh, and ReLU, which are listed as follows:
Table of contents
It’s worth mentioning that the ReLU is the preferred activation function,
because it converges faster compared to the other activation functions. In
addition to that, the output of the convolution layer is modified by the max-
pooling layer which keeps more information about the input image, like the
background and texture.
The three important CNN properties that make them versatile and a primary
component of self-driving cars are:
Source
Next, we’ll discuss three CNN networks that are used by three companies
pioneering self-driving cars:
1. HydraNet by Tesla
2. ChauffeurNet by Google Waymo
3. Nvidia Self driving car
HydraNet was introduced by Ravi et al. in 2018. It was developed for semantic
segmentation, for improving computational efficiency during inference time.
Table of contents
Source
Take the context of self-driving cars. One input dataset can be of static
environments like trees and road-railing, another can be of the road and the
lanes, another of traffic lights and road, and so on. These inputs are trained in
different branches. During the inference time, the gate chooses which
branches to run, and the combiner aggregates branch outputs and makes a
final decision.
In the case of Tesla, they have modified this network slightly because it’s
difficult to segregate data for the individual tasks during inference. To
overcome that problem, engineers at Tesla developed a shared backbone. The
shared backbones are usually modified ResNet-50 blocks.
This HydraNet is trained on all the object’s data. There are task-specific heads
that allow the model to predict task-specific outputs. The heads are based on
semantic segmentation architecture like the U-Net.
Table of contents
Source
The Tesla HydraNet can also project a birds-eye, meaning it can create a 3D
view of the environment from any angle, giving the car much more
dimensionality to navigate properly. It’s important to know that Tesla doesn’t
use LiDAR sensors. It has only two sensors, a camera and a radar. Although
LiDAR explicitly creates depth perception for the car, Tesla’s hydranet is so
efficient that it can stitch all the visual information from the 8 cameras in it and
create depth perception.
Source
Table
The CNNof contents
in ChauffeurNet is described as a convolutional feature network, or
FeatureNet, that extracts contextual feature representation shared by the
other networks. These representations are then fed to a recurrent agent
network (AgentRNN) that iteratively yields the prediction of successive points
in the driving trajectory.
The idea behind this network is to train a self-driving car using imitation
learning. In the paper released by Bansal et al “ChauffeurNet: Learning to Drive
by Imitating the Best and Synthesizing the Worst”, they argue that training a
self-driving car even with 30 million examples is not enough. In order to tackle
that limitation, the authors trained the car in synthetic data. This synthetic data
introduced deviations such as introducing perturbation to the trajectory path,
adding obstacles, introducing unnatural scenes, etc. They found that such
synthetic data was able to train the car much more efficiently than the normal
data.
Source
Source
In the image above, the cyan path depicts the input route, green box is the self-
driving car, blue dots are the agent’s past route or position, and green dots are
the predicted future routes or positions.
Nvidia also uses a Convolution Neural Network as a primary algorithm for its
self-driving car. But unlike Tesla, it uses 3 cameras, one on each side and one
at the front. See the image below.
Table of contents
Source
The network is capable of operating inroads that don’t have lane markings,
including parking lots. It can also learn features and representations that are
necessary for detecting useful road features.
Table of contents
We discussed earlier how the neural network predicts a number of actions
from the perception data. But, choosing an appropriate action requires deep
reinforcement learning (DRL). At the core of DRL, we have three important
variables:
1. State describes the current situation in a given time. In this case, it would
be a position on the road.
2. Action describes all the possible moves that the car can make.
3. Reward is feedback that the car receives whenever it takes a certain
action.
Generally, the agent is not told what to do or what actions to take. So far as we
have seen, in supervised learning, the algorithm maps input to the output. In
DRL, the algorithm learns by exploring the environment and each interaction
yields a certain reward. The reward can be both positive and negative. The goal
of the DRL is to maximize the cumulative rewards.
One key point to remember is that self-driving cars can’t be trained in real-
world scenarios or roads because they will be extremely dangerous. Instead,
self-driving cars are trained on a simulator where there’s no risk at all.
1. CARLA
2. SUMMIT
3. AirSim
4. DeepDrive
5. Flow
Table of contents
These cars (agents) are trained for thousands of epochs with highly difficult
simulations before they’re deployed in the real world.
During training, the agent (the car) learns by taking a certain action in a certain
state. Based on this state-action pair, it receives a reward. This process
happens over and over again. Each time the agent updates its memory of
rewards. This is called the policy.
The policy is described as how the agent makes decisions. It’s a decision-
making rule. The policy defines the behaviour of the agent at a given time.
For every negative decision the agent makes, the policy is changed. So in order
to avoid the negative rewards, the agent checks the quality of a certain action.
This is measured by the state-value function. State-value can be measured
using the Bellman Expectation Equation.
The Bellman expectation equation, along with Markov Decision Process (MDP),
Table of contents
makes up the two core concepts of DRL. But when it comes to self-driving cars,
we have to keep in mind that the observations from the perception data
should be mapped with the appropriate action and not just map the
underlying state to the action. This is where a partially observed decision
process or a Partially Observable Markov Decision Process (POMDP) is
required, which can make decisions based on the observation.
The action taken is transitioned into some new state and the agent is given a
reward. This process of evaluating a state, taking action, changing states, and
rewarding is repeated. Throughout the process, it’s the agent’s goal to
maximize the total amount of rewards.
The POMDP has six components and it can be denoted as POMDP M:= (I, S, A,
R, P, γ), where,
I: Observations
S: Finite set of states
A: Finite set of actions
R: Reward function
P: transition probability function
γ – discounting factor for future rewards.
The objective of DRL is to find the desired policy that maximizes the reward at
each given time step or, in other words, to find an optimal value-action function
(Q-function).
Q-learning is one of the most commonly used DRL algorithms for self-driving
cars. It comes under the category of model-free learning. In model-free
learning, the agent will try to approximate the optimal state-action pair. The
policy still determines which action-value pairs or Q-value are visited and
updated (see the equation below). The goal is to find optimal policy by
interacting with the environment while modifying the same when the agent
makes an error.
With enough samples or observation data, Q-learning will learn optimal state-
action value pairs. In practice, Q-learning has been shown to converge to the
optimum state-action values for a MDP with probability 1, provided that all
actions in all states are infinitely available.
where:
α ∈ [0,1] is the learning rate. It controls the degree to which Q values are
updated at a given t.
Table of contents
Source
It’s important to remember that the agent will discover the good and bad
actions through trial and error.
Conclusion
Self-driving cars aim to revolutionize car travel by making it safe and efficient.
In this article, we outlined some of the key components such as LiDAR, RADAR,
cameras, and most importantly – the algorithms that make self-driving cars
possible.
While it’s promising, there’s still a lot of room for improvement. For example,
current self-driving cars are at level-2 out of level-5 of advancement, which
means that there still has to be a human ready to intervene if necessary.
1. The algorithms used are not yet optimal enough to perceive roads and lanes
because some roads lack markings and other signs.