You are on page 1of 27

Table of contents

MLOps Blog

Self-Driving Cars With Convolutional Neural


Networks (CNN)
Nilesh Barla

12 min

10th August, 2023

Computer Vision ML Model Development

Humanity has been waiting for self-driving cars for several decades. Thanks to
the extremely fast evolution of technology, this idea recently went from
“possible” to “commercially available in a Tesla”.

Deep learning is one of the main technologies that enabled self-driving. It’s a
versatile tool that can solve almost any problem – it can be used in physics, for
example, the proton-proton collision in the Large Hadron Collider, just as well
as in Google Lens to classify pictures. Deep learning is a technology that can
help solve almost any type of science or engineering problem.

In this article, we’ll focus on deep learning algorithms in self-driving cars –


convolutional neural networks (CNN). CNN is the primary algorithm that
these systems use to recognize and classify different parts of the road, and to
make appropriate decisions.

Along the way, we’ll see how Tesla, Waymo, and Nvidia use CNN algorithms to
make their cars driverless or autonomous.
You may also like
Experiment Tracking for Systems Powering Self-Driving Vehicles [Case
Table of contents
Study with Waabi]

How do self-driving cars work?

The first self-driving car was invented in 1989, it was the Automatic Land
Vehicle in Neural Network (ALVINN). It used neural networks to detect lines,
segment the environment, navigate itself, and drive. It worked well, but it was
limited by slow processing powers and insufficient data.

With today’s high-performance graphics cards, processors, and huge amounts


of data, self-driving is more powerful than ever. If it becomes mainstream, it
will reduce traffic congestion and increase road safety.

Self-driving cars are autonomous decision-making systems. They can process


streams of data from different sensors such as cameras, LiDAR, RADAR, GPS,
or inertia sensors. This data is then modeled using deep learning algorithms,
which then make decisions relevant to the environment the car is in.

A modular perception-planning-action pipeline | Source

The image above shows a modular perception-planning-action pipeline used


to make driving decisions. The key components of this method are the different
sensors that fetch data from the environment.
To understand the workings of self-driving cars, we need to examine the four
main parts:

1. Perception
Table of contents
2. Localization
3. Prediction
4. Decision Making
High-level path planning
Behaviour Arbitration
Motion Controllers

Perception

One of the most important properties that self-driving cars must have is
perception, which helps the car see the world around itself, as well as
recognize and classify the things that it sees. In order to make good decisions,
the car needs to recognize objects instantly.

So, the car needs to see and classify traffic lights, pedestrians, road signs,
walkways, parking spots, lanes, and much more. Not only that, it also needs to
know the exact distance between itself and the objects around it. Perception is
more than just seeing and classifying, it enables the system to evaluate the
distance and decide to either slow down or brake.

To achieve such a high level of perception, a self-driving car must have three
sensors:

1. Camera
2. LiDAR
3. RADAR

Camera

The camera provides vision to the car, enabling multiple tasks like
classification, segmentation, and localization. The cameras need to be high-
resolution and represent the environment accurately.

In order to make sure that the car receives visual information from every side:
front, back, left, and right, the cameras are stitched together to get a 360-
degree view of the entire environment. These cameras provide a wide-range
view as far as 200 meters as well as a short-range view for more focused
perception.
Table of contents

Self-driving car’s camera | Source

In some tasks like parking, the camera also provides a panoramic view for
better decision-making.

Even though the cameras do all the perception related tasks, it’s hardly of any
use during extreme conditions like fog, heavy rain, and especially at night time.
During extreme conditions, all that cameras capture is noise and
discrepancies, which can be life-threatening.

To overcome these limitations, we need sensors that can work without light
and also measure distance.

LiDAR

LiDAR stands for Light Detection And Ranging, it’s a method to measure the
distance of objects by firing a laser beam and then measuring how long it takes
for it to be reflected by something.

A camera can only provide the car with images of what’s going around itself.
When it’s combined with the LiDAR sensor, it gains depth in the images – it
suddenly has a 3D perception of what’s going on around the car.

So, LiDAR perceives spatial information. And when this data is fed into deep
neural networks, the car can predict the actions of the objects or vehicles close
to it. This sort of technology is very useful in a complex driving scenario, like a
multi-exit intersection, where the car can analyze all other cars and make the
appropriate, safest decision.

Table of contents

0:01 / 0:03

Object detection with LiDAR | Source

In 2019, Elon Musk openly stated that “anyone relying on LiDARs are
doomed…”. Why? Well, LiDARs have limitations that can be catastrophic. For
example, the LiDAR sensor uses lasers or light to measure the distance of the
nearby object. It will work at night and in dark environments, but it can still fail
when there’s noise from rain or fog. That’s why we also need a RADAR sensor.

RADARs

Radio detection and ranging (RADAR) is a key component in many military and
consumer applications. It was first used by the military to detect objects. It
calculates distance using radio wave signals. Today, it’s used in many vehicles
and has become a primary component of the self-driving car.

RADARs are highly effective because they use radio waves instead of lasers, so
they work in any conditions.
Source

It’s important to understand that radars are noisy sensors. This means that
Table
even ifofthe
contents
camera sees no obstacle, the radar will detect some obstacles.

Source

The image above shows the self-driving car (in green) using LiDAR to detect
objects around and to calculate the distance and shape of the object. Compare
the same scene, but captured with the RADAR sensor below, and you can see a
lot of unnecessary noise.
Table of contents

Source

The RADAR data should be cleaned in order to make good decisions and
predictions. We need to separate weak signals from strong ones; this is called
thresholding. We also use Fast Fourier Transforms (FFT) to filter and interpret
the signal.

If you look at the below above, you’ll notice that the RADAR and LiDAR signals
are point-based data. This data should be clustered so that it can be
interpreted nicely. Clustering algorithms such as Euclidean Clustering or K
means Clustering are used to achieve this task.
Table of contents

Source

Localization

Localization algorithms in self-driving cars calculate the position and


orientation of the vehicle as it navigates – a science known as Visual Odometry
(VO).

VO works by matching key points in consecutive video frames. With each


frame, the key points are used as the input to a mapping algorithm. The
mapping algorithm, such as Simultaneous localization and mapping (SLAM),
computes the position and orientation of each object nearby with respect to
the previous frame and helps to classify roads, pedestrians, and other objects
around.
Table of contents

Source

Deep learning is generally used to improve the performance of VO, and to


classify different objects. Neural networks, such as PoseNet and VLocNet++,
are some of the frameworks that use point data to estimate the 3D position
and orientation. These estimated 3D positions and orientations can be used to
derive scene semantics, as seen in the image below.

Source
Prediction

Understanding human drivers is a very complex task. It involves emotions


Table
ratherof contents
than logic, and these are all fueled with reactions. It becomes very
uncertain what the next action will be of the drivers or pedestrians nearby, so a
system that can predict the actions of other road users can be very important
for road safety.

The car has a 360-degree view of its environment that enables it to perceive
and capture all the information and process it. Once fed into the deep learning
algorithm, it can come up with all the possible moves that other road users
might make. It’s like a game where the player has a finite number of moves and
tries to find the best move to defeat the opponent.

The sensors in self-driving cars enable them to perform tasks like image
classification, object detection, segmentation, and localization. With various
forms of data representation, the car can make predictions of the object
around it.

A deep learning algorithm can model such information (images and cloud data
points from LiDARs and RADARs) during training. The same model, but during
inference, can help the car to prepare for all the possible moves which involve
braking, halting, slowing down, changing lanes, and so on.

The role of deep learning is to interpret complex vision tasks, localize itself in
the environment, enhance perception, and actuate kinematic maneuvers in
self-driving cars. This ensures road safety and easy commute as well.

But the tricky part is to choose the correct action out of a finite number of
actions.

Decision-making

Decision-making is vital in self-driving cars. They need a system that’s dynamic


and precise in an uncertain environment. It needs to take into account that not
all sensor readings will be true, and that humans can make unpredictable
choices while driving. These things can’t be measured directly. Even if we could
measure them, we can’t predict them with good accuracy.
Table of contents

A self-driving car moving towards an intersection | Source

The image above shows a self-driving car moving towards an intersection.


Another car, in blue, is also moving towards the intersection. In this scenario,
the self-driving car has to predict whether the other car will go straight, left, or
right. In each case, the car has to decide what maneuver it should perform to
prevent a collision.

In order to make a decision, the car should have enough information so that it
can select the necessary set of actions. We learned that the sensors help the
car to collect information and deep learning algorithms can be used for
localization and prediction.

To recap, localization enables the car to know its initial position, and prediction
creates an n number of possible actions or moves based on the environment.
The question remains: which option is best out of the many predicted actions?

When it comes to making decisions, we use deep reinforcement learning (DRL).


More specifically, a decision-making algorithm called the Markov decision
process (MDP) lies at the heart of DRL (we’ll learn more about MDP in a later
section where we talk about reinforcement learning).

Usually, an MDP is used to predict the future behavior of the road-users. We


should keep in mind that the scenario can get very complex if the number of
objects, especially moving ones, increases. This eventually increases the
number of possible moves for the self-driving car itself.

In order to tackle the problem of finding the best move for itself, the deep
learning model is optimized with Bayesian optimization. There are also
situations where the framework, consisting of both a hidden Markov model and
Bayesian Optimization, is used for decision-making.

In general, decision-making in self-driving cars is a hierarchical process. This


process has four components:
Path or Route planning: Essentially, route planning is the first of four
decisions that the car must make. Entering the environment, the car should
plan the best possible route from its current position to the requested
destination. The idea is to find an optimal solution among all the other
Table of contents
solutions.
Behaviour Arbitration: Once the route is planned, the car needs to
navigate itself through the route. The car knows about the static elements,
like roads, intersections, average road congestion and more, but it can’t
know exactly what the other road users are going to be doing throughout
the journey. This uncertainty in the behavior of other road users is solved by
using probabilistic planning algorithms like MDPs.
Motion Planning: Once the behavior layer decides how to navigate through
a certain route, the motion planning system orchestrates the motion of the
car. The motion of the car must be feasible and comfortable for the
passenger. Motion planning includes speed of the vehicle, lane-changing,
and more, all of which should be relevant to the environment the car is in.
Vehicle Control: Vehicle control is used to execute the reference path from
the motion planning system.
Table of contents
Source

Table of contents
CNNs used for self-driving cars

Convolutional neural networks (CNN) are used to model spatial information,


such as images. CNNs are very good at extracting features from images, and
they’re often seen as universal non-linear function approximators.

CNNs can capture different patterns as the depth of the network increases. For
example, the layers at the beginning of the network will capture edges, while
the deep layers will capture more complex features like the shape of the
objects (leaves in trees, or tires on a vehicle). This is the reason why CNNs are
the main algorithm in self-driving cars.

The key component of the CNN is the convolutional layer itself. It has a
convolutional kernel which is often called the filter matrix. The filter matrix is
convolved with a local region of the input image which can be defined as:

Where:

the operator * represents the convolution operation,


w is the filter matrix and b is the bias,
x is the input,
y is the output.

The dimension of the filter matrix in practice is usually 3X3 or 5X5. During the
training process, the filter matrix will constantly update itself to get a
reasonable weight. One of the properties of CNN is that the weights are
shareable. The same weight parameters can be used to represent two different
transformations in the network. The shared parameter saves a lot of
processing space; they can produce more diverse feature representations
learned by the network.

The output of the CNN is usually fed to a nonlinear activation function. The
activation function enables the network to solve the linear inseparable
problems, and these functions can represent high-dimensional manifolds in
lower-dimensional manifolds. Commonly used activation functions are
Sigmoid, Tanh, and ReLU, which are listed as follows:

Table of contents

It’s worth mentioning that the ReLU is the preferred activation function,
because it converges faster compared to the other activation functions. In
addition to that, the output of the convolution layer is modified by the max-
pooling layer which keeps more information about the input image, like the
background and texture.

The three important CNN properties that make them versatile and a primary
component of self-driving cars are:

local receptive fields,


shared weights,
spatial sampling.

These properties reduce overfitting and store representations and features


that are vital for image classification, segmentation, localization, and more.
Table of contents

Source

Next, we’ll discuss three CNN networks that are used by three companies
pioneering self-driving cars:

1. HydraNet by Tesla
2. ChauffeurNet by Google Waymo
3. Nvidia Self driving car

HydraNet – semantic segmentation for self-driving


cars

HydraNet was introduced by Ravi et al. in 2018. It was developed for semantic
segmentation, for improving computational efficiency during inference time.
Table of contents

Play with a live Neptune project -> Take a tour 📈

> Blog > ML Model Development Search in Blog...

Source

HydraNets is dynamic architecture so it can have different CNN networks, each


assigned to different tasks. These blocks or networks are called branches. The
idea of HydraNet is to get various inputs and feed them into a task-specific
CNN network.

Take the context of self-driving cars. One input dataset can be of static
environments like trees and road-railing, another can be of the road and the
lanes, another of traffic lights and road, and so on. These inputs are trained in
different branches. During the inference time, the gate chooses which
branches to run, and the combiner aggregates branch outputs and makes a
final decision.

In the case of Tesla, they have modified this network slightly because it’s
difficult to segregate data for the individual tasks during inference. To
overcome that problem, engineers at Tesla developed a shared backbone. The
shared backbones are usually modified ResNet-50 blocks.

This HydraNet is trained on all the object’s data. There are task-specific heads
that allow the model to predict task-specific outputs. The heads are based on
semantic segmentation architecture like the U-Net.
Table of contents

Source

The Tesla HydraNet can also project a birds-eye, meaning it can create a 3D
view of the environment from any angle, giving the car much more
dimensionality to navigate properly. It’s important to know that Tesla doesn’t
use LiDAR sensors. It has only two sensors, a camera and a radar. Although
LiDAR explicitly creates depth perception for the car, Tesla’s hydranet is so
efficient that it can stitch all the visual information from the 8 cameras in it and
create depth perception.

Source

ChauffeurNet: training self-driving car using


imitation learning
ChauffeurNet is an RNN-based neural network used by Google Waymo,
however, CNN is actually one of the core components here and it’s used to
extract features from the perception system.

Table
The CNNof contents
in ChauffeurNet is described as a convolutional feature network, or
FeatureNet, that extracts contextual feature representation shared by the
other networks. These representations are then fed to a recurrent agent
network (AgentRNN) that iteratively yields the prediction of successive points
in the driving trajectory.

The idea behind this network is to train a self-driving car using imitation
learning. In the paper released by Bansal et al “ChauffeurNet: Learning to Drive
by Imitating the Best and Synthesizing the Worst”, they argue that training a
self-driving car even with 30 million examples is not enough. In order to tackle
that limitation, the authors trained the car in synthetic data. This synthetic data
introduced deviations such as introducing perturbation to the trajectory path,
adding obstacles, introducing unnatural scenes, etc. They found that such
synthetic data was able to train the car much more efficiently than the normal
data.

Usually, self-driving has an end-to-end process as we saw earlier, where the


perception system is part of a deep learning algorithm along with planning and
controlling. In the case of ChauffeurNet, the perception system is not a part of
the end-to-end process; instead, it’s a mid-level system where the network can
have different variations of input from the perception system.
Table of contents

Source

ChauffeurNet yields a driving trajectory by observing a mid-level


representation of the scene from the sensors, using the input along with
synthetic data to imitate an expert driver.
Table of contents

Source

In the image above, the cyan path depicts the input route, green box is the self-
driving car, blue dots are the agent’s past route or position, and green dots are
the predicted future routes or positions.

Essentially, a mid-level representation doesn’t directly use raw sensor data as


input, factoring out the perception task, so we can combine real and simulated
data for easier transfer learning. This way, the network can create a high-level
bird’s eye view of the environment which ultimately yields better decisions.

Nvidia self-driving car: a minimalist approach


towards self-driving cars

Nvidia also uses a Convolution Neural Network as a primary algorithm for its
self-driving car. But unlike Tesla, it uses 3 cameras, one on each side and one
at the front. See the image below.
Table of contents

Source

The network is capable of operating inroads that don’t have lane markings,
including parking lots. It can also learn features and representations that are
necessary for detecting useful road features.

Compared to the explicit decomposition of the problem such as lane marking


detection, path planning, and control, this end-to-end system optimizes all
processing steps at the same time.

Better performance is the result of internal components self-optimizing to


maximize overall system performance, instead of optimizing human-selected
intermediate criteria like lane detection. Such criteria understandably are
selected for ease of human interpretation, which doesn’t automatically
guarantee maximum system performance. Smaller networks are possible
because the system learns to solve the problem with a minimal number of
processing steps.

Reinforcement learning used for self-


driving cars
Reinforcement learning (RL) is a type of machine learning where an agent
learns by exploring and interacting with the environment. In this case, the self-
driving car is an agent.

Explore more applications of RL


10 Real-Life Applications of Reinforcement Learning
7 Applications of Reinforcement Learning in Finance and Trading

Table of contents
We discussed earlier how the neural network predicts a number of actions
from the perception data. But, choosing an appropriate action requires deep
reinforcement learning (DRL). At the core of DRL, we have three important
variables:

1. State describes the current situation in a given time. In this case, it would
be a position on the road.
2. Action describes all the possible moves that the car can make.
3. Reward is feedback that the car receives whenever it takes a certain
action.

Generally, the agent is not told what to do or what actions to take. So far as we
have seen, in supervised learning, the algorithm maps input to the output. In
DRL, the algorithm learns by exploring the environment and each interaction
yields a certain reward. The reward can be both positive and negative. The goal
of the DRL is to maximize the cumulative rewards.

In self-driving cars, the same procedure is followed: the network is trained on


perception data, where it learns what decision it should make. Because the
CNNs are very good at extracting features of representations from the input,
DRL algorithms can be trained on those representations. Training a DRL
algorithm on these representations can yield good results because these
extracted representations are the transformation of higher-dimensional
manifolds into simpler lower-dimensional manifolds. Training on lower
representation yields efficiency which is required at the inference.

One key point to remember is that self-driving cars can’t be trained in real-
world scenarios or roads because they will be extremely dangerous. Instead,
self-driving cars are trained on a simulator where there’s no risk at all.

Some open-source simulators are:

1. CARLA
2. SUMMIT​​
3. AirSim
4. DeepDrive
5. Flow
Table of contents

A snapshot from Voyage Deepdrive | Source

A snapshot from Voyage Deepdrive | Source

These cars (agents) are trained for thousands of epochs with highly difficult
simulations before they’re deployed in the real world.

During training, the agent (the car) learns by taking a certain action in a certain
state. Based on this state-action pair, it receives a reward. This process
happens over and over again. Each time the agent updates its memory of
rewards. This is called the policy.

The policy is described as how the agent makes decisions. It’s a decision-
making rule. The policy defines the behaviour of the agent at a given time.

For every negative decision the agent makes, the policy is changed. So in order
to avoid the negative rewards, the agent checks the quality of a certain action.
This is measured by the state-value function. State-value can be measured
using the Bellman Expectation Equation.

The Bellman expectation equation, along with Markov Decision Process (MDP),
Table of contents
makes up the two core concepts of DRL. But when it comes to self-driving cars,
we have to keep in mind that the observations from the perception data
should be mapped with the appropriate action and not just map the
underlying state to the action. This is where a partially observed decision
process or a Partially Observable Markov Decision Process (POMDP) is
required, which can make decisions based on the observation.

Partially Observable Markov Decision


Process used for self-driving cars
The Markov Decision Process gives us a way to sequentialize decision-making.
When the agent interacts with the environment, it does so sequentially over
time. Each time the agent interacts with the environment, it gives some
representation of the environment state. Given the representation of the state,
the agent selects the action to take, as in the image below.

The action taken is transitioned into some new state and the agent is given a
reward. This process of evaluating a state, taking action, changing states, and
rewarding is repeated. Throughout the process, it’s the agent’s goal to
maximize the total amount of rewards.

Let’s get a more constructive idea of the whole process:

1. At a give time t, the state of the environment is at St


2. The agent observes the current state St and selects an action At
3. The environment is then transitioned into a new state St+1, simultaneously
the agent is rewarded Rt

In a partially observable Markov decision process (POMDP), the agent


Table of contents
senses the environment state with observations received from the perception
data and takes a certain action followed by receiving a reward.

The POMDP has six components and it can be denoted as POMDP M:= (I, S, A,
R, P, γ), where,

I: Observations
S: Finite set of states
A: Finite set of actions
R: Reward function
P: transition probability function
γ – discounting factor for future rewards.

The objective of DRL is to find the desired policy that maximizes the reward at
each given time step or, in other words, to find an optimal value-action function
(Q-function).

Q-learning used for self-driving cars

Q-learning is one of the most commonly used DRL algorithms for self-driving
cars. It comes under the category of model-free learning. In model-free
learning, the agent will try to approximate the optimal state-action pair. The
policy still determines which action-value pairs or Q-value are visited and
updated (see the equation below). The goal is to find optimal policy by
interacting with the environment while modifying the same when the agent
makes an error.

With enough samples or observation data, Q-learning will learn optimal state-
action value pairs. In practice, Q-learning has been shown to converge to the
optimum state-action values for a MDP with probability 1, provided that all
actions in all states are infinitely available.

Q-learning can be described in the following equation:

where:
α ∈ [0,1] is the learning rate. It controls the degree to which Q values are
updated at a given t.

Table of contents

Source

It’s important to remember that the agent will discover the good and bad
actions through trial and error.

Conclusion

Self-driving cars aim to revolutionize car travel by making it safe and efficient.
In this article, we outlined some of the key components such as LiDAR, RADAR,
cameras, and most importantly – the algorithms that make self-driving cars
possible.

While it’s promising, there’s still a lot of room for improvement. For example,
current self-driving cars are at level-2 out of level-5 of advancement, which
means that there still has to be a human ready to intervene if necessary.

Few things need to be taken care of:

1. The algorithms used are not yet optimal enough to perceive roads and lanes
because some roads lack markings and other signs.

You might also like