You are on page 1of 6

2019 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE19)

Convolutional Neural Network for a Self-Driving


Car in a Virtual Environment
Mohamed A. A. Babiker Mohamed A. O. Elawad Azza H. M. Ahmed
Department of EEE Department of EEE Department of EEE
University of khartoum University of Khartoum University of Khartoum
Khartoum,Sudan Khartoum,Sudan Khartoum,Sudan
mohmedazem@hotmail.com zackdad88@gmail.com Knkn1989@hotmail.com

Abstract—Convolutional neural networks (CNNs) are machine have been actually embedded into real systems in the recent
learning models accomplishing state of the art results in a couple of years. This is mainly due to the contribution of
variety of computer vision tasks, decision making and visual two main advances. Firstly, the success of parallel graphics
recognition. For a long time, traditional computer vision based
algorithms has been the primary method for analyzing camera processing units (GPUs), given the dependability of CNNs
footage, used for assisting safety functions, where decision making learning process. Secondly, the availability of large datasets
have been a product of manually constructed behaviors. During that can be used for training and testing such as Mnist [17]
the last few years deep learning has showed its extraordinary and ILVRC [2]
capabilities for both visual recognition and decision making in This paper describes a CNN that goes beyond pattern recog-
end-to-end systems. this paper proposes a solution of introducing
redundancy by combining deep learning methods with traditional nition. It learns the entire processing pipeline needed to steer
computer vision based techniques for minimizing unsafe behavior an automobile. This CNN has been trained and tested with
in autonomous vehicles. A CNN has been trained to map raw a dataset collected while driving in a simulated environment.
pixels from a single front-facing camera directly to steering After the training, it is able to steer the car by generating
commands. The objective was to build a simple and reliable control commands from images of a front facing center camera
algorithm for a self-driving car and to implement a system that
allow autonomous driving. video.
Index Terms—Machine Learning, Self-Driving car, Au- II. BACKGROUND
tonomous Driving, CNN, Simulation.
A. Real World Examples of Self-Driving Cars
I. I NTRODUCTION Modern self-driving car companies do not use end-to-end
Nowadays machine learning (ML) is becoming a popular learning as it is infeasible to collect enough training data, to
research area in many disciplines such as genetics, pharmaco- cover all possible scenarios in a real world driving experience
logical research, image classification and segmentation, speech where extremely high accuracy is necessary, the amount of
recognition, natural language processing, robotics and stock data required to train an end-to-end system grows exponen-
market predictions. Ml is evolving to support the recommen- tially compared to a modular system. In a modular approach,
dation system for Netflix, the search engine for Google, and the system is broken down into sub-modules with different
a large portion of the climate anticipating labs utilize ML responsibilities as pedestrian detection or path planning. Each
calculations to make predictions. module is then trained either using machine learning or
Convolutional Neural Networks (CNNs) [1] and other deep manual engineering, depending on empirical success. Another
architectures have accomplished huge outcomes in the field important different between real world self-driving cars and
of computer vision. In most cases, they surpassed previous the car model used in this paper is the form and size of input.
hand-crafted feature extraction based systems and set up a new Companies like Waymo or Tesla combine data from multiple
state-of-the-art for tasks such as image classification, image sensors and cameras together to create a visual model of the
captioning, object detection or semantic segmentation. cars surroundings. Expensive technologies such as LIDAR [3],
Previously, the problems of pattern recognition were manip- radar and ultrasonic sensors are necessary to create a realistic,
ulated through a pre-processing phase of hand-created feature 360-degree model of the environment. For example, Tesla
extraction followed by a classifier. A key advantage of CNNs is model S uses a combination of wide, narrow, normal front
that high-level features can be automatically extracted training cameras, side, rearward looking side cameras and a single back
data which can be used intensively in image recognition tasks. camera. In total the rear twelve ultrasonic sensors and on top
Also, by using the convolution kernels to scan a whole of all that, there is a front facing RADAR used primarily to
image, relatively few parameters need to be learned compared detect relative speed of objects in front of the car. Such cars
to the total number of operations. also use variations of Bayesian Simultaneous Localization and
The potential of using CNNs with learned features have Mapping (SLAM) [4] algorithms which blend together the data
been well justified during the last twenty years. However, they from all sensors.

978-1-7281-1006-6/19/$31.00 2019
c IEEE

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap
B. Objective and aim E. Paradigms of Learning
The main motive for this work is to avoid the need to Based on the way of learning, ML algorithms are being
recognize specific human-designated features, such as lane be divided into three types: supervised, unsupervised and
markings, guard rails, or other vehicles, and to avoid creating reinforcement learning [14].
a set of if, then, else rules, based on observation of these F. Deep Learning
features. This paper discusses the preliminary results of this
new effort Deep learning is based on observations of how the brain
processes information, where it is believed that each level in
C. Approaches to Steering in Autonomous Driving the brain is learning features at increased abstraction levels
[7] as shown in figure 1 below. The goal of deep learning in
Currently, there are three main ways of how to deal with a machine learning context is to mimic this behavior in the
steering in a self-driving vehicle. form of a computer architecture.
• Non-AI Approach (manual engineering)
• AI Approach
• Combination of AI and Non-AI
1) Non-AI Approach: The non-AI approach uses control
theory to calculate a steering angle to keep the vehicle on the
desired trajectory, which is usually detected through computer
vision algorithms. One of the most popular methods in control
theory is PID (Proportional Integral Derivative) controller [5].
The controller works in a loop, which continuously calculates Fig. 1. Image representing a Convolutional Neural Network, each layer
an error value e(t) as a deference between the vehicles feed- introduces a higher level of abstraction
back and the next command signal. Afterwards, a correction
is calculated and applied.
G. End-To-End Deep Learning
2) AI Approach: The AI approach, in comparison to the
previous, does not calculate the precise steering angle using The first big breakthrough in the use of end-to-end deep
mathematical equations, but instead relies on an intelligent learning for self-driving cars is dated to 1989 and the accom-
agent which chooses the best course of action. Such agent plishments of Pomerleau [8], who built the Autonomous Land
can be trained with deep learning on large datasets of driving Vehicle in a Neural Network (ALVINN). He used a simple
data with the goal of recognizing road features and predicting feed forward neural network with a single hidden layer of
the direction in which the vehicle should travel in order to 29 units. In 2004, the Defense Advanced Research Projects
follow the road. With the rise of deep learning and recent Agency(DARPA) came up with a project known as DARPA
accomplishments of companies like Google, Tesla or Uber Autonomous Vehicle (DAVE) [9], in which a sub-scale radio
in the field of machine learning, new possibilities arise. The control(RC) car drove through a junk-filled alley way. The
combination of artificial neural networks being a universal training data for DAVE consists of human driving experience
approximator [6], breakthroughs in research on CNNs and the in different environments. Two cameras are used to record
improvements in GPU performance makes creation of an NN- the left and right commands for steering by the car driver.
based controller, which can drive a car, seem like a reachable DAVE served as ground work for the most recent achievement
goal. of NVIDIA in the field of self-driving cars the DAVE-2 In
the paper End To-End Learning for Self-Driving Cars [10],
D. Machine learning NVIDIA published a state-of the-art network architecture that
benefits from the modern convolutions and processing power
Mitchell (1997) provided one of the superior definitions of present-day GPUs. Their car prototype was able to drive on
of machine learning: “A computer program is said to learn highways and in simple traffic on local at roads. This work is
from experience E with respect to some class of tasks T distinctive in the years of progress that allow us to apply more
and performance measure P if its performance at tasks in T, data and computational power to the task. Also, our experience
as measured by P, improves with experience E”.In the case with CNNs lets us make use of this powerful technology.
of image classification problem with dataset of images both
labeled and unlabeled. The labels for unlabeled images(called H. The Nvidia Model
test set) can be detected through task T using the set of labeled The NVIDIA model in figure 2 describes the architecture of
images (called training set) throughout the experience E. We a CNN proposed by researchers at Nvidia in the paper End to
usually refer to how well the predictions of the labels using End Learning for Self-Driving Cars. The authors of the paper
the feature extracted from the labeled images by performance propose that this architecture demonstrates the advantages of
P. This problem is very hard to be solved by a normal using end to end learning by optimizing all processing steps of
computer program given the huge data to be processed and the network simultaneously, while generating smaller systems
the complexity of classification approach. by having self-optimization of the systems components, rather

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap
than having criteria selected by humans, which does not
always guarantee the best performance of the system [11].
The architecture consists of 9 layers in total, consisting of one
normalization layer followed by five convolutional layers and
three fully connected layers.
Fig. 4. illustration of the steering wheels angle

neural network can later be trained on this data. When our


system is driving the car, the camera feeds its images directly
into the neural network.it then predicts the control signals.
A. Data Collection
The data collection process begins by modifying the envi-
ronment to our liking. The simulated cameras provide images
in front of the vehicle as it drives on a variety of tracks that
have sharp and smooth turns, road bumps, trees and traffic
lights. A data collection script running in the background
collects information about the drivers steering, throttle and
brake values, table 1 shows a sample of the collected data.
Steering angle throttle brake
-0.1583 0.0747 0.0514
-0.1604 0.0802 0.0525
-0.0784 0.0862 0.0545
TABLE I
S AMPLE OF COLLECTED DATA
Fig. 2. The Nvidia Model Architecture

Each image from the simulation is labeled with the corre-


I. Simulation
sponding steering, throttle and brake values. Before the images
The simulator is a modeled environment used to train and from the camera are given as input to the network, they
test the CNN. it takes pre-recorded videos from a forward- undergo several processing procedures:
facing on-board camera on a human-driven data and generates • Selection procedure: data with desired properties is se-
images that approximate what would appear if the CNN lected and divided into a training set and a validation
were, instead, steering the vehicle. These test videos are time- set, the training set consist of 80% of the data, while the
synchronized with recorded steering commands generated by remaining 20% goes to validation set.
the human driver. • In the color scheme alteration procedure, it is possible to
change the color scheme of the input images.
III. M ETHODOLOGY
• The images are then cropped to remove the car front and
The task is to predict a self-driving cars steering wheel then re-sized to 160X320 and changed from RGB to YUV
actions based on the input of a camera placed in front of the because YUV takes into account human perception allow-
car. Our deep convolutional neural network does this mapping ing reduced bandwidth for chrominance components.
from image to angle. An example image can be found in • The images are then augmented adding artificial shifts
Figure 3 and an illustration of the steering wheels angle can and rotations to teach the network how to recover from
be found in Figure 4. a poor position or orientation. The magnitude of these
perturbations is chosen randomly from a zero mean
normal distribution.
• The last step is to group the images and their associated
steering angel, throttle and brake values into batches so
that they can be fed to the neural network as numpy [12]
arrays.
Fig. 3. Description of the systems driving mode
B. Approach
The system has two modes of operation, data gathering For the obtained model to be able to clone the behavior of
mode and self-driving mode. The data gathering mode happens humans and drive the car in different environments and con-
while the car is being manually driven. The camera saves its ditions without making mistakes, we used supervised learning
images together with the sensors measurements so that the as the approach for training the model.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap
C. Neural network arrays. The hyper parameters (number of layers, number of
There are two neural networks used in this project, a neurons, steps per epoch, number of epochs, learning rate and
regression one to predict the output of the steering angle batch size) are chosen empirically and after a number of trials
showing in figure 5, and a classifier for the traffic light the optimal values found are: Learning rate=0.0001 batch size=
detection. 500-600. Number of epochs =20. Steps-per-epoch=1000
E. The traffic light classifier model architecture
D. Architecture
In this neural network a trained model was used which is the
The two NN architecture types used in our experiments VGG16 model figure 6, it is a model developed in 2014 and
are feed forward neural networks. Since we are working with won the Image Net Large Scale Visual Recognition Challenge
inputs that are images, we use networks with convolutional (ILSVRC). This model is trained on the image net dataset
layers as the first layers. The first layer as the input layer which is a large data set that consists of 1000 classes of
takes images with 160X320X3 where 160X320 are the height objects, and it has an accuracy of 90.0%.
and width of the images and 3 is the number of channels
(Red-Green-Blue) then there is 5 convolution layers with 24,
36,48,64,64 neurons with the first three having filter sizes of
5X5 and a jump at each pass of the filter by 2X2 pixels (strides
of sizes 2X2), and the final 2 layers having filter sizes of 3X3
with no strides. The approach of increasing the number of
neurons at each layer helps with extracting details and features
from the image easily. All the convolution layers have an
elu [15] activation function because it tends to converge the
cost to zero faster than relu [15] and produces more accurate
results. A dropout layer is added after the convolution layers to
prevent overfitting, the dropout layer as the name implements Fig. 6. VGG16 model architecture
it drops randomly choosing a predefined percentage of neurons
(50% in our case) and it drops out those neurons by disabling In our application we only want to detect three classes, the
them. Then there is a flatten layer so that the multidimensional three states of the traffic light (red, yellow and green) to do
output of the convolution layers can be converted to a one that we take the pre-trained model and replace the last layer
dimensional output. The last part of the network is a 5 dense that has the 1000 neurons with a dense layer that has 3 neurons
layers with 100, 100, 50,10,3 neurons respectively and all with softmax activation because we only want the correct class
with elu activations except the output layer which has a linear neuron to fire each time a traffic light is detected.
activation because this is a regression problem, all the above
layers are saved as a model of type Sequential [16]. F. software
This section describes the important software dependencies
that are used in this project.
1) Unity: Unity is a cross-platform game engine commonly
used for developing ordinary video games for platforms such
as computers (Windows, Linux and Mac), mobile phones and
consoles (Xbox, Playstation etc). It was used to build the
simulator of the car and its environment. The two main reasons
for choosing Unity was because of previous experience as well
as the fact that the project we used as a base for the simulator
was built in Unity.
2) TensorFlow: Is an open source framework created for
mathematical operations on multidimensional data arrays (Ten-
sors). Because of the flexible architecture of the framework
dierent hardware configurations can be run using the same
API.
3) Keras: A high-level neural networks API that runs on
top of TensorFlow.
4) Socket.io: The existing communication provided in
the Udacity project was built in the high-level framework
Fig. 5. Model architecture Socket.IO (SIO). SIO was developed mainly for use in web
applications and the choice to use it in the Udacity project
The images are fed to the model as input data and the steer- was probably motivated by its simple integration in high-level
ing angel, throttle and brake as labels in the format of Numpy applications such as Unity games and Python applications.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap
G. Model training

After the data has been collected and processed The net-
works are trained on the training set, a random batch of
600 consecutive samples is loaded from CSV, these samples
are randomly drawn until the 600 samples have been used,
different random batches are drawn until the whole training
set has been exhausted. After every such epoch, the network
is evaluated on the validation set We used Google colab which Fig. 8. Screenshot of the actual driving process
is a cloud host that lets users train their models in a GPU
powered environment with keras and tensorflow preinstalled
for the training process. Before we start training in google IV. RESULTS
colab, we upload the data to a google drive account and link A. key results
it to colab. Colab is then used to start the training process.
The training starts with the input batches given through a • the model was able to clone the behavior of the human
batch generator and with the model having random weights, driver by mapping the input images that the front camera
then the inputs are sled throw the layers, the output is then records into corresponding actions with an accuracy of
compared with the given labels. After calculating the mean (86%), figure 9 below shows the model accuracy through
square error of the output the weights are adjusted using the epoch.
back propagation process, the adjustment is done by the Adam
optimizer [13] (learning rate = 0.0001). After each pass throw
all the inputs in the first epoch check point function sees the
validation loss and saves its value and if the next epoch gave
a lower value for the validation loss a copy of the model is
saved as a .h5 file with the associated weights. After all the
epochs are done the weights are saved in a .h5 file so it can
be used later for testing and functioning the self-driving car.

H. Model testing and implementation Fig. 9. Model accuracy through epoch

This process consists of two steps: Predicting the outputs • The model was able to detect useful road features on its
and sending the control values to car in the simulator. The own, i.e., with only the human steering angle, throttle and
client side is set up using anaconda which is a tool used brake amount applied by the human driver as labels.
to run python codes, it is used because it can install all the • Image Augmentation technique was useful to train the
dependencies and libraries and then it creates an environment recovery driving scenarios
where the code can operate easily. • The helper traffic light model was able to detect and
differentiate between the different states of the traffic light
After activating the environment, the simulator is started,
signal because of the strong ability of the vgg16 to detect
and a scene is chosen to begin the Autonomous driving by
the features of a given input image.
the model.
• The vgg16 model was trained on ImageNet data-set
After successful connection the images from the simulator which is a huge set with about 1000 classes It learned
are received using socketio function and are fed to the trained to detect different textures, patterns, shapes and so many
model which will send back the predicted control values, this image features which enabled it to be able to detect traffic
process can be seen as in figure 7 below, and a screenshot of lights and its different states.
the actual driving process can be seen in figure 8.
B. Hardware limitations
The use of deep learning has increased exponentially during
the last few years and it is thanks to development of both
hardware and algorithms. and of hardware, Without Graphical
Processing Unit (GPU) processing the training would take
an enormous amount of time. This means that the process
of developing, testing and running the DNN was very GPU
Fig. 7. Connection between the model and the simulator dependent.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap
C. Software dependencies can be added to the model such as avoiding encountered
The goal of the project was to implement neural nets rather obstacles on the track.
than developing everything from scratch. To achieve this, • Implementing the CNN on a physical car could have
we examined existing solutions and tried to use as many a great impact on the accuracy of the model. As of
frameworks and libraries as we could. Though this has helped training the model on a simulator is not enough, though
us a lot and heavily reduced the development/implementation the simulator is built to be realistic as possible, it could
time, this approach came with another issue. Many of the miss some of the conditions and specification of the real
libraries/frameworks/solutions required a lot of dependencies. world.
Many of the dependencies could be installed easily, but some • Future research can be done towards a framework in
of them had to be compiled manually, which in some cases which better metrics can be used that give an indication
took a lot of time or didn’t even work. Unfortunately, we had of a systems driving behavior. This simulator offers
to spend a significant amount of time upon managing software opportunities for this, without the complications that are
dependencies during the course of this project. often encountered in the real world.

D. Simulator R EFERENCES
During initial phase of the project we decided to work [1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.
Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip
with a simulator to not be dependent of hardware, and this code recognition. Neural Computation, 1(4):541551, Winter 1989.
was probably the most important decision we took, without [2] Large scale visual recognition challenge (ILSVRC). URL:
it the project would probably not have been finished since http://www.image-net.org/ challenges/LSVRC/.
[3] Himmelsbach,M.;Mueller,A.;etal.LIDAR-based3Dobjectperception. In
implementing the CNN in a physical car would have taken an Proceedings of 1st international workshop on cognition for technical
enormous work and resources just to build the car. systems, volume 1, 2008.
[4] Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping:
V. CONCLUSION AND FUTURE WORK part I. IEEE Robotics & Automation Magazine, volume 13, no. 2, 2006:
pp. 99110, doi:10.1109/MRA.2006.1638022.
A. Conclusion [5] Levine, W. S. The Control Handbook, (Three Volume Set) (Electrical
This work supports the notion that autonomous vehicles Engineering Handbook). Boca Raton, FL, USA: CRC Press, Inc., second
edition, 2010, ISBN 1420073664, 9781420073669.
can operate using convolutional neural networks trained with [6] Hornik,K.;Stinchcombe,M.;et al.Multilayer Feedforward Networks Are
simulation data. a neural network is implemented and trained Universal Approximators. Neural Netw., volume 2, no. 5, July 1989: pp.
using data taken from the simulator made by udacity. And then 359366, ISSN 0893-6080, doi:10.1016/0893-6080(89)90020-8. Avail-
able from: http://dx.doi.org/10.1016/0893-6080(89)90020-8
the performance of the neural network is tested using testing [7] K.P.Murphy,Machine learning : A probabilistic perspective. Cam-
data also taken from the simulator, the system learned to drive bridge,MA, USA: MIT Press, 2012.
the car autonomously. This work empirically demonstrated that [8] Dean A. Pomerleau. ALVINN, an autonomous land vehicle
in a neural network. Technical report, Carnegie Mellon
CNNs are able to learn the entire task of lane and road follow- University, 1989. URL: http://repository.cmu.edu/cgi/viewcontent.
ing without manual decomposition into road or lane marking cgi?article=2874&context=compsci.
detection, semantic abstraction, path planning, and control. [9] Net-Scale Technologies, Inc. Autonomous off-road vehicle control using
end-to-end learning, July 2004. Final technical report. URL: http://net-
The CNN is able to learn meaningful road features from a scale.com/doc/net-scale-dave-report.pdf.
very sparse training signals. This project investigated how [10] Bojarski, M.; Del Testa, D.; et al. End to end learning for self-driving
a combination of traditional computer vision techniques and cars. arXiv preprint arXiv:1604.07316, 2016.
[11] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P.
modern Deep Neural Networks (DNN) can minimize unsafe Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. End to end
behavior in autonomous vehicles. Much of the development learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
time was spent on setting up the environment and everything [12] NumPy. (n.d.). Retrieved from http://www.numpy.org/
[13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
needed to conduct the investigation. The results indicate that CoRR, abs/1412.6980, 2014.
the combination of both computer vision and deep learning [14] Lison, Pierre. ”An introduction to machine learning.” (2015).
is one of most accurate and safe approaches for autonomous [15] Clevert, Djork-Arn, Thomas Unterthiner, and Sepp Hochreiter. ”Fast and
accurate deep network learning by exponential linear units (elus).” arXiv
driving, therefore we believe that future projects will benefit preprint arXiv:1511.07289 (2015).
greatly from introducing a system that combines and utilizes [16] Bishop, Christopher M. Pattern recognition and machine learning.
the strengths of traditional computer vision based techniques springer, 2006.
[17] Deng, L. (2012). The MNIST database of handwritten digit images for
with the computational potential of DNNs. machine learning research [best of the web]. IEEE Signal Processing
Magazine, 29(6), 141-142.
B. Future research
The changes that we propose in this section would allow
us to either test our system more extensively or improve the
systems performance.
• To enhance the capabilities of our CNN, we suggest to
have more training data in many tracks with different
light conditions. Certainly, this will improve the car
navigation accuracy. Moreover, additional functionalities

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on February 22,2023 at 12:59:07 UTC from IEEE Xplore. Restrictions ap

You might also like