A Comparison of Genetic Algorithm and Reinfocement Learning

DEGREE PROJECT IN TECHNOLOGY,
FIRST CYCLE, 15 CREDITS

STOCKHOLM, SWEDEN 2019
A comparison of
genetic algorithm
and reinforcement
learning for
autonomous driving
KTH Bachelor Thesis Report
<Ziyi Xiang>
KTH ROYAL INSTITUTE OF TECHNOLOGY

ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
EXAMENSARBETE INOM TEKNIK,
GRUNDNIVÅ, 15 HP
STOCKHOLM, SVERIGE 2019
En jämförelse mellan
genetisk algoritm och
förstärkningslärande
för självkörande bilar
<Ziyi Xiang>
KTH KUNGLIGA TEKNISKA HÖGSKOLAN

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP
Abstract
This paper compares two different methods, reinforcement learning and

genetic algorithm for designing autonomous cars’ control system in a dynamic
environment.
The research problem could be formulated as such: How is the learning efficiency
compared between reinforcement learning and genetic algorithm on autonomous
navigation through a dynamic environment?
In conclusion, the genetic algorithm outperforms the reinforcement learning on

mean learning time, despite the fact that the prior shows a large variance, i.e.
genetic algorithm provide a better learning efficiency.
Keywords
Thesis, Machine learning, Genetic algorithm, Deep reinforcement learning,

Autonomous driving
i
Abstract
I det här papperet jämförs två olika metoder, förstärkningsinlärning och genetisk
algoritm för att designa autonoma bilar styrsystem i en dynamisk miljö.
Forskningsproblemet kan formuleras som: Hur är inlärningseffektiviteten jämför

mellan förstärkningsinlärning och genetisk algoritm på autonom navigering i en
dynamisk miljö?
Sammanfattningsvis, den genetisk algoritm överträffar förstärkningsinlärning

på genomsnittlig inlärningstid, trots att den tidigare visar en stor varians, dvs
genetisk algoritm, ger en bättre inlärningseffektivitet.
Keywords
Thesis, Maskininlärning, Genetisk algoritm, Djup förstärkning lärande,

självkörandebilar
ii
Acknowledgements
I would like to offer my special thanks to my supervisor Jana Tumová as well as the
examiner Örjan Ekeberg. Assistance provided by Jana was greatly appreciated. I
would also like to extend my thanks to Shuai Wu, Oskar Nehlin and Wen Yin for
the great advices.
iii
Authors
Ziyi Xiang <zxiang@kth.com>

Information and Communication Technology
KTH Royal Institute of Technology
Place for Project
Stockholm, Sweden
Examiner
The Professor
Örjan Ekeberg
Supervisor
The Supervisor
Jana Tumová
Contents
1 Introduction 1
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theoretical Background 3
2.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methods 11
3.1 Measurement evaluation . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 The agent’s control system . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 The reinforcement learning’s car controller . . . . . . . . . . . . . . 17
3.5 The genetic algorithm’s car controller . . . . . . . . . . . . . . . . . 20
4 Result 24
4.1 Reinforcement learning test . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Genetic algorithm test . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Conclusion 28
6 Discussion 28
6.1 safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Implementation difficulty and time cost . . . . . . . . . . . . . . . . 29
6.3 Further study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
References 31
v
1 Introduction
Autonomous driving has many debates among experts. It has good potential
for better safety over human drivers. The cars cannot get distracted and always
obeying traffic rules.
Autonomous cars are already being developed by many companies like Volvo,
Tesla, and Google. [6, 14]. The cars use sensors like radar or camera to observe
the environment. The movement control system predicts the environment based
on sensors’ observations and makes the decisions of movement control. These
solutions can still to be optimized, due to full automation has not been achieved.
Most of them are in a partially automated level. [2]
The level classification system is based on driver intervention and attentiveness,

which is published by the society of automotive engineers SAE International in
2014. [13] According to their definition, a partially automated car system will have
some control like speed and steering, but the driver needs always be prepared to
take control in a needed situation. While a fully automated car can operate in all
conditions without any input.
Autonomous cars are closely associated with machine learning and artificial
intelligence. The machines in the industry need to be smarter in order to
be able to accomplish the tasks with increasing size of difficulty. Machine
learning considers being a technology that helps robots learn and interact with
the environment.
Deep reinforcement learning is widely used and is a very efficient way to design
the AI behaviors.[5] On the other hand, the genetic algorithm has also been proved
to be a successful technique for the optimization of automatic robot and vehicle
guidance. [4]Both methods require a large sample data set to learn which can lead
to an expensive cost for producing sample data. [4, 5]
The purpose of this study is to provide a safe driving condition with fewer learning
samples.
1
1.1 Problem statement
This paper compares two different methods, reinforcement learning and

genetic algorithm for designing autonomous cars’ control system in a dynamic
environment. The research problem could be formulated as such: How are
the learning efficiency compared between reinforcement learning and genetic
algorithm on autonomous navigation through a dynamic environment? The
dynamic environment meaning environment is a real-time simulation. The
agents1 are continuously making decisions in this environment based on
observations. The learning efficiency defines how many samples are used in order
to obtain an acceptable result.
1.2 Delimitations
Due to the difficulties in simulate multiple cars simultaneously, the simulation

is confined to running in a small racing circuit surrounded with walls. The
simulation has a simplified version in the horizontal plane. The cars will only
have steering and acceleration control. Many factors like pedestrians, traffic rules,
weather, and mass are not counted. Therefore, the results can be different from
reality.
1
A term which describes the robots used in machine learning
2
2 Theoretical Background
This study focus on reinforcement learning and genetic algorithm for designing
a car control system. Both methods are capable of decision making and
optimization problems. The action predictions is a decision-making problem
based on observation. An optimized solution of such a problem can be obtained
by the optimization algorithm. [4, 5]
In this section, the paper introduces some basic theoretical concepts and research
finding for both algorithms.
2.1 Reinforcement learning
2.1.1 Markov decision process
The Markov decision process(MDP) is a tuple.[1]
(S, A, P (s, a, s’), γ, R)
S denotes a set of states:

s∈S
The state is defined by the agent itself and the surrounding environment around
the agent, such as velocity, position, mass or distance to obstacles. In a real life
scenario, the environment could be observed by camera and sensor, data obtained
from these observations in turn will represent the states of the vehicle.
A denotes a set of Actions:

a∈A
The agent can take different action in a given state, such as accelerate and rotate.
The action set S defined by the available choices agent can make in a particular
state.
A transition function:
P (s, a, s’)
3
The transition function is when we take the current state as the starting point,
taking an action and the probability it lands to the status s’.
A discount factor γindicates how much future rewards agents should care about
compared to current rewards.
A reward function
R(s, a, s’)
The reward function represents the reward when agents take an action in state s
and land to the new state s’.
The agent constantly observes the environment and acts base on these
observations. After the agent performs some action, the reward associated with
observation of what just happened will be calculated. The policy2 of action is
updated based on what the agent has learned, where future decision-making is
influenced by previous attempts. This creates a feedback cycle and the process
repeat until it finds a policy to maximize its reward.
MDP describes the agent’s decision-making process for optimal

action selection[1]. The main objective is to find an optimized policy maximizes
future reward. A policy contains guideline for the agent on which action to take in
each and every state, where as the optimized policy enable the agent to maximize
the rewards in every state by choosing the best possible action.
The mathematical equation of this objective is defined as following where π

denotes the policy:
∑
H
maxπ E[ γ t R(st , at , st+1 )|π]
t=0
[1]
H denotes horizon, indicates the length of the finite states. Policy π is a

prescription for mapping between each state to the corresponding action. The
discount factor γindicates how much future rewards agents should care about
compared to current rewards.
2
a strategy or solution, guideline for the agent on which action to take in each state
4
Dynamic programming and policy optimization are two keys to solving MDP
problem.[1, 10]
2.1.2 Dynamic programming
In dynamic programming, the program finds the policy which maximized

expected reward based on past experiences.[1]
A famous method for solving MDP using the concept of dynamic environment
is called Q-learning. Q-learning learns the policy by trial and error from storing
data. It uses past experiences to calculates the expected reward for each action in
a given state and iterative update the change of reward.
If we have a problem defined with n number of possible actions and a total of m

finite states. We can create a m by n matrix and fill all possible actions with the
maximal reward it gets to reach a certain point.
The problem of using dynamic programming is the huge time and space
complexity when problem size scales up. If the states and actions are large or
infinite the memory is not enough to store all data and the calculation becomes
slow.[1]
2.1.3 Policy optimization
In subsection 2.1.2, we introduced some concept of reinforcement learning using

dynamic programming. Dynamic programming has some complexity problems,
as a result, the maximum reward cannot be calculated efficiently. An alternative
is to use policy optimization method.
Policy optimization is a method where the agent directly learns the policy function
without calculating the reward in each state.[10] The algorithm acts with the
current policy and improves it through learning. The policy is not forced to choose
the action that gives the highest reward. Instead, it uses some probability function
to randomly select an action to discover a better solution.
5
The objective of policy optimization is to find a policy function π with parameter
θ which maximized the total reward.
θ is a parameter or weight vector for the policy π. It can use gradient descent
function3 to update the policy by changing.
By using policy optimization, the algorithm will increase the probability of taking
action that gives higher reward and decreases the probability of action which is
worse than the latest experience. It calculates the performance of policy and uses
it to influence the next iteration.
The optimization method will slightly change the θ of the policy based on the latest
performance. Unlike q-learning, it does not store the offline data in memory.
It learns directly from what the agent is acting. Once policy updates, the old
experience will be discarded.
2.1.4 Proximal Policy Optimization
Proximal Policy Optimization(PPO) is a method based on the idea of policy

optimization.[7] It has been proven to be able to solve a wider range of problems
rather than q-learning, especially for complex problems. It is the default
reinforcement learning method used for OpenAI[7] and Unity Machine Learning
Agents Toolkit(ML-agent)[8].
In policy optimization, the sampling operation is not efficient due to the data only
update one performance and discards the data after use. Moreover, the result
is not stable due to the large changing distribution of observation and policy.
If the program takes a step too far from the previous policy, it will change the
entire distribution of acting in the environment. Recovering the old policy will be
difficult and the policy could then ends up in a bad position. Therefore, a more
stable algorithm is required.
PPO can be used to limit the policy update by defining a maximum distance which
is called a region. The algorithm optimizes the local approximation and finding
3
An optimization method that iterative adding or subtracting value to find the optimal input to
function which optimize the result.
6
the optimal point in this bounded region as a result updated policy can no longer
move too far away from the old policy.
2.1.5 Neural network and deep reinforcement learning
The deep reinforcement learning is based on the neural networks. The neural
network is a matrix network system spired by the biological neural network and
animal brains.[3]
In deep reinforcement learning, the algorithm present the policy in a form of a

neural network. The basic idea is to identify the correct action, by the means of
utilizing a neural network, which maps states into actions.
Let us introduce a simple neural network. This neural network is a 2-dimensional

matrix consists of multiple arrays. Each neuron inside the so-called layer holds a
digit and the layer itself is an array of digits.
(a) Neural network illustration
Figure 2.1: A simple neural network with 2 hidden layers of size 3
The input layer is a set of digits representing the state or observations. The
information passes through the network and maps into the actions which return
from the output layer.
7
For each pair of connecting neurons iandj, there is a weight value, which is given
between two neurons. With each update, the neurons from the left layer update
its connecting neurons by adding connecting neuron’ values times the weights of
connection. [3]
Once the neuron from the input layer receives the inputs, they will update
the connecting neuron on the right by multiplying its own value and weight
connection value, then adds to the target value. This iteration repeats until all
inputs digits have passed through the network. We apply mathematics function
like Sigmoid function[3] to limit the value between 0 and 1. This is used for some
models where we have to predict some probabilities in range 0 and 1.
Sigmoid function:
g(z) = 1/(1 + e−z )
The connecting layers between the input layer and the output layer are called
hidden layers.[3] The hidden layers are mainly used to increase the complexity
of the matrix, which gives the neural network an opportunity to create a solution
with more advanced mapping. It is especially useful for solving large and complex
problems, but an increased complexity can also lead to an increased difficulty for
learning.
There are many variants of the neural network, while the basic concepts are
similar. After observing the result, the network adjusts the value of the weight
along the path using backpropagation. The detail of backpropagation is irrelevant
on understanding the following chapters, thus will not be discussed here. More
detail and further explanation is referred to chapter 4 of the book ”Artificial
Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep
Learning” by James and Stone. [11]
After training the network, it should be able to find the pattern of inputs and map
them to the actions.
8
2.2 Genetic algorithm
Genetic algorithms are methods for solving optimization problems inspired by

the survival of the fittest Darwinian Evolution. They are based on natural
selection and are widely used to design AI behaviors and machine learning
applications.[9]
Genetic algorithms start from a set of randomly generated solutions based on their
problems.[9] By observing the solutions’ performances, the algorithms select the
most successful solutions for the reproduction of new solutions. The algorithms
repeatedly create multiple sample solutions and observe their performances.
After evaluating multiple iterations, the solutions evolve toward the optimal.
The solutions are called chromosomes and represent in the form of an array or a
matrix.[9] The iteration in genetic algorithms is called generation and the most
successful agents in each generation are called parents. Each agent has it’s own
identical solution in the form of an array or matrix. It is a mutation version of
reproduced solution inherited from their parents.
In each generation, the algorithms will select two parents who perform best in the
previous simulation and use their chromosomes to create children. The parents
cross over the chromosomes and create a new chromosome by combing their
solutions. The program takes the child solution, replicates it to create multiple
solutions which will use in the next generation.
By randomly changing some values in the child solutions’ chromosomes, the

children mutated and become identical. The mutation makes the children
identical which helps the algorithms to explore the world to find a better
optimization. In genetic algorithm, the simulation can define the number of
children that simulate in each generation and the chromosomes in the first
generation can be randomly generated. The probability ratio of mutation will
decreases when the result gets better.
9
Figure 2.2: Genetic algorithm illustration
2.2.1 Genetic algorithm using neural network
The neural network is an efficient way to represent the mapping between the state
and the action. In deep reinforcement learning, the algorithm use the neural
network to represent the policy. By utilizing the network it maps the states into
the actions.
As we mentioned in section 2.1.5, there are weight values between each connected
neuron. The weight value can represent the chromosomes in genetic algorithm.[9]
In each generation, the program selects the parents, crossovers and mutates their
weight values to create their children. The children inherited from their parents’
weights and evolve toward the optimal after evaluating multiple iterations. The
weights in the first generation can be randomly generated.
10
3 Methods
3.1 Measurement evaluation
A successful test is defined by when a car in the respective method could finish
100 laps(figure3.2a) on the racing track without any collision. Any contact with
the walls destroy the cars. (figure3.1a)
The learning efficiency or learning-cost is defined to be the number of cars that

get destroyed in a simulation.
Due to a perfectly optimized solution is nearly impossible to find, we define

optimized solution as such: If an agent could finish 100 laps without collision
when controlled by the solution, said the solution is said to be an optimized
solution. An algorithm with better learning efficiency will destroy less car to find
the optimized solution, thus the study wants to calculate how many cars will be
destroyed on average to optimize the solution which can answer our problem
statement.
(a) Track surrounded by walls
Figure 3.1: Aerial view
11
(a) lap
Figure 3.2: Aerial view
3.2 The simulation
3.2.1 The simulation environment and library
The simulator is created in Unity 3D, which is a widely used software for game
development.[8] In Unity, we can import 3d assets like roads and cars from the
store instead of making our own, which is time efficient. It also has some useful
plugins such as Unity machine learning agents toolkit(ML-Agent) for training
intelligent agents.[8] The ML-Agent is a library, which can use for building
reinforcement learning.
In Unity, we give objects special properties by attaching a C# script. The inbuilt

functions like movement control and collision detection can interact the objects
with the environment.
12
3.2.2 The motion controller
The motion controller is a c# script, which attached to each car and allows our
agents to take control of the car’s position. (figure3.2a)
In our simulation, the car moves in the horizontal plane with a given speed and
an angle of facing direction. The speed and the facing direction are control by our
agents.
Here is the pseudocode for the motion of rotation and acceleration control.
UpdateMovement ( RotationAngleFromAgent , AccelerationSpeedFromAgent ) :

d i r e c t i o n A n g l e <− d i r e c t i o n A n g l e + RotationAngleFromAgent
speed <− Speed + AccelerationSpeedFromAgent
transform ( c u r r e n t P o s i t i o n , d i r e c t i o n A n g l e , speed )
The update function is called once per frame, it calculates the car’s new rotation
angle and current speed based on the neural network’s output. The agent controls
the motions, by sending the values of the updates to the UpdateMovement
function.
Transform is a Unity in-build function, which changes the position of an object

towards a direction.
3.2.3 The sensors
The sensor scripts allow the agent to observe the current state, by continuously
scanning the surrounding environment in each frame. The sensors consist of 7
laser beams with 30 degrees angles in between, which placed in front of each car.
(figure3.3)
The laser beams are built-in in Unity, which can classify the type of scanning object
and calculate the distance between the current position to the target position.
Multiple laser beams will create an array of distances, which can represent the
input layer in the neural network.
13
Figure 3.3: Sensors
3.3 The agent’s control system
The study illustrates two high-level figure which contains the executive summary
of designing the control system.
The control system in reinforcement learning has a feedback cycle.(figure3.4a)

The feedback cycle requires the states: lasers observations, the actions: the
acceleration and rotation, and the reward system. The agent is connected with
a neural network using the ML-agent library, which automatically updates the
neural network in each iteration.(refers to the theory in section2.1.4)
In genetic algorithm, the program does not need a reward system, but it requires
the functions that can select the parents, crossover and mutate the chromosomes.
(figure 3.5a)
The genetic algorithm builds on the most successful agent in the previous
attempts. In each so-called generation, a chosen number of cars are created and
simultaneously sent along the track. If all cars collide with the walls, the algorithm
will pick the last two cars that stayed on the track to be the parents.
In our simulation of genetic algorithm, the program initiates a number of cars

applying with randomly generate weights. In each generation, the program sends
14
5 or 10 cars along the path(figure3.6), when all the cars get destroyed, it starts
a new generation with updated neural networks. This process repeats until the
program reaches the goal we have settled.
(a)
Figure 3.4: The high-level figure of implementing the reinforcement learning
15
(a)
Figure 3.5: The high-level figure of implementing the genetic algorithm
16
Figure 3.6: Simulation of GA consists of 5 cars per generation
3.4 The reinforcement learning’s car controller
The car controller for reinforcement agent interacts with neural network and
sends the speed and rotation changes to the motion controller.
In each frame update, the car received the observation which is an array of floating
points from the sensor, and send the observation to the neural network. There is
a total of 7 lasers, thus the size of the input layer will also be 7. After utilizing the
neural network, it returns two corresponding actions: the acceleration and the
rotation, thus the output layer will have a size of 2.(figure3.7)
The neural network is created inside the ML-agents, but we are able to configure
the size of the neural network. The size selection is based on the complexity of the
problem. (refers to section2.1.5) A larger neural network creates solutions with
more advanced mapping from the actions into the state, but will also increase the
17
learning cost. A smaller network learns faster, but the mapping from the actions to
the states will be simpler, which means it can be difficult to solve the large problem
with multiple inputs.
The simulation chooses the neural network with a size of 2x64. The neural
network of size 2 x64 has 2 hidden layers of length 64, which is considered to
be large enough to solve our problem.
The program can use the in-built function from ML-agent to directly access the
output layer from the network. The first element of the output array represents
the rotation direction. The second element represents the acceleration. The
rotation speed and acceleration speed are two predefined variables, which can be
changed.
Here is the pseudocode.
R o t a t i o n ( outPutLayer [ 0 ] ) :
I f outPutLayer [ 0 ] i s 1 then
//make a r i g h t turn
d i r e c t i o n A n g l e <− d i r e c t i o n A n g l e + r o t a t i o n S p e e d
//make a l e f t turn
d i r e c t i o n A n g l e <− d i r e c t i o n A n g l e − r o t a t i o n S p e e d
// donothing
return d i r e c t i o n A n g l e
A c c e l e r a t i o n ( outPutLayer [ 1 ] ) :
// a c c e l e r a t e
speed <− speed + a c c e l e r a t i o n S p e e d
// d e a c c e l e r a t e
speed <− speed − a c c e l e r a t i o n S p e e d
// donothing
18
return speed
The controller passes the observing states using a built-in function called
AddVectorObs(). AddVectorObs() takes a vector, which is from our sensors and
send the vector to the neural network.
Figure 3.7: RL.neural network
3.4.1 The reward System
The reward system defines how much a policy should be rewarded or punished by
observing the result.
If the car keeps a good distance to the wall, The action will be rewarded, otherwise
if the car drives too close to the wall or get destroyed, it will get a punished.
A method called AddReward() is used to build the reward system. In our

simulation, the safe distance from the walls is set to be the length of 2,5, while
19
the vector is set to be the length of 8.0. We create a constant variable of r which
represents the ratio of the reward gains.
In each frame updates, If the distance between the car and the wall is in range of
0 and 2,5, the program will get a penalty of 5 times r for dangerous driving. If the
car crashed, a penalty of 100 times r will be added,otherwise it gets a reward of
1.
CheckReward ( Sensors ) :
for i <− 0 t o s e n s o r s . l e n g t h
i f c o l l i d i n g then
AddReward(− r * 100)
else i f s e n s o r [ 0 ] . d i s t a n c e To W a l l <= 2 ,5 then
AddReward(− r * 5)
otherwise
AddReward ( 1 )
The controller and reward system are the only things that need to implement, the
ML-agents has the implementation of reinforcement learning inside its library,
which does all network changes and the iteration updates.
3.5 The genetic algorithm’s car controller
The size of the neural network remains the same as the network in reinforcement
learning, which is 2x64.
The interaction with the neural network gives the agent rotation and acceleration
control, it is implemented in a similar way compare with reinforcement
learning.
R o t a t i o n ( outPutLayer [ 0 ] ) :
I f outPutLayer [ 0 ] i s in range o f 0 . 2 1 and 0.6 then
//make a r i g h t turn
d i r e c t i o n A n g l e <− d i r e c t i o n A n g l e + r o t a t i o n S p e e d
I f outPutLayer [ 0 ] i s in range o f 0 . 6 1 and 1 then
//make a l e f t turn
20
d i r e c t i o n A n g l e <− d i r e c t i o n A n g l e − r o t a t i o n S p e e d
I f outPutLayer [ 0 ] i s in range o f 0 and 0.2 then
// donothing
return d i r e c t i o n A n g l e
A c c e l e r a t i o n ( outPutLayer [ 1 ] ) :
I f outPutLayer [ 1 ] i s in range o f 0 . 2 1 and 0.6 then
// a c c e l e r a t e
speed <− speed + a c c e l e r a t i o n S p e e d
I f outPutLayer [ 1 ] i s in range o f 0 . 6 1 and 1 then
// d e a c c e l e r a t e
speed <− speed − a c c e l e r a t i o n S p e e d
I f outPutLayer [ 1 ] i s in range o f 0 and 0.2 then
// donothing
return speed
Neural network itself looks like a 2-dimensional matrix, which consists of multiple
arrays of different lengths(figure3.8). The car controller received the input array
from the sensors and send it to the neural network. The function utilizes the neural
network using the weight matrix and maps the observation to the actions.
The genetic algorithm uses weights to represents the chromosomes. The crossover
and mutation function will then apply to the matrix of weights. For each connected
pair in the neural network, there is a weight value and the neural network has
multiple layers, therefore the weights will be a 3-dimensional matrix. ijk in
weights[i][j][k] stands for the number of layers, the length of current layer, and
the length of previous layer.3.8
Here is the pseudocode which describes the mapping from the observations to the
actions.
FeedForward ( i n p u t s ) :
for i <− 1 t o numberOfLayer−1 {
//from the second l a y e r through the network
for j <− 0 t o c u r r e n t l a y e r S i z e {
for k <− 0 t o p r e v i o u s l a y e r S i z e {
21
v a l u e += w e i g h t s [ i −1][ j ] [ k ] * neurons [ i + 1 ] [ j ]
// the c o n n e c t i o n s w e i g h t s between two neuron
// times the v a l u e from the l e f t connected neuron
}
neurons [ i ] [ j ] <− Sigmoid ( v a l u e )
}
}
return neurons [ numberOfLayer −1]// output Layer
For each connected neurons from the left layer, the neuron multiples the weight
with its value and sends the result to all connected neurons on the right. The
neurons on the right will then sum over the results from all left layer and apply
a sigmoid function[12] to bound the result between 0 and 1. This process will
repeat until all the updates on the left-hand side pass through are hidden layers
and return a solution.
Figure 3.8: GA.neural network
22
3.5.1 Selection,crossover and mutation function
In each generation, the genetic algorithm selects two agents which survives the
longest time and combine their chromosomes using crossover function. As a
result, the children will randomly inherit from their parents.
C r o s s o v e r ( f a t h e r , mother ) :
for i , j , k t o s i z e o f w e i g h t s
rand <− Random(0 or 1 )
i f rand i s 1 then
w e i g h t s [ i ] [ j ] [ k ] <− f a t h e r [ i ] [ j ] [ k ]
else
w e i g h t s [ i ] [ j ] [ k ] <− mother [ i ] [ j ] [ k ]
}
return w e i g h t s
The mutation function goes through all elements in the weight matrix, with a
probability value p it replaces the value with a random number.
Mutate ( w e i g h t s ) :
for i , j , k t o s i z e o f w e i g h t s
i f ( p < RandomFloat (0 t o 1 )
w e i g h t s [ i ] [ j ] [ k ] <− RandomFloat (0 t o 1 )
return w e i g h t s
23
4 Result
The result of the simulation will be presented in this chapter and will be used to
aware the research question.
4.1 Reinforcement learning test
The statistic from the tables (table4.1,table4.2) are used to compare their learning-
cost to achieved the optimized solution. In the simulation of reinforcement
learning, the program performs 10 tests with the same configuration to generate a
reliable result. The simulation destroyed 3763 cars on average to optimize the
solution, where the maximum learning-cost is 7054 cars and the minimum is
2934. The data points tend to be close around three thousand, which can prove
the reliability of this data set.
Table 4.1: Number of cars destroyed to finish 100 laps
SAMPLE
Test 1 3186
Test 2 2982
Test 3 3414
Test 4 4826
Test 5 3006
Test 6 7054
Test 7 3446
Test 8 3781
Test 9 2934
Test 10 3008
24
Figure 4.1: Number of cars destroyed to finish 100 laps
Reinforcement learning cannot find a good solution with just a few attempts,
it takes a longer time to learn, but the variances of learning-cost consider to
be smaller and more stable. Unlike the genetic algorithm, the reinforcement
learning optimizes the solution in a stable way by constantly making small
progress in a bounded region, which is defined by the implementation of ML-
agents reinforcement learning.
4.2 Genetic algorithm test
17 of 20 attempts finished 100 laps without colliding the walls. (table4.2)
The result we got from the genetic algorithm is 3005 cars on average for
using 10 cars in each generation, which is lower than 3763 from reinforcement
learning.
The data points are separated far away from the others, which means the
simulation cannot get the same results over multiple trials because the variation
is large. The maximum learning-cost is 18712 compare with the minimum of
819.
25
Table 4.2: Number of cars destroyed to finish 100 laps
SAMPLE 5 cars each generation 10 cars each generation

Test 1 12424 3869
Test 2 13704 2409
Test 3 9234 4499
Test 4 8474 819
Test 5 5234 6559
Test 6 17263 1789
Test 7 18712 1299
Test 8 2164 939
Test 9 5624 3229
Test 10 15672 4639
Figure 4.2: Number of cars destroyed to finish 100 laps
The statistic has shown, the genetic algorithm has a lower average learning-cost,
but it does not learn as stable as reinforcement learning. The genetic algorithm
constantly created new solutions by crossover parents chromosomes, a problem
lies on that two agents with the same performance might still have a very different
weight matrix. Combing such pairs are more likely to mess up the entire solutions,
which have a negative impact on learning efficiency.
26
4.3 Comparison
The genetic algorithm outperforms the reinforcement learning on mean learning

time, despite the fact that the prior shows a large variance.
Refers to the box diagrams4.1,4.2, the genetic algorithm starts strong finding
optimized solutions efficiently, but this increase of performance is unstable. There
are multiple trials have small learning-cost, which lowers the average learning-
cost, but it can be surpassed with increased testing size by the reinforcement
learning variant.
Reinforcement learning outperforms genetic algorithm in an aspect, namely that

reinforcement learning could teach the respective car to stay on the designated
area on the road, while genetic algorithm can not. This corresponds to safe
driving and respecting traffic rules, although the randomly generated solution
from reinforcement learning can be of hidden danger, especially when driving in
unfamiliar places.
27
5 Conclusion
Both methods have been proved to be a successful technique for autonomous

control optimization but have different performances.
In conclusion, the genetic algorithm outperforms the reinforcement learning on

mean learning time, despite the fact that the prior shows a large variance, i.e.
genetic algorithm provide a better learning efficiency, which answers our research
question: How are the learning efficiency compared between reinforcement
learning and genetic algorithm on autonomous navigation through a dynamic
environment.
6 Discussion
In the current stage, the learning-rate are very similar for both algorithms. It is
very difficult to draw an easy conclusion from the current simulation, but the
genetic algorithm is more likely to have less learning-cost than reinforcement
learning to finish 100 laps. Reinforcement learning has higher learning-cost, but
the results have a smaller variance compare with genetic algorithm. A smaller
variance indicates that the data points tend to be close to the mean, which giving
a better statistic significance on its mean.
6.1 safety
Because of the randomly generated solution, the cars from the genetic algorithms
often make some dangerous turns when steering and drive close to the wall, while
reinforcement learning does not. This is thought to be reinforcement learning
design to keep a distance to earn the reward,in contrast to the genetic algorithm
which does not have any sub-conditions to achieve than surviving 100 laps. The
cars from reinforcement learning keep the distances, which means they have good
potential for better safety over genetic algorithms.
28
6.2 Implementation difficulty and time cost
The genetic algorithm is easy to implement, while reinforcement learning takes

a longer time to design, some small changes will lead to a big difference in
performance, and without a good reward system, the agents will perform poorly.
It can cost more cars but cannot find a single solution. The training ground with
U-curved requires a sharp turn, the agent can easily get stuck if they are cannot
find a solution.
The genetic algorithm is easier to modify, by changing the mutation probability

if the car gets stuck. Genetic algorithm has much better learning time to finish
100 laps. It is simply because multiple agents learn parallel in genetic algorithm,
while reinforcement learning train once at a time.
6.3 Further study
The implementation can still be optimized by adjusting the configuration.
The probability of mutation and the number of cars in each generation might
result in a better learning rate and accuracy could be optimized in the genetic
algorithm. There is room for improvement in the reward system design
concerning reinforcement learning..
For each algorithm, we only perform 10 tests which are pretty small due to time
limitation. There is a number of laser detection, but the detection does not cover
all angles, therefore, the agent can not detect full information of the surrounding
environment. That can lead to a wrong prediction. The inputs and outputs are
separate actions and observations. But in the ideal condition, it should be a
combination of actions under a period of time, instead of predicting one single
action per state.
Future studies could include moving obstacles e.g. other controlled by other
agents the input should include a set of observations over time to be able to predict
the moving direction of these objects. The future studies can focus on improving
the simulation, increasing the input sizes, allowing the agents to detect a moving
29
obstacle and respond with a combination of actions. These changes will make the
study closer to real life scenario.
30
References
[1] Abbeel, Pieter. “MDPs-exact-methods”. In: (2012). URL: https://people.

eecs . berkeley . edu / ~pabbeel / cs287 - fa12 / slides / mdps - exact -
methods.pdf, visited 2019-5-1.
[2] Bimbraw, Keshav. Autonomous Cars: Past, Present and Future - A Review
of the Developments in the Last Century, the Present Scenario and the
Expected Future of Autonomous Vehicle Technology. Tech. rep. Thapar
University, 2015.
[3] Dabbura, Imad. “Coding Neural Network — Forward Propagation and

Backpropagation”. In: (2018). URL: https://towardsdatascience.com/
coding- neural- network- forward- propagation- and- backpropagtion-
ccf8cf369f76, visited 2019-4-1.
[4] Fleming, Peter and Purshouse, Robin. Genetic Algorithms In Control

Systems Engineering. Tech. rep. The University of Sheffield, May 2002.
[5] Fridman, Lex. “MIT 6.S094: Deep Learning for Self-Driving Cars ”. In:
(2019). URL: https://selfdrivingcars.mit.edu/resources/.
[6] Hendrickson, Josh. “What Are the Different Self-Driving Car “Levels” of
Autonomy?” In: (2019). URL: https://www.howtogeek.com/401759/what-
are - the - different - self - driving - car - levels - of - autonomy, visited
2019-5-10.
[7] Joschu et al. Proximal Policy Optimization Algorithms. Tech. rep. 2017.
[8] Juliani, A. et al. “Unity: A General Platform for Intelligent Agents. arXiv
preprint arXiv:1809.0262”. In: Sport Management Review (2018). URL:
https://github.com/Unity-Technologies/ml-agents, visited 2019-3-1.
[9] MathWorks. “What Is the Genetic Algorithm? ” In: (2019). URL: https://
de.mathworks.com/help/gads/what-is-the-genetic-algorithm.html,
visited 2019-4-1.
[10] Meyer, David. Likelihood Ratio Policy Gradients for Reinforcement

Learning. Tech. rep. University of Oregon, 2018.
31
[11] Stone, James. Artificial Intelligence Engines: A Tutorial Introduction
to the Mathematics of Deep Learning. Apr. 2019, pp. 37–62. ISBN:
9780956372819.
[12] TutorialsPoint. “Genetic algorithms tutorial”. In: (2016). URL: https : / /

www . tutorialspoint . com / genetic _ algorithms / genetic _ algorithms _
tutorial.pdf, visited 2019-3-1.
[13] Warrendale and PA. “SAE International Releases Updated Visual Chart for
Its “Levels of Driving Automation” Standard for Self-Driving Vehicles”. In:
(2018). URL: https://www.sae.org/news/press- room/2018/12/sae-
international - releases - updated - visual - chart - for - its - %E2 % 80 %
9Clevels - of - driving - automation % E2 % 80 % 9D - standard - for - self -
driving-vehicles, visited 2019-5-10.
[14] White, Joseph and Khan, Shariq. “Waymo says it will build self-driving cars
in Michigan”. In: (2019). URL: https://www.reuters.com/article/us-
autonomous- waymo/waymo- says- it- will- build- self- driving- cars-
in-michigan-idUSKCN1PG22R, visited 2019-5-10.
32
TRITA-EECS-EX-2019:505
www.kth.se

A Comparison of Genetic Algorithm and Reinfocement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparison of Genetic Algorithm and Reinfocement Learning

Uploaded by

Copyright:

Available Formats

DEGREE PROJECT IN TECHNOLOGY,

FIRST CYCLE, 15 CREDITS

KTH ROYAL INSTITUTE OF TECHNOLOGY

KTH KUNGLIGA TEKNISKA HÖGSKOLAN

This paper compares two different methods, reinforcement learning and

In conclusion, the genetic algorithm outperforms the reinforcement learning on

Thesis, Machine learning, Genetic algorithm, Deep reinforcement learning,

Forskningsproblemet kan formuleras som: Hur är inlärningseffektiviteten jämför

Sammanfattningsvis, den genetisk algoritm överträffar förstärkningsinlärning

Thesis, Maskininlärning, Genetisk algoritm, Djup förstärkning lärande,

Ziyi Xiang <zxiang@kth.com>

Place for Project

The level classification system is based on driver intervention and attentiveness,

This paper compares two different methods, reinforcement learning and

Due to the difficulties in simulate multiple cars simultaneously, the simulation

2.1 Reinforcement learning

2.1.1 Markov decision process

The Markov decision process(MDP) is a tuple.[1]

(S, A, P (s, a, s’), γ, R)

S denotes a set of states:

A denotes a set of Actions:

MDP describes the agent’s decision-making process for optimal

The mathematical equation of this objective is defined as following where π

H denotes horizon, indicates the length of the finite states. Policy π is a

2.1.2 Dynamic programming

In dynamic programming, the program finds the policy which maximized

If we have a problem defined with n number of possible actions and a total of m

2.1.3 Policy optimization

In subsection 2.1.2, we introduced some concept of reinforcement learning using

2.1.4 Proximal Policy Optimization

Proximal Policy Optimization(PPO) is a method based on the idea of policy

2.1.5 Neural network and deep reinforcement learning

In deep reinforcement learning, the algorithm present the policy in a form of a

Let us introduce a simple neural network. This neural network is a 2-dimensional

(a) Neural network illustration

Figure 2.1: A simple neural network with 2 hidden layers of size 3

Genetic algorithms are methods for solving optimization problems inspired by

By randomly changing some values in the child solutions’ chromosomes, the

2.2.1 Genetic algorithm using neural network

3.1 Measurement evaluation

The learning efficiency or learning-cost is defined to be the number of cars that

Due to a perfectly optimized solution is nearly impossible to find, we define

(a) Track surrounded by walls

Figure 3.1: Aerial view

Figure 3.2: Aerial view

3.2 The simulation

3.2.1 The simulation environment and library

In Unity, we give objects special properties by attaching a C# script. The inbuilt

UpdateMovement ( RotationAngleFromAgent , AccelerationSpeedFromAgent ) :

Transform is a Unity in-build function, which changes the position of an object

3.2.3 The sensors

3.3 The agent’s control system

The control system in reinforcement learning has a feedback cycle.(figure3.4a)

In our simulation of genetic algorithm, the program initiates a number of cars

Figure 3.4: The high-level figure of implementing the reinforcement learning

Figure 3.5: The high-level figure of implementing the genetic algorithm

3.4 The reinforcement learning’s car controller

Here is the pseudocode.

Figure 3.7: RL.neural network

3.4.1 The reward System

A method called AddReward() is used to build the reward system. In our

3.5 The genetic algorithm’s car controller

Figure 3.8: GA.neural network