You are on page 1of 40

Contents

1. INTRODUCTION ......................................................................................................................................... 1
1.1 Machine Learning................................................................................................................................ 1
1.2 Types of Learning: ............................................................................................................................... 1
1.2.1 Supervised Learning: .................................................................................................................... 1
1.2.2 Unsupervised Learning................................................................................................................. 2
1.2.3 Reinforcement Learning............................................................................................................... 3
1.3 Motivation........................................................................................................................................... 4
1.4 History on Game Bots ......................................................................................................................... 4
1.5 Observations ....................................................................................................................................... 5
2. Literature Survey ....................................................................................................................................... 6
2.1 Reinforcement Learning...................................................................................................................... 6
2.1.1 Exploration-Exploitation Trade-off .............................................................................................. 6
2.1.2 Feedback, Goals and Performance .............................................................................................. 7
2.1.3 The Reward Function ................................................................................................................... 7
2.1.4 Experience Replay ........................................................................................................................ 8
2.1.5 Markov Decision Process ............................................................................................................. 9
2.1.6 Discounted Future Reward ........................................................................................................ 10
2.1.7 Q-learning .................................................................................................................................. 11
2.2 Neural Network ................................................................................................................................. 13
2.3 Value Iteration .................................................................................................................................. 13
2.4 The History of Game Programming .................................................................................................. 14
2.5 The Democratization of AI ................................................................................................................ 15
2.6 Bot Types:.......................................................................................................................................... 15
3. System Requirements ............................................................................................................................. 16
3.1 Environment and Dependencies Needed ......................................................................................... 16
3.2 Hardware Requirements ................................................................................................................... 16
4. Technology .............................................................................................................................................. 17
4.1 Python ............................................................................................................................................... 17
4.2 CUDA ................................................................................................................................................. 19
4.3 cuDNN ............................................................................................................................................... 19
4.4 OpenAI Gym: ..................................................................................................................................... 20
4.4.1 Installing OpenAI Gym using the pip command......................................................................... 21
4.4.2 Building OpenAI Gym from Source ............................................................................................ 21
4.5 Tensorflow ........................................................................................................................................ 21
4.6 Keras.................................................................................................................................................. 22
5. Implementation ...................................................................................................................................... 24
5.1 Cartpole Game .................................................................................................................................. 24
5.2 Implementing Simple Neural Network using Keras .......................................................................... 24
5.3 Implementing Mini Deep Q Network (DQN)..................................................................................... 26
5.4 Remember......................................................................................................................................... 28
5.5 Replay................................................................................................................................................ 28
5.6 How The Agent Decides to Act ......................................................................................................... 29
5.7 Hyper Parameters ............................................................................................................................. 30
5.8 Putting It All Together: Coding The Deep Q-Learning Agent ............................................................ 31
5.9 Let’s Train the Agent ......................................................................................................................... 32
5.10 Result .............................................................................................................................................. 33
6. Screenshots ............................................................................................................................................. 35
6.1 TensorFlow detects the GPU on board ............................................................................................. 35
6.2 Agent Learning to play and maximize scores.................................................................................... 35
6.3 Agent playing the game .................................................................................................................... 36
7. Conclusion ............................................................................................................................................... 37
8. Bibilography ............................................................................................................................................ 38
Reinforcement Learning

1. INTRODUCTION
1.1 Machine Learning
Machine Learning is a sub-field of computer science that has evolved from the study of
pattern recognition. In Machine Learning, we explore the study and construction of algorithms that
can learn from data and also make predictions on data. Such algorithms operate by building a
model from example inputs in order to make data-driven predictions or decisions, rather than
following strictly static program instructions.

Machine Learning is related to computational statistics which aims at designing algorithms for
implementing statistical methods on computers. Machine Learning also has strong ties to
mathematical optimization, which can deliver methods, theory and application domains to the
field. Machine Learning is employed in many sets of computing tasks where designing and
programming explicit algorithms is possible.

1.2 Types of Learning:


There are three main types for learning in Machine Learning:

1.2.1 Supervised Learning:


In Supervised Learning, the learning algorithm learns from labeled training data which
consist of a set of training examples. Each example in the dataset contains an input object (a vector)
and an output value (also called as a signal). A supervised learning algorithm analyzes the training
data and produces a function which can be used to map new examples allowing the user to correctly
determine class labels for unseen instances. This requires the algorithm to generalize from training
dataset to unseen or new data in a reasonable way.

Steps:

 Determine the type of training examples.

 Gather a training set.

 Determine the input feature representation of the learned function.

1
Reinforcement Learning

 Determine the structure of the learned function and corresponding learning algorithm.

 Complete the design. Run the learning algorithm on the gathered training set.

Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from the
training set.

1.2.2 Unsupervised Learning


We now consider unsupervised learning, where we are just given output data, without any
inputs. The goal is to discover “interesting structure” in the data; this is sometimes called
knowledge discovery. Unlike supervised learning, we are not told what the desired output is for
each input. Unsupervised learning is arguably more typical of human and animal learning. It is
also more widely applicable than supervised learning since it does not require a human expert to
manually label the data. Labeled data is not only expensive to acquire, but it also contains relatively
little information, certainly not enough to reliably estimate the parameters of complex models.

Figure 1: Supervised vs Unsupervised Learning

2
Reinforcement Learning

1.2.3 Reinforcement Learning


Reinforcement learning is learning what to do--how to map situations to tell which actions
to take, as in most forms of machine learning, but instead must discover which actions yield the
most reward by trying them. In the most interesting and challenging cases, actions may affect not
only the immediate reward but also the next situation and, through that, all subsequent rewards.
This is called the delayed reward and is one of two important characteristics of Reinforcement
Learning (The other being trial-and-error search).
Reinforcement learning is different from supervised learning, the kind of learning studied
in most current research in machine learning, statistical pattern recognition, and artificial neural
networks. Supervised learning is learning from examples provided by a knowledgeable external
supervisor. This is an important kind of learning, but alone it is not adequate for learning from
interaction. One of the challenges that arise in reinforcement learning and not in other kinds of
learning is the trade-off between exploration and exploitation. To obtain a lot of rewards, a
reinforcement learning agent must prefer actions that it has tried in the past and found to be
effective in producing rewards.
But to discover such actions, it has to try actions that it has not selected before. The agent
has to exploit what it already knows in order to obtain a reward, but it also has to explore in order
to make better action selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task. The agent must try a variety of
actions and progressively favor those that appear to be best. On a stochastic task, each action must
be tried many times to gain a reliable estimate of its expected reward.

Main points in Reinforcement learning

 Input: The input should be an initial state from which the model will start
 Output: There are many possible output as there are variety of solution to a particular
problem
 Training: The training is based upon the input, the model will return a state and the user
will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

3
Reinforcement Learning

The ‘deep’ in deep reinforcement learning means it is a consequence of applying deep


learning to reinforcement learning. The model uses Q values to determine the best action. In
traditional learning this is based on the Policy and in Q-learning is based on bellman equation
whence the data is extracted from the Q-table.

In deep reinforcement learning or deep Q-learning we use a neural network to approximate


the Q-values and decide future actions so as to maximize reward. Deep Q-Learning is known to
fare better than policy gradients or the Q-Table in case of the presence of a large number of
states.

1.3 Motivation
Deep Learning agents are still unresolved because what is happening in the network isn’t
completely clear to many scholars. There are researchers working in this field to discover and solve
real-life problems using Reinforcement Learning. Solving Game AI’s using Deep Learning can be
a futuristic work as we expect the agent to learn as humans do in a trial-and-error and reward-
penalty basis. Playing with a bot that has been highly trained on a particular game can make it
really challenging for the human player and also help in developing new insights on how to play
the game better.

1.4 History on Game Bots


Most of the game bots in the past were hard-coded, i.e., most of the conditions are given
and the bots act according to them. Such bots are known as Scripted Bots. Later, IBM’s Deep Blue
computer defeated Garry Kasparov, a Grandmaster in 1997. From 2007, modern Gaming
introduces more technical bots which can react according to realistic markers, such as sound made
by a player and the footprints they leave behind. Nintendo started a competition called “Mario AI
Championship” in which the players had to create a game bot that can play the Mario game. The
bots were built using Neural Networks and cannot see beyond the computer screen. In 2014,
Google acquired ‘Deep mind’, an AI-based company that uses reinforcement learning to make a
bot that is not pre-programmed but can play arcade games like Space Invaders and Breakout.

4
Reinforcement Learning

1.5 Observations
If we ever want to do better than take random actions at each step, it’d probably be good
to actually know what our actions are doing to the environment.

The environment’s step function returns exactly what we need. In fact, step returns four
values. These are:

 observation (object): an environment-specific object representing your observation of the


environment. For example, pixel data from a camera, joint angles and joint velocities of a
robot, or the board state in a board game.
 reward (float): amount of reward achieved by the previous action. The scale varies between
environments, but the goal is always to increase your total reward.
 done (Boolean): whether it’s time to reset the environment again. Most (but not all) tasks
are divided up into well-defined episodes, and done being True indicates the episode has
terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
 info (dict): diagnostic information useful for debugging. It can sometimes be useful for
learning (for example, it might contain the raw probabilities behind the environment’s last
state change). However, official evaluations of your agent are not allowed to use this for
learning.

This is just an implementation of the classic “agent-environment loop”. Each time step, the
agent chooses an action, and the environment returns an observation and a reward.

5
Reinforcement Learning

2. Literature Survey
Jürgen Schmidhuber talks about Deep Learning policies and underlying mechanics in his
research. In recent years, deep artificial neural networks (including recurrent ones) have won
numerous contests in pattern recognition and machine learning. Shallow and Deep Learners are
distinguished by the depth of their credit assignment paths, which are chains of possibly learnable,
causal links between actions and effects.

2.1 Reinforcement Learning


Reinforcement Learning is situated in between supervised learning and unsupervised
learning. It deals with learning in sequential decision-making problems where the feedback is
limited.

Figure 2: Example of interaction between an agent and its environment

2.1.1 Exploration-Exploitation Trade-off


If the complete dynamics of the problem are known, we can use methods like Dynamic
Programming to compute optimal policies from this model. But, in the real world, as we most
likely do not have the complete knowledge of the environment, we follow the ‘trial-and-error’
approach to learn the correct policy to solve the problem which means that the agent needs to

6
Reinforcement Learning

explore and interact with the environment by performing actions available to him in the action set
and get rewards or penalties. The agent interacts with the environment, performs actions, receives
rewards (which is the only feedback that the agent gets) and formulates a policy to crack the
problem. This approach means that the agent has to try out a ‘bad’ actions which can cause the
performance to drop but is important because the agent can never figure out the optimal policy
without trying out all the actions. To perform efficiently, the agent has to ‘exploit’ what it already
knows. Balancing these two things is the exploration-exploitation problem.

2.1.2 Feedback, Goals and Performance


The feedback received in Reinforcement Learning is less compared to the feedback the
model receives in Supervised Learning as the model receives the ‘correct output’ for every training
example which allows us to easily measure the performance or the predictive accuracy of the
model. But in reinforcement learning, the feedback system is evaluative instead of instructive like
in supervised learning model. As the feedback received in reinforcement learning is highly limited
(just the reward/penalty), the agent has to put in more effort in using the feedback to evaluate and
improve its policy in solving the problem. Another aspect about feedback and performance is the
stochastic nature of the problem formulation. In supervised and unsupervised algorithm, the data
is considered to be ‘static’ and doesn’t change. Measuring the performance in such a case is easy
whereas, in reinforcement learning, the data isn’t static and can be considered as ‘moving target’
and the policy changes over time which means the rewards and states change over time. This
problem of a changing distribution is called as a concept drift and it demands special features to
deal with it. In reinforcement learning, this problem is dealt with exploration.

2.1.3 The Reward Function


The reward function tells the agent the reward that it receives for performing an action in
a particular state. The state reward function is defined as R: S → R and it tells the reward R that

is obtained in the state S. The other possible definitions for the Reward Function are

R: S × A → R or R: S × A × S → R.

7
Reinforcement Learning

The first definition gives rewards for performing an action in a state and the second
definition gives rewards for particular transitions between states. All definitions are
interchangeable though the last one is convenient in model-free algorithms because we usually
need both the starting state and the resulting state in backing up values.

2.1.4 Experience Replay


Deep Q-Networks tend to forget what they’ve learned. Experience Replay is a technique to
overcome this issue. We sample a random batch of experiences from the large table(or deque)
and train the model on it again.

To perform experience replay we store the agent's experiences et=(st,at,rt,st+1)

This means instead of running Q-learning on state/action pairs as they occur during simulation or
actual experience, the system stores the data discovered for [state, action, reward, next_state] -
typically in a large table. Note this does not store associated values - this is the raw data to feed
into action-value calculations later.

The learning phase is then logically separate from gaining experience, and based on taking
random samples from this table. You still want to interleave the two processes - acting and
learning - because improving the policy will lead to different behaviour that should explore
actions closer to optimal ones, and you want to learn from those. However, you can split this
how you like - e.g. take one step, learn from three random prior steps etc. The Q-Learning targets
when using experience replay use the same targets as the online version, so there is no new
formula for that. The loss formula given is also the one you would use for DQN without
experience replay. The difference is only which s, a, r, s', a' you feed into it.

In DQN, the DeepMind team also maintained two networks and switched which one was
learning and which one feeding in current action-value estimates as "bootstraps". This helped
with stability of the algorithm when using a non-linear function approximator. That's what the
bar stands for in θ ¯I - it denotes the alternate frozen version of the weights.

Advantages of experience replay:

8
Reinforcement Learning

 More efficient use of previous experience, by learning with it multiple times. This is key
when gaining real-world experience is costly, you can get full use of it. The Q-learning
updates are incremental and do not converge quickly, so multiple passes with the same
data is beneficial, especially when there is low variance in immediate outcomes (reward,
next state) given the same state, action pair.
 Better convergence behaviour when training a function approximator. Partly this is
because the data is more like i.i.d. data assumed in most supervised learning convergence
proofs.

Disadvantage of experience replay:

 It is harder to use multi-step learning algorithms, such as Q(λ), which can be tuned to
give better learning curves by balancing between bias (due to bootstrapping) and variance
(due to delays and randomness in long-term outcomes). Multi-step DQN with experience-
replay DQN is one of the extensions explored in the paper Rainbow: Combining
Improvements in Deep Reinforcement Learning.

2.1.5 Markov Decision Process

Now the question is, how do you formalize a reinforcement learning problem, so that you can
reason about it? The most common method is to represent it as a Markov decision process.

Suppose you are an agent, situated in an environment (e.g. Breakout game). The environment is
in a certain state (e.g. location of the paddle, location and direction of the ball, existence of every
brick and so on). The agent can perform certain actions in the environment (e.g. move the paddle
to the left or to the right). These actions sometimes result in a reward (e.g. increase in score).
Actions transform the environment and lead to a new state, where the agent can perform another
action, and so on. The rules for how you choose those actions are called policy. The environment
9
Reinforcement Learning

in general is stochastic, which means the next state may be somewhat random (e.g. when you
lose a ball and launch a new one, it goes towards a random direction).

Figure 3: Left: reinforcement learning problem. Right: Markov decision process.

The set of states and actions, together with rules for transitioning from one state to another and
for getting rewards, make up a Markov decision process. One episode of this process (e.g. one
game) forms a finite sequence of states, actions and rewards:

s0,a0,r1,s1,a1,r2,s2,…,sn−1,an−1,rn,sn

Here si represents the state, ai is the action and ri+1 is the reward after performing the action.
The episode ends with terminal state sn (e.g. “game over” screen). A Markov decision process
relies on the Markov assumption, that the probability of the next state si+1 depends only on
current state si and performed action ai, but not on preceding states or actions.

2.1.6 Discounted Future Reward

To perform well in long-term, we need to take into account not only the immediate rewards, but
also the future awards we are going to get. How should we go about that?

Given one run of Markov decision process, we can easily calculate the total reward for one
episode:

10
Reinforcement Learning

R = r1+r2+r3+…+rn

Given that, the total future reward from time point t onward can be expressed as:

Rt = rt+rt+1+rt+2+…+rn

But because our environment is stochastic, we can never be sure, if we will get the same rewards
the next time we perform the same actions. The more into the future we go, the more it may
diverge. For that reason it is common to use discounted future reward instead:

R t= rt+γrt+1+γ2rt+2…+γn−trn

Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we
take it into consideration. It is easy to see, that discounted future reward at time step t can be
expressed in terms of the same thing at time step t+1:

Rt = rt+γ(rt+1+γ(rt+2+…))=rt+γRt+1

If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the
immediate rewards. If we want to balance between immediate and future rewards, we should set
discount factor to something like γ=0.9. If our environment is deterministic and the same actions
always result in same rewards, then we can set discount factor γ=1.

A good strategy for an agent would be to always choose an action, that maximizes the discounted
future reward.

2.1.7 Q-learning

In Q-learning we define a function Q(s,a) representing the discounted future reward when we
perform action a in state s, and continue optimally from that point on.

Q(st,at)=maxπRt+1

11
Reinforcement Learning

The way to think about Q(s,a) is that it is “the best possible score at the end of game after
performing action a in state s”. It is called Q-function, because it represents the “quality” of
certain action in given state.

This may sound quite a puzzling definition. How can we estimate the score at the end of game, if
we know just current state and action, and not the actions and rewards coming after that? We
really can’t. But as a theoretical construct we can assume existence of such a function.

If you’re still not convinced, then consider what the implications of having such a function
would be. Suppose you are in state s and pondering whether you should take action a or b. You
want to select the action, that results in the highest score at the end of game. Once you have the
magical Q-function, the answer becomes really simple – pick the action with the highest Q-
value!

π(s) = argmaxaQ(s,a)

Here π represents the policy, the rule how we choose an action in each state.

OK, how do we get that Q-function then? Let’s focus on just one transition <s,a,r,s′>. Just like
with discounted future rewards in previous section we can express Q-value of state s and action a
in terms of Q-value of next state s′.

Q(s,a) = r+γmaxa′Q(s′,a′)

This is called the Bellman equation. If you think about it, it is quite logical – maximum future
reward for this state and action is the immediate reward plus maximum future reward for the next
state.

The main idea in Q-learning is that we can iteratively approximate the Q-function using the
Bellman equation. In the simplest case the Q-function is implemented as a table, with states as
rows and actions as columns. The gist of Q-learning algorithm is as simple as the following:

initialize Q[numstates,numactions] arbitrarily

12
Reinforcement Learning

observe initial state s


repeat
select and carry out an action a
observe reward r and new state s'
Q[s,a] = Q[s,a] + α(r + γmaxa' Q[s',a'] - Q[s,a])
s = s'
until terminated

α in the algorithm is a learning rate that controls how much of the difference between previous
Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two
Q[s,a]-s cancel and the update is exactly the same as Bellman equation.

maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it
may be completely wrong. However the estimations get more and more accurate with every
iteration and it has been shown, that if we perform this update enough times, then the Q-function
will converge and represent the true Q-value.

2.2 Neural Network


Building Neural Network architectures, much like Computer Architecture in the early days
of computing is a tough job. Despite having tools to help in the process, what really matters is
what the model is built for. Between Policy Gradients and Deep-Q Learning, Deep-Q learning was
the slightly better approach as Policy Gradients do well for games with a small number of actions
whereas Deep-Q can be used to choose the best option even when the action set is large.

2.3 Value Iteration


The policy iteration algorithm completely separates the evaluation and improvement
phases. In the evaluation step, the value function must be computed in the limit.

13
Reinforcement Learning

Figure 4: Value Iteration Function

Value Iteration is guaranteed to converge in the limit towards V*, i.e., the Bellman
optimality Equation holds for each state. A deterministic policy ‘π’for all states can be computed
using the above equation.

Figure 5: The general algorithm for Reinforcement Learning

2.4 The History of Game Programming

Game Programmers used to use heuristic if-then-else type decisions to make educated

14
Reinforcement Learning

guesses. We saw this in the earliest arcade video games such as Pong and Pacman. This trend was
the norm for a very long time. But game developers cannot predict every possible scenario and
edge case. Game developers then tried to mimic how humans would play a game and then modeled
human intelligence into a game bot. This is possible by generalizing and modeling intelligence to
solve any Atari game that the bot is made to play.

The bot has a deep learning neural network that lets it learn how to play the game after
being fed only the raw pixels of the game and the game controls.

2.5 The Democratization of AI


To avoid concentrating the power of AI in the hands of few, OpenAI was founded by Elon
Musk which seeks to democratize AI by making it accessible for all. The Gym and Universe
libraries used in the project are part of OpenAI.

2.6 Bot Types:


Static Bots are designed to follow pre-made waypoints for each level or map. They need a
unique waypoint file for each map to function.

Dynamic Bots, on the other hand, learn the levels and maps as they play. In video games, AI is
used to generate responsive, adaptive or intelligent behavior primarily in non-player characters
(NPCs), similar to human-like intelligence. In most games, the “AI” in the term “Game AI”
overstates its worth as a game. “Real” Artificial Intelligence addresses the fields of Machine
Learning and decision making based on arbitrary data input whereas “Game AI” generally plays
based on rules of thumb or heuristics.

15
Reinforcement Learning

3. System Requirements

Using the OpenAI Gym and Universe libraries mentioned above we can create a bot for the Atari
environments that are included in Gym.

3.1 Environment and Dependencies Needed

Supported Systems: Linux and OSX

Python Version: 2.7.x or 3.6.x

Dependencies: numpy, incremental, golang, libjpeg-turbo8-dev, TensorFlow, OpenAI Gym and

Keras

Application Support: CUDA, cuDNN

3.2 Hardware Requirements

Processor : Inter Core i5 or above

RAM: 8 GB

HDD: 256 GB

Graphics Processing Unit : A good Nvidia GPU with atleast 2GB of VRAM.

16
Reinforcement Learning

4. Technology
4.1 Python

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van


Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability,
notably using significant whitespace. It provides constructs that enable clear programming on both small
and large scales. Van Rossum led the language community until stepping down as leader in July 2018.

Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural. It also
has a comprehensive standard library.

Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of Python's other implementations. Python and CPython are managed by
the non-profit Python Software Foundation.

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by
SETL), capable of exception handling and interfacing with the Amoeba operating system. Its
implementation began in December 1989. Van Rossum's long influence on Python is reflected in
the title given to him by the Python community: Benevolent Dictator For Life (BDFL) – a post
from which he gave himself permanent vacation on July 12, 2018.

Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-
detecting garbage collector and support for Unicode.

Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not
completely backward-compatible. Many of its major features were backported to Python 2.6.x
and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at least
partially) the translation of Python 2 code to Python 3.

17
Reinforcement Learning

Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that
a large body of existing code could not easily be forward-ported to Python 3. In January 2017,
Google announced work on a Python 2.7 to Go transcompiler to improve performance under
concurrent workloads.

Python's large standard library, commonly cited as one of its greatest strengths, provides tools
suited to many tasks. For Internet-facing applications, many standard formats and protocols such
as MIME and HTTP are supported. It includes modules for creating graphical user interfaces,
connecting to relational databases, generating pseudorandom numbers, arithmetic with arbitrary
precision decimals,[95] manipulating regular expressions, and unit testing.

Some parts of the standard library are covered by specifications (for example, the Web Server
Gateway Interface (WSGI) implementation wsgiref follows PEP 333), but most modules are not.
They are specified by their code, internal documentation, and test suites (if supplied). However,
because most of the standard library is cross-platform Python code, only a few modules need
altering or rewriting for variant implementations.

As of March 2018, the Python Package Index (PyPI), the official repository for third-party
Python software, contains over 130,000 packages with a wide range of functionality, including:

 Graphical user interfaces


 Web frameworks
 Multimedia
 Databases
 Networking
 Test frameworks
 Automation
 Web scraping
 Documentation
 System administration
 Scientific computing
 Text processing
 Image processing

18
Reinforcement Learning

4.2 CUDA

CUDA is a parallel computing platform and application programming interface (API) model
created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled
graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU
(General-Purpose computing on Graphics Processing Units). The CUDA platform is a software
layer that gives direct access to the GPU's virtual instruction set and parallel computational
elements, for the execution of compute kernels.

The CUDA platform is designed to work with programming languages such as C, C++, and
Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU
resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills
in graphics programming.] Also, CUDA supports programming frameworks such as OpenACC
and OpenCL. When it was first introduced by Nvidia, the name CUDA was an acronym for
Compute Unified Device Architecture, but Nvidia subsequently dropped the use of the acronym.

The graphics processing unit (GPU), as a specialized computer processor, addresses the demands
of real-time high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved
into highly parallel multi-core systems allowing very efficient manipulation of large blocks of
data. This design is more effective than general-purpose central processing unit (CPUs) for
algorithms in situations where processing large blocks of data is done in parallel.

4.3 cuDNN
cuDNN is a library for deep neural nets built using CUDA. It provides GPU accelerated
functionality for common operations in deep neural nets. You could use it directly yourself, but
other libraries like TensorFlow already have built abstractions backed by cuDNN.

19
Reinforcement Learning

Please note that CUDA and cuDNN are optional. However, they accelerate the process of
learning and prediction using the GPU. Both of these provide support to ‘tensorflow-gpu’ library
for accelerated functionality.

4.4 OpenAI Gym:


Gym is a collection of environments/problems designed for testing and developing
reinforcement learning algorithms. Gym allows us to use premade environments thus freeing us
from the work of having to create a complicated game environment. Gym is a library written in
Python and it contains multiple environments such as robot simulations and Atari games.

Gym is interesting for two reasons:

 RL is very general, encompassing all problems that involve making a sequence of


decisions: for example, controlling a robot’s motors so that it’s able to run and jump,
making business decisions like pricing and inventory management, or playing video games
and board games. RL can even be applied to supervised learning problems with sequential
or structured outputs.
 RL algorithms have started to achieve good results in many difficult environments.
RL has a long history, but until recent advances in deep learning, it required lots of
problem-specific engineering. DeepMind’s Atari results, BRETT from Pieter Abbeel’s
group, and AlphaGo all used deep RL algorithms which did not make too many
assumptions about their environment and thus can be applied in other settings.

However, RL research is also slowed down by two factors:

1. The need for better benchmarks. In supervised learning, progress has been driven by
large labeled datasets like ImageNet. In RL, the closest equivalent would be a large and
diverse collection of environments. However, the existing open-source collections of RL
environments don’t have enough variety, and they are often difficult to even set up and use.
2. Lack of standardization of environments used in publications. Subtle differences in the
problem definition, such as the reward function or the set of actions, can drastically alter a

20
Reinforcement Learning

task’s difficulty. This issue makes it difficult to reproduce published research and compare
results from different papers.

Gym is an attempt to fix both problems.

4.4.1 Installing OpenAI Gym using the pip command

OpenAI Gym is supported in Python 3.5 or higher versions. To install OpenAI Gym, we use the
following command:
pip install gym

4.4.2 Building OpenAI Gym from Source

Another option that we have using which we can install OpenAI Gym is cloning the Git
repository. To do this, we use the following commands

git clone https://github.com/openai/gym


cd gym
pip install -e .

4.5 Tensorflow

TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks. It is used for both research and production at
Google. It is a standard expectation in the industry to have experience in TensorFlow to work in
machine learning.

TensorFlow was developed by the Google Brain team for internal Google use. It was released
under the Apache 2.0 open-source license on November 9, 2015

21
Reinforcement Learning

TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on February
11, 2017. While the reference implementation runs on single devices, TensorFlow can run on
multiple CPUs and GPUs (with optional CUDA and SYCL extensions for general-purpose
computing on graphics processing units). TensorFlow is available on 64-bit Linux, macOS,
Windows, and mobile computing platforms including Android and iOS.

Its flexible architecture allows for the easy deployment of computation across a variety of
platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge
devices.

TensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow
derives from the operations that such neural networks perform on multidimensional data arrays,
which are referred to as tensors. During the Google I/O Conference in June 2016, Jeff Dean
stated that 1,500 repositories on GitHub mentioned TensorFlow, of which only 5 were from
Google.

Among the applications for which TensorFlow is the foundation, are automated image-
captioning software, such as DeepDream. RankBrain now handles a substantial number of search
queries, replacing and supplementing traditional static algorithm-based search results.

4.6 Keras

Keras is an open-source neural-network library written in Python. It is capable of running on top


of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML. Designed to enable fast
experimentation with deep neural networks, it focuses on being user-friendly, modular, and
extensible. It was developed as part of the research effort of project ONEIROS (Open-ended
Neuro-Electronic Intelligent Robot Operating System), and its primary author and maintainer is
François Chollet, a Google engineer. Chollet also is the author of the XCeption deep neural
network model

In 2017, Google's TensorFlow team decided to support Keras in TensorFlow's core library.
Chollet explained that Keras was conceived to be an interface rather than a standalone machine-

22
Reinforcement Learning

learning framework. It offers a higher-level, more intuitive set of abstractions that make it easy
to develop deep learning models regardless of the computational backend used. Microsoft added
a CNTK backend to Keras as well, available as of CNTK v2.0

Keras contains numerous implementations of commonly used neural-network building blocks


such as layers, objectives, activation functions, optimizers, and a host of tools to make working
with image and text data easier. The code is hosted on GitHub, and community support forums
include the GitHub issues page, and a Slack channel.

In addition to standard neural networks, Keras has support for convolutional and recurrent neural
networks. It supports other common utility layers like dropout, batch normalization, and pooling.

Keras allows users to productize deep models on smartphones (iOS and Android), on the web, or
on the Java Virtual Machine. It also allows use of distributed training of deep-learning models on
clusters of Graphics Processing Units (GPU) and Tensor processing units (TPU).

23
Reinforcement Learning

5. Implementation

5.1 Cartpole Game

Usually, training an agent to play an Atari game takes a while (from few hours to a day). So we
will make an agent to play a simpler game called CartPole, but using the same idea used in the
paper.

CartPole is one of the simplest environments in OpenAI gym (a game simulator). As you can see
in the animation from the top, the goal of CartPole is to balance a pole connected with one joint
on top of a moving cart. Instead of pixel information, there are 4 kinds of information given by
the state, such as angle of the pole and position of the cart. An agent can move the cart by
performing a series of actions of 0 or 1 to the cart, pushing it left or right.

Gym makes interacting with the game environment really simple.

next_state, reward, done, info = env.step(action)

As we discussed above, action can be either 0 or 1. If we pass those numbers, env, which
represents the game environment, will emit the results. done is a boolean value telling whether
the game ended or not. The old stateinformation paired with action and next_state and reward is
the information we need for training the agent.

5.2 Implementing Simple Neural Network using Keras

This post is not about deep learning or neural net. So we will consider neural net as just a black
box algorithm that approximately maps inputs to outputs. It is basically an algorithm that learns
on the pairs of examples input and output data, detects some kind of patterns, and predicts the
output based on an unseen input data. Though neural network itself is not the focus of this
article, we should understand how it is used in the DQN algorithm.

24
Reinforcement Learning

Figure 6: Neural Network

Note that the neural net we are going to use is similar to the diagram above. We will have one
input layer that receives 4 information and 3 hidden layers. But we are going to have 2 nodes in
the output layer since there are two buttons (0 and 1) for the game.

Keras makes it really simple to implement a basic neural network. The code below creates an
empty neural net model. activation, loss and optimizer are the parameters that define the
characteristics of the neural network, but we are not going to discuss it here.

# Neural Net for Deep Q Learning


# Sequential() creates the foundation of the layers.
model = Sequential()
# 'Dense' is the basic form of a neural network layer
# Input Layer of state size(4) and Hidden Layer with 24 nodes
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
# Hidden layer with 24 nodes
model.add(Dense(24, activation='relu'))
# Output Layer with # of actions: 2 nodes (left, right)
model.add(Dense(self.action_size, activation='linear'))
# Create the model based on the information above
model.compile(loss='mse',

25
Reinforcement Learning

optimizer=Adam(lr=self.learning_rate))

In order for a neural net to understand and predict based on the environment data, we have to
feed it the information. fit() method feeds input and output pairs to the model. Then the model
will train on those data to approximate the output based on the input.

This training process makes the neural net to predict the reward value from a certain state.

model.fit(state, reward_value, epochs=1, verbose=0)

After training, the model now can predict the output from unseen input. When you call predict()
function on the model, the model will predict the reward of current state based on the data you
trained. Like so:

prediction = model.predict(state)

5.3 Implementing Mini Deep Q Network (DQN)

Normally in games, the reward directly relates to the score of the game. Imagine a situation
where the pole from CartPole game is tilted to the right. The expected future reward of pushing
right button will then be higher than that of pushing the left button since it could yield higher
score of the game as the pole survives longer.

In order to logically represent this intuition and train it, we need to express this as a formula that
we can optimize on. The loss is just a value that indicates how far our prediction is from the
actual target. For example, the prediction of the model could indicate that it sees more value in
pushing the right button when in fact it can gain more reward by pushing the left button. We
want to decrease this gap between the prediction and the target (loss). We will define our loss
function as follows:

26
Reinforcement Learning

Figure 7: Loss function

We first carry out an action a, and observe the reward r and resulting new state s`. Based on the
result, we calculate the maximum target Q and then discount it so that the future reward is worth
less than immediate reward (It is a same concept as interest rate for money. Immediate payment
always worth more for same amount of money). Lastly, we add the current reward to the
discounted future reward to get the target value. Subtracting our current prediction from the
target gives the loss. Squaring this value allows us to punish the large loss value more and treat
the negative values same as the positive values.

Keras takes care of the most of the difficult tasks for us. We just need to define our target. We
can express the target in a magical one-liner in python.

target = reward + gamma * np.amax(model.predict(next_state))

Keras does all the work of subtracting the target from the neural network output and squaring it.
It also applies the learning rate we defined while creating the neural network model. This all
happens inside the fit() function. This function decreases the gap between our prediction to target
by the learning rate. The approximation of the Q-value converges to the true Q-value as we
repeat the updating process. The loss will decrease and score will grow higher.

27
Reinforcement Learning

The most notable features of the DQN algorithm are remember and replay methods. Both are
pretty simple concepts. The original DQN architecture contains a several more tweaks for better
training, but we are going to stick to a simpler version for now.

5.4 Remember

One of the challenges for DQN is that neural network used in the algorithm tends to forget the
previous experiences as it overwrites them with new experiences. So we need a list of previous
experiences and observations to re-train the model with the previous experiences. We will call
this array of experiences memory and use remember() function to append state, action, reward,
and next state to the memory.

In our example, the memory list will have a form of:

memory = [(state, action, reward, next_state, done)...]

And remember function will simply store states, actions and resulting rewards to the memory
like below:

def remember(self, state, action, reward, next_state, done):


self.memory.append((state, action, reward, next_state, done))

done is just a boolean that indicates if the state is the final state.

5.5 Replay

A method that trains the neural net with experiences in the memory is called replay(). First, we
sample some experiences from the memory and call them minibath.

minibatch = random.sample(self.memory, batch_size)

The above code will make minibatch, which is just a randomly sampled elements of the
memories of size batch_size. We set the batch size as 32 for this example.
28
Reinforcement Learning

To make the agent perform well in long-term, we need to take into account not only the
immediate rewards but also the future rewards we are going to get. In order to do this, we are
going to have a ‘discount rate’ or ‘gamma’. This way the agent will learn to maximize the
discounted future reward based on the given state.

# Sample minibatch from the memory


minibatch = random.sample(self.memory, batch_size)
# Extract informations from each memory
for state, action, reward, next_state, done in minibatch:
# if done, make our target reward
target = reward
if not done:
# predict the future discounted reward
target = reward + self.gamma * \
np.amax(self.model.predict(next_state)[0])
# make the agent to approximately map
# the current state to future discounted reward
# We'll call that target_f
target_f = self.model.predict(state)
target_f[0][action] = target
# Train the Neural Net with the state and target_f
self.model.fit(state, target_f, epochs=1, verbose=0)

5.6 How The Agent Decides to Act

Our agent will randomly select its action at first by a certain percentage, called ‘exploration rate’
or ‘epsilon’. This is because at first, it is better for the agent to try all kinds of things before it
starts to see the patterns. When it is not deciding the action randomly, the agent will predict the
reward value based on the current state and pick the action that will give the highest reward.
np.argmax() is the function that picks the highest value between two elements in the
act_values[0].
29
Reinforcement Learning

def act(self, state):


if np.random.rand() <= self.epsilon:
# The agent acts randomly
return env.action_space.sample()
# Predict the reward value based on the given state
act_values = self.model.predict(state)
# Pick the action based on the predicted reward
return np.argmax(act_values[0])

act_values[0] looks like this: [0.67, 0.2], each numbers representing the reward of picking action
0 and 1. And argmax function picks the index with the highest value. In the example of [0.67,
0.2], argmax returns 0 because the value in the 0th index is the highest.

5.7 Hyper Parameters

There are some parameters that have to be passed to a reinforcement learning agent. You will see
these over and over again.

 episodes - a number of games we want the agent to play.


 gamma - aka decay or discount rate, to calculate the future discounted reward.
 epsilon - aka exploration rate, this is the rate in which an agent randomly decides its
action rather than prediction.
 epsilon_decay - we want to decrease the number of explorations as it gets good at playing
games.
 epsilon_min - we want the agent to explore at least this amount.
 learning_rate - Determines how much neural net learns in each iteration.

30
Reinforcement Learning

5.8 Putting It All Together: Coding The Deep Q-Learning Agent

I explained each part of the agent in the above. The code below implements everything we’ve
talked about as a nice and clean class called DQNAgent.

# Deep Q-learning Agent


class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=2000)
self.gamma = 0.95 # discount rate
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model()
def _build_model(self):
# Neural Net for Deep-Q learning Model
model = Sequential()
model.add(Dense(24, input_dim=self.state_size, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse',
optimizer=Adam(lr=self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
act_values = self.model.predict(state)

31
Reinforcement Learning

return np.argmax(act_values[0]) # returns action


def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = reward + self.gamma * \
np.amax(self.model.predict(next_state)[0])
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay

5.9 Let’s Train the Agent

if __name__ == "__main__":
# initialize gym environment and the agent
env = gym.make('CartPole-v0')
agent = DQNAgent(env)
# Iterate the game
for e in range(episodes):
# reset state in the beginning of each game
state = env.reset()
state = np.reshape(state, [1, 4])
# time_t represents each frame of the game
# Our goal is to keep the pole upright as long as possible until score of 500
# the more time_t the more score

32
Reinforcement Learning

for time_t in range(500):


# turn this on if you want to render
# env.render()
# Decide action
action = agent.act(state)
# Advance the game to the next frame based on the action.
# Reward is 1 for every frame the pole survived
next_state, reward, done, _ = env.step(action)
next_state = np.reshape(next_state, [1, 4])
# Remember the previous state, action, reward, and done
agent.remember(state, action, reward, next_state, done)
# make next_state the new current state for the next frame.
state = next_state
# done becomes True when the game ends
# ex) The agent drops the pole
if done:
# print the score and break out of the loop
print("episode: {}/{}, score: {}"
.format(e, episodes, time_t))
break
# train the agent with the experience of the episode
agent.replay(32)

5.10 Result

In the beginning, the agent explores by acting randomly.

33
Reinforcement Learning

It goes through multiple phases of learning.

1. The cart masters balancing the pole.


2. But goes out of bounds, ending the game.
3. It tries to move away from the bounds when it is too close to them, but drops the pole.
4. The cart masters balancing and controlling the pole.

After several hundreds of episodes , it starts to learn how to maximize the score.

The final result is the birth of a skillful CartPole game player!

34
Reinforcement Learning

6. Screenshots
6.1 TensorFlow detects the GPU on board

6.2 Agent Learning to play and maximize scores

35
Reinforcement Learning

6.3 Agent playing the game

36
Reinforcement Learning

7. Conclusion
We have described the Game bots project as an infrastructure for multi-agent systems
research that supports different platforms and is widely available. Reinforcement Learning is a
widely used real-world problem-solving technique and a Deep Learning based game bot can solve
any Atari and Flash game without having any prior information about the game environment. It
only needs to be fed the raw pixels of the game and the game controls.A well-trained and well-
designed game bot can outperform human gamers.

These games can also be used to test various policy gradients in the field of applied Artificial
Intelligence in a safe manner without incurring major expenses.

37
Reinforcement Learning

8. Bibilography
 OpenAI Gym Repository
o https://github.com/openai/gym
 https://medium.freecodecamp.org/how-to-build-an-ai-game-bot-using-openai-
gym-and-universe-f2eb9bfbb40a
 GameBots: A Flexible Test Bed for Multiagent Team Research Gal A.
Kaminka, Manuela M. Veloso, Steve Schaffer, Chris Sollitto, Rogelio
Adobbati, Andrew N. Marshall, Andrew Scholer, and Sheila Tejada.
Communications of the ACM, 45(1):43–45, January 2002.
 http://www.pinchofintelligence.com/getting-started-openai-gym/
 School of AI’s tutorial:
https://www.youtube.com/watch?v=79pmNdyxEGo&t=302s
 Deep Q- learninghttps://medium.com/emergent-future/simple-reinforcement-
learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-
d195264329d0
 Demystifying Deep Reinforcement Learninghttps://neuro.cs.ut.ee/demystifying-
deep-reinforcement-learning/
 Andrej Karpathy’s blog http://karpathy.github.io/2016/05/31/rl/
 https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html

38

You might also like