Professional Documents
Culture Documents
1. INTRODUCTION ......................................................................................................................................... 1
1.1 Machine Learning................................................................................................................................ 1
1.2 Types of Learning: ............................................................................................................................... 1
1.2.1 Supervised Learning: .................................................................................................................... 1
1.2.2 Unsupervised Learning................................................................................................................. 2
1.2.3 Reinforcement Learning............................................................................................................... 3
1.3 Motivation........................................................................................................................................... 4
1.4 History on Game Bots ......................................................................................................................... 4
1.5 Observations ....................................................................................................................................... 5
2. Literature Survey ....................................................................................................................................... 6
2.1 Reinforcement Learning...................................................................................................................... 6
2.1.1 Exploration-Exploitation Trade-off .............................................................................................. 6
2.1.2 Feedback, Goals and Performance .............................................................................................. 7
2.1.3 The Reward Function ................................................................................................................... 7
2.1.4 Experience Replay ........................................................................................................................ 8
2.1.5 Markov Decision Process ............................................................................................................. 9
2.1.6 Discounted Future Reward ........................................................................................................ 10
2.1.7 Q-learning .................................................................................................................................. 11
2.2 Neural Network ................................................................................................................................. 13
2.3 Value Iteration .................................................................................................................................. 13
2.4 The History of Game Programming .................................................................................................. 14
2.5 The Democratization of AI ................................................................................................................ 15
2.6 Bot Types:.......................................................................................................................................... 15
3. System Requirements ............................................................................................................................. 16
3.1 Environment and Dependencies Needed ......................................................................................... 16
3.2 Hardware Requirements ................................................................................................................... 16
4. Technology .............................................................................................................................................. 17
4.1 Python ............................................................................................................................................... 17
4.2 CUDA ................................................................................................................................................. 19
4.3 cuDNN ............................................................................................................................................... 19
4.4 OpenAI Gym: ..................................................................................................................................... 20
4.4.1 Installing OpenAI Gym using the pip command......................................................................... 21
4.4.2 Building OpenAI Gym from Source ............................................................................................ 21
4.5 Tensorflow ........................................................................................................................................ 21
4.6 Keras.................................................................................................................................................. 22
5. Implementation ...................................................................................................................................... 24
5.1 Cartpole Game .................................................................................................................................. 24
5.2 Implementing Simple Neural Network using Keras .......................................................................... 24
5.3 Implementing Mini Deep Q Network (DQN)..................................................................................... 26
5.4 Remember......................................................................................................................................... 28
5.5 Replay................................................................................................................................................ 28
5.6 How The Agent Decides to Act ......................................................................................................... 29
5.7 Hyper Parameters ............................................................................................................................. 30
5.8 Putting It All Together: Coding The Deep Q-Learning Agent ............................................................ 31
5.9 Let’s Train the Agent ......................................................................................................................... 32
5.10 Result .............................................................................................................................................. 33
6. Screenshots ............................................................................................................................................. 35
6.1 TensorFlow detects the GPU on board ............................................................................................. 35
6.2 Agent Learning to play and maximize scores.................................................................................... 35
6.3 Agent playing the game .................................................................................................................... 36
7. Conclusion ............................................................................................................................................... 37
8. Bibilography ............................................................................................................................................ 38
Reinforcement Learning
1. INTRODUCTION
1.1 Machine Learning
Machine Learning is a sub-field of computer science that has evolved from the study of
pattern recognition. In Machine Learning, we explore the study and construction of algorithms that
can learn from data and also make predictions on data. Such algorithms operate by building a
model from example inputs in order to make data-driven predictions or decisions, rather than
following strictly static program instructions.
Machine Learning is related to computational statistics which aims at designing algorithms for
implementing statistical methods on computers. Machine Learning also has strong ties to
mathematical optimization, which can deliver methods, theory and application domains to the
field. Machine Learning is employed in many sets of computing tasks where designing and
programming explicit algorithms is possible.
Steps:
1
Reinforcement Learning
Determine the structure of the learned function and corresponding learning algorithm.
Complete the design. Run the learning algorithm on the gathered training set.
Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from the
training set.
2
Reinforcement Learning
Input: The input should be an initial state from which the model will start
Output: There are many possible output as there are variety of solution to a particular
problem
Training: The training is based upon the input, the model will return a state and the user
will decide to reward or punish the model based on its output.
The model keeps continues to learn.
The best solution is decided based on the maximum reward.
3
Reinforcement Learning
1.3 Motivation
Deep Learning agents are still unresolved because what is happening in the network isn’t
completely clear to many scholars. There are researchers working in this field to discover and solve
real-life problems using Reinforcement Learning. Solving Game AI’s using Deep Learning can be
a futuristic work as we expect the agent to learn as humans do in a trial-and-error and reward-
penalty basis. Playing with a bot that has been highly trained on a particular game can make it
really challenging for the human player and also help in developing new insights on how to play
the game better.
4
Reinforcement Learning
1.5 Observations
If we ever want to do better than take random actions at each step, it’d probably be good
to actually know what our actions are doing to the environment.
The environment’s step function returns exactly what we need. In fact, step returns four
values. These are:
This is just an implementation of the classic “agent-environment loop”. Each time step, the
agent chooses an action, and the environment returns an observation and a reward.
5
Reinforcement Learning
2. Literature Survey
Jürgen Schmidhuber talks about Deep Learning policies and underlying mechanics in his
research. In recent years, deep artificial neural networks (including recurrent ones) have won
numerous contests in pattern recognition and machine learning. Shallow and Deep Learners are
distinguished by the depth of their credit assignment paths, which are chains of possibly learnable,
causal links between actions and effects.
6
Reinforcement Learning
explore and interact with the environment by performing actions available to him in the action set
and get rewards or penalties. The agent interacts with the environment, performs actions, receives
rewards (which is the only feedback that the agent gets) and formulates a policy to crack the
problem. This approach means that the agent has to try out a ‘bad’ actions which can cause the
performance to drop but is important because the agent can never figure out the optimal policy
without trying out all the actions. To perform efficiently, the agent has to ‘exploit’ what it already
knows. Balancing these two things is the exploration-exploitation problem.
is obtained in the state S. The other possible definitions for the Reward Function are
R: S × A → R or R: S × A × S → R.
7
Reinforcement Learning
The first definition gives rewards for performing an action in a state and the second
definition gives rewards for particular transitions between states. All definitions are
interchangeable though the last one is convenient in model-free algorithms because we usually
need both the starting state and the resulting state in backing up values.
This means instead of running Q-learning on state/action pairs as they occur during simulation or
actual experience, the system stores the data discovered for [state, action, reward, next_state] -
typically in a large table. Note this does not store associated values - this is the raw data to feed
into action-value calculations later.
The learning phase is then logically separate from gaining experience, and based on taking
random samples from this table. You still want to interleave the two processes - acting and
learning - because improving the policy will lead to different behaviour that should explore
actions closer to optimal ones, and you want to learn from those. However, you can split this
how you like - e.g. take one step, learn from three random prior steps etc. The Q-Learning targets
when using experience replay use the same targets as the online version, so there is no new
formula for that. The loss formula given is also the one you would use for DQN without
experience replay. The difference is only which s, a, r, s', a' you feed into it.
In DQN, the DeepMind team also maintained two networks and switched which one was
learning and which one feeding in current action-value estimates as "bootstraps". This helped
with stability of the algorithm when using a non-linear function approximator. That's what the
bar stands for in θ ¯I - it denotes the alternate frozen version of the weights.
8
Reinforcement Learning
More efficient use of previous experience, by learning with it multiple times. This is key
when gaining real-world experience is costly, you can get full use of it. The Q-learning
updates are incremental and do not converge quickly, so multiple passes with the same
data is beneficial, especially when there is low variance in immediate outcomes (reward,
next state) given the same state, action pair.
Better convergence behaviour when training a function approximator. Partly this is
because the data is more like i.i.d. data assumed in most supervised learning convergence
proofs.
It is harder to use multi-step learning algorithms, such as Q(λ), which can be tuned to
give better learning curves by balancing between bias (due to bootstrapping) and variance
(due to delays and randomness in long-term outcomes). Multi-step DQN with experience-
replay DQN is one of the extensions explored in the paper Rainbow: Combining
Improvements in Deep Reinforcement Learning.
Now the question is, how do you formalize a reinforcement learning problem, so that you can
reason about it? The most common method is to represent it as a Markov decision process.
Suppose you are an agent, situated in an environment (e.g. Breakout game). The environment is
in a certain state (e.g. location of the paddle, location and direction of the ball, existence of every
brick and so on). The agent can perform certain actions in the environment (e.g. move the paddle
to the left or to the right). These actions sometimes result in a reward (e.g. increase in score).
Actions transform the environment and lead to a new state, where the agent can perform another
action, and so on. The rules for how you choose those actions are called policy. The environment
9
Reinforcement Learning
in general is stochastic, which means the next state may be somewhat random (e.g. when you
lose a ball and launch a new one, it goes towards a random direction).
The set of states and actions, together with rules for transitioning from one state to another and
for getting rewards, make up a Markov decision process. One episode of this process (e.g. one
game) forms a finite sequence of states, actions and rewards:
s0,a0,r1,s1,a1,r2,s2,…,sn−1,an−1,rn,sn
Here si represents the state, ai is the action and ri+1 is the reward after performing the action.
The episode ends with terminal state sn (e.g. “game over” screen). A Markov decision process
relies on the Markov assumption, that the probability of the next state si+1 depends only on
current state si and performed action ai, but not on preceding states or actions.
To perform well in long-term, we need to take into account not only the immediate rewards, but
also the future awards we are going to get. How should we go about that?
Given one run of Markov decision process, we can easily calculate the total reward for one
episode:
10
Reinforcement Learning
R = r1+r2+r3+…+rn
Given that, the total future reward from time point t onward can be expressed as:
Rt = rt+rt+1+rt+2+…+rn
But because our environment is stochastic, we can never be sure, if we will get the same rewards
the next time we perform the same actions. The more into the future we go, the more it may
diverge. For that reason it is common to use discounted future reward instead:
R t= rt+γrt+1+γ2rt+2…+γn−trn
Here γ is the discount factor between 0 and 1 – the more into the future the reward is, the less we
take it into consideration. It is easy to see, that discounted future reward at time step t can be
expressed in terms of the same thing at time step t+1:
Rt = rt+γ(rt+1+γ(rt+2+…))=rt+γRt+1
If we set the discount factor γ=0, then our strategy will be short-sighted and we rely only on the
immediate rewards. If we want to balance between immediate and future rewards, we should set
discount factor to something like γ=0.9. If our environment is deterministic and the same actions
always result in same rewards, then we can set discount factor γ=1.
A good strategy for an agent would be to always choose an action, that maximizes the discounted
future reward.
2.1.7 Q-learning
In Q-learning we define a function Q(s,a) representing the discounted future reward when we
perform action a in state s, and continue optimally from that point on.
Q(st,at)=maxπRt+1
11
Reinforcement Learning
The way to think about Q(s,a) is that it is “the best possible score at the end of game after
performing action a in state s”. It is called Q-function, because it represents the “quality” of
certain action in given state.
This may sound quite a puzzling definition. How can we estimate the score at the end of game, if
we know just current state and action, and not the actions and rewards coming after that? We
really can’t. But as a theoretical construct we can assume existence of such a function.
If you’re still not convinced, then consider what the implications of having such a function
would be. Suppose you are in state s and pondering whether you should take action a or b. You
want to select the action, that results in the highest score at the end of game. Once you have the
magical Q-function, the answer becomes really simple – pick the action with the highest Q-
value!
π(s) = argmaxaQ(s,a)
Here π represents the policy, the rule how we choose an action in each state.
OK, how do we get that Q-function then? Let’s focus on just one transition <s,a,r,s′>. Just like
with discounted future rewards in previous section we can express Q-value of state s and action a
in terms of Q-value of next state s′.
Q(s,a) = r+γmaxa′Q(s′,a′)
This is called the Bellman equation. If you think about it, it is quite logical – maximum future
reward for this state and action is the immediate reward plus maximum future reward for the next
state.
The main idea in Q-learning is that we can iteratively approximate the Q-function using the
Bellman equation. In the simplest case the Q-function is implemented as a table, with states as
rows and actions as columns. The gist of Q-learning algorithm is as simple as the following:
12
Reinforcement Learning
α in the algorithm is a learning rate that controls how much of the difference between previous
Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then two
Q[s,a]-s cancel and the update is exactly the same as Bellman equation.
maxa’ Q[s',a'] that we use to update Q[s,a] is only an estimation and in early stages of learning it
may be completely wrong. However the estimations get more and more accurate with every
iteration and it has been shown, that if we perform this update enough times, then the Q-function
will converge and represent the true Q-value.
13
Reinforcement Learning
Value Iteration is guaranteed to converge in the limit towards V*, i.e., the Bellman
optimality Equation holds for each state. A deterministic policy ‘π’for all states can be computed
using the above equation.
Game Programmers used to use heuristic if-then-else type decisions to make educated
14
Reinforcement Learning
guesses. We saw this in the earliest arcade video games such as Pong and Pacman. This trend was
the norm for a very long time. But game developers cannot predict every possible scenario and
edge case. Game developers then tried to mimic how humans would play a game and then modeled
human intelligence into a game bot. This is possible by generalizing and modeling intelligence to
solve any Atari game that the bot is made to play.
The bot has a deep learning neural network that lets it learn how to play the game after
being fed only the raw pixels of the game and the game controls.
Dynamic Bots, on the other hand, learn the levels and maps as they play. In video games, AI is
used to generate responsive, adaptive or intelligent behavior primarily in non-player characters
(NPCs), similar to human-like intelligence. In most games, the “AI” in the term “Game AI”
overstates its worth as a game. “Real” Artificial Intelligence addresses the fields of Machine
Learning and decision making based on arbitrary data input whereas “Game AI” generally plays
based on rules of thumb or heuristics.
15
Reinforcement Learning
3. System Requirements
Using the OpenAI Gym and Universe libraries mentioned above we can create a bot for the Atari
environments that are included in Gym.
Keras
RAM: 8 GB
HDD: 256 GB
Graphics Processing Unit : A good Nvidia GPU with atleast 2GB of VRAM.
16
Reinforcement Learning
4. Technology
4.1 Python
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural. It also
has a comprehensive standard library.
Python interpreters are available for many operating systems. CPython, the reference
implementation of Python, is open source software and has a community-based development
model, as do nearly all of Python's other implementations. Python and CPython are managed by
the non-profit Python Software Foundation.
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by
SETL), capable of exception handling and interfacing with the Amoeba operating system. Its
implementation began in December 1989. Van Rossum's long influence on Python is reflected in
the title given to him by the Python community: Benevolent Dictator For Life (BDFL) – a post
from which he gave himself permanent vacation on July 12, 2018.
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-
detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not
completely backward-compatible. Many of its major features were backported to Python 2.6.x
and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at least
partially) the translation of Python 2 code to Python 3.
17
Reinforcement Learning
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that
a large body of existing code could not easily be forward-ported to Python 3. In January 2017,
Google announced work on a Python 2.7 to Go transcompiler to improve performance under
concurrent workloads.
Python's large standard library, commonly cited as one of its greatest strengths, provides tools
suited to many tasks. For Internet-facing applications, many standard formats and protocols such
as MIME and HTTP are supported. It includes modules for creating graphical user interfaces,
connecting to relational databases, generating pseudorandom numbers, arithmetic with arbitrary
precision decimals,[95] manipulating regular expressions, and unit testing.
Some parts of the standard library are covered by specifications (for example, the Web Server
Gateway Interface (WSGI) implementation wsgiref follows PEP 333), but most modules are not.
They are specified by their code, internal documentation, and test suites (if supplied). However,
because most of the standard library is cross-platform Python code, only a few modules need
altering or rewriting for variant implementations.
As of March 2018, the Python Package Index (PyPI), the official repository for third-party
Python software, contains over 130,000 packages with a wide range of functionality, including:
18
Reinforcement Learning
4.2 CUDA
CUDA is a parallel computing platform and application programming interface (API) model
created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled
graphics processing unit (GPU) for general purpose processing — an approach termed GPGPU
(General-Purpose computing on Graphics Processing Units). The CUDA platform is a software
layer that gives direct access to the GPU's virtual instruction set and parallel computational
elements, for the execution of compute kernels.
The CUDA platform is designed to work with programming languages such as C, C++, and
Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU
resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills
in graphics programming.] Also, CUDA supports programming frameworks such as OpenACC
and OpenCL. When it was first introduced by Nvidia, the name CUDA was an acronym for
Compute Unified Device Architecture, but Nvidia subsequently dropped the use of the acronym.
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands
of real-time high-resolution 3D graphics compute-intensive tasks. By 2012, GPUs had evolved
into highly parallel multi-core systems allowing very efficient manipulation of large blocks of
data. This design is more effective than general-purpose central processing unit (CPUs) for
algorithms in situations where processing large blocks of data is done in parallel.
4.3 cuDNN
cuDNN is a library for deep neural nets built using CUDA. It provides GPU accelerated
functionality for common operations in deep neural nets. You could use it directly yourself, but
other libraries like TensorFlow already have built abstractions backed by cuDNN.
19
Reinforcement Learning
Please note that CUDA and cuDNN are optional. However, they accelerate the process of
learning and prediction using the GPU. Both of these provide support to ‘tensorflow-gpu’ library
for accelerated functionality.
1. The need for better benchmarks. In supervised learning, progress has been driven by
large labeled datasets like ImageNet. In RL, the closest equivalent would be a large and
diverse collection of environments. However, the existing open-source collections of RL
environments don’t have enough variety, and they are often difficult to even set up and use.
2. Lack of standardization of environments used in publications. Subtle differences in the
problem definition, such as the reward function or the set of actions, can drastically alter a
20
Reinforcement Learning
task’s difficulty. This issue makes it difficult to reproduce published research and compare
results from different papers.
OpenAI Gym is supported in Python 3.5 or higher versions. To install OpenAI Gym, we use the
following command:
pip install gym
Another option that we have using which we can install OpenAI Gym is cloning the Git
repository. To do this, we use the following commands
4.5 Tensorflow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks. It is used for both research and production at
Google. It is a standard expectation in the industry to have experience in TensorFlow to work in
machine learning.
TensorFlow was developed by the Google Brain team for internal Google use. It was released
under the Apache 2.0 open-source license on November 9, 2015
21
Reinforcement Learning
TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on February
11, 2017. While the reference implementation runs on single devices, TensorFlow can run on
multiple CPUs and GPUs (with optional CUDA and SYCL extensions for general-purpose
computing on graphics processing units). TensorFlow is available on 64-bit Linux, macOS,
Windows, and mobile computing platforms including Android and iOS.
Its flexible architecture allows for the easy deployment of computation across a variety of
platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge
devices.
TensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow
derives from the operations that such neural networks perform on multidimensional data arrays,
which are referred to as tensors. During the Google I/O Conference in June 2016, Jeff Dean
stated that 1,500 repositories on GitHub mentioned TensorFlow, of which only 5 were from
Google.
Among the applications for which TensorFlow is the foundation, are automated image-
captioning software, such as DeepDream. RankBrain now handles a substantial number of search
queries, replacing and supplementing traditional static algorithm-based search results.
4.6 Keras
In 2017, Google's TensorFlow team decided to support Keras in TensorFlow's core library.
Chollet explained that Keras was conceived to be an interface rather than a standalone machine-
22
Reinforcement Learning
learning framework. It offers a higher-level, more intuitive set of abstractions that make it easy
to develop deep learning models regardless of the computational backend used. Microsoft added
a CNTK backend to Keras as well, available as of CNTK v2.0
In addition to standard neural networks, Keras has support for convolutional and recurrent neural
networks. It supports other common utility layers like dropout, batch normalization, and pooling.
Keras allows users to productize deep models on smartphones (iOS and Android), on the web, or
on the Java Virtual Machine. It also allows use of distributed training of deep-learning models on
clusters of Graphics Processing Units (GPU) and Tensor processing units (TPU).
23
Reinforcement Learning
5. Implementation
Usually, training an agent to play an Atari game takes a while (from few hours to a day). So we
will make an agent to play a simpler game called CartPole, but using the same idea used in the
paper.
CartPole is one of the simplest environments in OpenAI gym (a game simulator). As you can see
in the animation from the top, the goal of CartPole is to balance a pole connected with one joint
on top of a moving cart. Instead of pixel information, there are 4 kinds of information given by
the state, such as angle of the pole and position of the cart. An agent can move the cart by
performing a series of actions of 0 or 1 to the cart, pushing it left or right.
As we discussed above, action can be either 0 or 1. If we pass those numbers, env, which
represents the game environment, will emit the results. done is a boolean value telling whether
the game ended or not. The old stateinformation paired with action and next_state and reward is
the information we need for training the agent.
This post is not about deep learning or neural net. So we will consider neural net as just a black
box algorithm that approximately maps inputs to outputs. It is basically an algorithm that learns
on the pairs of examples input and output data, detects some kind of patterns, and predicts the
output based on an unseen input data. Though neural network itself is not the focus of this
article, we should understand how it is used in the DQN algorithm.
24
Reinforcement Learning
Note that the neural net we are going to use is similar to the diagram above. We will have one
input layer that receives 4 information and 3 hidden layers. But we are going to have 2 nodes in
the output layer since there are two buttons (0 and 1) for the game.
Keras makes it really simple to implement a basic neural network. The code below creates an
empty neural net model. activation, loss and optimizer are the parameters that define the
characteristics of the neural network, but we are not going to discuss it here.
25
Reinforcement Learning
optimizer=Adam(lr=self.learning_rate))
In order for a neural net to understand and predict based on the environment data, we have to
feed it the information. fit() method feeds input and output pairs to the model. Then the model
will train on those data to approximate the output based on the input.
This training process makes the neural net to predict the reward value from a certain state.
After training, the model now can predict the output from unseen input. When you call predict()
function on the model, the model will predict the reward of current state based on the data you
trained. Like so:
prediction = model.predict(state)
Normally in games, the reward directly relates to the score of the game. Imagine a situation
where the pole from CartPole game is tilted to the right. The expected future reward of pushing
right button will then be higher than that of pushing the left button since it could yield higher
score of the game as the pole survives longer.
In order to logically represent this intuition and train it, we need to express this as a formula that
we can optimize on. The loss is just a value that indicates how far our prediction is from the
actual target. For example, the prediction of the model could indicate that it sees more value in
pushing the right button when in fact it can gain more reward by pushing the left button. We
want to decrease this gap between the prediction and the target (loss). We will define our loss
function as follows:
26
Reinforcement Learning
We first carry out an action a, and observe the reward r and resulting new state s`. Based on the
result, we calculate the maximum target Q and then discount it so that the future reward is worth
less than immediate reward (It is a same concept as interest rate for money. Immediate payment
always worth more for same amount of money). Lastly, we add the current reward to the
discounted future reward to get the target value. Subtracting our current prediction from the
target gives the loss. Squaring this value allows us to punish the large loss value more and treat
the negative values same as the positive values.
Keras takes care of the most of the difficult tasks for us. We just need to define our target. We
can express the target in a magical one-liner in python.
Keras does all the work of subtracting the target from the neural network output and squaring it.
It also applies the learning rate we defined while creating the neural network model. This all
happens inside the fit() function. This function decreases the gap between our prediction to target
by the learning rate. The approximation of the Q-value converges to the true Q-value as we
repeat the updating process. The loss will decrease and score will grow higher.
27
Reinforcement Learning
The most notable features of the DQN algorithm are remember and replay methods. Both are
pretty simple concepts. The original DQN architecture contains a several more tweaks for better
training, but we are going to stick to a simpler version for now.
5.4 Remember
One of the challenges for DQN is that neural network used in the algorithm tends to forget the
previous experiences as it overwrites them with new experiences. So we need a list of previous
experiences and observations to re-train the model with the previous experiences. We will call
this array of experiences memory and use remember() function to append state, action, reward,
and next state to the memory.
And remember function will simply store states, actions and resulting rewards to the memory
like below:
done is just a boolean that indicates if the state is the final state.
5.5 Replay
A method that trains the neural net with experiences in the memory is called replay(). First, we
sample some experiences from the memory and call them minibath.
The above code will make minibatch, which is just a randomly sampled elements of the
memories of size batch_size. We set the batch size as 32 for this example.
28
Reinforcement Learning
To make the agent perform well in long-term, we need to take into account not only the
immediate rewards but also the future rewards we are going to get. In order to do this, we are
going to have a ‘discount rate’ or ‘gamma’. This way the agent will learn to maximize the
discounted future reward based on the given state.
Our agent will randomly select its action at first by a certain percentage, called ‘exploration rate’
or ‘epsilon’. This is because at first, it is better for the agent to try all kinds of things before it
starts to see the patterns. When it is not deciding the action randomly, the agent will predict the
reward value based on the current state and pick the action that will give the highest reward.
np.argmax() is the function that picks the highest value between two elements in the
act_values[0].
29
Reinforcement Learning
act_values[0] looks like this: [0.67, 0.2], each numbers representing the reward of picking action
0 and 1. And argmax function picks the index with the highest value. In the example of [0.67,
0.2], argmax returns 0 because the value in the 0th index is the highest.
There are some parameters that have to be passed to a reinforcement learning agent. You will see
these over and over again.
30
Reinforcement Learning
I explained each part of the agent in the above. The code below implements everything we’ve
talked about as a nice and clean class called DQNAgent.
31
Reinforcement Learning
if __name__ == "__main__":
# initialize gym environment and the agent
env = gym.make('CartPole-v0')
agent = DQNAgent(env)
# Iterate the game
for e in range(episodes):
# reset state in the beginning of each game
state = env.reset()
state = np.reshape(state, [1, 4])
# time_t represents each frame of the game
# Our goal is to keep the pole upright as long as possible until score of 500
# the more time_t the more score
32
Reinforcement Learning
5.10 Result
33
Reinforcement Learning
After several hundreds of episodes , it starts to learn how to maximize the score.
34
Reinforcement Learning
6. Screenshots
6.1 TensorFlow detects the GPU on board
35
Reinforcement Learning
36
Reinforcement Learning
7. Conclusion
We have described the Game bots project as an infrastructure for multi-agent systems
research that supports different platforms and is widely available. Reinforcement Learning is a
widely used real-world problem-solving technique and a Deep Learning based game bot can solve
any Atari and Flash game without having any prior information about the game environment. It
only needs to be fed the raw pixels of the game and the game controls.A well-trained and well-
designed game bot can outperform human gamers.
These games can also be used to test various policy gradients in the field of applied Artificial
Intelligence in a safe manner without incurring major expenses.
37
Reinforcement Learning
8. Bibilography
OpenAI Gym Repository
o https://github.com/openai/gym
https://medium.freecodecamp.org/how-to-build-an-ai-game-bot-using-openai-
gym-and-universe-f2eb9bfbb40a
GameBots: A Flexible Test Bed for Multiagent Team Research Gal A.
Kaminka, Manuela M. Veloso, Steve Schaffer, Chris Sollitto, Rogelio
Adobbati, Andrew N. Marshall, Andrew Scholer, and Sheila Tejada.
Communications of the ACM, 45(1):43–45, January 2002.
http://www.pinchofintelligence.com/getting-started-openai-gym/
School of AI’s tutorial:
https://www.youtube.com/watch?v=79pmNdyxEGo&t=302s
Deep Q- learninghttps://medium.com/emergent-future/simple-reinforcement-
learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-
d195264329d0
Demystifying Deep Reinforcement Learninghttps://neuro.cs.ut.ee/demystifying-
deep-reinforcement-learning/
Andrej Karpathy’s blog http://karpathy.github.io/2016/05/31/rl/
https://yanpanlau.github.io/2016/07/10/FlappyBird-Keras.html
38