Disecting RL

Dissecting Reinforcement
Learning-Part.1
Massimiliano Patacchiola Dec 9, 2016
Premise[This post is an introduction to reinforcement learning

and it is meant to be the starting point for a reader who already has
some machine learning background and is confident with a little bit
of math and Python. When I study a new algorithm I always want to
understand the underlying mechanisms and implement the algorithm
from scratch using a programming language. I followed this
approach in this post which can be long to read but worthy.]
When I started to study reinforcement learning I did not find any

good online resource which explained from the basis what
reinforcement learning really is. Most of the (very good) blogs out
there focus on the modern approaches (Deep Reinforcement
Learning) and introduce the Bellman equation without a satisfying
explanation. I turned my attention to books and I found the one of
Russel and Norvig called Artificial Intelligence: A Modern
Approach.
This post is based on chapters 17 of the second edition, and it can
be considered an extended review of the chapter. I will use the same
mathematical notation of the authors, in this way you can use the
book to cover some missing parts or vice versa. You can find the
complete code used in this post in my github repository, with the
PDF version of the post. In the next section I will introduce Markov
chains, if you already know this concept you can skip to the next
section…
In the beginning was Andrey Markov

Andrey Markov was a Russian mathematician who studied stochastic
processes. Markov was particularly interested in systems that follow
a chain of linked events. In 1906 Markov produced interesting results
about discrete processes that he called chains. A Markov Chain has
a set of states
𝑆 = {𝑠0 , 𝑠1 , . . . , 𝑠𝑚 }
and a process that can move successively from one state to the
other. Each move is a single step and is based on a transition
model
𝑇
. You should make some effort in remembering the keywords in bold,
we will use them extensively during the rest of the article. To
summarise a Markov chain is defined as follows:
1. Set of possible States:

𝑆 = {𝑠0 , 𝑠1 , . . . , 𝑠𝑚 }
2. Initial State:
𝑠0
3. Transition Model:
′
𝑇(𝑠, 𝑠 )
There is something peculiar in a Markov chain that I did not mention.

A Markov chain is based on the Markov Property. The Markov
property states that given the present, the future is conditionally
independent of the past. That’s it, the state in which the process is
now it is dependent only from the state it was at
𝑡−1
. An example can simplify the digestion of Markov chains. Let’s
suppose we have a chain with only two states
𝑠0
and
𝑠1
, where
𝑠0
is the initial state. The process is in
𝑠0
90% of the time and it can move to
𝑠1
the remaining 10% of the time. When the process is in state
𝑠1
it will remain there 50% of the time. Given this data we can create a
Transition Matrix
𝑇
as follow:
⎡ 0.90 0.10 ⎤
𝑇=⎢ ⎥
⎣ 0.50 0.50 ⎦
The transition matrix is always a square matrix, and since we are

dealing with probability distributions all the entries are within 0 and 1
and a single row sums to 1. We can graphically represent the
Markov chain. In the following representation each state of the chain
is a node and the transition probabilities are edges. Highest
probabilities have a thickest edge:
Until now we did not mention time, but we have to do it because
Markov chains are dynamical processes which evolve in time. Let’s
suppose we have to guess were the process will be after 3 steps and
after 50 steps. How can we do it? We are interested in chains that
have a finite number of states and are time-homogeneous meaning
that the transition matrix does not change over time. Given these
assumptions we can compute the k-step transition probability as
the k-th power of the transition matrix, let’s do it in Numpy:
import numpy as np
#Declaring the Transition Matrix T

T = np.array([[0.90, 0.10],
[0.50, 0.50]])
#Obtaining T after 3 steps

T_3 = np.linalg.matrix_power(T, 3)
#Printing the matrices

print("T: " + str(T))
print("T_3: " + str(T_3))
print("T_50: " + str(T_50))
print("T_100: " + str(T_100))
T: [[ 0.9 0.1]
[ 0.5 0.5]]
T_3: [[ 0.844 0.156]

[ 0.78 0.22 ]]
T_50: [[ 0.83333333 0.16666667]

[ 0.83333333 0.16666667]]
T_100: [[ 0.83333333 0.16666667]

[ 0.83333333 0.16666667]]
Now we define the initial distribution which represent the state of

the system at k=0. Our system is composed of two states and we
can model the initial distribution as a vector with two elements, the
first element of the vector represents the probability of staying in
state
𝑠0
and the second element the probability of staying in state
𝑠1
. Let’s suppose that we start from
𝑠0
, the vector
𝐯
representing the initial distribution will have the form
𝐯 = (1, 0)
We can calculate the probability of being in a specific state after

k iterations multiplying the initial distribution and the transition
matrix:
𝐯 ⋅ 𝑇𝑘
. Let’s do it in Numpy:
import numpy as np
#Declaring the initial distribution

v = np.array([[1.0, 0.0]])
#Declaring the Transition Matrix T
T = np.array([[0.90, 0.10],
[0.50, 0.50]])

#Printing the initial distribution

print("v: " + str(v))
print("v_1: " + str(np.dot(v,T)))
print("v_3: " + str(np.dot(v,T_3)))
v: [[ 1. 0.]]
v_1: [[ 0.9 0.1]]
v_3: [[ 0.844 0.156]]
v_50: [[ 0.83333333 0.16666667]]
v_100: [[ 0.83333333 0.16666667]]
What’s going on? The process starts at

𝑠0
and after one iteration we can be 90% sure it is still in that state. This
is easy to grasp, our transition model says that the process can stay
in
𝑠0
with 90% probability, nothing new. Looking to the state distribution
at k=3 we notice that there is something different. We are moving in
the future and different branches are possible. If we want to find the
probability of being in state
𝑠0
after three iteration we should sum all the possible branches that
lead to
𝑠0
. A picture is worth a thousand words:
The possibility to be in
𝑠0
at
𝑘=3
is given by (0.729 + 0.045 + 0.045 + 0.025) which is equal to 0.844
we got the same result. Now let’s suppose that at the beginning we
have some uncertainty about the starting state of our process, let’s
define another starting vector as follows:
𝐯 = (0.5, 0.5)
That’s it, with a probability of 50% we can start from

𝑠0
. Running again the Python script we print the results after 1, 3, 50
and 100 iterations:
v: [[ 0.5, 0.5]]
v_1: [[ 0.7 0.3]]
v_3: [[ 0.812 0.188]]

v_50: [[ 0.83333333 0.16666667]]
v_100: [[ 0.83333333 0.16666667]]
This time the probability of being in

𝑠0
at k=3 is lower (0.812), but in the long run we have the same
outcome (0.8333333). What is happening in the long run? The
result after 50 and 100 iterations are the same and v_50 is equal to
v_100 no matter which starting distribution we have. The chain
converged to equilibrium meaning that as the time progresses it
forgets about the starting distribution. But we have to be careful, the
convergence is not always guaranteed. The dynamics of a Markov
chain can be very complex, in particular it is possible to have
transient and recurrent states. For our scope what we saw is
enough. I suggest you to give a look at the setosa.io blog because
they have an interactive page for Markov chain visualization.
Markov Decision Process

In reinforcement learning it is used a concept that is affine to Markov
chains, I am talking about Markov Decision Processes (MDPs). A
MDP is a reinterpretation of Markov chains which includes an agent
and a decision making stage. A MDP is defined by these
components:

𝑆 = {𝑠0 , 𝑠1 , . . . , 𝑠𝑚 }
2. Initial State:
𝑠0
3. Set of possible Actions:
𝐴 = {𝑎0 , 𝑎1 , . . . , 𝑎𝑛 }
4. Transition Model:
′
𝑇(𝑠, 𝑎, 𝑠 )
5. Reward Function:
𝑅(𝑠)
As you can see we are introducing some new elements compared to

Markov chains. The transition model depends on the current state,
the next state and the action of the agent. The transition model
returns the probability of reaching the state
′
𝑠
if the action
𝑎
is done in state
𝑠
. But given
𝑠
and
𝑎
the model is conditionally independent of all previous states and
actions (Markov Property). Moreover there is the Reward function
𝑅(𝑠)
which return a real value every time the agent moves from one state
to the other (Attention: defining the Reward function to depend only
from
𝑠
can be confusing, Russel and Norvig used this notation in the book
to simplify the description, it does not change the problem in any
significant way). Since we have a reward function we can say that
some states are more desirable that others because when the
agent moves in those states it receives a higher reward. On the
opposite there are states that are not desirable at all, because
when the agent moves there it receives a negative reward.
Problem the agent has to maximise the reward avoiding states

that return negative values and choosing the one that return
positive values.
Solution find a policy
𝜋(𝑠)
that returns the action with the highest reward.
The agent can try different policies but only one of those can be
considered an optimal policy, denoted by
𝜋∗
, which yields to the highest expected utility. It is time to introduce an
example that I am going to use along all the post. This example is
inspired by the simple environment presented by Russel and Norving
in chapter 17.1 of their book. Let suppose we have a cleaning robot
that has to reach a charging station. Our simple world is a 4x3 matrix
where the starting point
𝑠0
is at (1,1), the charging station at (4,3), dangerous stairs at (4,2), and
an obstacle at (2,2). The robot has to find the best way to reach
the charging station (Reward +1) and to avoid falling down the
flight of stairs (Reward -1). Every time the robot takes a decision it
is possible to have the interference of a stochastic factor (ex. the
ground is slippery, an evil cat is stinging the robot), which makes the
robot diverge from the original path 20% of the time. If the robot
decides to go ahead in 10% of the cases it will finish on the left and
in 10% of the cases on the right state. If the robot hits the wall or the
obstacle it will bounce back to the previous position. The main
characteristics of this world are the following:
Discrete time and space

Fully observable
Infinite horizon
Known Transition Model
The environment is fully observable, meaning that the robot

always knows in which state it is in. The infinite horizon clause
should be explained further. Infinite horizon means that there is not
a fixed time limit. If the agent has a policy for going back an forth in
the same two states, it will go on forever. This assumption does not
mean that in every episode the agent has to pass for a series of
infinite states. When one of the two terminal states is reached, the
episode stops. A representation of this world and the transition
model are reported below. Be careful with the indexing used by
Russell and Norvig, it can be confusing. They named each state of
the world by the column and row, starting from the bottom-left
corner.
I said that the aim of the robot is to find the best way to reach the
charging station, but what does it mean the best way? Depending
on the type of reward the robot is receiving for each intermediate
state we can have different optimal policies
𝜋∗
. Let’s suppose we are programming the firmware of the robot.
Based on the battery level we give a different reward at each time
step. The rewards for the two terminal states remain the same
(charger=+1, stairs=-1). The obstacle at (2,2) is not a valid state and
therefore there is no reward associated to it. Given these
assumptions we can have four different cases:
1. 𝑅(𝑠) ≤ − 1.6284
extremely low battery
2. −0.4278 ≤ 𝑅(𝑠) ≤ − 0.085
quite low battery
3. −0.0221 ≤ 𝑅(𝑠) ≤ 0
slightly low battery
4. 𝑅(𝑠) > 0
fully charged
For each one of these conditions we can try to guess which policy
the agent will choose. In the extremely low battery scenario the
agent receives such a high punishment that it only wants to stop the
pain as soon as possible. Life is so painful that falling down the flight
of stairs is a good choice. In the quite low battery scenario the
agent takes the shortest path to the charging station, it does not
care about falling down. In the slightly low battery case the robot
does not take risks at all and it avoids the stairs at cost of banging
against the wall. Finally in the fully charged case the agent remains
in a steady state receiving a positive reward at each time step.
Until now we know the kind of policies that can emerge in specific
environments with defined rewards, but there is still something I did
not talk about: how can the agent choose the best policy?
The Bellman equation

The previous section finished with a question: how can the agent
choose the best policy? To give an answer to this question I will
present the Bellman equation. First of all we have to find a way to
compare two policies. We can use the reward given at each state to
obtain a measure of the utility of a state sequence. We define the
utility of the states history
ℎ
as:
𝑈ℎ = 𝑅(𝑠0 ) + 𝛾𝑅(𝑠1 ) + 𝛾2 𝑅(𝑠2 ) + . . . + 𝛾𝑛 𝑅(𝑠𝑛 )

The previous formula defines the Discounted Rewards of a state
sequence, where
𝛾 ∈ [0, 1]
is called the discount factor. The discount factor describes the
preference of the agent for the current rewards over future rewards.
A discount factor of 1.0 collapses the previous formula into additive
rewards. The discounted rewards are not the only way we can
estimate the utility, but it is the one giving less problems. For
example, in the case of an infinite sequence of states the discounted
reward gives a finite utility (using the sum of infinite series),
moreover we can also compare infinite sequences using the average
reward obtained per time step. How to compare the utility of
single states? The utility
𝑈(𝑠)
can be defined as:
⎡∞ ⎤
𝑈(𝑠) = 𝐸⎢⎣ ∑ 𝛾𝑡 𝑅(𝑠𝑡 )⎥⎦
𝑡=0
Let’s recall that the utility is defined with respect of a policy

𝜋
which for simplicity I did not mention. Once we have the utilities, how
can we choose the best action for the next state? Using the
maximum expected utility principle which says that a rational agent
should choose an action that maximise its expected utility. We are a
step closer to the Bellman equation. What we miss is to recall that
the utility of a state
𝑠
is correlated with the utility of its neighbours at
′
𝑠
, meaning:
′ ′
𝑈(𝑠) = 𝑅(𝑠) + 𝛾 max ∑ 𝑇(𝑠, 𝑎, 𝑠 )𝑈(𝑠 )
𝑎 ′
𝑠
We just derived the Bellman equation! Using the Bellman equation

an agent can estimate the best action to take and find the optimal
policy. Let’s try to dissect this equation. First, the term
𝑅(𝑠)
is something we have to add for sure in the equation. We are in state
𝑠
and we know the reward given for that state, the utility must take it
into account. Second, notice that the equation is using the transition
model
𝑇
which is multiplied by the utility of the next state
′
𝑠
. If you think about that it makes sense, a state which has a low
probability to happen (like the 10% probability of moving on the left
and on the right in our simplified world) will have a lowest weight in
the summation.
To empirically test the Bellman equation we are going to use our

cleaning robot in the simplified 4x3 environment. In this example the
reward for each non-terminal state is
𝑅(𝑠) = − 0.04
. We can imagine to have the utility values for each one of the states,
for the moment you do not need to know how we got these
values, imagine they appeared magically. In the same magical
way we obtained the optimal policy for the world (to double-check if
what we will obtain from the Bellman equation makes sense). This
image is very important, keep it in mind.
In our example we suppose the robot starts from the state (1,1).
Using the Bellman equation we have to find the action with the
highest utility between UP, LEFT, DOWN and RIGHT. We do not
have the optimal policy, but we have the transition model and the
utility values for each state. You have to recall the two main rules of
our environment: (i) if the robot bounce on the wall it goes back to
the previous state, and (ii) the selected action is executed only with a
probability of 80% in accordance with the transition model. Instead
of dealing with those ugly numbers I want to show you a visual
representaion of the possible outcomes:
For each possible outcome I reported the utility and the probability
given by the transition model. This corresponds to the first part of
the Bellman equation. The next step is to calculate the product
between the utility and the transition probability, then sum up
the value for each action.
We found out that for state (1,1) the action UP has the highest
value. This is in accordance with the optimal policy we magically got.
This part of the Bellman equation returns the action that maximizes
the expected utility of the subsequent state, which is what an
optimal policy should do:
′ ′
𝜋 ∗ (𝑠) = argmax ∑ 𝑇(𝑠, 𝑎, 𝑠 )𝑈(𝑠 )
𝑎 ′
𝑠
Now we have all the elements and we can plug the values in the
Bellman equation finding the utility of the state (1,1):
𝑈(𝑠11 ) = − 0.04 + 1.0 × 0.7456 = 0.7056
The Bellman equation works! What we need is a Python

implementation of the equation to use in our simulated world. We
are going to use the same terminology of the previous sections. Our
world has 4x3=12 possible states. The starting vector contains 12
values and the transition matrix is a huge 12x12x4 matrix (12 starting
states, 12 next states, 4 actions) where most of the values are zeros
(we can move only from one state to its neighbours). I generated the
transition matrix using a script and I saved it as a Numpy matrix (you
can download it here). In the script I defined the function
return_state_utility() which is an implementation of the Bellman
equation. Using this function we are going to print the utility of the
state (1,1) and check if it is the same we found previously:
import numpy as np
def return_state_utility(v, T, u, reward, gamma):

"""Return the state utility.
@param v the state vector

@param T transition matrix
@param u utility vector
@param reward for that state
@param gamma discount factor
@return the utility of the state
"""
action_array = np.zeros(4)
for action in range(0, 4):
action_array[action] = np.sum(np.multiply(u, np.dot(v, T[:,:,action
return reward + gamma * np.max(action_array)
def main():
#Starting state vector
#The agent starts from (1, 1)
v = np.array([[0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
1.0, 0.0, 0.0, 0.0]])
#Transition matrix loaded from file

#(It is too big to write here)
T = np.load("T.npy")
#Utility vector
u = np.array([[0.812, 0.868, 0.918, 1.0,
0.762, 0.0, 0.660, -1.0,
0.705, 0.655, 0.611, 0.388]])
#Defining the reward for state (1,1)

reward = -0.04
#Assuming that the discount factor is equal to 1.0
gamma = 1.0
#Use the Bellman equation to find the utility of state (1,1)

utility_11 = return_state_utility(v, T, u, reward, gamma)
print("Utility of state (1,1): " + str(utility_11))
if __name__ == "__main__":
main()
Utility of state (1,1): 0.7056
That’s great, we obtained exactly the same value! Until now we

supposed that the utility values appeared magically. Instead of
using a magician we want to find an algorithm to obtain these
values. There is a problem. For
𝑛
possible states there are
𝑛
Bellman equations, and each equation contains
𝑛
unknowns. Using any linear algebra package would be possible to
solve these equations, the problem is that they are not linear
because of the
max
operator. What to do? We can use the value iteration algorithm…
The value iteration algorithm

The Bellman equation is the core of the value iteration algorithm for
solving a MDP. Our objective is to find the utility (also called
value) for each state. As we said we cannot use a linear algebra
library, we need an iterative approach. We start with arbitrary initial
utility values (usually zeros). Then we calculate the utility of a state
using the Bellman equation and we assign it to the state. This
iteration is called Bellman update. Applying the Bellman update
infinitely often we are guaranteed to reach an equilibrium. Once
we reached the equilibrium we have the utility values we were
looking for and we can use them to estimate which is the best move
for each state. How do we understand when the algorithm reaches
the equilibrium? We need a stopping criteria. Taking into account
the utilities between two consecutive iterations we can stop the
algorithm when no state’s utility changes by much.
1−𝛾
| | 𝑈𝑘 + 1 − 𝑈𝑘 | | < 𝜖 𝛾
This result is a consequence of the contraction property which I will

skip because it is well explained in the chapter 17.2 of the book. Ok,
it’s time to implement the algorithm in Python. I will reuse the
return_state_utility() function to update the utility vector u.
import numpy as np
def return_state_utility(v, T, u, reward, gamma):

"""Return the state utility.
@param v the state vector

@param reward for that state
@return the utility of the state
"""
action_array = np.zeros(4)
for action in range(0, 4):
action_array[action] = np.sum(np.multiply(u, np.dot(v, T[:,:,action
return reward + gamma * np.max(action_array)
def main():
#Change as you want
tot_states = 12
gamma = 0.999 #Discount factor
iteration = 0 #Iteration counter
epsilon = 0.01 #Stopping criteria small value
#List containing the data for each iteation

graph_list = list()
#Transition matrix loaded from file (It is too big to write here)
#Reward vector
r = np.array([-0.04, -0.04, -0.04, +1.0,
-0.04, 0.0, -0.04, -1.0,
-0.04, -0.04, -0.04, -0.04])
#Utility vectors
u = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
u1 = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
while True:
delta = 0
u = u1.copy()
iteration += 1
graph_list.append(u)
for s in range(tot_states):
reward = r[s]
v = np.zeros((1,tot_states))
v[0,s] = 1.0
u1[s] = return_state_utility(v, T, u, reward, gamma)
delta = max(delta, np.abs(u1[s] - u[s])) #Stopping criteria
if delta < epsilon * (1 - gamma) / gamma:
print("=================== FINAL RESULT =================="
print("Iterations: " + str(iteration))
print("Delta: " + str(delta))
print("Gamma: " + str(gamma))
print("Epsilon: " + str(epsilon))
print("==================================================="
print(u[0:4])
print(u[4:8])
print(u[8:12])
print("==================================================="
break
if __name__ == "__main__":
main()
It is interesting to give a look at the stabilization of each utility during

the convergence. Using matplotlib I draw the utility value of each
state for 25 iterations.
Using the same code I run different simulations with different
values for the discounting factor gamma. When the discounting factor
approaches 1.0 our prediction for the utilities gets more precise. In
the limit case of gamma = 1.0 the algorithm will never end
because we will never reach the stopping criteria.
=================== FINAL RESULT ==================

Iterations: 9
Delta: 0.000304045
Gamma: 0.5
Epsilon: 0.001
===================================================
[[ 0.00854086 0.12551955 0.38243452 1. ]]
[[-0.04081336 0. 0.06628399 -1. ]]
[[-0.06241921 -0.05337728 -0.01991461 -0.07463402]]
===================================================
=================== FINAL RESULT ==================

Iterations: 16
Delta: 0.000104779638547
Gamma: 0.9
Epsilon: 0.001
===================================================
[[ 0.50939438 0.64958568 0.79536209 1. ]]
[[ 0.39844322 0. 0.48644002 -1. ]]
[[ 0.29628832 0.253867 0.34475423 0.12987275]]
===================================================
=================== FINAL RESULT ==================

Iterations: 29
Delta: 9.97973302774e-07
Gamma: 0.999
Epsilon: 0.001
===================================================
[[ 0.80796344 0.86539911 0.91653199 1. ]]
[[ 0.75696623 0. 0.65836281 -1. ]]
[[ 0.69968285 0.64882069 0.6047189 0.38150244]]
===================================================
There is another algorithm that allows us to find the utility vector and
at the same time an optimal policy, the policy iteration algorithm.
The policy iteration algorithm

With the value iteration algorithm we have a way to estimate the
utility of each state. What we still miss is a way to estimate an
optimal policy. In this section, I am going to show you how we can
use the policy iteration algorithm to find an optimal policy that
maximizes the expected reward. No policy generates more reward
than the optimal policy
𝜋∗
. Policy iteration is guaranteed to converge and at convergence,
the current policy and its utility function are the optimal policy
and the optimal utility function. First of all, we define a policy
𝜋
assigning an action to each state. We can assign random actions to
this policy, it does not matter. Using the return_state_utility()
function (the Bellman equation) we can compute the expected utility
of the policy. There is a good news. We do not really need the
complete version of the Bellman equation which is:
′ ′
𝑈(𝑠) = 𝑅(𝑠) + 𝛾 max ∑ 𝑇(𝑠, 𝑎, 𝑠 )𝑈(𝑠 )
𝑎 ′
𝑠
Since we have a policy and the policy associate to each state an

action, we can get rid of the
max
operator and use a simplified version of the Bellman equation:
′ ′
𝑈(𝑠) = 𝑅(𝑠) + 𝛾∑ 𝑇(𝑠, 𝜋(𝑠), 𝑠 )𝑈(𝑠 )
′
𝑠
Once we have evaluated the policy, we can improve it. Policy

improvement is the second and last step of the algorithm. Our
environment has a finite number of states, therefore a finite number
of policies. Each iteration returns a better policy. I have implemented
a function called return_policy_evaluation() containing the
simplified version of the Bellman equation. Moreover, we need the
function return_expected_action() returning the action with the
highest utility based on the current value of u and T. To check what’s
going on I created also a print function, that maps each action
contained in the policy vector p to a symbol and print it on terminal.
import numpy as np
def return_policy_evaluation(p, u, r, T, gamma):

"""Return the policy utility.
@param p policy vector

@param r reward vector
@return the utility vector u
"""
for s in range(12):
if not np.isnan(p[s]):
v = np.zeros((1,12))
v[0,s] = 1.0
action = int(p[s])
u[s] = r[s] + gamma * np.sum(np.multiply(u, np.dot(v, T[:,:,action
return u
def return_expected_action(u, T, v):

"""Return the expected action.
It returns an action based on the

expected utility of doing a in state s,
according to T and u. This action is
the one that maximize the expected
utility.
@param v starting vector
@return expected action (int)
"""
actions_array = np.zeros(4)
for action in range(4):
#Expected utility of doing a in state s, according to T and u.
actions_array[action] = np.sum(np.multiply(u, np.dot(v, T[:,:,action
return np.argmax(actions_array)
def print_policy(p, shape):

"""Printing utility.
Print the policy actions using symbols:

^, v, <, > up, down, left, right
* terminal states
# obstacles
"""
counter = 0
policy_string = ""
for row in range(shape[0]):
for col in range(shape[1]):
if(p[counter] == -1): policy_string += " * "
elif(p[counter] == 0): policy_string += " ^ "
elif(p[counter] == 1): policy_string += " < "
elif(p[counter] == 2): policy_string += " v "
elif(p[counter] == 3): policy_string += " > "
elif(np.isnan(p[counter])): policy_string += " # "
counter += 1
policy_string += '\n'
print(policy_string)
Now I am going to use these functions in a main loop that is an

implementation of the policy iteration algorithm. I declared a new
vector p containing the actions for each state. The stopping
condition of the algorithm is the difference between the utility
vectors after two consecutive iterations. The algorithm terminates
when the improvement step has no effect (or a very small effect)
over the utilities.
def main():
gamma = 0.999
epsilon = 0.0001
iteration = 0
#Generate the first policy randomly
# NaN=Nothing, -1=Terminal, 0=Up, 1=Left, 2=Down, 3=Right
p = np.random.randint(0, 4, size=(12)).astype(np.float32)
p[5] = np.NaN
p[3] = p[7] = -1
#Utility vectors
u = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
#Reward vector
r = np.array([-0.04, -0.04, -0.04, +1.0,
-0.04, 0.0, -0.04, -1.0,
-0.04, -0.04, -0.04, -0.04])
while True:
iteration += 1
#1- Policy evaluation
u_0 = u.copy()
u = return_policy_evaluation(p, u, r, T, gamma)
#Stopping criteria
delta = np.absolute(u - u_0).max()
if delta < epsilon * (1 - gamma) / gamma: break
for s in range(12):
if not np.isnan(p[s]) and not p[s]==-1:
v = np.zeros((1,12))
v[0,s] = 1.0
#2- Policy improvement
a = return_expected_action(u, T, v)
if a != p[s]: p[s] = a
print_policy(p, shape=(3,4))
print("=================== FINAL RESULT ==================")

print("Iterations: " + str(iteration))
print("Delta: " + str(delta))
print("Gamma: " + str(gamma))
print("Epsilon: " + str(epsilon))
print("===================================================")
print(u[0:4])
print(u[4:8])
print(u[8:12])
print("===================================================")
print_policy(p, shape=(3,4))
print("===================================================")
if __name__ == "__main__":
main()
Running the script with gamma=0.999 and epsilon=0.0001 we get

convergence in 22 iterations with the following result:
=================== FINAL RESULT ==================

Iterations: 22
Delta: 9.03617490833e-08
Gamma: 0.999
Epsilon: 0.0001
===================================================
[ 0.80796344 0.86539911 0.91653199 1. ]
[ 0.75696624 0. 0.65836281 -1. ]
[ 0.69968295 0.64882105 0.60471972 0.38150427]
===================================================
> > > *
^ # ^ *
^ < < <
===================================================
The final policy returned by the algorithm is equal to the optimal

policy. Moreover using the simplified Bellman equation the algorithm
managed to find good values for the utility vector. If we give a look to
the policy evolution we will notice something interesting. At the
beginning the policy is randomly generated. After four iterations the
algorithm finds a sub-optimal policy and sticks to it until iteration 10
when it finds the optimal policy. From iteration 10 until iteration 22
the algorithm does not change the policy at all. A sub-optimal
policy can be a problem in model-free reinforcement learning,
because greedy agents can stick to it, but for the moment it is not a
problem for us.
Policy iteration and value iteration, which is best? If you have many
actions or you start from a fair policy then choose policy iteration. If
you have few actions and the transition is acyclic then chose value
iteration. If you want the best from the two world then give a look to
the modified policy iteration algorithm.
Policy evaluation using linear algebra

I said that eliminating the
𝑚𝑎𝑥
operator from the Bellman equation made our life easier because we
could use any linear algebra package to calculate the utilities. In the
last section I would like to show you how to reach the same
conclusion using a linear algebra approach. In the Bellman
equation we have a linear system with
𝑛
variables and
𝑛
constraints. Remember that here we are dealing with matrices and
vectors. Given a policy p and the action associated to the state s, the
reward vector r, the transition matrix T and the discount factor gamma,
we can estimate the utility in a single line of code:
u[s] = np.linalg.solve(np.identity(12) - gamma*T[:,:,p[s]], r)[s]

I used the Numpy method np.linalg.solve() that takes as input the
coefficient matrix A and an array of dependent values b, finding (if it
exists) the exact solution to a system of linear equations. If the
solution does not exist (e.g. the matrix is not square, or the row-
columns are not linearly independent), it is necessary to use the
least-squares approximation via the method np.linalg.lstsq(). In
both cases we get the solution to the system A x = b. For the matrix
A we pass the difference between an identity matrix I and gamma * T,
for the dependent array b we pass the reward vector r. Why we
pass as first parameter I - gamma*T ? We can derive this value
starting from the simplified Bellman equation:
𝐮 = 𝐫 + 𝛾𝑇𝐮
(𝐼 − 𝛾𝑇)𝐮 = 𝐫
𝐮 = (𝐼 − 𝛾𝑇)−1 𝐫
In fact, we could obtain u implementing the last equation in Numpy:
u[s] = np.dot(np.linalg.inv(np.identity(12) - gamma*T[:,:,p[s]]), r)[s
If you want to use the last expression when an exact solution does
not exist, you need to invert the matrix using the pseudoinverse and
the Numpy method np.linalg.pinv(). In the end, I prefer to use
np.linalg.solve() or np.linalg.lstsq() that does the same thing
but is much more readable.
Conclusions
In this first part I summarised the fundamental ideas behind
Reinforcement learning. As an example, I used a finite environment
with a predefined transition model. What happen if we do not have
the transition model? In the next post I will introduce model-free
reinforcement learning, that gives an answer to this question with a
new set of interesting tools. You can find the full code on my github
repository.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value
iteration and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods,
Prediction and Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning,
TD(0), TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods,
computational Actor-Critic methods, Actor-only and Critic-only
methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic
Algorithms in Reinforcement Learning, Genetic Algorithms for
policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed
Bandit, Mountain Car, Inverted Pendulum, Drone landing, Hard
problems.
7. [Seventh Post] Function approximation, Intuition, Linear
approximator, Applications, High-order approximators.
8. [Eighth Post] Non-linear function approximation, Perceptron,
Multi Layer Perceptron, Applications, Policy Gradient.
Resources
The dissecting-reinforcement-learning repository.
The setosa blog containing a good-looking simulator for Markov

chains.
Official github repository for the book “Artificial Intelligence: a
Modern Approach”.
References
Bellman, R. (1957). A Markovian decision process (No. P-1066).
RAND CORP SANTA MONICA CA.
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., & Edwards, D. D.
(2003). Artificial intelligence: a modern approach (Vol. 2). Upper
Saddle River: Prentice hall.
Learning-Part.2
Massimiliano Patacchiola Jan 15, 2017
Welcome to the second part of the dissecting reinforcement

learning series. If you managed to survive to the first part then
congratulations! You learnt the foundation of reinforcement learning,
the dynamic programming approach. As I promised in the second
part I will go deeper in model-free reinforcement learning (for
prediction and control), giving an overview on Monte Carlo (MC)
methods. This post is (weakly) connected to part one, and I will use
the same terminology, examples and mathematical notation. I will
merge some of the ideas presented by Russel and Norvig in
Artificial Intelligence: A Modern Approach and the classical
Reinforcement Learning, An Introduction by Sutton and Barto. In
particular I will focus on chapter 21 (second edition) of the former
and on chapter 5 (first edition) of the latter. Moreover, you can follow
lecture 4 and lecture 5 of David Silver’s course. For open versions of
the books look at the resource section.
All right, now with the same spirit of the previous part I am going to
dissect one-by-one all the concepts we will step through.
Beyond dynamic programming

In the first post I introduced the two main algorithms for computing
optimal policies: value iteration and policy iteration. We modelled
the environment as a Markov decision process (MDP), and we used a
transition model to describe the probability of moving from one state
to the other. The transition model was stored in a matrix T and used
to find the utility function
𝑈∗
and the best policy
𝜋∗
. Here, we must be careful with the mathematical notation. In the
book of Sutton and Barto the utility function is called value function
or state-value function and is indicated with the letter
𝑉
. In order to keep everything uniform I will use the notation of Russel
and Norvig which uses the letter
𝑈
to identify the utility function. The two notations have the same
meaning and they define the value of a state as the expected
cumulative future discounted reward starting from that state. The
reader should get used to different notations as a good form of
mental gymnastics.
Now, I would like to give a proper definition of model-free

reinforcement learning and in particular of passive and active
reinforcement learning. In model-free reinforcement learning the first
thing we miss is a transition model. In fact the name model-free
stands for transition-model-free. The second thing we miss is the
reward function
𝑅(𝑠)
giving to the agent the reward associated to a particular state. In the
passive approach we have a policy
𝜋
used by the agent to move in the environment. In state
𝑠
the agent always produces the action
𝑎
given by the policy
𝜋
. The goal of the agent in passive reinforcement learning is to learn
the utility function
𝑈𝜋 (𝑠)
. Sutton and Barto called this case MC for prediction. It is also
possible to estimate the optimal policy while moving in the
environment. In this case we are in an active case and using the
words of Sutton and Burto we will say that we are applying MC for
control estimation. Here, I will use again the example of the
cleaning robot from the first post but with a different setup.
The robot is in a 4x3 world with an unknown transition model. The

only information about the environment is the states availability.
Since the robot does not have the reward function it does not know
which state contains the charging station (+1) and which state
contains the stairs (-1). Only in the passive case the robot has a
policy that can follow to move in the world. Finally, the transition
model, since the robot does not know what it is going to happen
after each action it can only give unknown probabilities to each
possible outcome. To summarize, in the passive case this is what we
have:
𝑆 = {𝑠0 , 𝑠1 , . . . , 𝑠𝑚 }
2. Initial State:
𝑠0
3. Set of possible Actions:
𝐴 = {𝑎0 , 𝑎1 , . . . , 𝑎𝑛 }
4. The policy
𝜋
In passive reinforcement learning our objective is to use the

available information to estimate the utility function. How to do it?
The first thing the robot can do is to estimate the transition model,
moving in the environment and keeping track of the number of times
an action has been correctly executed. Once the transition model is
available the robot can use either value iteration or policy iteration to
get the utility function. In this sense, there are different techniques to
find out the transition model making use of Bayes rule and maximum
likelihood estimation. Russel and Norvig mention these techniques in
chapter 21.2.2 (Bayesian reinforcement learning). The problem of
this approach is evident: estimating the values of a transition
model can be expensive. In our 3x4 world it means to estimate the
values for a 12x12x4 (states x states x actions) table. Moreover
certain actions and some states can be very unlikely, making the
entries in the transition table hard to estimate. Here I will focus on
another technique, able to estimate the utility function without the
transition model, the Monte Carlo method.
The Monte Carlo method

The Monte Carlo (MC) method was used for the first time in 1930 by
Enrico Fermi who was studying neutron diffusion. Fermi did not
publish anything on it, the modern version is due to Stanislaw Ulam
who invented it during the 1940s at Los Alamos. The idea behind MC
is simple: just use randomness to solve a problem. For example, it is
possible to use MC to estimate a multidimensional definite integral, a
technique called MC integration. In artificial intelligence we can use
MC tree search to find the best move in a game. DeepMind AlphaGo
defeated the Go world champion Lee Seedol using MC tree search
combined with convolutional networks and deep reinforcement
learning. Later on in this series we will discover how it was possible.
The advantages of MC methods over the dynamic programming
approach are the following:
1. MC allows learning optimal behaviour directly from interaction

with the environment.
2. It is easy and efficient to focus MC methods on small subset of
the states.
3. MC can be used with simulations (sample models).
During the post I will analyse the first two points. The third point is
less intuitive. In many applications it is easy to simulate episodes but
it can be extremely difficult to construct the transition model
required by the dynamic programming techniques. In all these cases,
MC methods rules.
Now let’s go back to our cleaning robot and let’s see what does it
mean to apply the MC method to this scenario. As usual the robot
starts in state (1, 1) and it follows its internal policy. At each step it
records the reward obtained and saves an history of all the states
visited until reaching a terminal state. We define an episode the
sequence of states from the starting state to the terminal state. Let’s
suppose that our robot recorded the following three episodes:
The robot followed its internal policy but an unknown transition
model perturbed the trajectory leading to undesired states. In the
first and second episode, after some fluctuation the robot eventually
reached the terminal state obtaining a positive reward. In the third
episode the robot moved along a wrong path reaching the stairs and
falling down (reward: -1.0). The following is another representation of
the three episodes:
Each occurrence of a state during the episode is called visit. The

concept of visit defines two different MC approaches:
1. First-Visit MC:
𝑈𝜋 (𝑠)
is defined as the average of the returns following the first visit to
𝑠
in a set of episodes.
2. Every-Visit MC:
𝑈𝜋 (𝑠)
is defined as the average of the returns following all the visit to
𝑠
in a set of episodes.
I will focus only on the First-Visit MC method in this post. What

does return means? The return is the sum of discounted reward. I
already presented the return in the first post when I introduced the
Bellman equation and the utility of a state history.
∞
Return(𝑠) = ∑ 𝛾𝑡 𝑅(𝑆𝑡 )
𝑡=0
Nothing new. We have the discount factor

𝛾
, the reward function
𝑅(𝑠)
and
𝑆𝑡
the state reached at time
𝑡
. We can calculate the return for the state (1,1) of the first episode,
with
𝛾 = 0.9
, as follows:
The return for the first episode is 0.27. Following the same procedure
we get the same result for the second episode. For the third episode
we get a different return: -0.79. After the three episodes we came
out with three different returns: 0.27, 0.27, -0.79. How can we use
returns to estimate utilities? I will now introduce the core equation
used in the MC method, which give the utility of a state following the
policy
𝜋
:
⎡∞ 𝑡 ⎤
𝑈 (𝑠) = 𝐸⎢ ∑ 𝛾 𝑅(𝑆𝑡 )⎥
𝜋
⎣𝑡 = 0 ⎦
If you compare this equation with the equation used to calculate the
return you will see only one difference: to obtain the utility function
we take the expectation of the returns. That’s it. To find the utility
of a state we need to calculate the expectation of the returns for that
state. In our example after only three episodes the approximated
utility for the state (1, 1) is: (0.27+0.27-0.79)/3=-0.08. However, an
estimation based only on three episodes is inaccurate. We need
more episodes in order to get the true value. Why do we need more
episodes?
Here the MC terminology steps into. We can define

𝑆𝑡
to be a discrete random variable that can represent all the available
states with a certain probability. Every time our robot enters in a
state is like picking a value for the random variable
𝑆𝑡
. For each state of each episode we can calculate the return and
store it in a list. Repeating this process for a large number of times is
guaranteed to converge to the true utility. How is that possible?
This is the result of a famous theorem known as the law of large
number. Understanding the law of large number is crucial. Rolling a
six-sided dice produces one of the numbers 1, 2, 3, 4, 5, or 6, each
with equal probability. The expectation is 3.5 and can be calculated
as the arithmetic mean: (1+2+3+4+5+6)/6=3.5. Using a MC
approach we can obtain the same value, let’s do it in Python:
import numpy as np
# Trowing a dice for N times and evaluating the expectation

dice = np.random.randint(low=1, high=7, size=3)
print("Expectation (rolling 3 times): " + str(np.mean(dice)))
Expectation (rolling 3 times): 4.0

As you can see the estimation of the expectation converges to the

true value of 3.5. What we are doing in MC reinforcement learning is
exactly the same but in this case we want to estimate the utility of
each state based on the return of each episode. Similarly to the
dice, more episodes we take into account more accurate our
estimation will be.
Python implementation
As usual we will implement the algorithm in Python. I wrote a class
called GridWorld contained in the module gridworld.py available in
my GitHub repository. Using this class it is possible to create a grid
world of any size and add obstacles and terminal states. The
cleaning robot will move in the grid world following a specific policy.
Let’s bring to life our 4x3 world:
import numpy as np
from gridworld import GridWorld
# Declare our environmnet variable

# The world has 3 rows and 4 columns
env = GridWorld(3, 4)
# Define the state matrix
# Adding obstacle at position (1,1)
# Adding the two terminal states
state_matrix = np.zeros((3,4))
state_matrix[0, 3] = 1
state_matrix[1, 3] = 1
state_matrix[1, 1] = -1
# Define the reward matrix
# The reward is -0.04 for all states but the terminal
reward_matrix = np.full((3,4), -0.04)
reward_matrix[0, 3] = 1
reward_matrix[1, 3] = -1
# Define the transition matrix
# For each one of the four actions there is a probability
transition_matrix = np.array([[0.8, 0.1, 0.0, 0.1],
[0.1, 0.8, 0.1, 0.0],
[0.0, 0.1, 0.8, 0.1],
[0.1, 0.0, 0.1, 0.8]])
# Define the policy matrix
# 0=UP, 1=RIGHT, 2=DOWN, 3=LEFT, NaN=Obstacle, -1=NoAction
# This is the optimal policy for world with reward=-0.04
policy_matrix = np.array([[1, 1, 1, -1],
[0, np.NaN, 0, -1],
[0, 3, 3, 3]])
# Set the matrices
env.setStateMatrix(state_matrix)
env.setRewardMatrix(reward_matrix)
env.setTransitionMatrix(transition_matrix)
In a few lines I defined a grid world with the properties of our
example. The policy is the optimal policy for a reward of -0.04 as we
saw in the first post. Now, it is time to reset the environment (move
the robot to starting position) and use the render() method to
display the world.
#Reset the environment

observation = env.reset()
#Display the world printing on terminal
env.render()
Running the snippet above we get the following print on screen:
- - - *
- # - *
○ - - -
I represented free positions with - the two terminal states with *

obstacles with # and the robot with ○. Now we can run an episode
using a loop:
for _ in range(1000):
action = policy_matrix[observation[0], observation[1]]
observation, reward, done = env.step(action)
print("")
print("ACTION: " + str(action))
print("REWARD: " + str(reward))
print("DONE: " + str(done))
env.render()
if done: break
Given the transition matrix and the policy the most likely output of
the script will be something like this:
- - - * - - - * ○ - - *
- # - * ○ # - * - # - *
○ - - - - - - - - - - -
- ○ - * - - ○ * - - - ○
- # - * - # - * - # - *
- - - - - - - - - - - -
You can find the full example in the GitHub repository. If you are
familiar with OpenAI Gym you will find many similarities with my
code. I used the same structure and I implemented the same
methods step() reset() and render(). In particular the method
step() moves forward at t+1 and returns the reward, the
observation (position of the robot), and a variable called done which
is True when the episode is finished (the robot reached a terminal
state).
Now we have all we need to implement the MC method. Here I will

use a discount factor of
𝛾 = 0.999
, the best policy
𝜋∗
and the same transition model used in the previous post. Remember
that with the current transition model the robot will go in the desired
direction only 80% of the times. First of all, I wrote a function to
estimate the return:
def get_return(state_list, gamma):

counter = 0
return_value = 0
for visit in state_list:
reward = visit[1]
return_value += reward * np.power(gamma, counter)
counter += 1
return return_value
The function get_return() takes as input a list containing tuples

(position, reward) and the discount factor gamma, the output is a
value representing the return for that action list. We are going to use
the function get_return() in the following loop in order to get the
returns for each episode and estimate the utilities. The following part
is crucial, I added many comments to make it readable.
# Defining an empty utility matrix

utility_matrix = np.zeros((3,4))
# init with 1.0e-10 to avoid division by zero
running_mean_matrix = np.full((3,4), 1.0e-10)
gamma = 0.999 #discount factor
tot_epoch = 50000
print_epoch = 1000
for epoch in range(tot_epoch):

#Starting a new episode
episode_list = list()
#Reset and return the first observation
observation= env.reset(exploring_start=False)
# Take the action from the action matrix
# Move one step in the environment and get obs and reward
observation, reward, done = env.step(action)
# Append the visit in the episode list
episode_list.append((observation, reward))
if done: break
# The episode is finished, now estimating the utilities
counter = 0
# Checkup to identify if it is the first visit to a state
checkup_matrix = np.zeros((3,4))
# This cycle is the implementation of First-Visit MC.
# For each state stored in the episode list it checks if it
# is the first visit and then estimates the return.
for visit in episode_list:
observation = visit[0]
row = observation[0]
col = observation[1]
reward = visit[1]
if(checkup_matrix[row, col] == 0):
return_value = get_return(episode_list[counter:], gamma)
running_mean_matrix[row, col] += 1
utility_matrix[row, col] += return_value
checkup_matrix[row, col] = 1
counter += 1
if(epoch % print_epoch == 0):
print("Utility matrix after " + str(epoch+1) + " iterations:")
print(utility_matrix / running_mean_matrix)
#Time to check the utility matrix obtained

print("Utility matrix after " + str(tot_epoch) + " iterations:")
print(utility_matrix / running_mean_matrix)
Executing this script will print the estimation of the utility matrix
every 1000 iterations:
Utility matrix after 1 iterations:

[[ 0.59184009 0.71385957 0.75461418 1. ]
[ 0.55124825 0. 0.87712296 0. ]
[ 0.510697 0. 0. 0. ]]

[[ 0.81379324 0.87288388 0.92520101 1. ]
[ 0.76332603 0. 0.73812382 -1. ]
[ 0.70553067 0.65729802 0. 0. ]]
[[ 0.81020502 0.87129531 0.92286107 1. ]
[ 0.75980199 0. 0.71287269 -1. ]
[ 0.70275487 0.65583747 0. 0. ]]
...

[[ 0.80764909 0.8650596 0.91610018 1. ]
[ 0.7563441 0. 0.65231439 -1. ]
[ 0.69873614 0.6478315 0. 0. ]]
As you can see the utility gets more and more accurate and in the
limit to infinite it converges to the true values. In the first post we
already found the utilities of this particular grid world using the
dynamic programming techniques. Here we can compare the results
obtained with MC and the ones obtained with dynamic programming:
If you observe the two utility matrices you will notice many
similarities but two important differences. The utility estimations for
the states (4,1) and (3,1) are equal to zero. This can be considered
one of the limitations and at the same time one of the advantage of
MC methods. The policy we are using, the transition probabilities,
and the fact that the robot always start from the same position
(bottom-left corner) are responsible for the wrong estimate in those
states. Starting from the state (1,1) the robot will never reach
those states and it cannot estimate the corresponding utilities. As I
told you this is a problem because we cannot estimate those values
but at the same time it is an advantage. In a very big grid world we
can estimate the utilities only for the states we are interested in,
saving time and resources and focusing only on a particular
subspace of the world.
What we can do to estimate the values for each state? A possible

solution is called exploring starts and consists in starting from all
the available states. This guarantees that all states will be visited in
the limit of an infinite number of episodes. To enable the exploring
starts in our code the only thing to do is to set the parameter
exploring_strarts in the reset() function to True as follows:
observation = env.reset(exploring_start=True)
Now every time a new episode begins the robot will start from a
random position. Running again the script will result in the following
estimations:

[[ 0.87712296 0.918041 0.959 1. ]
[ 0.83624584 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

[[ 0.81345829 0.8568502 0.91298468 1. ]
[ 0.76971062 0. 0.64240071 -1. ]
[ 0.71048183 0.65156625 0.62423942 0.3622782 ]]
[[ 0.80248079 0.85321 0.90835335 1. ]
[ 0.75558086 0. 0.64510648 -1. ]
[ 0.69689178 0.64712344 0.6096939 0.34484468]]
...

[[ 0.8077211 0.86449595 0.91575904 1. ]
[ 0.75630573 0. 0.65417382 -1. ]
[ 0.6989143 0.64707444 0.60495949 0.36857044]]
As you can see this time we got the right values for the states (4,1)
and (3,1). Until now we assumed that we had a policy and we used
that policy to estimate the utility function. What to do when we do
not have a policy? In this case there are other methods we can use.
Russel and Norvig called this case active reinforcement learning.
Following the definition of Sutton and Barto I will call this case the
model-free Monte Carlo control estimation.
Monte Carlo control

The MC methods for control (active) are slightly different from MC
methods for prediction (passive). In some sense the MC control
problem is more realistic because we need to estimate a policy
that is not given. The mechanism behind MC for control is the same
we used in the dynamic programming approach. In the Sutton and
Barto book it is called Generalised Policy Iteration or GPI. The GPI
is well explained by the policy iteration algorithm of the first post.
The policy iteration allowed finding the utility values for each state
and at the same time the optimal policy
𝜋∗
. The approach we used in policy iteration included two steps:
1. Policy evaluation:
𝑈 → 𝑈𝜋
2. Policy improvement:
𝜋 → 𝑔𝑟𝑒𝑒𝑑𝑦(𝑈)
The first step makes the utility function consistent with the current
policy (evaluation). The second step makes the policy
𝜋
greedy with respect to the current utility function (improvement).
The two changes work against each other, creating a moving target
for the other, but together they collaborate making both policy and
value function approach optimal.
Examining the second step we notice a new term: greedy. What

does it means greedy? A greedy algorithm makes the local optimal
choice at each step. In our case greedy means to take for each state
the action with the highest utility and update the policy with that
action. However, following only local cues does not (generally) lead
to optimal solutions. For example, choosing the highest utility at each
step in the following case leads to a negative reward.
How can the greedy strategy work? It works because the local
choice is evaluated using the utility function adjusted in time. At the
beginning the agent will follow many sub-optimal paths but after a
while the utilities will start to converge to the true values and the
greedy strategy will lead to positive rewards. All reinforcement
learning methods can be described in terms of policy iteration and
more specifically in terms of GPI. Keeping the GPI idea in your mind
will let you understand easily control methods. In order to fully
understand the MC method for control I have to introduce another
topic: the Q-function.
Action Values and the Q-function

Until now we used the function
𝑈
called the utility function (aka value function, state-value function) as
a way to estimate the utility (value) of a state. More precisely, we
used
𝑈𝜋 (𝑠)
to estimate the value of a state
𝑠
under a policy
𝜋
. Now it’s time to introduce a new function called
𝑄
(aka action-value function) defined as follows:
𝑄𝜋 (𝑠, 𝑎) = 𝐸{Return𝑡 | 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎}
That’s it, the Q-function takes the action

𝑎
in state
𝑠
under the policy
𝜋
and returns the utility of that state-action pair. The Q-function is
defined as the expected return starting from
𝑠
, taking the action
𝑎
and thereafter following policy
𝜋
.
Why do we need the function Q in MC methods? In model-free

reinforcement learning the utility of the states are not sufficient to
suggest a policy. One must explicitly estimate the utility of each
action, thus the primary goal in MC methods for control is to
estimate the function
𝑄∗
. What I said previously about the GPI applies also to the action-
value function Q. Estimating the optimal action-value function is not
different from estimating the utility function. The first-visit MC
method for control estimation averages the return after a specific
state-action pair has been visited for the first time. We must think in
terms of state-action pairs and not in terms of states. When we
estimated the utility function
𝑈
we stored the utilities in a matrix having the same dimension of the
world. Here, we need a new way to represent the state-value
function Q, because we have to take into account the actions. What
we can do is to have a row for each action and a column for each
state. Imagine to take all the 12 states of our 4x3 grid world and
dispose them along a single row, then repeat the process for all the
four possible actions (up, right, down, left). The resulting (empty)
matrix is the following:
The state-action matrix stores the utilities of executing a specific

action in a specific state, thus with a query to the matrix we can
estimate which action should be executed in order to have the
highest utility. In the MC control case we have to change our mindset
when analyzing an episode. Each state has an associated action, and
executing this action in that state leads to a new state and a reward.
Graphically we can represent an episode pairing states with the
corresponding actions:
The episode above is the same we used as example in the MC for

prediction. The robot starts at (1,1) reaching the charging station
after seven visits. Here, we can calculate the returns as usual. Recall
that we are under the assumption of first-visit MC, and we update
the entry for the state-action pair (1,2)-UP only once, since this pair
is present twice in the episode. To estimate the utility we have to
decompose the episode and evaluate the return that follows the
first occurrence of the state-action pair. In our example, we have
to compute the return for the pair (1,1)-UP, the pair (1,2)-UP, the pair
(1,3)-DOWN, skip the pair (1,2)-UP (already updated), the pair (1,3)-
RIGHT, etc. In the following image you can see this process and how
the returns are evaluated:
After this episode the matrix containing the values for the state-
action utilities can be updated. In our case the new matrix will
contain the following values:
After a second episode we will fill more entries in the table. Going on
in this way will eventually lead to a complete state-action table with
all the entries filled. This step is what is called evaluation in the GPI
framework. The second step of the algorithm is the improvement. In
the improvement we take our randomly initialised policy
𝜋
and we update it in the following way:
𝜋(𝑠) = argmax 𝑄(𝑠, 𝑎)

𝑎
That’s it, we are making the policy greedy choosing for each state
𝑠
appearing in the episode the action with maximal Q-value. For
example, if we consider the state (1,3) (top-left corner in the grid
world) we can update the entry of the policy matrix taking the action
with the highest value in the state-action table. In our case, after the
first episode the action with the highest value is RIGHT which has a
Q-value of 0.74.
In MC for control it is important to guarantee a uniform exploration
of all the state-action pairs. Following the policy
𝜋
it can happen that relevant state-action pairs are never visited.
Without returns the method will not improve. The solution is to use
exploring starts specifying that the first step of each episode starts
at a state-action pair and that every such pair has a non-zero
probability of being selected. It’s time to implement the algorithm in
Python.
Python implementation
I will use again the function get_return() but this time the input will
be a list containing tuples (observation, action, reward):
def get_return(state_list, gamma):

""" Get the return for a list of action-state values.
@return get the Return

"""
counter = 0
return_value = 0
for visit in state_list:
reward = visit[2]
return_value += reward * np.power(gamma, counter)
counter += 1
return return_value
I will use a function called update_policy() that makes the policy

greedy with respect to the current state-action function:
def update_policy(episode_list, policy_matrix, state_action_matrix):

""" Update a policy
The function makes the policy greedy in respect

of the state-action matrix.
@return the updated policy
"""
col = observation[1] + (observation[0]*4)
if(policy_matrix[observation[0], observation[1]] != -1):
policy_matrix[observation[0], observation[1]] = \
np.argmax(state_action_matrix[:,col])
return policy_matrix
The update_policy() function is part of the improvement step of the

GPI and it is fundamental for the convergence to an optimal policy. I
will also use the function print_policy(), already used in the
previous post, to print the policy using the symbols: ^, >, v, <, *, #. In
the main() function I initialized a random policy matrix and the
state_action_matrix that contains the utilities of each state-action
pair. The matrix can be initialized with zeros or random values, it
does not matter.
# Random policy matrix

policy_matrix = np.random.randint(low=0, high=4,
size=(3, 4)).astype(np.float32)
policy_matrix[1,1] = np.NaN #NaN for the obstacle at (1,1)
policy_matrix[0,3] = policy_matrix[1,3] = -1 #No action (terminal states)
# State-action matrix (init to zeros or to random values)

state_action_matrix = np.random.random_sample((4,12)) # Q
Finally, the main loop of the algorithm. This is not so different from
the loop used in MC prediction:

# Starting a new episode
episode_list = list()
# Reset and return the first observation
observation = env.reset(exploring_starts=True)
is_starting = True
# Take the action from the action matrix
# If the episode just started then it is
# necessary to choose a random action (exploring starts)
if(is_starting):
action = np.random.randint(0, 4)
is_starting = False
# Move one step in the environment and gets
# a new observation and the reward
new_observation, reward, done = env.step(action)
#Append the visit in the episode list
episode_list.append((observation, action, reward))
observation = new_observation
if done: break
# The episode is finished, now estimating the utilities
counter = 0
# Checkup to identify if it is the first visit to a state-action
checkup_matrix = np.zeros((4,12))
# This cycle is the implementation of First-Visit MC.
# For each state-action stored in the episode list it checks if
# it is the first visit and then estimates the return.
# This is the Evaluation step of the GPI.
action = visit[1]
row = action
if(checkup_matrix[row, col] == 0):
return_value = get_return(episode_list[counter:], gamma)
running_mean_matrix[row, col] += 1
state_action_matrix[row, col] += return_value
checkup_matrix[row, col] = 1
counter += 1
# Policy Update (Improvement)
policy_matrix = update_policy(episode_list,
policy_matrix,
state_action_matrix/running_mean_matrix)
# Printing
if(epoch % print_epoch == 0):
print("")
print("State-Action matrix after " + str(epoch+1) + " iterations:"
print(state_action_matrix / running_mean_matrix)
print("Policy matrix after " + str(epoch+1) + " iterations:")
print(policy_matrix)
print_policy(policy_matrix)
# Time to check the utility matrix obtained
print("Utility matrix after " + str(tot_epoch) + " iterations:")
print(state_action_matrix / running_mean_matrix)
If we compare the code below with the one used in MC for prediction
we will notice some important differences, for example the following
condition:
if(is_starting):
action = np.random.randint(0, 4)
is_starting = False
This condition satisfies the exploring starts. The MC algorithm will
converge to the optimal solution only if we assure exploring starts.
In MC for control it is not sufficient to select random starting states.
During the iterations the algorithm will improve the policy only if all
the actions have a non-zero probability to be chosen. In this sense
when the episode starts we have to select a random action, this
must be done only for the starting state.
There is another subtle difference. In the code I differentiate

between observation and new_observation, the observation at time
𝑡
and the observation at time
𝑡+1
. What we need to store in our episode list is the observation at
𝑡
, the action taken at
𝑡
and the reward obtained at
𝑡+1
. Remember that we are interested in the utility of taking a certain
action in a certain state.
It is time to run the script. Before recall that for the simple 4x3
gridworld we already know the optimal policy. In the first post we
have found the optimal policy with a reward equal to -0.04 (for non
terminal states) and with transition model having 80-10-10 percent
probabilities. The optimal policy is the following:
Optimal policy:
> > > *

^ # ^ *
^ < < <
In the optimal policy the robot will move far away from the stairs at
state (4, 2) and will reach the charging station through the longest
path. Now, I will show you the evolution of the policy once we run the
script for MC control estimation:
Policy after 1 iterations:

^ > v *
< # v *
v > < >
...

> > > *
> # ^ *
> > ^ <
...

> > > *
^ # ^ *
^ < ^ <
...

> > > *
^ # ^ *
^ < < <
...

> > > *
^ # ^ *
^ < < <
At the beginning the MC method is initialized with a random policy,

therefore it is not a surprise that the first policy is a complete non-
sense. After 3000 iteration the algorithm finds a sub-optimal
policy. In this policy the robot moves close to the stairs in order to
reach the charging station. As we said in the previous post this is
risky because the robot can fall down. At iteration 78000 the
algorithm finds another policy, that is still sub-optimal but slightly
better than the previous one. Finally at iteration 405000 the
algorithm finds the optimal policy and stick to it until the end.
The MC method cannot converge to any sub-optimal policy.

From the GPI point of view this is obvious. If the algorithm converges
to a sub-optimal policy then the utility function would eventually
converge to the utility function for that policy causing the policy to
change. Stability is reached only when both the policy and the
utility function are optimal. Convergence to this optimal fixed point
seems inevitable but has not yet been formally proved.
Conclusions
I would like to reflect for a moment on the beauty of the MC
algorithm. In MC for control the method can estimate the best policy
from nothing. The robot is moving in the environment trying different
actions and following the consequences of those actions until the
end. That’s all. The robot does not know the reward function, it does
not know the transition model and it does not have any policy to
follow. Nevertheless the algorithm improves until reaching the
optimal strategy.
Be careful MC methods are not perfect. The fact that we have to

save a full episode before updating the utility function is a strong
limitation. It means that if you want to train a robot for driving a car
you should wait until the robot crashes into a wall in order to update
the policy. To overcome this problem we can use another algorithm
called Temporal Differencing (TD) learning. Using TD methods we
can obtain the same result of MC methods but we can update the
utility function after a single step. In the next post I will introduce TD
methods, the foundations of Q-Learning and Deep Reinforcement
Learning.
Index
methods.
policy selection.
problems.
Resources
The complete code for MC prediction and MC control is
available on the dissecting-reinforcement-learning official
repository on GitHub.
Dadid Silver’s course (DeepMind) in particular lesson 4 [pdf]

[video] and lesson 5 [pdf][video].
Artificial intelligence: a modern approach. (chapters 17 and

21) Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., &
Edwards, D. D. (2003). Upper Saddle River: Prentice hall. [web]
[github]
Reinforcement learning: An introduction. Sutton, R. S., &

Barto, A. G. (1998). Cambridge: MIT press. [html]
Reinforcement learning: An introduction (second edition).

Sutton, R. S., & Barto, A. G. (2018). [pdf]
References
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An

introduction (Vol. 1, No. 1). Cambridge: MIT press.
Learning-Part.3
Massimiliano Patacchiola Jan 29, 2017
Welcome to the third part of the “Disecting Reinforcement Learning”

series. In the first and second post we dissected dynamic
programming and Monte Carlo (MC) methods. The third major
group of methods in reinforcement learning is called Temporal
Differencing (TD). TD learning solves some of the problem of MC
learning and in the conclusions of the second post I described one of
these problems. With MC it is necessary to wait until the end of the
episode before updating the utility function. This is a serious
problem because some applications can have very long episodes
with learning delayed to the end of each one. Moreover, in some
environments the completion of the episode is not guaranteed. Here,
we will see how TD methods solve these issues.
In this post I will start from a general introduction to the TD

approach and then pass to the most important TD techniques: Sarsa
and Q-Learning. TD had a huge impact on reinforcement learning
and most of the last publications (included Deep Reinforcement
Learning) are based on the TD approach. Additionally, we will see
how TD is correlated with psychology through animal learning
experiments. If you want to read more about TD and animal learning
you should read chapter 14 in the second edition of the Sutton and
Barto’s book (pdf) and the chapter entitled “Time-derivative models
of pavlovian reinforcement” which you can easily find on Google.
Some parts of this post are based on chapters 6 and 7 of the
classical “Reinforcement Learning: An Introduction”. If after reading
this post you are not satisfied I suggest you to give a look to the
article of Sutton entitled “Learning to predict by the methods of
temporal differences”. If you want to read more about Sarsa and Q-
learning you can use the book of Russel and Norvig (chapter 21.3.2).
A short introduction to reinforcement learning and Q-Learning is also
provided by Mitchell in his book Machine Learning (1997) (chapter
13). Links to these resources are available in the last section of the
post.
Temporal Differencing (and rabbits)

The term Temporal Differencing was first used by Sutton back in
1988. Sutton has an interesting background. Libertarian,
psychologist and Computer scientist interested in understanding
what we mean by intelligence and goal-directed behaviour. Give a
look to his personal page if you want to know more. The interesting
thing about Sutton’s research is that he motivated and explained TD
from the point of view of animal learning theory and showed that
the TD model solves many problems with a simple time-derivative
approach. I am sure you already heard about the famous Pavlov’s
experiment in classical conditioning. Showing food to a dog elicits
a response (salivation). This association is called unconditioned
response (UR) and it is caused by an unconditioned stimulus
(US). The UR it is a natural reaction that does not depend on
previous experience. In a second phase we pair the stimulus (food)
with a neutral stimulus (e.g. bell). After a while the dog will associate
the sound of the bell to the food and this association will elicit the
salivation. The bell is called conditioned stimulus (CS) and the
response is the conditioned response (CR).
The same effect is studied with eyeblink conditioning in rabbits. A

mild puff of air is directed to the rabbit’s eyes. The UR in this case is
closing the eyelid, whereas the US is the air puff. During the
conditioning a red light (CS) is turned on before the air puff. The
conditioning creates an association between the light and the eye
blinks. There are two types of arrangements of stimuli in classical
conditioning experiments. In delay conditioning, the CS extends
throughout the US without any interval. In trace conditioning there
is a time interval, called the trace interval, between CS and US. The
delay between CS and US is an important variable called
interstimulus interval (ISI).
Learning about predictive relationships among stimuli is extremely
important for surviving, this is the reason why it is widely present
among species ranging from mice to humans. Learning means to
accurately predict at each point in time the imminence-weighted
sum of future US intensity levels. In the eyblinking experiment has
been observed that rabbits learn a weaker prediction for CSs
presented far in advance of the US. Studying the results on eyeblink
conditioning, Sutton and Barto (1990) found a correlation with the
TD framework. Reinforcement is weighted according to its
imminence (length of the ISI), when slightly delayed it carries slightly
less weight, when long-delayed it carries very little weight, and so on
so forth. This assumption is the core of the TD model of classical
conditioning and it is an extension of the Rescorla-Wagner model
(1972). If you read the previous posts you should find some
similarities whit the concept of discounted rewards. The general
rule behind TD applies to rabbits and to artificial agents. This
general rule can be summarised as follow:
NewEstimate←OldEstimate+StepSize[Target−OldEstimate]
The expression [Target−OldEstimate] is the estimation error or δ

which can be reduced moving one step towards the real value
(Target). The StepSize (sometimes called learning rate) is a
parameter that determines to what extent the error has to be
integrated in the new estimation. If StepSize=0 the agent does not
learn at all. If StepSize=1 the agent considers only the most recent
information. In some application the StepSize changes at each time
step. Processing the kth reward the parameter is updated as 1k.
However in practice it is often used a constant value such as 0.1 for
all steps. What is the Target in our case? From the second post we
know that we can estimate the utility of a state as the expectation of
the returns for that state. The Target is the expected return of the
state:
Target=Eπ[∞∑k=0γkrt+k+1]
In MC method to estimate the Target we take into account all the

states visited until the end of the episode:
Target=Eπ[rt+1+γrt+2+γ2rt+3+...+γkrt+k+1]
In TD learning we want to update the utility function after each

visit, therefore we do not have all the states and we do not have the
values of the rewards. The only information available is rt+1 the
reward at t+1 and the utilities estimated before. If we find a way to
express the target using only those values we are done. To solve the
issue we can bootstrap meaning we can use the estimates to build
new estimates. This is the most important part, if we group γ we
obtain exactly the equation of U(st+1):
Target=Eπ[rt+1+γ(rt+2+γrt+3+...+γk−1rt+k+1)]=Eπ[rt+1+γU(st+1)]
We got what we wanted. The Target is now expressed by two

quantities: rt+1 and U(st+1) and both of them are known. Taking into
account all these considerations we can finally write the complete
update rule:
U(st)←U(st)+α[rt+1+γU(st+1)−U(st)]
This update rule is fascinating. At the very first iteration we are

updating the utility table using completely wrong values. Think about
that, we initialized the utilities with random values (or zeros) and we
are taking one of these values at t+1 to update the state at t. How
can the algorithm converge to the real values? The magic
happens when the agent meet a terminal state for the first time.
In this particular case the return obtained by TD and MC coincides.
Using again our cleaning robot example we can easily see what is the
difference between TD and MC learning and what each one does at
each step…
TD(0) Python implementation

The update rule found in the previous part is the simplest form of TD
learning, the TD(0) algorithm. TD(0) allows estimating the utility
values following a specific policy. We are in the passive learning
case for prediction, and we are in model-free reinforcement
learning, meaning that we do not have the transition model. To
estimate the utility function we can only move in the world. Using
again the cleaning robot example I want to show you what does it
mean to apply the TD algorithm to a single episode. I am going to
use the episode of the second post where the robot starts at (1,1)
and reaches the terminal state at (4,3) after seven steps.
Applying the TD algorithm means to move step by step considering
only the state at t and the state at t+1. That’s it, after each step we
get the utility value and the reward at t+1 and we update the value at
t. The TD(0) algorithm ignores the past states as shown by the
shadow I added above those states. Applying the algorithm to the
episode (γ=0.9, α=0.1) leads to the following changes in the utility
matrix:
The red frame highlights the utility value that has been updated at
each visit. The matrix is initialised with zeros. At k=1 the state (1,1) is
updated since the robot is in the state (1,2) and the first reward
(-0.04) is available. The calculation for updating the utility at (1,1) is:
0.0 + 0.1 (-0.04 + 0.9 (0.0) - 0.0) = -0.004. Similarly to (1,1)
the algorithm updates the state at (1,2). At k=3 the robot goes back
and the calculation take the form: 0.0 + 0.1 (-0.04 + 0.9 (-0.004)
- 0.0) = -0.00436. At k=4 the robot changes again its direction. In
this case the algorithm update for the second time the state (1,2) as
follow: -0.004 + 0.1 (-0.04 + 0.9 (-0.00436) + 0.004) =
-0.0079924. The same process is applied until the end of the
episode.
In the Python implementation we have to create a grid world as we

did in the second post, using the class GridWorld contained in the
module gridworld.py. I will use again the 4x3 world with a charging
station at (4,3) and the stairs at (4,2). The optimal policy and the
utility values of this world are the same we obtained in the previous
posts:
Optimal policy: Utility Matrix:
> > > * 0.812 0.868 0.918 1.0

^ # ^ * 0.762 0.0 0.660 -1.0
^ < < <. 0.705 0.655 0.611 0.388
The update rule of TD(0) can be implemented in a few lines:
def update_utility(utility_matrix, observation, new_observation,

reward, alpha, gamma):
'''Return the updated utility matrix
@param utility_matrix the matrix before the update
@param observation the state observed at t
@param new_observation the state observed at t+1
@param reward the reward observed after the action
@param alpha the step size (learning rate)
@param gamma the discount factor
@return the updated utility matrix
'''
u = utility_matrix[observation[0], observation[1]]
u_t1 = utility_matrix[new_observation[0], new_observation[1]]
utility_matrix[observation[0], observation[1]] += \
alpha * (reward + gamma * u_t1 - u)
return utility_matrix
The main loop is much simpler than the one used in MC methods. In
this case we do not have any first-visit constraint and the only thing
to do is to apply the update rule.

for step in range(1000):
#Take the action from the action matrix
#Move one step in the environment and get obs and reward
#Update the utility matrix using the TD(0) rule
utility_matrix = update_utility(utility_matrix,
observation, new_observation,
reward, alpha, gamma)
if done: break #return
The complete code, called temporal_differencing_prediction.py,

is available in the GitHub repository. For the moment it is important
to get the general idea behind the algorithm. Running the complete
code with gamma=0.999, alpha=0.1 and following the optimal policy
for a reward of -0.04 we obtain:

[[-0.004 -0.0076 0.1 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]

[[-0.00835924 -0.00085 0.186391 0. ]
[-0.0043996 0. 0. 0. ]
[-0.004 0. 0. 0. ]]

[[-0.01520748 0.01385546 0.2677519 0. ]
[-0.00879473 0. 0. 0. ]
[-0.01163916 -0.0043996 -0.004 -0.004 ]]
...

[[ 0.83573452 0.93700432 0.94746457 0. ]
[ 0.77458346 0. 0.55444341 0. ]
[ 0.73526333 0.6791969 0.62499965 0.49556852]]
...

[[ 0.85999294 0.92663558 0.99565229 0. ]
[ 0.79879005 0. 0.69799246 0. ]
[ 0.75248148 0.69574141 0.65182993 0.34041743]]
We can now compare the utility matrix obtained with TD(0) and the
one obtained with Dynamic Programming in the first post:
Most of the values are similar. The main difference between the two
table is the estimate of the two terminal states. TD(0) does not
work for terminal states because we need reward and utility of the
next state at t+1. For definition after a terminal states there is no
other state. However, this is not a big issue. What we want to know is
the utility of the states nearby the terminal states. To overcome the
problem it is often used a simple conditional statement:
if (is_terminal(state) == True):
utility_matrix(state) = reward
Great we saw how TD(0) works, however there is something I did not
talk about: what does the zero contained in the name of the
algorithm means? To understand what that zero means I have to
introduce the eligibility traces.
TD(λ) and eligibility traces

As I told you in the previous section, the TD(0) algorithm does not
take into account past states. What matters in TD(0) is the current
state and the state at t+1. However, it would be useful to extend what
has been learned at t+1 also to previous states, this would accelerate
the learning. To achieve this objective it is necessary to have a short-
term memory mechanism to store the states that have been visited
in the last steps. For each state s at time t we can define et(s) as the
eligibility trace:
et(s)={γλet−1(s)if s≠st;γλet−1(s)+1if s=st;
Here γ is the discount rate and λ∈[0,1] is a decay parameter called

trace-decay or accumulating trace defining the update weight for
each state visited. When 0<λ< 1 the traces decrease in time, giving a
small weight to infrequent states. For the particular case λ=0 we
have TD(0), and only the previous prediction is updated. For λ=1 we
have TD(1) where all the previous predictions are equally updated.
TD(1) can be considered an extension of MC methods using a TD
framework. In MC methods we need to wait the end of the episode
in order to update the states. In TD(1) we can update all the previous
states online, we do not need the end of the episode. Let’s see now
what happens to a specific state trace during an episode. I will take
into account an episode with seven visits where five states are
visited. The state s1 is visited twice during the episode. Let’s see
what happens to its trace.
At the beginning the trace is equal to zero. After the first visit to s1
(second step) the trace goes up to 1 and then it starts decaying.
After the second visit (fourth step) +1 is added to the current value
(0.25) obtaining a final trace of 1.25. After that point the state s1 is no
more visited and the trace slowly goes to zero. How does TD(λ)
update the utility function? In TD(0) we saw that a uniform shadow
was added in the graphical illustration to represent the inaccessibility
of previous states. In TD(λ) the previous states are accessible but
they are updated based on the eligibility trace value. States with a
small eligibility trace will be updated of a small amount whereas
states with high eligibility traces will be substantially updated.
Graphically we can represent TD(λ) with a non-uniform shadow

partially hiding previous states. Now it’s time to define the update
rule for TD(λ). Recall that the estimation error δ was defined in the
previous section as:
δt=rt+1+γU(st+1)−U(st)
we can update the utility function as follows:
Ut(s)=Ut(s)+αδtet(s)for all s∈S

To help you understand the differences between TD(0) and TD(λ) I
build a 4x3 grid world where the reward is zero for all the states but
the two terminal states. The utility matrix is initialised with zeros.
The episode I will take into account has five visits, the robot starts at
state (1,1) and it arrives at the charging station (4,3) following the
optimal path.
The results of the update for TD(0) and TD(λ) are the same (zero)
along all the visit but the last one. When the robot reaches the
charging station (reward +1.0) the update rule returns a positive
value. In TD(0) the result is propagated only to the previous state
(3,3). In TD(λ) the result is propagated back to all previous states
thanks to the eligibility trace. The decay value of the trace gives
more weight to the last states. As I told you the utility trace
mechanism helps to speed up the convergence. It is easy to
understand why, if you consider that in our example TD(0) needs five
episodes in order to reach the same results of TD(λ).
The Python implementation of TD(λ) is straightforward. We only

need to add an eligibility matrix and its update rule.
def update_utility(utility_matrix, trace_matrix, alpha, delta):

@param delta the error (Taget-OldEstimte)
'''
utility_matrix += alpha * delta * trace_matrix
return utility_matrix
def update_eligibility(trace_matrix, gamma, lambda_):

'''Return the updated trace_matrix
@param trace_matrix the eligibility traces matrix

@param lambda_ the decaying value
@return the updated trace_matrix
'''
trace_matrix = trace_matrix * gamma * lambda_
return trace_matrix
The main loop introduces some new components compared to the

TD(0) case. We have the estimation of delta in a separate line and
the management of the trace_matrix in two lines. First of all the
states are increased (+1) and then they are decayed.

#Estimate the error delta (Target - OldEstimate)
delta = reward + gamma * \
utility_matrix[new_observation[0], new_observation[1]] - \
utility_matrix[observation[0], observation[1]]
#Adding +1 in the trace matrix (only the state visited)
trace_matrix[observation[0], observation[1]] += 1
#Update the utility matrix (all the states)
utility_matrix = update_utility(utility_matrix, trace_matrix, alpha
#Update the trace matrix (decaying) (all the states)
trace_matrix = update_eligibility(trace_matrix, gamma, lambda_)
if done: break #return
The complete code is available on the GitHub repository and it is

called temporal_differencing_prediction_trace.py. Running the
script we obtain the following utility matrices:

[[ 0. 0.04595 0.1 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
...

[[ 0.90680695 0.98373981 1.05569002 0. ]
[ 0.8483302 0. 0.6750451 0. ]
[ 0.77096419 0.66967837 0.50653039 0.22760573]]
...

[[ 0.86030512 0.91323552 0.96350672 0. ]
[ 0.80914277 0. 0.82155788 0. ]
[ 0.76195244 0.71064599 0.68342933 0.48991829]]
...

[[ 0.87075806 0.92693723 0.97192601 0. ]
[ 0.82203398 0. 0.87812674 0. ]
[ 0.76923169 0.71845851 0.7037472 0.52270127]]
Comparing the final utility matrix with the one obtained without the
use of eligibility traces in TD(0) you will notice similar values. One
could ask: what’s the advantage of using eligibility traces? The
eligibility traces version converges faster. This advantage become
clear when dealing with sparse reward in a large state space. In this
case the eligibility trace mechanism can considerably speeds up the
convergence propagating what learnt at t+1 back to the last states
visited.
SARSA: Temporal Differencing control

Now it is time to extend the TD method to the control case. Here we
are in the active scenario, we want to estimate the optimal policy
starting from a random one. We saw in the introduction that the final
update rule for the TD(0) case was:
U(st)←U(st)+α[rt+1+γU(st+1)−U(st)]
The update rule is based on the tuple State-Reward-State. We are

in the control case and we use the Q-function (see second post) to
estimate the best policy. The Q-function requires as input a state-
action pair. The TD algorithm for control is straightforward, give a
look at the update rule:
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)−Q(st,at)]
That’s it, we simply replaced U with Q but we must be careful

because there is a difference. Now we need a new value which is the
action at t+1. This is not a problem because it is contained in the Q-
matrix. In TD control the estimation is based on the tuple State-
Action-Reward-State-Action and this tuple gives the name to the
algorithm: SARSA. SARSA has been introduced in 1994 by Rummery
and Niranjan in the article “On-Line Q-Learning Using Connectionist
Systems” and was originally called modified Q-learning. In 1996
Sutton introduced the current name.
To get the intuition behind the algorithm we consider again a single

episode of an agent moving in a world. The robot starts at s0 and
after seven visits it reaches a terminal state at s5. For each state we
have an associated action. Moving forward the algorithm takes into
account only the state at t and t+1. In the standard implementation of
SARSA the previous states are ignored, as shown by the shadow
on top of them in the graphical illustration. This is in line with the TD
framework as explained in the TD(0) section. Now I would like to
summarise all the steps of the algorithm:
1. Move one step selecting at from π(st)

2. Observe: rt+1, st+1, at+1
3. Update the state-action function Q(st,at)
4. Update the policy π(st)← argmax aQ(st,at)
In step 1 the agent selects one action from the policy and moves one
step forward. In step 2 the agent observes the reward, the new state
and the associated action. In step 3 the algorithm updates the state-
action function using the update rule. In step 4 we are using the
same mechanism of MC for control (see second post), the policy π
is updated at each visit choosing the action with the highest state-
action value. We are making the policy greedy. As for MC methods
we use the exploring starts condition.
Can we apply the TD(λ) ideas to SARSA? Yes we can. SARSA(λ)
follows the same steps of TD(λ) implementing the eligibility traces
to speed up the convergence. The intuition behind the algorithm is
the same but instead of applying the prediction method to the
states, SARSA(λ) applies it to state-action pairs. We have a trace for
each state-action and this trace is updated as follows:
et(s,a)={γλet−1(s,a)+1if s=st and a=at;γλet−1(s,a)otherwise;
To update the Q-function we use the following update rule:

Qt+1(s,a)=Qt(s,a)+αδtet(s,a)for all s∈S
Considering that in this post I introduced many new concepts I will

not proceed with the Python implementation of SARSA(λ). Consider
it an homework and try to implement it by yourself. If what explained
in the previous sections is not enough you can read the chapter 7.5
of Sutton and Barto’s book.
SARSA: Python and ε-greedy policy

The Python implementation of SARSA requires a Numpy matrix
called state_action_matrix which can be initialised with random
values or filled with zeros. Here you must remember that we defined
state_action_matrix has having one state for each column, and one
action for each row (see second post). For instance in the 4x3 grid
world, with the query state_action_matrix[0, 2] we get the state-
action value for the state (1,3) (top-left corner) and action DOWN.
With the query state_action_matrix[11, 0] we get the state-action
value for the state (4,1) (bottom-right corner) and action UP. This
follows the convention of Russel and Norvig for naming the states
with the bottom-left corner being the state (1,1). Note that in Python
we use the Numpy convention where [0, 0] defines the top-left
value of the grid world. SARSA is based on the following update rule
for the state-action matrix:
def update_state_action(state_action_matrix, observation, new_observation

action, new_action, reward, alpha, gamma):
@param state_action_matrix the matrix before the update

@param observation the state observed at t
@param action the action at t
@param new_action the action at t+1
@return the updated state action matrix
'''
#Getting the values of Q at t and at t+1
q = state_action_matrix[action ,col]
col_t1 = new_observation[1] + (new_observation[0]*4)
q_t1 = state_action_matrix[new_action ,col_t1]
#Applying the update rule
state_action_matrix[action ,col] += \
alpha * (reward + gamma * q_t1 - q)
return state_action_matrix
Moreover, since we are in the control case and we want to estimate a

policy, we need also an update function to achieve this goal:
def update_policy(policy_matrix, state_action_matrix, observation):

'''Return the updated policy matrix
@param policy_matrix the matrix before the update

@param state_action_matrix the state-action matrix
@param observation the state obsrved at t
'''
#Getting the index of the action with the highest utility
best_action = np.argmax(state_action_matrix[:, col])
#Updating the policy
policy_matrix[observation[0], observation[1]] = best_action
return policy_matrix
The update_policy function makes the policy greedy selecting the

action with the highest value in accordance with the step 4 of the
algorithm. Finally, the main loop updates the state_action_matrix
and the policy_matrix for each visit in the episode.

#Move one step in the environment and get obs,reward and new action
new_action = policy_matrix[new_observation[0], new_observation[1]]
#Updating the state-action matrix
state_action_matrix = update_state_action(state_action_matrix,
observation, new_observation
action, new_action,
reward, alpha, gamma)
#Updating the policy
policy_matrix = update_policy(policy_matrix,
state_action_matrix,
observation)
if done: break
The complete Python script is available in the GitHub repository and

is called temporal_differencing_control_sarsa.py. Running the
script with alpha=0.001 and gamma=0.999 gives the optimal policy
after 180000 iterations.
Policy matrix after 1 iterations:

< v > *
^ # v *
> v v >
...

> > > *
^ # ^ *
^ < ^ <
...

> > > *
^ # ^ *
^ < < <
Does SARSA always converge to the optimal policy? The answer is

yes, SARSA converges with probability 1 as long as all the state-
action pairs are visited an infinite number of times. This assumption
is called by Russel and Norvig Greedy in the Limit of Infinite
Exploration (GLIE). A GLIE scheme must try each action in each
state an unbounded number of times to avoid having a finite
probability that an optimal action is missed because of an unusually
bad series of outcomes. In our grid world it can happen that an
unlucky initialization causes a bad policy that keeps the agent far
from certain states. In the second post we used the assumption of
exploring starts to guarantee a uniform exploration of all the state-
action pairs. However, exploring starts can be hard to apply in a large
state space. An alternative solution is called ε-greedy policy. An ε-
greedy policy explores all the states taking the action with the
highest value but with a small probability ε it selects an action at
random. After defining 0≤σ≤1 as a uniform random number drawn at
each time step, and A as the set containing all the available action,
we select the action a as follow:
π(s)={ argmax aQ(s,a)if σ>ϵ;a∼A(s)if σ≤ϵ;
In Python we can easily implement a function that returns an action

following the ε-greedy scheme:
def return_epsilon_greedy_action(policy_matrix, observation, epsilon=0.1

tot_actions = int(np.nanmax(policy_matrix) + 1)
#Getting the greedy action
action = int(policy_matrix[observation[0], observation[1]])
#Probabilities of non-greedy actions
non_greedy_prob = epsilon / tot_actions
#Probability of the greedy action
greedy_prob = 1 - epsilon + non_greedy_prob
#Array containing a weight for each action
weight_array = np.full((tot_actions), non_greedy_prob)
weight_array[action] = greedy_prob
#Sampling the action based on the weights
return np.random.choice(tot_actions, 1, p=weight_array)
In the naive implementation of ε-greedy policy at each non-greedy

action is given the same probability, but some actions may be better
than others. Using a softmax distribution (e.g. Boltzmann
distribution) it is possible to give the highest probability to the
greedy action without treating the others in the same way. Here, for
simplicity I will use the naive approach. The exploring starts and ε-
greedy policy do not exclude one another, they can coexist. Using
the two approaches at the same time can give to faster convergence.
Let’s try to extend the previous script with ε-greedy action selection
to see what happens. In the main loop we have to replace the
standard action selection with the ε-greedy action. Running the
script with gamma=0.999, alpha=0.001 and epsilon=0.1 gives the
optimal policy in 130000 iterations, meaning 50000 iterations less.
The complete code is part of the file
temporal_differencing_control_sarsa.py you can enable or disable
the ε-greedy selection commenting the corresponding line in the
main loop. How to choose the value of ε? Most of the time a value
of 0.1 is a good choice. Choosing a value that is too high will cause
the algorithm to converge slowly (too much exploration). On the
other hand, a value which is too small does not guarantee to visit all
the state-action pairs leading to sub-optimal policies. This issue is
known as the exploration-exploitation dilemma and is one of the
problems afflicting reinforcement learning. It is time to introduce Q-
learning, another algorithm for TD control estimation.
Q-learning: off-policy control

Q-learning was introduced by Watkins in his doctoral dissertation
and is considered one of the most important algorithm in
reinforcement learning. Understanding how it works means
understanding most of the following posts. Before proceeding any
further, make sure you got the following key concepts:
The Generalised Policy Iteration (GPI) (second post)

The Target term in TD learning (first section)
The update rule of SARSA (previous section)
Now we can proceed. In the control case we always used the policy
π to learn on the run, meaning that we updated π from experiences
sampled from π. This approach is called on-policy learning. There is
another way to learn about π which is called off-policy learning. In
off-policy learning we do not need a policy in order to update our Q-
function. Of course we can still generate a policy π based on the
action with the maximum utility (taken from our Q-function) but the
Q-function itself is updated thanks to a second policy µ that is not
updated. For instance, consider the first four iterations of an off-
policy algorithm applied to the 4x3 grid world. We can see how after
the random initialisation of π the states are updated step by step,
whereas the policy µ does not change at all.
What are the advantages of off-policy learning? First of all using off-
policy it is possible to learn about an optimal policy while following
an exploratory policy µ. Off-policy means learning by
observation. For example, we can find an optimal policy looking to a
robot that is following a sub-optimal policy. It is also possible to learn
about multiple policies while following one policy (e.g. multi-robot
scenario). Moreover, in deep reinforcement learning we will see how
off-policy allows re-using old experiences generated from old
policies to improve the current policy (experience replay). The most
famous off-policy TD algorithm for control is called Q-Learning.
To understand how Q-learning works let’s consider its update rule:
Q(st,at)←Q(st,at)+α[rt+1+γ max aQ(st+1,a)−Q(st,at)]
Comparing the update rule of SARSA and the one of Q-learning you
will notice only one difference: the Target term. Here I report both of
them to simplify the comparison:
Target[SARSA]=rt+1+γQ(st+1,at+1) Target[Q-
learning]=rt+1+γ max aQ(st+1,a)
SARSA uses GPI to improve the policy π. The Target is estimated

through Q(st+1,at+1) that is based on the action at+1 sampled from
the policy π. In SARSA improving π means improving the estimation
returned by Q(st+1,at+1). In Q-learning we have two policies π and µ.
The action value a necessary to estimate Q(st,a) is directly taken
from the Q-function using the max operator. Q-learning makes an
update based on the greedy Q-value of the successor state, st+1,
while SARSA uses the Q-value of the action at+1 chosen by the
learning policy. This makes SARSA an on-policy algorithm, with
convergence depending on the learning policy. Q-learning does not
depend on the policy and it can converge following a completely
random one. Now I would like to better describe the Target term
used in Q-learning. First of all, let’s recall that the value of at+1
cannot be sampled from the policy µ because it is not updated
during the training (using it would break the GPI scheme). However,
having the Q-function we can update at each time step the policy π
as follows:
π(st+1)= argmax aQ(st+1,a)
At this point it is obvious that we do not really need the policy π for
choosing the action, we can simply use the term on the right and
rewrite the Target as the discounted Q-value obtained at st+1
through a greedy selection:
Target=rt+1+γQ(st+1, argmax aQ(st+1,a))
The expression below corresponds to the highest Q-value at t+1

meaning that it can be reduced to:
Target=rt+1+γ max aQ(st+1,a)
That’s it, we have the Target used in the actual update rule and this
value follows the GPI scheme. Let’s see now all the steps involved in
Q-learning:
1. Move one step selecting at from µ(st)
2. Observe: rt+1, st+1
3. Update the state-action function Q(st,at)
4. (optional) Update the policy π(st)← argmax aQ(st,a)
There are some differences between the steps followed in SARSA

and the one followed in Q-learning. Unlike SARSA, step 2 of Q-
learning does not consider at+1, the action at the next step. In this
sense Q-learning updates the state-action function using the tuple
State-Action-Reward-State. Comparing step 1 and step 4 you can
see that in step 1 of SARSA the action is sampled from π and then
the same policy is updated at step 4. In step 1 and step 4 of Q-
learning we are sampling the action from the exploration policy µ and
(optionally) we update the policy π at step 4. The step 4 is optional
because the action value can be obtained directly from the Q-
function, calculating and storing π can be a waste of resources.
Also for Q-learning there is a version based on eligibility traces.

Actually there are two versions: Watkins’s Q(λ) and Peng’s Q(λ).
Here, I will focus on Watkins’s Q(λ). The Q(λ) algorithm was
introduced by Watkins in his doctoral dissertation. The idea behind
the algorithm is similar to TD(λ) and SARSA(λ). Like in SARSA(λ) we
are updating state-action pairs, with an important difference. In Q-
learning there are two policies, the exploratory policy µ used to
sample actions and the target policy π updated at each iteration.
Because the action a is chosen with ε-greedy, there is a chance to
select an exploratory action instead of a greedy one. In this case the
eligibility traces for all state-action pairs but the current one are set
to zero.
et(s,a)=Isst∙Iaat+
{γλet−1(s,a)if Qt−1(st,at)=max aQt−1(st,a);0otherwise;
The term Isst is an identity indicator and it is equal to 1 if s=st. The
same for Iaat. The estimation error δ is defined as:
δs=rt+1+γmax aQt(st+1,a)−Qt(st,at)
To update the Q-function we use the following update rule:
Qt+1(s,a)=Qt(s,a)+αδtet(s,a)for all s∈S
Unfortunately, cutting off traces when an exploratory non-greedy

action is taken loses much of the advantages of using eligibility
traces. Like for SARSA(λ) I will not implement the Q(λ) algorithm in
the next section to keep the post more compact. However a good
pseudo-code is given in chapter 7.6 of the Sutton and Barto’s book.
Q-learning: Python implementation

The Python implementation of the algorithm requires a random
policy called policy_matrix and an exploratory policy called
exploratory_policy_matrix. The first can be initialised randomly,
whereas the second can be any sub-optimal policy. The action to be
executed at each visit are taken from exploratory_policy_matrix,
whereas the update rule of step 4 is applied to the policy_matrix.
The code is very similar to the one used in SARSA, the main
difference is in the update rule for the state-action matrix:
def update_state_action(state_action_matrix, observation, new_observation

action, reward, alpha, gamma):

@param action the action at t
@param new_action the action at t+1
@param alpha the ste size (learning rate)
'''
#Getting the values of Q at t and at t+1
q = state_action_matrix[action ,col]
col_t1 = new_observation[1] + (new_observation[0]*4)
q_t1 = np.max(state_action_matrix[: ,col_t1])
#Applying the update rule
state_action_matrix[action ,col] += alpha * (reward + gamma * q_t1
Time for an example. Let’s suppose you noticed that the cleaning
robot bought last week does not follow an optimal policy while going
back to the charging station. The robot is following a sub-optimal
path that is unsafe. You want to find an optimal policy and propose
an upgrade to the manufacturer (and get hired!). There is a problem,
you do not have any access to the robot firmware. The robot is
following its internal policy µ and this policy is inaccessible. What to
do?
What you can do is to use an off-policy algorithm like Q-learning to

estimate an optimal policy. First of all you create a discrete version of
the room using the GridWorld class. Second, you get a camera and
thanks to some markers you estimate the position of the robot in the
real world, then relocate it in the grid world. At each time step you
have the position of the robot and the reward. The camera is
connected to your workstation and in the workstation is running the
Q-learning Python script. Fortunately you do not have to write the
code from scratch because you notice that a good starting script is
available on the dissecting-reinforcement-learning official repository
on GitHub and is called
temporal_differencing_control_qlearning.py. The script is based
on the usual 4x3 grid world, but can be easily extended to more
complex scenarios.
Running the script with alpha=0.001, gamma=0.999 and epsilon=0.1

the algorithm converged to the optimal policy in 300000 iterations.
Great! You got the optimal policy. However there are two important
limitations. First, the algorithm converged after 300000 iterations,
meaning that you need 300000 episodes. Probably you have to
monitor the robot for months in order to get all these episodes. In a
deterministic environment you can estimate the policy µ through
observation and then run the 300000 episodes in simulation.
However, in environments that are non-deterministic you need to
spend much more energies in order to find the motion model. This is
very time consuming. The second limitation is that the optimal policy
is valid only for the current room setup. When you change the
position of the charging station or the position of the obstacles you
have to find a new policy. We will see in a future post how to
generalise to much larger state space using supervised learning
and neural networks. For the moment let’s focus on the results
achieved. Q-learning followed a policy µ which was sub-optimal and
estimated the optimal policy π∗ starting from a random policy π.
What is interesting about Q-learning is that it converges also when

the policy µ is an adversarial policy. Let’s suppose that µ pushes
the robot as far as possible from the charging station and as close as
possible to the stairs. Is the algorithm going to converge in these
extreme conditions? Yes. Running the script with the same
parameters and with an adversarial policy the algorithm converges to
the optimal policy in 583001 iterations. We empirically demonstrated
that starting from a favourable policy speeds up convergence.
Conclusions
This post has summarised many important concepts in
reinforcement learning. TD methods are widely used because of their
simplicity and versatility. As in the second post we divided TD
methods in two families: prediction and control. The prediction TD
algorithm has been called TD(0). Via eligibility traces it is possible to
extend to previous states what has been learnt in the last one. The
extension of TD(0) with eligibility traces is called TD(λ). The control
algorithms in TD are called SARSA and Q-learning. The former is an
on-policy algorithm that updates the policy while moving in the
environment. The latter is an off-policy algorithm based on two
separate policies, one updated and the other used for moving in the
world. Do TD methods converge faster than MC methods? There is
no mathematical proof but by experience TD methods converge
faster.
Index
methods.
policy selection.
problems.
Resources
The complete code for TD prediction and TD control is
available on the dissecting-reinforcement-learning official
repository on GitHub.
Dadid Silver’s course (DeepMind) in particular lesson 4 [pdf]

[video] and lesson 5 [pdf][video]
Christopher Watkins doctoral dissertation, which introduced

the Q-learning for the first time [pdf]
Machine Learning Mitchell T. (1997) [web]
Artificial intelligence: a modern approach. (chapters 17 and

21) Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., &
Edwards, D. D. (2003). Upper Saddle River: Prentice hall. [web]
[github]


Sutton, R. S., & Barto, A. G. (2018). [pdf]
References
Bellman, R. (1957). A Markovian decision process (No. P-1066).
RAND CORP SANTA MONICA CA.
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian

conditioning: Variations in the effectiveness of reinforcement and
nonreinforcement. Classical conditioning II: Current research and
theory, 2, 64-99.
Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using

connectionist systems. University of Cambridge, Department of
Engineering.
Sutton, R. S. (1988). Learning to predict by the methods of temporal

differences. Machine learning, 3(1), 9-44.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of

pavlovian reinforcement.
Watkins, C. J. C. H. (1989). Learning from delayed rewards (Doctoral

dissertation, University of Cambridge).
Learning-Part.4
Massimiliano Patacchiola Feb 11, 2017
Here we are, the fourth episode of the “Dissecting Reinforcement

Learning” series. In this post I will introduce another group of
techniques widely used in reinforcement learning: Actor-Critic (AC)
methods. I often define AC as a meta-technique that uses the
methods introduced in the previous posts in order to learn. AC-
based algorithms are among the most popular methods in
reinforcement learning. For example, the Deep Determinist Policy
Gradient algorithm introduced recently by some researchers at
Google DeepMind is an actor-critic, model-free method. Moreover,
the AC framework has many connections with neuroscience and
animal learning, in particular with models of basal ganglia (Takahashi
et al. 2008).
AC methods are not accurately described in the books I generally

provide. For instance in Russel and Norvig and in the Mitchell’s book
they are not covered at all. In the classical Sutton and Barto’s book
there are only three short paragraphs (2.8, 6.6, 7.7), however in the
second edition a wider description of neuronal AC methods has been
added in chapter 15 (neuroscience). A meta-classification of
reinforcement learning techniques is covered in the article
“Reinforcement Learning in a Nutshell”. Here, I will introduce AC
methods starting from neuroscience. You can consider this post as
the neuro-physiological counterpart of the third one, that introduced
Temporal Differencing (TD) methods from a psychological and
behaviouristic point of view.
Actor-Critic methods (and rats)

AC methods are deeply connected with neuroscience, therefore I
will introduce this topic with a brief excursion in the neuroscience
field. If you have a pure computational background you will learn
something new. My objective is to give you a deeper insight into the
reinforcement learning (extended) world. To understand this
introduction you should be familiar with the basic structure of the
nervous system. What is a neuron? How do neurons communicate
using synapses and neurotransmitters? What is the cerebral cortex?
You do not need to know the details, here I want you to get the
general scheme. Let’s start from Dopamine. Dopamine is a
neuromodulator implied in some of the most important process in
human and animal brains. You can see dopamine as a messenger
that allows neurons to communicate. Dopamine has an important
role in multiple processes in the mammalian brain (e.g. learning,
motivation, addiction), and it is produced in two specific areas:
substantia nigra pars compacta and ventral tegmental area.
These two areas have direct projections to another area of the brain,
the striatum. The striatum is divided in two parts: ventral striatum
and dorsal striatum. The output of the striatum is directed to motor
areas and prefrontal cortex, and it is involved in motor control and
planning.
Most of the brain areas cited so far are part of the basal ganglia.
Different models found a connection between basal ganglia and
learning. In particular, it seems that the phasic activity of the
dopaminergic neurons can encode an error. This error is very similar
to the error in TD learning that I introduced in the third post. Before
going into details I would like to simplify the basal ganglia
mechanism distinguishing between two groups:
1. Ventral striatum, substantia nigra, ventral tegmental area

2. Dorsal striatum and motor areas
There are no specific biological names for these groups but I will
create two labels for the occasion. The first group can evaluate the
saliency of a stimulus based on the associated reward. At the same
time it can estimate an error measure comparing the result of the
action and the direct consequences, and use this value to calibrate
an executor. For these reasons I will call it the critic. The second
group has direct access to actions but no way to estimate the utility
of a stimulus, because of that I will call it the actor.
The interaction between actor and critic has an important role in
learning. In particular, well established research has shown that basal
ganglia are involved in Pavlovian learning (see third post) and in
procedural (implicit) memory, meaning unconscious memories such
as skills and habits. On the other hand the acquisition of declarative
(explicit) memory, implied in the recollection of factual information,
seems to be connected with another brain area called hippocampus.
The only way actor and critic can communicate is through the
dopamine released from the substantia nigra after the activation of
the ventral striatum. Drug abuse can have an effect on the
dopaminergic system, altering the communication between actor
and critic. Some experiments of Takahashi et al. (2007) showed that
cocaine sensitization in rats can have as effect maladaptive
decision-making. In particular, rather than being influenced by long-
term goal, rats are driven by immediate rewards. This issue is also
part of standard computational pipelines and is know as the credit
assignment problem. For example, when playing chess it is not
easy to isolate the most salient actions that lead to the final victory
(or defeat).
To understand how the neuronal actor-critic mechanism was
involved in the credit assignment problem, Takahashi et al. (2008)
observed the performances of rats pre-sensitized with cocaine in a
Go/No-Go task. The procedure of a Go/No-Go task is simple. The
rat is in a small metallic box and it has to learn to poke a button with
the nose when a specific odour (cue) is released. If the rat pokes the
button when a positive odour is present it gets rewarded (with
delicious sugar). If the rat pokes the button when a negative odour is
present it gets punished (e.g. with a bitter substance such as
quinine). Positive and negative odours do not mean that they are
pleasant or unpleasant, we can consider them neutral. Learning
means to associate a specific odour to reward or punishment. Finally,
if the rat does not move (No-Go) then neither reward nor punishment
are given. In total there are four possible conditions.
The interesting fact observed in those experiments is that rats pre-

sensitized with cocaine do not learn the task. The most plausible
explanation is that cocaine damages the basal ganglia and the signal
returned by the critic gets distorted. To test this hypothesis
Takahashi et al. (2008) sensitized a group of rats 1-3 months before
the experiment and then compared it with a non-sensitized control
group. The results of the experiment showed that the rat in the
control group could learn how to obtain the reward (go) when the
positive odour was presented and how to avoid the punishment (no-
go) when the negative odour was presented. The observations of the
basal ganglia showed that the ventral striatum (critic) developed
some cue-selectivity neurons that fired only when the odour
appeared. This neurons developed during the training and their
activity preceded the response in the dorsal striatum (actor).
On the other hand the cocaine sensitized rat did not show any kind
of cue-selectivity during the training. Moreover, post-mortem
analysis showed that those rats did not developed cue-selective
neurons in the ventral striatum (critic). These results confirm the
hypothesis that the critic learns the value of the cue and it instructs
the actor about the action to execute.
In this section, I showed how the AC framework is deeply connected

to the neurobiology of the mammalian brain. The model is elegant
and it can explain phenomena such as Pavlovian learning and drug
addiction. However, the elegance of the model does not have to
prevent us from criticizing it. Be aware that there are different
experiments that did not confirm it. For example, some form of
stimulus-reward learning can take place in the absence of dopamine.
Moreover dopamine cells can fire before the stimulus, meaning that
their values cannot be used for the update. For a good review of
neuronal AC models and their limits I suggest you the article of Joel
et al. (2002).
Now it’s time to turn our attention to math and code. How can we
build a computational model from the biological one?
Rewiring Actor-Critic methods

In the last section I presented a neuronal model of the basal ganglia
based on the AC framework. Here, I will rewire that model using the
reinforcement learning techniques we studied until now. The
objective is to obtain a computational version which can be used in
generic cases (e.g. the 4x3 grid world). The first implementation of
an AC algorithm is due to Witten (1977), however the terms Actor
and Critic have been introduced later by Barto et al. (1988) to solve
the pole-balancing problem. First of all, how can we represent the
critic? In the neural version the critic does not have access to the
actions. Input to the critic is the information obtained through the
cerebral cortex that we can compare to the information obtained by
the agent through the sensors (state estimation). Moreover the critic
receives as input a reward, this comes directly from the environment.
The critic can be represented by an utility function, that is updated
based on the reward signal received at each iteration. In model free
reinforcement learning we can use the TD(0) algorithm to represent
the critic. The dopaminergic output from substantia nigra and ventral
tegmental area can be represented by the two signals returned by
TD(0): the update value and the error estimation
𝛿
. In practice we use the update signal to improve the utility function
and the error to update the actor. How can we represent the actor?
In the neural system the actor receives an input from the cerebral
cortex, that we can translate in sensor signals (current state). The
dorsal striatum projects to the motor areas and executes an action.
Similarly, we can use a state-action matrix containing the possible
actions for each state. The action can be selected with an ε-greedy
(or softmax) strategy and then updated using the error returned by
the critic. As usual a picture is worth a thousand words:
We can summarize the steps of the AC algorithm as follows:
1. Produce the action

𝑎𝑡
for the current state
𝑠𝑡
2. Observe next state
𝑠𝑡 + 1
and the reward
𝑟
3. Update the utility of state
𝑠𝑡
(critic)
4. Update the probability of the action using
𝛿
(actor)
In step 1, the agent produces an action following the current policy.

In the previous posts I used an ε-greedy strategy to select the action
and to update the policy. Here, I will select a certain action using a
softmax function:
𝑝(𝑠, 𝑎)
𝑃{𝑎𝑡 = 𝑎 | 𝑠𝑡 = 𝑠} = 𝑒
∑ 𝑏 𝑒𝑝(𝑠, 𝑏)
After the action we observe the new state and the reward (step 2). In
step 3 we plug the reward, the utility of
𝑠𝑡
and
𝑠𝑡 + 1
in the standard update rule used in TD(0) (see third post):
𝑈(𝑠𝑡 ) ← 𝑈(𝑠𝑡 ) + 𝛼[r𝑡 + 1 + 𝛾𝑈(𝑠𝑡 + 1 ) − 𝑈(𝑠𝑡 )]
In step 4 we use the error estimation

𝛿
to update the policy. In practical terms, step 4 consists of
strengthening or weakening the probability of the action using the
error
𝛿
and a positive step-size parameter
𝛽
:
𝑝(𝑠𝑡 , 𝑎𝑡 ) ← 𝑝(𝑠𝑡 , 𝑎𝑡 ) + 𝛽𝛿𝑡
Like in the TD case, we can also integrate the eligibility traces

mechanism (see third post). In the AC case we need two set of
traces, one for the actor and one for the critic. For the critic we need
to store a trace for each state and update it as follows:
𝛾𝜆𝑒𝑡 − 1 (𝑠) if 𝑠 ≠ 𝑠𝑡 ;
𝑒𝑡 (𝑠) = {
𝛾𝜆𝑒𝑡 − 1 (𝑠) + 1 if 𝑠 = 𝑠𝑡 ;
Nothing different from the TD(λ) method I introduced in the third
post. Once we estimated the trace we can update the state as
follows:
𝑈(𝑠𝑡 ) ← 𝑈(𝑠𝑡 ) + 𝛼𝛿𝑡 𝑒𝑡 (𝑠)
For the actor we have to store a trace for each state-action pair,
similarly to SARSA and Q-learning. The traces can be updated as
follows:
𝛾𝜆𝑒𝑡 − 1 (𝑠, 𝑎) + 1 if 𝑠 = 𝑠𝑡 and 𝑎 = 𝑎𝑡 ;

𝑒𝑡 (𝑠, 𝑎) = {
𝛾𝜆𝑒𝑡 − 1 (𝑠, 𝑎) otherwise;
Finally, the probability of choosing an action is updated as follows:
𝑝(𝑠𝑡 , 𝑎𝑡 ) ← 𝑝(𝑠𝑡 , 𝑎𝑡 ) + 𝛼𝛿𝑡 𝑒𝑡 (𝑠)
Great, we obtained our generic computational model to use in a

standard reinforcement learning scenario. Now, I would like to close
the loop providing an answer to this simple question: does the
computational model explain the neurobiological observation?
Apparently yes. In the previous section we saw how Takahashi et al.
(2008) observed some anomalies in the interaction between actor
and critic in rats sensitized with cocaine. Drug abuse seems to
deteriorate the dopaminergic feedback going from the critic to the
actor. From the computational point of view we can observe a similar
result when all the
𝑈(𝑠)
are the same regardless of the current state. In this case the
prediction error
𝛿
generated by the critic (with
𝛾=1
) reduces to the immediate available reward:
𝛿𝑡 = 𝑟𝑡 + 1 + 𝑈(𝑠𝑡 + 1 ) − 𝑈(𝑠𝑡 ) = 𝑟𝑡 + 1
This result explains why the credit assignment problem emerges
during the training of cocaine sensitized rats. The rats prefer the
immediate reward and do not take into account the long-term
drawbacks. Learning based only on immediate reward it’s not
sufficient to master a complex Go/No-Go task but in simpler tasks
learning can be faster, with cocaine sensitized rats performing better
than the control group. However, for a neuroscientist this sort of
explanations are too tidy. Recent work has highlighted the existence
of multiple learning systems operating in parallel in the mammalian
brain. Some of these systems (e.g. amygdala and/or nucleus
accumbens) can replace a malfunctioning critic and compensate the
damage caused by cocaine sensitization. In conclusion, additional
experiments are needed in order to shed light on the neuronal AC
architecture. Now it is time for coding. In the next section I will show
you how to implement an AC algorithm in Python and how to apply it
to the cleaning robot example.
Actor-Critic Python implementation

Using the knowledge acquired in the previous posts we can easily
create a Python script to implement an AC algorithm. As usual, I will
use the robot cleaning example and the 4x3 grid world. To
understand this example you have to read the rules of the grid world
introduced in the first post. First of all I will describe the general
architecture, then I will describe step-by-step the algorithm in a
single episode. Finally I will implement everything in Python. In the
complete architecture we can represent the critic using a utility
function (state matrix). The matrix is initialized with zeros and
updated at each iteration through TD learning. For example, after the
first step the robot moves from (1,1) to (1,2) obtaining a reward of
-0.04. The actor is represented by a state-action matrix similar to
the one used to model the Q-function. Each time a new state is
observed an action is returned and the robot moves. To avoid clutter
I will draw an empty state-action matrix, but imagine that the values
inside the table have been initialized with random samples in the
range [0,1].
In the episode considered here the robot starts in the bottom-left

corner at state (1,1) and it reaches the charging station (reward=+1.0)
after seven steps.
The first thing to do is to take an action. A query to the state-action

table (actor) returns the action vector for the current state that in our
case is [0.48, 0.08, 0.15, 0.37]. The action vector is passed to
the softmax function that turns it into a probability distribution
[0.30, 0.20, 0.22, 0.27]. I sampled from the distribution using the
Numpy method np.random.choice() that returned the action UP.
The softmax function took as input the N-dimensional action vector
𝒙
and returned an N-dimensional vector of real values in the range [0,
1] that add up to 1. The softmax function can be easily implemented
in Python, however differently from the original softmax equation
here I will use numpy.max() in the exponents to avoid approximation
errors:
def softmax(x):
'''Compute softmax values of array x.
@param x the input array

@return the softmax array
'''
return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)))
After the action, a new state is reached and a reward is available

(-0.04). It’s time to update the state value of the critic and to
estimate the error
𝛿
. Here, I used the following parameters:
𝛼 = 0.1
,
𝛽 = 1.0
and
𝛾 = 0.9
. Applying the update rule (step 3 of the algorithm) we obtain the
new value for the state (1,1): 0.0 + 0.1[-0.04 + 0.9(0.0) - 0.0] =
-0.004. At the same time it is possible to calculate the error
𝛿
as follows: -0.04 + 0.9(0.0) - 0.0 = -0.04
The robot is in a new state, and the the error has been evaluated by
the critic. Now the error has to be used to update the state-action
table of the actor. In this step, the action UP for state (1,1) is
weakened, adding the negative term
𝛿
. In case of a positive
𝛿
the action would be strengthened.
We can repeat the same steps until the end of the episode. All the
action will be weakened but the last one, that will be strengthened by
a factor of +1.0. Repeating the process for many episodes we get the
optimal utility matrix and the optimal policy.
Time for the Python implementation. First of all, we have to create a

function to update the utility matrix (critic). I called this function
update_critic. The inputs are the utility_matrix, the observation
and new_observation states, then the usual hyper-parameters. The
function returns an updated utility matrix and the estimation error
delta to use for updating the actor.
def update_critic(utility_matrix, observation, new_observation,

reward, alpha, gamma):

@param alpha the ste size (learning rate)
@return the estimation error delta
'''
u = utility_matrix[observation[0], observation[1]]
u_t1 = utility_matrix[new_observation[0], new_observation[1]]
delta = reward + gamma * u_t1 - u
utility_matrix[observation[0], observation[1]] += alpha * (delta)
return utility_matrix, delta
The function update_actor is used to update the state-action matrix.

The parameter passed to the function are the state_action_matrix,
the observation, the action, the estimation error delta (returned by
update_critic), and the hyper-parameter beta (represented as a
matrix with a value for each pair) counting how many times a
particular state-action pair has been visited.
def update_actor(state_action_matrix, observation, action,

delta, beta_matrix=None):
'''Return the updated state-action matrix

@param action taken at time t
@param delta the estimation error returned by the critic
@param beta_matrix a visit counter for each state-action pair
@return the updated matrix
'''
if beta_matrix is None: beta = 1
else: beta = 1 / beta_matrix[action,col]
state_action_matrix[action, col] += beta * delta
The two functions are used in the main loop. The exploring start
assumption is once again used here to guarantee uniform
exploration. The beta_matrix parameter has not been used in this
example, but it can be easily enabled.

#Estimating the action through Softmax
action_array = state_action_matrix[:, col]
action_distribution = softmax(action_array)
#Sampling an action using the probability
#distribution returned by softmax
action = np.random.choice(4, 1, p=action_distribution)
#beta_matrix[action,col] += 1 #increment the counter
#Updating the critic (utility_matrix) and getting the delta
utility_matrix, delta = update_critic(utility_matrix, observation,
new_observation, reward,
alpha, gamma)
#Updating the actor (state-action matrix)
state_action_matrix = update_actor(state_action_matrix, observation
action, delta, beta_matrix=None
if done: break
I uploaded the complete code in the official GitHub repository under

the name actor_critic.py. Running the script with gamma = 0.999
and alpha = 0.001 I obtained the following utility matrix:

[[-0.02564938 0.07991029 0.53160489 0. ]
[-0.054659 0. 0.0329912 0. ]
[-0.06327405 -0.06371056 -0.0498283 -0.11859039]]
...

[[ 0.85010645 0.9017371 0.95437213 0. ]
[ 0.80030524 0. 0.68354459 0. ]
[ 0.72840853 0.55952242 0.60486472 0.39014426]]
...

[[ 0.84762914 0.90564964 0.95700181 0. ]
[ 0.79807688 0. 0.69751386 0. ]
[ 0.72844679 0.55459785 0.60332219 0.38933992]]
Comparing the result obtained with AC and the one obtained with
dynamic programming in the first post we can notice a few
differences.
Similarly to the estimation of TD(0) in the third post the value of the
two terminal states is zero. This is the consequence of the fact that
we cannot estimate the update value for a terminal state, because
after a terminal state there is no other state. As discussed in the third
post this is not a big issue since it does not affect the convergence,
and can be addressed with a simple conditional statement. From a
practical point of view the results obtained with the AC algorithm can
be unstable because there are more hyper-parameter to tune,
however the flexibility of the paradigm can often balance this
drawback.
Actor-only and Critic-only methods

In the Sutton and Barto’s book AC methods are considered part of
TD methods. That makes sense, because the critic is an
implementation of the TD(0) algorithm and it is updated following the
same rule. The question is: why should we use AC methods
instead of TD learning? The main advantage of AC methods is that
the
𝛿
returned by the critic is an error value produced by an external
supervisor. We can use this value to adjust the policy with
supervised learning. The use of an external supervisor reduces the
variance when compared to pure Actor-only methods. These
aspects will be clearer when I will introduce function approximators
later on in the series. Another advantage of AC methods is that the
action selection requires minimal computation. Until now we
always had a discrete number of possible actions. When the action
space is continuous and the possible number of action infinite, it is
computationally prohibitive to search for the optimal action in this
infinite set. AC methods can represent the policy in a separate
discrete structure and use it to find the best action. Another
advantage of AC methods is their similarity to the brain
mechanisms of reward in the mammalian brain. This similarity
makes AC methods appealing as psychological and biological
models. To summarize there are three advantages in using AC
methods:
1. Variance reduction in function approximation.

2. Computationally efficiency in continuous action space.
3. Similarity to biological reward mechanisms in the mammalian
brain.
The distinction between actor and critic is also very useful from a
taxonomic point of view. In the article “Reinforcement Learning in a
Nutshell” AC methods are considered as a meta-category that can
be used to assign all the techniques I introduced until now to three
macro-groups: AC methods, Critic-only, Actor-only. Here I will follow
a similar approach to give a wider view on what is available out there.
In this post I introduced a possible architecture for an AC algorithm.
In AC methods the actor and the critic are represented explicitly and
trained separately, but we could ask: is it possible to use only the
actor or only the critic? In previous posts we considered utility
functions and policies. In dynamic programming these two entities
collapsed in the value iteration and the policy iteration algorithms
(see first post). Both those algorithms are based on utility estimation,
that allows the policy to converge thanks to the Generalised Policy
Iteration (GPI) mechanism (see second post). Note that, even in TD
learning we are relying on utility estimation (see third post) especially
when the emphasis is on the policy (SARSA and Q-learning). All
these methods can be broadly grouped in a category called Critic-
only. Critic-only methods always build a policy on top of a utility
function and as I said the utility function is the critic in the AC
framework.
What if we search for an optimal policy without using a utility

function? Is that possible? The answer is yes. We can search directly
in policy space using an Actor-only approach. A class of algorithms
called REINFORCE (REward Increment = Nonnegative Factor x Offset
Reinforcement x Characteristic Eligibility) can be considered part of
the Actor-only group. REINFORCE measures the correlation between
the local behaviour and the global performance of the agent and
updates the weights of a neural network. To understand REINFORCE
it is necessary to know gradient descent and generalisation through
neural networks (which I will cover later in this series). Here, I would
like to focus more on another type of Actor-only techniques:
evolutionary algorithms. The evolutionary algorithm label can be
applied to a wide range of techniques, but in reinforcement learning
are often used genetic algorithms. Genetic algorithms represent
each policy as a possible solution to the agent problem. Imagine 10
cleaning robots working in parallel, each one using a different
(random initialized policy). After 100 episodes we can have an
estimation of how good the policy of each single robot is. We can
keep the best robots and randomly mutate their policies in order to
generate new ones. After some generations, evolution selects the
best policies and among them we can (probably) find the optimal
one. In classic reinforcement learning textbooks the genetic
algorithms are not covered, but I had first-hand experience with
them. When I was an undergraduate I did an internship at the
Laboratory of Autonomous Robotics and Artificial Life (LARAL),
where I used genetic algorithms in evolutionary robotics to
investigate the decision making strategies of simulated robots living
in different ecologies. I will spend more words on genetic algorithms
for reinforcement learning in the next post.
Conclusions
Starting from the neurobiology of the mammalian brain I introduced
AC methods, a class of reinforcement learning algorithms widely
used by the research community. The neuronal AC model can
describe phenomena like Pavlovian learning and drug addiction,
whereas its computational counterpart can be easily applied to
robotics and machine learning. The Python implementation is
straightforward and is based on the TD(0) algorithm introduced in
the third post. AC methods are also good for taxonomic reasons, we
can categorize TD algorithms as Critic-only methods and techniques
such as REINFORCE and genetic algorithm as Actor-only methods. In
the next post I will focus on genetic algorithms, a method that
allows us to search directly in the policy space without the need of a
utility function.
Index
methods.
policy selection.
problems.
Resources
The complete code for the Actor-Critic examples is available on
the dissecting-reinforcement-learning official repository on
GitHub.


Sutton, R. S., & Barto, A. G. (in progress). [pdf]
Reinforcement Learning in a Nutshell. Heidrich-Meisner, V.,
Lauer, M., Igel, C., & Riedmiller, M. A. (2007) [pdf]
References
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike
adaptive elements that can solve difficult learning control problems.
IEEE transactions on systems, man, and cybernetics, (5), 834-846.
Heidrich-Meisner, V., Lauer, M., Igel, C., & Riedmiller, M. A. (2007,

April). Reinforcement learning in a nutshell. In ESANN (pp. 277-288).
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor–critic models of the basal
ganglia: New anatomical and computational perspectives. Neural
networks, 15(4), 535-547.
Takahashi, Y., Roesch, M. R., Stalnaker, T. A., & Schoenbaum, G.

(2007). Cocaine exposure shifts the balance of associative encoding
from ventral to dorsolateral striatum. Frontiers in integrative
neuroscience, 1(11).
Takahashi, Y., Schoenbaum, G., & Niv, Y. (2008). Silencing the critics:
understanding the effects of cocaine sensitization on dorsolateral
and ventral striatum in the context of an actor/critic model. Frontiers
in neuroscience, 2, 14.
Williams, R. J. (1992). Simple statistical gradient-following algorithms

for connectionist reinforcement learning. Machine learning, 8(3-4),
229-256.
Witten, I. H. (1977). An adaptive optimal controller for discrete-time

Markov environments. Information and control, 34(4), 286-295.

Disecting RL

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Disecting RL

Uploaded by

Copyright:

Available Formats

Dissecting Reinforcement

Premise[This post is an introduction to reinforcement learning

When I started to study reinforcement learning I did not find any

In the beginning was Andrey Markov

1. Set of possible States:

There is something peculiar in a Markov chain that I did not mention.

The transition matrix is always a square matrix, and since we are

#Declaring the Transition Matrix T

#Obtaining T after 3 steps

#Printing the matrices

T_3: [[ 0.844 0.156]

T_50: [[ 0.83333333 0.16666667]

T_100: [[ 0.83333333 0.16666667]

Now we define the initial distribution which represent the state of

We can calculate the probability of being in a specific state after

#Declaring the initial distribution

#Obtaining T after 3 steps

#Printing the initial distribution

v_1: [[ 0.9 0.1]]

v_3: [[ 0.844 0.156]]

v_50: [[ 0.83333333 0.16666667]]

v_100: [[ 0.83333333 0.16666667]]

What’s going on? The process starts at

That’s it, with a probability of 50% we can start from

v_1: [[ 0.7 0.3]]

v_3: [[ 0.812 0.188]]

v_100: [[ 0.83333333 0.16666667]]

This time the probability of being in

Markov Decision Process

1. Set of possible States:

As you can see we are introducing some new elements compared to

Problem the agent has to maximise the reward avoiding states

Discrete time and space

The environment is fully observable, meaning that the robot

The Bellman equation

𝑈ℎ = 𝑅(𝑠0 ) + 𝛾𝑅(𝑠1 ) + 𝛾2 𝑅(𝑠2 ) + . . . + 𝛾𝑛 𝑅(𝑠𝑛 )

Let’s recall that the utility is defined with respect of a policy

We just derived the Bellman equation! Using the Bellman equation

To empirically test the Bellman equation we are going to use our

𝑈(𝑠11 ) = − 0.04 + 1.0 × 0.7456 = 0.7056

The Bellman equation works! What we need is a Python

def return_state_utility(v, T, u, reward, gamma):

@param v the state vector

#Transition matrix loaded from file

#Defining the reward for state (1,1)

#Use the Bellman equation to find the utility of state (1,1)

Utility of state (1,1): 0.7056

That’s great, we obtained exactly the same value! Until now we

The value iteration algorithm

This result is a consequence of the contraction property which I will

def return_state_utility(v, T, u, reward, gamma):

@param v the state vector

#List containing the data for each iteation

It is interesting to give a look at the stabilization of each utility during

=================== FINAL RESULT ==================

=================== FINAL RESULT ==================

=================== FINAL RESULT ==================

The policy iteration algorithm

Since we have a policy and the policy associate to each state an

Once we have evaluated the policy, we can improve it. Policy

def return_policy_evaluation(p, u, r, T, gamma):

@param p policy vector

def return_expected_action(u, T, v):

It returns an action based on the