Professional Documents
Culture Documents
Learning-Part.1
Massimiliano Patacchiola Dec 9, 2016
⎡ 0.90 0.10 ⎤
𝑇=⎢ ⎥
⎣ 0.50 0.50 ⎦
import numpy as np
T: [[ 0.9 0.1]
[ 0.5 0.5]]
𝐯 = (1, 0)
import numpy as np
v: [[ 1. 0.]]
The possibility to be in
𝑠0
at
𝑘=3
is given by (0.729 + 0.045 + 0.045 + 0.025) which is equal to 0.844
we got the same result. Now let’s suppose that at the beginning we
have some uncertainty about the starting state of our process, let’s
define another starting vector as follows:
𝐯 = (0.5, 0.5)
v: [[ 0.5, 0.5]]
The agent can try different policies but only one of those can be
considered an optimal policy, denoted by
𝜋∗
, which yields to the highest expected utility. It is time to introduce an
example that I am going to use along all the post. This example is
inspired by the simple environment presented by Russel and Norving
in chapter 17.1 of their book. Let suppose we have a cleaning robot
that has to reach a charging station. Our simple world is a 4x3 matrix
where the starting point
𝑠0
is at (1,1), the charging station at (4,3), dangerous stairs at (4,2), and
an obstacle at (2,2). The robot has to find the best way to reach
the charging station (Reward +1) and to avoid falling down the
flight of stairs (Reward -1). Every time the robot takes a decision it
is possible to have the interference of a stochastic factor (ex. the
ground is slippery, an evil cat is stinging the robot), which makes the
robot diverge from the original path 20% of the time. If the robot
decides to go ahead in 10% of the cases it will finish on the left and
in 10% of the cases on the right state. If the robot hits the wall or the
obstacle it will bounce back to the previous position. The main
characteristics of this world are the following:
I said that the aim of the robot is to find the best way to reach the
charging station, but what does it mean the best way? Depending
on the type of reward the robot is receiving for each intermediate
state we can have different optimal policies
𝜋∗
. Let’s suppose we are programming the firmware of the robot.
Based on the battery level we give a different reward at each time
step. The rewards for the two terminal states remain the same
(charger=+1, stairs=-1). The obstacle at (2,2) is not a valid state and
therefore there is no reward associated to it. Given these
assumptions we can have four different cases:
1. 𝑅(𝑠) ≤ − 1.6284
extremely low battery
2. −0.4278 ≤ 𝑅(𝑠) ≤ − 0.085
quite low battery
3. −0.0221 ≤ 𝑅(𝑠) ≤ 0
slightly low battery
4. 𝑅(𝑠) > 0
fully charged
For each one of these conditions we can try to guess which policy
the agent will choose. In the extremely low battery scenario the
agent receives such a high punishment that it only wants to stop the
pain as soon as possible. Life is so painful that falling down the flight
of stairs is a good choice. In the quite low battery scenario the
agent takes the shortest path to the charging station, it does not
care about falling down. In the slightly low battery case the robot
does not take risks at all and it avoids the stairs at cost of banging
against the wall. Finally in the fully charged case the agent remains
in a steady state receiving a positive reward at each time step.
Until now we know the kind of policies that can emerge in specific
environments with defined rewards, but there is still something I did
not talk about: how can the agent choose the best policy?
⎡∞ ⎤
𝑈(𝑠) = 𝐸⎢⎣ ∑ 𝛾𝑡 𝑅(𝑠𝑡 )⎥⎦
𝑡=0
For each possible outcome I reported the utility and the probability
given by the transition model. This corresponds to the first part of
the Bellman equation. The next step is to calculate the product
between the utility and the transition probability, then sum up
the value for each action.
We found out that for state (1,1) the action UP has the highest
value. This is in accordance with the optimal policy we magically got.
This part of the Bellman equation returns the action that maximizes
the expected utility of the subsequent state, which is what an
optimal policy should do:
′ ′
𝜋 ∗ (𝑠) = argmax ∑ 𝑇(𝑠, 𝑎, 𝑠 )𝑈(𝑠 )
𝑎 ′
𝑠
Now we have all the elements and we can plug the values in the
Bellman equation finding the utility of the state (1,1):
import numpy as np
def main():
#Starting state vector
#The agent starts from (1, 1)
v = np.array([[0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
1.0, 0.0, 0.0, 0.0]])
#Utility vector
u = np.array([[0.812, 0.868, 0.918, 1.0,
0.762, 0.0, 0.660, -1.0,
0.705, 0.655, 0.611, 0.388]])
if __name__ == "__main__":
main()
1−𝛾
| | 𝑈𝑘 + 1 − 𝑈𝑘 | | < 𝜖 𝛾
import numpy as np
def main():
#Change as you want
tot_states = 12
gamma = 0.999 #Discount factor
iteration = 0 #Iteration counter
epsilon = 0.01 #Stopping criteria small value
#Transition matrix loaded from file (It is too big to write here)
T = np.load("T.npy")
#Reward vector
r = np.array([-0.04, -0.04, -0.04, +1.0,
-0.04, 0.0, -0.04, -1.0,
-0.04, -0.04, -0.04, -0.04])
#Utility vectors
u = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
u1 = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
while True:
delta = 0
u = u1.copy()
iteration += 1
graph_list.append(u)
for s in range(tot_states):
reward = r[s]
v = np.zeros((1,tot_states))
v[0,s] = 1.0
u1[s] = return_state_utility(v, T, u, reward, gamma)
delta = max(delta, np.abs(u1[s] - u[s])) #Stopping criteria
if delta < epsilon * (1 - gamma) / gamma:
print("=================== FINAL RESULT =================="
print("Iterations: " + str(iteration))
print("Delta: " + str(delta))
print("Gamma: " + str(gamma))
print("Epsilon: " + str(epsilon))
print("==================================================="
print(u[0:4])
print(u[4:8])
print(u[8:12])
print("==================================================="
break
if __name__ == "__main__":
main()
There is another algorithm that allows us to find the utility vector and
at the same time an optimal policy, the policy iteration algorithm.
import numpy as np
def main():
gamma = 0.999
epsilon = 0.0001
iteration = 0
T = np.load("T.npy")
#Generate the first policy randomly
# NaN=Nothing, -1=Terminal, 0=Up, 1=Left, 2=Down, 3=Right
p = np.random.randint(0, 4, size=(12)).astype(np.float32)
p[5] = np.NaN
p[3] = p[7] = -1
#Utility vectors
u = np.array([0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0])
#Reward vector
r = np.array([-0.04, -0.04, -0.04, +1.0,
-0.04, 0.0, -0.04, -1.0,
-0.04, -0.04, -0.04, -0.04])
while True:
iteration += 1
#1- Policy evaluation
u_0 = u.copy()
u = return_policy_evaluation(p, u, r, T, gamma)
#Stopping criteria
delta = np.absolute(u - u_0).max()
if delta < epsilon * (1 - gamma) / gamma: break
for s in range(12):
if not np.isnan(p[s]) and not p[s]==-1:
v = np.zeros((1,12))
v[0,s] = 1.0
#2- Policy improvement
a = return_expected_action(u, T, v)
if a != p[s]: p[s] = a
print_policy(p, shape=(3,4))
if __name__ == "__main__":
main()
Policy iteration and value iteration, which is best? If you have many
actions or you start from a fair policy then choose policy iteration. If
you have few actions and the transition is acyclic then chose value
iteration. If you want the best from the two world then give a look to
the modified policy iteration algorithm.
If you want to use the last expression when an exact solution does
not exist, you need to invert the matrix using the pseudoinverse and
the Numpy method np.linalg.pinv(). In the end, I prefer to use
np.linalg.solve() or np.linalg.lstsq() that does the same thing
but is much more readable.
Conclusions
In this first part I summarised the fundamental ideas behind
Reinforcement learning. As an example, I used a finite environment
with a predefined transition model. What happen if we do not have
the transition model? In the next post I will introduce model-free
reinforcement learning, that gives an answer to this question with a
new set of interesting tools. You can find the full code on my github
repository.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value
iteration and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods,
Prediction and Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning,
TD(0), TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods,
computational Actor-Critic methods, Actor-only and Critic-only
methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic
Algorithms in Reinforcement Learning, Genetic Algorithms for
policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed
Bandit, Mountain Car, Inverted Pendulum, Drone landing, Hard
problems.
7. [Seventh Post] Function approximation, Intuition, Linear
approximator, Applications, High-order approximators.
8. [Eighth Post] Non-linear function approximation, Perceptron,
Multi Layer Perceptron, Applications, Policy Gradient.
Resources
The dissecting-reinforcement-learning repository.
References
Bellman, R. (1957). A Markovian decision process (No. P-1066).
RAND CORP SANTA MONICA CA.
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., & Edwards, D. D.
(2003). Artificial intelligence: a modern approach (Vol. 2). Upper
Saddle River: Prentice hall.
Dissecting Reinforcement
Learning-Part.2
Massimiliano Patacchiola Jan 15, 2017
All right, now with the same spirit of the previous part I am going to
dissect one-by-one all the concepts we will step through.
The first thing the robot can do is to estimate the transition model,
moving in the environment and keeping track of the number of times
an action has been correctly executed. Once the transition model is
available the robot can use either value iteration or policy iteration to
get the utility function. In this sense, there are different techniques to
find out the transition model making use of Bayes rule and maximum
likelihood estimation. Russel and Norvig mention these techniques in
chapter 21.2.2 (Bayesian reinforcement learning). The problem of
this approach is evident: estimating the values of a transition
model can be expensive. In our 3x4 world it means to estimate the
values for a 12x12x4 (states x states x actions) table. Moreover
certain actions and some states can be very unlikely, making the
entries in the transition table hard to estimate. Here I will focus on
another technique, able to estimate the utility function without the
transition model, the Monte Carlo method.
During the post I will analyse the first two points. The third point is
less intuitive. In many applications it is easy to simulate episodes but
it can be extremely difficult to construct the transition model
required by the dynamic programming techniques. In all these cases,
MC methods rules.
Now let’s go back to our cleaning robot and let’s see what does it
mean to apply the MC method to this scenario. As usual the robot
starts in state (1, 1) and it follows its internal policy. At each step it
records the reward obtained and saves an history of all the states
visited until reaching a terminal state. We define an episode the
sequence of states from the starting state to the terminal state. Let’s
suppose that our robot recorded the following three episodes:
The robot followed its internal policy but an unknown transition
model perturbed the trajectory leading to undesired states. In the
first and second episode, after some fluctuation the robot eventually
reached the terminal state obtaining a positive reward. In the third
episode the robot moved along a wrong path reaching the stairs and
falling down (reward: -1.0). The following is another representation of
the three episodes:
1. First-Visit MC:
𝑈𝜋 (𝑠)
is defined as the average of the returns following the first visit to
𝑠
in a set of episodes.
2. Every-Visit MC:
𝑈𝜋 (𝑠)
is defined as the average of the returns following all the visit to
𝑠
in a set of episodes.
The return for the first episode is 0.27. Following the same procedure
we get the same result for the second episode. For the third episode
we get a different return: -0.79. After the three episodes we came
out with three different returns: 0.27, 0.27, -0.79. How can we use
returns to estimate utilities? I will now introduce the core equation
used in the MC method, which give the utility of a state following the
policy
𝜋
:
⎡∞ 𝑡 ⎤
𝑈 (𝑠) = 𝐸⎢ ∑ 𝛾 𝑅(𝑆𝑡 )⎥
𝜋
⎣𝑡 = 0 ⎦
If you compare this equation with the equation used to calculate the
return you will see only one difference: to obtain the utility function
we take the expectation of the returns. That’s it. To find the utility
of a state we need to calculate the expectation of the returns for that
state. In our example after only three episodes the approximated
utility for the state (1, 1) is: (0.27+0.27-0.79)/3=-0.08. However, an
estimation based only on three episodes is inaccurate. We need
more episodes in order to get the true value. Why do we need more
episodes?
Python implementation
As usual we will implement the algorithm in Python. I wrote a class
called GridWorld contained in the module gridworld.py available in
my GitHub repository. Using this class it is possible to create a grid
world of any size and add obstacles and terminal states. The
cleaning robot will move in the grid world following a specific policy.
Let’s bring to life our 4x3 world:
import numpy as np
from gridworld import GridWorld
- - - *
- # - *
○ - - -
for _ in range(1000):
action = policy_matrix[observation[0], observation[1]]
observation, reward, done = env.step(action)
print("")
print("ACTION: " + str(action))
print("REWARD: " + str(reward))
print("DONE: " + str(done))
env.render()
if done: break
Given the transition matrix and the policy the most likely output of
the script will be something like this:
- - - * - - - * ○ - - *
- # - * ○ # - * - # - *
○ - - - - - - - - - - -
- ○ - * - - ○ * - - - ○
- # - * - # - * - # - *
- - - - - - - - - - - -
You can find the full example in the GitHub repository. If you are
familiar with OpenAI Gym you will find many similarities with my
code. I used the same structure and I implemented the same
methods step() reset() and render(). In particular the method
step() moves forward at t+1 and returns the reward, the
observation (position of the robot), and a variable called done which
is True when the episode is finished (the robot reached a terminal
state).
Executing this script will print the estimation of the utility matrix
every 1000 iterations:
...
As you can see the utility gets more and more accurate and in the
limit to infinite it converges to the true values. In the first post we
already found the utilities of this particular grid world using the
dynamic programming techniques. Here we can compare the results
obtained with MC and the ones obtained with dynamic programming:
If you observe the two utility matrices you will notice many
similarities but two important differences. The utility estimations for
the states (4,1) and (3,1) are equal to zero. This can be considered
one of the limitations and at the same time one of the advantage of
MC methods. The policy we are using, the transition probabilities,
and the fact that the robot always start from the same position
(bottom-left corner) are responsible for the wrong estimate in those
states. Starting from the state (1,1) the robot will never reach
those states and it cannot estimate the corresponding utilities. As I
told you this is a problem because we cannot estimate those values
but at the same time it is an advantage. In a very big grid world we
can estimate the utilities only for the states we are interested in,
saving time and resources and focusing only on a particular
subspace of the world.
observation = env.reset(exploring_start=True)
Now every time a new episode begins the robot will start from a
random position. Running again the script will result in the following
estimations:
...
As you can see this time we got the right values for the states (4,1)
and (3,1). Until now we assumed that we had a policy and we used
that policy to estimate the utility function. What to do when we do
not have a policy? In this case there are other methods we can use.
Russel and Norvig called this case active reinforcement learning.
Following the definition of Sutton and Barto I will call this case the
model-free Monte Carlo control estimation.
1. Policy evaluation:
𝑈 → 𝑈𝜋
2. Policy improvement:
𝜋 → 𝑔𝑟𝑒𝑒𝑑𝑦(𝑈)
The first step makes the utility function consistent with the current
policy (evaluation). The second step makes the policy
𝜋
greedy with respect to the current utility function (improvement).
The two changes work against each other, creating a moving target
for the other, but together they collaborate making both policy and
value function approach optimal.
After this episode the matrix containing the values for the state-
action utilities can be updated. In our case the new matrix will
contain the following values:
After a second episode we will fill more entries in the table. Going on
in this way will eventually lead to a complete state-action table with
all the entries filled. This step is what is called evaluation in the GPI
framework. The second step of the algorithm is the improvement. In
the improvement we take our randomly initialised policy
𝜋
and we update it in the following way:
That’s it, we are making the policy greedy choosing for each state
𝑠
appearing in the episode the action with maximal Q-value. For
example, if we consider the state (1,3) (top-left corner in the grid
world) we can update the entry of the policy matrix taking the action
with the highest value in the state-action table. In our case, after the
first episode the action with the highest value is RIGHT which has a
Q-value of 0.74.
In MC for control it is important to guarantee a uniform exploration
of all the state-action pairs. Following the policy
𝜋
it can happen that relevant state-action pairs are never visited.
Without returns the method will not improve. The solution is to use
exploring starts specifying that the first step of each episode starts
at a state-action pair and that every such pair has a non-zero
probability of being selected. It’s time to implement the algorithm in
Python.
Python implementation
I will use again the function get_return() but this time the input will
be a list containing tuples (observation, action, reward):
Finally, the main loop of the algorithm. This is not so different from
the loop used in MC prediction:
If we compare the code below with the one used in MC for prediction
we will notice some important differences, for example the following
condition:
if(is_starting):
action = np.random.randint(0, 4)
is_starting = False
This condition satisfies the exploring starts. The MC algorithm will
converge to the optimal solution only if we assure exploring starts.
In MC for control it is not sufficient to select random starting states.
During the iterations the algorithm will improve the policy only if all
the actions have a non-zero probability to be chosen. In this sense
when the episode starts we have to select a random action, this
must be done only for the starting state.
It is time to run the script. Before recall that for the simple 4x3
gridworld we already know the optimal policy. In the first post we
have found the optimal policy with a reward equal to -0.04 (for non
terminal states) and with transition model having 80-10-10 percent
probabilities. The optimal policy is the following:
Optimal policy:
In the optimal policy the robot will move far away from the stairs at
state (4, 2) and will reach the charging station through the longest
path. Now, I will show you the evolution of the policy once we run the
script for MC control estimation:
...
...
...
...
Conclusions
I would like to reflect for a moment on the beauty of the MC
algorithm. In MC for control the method can estimate the best policy
from nothing. The robot is moving in the environment trying different
actions and following the consequences of those actions until the
end. That’s all. The robot does not know the reward function, it does
not know the transition model and it does not have any policy to
follow. Nevertheless the algorithm improves until reaching the
optimal strategy.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value
iteration and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods,
Prediction and Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning,
TD(0), TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods,
computational Actor-Critic methods, Actor-only and Critic-only
methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic
Algorithms in Reinforcement Learning, Genetic Algorithms for
policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed
Bandit, Mountain Car, Inverted Pendulum, Drone landing, Hard
problems.
7. [Seventh Post] Function approximation, Intuition, Linear
approximator, Applications, High-order approximators.
8. [Eighth Post] Non-linear function approximation, Perceptron,
Multi Layer Perceptron, Applications, Policy Gradient.
Resources
The complete code for MC prediction and MC control is
available on the dissecting-reinforcement-learning official
repository on GitHub.
References
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., & Edwards, D. D.
(2003). Artificial intelligence: a modern approach (Vol. 2). Upper
Saddle River: Prentice hall.
NewEstimate←OldEstimate+StepSize[Target−OldEstimate]
Target=Eπ[∞∑k=0γkrt+k+1]
Target=Eπ[rt+1+γrt+2+γ2rt+3+...+γkrt+k+1]
Target=Eπ[rt+1+γ(rt+2+γrt+3+...+γk−1rt+k+1)]=Eπ[rt+1+γU(st+1)]
The main loop is much simpler than the one used in MC methods. In
this case we do not have any first-visit constraint and the only thing
to do is to apply the update rule.
...
...
We can now compare the utility matrix obtained with TD(0) and the
one obtained with Dynamic Programming in the first post:
Most of the values are similar. The main difference between the two
table is the estimate of the two terminal states. TD(0) does not
work for terminal states because we need reward and utility of the
next state at t+1. For definition after a terminal states there is no
other state. However, this is not a big issue. What we want to know is
the utility of the states nearby the terminal states. To overcome the
problem it is often used a simple conditional statement:
if (is_terminal(state) == True):
utility_matrix(state) = reward
Great we saw how TD(0) works, however there is something I did not
talk about: what does the zero contained in the name of the
algorithm means? To understand what that zero means I have to
introduce the eligibility traces.
At the beginning the trace is equal to zero. After the first visit to s1
(second step) the trace goes up to 1 and then it starts decaying.
After the second visit (fourth step) +1 is added to the current value
(0.25) obtaining a final trace of 1.25. After that point the state s1 is no
more visited and the trace slowly goes to zero. How does TD(λ)
update the utility function? In TD(0) we saw that a uniform shadow
was added in the graphical illustration to represent the inaccessibility
of previous states. In TD(λ) the previous states are accessible but
they are updated based on the eligibility trace value. States with a
small eligibility trace will be updated of a small amount whereas
states with high eligibility traces will be substantially updated.
δt=rt+1+γU(st+1)−U(st)
The results of the update for TD(0) and TD(λ) are the same (zero)
along all the visit but the last one. When the robot reaches the
charging station (reward +1.0) the update rule returns a positive
value. In TD(0) the result is propagated only to the previous state
(3,3). In TD(λ) the result is propagated back to all previous states
thanks to the eligibility trace. The decay value of the trace gives
more weight to the last states. As I told you the utility trace
mechanism helps to speed up the convergence. It is easy to
understand why, if you consider that in our example TD(0) needs five
episodes in order to reach the same results of TD(λ).
...
...
Comparing the final utility matrix with the one obtained without the
use of eligibility traces in TD(0) you will notice similar values. One
could ask: what’s the advantage of using eligibility traces? The
eligibility traces version converges faster. This advantage become
clear when dealing with sparse reward in a large state space. In this
case the eligibility trace mechanism can considerably speeds up the
convergence propagating what learnt at t+1 back to the last states
visited.
U(st)←U(st)+α[rt+1+γU(st+1)−U(st)]
Q(st,at)←Q(st,at)+α[rt+1+γQ(st+1,at+1)−Q(st,at)]
In step 1 the agent selects one action from the policy and moves one
step forward. In step 2 the agent observes the reward, the new state
and the associated action. In step 3 the algorithm updates the state-
action function using the update rule. In step 4 we are using the
same mechanism of MC for control (see second post), the policy π
is updated at each visit choosing the action with the highest state-
action value. We are making the policy greedy. As for MC methods
we use the exploring starts condition.
Can we apply the TD(λ) ideas to SARSA? Yes we can. SARSA(λ)
follows the same steps of TD(λ) implementing the eligibility traces
to speed up the convergence. The intuition behind the algorithm is
the same but instead of applying the prediction method to the
states, SARSA(λ) applies it to state-action pairs. We have a trace for
each state-action and this trace is updated as follows:
...
Now we can proceed. In the control case we always used the policy
π to learn on the run, meaning that we updated π from experiences
sampled from π. This approach is called on-policy learning. There is
another way to learn about π which is called off-policy learning. In
off-policy learning we do not need a policy in order to update our Q-
function. Of course we can still generate a policy π based on the
action with the maximum utility (taken from our Q-function) but the
Q-function itself is updated thanks to a second policy µ that is not
updated. For instance, consider the first four iterations of an off-
policy algorithm applied to the 4x3 grid world. We can see how after
the random initialisation of π the states are updated step by step,
whereas the policy µ does not change at all.
What are the advantages of off-policy learning? First of all using off-
policy it is possible to learn about an optimal policy while following
an exploratory policy µ. Off-policy means learning by
observation. For example, we can find an optimal policy looking to a
robot that is following a sub-optimal policy. It is also possible to learn
about multiple policies while following one policy (e.g. multi-robot
scenario). Moreover, in deep reinforcement learning we will see how
off-policy allows re-using old experiences generated from old
policies to improve the current policy (experience replay). The most
famous off-policy TD algorithm for control is called Q-Learning.
To understand how Q-learning works let’s consider its update rule:
Comparing the update rule of SARSA and the one of Q-learning you
will notice only one difference: the Target term. Here I report both of
them to simplify the comparison:
Target[SARSA]=rt+1+γQ(st+1,at+1) Target[Q-
learning]=rt+1+γ max aQ(st+1,a)
At this point it is obvious that we do not really need the policy π for
choosing the action, we can simply use the term on the right and
rewrite the Target as the discounted Q-value obtained at st+1
through a greedy selection:
That’s it, we have the Target used in the actual update rule and this
value follows the GPI scheme. Let’s see now all the steps involved in
Q-learning:
1. Move one step selecting at from µ(st)
2. Observe: rt+1, st+1
3. Update the state-action function Q(st,at)
4. (optional) Update the policy π(st)← argmax aQ(st,a)
et(s,a)=Isst∙Iaat+
{γλet−1(s,a)if Qt−1(st,at)=max aQt−1(st,a);0otherwise;
The term Isst is an identity indicator and it is equal to 1 if s=st. The
same for Iaat. The estimation error δ is defined as:
δs=rt+1+γmax aQt(st+1,a)−Qt(st,at)
Time for an example. Let’s suppose you noticed that the cleaning
robot bought last week does not follow an optimal policy while going
back to the charging station. The robot is following a sub-optimal
path that is unsafe. You want to find an optimal policy and propose
an upgrade to the manufacturer (and get hired!). There is a problem,
you do not have any access to the robot firmware. The robot is
following its internal policy µ and this policy is inaccessible. What to
do?
Conclusions
This post has summarised many important concepts in
reinforcement learning. TD methods are widely used because of their
simplicity and versatility. As in the second post we divided TD
methods in two families: prediction and control. The prediction TD
algorithm has been called TD(0). Via eligibility traces it is possible to
extend to previous states what has been learnt in the last one. The
extension of TD(0) with eligibility traces is called TD(λ). The control
algorithms in TD are called SARSA and Q-learning. The former is an
on-policy algorithm that updates the policy while moving in the
environment. The latter is an off-policy algorithm based on two
separate policies, one updated and the other used for moving in the
world. Do TD methods converge faster than MC methods? There is
no mathematical proof but by experience TD methods converge
faster.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value
iteration and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods,
Prediction and Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning,
TD(0), TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods,
computational Actor-Critic methods, Actor-only and Critic-only
methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic
Algorithms in Reinforcement Learning, Genetic Algorithms for
policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed
Bandit, Mountain Car, Inverted Pendulum, Drone landing, Hard
problems.
7. [Seventh Post] Function approximation, Intuition, Linear
approximator, Applications, High-order approximators.
8. [Eighth Post] Non-linear function approximation, Perceptron,
Multi Layer Perceptron, Applications, Policy Gradient.
Resources
The complete code for TD prediction and TD control is
available on the dissecting-reinforcement-learning official
repository on GitHub.
References
Bellman, R. (1957). A Markovian decision process (No. P-1066).
RAND CORP SANTA MONICA CA.
There are no specific biological names for these groups but I will
create two labels for the occasion. The first group can evaluate the
saliency of a stimulus based on the associated reward. At the same
time it can estimate an error measure comparing the result of the
action and the direct consequences, and use this value to calibrate
an executor. For these reasons I will call it the critic. The second
group has direct access to actions but no way to estimate the utility
of a stimulus, because of that I will call it the actor.
The interaction between actor and critic has an important role in
learning. In particular, well established research has shown that basal
ganglia are involved in Pavlovian learning (see third post) and in
procedural (implicit) memory, meaning unconscious memories such
as skills and habits. On the other hand the acquisition of declarative
(explicit) memory, implied in the recollection of factual information,
seems to be connected with another brain area called hippocampus.
The only way actor and critic can communicate is through the
dopamine released from the substantia nigra after the activation of
the ventral striatum. Drug abuse can have an effect on the
dopaminergic system, altering the communication between actor
and critic. Some experiments of Takahashi et al. (2007) showed that
cocaine sensitization in rats can have as effect maladaptive
decision-making. In particular, rather than being influenced by long-
term goal, rats are driven by immediate rewards. This issue is also
part of standard computational pipelines and is know as the credit
assignment problem. For example, when playing chess it is not
easy to isolate the most salient actions that lead to the final victory
(or defeat).
To understand how the neuronal actor-critic mechanism was
involved in the credit assignment problem, Takahashi et al. (2008)
observed the performances of rats pre-sensitized with cocaine in a
Go/No-Go task. The procedure of a Go/No-Go task is simple. The
rat is in a small metallic box and it has to learn to poke a button with
the nose when a specific odour (cue) is released. If the rat pokes the
button when a positive odour is present it gets rewarded (with
delicious sugar). If the rat pokes the button when a negative odour is
present it gets punished (e.g. with a bitter substance such as
quinine). Positive and negative odours do not mean that they are
pleasant or unpleasant, we can consider them neutral. Learning
means to associate a specific odour to reward or punishment. Finally,
if the rat does not move (No-Go) then neither reward nor punishment
are given. In total there are four possible conditions.
On the other hand the cocaine sensitized rat did not show any kind
of cue-selectivity during the training. Moreover, post-mortem
analysis showed that those rats did not developed cue-selective
neurons in the ventral striatum (critic). These results confirm the
hypothesis that the critic learns the value of the cue and it instructs
the actor about the action to execute.
Now it’s time to turn our attention to math and code. How can we
build a computational model from the biological one?
After the action we observe the new state and the reward (step 2). In
step 3 we plug the reward, the utility of
𝑠𝑡
and
𝑠𝑡 + 1
in the standard update rule used in TD(0) (see third post):
𝛾𝜆𝑒𝑡 − 1 (𝑠) if 𝑠 ≠ 𝑠𝑡 ;
𝑒𝑡 (𝑠) = {
𝛾𝜆𝑒𝑡 − 1 (𝑠) + 1 if 𝑠 = 𝑠𝑡 ;
Nothing different from the TD(λ) method I introduced in the third
post. Once we estimated the trace we can update the state as
follows:
For the actor we have to store a trace for each state-action pair,
similarly to SARSA and Q-learning. The traces can be updated as
follows:
𝛿𝑡 = 𝑟𝑡 + 1 + 𝑈(𝑠𝑡 + 1 ) − 𝑈(𝑠𝑡 ) = 𝑟𝑡 + 1
This result explains why the credit assignment problem emerges
during the training of cocaine sensitized rats. The rats prefer the
immediate reward and do not take into account the long-term
drawbacks. Learning based only on immediate reward it’s not
sufficient to master a complex Go/No-Go task but in simpler tasks
learning can be faster, with cocaine sensitized rats performing better
than the control group. However, for a neuroscientist this sort of
explanations are too tidy. Recent work has highlighted the existence
of multiple learning systems operating in parallel in the mammalian
brain. Some of these systems (e.g. amygdala and/or nucleus
accumbens) can replace a malfunctioning critic and compensate the
damage caused by cocaine sensitization. In conclusion, additional
experiments are needed in order to shed light on the neuronal AC
architecture. Now it is time for coding. In the next section I will show
you how to implement an AC algorithm in Python and how to apply it
to the cleaning robot example.
def softmax(x):
'''Compute softmax values of array x.
The robot is in a new state, and the the error has been evaluated by
the critic. Now the error has to be used to update the state-action
table of the actor. In this step, the action UP for state (1,1) is
weakened, adding the negative term
𝛿
. In case of a positive
𝛿
the action would be strengthened.
We can repeat the same steps until the end of the episode. All the
action will be weakened but the last one, that will be strengthened by
a factor of +1.0. Repeating the process for many episodes we get the
optimal utility matrix and the optimal policy.
The two functions are used in the main loop. The exploring start
assumption is once again used here to guarantee uniform
exploration. The beta_matrix parameter has not been used in this
example, but it can be easily enabled.
...
...
Comparing the result obtained with AC and the one obtained with
dynamic programming in the first post we can notice a few
differences.
Similarly to the estimation of TD(0) in the third post the value of the
two terminal states is zero. This is the consequence of the fact that
we cannot estimate the update value for a terminal state, because
after a terminal state there is no other state. As discussed in the third
post this is not a big issue since it does not affect the convergence,
and can be addressed with a simple conditional statement. From a
practical point of view the results obtained with the AC algorithm can
be unstable because there are more hyper-parameter to tune,
however the flexibility of the paradigm can often balance this
drawback.
The distinction between actor and critic is also very useful from a
taxonomic point of view. In the article “Reinforcement Learning in a
Nutshell” AC methods are considered as a meta-category that can
be used to assign all the techniques I introduced until now to three
macro-groups: AC methods, Critic-only, Actor-only. Here I will follow
a similar approach to give a wider view on what is available out there.
In this post I introduced a possible architecture for an AC algorithm.
In AC methods the actor and the critic are represented explicitly and
trained separately, but we could ask: is it possible to use only the
actor or only the critic? In previous posts we considered utility
functions and policies. In dynamic programming these two entities
collapsed in the value iteration and the policy iteration algorithms
(see first post). Both those algorithms are based on utility estimation,
that allows the policy to converge thanks to the Generalised Policy
Iteration (GPI) mechanism (see second post). Note that, even in TD
learning we are relying on utility estimation (see third post) especially
when the emphasis is on the policy (SARSA and Q-learning). All
these methods can be broadly grouped in a category called Critic-
only. Critic-only methods always build a policy on top of a utility
function and as I said the utility function is the critic in the AC
framework.
Conclusions
Starting from the neurobiology of the mammalian brain I introduced
AC methods, a class of reinforcement learning algorithms widely
used by the research community. The neuronal AC model can
describe phenomena like Pavlovian learning and drug addiction,
whereas its computational counterpart can be easily applied to
robotics and machine learning. The Python implementation is
straightforward and is based on the TD(0) algorithm introduced in
the third post. AC methods are also good for taxonomic reasons, we
can categorize TD algorithms as Critic-only methods and techniques
such as REINFORCE and genetic algorithm as Actor-only methods. In
the next post I will focus on genetic algorithms, a method that
allows us to search directly in the policy space without the need of a
utility function.
Index
1. [First Post] Markov Decision Process, Bellman Equation, Value
iteration and Policy Iteration algorithms.
2. [Second Post] Monte Carlo Intuition, Monte Carlo methods,
Prediction and Control, Generalised Policy Iteration, Q-function.
3. [Third Post] Temporal Differencing intuition, Animal Learning,
TD(0), TD(λ) and Eligibility Traces, SARSA, Q-learning.
4. [Fourth Post] Neurobiology behind Actor-Critic methods,
computational Actor-Critic methods, Actor-only and Critic-only
methods.
5. [Fifth Post] Evolutionary Algorithms introduction, Genetic
Algorithms in Reinforcement Learning, Genetic Algorithms for
policy selection.
6. [Sixt Post] Reinforcement learning applications, Multi-Armed
Bandit, Mountain Car, Inverted Pendulum, Drone landing, Hard
problems.
7. [Seventh Post] Function approximation, Intuition, Linear
approximator, Applications, High-order approximators.
8. [Eighth Post] Non-linear function approximation, Perceptron,
Multi Layer Perceptron, Applications, Policy Gradient.
Resources
The complete code for the Actor-Critic examples is available on
the dissecting-reinforcement-learning official repository on
GitHub.
References
Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike
adaptive elements that can solve difficult learning control problems.
IEEE transactions on systems, man, and cybernetics, (5), 834-846.
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor–critic models of the basal
ganglia: New anatomical and computational perspectives. Neural
networks, 15(4), 535-547.
Takahashi, Y., Schoenbaum, G., & Niv, Y. (2008). Silencing the critics:
understanding the effects of cocaine sensitization on dorsolateral
and ventral striatum in the context of an actor/critic model. Frontiers
in neuroscience, 2, 14.