You are on page 1of 49

Chapter 6: Temporal Difference Learning

Objectives of this chapter:

Introduce Temporal Difference (TD) learning


Focus first on policy evaluation, or prediction, methods
Compare efficiency of TD learning with MC learning
Then extend to control methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


TD methods bootstrap and
sample
‣ Bootstrapping: update involves an estimate of the
value function
• TD and DP methods bootstrap
• MC methods do not bootstrap
‣ Sampling: update does not involve an expected value
• TD and MC method sample
• Classical DP does not sample
di↵erences in their approaches to the prediction problem.

6.1 TD Prediction
TD Prediction
Both TD and Monte Carlo methods use experience to solve the prediction
PolicyGiven
Evaluation (the prediction
some experience followingproblem):
a policy ⇡, both methods update their e
of va⇡ given
for for thepolicy
nonterminal states the
π, compute St occurring in that
state-value experience.
function vπ Roughly
Monte Carlo methods wait until the return following the visit is known,
that return as a target for V (St ). A simple every-visit Monte Carlo metho
Recall:forSimple every-visit
nonstationary Monte Carlo
environments is method:
128 h
CHAPTER i6. TEMPORAL-DIFFERENCE L
V (St ) V (St ) + ↵ Gt V (St ) ,
(only
wherethen
Gt isisthe
Gtactual
known), TDfollowing
return methodstime needt, wait
and ↵only until thestep-size
is a constant next tim p
time
(c.f.,tEquation
+ 1 they2.4). target:
Let us callthe
immediately form actual
this return
amethod
target andafter
make
constant-↵ time
a tuseful
MC. update
Whereas Mo
observed
methods reward
must wait Rt+1 andthe
until the
end estimate V (St+1to
of the episode ). determine
The simplest TD meth
the increment
The as TD(0),temporal-difference
simplest is method TD(0):
h 127 i
V (St ) V (St ) + ↵ Rt+1 + V (St+1 ) V (St ) .

In e↵ect, the target for the Monte Carlo update is Gt , whereas the target
update is Rt+1 + V (St+1 ).target: an estimate of the return
Because the TD method bases its update in part on an existing estim
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TD target for prediction

‣ The TD target: Rt+1 + v⇡ (St+1 )


• it is an estimate like MC target because it samples
the expected value
• it is an estimate like the DP target because it uses
the current estimate of V instead of v⇡
one-step TD, because it is a special case of the TD( ) and n-step
and Chapter 7. The box below specifies TD(0) completely in pro

Tabular TD(0) for estimating v⇡


Input: the policy ⇡ to be evaluated
Initialize V (s) arbitrarily (e.g., V (s) = 0, for all s 2 S+ )
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
A action given by ⇡ for S
0
Take action A, observe
⇥ R, S ⇤
0
V (S) V (S) + ↵ R + V (S ) V (S)
S S0
until S is terminal

Agent program
Because the TD(0) bases its update Environment
in part on an program
existing esti
method, like DP. We know from Chapter 3 that program
Experiment
Dynamic programing
cf. Dynamic Programming
X X
V (St ) ← Eπ [ Rt +1 + γ V (St +1 )] = ⇡(a|St ) p(s0 , r|St , a)[r + V (s0 )]
a s0 ,r
St
a
r
s0

T TT T T T

TT T T T T

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2


Simple Monte
Simple Monte CarloCarlo

V (St ) ← V (St ) + α [ Gt − V (St )]

St

T
T T
T TT T T

TT T TT T TT

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3


Simplest TD
Simplest TD method
Method
V (St ) ← V (St ) + α [ Rt +1 + γ V (St +1 ) − V (St )]

St

Rt +1
St +1

T TT TT T T

T T
T TT T TT

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4


Example: Driving Home
‣ Consider driving home:
• each day you drive home
• your goal is to try and predict how long it will take
at particular stages
• when you leave office you note the time, day, &
other relevant info
‣ Consider the policy evaluation or prediction task
Traffic is often slower in the rain, so you reestimate that it will take 35 minutes fro
then, or a total of 40 minutes. Fifteen minutes later you have completed the highwa
portion of your journey in good time. As you exit onto a secondary road you cu

Driving Home
your estimate of total travel time to 35 minutes. Unfortunately, at this point you g
stuck behind a slow truck, and the road is too narrow to pass. You end up havin
to follow the truck until you turn onto the side street where you live at 6:40. Thr
minutes later you are home. The sequence of states, times, and predictions is thu
as follows:
Elapsed Time Predicted Predicted
State (minutes) Time to Go Total Time
leaving office, friday at 6 0 30 30
reach car, raining 5 35 40
exiting highway 20 15 35
2ndary road, behind truck 30 10 40
entering home street 40 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 W
are not discounting ( = 1), and thus the return for each state is the actual time
go from that state. The value of each state is the expected time to go. The secon
column of numbers gives the current estimated value for each state encountered.
A simple way to view the operation of Monte Carlo methods is to plot the predicte
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrow
Driving home as an RL
problem

‣ Rewards = 1 per step (if we were minimizing travel


time what would reward be?)
‣ γ=1
‣ Gt = time to go from state St
‣ V(St) = expected time to get home from St
Updating our predictions

‣ Goal: update the prediction of total time leaving from


office, while driving home
‣ With MC we would need to wait for a termination—until
we get home—then calculate Gt for each step of
episode, then apply our updates
Traffic is often slower in the rain, so you reestimate that it will take 35 minutes fro
then, or a total of 40 minutes. Fifteen minutes later you have completed the highwa
portion of your journey in good time. As you exit onto a secondary road you cu

Driving home
your estimate of total travel time to 35 minutes. Unfortunately, at this point you g
stuck behind a slow truck, and the road is too narrow to pass. You end up havin
to follow the truck until you turn onto the side street where you live at 6:40. Thr
minutes later you are home. The sequence of states, times, and predictions is thu
as follows: V(s) V(office)
Elapsed Time Predicted Predicted
State (minutes) R Time to Go Total Time
leaving office, friday at 6 0 5 30 30
reach car, raining 5 15 35 40
exiting highway 20 10 15 35
2ndary road, behind truck 30 10 10 40
entering home street 40 3 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 W
are‣ not discounting
Task: update ( the
= 1),value
and thus the return
function as for
weeach
go,state is theon
based actual time
go from that state. The value of each state is the expected time to go. The secon
observed elapsed time—Reward column
column of numbers gives the current estimated value for each state encountered.
A simple way to view the operation of Monte Carlo methods is to plot the predicte
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrow
your estimate of total travel time to 35 minutes. Unfortunately, at this point you get
stuck behind a slow truck, and the road is too narrow to pass. You end up having
Driving home
to follow the truck until you turn onto the side street where you live at 6:40. Three
minutes later you are home. The sequence of states, times, and predictions is thus
as follows: V(s) V(office)
Elapsed Time R Predicted Predicted
State (minutes) Time to Go Total Time
leaving office, friday at 6 0 5 30 30
reach car, raining 5 15 35 40
exiting highway 20 10 15 35
2ndary road, behind truck 30 10 10 40
entering home street 40 3 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 We
are not discounting ( = 1), and thus the return for each state is the actual time to
go from that state. The value of each state is the expected time to go. The second

with 𝛂 = 1?
column of numbers gives the current estimated value for each state encountered.
‣ update V(office)
A simple way to view the operation of Monte Carlo methods is to plot the predicted
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrows
• V(s) = V(s) + 𝛂[Rfor ↵ = 1. Theset+1 + 𝛾V(s’) - V(s)]
show the changes in predictions recommended by the constant-↵ MC method (6.1),
are exactly the errors between the estimated value (predicted time
to go) in each state and the actual return (actual time to go). For example, when
• V(office) = IfV(office)
1
+ 𝛂[Rt+1 + 𝛾V(car) - V(office)]
you exited the highway you thought it would take only 15 minutes more to get home,
this were a control problem with the objective of minimizing travel time, then we would of
course make the rewards the negative of the elapsed time. But since we are concerned here only
• new V(office) = 40; 𝚫 = +10
with prediction (policy evaluation), we can keep things simple by using positive numbers.

‣ update V(car)?
• V(car) = 30; 𝚫 = -5

‣ update V(exit)?
• V(exit) = 20; 𝚫 = +5
Changes recommended by TD
HAPTER 128 CHAPTER
6. TEMPORAL-DIFFERENCE 6. TEMPORAL-DIFFER
LEARNING
methods (𝛂 = 1)
45
45 45
come actual outcome
actual
outcome
40
40 40
Predicted
Predicted Predicted
total
total total
travel
travel 35 travel 35
time 35 time
time
V(office)
30
30 30

leaving reach exiting 2ndary home arrive leaving reach ex


ng 2ndary home arrive leaving reach exiting 2ndary home arrive car hig
office car highway road street home office
way road street home office car highway road street home

uation Situation
Situation S

Figure 6.2: Changes recommended in the driving home example by


mended in the driving home example by Monte Carlo methods
Driving Home
128 Changes CHAPTER
recommended 6.
by TEMPORAL-DIFFERENCE LEARNING
Changes recommended
Monte Carlo methods (α=1) by TD methods (α=1)
45 45
actual outcome actual
outcome
40 40
Predicted Predicted
total total
travel 35 travel 35
time time
V(office)
30 30

leaving reach exiting 2ndary home arrive leaving reach exiting 2ndary home arrive
office car highway road street home office car highway road street home

Situation Situation

Figure 6.2: Changes recommended in the driving home example by Monte Carlo methods
(left) and TD methods (right).
Advantages of TD learning
‣ TD methods do not require a model of the environment,
only experience
‣ TD methods can be fully incremental
‣ Make updates before knowing the final outcome
‣ Requires less memory
‣ Requires less peak computation
‣ You can learn without the final outcome, from incomplete
sequences
‣ Both MC and TD converge (under certain assumptions to
be detailed later), but which is faster?
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was

Random walk
consistently better than the MC method on this task.

0 0 0 0 0 1
A B C D E
start
‣ C is start state, episodic, undiscounted γ = 1
0.8 Estimated 0.25 Empirical RMS error,
‣ 𝜋 is left value with100
or right equal probability
0.2 in all
averaged over states
states
!=.01
0.6 10
MC
‣ termination
Estimated 0 at either end RMS error, 0.15
1 !=.02
value 0.4 averaged !=.04

‣ true over states 0 0.1


rewards +1 on right values termination, otherwise !=.15
!=.03

0.2
0.05
‣ what does v⇡ (s) tell us? TD !=.1
!=.05

0 0
• probability
A B ofCtermination
D E on right side
0 from
25 each
50 state,
75 100

under randomState policy Walks / Episodes

• what
Figure v
6.3:isResults
⇡ =[A withBthe
CD E]?random walk. Above: The small Markov reward
5-state
process generating the episodes. Left: Results from a single run after various numbers of
• ⇡
episodes. v
The= estimate
[1/6 2/6 3/6
after 1004/6 5/6] is about as close as they ever get to the true
episodes
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
‣ Initialize
indefinitely V(s) = 0.5
in response 8s 2 S
to the outcomes of the most recent episodes. Right: Learning curves
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
consistently better than the MC method on this task.
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was
Values learned
consistently byMC
better than the TD fromon one
method run, after various
this task.
numbers 0of episodes
A
0
B
0 0 0
C D
0 0 0 0 0 1
A B C D E
start
start

0.8 Estimated 0.25 Empirical RMS error,0.25


0.8
value 100
Estimated averaged over states
0.2 !=.01
0.6 value
10 100
MC 0.2
Estimated 0
1 0.6 RMS error, 0.15 10
!=.02
value 0.4 averaged !=.04

0 true over states 0.1


Estimated values RMS error, 0.15
!=.15
!=.03

0.2 1
value 0.4 0.05
TD averaged !=.1
!=.05

0
128 CHAPTER true over states 0.1
0 6. TEMPORAL-DIFFERENCE L
A B C D E 0values
25 50 75 100 !=.15
0.2 State Walks / Episodes 0.05
(only then is Gt known), TD methods need wait only until the next tim T
Figure
time6.3:t Results
+ 1 theywith immediately
the 5-state random walk.a Above:
form target The and small
make Markov rewardupdat
a useful
process generating 0 the episodes. Left: Results from a single run after various numbers0 of
observed reward
episodes. The estimate after Rt+1100 and theisestimate
episodes V (Sast+1they
about as close ). The simplest
ever get TD met
to the true
A B C D E 0
as with
values; TD(0), is step-size
a constant parameter (↵ = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes ofState
the most recent episodes. Right: Learning curves
h i
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is theVroot
(St )mean-squared
V (St ) +(RMS)
↵ Rt+1
error+between
V (St+1
the )valueVfunction
(St ) . learned and the
Figure 6.3:
true value function, averaged overResults with
the five states. thedata5-state
These random
are averages walk.
over 100 di↵erent Ab
TD and MC on the Random Walk
0.25 Empirical RMS error,
averaged over states
0.2 !=.01

MC Data averaged over


0.15 100 sequences of episodes
!=.02
!=.04
0.1
!=.03
!=.15

0.05
!=.1
TD !=.05

0
0 25 50 75 100
Walks / Episodes
Batch Updating in TD and MC methods

Batch Updating: train completely on a finite amount of data,


e.g., train repeatedly on 10 episodes until convergence.

Compute updates according to TD or MC, but only update


estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating,


TD converges for sufficiently small α.

Constant-α MC also converges under these conditions, but to


a difference answer!

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was

Random Walk under Batch Updating


consistently better than the MC method on this task.

0 0 0 0 0 1
A B C D E
start
‣ After each new episode, all episodes seen so far are treated as a
0.8 Estimated 0.25 Empirical RMS error,
batch
value 100 averaged over states
0.2 !=.01
‣ This
0.6 growing batch is repeatedly
10 processed by TD and MC until
MC
Estimatedconvergence
0
1
RMS error, 0.15 !=.02
value 0.4 averaged !=.04
true over states 0.1
!=.03
values !=.15
0.2
0.05
!=.1
TD !=.05

0 0
A B C D E 0 25 50 75 100
State Walks / Episodes
(Repeated 100 times)
Figure 6.3: Results with the 5-state random walk. Above: The small Markov reward
process generating the episodes. Left: Results from a single run after various numbers of
episodes. The estimate after 100 episodes is about as close as they ever get to the true
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes of the most recent episodes. Right: Learning curves
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was

Random Walk under Batch Updating


consistently better than the MC method on this task.

0 0 0 0 0 1
A B C D E
start
‣ After each new episode, all episodes seen so far are treated as a
0.8 Estimated 0.25 Empirical RMS error,
batch
132 value CHAPTER
100 6. TEMPORAL-DIFFERENCE
0.2 !=.01
LEARNING
averaged over states
‣ This0.6 growing batch is repeatedly
10 processed by TD and MC until
MC
Estimatedconvergence
0
1 .25 RMS error, 0.15
!=.02
value 0.4 averaged !=.04
true over states BATCH TRAINING
0.1
!=.03
values
.2 !=.15
0.2
0.05
!=.1
TD !=.05

0
RMS error, .15 0
A averaged
B C D E 0 25 50 75 100
over states
State .1
MC Walks / Episodes
(Repeated 100 times)
.05 5-state TD
Figure 6.3: Results with the random walk. Above: The small Markov reward
process generating the episodes. Left: Results from a single run after various numbers of
episodes. The estimate after 100 episodes is about as close as they ever get to the true
.0
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
0
indefinitely in response to the outcomes of25 50
the most recent 75
episodes. 100 Learning curves
Right:
for TD(0) and constant-↵ MC methods, forWalks various /values of ↵. The performance measure
Episodes
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0
B, 1
B, 1
B, 1 V(B)?
B, 1
B, 1
B, 1
B, 0

Assume Markov states, no discounting (𝜸 = 1)


R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0
B, 1
B, 1
B, 1 V(B)? 0.75
B, 1 V(A)?
B, 1
B, 1
B, 0

Assume Markov states, no discounting (𝜸 = 1)


R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0
B, 1
B, 1
B, 1 V(B)? 0.75
B, 1 V(A)? 0?
B, 1
B, 1
B, 0

Assume Markov states, no discounting (𝜸 = 1)


R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
Example 6.4: You are the Predictor Place yourself no
predictor of returns for an unknown Markov reward process.
Consider the following data
the following eight episodes:

A, 0, B, 0 B, 1
B, 1 B, 1
B, 1 B, 1
B, 1 B, 0

This means that the first episode started in state A, transitione


of 0, and then terminated from B with a reward of 0. The othe
‣ V(B) even shorter, starting from B and terminating immediately. Giv
what would you say are the optimal predictions, the best valu

6 out
V (A)ofand
8 times
V (B)?we saw a 1;
Everyone V(B)probably
would = 3/4 agree that the optim
3
4 , because
‣ The batch six out of the
MC prediction foreight times in state B the process ter
V(A):
with a return of 1, and the other two times in B the process ter

100%
with of returns
a return of from
0. A equal zero; V(A) = 0
But what is the optimal value for the estimate V (A)
given this data? Here there are two reasonable answers.
You are the Predictor
‣ Modeling the Markov process based on
the observed training data

V(A)?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


You are the Predictor
‣ Modeling the Markov process based on
the observed training data

V(A)? 0.75

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


Optimality of TD(0)
‣ The prediction that best matches the training data is V(A)=0:
• This minimizes the mean-square-error between V(s) and
the sample returns in the training set. (zero MSE in our
example)
• Under batch training, this is what constant-𝛂 MC gets
‣ TD(0) achieves a different type of optimality, where V(A)=0.75
• This is correct for the maximum likelihood estimate of the
Markov model generating the data
• i.e., if we do a best fit Markov model, and assume it is
exactly correct, and then compute the predictions
• This is called the certainty-equivalence estimate
• This is what TD gets
Advantages of TD

‣ If the process is Markov, then we expect the TD


estimate to produce lower error on future data
‣ This helps explain why TD methods converge more
quickly than MC in the batch setting
‣ TD(0) makes progress towards the certainty-
equivalence estimate without explicitly building the
model!
Summary so far
Introduced one-step tabular model-free TD methods
These methods bootstrap and sample, combining aspects of DP
and MC methods
TD methods are computationally congenial
If the world is truly Markov, then TD methods will learn faster
than MC methods
MC methods have lower error on past data, but higher error on
future data

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


Unified View
width
of update Dynamic
Temporal-
difference programming
learning

depth
(length)
Multi-step
of update bootstrapping

Exhaustive
Monte search
Carlo
...

29
Learning An Action-Value Function

Estimate qπ for the current policy π

... Rt+1 Rt+2 Rt+3 ...


St St+1 St+2 St+3
St,At St+1, At+1 St+2, At+2 St+3, At+3

After every transition from a nonterminal state, St , do this:


Q(St , At ) ← Q(St , At ) + α [ Rt +1 + γ Q(St +1 , At +1 ) − Q(St , At )]
If St +1 is terminal, then define Q(St +1 , At +1 ) = 0

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30


Sarsa: On-Policy TD Control

Turn this into a control method by always updating the


142
policy toCHAPTER
be greedy with respect to the current estimate:
6. TEMPORAL-DIFFERENCE LEARNING

Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ·) = 0


Repeat (for each episode):
Initialize S
Choose A from S using policy derived from Q (e.g., "-greedy)
Repeat (for each step of episode):
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
Q(S, A) Q(S, A) + ↵[R + Q(S 0 , A0 ) Q(S, A)]
S S0; A A0 ;
until S is terminal

Figure 6.9: Sarsa: An on-policy TD control algorithm.


R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31
Windy Gridworld

Wind:

undiscounted, episodic, reward = –1 until goal

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32


Results of Sarsa on the Windy Gridworld

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33


6.5 Q-learning: O↵-Policy TD Control
Exercise 6.9 Why is Q-lear
Q-Learning:
One of the most Off-Policy
important breakthroughs TD Control
in reinforcement learning was the de
opment of an o↵-policy TD control algorithm known as Q-learning (Watkins, 19
Its simplest form,
One-step one-step Q-learning, is defined by
Q-learning:
h i
Q(St , At ) Q(St , At ) + ↵ Rt+1 + max Q(St+1 , a) Q(St , At ) . (
a

6.5. Q-LEARNING:
In this OFF-POLICY
case, the learned action-value TD CONTROL
function, Q, directly approximates q⇤ ,145
the
timal action-value function, independent of the policy being followed.Q-learning
This dram
ically simplifies
Initialize Q(s,the
a), analysis
8s 2 S, a of the algorithm
2 A(s), and
arbitrarily, andenabled early convergence
Q(terminal-state, ·) = 0 pro
TheRepeat
policy (for
still each
has an Figure 6.6:
e↵ect in that it determines which state–action
episode): pairsThe
arebac
vis
and updated.
Initialize However,
S all that is required for correct convergence is that all p
continue to be(for
Repeat updated. As of
each step weepisode):
observed in Chapter 5, this is a minimal requirem
in the sense that A
Choose any method
from S usingguaranteed to findfrom
policy derived optimal behavior
Q (e.g., in the general c
"-greedy)
must require
Takeit.action
Under A,this R, S 0 and a variant of the usual stochastic app
assumption
observe
0
imation conditions
Q(S, A) on the
Q(S, A)sequence
+ ↵[R + ofmax step-size
a Q(S , a)parameters,
Q(S, A)] Q has been shown
0;
converge Swith Sprobability 1 to q⇤ . The Q-learning algorithm is shown in proced
form inuntil S is 6.10.
Figure terminal
What is the backup diagram for Q-learning? The rule (6.6) updates a state–ac
pair,
R. S. Sutton so Figure
the
and A. G. Barto:top 6.12:
node,
Reinforcement Q-learning:
the
Learning: root ofAn
An Introduction theo↵-policy TD control
backup, must algorithm.
be a small, filled 34action no
Cliffwalking
R

R
ε−greedy, ε = 0.1

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35


6.6 Expected Sarsa
Expected Sarsa
Consider the learning algorithm that is just like Q-learning except that instead o
the maximum over next state–action pairs it uses the expected value, taking int
account how likely
Instead of theeach actionvalue-of-next-state,
sample is under the current policy.
use theThat is, consider th
expectation!
algorithm with the update rule
h i
Q(St , At ) Q(St , At ) + ↵ Rt+1 + E[Q(St+1 , At+1 ) | St+1 ] Q(St , At )
h X i
Q(St , At ) + ↵ Rt+1 + ⇡(a|St+1 )Q(St+1 , a) Q(St , At ) , (6.7
6.6. EXPECTED SARSA a
a 141

but that otherwise follows the schema of Q-learning (as in Figure 6.10). Given th
next state, St+1 , this algorithm moves deterministically in the same direction a
Sarsa moves in expectation, and accordingly it is called expected Sarsa. Its backu
diagram is shown in Figure 6.12.
Expected Sarsa is more complex computationally
Q-learning Expected Sarsathan Sarsa but, in return, i
eliminates the variance due to the random selection of At+1 . Given the same amoun
of experience
Figure 6.12: we The
might expect
backup it to perform
diagrams slightly
for Q-learning andbetter than
expected Sarsa, and indeed i
Sarsa.
generally does. Figure 6.13 shows summary results on the cli↵-walking task with Ex
pectedExpected Sarsa compared Sarsa’s to performs better thanAsSarsa
Sarsa and Q-learning. (but costs
an on-policy more)
method, Expecte
6.6 Sarsa Expected Sarsa
retains the significant advantage of Sarsa over Q-learning on this problem. I
addition, Expected Sarsa shows a significant improvement over Sarsa over a wid
Consider
R. S. Sutton and A.the learning
G. Barto: Reinforcement algorithm that is just like Q-learning except that instead
Learning: An Introduction 36 of
results is that for large values of ↵ the Q values of Sarsa diverge.
van Seijen, van Hasselt, Whiteson, & Wiering 2009
pendent Although the policy is still improved over the initial random
. policy during the early stages of learning, divergence causes
Performance
the policyCHAPTER
on
6.
to get worse
the Cliff-walking Task
inTEMPORAL-DIFFERENCE
the long run. LEARNING

0 0
walking
ich the −20
ministic
Expected Sarsa
iff (see -40
−40 Asymptotic Performance
actions:
Q-learning
ent one −60 Sarsa
average return

lts inReward
a
ff area,per -80
−80 Q-learning
episode
return −100
he goal Interim Performancen = 100, Sarsa
-120
−120 (after 100 episodes)n = 100, Q−learning
n = 100, Expected Sarsa
n = 1E5, Sarsa
n = 1E5, Q−learning
−140
n = 1E5, Expected Sarsa

−160
0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11

alpha

eR.6.13: Interim
S. Sutton and A. G. Barto:and asymptotic
Reinforcement performance of TD control methods on the cli↵-walking
Learning: An Introduction 37
Figure 6.12: The backup diagrams for Q-learning and expected Sarsa.

Off-policy Expected Sarsa


6.6 Expected Sarsa
Expected
Consider Sarsaalgorithm
the learning generalizes
that to arbitrary
is just behavior except
like Q-learning policies 𝜇 instead o
that
the maximum over next state–action pairs it uses the expected value, taking int
in which case it includes Q-learning as the special case in
account how likely each action is under the current policy. That is, consider th
algorithm which π update
with the is the greedy
rule policy
h i
Q(St , At ) Q(St , At ) + ↵ Rt+1 + E[Q(St+1 , At+1 ) | St+1 ] Q(St , At )
h X i
6.6. EXPECTED SARSA Q(St , At ) + ↵ Rt+1 + 141
⇡(a|St+1 )Q(St+1 , a) Q(St , At ) , (6.7
Nothing a
a
changes
here
but that otherwise follows the schema of Q-learning (as in Figure 6.10). Given th
next state, St+1 , this algorithm moves deterministically in the same direction a
Sarsa moves in expectation, and accordingly it is called expected Sarsa. Its backu
diagram is shown in Figure 6.12.
Q-learning Expected Sarsa
Expected Sarsa is more complex computationally than Sarsa but, in return, i
eliminates
Figure the 6.12: variance
The backup due to the random
diagrams selection
for Q-learning of expected
and At+1 . Given
Sarsa.the same amoun
of experience
This idea we seems
might expect to be new it to perform slightly better than Sarsa, and indeed i
generally does. Figure 6.13 shows summary results on the cli↵-walking task with Ex
6.6 pected Expected Sarsa
Sarsato Sarsa and Q-learning. As an on-policy method, Expecte
compared
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38
Maximization Bias Example
100%
N( 0.1, 1)
0 0
B A

...
75% wrong right
START

%
Wrong 50%
actions Q-learning
Double
6.5. Q-LEARNING: OFF-POLICY TD CONTROL
25% Q-learning
6.5 Q-learning: O↵-Policy TD Control
5% optimal
0
One
1 of the most important
100 breakthroughs
200 in reinforcement
300 learning was th
opment of an o↵-policy Episodes
TD control algorithm known as Q-learning (Watkin
Its simplest form, one-step Q-learning, is defined by
h i
Tabular Q-learning: Q(St , At ) Q(St , At ) + ↵ Rt+1 + max Q(St+1 , a) Q(St , At ) .
a
0.19 0.63 0.75 0.67 0.85 0.58 0.85 0.36 0.80 0.42
0.79 0.47 0.92 0.33 0.95 0.37 0.68 0.81 0.68 0.35
0.05 0.17 0.81 0.23 0.58 0.28 0.75 0.44 0.68 0.79
0.51 0.98 0.68 0.13 0.78 0.17 0.40 0.70 0.36 0.30
0.24 0.58 0.98 0.93 0.45 0.62 0.69 0.22 0.95 0.50
0.78 0.98 0.87 0.22 0.39 0.08 0.36 0.67 0.27 0.45
0.14 0.31 0.65 0.85 0.97 0.37 0.78 0.63 0.57 0.44
0.14 0.87 0.29 0.38 0.98 0.87 0.93 0.98 0.98 0.33
0.31 0.20 0.98 0.17 0.62 0.22 0.22 0.25 0.27 0.89
0.96 0.48 0.93 0.50 0.41 0.22 0.10 0.46 0.47 0.12
Initialize S
Repeat (for each step of episode): Hado van Hasselt 2010
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2 )

Double Q-Learning
Take action A, observe R, S 0
With 0.5 probabilility: ⇣ ⌘
Q1 (S, A) Q1 (S, A) + ↵ R + Q2 S 0 , argmaxa Q1 (S 0 , a) Q1 (S, A)

• else:
Train 2 Qaction-value
2 (S, A)
functions,

Q and Q
Q2 (S, A) + ↵ R + Q1 S10 , argmax 2Q2 (S 0 , a)
a Q2 (S, A)

S0;
• Do Q-learning on both, but
S
until S is terminal

• never on the same time steps (Q and Q are indep.)


Figure 6.15: Double Q-learning.
1 2

• pick Q or Q at random to be updated on each step


The idea of doubled learning extends naturally to algorithms for full MDPs. For
example, the 1doubled2learning algorithm analogous to Q-learning, called Double Q-
learning, divides the time steps in two, perhaps by flipping a coin on each step. If
• If updating Q , use Q for the value of the next state:
the coin comes up1 heads, the
2 update is
⇣ ⌘
Q1 (St , At ) Q1 (St , At ) + ↵ Rt+1 + Q2 St+1 , argmax Q1 (St+1 , a) Q1 (St , At ) .
a
(6.8)

• Action selections are (say) 𝜀-greedy with respect to the sum


If the coin comes up tails, then the same update is done with Q1 and Q2 switched,
of
so that
Q1 Qand
2 is Q
updated. The two approximate value functions are treated completely
2
symmetrically. The behavior policy can use both action value estimates. For ex-
Hado van Hasselt 2010

Double Q-Learning
144 CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING

Initialize Q1 (s, a) and Q2 (s, a), 8s 2 S, a 2 A(s), arbitrarily


Initialize Q1 (terminal-state, ·) = Q2 (terminal-state, ·) = 0
Repeat (for each episode):
Initialize S
Repeat (for each step of episode):
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2 )
Take action A, observe R, S 0
With 0.5 probabilility: ⇣ ⌘
0 0
Q1 (S, A) Q1 (S, A) + ↵ R + Q2 S , argmaxa Q1 (S , a) Q1 (S, A)
else: ⇣ ⌘
Q2 (S, A) Q2 (S, A) + ↵ R + Q1 S 0 , argmaxa Q2 (S 0 , a) Q2 (S, A)
S S0;
until S is terminal

Figure 6.15: Double Q-learning.

The idea of doubled learning extends naturally to algorithms for full MDPs. For
Are there algorithms that avoid maximization bias? To start, consider a bandit

Example of Maximization Bias


case in which we have noisy estimates of the value of each of many actions, obtained
as sample averages of the rewards received on all the plays with each action. As we
discussed above, there will be a positive maximization bias if we use the maximum
of the estimates as an estimate of the maximum of the true values. One way to view
the problem is that it is due to using the same samples (plays) both to determine
the maximizing
100% action and to estimate its value. Suppose we divided the plays in
two sets and used them to learnN(two0.1,independent
1)
estimates, call them Q1 (a) and
Q2 (a), each an estimate of the true value q(a), 0for all a 20 A. We could then use
B wrong A right

...
one estimate, say Q1 , to determine the maximizing
75% action A⇤ = argmaxa Q1 (a), and
Q2 (A⇤ ) = Q2 (argmaxa Q1 (a)).
the other, Q2 , to provide the estimate of its value, START
This %estimate will then be unbiased in the sense that E[Q2 (A⇤ )] = q(A⇤ ). We can
Wrong
also repeat 50%
the process with the role of the two estimates reversed to yield a second
actions estimate Q1 (argmaxa Q2Q-learning
unbiased (a)). This is the idea of doubled learning. Note
Double
that although we learn two estimates, only one estimate is updated on each play;
25% Q-learning
doubled learning doubles the memory requirements, but is no increase at all in the
amount of computation per step.
5% optimal
The idea of0doubled learning extends naturally to algorithms for full MDPs. For
1
example, the doubled learning 100 200 to Q-learning,
algorithm analogous 300called Double Q-

learning, divides the time steps in two, Episodes


perhaps by flipping a coin on each step. If
the coin comes up heads, the update is
Double Q-learning:
h i
Q1 (St , At ) Q1 (St , At ) + ↵ Rt+1 + Q2 St+1 , argmax Q1 (St+1 , a) Q1 (St , At ) .
a
Afterstates

Usually, a state-value function evaluates states in which


the agent can take an action.
But sometimes it is useful to evaluate states after agent has
acted, as in tic-tac-toe.
Why is this useful?

What is this in general?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 44


Summary
Introduced one-step tabular model-free TD methods
These methods bootstrap and sample, combining aspects of
DP and MC methods
TD methods are computationally congenial
If the world is truly Markov, then TD methods will learn
faster than MC methods
MC methods have lower error on past data, but higher error
on future data
Extend prediction to control by employing some form of GPI
On-policy control: Sarsa, Expected Sarsa
Off-policy control: Q-learning, Expected Sarsa
Avoiding maximization bias with Double Q-learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 45

You might also like