Professional Documents
Culture Documents
6.1 TD Prediction
TD Prediction
Both TD and Monte Carlo methods use experience to solve the prediction
PolicyGiven
Evaluation (the prediction
some experience followingproblem):
a policy ⇡, both methods update their e
of va⇡ given
for for thepolicy
nonterminal states the
π, compute St occurring in that
state-value experience.
function vπ Roughly
Monte Carlo methods wait until the return following the visit is known,
that return as a target for V (St ). A simple every-visit Monte Carlo metho
Recall:forSimple every-visit
nonstationary Monte Carlo
environments is method:
128 h
CHAPTER i6. TEMPORAL-DIFFERENCE L
V (St ) V (St ) + ↵ Gt V (St ) ,
(only
wherethen
Gt isisthe
Gtactual
known), TDfollowing
return methodstime needt, wait
and ↵only until thestep-size
is a constant next tim p
time
(c.f.,tEquation
+ 1 they2.4). target:
Let us callthe
immediately form actual
this return
amethod
target andafter
make
constant-↵ time
a tuseful
MC. update
Whereas Mo
observed
methods reward
must wait Rt+1 andthe
until the
end estimate V (St+1to
of the episode ). determine
The simplest TD meth
the increment
The as TD(0),temporal-difference
simplest is method TD(0):
h 127 i
V (St ) V (St ) + ↵ Rt+1 + V (St+1 ) V (St ) .
In e↵ect, the target for the Monte Carlo update is Gt , whereas the target
update is Rt+1 + V (St+1 ).target: an estimate of the return
Because the TD method bases its update in part on an existing estim
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
TD target for prediction
Agent program
Because the TD(0) bases its update Environment
in part on an program
existing esti
method, like DP. We know from Chapter 3 that program
Experiment
Dynamic programing
cf. Dynamic Programming
X X
V (St ) ← Eπ [ Rt +1 + γ V (St +1 )] = ⇡(a|St ) p(s0 , r|St , a)[r + V (s0 )]
a s0 ,r
St
a
r
s0
T TT T T T
TT T T T T
St
T
T T
T TT T T
TT T TT T TT
St
Rt +1
St +1
T TT TT T T
T T
T TT T TT
Driving Home
your estimate of total travel time to 35 minutes. Unfortunately, at this point you g
stuck behind a slow truck, and the road is too narrow to pass. You end up havin
to follow the truck until you turn onto the side street where you live at 6:40. Thr
minutes later you are home. The sequence of states, times, and predictions is thu
as follows:
Elapsed Time Predicted Predicted
State (minutes) Time to Go Total Time
leaving office, friday at 6 0 30 30
reach car, raining 5 35 40
exiting highway 20 15 35
2ndary road, behind truck 30 10 40
entering home street 40 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 W
are not discounting ( = 1), and thus the return for each state is the actual time
go from that state. The value of each state is the expected time to go. The secon
column of numbers gives the current estimated value for each state encountered.
A simple way to view the operation of Monte Carlo methods is to plot the predicte
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrow
Driving home as an RL
problem
Driving home
your estimate of total travel time to 35 minutes. Unfortunately, at this point you g
stuck behind a slow truck, and the road is too narrow to pass. You end up havin
to follow the truck until you turn onto the side street where you live at 6:40. Thr
minutes later you are home. The sequence of states, times, and predictions is thu
as follows: V(s) V(office)
Elapsed Time Predicted Predicted
State (minutes) R Time to Go Total Time
leaving office, friday at 6 0 5 30 30
reach car, raining 5 15 35 40
exiting highway 20 10 15 35
2ndary road, behind truck 30 10 10 40
entering home street 40 3 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 W
are‣ not discounting
Task: update ( the
= 1),value
and thus the return
function as for
weeach
go,state is theon
based actual time
go from that state. The value of each state is the expected time to go. The secon
observed elapsed time—Reward column
column of numbers gives the current estimated value for each state encountered.
A simple way to view the operation of Monte Carlo methods is to plot the predicte
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrow
your estimate of total travel time to 35 minutes. Unfortunately, at this point you get
stuck behind a slow truck, and the road is too narrow to pass. You end up having
Driving home
to follow the truck until you turn onto the side street where you live at 6:40. Three
minutes later you are home. The sequence of states, times, and predictions is thus
as follows: V(s) V(office)
Elapsed Time R Predicted Predicted
State (minutes) Time to Go Total Time
leaving office, friday at 6 0 5 30 30
reach car, raining 5 15 35 40
exiting highway 20 10 15 35
2ndary road, behind truck 30 10 10 40
entering home street 40 3 3 43
arrive home 43 0 43
The rewards in this example are the elapsed times on each leg of the journey.1 We
are not discounting ( = 1), and thus the return for each state is the actual time to
go from that state. The value of each state is the expected time to go. The second
with 𝛂 = 1?
column of numbers gives the current estimated value for each state encountered.
‣ update V(office)
A simple way to view the operation of Monte Carlo methods is to plot the predicted
total time (the last column) over the sequence, as in Figure 6.2 (left). The arrows
• V(s) = V(s) + 𝛂[Rfor ↵ = 1. Theset+1 + 𝛾V(s’) - V(s)]
show the changes in predictions recommended by the constant-↵ MC method (6.1),
are exactly the errors between the estimated value (predicted time
to go) in each state and the actual return (actual time to go). For example, when
• V(office) = IfV(office)
1
+ 𝛂[Rt+1 + 𝛾V(car) - V(office)]
you exited the highway you thought it would take only 15 minutes more to get home,
this were a control problem with the objective of minimizing travel time, then we would of
course make the rewards the negative of the elapsed time. But since we are concerned here only
• new V(office) = 40; 𝚫 = +10
with prediction (policy evaluation), we can keep things simple by using positive numbers.
‣ update V(car)?
• V(car) = 30; 𝚫 = -5
‣ update V(exit)?
• V(exit) = 20; 𝚫 = +5
Changes recommended by TD
HAPTER 128 CHAPTER
6. TEMPORAL-DIFFERENCE 6. TEMPORAL-DIFFER
LEARNING
methods (𝛂 = 1)
45
45 45
come actual outcome
actual
outcome
40
40 40
Predicted
Predicted Predicted
total
total total
travel
travel 35 travel 35
time 35 time
time
V(office)
30
30 30
uation Situation
Situation S
leaving reach exiting 2ndary home arrive leaving reach exiting 2ndary home arrive
office car highway road street home office car highway road street home
Situation Situation
Figure 6.2: Changes recommended in the driving home example by Monte Carlo methods
(left) and TD methods (right).
Advantages of TD learning
‣ TD methods do not require a model of the environment,
only experience
‣ TD methods can be fully incremental
‣ Make updates before knowing the final outcome
‣ Requires less memory
‣ Requires less peak computation
‣ You can learn without the final outcome, from incomplete
sequences
‣ Both MC and TD converge (under certain assumptions to
be detailed later), but which is faster?
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was
Random walk
consistently better than the MC method on this task.
0 0 0 0 0 1
A B C D E
start
‣ C is start state, episodic, undiscounted γ = 1
0.8 Estimated 0.25 Empirical RMS error,
‣ 𝜋 is left value with100
or right equal probability
0.2 in all
averaged over states
states
!=.01
0.6 10
MC
‣ termination
Estimated 0 at either end RMS error, 0.15
1 !=.02
value 0.4 averaged !=.04
0.2
0.05
‣ what does v⇡ (s) tell us? TD !=.1
!=.05
0 0
• probability
A B ofCtermination
D E on right side
0 from
25 each
50 state,
75 100
• what
Figure v
6.3:isResults
⇡ =[A withBthe
CD E]?random walk. Above: The small Markov reward
5-state
process generating the episodes. Left: Results from a single run after various numbers of
• ⇡
episodes. v
The= estimate
[1/6 2/6 3/6
after 1004/6 5/6] is about as close as they ever get to the true
episodes
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
‣ Initialize
indefinitely V(s) = 0.5
in response 8s 2 S
to the outcomes of the most recent episodes. Right: Learning curves
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
consistently better than the MC method on this task.
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was
Values learned
consistently byMC
better than the TD fromon one
method run, after various
this task.
numbers 0of episodes
A
0
B
0 0 0
C D
0 0 0 0 0 1
A B C D E
start
start
0.2 1
value 0.4 0.05
TD averaged !=.1
!=.05
0
128 CHAPTER true over states 0.1
0 6. TEMPORAL-DIFFERENCE L
A B C D E 0values
25 50 75 100 !=.15
0.2 State Walks / Episodes 0.05
(only then is Gt known), TD methods need wait only until the next tim T
Figure
time6.3:t Results
+ 1 theywith immediately
the 5-state random walk.a Above:
form target The and small
make Markov rewardupdat
a useful
process generating 0 the episodes. Left: Results from a single run after various numbers0 of
observed reward
episodes. The estimate after Rt+1100 and theisestimate
episodes V (Sast+1they
about as close ). The simplest
ever get TD met
to the true
A B C D E 0
as with
values; TD(0), is step-size
a constant parameter (↵ = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes ofState
the most recent episodes. Right: Learning curves
h i
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is theVroot
(St )mean-squared
V (St ) +(RMS)
↵ Rt+1
error+between
V (St+1
the )valueVfunction
(St ) . learned and the
Figure 6.3:
true value function, averaged overResults with
the five states. thedata5-state
These random
are averages walk.
over 100 di↵erent Ab
TD and MC on the Random Walk
0.25 Empirical RMS error,
averaged over states
0.2 !=.01
0.05
!=.1
TD !=.05
0
0 25 50 75 100
Walks / Episodes
Batch Updating in TD and MC methods
0 0 0 0 0 1
A B C D E
start
‣ After each new episode, all episodes seen so far are treated as a
0.8 Estimated 0.25 Empirical RMS error,
batch
value 100 averaged over states
0.2 !=.01
‣ This
0.6 growing batch is repeatedly
10 processed by TD and MC until
MC
Estimatedconvergence
0
1
RMS error, 0.15 !=.02
value 0.4 averaged !=.04
true over states 0.1
!=.03
values !=.15
0.2
0.05
!=.1
TD !=.05
0 0
A B C D E 0 25 50 75 100
State Walks / Episodes
(Repeated 100 times)
Figure 6.3: Results with the 5-state random walk. Above: The small Markov reward
process generating the episodes. Left: Results from a single run after various numbers of
episodes. The estimate after 100 episodes is about as close as they ever get to the true
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
indefinitely in response to the outcomes of the most recent episodes. Right: Learning curves
for TD(0) and constant-↵ MC methods, for various values of ↵. The performance measure
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
as a function of number of episodes. In all cases the approximate value function
was initialized to the intermediate value V (s) = 0.5, for all s. The TD method was
0 0 0 0 0 1
A B C D E
start
‣ After each new episode, all episodes seen so far are treated as a
0.8 Estimated 0.25 Empirical RMS error,
batch
132 value CHAPTER
100 6. TEMPORAL-DIFFERENCE
0.2 !=.01
LEARNING
averaged over states
‣ This0.6 growing batch is repeatedly
10 processed by TD and MC until
MC
Estimatedconvergence
0
1 .25 RMS error, 0.15
!=.02
value 0.4 averaged !=.04
true over states BATCH TRAINING
0.1
!=.03
values
.2 !=.15
0.2
0.05
!=.1
TD !=.05
0
RMS error, .15 0
A averaged
B C D E 0 25 50 75 100
over states
State .1
MC Walks / Episodes
(Repeated 100 times)
.05 5-state TD
Figure 6.3: Results with the random walk. Above: The small Markov reward
process generating the episodes. Left: Results from a single run after various numbers of
episodes. The estimate after 100 episodes is about as close as they ever get to the true
.0
values; with a constant step-size parameter (↵ = 0.1 in this example), the values fluctuate
0
indefinitely in response to the outcomes of25 50
the most recent 75
episodes. 100 Learning curves
Right:
for TD(0) and constant-↵ MC methods, forWalks various /values of ↵. The performance measure
Episodes
shown is the root mean-squared (RMS) error between the value function learned and the
true value function, averaged over the five states. These data are averages over 100 di↵erent
You are the Predictor
A, 0, B, 0
B, 1
B, 1
B, 1 V(B)?
B, 1
B, 1
B, 1
B, 0
A, 0, B, 0
B, 1
B, 1
B, 1 V(B)? 0.75
B, 1 V(A)?
B, 1
B, 1
B, 0
A, 0, B, 0
B, 1
B, 1
B, 1 V(B)? 0.75
B, 1 V(A)? 0?
B, 1
B, 1
B, 0
A, 0, B, 0 B, 1
B, 1 B, 1
B, 1 B, 1
B, 1 B, 0
V(A)?
V(A)? 0.75
depth
(length)
Multi-step
of update bootstrapping
Exhaustive
Monte search
Carlo
...
29
Learning An Action-Value Function
Wind:
6.5. Q-LEARNING:
In this OFF-POLICY
case, the learned action-value TD CONTROL
function, Q, directly approximates q⇤ ,145
the
timal action-value function, independent of the policy being followed.Q-learning
This dram
ically simplifies
Initialize Q(s,the
a), analysis
8s 2 S, a of the algorithm
2 A(s), and
arbitrarily, andenabled early convergence
Q(terminal-state, ·) = 0 pro
TheRepeat
policy (for
still each
has an Figure 6.6:
e↵ect in that it determines which state–action
episode): pairsThe
arebac
vis
and updated.
Initialize However,
S all that is required for correct convergence is that all p
continue to be(for
Repeat updated. As of
each step weepisode):
observed in Chapter 5, this is a minimal requirem
in the sense that A
Choose any method
from S usingguaranteed to findfrom
policy derived optimal behavior
Q (e.g., in the general c
"-greedy)
must require
Takeit.action
Under A,this R, S 0 and a variant of the usual stochastic app
assumption
observe
0
imation conditions
Q(S, A) on the
Q(S, A)sequence
+ ↵[R + ofmax step-size
a Q(S , a)parameters,
Q(S, A)] Q has been shown
0;
converge Swith Sprobability 1 to q⇤ . The Q-learning algorithm is shown in proced
form inuntil S is 6.10.
Figure terminal
What is the backup diagram for Q-learning? The rule (6.6) updates a state–ac
pair,
R. S. Sutton so Figure
the
and A. G. Barto:top 6.12:
node,
Reinforcement Q-learning:
the
Learning: root ofAn
An Introduction theo↵-policy TD control
backup, must algorithm.
be a small, filled 34action no
Cliffwalking
R
R
ε−greedy, ε = 0.1
but that otherwise follows the schema of Q-learning (as in Figure 6.10). Given th
next state, St+1 , this algorithm moves deterministically in the same direction a
Sarsa moves in expectation, and accordingly it is called expected Sarsa. Its backu
diagram is shown in Figure 6.12.
Expected Sarsa is more complex computationally
Q-learning Expected Sarsathan Sarsa but, in return, i
eliminates the variance due to the random selection of At+1 . Given the same amoun
of experience
Figure 6.12: we The
might expect
backup it to perform
diagrams slightly
for Q-learning andbetter than
expected Sarsa, and indeed i
Sarsa.
generally does. Figure 6.13 shows summary results on the cli↵-walking task with Ex
pectedExpected Sarsa compared Sarsa’s to performs better thanAsSarsa
Sarsa and Q-learning. (but costs
an on-policy more)
method, Expecte
6.6 Sarsa Expected Sarsa
retains the significant advantage of Sarsa over Q-learning on this problem. I
addition, Expected Sarsa shows a significant improvement over Sarsa over a wid
Consider
R. S. Sutton and A.the learning
G. Barto: Reinforcement algorithm that is just like Q-learning except that instead
Learning: An Introduction 36 of
results is that for large values of ↵ the Q values of Sarsa diverge.
van Seijen, van Hasselt, Whiteson, & Wiering 2009
pendent Although the policy is still improved over the initial random
. policy during the early stages of learning, divergence causes
Performance
the policyCHAPTER
on
6.
to get worse
the Cliff-walking Task
inTEMPORAL-DIFFERENCE
the long run. LEARNING
0 0
walking
ich the −20
ministic
Expected Sarsa
iff (see -40
−40 Asymptotic Performance
actions:
Q-learning
ent one −60 Sarsa
average return
lts inReward
a
ff area,per -80
−80 Q-learning
episode
return −100
he goal Interim Performancen = 100, Sarsa
-120
−120 (after 100 episodes)n = 100, Q−learning
n = 100, Expected Sarsa
n = 1E5, Sarsa
n = 1E5, Q−learning
−140
n = 1E5, Expected Sarsa
−160
0.1
0.1 0.2
0.2 0.3
0.3 0.4
0.4 0.5
0.5 0.6
0.6 0.7
0.7 0.8
0.8 0.9
0.9 11
↵
alpha
eR.6.13: Interim
S. Sutton and A. G. Barto:and asymptotic
Reinforcement performance of TD control methods on the cli↵-walking
Learning: An Introduction 37
Figure 6.12: The backup diagrams for Q-learning and expected Sarsa.
...
75% wrong right
START
%
Wrong 50%
actions Q-learning
Double
6.5. Q-LEARNING: OFF-POLICY TD CONTROL
25% Q-learning
6.5 Q-learning: O↵-Policy TD Control
5% optimal
0
One
1 of the most important
100 breakthroughs
200 in reinforcement
300 learning was th
opment of an o↵-policy Episodes
TD control algorithm known as Q-learning (Watkin
Its simplest form, one-step Q-learning, is defined by
h i
Tabular Q-learning: Q(St , At ) Q(St , At ) + ↵ Rt+1 + max Q(St+1 , a) Q(St , At ) .
a
0.19 0.63 0.75 0.67 0.85 0.58 0.85 0.36 0.80 0.42
0.79 0.47 0.92 0.33 0.95 0.37 0.68 0.81 0.68 0.35
0.05 0.17 0.81 0.23 0.58 0.28 0.75 0.44 0.68 0.79
0.51 0.98 0.68 0.13 0.78 0.17 0.40 0.70 0.36 0.30
0.24 0.58 0.98 0.93 0.45 0.62 0.69 0.22 0.95 0.50
0.78 0.98 0.87 0.22 0.39 0.08 0.36 0.67 0.27 0.45
0.14 0.31 0.65 0.85 0.97 0.37 0.78 0.63 0.57 0.44
0.14 0.87 0.29 0.38 0.98 0.87 0.93 0.98 0.98 0.33
0.31 0.20 0.98 0.17 0.62 0.22 0.22 0.25 0.27 0.89
0.96 0.48 0.93 0.50 0.41 0.22 0.10 0.46 0.47 0.12
Initialize S
Repeat (for each step of episode): Hado van Hasselt 2010
Choose A from S using policy derived from Q1 and Q2 (e.g., "-greedy in Q1 + Q2 )
Double Q-Learning
Take action A, observe R, S 0
With 0.5 probabilility: ⇣ ⌘
Q1 (S, A) Q1 (S, A) + ↵ R + Q2 S 0 , argmaxa Q1 (S 0 , a) Q1 (S, A)
• else:
Train 2 Qaction-value
2 (S, A)
functions,
⇣
Q and Q
Q2 (S, A) + ↵ R + Q1 S10 , argmax 2Q2 (S 0 , a)
a Q2 (S, A)
⌘
S0;
• Do Q-learning on both, but
S
until S is terminal
Double Q-Learning
144 CHAPTER 6. TEMPORAL-DIFFERENCE LEARNING
The idea of doubled learning extends naturally to algorithms for full MDPs. For
Are there algorithms that avoid maximization bias? To start, consider a bandit
...
one estimate, say Q1 , to determine the maximizing
75% action A⇤ = argmaxa Q1 (a), and
Q2 (A⇤ ) = Q2 (argmaxa Q1 (a)).
the other, Q2 , to provide the estimate of its value, START
This %estimate will then be unbiased in the sense that E[Q2 (A⇤ )] = q(A⇤ ). We can
Wrong
also repeat 50%
the process with the role of the two estimates reversed to yield a second
actions estimate Q1 (argmaxa Q2Q-learning
unbiased (a)). This is the idea of doubled learning. Note
Double
that although we learn two estimates, only one estimate is updated on each play;
25% Q-learning
doubled learning doubles the memory requirements, but is no increase at all in the
amount of computation per step.
5% optimal
The idea of0doubled learning extends naturally to algorithms for full MDPs. For
1
example, the doubled learning 100 200 to Q-learning,
algorithm analogous 300called Double Q-