Professional Documents
Culture Documents
width
of update Dynamic
Temporal-
difference programming
learning
depth
(length)
Multi-step
of update bootstrapping
Exhaustive
Monte search
Carlo
...
Chapter 7:
Multi-step Bootstrapping
Unifying Monte Carlo and TD
key algorithms: n-step TD, n-step Sarsa, Tree-backup, Q(𝜎)
n-step TD Prediction
1-step TD ∞-step TD
and TD(0) 2-step TD 3-step TD n-step TD and Monte Carlo
··· ···
···
Idea: Look farther into the
future when you do TD —
···
backup (1, 2, 3, …, n steps)
we called them one-step TD methods.
we called them one-step More formally, TD methods.
formally, consider
consider the the update
updateofofthe theestimated
estimatedvalue valueofofstat st
More formally, More consider
sequence,formally, Sttthe
,, R update
Rconsider
t+1 ,,SSt+1t+1of ,the
,R Rthe
t+2 ,estimated
,..... ., ,RR
backup TT, ,SS
applied TT(omitting
value statethe
to ofstate
(omitting actions).
tStas
Sthe as a aresult
result
actions). WoW
Mathematics of n-step TD Returns/Targets
sequence, Streward
, Rthethe, estimate
t+1 Ssequence,
t+1 , Rt+2
estimate of
t+1
of,S.vvt.,⇡⇡.R(S
,(SR )),isis
ttT
t+1 St+1T
t+2
updated
, Rt+2 ,in
(omitting
updated . .the
.in direction
RTactions).
,the ,direction ofofthe
We
ST (omitting know complete
thethe that in M
actions
complete r
the estimateity).of v⇡We (StG ) is=
know ..updated
that in in
Monte the direction
Carlo 22 backups of+the complete
the TT t t 1 1return:
estimate of v⇡ (St ) is up
G t = RR t+1 + + RR t+2 ++ RR t+3 + · ·
· ·
· ·++ R RT , ,
. direction oft the complete t+1
2
t+2
return: T t 1
t+3 T
Gt = R t+1 +
where T R +
t+2 the last R t+3 + · · · + R T ,
where T is is the last time time step stepof ofthetheepisode.episode.Let Letususcall callthisthisquantit
quant
in Monte . Carlo
where T Carlo:
Monte is the in lastG time
Monte
t = Rt+1
step
Carlo +of updates
the
Rt+2episode.
updates + the 2 target
the Rt+3 Let+us
target · ·is·call
the
is+ the Treturn,
this RT in
, inone-step
t quantity
1
return, the target
one-step update
of th
upda
in Monte Carlothe the discounted
updates
discounted the target estimated
estimated is the value value ofofin
return, the
the next
nextstate,
one-step which
updates
state, which thewe call
target
we callthe
istht
the discounted where estimated
TG thevalue
ist:t+1 .. of
last time the next
step ofstate, which we
the), episode. Let call
usthe
callone-step
this quantity return
= R
Gt:t+1 = Rt+1 + Vt (St+1 ), t+1 + V t (S t+1
. backup.
TD: Gt:t+1 = R the t+1 +V V:Whereas
where t (S
S t+1 ),
! R
in Monte Carlo backups the target is the return
here is the estimate at time t of v⇡ . The subsc
wherethe
backups : S !isRthe
Vtt target here first is reward
the estimate plus theatdiscounted
time t of estimated
v⇡ . The subs valu
where V t : S
Use Vt state,! R
truncated
to truncatedhere
estimate is return
the
remaining estimatefor time at
returnt using
time t rewards
of v ⇡ . up
The until time
subscripts t
on + G1, with
t:t+1
which we return
call thefor one-step
time t using
return: rewards up 2until time t + 1, T wit
t
truncated return taking
for time the place t usingofrewards the other uptermsuntil time Rt+2t + + 1, with
R
2 t+3 +
the · · · +
discounted T t
taking the place of the other terms 2 Rt+2 + T Rt t+3 1idea + ··· +
taking the place in of
the the previous
. other chapter.
terms R Our
t+2
(1) previous chapter. Our point now is that this idea + point R t+3 now + ·is
· ·that
+ this R T ofmakes
the fulljust
in Gthe =R +The Vtnow(St+1is),for makes ju
in the previousdoes t after
chapter. one.
Ourt+1 point target thata this two-step idea makesupdatejust is the two-step
as much senseret
n-step TD: does after one.
does after one. The target for . aThe two-step
target for a two-step update is the two-step r
update 2 is the two-step return:
G = R + 117
R +
where V : S
t:t+2! R . here t+1 is the estimate
t+2 2Vt+1at (St+2t) of v⇡ , in which case it
time
2 step
Gt:t+2return:
=
.
R
t
+
Gt:t+2 = Rt+1
R + 2 + Rt+2 +
V (S )
Vt+1 (St+2 ) 2
thatwhere Vt (Snow
t+1 t+1 ) should
t+2 2
V t+1
take
(S t+1
t+2
the
) t+2 place offor
corrects thethe remaining
absence terms
of the Rterms
t+2 + 2
2
2 where
t 1 R ,now Vt+1 (S ) corrects for the absence Rof the terms
T as we discussed in the previous chapter. Our point now 3 isre
where now
p returns can be considered VSimilarly,
(S )
t+1 approximations
T
t+2 the
correctstarget for t+2
for
to thean
the arbitrary
absence ofn-step
the update
terms 2 is the
t+3 +n-step Rt+4th
Similarly, makes
the Similarly,
target justfor asanthemuch target
.t+n
arbitrary sense for
n-step an arbitrary
after two steps
update is n-step
nthe it update
as1 n-step
doesreturn: is the
after one. n-step
The r
for the remaining missing terms by V 1 (S
n-step return: Gt:t+n =. Rt+1 + Rt+2 + · · · + n 1Rt+n + nVt+n 1 (St+n ),
t+n ). n
two-step
. target
G is
=+ the
R·t+1 two-step
+ n R1t+2 return:
+ ··· +
termination), then
Gt:t+n all
=the Rt+1missing
+ t:t+n
. R terms
t+2
are
· · + taken R t+n +
n
Vt+n R t+n +
1 (St+n ),
Vt+n 1 (St+n )
with (Gt:t+n
he ordinary full return (2) = . Gt if t + n T ). 2
Gt = Rt+1 + Rt+2 + Vt (St+2 ),
rewards and states that are not available at the
can use the n-step return until2after it has seen 2 3
Forward View of TD(λ)
RrT
Rrtt+3 stt+3
+3 S
+3
rt+2
R st+2
t +2 S +2
rt+1
R st+1
t +1 S
t 1
+
Sstt T ime
truncated return for time t using rewards up until time t + 1, with the
taking the place of the other terms Rt+2 + 2 Rt+3 + · · · + T t 1 RT
in the previous chapter. Our point now is that this idea makes just as
does after one. The target
n-step
for a
TD update is the two-step return
two-step
7.1. N -STEP TD PREDICTION
.
Gt:t+2 = Rt+1 + Rt+2 + 2 Vt+1 (St+2 )
forRecall
all n, tthe 2 that n
such 1 and 0 t < T n. All n-step returns can 2 co
be
where now V (S
n-step return:
t+1 t+2 ) corrects for the absence of the terms Rt+
full return,the
Similarly, truncated
target forafter
an narbitrary
steps andn-step
then corrected
update isfor
thethe remaining
n-step returnm
If t+n T (if . the n-step return extendsn to1 or beyondn termination), then
Gt:t+n
as zero, and = theRn-step
t+1 + return · · · + to beRequal
Rt+2 +defined t+n +to the
Vt+n 1 (St+nfull
ordinary ), ret
Note that n-step returns for n > 1 involve future rewards and state
Of course, this is not available until time t+n
time of transition from t to t + 1. No real algorithm can use the n-step
Rt+n and computed Vt+n 1 . The first time these are available is t + n. T
The natural
algorithm algorithm
for using n-step is returns
thus to wait until then:
is thus
. ⇥ ⇤
Vt+n (St ) = Vt+n 1 (St ) + ↵ Gt:t+n Vt+n 1 (St ) , 0 t < T,
while
Thisthe values n-step
is called of all other
TD states remain unchanged: Vt+n (s) = Vt+n 1
algorithm n-step TD. Note that no changes at all are made during the fi
To make up for that, an equal number of additional updates are made a
termination and before starting the next episode.
150 CHAPTER 7. N -STEP BOOTSTRAPPING
0 0 0 0 0 1
A B C D E
start
256
512
128 n=64
n=32
n-step TD
Average
RMS error results
over 19 states
and first 10
n=32 n=1
episodes
n=16
n=8 n=2
n=4
↵
An intermediate α is best
Figure 7.2: Performance of n-step TD methods as a function of ↵, for various values of n,
An random
on a 19-state intermediate
walk taskn (Example
is best 7.1).
Do you think there is an optimal n? for every task?
Conclusions Regarding n-step Methods (so far)
Generalize Temporal-Difference and Monte Carlo learning
methods, sliding from one to the other as n increases
n = 1 is TD as in Chapter 6
n = ∞ is MC as in Chapter 5
an intermediate n is often much better than either extreme
applicable to both continuing and episodic problems
There is some cost in computation
need to remember the last n states
learning is delayed by n steps
per-step computation is small and uniform, like TD
Everything generalizes nicely: error-reduction theory, Sarsa, off-
policy by importance sampling, Expected Sarsa, Tree Backup
The very general n-step Q(𝜎) algorithm includes everything!
previous exercise) in place of the error in (7.2) would actually be a slightl
algorithm. Would it be a better algorithm or a worse one? Devise and p
small experiment to answer this question empirically.
Error-reduction property
The n-step return uses the value function Vt+n 1 to correct for the missin
beyond Rt+n . An important property of n-step returns is that their expe
guaranteed to be a better estimate of v⇡ than Vt+n 1 is, in a worst-state se
is, the worst error of the expected n-step return is guaranteed to be les
equal toError reduction
n times property
the worst error of n-step
under returns
Vt+n 1:
n
max E⇡[Gt:t+n |St = s] v⇡ (s) max Vt+n 1 (s) v⇡ (s) ,
s s
for all n 1. This is called the error reduction property of n-step returns
of the error reduction property, one can show formally that all n-step TD
Maximum error using n-step return Maximum error using V
converge to the correct predictions under appropriate technical condition
step TD methods thus form a family of sound methods, with one-step TD
and Monte
UsingCarlo
this,methods
you canasshow extremethat members.
n-step methods converge
It’s much the same for action values
158 CHAPTER 7. MULTI-STEP BOOTSTRAPPING
Figure 7.3: The spectrum of n-step backups for state-action values. They range from the
callmain
The one-step PSarsa,
idea is to
min(⌧ or )Sarsa(0).
simply
+n,T i switch
⌧ 1 R states for actions (state–action pair
| G i=⌧ +1 i
The main
use| an "-greedy idea is to simply switch nstates forn-step
actions (state–action p
If ⌧ + npolicy. The G
< T , then backup G+ diagrams
Q(S⌧ +nfor , A⌧ +n ) Sarsa (shown in(G
On-policy
like use an "-greedy
| thoseQ(S , A⌧n-step
of ⌧n-step ) TD Q(S⌧ ,Action-value
policy.(Figure
The backup
A⌧ ) + 7.1), diagrams
↵ [Gare Q(S
strings
⌧ , Methods
A
forofn-step
⌧ )]
Sarsa (shown
alternating states a
like those
except of n-step TDall(Figure 7.1),
endare strings of alternating state
| that If ⇡the
is Sarsa
being oneslearned, start
thenand ensure with
that an action
⇡(·|S ⌧ ) is rather
"-greedy a state.
wrt Q
except that (update
n-step the Sarsatargets)
ones all in start andofend with an action actionrather
values:a stat
Untilreturns
⌧ =T 1 terms estimated
Action-value
n-step returns form of n-step
(update targets)return
in terms of estimated action values:
.
Gt:t+n = R.t+1 + Rt+2 +· · ·+ n 1nRt+n 1
+ n
Qnt+n 1 (St+n , At+n ), n 1, 0
Gt:t+n = Rt+1 + Rt+2 +· · ·+ Rt+n + Qt+n 1 (St+n , At+n ), n 1,
Pseudocode is shown in the box above, and an example of why it can s
learning compared . to one-step methods is given in Figure 7.4.
n-step Sarsa:
with Gt:t+n = G.t if t + n T . The natural algorithm is then
What about
with Gt:t+nExpected
= Gt if Sarsa?
t + n The backup
T . The diagram
natural algorithmfor the is n-step
then version
pected Sarsa is shown . on the far right in Figure 7.3. It consists of a linea
Qt+n (St , At ) = Q.t+n 1 (St , At ) + ↵ [Gt:t+n Qt+n 1 (St , At )] , 0
of sample Q t+n (St ,and
actions At ) states,
= Qt+njust 1 (Sas
t , Ain
t) + ↵ [Gt:t+n
n-step Sarsa, Q t+n 1 (S
except t , Aits
that t )] ,last ele
0
a branch
while overvalues
the all actionof possibilities
all other states weighted,
remain as always, by Q
unchanged: their(s,probabilit
a) = Q
n-step
⇡. for
Thiswhile Expected
the
algorithm values
can
Sarsa
of
be allis the
other
described
same
states
by
update
theremain
same
with a
unchanged:
equation
slightly
as
t+n
Q
n-step (s,
t+n Sarsa a) =
all s, a n-step
different such that s 6= St or a 6= At . This is the algorithm we call n
return:
for all s, a such
except with the n-step return that s 6=redefined
St or a 6= as At . This is the algorithm we call
. X
n 1 n
Gt:t+n
1-step = Rt+1 + · · · +
Sarsa
1-step Sarsa
Rt+n + ⇡(a|St+n )Qt+n ∞-step (SSarsa , a), n
∞1-stept+n
Sarsa
aka Sarsa(0) 2-step Sarsa 3-step Sarsa n-step Sarsa aka Monte Carlo Expec
aka Sarsa(0) 2-step Sarsa 3-step Sarsa a n-step Sarsa aka Monte Carlo Ex
G G G
igure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods. The fi
anel shows the path taken by an agent in a single episode, ending at a location of high reward, marked
he G. In this example the values were all initially 0, and all rewards were zero except for a positive reward
. The arrows in the other two panels show which action values were strengthened as a result of this path
ne-step and n-step Sarsa methods. The one-step method strengthens only the last action of the sequence
ctions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequen
o that much more is learned from the one episode.
TD,allthe
for update
n and for time
t such that tn(actually
1 and made
0 t at
Ttimen.
t + n) can simply be weighted
where ⇢t:t+n 1 , .called the importance sampling ratio, is the relative probability
Vt+nthe
of taking (Stn
)= Vt+n from
actions 1 (St ) A
+ ↵⇢ t:t+n 1 [Gt:t+n Vt+n 1 (St )] , 0 t < T,
t to At+n 1 (cf. Eq. 5.3):
Off-policy n-step Methods by Importance Sampling
7.3 n-step
where ⇢t:t+n O↵-policy
1 , called
min(h,T
Learning by Importance
1) the importance sampling ratio, is the relative probability
of taking . nY
the actions⇡(A from k |SAkt) .to At+n 1 (cf. Eq. 5.3):
Recall
Recall
⇢ t:h the importance-sampling ratio:
that
=
o↵-policy learning b(Ak |Sk )is learning the value function for one policy, ⇡
min(h,Tk=t 1)the greedy policy for the current action-value-function
policy, b. Often,
. Y ⇡ is ⇡(Ak |Sk )
⇢
exploratory
For example,
t:h = policy,
if any perhapsoneb(Aof the .
"-greedy.
actions In order
would neverto beusetaken
the databy ⇡ from
(i.e., b⇡(A
wek |Smk
k |Sk )
di↵erence
return should betweenbek=tgiventhe twozero policies,
weight and usingbe their
totallyrelative
ignored. probability
On the other of taking
hand, th i
(see Section
Weexample, ⇡ 5.5).
get off-policy In n-step methods, returns areupdatesconstructed overratio
n steps, sw
taken
For that would
if any take onemethods
with
of themuch
actionsbygreater
weighting
wouldprobability
never be taken than byby
b this
does, then
⇡ (i.e., ⇡(A this
k |Sk
relative
that
returnwould probability
should otherwise of just
be givenbezero given those
to the
weight
nandactions.
return. For example,
This
be totally makes sense
ignored.
to because
On the
make a simple
otherthat
hand,acti oif
TD,
(and
taken the
Off-policy update
therefore
that wefor
n-step
⇡ would want time
TD:
take tlearn
towith(actually made
aboutgreater
much it) butat istime t + n)
selected
probability only
thancanrarely
b simply be
does,bythen
b andweigh
thisthu w
data. To make
that would up
. forbe
otherwise this
givenwe tohave thetoreturn.
over-weight it when
This makes it does
sense occur.
because thatNote
acti
are V (S ) = V (S ) + ↵⇢ [G V (S )] , 0 t < T
(andactually
t+n the
therefore t
wesame t+n (the
want 1
to ton-policy
learn about t:t+ncase) then
it) 1but t:t+nthe importance
is selected t+n 1 t sampling ratio
only rarely by b and thu
new
data. update
To make(7.7) up generalizes
for this we and
have canto completely
over-weight replace
it when ourit earlier
does n-stepNote
occur. TD
where ⇢t:t+n
Off-policy
previous n-step , Sarsa
calledupdate
1n-step the importance
Sarsa: can be sampling replaced
completely ratio, is bythea relative
simple probabil
o↵-policy
aretaking
of actually thethe same (the
n actions fromon-policy
At to At+n case) then the importance sampling ratio
1 (cf. Eq. 5.3):
new update (7.7) generalizes . and can completely replace our earlier n-step TD
previousQt+n (St , ASarsa
n-step t ) = Qt+n 1 (St , At ) + ↵⇢t+1:t+n 1 [Gt:t+n Qt+n 1 (St , At )] ,
min(h,T 1)update can be completely replaced by a simple o↵-policy
. Y ⇡(Ak |Sk )
0 ⇢t:h
forOff-policyt<= T . Note. that the importance . sampling ratio hereQstarts one step la
Qt+n (St ,n-step
Ak=t
t ) = QExpected
t+n k1 kt ASarsa:
b(A |S
(S ), t ) + ↵⇢t+1:t+n 1 [Gt:t+n t+n 1 (St , At )] ,
(above). This is because here we are updating a state–action pair. We do not
we
for were
For Justtto
0example, select
<like
T .ifabove
Note
anytheone action;
except
that ofthethenow
expected
importance
actionsthat we havenever
n-step
sampling
would selected
return it
beand
ratio ⇢wegoes
here
taken want
by ⇡toto
starts t learn
one +step
(i.e., n⇡(A ful
2la
k
with importance
(above). This is sampling
because only
here we for aresubsequent
updating
return should be given zero weight and be totally ignored. On the other hand actions.
a Pseudocode
state–action pair. for
We the
do ful
not
Off-policy Learning w/o Importance Sampling:
The n-step Tree Backup Algorithm
3-step TB
Expected Sarsa
and 1-step Tree Backup 2-step Tree Backup
St , At St , At
Rt+1 Rt+1
St+1 St+1
⇡ ⇡
a At+1 a
X Rt+2
Rt+1 + ⇡(a|St+1 )Q(St+1 , a) St+2
a ⇡
2
a0
X
Target Rt+1 + ⇡(a|St+1 )Q(St+1 , a)
a6=At+1
✓ X ◆
+ ⇡(At+1 |St+1 ) Rt+2 + ⇡(a0 |St+2 )Q(St+2 , a0 )
a0
a pure expectation with no sampling. The random variable t might be set as a
function of the state, action, or state–action pair at time t. We call this proposed
new algorithm n-step Q( ).
A Unifying Algorithm: n-step Q(𝜎)
4-step 4-step 4-step 4-step
Sarsa Tree backup Expected Sarsa Q( )
=1
⇢ ⇢ ⇢
=0
⇢ ⇢
=1
⇢ ⇢ ⇢
=0
⇢
Choose whether
Figure 7.5: to sample
The three kinds of or takeaction-value
n-step the expectation onconsidered
backups far in𝜎(s)
each stepsowith this
chapter (4-step case) plus a fourth kind of backup that unifies them all. The ‘⇢’s
Conclusions Regarding n-step Methods
Generalize Temporal-Difference and Monte Carlo learning
methods, sliding from one to the other as n increases
n = 1 is TD as in Chapter 6
n = ∞ is MC as in Chapter 5
an intermediate n is often much better than either extreme
applicable to both continuing and episodic problems
There is some cost in computation
need to remember the last n states
learning is delayed by n steps
per-step computation is small and uniform, like TD
Everything generalizes nicely: error-reduction theory, Sarsa, off-
policy by importance sampling, Expected Sarsa, Tree Backup
The very general n-step Q(𝜎) algorithm includes everything!