You are on page 1of 18

Unified View

width
of update Dynamic
Temporal-
difference programming
learning

depth
(length)
Multi-step
of update bootstrapping

Exhaustive
Monte search
Carlo
...
Chapter 7:
Multi-step Bootstrapping
Unifying Monte Carlo and TD
key algorithms: n-step TD, n-step Sarsa, Tree-backup, Q(𝜎)
n-step TD Prediction
1-step TD ∞-step TD
and TD(0) 2-step TD 3-step TD n-step TD and Monte Carlo

··· ···

···
Idea: Look farther into the
future when you do TD —

···
backup (1, 2, 3, …, n steps)
we called them one-step TD methods.
we called them one-step More formally, TD methods.
formally, consider
consider the the update
updateofofthe theestimated
estimatedvalue valueofofstat st
More formally, More consider
sequence,formally, Sttthe
,, R update
Rconsider
t+1 ,,SSt+1t+1of ,the
,R Rthe
t+2 ,estimated
,..... ., ,RR
backup TT, ,SS
applied TT(omitting
value statethe
to ofstate
(omitting actions).
tStas
Sthe as a aresult
result
actions). WoW
Mathematics of n-step TD Returns/Targets
sequence, Streward
, Rthethe, estimate
t+1 Ssequence,
t+1 , Rt+2
estimate of
t+1
of,S.vvt.,⇡⇡.R(S
,(SR )),isis
ttT
t+1 St+1T
t+2
updated
, Rt+2 ,in
(omitting
updated . .the
.in direction
RTactions).
,the ,direction ofofthe
We
ST (omitting know complete
thethe that in M
actions
complete r
the estimateity).of v⇡We (StG ) is=
know ..updated
that in in
Monte the direction
Carlo 22 backups of+the complete
the TT t t 1 1return:
estimate of v⇡ (St ) is up
G t = RR t+1 + + RR t+2 ++ RR t+3 + · ·
· ·
· ·++ R RT , ,
. direction oft the complete t+1
2
t+2
return: T t 1
t+3 T
Gt = R t+1 +
where T R +
t+2 the last R t+3 + · · · + R T ,
where T is is the last time time step stepof ofthetheepisode.episode.Let Letususcall callthisthisquantit
quant
in Monte . Carlo
where T Carlo:
Monte is the in lastG time
Monte
t = Rt+1
step
Carlo +of updates
the
Rt+2episode.
updates + the 2 target
the Rt+3 Let+us
target · ·is·call
the
is+ the Treturn,
this RT in
, inone-step
t quantity
1
return, the target
one-step update
of th
upda
in Monte Carlothe the discounted
updates
discounted the target estimated
estimated is the value value ofofin
return, the
the next
nextstate,
one-step which
updates
state, which thewe call
target
we callthe
istht
the discounted where estimated
TG thevalue
ist:t+1 .. of
last time the next
step ofstate, which we
the), episode. Let call
usthe
callone-step
this quantity return
= R
Gt:t+1 = Rt+1 + Vt (St+1 ), t+1 + V t (S t+1
. backup.
TD: Gt:t+1 = R the t+1 +V V:Whereas
where t (S
S t+1 ),
! R
in Monte Carlo backups the target is the return
here is the estimate at time t of v⇡ . The subsc
wherethe
backups : S !isRthe
Vtt target here first is reward
the estimate plus theatdiscounted
time t of estimated
v⇡ . The subs valu
where V t : S
Use Vt state,! R
truncated
to truncatedhere
estimate is return
the
remaining estimatefor time at
returnt using
time t rewards
of v ⇡ . up
The until time
subscripts t
on + G1, with
t:t+1
which we return
call thefor one-step
time t using
return: rewards up 2until time t + 1, T wit
t
truncated return taking
for time the place t usingofrewards the other uptermsuntil time Rt+2t + + 1, with
R
2 t+3 +
the · · · +
discounted T t
taking the place of the other terms 2 Rt+2 + T Rt t+3 1idea + ··· +
taking the place in of
the the previous
. other chapter.
terms R Our
t+2
(1) previous chapter. Our point now is that this idea + point R t+3 now + ·is
· ·that
+ this R T ofmakes
the fulljust
in Gthe =R +The Vtnow(St+1is),for makes ju
in the previousdoes t after
chapter. one.
Ourt+1 point target thata this two-step idea makesupdatejust is the two-step
as much senseret
n-step TD: does after one.
does after one. The target for . aThe two-step
target for a two-step update is the two-step r
update 2 is the two-step return:
G = R + 117
R +
where V : S
t:t+2! R . here t+1 is the estimate
t+2 2Vt+1at (St+2t) of v⇡ , in which case it
time
2 step
Gt:t+2return:
=
.
R
t
+
Gt:t+2 = Rt+1
R + 2 + Rt+2 +
V (S )
Vt+1 (St+2 ) 2
thatwhere Vt (Snow
t+1 t+1 ) should
t+2 2
V t+1
take
(S t+1
t+2
the
) t+2 place offor
corrects thethe remaining
absence terms
of the Rterms
t+2 + 2
2
2 where
t 1 R ,now Vt+1 (S ) corrects for the absence Rof the terms
T as we discussed in the previous chapter. Our point now 3 isre
where now
p returns can be considered VSimilarly,
(S )
t+1 approximations
T
t+2 the
correctstarget for t+2
for
to thean
the arbitrary
absence ofn-step
the update
terms 2 is the
t+3 +n-step Rt+4th
Similarly, makes
the Similarly,
target justfor asanthemuch target
.t+n
arbitrary sense for
n-step an arbitrary
after two steps
update is n-step
nthe it update
as1 n-step
doesreturn: is the
after one. n-step
The r
for the remaining missing terms by V 1 (S
n-step return: Gt:t+n =. Rt+1 + Rt+2 + · · · + n 1Rt+n + nVt+n 1 (St+n ),
t+n ). n
two-step
. target
G is
=+ the
R·t+1 two-step
+ n R1t+2 return:
+ ··· +
termination), then
Gt:t+n all
=the Rt+1missing
+ t:t+n
. R terms
t+2
are
· · + taken R t+n +
n
Vt+n R t+n +
1 (St+n ),
Vt+n 1 (St+n )
with (Gt:t+n
he ordinary full return (2) = . Gt if t + n T ). 2
Gt = Rt+1 + Rt+2 + Vt (St+2 ),
rewards and states that are not available at the
can use the n-step return until2after it has seen 2 3
Forward View of TD(λ)

Look forward from each state to determine update from


future states and rewards:

RrT

Rrtt+3 stt+3
+3 S
+3
rt+2
R st+2
t +2 S +2
rt+1
R st+1
t +1 S
t 1
+

Sstt T ime
truncated return for time t using rewards up until time t + 1, with the
taking the place of the other terms Rt+2 + 2 Rt+3 + · · · + T t 1 RT
in the previous chapter. Our point now is that this idea makes just as
does after one. The target
n-step
for a
TD update is the two-step return
two-step
7.1. N -STEP TD PREDICTION
.
Gt:t+2 = Rt+1 + Rt+2 + 2 Vt+1 (St+2 )
forRecall
all n, tthe 2 that n
such 1 and 0  t < T n. All n-step returns can 2 co
be
where now V (S
n-step return:
t+1 t+2 ) corrects for the absence of the terms Rt+
full return,the
Similarly, truncated
target forafter
an narbitrary
steps andn-step
then corrected
update isfor
thethe remaining
n-step returnm
If t+n T (if . the n-step return extendsn to1 or beyondn termination), then
Gt:t+n
as zero, and = theRn-step
t+1 + return · · · + to beRequal
Rt+2 +defined t+n +to the
Vt+n 1 (St+nfull
ordinary ), ret
Note that n-step returns for n > 1 involve future rewards and state
Of course, this is not available until time t+n
time of transition from t to t + 1. No real algorithm can use the n-step
Rt+n and computed Vt+n 1 . The first time these are available is t + n. T
The natural
algorithm algorithm
for using n-step is returns
thus to wait until then:
is thus
. ⇥ ⇤
Vt+n (St ) = Vt+n 1 (St ) + ↵ Gt:t+n Vt+n 1 (St ) , 0  t < T,

while
Thisthe values n-step
is called of all other
TD states remain unchanged: Vt+n (s) = Vt+n 1
algorithm n-step TD. Note that no changes at all are made during the fi
To make up for that, an equal number of additional updates are made a
termination and before starting the next episode.
150 CHAPTER 7. N -STEP BOOTSTRAPPING

n-step TD for estimating V ⇡ v⇡

Initialize V (s) arbitrarily, s 2 S


Parameters: step size ↵ 2 (0, 1], a positive integer n
All store and access operations (for St and Rt ) can take their index mod n
Repeat (for each episode):
Initialize and store S0 6= terminal
T 1
For t = 0, 1, 2, . . . :
| If t < T , then:
| Take an action according to ⇡(·|St )
| Observe and store the next reward as Rt+1 and the next state as St+1
| If St+1 is terminal, then T t+1
| ⌧ t n + 1 (⌧ is the time whose state’s estimate is being updated)
| If ⌧ 0:
Pmin(⌧ +n,T ) i ⌧ 1
| G i=⌧ +1 Ri
| If ⌧ + n < T , then: G G + n V (S⌧ +n ) (G⌧ :⌧ +n )
| V (S⌧ ) V (S⌧ ) + ↵ [G V (S⌧ )]
Until ⌧ = T 1
Random Walk Examples
ADVANTAGES OF TD PREDICTION METHODS

0 0 0 0 0 1
A B C D E
start

Figure 6.5: A small Markov process for generating random walks


How does 2-step TD work here?
How about 3-step TD?
words, which method learns faster? Which makes the more efficie
mited data? At the current time this is an open question in the
no one has been able to prove mathematically that one method con
r than the other. In fact, it is not even clear what is the most
e formal way to phrase this question! In practice, however, TD m
A Larger Example – 19-state Random Walk
7.2. N -STEP SARSA 157

256
512
128 n=64
n=32

n-step TD
Average
RMS error results
over 19 states
and first 10
n=32 n=1
episodes
n=16

n=8 n=2
n=4


An intermediate α is best
Figure 7.2: Performance of n-step TD methods as a function of ↵, for various values of n,
An random
on a 19-state intermediate
walk taskn (Example
is best 7.1).
Do you think there is an optimal n? for every task?
Conclusions Regarding n-step Methods (so far)
Generalize Temporal-Difference and Monte Carlo learning
methods, sliding from one to the other as n increases
n = 1 is TD as in Chapter 6
n = ∞ is MC as in Chapter 5
an intermediate n is often much better than either extreme
applicable to both continuing and episodic problems
There is some cost in computation
need to remember the last n states
learning is delayed by n steps
per-step computation is small and uniform, like TD
Everything generalizes nicely: error-reduction theory, Sarsa, off-
policy by importance sampling, Expected Sarsa, Tree Backup
The very general n-step Q(𝜎) algorithm includes everything!
previous exercise) in place of the error in (7.2) would actually be a slightl
algorithm. Would it be a better algorithm or a worse one? Devise and p
small experiment to answer this question empirically.
Error-reduction property
The n-step return uses the value function Vt+n 1 to correct for the missin
beyond Rt+n . An important property of n-step returns is that their expe
guaranteed to be a better estimate of v⇡ than Vt+n 1 is, in a worst-state se
is, the worst error of the expected n-step return is guaranteed to be les
equal toError reduction
n times property
the worst error of n-step
under returns
Vt+n 1:

n
max E⇡[Gt:t+n |St = s] v⇡ (s)  max Vt+n 1 (s) v⇡ (s) ,
s s

for all n 1. This is called the error reduction property of n-step returns
of the error reduction property, one can show formally that all n-step TD
Maximum error using n-step return Maximum error using V
converge to the correct predictions under appropriate technical condition
step TD methods thus form a family of sound methods, with one-step TD
and Monte
UsingCarlo
this,methods
you canasshow extremethat members.
n-step methods converge
It’s much the same for action values
158 CHAPTER 7. MULTI-STEP BOOTSTRAPPING

1-step Sarsa ∞-step Sarsa n-step


aka Sarsa(0) 2-step Sarsa 3-step Sarsa n-step Sarsa aka Monte Carlo Expected Sarsa

Figure 7.3: The spectrum of n-step backups for state-action values. They range from the
callmain
The one-step PSarsa,
idea is to
min(⌧ or )Sarsa(0).
simply
+n,T i switch
⌧ 1 R states for actions (state–action pair
| G i=⌧ +1 i
The main
use| an "-greedy idea is to simply switch nstates forn-step
actions (state–action p
If ⌧ + npolicy. The G
< T , then backup G+ diagrams
Q(S⌧ +nfor , A⌧ +n ) Sarsa (shown in(G
On-policy
like use an "-greedy
| thoseQ(S , A⌧n-step
of ⌧n-step ) TD Q(S⌧ ,Action-value
policy.(Figure
The backup
A⌧ ) + 7.1), diagrams
↵ [Gare Q(S
strings
⌧ , Methods
A
forofn-step
⌧ )]
Sarsa (shown
alternating states a
like those
except of n-step TDall(Figure 7.1),
endare strings of alternating state
| that If ⇡the
is Sarsa
being oneslearned, start
thenand ensure with
that an action
⇡(·|S ⌧ ) is rather
"-greedy a state.
wrt Q
except that (update
n-step the Sarsatargets)
ones all in start andofend with an action actionrather
values:a stat
Untilreturns
⌧ =T 1 terms estimated
Action-value
n-step returns form of n-step
(update targets)return
in terms of estimated action values:
.
Gt:t+n = R.t+1 + Rt+2 +· · ·+ n 1nRt+n 1
+ n
Qnt+n 1 (St+n , At+n ), n 1, 0
Gt:t+n = Rt+1 + Rt+2 +· · ·+ Rt+n + Qt+n 1 (St+n , At+n ), n 1,
Pseudocode is shown in the box above, and an example of why it can s
learning compared . to one-step methods is given in Figure 7.4.
n-step Sarsa:
with Gt:t+n = G.t if t + n T . The natural algorithm is then
What about
with Gt:t+nExpected
= Gt if Sarsa?
t + n The backup
T . The diagram
natural algorithmfor the is n-step
then version
pected Sarsa is shown . on the far right in Figure 7.3. It consists of a linea
Qt+n (St , At ) = Q.t+n 1 (St , At ) + ↵ [Gt:t+n Qt+n 1 (St , At )] , 0
of sample Q t+n (St ,and
actions At ) states,
= Qt+njust 1 (Sas
t , Ain
t) + ↵ [Gt:t+n
n-step Sarsa, Q t+n 1 (S
except t , Aits
that t )] ,last ele
0
a branch
while overvalues
the all actionof possibilities
all other states weighted,
remain as always, by Q
unchanged: their(s,probabilit
a) = Q
n-step
⇡. for
Thiswhile Expected
the
algorithm values
can
Sarsa
of
be allis the
other
described
same
states
by
update
theremain
same
with a
unchanged:
equation
slightly
as
t+n
Q
n-step (s,
t+n Sarsa a) =
all s, a n-step
different such that s 6= St or a 6= At . This is the algorithm we call n
return:
for all s, a such
except with the n-step return that s 6=redefined
St or a 6= as At . This is the algorithm we call
. X
n 1 n
Gt:t+n
1-step = Rt+1 + · · · +
Sarsa
1-step Sarsa
Rt+n + ⇡(a|St+n )Qt+n ∞-step (SSarsa , a), n
∞1-stept+n
Sarsa
aka Sarsa(0) 2-step Sarsa 3-step Sarsa n-step Sarsa aka Monte Carlo Expec
aka Sarsa(0) 2-step Sarsa 3-step Sarsa a n-step Sarsa aka Monte Carlo Ex

for all n and t such that n 1 and 0  t  T n.


| If ⌧ 0:
Pmin(⌧ +n,T ) i ⌧ 1
| G i=⌧ +1 Ri
| If ⌧ + n < T , then G G + n Q(S⌧ +n , A⌧ +n ) (G⌧ :⌧ +n )
| n-step Methods Can Accelerate Learning
Q(S⌧ , A⌧ ) Q(S⌧ , A⌧ ) + ↵ [G Q(S⌧ , A⌧ )]
| If ⇡ is being learned, then ensure that ⇡(·|S⌧ ) is "-greedy wrt Q
Until ⌧ = T 1

Action values increased Action values increased


Path taken by one-step Sarsa by 10-step
by Sarsa(!) Sarsa
with !=0.9

G G G

igure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step methods. The fi
anel shows the path taken by an agent in a single episode, ending at a location of high reward, marked
he G. In this example the values were all initially 0, and all rewards were zero except for a positive reward
. The arrows in the other two panels show which action values were strengthened as a result of this path
ne-step and n-step Sarsa methods. The one-step method strengthens only the last action of the sequence
ctions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequen
o that much more is learned from the one episode.
TD,allthe
for update
n and for time
t such that tn(actually
1 and made
0  t at
 Ttimen.
t + n) can simply be weighted
where ⇢t:t+n 1 , .called the importance sampling ratio, is the relative probability
Vt+nthe
of taking (Stn
)= Vt+n from
actions 1 (St ) A
+ ↵⇢ t:t+n 1 [Gt:t+n Vt+n 1 (St )] , 0  t < T,
t to At+n 1 (cf. Eq. 5.3):
Off-policy n-step Methods by Importance Sampling
7.3 n-step
where ⇢t:t+n O↵-policy
1 , called
min(h,T
Learning by Importance
1) the importance sampling ratio, is the relative probability
of taking . nY
the actions⇡(A from k |SAkt) .to At+n 1 (cf. Eq. 5.3):
Recall
Recall
⇢ t:h the importance-sampling ratio:
that
=
o↵-policy learning b(Ak |Sk )is learning the value function for one policy, ⇡
min(h,Tk=t 1)the greedy policy for the current action-value-function
policy, b. Often,
. Y ⇡ is ⇡(Ak |Sk )

exploratory
For example,
t:h = policy,
if any perhapsoneb(Aof the .
"-greedy.
actions In order
would neverto beusetaken
the databy ⇡ from
(i.e., b⇡(A
wek |Smk
k |Sk )
di↵erence
return should betweenbek=tgiventhe twozero policies,
weight and usingbe their
totallyrelative
ignored. probability
On the other of taking
hand, th i
(see Section
Weexample, ⇡ 5.5).
get off-policy In n-step methods, returns areupdatesconstructed overratio
n steps, sw
taken
For that would
if any take onemethods
with
of themuch
actionsbygreater
weighting
wouldprobability
never be taken than byby
b this
does, then
⇡ (i.e., ⇡(A this
k |Sk
relative
that
returnwould probability
should otherwise of just
be givenbezero given those
to the
weight
nandactions.
return. For example,
This
be totally makes sense
ignored.
to because
On the
make a simple
otherthat
hand,acti oif
TD,
(and
taken the
Off-policy update
therefore
that wefor
n-step
⇡ would want time
TD:
take tlearn
towith(actually made
aboutgreater
much it) butat istime t + n)
selected
probability only
thancanrarely
b simply be
does,bythen
b andweigh
thisthu w
data. To make
that would up
. forbe
otherwise this
givenwe tohave thetoreturn.
over-weight it when
This makes it does
sense occur.
because thatNote
acti
are V (S ) = V (S ) + ↵⇢ [G V (S )] , 0  t < T
(andactually
t+n the
therefore t
wesame t+n (the
want 1
to ton-policy
learn about t:t+ncase) then
it) 1but t:t+nthe importance
is selected t+n 1 t sampling ratio
only rarely by b and thu
new
data. update
To make(7.7) up generalizes
for this we and
have canto completely
over-weight replace
it when ourit earlier
does n-stepNote
occur. TD
where ⇢t:t+n
Off-policy
previous n-step , Sarsa
calledupdate
1n-step the importance
Sarsa: can be sampling replaced
completely ratio, is bythea relative
simple probabil
o↵-policy
aretaking
of actually thethe same (the
n actions fromon-policy
At to At+n case) then the importance sampling ratio
1 (cf. Eq. 5.3):
new update (7.7) generalizes . and can completely replace our earlier n-step TD
previousQt+n (St , ASarsa
n-step t ) = Qt+n 1 (St , At ) + ↵⇢t+1:t+n 1 [Gt:t+n Qt+n 1 (St , At )] ,
min(h,T 1)update can be completely replaced by a simple o↵-policy
. Y ⇡(Ak |Sk )
0 ⇢t:h
forOff-policyt<= T . Note. that the importance . sampling ratio hereQstarts one step la
Qt+n (St ,n-step
Ak=t
t ) = QExpected
t+n k1 kt ASarsa:
b(A |S
(S ), t ) + ↵⇢t+1:t+n 1 [Gt:t+n t+n 1 (St , At )] ,
(above). This is because here we are updating a state–action pair. We do not
we
for were
For Justtto
0example, select
<like
T .ifabove
Note
anytheone action;
except
that ofthethenow
expected
importance
actionsthat we havenever
n-step
sampling
would selected
return it
beand
ratio ⇢wegoes
here
taken want
by ⇡toto
starts t learn
one +step
(i.e., n⇡(A ful
2la
k
with importance
(above). This is sampling
because only
here we for aresubsequent
updating
return should be given zero weight and be totally ignored. On the other hand actions.
a Pseudocode
state–action pair. for
We the
do ful
not
Off-policy Learning w/o Importance Sampling:
The n-step Tree Backup Algorithm
3-step TB
Expected Sarsa
and 1-step Tree Backup 2-step Tree Backup

St , At St , At
Rt+1 Rt+1

St+1 St+1
⇡ ⇡
a At+1 a
X Rt+2
Rt+1 + ⇡(a|St+1 )Q(St+1 , a) St+2
a ⇡
2
a0
X
Target Rt+1 + ⇡(a|St+1 )Q(St+1 , a)
a6=At+1
✓ X ◆
+ ⇡(At+1 |St+1 ) Rt+2 + ⇡(a0 |St+2 )Q(St+2 , a0 )
a0
a pure expectation with no sampling. The random variable t might be set as a
function of the state, action, or state–action pair at time t. We call this proposed
new algorithm n-step Q( ).
A Unifying Algorithm: n-step Q(𝜎)
4-step 4-step 4-step 4-step
Sarsa Tree backup Expected Sarsa Q( )

=1
⇢ ⇢ ⇢

=0
⇢ ⇢

=1
⇢ ⇢ ⇢

=0

Choose whether
Figure 7.5: to sample
The three kinds of or takeaction-value
n-step the expectation onconsidered
backups far in𝜎(s)
each stepsowith this
chapter (4-step case) plus a fourth kind of backup that unifies them all. The ‘⇢’s
Conclusions Regarding n-step Methods
Generalize Temporal-Difference and Monte Carlo learning
methods, sliding from one to the other as n increases
n = 1 is TD as in Chapter 6
n = ∞ is MC as in Chapter 5
an intermediate n is often much better than either extreme
applicable to both continuing and episodic problems
There is some cost in computation
need to remember the last n states
learning is delayed by n steps
per-step computation is small and uniform, like TD
Everything generalizes nicely: error-reduction theory, Sarsa, off-
policy by importance sampling, Expected Sarsa, Tree Backup
The very general n-step Q(𝜎) algorithm includes everything!

You might also like