03 TemporalDiffHung

Reinforcement Learning
Temporal Difference Learning
Temporal difference learning, TD prediction, Q-learning, elibigility traces.

(many slides from Marc Toussaint & David Silver)
Hung Ngo & Vien Ngo

MLR Lab, University of Stuttgart
Outline
Monte-Carlo Learning
Temporal-Difference (TD) Learning
SARSA, Q-learning
Eligibility Traces
2/50
Quick Review: Bellman (Optimality) Equations
V (s) = E {rt+1 + rt+2 + 2 rt+3 + | st = s}

= E {rt+1 | st = s; } + E {rt+2 + rt+3 + | st = s}
= R((s), s) + s0 P (s0 | (s), s) E {rt+2 + rt+3 + | st+1 = s0 }
P
= R((s), s) + s0 P (s0 | (s), s) V (s0 )

P
Matrix form: V = R + P V V = (I P )1 R
Bellman optimality equation
h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i
(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a

V (s) = Q (s), s
Dynamic Programming planning methods: VI, QI, PI, async. DP, etc.
3/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world
D = {(st , at , rt+1 , st+1 )}H

t=1
4/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world
D = {(st , at , rt+1 , st+1 )}H

t=1
What could the RL agent learn from the data?

learn to predict next state: P (s0 |s, a) (or Pss
a
0)
learn to predict immediate reward: P (r|s, a, s0 ) (or Rss

a
0)
learn to predict value: s, a 7 Q(s, a)

learn to predict action: (s, a) (or (a|s))
learn to control: (s)
4/50
5/50
6/50
Introduction to Model-Free Methods
Monte-Carlo prediction/control methods
Temporal Difference learning (prediction)
On-policy SARSA, off-policy Q-Learning (control) algorithms
Behavior policy (used to act) can be either the same (on-policy) or

different (off-policy) with the
estimation policy (the one that is being evaluated; prediction problem)
target policy (greedy policy being evaluated and improved; control).
7/50
Monte-Carlo Policy Evaluation
MC policy evaluation: First-visit and every-visit methods
Algorithm 1 MC-PE: policy evaluation of policy

1: Given ; a set of returns U(s) = for each s S
2: while (!converged) do
3: Generate an episode = (s0 , a0 , r1 , . . . , sT 1 , aT 1 , rT , sT ) using
4: for each state st in do
0
Compute the return of st : Rt = R(st ) = Tt0 =t t t rt0 +1
P
5:
6: Either:

For each s in , U(s ) U (s ) {R(s )} once, first-visit
t t t t
U(st ) U (st ) {R(st )}, every-visit
7: Return V (s) = average U(s) F

Converge!!! as the number of visits to all s goes to infinity.

(Introduction to RL, Sutton & Barto 1998)
8/50
On-Policy MC Control Algorithm
Algorithm 2 On-policy MC Control Algorithm

1: Init an initial policy 0 .
3: Policy Evaluation:
k Qk (s, a), s, a
4: Policy Improvement: greedy action selection
k+1 (s) = arg max Qk (s, a), s

a
9/50

k Qk (s, a), s, a

a
Generalized policy iteration framework
9/50

k Qk (s, a), s, a

a
Generalized policy iteration framework
Need sufficient exploration!!! exploring starts, -greedy policy, etc.
9/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1
randomized exploration: a = rand(|A|) with probability
10/50
-Greedy Policy
The -greedy policy:
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!
10/50
-Greedy Policy
The -greedy policy:
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!
X
Qk (s, k+1 (s)) = k+1 (a|s)Qk (s, a)
a
X k
= Q (s, a) + (1 ) max Qk (s, a)
|A| a a
X X k (a|s) /|A| k
Qk (s, a) + (1 ) Q (s, a)
|A| a a
1
X
= k (a|s)Qk (s, a) = V k (s)
a
Therefore V k+1 (s) V k (s)
10/50
MC Control Algorithm with -Greedy
Choose actions using -greedy.
Converges! if k decreases to zero through time (GLIE), e.g. k = 1/k.
A policy is GLIE (Greedy in the Limit with Infinite Exploration) if
All state-action pairs are visited infinitely often.
In the limit (k ), the policy becomes greedy
lim k (a|s) = a (arg max

0
Qk (s, a0 ))
k a
where is a Dirac function.
11/50
Temporal difference (TD) learning
TD Prediction (TD(0))
On-policy TD Control (SARSA Algorithm)
Off-policy TD Control (Q-Learning)
Eligibility Traces (TD())
12/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
13/50
h i
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?
13/50
h i
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]

| {z }
TD target
= Vold (s) + [r + Vold (s0 ) Vold (s)].
| {z }
TD error
13/50
h i
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

| {z }
TD target
| {z }
TD error
Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)
13/50
h i
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

| {z }
TD target
| {z }
TD error
Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)
DP & TD bootstrap: learn a guess Vb (s) from a guess Vb (s0 ).
13/50
TD Prediction vs. MC Prediction
Figure: TD backup diagram. TD methods

combine the sampling of Monte Carlo with
the bootstrapping of DP.
Figure: MC backup diagrams. 14/50

TD Prediction vs. MC Prediction
TD can learn before termination MC learning must wait until the

of epsisodes. end of episodes.
TD can be used for either MC only works for episodic tasks.
non-episodic or episodic tasks. The update depends on a
The update depends on single sequence of many stochastic
stochastic transition lower transitions much larger
variance. variance.
Updates use bootstrapping Unbiased estimate.
estimate has some bias. MC updates does not exploit the
TD updates exploit the Markov Markov property, hence it can be
property. effective in non-Markovian
environments.
15/50
TD vs. MC: A Random Walk Example
The value of each state is a prediction: the probability of terminating on

the right if starting from that state.
16/50
On-Policy TD Control: SARSA
17/50
s,a
r
s'
a'
Figure: Learning on tuple (s, a, r, s0 , a0 ): SARSA
Q-value updates:

Qt+1 (s, a) = Qt (s, a) + rt + Qt (s0 , a0 ) Qt (s, a)
18/50
Algorithm 5 SARSA Algorithm

1: Init Q(s, a) arbitrarily s S, a A, and Q(s , ) = 0 for all terminal states s .
3: Init a starting state s
4: Select an action a from s using a policy derived from Q (e.g. -greedy)
5: /* Execute one episode */
6: while (!episode-terminated) do
7: Execute a, observe r, s0
8: Derive from Q, then select action a0 from s0
Update: Qt+1 (s, a) = Qt (s, a) + t r + Qt (s0 , a0 ) Qt (s, a)

9:
10: s s0 ; a a0 (behavior policy = target policy)
19/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t
the policies t (s, a) derived from Qt (s, a) are GLIE policies.
20/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t
the policies t (s, a) derived from Qt (s, a) are GLIE policies.
GLIE: Greedy in the Limit with Infinite Exploration
20/50
SARSA on Windy Gridworld Example
reward is -1.0 at non-terminal state (cell G).

each move is shifted upward along the wind direction, the strength in
number of cells shifted upwards is given below each column.
21/50
y-axis shows the accumulated number of goal reaching.
22/50
Off-Policy TD Control: Q-Learning
23/50
Off-Policy MC?
Importance sampling to estimate the following expectation of returns:

Z

E P ( ) ( ) = ( )P ( )d
P ( )
Z
= ( ) P ( )d
P"( ) #
P ( )
= E P ( ) ( )
P ( )
" N #
1 X P (i )
(i )
N i=1 P (i )
Denote a trajectory distribution P ( ), where = {s0 , a0 , s1 , a1 , }

Y
P ( ) = P0 (s0 ) P (st+1 |st , at )(at |st )
24/50
Assuming in MC Control, the control policy used to generate data is
(a|s) (i.e. P ( )).
The target policy is (a|s) (i.e. P ( ))
Set importance weights as:
T
P (t ) Y (ai |si )
wt = =
P (t ) i=t (ai |si )
The MC value update becomes (when observing a return t ):
V (st ) V (st ) + (wt t V (st ))
25/50
Off-Policy TD?
The term = rt + V (st+1 ) is estimated by importance sampling.

The TD value update becomes (given a transition (st , at , rt , st+1 )):
(a |s )
t t
V (st ) V (st ) + (rt + V (st+1 )) V (st )
(at |st )
26/50
Q-learning (Watkins, 1988) Given a new experience (s, a, r, s0 )
Qnew (s, a) = (1 ) Qold (s, a) + [r + max

0
Qold (s0 , a0 )]
a
= Qold (s, a) + [rt Qold (s, a) + max Qold (s0 , a)]
a
Reinforcement:
more reward than expected (r + maxa Qold (s0 , a) Qold (s, a))
increase Q(s, a)
less reward than expected (r + maxa Qold (s0 , a) < Qold (s, a))
decrease Q(s, a)
27/50
Q-Learning
Algorithm 6 Q-Learning Algorithm

Update: Qt+1 (s, a) = Qt (s, a) + t r + maxa0 Qt (s0 , a0 ) Qt (s, a)

8:
9: s s0 ;
28/50
Q-Learning
Algorithm 7 Q-Learning Algorithm

Update: Qt+1 (s, a) = Qt (s, a) + t r + maxa0 Qt (s0 , a0 ) Qt (s, a)

8:
9: s s0 ;
The behavior policy is , e.g. -greedy w.r.t Q(s, a)

The target policy is greedy: (st ) = arg maxa Q(st , a)
off-policy learning: behavior policy 6= target policy
28/50
Q-learning convergence with prob 1
Q-learning is a stochastic approximation of Q-Iteration:
Q-learning: Qnew (s, a) = (1 )Qold (s, a) + [r + maxa0 Qold (s0 , a0 )]

: Qk+1 (s, a) = R(s, a) + s0 P (s0 |a, s) maxa0 Qk (s0 , a0 )
P
Q-Iteration: s,a
Weve shown convergence of Q-VI to Q
Convergence of Q-learning:
Q-Iteration is a deterministic update: Qk+1 = T (Qk )
Q-learning is a stochastic version: Qk+1 = (1 )Qk + [T (Qk ) + k ]
k is zero mean!
29/50
Q-learning convergence with prob 1
The Q-learning algorithm converges w.p.1 as long as the learning rates
satisfy
X
t (s, a) =
t
, X
t2 (s, a) <
t
(Watkins and Dayan, Q-learning. Machine Learning 1992)
30/50
Q-learning vs. SARSA: The Cliff Example
31/50
Q-Learning impact
Q-Learning was the first provably convergent direct adaptive optimal
control algorithm
Great impact on the field of Reinforcement Learning

smaller representation than models
automatically focuses attention to where it is needed,
i.e., no sweeps through state space
though does not solve the exploration versus exploitation issue
-greedy, optimistic initialization, etc,...
off-policy control: learning many control policies simultaneously while
following a single behavior policy
32/50
Backup Diagram: DP
h i P h i
0
V (s) = E rt+1 + V (st+1 )|st = s = a (s, a) s0 Pss
a a
P
0 Rss0 + V (s )
Full-width backup: |S| grows exponentially with #state variables curse of dim.
33/50
Backup Diagram: MC
h i
V (s) V (s) + R(s) V (s) , = 1
nR(s)
Sample backup: const. complexity, independent of |S|.
34/50
Backup Diagram: TD
V (st ) V (st ) + t , with TD-error t = rt+1 + V (st+1 ) V (st )
35/50
Backup Diagram: TD( = 0)
V (st ) V (st ) + t , with TD-error t = rt+1 + V (st+1 ) V (st )
36/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )
Vnew (s0 ) = Vold (s0 ) + [r1 + Vold (s1 ) Vold (s0 )]
Longer sequence of experience? e.g.: (s0 , r1 , r2 , r3 , s3 )
37/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )
Vnew (s0 ) = Vold (s0 ) + [r1 + Vold (s1 ) Vold (s0 )]
Longer sequence of experience? e.g.: (s0 , r1 , r2 , r3 , s3 )

Temporal credit assignment, think further backwards: receiving r3 also
tells us something about V (s0 )
Vnew (s0 ) = Vold (s0 ) + [r1 + r2 + 2 r3 + 3 Vold (s3 ) Vold (s0 )]
37/50
n-Step TD Update: Forward View
Let TD target look n steps into the future
TD (1-step) 2-step 3-step n-step Monte Carlo
38/50
Let TD target look n steps into the future
TD (1-step) 2-step 3-step n-step Monte Carlo
Define the n-step return:

Rtn = rt+1 + rt+2 + . . . + n1 rt+n + n Vt (st+n )
38/50
n-step TD learning:
Vtn (s) Vtn (s) + [Rtn Vtn (s)]
where Rtn is the n-step return
Rtn = rt+1 + rt+2 + . . . + n1 rt+n + n Vt (st+n )
The offline1 value update up to time T
T
X 1
Vt (s) Vt (s) + Vt (s)
t=0
Error reduction: | Vtn V | n | Vt V |
Can we derive efficient online update?
1
Offline: Vt (s) is constant within an episode, for all s. 39/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1
TD(), -return
Look into the future, and do

MC-Evaluation then averaging 1
weightedly:
(1)

X
Rt = (1 ) n1 Rtn 2
(1)
n=1
=1 T-t-1

40/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1

X
Rt = (1 ) n1 Rtn
n=1
41/50
TD(): Forward View
rT
rt+3 s
t+3
rt+2 s
t+2
rt+1 st+1
st
T ime
Update value function towards the -return

Forward-view looks into the future to compute Rt
Not directly implementable because it is acausal: using at each step
knowledge of what will happen many steps later
42/50
TD(): Backward View
Forward view provides theory

Backward view provides mechanism (approximating forward view)
Update online, every step, from incomplete sequences
Key concept: eligibility traces
Identical offline updates (proof in Section 7.4, Sutton & Bartos book)
43/50
TD(): Eligibility Traces
Credit assignment problem: did bell or light cause shock?

Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
44/50
TD(): Eligibility Traces
Credit assignment problem: did bell or light cause shock?

Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics, updated for all states:

e (s)
t1 s 6= st
et (s) =
et1 (s) + 1 s = st
44/50
TD(): Backward View
Keep an eligibility trace for every state s

Update value V (s) for every state s wide-spreading!!!
In proportion to TD-error t and eligibility trace et (s)
V (s) V (s) + t et (s), with t = rt+1 + V (st+1 ) V (st )
et
t
et
st-3
st-2
et
st-1
et
st
T ime st+1
45/50
TD() with Eligibility Traces: Online Tabular
e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : e(s) e(s)
46/50
TD() with Eligibility Traces: Online Tabular
e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : BACKWARD
7.3. THE e(s) e(s) VIEW OF TD( ) 15
Initialize V (s) arbitrarily (but set to 0 if s is terminal)

Repeat (for each episode):
Initialize E(s) = 0, for all s 2 S
Initialize S
Repeat (for each step of episode):
A action given by for S
Take action A, observe reward, R, and next state, S 0
R + V (S 0 ) V (S)
E(S) E(S) + 1 (accumulating traces)
or E(S) (1 )E(S) + 1 (dutch traces)
or E(S) 1 (replacing traces)
For all s 2 S:
V (s) V (s) + E(s)
E(s) E(s)
S S0
until S is terminal
46/50
Unified View
47/50
gradually based on the approximate values for the current policy. The policy im-
provement can be done in many dierent ways, as we have seen throughout this book.
For example, the simplest approach is to use the "-greedy policy with respect to the
Tabular
current SARSA(
action-value ) Figure 7.13 shows the complete Sarsa( ) algorithm
estimates.
for this case.
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
R + Q(S 0 , A0 ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
E(s, a) E(s, a)
S S0; A A0
until S is terminal
Remember to initialize Q(s , ) = 0 for all terminal states s !

Figure 7.13: Tabular Sarsa( ).
Example 7.2: Traces in Gridworld The use of eligibility traces can substantially
48/50
Tabular SARSA(): Example
Left: A single trajectory to goal (*). All rewards were zero except for a
positive reward at the goal location.
Middle: A single action value was strengthened as a result of this path
by one-step Sarsa.
Right: Many action values were strengthened as a result of this path by
Sarsa().
49/50
Tabular Q()
n-step return: rt+1 + rt+2 + . . . + n1 rt+n + n maxa Qt (st+n , a)
7.7. OFF-POLICY TD( ) AND EXPECTED SARSA( ) 165
Q() algorithm by Watkin
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
A argmaxa Q(S 0 , a) (if A0 ties for the max, then A A0 )
R + Q(S 0 , A ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
If A0 = A , then E(s, a) E(s, a)
else E(s, a) 0
0 0
S S;A A
until S is terminal

Remember to7.16:
Figure initialize Q(s
Tabular , )of=Watkinss
version 0 for allQ(terminal states s !
) algorithm.
50/50

03 TemporalDiffHung

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 TemporalDiffHung

Uploaded by

Copyright:

Available Formats

Reinforcement Learning

Temporal Difference Learning

Temporal difference learning, TD prediction, Q-learning, elibigility traces.

Hung Ngo & Vien Ngo

Temporal-Difference (TD) Learning

V (s) = E {rt+1 + rt+2 + 2 rt+3 + | st = s}

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

D = {(st , at , rt+1 , st+1 )}H

D = {(st , at , rt+1 , st+1 )}H

What could the RL agent learn from the data?

learn to predict immediate reward: P (r|s, a, s0 ) (or Rss

learn to predict value: s, a 7 Q(s, a)

Monte-Carlo prediction/control methods

Temporal Difference learning (prediction)

On-policy SARSA, off-policy Q-Learning (control) algorithms

Behavior policy (used to act) can be either the same (on-policy) or

Algorithm 1 MC-PE: policy evaluation of policy

7: Return V (s) = average U(s) F

Converge!!! as the number of visits to all s goes to infinity.

Algorithm 2 On-policy MC Control Algorithm

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s

Algorithm 3 On-policy MC Control Algorithm

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s

Generalized policy iteration framework

Algorithm 4 On-policy MC Control Algorithm

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s

Generalized policy iteration framework

Need sufficient exploration!!! exploring starts, -greedy policy, etc.

Therefore V k+1 (s) V k (s)

lim k (a|s) = a (arg max

where is a Dirac function.

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]

Figure: TD backup diagram. TD methods

Figure: MC backup diagrams. 14/50

TD can learn before termination MC learning must wait until the

The value of each state is a prediction: the probability of terminating on

Figure: Learning on tuple (s, a, r, s0 , a0 ): SARSA

Algorithm 5 SARSA Algorithm

the policies t (s, a) derived from Qt (s, a) are GLIE policies.

the policies t (s, a) derived from Qt (s, a) are GLIE policies.

GLIE: Greedy in the Limit with Infinite Exploration

reward is -1.0 at non-terminal state (cell G).

Importance sampling to estimate the following expectation of returns:

Denote a trajectory distribution P ( ), where = {s0 , a0 , s1 , a1 , }

The MC value update becomes (when observing a return t ):

V (st ) V (st ) + (wt t V (st ))

The term = rt + V (st+1 ) is estimated by importance sampling.

Qnew (s, a) = (1 ) Qold (s, a) + [r + max

Algorithm 6 Q-Learning Algorithm

Algorithm 7 Q-Learning Algorithm

The behavior policy is , e.g. -greedy w.r.t Q(s, a)

Q-learning: Qnew (s, a) = (1 )Qold (s, a) + [r + maxa0 Qold (s0 , a0 )]

Weve shown convergence of Q-VI to Q

(Watkins and Dayan, Q-learning. Machine Learning 1992)

Need sufficient exploration!!! exploring starts, -greedy policy, etc.

The behavior policy is , e.g. -greedy w.r.t Q(s, a)