You are on page 1of 65

Reinforcement Learning

Temporal Difference Learning

Temporal difference learning, TD prediction, Q-learning, elibigility traces.


(many slides from Marc Toussaint & David Silver)

Hung Ngo & Vien Ngo


MLR Lab, University of Stuttgart
Outline
Monte-Carlo Learning

Temporal-Difference (TD) Learning

SARSA, Q-learning

Eligibility Traces

2/50
Quick Review: Bellman (Optimality) Equations

V (s) = E {rt+1 + rt+2 + 2 rt+3 + | st = s}


= E {rt+1 | st = s; } + E {rt+2 + rt+3 + | st = s}
= R((s), s) + s0 P (s0 | (s), s) E {rt+2 + rt+3 + | st+1 = s0 }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )


P

Matrix form: V = R + P V V = (I P )1 R
Bellman optimality equation
h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i
(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a
 
V (s) = Q (s), s
Dynamic Programming planning methods: VI, QI, PI, async. DP, etc.
3/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world

D = {(st , at , rt+1 , st+1 )}H


t=1

4/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world

D = {(st , at , rt+1 , st+1 )}H


t=1

What could the RL agent learn from the data?


learn to predict next state: P (s0 |s, a) (or Pss
a
0)

learn to predict immediate reward: P (r|s, a, s0 ) (or Rss


a
0)

learn to predict value: s, a 7 Q(s, a)


learn to predict action: (s, a) (or (a|s))
learn to control: (s)

4/50
5/50
6/50
Introduction to Model-Free Methods

Monte-Carlo prediction/control methods

Temporal Difference learning (prediction)

On-policy SARSA, off-policy Q-Learning (control) algorithms

Behavior policy (used to act) can be either the same (on-policy) or


different (off-policy) with the
estimation policy (the one that is being evaluated; prediction problem)
target policy (greedy policy being evaluated and improved; control).

7/50
Monte-Carlo Policy Evaluation
MC policy evaluation: First-visit and every-visit methods

Algorithm 1 MC-PE: policy evaluation of policy


1: Given ; a set of returns U(s) = for each s S
2: while (!converged) do
3: Generate an episode = (s0 , a0 , r1 , . . . , sT 1 , aT 1 , rT , sT ) using
4: for each state st in do
0
Compute the return of st : Rt = R(st ) = Tt0 =t t t rt0 +1
P
5:
6: Either:

For each s in , U(s ) U (s ) {R(s )} once, first-visit
t t t t
U(st ) U (st ) {R(st )}, every-visit

7: Return V (s) = average U(s) F




Converge!!! as the number of visits to all s goes to infinity.


(Introduction to RL, Sutton & Barto 1998)
8/50
On-Policy MC Control Algorithm

Algorithm 2 On-policy MC Control Algorithm


1: Init an initial policy 0 .
2: while (!converged) do
3: Policy Evaluation:

k Qk (s, a), s, a

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s


a

9/50
On-Policy MC Control Algorithm

Algorithm 3 On-policy MC Control Algorithm


1: Init an initial policy 0 .
2: while (!converged) do
3: Policy Evaluation:

k Qk (s, a), s, a

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s


a

Generalized policy iteration framework

9/50
On-Policy MC Control Algorithm

Algorithm 4 On-policy MC Control Algorithm


1: Init an initial policy 0 .
2: while (!converged) do
3: Policy Evaluation:

k Qk (s, a), s, a

4: Policy Improvement: greedy action selection

k+1 (s) = arg max Qk (s, a), s


a

Generalized policy iteration framework

Need sufficient exploration!!! exploring starts, -greedy policy, etc.

9/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1 
randomized exploration: a = rand(|A|) with probability 

10/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1 
randomized exploration: a = rand(|A|) with probability 
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!

10/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1 
randomized exploration: a = rand(|A|) with probability 
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!
X
Qk (s, k+1 (s)) = k+1 (a|s)Qk (s, a)
a
 X k
= Q (s, a) + (1 ) max Qk (s, a)
|A| a a

 X X k (a|s) /|A| k
Qk (s, a) + (1 ) Q (s, a)
|A| a a
1
X
= k (a|s)Qk (s, a) = V k (s)
a

Therefore V k+1 (s) V k (s)

10/50
MC Control Algorithm with -Greedy
Choose actions using -greedy.
Converges! if k decreases to zero through time (GLIE), e.g. k = 1/k.
A policy is GLIE (Greedy in the Limit with Infinite Exploration) if
All state-action pairs are visited infinitely often.
In the limit (k ), the policy becomes greedy

lim k (a|s) = a (arg max


0
Qk (s, a0 ))
k a

where is a Dirac function.

11/50
Temporal difference (TD) learning
TD Prediction (TD(0))
On-policy TD Control (SARSA Algorithm)
Off-policy TD Control (Q-Learning)
Eligibility Traces (TD())

12/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]


| {z }
TD target
= Vold (s) + [r + Vold (s0 ) Vold (s)].
| {z }
TD error

13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]


| {z }
TD target
= Vold (s) + [r + Vold (s0 ) Vold (s)].
| {z }
TD error

Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)

13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)

TD learning: Given (s, a, r, s0 ), how to update Vnew (s) Vold (s)?

Vnew (s) = (1 ) Vold (s) + [r + Vold (s0 )]


| {z }
TD target
= Vold (s) + [r + Vold (s0 ) Vold (s)].
| {z }
TD error

Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)
DP & TD bootstrap: learn a guess Vb (s) from a guess Vb (s0 ).
13/50
TD Prediction vs. MC Prediction

Figure: TD backup diagram. TD methods


combine the sampling of Monte Carlo with
the bootstrapping of DP.

Figure: MC backup diagrams. 14/50


TD Prediction vs. MC Prediction

TD can learn before termination MC learning must wait until the


of epsisodes. end of episodes.
TD can be used for either MC only works for episodic tasks.
non-episodic or episodic tasks. The update depends on a
The update depends on single sequence of many stochastic
stochastic transition lower transitions much larger
variance. variance.
Updates use bootstrapping Unbiased estimate.
estimate has some bias. MC updates does not exploit the
TD updates exploit the Markov Markov property, hence it can be
property. effective in non-Markovian
environments.

15/50
TD vs. MC: A Random Walk Example

The value of each state is a prediction: the probability of terminating on


the right if starting from that state.

16/50
On-Policy TD Control: SARSA

17/50
On-Policy TD Control: SARSA

s,a
r

s'

a'

Figure: Learning on tuple (s, a, r, s0 , a0 ): SARSA

Q-value updates:
 
Qt+1 (s, a) = Qt (s, a) + rt + Qt (s0 , a0 ) Qt (s, a)

18/50
On-Policy TD Control: SARSA

Algorithm 5 SARSA Algorithm


1: Init Q(s, a) arbitrarily s S, a A, and Q(s , ) = 0 for all terminal states s .
2: while (!converged) do
3: Init a starting state s
4: Select an action a from s using a policy derived from Q (e.g. -greedy)
5: /* Execute one episode */
6: while (!episode-terminated) do
7: Execute a, observe r, s0
8: Derive from Q, then select action a0 from s0
Update: Qt+1 (s, a) = Qt (s, a) + t r + Qt (s0 , a0 ) Qt (s, a)

9:
10: s s0 ; a a0 (behavior policy = target policy)

19/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t

the policies t (s, a) derived from Qt (s, a) are GLIE policies.

20/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t

the policies t (s, a) derived from Qt (s, a) are GLIE policies.

GLIE: Greedy in the Limit with Infinite Exploration

20/50
SARSA on Windy Gridworld Example

reward is -1.0 at non-terminal state (cell G).


each move is shifted upward along the wind direction, the strength in
number of cells shifted upwards is given below each column.

21/50
y-axis shows the accumulated number of goal reaching.

22/50
Off-Policy TD Control: Q-Learning

23/50
Off-Policy TD Control: Q-Learning

Off-Policy MC?

Importance sampling to estimate the following expectation of returns:


Z
 
E P ( ) ( ) = ( )P ( )d
P ( )
Z
= ( ) P ( )d
P"( ) #
P ( )
= E P ( ) ( )
P ( )
" N #
1 X P (i )
(i )
N i=1 P (i )

Denote a trajectory distribution P ( ), where = {s0 , a0 , s1 , a1 , }


Y
P ( ) = P0 (s0 ) P (st+1 |st , at )(at |st )
24/50
Assuming in MC Control, the control policy used to generate data is
(a|s) (i.e. P ( )).
The target policy is (a|s) (i.e. P ( ))
Set importance weights as:

T
P (t ) Y (ai |si )
wt = =
P (t ) i=t (ai |si )

The MC value update becomes (when observing a return t ):

V (st ) V (st ) + (wt t V (st ))

25/50
Off-Policy TD Control: Q-Learning

Off-Policy TD?

The term = rt + V (st+1 ) is estimated by importance sampling.


The TD value update becomes (given a transition (st , at , rt , st+1 )):
 (a |s ) 
t t
V (st ) V (st ) + (rt + V (st+1 )) V (st )
(at |st )

26/50
Off-Policy TD Control: Q-Learning
Q-learning (Watkins, 1988) Given a new experience (s, a, r, s0 )

Qnew (s, a) = (1 ) Qold (s, a) + [r + max


0
Qold (s0 , a0 )]
a
= Qold (s, a) + [rt Qold (s, a) + max Qold (s0 , a)]
a

Reinforcement:
more reward than expected (r + maxa Qold (s0 , a) Qold (s, a))
increase Q(s, a)
less reward than expected (r + maxa Qold (s0 , a) < Qold (s, a))
decrease Q(s, a)

27/50
Q-Learning

Algorithm 6 Q-Learning Algorithm


1: Init Q(s, a) arbitrarily s S, a A, and Q(s , ) = 0 for all terminal states s .
2: while (!converged) do
3: Init a starting state s
4: /* Execute one episode */
5: while (!episode-terminated) do
6: Select an action a from s using a policy derived from Q (e.g. -greedy)
7: Execute a, observe r, s0
Update: Qt+1 (s, a) = Qt (s, a) + t r + maxa0 Qt (s0 , a0 ) Qt (s, a)

8:
9: s s0 ;

28/50
Q-Learning

Algorithm 7 Q-Learning Algorithm


1: Init Q(s, a) arbitrarily s S, a A, and Q(s , ) = 0 for all terminal states s .
2: while (!converged) do
3: Init a starting state s
4: /* Execute one episode */
5: while (!episode-terminated) do
6: Select an action a from s using a policy derived from Q (e.g. -greedy)
7: Execute a, observe r, s0
Update: Qt+1 (s, a) = Qt (s, a) + t r + maxa0 Qt (s0 , a0 ) Qt (s, a)

8:
9: s s0 ;

The behavior policy is , e.g. -greedy w.r.t Q(s, a)


The target policy is greedy: (st ) = arg maxa Q(st , a)
off-policy learning: behavior policy 6= target policy

28/50
Q-learning convergence with prob 1
Q-learning is a stochastic approximation of Q-Iteration:

Q-learning: Qnew (s, a) = (1 )Qold (s, a) + [r + maxa0 Qold (s0 , a0 )]


: Qk+1 (s, a) = R(s, a) + s0 P (s0 |a, s) maxa0 Qk (s0 , a0 )
P
Q-Iteration: s,a

Weve shown convergence of Q-VI to Q

Convergence of Q-learning:
Q-Iteration is a deterministic update: Qk+1 = T (Qk )
Q-learning is a stochastic version: Qk+1 = (1 )Qk + [T (Qk ) + k ]
k is zero mean!

29/50
Q-learning convergence with prob 1
The Q-learning algorithm converges w.p.1 as long as the learning rates
satisfy
X
t (s, a) =
t
, X
t2 (s, a) <
t

(Watkins and Dayan, Q-learning. Machine Learning 1992)

30/50
Q-learning vs. SARSA: The Cliff Example

31/50
Q-Learning impact
Q-Learning was the first provably convergent direct adaptive optimal
control algorithm

Great impact on the field of Reinforcement Learning


smaller representation than models
automatically focuses attention to where it is needed,
i.e., no sweeps through state space
though does not solve the exploration versus exploitation issue
-greedy, optimistic initialization, etc,...
off-policy control: learning many control policies simultaneously while
following a single behavior policy

32/50
Backup Diagram: DP
h i P h i
0
V (s) = E rt+1 + V (st+1 )|st = s = a (s, a) s0 Pss
a a
P
0 Rss0 + V (s )

Full-width backup: |S| grows exponentially with #state variables curse of dim.

33/50
Backup Diagram: MC
h i
V (s) V (s) + R(s) V (s) , = 1
nR(s)

Sample backup: const. complexity, independent of |S|.

34/50
Backup Diagram: TD

V (st ) V (st ) + t , with TD-error t = rt+1 + V (st+1 ) V (st )

Sample backup: const. complexity, independent of |S|.

35/50
Backup Diagram: TD( = 0)

V (st ) V (st ) + t , with TD-error t = rt+1 + V (st+1 ) V (st )

Sample backup: const. complexity, independent of |S|.

36/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )

Vnew (s0 ) = Vold (s0 ) + [r1 + Vold (s1 ) Vold (s0 )]

Longer sequence of experience? e.g.: (s0 , r1 , r2 , r3 , s3 )

37/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )

Vnew (s0 ) = Vold (s0 ) + [r1 + Vold (s1 ) Vold (s0 )]

Longer sequence of experience? e.g.: (s0 , r1 , r2 , r3 , s3 )


Temporal credit assignment, think further backwards: receiving r3 also
tells us something about V (s0 )

Vnew (s0 ) = Vold (s0 ) + [r1 + r2 + 2 r3 + 3 Vold (s3 ) Vold (s0 )]

37/50
n-Step TD Update: Forward View
Let TD target look n steps into the future

TD (1-step) 2-step 3-step n-step Monte Carlo

38/50
n-Step TD Update: Forward View
Let TD target look n steps into the future

TD (1-step) 2-step 3-step n-step Monte Carlo

Define the n-step return:


Rtn = rt+1 + rt+2 + . . . + n1 rt+n + n Vt (st+n )

38/50
n-Step TD Update: Forward View
n-step TD learning:

Vtn (s) Vtn (s) + [Rtn Vtn (s)]

where Rtn is the n-step return

Rtn = rt+1 + rt+2 + . . . + n1 rt+n + n Vt (st+n )

The offline1 value update up to time T

T
X 1
Vt (s) Vt (s) + Vt (s)
t=0

Error reduction: | Vtn V | n | Vt V |

Can we derive efficient online update?

1
Offline: Vt (s) is constant within an episode, for all s. 39/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1

TD(), -return

Look into the future, and do


MC-Evaluation then averaging 1

weightedly:
(1)

X
Rt = (1 ) n1 Rtn 2
(1)
n=1

=1 T-t-1

40/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1


X
Rt = (1 ) n1 Rtn
n=1

41/50
TD(): Forward View

rT

rt+3 s
t+3
rt+2 s
t+2
rt+1 st+1
st
T ime

Update value function towards the -return


Forward-view looks into the future to compute Rt
Not directly implementable because it is acausal: using at each step
knowledge of what will happen many steps later

42/50
TD(): Backward View

Forward view provides theory


Backward view provides mechanism (approximating forward view)
Update online, every step, from incomplete sequences
Key concept: eligibility traces
Identical offline updates (proof in Section 7.4, Sutton & Bartos book)

43/50
TD(): Eligibility Traces

Credit assignment problem: did bell or light cause shock?


Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states

44/50
TD(): Eligibility Traces

Credit assignment problem: did bell or light cause shock?


Frequency heuristic: assign credit to most frequent states
Recency heuristic: assign credit to most recent states
Eligibility traces combine both heuristics, updated for all states:


e (s)
t1 s 6= st
et (s) =
et1 (s) + 1 s = st

44/50
TD(): Backward View

Keep an eligibility trace for every state s


Update value V (s) for every state s wide-spreading!!!
In proportion to TD-error t and eligibility trace et (s)
V (s) V (s) + t et (s), with t = rt+1 + V (st+1 ) V (st )

et
t
et
st-3
st-2
et
st-1
et
st
T ime st+1

45/50
TD() with Eligibility Traces: Online Tabular

e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : e(s) e(s)

46/50
TD() with Eligibility Traces: Online Tabular

e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : BACKWARD
7.3. THE e(s) e(s) VIEW OF TD( ) 15

Initialize V (s) arbitrarily (but set to 0 if s is terminal)


Repeat (for each episode):
Initialize E(s) = 0, for all s 2 S
Initialize S
Repeat (for each step of episode):
A action given by for S
Take action A, observe reward, R, and next state, S 0
R + V (S 0 ) V (S)
E(S) E(S) + 1 (accumulating traces)
or E(S) (1 )E(S) + 1 (dutch traces)
or E(S) 1 (replacing traces)
For all s 2 S:
V (s) V (s) + E(s)
E(s) E(s)
S S0
until S is terminal
46/50
Unified View

47/50
gradually based on the approximate values for the current policy. The policy im-
provement can be done in many dierent ways, as we have seen throughout this book.
For example, the simplest approach is to use the "-greedy policy with respect to the
Tabular
current SARSA(
action-value ) Figure 7.13 shows the complete Sarsa( ) algorithm
estimates.
for this case.
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
Repeat (for each episode):
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Repeat (for each step of episode):
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
R + Q(S 0 , A0 ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
E(s, a) E(s, a)
S S0; A A0
until S is terminal

Remember to initialize Q(s , ) = 0 for all terminal states s !


Figure 7.13: Tabular Sarsa( ).

Example 7.2: Traces in Gridworld The use of eligibility traces can substantially
48/50
Tabular SARSA(): Example

Left: A single trajectory to goal (*). All rewards were zero except for a
positive reward at the goal location.
Middle: A single action value was strengthened as a result of this path
by one-step Sarsa.
Right: Many action values were strengthened as a result of this path by
Sarsa().

49/50
Tabular Q()
n-step return: rt+1 + rt+2 + . . . + n1 rt+n + n maxa Qt (st+n , a)
7.7. OFF-POLICY TD( ) AND EXPECTED SARSA( ) 165
Q() algorithm by Watkin
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
Repeat (for each episode):
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Repeat (for each step of episode):
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
A argmaxa Q(S 0 , a) (if A0 ties for the max, then A A0 )
R + Q(S 0 , A ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
If A0 = A , then E(s, a) E(s, a)
else E(s, a) 0
0 0
S S;A A
until S is terminal


Remember to7.16:
Figure initialize Q(s
Tabular , )of=Watkinss
version 0 for allQ(terminal states s !
) algorithm.

50/50

You might also like