Professional Documents
Culture Documents
SARSA, Q-learning
Eligibility Traces
2/50
Quick Review: Bellman (Optimality) Equations
Matrix form: V = R + P V V = (I P )1 R
Bellman optimality equation
h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i
(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a
V (s) = Q (s), s
Dynamic Programming planning methods: VI, QI, PI, async. DP, etc.
3/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world
4/50
Learning in MDPs
Assume unknown MDP hS, A, , , i (unknown P, R).
Accumulating experience while interacting with the world
4/50
5/50
6/50
Introduction to Model-Free Methods
7/50
Monte-Carlo Policy Evaluation
MC policy evaluation: First-visit and every-visit methods
k Qk (s, a), s, a
9/50
On-Policy MC Control Algorithm
k Qk (s, a), s, a
9/50
On-Policy MC Control Algorithm
k Qk (s, a), s, a
9/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1
randomized exploration: a = rand(|A|) with probability
10/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1
randomized exploration: a = rand(|A|) with probability
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!
10/50
-Greedy Policy
The -greedy policy:
greedy: a = arg maxa0 A Q(s, a0 ) with probability 1
randomized exploration: a = rand(|A|) with probability
Construct a new -greedy policy k+1 upon the value functions
Qk (s, a) of the previous policy: Policy improvement!
X
Qk (s, k+1 (s)) = k+1 (a|s)Qk (s, a)
a
X k
= Q (s, a) + (1 ) max Qk (s, a)
|A| a a
X X k (a|s) /|A| k
Qk (s, a) + (1 ) Q (s, a)
|A| a a
1
X
= k (a|s)Qk (s, a) = V k (s)
a
10/50
MC Control Algorithm with -Greedy
Choose actions using -greedy.
Converges! if k decreases to zero through time (GLIE), e.g. k = 1/k.
A policy is GLIE (Greedy in the Limit with Infinite Exploration) if
All state-action pairs are visited infinitely often.
In the limit (k ), the policy becomes greedy
11/50
Temporal difference (TD) learning
TD Prediction (TD(0))
On-policy TD Control (SARSA Algorithm)
Off-policy TD Control (Q-Learning)
Eligibility Traces (TD())
12/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)
13/50
Temporal difference (TD) Prediction
h i
Recall DP : V (s) = E rt+1 + V (st+1 )|st = s ,
h i 1
M C : V (s) V (s) + R(s) V (s) , =
nR(s)
Reinforcement:
more reward than expected: r + Vold (s0 ) > Vold (s) V (s)
less reward than expected: r + Vold (s0 ) < Vold (s) V (s)
DP & TD bootstrap: learn a guess Vb (s) from a guess Vb (s0 ).
13/50
TD Prediction vs. MC Prediction
15/50
TD vs. MC: A Random Walk Example
16/50
On-Policy TD Control: SARSA
17/50
On-Policy TD Control: SARSA
s,a
r
s'
a'
Q-value updates:
Qt+1 (s, a) = Qt (s, a) + rt + Qt (s0 , a0 ) Qt (s, a)
18/50
On-Policy TD Control: SARSA
19/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t
20/50
Q values of SARSA converges w.p.1 to the optimal values as long as
the learning rates satisfy stochastic approximation conditions,
X
t (s, a) =
t
, X
t2 (s, a) <
t
20/50
SARSA on Windy Gridworld Example
21/50
y-axis shows the accumulated number of goal reaching.
22/50
Off-Policy TD Control: Q-Learning
23/50
Off-Policy TD Control: Q-Learning
Off-Policy MC?
T
P (t ) Y (ai |si )
wt = =
P (t ) i=t (ai |si )
25/50
Off-Policy TD Control: Q-Learning
Off-Policy TD?
26/50
Off-Policy TD Control: Q-Learning
Q-learning (Watkins, 1988) Given a new experience (s, a, r, s0 )
Reinforcement:
more reward than expected (r + maxa Qold (s0 , a) Qold (s, a))
increase Q(s, a)
less reward than expected (r + maxa Qold (s0 , a) < Qold (s, a))
decrease Q(s, a)
27/50
Q-Learning
28/50
Q-Learning
28/50
Q-learning convergence with prob 1
Q-learning is a stochastic approximation of Q-Iteration:
Convergence of Q-learning:
Q-Iteration is a deterministic update: Qk+1 = T (Qk )
Q-learning is a stochastic version: Qk+1 = (1 )Qk + [T (Qk ) + k ]
k is zero mean!
29/50
Q-learning convergence with prob 1
The Q-learning algorithm converges w.p.1 as long as the learning rates
satisfy
X
t (s, a) =
t
, X
t2 (s, a) <
t
30/50
Q-learning vs. SARSA: The Cliff Example
31/50
Q-Learning impact
Q-Learning was the first provably convergent direct adaptive optimal
control algorithm
32/50
Backup Diagram: DP
h i P h i
0
V (s) = E rt+1 + V (st+1 )|st = s = a (s, a) s0 Pss
a a
P
0 Rss0 + V (s )
Full-width backup: |S| grows exponentially with #state variables curse of dim.
33/50
Backup Diagram: MC
h i
V (s) V (s) + R(s) V (s) , = 1
nR(s)
34/50
Backup Diagram: TD
35/50
Backup Diagram: TD( = 0)
36/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )
37/50
n-Step TD Update
Temporal Difference: based on single experience (s0 , r1 , s1 )
37/50
n-Step TD Update: Forward View
Let TD target look n steps into the future
38/50
n-Step TD Update: Forward View
Let TD target look n steps into the future
38/50
n-Step TD Update: Forward View
n-step TD learning:
T
X 1
Vt (s) Vt (s) + Vt (s)
t=0
1
Offline: Vt (s) is constant within an episode, for all s. 39/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1
TD(), -return
weightedly:
(1)
X
Rt = (1 ) n1 Rtn 2
(1)
n=1
=1 T-t-1
40/50
TD(): Forward View
TD() is the weighted average of n-step returns with different n
-decay weights (1 )n1
X
Rt = (1 ) n1 Rtn
n=1
41/50
TD(): Forward View
rT
rt+3 s
t+3
rt+2 s
t+2
rt+1 st+1
st
T ime
42/50
TD(): Backward View
43/50
TD(): Eligibility Traces
44/50
TD(): Eligibility Traces
e (s)
t1 s 6= st
et (s) =
et1 (s) + 1 s = st
44/50
TD(): Backward View
et
t
et
st-3
st-2
et
st-1
et
st
T ime st+1
45/50
TD() with Eligibility Traces: Online Tabular
e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : e(s) e(s)
46/50
TD() with Eligibility Traces: Online Tabular
e(st ) e(st ) + 1
s : Vnew (s) = Vold (s) + e(s) [rt+1 + Vold (st+1 ) Vold (st )]
s : BACKWARD
7.3. THE e(s) e(s) VIEW OF TD( ) 15
47/50
gradually based on the approximate values for the current policy. The policy im-
provement can be done in many dierent ways, as we have seen throughout this book.
For example, the simplest approach is to use the "-greedy policy with respect to the
Tabular
current SARSA(
action-value ) Figure 7.13 shows the complete Sarsa( ) algorithm
estimates.
for this case.
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
Repeat (for each episode):
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Repeat (for each step of episode):
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
R + Q(S 0 , A0 ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
E(s, a) E(s, a)
S S0; A A0
until S is terminal
Example 7.2: Traces in Gridworld The use of eligibility traces can substantially
48/50
Tabular SARSA(): Example
Left: A single trajectory to goal (*). All rewards were zero except for a
positive reward at the goal location.
Middle: A single action value was strengthened as a result of this path
by one-step Sarsa.
Right: Many action values were strengthened as a result of this path by
Sarsa().
49/50
Tabular Q()
n-step return: rt+1 + rt+2 + . . . + n1 rt+n + n maxa Qt (st+n , a)
7.7. OFF-POLICY TD( ) AND EXPECTED SARSA( ) 165
Q() algorithm by Watkin
Initialize Q(s, a) arbitrarily, for all s 2 S, a 2 A(s)
Repeat (for each episode):
E(s, a) = 0, for all s 2 S, a 2 A(s)
Initialize S, A
Repeat (for each step of episode):
Take action A, observe R, S 0
Choose A0 from S 0 using policy derived from Q (e.g., "-greedy)
A argmaxa Q(S 0 , a) (if A0 ties for the max, then A A0 )
R + Q(S 0 , A ) Q(S, A)
E(S, A) E(S, A) + 1 (accumulating traces)
or E(S, A) (1 )E(S, A) + 1 (dutch traces)
or E(S, A) 1 (replacing traces)
For all s 2 S, a 2 A(s):
Q(s, a) Q(s, a) + E(s, a)
If A0 = A , then E(s, a) E(s, a)
else E(s, a) 0
0 0
S S;A A
until S is terminal
Remember to7.16:
Figure initialize Q(s
Tabular , )of=Watkinss
version 0 for allQ(terminal states s !
) algorithm.
50/50