Professional Documents
Culture Documents
Policy at step t = π t =
a mapping from states to action probabilities
π t (a | s) = probability that At = a when St = s
prediction v⇡ q⇡
control v⇤ q⇤
Vt (s) Qt (s, a)
Values are expected returns
• The value of a state, given a policy:
v⇡ (s) = E{Gt | St = s, At:1 ⇠ ⇡} v⇡ : S ! <
• The value of a state-action pair, given a policy:
q⇡ (s, a) = E{Gt | St = s, At = a, At+1:1 ⇠ ⇡} q⇡ : S ⇥ A ! <
• The optimal value of a state:
v⇤ (s) = max v⇡ (s) v⇤ : S ! <
⇡
• The optimal value of a state-action pair:
q⇤ (s, a) = max q⇡ (s, a) q⇤ : S ⇥ A ! <
⇡
• Optimal policy: ⇡⇤ is an optimal policy if and only if
⇡⇤ (a|s) > 0 only where q⇤ (s, a) = max q⇤ (s, b) 8s 2 S
b
• in other words, ⇡⇤ is optimal iff it is greedy wrt q⇤
optimal policy example
! And returns
prediction v⇡ q⇡
control v⇤ q⇤
Vt (s) Qt (s, a)
Gridworld
State-value function
for equiprobable
random policy;
γ = 0.9
Q*(s, driver)
q*(s, driver)
sand
0 !1 !2
!2
sa
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction !3 n green 14
SUMMARY O
Summary of a
in bold
SUMMARY OF NOT
Optimal Value Functions Summary
s are us
Capital letters
Summary
❐ For finite MDPs, policies can be partially ordered:
Lower a of
case letters ar
Capital letters
if and only if vπ (s) ≥ vπfunctions.
π ≥ π# # (s) for all s∈ S
Quantities
Lower case + let
❐ There are always one or more policies that inare
Capitalbold and
better than
letters Sorlower
in areQu us
functions.
equal to all the others. These are the optimal Lower policies.
case We A(s) ar
letters
in bold and in
denote them all π*. functions.
s Quantities
state
❐ Optimal policies share the same optimalin bold and
astate-value t lower
in
action
function:
s state
v* (s) = max vπ (s) for all s ∈ S set Tof all no
π a actio
❐ Optimal policies also share the same optimal
+
sS action-value set Sof
S state t all setsto
function: aA(s) S+action set Aoft actio
set o
q* (s, a) = max qπ (s, a) for all s ∈ S and a ∈ A(s) set
(s) Roft allsetno o
π
St+ set G
discrete
oft(n)alltim
st
This is the expected return for taking action T a in state
t s G time
final discrs
A(s) set oft actio
and thereafter following an optimal policy. T state Gt atfinal
St t
At
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction St action at statet
15
Why Optimal State-Value Functions are Useful
a) gridworld b) v
V*
* π*
c) !*
0 !1 !2
!2
sa
!3 n green
d
Gt = Rt +1 + γ Rt +2 + γ 2 Rt + 3 + γ 3 Rt + 4 L
+ ...
(
= Rt +1 + γ Rt +2 + γ Rt + 3 + γ 2 Rt + 4 +L... )
= Rt +1 + γ Gt +1
X X h i
0 0
v⇡ (s) = ⇡(a|s) p(s , r|s, a) r + v⇡ (s )
a s0 ,r
This is a set of equations (in fact, linear), one for each state.
The
⇥ value function for π is its unique solution.
⇤ 2
s) = E⇡ Rt+1 + Rt+2 + Rt+3 + · · · St = s
= EBackup
⇡[Rt+1 +diagrams:
v⇡ (St+1 ) | St = s]
X X h i
= ⇡(a|s) p(s0 , r|s, a) r + v⇡ (s0 ) ,
a s0 ,r
for vπ for qπ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22
Gridworld
State-value function
for equiprobable
random policy;
γ = 0.9