Professional Documents
Culture Documents
Value Function
i −1 i i +1
0.8 0.7
0.2
0.8 0.2
P :=
0.3 0.7 H S
0.3
I Inertia ⇒ happy or sad today, likely to stay happy or sad tomorrow
(P00 = 0.8, P11 = 0.7)
I But when sad, a little less likely so (P00 > P11 )
0.9 HH 0.2 HS
0.9 0.1 0 0
0 0 0.4 0.6
P :=
0.8 0.6
0.8 0.2 0 0
0 0 0.3 0.7
0.4
SH SS 0.7
0.3
I More time happy or sad increases likelihood of staying happy or sad
I Step to the right with probability p, to the left with prob. (1-p)
p p p p
i −1 i i +1
Pi,i+1 = p, Pi,i−1 = 1 − p,
I Pij = 0 for all other transitions
60 40 60
40 40
20
position (in steps)
0 0 0
−20 −20
−20
−40 −40
−80 −80
−60
−100 −100
0 100 200 300 400 500 600 700 800 900 1000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 100 200 300 400 500 600 700 800 900 1000
time time time
I With p > 1/2 diverges to the right (grows unbounded almost surely)
I With p < 1/2 diverges to the left
I With p = 1/2 always come back to visit origin (almost surely)
40
35
I Take a step in random direction East, West, South 30
or North 25
Latitude (North−South)
20
10
I States are pairs of coordinates (x, y ) 5
0
⇒ x = 0, ±1, ±2, . . . and y = 0, ±1, ±2, . . . −5
1 50
Latitude (North−South)
20
4
10
1
P x(t +1) = i, y (t + 1) = j +1 x(t) = i, y (t) = j = 0
4
1 −10
0 ... i −1 i i +1 ... J
1−p 1−p
I States 0 and J are called absorbing. Once there stay there forever
I The rest are transient states. Visits stop almost surely
Value Function
∞ a2
X
P[S1 = s|S0 ] = P[S1 = s, A0 = ai |S0 ] ..
i=1 .
S0 ai s
I Conditioning on A0 ..
.
∞
X
P[S1 = s|S0 ] = P[S1 = s|A0 = ai , S0 ]P[A0 = ai |S0 ]
i=1
I Note that once we have defined the policy we have a Markov Chain
TT = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST )
I T can be finite or infinite and it is call the horizon
I We want to compute the probability of a given trajectory, this is P[TT ]
P[TT ] = P [S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST ]
I Let us condition on TT −1 = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 )
Value Function
I Since we are talking about sequential processes we will use the notation
Z
P(St+1 ∈ S 0 , Rt+1 ∈ R0 |St , At ) = p(s, r |St , At ) dsdr
S 0 ×R0
St+1 = St + At + ξt
I The transition dynamics are (Let us forget about the rewards)
I And we have also the expression for the density of the one transition case
Z
p(s 0 |s) = p(s 0 |s, a)π(a|s) da
A
Value Function
I Return after time t as the sum of the rewards from time t + 1 onward
I For episodic tasks with finite horizon T
Gt = Rt+1 + Rt+2 + · · · RT
I The goal is to balance the pole on the up position for as long as possible
⇒ It is a continuing task ⇒ T = ∞, γ ∈ (0, 1)
⇒ A possible reward Rt = 1 if θt ∈ [170, 190], zero otherwise
I We also need the cart to be in a specified range
⇒ A possible reward Rt = −1 if |x| > 1
Value Function
H S
p = 0.8
r = −10
p = 0.2
r = −10
p = 0.2
p = 0.8 r = 40
r = 40
H S
p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20
I Some actions are good despite them being bad in the short term
⇒ Studying while happy ⇒ it’s hard to assign it credit
I Exploration v.s. Explotation
⇒ If we start happy and we drink we might think it is the best option
p = 0.8
r = −10
p = 0.2
r = −10 p = 0.2
p = 0.8 r = 40
r = 40
H S
p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20
I The policy of an agent is the rule under which the actions are chosen
I Formally is a mapping from states to probabilities
I If the agent applies the policy π at time t then π(a|s) is the probability of
choosing At = a given that the state St = s
I Policies are stationary in the RL framework
I To evaluate the quality of the policy we use the expected return (ER)
I The value function is the ER when starting in s and following π
"∞ #
X k
vπ (s) = Eπ [Gt |St = s] = Eπ γ Rt+k+1 |St = s , for all s ∈ S
k=0
Z
I Notice that for any j 6= k p(rk ) = p(rj , rk ) drj
R
∞
X Z
vπ (s) = γk rk p(rk |S0 = s) drk
k=0 R
I It is convenient to write
p(rk |S0 = s0 ) =
Z
p(rk , sk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 , rk−1 |s0 ) dsk drk−1 dsk−1 dak−1
S 2 AR
π ? = argmax vπ (s)
π
Value Function
∞
X
I Recall that the return is given by Gt = γ k Rt+k+1
k=0
∞
X ∞
X
Gt = Rt+1 + γ γ k−1 Rt+k+1 = Rt+1 + γ γ l Rt+1+l+1 = Rt+1 + γGt+1
k=1 l=0
= Eπ Rt+1 + γvπ (s 0 )St = s
= E Rt+1 + γvπ (s 0 )St = s, At = a
I Let us define a “better” policy than π for instance π 0 (s) = maxa∈A qπ (s, a)
I In which sense the policy π 0 is “better” than π?
I Let us check the Optimal Bellman’s Equation for the Grid World
I Let’s focus on the state at the top left corner v? (s) = 22 with γ = 0.9
⇒ up or left ⇒ Rt+1 + γv? (s0) = −1 + 0.9 × 22 = 18.8
⇒ down ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 19.8 = 17.82
⇒ right ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 24.4 = 21.96