Professional Documents
Culture Documents
The agent has no knowledge of the MDP. We denote the return for state
s in trajectory i as Gs,i . Starting from S0 = A and under policy π, we
obtain the trajectories and returns (γ = 1) of the first N = 3 episodes:
+1
A D C GA,0 = 1
−1 +1
A R B D C GA,1 = 0
+1
A D C GA,2 = 1
Pablo Zometa – Department of Mechatronics – GIU Berlin 5
Exercise: Monte Carlo prediction for 3-state grid world
From these N = 3 episodes, the agent can estimate the state value
function for A under policy π:
N −1
1 X A,i 1 2
vπ (A) = Eπ (Gt | St = A) ≈ G = (GA,0 + GA,1 + GA,2 ) =
N i=0 3 3
Similarly, the value function for state B under π can be estimated by the
agent from experience:
N −1
1 X B,i
vπ (B) = Eπ (Gt | St = B) ≈ G =1
N j=0
In DP:
π0 7→E vπ0 7→I π1 7→E vπ1 · · · 7→I π∗ 7→E vπ∗
Both policy evaluation and improvement rely on the Bellman equation:
X X
vk+1 (s) = π(a|s) p(s′ , r|a, s) (r + γvk (s′ ))
a s′ ,r
X
πk+1 (s) = arg max p(s′ , r|s, a)(r + γvπk (s′ ))
a
s′ ,r
In MC:
π0 7→E qπ0 7→I π1 7→E qπ1 · · · 7→I π∗ 7→E qπ∗
πk+1 (s) is the greedy policy with respect to qk (s, a), for all
s ∈ S, a ∈ A(s)
πk+1 (s) = arg max qk (s, a)
a