Professional Documents
Culture Documents
Solutions To Reinforcement Learning by Sutton Chapter 6 RX
Solutions To Reinforcement Learning by Sutton Chapter 6 RX
Chapter 6
Yifan Wang
Aug 2019
Exercise 6.1
.
By the setup, δt = Rt+1 + γVt (St+1 ) − Vt (St ).
And the update of TD is Vt+1 (S) = Vt (S) + α[Rt+1 + γVt (S 0 ) − Vt (S)].
With these new definitions in mind we continue as follows:
PT −1
Thus, the additional amount that must be added is k=t γ k−t+1 uk+1
Exercise 6.2
1
bring average improvement of all states, especially those high way
parts, which do not need too much adjustment at all but will still be
fluctuated by the Monte Carlo method.
Finally, TD can treat experimental problem well as the 6.2 will men-
tion.
Exercise 6.3
Exercise 6.4
2
or changing weight for different states will be handy, and it will be
covered in much later chapters.
Exercise 6.5
Exercise 6.6
PE (R) = 1 − PE (L)
= 1 − PE (D) ∗ PD (L)
= 1 − PE (D) ∗ [PD (C) ∗ PC (L) + PD (E) ∗ PE (L)]
3
Thus:
PE (L) = 0.5 ∗ [0.5 ∗ 0.5 + 0.5 ∗ PE (L)]
We have PE (L) = 61 .
Similarly, one can easily hacked the rest.
I think the book uses this method because it is simple and math-
ematically accurate.
Exercise 6.7
Exercise 6.8
G(St , At ) − Q(St , At ) = Rt+1 + γG(St+1 , At+1 ) − Q(St , At ) + γQ(St+1 , At+1 ) − γQ(St+1 , At+1 )
= δt + γ[G(St+1 , At+1 ) − Q(St+1 , At+1 )]
= δt + γδt+1 + γ 2 [G(St+2 , At+2 ) − Q(St+2 , At+2 )]
T
X −1
= γ k−t δk
k=t
4
Exercise 6.9
Exercise 6.10
Exercise 6.11
It is the best answer I have seen so far, I will just quote: link (I
tried in my own words several times but not good enough)
Exercise 6.12
5
During update, SARSA will use the previous greedy choice A0 made
from S 0 , and advance to Q(S 0 , A0 ). But Q-learning will use the updated
Q to re-choose greedy A∗ . Because A0 may not be the same action
as the new greedy choice A∗ that will maximize Q(S’,A*), these two
methods are different.
Exercise 6.13
(by burmecia)
" #
X
Q1 (St , At ) ← Q1 (St , At ) + α Rt+1 + γ π(a|St+1 )Q2 (St+1 , a) − Q1 (St , At )
a
1 − +
|A(s)| if a =a (Q1 (s, a0 ) + Q2 (s, a0 ))
π(a|s) =
otherwise
|A(s)|
Exercise 6.14