You are on page 1of 9

Solutions to Reinforcement Learning by Sutton

Chapter 5
Yifan Wang

May 2019

Exercise 5.1

1. It is due to the strategy that player will not stop until meet-
ing 20 or 21. That indicates player would face the risk of failing by
hitting, which results the low value part right before 20 and 21. On
the 20 and 21, however, the player stops and has a very high oppor-
tunity to win, especially when dealer will stop at 17 or higher.

2. It drop off for the whole last row on the left because if Dealer
showing an ACE, It has very high possibility of getting higher score
than the player when it counts as 11. Thus, the value of dealer’s A has
contained the dealer’s winning rate of making it usable or not. Other
cards have no such condition thus A is a special which makes the gap.

3. Frontmost values are higher in the upper diagrams because A


represent dual values of being used as 1 and 11 in the upper diagram.
It makes the player better off and is similar with the condition of
having drop in the leftmost rows.

Exercise 5.2

No. Black jack does not contain two duplicate state in any episode,
making first-visit and every-visit method essentially the same thing.

1


Exercise 5.3

Drawing problem. I will not draw it here. The diagram has max
of qπ in the bottom.

2
Exercise 5.4

Recall from Chapter 2.4:


n
1X
Qn+1 = Ri
n i=1
n−1
1 X 
= Rn + Ri
n i=1
n−1
1 1 X 
= Rn + (n − 1) Ri
n n − 1 i=1
1 
= Rn + (n − 1)Qn
n
1 
= Rn + nQn − Qn
n
1 
= Qn + Rn − Qn
n

Then, the pseudo-code for Monte Carlo ES has similar improvement:


n
1X
Qn (St , At ) = Gi (St , At )
n i=1
n−1
1 X 
= Gn (St , At ) + Gi (St , At )
n i=1
n−1
1 1 X 
= Gn (St , At ) + (n − 1) Gi (St , At )
n n − 1 i=1
1  
= Gn (St , At ) + (n − 1) Qn−1 (St , At )
n
1 
= Gn (St , At ) + nQn−1 (St , At ) − Qn−1 (St , At )
n
1 
= Qn−1 (St , At ) + Gn (St , At ) − Qn−1 (St , At )
n

3
Exercise 5.5

Because X
|J(s)| = ρt:T (t)−1 = 10
t∈J(s)

ρt:T (t)−1 = 1

We have off-policy = on-policy, thus:

first-visit:
vs = 10

all-visit:
1
vs = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10) = 5.5
10


Exercise 5.6

eq (5.6):

P
. t∈J(s) ρt:T (t)−1 Gt
V (s) = P
t∈J(s) ρt:T (t)−1

For Q:

P
. t∈J(s,a) ρt+1:T (t)−1 Gt
Q(s, a) = P
t∈J(s,a) ρt+1:T (t)−1

4
Exercise 5.7

The weighted average algorithm will need few episodes to decrease its
bias. Especially when the ρt:T (t)−1 is big, the weighted average algo-
rithm would be shifted by those data. When we get enough episodes,
the average begins to be stable and decreasing the bias.

Exercise 5.8

For first visit, we have:


" −1
TY
!2 #
π(At |St )
Eb G0
t=0
b(At |St )

For every visit, we will some difference


" T −1 k
!2 #
1 X Y π(At |St )
Eb G0
T −1 t=0
b(At |St )
k=1

= 0.5 (policy possibility) · 0.1 (length of 1 possibility) · 22·1 (square of ρ)


1
+ 0.5 · 0.9 · 0.5 · 0.1 · 22·2 (second visit) + 0.5 · 0.1 · 22·1 (first visit)

2
1
+ 0.5 · 0.9 · 0.5 · 0.9 · 0.5 · 0.1 · 22·3 (third visit)
3
+ 0.5 · 0.9 · 0.5 · 0.1 · 22·2 (second visit) + 0.5 · 0.1 · 22·1 (first visit)


...
∞ k−1
X 1X l l
= 0.1 0.9 · 2 · 2
k
k=1 l=0
∞ k−1
X 1 X
= 0.2 1.8l = ∞.
k
k=1 l=0

5
On the other hand, considering weighted average method:
" T −1 k
!2 #
1 X Y π(At |St )
Eb G0
|J(s)| t=0
b(At |St )
k=1

...
∞ k−1
X 1 1 X
≤ 0.1 Pk 0.9l · 2l · 2
k ( i=0 2 )
i 2
k=1 l=0
∞ k−1
X 1 1 X l
≤ 0.2 1.8
k 22k
k=1 l=0

X 1 −2k
≤ 0.2 2 k 1.8k
k
k=1

X 1 −2k k
≤ 2 k2
k
k=1
X∞
= 2−k = 1
k=1

Exercise 5.9

Similar with Exercise 5.4:


n
1X
Vn (St ) = Gi (t)
n i=1

= ...
1 
= Vn−1 (St ) + Gn (t) − Vn−1 (St )
n


6
Exercise 5.10

Pn
. k=1 Wk Gk
Vn+1 = P n
k=1 Wk

Pn−1 Pn−1
Wn Gn + k=1 Wk Gk Wk
= Pn−1 Pk=1
n
k=1 Wk k=1 Wk

 Wn Gn  Cn−1
= + Vn
Cn−1 Cn

Wn Gn Vn Cn−1
= +
Cn Cn

Wn Gn Vn Cn−1
= Vn + + − Vn
Cn Cn

Wn Gn Vn Cn−1 − Vn Cn
= Vn + +
Cn Cn

Wn Gn −Vn Wn
= Vn + +
Cn Cn

Wn  
= Vn + Gn − Vn
Cn

7
Exercise 5.11

At is only allowed to change W if At = π(St ). And due to π is de-


terministic, we are safe to say π(At |St ) = 1 during the update of W .

Exercise 5.12

Programming problem. See Github 

Exercise 5.13

from 5.12
π(At |St ) π(At+1 |St+1 ) π(At+2 |St+2 ) π(AT −1 |ST −1 )
ρt:T −1 Rt+1 = ... Rt+1
b(At |St ) b(At+1 |St+1 ) b(At+2 |St+2 ) b(AT −1 |ST −1 )

However, we have 5.13


" #
π(At |St ) . X
E = π(a|Sk ) = 1
b(At |St ) a

And, we know after t, any importance-sampling ratio becomes inde-


pendent with Rt+1 . This must follows:

E(XY ) = E(X)E(Y )

Go back to 5.12 using 5.13, we will have 5.14 indeed:

E[ρt:T −1 Rt+1 ] = E[ρt Rt+1 ]

Exercise 5.14

8
Skip 

You might also like