Professional Documents
Culture Documents
8C
96
B
CC
6A
Paper / Subject Code: 53173 / Reinforcement Learning (DLOC - V)
E2
B3
2A
9F
8C
CB
9D
AB
9
E2
B3
2A
FC
1B
B6
8C
9D
AB
C8
C
E2
B3
FC
B
16
6
1
8C
B
Time: 3 Hours Marks: 80
D
C8
29
B8
CC
9
B3
1B
CE
6
D0
F
1
9D
C8
9
B8
C
FF
E2
B3
Note: 1. Question 1 is compulsory
FC
B
6
D0
96
81
8C
81
29
2A
FF
2. Answer any three out of the remaining five questions.
B9
C
B
B3
E
6
0
96
AB
3. Assume any suitable data wherever required and justify the same.
81
C
1
D
9D
8
38
2A
FF
C
0B
B6
B
16
DB
6
AB
81
D
9
CC
38
Q1 a) Describe generalized policy iteration. [5]
B9
C
B
6
F
B2
16
9F
B
0
B
81
FD
9D
A9
b) Suppose γ = 0.9 and the reward sequence is R1 = 2 followed by an infinite
CC
[5]
8
6A
E2
C
0B
6F
1B
2
16
F
8C
sequence of 7s. What are G1 and G0?
D
9
9
C
C8
B8
6A
E2
B3
F
FC
c) What are the key features of reinforcement learning? [5]
6F
B2
6
D0
8C
B
9D
1
29
A9
CC
B8
A
3
FF
C
1B
E
B
6
d) Describe temporal-difference (TD) prediction with an example. [5]
16
9F
D0
C
CB
96
9D
AB
C8
38
B8
E2
2A
F
FC
1B
16
DB
B6
6F
D0
8C
Q2 a) Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. [10]
B
C8
9
B8
A9
C
B9
A
E2
B3
FF
Consider applying to this problem a bandit algorithm using ε -greedy action
C
6
D0
B6
B2
9F
81
8C
1
96
9D
C
FF
A
6C
2
3
2A
FC
1B
E
for all a. Suppose the initial sequence of actions and rewards is A1 = 1, R1 =1, A2
DB
0
B6
96
C
81
D
AB
C8
29
38
= 2, R2 = 1, A3 = 2, R3 =2, A4 = 2, R4 = 2, A5 = 3, R5 = 0. On some of these time
C
2A
FF
B9
B
FC
CE
6
B
0
B6
96
steps the ε case may have occurred, causing an action to be selected at random. On
AB
81
1
D
29
8
38
CC
2A
FF
B9
C
which time steps did this definitely occur? On which time steps could this possibly
0B
B6
CE
16
DB
96
B
9F
81
D
have occurred?
CC
B8
A
38
A
B9
6C
E2
B6
F
B2
9F
B
0
8C
81
D
9D
A9
CC
C8
6A
E2
FF
0B
B3
1B
B2
16
9F
8C
CB
96
FD
9D
Q3 a) Illustrate through an example the use of Monte Carlo Methods. [10]
C8
B8
6A
E2
B3
2A
C
6F
1B
16
F
D0
8C
B
9D
AB
9
A9
CC
C8
B8
2
B3
FF
1B
b) Describe the application of reinforcement learning to the real world problem [10]
CE
B2
16
9F
0
CB
96
D
C8
FD
of Job-Shop Scheduling.
38
9
B8
6A
E2
2A
FC
1B
16
DB
6F
D0
8C
AB
C8
9
B8
A9
C
B9
E2
B3
FF
B6
B2
81
8C
96
D
29
B8
CC
b) Explain how upper confidence bound (UCB) action selection generally performs [10]
9
6A
C
2A
1B
E
6
DB
0
9F
CB
AB
8
8
38
B9
6C
E2
0B
6F
FC
DB
B6
81
8C
81
FD
A9
29
CC
B3
F
CE
6
96
9F
81
1
FD
9D
38
2A
6C
E2
0B
6F
1B
DB
AB
8C
81
D
9
C8
2A
FF
B9
0B
Q6 a) What are Goals and Rewards? Explain with a suitable example. [10]
B3
B6
16
96
AB
81
FD
9D
CC
B8
2A
6C
B6
6F
1B
D0
81
9
CC
C8
A
FF
0B
__________________________
B6
B2
16
F
96
FD
29
B8
6A
2A
FC
E
6F
D0
8C
CB
AB
29
A9
B3
FF
FC
E
B6
B2
8C
96
29
CC
6A
B3
2A
CE
9F
B
9D
AB
8
CC
E2
B3
42752 Page 1 of 1
B
B6
9F
81
8C
9D
CC
6C
E2
B3
1B
9F
8C
9D
C8
E2
B3
0B816C81B9DB38CE29FCCB6AB2A96FFD
1B
16