You are on page 1of 1

9F

8C

96
B
CC

6A
Paper / Subject Code: 53173 / Reinforcement Learning (DLOC - V)

E2
B3

2A
9F
8C

CB
9D

AB

9
E2
B3

2A
FC
1B

B6
8C
9D

AB
C8

C
E2
B3

FC
B
16

6
1

8C

B
Time: 3 Hours Marks: 80

D
C8

29
B8

CC
9

B3
1B

CE
6
D0

F
1

9D
C8

9
B8

C
FF

E2
B3
Note: 1. Question 1 is compulsory

FC
B
6
D0
96

81

8C
81

29
2A

FF
2. Answer any three out of the remaining five questions.

B9
C
B

B3

E
6
0
96
AB
3. Assume any suitable data wherever required and justify the same.

81

C
1
D

9D
8

38
2A

FF

C
0B
B6

B
16

DB
6
AB

81
D
9
CC

38
Q1 a) Describe generalized policy iteration. [5]

B9
C
B
6

F
B2

16
9F

B
0
B

81
FD

9D
A9
b) Suppose γ = 0.9 and the reward sequence is R1 = 2 followed by an infinite

CC
[5]

8
6A
E2

C
0B
6F

1B
2

16
F
8C
sequence of 7s. What are G1 and G0?

D
9

9
C

C8
B8
6A
E2
B3

F
FC
c) What are the key features of reinforcement learning? [5]

6F
B2

6
D0
8C

B
9D

1
29

A9
CC

B8
A
3

FF

C
1B

E
B

6
d) Describe temporal-difference (TD) prediction with an example. [5]

16
9F

D0
C

CB

96
9D

AB
C8

38

B8
E2

2A

F
FC
1B
16

DB

B6

6F

D0
8C
Q2 a) Consider a k-armed bandit problem with k = 4 actions, denoted 1, 2, 3, and 4. [10]

B
C8

9
B8

A9
C
B9

A
E2
B3

FF
Consider applying to this problem a bandit algorithm using ε -greedy action

C
6
D0

B6

B2
9F
81

8C
1

96
9D

selection, sample-average action-value estimates, and initial estimates of Q1(a) = 0,


B8

C
FF

A
6C

2
3

2A
FC
1B

E
for all a. Suppose the initial sequence of actions and rewards is A1 = 1, R1 =1, A2
DB
0

B6
96

C
81
D

AB
C8

29
38
= 2, R2 = 1, A3 = 2, R3 =2, A4 = 2, R4 = 2, A5 = 3, R5 = 0. On some of these time

C
2A

FF

B9
B

FC
CE
6

B
0

B6
96

steps the ε case may have occurred, causing an action to be selected at random. On
AB

81
1
D

29
8

38

CC
2A

FF

B9
C

which time steps did this definitely occur? On which time steps could this possibly
0B
B6

CE
16

DB
96
B

9F
81
D

have occurred?
CC

B8
A

38
A

B9
6C

E2
B6

F
B2
9F

B
0

b) What is reinforcement learning. Explain the elements of reinforcement learning. [10]


6

8C
81
D

9D
A9
CC

C8
6A
E2

FF

0B

B3
1B
B2

16
9F
8C

CB

96

FD

9D
Q3 a) Illustrate through an example the use of Monte Carlo Methods. [10]
C8
B8
6A
E2
B3

2A
C

6F

1B
16
F

D0
8C

B
9D

AB
9

A9
CC

C8
B8
2
B3

FF
1B

b) Describe the application of reinforcement learning to the real world problem [10]
CE

B2

16
9F

0
CB

96
D
C8

FD

of Job-Shop Scheduling.
38
9

B8
6A
E2

2A
FC
1B
16

DB

6F

D0
8C

AB
C8

9
B8

A9
C
B9

E2
B3

FF

Q4 a) Describe temporal-difference (TD) control using Q-Learning. [10]


FC
16

B6

B2
81

8C

96
D

29
B8

CC

b) Explain how upper confidence bound (UCB) action selection generally performs [10]
9

6A
C

2A
1B

E
6

DB
0

9F

better than ε-greedy action selection with a suitable example.


C
1
FD

CB

AB
8
8

38
B9
6C

E2
0B
6F

FC
DB

B6
81

8C
81
FD
A9

29

CC

Q5 a) Describe Monte Carlo Estimation of Action Values. [10]


B9
C
0B

B3
F

CE
6
96

9F
81
1
FD

9D

b) Describe asynchronous dynamic programming with an example. [10]


8

38
2A

6C

E2
0B
6F

1B

DB
AB

8C
81
D
9

C8
2A

FF

B9
0B

Q6 a) What are Goals and Rewards? Explain with a suitable example. [10]
B3
B6

16
96
AB

81
FD

9D
CC

B8
2A

6C
B6

6F

1B
D0

b) Explain Tracking a Nonstationary Problem with an example. [10]


AB

81
9
CC

C8
A

FF

0B

__________________________
B6

B2

16
F

96

FD
29

B8
6A

2A
FC
E

6F

D0
8C

CB

AB
29

A9
B3

FF
FC
E

B6

B2
8C

96
29

CC

6A
B3

2A
CE

9F

B
9D

AB
8

CC
E2
B3

42752 Page 1 of 1
B

B6
9F
81

8C
9D

CC
6C

E2
B3
1B

9F
8C
9D
C8

E2
B3

0B816C81B9DB38CE29FCCB6AB2A96FFD
1B
16

You might also like