You are on page 1of 3

EE675A : Introduction to Reinforcement Learning

Problem Set 3 (Q1 to Q3) solutions


04/05/2023
Lecturer: Subramanya swamy peruru Scribe: A.V.Jayanth reddy

1 Contextual bandits
Q1: Given an over-determined system of linear equations Ay = b, where there are more equa-
tions than unknowns, assuming (AT A)−1 exists, we generally use the concept of pseudo-inverse to
compute  ŷ = (ATA)−1 ATb asthe estimate of y. Find ŷ if
1 3 17
A =  5 7  , b = 19
11 13 23
A1: To compute
[ the pseudo-inverse
] of a matrix, we can use the formula: ŷ = (AT A)−1 AT b
147 181
(AT A) =
181
[ 227 ]
0.3734 −0.2977
(AT A)−1 =
−0.2977
[ 0.2418 ]
T −1 T −0.5197 −0.2171 0.2368
(A A) A =
0.4276 0.2039 −0.1316[ ]
−7.513
from this, we can estimate the value of ŷ to be
8.11
Q2: Consider a contextual bandits scenario in which the true mean µa (x) = θaT x of an arm a is
a linear function of the context vector x. Here θa and x are n × 1 vectors if n is the number of
features in the context vector. Assume that we have two arms a1 and a2 , and the [ ]samples (con-
1
text, action, reward) observed by the agent in the first 6 rounds are as follows: ( , a1 , r = 17),
[ ] [ ] [ ] [ ] [ ] 3
7 5 5 11 5
( , a2 , r = 2), ( , a1 , r = 2), ( , a2 , r = 1), ( , a1 , r = 23), ( , a2 , r = 9)
13 7 [3 ] 13 7
2
If the context seen in the 7th round is , what arm is played by the agent in that round if it uses
1
a greedy policy?

A2: Given the samples (context, action, reward) observed by the agent in the first 6 rounds, we can
estimate the parameters a using linear regression. For each arm, we can collect the samples where
that arm was selected and use them to estimate the corresponding parameters. Specifically, we can
use the least squares method to estimate θa as follows:

1
θa = (XaT Xa)−1 XaT ra
   
1 3 17
for a1 , Xa1 =  5 7  , ra1 = 2 

 11 13   23
7 13 2
for a2 , Xa2 = 5 3 , ra2 = 1
  
5 7 9 ] [ ] [
−3.8 0.6
from this we can obtain the sample estimated to be θ1 = andθ2 =
4.6 0.0
To determine the arm to select in the 7th round using a[greedy
] policy, we compute
[ ] the expected
2 2
reward for each arm given the observed context: µ1 = θ1T = −2.8, µ2 = θ2T = 1.2
1 1
Since the expected reward for arm 2 is higher than that of arm 1, the agent should select arm 2
in the 7th round if it uses a greedy policy.

2 Policy Gradient Algorithm


Q3: Consider a parametric representation of policy π given by

π(a; {θb }b∈A ) = ∑ exp(θa ) , for a ∈ A.


b∈A exp(θb )

Here A is the set of actions, and {θb }b∈A is the parameter vector that characterizes the policy
π. Assuming there are only two actions a1 and a2 , derive the policy gradient update for this case.

A3: In the policy gradient algorithm, the policy gradient update is given by
(t+1) (t)
θi = θi + α(E[Rt dlog(π(x))
dθi
])

π(ai ; θ) = ∑ exp(θi ) ,
j∈A exp(θj )

dlog(π(ai ;θ)) d(log(exp(θi )− j∈A exp(θj )))
Here, dθk
= dθk

= 1i=k − ∑ expθk
j∈A exp(θj )
exp(θ1 ) exp(θ2 )
we know that, π(a1 ) = exp(θ1 )+exp(θ2 )
and π(a2 ) = exp(θ1 )+exp(θ2 )

If we picked arm 1, then the update will be -


(t+1) (t) (t+1) (t)
θ1 = θ1 + α(Rt )(1 − π(a1 )); θ2 = θ2 + α(Rt )(0 − π(a2 ))

2
If we picked arm 2, then the update will be -
(t+1) (t) (t+1) (t)
θ2 = θ2 + α(Rt )(1 − π(a2 )); θ1 = θ1 + α(Rt )(0 − π(a1 ))

You might also like