2 Merged

EE675A : Introduction to Reinforcement Learning
Problem Set 3 (Q1 to Q3) solutions

04/05/2023
Lecturer: Subramanya swamy peruru Scribe: A.V.Jayanth reddy
1 Contextual bandits
Q1: Given an over-determined system of linear equations Ay = b, where there are more equa-
tions than unknowns, assuming (AT A)−1 exists, we generally use the concept of pseudo-inverse to
compute  ŷ = (ATA)−1 ATb asthe estimate of y. Find ŷ if
1 3 17
A =  5 7  , b = 19
11 13 23
A1: To compute
[ the pseudo-inverse
] of a matrix, we can use the formula: ŷ = (AT A)−1 AT b
147 181
(AT A) =
181
[ 227 ]
0.3734 −0.2977
(AT A)−1 =
−0.2977
[ 0.2418 ]
T −1 T −0.5197 −0.2171 0.2368
(A A) A =
0.4276 0.2039 −0.1316[ ]
−7.513
from this, we can estimate the value of ŷ to be
8.11
Q2: Consider a contextual bandits scenario in which the true mean µa (x) = θaT x of an arm a is
a linear function of the context vector x. Here θa and x are n × 1 vectors if n is the number of
features in the context vector. Assume that we have two arms a1 and a2 , and the [ ]samples (con-
1
text, action, reward) observed by the agent in the first 6 rounds are as follows: ( , a1 , r = 17),
[ ] [ ] [ ] [ ] [ ] 3
7 5 5 11 5
( , a2 , r = 2), ( , a1 , r = 2), ( , a2 , r = 1), ( , a1 , r = 23), ( , a2 , r = 9)
13 7 [3 ] 13 7
2
If the context seen in the 7th round is , what arm is played by the agent in that round if it uses
1
a greedy policy?
A2: Given the samples (context, action, reward) observed by the agent in the first 6 rounds, we can
estimate the parameters a using linear regression. For each arm, we can collect the samples where
that arm was selected and use them to estimate the corresponding parameters. Specifically, we can
use the least squares method to estimate θa as follows:
1
θa = (XaT Xa)−1 XaT ra
   
1 3 17
for a1 , Xa1 =  5 7  , ra1 = 2 

 11 13   23
7 13 2
for a2 , Xa2 = 5 3 , ra2 = 1
  
5 7 9 ] [ ] [
−3.8 0.6
from this we can obtain the sample estimated to be θ1 = andθ2 =
4.6 0.0
To determine the arm to select in the 7th round using a[greedy
] policy, we compute
[ ] the expected
2 2
reward for each arm given the observed context: µ1 = θ1T = −2.8, µ2 = θ2T = 1.2
1 1
Since the expected reward for arm 2 is higher than that of arm 1, the agent should select arm 2
in the 7th round if it uses a greedy policy.
2 Policy Gradient Algorithm

Q3: Consider a parametric representation of policy π given by
π(a; {θb }b∈A ) = ∑ exp(θa ) , for a ∈ A.

b∈A exp(θb )
Here A is the set of actions, and {θb }b∈A is the parameter vector that characterizes the policy
π. Assuming there are only two actions a1 and a2 , derive the policy gradient update for this case.
A3: In the policy gradient algorithm, the policy gradient update is given by
(t+1) (t)
θi = θi + α(E[Rt dlog(π(x))
dθi
])
π(ai ; θ) = ∑ exp(θi ) ,
j∈A exp(θj )
∑
dlog(π(ai ;θ)) d(log(exp(θi )− j∈A exp(θj )))
Here, dθk
= dθk
= 1i=k − ∑ expθk
j∈A exp(θj )
exp(θ1 ) exp(θ2 )
we know that, π(a1 ) = exp(θ1 )+exp(θ2 )
and π(a2 ) = exp(θ1 )+exp(θ2 )
If we picked arm 1, then the update will be -

(t+1) (t) (t+1) (t)
θ1 = θ1 + α(Rt )(1 − π(a1 )); θ2 = θ2 + α(Rt )(0 − π(a2 ))
2
If we picked arm 2, then the update will be -
(t+1) (t) (t+1) (t)
θ2 = θ2 + α(Rt )(1 − π(a2 )); θ1 = θ1 + α(Rt )(0 − π(a1 ))
3
Solutions for Quiz I Practice Problems
26/01/2023
Lecturer: Dr. S.S. Peruru Scribe: Swastik Sharma (21104286) & Himanshu Sood (190381)
Exercise 4
In the class, we have looked at Thompson sampling for the Gaussian case, where the reward
distributions are {N (µ(a), 1)}a∈A , and the prior distributions for all the arms is N (0, 1). Now,
assume that instead of the standard Gaussian N (0, 1) as a prior, we have the following Gaussian
distributions as priors {N (v0 (a), σ0 (a))}a∈A
1. Prove that the posterior of an arm a after observing 1 reward sample r from reward distribu-
tion N (µ(a), 1) is given by
!
2 2
v0 (a) + (σ0 (a)) r (σ0 (a))
N 2 ,
1 + (σ0 (a)) 1 + (σ0 (a))2
2. Using the above result, write the Thompson sampling algorithm for this case.
Solution:
Given,
Prior : {N (v0 (a), σ0 (a))}a∈A
Rewards : {N (µ(a), 1)}a∈A
We know from Baye’s Theorem that:
Pr(θ|r) ∝ Pr(r|θ) + Pr(θ) (1)
Where r is one sample from the reward distribution and θ is a random variable sampled from the
prior distribution, over which we are updating our belief.
Now, we know for Gaussian Distribution R.H.S. of (1) can be written as,
1
Pr(θ|r) ∝ Pr(r|θ) + Pr(θ)

1 1
∝ exp − ((r − θ)) · exp −
2
(θ − v0 (a)) 2
2 2(σ0 (a))2
( )
((r − θ))2 (θ − v0 (a))2
∝ exp − +
2 2(σ0 (a))2
( )
2 2
θ (θ) θv0 (a)
∝ exp − − rθ + 2
−
2 2(σ0 (a)) 2(σ0 (a))2
( )
θ2 (σ0 (a))2 + 1 v0 (a)
∝ exp − − θ(r +
2 (σ0 (a))2 (σ0 (a))2
1 2
∝ exp − 2 θ ((σ0 (a)) + 1) − 2θ(r(σ0 (a)) + v0 (a))
2 2
2 (σ0 (a))

((σ0 (a))2 + 1) (r(σ0 (a))2 + v0 (a))
∝ exp − θ − 2θ
2
2 (σ0 (a))2 ((σ0 (a))2 + 1)
( 2 )
((σ0 (a))2 + 1) (r(σ0 (a))2 + v0 (a))
∝ exp − θ−
2 (σ0 (a))2 ((σ0 (a))2 + 1)
The last expression can be expressed as:

(r(σ0 (a))2 + v0 (a)) (σ0 (a))2
Pr(θ|r) = N , (2)
((σ0 (a))2 + 1) ((σ0 (a))2 + 1)
Hence Proved.
The Thompson Sampling algorithm for this case is represented in Algorithm 1.
Algorithm 1 Thompson Sampling for Gaussian Priors

1: Set µ0 (a) = 0 ∀ a ∈ A
2: for t ≥ 1 do
3: for each arm a do
Sample θet (a) from N (r(σ((σ0 (a))
2 +v (a)) (σ0 (a))2
0
4: 2
0 (a)) +1)
, 2
((σ0 (a)) +1)
5: end for
6: Play a(t) = argmax θet (a)
a
7: Observe reward rt
8: if a(t) = a then
9: Update µ̄t (a) based on observed reward rt
10: end if
11: end for
2
Exercise 5
Consider two Gaussian distributions N (µa , σ) N (µb , σ). Prove that the KL divergence between
these two distributions is 2σ1 2 (µa − µb )2 .
Solution:
Given two distributions, let them be named:
P = N (µa , σ)
Q = N (µb , σ)
As they are normal distributions they can be defined as:

1 (x − µa )2
P (x) ≜ √ · exp −
2πσ 2 2σ 2

1 (x − µb )2
Q(x) ≜ √ · exp −
2πσ 2 2σ 2
From the KL-Divergence of two distributions, we know that:
Z
P (x)
KL(P ||Q) = P (x) · log dx
Q(x)
Z x

(x − µa )2 − (x − µb )2
= P (x) · log exp − dx
2σ 2
Z x

(x − µb )2 − (x − µa )2
= P (x) · dx
2σ 2
x
Z Z
1
= 2 − P (x)(x − µa ) dx + P (x)(x − µb ) dx
2 2
2σ
x Z x

1
= 2 −σ + P (x)(x − µb + µa − µa ) dx
2 2
2σ x R
(Beacuse x P (x)(x − µa )2 dx = σ 2 and adding and subtracting µa )
Z Z Z
1
= 2 −σ + P (x)(x − µa ) dx + P (x)(µa − µb ) dx + 2(µa − µb ) P (x)(x − µa )dx
2 2 2
2σ x x x
1 2
= 2 −σ + σ 2 + (µa − µb )2
2σ R R R
(Beacuse x P (x)(x − µa )2 dx = σ 2 , x P (x)(x − µa )dx = 0 and x P (x)dx = 1)
(µa − µb )2
=
2σ 2
Hence proved that KL divergence between N (µa , σ) N (µb , σ) is 1
2σ 2
(µa − µb )2 .
3
Exercise 6
Recall the hypothesis testing problem with two distributions N (0, 1), N (∆, 1). We are given T
samples from one of these two distributions, and we have to predict from which of these two
distributions the samples were drawn. Assume ∆ = √1T . Use the above theorem to show that the
prediction of hypothesis testing can go wrong with a constant probability (i.e., the constant does
not depend on T ).
Solution:
Given that there are two distributions N (0, 1), N (∆, 1) and that ∆ = √1T where T is the number
of samples.
We know that after taking N samples, the distributions are given by, let these be named P and Q
respectively:
1
N 0, =P
N

1
N ∆, =Q
N
From the KL-Divergence of two distributions, we know that;

Z
P (x)
KL(P ||Q) = P (x) · log dx
Q(x)
Z
x
!
exp − N2 · (x2 )
= P (x) · log dx
x exp − N2 · (x − ∆)2
Z
N 2
= P (x) · log exp − x − (x − ∆) 2
dx
2
x
Z
N
=− P (x) · x2 − (x − ∆)2 dx
2 x
Z Z Z
N N N N
=− + P (x) · x +
2
P (x) · ∆ −
2
P (x) · 2x∆dx
2 2 x 2 x 2 x
Z
N
= P (x) · ∆2 dx
2 x
R
Therefore, as x P (x)dx = 1, the KL divergence of the distributions P and Q can be written as-
∆2 · N
KL(P ||Q) = (3)
2
We know from the Bretagnolle - Huber Inequality, for 2 distributions P & Q on the same sample
4
space, there for any event A, we have:
1
P (Ac ) + Q(A) ≥ exp {−KL(P, Q)} ∀a ∈ A (4)
2
Here A is the event in which sample mean ≤ a > ∆2 , and
Ac is the complement of that event with sample mean ≥ a > ∆
2
Here, P (Ac ) signifies the probability of the wrong predictions from samples from distribution
P and Q(A) signifies the probability of the wrong predictions from samples from distribution Q
From (3) and (4) we get:

1 ∆2 · N
P (A ) + Q(A) ≥ exp −
c
∀a ∈ A (5)
2 2
As ∆ = √1 or √1 , equating in (5) yields:

T N

1 1
P (A ) + Q(A) ≥ exp −
c
2 2
P (Ac ) + Q(A) ≥ 0.3033 (6)

Therefore as can be seen from the analysis that the prediction of hypothesis testing can go wrong
with a constant probability (≥ 0.3033)
5
EE675A - IITK, 2022-23-II
Lecture : Sample Quiz
Lecturer: Subrahmanya Swamy Peruru Scribe: Harsha Kurma, Aravindasai Bura
1 Exercise
Let X be a random variable which takes values in [0, 1]. Let E[X] = µ , and µ̂N denotes the sam-
ple average obtained by observing N i.i.d samples of X. Suggest a number N such that ¯µ̂N does
not deviate from µ by more than 0.01 with a very high probability (let us say with 0.99 probability).
Hoeffding’s Inequality: It states that the probability of deviation of estimated mean µ̂(a) from
2
true mean greater than ϵ is upper bound by 2e−2ϵ N where N is number of samples
2N
P [|µ̄(a) − µ(a)| ≥ ϵ] <= 2e−2ϵ
Given µ̂N does not deviate from µ by more than 0.01 with a very high probability (let us say
with 0.99 probability)
ϵ = 0.01 does not deviate with probability 0.99 =⇒ deviates with probability 0.01
P [|µ̄(a) − µ(a)| ≥ 0.01] ≤ 0.01

2N
0.01 ≤ 2e−2ϵ
2N
0.01 ≤ 2e−2(0.01)
taking log on both sides ans solving further
N ≥ 24692
2 Exercise
Prove that ϵ-greedy algorithm, where ϵ is some fixed constant, incurs a regret that grows linearly
in T .
In ϵ - Greedy Algorithm, in every round with probability ϵ an arm is played randomly,and with
probability 1 − ϵ the optimal arm is played.
1
Therefore, ∆(a) is bounded by,
∆(a(t)) = µ(a∗ (t)) − µ(a(t))
X
R(t) = ∆(a(t))
t∈T
X
E[R(t)] = E[ ∆(a(t))]
t∈T
X
E[R(t)] = E[∆(a(t))]
t∈T
XX
= ( ∆(a(t))P (a(t) = a))
t∈T a∈A
â(t) = arg maxa∈A µt (a)

(Note that â is the best arm based on sample estimate, whereas a∗ is the actual best arm that gives
us the maximum reward.)
consider there are k arms

The probability to choose an arm at random is 1/k with ϵ, the probability a random arm is
selected which includes the best arm, so the probability of choosing a random arm is ϵ/k
In Greedy case there is only one best arm and it is played with probability 1-ϵ
P (a(t) = â) = (1 − ϵ) + ϵ/k

P (a(t) = a) = ϵ/k
P (a(t) = a) ≥ ϵ/k
XX
E[R(t)] = ( ∆(a(t))P (a(t) = a))
t∈T a∈A
XX
E[R(t)] ≥ ( ∆(a(t))ϵ/k)
t∈T a∈A
X
E[R(t)] ≥ (ϵT /k) ∆(a(t))
a∈A
ϵ is constant in this case, the expected regret grows linearly T
3 Exercise
In the UCB algorithm discussed in our class, we had
s
2lnT
U CBt (a) := µt (a) + (2)
nt (a)
2
To execute this algorithm, we need to know the value of T , i.e., we should know the total
number of rounds we are going to play. Let us assume that we do not know the value of T in
advance and consider the following variant of UCB.
s
lnt
U CBt (a) := µt (a) + (3)
nt (a)
Prove that this UCB variant also has a similar regret bound. HINT: The proof proceeds in a
similar fashion. Just make necessary changes to reflect this new formulation
step-1 :
We have U CBt (a) ≥ U CBt (a∗ )
calculate the probability that the true mean µ(a) lies outside the confidence interval that we
have at time ’t’ i.e., ∀ actions a ∈ A and 0 < t < T
s ! s !!
lnt lnt
P µt (a) ∈
/ µt (a) − , µt (a) + (4)
nt (a) nt (a)
Using Hoeffding’s Inequality
2N
P [|µ̄(a) − µ(a)| ≥ ϵ] <= 2e−2ϵ
q
considering ϵas nlnt
t (a)
we get
2N 2
2e−2ϵ =
t2
step-2:
In the proof discussed in class for UCB, we assumed that for every time 1 ≤ t ≤ T, and for
every arm a ∈ A, our confidence intervals are correct with high probability, specifically if we use
union bound
2
P( our confidence interval is wrong for atleast one arm (or) one time slot) ≤ 4 T k
T
for a given arm ’a’ at a particular round ’t’ the probability of deviation of true mean out side
the bound is 2/T4
Union bound over all time slots 1<t<T and all the ’k’ arms
Assuming k ≤ T which is a reasonable assumption since we should have enough rounds to
play each arm at least once
1

P(confidence interval getting voilated atleast in one slot or for one arm)≤ O T2
3
q
logt
For UCB= nt (a)
:
q
logt
If we do a similar analysis for nt (a)
version of UCB
Using Hoeffding’s inequality
2
P(voilation for arm ’a’ at round ’t’) ≤
∀a ∈ A (6)
T2
T X
X 2
P( voilation for atleast one arm at least on o=round) ≤
t=1 a∈A
t2
T (7)
X 2
P=K 2
>K
t=1
t
asT → ∞
Conclusion: This bound is greater than 1,which is useless,we cant hopw our confidence inter-
vals are correct all the times
Step-3: For the UCB version done in class we have shown that
i)A bad arm cannot be played too many times.

ii)If any arm is played a lot of times, its not a very bad arm.
similarly , " If a suboptimal arm is played sufficient number of times then the probability of
playing it further is very less
!
a(t + 1) = a 4
P 4lnt
) ≤ (8)
nt (a) ≥ (△(a)) 2 t2
Note: using the following properties 1. If an arm has to be played at time ’t’,
U CBt (a) ≥ U CBt (a∗ ) (9)

∗ ∗
µ̄t (a) + ϵt (a) ≥ µ̄t (a ) + ϵt (a ) (10)
µ̄t (a∗ ) − µ̄t (a) ≤ ϵt (a) − ϵt (a∗ ) (11)
Also we have ,
4
∆(a) = µ(a∗ ) − µ(a) (12)
∗ ∗
∆(a) ≤ µ̄t (a ) + ϵt (a ) − (µ̄t (a) − ϵt (a)) (13)
∆(a) ≤ 2ϵt (a)) (14)
s
log t
∆(a) ≤ 2 (15)
nt (a)
4 log t
nt (a) ≥ (16)
(∆(a))2
2
P(confidence interval voilation for arm a, at t) ≤ using this property for both a, a*
t2
P(confidence interval going wrong either for a or a*) ≤ t42
i.e.,
s !
lnt 2
p |µ(a) − µt (a))| ≤ 2 (17)
nt (a) t
(18)
using for both a and a∗ we get,
4
P(confidence interval going wrong for either a or a*) ≤ (19)
t2
4
P(confidence interval correct in both cases) ≤ 1 − 2 (20)
t
(21)
we can say that

!
a(t + 1) = a 4
P 4lnt
) ≤ (22)
nt (a) ≥ (△(a)) 2 t2
Step 4:
4lnT
E[nt (a)] ≤ (△(a))2
+8
5
This is nothing but "A bad arm in not played many times".
Express hP i
T
E[nt (a)] = 1 + E 1
t=K+1 (a(t)=a)
every arm is played once in first K rounds so 1

Divide the second term intoi two parts
hP hP i
T T
E t=K+1 1a(t)=a,nt (a)≤ 4lnt 2 + E
(△(a))
t=K+1 1a(t)=a,nt (a)≥ 4lnt
(△(a))2
" T
#
X 4lnT
E 1a(t)=a,nt (a)≤ 4lnt the total contribution of this term is (23)
t=K+1
(△(a))2 ((a))2
hP i
T
E 1 4lnt
t=K+1 a(t)=a,nt (a)≥ (△(a))2

PT a(t) 4lnt
t=k+1 P n (a)≥ 4lnt
P nt (a) ≥ (△(a))2
t
(△(a))2
PT 4
t=k+1 t2 =8
P
step 5: calculate E [R(T ; a)] using E[nt (a)] from step 4 use E[R(T )] = a E[R(T ; a)]
X X
E[R(T ; a)] = E[nt (a)]△(a) (24)
a a
X X 4 log t
E[R(T ; a)] ≤ + 8(∆(a)) (25)
a a
(∆(a))

2 Merged

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Merged

Uploaded by

Copyright:

Available Formats

EE675A : Introduction to Reinforcement Learning

Problem Set 3 (Q1 to Q3) solutions

2 Policy Gradient Algorithm

π(a; {θb }b∈A ) = ∑ exp(θa ) , for a ∈ A.

If we picked arm 1, then the update will be -

Pr(θ|r) ∝ Pr(r|θ) + Pr(θ) (1)

The last expression can be expressed as:

Algorithm 1 Thompson Sampling for Gaussian Priors

Given two distributions, let them be named:

From the KL-Divergence of two distributions, we know that;

From (3) and (4) we get:

As ∆ = √1 or √1 , equating in (5) yields:

P (Ac ) + Q(A) ≥ 0.3033 (6)

Lecture : Sample Quiz

Lecturer: Subrahmanya Swamy Peruru Scribe: Harsha Kurma, Aravindasai Bura

P [|µ̄(a) − µ(a)| ≥ 0.01] ≤ 0.01

â(t) = arg maxa∈A µt (a)

consider there are k arms

P (a(t) = â) = (1 − ϵ) + ϵ/k

ϵ is constant in this case, the expected regret grows linearly T

i)A bad arm cannot be played too many times.

U CBt (a) ≥ U CBt (a∗ ) (9)

using for both a and a∗ we get,

we can say that

every arm is played once in first K rounds so 1

You might also like