You are on page 1of 2

EE 737 Spring 2019-20

Mid-semester examination
Max marks: 30 Time: ∼ 2 hours

Instructions

• This is a take-home, open-notes exam, administered on the basis of the following honuor code. You are not to violate
the following terms.
– You are to complete this exam in a single sitting.
– The exam is designed to be completed in 2 hours but you are allowed to spend longer on it.
– You are allowed to refer to your notes and the text (by Lattimore and Szepasvári). You are however not allowed
to (i) interact with anyone else, (ii) refer to any other source, or (iii) use the Internet.
• The honour code kicks in as soon as you have read a question fully or partially.
• There are three questions and each question carries 10 marks.
• Turn in your completed answer scripts in the dropbox marked EE737 in the EE office by 5 pm, Feb. 28 (Friday).

Questions

1. Design and analysis of UCB algorithm for bounded rewards. In class, we analysed UCB algorithms for the
stochastic multi-armed bandit problem assuming that the arm rewards are 1-subgaussian. You now have to come up
with a variant of UCB specialized for the case where the arm distributions have bounded support. Note that bounded
support implies subgaussianity, so your regret bounds should ideally be better than what one would obtain using the
algorithms covered in class, being tailored to the special case of bounded rewards.
Specifically, consider the stochastic multi-armed bandit problem with k arms, the reward distribution of each arm
being supported on [0, 1]. The horizon length is n.
(a) Describe your UCB algorithm clearly.
(b) Derive an upper bound on the expected regret Rn .
(c) Extra credit: Compare the above regret bound to the one obtained using a subgaussianity assumption.
Hint: You might want to recall the Hoeffding inequality: Consider i.i.d. random variables {X1 , X2 , · · · , Xn } having
support [a, b] and mean µ. Then for  > 0,
 Pn 
i=1 Xi 2 2
P − µ ≥  ≤ e−2n /(b−a) ,
n
 Pn 
i=1 Xi 2 2
P − µ ≤ − ≤ e−2n /(b−a) .
n

2. A test of your ability to concentrate (probability). Suppose that Sn is a Binomial(n, p) random variable (i.e., it
represents the number of heads seen over n independent tosses of a biased coin, where the probability of a heads on
each toss equals p). For a ∈ (p, 1), show that

P (Sn > na) ≤ e−nγ ,

where γ = a log(a/p) + (1 − a) log[(1 − a)/(1 − p)].

1
3. Strengthen your adversary. For this problem, consider the worst case learning from expert advice setting. Specifi-
cally, there are k experts, and yi,t ∈ [0, 1] denotes the loss experienced by Expert i at time t. At each time t ∈ [n],
• The algorithm picks an expert At ∈ [k] (using only information gathered until time t)
• The loss of each expert i, yi,t is revealed,
• The algorithm experiences loss yAt ,t .
Note that the problem instance is defined by the adversarially chosen table of costs y = (yi,t , 1 ≤ i ≤ k, 1 ≤ t ≤ n).
Now, suppose that we define the following stronger notion of regret associated with your (possibly randomized)
algorithm A : " n
X Xn  #
R̄n (A) = sup E yAt ,t − min yi,t .
y 1≤1≤k
t=1 t=1

Here the expectation is with respect to any randomization performed by the algorithm.
(a) Why is R̄n (A) a stronger notion of regret compared to the one analysed in class?
(b) Prove that for any algorithm,  
1
R̄n (A) ≥ n 1 − .
k
This means that sub-linear regret (under this stronger notion of regret) is impossible!

You might also like