You are on page 1of 10

CS747: F OUNDATIONS OF I NTELLIGENT AND

L EARNING A GENTS

Assignment 1

Author: Roll No.:


Jishnu Basavaraju 180050021

September 9, 2021
Contents
Introduction 2

Code Base and Datasets 2


bandit.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Output generation 3
outputData.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Task 1 4
Epsilon Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
KL-UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Thomson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Task 2 6
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Task 3 7
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Task 4 8
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

References 9
references.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Submission Directory 9

1
Introduction
This assignment tests is implementation of the regret minimisation algorithms
discussed in class. Different algorithms for sampling the arms of a stochastic
multi-armed bandit are implemented and compared. The assignment is divided
into four tasks.
Task 1: Epsilon-Greedy, UCB, KL-UCB, and Thompson Sampling are implemented
Task 2: Optimising the scaling coefficient of the exploration bonus in UCB
Task 3: Improvising UCB to tackle a scenario in which the rewards come from a
finite set contained in [0, 1], which generalises the Bernoulli case
Task 4: UCB algorithm to maximise the expected number of times the reward
obtained exceeds a given threshold

Code Base and Datasets


bandit.py
The assignment runs on a single this single python file bandit.py. It takes various
commandline arguments to meet the requirements for different tasks. The help
section for bandit.py is given below.

Figure 1: Usage of Bandit.py

Upon execution of the code bandit.py always returns output in the format of
instance, algorithm, random seed, epsilon, scale, threshold, horizon, REG, HIGHS.
Depending on the task, some of these values maybe preset to defaults

instances
All the data for execution of the Assignement are present in the Instances folder.
It contains 4 folders, one for each task, and each folder contains some instances of
data appropriate for running each task

2
Output generation
For the required testing of this assignment

To generate data for Task 1, run bandit.py for every combination of


instance from "../instances/instances-task1/i-1.txt"; "../instances/instances-task1/i-
2.txt"; "../instances/instances-task1/i-3.txt"
algorithm from epsilon-greedy-t1 with epsilon set to 0.02; ucb-t1, kl-ucb-t1,
thompson-sampling-t1
horizon from 100; 400; 1600; 6400; 25600; 102400
random seed from 0; 1; ...; 49

To generate data for Task 2, run bandit.py for every combination of


instance from "../instances/instances-task2/i-1.txt"; "../instances/instances-task2/i-
2.txt"; "../instances/instances-task2/i-3.txt", "../instances/instances-task2/i-4.txt",
"../instances/instances-task2/i-5.txt"
algorithm set to ucb-t2
scale from 0.02; 0.04; 0.06; ...; 0.3
horizon equal to 10000
random seed from 0; 1; ...; 49

To generate data for Task 3, run bandit.py for every combination of


instance from "../instances/instances-task3/i-1.txt"; "../instances/instances-task3/i-
2.txt"
algorithm set to alg-t3
horizon from 100; 400; 1600; 6400; 25600; 102400
random seed from 0; 1; ...; 49

To generate data for Task 4, run bandit.py for every combination of


instance from "../instances/instances-task4/i-1.txt"; "../instances/instances-task4/i-
2.txt"
algorithm set to alg-t4
threshold from 0.2; 0.6
horizon from 100; 400; 1600; 6400; 25600; 102400
random seed from 0; 1; ...; 49

outputData.txt
The output from executing all these 9150 = 3600 (for Task 1) + 3750 (for Task 2) +
600 (for Task 3) + 1200 (for Task 4) commands are stored in a file outputData.txt

3
Task 1
In this section we are required to implement each of Epsilon Greedy, UCB, KL-UCB
and Thomson Sampling and test it on 3 different bandits with 2,5,25 arms.

Epsilon Greedy
The ²G 3 strategy was implemented. An uniform choice between 0,1 is made. If the
choice is less than epsilon, exploration (randomly choosing an arm) is done, else
exploitation (choosing the arm with maximum empirical mean) is done.

UCB
The value of UCB for each arm at time t is calculated as follows
s
2 ln t
UCB at = p̂ at +
u at
where p̂ at is the empirical mean of reward for arm a at time t and u at is the number
of times the arm is pulled at before time t. The arm with maximum UCB at time t
is chosen everytime. Note that, due to my implementation, the arm with no pulls
will always give UCB as infinity (and not throw an error due to zero division), and
therefore would be picked first. Due to this, all the arms would be pulled once
before the actual algorithm starts.

KL-UCB
The value of K L − UCB for each arm at time t is calculated as follows
ucb − kl at = max q ∈ [ p̂ at , 1] such that u at K L( p̂ at , q) ≤ ln t + c ln ln t
where p̂ at is the empirical mean of reward for arm a at time t, u at is the number of
times the arm is pulled at before time t and c ≥ 3 is the scaling factor. The value of
K L − UCB is found via binary search. The arm with maximum K L − UCB at time
t is chosen everytime. Before the start of algorithm, each arm is pulled once.

Thomson Sampling
For this algorithm, at time t for an arm a, a sample xat is drawn such that
xat ∼ Beta( s ta + 1, f at + 1)
where s ta is the number of successes for arm a at time t and f at is the number
of failures for arm a at time t. The arm with the maximum pull xat is chosen
everytime.

4
Results
The plots obtained for each instance file with 2,5,25 bandit arms respectively are
given below

Figure 2: Log Horizon vs Regret for different methods for Dataset-1 (2 arms)

Figure 3: Log Horizon vs Regret for dif- Figure 4: Log Horizon vs Regret for dif-
ferent methods for Dataset-2 (5 arms) ferent methods for Dataset-3 (25 arms)

Observations
As observed from the graph, a general trend of regret is followed, which is epsilon-
greedy > ucb > kl-ucb > thomson-sampling with this being more pronounced in
cases with lower bandit arms. This is concurrent with the our theortical deductions.
An anomaly with 25 arms case where ucb was less accurate than epsilon-greedy
can be spotted. This can be explained that, as the number of arms is more, it
would take more horizon to notice the difference between accuracies of these two

5
algorithms and if we go for larger horizons, this case would also behave in the
expected way. Also, the value of regret increases with increase in value of horizon
in every case, as we sample more number of times, more inaccuracies are bound to
happen.

Task 2
In this task, we try to optimise the scaling coefficient c in UCB as follows
s
c ln t
UCB at = p̂ at +
u at

The default value of c is 2, but in this task, it is varied to control the amount of
exploration performed, across 5 different bandit instances. Same code used for
UCB in previous section is used, with adding the c value in the calculation.

Results
The plot between regret and c value for 5 different instances is as follows

Figure 5: Plot of c value vs Regret in 5 bandit instances

6
Observations
From the graph, it can be deduced that a general trend of decreasing regret and
then remaining more or less constant is observed. On a closer look at datasets, this
is also observed. The datasets, all have 2 arms. The probabilites of arms are closest
in set-5, next closest in set-4 etc. The dataset which had the farthest arms, dataset-
1, had its minimum regret at c = 0.04. dataset-2, c = 0.08, dataset-3, c = 0.12,
dataset-4, c = 0.18 and dataset-5, c = 0.30. The bandits with close probability arms
require more exploration before getting saturated, while bandits with farther arms
need less. Therefore, the c value for minimum regret increases with closeness in
bandit arm probabilites. This concurs with our practical observation.

Task 3
In this task, UCB strategy is improvised to tackle a scenario in which the rewards
come from a finite set contained in [0, 1], which generalises the Bernoulli case.

Results
Plots of log horizon vs Regret for the provided Instances are as follows

Figure 6: Log Horizon vs Regret for Figure 7: Log Horizon vs Regret for for
Dataset-1 Dataset-2

Observations
Similar to Task 1, the value of regret increased with increase in horizon. The regret
for Instance-2 is slightly higher than regret for Instance-1.

7
Task 4
In this task, the UCB strategy is the same as Task 3. But the objective of this Task
is to maximise the number of times that a reward is obtained, which exceeds a
given threshold. The implementation is done by merely thresholding the rewards (
if reward > threshold, reward = 1 else 0) and proceeding in the strategy of Task 3.
The quantity HIGHS, is the total reward with the new thresholded reward. The
regret is calculated as the difference of max possible HIGHS and HIGHS obtained.

Results
The plots obtained for two datasets for different values of threshold are as follows

Figure 8: Log Horizon vs Regret with Figure 9: Log Horizon vs Regret with
Threshold = 0.2 for Dataset-1 Threshold = 0.6 for Dataset-1

Figure 10: Log Horizon vs Regret with Figure 11: Log Horizon vs Regret with
Threshold = 0.2 for Dataset-2 Threshold = 0.6 for Dataset-2

8
Observations
The key observation that can be made is that regret for lower threshold in Dataset-
1 is lower than regret for higher threshold in Dataset-1 while in Dataset-2, the
opposite is the case. This can be explained by taking a close look at the Datasets.
In dataset-1 it can be observed that majority of the pulls would end up being more
than threshold 0.2. In fact, for 2 out of 3 arms, the probability of getting a reward
less than 0.2 is zero. So majority of pulls end up crossing both the thresholds, so
MAX-HIGHS for both thresholds is close. But HIGHS for higher threshold would
be less, due to it being the less probable event of the two. So the regret for higher
threshold is more. In dataset-2, not many pulls cross the threshold 0.6, according
to the data. Therefore, the value of MAX-HIGHS and HIGHS, both proportionately
reduce, bringing down the regret value for higher threshold. So the value of regret
for lower threshold is more here.

References
• Various numpy functions from https://numpy.org/doc/
• Binary Search on a continous interval, idea from https://codereview.
stackexchange.com/questions/168730/binary-search-on-real-space
• Plotting with matplotlib https://matplotlib.org/stable/contents.html
• Python help from https://docs.python.org/3.8/

references.txt
All these references can be found in file references.txt of submission directory

Submission Directory
A single file submission.tar.gz is submitted which contains the following files

submission.tar.gz
submission
bandit.py
outputData.txt
references.txt
report.pdf

You might also like