You are on page 1of 2

Class Exercise

MAB demo, Thompson sampling, RL_sell like a wolf, Q-learning


Soubhagya Dash
PGP/25/116
Section A

MAB:
Here we are using reinforcement learning in MAB (Multi Armed Bandit) problem to find out
the slot
machine that will give us the maximum reward. We were able to replicate a slot machine
where we could select any machine between 1 and 3. This function's result is essentially the
prize we will receive if we select that machine. So, as we repeatedly play a game, the
computer receives additional data, and due to the re-enforcement learning procedure, the
model will be fine-tuned, and the result will be a predicted reward depending on the number
of rounds the player has played.
In the exploration phase, the number of rounds we played, the more accurate was the result.
Hence, the machine was accurately able to guess the luckiest machine.

Online Advertising:
Here, we focused mostly on the banners of a firm that want to market a product on many
websites in order to identify the banner with the highest CTR.
First, we used A/B testing to find the best online banner that can provide us with the highest
generated reward. The algorithm was able to deduce D as the highest rewarding programme
but it was really inefficient so the algorithm was not able to exploit the dataset fully.
Next, we considered epsilon-greedy algorithm, the unique aspect of this algorithm is that it
chooses the exploration and exploitation randomly, hence having a balance between those
two.
Thirdly, we used Upper Confidence Bounds algorithm. In this, the potential for the
unexplored states is taken care of. The uncertainty of the state is calculated here to define the
UCB, and then onwards the potential of every action is calculated by adding up the estimate
of its value and a measure of how uncertain this estimate is.

Thompson Sampling:
In this, I learned using beta distribution to choose the best slot machine by looking at how
many times the user has lost and how many times the user has won.
This indicates that the fourth machine with the highest conversion rate was chosen.
Therefore, it is a good slot machine for placing our coins.
Sell like wolf:
This algorithm used to evaluate several sales methods and choose the best one. We utilised
random selection and Thomson's model to determine the random strategy, as well as the
quantity and total value of prizes. Then, we figured out the relative return, which comes to
98%. From the histogram, we can see that method "6" was the most popular, indicating that it
is the best.

Robots warehouse demo:


A methodology based on artificial intelligence that helps warehouses transfer their logistics
by finding the optimal and quickest route from source to destination.
Initially, we were working out the three fundamental ideas of reinforcement learning: states,
actions, and rewards. For each action performed by the robot, we provide the model's
rewards. If it selects the correct path/action, it will be compensated. Then, we determine how
to travel from the states to the destinations. Then, we create the AI model or function
responsible for determining the quickest path.

You might also like