Professional Documents
Culture Documents
Sarbani Mishra
PGP/24/111
Multi-armed bandits :
Learnt exploration vs exploitation dilemma through a classic case. It about creating a game of
chance.
It’s the case about the principal architecture on which slot machines of Casino operates.
By creating some virtual slot machines we will try to maximize the total reward by identifying the
luckiest machine
And defining a class for a single slot machine that gives a reward from a normal (Gaussian)
distribution.
>Class Gaussianbanditgame(object) :
>Np.random.shuffle(self.bandits)
A particular reward could be wildly different from the average reward we could expect from that
machine. This was dependent on the variance of the reward distribution.
Online Ads :
One more application of MAB model. Here we want to exploit the MAB model to identify the Best
CTR (click through rate). Reward will come from the different Bernoulli distribution for each ad.
>class_BernouliBandit(Object):
- Starting A/B/n testing with 5 ads with different Bernoulli numbers encoded which created
randomness
We are looking for the average reward, that is calculated by the algorithm and shown by the print
command. Which in our case was “Ad D”.
Plotting the curve using Cufflink 0.17.3 (pip Install cufflinks)
We can see that the average reward went to 2.7% which is expected CTR for the ad D.
This algorithm run the model through the 10K impressions and predict the rate .But this can be
improved because the model doesn’t take any preventive it to writeoff actions that are clearly
failing. The CTR Rate is 2.98%.
Upper Confidence Bounds:
With this algorithm the CTR rate improved to 3.8%. This algorithm knows when to stop the
exploration and start exploitation. It systematically and dynamically allocates the budget to
alternatives that need exploration.
Thomson sampling
Thompson sampling is a Heuristic for choosing actions that address the exploration-exploiatation
dilemma in the multi-arm Bandit problem.
After creating a dataset. It takes the best slot machine through beta distribution and update its
losses and wins .
Using The model to figure out which strategy has the highest conversion rate quickly by spending the
minimum amount.
Customers got popup ad, in suggestion we can provide the same content as the ad is describing.
After the algorithm is learned and executed the algorithm can sent for the production.
And in the actual life the robot will navigate through the points in order to cover the route. We can
define the path with the incremental reward to plot the course of the robots in the warehouse.
Most sensible path : ['E', 'I', 'J', 'F', 'B', 'C', 'G']