You are on page 1of 8

Reinforcement Learning

Sarbani Mishra

PGP/24/111

Multi-armed bandits :

Learnt exploration vs exploitation dilemma through a classic case. It about creating a game of
chance.

It’s the case about the principal architecture on which slot machines of Casino operates.

By creating some virtual slot machines we will try to maximize the total reward by identifying the
luckiest machine

>Importing numpy as “np”

And defining a class for a single slot machine that gives a reward from a normal (Gaussian)
distribution.

>Class Gaussianbanditgame(object) :

Then we tried to guide the algorithm to have a random shuffle in between

>Np.random.shuffle(self.bandits)
A particular reward could be wildly different from the average reward we could expect from that
machine. This was dependent on the variance of the reward distribution.

Online Ads :

One more application of MAB model. Here we want to exploit the MAB model to identify the Best
CTR (click through rate). Reward will come from the different Bernoulli distribution for each ad.

Adding Bernoulli distribution.

>class_BernouliBandit(Object):

- Starting A/B/n testing with 5 ads with different Bernoulli numbers encoded which created
randomness

We are looking for the average reward, that is calculated by the algorithm and shown by the print
command. Which in our case was “Ad D”.
Plotting the curve using Cufflink 0.17.3 (pip Install cufflinks)

We can see that the average reward went to 2.7% which is expected CTR for the ad D.

Use of Epsilon-Greedy Algorithm to understand exploration- exploitation problem :

The Epsilon-Greedy algorithm helps to improve the CTR rate.

>greedy_list = [ ‘e-greedy: 0.1]

This algorithm run the model through the 10K impressions and predict the rate .But this can be
improved because the model doesn’t take any preventive it to writeoff actions that are clearly
failing. The CTR Rate is 2.98%.
Upper Confidence Bounds:

With this algorithm the CTR rate improved to 3.8%. This algorithm knows when to stop the
exploration and start exploitation. It systematically and dynamically allocates the budget to
alternatives that need exploration.

Thomson sampling

Thompson sampling is a Heuristic for choosing actions that address the exploration-exploiatation
dilemma in the multi-arm Bandit problem.
After creating a dataset. It takes the best slot machine through beta distribution and update its
losses and wins .

Thomson sampling for customer conversion via Advertisements :

Using The model to figure out which strategy has the highest conversion rate quickly by spending the
minimum amount.

Customers got popup ad, in suggestion we can provide the same content as the ad is describing.

Building the environment under the simulation.


Computing the relative return by each strategy and plotting the histogram of the relative return
against their respective strategy.

(Q learning) Finding a path through a warehouse using positive reward system

First we need to create an environment by defining the rewards :


Building the AI solution with the q learning and implementing it :

The same reinforcement Q-Learning algorithm can be implemented in teaching an AI to follow or


find the goal in the labyrinth. This can be done by giving incentive to the algorithm in order to teach
them how to navigate.

After the algorithm is learned and executed the algorithm can sent for the production.
And in the actual life the robot will navigate through the points in order to cover the route. We can
define the path with the incremental reward to plot the course of the robots in the warehouse.

Most sensible path : ['E', 'I', 'J', 'F', 'B', 'C', 'G']

You might also like