Bandits

Stateless Reinforcement Learning
Prabuchandran K.J.
Assistant Professor, IIT Dharwad
26 February 2020
Outline
Mult-Arm Bandit Problem

Outline
Application of Bandits
Outline
Simple Bandit Problem

Outline
Approaches to solve Bandit Problems

Outline
Value based algorithms - UCB, Thompson sampling

Outline
Policy based algorithms - Reinforce

Outline
Policy based algorithms - Reinforce
Concluding Remarks
Multi Arm Bandit (MAB) Problem
Choose the slot machine

Stochastic Bandits Problem Setting
K choices or K arms {1, 2, . . . , K}

Each choice has average reward Q(a) based on the arm chosen
Rewards underlying distribution is P (a) for arm a

Collect as much reward in T rounds.

Simple Solution
Determine a∗ = maxa Q(a)

Simple Solution
Play the optimal arm a∗ in all T rounds

Simple Solution
What is the total reward collected?

Simple Solution
What is the total reward collected? Q(a∗ )T

Simple Solution
What is the total reward collected? Q(a∗ )T
Problem: Average reward unknown for all arms and has to be learnt
Revisit: Stochastic Bandits Problem Setting
Reward Distribution or average reward unknown
Collect as much reward on an average in T rounds.

Clinical Trials Network Routing
Online Ad Placement Game Design

Explore vs Exploit Dilemma
Current Success Estimates

Bernoulli Bandit
2 Arms/Coins
Bernoulli Bandit
2 Arms/Coins
Each coin has probability of head Q(1) and Q(2)

Bernoulli Bandit
2 Arms/Coins
Head/Success yields reward 1 and Tail/Failure yields 0

Bernoulli Bandit
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
Bernoulli Bandit
2 Arms/Coins
they are unknown
In T rounds, choose either coin 1 or coin 2 at evey time

t ∈ {1, 2, . . . , T }
Bernoulli Bandit
2 Arms/Coins
they are unknown

t ∈ {1, 2, . . . , T }
Goal is to maximize expected total reward (T R) collected

T R = E[R1 + R2 + . . . + RT ]
Approaches to solve Bandits problem
Value based (Indirect)

I Maintain an estimate of the quality of the action based on rewards
I Decide which action to pick based on the quality of actions
Policy Based (Direct)

I Directly keep preferences for action by having probability for actions
I Update probability for actions based on the rewards seen
Value based solution approach
Greedy Algorithm
Keep track of the proportion of heads

Greedy Algorithm
Head estimates are Q̂t (1) and Q̂t (2)

Greedy Algorithm
Pt
k=1 Rk ∗ Iak (a)
Q̂t (a) = , a = 1, 2
k
Pick at+1 = arg max{Q̂t (1), Q̂t (2)}

Greedy Algorithm
Pt
k=1 Rk ∗ Iak (a)
Q̂t (a) = , a = 1, 2
k
Pick at+1 = arg max{Q̂t (1), Q̂t (2)}
Greedy picks suboptimal arm O(T ) rounds

UCB Algorithm
Keep track of the proportion of heads for each coin

UCB Algorithm
Keep track of the number of plays of each coin N1 (t), N2 (t)

UCB Algorithm

UCB Algorithm
Pick at+1 = arg max{Q̂t (1) + bonus(1), Q̂t (2) + bonus(2)}, where
q
2 ln t
bonus(i) = N i (t)
, i = 1, 2
UCB Algorithm
Pick at+1 = arg max{Q̂t (1) + bonus(1), Q̂t (2) + bonus(2)}, where
q
2 ln t
bonus(i) = N i (t)
, i = 1, 2
UCB picks suboptimal arm O(lg T ) rounds

Bayesian Treatment
Treat each coin’s success probabilities as random variables

Bayesian Treatment
For each of the random variables assume an underlying distribution

Bayesian Treatment
For each of the random variables assume an underlying distribution
Based on reward observations update success probabilities distributions

Exploring choice of distribution
Which distribution can we use?

Which distribution can we use? Uniform distribution

Which distribution can we use? Uniform distribution
This is same as beta(α, β) distribution with α = β = 1

Beta distribution
pα−1 (1 − p)β−1
f (p; α, β) = , 0≤p≤1
B(α, β)
B(α, β) - Beta function, α, β > 0

Beta density for different α and β
α=β=1 α = β = 100 α = β = 1000
α = 100, β = 10 α = 10, β = 100 α = β = 0.9

Bayesian Update : Prior to Posterior
Suppose we toss a coin and we obtain head then how do we update

distribution?

distribution?
Should we increase α or β?

distribution?
What is αnew and βnew ?


distribution?
What is αnew and βnew ?

αnew = αold + 1 and βnew = βold
Thompson sampling
For coin 1 maintain an uncertainity model beta(α1 , β1 )
Similarly for coin 2 maintain an uncertainity model beta(α2 , β2 )
Whenever head or tail comes from the coin, we know how to update
the parameters
If head comes increase α and if tail then increase β

Thompson sampling ctd
How to choose which coin to choose at time t?

Sample values from coin1 as well as coin2 based on the current model
Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
at+1 = arg max{Q̂(1), Q̂(2)}

Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
at+1 = arg max{Q̂(1), Q̂(2)}
Thompson sampling picks suboptimal arms O(lg T ) rounds

Thompson sampling Summary
Maintain uncertainty models for each of the arms

Sample rewards from the uncertainty model for each of the arms
Pull the arm that has the highest sample reward

Update the uncertainty model of the pulled arm in the Bayesian way
Update the uncertainty model of the pulled arm in the Bayesian way
Thompson sampling achieves O(lg T ) regret

Policy based solution approach
Bernoulli Bandit
2 Arms/Coins
Bernoulli Bandit
2 Arms/Coins

Bernoulli Bandit
2 Arms/Coins

Bernoulli Bandit
2 Arms/Coins
they are unknown
Bernoulli Bandit
2 Arms/Coins
they are unknown

t ∈ {1, 2, . . . , T }
Bernoulli Bandit
2 Arms/Coins
they are unknown

t ∈ {1, 2, . . . , T }
Goal is to maximize expected total reward (T R) collected

T R = E[R1 + R2 + . . . + RT ]
Policy Search
Maintain preferences for action

Policy Search
Maintain preferences for action
πt (a), a = 1, 2 - Probability of picking action a at time t
Based on reward directly update these probabilities
Let at denote the action taken at time t

How to update?
Suppose we received reward 1, then

I we increase the probability of that action.
πt (at ) = πt (at ) + α(1 − πt (at )), α is the stepsize parameter
I we decrease the probability of the other action.
πt (a) = πt (a) + α(1 − πt (a)), a 6= at
How to update?

I we increase the probability of that action.
πt (at ) = πt (at ) + α(1 − πt (at )), α is the stepsize parameter
πt (a) = πt (a) + α(1 − πt (a)), a 6= at

I we decreasee the probability of that action.
πt (at ) = πt (at ) + β(0 − πt (at )), α is the stepsize parameter
πt (a) = πt (a) + β(1 − πt (a)), a 6= at
Three Simple Algorithms
If α = β, then we have LR−P algorithm

If α β, then we have LR−P algorithm

If α β, then we have LR−P algorithm
If β = 0, then we have LR−I algorithm

Policy Parameterization
Policy depends on a set of parameters θ = (θ1 , θ2 )
eθ1
π(a = 1; θ) =
eθ1 + eθ2
eθ1
π(a = 1; θ) =
eθ1 + eθ2
θ = (θ1 , θ2 ) specifies probabilities of choosing actions

eθ1
π(a = 1; θ) =
eθ1 + eθ2
Each θ indirectly represents a policy

eθ1
π(a = 1; θ) =
eθ1 + eθ2
How good is a parameter θ?

eθ1
π(a = 1; θ) =
eθ1 + eθ2
How good is a parameter θ?
Define a performance measure for each θ

Performance measure for θ
ρ(θ) - Evaluation of policy for parameters θ
X
ρ(θ) = E[Reward] = Q(a)π(a; θ)
a
ρ(θ) - Expected payoff for policy θ
How can we update these parameters?

Gradient Ascent
θ(t+1) = θ(t) + αt ∇ρ(θ(t) ) where αt - step size parameter

Gradient Ascent
θ(t+1) = θ(t) + αt ∇ρ(θ(t) ) where αt - step size parameter
Move in the direction that maximizes the function

Bandits

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bandits

Uploaded by

Copyright:

Available Formats

Stateless Reinforcement Learning

Assistant Professor, IIT Dharwad

Mult-Arm Bandit Problem

Mult-Arm Bandit Problem

Mult-Arm Bandit Problem

Simple Bandit Problem

Mult-Arm Bandit Problem

Simple Bandit Problem

Approaches to solve Bandit Problems

Mult-Arm Bandit Problem

Simple Bandit Problem

Approaches to solve Bandit Problems

Value based algorithms - UCB, Thompson sampling

Mult-Arm Bandit Problem

Simple Bandit Problem

Approaches to solve Bandit Problems

Value based algorithms - UCB, Thompson sampling

Policy based algorithms - Reinforce

Mult-Arm Bandit Problem

Simple Bandit Problem

Approaches to solve Bandit Problems

Value based algorithms - UCB, Thompson sampling

Policy based algorithms - Reinforce

Choose the slot machine

K choices or K arms {1, 2, . . . , K}

K choices or K arms {1, 2, . . . , K}

K choices or K arms {1, 2, . . . , K}

Rewards underlying distribution is P (a) for arm a

K choices or K arms {1, 2, . . . , K}

Rewards underlying distribution is P (a) for arm a

Collect as much reward in T rounds.

Determine a∗ = maxa Q(a)

Determine a∗ = maxa Q(a)

Play the optimal arm a∗ in all T rounds

Determine a∗ = maxa Q(a)

Play the optimal arm a∗ in all T rounds

What is the total reward collected?

Determine a∗ = maxa Q(a)

Play the optimal arm a∗ in all T rounds

What is the total reward collected? Q(a∗ )T

Determine a∗ = maxa Q(a)

Play the optimal arm a∗ in all T rounds

What is the total reward collected? Q(a∗ )T

K choices or K arms {1, 2, . . . , K}

Rewards underlying distribution is P (a) for arm a

Reward Distribution or average reward unknown

Collect as much reward on an average in T rounds.

Clinical Trials Network Routing

Online Ad Placement Game Design

Current Success Estimates

Each coin has probability of head Q(1) and Q(2)

Each coin has probability of head Q(1) and Q(2)

Head/Success yields reward 1 and Tail/Failure yields 0

Each coin has probability of head Q(1) and Q(2)

Head/Success yields reward 1 and Tail/Failure yields 0

Each coin has probability of head Q(1) and Q(2)

Head/Success yields reward 1 and Tail/Failure yields 0

In T rounds, choose either coin 1 or coin 2 at evey time

Each coin has probability of head Q(1) and Q(2)

Head/Success yields reward 1 and Tail/Failure yields 0

In T rounds, choose either coin 1 or coin 2 at evey time

Goal is to maximize expected total reward (T R) collected

Value based (Indirect)