Professional Documents
Culture Documents
Bandits
Bandits
Prabuchandran K.J.
26 February 2020
Outline
Application of Bandits
Outline
Application of Bandits
Application of Bandits
Application of Bandits
Application of Bandits
Application of Bandits
Concluding Remarks
Multi Arm Bandit (MAB) Problem
Each choice has average reward Q(a) based on the arm chosen
Stochastic Bandits Problem Setting
Each choice has average reward Q(a) based on the arm chosen
Each choice has average reward Q(a) based on the arm chosen
Problem: Average reward unknown for all arms and has to be learnt
Revisit: Stochastic Bandits Problem Setting
Each choice has average reward Q(a) based on the arm chosen
2 Arms/Coins
Bernoulli Bandit
2 Arms/Coins
2 Arms/Coins
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
Bernoulli Bandit
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
Pt
k=1 Rk ∗ Iak (a)
Q̂t (a) = , a = 1, 2
k
Pt
k=1 Rk ∗ Iak (a)
Q̂t (a) = , a = 1, 2
k
Pick at+1 = arg max{Q̂t (1) + bonus(1), Q̂t (2) + bonus(2)}, where
q
2 ln t
bonus(i) = N i (t)
, i = 1, 2
UCB Algorithm
Pick at+1 = arg max{Q̂t (1) + bonus(1), Q̂t (2) + bonus(2)}, where
q
2 ln t
bonus(i) = N i (t)
, i = 1, 2
pα−1 (1 − p)β−1
f (p; α, β) = , 0≤p≤1
B(α, β)
Should we increase α or β?
Bayesian Update : Prior to Posterior
Should we increase α or β?
Should we increase α or β?
Whenever head or tail comes from the coin, we know how to update
the parameters
Sample values from coin1 as well as coin2 based on the current model
Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
Thompson sampling ctd
Sample values from coin1 as well as coin2 based on the current model
Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
Sample values from coin1 as well as coin2 based on the current model
Q̂(1) ∼ beta(α1 , β1 )
Q̂(2) ∼ beta(α2 , β2 )
Sample rewards from the uncertainty model for each of the arms
Thompson sampling Summary
Sample rewards from the uncertainty model for each of the arms
Sample rewards from the uncertainty model for each of the arms
Update the uncertainty model of the pulled arm in the Bayesian way
Thompson sampling Summary
Sample rewards from the uncertainty model for each of the arms
Update the uncertainty model of the pulled arm in the Bayesian way
2 Arms/Coins
Bernoulli Bandit
2 Arms/Coins
2 Arms/Coins
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
Bernoulli Bandit
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
2 Arms/Coins
Average reward of coin 1 and coin 2 - Q(1) and Q(2) probabilities and
they are unknown
eθ1
π(a = 1; θ) =
eθ1 + eθ2
Policy Parameterization
eθ1
π(a = 1; θ) =
eθ1 + eθ2
eθ1
π(a = 1; θ) =
eθ1 + eθ2
eθ1
π(a = 1; θ) =
eθ1 + eθ2
eθ1
π(a = 1; θ) =
eθ1 + eθ2
X
ρ(θ) = E[Reward] = Q(a)π(a; θ)
a