Professional Documents
Culture Documents
Exploration vs Exploitation
Marcello Restelli
March–April, 2015
Exploration vs Exploitation Dilemma
Marcello
Restelli
Multi–Arm
Bandit Online decision making involves a fundamental choice:
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting Exploitation: make the best decision given current
MAB Extensions
information
Markov
Decision
Exploration: gather more information
Processes
The best long–term strategy may involve short–term
sacrifices
Gather enough information to make the best overall
decisions
Examples
Marcello
–Greedy
Restelli (
at∗ with probability 1 −
Multi–Arm
Bandit
at =
Bayesian MABs
random action with probability
Frequentist MABs
Stochastic Setting
Adversarial Setting
Softmax
MAB Extensions
Bias exploration towards promising actions
Markov
Decision
Softmax action selection methods grade action
Processes probabilities by estimated values
The most common softmax uses a Gibbs (or
Boltzmann) distribution:
Q(s,a)
e τ
π(a|s) = P Q(s,a0 )
e a0 ∈A τ
τ is a “computational” temperature:
1
τ → ∞: P = |A|
τ → 0: greedy
Multi–Arm Bandits (MABs)
Marcello
Restelli A multi–armed bandit is a tuple
Multi–Arm
hA, Ri
Bandit
Bayesian MABs
A is a set of N possible actions
Frequentist MABs
Stochastic Setting
(one per machine = arm)
Adversarial Setting
MAB Extensions R(r |a) is an unknown probability
Markov distribution of rewards given the
Decision
Processes action chosen
At each time step t the agent
selects an action at ∈ A
The environment generates a
reward rt ∼ R(·, a)
The goal is to maximize
cumulative reward: Tt=1 rt
P
Regret
Marcello
The count Nt (a) is expected number of selections for
Restelli action a
Multi–Arm The gap ∆a is the difference in value between action
a and optimal action a∗ , ∆a = V ∗ − Q(a)
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting Regret is a function of gaps and the counts
Adversarial Setting
MAB Extensions
" T #
X
Markov
Decision
Lt = E V ∗ − Q(at )
Processes t=1
X
= E[Nt (a)](V ∗ − Q(a))
a∈A
X
= E[Nt (a)](∆a )
a∈A
Marcello
Restelli We consider algorithms that estimate Q̂t (a) ≈ Q(a)
Multi–Arm Estimate the value of each action by Monte–Carlo
Bandit
Bayesian MABs evaluation
Frequentist MABs
Stochastic Setting
T
Adversarial Setting
MAB Extensions 1 X
Q̂t (a) = rt 1(at = a)
Markov Nt (a)
Decision t=1
Processes
Marcello
Restelli
Multi–Arm
Bandit The –greedy algorithm continues to explore forever
Bayesian MABs
Frequentist MABs
Stochastic Setting
With probability 1 − select a = arg maxa∈A Q̂(a)
Adversarial Setting With probability select a random action
MAB Extensions
Marcello
Restelli
Multi–Arm
Bandit N = 10 possible actions
Bayesian MABs
Frequentist MABs Q(a) are chosen
Stochastic Setting
Adversarial Setting randomly from a normal
MAB Extensions
Markov
distribution N (0, 1)
Decision
Processes Rewards rt are also
normal N (Q(at ), 1)
1000 plays
Results averaged over
2000 trials
Optimistic Initial Values
Marcello
Restelli
All methods depend on Q0 (a), i.e., they are biased
Multi–Arm
Bandit Encourage exploration: initialize action values
Bayesian MABs
Frequentist MABs optimistically
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
Decaying t –Greedy Algorithm
Frequentist formulation
(Q(a1 ), Q(a2 ), . . . ) are unknown deterministic
parameters
Policy: choose an arm based on the observation
history
Total reward over a finite horizon of length T
Bandit and MDP
Marcello
Restelli
Multi-armed bandit as a class of MDP
N independent arms with fully observable states
Multi–Arm
Bandit [Z1 (t), . . . , ZN (t)]
Bayesian MABs
Frequentist MABs
One arm is activated at each time
Stochastic Setting Active arm changes state (known Markov process)
Adversarial Setting
MAB Extensions and offers reward Ri (Zi (t))
Markov Passive arms are frozen and generate no reward
Decision
Processes Why is sampling stochastic processes with unknown
distributions an MDP?
The state of each arm is the posterior distribution fi (t)
(information state)
For an active arm, fi (t + 1) is updated from fi (t) and the
new observation
For a passive arm, fi (t + 1) = fi (t)
Solving MAB using DP: exponential complexity w.r.t. N
Gittins Index
Marcello
Restelli The index structure of the optimal policy [Gittins’74]
Multi–Arm Assign each state of each arm a priority index
Bandit
Bayesian MABs
Activate the arm with highest current index value
Frequentist MABs
Stochastic Setting
Complexity
Adversarial Setting
MAB Extensions Arms are decoupled (one N-dim to N separate 1-dim
Markov problems)
Decision
Processes
Linear complexity with N
Forward induction
Polynomial (cubic) with the state space size of a single
arm
Computational cost is too high for real-time use
Approximations of the Gittins index
Thompson sampling
Incomplete learning
Optimism in Face of Uncertainty
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
Marcello
Restelli The performance of any algorithm is determined by
Multi–Arm
similarity between optimal arm and other arms
Bandit
Bayesian MABs Hard problems have similar–looking arms with
Frequentist MABs
Stochastic Setting
different means
Adversarial Setting
MAB Extensions This is formally described by the gap ∆a and the
Markov
Decision
similarity in distributions KL(R(·|a)||R(·, a∗ ))
Processes
Theorem
Lai and Robbins Asymptotic total regret is at least
logarithmic in number of steps
X ∆a
lim Lt ≥ log t
t→∞ KL(R(·|a)||R(·, a∗ ))
a|∆a >0
Upper Confidence Bounds
Marcello
Restelli Estimate an upper confidence Ût (a) for each action
Multi–Arm value
Bandit
Bayesian MABs Such that Q(a) ≤ Q̂t (a) + Ût (a) with high probability
Frequentist MABs
Stochastic Setting This depends on the number of items N(a) has been
Adversarial Setting
MAB Extensions selected
Markov Small Nt (a) ⇒ large Ût (a) (estimated value is
Decision
Processes uncertain)
Large Nt (a) ⇒ small Ût (a) (estimated value is
accurate)
Select action maximizing Upper Confidence Bound
(UCB)
at = arg max Q̂(a) + Û(a)
a∈A
Hoeffding’s Inequality
Marcello
Restelli
Multi–Arm Theorem
Bandit
Bayesian MABs
Frequentist MABs
Hoeffding’s Inequality Let X1 , . . . ,P
Xt be i.i.d. random
Stochastic Setting
Adversarial Setting
variables in [0, 1], and let X t = T1 Tt=1 Xt be the sample
MAB Extensions
mean. Then
2
Markov
Decision
P[E[X ] > X t + u] ≤ e−2tu
Processes
Marcello
Restelli Pick a probability p that true value exceeds UCB
Multi–Arm Now solve for Ut (a)
Bandit
Bayesian MABs 2
Frequentist MABs
Stochastic Setting
e−2Nt (a)Ut (a) = p
Adversarial Setting
s
MAB Extensions − log p
Markov
Ut (a) =
Decision
2Nt (a)
Processes
Marcello
Restelli
This leads to the UCB1 algorithm
Multi–Arm
Bandit s
Bayesian MABs
Frequentist MABs 2 log t
Stochastic Setting at = arg max Q(a) +
Adversarial Setting a∈A Nt (a)
MAB Extensions
Markov
Decision
Processes
Theorem
At time T , the expected total regret of the UCB policy is at
most
π2 X
X 1
E[LT ] ≤ 8 log T + 1+ ∆a
∆a 3
a|∆a <∆a∗ a∈A
Example: UCB vs –Greedy on 10–armed
Bandit
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
Other Upper Confidence Bounds
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Upper confidence bounds can be applied to other
Frequentist MABs
Stochastic Setting
inequalities
Adversarial Setting
MAB Extensions Bernstein’s inequality
Markov
Decision Empirical Bernstein’s inequality
Processes
Chernoff inequality
Azuma’s inequality
...
The Adversarial Bandit Setting
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs N arms
Stochastic Setting
Adversarial Setting
MAB Extensions
At each round t = 1, . . . , T
Markov
The learner chooses It ∈ 1, . . . , N
Decision At the same time the adversary selects reward vector
Processes
rt = (r1,t , . . . , rN,t ) ∈ [0, 1]N
The learner receives the reward rIt ,t , while the rewards
of the other arms are not received
Variation on Softmax
EXP3.P
Marcello
Restelli It is possible to drive regret down by annealing τ
Multi–Arm EXP3.P: Exponential weight algorithm for exploration
Bandit
Bayesian MABs and exploitation
Frequentist MABs
Stochastic Setting The probability of choosing arm a at time t is
Adversarial Setting
MAB Extensions
Markov wt (a) β
Decision π(a|t) = (1 − β) P 0)
+
Processes w
a0 ∈A t (a |A|
r (a)
−η πt (a)
wt (a)e if arm a is pulled at t
t
wt+1 (a) =
wt (a) otherwise
η > 0 and β > 0 are the parameters of the algorithm
p
Regret: E[LT ] ≤ O( T |A| log |A|)
MAB with Infinitely Many Arms
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs Unstructured set of actions
Stochastic Setting
Adversarial Setting UCB Arm-Increasing Rule
MAB Extensions
Markov
Structured set of actions
Decision
Processes
Linear Bandits
Lipschitz Bandits
Unimodal
Bandits in trees
The Contextual Bandit Setting
Marcello
Restelli
Multi–Arm
Bandit
At each round t = 1, . . . , T
Bayesian MABs The world produces some context s ∈ S
Frequentist MABs
Stochastic Setting The learner chooses It ∈ 1, . . . , N
Adversarial Setting
MAB Extensions The world reacts with reward RIt ,t
Markov There is no dynamics
Decision
Processes Learn a good policy (low regret) for choosing actions
given context
Ideas
Run a different MAB for each distinct context
Perform generalization over contexts
Exploration–Exploitation in MDPs
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Multi–armed bandit are one–step decision–making
Stochastic Setting
Adversarial Setting
problems
MAB Extensions
MDPs represent sequential decision–making problems
Markov
Decision
Processes
What exploration–exploitation approaches can be used
in MDPs?
How to measure the efficiency of an RL algorithm in
formal terms?
Sample Complexity
Let VtA = E[ ∞ τ
P
τ =0 γ rt+τ |s0 , a0 , r1 , . . . , st−1 , at−1 , rt , st ]
Adversarial Setting
MAB Extensions
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting Definition
MAB Extensions
Markov
An algorithm with sample complexity that is polynomial in
1 1 1
, log δ , N, K , 1−γ , Rmax is called PAC-MDP (probably
Decision
Processes
approximately correct in MDPs)
Exploration Strategies
PAC–MDP Efficiency
Markov
Decision Other approaches with exponential complexity:
Processes
variance minimization of action–values
optimistic value inizialization
state bonuses: frequency, recency, error, . . .
Example of PAC–MDP efficient approaches:
model–based: E 3 , R–MAX, MBIE
model–free: Delayed Q–learning
Bayesian RL: optimal exploration strategy
only tractable for very simple problems
Explicit–Exploit–or–Explore (E 3 ) Algorithm
Kearns and Singh, 2002
Marcello
Restelli Model–based approach with polynomial sample
Multi–Arm
complexity (PAC–MDP)
Bandit
Bayesian MABs
uses optimism in the face of uncertainty
Frequentist MABs assumes knowledge of maximum reward
Stochastic Setting
Adversarial Setting
MAB Extensions
Maintains counts for state and actions to quantify
Markov
confidence in model estimates
Decision A state s is known if all actions in s have been
Processes
sufficiently often executed
From observed data, E 3 constructs two MDPs
MDPknown : includes known states with (approximately
exact) estimates of P and R (drives exploitation)
MDPunknown : MDPknown without reward + special state s0
where the agent receives maximum reward (drives
exploration)
E 3 Sketch
Marcello
Restelli
Input: State s
Multi–Arm Output: Action a
Bandit
Bayesian MABs if s is known then
Frequentist MABs
Stochastic Setting Plan in MDPknown
Adversarial Setting
MAB Extensions if resulting plan has value above some threshold then
Markov return first action of plan
Decision
Processes else
Plan in MDPunknown
return first action of plan
end if
else
return action with the least observations in s
end if
E 3 Example
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
E 3 Example
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
E 3 Example
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions
Markov
Decision
Processes
R–MAX
Brafman and Tennenholtz, 2002
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs Similar to E 3 : implicit instead of explicit exploration
Frequentist MABs
Stochastic Setting
Adversarial Setting
Based on reward function
MAB Extensions
(
Markov
R−MAX R(s, a) c(s, a) ≥ m(s, a known)
Decision
Processes
R (s, a) =
Rmax c(s, a) < m(s, a unknown)
|S|2 |A|
Sample Complexity: O 3 (1−γ)6
Limitations of E 3 and R–MAX
Marcello
Restelli
Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs Optimistic initialization
Stochastic Setting
Adversarial Setting
MAB Extensions
Greedy action selection
Markov
Decision
Q–learning updates in batches (no learning rate)
Processes
Updates only if values significantly decrease
Sample complexity: O 4|S||A|
(1−γ)8
Bayesian RL
Marcello
Restelli
Agent maintains a distribution (belief) b(m) over MDP
Multi–Arm models m.
Bandit
Bayesian MABs
typically, MDP structure is fixed; belief over the
Frequentist MABs
Stochastic Setting
parameters
Adversarial Setting belief updated after each observation
MAB Extensions
(s, a, r , s0 ) : b → b0
Markov
Decision only tractable for very simple problems
Processes
Bayes–optimal policy π ∗ = arg maxπ V π (b, s)
no other policy leads to more rewards in expectation
w.r.t. prior distribution over MDPs
solves the exploration–exploitation tradeoff implicitly:
minimizes uncertainty about the parameters, while
exploiting where it is certain
is not PAC-MDP efficient!