You are on page 1of 40

Reinforcement Learning

Exploration vs Exploitation

Marcello Restelli

March–April, 2015
Exploration vs Exploitation Dilemma

Marcello
Restelli

Multi–Arm
Bandit Online decision making involves a fundamental choice:
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting Exploitation: make the best decision given current
MAB Extensions
information
Markov
Decision
Exploration: gather more information
Processes
The best long–term strategy may involve short–term
sacrifices
Gather enough information to make the best overall
decisions
Examples

Marcello Restaurant Selection


Restelli
Exploitation: Go to favorite restaurant
Multi–Arm Exploration: Try a new restaurant
Bandit
Bayesian MABs Online Banner Advertisements
Frequentist MABs
Stochastic Setting Exploitation: Show the most successful advert
Adversarial Setting
MAB Extensions Exploration: Show a different advert
Markov
Decision
Oil Drilling
Processes Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Clinical Trial
Exploitation: Choose the best treatment so far
Exploration: Try a new treatment
Common Approaches in RL

Marcello
–Greedy
Restelli (
at∗ with probability 1 − 
Multi–Arm
Bandit
at =
Bayesian MABs
random action with probability 
Frequentist MABs
Stochastic Setting
Adversarial Setting
Softmax
MAB Extensions
Bias exploration towards promising actions
Markov
Decision
Softmax action selection methods grade action
Processes probabilities by estimated values
The most common softmax uses a Gibbs (or
Boltzmann) distribution:
Q(s,a)
e τ
π(a|s) = P Q(s,a0 )
e a0 ∈A τ

τ is a “computational” temperature:
1
τ → ∞: P = |A|
τ → 0: greedy
Multi–Arm Bandits (MABs)

Marcello
Restelli A multi–armed bandit is a tuple
Multi–Arm
hA, Ri
Bandit
Bayesian MABs
A is a set of N possible actions
Frequentist MABs
Stochastic Setting
(one per machine = arm)
Adversarial Setting
MAB Extensions R(r |a) is an unknown probability
Markov distribution of rewards given the
Decision
Processes action chosen
At each time step t the agent
selects an action at ∈ A
The environment generates a
reward rt ∼ R(·, a)
The goal is to maximize
cumulative reward: Tt=1 rt
P
Regret

Marcello The action–value is the mean reward for action a


Restelli
Q(a) = E[r |a]
Multi–Arm
Bandit
Bayesian MABs The optimal value V ∗ is
Frequentist MABs
Stochastic Setting
Adversarial Setting
V ∗ = Q(a∗ ) = max Q(a)
MAB Extensions
a∈A
Markov
Decision
Processes
The regret is the opportunity loss for one step
lt = E[V ∗ − Q(at )]
The total regret is the total opportunity loss
" T #
X
Lt = E V ∗ − Q(at )
t=1

Maximize cumulative reward ≡ minimize total regret


Counting Regret

Marcello
The count Nt (a) is expected number of selections for
Restelli action a
Multi–Arm The gap ∆a is the difference in value between action
a and optimal action a∗ , ∆a = V ∗ − Q(a)
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting Regret is a function of gaps and the counts
Adversarial Setting
MAB Extensions
" T #
X
Markov
Decision
Lt = E V ∗ − Q(at )
Processes t=1
X
= E[Nt (a)](V ∗ − Q(a))
a∈A
X
= E[Nt (a)](∆a )
a∈A

A good algorithm ensures small counts for large gaps


Problem: gaps are not known!
Greedy Algorithm

Marcello
Restelli We consider algorithms that estimate Q̂t (a) ≈ Q(a)
Multi–Arm Estimate the value of each action by Monte–Carlo
Bandit
Bayesian MABs evaluation
Frequentist MABs
Stochastic Setting
T
Adversarial Setting
MAB Extensions 1 X
Q̂t (a) = rt 1(at = a)
Markov Nt (a)
Decision t=1
Processes

The greedy algorithm selects action with highest value

at∗ = arg max Q̂t (a)


a∈A

Greedy can lock onto a suboptimal action forever


⇒ Greedy has linear total regret
–Greedy Algorithm

Marcello
Restelli

Multi–Arm
Bandit The –greedy algorithm continues to explore forever
Bayesian MABs
Frequentist MABs
Stochastic Setting
With probability 1 −  select a = arg maxa∈A Q̂(a)
Adversarial Setting With probability  select a random action
MAB Extensions

Markov Constant  ensures minimum regret


Decision
Processes
 X
lt ≥ ∆a
|A|
a∈A

⇒ –greedy has linear total regret


–Greedy on the 10–Armed Testbed

Marcello
Restelli

Multi–Arm
Bandit N = 10 possible actions
Bayesian MABs
Frequentist MABs Q(a) are chosen
Stochastic Setting
Adversarial Setting randomly from a normal
MAB Extensions

Markov
distribution N (0, 1)
Decision
Processes Rewards rt are also
normal N (Q(at ), 1)
1000 plays
Results averaged over
2000 trials
Optimistic Initial Values

Marcello
Restelli
All methods depend on Q0 (a), i.e., they are biased
Multi–Arm
Bandit Encourage exploration: initialize action values
Bayesian MABs
Frequentist MABs optimistically
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes
Decaying t –Greedy Algorithm

Marcello Pick a decay schedule for 1 , 2 , . . .


Restelli
Consider the following schedule
Multi–Arm
Bandit
Bayesian MABs c > 0
Frequentist MABs
Stochastic Setting
Adversarial Setting
d = min ∆a
MAB Extensions a|∆a >0
 
Markov c|A|
Decision t = min 1, 2
Processes d t

Decaying t –greedy has logarithmic asymptotic total


regret!
Unfortunately, schedule requires advance knowledge
of gaps
Goal: find an algorithm with sublinear regret for any
multi–armed bandit (without knowledge of R)
Two Formulations

Marcello Bayesian formulation


Restelli
(Q(a1 ), Q(a2 ), . . . ) are random variables with prior
Multi–Arm distribution (f1 , f2 , . . . )
Bandit
Bayesian MABs
Policy: choose an arm based on the priors (f1 , f2 , . . . )
Frequentist MABs
Stochastic Setting
and the observation history
Adversarial Setting Total discounted reward over an infinite horizon:
MAB Extensions
(∞ )
Markov X
Decision π t π
Processes
V (f1 , f2 , . . . ) = Eπ β R (t)|f1 , f2 , . . . (0 < β < 1)
t=0

Frequentist formulation
(Q(a1 ), Q(a2 ), . . . ) are unknown deterministic
parameters
Policy: choose an arm based on the observation
history
Total reward over a finite horizon of length T
Bandit and MDP

Marcello
Restelli
Multi-armed bandit as a class of MDP
N independent arms with fully observable states
Multi–Arm
Bandit [Z1 (t), . . . , ZN (t)]
Bayesian MABs
Frequentist MABs
One arm is activated at each time
Stochastic Setting Active arm changes state (known Markov process)
Adversarial Setting
MAB Extensions and offers reward Ri (Zi (t))
Markov Passive arms are frozen and generate no reward
Decision
Processes Why is sampling stochastic processes with unknown
distributions an MDP?
The state of each arm is the posterior distribution fi (t)
(information state)
For an active arm, fi (t + 1) is updated from fi (t) and the
new observation
For a passive arm, fi (t + 1) = fi (t)
Solving MAB using DP: exponential complexity w.r.t. N
Gittins Index

Marcello
Restelli The index structure of the optimal policy [Gittins’74]
Multi–Arm Assign each state of each arm a priority index
Bandit
Bayesian MABs
Activate the arm with highest current index value
Frequentist MABs
Stochastic Setting
Complexity
Adversarial Setting
MAB Extensions Arms are decoupled (one N-dim to N separate 1-dim
Markov problems)
Decision
Processes
Linear complexity with N
Forward induction
Polynomial (cubic) with the state space size of a single
arm
Computational cost is too high for real-time use
Approximations of the Gittins index
Thompson sampling
Incomplete learning
Optimism in Face of Uncertainty

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes

The more uncertain we are about an action–value


The more important it is to explore that action
It could turn out to be the best action
Lower Bound

Marcello
Restelli The performance of any algorithm is determined by
Multi–Arm
similarity between optimal arm and other arms
Bandit
Bayesian MABs Hard problems have similar–looking arms with
Frequentist MABs
Stochastic Setting
different means
Adversarial Setting
MAB Extensions This is formally described by the gap ∆a and the
Markov
Decision
similarity in distributions KL(R(·|a)||R(·, a∗ ))
Processes
Theorem
Lai and Robbins Asymptotic total regret is at least
logarithmic in number of steps
X ∆a
lim Lt ≥ log t
t→∞ KL(R(·|a)||R(·, a∗ ))
a|∆a >0
Upper Confidence Bounds

Marcello
Restelli Estimate an upper confidence Ût (a) for each action
Multi–Arm value
Bandit
Bayesian MABs Such that Q(a) ≤ Q̂t (a) + Ût (a) with high probability
Frequentist MABs
Stochastic Setting This depends on the number of items N(a) has been
Adversarial Setting
MAB Extensions selected
Markov Small Nt (a) ⇒ large Ût (a) (estimated value is
Decision
Processes uncertain)
Large Nt (a) ⇒ small Ût (a) (estimated value is
accurate)
Select action maximizing Upper Confidence Bound
(UCB)
at = arg max Q̂(a) + Û(a)
a∈A
Hoeffding’s Inequality

Marcello
Restelli

Multi–Arm Theorem
Bandit
Bayesian MABs
Frequentist MABs
Hoeffding’s Inequality Let X1 , . . . ,P
Xt be i.i.d. random
Stochastic Setting
Adversarial Setting
variables in [0, 1], and let X t = T1 Tt=1 Xt be the sample
MAB Extensions
mean. Then
2
Markov
Decision
P[E[X ] > X t + u] ≤ e−2tu
Processes

We will apply Hoeffding’s Inequality to rewards of the bandit


conditioned on selecting action a
2
P[Q(a) > Q̂t (a) + Ut (a)] ≤ e−2Nt (a)Ut (a)
Calculating Upper Confidence Bounds

Marcello
Restelli Pick a probability p that true value exceeds UCB
Multi–Arm Now solve for Ut (a)
Bandit
Bayesian MABs 2
Frequentist MABs
Stochastic Setting
e−2Nt (a)Ut (a) = p
Adversarial Setting
s
MAB Extensions − log p
Markov
Ut (a) =
Decision
2Nt (a)
Processes

Reduce p as we observe more rewards, e.g., p = t −4


Ensures we select optimal actions as t → ∞
s
2 log t
Ut (a) =
Nt (a)
UCB1

Marcello
Restelli
This leads to the UCB1 algorithm
Multi–Arm
Bandit s
Bayesian MABs
Frequentist MABs 2 log t
Stochastic Setting at = arg max Q(a) +
Adversarial Setting a∈A Nt (a)
MAB Extensions

Markov
Decision
Processes
Theorem
At time T , the expected total regret of the UCB policy is at
most
π2 X
 
X 1
E[LT ] ≤ 8 log T + 1+ ∆a
∆a 3
a|∆a <∆a∗ a∈A
Example: UCB vs –Greedy on 10–armed
Bandit

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes
Other Upper Confidence Bounds

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Upper confidence bounds can be applied to other
Frequentist MABs
Stochastic Setting
inequalities
Adversarial Setting
MAB Extensions Bernstein’s inequality
Markov
Decision Empirical Bernstein’s inequality
Processes
Chernoff inequality
Azuma’s inequality
...
The Adversarial Bandit Setting

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs N arms
Stochastic Setting
Adversarial Setting
MAB Extensions
At each round t = 1, . . . , T
Markov
The learner chooses It ∈ 1, . . . , N
Decision At the same time the adversary selects reward vector
Processes
rt = (r1,t , . . . , rN,t ) ∈ [0, 1]N
The learner receives the reward rIt ,t , while the rewards
of the other arms are not received
Variation on Softmax
EXP3.P

Marcello
Restelli It is possible to drive regret down by annealing τ
Multi–Arm EXP3.P: Exponential weight algorithm for exploration
Bandit
Bayesian MABs and exploitation
Frequentist MABs
Stochastic Setting The probability of choosing arm a at time t is
Adversarial Setting
MAB Extensions

Markov wt (a) β
Decision π(a|t) = (1 − β) P 0)
+
Processes w
a0 ∈A t (a |A|
  
r (a)
−η πt (a)
wt (a)e if arm a is pulled at t

t
wt+1 (a) =
wt (a) otherwise
η > 0 and β > 0 are the parameters of the algorithm
p
Regret: E[LT ] ≤ O( T |A| log |A|)
MAB with Infinitely Many Arms

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs Unstructured set of actions
Stochastic Setting
Adversarial Setting UCB Arm-Increasing Rule
MAB Extensions

Markov
Structured set of actions
Decision
Processes
Linear Bandits
Lipschitz Bandits
Unimodal
Bandits in trees
The Contextual Bandit Setting

Marcello
Restelli

Multi–Arm
Bandit
At each round t = 1, . . . , T
Bayesian MABs The world produces some context s ∈ S
Frequentist MABs
Stochastic Setting The learner chooses It ∈ 1, . . . , N
Adversarial Setting
MAB Extensions The world reacts with reward RIt ,t
Markov There is no dynamics
Decision
Processes Learn a good policy (low regret) for choosing actions
given context
Ideas
Run a different MAB for each distinct context
Perform generalization over contexts
Exploration–Exploitation in MDPs

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Multi–armed bandit are one–step decision–making
Stochastic Setting
Adversarial Setting
problems
MAB Extensions
MDPs represent sequential decision–making problems
Markov
Decision
Processes
What exploration–exploitation approaches can be used
in MDPs?
How to measure the efficiency of an RL algorithm in
formal terms?
Sample Complexity

Marcello Let M be an MDP with N states, K actions, discount


Restelli
factor γ ∈ [0, 1) and a maximal reward Rmax > 0
Multi–Arm
Bandit Let A be an algorithm that acts in the environment,
Bayesian MABs
Frequentist MABs producing experience: s0 , a0 , r1 , s1 , a1 , r2 , . . .
Stochastic Setting

Let VtA = E[ ∞ τ
P
τ =0 γ rt+τ |s0 , a0 , r1 , . . . , st−1 , at−1 , rt , st ]
Adversarial Setting
MAB Extensions

Markov Let V ∗ be the value function of the optimal policy


Decision
Processes
Definition [Kakade,2003]
Let  > 0 be a prescribed accuracy and δ > 0 be an
allowed probability of failure. The expression
η(, δ, N, K , γ, Rmax ) is a sample complexity bound for
algorithm A if independently of the choice of s0 , with
probability at least 1 − δ, the number of timesteps such that
VtA < V ∗ (st ) −  is at most η(, δ, N, K , γ, Rmax ).
Efficient Exploration

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting Definition
MAB Extensions

Markov
An algorithm with sample complexity that is polynomial in
1 1 1
 , log δ , N, K , 1−γ , Rmax is called PAC-MDP (probably
Decision
Processes
approximately correct in MDPs)
Exploration Strategies
PAC–MDP Efficiency

Marcello –greedy and Boltzmann are not PAC-MDP efficient.


Restelli
Their sample complexity is exponential in the number
Multi–Arm of states
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision Other approaches with exponential complexity:
Processes
variance minimization of action–values
optimistic value inizialization
state bonuses: frequency, recency, error, . . .
Example of PAC–MDP efficient approaches:
model–based: E 3 , R–MAX, MBIE
model–free: Delayed Q–learning
Bayesian RL: optimal exploration strategy
only tractable for very simple problems
Explicit–Exploit–or–Explore (E 3 ) Algorithm
Kearns and Singh, 2002

Marcello
Restelli Model–based approach with polynomial sample
Multi–Arm
complexity (PAC–MDP)
Bandit
Bayesian MABs
uses optimism in the face of uncertainty
Frequentist MABs assumes knowledge of maximum reward
Stochastic Setting
Adversarial Setting
MAB Extensions
Maintains counts for state and actions to quantify
Markov
confidence in model estimates
Decision A state s is known if all actions in s have been
Processes
sufficiently often executed
From observed data, E 3 constructs two MDPs
MDPknown : includes known states with (approximately
exact) estimates of P and R (drives exploitation)
MDPunknown : MDPknown without reward + special state s0
where the agent receives maximum reward (drives
exploration)
E 3 Sketch

Marcello
Restelli
Input: State s
Multi–Arm Output: Action a
Bandit
Bayesian MABs if s is known then
Frequentist MABs
Stochastic Setting Plan in MDPknown
Adversarial Setting
MAB Extensions if resulting plan has value above some threshold then
Markov return first action of plan
Decision
Processes else
Plan in MDPunknown
return first action of plan
end if
else
return action with the least observations in s
end if
E 3 Example

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes
E 3 Example

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes
E 3 Example

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs
Stochastic Setting
Adversarial Setting
MAB Extensions

Markov
Decision
Processes
R–MAX
Brafman and Tennenholtz, 2002

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs Similar to E 3 : implicit instead of explicit exploration
Frequentist MABs
Stochastic Setting
Adversarial Setting
Based on reward function
MAB Extensions
(
Markov
R−MAX R(s, a) c(s, a) ≥ m(s, a known)
Decision
Processes
R (s, a) =
Rmax c(s, a) < m(s, a unknown)
 
|S|2 |A|
Sample Complexity: O 3 (1−γ)6
Limitations of E 3 and R–MAX

Marcello E 3 and R–MAX are called “efficient” because its


Restelli
sample complexity scales only polynomially in the
Multi–Arm
Bandit
number of states
Bayesian MABs
Frequentist MABs In natural environments, however, this number of states
Stochastic Setting
Adversarial Setting is enormous: it is exponential in the number of state
MAB Extensions
variables
Markov
Decision
Processes
Hence E 3 and R–MAX scale exponentially in the
number of state variables
Generalization over states and actions is crucial for
exploration
Relational Reinforcement Learning: try to model
structure of environments
KWIK-R–MAX: extension of R–MAX that is polynomial
in the number of parameters used to approximate the
state transition model
Delayed Q–learning

Marcello
Restelli

Multi–Arm
Bandit
Bayesian MABs
Frequentist MABs Optimistic initialization
Stochastic Setting
Adversarial Setting
MAB Extensions
Greedy action selection
Markov
Decision
Q–learning updates in batches (no learning rate)
Processes
Updates only if values significantly decrease
 
Sample complexity: O 4|S||A|
(1−γ)8
Bayesian RL

Marcello
Restelli
Agent maintains a distribution (belief) b(m) over MDP
Multi–Arm models m.
Bandit
Bayesian MABs
typically, MDP structure is fixed; belief over the
Frequentist MABs
Stochastic Setting
parameters
Adversarial Setting belief updated after each observation
MAB Extensions
(s, a, r , s0 ) : b → b0
Markov
Decision only tractable for very simple problems
Processes
Bayes–optimal policy π ∗ = arg maxπ V π (b, s)
no other policy leads to more rewards in expectation
w.r.t. prior distribution over MDPs
solves the exploration–exploitation tradeoff implicitly:
minimizes uncertainty about the parameters, while
exploiting where it is certain
is not PAC-MDP efficient!

You might also like