You are on page 1of 47

Exploration-Exploitation

in Reinforcement Learning (DeepRL)


Alessandro Lazaric
Facebook AI Research
2
The Four Ingredients Recipe

1 Build accurate estimators


2 Evaluate the uncertainty of the prediction
3 Define a mechanism to combine estimation and uncertainty
4 Execute the best policy

Lazaric
3
The Four Ingredients Recipe

Optimism in face of uncertainty


1 Build accurate estimators
ct ⇒ g π
M Mc t

2 Evaluate the uncertainty of the estimators


h i
Btr (s, a) := rbt (s, a) − βtr (s, a), rbt (s, a) + βtr (s, a)
n o
Btp (s, a) := p(·|s, a) ∈ ∆(S) : kp(·|s, a) − pbt (·|s, a)k1 ≤ βtp (s, a)

3 Define a mechanism to combine estimation and uncertainty


π
πt = arg max max gM
π M ∈Mt

Lazaric
4
The Four Ingredients Recipe

Posterior Sampling
1 Build accurate estimators
2 Evaluate the uncertainty of the estimators

∀Θ, P(M ∗ ∈ Θ|Ht , µ1 ) = µt (Θ) µt updated using Bayes’ rule

3 Define a mechanism to combine estimation and uncertainty


π ft ∼ µt
f,
πt = arg max gM M
π t

Lazaric
5
“Practical” Limitations
Optimism in face of uncertainty
Confidence intervals
s s
log(Nt (s, a)/δ) S log(Nt (s, a)/δ)
βtr (s, a) ∝ βtp (s, a) ∝
Nt (s, a) Nt (s, a)

Solving
π
πt = arg max max gM
π M ∈Mt

Posterior sampling
Posterior (dynamics for any state-action pair)

Dirichlet Nt (s01 |s, a), Nt (s02 |s, a), . . . , Nt (s0S |s, a)




Update/sample from a unstructured/non-conjugate posteriors

Lazaric
1 Optimistic Exploration in Deep RL

2 Random Exploration in Deep RL

3 Summary

Lazaric
7
Count-based Exploration
General Scheme

1 Estimate a “proxy” for the number of visits N


e (st )
2 Add an exploration bonus to the rewards
s
1
ret+ = rt + c
N (st )
e

Run any DeepRL algorithm on (st , at , ret+ , st+1 )



3

Lazaric
7
Count-based Exploration
General Scheme

1 Estimate a “proxy” for the number of visits N


e (st )
2 Add an exploration bonus to the rewards
s
1
ret+ = rt + c
N (st )
e

Run any DeepRL algorithm on (st , at , ret+ , st+1 )



3

o What happened to optimism in the dynamics?

Lazaric
8
Extended Value Iteration

n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)
 
= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)
 
r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)

Lazaric
8
Extended Value Iteration

n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)
 
= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)
 
r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)


- Exploration bonus (1 + C S)βtr (s, a) for the reward

Lazaric
9
Count-based Exploration
Tang et al. [2017]

Use locality-sensitive hashing to discretize the input


• Encode the state into a k-dim vector by random project
• Use the sign to discretize
Count on discrete hashed-states

Lazaric
9
Count-based Exploration
Tang et al. [2017]

Use locality-sensitive hashing to discretize the input


• Encode the state into a k-dim vector by random project
• Use the sign to discretize
Count on discrete hashed-states
, Difficult to define a good hashing function

Lazaric
10
Count-based Exploration
Tang et al. [2017]

Entropy loss for the auto-encoder


“Binarization” loss for the “projection”

Lazaric
11
Count-based Exploration
Tang et al. [2017]

Use all past history to update the AE


AE should not be updated too often

Lazaric
12
Count-based Exploration
Tang et al. [2017]

Lazaric
13
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]

Density estimation over a countable set X


 
ρn (x) = ρ(x|x1 , . . . , xn ) ≈ P X = x|x1 , . . . , xn

Recording probability

ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x


 

Pseudo “local” and “total” counts N


en (x) and N
en (x) s.t.

0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)

Lazaric
13
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]

Density estimation over a countable set X


 
ρn (x) = ρ(x|x1 , . . . , xn ) ≈ P X = x|x1 , . . . , xn

Recording probability

ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x


 

Pseudo “local” and “total” counts N


en (x) and N
en (x) s.t.

0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)

- Any density estimation algorithm (accurate for images)


, Density estimation in continuous spaces is hard

Lazaric
14
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]

Lazaric
15
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]

Montezuma!

Lazaric
16
Prediction-based Exploration
Burda et al. [2018]

Sources of prediction errors


1 Amount of data -
2 Stochasticity (e.g., noisy-TV) ,
3 Model misspecification ,
4 Learning dynamics ,

Lazaric
17
Prediction-based Exploration
Burda et al. [2018]

Randomly initialize two instances of the same NN (target θ∗ and prediction θ0 )

fθ∗ : S → R; fθ : S → R

Train the prediction network minimizing loss w.r.t. the target network
n 
X 2
θn = arg min fθ (st ) − fθ∗ (st )
θ
t=1

Build “intrinsic” reward


rtE = fθ (st ) − fθ∗ (st )

- No influence from stochastic transitions


- No model misspecification (fθ can exactly predict fθ∗ )
- Influence of learning dynamics can be reduced

Lazaric
18
Prediction-based Exploration
Burda et al. [2018]

Lazaric
19
Prediction-based Exploration
Burda et al. [2018]

General architecture
Separate extrinsic rtI and intrinsic reward rtE
PPO with two heads to estimate V I and V E
Greedy policy w.r.t. V I + cV E

“Tricks”
Rewards should be in the same range
Use different discount factors for intrinsic and extrinsic rewards

Lazaric
20
Prediction-based Exploration
Burda et al. [2018]

Lazaric
1 Optimistic Exploration in Deep RL

2 Random Exploration in Deep RL

3 Summary

Lazaric
22
Randomized Exploration
General Scheme

1 Estimate the parameters θ for either policy or value function


2 Add randomness to the parameters θe = θ + noise
3 Run the corresponding (greedy) policy

Lazaric
22
Randomized Exploration
General Scheme

1 Estimate the parameters θ for either policy or value function


2 Add randomness to the parameters θe = θ + noise
3 Run the corresponding (greedy) policy

o The randomness needs to represent “uncertainty”

Lazaric
23
Bootstrap DQN
Osband et al. [2016]

Define multiple value functions Qk


Update functions with different
datasets
Share part of the architecture

Lazaric
24
Bootstrap DQN
Osband et al. [2016]

Mt determines the type of bootstrapping strategy

gtk = mkt ytQ − Qk (st , at ; θ) ∇θ Qk (st , at , ; θ)




Lazaric
25
Bootstrap DQN
Osband et al. [2016]

Lazaric
26
Bootstrap DQN
Osband et al. [2016]

Ensemble DQN: ensemble policy?


Thompson DQN: resample at each step

Lazaric
27
Bootstrap DQN
Osband et al. [2016]

Lazaric
28
Noisy Networks
Fortunato et al. [2017]

Normal NN layer y = wx + b
Double the parameters with mean and variance w → µw , σ w
and b → µb , σ b
Whenever a layer is evaluated draw εw , εb ∼ D
Evaluate the “random” layer as
y = (µw + σ w εw ) + µb + σ b εb
Let ζ = (µw , σ w , µb , σ b ), define the expected loss
 
L(ζ) = Eε L(ζ, ε)

Gradient estimation
n
  1X
∇ζ L(ζ) = Eε ∇ζ L(ζ, ε) ≈ ∇ζ L(ζ, εi )
n
i=1

Lazaric
29
Noisy Networks
Fortunato et al. [2017]

Noise models
Independent noise εi,j for each weight i at layer j

Factorized noise εi,j = f (εi )f (εj ) (e.g., f (x) = sgn(x) x)
Independent noise for target and online networks
2
yt = rt + max
0
Q(s0t , a0 ; ε0 , ζ − ); Lt (ζ, ε) = (yt − Q(st , at ; ε, ζ)
a

Lazaric
30
Noisy Networks
Fortunato et al. [2017]

Lazaric
31
Bayesian DQN
Azizzadenesheli et al. [2018]

o Same tools as in linear bandit


1 Bayesian linear regression with given feature φ(s) ∈ Rd
and given target vector for each action ya
−1 T
µa = (ΦT
a Φ a ) Φ a ya Σa = ΦT
a Φa

Draw a weight vector at random wa ∼ N µa , Σa−1



2

3 Run the corresponding (greedy) policy


at = arg max waT φ(st )
a
4 Train φ with standard NN

Lazaric
32
Bayesian DQN
Azizzadenesheli et al. [2018]

Lazaric
33
Randomized
Osband et al. [2018]
Prior

Computational generation of posterior samples for linear Bayesian regression


Let fθ (x) = xT θ and yi = xT
i θ + i
Generate a sample θ|Dn from the linear Bayesian posterior
1 Compute yei ∼ N (yi , σ 2 ), θe ∼ N (θ0 , Σ0 ) (prior)
2 Compute
n
X σ2 e
arg min yi − fθ (xi )k22 +
ke kθ − θk2
θ
i=1
λ

Lazaric
34
Randomized
Osband et al. [2018]
Prior

Lazaric
35
Randomized
Osband et al. [2018]
Prior

Lazaric
1 Optimistic Exploration in Deep RL

2 Random Exploration in Deep RL

3 Summary

Lazaric
37
Summary

Exploration bonus techniques


Random exploration techniques

Lazaric
Thank you!
39
Resources
Reinforcement Learning
Books
• Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York, NY, USA, 1994
• Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.
MIT press Cambridge, 1998
• Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II.
Athena Scientific, 3rd edition, 2007
• Csaba Szepesvari. Algorithms for Reinforcement Learning.
Morgan and Claypool Publishers, 2010
Courses (with good references for exploration)
• Nan Jiang. Cs598 statistical reinforcement learning.
http://nanjiang.cs.illinois.edu/cs598/
• Emma Brunskill. Cs234 reinforcement learning winter 2019.
http://web.stanford.edu/class/cs234/index.html
• Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html
• Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/
Lazaric
40
Resources

Exploration-Exploitation and Regret Minimization


Books
• Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic
multi-armed bandit problems.
Foundations and Trends R in Machine Learning, 5(1):1–122, 2012

• Tor Lattimore and Csaba Szepesvári. Bandit algorithms.


Pre-publication version, 2018.
URL http://downloads.tor-lattimore.com/banditbook/book.pdf

Lazaric
Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian
deep q-networks. CoRR, abs/1802.04412, 2018. URL http://arxiv.org/abs/1802.04412.
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying
count-based exploration and intrinsic motivation. In NIPS, pages 1471–1479, 2016.
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 3rd edition, 2007.
Emma Brunskill. Cs234 reinforcement learning winter 2019. http://web.stanford.edu/class/cs234/index.html.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.

Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation.
CoRR, abs/1810.12894, 2018. URL http://arxiv.org/abs/1810.12894.
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih,
Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for
exploration. CoRR, abs/1706.10295, 2017.
Nan Jiang. Cs598 statistical reinforcement learning. http://nanjiang.cs.illinois.edu/cs598/.
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Pre-publication version, 2018. URL
http://downloads.tor-lattimore.com/banditbook/book.pdf.
Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html.
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped
DQN. CoRR, abs/1602.04621, 2016. URL http://arxiv.org/abs/1602.04621.
Ian Osband, John Aslanides, and Albin Cassirer. Randomized Prior Functions for Deep Reinforcement Learning.
arXiv e-prints, art. arXiv:1806.03335, Jun 2018.

Lazaric
Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with
neural density models. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730.
PMLR, 2017.
Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/.
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &
Sons, Inc., New York, NY, USA, 1994.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press
Cambridge, 1998.
Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck,
and Pieter Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In
NIPS, pages 2750–2759, 2017.

Lazaric

You might also like