Professional Documents
Culture Documents
Lecture exploreRLdeep PDF
Lecture exploreRLdeep PDF
Lazaric
3
The Four Ingredients Recipe
Lazaric
4
The Four Ingredients Recipe
Posterior Sampling
1 Build accurate estimators
2 Evaluate the uncertainty of the estimators
Lazaric
5
“Practical” Limitations
Optimism in face of uncertainty
Confidence intervals
s s
log(Nt (s, a)/δ) S log(Nt (s, a)/δ)
βtr (s, a) ∝ βtp (s, a) ∝
Nt (s, a) Nt (s, a)
Solving
π
πt = arg max max gM
π M ∈Mt
Posterior sampling
Posterior (dynamics for any state-action pair)
Lazaric
1 Optimistic Exploration in Deep RL
3 Summary
Lazaric
7
Count-based Exploration
General Scheme
Lazaric
7
Count-based Exploration
General Scheme
Lazaric
8
Extended Value Iteration
n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)
= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)
r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)
Lazaric
8
Extended Value Iteration
n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)
= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)
r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)
√
- Exploration bonus (1 + C S)βtr (s, a) for the reward
Lazaric
9
Count-based Exploration
Tang et al. [2017]
Lazaric
9
Count-based Exploration
Tang et al. [2017]
Lazaric
10
Count-based Exploration
Tang et al. [2017]
Lazaric
11
Count-based Exploration
Tang et al. [2017]
Lazaric
12
Count-based Exploration
Tang et al. [2017]
Lazaric
13
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]
Recording probability
0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)
Lazaric
13
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]
Recording probability
0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)
Lazaric
14
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]
Lazaric
15
Count-based Exploration
Bellemare et al. [2016], Ostrovski et al. [2017]
Montezuma!
Lazaric
16
Prediction-based Exploration
Burda et al. [2018]
Lazaric
17
Prediction-based Exploration
Burda et al. [2018]
fθ∗ : S → R; fθ : S → R
Train the prediction network minimizing loss w.r.t. the target network
n
X 2
θn = arg min fθ (st ) − fθ∗ (st )
θ
t=1
Lazaric
18
Prediction-based Exploration
Burda et al. [2018]
Lazaric
19
Prediction-based Exploration
Burda et al. [2018]
General architecture
Separate extrinsic rtI and intrinsic reward rtE
PPO with two heads to estimate V I and V E
Greedy policy w.r.t. V I + cV E
“Tricks”
Rewards should be in the same range
Use different discount factors for intrinsic and extrinsic rewards
Lazaric
20
Prediction-based Exploration
Burda et al. [2018]
Lazaric
1 Optimistic Exploration in Deep RL
3 Summary
Lazaric
22
Randomized Exploration
General Scheme
Lazaric
22
Randomized Exploration
General Scheme
Lazaric
23
Bootstrap DQN
Osband et al. [2016]
Lazaric
24
Bootstrap DQN
Osband et al. [2016]
Lazaric
25
Bootstrap DQN
Osband et al. [2016]
Lazaric
26
Bootstrap DQN
Osband et al. [2016]
Lazaric
27
Bootstrap DQN
Osband et al. [2016]
Lazaric
28
Noisy Networks
Fortunato et al. [2017]
Normal NN layer y = wx + b
Double the parameters with mean and variance w → µw , σ w
and b → µb , σ b
Whenever a layer is evaluated draw εw , εb ∼ D
Evaluate the “random” layer as
y = (µw + σ w εw ) + µb + σ b εb
Let ζ = (µw , σ w , µb , σ b ), define the expected loss
L(ζ) = Eε L(ζ, ε)
Gradient estimation
n
1X
∇ζ L(ζ) = Eε ∇ζ L(ζ, ε) ≈ ∇ζ L(ζ, εi )
n
i=1
Lazaric
29
Noisy Networks
Fortunato et al. [2017]
Noise models
Independent noise εi,j for each weight i at layer j
√
Factorized noise εi,j = f (εi )f (εj ) (e.g., f (x) = sgn(x) x)
Independent noise for target and online networks
2
yt = rt + max
0
Q(s0t , a0 ; ε0 , ζ − ); Lt (ζ, ε) = (yt − Q(st , at ; ε, ζ)
a
Lazaric
30
Noisy Networks
Fortunato et al. [2017]
Lazaric
31
Bayesian DQN
Azizzadenesheli et al. [2018]
Lazaric
32
Bayesian DQN
Azizzadenesheli et al. [2018]
Lazaric
33
Randomized
Osband et al. [2018]
Prior
Lazaric
34
Randomized
Osband et al. [2018]
Prior
Lazaric
35
Randomized
Osband et al. [2018]
Prior
Lazaric
1 Optimistic Exploration in Deep RL
3 Summary
Lazaric
37
Summary
Lazaric
Thank you!
39
Resources
Reinforcement Learning
Books
• Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York, NY, USA, 1994
• Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.
MIT press Cambridge, 1998
• Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II.
Athena Scientific, 3rd edition, 2007
• Csaba Szepesvari. Algorithms for Reinforcement Learning.
Morgan and Claypool Publishers, 2010
Courses (with good references for exploration)
• Nan Jiang. Cs598 statistical reinforcement learning.
http://nanjiang.cs.illinois.edu/cs598/
• Emma Brunskill. Cs234 reinforcement learning winter 2019.
http://web.stanford.edu/class/cs234/index.html
• Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html
• Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/
Lazaric
40
Resources
Lazaric
Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian
deep q-networks. CoRR, abs/1802.04412, 2018. URL http://arxiv.org/abs/1802.04412.
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying
count-based exploration and intrinsic motivation. In NIPS, pages 1471–1479, 2016.
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 3rd edition, 2007.
Emma Brunskill. Cs234 reinforcement learning winter 2019. http://web.stanford.edu/class/cs234/index.html.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends
R in Machine Learning, 5(1):1–122, 2012.
Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation.
CoRR, abs/1810.12894, 2018. URL http://arxiv.org/abs/1810.12894.
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih,
Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for
exploration. CoRR, abs/1706.10295, 2017.
Nan Jiang. Cs598 statistical reinforcement learning. http://nanjiang.cs.illinois.edu/cs598/.
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Pre-publication version, 2018. URL
http://downloads.tor-lattimore.com/banditbook/book.pdf.
Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html.
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped
DQN. CoRR, abs/1602.04621, 2016. URL http://arxiv.org/abs/1602.04621.
Ian Osband, John Aslanides, and Albin Cassirer. Randomized Prior Functions for Deep Reinforcement Learning.
arXiv e-prints, art. arXiv:1806.03335, Jun 2018.
Lazaric
Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with
neural density models. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730.
PMLR, 2017.
Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/.
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &
Sons, Inc., New York, NY, USA, 1994.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press
Cambridge, 1998.
Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck,
and Pieter Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In
NIPS, pages 2750–2759, 2017.
Lazaric