Lecture exploreRLdeep PDF

Exploration-Exploitation
in Reinforcement Learning (DeepRL)

Alessandro Lazaric
Facebook AI Research
2
The Four Ingredients Recipe
1 Build accurate estimators

2 Evaluate the uncertainty of the prediction
3 Define a mechanism to combine estimation and uncertainty
4 Execute the best policy
Lazaric
3
Optimism in face of uncertainty

ct ⇒ g π
M Mc t
2 Evaluate the uncertainty of the estimators

h i
Btr (s, a) := rbt (s, a) − βtr (s, a), rbt (s, a) + βtr (s, a)
n o
Btp (s, a) := p(·|s, a) ∈ ∆(S) : kp(·|s, a) − pbt (·|s, a)k1 ≤ βtp (s, a)

π
πt = arg max max gM
π M ∈Mt
Lazaric
4
Posterior Sampling
2 Evaluate the uncertainty of the estimators
∀Θ, P(M ∗ ∈ Θ|Ht , µ1 ) = µt (Θ) µt updated using Bayes’ rule

π ft ∼ µt
f,
πt = arg max gM M
π t
Lazaric
5
“Practical” Limitations
Optimism in face of uncertainty
Confidence intervals
s s
log(Nt (s, a)/δ) S log(Nt (s, a)/δ)
βtr (s, a) ∝ βtp (s, a) ∝
Nt (s, a) Nt (s, a)
Solving
π
πt = arg max max gM
π M ∈Mt
Posterior sampling
Posterior (dynamics for any state-action pair)
Dirichlet Nt (s01 |s, a), Nt (s02 |s, a), . . . , Nt (s0S |s, a)

Update/sample from a unstructured/non-conjugate posteriors
Lazaric
1 Optimistic Exploration in Deep RL
2 Random Exploration in Deep RL
3 Summary
Lazaric
7
Count-based Exploration
General Scheme
1 Estimate a “proxy” for the number of visits N

e (st )
2 Add an exploration bonus to the rewards
s
1
ret+ = rt + c
N (st )
e
Run any DeepRL algorithm on (st , at , ret+ , st+1 )

3
Lazaric
7
General Scheme
1 Estimate a “proxy” for the number of visits N

e (st )
2 Add an exploration bonus to the rewards
s
1
ret+ = rt + c
N (st )
e
Run any DeepRL algorithm on (st , at , ret+ , st+1 )

3
o What happened to optimism in the dynamics?
Lazaric
8
Extended Value Iteration
n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)

= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)

r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)
Lazaric
8
Extended Value Iteration
n o
T
vn+1 (s) = Lt vn (s) = max r + p v n
(a,r,p)∈A(s)×Btr (s,a)×Btp (s,a)

= max max r + max p T vn
a∈A(s) r∈Btr (s,a) p∈Btp (s,a)

r T
= max rbt (s, a) + βt (s, a) + max p vn
a∈A(s) p∈Btp (s,a)
n o
≤ max rbt (s, a) + βtr (s, a) + kp − p(·|s, a)k1 kvn k∞ + pbT v
t n
a∈A(s)
n √ o
≤ max rbt (s, a) + βtr (s, a) + C Sβtr (s, a) + pbT t vn
a∈A(s)
√
- Exploration bonus (1 + C S)βtr (s, a) for the reward
Lazaric
9
Tang et al. [2017]
Use locality-sensitive hashing to discretize the input

• Encode the state into a k-dim vector by random project
• Use the sign to discretize
Count on discrete hashed-states
Lazaric
9
Tang et al. [2017]
Use locality-sensitive hashing to discretize the input

• Encode the state into a k-dim vector by random project
• Use the sign to discretize
Count on discrete hashed-states
, Difficult to define a good hashing function
Lazaric
10
Tang et al. [2017]
Entropy loss for the auto-encoder

“Binarization” loss for the “projection”
Lazaric
11
Tang et al. [2017]
Use all past history to update the AE

AE should not be updated too often
Lazaric
12
Tang et al. [2017]
Lazaric
13
Bellemare et al. [2016], Ostrovski et al. [2017]
Density estimation over a countable set X

ρn (x) = ρ(x|x1 , . . . , xn ) ≈ P X = x|x1 , . . . , xn
Recording probability
ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x

Pseudo “local” and “total” counts N

en (x) and N
en (x) s.t.
0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)
Lazaric
13
Density estimation over a countable set X

ρn (x) = ρ(x|x1 , . . . , xn ) ≈ P X = x|x1 , . . . , xn
Recording probability
ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x

Pseudo “local” and “total” counts N

en (x) and N
en (x) s.t.
0
N en (x) = ρn (x)(1 − ρn (x)) = n
en (x) N
en (x) + 1
= ρn (x); = ρ0n (x) ⇒ N eρn (x)
n
e n
e+1 ρ0n (x) − ρn (x)
- Any density estimation algorithm (accurate for images)

, Density estimation in continuous spaces is hard
Lazaric
14
Lazaric
15
Montezuma!
Lazaric
16
Prediction-based Exploration
Burda et al. [2018]
Sources of prediction errors

1 Amount of data -
2 Stochasticity (e.g., noisy-TV) ,
3 Model misspecification ,
4 Learning dynamics ,
Lazaric
17
Burda et al. [2018]
Randomly initialize two instances of the same NN (target θ∗ and prediction θ0 )
fθ∗ : S → R; fθ : S → R
Train the prediction network minimizing loss w.r.t. the target network
n
X 2
θn = arg min fθ (st ) − fθ∗ (st )
θ
t=1
Build “intrinsic” reward

rtE = fθ (st ) − fθ∗ (st )

- No influence from stochastic transitions

- No model misspecification (fθ can exactly predict fθ∗ )
- Influence of learning dynamics can be reduced
Lazaric
18
Burda et al. [2018]
Lazaric
19
Burda et al. [2018]
General architecture
Separate extrinsic rtI and intrinsic reward rtE
PPO with two heads to estimate V I and V E
Greedy policy w.r.t. V I + cV E
“Tricks”
Rewards should be in the same range
Use different discount factors for intrinsic and extrinsic rewards
Lazaric
20
Burda et al. [2018]
Lazaric
3 Summary
Lazaric
22
Randomized Exploration
General Scheme
1 Estimate the parameters θ for either policy or value function

2 Add randomness to the parameters θe = θ + noise
3 Run the corresponding (greedy) policy
Lazaric
22
Randomized Exploration
General Scheme
1 Estimate the parameters θ for either policy or value function

2 Add randomness to the parameters θe = θ + noise
o The randomness needs to represent “uncertainty”
Lazaric
23
Bootstrap DQN
Osband et al. [2016]
Define multiple value functions Qk

Update functions with different
datasets
Share part of the architecture
Lazaric
24
Bootstrap DQN
Mt determines the type of bootstrapping strategy
gtk = mkt ytQ − Qk (st , at ; θ) ∇θ Qk (st , at , ; θ)

Lazaric
25
Bootstrap DQN
Lazaric
26
Bootstrap DQN
Ensemble DQN: ensemble policy?

Thompson DQN: resample at each step
Lazaric
27
Bootstrap DQN
Lazaric
28
Noisy Networks
Fortunato et al. [2017]
Normal NN layer y = wx + b
Double the parameters with mean and variance w → µw , σ w
and b → µb , σ b
Whenever a layer is evaluated draw εw , εb ∼ D
Evaluate the “random” layer as
y = (µw + σ w εw ) + µb + σ b εb
Let ζ = (µw , σ w , µb , σ b ), define the expected loss

L(ζ) = Eε L(ζ, ε)
Gradient estimation
n
1X
∇ζ L(ζ) = Eε ∇ζ L(ζ, ε) ≈ ∇ζ L(ζ, εi )
n
i=1
Lazaric
29
Noisy Networks
Noise models
Independent noise εi,j for each weight i at layer j
√
Factorized noise εi,j = f (εi )f (εj ) (e.g., f (x) = sgn(x) x)
Independent noise for target and online networks
2
yt = rt + max
0
Q(s0t , a0 ; ε0 , ζ − ); Lt (ζ, ε) = (yt − Q(st , at ; ε, ζ)
a
Lazaric
30
Noisy Networks
Lazaric
31
Bayesian DQN
Azizzadenesheli et al. [2018]
o Same tools as in linear bandit

1 Bayesian linear regression with given feature φ(s) ∈ Rd
and given target vector for each action ya
−1 T
µa = (ΦT
a Φ a ) Φ a ya Σa = ΦT
a Φa
Draw a weight vector at random wa ∼ N µa , Σa−1

2

at = arg max waT φ(st )
a
4 Train φ with standard NN
Lazaric
32
Bayesian DQN
Azizzadenesheli et al. [2018]
Lazaric
33
Randomized
Prior
Computational generation of posterior samples for linear Bayesian regression

Let fθ (x) = xT θ and yi = xT
i θ + i
Generate a sample θ|Dn from the linear Bayesian posterior
1 Compute yei ∼ N (yi , σ 2 ), θe ∼ N (θ0 , Σ0 ) (prior)
2 Compute
n
X σ2 e
arg min yi − fθ (xi )k22 +
ke kθ − θk2
θ
i=1
λ
Lazaric
34
Randomized
Prior
Lazaric
35
Randomized
Prior
Lazaric
3 Summary
Lazaric
37
Summary
Exploration bonus techniques

Random exploration techniques
Lazaric
Thank you!
39
Resources
Reinforcement Learning
Books
• Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York, NY, USA, 1994
• Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.
MIT press Cambridge, 1998
• Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II.
Athena Scientific, 3rd edition, 2007
• Csaba Szepesvari. Algorithms for Reinforcement Learning.
Morgan and Claypool Publishers, 2010
Courses (with good references for exploration)
• Nan Jiang. Cs598 statistical reinforcement learning.
http://nanjiang.cs.illinois.edu/cs598/
• Emma Brunskill. Cs234 reinforcement learning winter 2019.
http://web.stanford.edu/class/cs234/index.html
• Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html
• Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/
Lazaric
40
Resources
Exploration-Exploitation and Regret Minimization

Books
• Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic
multi-armed bandit problems.
Foundations and Trends R in Machine Learning, 5(1):1–122, 2012
• Tor Lattimore and Csaba Szepesvári. Bandit algorithms.

Pre-publication version, 2018.
URL http://downloads.tor-lattimore.com/banditbook/book.pdf
Lazaric
Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through bayesian
deep q-networks. CoRR, abs/1802.04412, 2018. URL http://arxiv.org/abs/1802.04412.
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying
count-based exploration and intrinsic motivation. In NIPS, pages 1471–1479, 2016.
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 3rd edition, 2007.
Emma Brunskill. Cs234 reinforcement learning winter 2019. http://web.stanford.edu/class/cs234/index.html.
Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit
problems. Foundations and Trends R in Machine Learning, 5(1):1–122, 2012.
Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation.
CoRR, abs/1810.12894, 2018. URL http://arxiv.org/abs/1810.12894.
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih,
Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for
exploration. CoRR, abs/1706.10295, 2017.
Nan Jiang. Cs598 statistical reinforcement learning. http://nanjiang.cs.illinois.edu/cs598/.
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Pre-publication version, 2018. URL
http://downloads.tor-lattimore.com/banditbook/book.pdf.
Alessandro Lazaric. Mva reinforcement learning.
http://chercheurs.lille.inria.fr/˜lazaric/Webpage/Teaching.html.
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped
DQN. CoRR, abs/1602.04621, 2016. URL http://arxiv.org/abs/1602.04621.
Ian Osband, John Aslanides, and Albin Cassirer. Randomized Prior Functions for Deep Reinforcement Learning.
arXiv e-prints, art. arXiv:1806.03335, Jun 2018.
Lazaric
Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with
neural density models. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730.
PMLR, 2017.
Alexandre Proutiere. Reinforcement learning: A graduate course.
http://www.it.uu.se/research/systems_and_control/education/2017/relearn/.
Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &
Sons, Inc., New York, NY, USA, 1994.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press
Cambridge, 1998.
Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck,
and Pieter Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In
NIPS, pages 2750–2759, 2017.
Lazaric

Lecture exploreRLdeep PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture exploreRLdeep PDF

Uploaded by

Copyright:

Available Formats

Exploration-Exploitation

in Reinforcement Learning (DeepRL)

1 Build accurate estimators

Optimism in face of uncertainty

2 Evaluate the uncertainty of the estimators

3 Define a mechanism to combine estimation and uncertainty

∀Θ, P(M ∗ ∈ Θ|Ht , µ1 ) = µt (Θ) µt updated using Bayes’ rule

3 Define a mechanism to combine estimation and uncertainty

Dirichlet Nt (s01 |s, a), Nt (s02 |s, a), . . . , Nt (s0S |s, a)

Update/sample from a unstructured/non-conjugate posteriors

2 Random Exploration in Deep RL

1 Estimate a “proxy” for the number of visits N

Run any DeepRL algorithm on (st , at , ret+ , st+1 )

1 Estimate a “proxy” for the number of visits N

Run any DeepRL algorithm on (st , at , ret+ , st+1 )

o What happened to optimism in the dynamics?

Use locality-sensitive hashing to discretize the input

Use locality-sensitive hashing to discretize the input

Entropy loss for the auto-encoder

Use all past history to update the AE

Density estimation over a countable set X

ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x

Pseudo “local” and “total” counts N

Density estimation over a countable set X

ρ0n (x) = ρ(x|x1 , . . . , xn , x) ≈ P X = x|x1 , . . . , xn , Xn+1 = x

Pseudo “local” and “total” counts N

- Any density estimation algorithm (accurate for images)

Sources of prediction errors

Randomly initialize two instances of the same NN (target θ∗ and prediction θ0 )

Build “intrinsic” reward

- No influence from stochastic transitions

2 Random Exploration in Deep RL

1 Estimate the parameters θ for either policy or value function

1 Estimate the parameters θ for either policy or value function

o The randomness needs to represent “uncertainty”

Define multiple value functions Qk

Mt determines the type of bootstrapping strategy

gtk = mkt ytQ − Qk (st , at ; θ) ∇θ Qk (st , at , ; θ)

Ensemble DQN: ensemble policy?

o Same tools as in linear bandit

Draw a weight vector at random wa ∼ N µa , Σa−1

3 Run the corresponding (greedy) policy

Computational generation of posterior samples for linear Bayesian regression

2 Random Exploration in Deep RL

Exploration bonus techniques

Exploration-Exploitation and Regret Minimization

• Tor Lattimore and Csaba Szepesvári. Bandit algorithms.

You might also like