Professional Documents
Culture Documents
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
What’s wrong?
dataset of transitions
Fitted Q-iteration
Another solution: replay buffers
+ samples are no longer correlated
dataset of transitions
(“replay buffer”)
off-policy
Q-learning
Putting it together
K = 1 is common, though
larger K more efficient
dataset of transitions
(“replay buffer”)
off-policy
Q-learning
What’s wrong?
supervised regression
targets don’t change in inner loop!
“Classic” deep Q-learning algorithm (DQN)
just SGD
A more general view
current target
parameters parameters
dataset of transitions
(“replay buffer”)
A more general view
current target
parameters parameters
dataset of transitions
(“replay buffer”)
this max
this max
particularly problematic (inner loop of training)
• gradient based optimization (e.g., SGD) a bit slow in the inner loop
• action space typically low-dimensional – what about stochastic
optimization?
Q-learning with stochastic optimization
Simple solution:
+ dead simple
+ efficiently parallelizable
- not very accurate
+ no change to algorithm
NAF: Normalized Advantage Functions + just as efficient as Q-learning
- loses representational power
• Multi-step Q-learning
run the policy)