Deep Q-Learning Explained: June 2018

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/329029610
Deep Q-Learning Explained
Presentation · June 2018

DOI: 10.13140/RG.2.2.22983.14241
CITATIONS READS
0 794
1 author:
Mauricio Arango
Oracle Corporation
25 PUBLICATIONS 198 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mauricio Arango on 18 November 2018.
The user has requested enhancement of the downloaded file.

Deep Q-Learning Explained
Mauricio Arango
June 20, 2018
Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Deep Q-Learning
Deep Q-Network (DQN)
• Combined improved version of Q-Learning
Mnih et al. 2013, 2015, Google DeepMind
with a deep convolutional neural network
• Used same architecture to learn to play 49
DQN Agent different Atari 2600 video games
– Performance better of higher than humans
• State, 𝒔𝒕 , raw video pixels from console – no
feature engineering
• Reward, 𝒓𝒕 , change in game score for step
• Output action, 𝒂𝒕 , joystick/button positions
• Focus of talk is on Q-Learning & improvements
– Enabled successful use of deep networks with Q-
Learning

Agenda
1 Reinforcement Learning Overview
2 Q-Learning
3 Deep Q-Learning
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 3

What is Reinforcement Learning?
• Type of machine learning where an agent learns to achieve a goal by trial
and error
• The agent tries actions and receives feedback (reinforcement) − reward or
punishment − that indicates how good or bad was the action
• The agent learns by adapting its behavior in such a way that the total
reward received is maximized
• Applicable to solving wide range of control and optimization problems
involving sequential decision making
– Industrial automation & robotics
– Healthcare
– Finance

Reinforcement Learning Framework
Framework for sequential decision making
Agent
state reward action
𝑠' 𝑟' 𝑎'
𝑟'*+
𝑠'*+ Environment
• Agent – learner and decision-maker

• Environment – set of things the agent interacts with
– State – a representation of the environment observable by the agent

Reinforcement Learning Framework
Agent
action a0 a a2
state reward s0 s1 1 s2 …
𝑎' r0 r1 r2
𝑠' 𝑟'
𝑟'*+
𝑠'*+ Environment
• Interaction – occurs as a sequence of time steps in each of which:

– Agent observes state 𝒔𝒕 of the environment
– Performs action 𝒂𝒕 a policy function
– As a consequence of the action, one step later the agent observes a new state 𝒔𝒕*𝟏 and receives
numerical reward 𝒓𝒕*𝟏
– Agent adjusts the policy (value function) – adapts behavior, learns
• The goal of the agent is to maximize the total rewards it receives – this drives policy
(value function) adjustments
Reinforcement Learning Algorithms
Output is an optimal policy function
• Policy function – given a state, produces action to execute

– 𝝅(𝒔) –> a
• Types of RL algorithms:
– Value-based - learn optimal action-value function 𝑸∗ (𝒔, 𝒂)
• Derive policy from 𝑄∗ (𝑠, 𝑎) – Q-Learning
– Policy-based – search directly for the optimal policy 𝝅∗

Gridworld example
Learning action-value function 𝑸(𝒔, 𝒂)
• The environment is the grid

• States S = {𝑠+,+ , 𝑠4,4 , … 𝑠5,6 }
• Actions A = {up, right, down, left}
• Reward
– -0.1 on reaching any state, except 𝑠5,6
and 𝑠4,6
– -1.0 on reaching fail state 𝑠4,6
– +1.0 on reaching goal state 𝑠5,6

Gridworld example
Computing actions from Q-values
• Assume optimal 𝑄 ∗ 𝑠, 𝑎 values for each

(s,a) pair have been calculated – Q-values
• Policy 𝜋 ∗ 𝑠 for each state is very simple

to obtain:
– Choose the action that yields highest Q-value
𝝅∗ 𝒔 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑸∗ (𝒔, 𝒂)
𝒂

Q-Values
Dynamic programming – Bellman equation
• The solution to this problem is based in Dynamic Programming and is an

iterative equation - Bellman update equation:
Next Current
estimate estimate Error = ∆(𝑠, 𝑎, 𝑟, 𝑠 E )

E E
𝑸 𝒔, 𝒂 ← 𝑸 𝒔, 𝒂 + 𝜶 [𝒓 + 𝜸 𝐦𝐚𝐱
D
𝑸 𝒔 ,𝒂 − 𝑸 𝒔, 𝒂 ]
𝒂
Learning Target value - Current
rate Bellman equation estimate
𝑸 𝒔, 𝒂 ← 𝑸 𝒔, 𝒂 + 𝜶 ∆(𝑠, 𝑎, 𝑟, 𝑠 E )
• Agent interacts with environment and obtains samples:
< 𝒄𝒖𝒓𝒓𝒆𝒏𝒕 𝒔𝒕𝒂𝒕𝒆, 𝒂𝒄𝒕𝒊𝒐𝒏, 𝒓𝒆𝒘𝒂𝒓𝒅, 𝒏𝒆𝒙𝒕 𝒔𝒕𝒂𝒕𝒆 > − < 𝒔, 𝒂, 𝒓, 𝒔E >

Q-Learning
Lookup table-based
• Lookup table
– One entry per (s, a) pair
– Initialized with random Q-values
• Use Bellman update equation to iteratively update 𝑄 𝑠, 𝑎 estimates
• Q-values converge to optimal: 𝑸 𝒔, 𝒂 ⟶ 𝑸∗ (𝒔, 𝒂)
• Limitations:
– Only feasible for small state-action spaces
– In large state-action spaces many regions may never be visited – many iterations to
cover limited percentage of space
– Doesn’t handle generalization to unvisited state-action pairs

Q-Learning
Function approximation
• Construct approximation of Q-function from observed examples of agent interaction
with environment
• Generalize from states agent has visited to states it has not visited
• Makes possible massive reduction in states that need to be visited to reach an
approximate solution
• Function approximation is what supervised learning does in this case, regression
s Function
a Approximator
Parameter 𝑄 𝑠, 𝑎, 𝑤
r vector: w
𝑠E

Q-Learning
Function approximation with online updates
• Regression: find 𝑄 𝑠, 𝑎 where 𝑄 𝑠, 𝑎 = 𝑦, given 𝑠, 𝑎 and 𝑦
– 𝑋 are state-action pairs 𝑠, 𝑎
– 𝑦 (targets) are desired values estimated with the Bellman equation = 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘
𝒂
• Regression requirements – fulfilled through neural networks

– Nonlinear functions
– Capability to handle both online and batch modes – incremental updates
• With a neural network, 𝑄 is a parameterized function with weights w
– Instead of updating Q values on each iteration, update parameter vector w that defines the function:
𝒘 ← 𝒘 + ∆𝒘
• This can be realized through gradient descent methods

Q-Learning
Neural network function approximation with online updates
• Use neural network with stochastic gradient descent and back propagation, where:
error (s, a, r, s’, w) = 𝑟 + 𝛾 max
D
𝑄(𝑠 E ,𝑎E , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤
_
E E 4
Mean Squared Error (MSE) = 𝛦[(𝑟 + 𝛾 𝑚𝑎𝑥
D
𝑄(𝑠 ,𝑎 , 𝑤) − 𝑄 𝑠, 𝑎, 𝑤 ) ]
_
𝒘 ← 𝒘 + 𝜶 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘 − 𝑸 𝒔, 𝒂, 𝒘 𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘
𝒂
Gradient obtained
with back propagation

Q-Learning
Neural network function approximation with online updates
7 ∆𝒘
Apply 1
𝑸 𝒔, 𝒂, 𝒘
s Neural
𝜀 − greedy a
network 2 policy (*)
a ×
𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘 6 Initialize network weights
4 Repeat for each episode:
r + − Initialize s
5 Repeat for each step of episode:
Choose a from s using 𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦 policy − 1
E E
𝒎𝒂𝒙 𝑸 𝒔 ,𝒂 , 𝒘 Take action a, observe r, s’
Neural D
𝒂
s’ Neural
Neural The network used Obtain current Q and gradient – 2
network
network
network 3 in stage 1 and in Calculate max next state Q value - 3
a’ stage 2 is the same Calculate target – 4
Calculate error – 5
Calculate weights delta – 6
Update weights – retrain network – 7
(*) 𝜺 − greedy policy: s ⟵ 𝑠′ Perform
With probability 𝜀, select random a gradient
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Until s is terminal descent
_

Problems with online Q-Learning
Reasons online Q-Learning implementations may fail to converge
• Function approximation issues

– Correlations between samples -- Consecutive experience samples can be highly
correlated, which can cause overfitting
– Data-inefficient - samples are discarded after each update
• Bootstrapping issues – update estimates on the basis of other estimates
– Non-stationary targets - target values of Q-learning updates depend on same
weights that are being updated
• Target = 𝒓 + 𝜸 𝒎𝒂𝒙
D
𝑸 𝒔E ,𝒂E , 𝒘 depends on w; targets are used to build updates to w
𝒂

DQN
Is a general-purpose algorithm – What changes is architecture of neural
network used – Q-network
DQN
Agent
• DQN improvements to Q-Learning made possible successful use

of Q-Learning with deep neural networks

DQN
Modifications to online Q-Learning
• Experience replay
– Resolves correlated samples and data efficiency issues
• Separate network for generating targets
– Reduces non-stationary targets issue

Experience Replay
• To remove correlations, build data-set from agent’s own experience stored in a
replay memory
• At each time step:

– Store the transition tuple (s, a, r, s’) in the replay memory
– Sample random mini-batch of transitions (s, a, r, sʹ) from the replay memory
– Compute Q-learning targets using separate fixed network with parameters 𝒘{
• Mini-batch updates yield smoother (less noisy) gradients
• Each experience sample potentially used in many weight updates – less new samples
new samples required

Separate network for generating targets
Reduce targets non-stationarity
• Separate neural network, 𝑄| with weights 𝒘{ used to calculate targets:

– Mini-batch targets = 𝒓 + 𝜸 𝒎𝒂𝒙 } 𝒔E ,𝒂E , 𝒘{
𝑸
D 𝒂
s Neural 𝑸(𝒔, 𝒂, 𝒔) 𝑠E Neural } 𝒔E ,𝒂E , 𝒘{

𝑸
network network
a 𝑤 𝑎E 𝑤{
• Target network parameters fixed for C updates

} and use it for generating Q-learning
– Every C updates clone network 𝑸 to obtain target network 𝑸
targets for the following C updates to 𝑸

DQN algorithm
7 ∆𝒘
Apply 1
𝑸 𝒔, 𝒂, 𝒘
Neural 𝜀 − greedy a
s network 2 policy (*)
𝑤 × Initialize replay memory D to capacity N
𝒈𝒓𝒂𝒅𝒊𝒆𝒏𝒕 = 𝜵𝒘 𝑸 𝒔, 𝒂, 𝒘 6 Initialize network 𝑸 with weights 𝒘
Initialize network 𝑸 } with weights 𝒘{ = 𝒘
r + −
For episode = 1, M do
For t = 1, T do
y 5 error
8 4 Select action a using an 𝜀 − greedy policy- 1
Execute action and observe next state 𝑠 E
every C updates and reward 𝑟
Store transition (s, a, r, s’) in D
𝒘{ ← 𝒘 Sample random mini-batch of transitions
(s, a, r, s’) from D
Neural
s’ network
3 Obtain 𝑄(s, a, w) and gradient 𝛻• 𝑄 𝑠, 𝑎, 𝑤 – 2
Calculate 𝑚𝑎𝑥 𝑄| 𝑠 E ,𝑎 E , 𝑤 { – 3
} 𝒔E ,𝒂E , 𝒘{
_D
𝑤{ 𝒎𝒂𝒙
D
𝑸 Calculate mini-batch targets, y – 4
𝒂 Calculate error – 5
Calculate weights 𝑤 delta – 6
Update 𝑤 weights for network Q – retrain network – 7
(*) 𝜺 − greedy policy: Every C updates reset 𝑄| = Q – 8
End For
With probability 𝜀, select random a End For
otherwise select argmax 𝑄 𝑠, 𝑎, 𝑤 Perform
_ gradient descent

Improvements to DQN
• Prioritized experience replay

– The RL agent can learn more effectively from some transitions than from others
– Prioritize transitions according to surprise, defined as proportional to DQN error:
E E
𝒓 + 𝜸 𝐦𝐚𝐱 D
𝑸 𝒔 ,𝒂 − 𝑸 𝒔, 𝒂
𝒂
• Double DQN
E E
– Addresses DQN overestimation problem caused by max 𝑄 𝑠 ,𝑎 operation in the
_D
updates
– Use two separate networks, one used to determine the maximizing action the other
second to estimate Q-values; alternate on each step

Summary & conclusions
• Online Q-Learning with function approximation transforms RL into a
sequence of regression training calculations, but has limitations due to:
– Non-stationary targets
– Correlated training transitions
• DQN mitigates these problems and makes it successfully applicable in
many domains − Key tool in an RL toolbox
• Why neural networks and Q-Learning − DQN
– Non-linear function approximation
– Incremental training to support both online and batch learning
– Capability to handle large number of inputs
– Automates feature engineering on raw inputs

References
• Mnih et al. – Playing Atari with Deep Reinforcement Learning,
arXiv:1312.5602v1, 2013.
• Mnih et al. – Human-level control through deep reinforcement learning,
Nature 518, 2015.
• Arulkumaran, K. et al. – A Brief Survey of Deep Reinforcement Learning,
arXiv:1708.05866, 2017.
• Sutton R. and Barto A. – Reinforcement Learning: An Introduction, MIT
Press, second edition, 2018.
• Mitchel, T. – Machine Learning, WCB/MacGraw-Hill, 1997.
• Russell, S., Norvig P. – Artificial Intelligence: A Modern Approach, Prentice-
Hall 2010.
View publication stats

Deep Q-Learning Explained: June 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Q-Learning Explained: June 2018

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Deep Q-Learning Explained

Presentation · June 2018

The user has requested enhancement of the downloaded file.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Agent – learner and decision-maker

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Interaction – occurs as a sequence of time steps in each of which:

• Policy function – given a state, produces action to execute

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• The environment is the grid

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Assume optimal 𝑄 ∗ 𝑠, 𝑎 values for each

• Policy 𝜋 ∗ 𝑠 for each state is very simple

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• The solution to this problem is based in Dynamic Programming and is an

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Regression requirements – fulfilled through neural networks

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Function approximation issues

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• DQN improvements to Q-Learning made possible successful use

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• At each time step:

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Separate neural network, 𝑄| with weights 𝒘{ used to calculate targets:

s Neural 𝑸(𝒔, 𝒂, 𝒔) 𝑠E Neural } 𝒔E ,𝒂E , 𝒘{

• Target network parameters fixed for C updates

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

• Prioritized experience replay

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.

View publication stats

You might also like