Tutorial: Deep Reinforcement Learning
David Silver, Google DeepMind
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Reinforcement Learning: AI = RL
RL is a generalpurpose framework for artificial intelligence
We seek a single agent which can solve any humanlevel task
The essence of an intelligent agent
Powerful RL requires powerful representations
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Deep Representations
A deep representation is a composition of many functions
x
w1
/ h1
w2
/ ...
wn
/ hn
wn+1
/y
Its gradient can be backpropagated by the chain rule
h1
x
h2
h1
... o
y
hn
...
hn
wn
h1
w1
y
wn+1
Deep Neural Network
A deep neural network is typically composed of:
I
Linear transformations
hk+1 = Whk
Nonlinear activation functions
hk+2 = f (hk+1 )
Weight Sharing
Recurrent neural network shares weights between timesteps
yO1
h0
/ h1
O
yO2
/ h2
O
x1
...
x2
/ ...
yOn
...
/ hn
O
xn
Convolutional neural network shares weights between local regions
w1
w2
w1
w2
h1
h2
Loss Function
I
A loss function l(y ) measures goodness of output y , e.g.
I
I
Meansquared error l(y ) = y y 2
Log likelihood l(y ) = log P [y x]
The loss is appended to the forward computation
x
w1
/ h1
/ ...
w2
wn
/ hn
wn+1
/y
/ l(y )
Gradient of loss is appended to the backward computation
h1
x
h2
h1
... o
y
hn
...
hn
wn
h1
w1
l(y )
y
y
wn+1
Stochastic Gradient Descent
!"#$%"&'(%")*+,#''!"#$%%" (%")*+,#'.+/0+,#
I
I
Minimise expected loss L(w ) = Ex [l(y )]
y !"#$%&'('%$&#()&*+$,*$#&&&$$$$."'%"$
Follow
the gradient of L(w )
'*$%/0'*,('*$.'("$("#$1)*%('*$
l(y )
,22&3'/,(&
%,*$0#$)+#4$($
(1)
w.
l(y )
L(w )
%&#,(#$,*$#&&&$1)*%('*$$$$$$$$$$$$$
.
= Ex
= Ex
.
w
l(y )
w (k)
y !"#$2,&(',5$4'11#&#*(',5$1$("'+$#&&&$
Adjust w in direction of ve gradient
1)*%('*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($
l(y )
%,*$*.$0#$)+#4$($)24,(#$("#$
w =
2 w
'*(#&*,5$8,&',05#+$'*$("#$1)*%('*$
where ,22&3'/,(&
is a stepsize9,*4$%&'('%:;$$$$$$
parameter
<&,4'#*($4#+%#*($=>?
Deep Supervised Learning
I
I
Deep neural networks have achieved remarkable success
Simple ingredients solve supervised learning problems
I
I
I
Use deep network as a function approximator
Define loss function
Optimise parameters endtoend by SGD
Scales well with memory/data/computation
Solves the representation learning problem
Stateoftheart for images, audio, language, ...
Deep Supervised Learning
I
I
Deep neural networks have achieved remarkable success
Simple ingredients solve supervised learning problems
I
I
I
Use deep network as a function approximator
Define loss function
Optimise parameters endtoend by SGD
Scales well with memory/data/computation
Solves the representation learning problem
Stateoftheart for images, audio, language, ...
Can we follow the same recipe for RL?
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Policies and Value Functions
Policy is a behaviour function selecting actions given states
a = (s)
Value function Q (s, a) is expected total reward
from state s and action a under policy
Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ...  s, a
How good is action a in state s?
Approaches To Reinforcement Learning
Policybased RL
I
Search directly for the optimal policy
This is the policy achieving maximum future reward
Valuebased RL
I
Estimate the optimal value function Q (s, a)
This is the maximum value achievable under any policy
Approaches To Reinforcement Learning
Policybased RL
I
Search directly for the optimal policy
This is the policy achieving maximum future reward
Valuebased RL
I
Estimate the optimal value function Q (s, a)
This is the maximum value achievable under any policy
Modelbased RL
I
Build a transition model of the environment
Plan (e.g. by lookahead) using model
Deep Reinforcement Learning
Can we apply deep learning to RL?
Use deep network to represent value function / policy / model
Optimise value function / policy /model endtoend
Using stochastic gradient descent
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Bellman Equation
I
Bellman expectation equation unrolls value function Q
Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ...  s, a
= Es 0 ,a0 r + Q (s 0 , a0 )  s, a
Bellman Equation
I
Bellman expectation equation unrolls value function Q
Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ...  s, a
= Es 0 ,a0 r + Q (s 0 , a0 )  s, a
Bellman optimality equation unrolls optimal value function Q
0 0
Q (s, a) = Es 0 r + max
Q (s , a )  s, a
0
a
Bellman Equation
I
Bellman expectation equation unrolls value function Q
Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ...  s, a
= Es 0 ,a0 r + Q (s 0 , a0 )  s, a
Bellman optimality equation unrolls optimal value function Q
0 0
Q (s, a) = Es 0 r + max
Q (s , a )  s, a
0
a
Policy iteration algorithms solve Bellman expectation equation
Qi+1 (s, a) = Es 0 r + Qi (s 0 , a0 )  s, a
Value iteration algorithms solve Bellman optimality equation
0 0
Qi+1 (s, a) = Es 0 ,a0 r + max
Qi (s , a )  s, a
0
a
Policy Iteration with NonLinear Sarsa
I
Represent value function by Qnetwork with weights w
Q(s, a, w ) Q (s, a)
Policy Iteration with NonLinear Sarsa
I
Represent value function by Qnetwork with weights w
Q(s, a, w ) Q (s, a)
Define objective function by meansquared error in Qvalues
L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )

{z
}
target
Policy Iteration with NonLinear Sarsa
I
Represent value function by Qnetwork with weights w
Q(s, a, w ) Q (s, a)
Define objective function by meansquared error in Qvalues
L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )

{z
}
target
Leading to the following Sarsa gradient
Q(s, a, w )
L(w )
0 0
= E r + Q(s , a , w ) Q(s, a, w )
w
w
Optimise objective endtoend by SGD, using
L(w )
w
Value Iteration with NonLinear QLearning
I
Represent value function by deep Qnetwork with weights w
Q(s, a, w ) Q (s, a)
Define objective function by meansquared error in Qvalues
2
Q(s 0 , a0 , w ) Q(s, a, w )
L(w ) = E
r + max
0
a

{z
}
target
Leading to the following Qlearning gradient
L(w )
Q(s, a, w )
0 0
= E r + max
Q(s , a , w ) Q(s, a, w )
a0
w
w
Optimise objective endtoend by SGD, using
L(w )
w
Example: TD Gammon
Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7
V(s, w)
6 5 4
3 2
w
1 0 Wbar
SelfPlay NonLinear Sarsa
Initialised with random weights
Trained by games of selfplay
Using nonlinear Sarsa with afterstate value function
Q(s, a, w ) = E V (s 0 , w )
Greedy policy improvement (no exploration)
Algorithm converged in practice (not true for other games)
SelfPlay NonLinear Sarsa
Initialised with random weights
Trained by games of selfplay
Using nonlinear Sarsa with afterstate value function
Q(s, a, w ) = E V (s 0 , w )
Greedy policy improvement (no exploration)
Algorithm converged in practice (not true for other games)
TD Gammon defeated world champion Luigi Villa 71
(Tesauro, 1992)
New TDGammon Results
Stability Issues with Deep RL
Naive Qlearning oscillates or diverges with neural nets
1. Data is sequential
I
Successive samples are correlated, noniid
2. Policy changes rapidly with slight changes to Qvalues
I
I
Policy may oscillate
Distribution of data can swing from one extreme to another
3. Scale of rewards and Qvalues is unknown
I
Naive Qlearning gradients can be large
unstable when backpropagated
Deep QNetworks
DQN provides a stable solution to deep valuebased RL
1. Use experience replay
I
I
I
Break correlations in data, bring us back to iid setting
Learn from all past policies
Using offpolicy Qlearning
2. Freeze target Qnetwork
I
I
Avoid oscillations
Break correlations between Qnetwork and target
3. Clip rewards or normalize network adaptively to sensible range
I
Robust gradients
Stable Deep RL (1): Experience Replay
To remove correlations, build dataset from agents own experience
I
Take action at according to greedy policy
Store transition (st , at , rt+1 , st+1 ) in replay memory D
Sample random minibatch of transitions (s, a, r , s 0 ) from D
Optimise MSE between Qnetwork and Qlearning targets, e.g.
"
2 #
L(w ) = Es,a,r ,s 0 D
r + max
Q(s 0 , a0 , w ) Q(s, a, w )
0
a
Stable Deep RL (2): Fixed Target QNetwork
To avoid oscillations, fix parameters used in Qlearning target
I
Compute Qlearning targets w.r.t. old, fixed parameters w
Q(s 0 , a0 , w )
r + max
0
a
Optimise MSE between Qnetwork and Qlearning targets
"
2 #
0 0
L(w ) = Es,a,r ,s 0 D
r + max
Q(s , a , w ) Q(s, a, w )
0
a
Periodically update fixed parameters w w
Reinforcement Learning in Atari
state
action
st
at
reward
rt
DQN in Atari
I
Endtoend learning of values Q(s, a) from pixels s
Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step
Network architecture and hyperparameters fixed across all games
[Mnih et al.]
DQN Results in Atari
DQN Demo
How much does DQN help?
DQN
Breakout
Enduro
River Raid
Seaquest
Space Invaders
Qlearning
Qlearning
3
29
1453
276
302
+ Target Q
10
142
2868
1003
373
Qlearning
+ Replay
241
831
4103
823
826
Qlearning
+ Replay
+ Target Q
317
1006
7447
2894
1089
Stable Deep RL (3): Reward/Value Range
DQN clips the rewards to [1, +1]
This prevents Qvalues from becoming too large
Ensures gradients are wellconditioned
Stable Deep RL (3): Reward/Value Range
DQN clips the rewards to [1, +1]
This prevents Qvalues from becoming too large
Ensures gradients are wellconditioned
Cant tell difference between small and large rewards
Better approach: normalise network output
e.g. via batch normalisation
Demo: Normalized DQN in PacMan
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
Policy Gradient for Continuous Actions
Represent policy by deep network a = (s, u) with weights u
Define objective function as total discounted reward
J(u) = E r1 + r2 + 2 r3 + ...
Optimise objective endtoend by SGD
i.e. Adjust policy parameters u to achieve more reward
Deterministic Policy Gradient
The gradient of the policy is given by
J(u)
Q (s, a)
= Es
u
u
Q (s, a) (s, u)
= Es
a
u
Policy gradient is the direction that most improves Q
Deterministic ActorCritic
Use two networks
I
Actor is a policy (s, u) with parameters u
s
w1
/ ...
wn
/Q
Critic provides loss function for actor
s
/ ...
/a
un
Critic is value function Q(s, a, w ) with parameters w
s, a
u1
u1
/ ...
/a
un
w1
/ ...
wn
/Q
Gradient backpropagates from critic into actor
a
u
... o
Q
a
... o
Deterministic ActorCritic: Learning Rule
Critic estimates value of current policy by Qlearning
Q(s, a, w )
L(w )
0
0
= E r + Q(s , (s ), w ) Q(s, a, w )
w
w
Actor updates policy in direction that improves Q
Q(s, a, w ) (s, u)
J(u)
= Es
u
a
u
Deterministic Deep Policy Gradient (DDPG)
Naive actorcritic oscillates or diverges with neural nets
DDPG provides a stable solution
Deterministic Deep Policy Gradient (DDPG)
Naive actorcritic oscillates or diverges with neural nets
DDPG provides a stable solution
1. Use experience replay for both actor and critic
2. Freeze target network to avoid oscillations
L(w )
Q(s, a, w )
= Es,a,r ,s 0 D
r + Q(s 0 , (s 0 , u ), w ) Q(s, a, w )
w
w
J(u)
Q(s, a, w ) (s, u)
= Es,a,r ,s 0 D
u
a
u
DDPG for Continuous Control
I
I
I
I
Endtoend learning of control policy from raw pixels s
Input state s is stack of raw pixels from last 4 frames
Two separate convnets are used for Q and
Physics are simulated in MuJoCo
a
Q(s,a)
(s)
[Lillicrap et al.]
DDPG Demo
Outline
Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models
ModelBased RL
Learn a transition model of the environment
p(r , s 0  s, a)
Plan using the transition model
I
e.g. Lookahead using transition model to find optimal actions
left
left
right
right
left
right
Deep Models
Represent transition model p(r , s 0  s, a) by deep network
Define objective function measuring goodness of model
e.g. number of bits to reconstruct next state
Optimise objective by SGD
(Gregor et al.)
DARN Demo
Challenges of ModelBased RL
Compounding errors
I
Errors in the transition model compound over the trajectory
By the end of a long trajectory, rewards can be totally wrong
Modelbased RL has failed (so far) in Atari
Challenges of ModelBased RL
Compounding errors
I
Errors in the transition model compound over the trajectory
By the end of a long trajectory, rewards can be totally wrong
Modelbased RL has failed (so far) in Atari
Deep networks of value/policy can plan implicitly
I
Each layer of network performs arbitrary computational step
nlayer network can lookahead n steps
Are transition models required at all?
Deep Learning in Go
MonteCarlo search
I MonteCarlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I Stateoftheart 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Deep Learning in Go
MonteCarlo search
I MonteCarlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I Stateoftheart 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Convolutional Networks
I 12layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree
(Maddison et al.)
Program
Human 6dan
12Layer ConvNet
8Layer ConvNet*
Prior stateoftheart
*Clarke & Storkey
Accuracy
52%
55%
44%
3139%
Program
GnuGo
MoGo (100k)
Pachi (10k)
Pachi (100k)
Winning rate
97%
46%
47%
11%
Conclusion
RL provides a generalpurpose framework for AI
RL problems can be solved by endtoend deep learning
A single agent can now solve many challenging tasks
Reinforcement learning + deep learning = AI
Questions?
The only stupid question is the one you never asked Rich Sutton
Much more than documents.
Discover everything Scribd has to offer, including books and audiobooks from major publishers.
Cancel anytime.