You are on page 1of 59

Tutorial: Deep Reinforcement Learning

David Silver, Google DeepMind

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

Reinforcement Learning: AI = RL

Powerful RL requires powerful representations

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

Deep Representations

x

w1

/ h1

w2

/ ...

wn

/ hn

wn+1

/y

h1
x

h2
h1

... o

y
hn

...

hn
wn

h1
w1

y
wn+1

Deep Neural Network

A deep neural network is typically composed of:
I

Linear transformations
hk+1 = Whk

Non-linear activation functions

hk+2 = f (hk+1 )

Weight Sharing
Recurrent neural network shares weights between time-steps
yO1
h0

/ h1
O

yO2
/ h2
O

x1

...

x2

/ ...

yOn

...

/ hn
O

xn

w1

w2

w1

w2
h1

h2

Loss Function
I

I
I

Mean-squared error l(y ) = ||y y ||2

Log likelihood l(y ) = log P [y |x]

x

w1

/ h1

/ ...

w2

wn

/ hn

wn+1

/y

/ l(y )

Gradient of loss is appended to the backward computation

h1
x

h2
h1

... o

y
hn

...

hn
wn

h1
w1

l(y )
y

y
wn+1

!"#\$%"&'(%")*+,#'-'!"#\$%%" (%")*+,#'.+/0+,#
I
I

Minimise expected loss L(w ) = Ex [l(y )]

y !"#\$%&'('%\$&#()&*+\$,*\$#&&-&\$\$\$\$."'%"\$
Follow
the gradient of L(w )

'*\$%-/0'*,('-*\$.'("\$("#\$1)*%('-*\$

l(y )
,22&-3'/,(-&

 %,*\$0#\$)+#4\$(-\$
(1)
w.
l(y )
L(w )
%&#,(#\$,*\$#&&-&\$1)*%('-*\$\$\$\$\$\$\$\$\$\$\$\$\$
.
= Ex
= Ex
.
w

l(y )
w (k)

y !"#\$2,&(',5\$4'11#&#*(',5\$-1\$("'+\$#&&-&\$
Adjust w in direction of -ve gradient

1)*%('-*\$\$\$\$\$\$\$\$\$\$\$\$\$\$\$\$6\$("#\$7&,4'#*(\$
l(y )
%,*\$*-.\$0#\$)+#4\$(-\$)24,(#\$("#\$
w =
2 w
'*(#&*,5\$8,&',05#+\$'*\$("#\$1)*%('-*\$
where ,22&-3'/,(-&
is a step-size9,*4\$%&'('%:;\$\$\$\$\$\$
parameter

<&,4'#*(\$4#+%#*(\$=>?

I
I

Deep neural networks have achieved remarkable success

Simple ingredients solve supervised learning problems
I
I
I

Use deep network as a function approximator

Define loss function
Optimise parameters end-to-end by SGD

I
I

Deep neural networks have achieved remarkable success

Simple ingredients solve supervised learning problems
I
I
I

Use deep network as a function approximator

Define loss function
Optimise parameters end-to-end by SGD

Can we follow the same recipe for RL?

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

a = (s)

Value function Q (s, a) is expected total reward

from state s and action a under policy


Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ... | s, a
How good is action a in state s?

Policy-based RL
I

Value-based RL
I

Policy-based RL
I

Value-based RL
I

Model-based RL
I

Using stochastic gradient descent

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

Bellman Equation
I

Bellman expectation equation unrolls value function Q



Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ... | s, a


= Es 0 ,a0 r + Q (s 0 , a0 ) | s, a

Bellman Equation
I

Bellman expectation equation unrolls value function Q



Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ... | s, a


= Es 0 ,a0 r + Q (s 0 , a0 ) | s, a

Bellman optimality equation unrolls optimal value function Q




0 0
Q (s, a) = Es 0 r + max
Q (s , a ) | s, a
0
a

Bellman Equation
I

Bellman expectation equation unrolls value function Q



Q (s, a) = E rt+1 + rt+2 + 2 rt+3 + ... | s, a


= Es 0 ,a0 r + Q (s 0 , a0 ) | s, a

Bellman optimality equation unrolls optimal value function Q




0 0
Q (s, a) = Es 0 r + max
Q (s , a ) | s, a
0
a

Policy iteration algorithms solve Bellman expectation equation



Qi+1 (s, a) = Es 0 r + Qi (s 0 , a0 ) | s, a

Value iteration algorithms solve Bellman optimality equation



0 0
Qi+1 (s, a) = Es 0 ,a0 r + max
Qi (s , a ) | s, a
0
a

I

Represent value function by Q-network with weights w

Q(s, a, w ) Q (s, a)

I

Represent value function by Q-network with weights w

Q(s, a, w ) Q (s, a)

Define objective function by mean-squared error in Q-values

L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )
|
{z
}
target

I

Represent value function by Q-network with weights w

Q(s, a, w ) Q (s, a)

Define objective function by mean-squared error in Q-values

L(w ) = E r + Q(s 0 , a0 , w ) Q(s, a, w )
|
{z
}
target



 Q(s, a, w )
L(w )
0 0
= E r + Q(s , a , w ) Q(s, a, w )
w
w

L(w )
w

I

Represent value function by deep Q-network with weights w

Q(s, a, w ) Q (s, a)

Define objective function by mean-squared error in Q-values

2

Q(s 0 , a0 , w ) Q(s, a, w )
L(w ) = E

r + max
0
a
|
{z
}
target




L(w )
Q(s, a, w )
0 0
= E r + max
Q(s , a , w ) Q(s, a, w )
a0
w
w

Optimise objective end-to-end by SGD, using

L(w )
w

Example: TD Gammon

Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7

V(s, w)

6 5 4

3 2

w
1 0 Wbar

Using non-linear Sarsa with afterstate value function



Q(s, a, w ) = E V (s 0 , w )

Using non-linear Sarsa with afterstate value function



Q(s, a, w ) = E V (s 0 , w )

(Tesauro, 1992)

Naive Q-learning oscillates or diverges with neural nets

1. Data is sequential
I

I
I

Policy may oscillate

Distribution of data can swing from one extreme to another

I

Naive Q-learning gradients can be large

unstable when backpropagated

Deep Q-Networks

DQN provides a stable solution to deep value-based RL

1. Use experience replay
I
I
I

Break correlations in data, bring us back to iid setting

Learn from all past policies
Using off-policy Q-learning

2. Freeze target Q-network

I
I

Avoid oscillations
Break correlations between Q-network and target

I

I

Optimise MSE between Q-network and Q-learning targets, e.g.

"
2 #
L(w ) = Es,a,r ,s 0 D
r + max
Q(s 0 , a0 , w ) Q(s, a, w )
0
a

I

Q(s 0 , a0 , w )
r + max
0
a

Optimise MSE between Q-network and Q-learning targets

"
2 #
0 0

L(w ) = Es,a,r ,s 0 D
r + max
Q(s , a , w ) Q(s, a, w )
0
a

state

action

st

at

reward

rt

DQN in Atari
I

[Mnih et al.]

DQN Demo

DQN

Breakout
Enduro
River Raid
Seaquest

Q-learning

Q-learning

3
29
1453
276
302

+ Target Q
10
142
2868
1003
373

Q-learning
+ Replay
241
831
4103
823
826

Q-learning
+ Replay
+ Target Q
317
1006
7447
2894
1089

Demo: Normalized DQN in PacMan

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

Define objective function as total discounted reward



J(u) = E r1 + r2 + 2 r3 + ...

The gradient of the policy is given by



J(u)
Q (s, a)
= Es
u
u


Q (s, a) (s, u)
= Es
a
u
Policy gradient is the direction that most improves Q

Deterministic Actor-Critic
Use two networks
I

s

w1

/ ...

wn

/Q

s

/ ...
/a
un

s, a

u1

u1

/ ...
/a
un

w1

/ ...

wn

/Q

a
u

... o

Q
a

... o

Critic estimates value of current policy by Q-learning




Q(s, a, w )
L(w )
0
0
= E r + Q(s , (s ), w ) Q(s, a, w )
w
w

Actor updates policy in direction that improves Q



Q(s, a, w ) (s, u)
J(u)
= Es
u
a
u

1. Use experience replay for both actor and critic

2. Freeze target network to avoid oscillations



L(w )
Q(s, a, w )
= Es,a,r ,s 0 D
r + Q(s 0 , (s 0 , u ), w ) Q(s, a, w )
w
w


J(u)
Q(s, a, w ) (s, u)
= Es,a,r ,s 0 D
u
a
u

I
I
I
I

End-to-end learning of control policy from raw pixels s

Input state s is stack of raw pixels from last 4 frames
Two separate convnets are used for Q and
Physics are simulated in MuJoCo
a
Q(s,a)

(s)

[Lillicrap et al.]

DDPG Demo

Outline

Deep Learning
Reinforcement Learning
Deep Value Functions
Deep Policies
Deep Models

Model-Based RL
Learn a transition model of the environment
p(r , s 0 | s, a)
Plan using the transition model
I

left

left

right

right

left

right

Deep Models

Optimise objective by SGD

(Gregor et al.)

DARN Demo

Challenges of Model-Based RL

Compounding errors
I

Model-based RL has failed (so far) in Atari

Challenges of Model-Based RL

Compounding errors
I

I

Are transition models required at all?

Deep Learning in Go
Monte-Carlo search
I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)

Deep Learning in Go
Monte-Carlo search
I Monte-Carlo search (MCTS) simulates future trajectories
I Builds large lookahead search tree with millions of positions
I State-of-the-art 19 19 Go programs use MCTS
I e.g. First strong Go program MoGo
(Gelly et al.)
Convolutional Networks
I 12-layer convnet trained to predict expert moves
I Raw convnet (looking at 1 position, no search at all)
I Equals performance of MoGo with 105 position search tree
Program
Human 6-dan
12-Layer ConvNet
8-Layer ConvNet*
Prior state-of-the-art
*Clarke & Storkey

Accuracy
52%
55%
44%
31-39%

Program
GnuGo
MoGo (100k)
Pachi (10k)
Pachi (100k)

Winning rate
97%
46%
47%
11%

Conclusion

Reinforcement learning + deep learning = AI

Questions?

The only stupid question is the one you never asked -Rich Sutton