You are on page 1of 70

Practical RL

Week 6 @ spring2019

Policy gradient methods

Yandex
Research %
1
Small experiment

The next slide contains a question

Please respond as fast as you can!

2
Small experiment

left or right? 3
Small experiment

Right! Ready for next one? 4


Small experiment

What's Q(s,right) under gamma=0.99?


5
Small experiment

What's Q(s,right) under gamma=0.99?


6
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Q: Which prediction is better (A/B)?
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
DQN is trained to minimize

2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Q: what algorithm works that way?


(of those we studied) 10
Conclusion

● Often computing q-values is harder than


picking optimal actions!

● We could avoid learning value functions by


directly learning agent's policy π θ (a∣s )

Q: what algorithm works that way?


e.g. crossentropy method 11
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]

12
how humans survived

π (run∣s)=1

13
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s)
14
Q: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy

a=πθ (s )
● Stochastic policy

a∼πθ (a∣s) e.g. rock-paper


-scissors
15
Q: Any case where stochastic is better?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
16
Q: how to represent policy in continuous action space?
Policies
In general, two kinds

● Deterministic policy same action each time


Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
17
categorical, normal, mixture of normal, whatever
Two approaches
● Value based:

Learn value function Qθ (s , a) or V θ (s)

Infer policy a=argmax Q θ (s, a)


a
● Policy based:

Explicitly learn policy π θ (a∣s ) or π θ (s)→a

Implicitly maximize reward over policy


18
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

19
Recap: crossentropy method
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]


i
i

TD version: elite (s,a) that have highest G(s,a)


20
(select elites independently from each state)
Policy gradient main idea

Why so complicated?
We'd rather maximize reward directly!

21
Objective

Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward:

J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)

22
Objective

Expected reward: R(z) setting


J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...

Expected discounted reward: G(s,a) = r + γ*G(s',a')

J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)

23
Objective
Consider an 1-step process for simplicity

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

24
Objective
Consider an 1-step process for simplicity

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

Reward for 1-step


session
state visitation frequency
(may depend on policy)

Q: how do we compute that?


25
Objective

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

N
1
J≈
N
∑ R (s , a)
i=0

sample N sessions
under current policy 26
Objective

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

N
1
J≈
N
∑ R (s , a)
i=0

sample N sessions

Can we optimize policy now? 27


Objective

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

parameters “sit” here

N
1
J≈
N
∑∑ R(s, a)
i=0 s ,a ∈ z i

We don't know how to compute dJ/dtheta 28


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

29
Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ
∇ J≈ ϵ

Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions

Q: any problems with those two? 30


Optimization
Finite differences
– Change policy a little, evaluate

J θ+ ϵ−J θ VERY noizy, especially


∇ J≈ ϵ if both J are sampled

Stochastic optimization “quantile convergence”


problems with stochastic
– Good old crossentropy method MDPs

– Maximize probability of “elite” actions

31
Objective

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds


s∼ p (s) s a
a∼π θ (s∣a)

Wish list:
– Analytical gradient
– Easy/stable approximations

32
Logderivative trick
Simple math

∇ log π ( z )=? ? ?

(try chain rule)

33
Logderivative trick
Simple math

1
∇ log π ( z )= ⋅∇ π( z)
π (z)

π⋅∇ log π( z )=∇ π( z)

34
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

35
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds


s a

36
Q: anything curious about that formula?
Policy gradient
Analytical inference

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds


s a
that's expectation :)
37
REINFORCE (bandit)
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
N
1
∇ J ≈ ∑ ∇ log πθ (a∣s)⋅R (s , a)
N i=0

– Ascend θi +1 ←θi + α⋅∇ J


38
Discounted reward case
● Replace R with Q :)
True action value
a.k.a. E[ G(s,a) ]

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds


s a

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds


s a
that's expectation :)
39
REINFORCE (discounted)
● Policy gradient

∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)


s∼ p (s)
a∼π θ (s∣a)

● Approximate with sampling


N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

40
REINFORCE algorithm
We can estimate Q using G

Qπ (st , at )=E s ' G(s t , a t )


2
Gt =r t +γ⋅r t+ 1+γ ⋅r t +2 +...

r r’
r'’’
prev s s'
s a'
a s''
prev a a''

r'’ 41
Recap: discounted rewards
2
Gt =r t +γ⋅r t+ 1+γ ⋅r t +2 +... =
= r t +γ⋅(r t +1 +γ⋅r t +2 +...)=
= r t +γ⋅Gt +1

We can use this to compue all G’s


in linear time 42
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


43
REINFORCE algorithm
● Initialize NN weights θ 0 ←random
Q: is it off- or on-policy?
● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


44
REINFORCE algorithm
● Initialize NN weights θ 0 ←random

● Loop: actions under current policy


= on-policy
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

– Ascend θi +1 ←θi + α⋅∇ J


45
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration

● Solves harder problem ● Solves easier problem


● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only?
value-based Vs policy-based
Value-based Policy-based

● Q-learning, SARSA, MCTS ● REINFORCE, CEM


value-iteration
We'll learn much more soon!
● Solves harder problem ● Solves easier problem
● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
REINFORCE baselines
● Initialize NN weights θ 0 ←random

● Loop:
– Sample N sessions z under current π θ (a∣s )
– Evaluate policy gradient
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

What is better for learning:


random action in good state
or 48
great action in bad state?
REINFORCE baselines
We can subtract arbitrary baseline b(s)

∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .


s∼ p (s)
a∼π θ (a∣s)

...= E ∇ log π θ (a∣s )Q(s , a)− E ∇ log π θ (a∣s)b (s )=.. .


s∼ p (s) s ∼ p (s)
a∼π θ (a∣s ) a∼π θ (a∣s )

49
REINFORCE baselines
We can subtract arbitrary baseline b(s)

∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .


s∼ p (s)
a∼π θ (a∣s)

...= E ∇ log π θ (a∣s )Q(s , a)− E ∇ log π θ (a∣s)b (s )=.. .


s∼ p (s) s ∼ p (s)
a∼π θ (a∣s ) a∼π θ (a∣s )

Q: Can you simplify the second term?


Note that b(s) does not depend on a 50
REINFORCE baselines
We can subtract arbitrary baseline b(s)

∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .


s∼ p (s)
a∼π θ (a∣s)

...= E ∇ log π θ (a∣s )Q(s , a)− E ∇ log π θ (a∣s)b (s )=.. .


s∼ p (s) s ∼ p (s)
a∼π θ (a∣s ) a∼π θ (a∣s )

E ∇ log πθ (a∣s)b(s)=b (s )⋅ E ∇ log πθ (a∣s)=0


s∼ p (s ) s∼ p (s )
a∼ πθ (a∣s) a∼ πθ (a∣s) 51
REINFORCE baselines
We can subtract arbitrary baseline b(s)

∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .


s∼ p (s)
a∼π θ (a∣s)

...= E ∇ log π θ (a∣s )Q(s , a)− E ∇ log π θ (a∣s)b (s )=.. .


s∼ p (s) s ∼ p (s)
a∼π θ (a∣s ) a∼π θ (a∣s )

Gradient direction doesn’t change!


...= E ∇ log π θ (a∣s )Q(s , a)
s∼ p (s)
a∼π θ (a∣s ) 52
REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change

Gradient variance: Var [Q(s, a)−b (s )]


as a random variable over (s, a)

Var [Q(s, a)]−2⋅Cov [Q (s , a), b(s)]+Var [b(s)]

53
REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change

Gradient variance: Var [Q(s, a)−b (s )]


as a random variable over (s, a)

Var [Q(s, a)]−2⋅Cov [Q (s , a), b(s)]+Var [b(s)]

If b(s) correlates with Q(s,a), variance decreases 54


REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change

Gradient variance: Var [Q(s, a)−b (s )]


as a random variable over (s, a)

Var [Q(s, a)]−2⋅Cov [Q (s , a), b(s)]+Var [b(s)]

Q: can you suggest any such b(s)? 55


REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change

Gradient variance: Var [Q(s, a)−b (s )]


as a random variable over (s, a)

Var [Q(s, a)]−2⋅Cov [Q (s , a), b(s)]+Var [b(s)]


Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0 56
REINFORCE baselines
Better baseline: b(s) = V(s)

N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−V (s))
i=0 s ,a ∈z i

Q: but how do we predict V(s)?

57
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for the best of both worlds :)

58
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

Q: how can we estimate A(s,a)


from (s,a,r,s') and V-function?

59
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )

Q(s , a)=r +γ⋅V (s ')

A(s , a)=r + γ⋅V (s ')−V (s)


60
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=Q (s , a)−V (s )


Also: n-step
Q(s , a)=r +γ⋅V (s ') version

A(s , a)=r + γ⋅V (s ')−V (s)


61
Advantage actor-critic

Idea: learn both π θ (a∣s ) and V θ (s)

Use V θ (s) to learn π θ (a∣s ) faster!

A(s , a)=r + γ⋅V (s ')−V (s)


N
1
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
consider
const

62
Advantage actor-critic

π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2

i=0 s ,a ∈ zi
state s

63
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?

64
Continuous action spaces

What if there's continuously many actions?


● Robot control: control motor voltage
● Trading: assign money to equity

How does the algorithm change?


it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution 65
Asynchronous advantage actor-critic

● Parallel game sessions


● Async multi-CPU training
● No experience replay

● LSTM policy
● N-step advantage
● No experience replay
66
Read more: https://arxiv.org/abs/1602.01783
IMPALA
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling

67
Read more: https://arxiv.org/abs/1802.01561
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower

● Regularize with entropy


– to prevent premature convergence

● Learn on parallel sessions


– Or super-small experience replay

● Use logsoftmax for numerical stability 68


Asynchronous advantage actor-critic

69
Let's code!

70

You might also like