Policy Gradient Methods

Practical RL
Week 6 @ spring2019
Policy gradient methods
Yandex
Research %
1
Small experiment
The next slide contains a question
Please respond as fast as you can!
2
Small experiment
left or right? 3
Small experiment
Right! Ready for next one? 4

Small experiment
What's Q(s,right) under gamma=0.99?

5
Small experiment
What's Q(s,right) under gamma=0.99?

6
Approximation error
DQN is trained to minimize
2
L≈ E[Q(s t , a t )−(r t +γ⋅max a ' Q(s t +1 , a '))]
Simple 2-state world
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 7
Q: Which prediction is better (A/B)?
Approximation error
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 8
better less
policy MSE
Approximation error
2
True (A) (B)
Q(s0,a0) 1 1 2
Q(s0,a1) 2 2 1
Q(s1,a0) 3 3 3
Q(s1,a1) 100 50 100 9
better less
Q-learning will prefer worse policy (B)! policy MSE
Conclusion
● Often computing q-values is harder than

picking optimal actions!
● We could avoid learning value functions by

directly learning agent's policy π θ (a∣s )
Q: what algorithm works that way?

(of those we studied) 10
Conclusion
● Often computing q-values is harder than

picking optimal actions!
● We could avoid learning value functions by

directly learning agent's policy π θ (a∣s )
Q: what algorithm works that way?

e.g. crossentropy method 11
NOT how humans survived
argmax[
Q(s,pet the tiger)
Q(s,run from tiger)
Q(s,provoke tiger)
Q(s,ignore tiger)
]
12
how humans survived
π (run∣s)=1
13
Policies
In general, two kinds
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s)
14
Q: Any case where stochastic is better?
Policies
● Deterministic policy
a=πθ (s )
● Stochastic policy
a∼πθ (a∣s) e.g. rock-paper

-scissors
15
Q: Any case where stochastic is better?
Policies
● Deterministic policy same action each time

Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
16
Q: how to represent policy in continuous action space?
Policies
● Deterministic policy same action each time

Genetic algos (week 0)
Deterministic policy gradient a=πθ (s )
● Stochastic policy sampling takes care
of exploration
Crossentropy method
Policy gradient
a∼πθ (a∣s)
17
categorical, normal, mixture of normal, whatever
Two approaches
● Value based:
Learn value function Qθ (s , a) or V θ (s)
Infer policy a=argmax Q θ (s, a)

a
● Policy based:
Explicitly learn policy π θ (a∣s ) or π θ (s)→a
Implicitly maximize reward over policy

18
Recap: crossentropy method
● Initialize NN weights θ 0 ←random
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]

i
i
19
Recap: crossentropy method
● Loop:
– Sample N sessions
– elite = take M best sessions and concatenate
θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]

i
i
TD version: elite (s,a) that have highest G(s,a)

20
(select elites independently from each state)
Policy gradient main idea
Why so complicated?
We'd rather maximize reward directly!
21
Objective
Expected reward:
J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
Expected discounted reward:
J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)
22
Objective
Expected reward: R(z) setting

J= E R(s , a , s ' , a ' , ...)
s∼ p (s)
a∼π θ (s∣a)
...
Expected discounted reward: G(s,a) = r + γ*G(s',a')
J= E G(s, a)
s∼ p (s)
a∼π θ (s∣a)
23
Objective
Consider an 1-step process for simplicity
J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

s∼ p (s) s a
a∼π θ (s∣a)
24
Objective
Consider an 1-step process for simplicity

s∼ p (s) s a
a∼π θ (s∣a)
Reward for 1-step

session
state visitation frequency
(may depend on policy)
Q: how do we compute that?

25
Objective

s∼ p (s) s a
a∼π θ (s∣a)
N
1
J≈
N
∑ R (s , a)
i=0
sample N sessions
under current policy 26
Objective

s∼ p (s) s a
a∼π θ (s∣a)
N
1
J≈
N
∑ R (s , a)
i=0
sample N sessions
Can we optimize policy now? 27

Objective

s∼ p (s) s a
a∼π θ (s∣a)
parameters “sit” here
N
1
J≈
N
∑∑ R(s, a)
i=0 s ,a ∈ z i
We don't know how to compute dJ/dtheta 28

Optimization
Finite differences
– Change policy a little, evaluate
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
– Maximize probability of “elite” actions
29
Optimization
Finite differences
J θ+ ϵ−J θ
∇ J≈ ϵ
Stochastic optimization
– Good old crossentropy method
Q: any problems with those two? 30

Optimization
Finite differences
J θ+ ϵ−J θ VERY noizy, especially

∇ J≈ ϵ if both J are sampled
Stochastic optimization “quantile convergence”

problems with stochastic
– Good old crossentropy method MDPs
31
Objective

s∼ p (s) s a
a∼π θ (s∣a)
Wish list:
– Analytical gradient
– Easy/stable approximations
32
Logderivative trick
Simple math
∇ log π ( z )=? ? ?
(try chain rule)
33
Logderivative trick
Simple math
1
∇ log π ( z )= ⋅∇ π( z)
π (z)
π⋅∇ log π( z )=∇ π( z)
34
Policy gradient
Analytical inference
∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

s a
π⋅∇ log π( z )=∇ π( z)
35
Policy gradient
∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

s a
π⋅∇ log π( z )=∇ π( z)
∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds

s a
36
Q: anything curious about that formula?
Policy gradient
∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

s a
π⋅∇ log π( z )=∇ π( z)
∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds

s a
that's expectation :)
37
REINFORCE (bandit)
● Loop:
– Sample N sessions z under current
– Evaluate policy gradient
N
1
∇ J ≈ ∑ ∇ log πθ (a∣s)⋅R (s , a)
N i=0
– Ascend θi +1 ←θi + α⋅∇ J

38
Discounted reward case
● Replace R with Q :)
True action value
a.k.a. E[ G(s,a) ]
∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds

s a
π⋅∇ log π( z )=∇ π( z)
∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds

s a
that's expectation :)
39
REINFORCE (discounted)
● Policy gradient
∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)

s∼ p (s)
a∼π θ (s∣a)
● Approximate with sampling

N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
40
REINFORCE algorithm
We can estimate Q using G
Qπ (st , at )=E s ' G(s t , a t )

2
Gt =r t +γ⋅r t+ 1+γ ⋅r t +2 +...
r r’
r'’’
prev s s'
s a'
a s''
prev a a''
r'’ 41
Recap: discounted rewards
2
Gt =r t +γ⋅r t+ 1+γ ⋅r t +2 +... =
= r t +γ⋅(r t +1 +γ⋅r t +2 +...)=
= r t +γ⋅Gt +1
We can use this to compue all G’s

in linear time 42
REINFORCE algorithm
● Loop:
– Sample N sessions z under current π θ (a∣s )
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

43
REINFORCE algorithm
Q: is it off- or on-policy?
● Loop:
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

44
REINFORCE algorithm
● Loop: actions under current policy

= on-policy
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i

45
value-based Vs policy-based
Value-based Policy-based
● Q-learning, SARSA, MCTS ● REINFORCE, CEM

value-iteration
● Solves harder problem ● Solves easier problem

● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only?
value-based Vs policy-based
Value-based Policy-based
● Q-learning, SARSA, MCTS ● REINFORCE, CEM

value-iteration
We'll learn much more soon!
● Solves harder problem ● Solves easier problem
● Artificial exploration ● Innate exploration
● Learns from partial experience ● Innate stochasticity
(temporal difference) ● Support continuous action space
● Evaluates strategy for free :) ● Learns from full session only
REINFORCE baselines
● Loop:
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅Q(s , a)
i=0 s ,a ∈z i
What is better for learning:

random action in good state
or 48
great action in bad state?
REINFORCE baselines
We can subtract arbitrary baseline b(s)
∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .

s∼ p (s)
a∼π θ (a∣s)
...= E ∇ log π θ (a∣s )Q(s , a)− E ∇ log π θ (a∣s)b (s )=.. .

s∼ p (s) s ∼ p (s)
a∼π θ (a∣s ) a∼π θ (a∣s )
49
REINFORCE baselines
∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .

s∼ p (s)
a∼π θ (a∣s)

s∼ p (s) s ∼ p (s)
Q: Can you simplify the second term?

Note that b(s) does not depend on a 50
REINFORCE baselines
∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .

s∼ p (s)
a∼π θ (a∣s)

s∼ p (s) s ∼ p (s)
E ∇ log πθ (a∣s)b(s)=b (s )⋅ E ∇ log πθ (a∣s)=0

s∼ p (s ) s∼ p (s )
a∼ πθ (a∣s) a∼ πθ (a∣s) 51
REINFORCE baselines
∇ J= E ∇ log π θ (a∣s )(Q(s , a)−b (s))=.. .

s∼ p (s)
a∼π θ (a∣s)

s∼ p (s) s ∼ p (s)
Gradient direction doesn’t change!

...= E ∇ log π θ (a∣s )Q(s , a)
s∼ p (s)
a∼π θ (a∣s ) 52
REINFORCE baselines
● Gradient direction ∇ J stays the same
● Variance may change
Gradient variance: Var [Q(s, a)−b (s )]

as a random variable over (s, a)
Var [Q(s, a)]−2⋅Cov [Q (s , a), b(s)]+Var [b(s)]
53
REINFORCE baselines

If b(s) correlates with Q(s,a), variance decreases 54

REINFORCE baselines

Q: can you suggest any such b(s)? 55

REINFORCE baselines


Naive baseline: b = moving average Q
over all (s, a), Var[b(s)] = 0, Cov[Q, b] > 0 56
REINFORCE baselines
Better baseline: b(s) = V(s)
N
1
∇ J≈
N
∑∑ ∇ log π θ (a∣s)⋅(Q (s , a)−V (s))
i=0 s ,a ∈z i
Q: but how do we predict V(s)?
57
Actor-critic
● Learn both V(s) and π θ (a∣s )
● Hope for the best of both worlds :)
58
Advantage actor-critic
Idea: learn both π θ (a∣s ) and V θ (s)
Use V θ (s) to learn π θ (a∣s ) faster!
Q: how can we estimate A(s,a)

from (s,a,r,s') and V-function?
59
A(s , a)=Q (s , a)−V (s )
Q(s , a)=r +γ⋅V (s ')
A(s , a)=r + γ⋅V (s ')−V (s)

60
A(s , a)=Q (s , a)−V (s )

Also: n-step
Q(s , a)=r +γ⋅V (s ') version
A(s , a)=r + γ⋅V (s ')−V (s)

61
A(s , a)=r + γ⋅V (s ')−V (s)

N
1
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
consider
const
62
π θ (a∣s ) V θ (s)
Improve policy:
N
1
model
∇ J actor ≈
N
∑∑ ∇ log π θ (a∣s)⋅A (s , a)
i=0 s ,a ∈ zi
W = params
Improve value:
N
1
Lcritic ≈
N
∑ ∑ (V θ (s)−[r +γ⋅V (s ')]) 2
i=0 s ,a ∈ zi
state s
63
Continuous action spaces
What if there's continuously many actions?

● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?
64
Continuous action spaces
What if there's continuously many actions?

● Robot control: control motor voltage
● Trading: assign money to equity
How does the algorithm change?

it doesn't :)
Just plug in a different formula for
pi(a|s), e.g. normal distribution 65
Asynchronous advantage actor-critic
● Parallel game sessions

● Async multi-CPU training
● No experience replay
● LSTM policy
● N-step advantage
● No experience replay
66
Read more: https://arxiv.org/abs/1602.01783
IMPALA
● Massively parallel
● Separate actor / learner processes
● Small experience replay
w/ importance sampling
67
Read more: https://arxiv.org/abs/1802.01561
Duct tape zone
● V(s) errors less important than in Q-learning
– actor still learns even if critic is random, just slower
● Regularize with entropy

– to prevent premature convergence
● Learn on parallel sessions

– Or super-small experience replay
● Use logsoftmax for numerical stability 68

Asynchronous advantage actor-critic
69
Let's code!
70

Policy Gradient Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Policy Gradient Methods

Uploaded by

Copyright:

Available Formats

Practical RL

Policy gradient methods

The next slide contains a question

Please respond as fast as you can!

Right! Ready for next one? 4

What's Q(s,right) under gamma=0.99?

What's Q(s,right) under gamma=0.99?

● Often computing q-values is harder than

● We could avoid learning value functions by

Q: what algorithm works that way?

● Often computing q-values is harder than

● We could avoid learning value functions by

Q: what algorithm works that way?

a∼πθ (a∣s) e.g. rock-paper

● Deterministic policy same action each time

● Deterministic policy same action each time

Learn value function Qθ (s , a) or V θ (s)

Infer policy a=argmax Q θ (s, a)

Explicitly learn policy π θ (a∣s ) or π θ (s)→a

Implicitly maximize reward over policy

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]

θi +1=θ i +α ∇ ∑ log π θ (ai∣s i)⋅[ si , ai ∈ Elite ]

TD version: elite (s,a) that have highest G(s,a)

Expected discounted reward:

Expected reward: R(z) setting

Expected discounted reward: G(s,a) = r + γ*G(s',a')

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

Reward for 1-step

Q: how do we compute that?

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

Can we optimize policy now? 27

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

parameters “sit” here

We don't know how to compute dJ/dtheta 28

Q: any problems with those two? 30

J θ+ ϵ−J θ VERY noizy, especially

Stochastic optimization “quantile convergence”

– Maximize probability of “elite” actions

J= E R(s , a)=∫ p(s) ∫ πθ (a∣s) R(s, a)da ds

(try chain rule)

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds

∇ J =∫ p (s)∫ ∇ πθ (a∣s) R(s, a)da ds

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)R (s , a)da ds

– Ascend θi +1 ←θi + α⋅∇ J

∇ J =∫ p (s)∫ ∇ πθ (a∣s)Q(s , a)da ds

π⋅∇ log π( z )=∇ π( z)

∇ J =∫ p (s)∫ πθ (a∣s) ∇ log πθ (a∣s)Q (s , a)da ds

∇ J= E ∇ log π θ (a∣s )⋅Q(s , a)

● Approximate with sampling

Qπ (st , at )=E s ' G(s t , a t )

We can use this to compue all G’s

– Ascend θi +1 ←θi + α⋅∇ J

– Ascend θi +1 ←θi + α⋅∇ J

● Loop: actions under current policy

– Ascend θi +1 ←θi + α⋅∇ J