Planning and Optimal Control Policy Gradient Methods

Planning and Optimal Control

8. Policy Gradient Methods
Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)

Institute for Computer Science
University of Hildesheim, Germany
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Syllabus
A. Models for Sequential Data
Tue. 22.10. (1) 1. Markov Models
Tue. 29.10. (2) 2. Hidden Markov Models
Tue. 5.11. (3) 3. State Space Models
Tue. 12.11. (4) 3b. (ctd.)
B. Models for Sequential Decisions
Tue. 19.11. (5) 1. Markov Decision Processes
Tue. 26.11. (6) 1b. (ctd.)
Tue. 3.12. (7) 1c. (ctd.)
Tue. 10.12. (8) 2. Monte Carlo and Temporal Difference Methods
Tue. 17.12. (9) 3. Q Learning
Tue. 24.12. — — Christmas Break —
Tue. 7.1. (10) 4. Policy Gradient Methods
Tue. 14.1. (11) tba
Tue. 21.1. (12) tba
Tue. 28.1. (13) 8. Reinforcement Learning for Games
Tue. 4.2. (14) Q&A
1 / 25
Outline
1. The Policy Gradient Theorem
2. Monte Carlo Policy Gradient (REINFORCE)
3. TD Policy Gradients: Actor-Critic Methods
4. Deterministic Policy Gradients
1 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem
Outline
1 / 25
I Q Learning: learn the action value function.
I Policy Gradient Methods
1 / 25
Example: Continuous Mountain Car
S := [−1.2, 0.5] × [−0.07, 0.07], s =: (x, v ) (position and velocity)

A := [−1, +1] (acceleration)
xt+1 := clip(xt + vt )
vt+1 := clip(vt + 0.001at − 0.0025 cos(3xt ))
p(x0 ) := unif([−0.6, −0.4]), v0 := 0
2 / 25
Idea
I let π be a parametrized policy with parameters θ:
π(a | s; θ)
I i.e., a neural network with input s and output a
I view its value function V π of the unique starting state s0 as a

function of θ:
V (θ) := V π (s0 ; θ)
I finding the optimal policy maximize V w.r.t. θ
I e.g., by gradient ascent:
θ(t+1) := θ(t) + αt ∇θ V (θ(t) )
3 / 25
State Distribution of a Policy

state visiting frequencies of a policy π:
η π (s) := E(|{St = s | t = 1 : T }|)
consistency:
X X
η π (s) = p(S0 = s) + η π (s 0 ) π(a | s 0 ) p(s | s 0 , a)
s 0 ∈S a∈A
π π T −1
η = (I − ((P ) ) )p0
X
with P π := ( π(a | s 0 ) p(s | s 0 , a))(s 0 ,s)∈S 2 state transition under π
a∈A
p0 := (p(S0 = s))s∈S initial states
state distribution of policy π (on-policy distribution):
η π (s)
µπ (s) := P π 0
s 0 ∈S η (s )
Note: With I the identity matrix.
4 / 25
Discounted State Distribution of a Policy

discounted state visiting frequencies of a policy π:
X
η π (s) := E( γ t I(St = s))
t
I visiting a state at time t contributes weight γ t .

I = visiting frequency for γ = 1.
consistency:
X X
η π (s) = p(S0 = s) + η π (s 0 )γ π(a | s 0 ) p(s | s 0 , a)
s 0 ∈S a∈A
π π T −1
η = (I − γ((P ) ) )p0
discounted state distribution of policy π:
η π (s)
µπ (s) := P π 0
s 0 ∈S η (s )
5 / 25
Policy Gradient Theorem
X X
∇θ V π (s0 ) ∝ µπ (s) ∇π(a | s; θ)Q π (s, a)
s a
6 / 25
Policy Gradient Theorem / Proof

X
∇V π (s) = ∇ π(a | s)Q π (s, a)
a
X
= ∇π(a | s)Q π (s, a) + π(a | s)∇Q π (s, a)
a
X X
= ∇π(a | s)Q π (s, a) + π(a | s)∇ p(s 0 , r | s, a)(r + γV π (s 0 ))
a s 0 ,r
X X
= π
∇π(a | s)Q (s, a) + π(a | s) p(s 0 | s, a)γ∇V π (s 0 )
a s 0 ,r
X X
= ∇π(a | s)Q π (s, a) + π(a | s) p(s 0 | s, a)γ
rec.
a s0
X X
( ∇π(a0 | s 0 )Q π (s 0 , a0 ) + π(a0 | s 0 ) p(s 00 | s 0 , a0 )γ∇V π (s 00 ))
a0 s 00
∞
XX X
= Pr(s → s 0 , k, π)γ k ∇π(a | s 0 )Q π (s 0 , a)
rec.
s0 k=0 a
Note: γ is missing in [Sutton and Barto, 2018, p.325].
7 / 25
Policy Gradient Theorem / Proof
∞
XX X
∇V π (s) = Pr(s → s 0 , k, π)γ k ∇π(a | s 0 )Q π (s 0 , a)
s 0 k=0 a
∞
XX X
∇V π (s0 ) = Pr(s0 → s, k, π)γ k ∇π(a | s)Q π (s, a)
s k=0 a
X X
= η π (s) ∇π(a | s)Q π (s, a)
s a
X X
π
∝ µ (s) ∇π(a | s)Q π (s, a)
s a
8 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)
Outline
9 / 25
Monte Carlo Policy Gradient (REINFORCE)
X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)Q π (s, a)
s a
X
= E( ∇π(a | St )Q π (St , a))
a
X ∇π(a | St )
= E( π(a | St )Q π (St , a))
a
π(a | St )
∇π(At | St )
= E( Vt )
π(At | St )
= E(Vt ∇ log π(At | St ))
update rule:
θ :=θ + αvt ∇ log π(at | st )
Note: Observed value Vt also called return in the literature.
9 / 25
Monte Carlo Policy Gradient (REINFORCE) / Alg.
1 learn-opt-policy-discounted-mc-policygrad(S, A, γ, sterm , N, α):

2 initialize parameters θ of policy π
3 for n := 1, . . . , N:
4 (s, a, r , T ) := generate-episode(S, A, sterm , π)
5 for t := 0, . . . , T − 1:
PT 0
6 vt := t 0 =t γ t −t rt 0
7 θ := θ + αvt ∇ log π(at | st )
8 return π
where
I sterm terminal state with zero reward.
I α learning rate
I returning π actually means to return its parameters θ
10 / 25
hese two policies achieve a value (at the start state) of less than 44 and 82,
espectively, as shown in the graph. A method can do significantly better if it can
earn a specific probability with which to select right. The best probability is about
Example
0.59, which achieves a value of about 11.6.
-11.6
-20 optimal
stochastic
policy
-40
-greedy right
J(✓) = v⇡✓ (S)
-60
S G
-80 -greedy left
-100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
probability of right action
[source: [Sutton and Barto, 2018, p.323]]

Perhaps I the simplest
assume advantage
all states that policy parameterization
are indistinguishable; reward −1 per may have over action-
step
ue parameterization
I := 0.1
is that the policy may be a simpler function to approximate
blems vary in the complexity of their policies and action-value functions. For some,
action-value function is simpler and thus easier to approximate. For others, the policy 11 / 25
Figure 13.1 shows the performance of REINFORCE on the short-corridor grid

Example
from Example 13.1.
-10 v⇤ (s0 )
-20 ↵=2 13
14
↵=2
G0 -40
↵=2 12
Total reward
on episode
averaged over 100 runs
-60
-80
-90
1 200 400 600 800 1000
Episode
Figure 13.1: REINFORCE on the short-corridor gridworld (Example 13.1). With a goo
size, the total reward per episode approaches the optimal[source:
value[Sutton
of theandstart state.
Barto, 2018, p.328]]
12 / 25
Monte Carlo Policy Gradient (REINFORCE w. Baseline)
I baseline b : S → R, not depending on actions a.

X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)Q π (s, a)
s a
X X
π
= µ (s) ∇π(a | s)(Q π (s, a)−b(s))
s a
as
X X
∇π(a | s)b(s) = b(s)∇ π(a | s) = b(s)∇1 = 0
a a
13 / 25
X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)(Q π (s, a)−b(s))
s a
X
= E( ∇π(a | St )(Q π (St , a)−b(St )))
a
X ∇π(a | St )
= E( π(a | St )(Q π (St , a)−b(St )))
a
π(a | St )
∇π(At | St )
= E( (Vt −b(St )))
π(At | St )
= E((Vt −b(St ))∇ log π(At | St ))
update rule:
θ :=θ + α(vt −b(st ))∇ log π(at | st )
14 / 25
I often a current estimate of the value function V̂ is used as baseline:
b(s) := V̂ (s; η)
I updates for parameters η of V̂ :
η := η + α2 (vt − V̂ (st ))∇V̂ (st )
15 / 25
Monte Carlo Policy Gradient (REINFORCE w. Basline) /

Alg.
1 learn-opt-policy-discounted-mc-policygrad-base(S, A, γ, sterm , N, α):

2 initialize parameters η of value function V̂
4 for n := 1, . . . , N:
5 (s, a, r , T ) := generate-episode(S, A, sterm , π)
6 for t := 0, . . . , T − 1:
PT 0
7 vt := t 0 =t γ t −t rt
8 v̂t := V̂ (st )
9 η := η + α2 (vt − v̂t )∇V (st ) where
10 θ := θ + α1 (vt − v̂t )∇ log π(at | st ) I sterm terminal state with zero reward.
I α1 , α2 learning rates for π and V̂ , resp.
11 return π I returning π actually means to return its parameters θ
16 / 25
Choosing the step
Planning and Optimal size2. for
Control Montevalues (here
Carlo Policy ↵w
Gradient ) is relatively
(REINFORCE)
⇥ easy; in⇤ the linear case we
w 2
rules of thumb for setting it, such as ↵ = 0.1/E krv̂(St ,w)kµ (see Section 9.6).
Example
much less clear how to set the step size for the policy parameters, ↵✓ , whose best v
depends on the range of variation of the rewards and on the policy parameterizatio
-10 REINFORCE with baseline ↵✓ = 2 9

, ↵w = 2 6
9
v⇤ (s0 )
↵=2
-20
REINFORCE
G0 -40
↵=2 13
Total reward
on episode
averaged over 100 runs
-60
-80
-90
1 200 400 600 800 1000
Episode
Figure 13.2: Adding a baseline to REINFORCE can make it learn much faster, as
trated here on the short-corridor gridworld (Example 13.1). The
[source: stepandsize
[Sutton used
Barto, 2018, here
p.330]] for
REINFORCE is that at which it performs best (to the nearest power of two; see Figure 1
17 / 25
Planning and Optimal Control 3. TD Policy Gradients: Actor-Critic Methods
Outline
18 / 25
TD Policy Gradients: Actor-Critic Methods
To derive MC policy gradient / REINFORCE:

X
E( π(a | St )Q π (St , a)) = E(Vt )
a
To derive TD policy gradient / Actor Critic:
X
E( π(a | St )Q π (St , a)) = E(Rt + γVt+1 )
a
then plug in V̂ (St+1 ) for Vt+1 :
∇V π (s0 ) ∝ E((Rt + γ V̂ (St+1 ))∇ log π(At | St ))
approx.
and subtract baseline V̂ (St ):

∇V π (s0 ) ∝ E((Rt + γ V̂ (St+1 ) − V̂ (St ))∇ log π(At | St ))
approx.
update rule:
θ :=θ + α(rt + γ V̂ (st+1 ) − V̂ (st ))∇ log π(at | st )
18 / 25
TD Policy Gradients (Actor Critic)

1 learn-opt-policy-discounted-td-policygrad(S, A, γ, sterm , N, α):
4 for n := 1, . . . , N:
5 s := new process()
6 v̂ := V̂ (s)
7 while s 6= sterm :
8 a := π(s)
9 (r , s 0 ) := execute action(s, a)
10 v̂ 0 := V̂ (s 0 )
11 δ := r + γv̂ 0 − v̂ where
12 η := η + α2 δ∇V (s) I new process() sets up a new process.
13 θ := θ + α1 δ∇ log π(a | s) I execute action(s, a) executes action a in process in
0 0 state s.
14 s := s , v̂ := v̂ limitations:
15 return π I online replay memory
I gradient descent general model update algorithm
19 / 25
TD Policy Gradients (Actor Critic) / Replay Memory

1 learn-opt-policy-discounted-td-policygrad0 (S, A, γ, sterm , N, α):
4 D := ∅
5 for n := 1, . . . , N:
6 s := new process()
7 while s 6= sterm :
8 a := π(s)
9 (r , s 0 ) := execute action(s, a)
10 D := D ∪ {(s, a, r , s 0 )}
11 D10 := {((s, a), r + γ V̂ (s 0 )) | (s, a, r , s 0 ) ∈ D}
12 D20 := {(s, a, caseweight = r + γ V̂ (s 0 ) − V̂ (s)) | (s, a, r , s 0 ) ∈ D}
13 V̂ := update-model(V̂ , D10 )
14 π := update-model(π, D20 ) where
0 I new process() sets up a new process.
15 s := s I execute action(s, a) executes action a in process in
16 return π state s.
20 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients
Outline
21 / 25
Policy Value
I for non-finite MDPs, use expected value over starting states (policy
value) to define optimality of a policy:
Z
V (π) := Es0 ∼p (V (s0 )) =
π
V π (s0 )p(s0 )ds0
s0
= Es∼µπ ,a∼π (r (s, a))
I policy gradient theorem is for stochastic policies
π : S × A → [0, 1]
I is there a corresponding result for deterministic policies?
π:S →A
21 / 25
Deterministic Policy Gradients
I stochastic policy gradient:
∇V (π) = Es∼ηπ ,a∼π (∇θ log π(s, a; θ)Q π (s, a))
I deterministic policy gradient:
∇V (π) = Es∼ηπ (∇θ π(s; θ)∇a Q π (s, a)|a=π(s) )
I determinisitc policy gradient is limit of stochastic one:
lim ∇θ V (π 0 ) = ∇θ V (π), π 0 stochastic policy, π deterministic policy

π 0 →π
I see Silver et al. [2014]
22 / 25
Overview
from what to learn

from episodes from transitions
value π π
V Monte Carlo for V Temporal Differences TD
function
action value
Qπ Monte Carlo for Q π SARSA
function
what to learn
optimal
∗
action value Qπ . Q Learning
function
optimal Monte Carlo Policy TD Policy Gradient /
π∗
policy Gradient / REINFORCE Actor Critic
23 / 25
Summary
I Policy gradient methods parametrize the policy directly: π(s; θ)
I instead of parametrizing the action-value function Q(s, a; θ)
and then deriving the policy as π(s) := arg maxa Q(s, a; θ).
I Gradients of the value function can be computed from gradients of

the policy via the policy gradient theorem:
X X
∇θ V π (s0 ) ∝ µπ (s) ∇π(a | s; θ)Q π (s, a)
s a
I Combined with observed values (Monte Carlo approach), gradient
ascent for the (implicit) value function leads to a simple update rule
for the policy parameters θ called REINFORCE:
θ :=θ + αvt ∇ log π(at | st )
I any baseline for the observed value can be used and often accelerates
convergence
I esp. using a current estimate for the value function as baseline.
θ :=θ + α(vt − V̂ (st ))∇ log π(at | st )
24 / 25
Summary (2/2)
I Instead of using observed values (MC approach), one also can

combine policy gradients with temporal differences (called actor
critic methods):
θ :=θ + α(rt + γ V̂ (st+1 ) − V̂ (st ))∇ log π(at | st )
I then besides the policy model (actor),

a second model for the value function (critic) is required.
25 / 25
Further Readings
I Reinforcement Learning:
I Olivier Sigaud, Frederick Garica (2010): Reinforcement Learning, ch. 1
in Sigaud and Buffet [2010].
I Sutton and Barto [2018]
26 / 25
References
Olivier Sigaud and Olivier Buffet, editors. Markov Decision Processes in Artificial Intelligence. Wiley, 2010.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient
algorithms. In ICML, 2014.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. MIT Press, 2nd edition edition, 2018.
27 / 25

Planning and Optimal Control Policy Gradient Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Planning and Optimal Control Policy Gradient Methods

Uploaded by

Copyright:

Available Formats

Planning and Optimal Control

Planning and Optimal Control

Information Systems and Machine Learning Lab (ISMLL)

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

I Q Learning: learn the action value function.

I Policy Gradient Methods

Example: Continuous Mountain Car

S := [−1.2, 0.5] × [−0.07, 0.07], s =: (x, v ) (position and velocity)

I i.e., a neural network with input s and output a

I view its value function V π of the unique starting state s0 as a

θ(t+1) := θ(t) + αt ∇θ V (θ(t) )

State Distribution of a Policy

Discounted State Distribution of a Policy

I visiting a state at time t contributes weight γ t .

Policy Gradient Theorem

Policy Gradient Theorem / Proof

Policy Gradient Theorem / Proof

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE) / Alg.

1 learn-opt-policy-discounted-mc-policygrad(S, A, γ, sterm , N, α):

probability of right action

[source: [Sutton and Barto, 2018, p.323]]

Figure 13.1 shows the performance of REINFORCE on the short-corridor grid

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

I baseline b : S → R, not depending on actions a.

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

I often a current estimate of the value function V̂ is used as baseline:

I updates for parameters η of V̂ :

η := η + α2 (vt − V̂ (st ))∇V̂ (st )

Monte Carlo Policy Gradient (REINFORCE w. Basline) /

1 learn-opt-policy-discounted-mc-policygrad-base(S, A, γ, sterm , N, α):

-10 REINFORCE with baseline ↵✓ = 2 9

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

TD Policy Gradients: Actor-Critic Methods

To derive MC policy gradient / REINFORCE:

and subtract baseline V̂ (St ):

TD Policy Gradients (Actor Critic)

TD Policy Gradients (Actor Critic) / Replay Memory

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

I policy gradient theorem is for stochastic policies

I is there a corresponding result for deterministic policies?

Deterministic Policy Gradients

I stochastic policy gradient:

∇V (π) = Es∼ηπ ,a∼π (∇θ log π(s, a; θ)Q π (s, a))

I deterministic policy gradient:

∇V (π) = Es∼ηπ (∇θ π(s; θ)∇a Q π (s, a)|a=π(s) )