You are on page 1of 34

Planning and Optimal Control

Planning and Optimal Control


8. Policy Gradient Methods

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)


Institute for Computer Science
University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Planning and Optimal Control

Syllabus
A. Models for Sequential Data
Tue. 22.10. (1) 1. Markov Models
Tue. 29.10. (2) 2. Hidden Markov Models
Tue. 5.11. (3) 3. State Space Models
Tue. 12.11. (4) 3b. (ctd.)
B. Models for Sequential Decisions
Tue. 19.11. (5) 1. Markov Decision Processes
Tue. 26.11. (6) 1b. (ctd.)
Tue. 3.12. (7) 1c. (ctd.)
Tue. 10.12. (8) 2. Monte Carlo and Temporal Difference Methods
Tue. 17.12. (9) 3. Q Learning
Tue. 24.12. — — Christmas Break —
Tue. 7.1. (10) 4. Policy Gradient Methods
Tue. 14.1. (11) tba
Tue. 21.1. (12) tba
Tue. 28.1. (13) 8. Reinforcement Learning for Games
Tue. 4.2. (14) Q&A

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Planning and Optimal Control

Outline

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Outline

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

I Q Learning: learn the action value function.

I Policy Gradient Methods

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
1 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Example: Continuous Mountain Car

S := [−1.2, 0.5] × [−0.07, 0.07], s =: (x, v ) (position and velocity)


A := [−1, +1] (acceleration)
xt+1 := clip(xt + vt )
vt+1 := clip(vt + 0.001at − 0.0025 cos(3xt ))
p(x0 ) := unif([−0.6, −0.4]), v0 := 0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
2 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Idea
I let π be a parametrized policy with parameters θ:

π(a | s; θ)

I i.e., a neural network with input s and output a

I view its value function V π of the unique starting state s0 as a


function of θ:

V (θ) := V π (s0 ; θ)
I finding the optimal policy maximize V w.r.t. θ
I e.g., by gradient ascent:

θ(t+1) := θ(t) + αt ∇θ V (θ(t) )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
3 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

State Distribution of a Policy


state visiting frequencies of a policy π:
η π (s) := E(|{St = s | t = 1 : T }|)
consistency:
X X
η π (s) = p(S0 = s) + η π (s 0 ) π(a | s 0 ) p(s | s 0 , a)
s 0 ∈S a∈A
π π T −1
η = (I − ((P ) ) )p0
X
with P π := ( π(a | s 0 ) p(s | s 0 , a))(s 0 ,s)∈S 2 state transition under π
a∈A
p0 := (p(S0 = s))s∈S initial states
state distribution of policy π (on-policy distribution):
η π (s)
µπ (s) := P π 0
s 0 ∈S η (s )
Note: With I the identity matrix.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
4 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Discounted State Distribution of a Policy


discounted state visiting frequencies of a policy π:
X
η π (s) := E( γ t I(St = s))
t

I visiting a state at time t contributes weight γ t .


I = visiting frequency for γ = 1.
consistency:
X X
η π (s) = p(S0 = s) + η π (s 0 )γ π(a | s 0 ) p(s | s 0 , a)
s 0 ∈S a∈A
π π T −1
η = (I − γ((P ) ) )p0
discounted state distribution of policy π:
η π (s)
µπ (s) := P π 0
s 0 ∈S η (s )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
5 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Policy Gradient Theorem

X X
∇θ V π (s0 ) ∝ µπ (s) ∇π(a | s; θ)Q π (s, a)
s a

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
6 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Policy Gradient Theorem / Proof


X
∇V π (s) = ∇ π(a | s)Q π (s, a)
a
X
= ∇π(a | s)Q π (s, a) + π(a | s)∇Q π (s, a)
a
X X
= ∇π(a | s)Q π (s, a) + π(a | s)∇ p(s 0 , r | s, a)(r + γV π (s 0 ))
a s 0 ,r
X X
= π
∇π(a | s)Q (s, a) + π(a | s) p(s 0 | s, a)γ∇V π (s 0 )
a s 0 ,r
X X
= ∇π(a | s)Q π (s, a) + π(a | s) p(s 0 | s, a)γ
rec.
a s0
X X
( ∇π(a0 | s 0 )Q π (s 0 , a0 ) + π(a0 | s 0 ) p(s 00 | s 0 , a0 )γ∇V π (s 00 ))
a0 s 00

XX X
= Pr(s → s 0 , k, π)γ k ∇π(a | s 0 )Q π (s 0 , a)
rec.
s0 k=0 a
Note: γ is missing in [Sutton and Barto, 2018, p.325].
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
7 / 25
Planning and Optimal Control 1. The Policy Gradient Theorem

Policy Gradient Theorem / Proof


XX X
∇V π (s) = Pr(s → s 0 , k, π)γ k ∇π(a | s 0 )Q π (s 0 , a)
s 0 k=0 a

XX X
∇V π (s0 ) = Pr(s0 → s, k, π)γ k ∇π(a | s)Q π (s, a)
s k=0 a
X X
= η π (s) ∇π(a | s)Q π (s, a)
s a
X X
π
∝ µ (s) ∇π(a | s)Q π (s, a)
s a

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
8 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Outline

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE)

X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)Q π (s, a)
s a
X
= E( ∇π(a | St )Q π (St , a))
a
X ∇π(a | St )
= E( π(a | St )Q π (St , a))
a
π(a | St )
∇π(At | St )
= E( Vt )
π(At | St )
= E(Vt ∇ log π(At | St ))
update rule:
θ :=θ + αvt ∇ log π(at | st )
Note: Observed value Vt also called return in the literature.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
9 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE) / Alg.

1 learn-opt-policy-discounted-mc-policygrad(S, A, γ, sterm , N, α):


2 initialize parameters θ of policy π
3 for n := 1, . . . , N:
4 (s, a, r , T ) := generate-episode(S, A, sterm , π)
5 for t := 0, . . . , T − 1:
PT 0
6 vt := t 0 =t γ t −t rt 0
7 θ := θ + αvt ∇ log π(at | st )
8 return π
where
I sterm terminal state with zero reward.
I α learning rate
I returning π actually means to return its parameters θ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
10 / 25
hese two policies achieve a value (at the start state) of less than 44 and 82,
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)
espectively, as shown in the graph. A method can do significantly better if it can
earn a specific probability with which to select right. The best probability is about
Example
0.59, which achieves a value of about 11.6.
-11.6
-20 optimal
stochastic
policy
-40
-greedy right
J(✓) = v⇡✓ (S)
-60

S G
-80 -greedy left

-100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

probability of right action

[source: [Sutton and Barto, 2018, p.323]]


Perhaps I the simplest
assume advantage
all states that policy parameterization
are indistinguishable; reward −1 per may have over action-
step
ue parameterization
I  := 0.1
is that the policy may be a simpler function to approximate
blems vary in the complexity of their policies and action-value functions. For some,
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
action-value function is simpler and thus easier to approximate. For others, the policy 11 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Figure 13.1 shows the performance of REINFORCE on the short-corridor grid


Example
from Example 13.1.

-10 v⇤ (s0 )

-20 ↵=2 13
14
↵=2

G0 -40
↵=2 12

Total reward
on episode
averaged over 100 runs
-60

-80

-90
1 200 400 600 800 1000

Episode
Figure 13.1: REINFORCE on the short-corridor gridworld (Example 13.1). With a goo
size, the total reward per episode approaches the optimal[source:
value[Sutton
of theandstart state.
Barto, 2018, p.328]]
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
12 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

I baseline b : S → R, not depending on actions a.


X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)Q π (s, a)
s a
X X
π
= µ (s) ∇π(a | s)(Q π (s, a)−b(s))
s a
as
X X
∇π(a | s)b(s) = b(s)∇ π(a | s) = b(s)∇1 = 0
a a

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
13 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

X X
∇V π (s0 ) ∝ µπ (s) ∇π(a | s)(Q π (s, a)−b(s))
s a
X
= E( ∇π(a | St )(Q π (St , a)−b(St )))
a
X ∇π(a | St )
= E( π(a | St )(Q π (St , a)−b(St )))
a
π(a | St )
∇π(At | St )
= E( (Vt −b(St )))
π(At | St )
= E((Vt −b(St ))∇ log π(At | St ))
update rule:
θ :=θ + α(vt −b(st ))∇ log π(at | st )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
14 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE w. Baseline)

I often a current estimate of the value function V̂ is used as baseline:

b(s) := V̂ (s; η)

I updates for parameters η of V̂ :

η := η + α2 (vt − V̂ (st ))∇V̂ (st )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
15 / 25
Planning and Optimal Control 2. Monte Carlo Policy Gradient (REINFORCE)

Monte Carlo Policy Gradient (REINFORCE w. Basline) /


Alg.

1 learn-opt-policy-discounted-mc-policygrad-base(S, A, γ, sterm , N, α):


2 initialize parameters η of value function V̂
3 initialize parameters θ of policy π
4 for n := 1, . . . , N:
5 (s, a, r , T ) := generate-episode(S, A, sterm , π)
6 for t := 0, . . . , T − 1:
PT 0
7 vt := t 0 =t γ t −t rt
8 v̂t := V̂ (st )
9 η := η + α2 (vt − v̂t )∇V (st ) where

10 θ := θ + α1 (vt − v̂t )∇ log π(at | st ) I sterm terminal state with zero reward.
I α1 , α2 learning rates for π and V̂ , resp.
11 return π I returning π actually means to return its parameters θ

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
16 / 25
Choosing the step
Planning and Optimal size2. for
Control Montevalues (here
Carlo Policy ↵w
Gradient ) is relatively
(REINFORCE)
⇥ easy; in⇤ the linear case we
w 2
rules of thumb for setting it, such as ↵ = 0.1/E krv̂(St ,w)kµ (see Section 9.6).
Example
much less clear how to set the step size for the policy parameters, ↵✓ , whose best v
depends on the range of variation of the rewards and on the policy parameterizatio

-10 REINFORCE with baseline ↵✓ = 2 9


, ↵w = 2 6

9
v⇤ (s0 )
↵=2
-20

REINFORCE
G0 -40
↵=2 13

Total reward
on episode
averaged over 100 runs
-60

-80

-90
1 200 400 600 800 1000

Episode
Figure 13.2: Adding a baseline to REINFORCE can make it learn much faster, as
trated here on the short-corridor gridworld (Example 13.1). The
[source: stepandsize
[Sutton used
Barto, 2018, here
p.330]] for
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
REINFORCE is that at which it performs best (to the nearest power of two; see Figure 1
17 / 25
Planning and Optimal Control 3. TD Policy Gradients: Actor-Critic Methods

Outline

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
18 / 25
Planning and Optimal Control 3. TD Policy Gradients: Actor-Critic Methods

TD Policy Gradients: Actor-Critic Methods

To derive MC policy gradient / REINFORCE:


X
E( π(a | St )Q π (St , a)) = E(Vt )
a
To derive TD policy gradient / Actor Critic:
X
E( π(a | St )Q π (St , a)) = E(Rt + γVt+1 )
a
then plug in V̂ (St+1 ) for Vt+1 :
∇V π (s0 ) ∝ E((Rt + γ V̂ (St+1 ))∇ log π(At | St ))
approx.

and subtract baseline V̂ (St ):


∇V π (s0 ) ∝ E((Rt + γ V̂ (St+1 ) − V̂ (St ))∇ log π(At | St ))
approx.
update rule:
θ :=θ + α(rt + γ V̂ (st+1 ) − V̂ (st ))∇ log π(at | st )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
18 / 25
Planning and Optimal Control 3. TD Policy Gradients: Actor-Critic Methods

TD Policy Gradients (Actor Critic)


1 learn-opt-policy-discounted-td-policygrad(S, A, γ, sterm , N, α):
2 initialize parameters η of value function V̂
3 initialize parameters θ of policy π
4 for n := 1, . . . , N:
5 s := new process()
6 v̂ := V̂ (s)
7 while s 6= sterm :
8 a := π(s)
9 (r , s 0 ) := execute action(s, a)
10 v̂ 0 := V̂ (s 0 )
11 δ := r + γv̂ 0 − v̂ where
12 η := η + α2 δ∇V (s) I new process() sets up a new process.
13 θ := θ + α1 δ∇ log π(a | s) I execute action(s, a) executes action a in process in
0 0 state s.
14 s := s , v̂ := v̂ limitations:
15 return π I online replay memory
I gradient descent general model update algorithm

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
19 / 25
Planning and Optimal Control 3. TD Policy Gradients: Actor-Critic Methods

TD Policy Gradients (Actor Critic) / Replay Memory


1 learn-opt-policy-discounted-td-policygrad0 (S, A, γ, sterm , N, α):
2 initialize parameters η of value function V̂
3 initialize parameters θ of policy π
4 D := ∅
5 for n := 1, . . . , N:
6 s := new process()
7 while s 6= sterm :
8 a := π(s)
9 (r , s 0 ) := execute action(s, a)
10 D := D ∪ {(s, a, r , s 0 )}
11 D10 := {((s, a), r + γ V̂ (s 0 )) | (s, a, r , s 0 ) ∈ D}
12 D20 := {(s, a, caseweight = r + γ V̂ (s 0 ) − V̂ (s)) | (s, a, r , s 0 ) ∈ D}
13 V̂ := update-model(V̂ , D10 )
14 π := update-model(π, D20 ) where
0 I new process() sets up a new process.
15 s := s I execute action(s, a) executes action a in process in
16 return π state s.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
20 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Outline

1. The Policy Gradient Theorem

2. Monte Carlo Policy Gradient (REINFORCE)

3. TD Policy Gradients: Actor-Critic Methods

4. Deterministic Policy Gradients

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Policy Value
I for non-finite MDPs, use expected value over starting states (policy
value) to define optimality of a policy:
Z
V (π) := Es0 ∼p (V (s0 )) =
π
V π (s0 )p(s0 )ds0
s0
= Es∼µπ ,a∼π (r (s, a))

I policy gradient theorem is for stochastic policies

π : S × A → [0, 1]

I is there a corresponding result for deterministic policies?

π:S →A

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
21 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Deterministic Policy Gradients

I stochastic policy gradient:

∇V (π) = Es∼ηπ ,a∼π (∇θ log π(s, a; θ)Q π (s, a))

I deterministic policy gradient:

∇V (π) = Es∼ηπ (∇θ π(s; θ)∇a Q π (s, a)|a=π(s) )

I determinisitc policy gradient is limit of stochastic one:

lim ∇θ V (π 0 ) = ∇θ V (π), π 0 stochastic policy, π deterministic policy


π 0 →π

I see Silver et al. [2014]

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
22 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Overview

from what to learn


from episodes from transitions
value π π
V Monte Carlo for V Temporal Differences TD
function
action value
Qπ Monte Carlo for Q π SARSA
function
what to learn

optimal

action value Qπ . Q Learning
function
optimal Monte Carlo Policy TD Policy Gradient /
π∗
policy Gradient / REINFORCE Actor Critic

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
23 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Summary
I Policy gradient methods parametrize the policy directly: π(s; θ)
I instead of parametrizing the action-value function Q(s, a; θ)
and then deriving the policy as π(s) := arg maxa Q(s, a; θ).

I Gradients of the value function can be computed from gradients of


the policy via the policy gradient theorem:
X X
∇θ V π (s0 ) ∝ µπ (s) ∇π(a | s; θ)Q π (s, a)
s a
I Combined with observed values (Monte Carlo approach), gradient
ascent for the (implicit) value function leads to a simple update rule
for the policy parameters θ called REINFORCE:
θ :=θ + αvt ∇ log π(at | st )
I any baseline for the observed value can be used and often accelerates
convergence
I esp. using a current estimate for the value function as baseline.
θ :=θ + α(vt − V̂ (st ))∇ log π(at | st )
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
24 / 25
Planning and Optimal Control 4. Deterministic Policy Gradients

Summary (2/2)

I Instead of using observed values (MC approach), one also can


combine policy gradients with temporal differences (called actor
critic methods):

θ :=θ + α(rt + γ V̂ (st+1 ) − V̂ (st ))∇ log π(at | st )

I then besides the policy model (actor),


a second model for the value function (critic) is required.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
25 / 25
Planning and Optimal Control

Further Readings
I Reinforcement Learning:
I Olivier Sigaud, Frederick Garica (2010): Reinforcement Learning, ch. 1
in Sigaud and Buffet [2010].
I Sutton and Barto [2018]

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
26 / 25
Planning and Optimal Control

References
Olivier Sigaud and Olivier Buffet, editors. Markov Decision Processes in Artificial Intelligence. Wiley, 2010.
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient
algorithms. In ICML, 2014.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. MIT Press, 2nd edition edition, 2018.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
27 / 25

You might also like