【PPT】Conservative policy iteration

Conservative Policy Iteration
Recap
Recall Policy Iteration (PI) with known (P, r)

Recap
Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
Recap
πnew(s) = arg max A πold(s, a)

a
Recap

a
i.e., pick an action that has the largest advantage against πold at every state s,
Recap

a
i.e., pick an action that has the largest advantage against πold at every state s,
Maximize advantage is great, as it gives monotonic improvement:
Q πnew(s, a) ≥ Q πold(s, a), ∀s, a

Recap
Recall Policy Iteration (PI):

a
Recap

a
Performance Difference Lemma (PDL): for all s0 ∈S

Recap

a

1
V πnew(s0) − V πold(s0) = 𝔼s,a∼dsπnew [A πold(s, a)]
1−γ 0
Recap

a

1
V πnew(s0) − V πold(s0) = 𝔼s,a∼dsπnew [A πold(s, a)]
1−γ 0
The advantage against πold averaged over πnew’s

own distribution
Today:
Conservative Policy Iteration
Q: How to enforce incremental policy update and ensure monotonic improvement

Outline
1. Greedy Policy Selection (via reduction to regression) and recap of API
2. Conservative Policy Iteration
3. Monotonic Improvement of CPI

Setting and Notation
Discounted infinite horizon MDP:
ℳ = {S, A, γ, r, P, μ}
ℳ = {S, A, γ, r, P, μ}
∞
State visitation: dμπ(s) γ hℙπh(s; μ)
∑
= (1 − γ)
h=0
ℳ = {S, A, γ, r, P, μ}
∞
∑
= (1 − γ)
h=0
As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:
Π = {π : S ↦ A}
ℳ = {S, A, γ, r, P, μ}
∞
∑
= (1 − γ)
h=0
Π = {π : S ↦ A}
(We can think about each policy as a classifier)
Note that the optimal policy π ⋆ may not be in Π

ℳ = {S, A, γ, r, P, μ}
∞
∑
= (1 − γ)
h=0
Π = {π : S ↦ A} From now on, think about

deterministic policy as a special
stochastic policy
(We can think about each policy as a classifier)
Note that the optimal policy π ⋆ may not be in Π

Recall Approximate Policy Iteration (API)
t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
t
i.e., let’s aim to (approximately) solve the following program:
arg max 𝔼s∼dμπt [A π (s, π(s))]

t
π∈Π
t

t
π∈Π Greedy Policy Selector

t

t
π∈Π Greedy Policy Selector
How to implement such greedy policy selector?
We talked about a regression process..

Implementing Approximate Greedy Policy Selector via Regression
We can do a reduction to Regression via Advantage function approximation
ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}
a
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

a
{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

t t
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

a

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

a

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

a

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator
̂ = arg max A t̂(s, a), ∀s

π (s)
a
Successful Regression ensures approximate greedy operator

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Assume this regression is successful, i.e.,
t̂
( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)

t t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Assume this regression is successful, i.e.,
t̂
( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)
Then, ̂ = arg max A t̂(s, a) is an approximate greedy policy:

π (s)
a
̂ ≥ max 𝔼s∼dμπt A π (s, π(s)) − O ( δ )

t t
𝔼s∼dμπt A π (s, π (s))
π∈Π
Summary So Far:
t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t
Summary So Far:
t
t
𝔼s∼dμπt [A π (s, π (s))

̂ ] ≈ max 𝔼s∼dμπt [A π (s, π(s))]
t t
π∈Π
Summary So Far:
t
t
𝔼s∼dμπt [A π (s, π (s))

̂ ] ≈ max 𝔼s∼dμπt [A π (s, π(s))]
t t
π∈Π
Throughout this lecture,
we will simply assume we can achieve arg max 𝔼s∼dμπt [A π (s, π(s))]
t
π∈Π
Outline
1. Greedy Policy Selection

The Failure case of API: Abrupt distribution change
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a)
t
Aπ
s, a
A π(s, a)
A t̂ ⇒ π t+1
t
Aπ
s, a
A π(s, a)
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
A π(s, a) ̂
A t+1
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
A π(s, a) ̂
A t+1
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
Oscillation between two updates:

No monotonic improvement
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
t+1 t
Recall Performance Difference Lemma:
1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t+1 t
1
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
t+1 t
1
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
s.t., 𝔼s∼dμπt [A π (s, π t+1(s))] ≈ 𝔼s∼dμπt+1 [A π (s, π t+1(s))]

t t
t+1 t
1
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
s.t., 𝔼s∼dμπt [A π (s, π t+1(s))] ≈ 𝔼s∼dμπt+1 [A π (s, π t+1(s))]

t t
This we know how to optimize: the Greedy Policy Selector

The CPI Algorithm
Initialize π 0
For t = 0 …
The CPI Algorithm
Initialize π 0
For t = 0 …
1. Greedy Policy Selector:
π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]

t
π∈Π
The CPI Algorithm
Initialize π 0
For t = 0 …

t
π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π
Return π t
The CPI Algorithm
Initialize π 0
For t = 0 …

t
π∈Π
t
π∈Π
Return π t
3. Incremental Update:
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

The CPI Algorithm
Initialize π 0
For t = 0 …

t
π∈Π Q: Why this is incremental? In what sense?

t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε Q: Can we get monotonic policy improvement?
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Today: Policy Optimization
1. Greedy Policy Selection

Q1: Why this is incremental? In what sense?
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s
Key observation 1:
For any state s, we have ∥π t+1( ⋅ | s) − π t( ⋅ | s)∥1 ≤ 2α

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s
Key observation 1:
Key observation 2 (Lemma 12.1 in AJKS)

γδ
For any two policies π and π′, if ∥π( ⋅ | s) − π′( ⋅ | s)∥1 ≤ δ, ∀s, then ∥dμπ( ⋅ ) − dμπ′( ⋅ )∥1 ≤
1−γ
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s
Key observation 1:
Key observation 2 (Lemma 12.1 in AJKS)

γδ
For any two policies π and π′, if ∥π( ⋅ | s) − π′( ⋅ | s)∥1 ≤ δ, ∀s, then ∥dμπ( ⋅ ) − dμπ′( ⋅ )∥1 ≤
1−γ
t+1 t 2γα
CPI ensures incremental update, i.e., ∥dμπ ( ⋅ ) − dμπ ( ⋅ )∥1 ≤
1−γ
Recall CPI:
Monotonic Improvement before Termination:

t
π∈Π
Before terminate, we have non-trivial avg local advantage: t
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s
(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]

t+1 t t
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t
= 𝔼s∼dμπt+1 [αA π (s, π′(s))]

t
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t
= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?

t
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t

t
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t

t
t t t
α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t

t
t t t
α
t t+1 t
1−γ
2γα 2
≥ α𝔸 −
(1 − γ)2
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t

t
t t t
α
t t+1 t
1−γ
2γα 2 (1 − γ)2𝔸
≥ α𝔸 − (Set α = )
(1 − γ)2 4γ
Recall CPI:

t
π∈Π
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t t

t
t t t
α
t t+1 t
1−γ
2γα 2 𝔸2(1 − γ) (1 − γ)2𝔸
≥ α𝔸 − = (Set α = )
(1 − γ)2 8γ 4γ
Recall CPI:

t
Summary of CPI so far: π∈Π
t
π∈Π
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:
Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
2. Before terminate, monotonic improvement (Thm 12.2 in AJKS):

Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
(By setting step size α properly…)

Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα

2
t+1 t ϵ
Vμπ − Vμπ ≥
8γ
Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
2. Before terminate, monotonic improvement (Thm 12.2 in AJKS): t+1 t
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥
8γ
Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
8γ 1−γ

Recall CPI:

t
t
π∈Π
Return π t
π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

t+1 t 2γα
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
8γ 1−γ
γα 2
(By setting step size α properly…) ≥ αϵ −
(1 − γ)2
Summary for today:
1. Algorithm: Conservative Policy Iteration:
Find the local greedy policy, and move towards it a little bit
2. Small change in policies results small change in state distributions
2. Unlike API, incremental policy update ensures monotonic improvement

【PPT】Conservative policy iteration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

【PPT】Conservative policy iteration

Uploaded by

Copyright:

Available Formats

Conservative Policy Iteration

Recall Policy Iteration (PI) with known (P, r)

Recall Policy Iteration (PI) with known (P, r)

Recall Policy Iteration (PI) with known (P, r)

πnew(s) = arg max A πold(s, a)

Recall Policy Iteration (PI) with known (P, r)

πnew(s) = arg max A πold(s, a)

Recall Policy Iteration (PI) with known (P, r)

πnew(s) = arg max A πold(s, a)

Maximize advantage is great, as it gives monotonic improvement:

Q πnew(s, a) ≥ Q πold(s, a), ∀s, a

Recall Policy Iteration (PI):

Recall Policy Iteration (PI):

Performance Diﬀerence Lemma (PDL): for all s0 ∈S

Recall Policy Iteration (PI):

Performance Diﬀerence Lemma (PDL): for all s0 ∈S

Recall Policy Iteration (PI):

Performance Diﬀerence Lemma (PDL): for all s0 ∈S

The advantage against πold averaged over πnew’s

Q: How to enforce incremental policy update and ensure monotonic improvement

1. Greedy Policy Selection (via reduction to regression) and recap of API

2. Conservative Policy Iteration

3. Monotonic Improvement of CPI

Discounted infinite horizon MDP:

Discounted infinite horizon MDP:

Discounted infinite horizon MDP:

Discounted infinite horizon MDP:

(We can think about each policy as a classifier)

Note that the optimal policy π ⋆ may not be in Π

Discounted infinite horizon MDP:

Π = {π : S ↦ A} From now on, think about

(We can think about each policy as a classifier)

Note that the optimal policy π ⋆ may not be in Π

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]

π∈Π Greedy Policy Selector

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]

π∈Π Greedy Policy Selector

How to implement such greedy policy selector?

We talked about a regression process..

We can do a reduction to Regression via Advantage function approximation

We can do a reduction to Regression via Advantage function approximation

We can do a reduction to Regression via Advantage function approximation

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

We can do a reduction to Regression via Advantage function approximation

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

We can do a reduction to Regression via Advantage function approximation

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

We can do a reduction to Regression via Advantage function approximation

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

̂ = arg max A t̂(s, a), ∀s

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

Assume this regression is successful, i.e.,

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)

Assume this regression is successful, i.e.,