You are on page 1of 75

Conservative Policy Iteration

Recap

Recall Policy Iteration (PI) with known (P, r)


Recap

Recall Policy Iteration (PI) with known (P, r)

Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
Recap

Recall Policy Iteration (PI) with known (P, r)

Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:

πnew(s) = arg max A πold(s, a)


a
Recap

Recall Policy Iteration (PI) with known (P, r)

Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:

πnew(s) = arg max A πold(s, a)


a

i.e., pick an action that has the largest advantage against πold at every state s,
Recap

Recall Policy Iteration (PI) with known (P, r)

Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:

πnew(s) = arg max A πold(s, a)


a

i.e., pick an action that has the largest advantage against πold at every state s,

Maximize advantage is great, as it gives monotonic improvement:

Q πnew(s, a) ≥ Q πold(s, a), ∀s, a


Recap

Recall Policy Iteration (PI):


πnew(s) = arg max A πold(s, a)
a
Recap

Recall Policy Iteration (PI):


πnew(s) = arg max A πold(s, a)
a

Performance Difference Lemma (PDL): for all s0 ∈S


Recap

Recall Policy Iteration (PI):


πnew(s) = arg max A πold(s, a)
a

Performance Difference Lemma (PDL): for all s0 ∈S


1
V πnew(s0) − V πold(s0) = 𝔼s,a∼dsπnew [A πold(s, a)]
1−γ 0
Recap

Recall Policy Iteration (PI):


πnew(s) = arg max A πold(s, a)
a

Performance Difference Lemma (PDL): for all s0 ∈S


1
V πnew(s0) − V πold(s0) = 𝔼s,a∼dsπnew [A πold(s, a)]
1−γ 0

The advantage against πold averaged over πnew’s


own distribution
Today:
Conservative Policy Iteration

Q: How to enforce incremental policy update and ensure monotonic improvement


Outline

1. Greedy Policy Selection (via reduction to regression) and recap of API

2. Conservative Policy Iteration

3. Monotonic Improvement of CPI


Setting and Notation

Discounted infinite horizon MDP:

ℳ = {S, A, γ, r, P, μ}
Setting and Notation

Discounted infinite horizon MDP:

ℳ = {S, A, γ, r, P, μ}


State visitation: dμπ(s) γ hℙπh(s; μ)

= (1 − γ)
h=0
Setting and Notation

Discounted infinite horizon MDP:

ℳ = {S, A, γ, r, P, μ}


State visitation: dμπ(s) γ hℙπh(s; μ)

= (1 − γ)
h=0

As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:

Π = {π : S ↦ A}
Setting and Notation

Discounted infinite horizon MDP:

ℳ = {S, A, γ, r, P, μ}


State visitation: dμπ(s) γ hℙπh(s; μ)

= (1 − γ)
h=0

As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:

Π = {π : S ↦ A}

(We can think about each policy as a classifier)

Note that the optimal policy π ⋆ may not be in Π


Setting and Notation

Discounted infinite horizon MDP:

ℳ = {S, A, γ, r, P, μ}


State visitation: dμπ(s) γ hℙπh(s; μ)

= (1 − γ)
h=0

As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:

Π = {π : S ↦ A} From now on, think about


deterministic policy as a special
stochastic policy

(We can think about each policy as a classifier)

Note that the optimal policy π ⋆ may not be in Π


Recall Approximate Policy Iteration (API)
Recall Approximate Policy Iteration (API)

t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
Recall Approximate Policy Iteration (API)

t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π
Recall Approximate Policy Iteration (API)

t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π Greedy Policy Selector


Recall Approximate Policy Iteration (API)

t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ

i.e., let’s aim to (approximately) solve the following program:

arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π Greedy Policy Selector

How to implement such greedy policy selector?

We talked about a regression process..


Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}
a
Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}


a

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t
Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}


a

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}


a

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i

A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator
Implementing Approximate Greedy Policy Selector via Regression

We can do a reduction to Regression via Advantage function approximation

ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t

Π = {π(s) = arg max f(s, a) : f ∈ ℱ}


a

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i

A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator

̂ = arg max A t̂(s, a), ∀s


π (s)
a
Successful Regression ensures approximate greedy operator

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Successful Regression ensures approximate greedy operator

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i

Assume this regression is successful, i.e.,


( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)
Successful Regression ensures approximate greedy operator

{si, ai, yi}, si ∼ dμπ , ai ∼ U(A), 𝔼 [yi] = A π (si, ai)


t t

Regression oracle:

A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i

Assume this regression is successful, i.e.,


( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)

Then, ̂ = arg max A t̂(s, a) is an approximate greedy policy:


π (s)
a

̂ ≥ max 𝔼s∼dμπt A π (s, π(s)) − O ( δ )


t t
𝔼s∼dμπt A π (s, π (s))
π∈Π
Summary So Far:

t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t
Summary So Far:

t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t

𝔼s∼dμπt [A π (s, π (s))


̂ ] ≈ max 𝔼s∼dμπt [A π (s, π(s))]
t t

π∈Π
Summary So Far:

t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t

𝔼s∼dμπt [A π (s, π (s))


̂ ] ≈ max 𝔼s∼dμπt [A π (s, π(s))]
t t

π∈Π

Throughout this lecture,

we will simply assume we can achieve arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Outline

1. Greedy Policy Selection

2. Conservative Policy Iteration

3. Monotonic Improvement of CPI


The Failure case of API: Abrupt distribution change

API cannot guarantee to succeed (let’s think about advantage function approximation setting)

A π(s, a)

t

s, a
The Failure case of API: Abrupt distribution change

API cannot guarantee to succeed (let’s think about advantage function approximation setting)

A π(s, a)

A t̂ ⇒ π t+1

t

s, a
The Failure case of API: Abrupt distribution change

API cannot guarantee to succeed (let’s think about advantage function approximation setting)

A π(s, a)

π t+1
A t̂ ⇒ π t+1
A
t

s, a
The Failure case of API: Abrupt distribution change

API cannot guarantee to succeed (let’s think about advantage function approximation setting)

A π(s, a) ̂
A t+1

π t+1
A t̂ ⇒ π t+1
A
t

s, a
The Failure case of API: Abrupt distribution change

API cannot guarantee to succeed (let’s think about advantage function approximation setting)

A π(s, a) ̂
A t+1

π t+1
A t̂ ⇒ π t+1
A
t

s, a

Oscillation between two updates:


No monotonic improvement
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change

t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change

t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!

Recall Performance Difference Lemma:

1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change

t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!

Recall Performance Difference Lemma:

1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change

t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!

Recall Performance Difference Lemma:

1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ

s.t., 𝔼s∼dμπt [A π (s, π t+1(s))] ≈ 𝔼s∼dμπt+1 [A π (s, π t+1(s))]


t t
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change

t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!

Recall Performance Difference Lemma:

1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ

s.t., 𝔼s∼dμπt [A π (s, π t+1(s))] ≈ 𝔼s∼dμπt+1 [A π (s, π t+1(s))]


t t

This we know how to optimize: the Greedy Policy Selector


The CPI Algorithm

Initialize π 0
For t = 0 …
The CPI Algorithm

Initialize π 0
For t = 0 …

1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π
The CPI Algorithm

Initialize π 0
For t = 0 …

1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π
Return π t
The CPI Algorithm

Initialize π 0
For t = 0 …

1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π
Return π t

3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


The CPI Algorithm

Initialize π 0
For t = 0 …

1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t

π∈Π Q: Why this is incremental? In what sense?


t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε Q: Can we get monotonic policy improvement?
π∈Π
Return π t

3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


Today: Policy Optimization

1. Greedy Policy Selection

2. Conservative Policy Iteration

3. Monotonic Improvement of CPI


Q1: Why this is incremental? In what sense?

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


Q1: Why this is incremental? In what sense?

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Key observation 1:

For any state s, we have ∥π t+1( ⋅ | s) − π t( ⋅ | s)∥1 ≤ 2α


Q1: Why this is incremental? In what sense?

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Key observation 1:

For any state s, we have ∥π t+1( ⋅ | s) − π t( ⋅ | s)∥1 ≤ 2α

Key observation 2 (Lemma 12.1 in AJKS)


γδ
For any two policies π and π′, if ∥π( ⋅ | s) − π′( ⋅ | s)∥1 ≤ δ, ∀s, then ∥dμπ( ⋅ ) − dμπ′( ⋅ )∥1 ≤
1−γ
Q1: Why this is incremental? In what sense?

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

Key observation 1:

For any state s, we have ∥π t+1( ⋅ | s) − π t( ⋅ | s)∥1 ≤ 2α

Key observation 2 (Lemma 12.1 in AJKS)


γδ
For any two policies π and π′, if ∥π( ⋅ | s) − π′( ⋅ | s)∥1 ≤ δ, ∀s, then ∥dμπ( ⋅ ) − dμπ′( ⋅ )∥1 ≤
1−γ

t+1 t 2γα
CPI ensures incremental update, i.e., ∥dμπ ( ⋅ ) − dμπ ( ⋅ )∥1 ≤
1−γ
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))]


t
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t

= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t

= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t

α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t

= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t

α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2
≥ α𝔸 −
(1 − γ)2
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t

= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t

α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2 (1 − γ)2𝔸
≥ α𝔸 − (Set α = )
(1 − γ)2 4γ
Recall CPI:
1. Greedy Policy Selector:

Monotonic Improvement before Termination:


π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]
t

π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π

Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s

(1 − γ)(Vμπ − Vμπ ) = 𝔼s∼dμπt+1 [𝔼a∼πt+1(⋅|s) A π (s, a)]


t+1 t t

= 𝔼s∼dμπt+1 [αA π (s, π′(s))] Why?


t

= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t

α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2 𝔸2(1 − γ) (1 − γ)2𝔸
≥ α𝔸 − = (Set α = )
(1 − γ)2 8γ 4γ
Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:
Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS):


Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS):

(By setting step size α properly…)


Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS):


2
t+1 t ϵ
Vμπ − Vμπ ≥

(By setting step size α properly…)
Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS): t+1 t

2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥

(By setting step size α properly…)
Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS): t+1 t

2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
∥dμπ − dμπ ∥1
8γ 1−γ

(By setting step size α properly…)


Recall CPI:
1. Greedy Policy Selector:

π′ ∈ arg max 𝔼s∼dμπt [A π (s, π(s))]


t
Summary of CPI so far: π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π

Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:

π t+1( ⋅ | s) = (1 − α)π t( ⋅ | s) + απ′( ⋅ | s), ∀s


t+1 t 2γα
∥dμπ − dμπ ∥1 ≤
1−γ Local adv versus distribution change:

2. Before terminate, monotonic improvement (Thm 12.2 in AJKS): t+1 t

2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
∥dμπ − dμπ ∥1
8γ 1−γ

γα 2
(By setting step size α properly…) ≥ αϵ −
(1 − γ)2
Summary for today:

1. Algorithm: Conservative Policy Iteration:

Find the local greedy policy, and move towards it a little bit

2. Small change in policies results small change in state distributions

2. Unlike API, incremental policy update ensures monotonic improvement

You might also like