Professional Documents
Culture Documents
Recap
Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
Recap
Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
i.e., pick an action that has the largest advantage against πold at every state s,
Recap
Assume MDP is known, we compute A πold(s, a) exactly for all s, a, PI updates policy as:
i.e., pick an action that has the largest advantage against πold at every state s,
ℳ = {S, A, γ, r, P, μ}
Setting and Notation
ℳ = {S, A, γ, r, P, μ}
∞
State visitation: dμπ(s) γ hℙπh(s; μ)
∑
= (1 − γ)
h=0
Setting and Notation
ℳ = {S, A, γ, r, P, μ}
∞
State visitation: dμπ(s) γ hℙπh(s; μ)
∑
= (1 − γ)
h=0
As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:
Π = {π : S ↦ A}
Setting and Notation
ℳ = {S, A, γ, r, P, μ}
∞
State visitation: dμπ(s) γ hℙπh(s; μ)
∑
= (1 − γ)
h=0
As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:
Π = {π : S ↦ A}
ℳ = {S, A, γ, r, P, μ}
∞
State visitation: dμπ(s) γ hℙπh(s; μ)
∑
= (1 − γ)
h=0
As we will consider large scale unknown MDP here, we start with a (restricted) function class Π:
t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
Recall Approximate Policy Iteration (API)
t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
π∈Π
Recall Approximate Policy Iteration (API)
t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
t
Given the current policy π t, let’s find a new policy that has large local adv over π t under dμπ
ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
Implementing Approximate Greedy Policy Selector via Regression
ℱ = {f : S × A ↦ ℝ} t
( ≈ Aπ )
Π = {π(s) = arg max f(s, a) : f ∈ ℱ}
a
Implementing Approximate Greedy Policy Selector via Regression
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Implementing Approximate Greedy Policy Selector via Regression
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator
Implementing Approximate Greedy Policy Selector via Regression
ℱ = {f : S × A ↦ ℝ} ( ≈ Aπ )
t
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
A t̂ (as we hope A t̂ ≈ A π ):
t
Act greedily wrt the estimator
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
Successful Regression ensures approximate greedy operator
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
t̂
( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)
Successful Regression ensures approximate greedy operator
Regression oracle:
A t̂ = arg min
∑( i i
f(s , a ) − yi)
2
f∈ℱ
i
t̂
( A (s, a) − A (s, a)) ≤ δ
2
πt
𝔼s∼dμπt,a∼U(A)
t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t
Summary So Far:
t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t
π∈Π
Summary So Far:
t
By reduction to Supervised Learning (i.e., via Regression to approximate A π
under dμπ ), we can expect to find an approximate greedy optimizer π ,̂ s.t.,
t
π∈Π
we will simply assume we can achieve arg max 𝔼s∼dμπt [A π (s, π(s))]
t
π∈Π
Outline
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a)
t
Aπ
s, a
The Failure case of API: Abrupt distribution change
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a)
A t̂ ⇒ π t+1
t
Aπ
s, a
The Failure case of API: Abrupt distribution change
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a)
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
The Failure case of API: Abrupt distribution change
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a) ̂
A t+1
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
The Failure case of API: Abrupt distribution change
API cannot guarantee to succeed (let’s think about advantage function approximation setting)
A π(s, a) ̂
A t+1
π t+1
A t̂ ⇒ π t+1
A
t
Aπ
s, a
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
Key Idea of CPI: Incremental Update—No Abrupt Distribution Change
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
t+1 t
Let’s design policy update rule such that dμπ and dμπ are not that different!
1
𝔼s∼dμπt+1 [A π (s, π t+1(s))]
t+1 t t
Vπ − Vπ =
1−γ
t t+1
dπ ≈ dπ
Initialize π 0
For t = 0 …
The CPI Algorithm
Initialize π 0
For t = 0 …
π∈Π
The CPI Algorithm
Initialize π 0
For t = 0 …
π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π
Return π t
The CPI Algorithm
Initialize π 0
For t = 0 …
π∈Π
t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
π∈Π
Return π t
3. Incremental Update:
Initialize π 0
For t = 0 …
3. Incremental Update:
Key observation 1:
Key observation 1:
Key observation 1:
t+1 t 2γα
CPI ensures incremental update, i.e., ∥dμπ ( ⋅ ) − dμπ ( ⋅ )∥1 ≤
1−γ
Recall CPI:
1. Greedy Policy Selector:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
3. Incremental Update:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
Recall CPI:
1. Greedy Policy Selector:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
Recall CPI:
1. Greedy Policy Selector:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2
≥ α𝔸 −
(1 − γ)2
Recall CPI:
1. Greedy Policy Selector:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2 (1 − γ)2𝔸
≥ α𝔸 − (Set α = )
(1 − γ)2 4γ
Recall CPI:
1. Greedy Policy Selector:
π∈Π
Before terminate, we have non-trivial avg local advantage: t
2. If max 𝔼s∼dμπt[A π (s, π(s))] ≤ε
𝔸 := 𝔼dμπt [A π (s, π′(s))] ≥ ε
t π∈Π
Return π t
t+1 t
Can we translate local advantage 𝔸 to V π − V π ? (Yes, by PDL) 3. Incremental Update:
= 𝔼s∼dμπt [αA π (s, π′(s))] + 𝔼s∼dμπt+1 [αA π (s, π′(s))] − 𝔼s∼dμπt [αA π (s, π′(s))]
t t t
α
≥ 𝔼s∼dμπt [αA π (s, π′(s))] −
t t+1 t
∥dμπ − dμπ ∥1
1−γ
2γα 2 𝔸2(1 − γ) (1 − γ)2𝔸
≥ α𝔸 − = (Set α = )
(1 − γ)2 8γ 4γ
Recall CPI:
1. Greedy Policy Selector:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥
8γ
(By setting step size α properly…)
Recall CPI:
1. Greedy Policy Selector:
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
∥dμπ − dμπ ∥1
8γ 1−γ
Return π t
1. Incremental update (Lemma 12.1 in AJKS)
3. Incremental Update:
2 Vμπ − Vμπ
t+1 t ϵ
Vμπ − Vμπ ≥ ≥ α𝔼s∼dμπt [A π (s, π′(s))] −
t α t+1 t
∥dμπ − dμπ ∥1
8γ 1−γ
γα 2
(By setting step size α properly…) ≥ αϵ −
(1 − γ)2
Summary for today:
Find the local greedy policy, and move towards it a little bit