Professional Documents
Culture Documents
Shusen Wang
Policy Function Approximation
Policy Function 𝜋 𝑎 𝑠
• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• What if there are too many (or infinite) states or actions?
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠>
feature
Policy Network 𝜋 𝑎 𝑠; 𝛉
• ∑C∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1.
• Here, 𝒜 = “left”, “right”, “up” is the set all actions.
• That is why we use softmax activation.
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠>
feature
State-Value Function Approximation
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯
• The return depends on actions 𝐴> , 𝐴>L5, 𝐴>L6, ⋯ and states 𝑆> , 𝑆>L5, 𝑆>L6, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 P = 𝑠 P |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 P 𝑠, 𝑎 . (State transition.)
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯
• Observe state 𝑠.
^ _ `;𝛉
• Update policy by: 𝛉 ← 𝛉 + 𝛽 ⋅ .
^ 𝛉
Policy gradient
Policy Gradient
Reference
• Sutton and others: Policy gradient methods for reinforcement learning with function approximation. In NIPS,
2000.
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
The expectation is taken w.r.t. the random variable 𝐴~𝜋(⋅ |𝑠; 𝛉).
Policy Gradient
Policy gradient:
^ _ `;𝛉 ^ def S(V|`,𝛉)
= 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient
^ _ `;𝛉
• By the definition of 𝐠, 𝔼V 𝐠 𝐴, 𝛉 = .
^ 𝛉
^ _ `;𝛉
• 𝐠 𝑎g, 𝛉 is an unbiased estimate of .
^ 𝛉
Calculate Policy Gradient