Policy-Based Reinforcement Learning: Shusen Wang

Policy-Based Reinforcement Learning
Shusen Wang
Policy Function Approximation
Policy Function 𝜋 𝑎 𝑠
• Policy function 𝜋 𝑎 𝑠 is a probability density function (PDF).

• It takes state 𝑠 as input.
• It output the probabilities for all the actions, e.g.,
𝜋 left 𝑠 = 0.2,
𝜋 right 𝑠 = 0.1,
𝜋 up 𝑠 = 0.7.
• Randomly sample action 𝑎 random drawn from the distribution.
Can we directly learn a policy function 𝜋 𝑎 𝑠 ?
• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯

State 𝑠5
State 𝑠6
State 𝑠7
⋮
Can we directly learn a policy function 𝜋 𝑎 𝑠 ?
• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• What if there are too many (or infinite) states or actions?
Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯

State 𝑠5
State 𝑠6
State 𝑠7
⋮
Policy Network 𝜋 𝑎 𝑠; 𝛉
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠>
feature
• ∑C∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1.
• Here, 𝒜 = “left”, “right”, “up” is the set all actions.
• That is why we use softmax activation.
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠>
feature
State-Value Function Approximation
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯
• The return depends on actions 𝐴> , 𝐴>L5, 𝐴>L6, ⋯ and states 𝑆> , 𝑆>L5, 𝑆>L6, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 P = 𝑠 P |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 P 𝑠, 𝑎 . (State transition.)
Action-Value Function
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯
Definition: Action-value function.

• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .
The expectation is taken w.r.t.

actions 𝐴>L5, 𝐴>L6, 𝐴>L7, ⋯
and states 𝑆>L5, 𝑆>L6, 𝑆>L7, ⋯
State-Value Function
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .
Definition: State-value function.

• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .
Integrate out action 𝐴~𝜋(⋅ |𝑠> ).

State-Value Function
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).


Approximate state-value function.

• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
Approximate state-value function.

• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
• Approximate value function 𝑉S 𝑠> by:
𝑉 𝑠> ; 𝛉 = ∑C 𝜋 𝑎 𝑠> ; 𝛉 ⋅ 𝑄S 𝑠> , 𝑎 .
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .

• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .
How to improve 𝛉? Policy gradient ascent!
• Observe state 𝑠.
^ _ `;𝛉
• Update policy by: 𝛉 ← 𝛉 + 𝛽 ⋅ .
^ 𝛉
Policy gradient
Policy Gradient
Reference
• Sutton and others: Policy gradient methods for reinforcement learning with function approximation. In NIPS,
2000.
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .
Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

^ _ `;𝛉 ^S(C|`;𝛉)⋅ab `,C ^S(C|`;𝛉)
• = ∑C = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉 ^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉

Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C Push derivative inside the summation
^ 𝛉

Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C
^ 𝛉
^S(C|`;𝛉) Pretend 𝑄S is independent of 𝛉.
= ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 (It may not be true.)

Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Note: This derivation is over-simplified and not rigorous.

Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^𝛉 S 𝛉 ^ 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^𝛉 S 𝛉 ^ 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^𝛉 S 𝛉 ^ 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^ def S(V|`;𝛉)
= 𝔼V ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉
The expectation is taken w.r.t. the random variable 𝐴~𝜋(⋅ |𝑠; 𝛉).
Policy Gradient
Policy gradient:
^ _ `;𝛉 ^ def S(V|`,𝛉)
= 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient
^ _ `;𝛉 ^ def S(V|`,𝛉)

Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
^ _ `;𝛉 ^ def S(V|`,𝛉)

^ 𝛉 ^ 𝛉
1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).

^ _ `;𝛉 ^ def S(V|`,𝛉)

^ 𝛉 ^ 𝛉

^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉
^ _ `;𝛉
• By the definition of 𝐠, 𝔼V 𝐠 𝐴, 𝛉 = .
^ 𝛉
^ _ `;𝛉
• 𝐠 𝑎g, 𝛉 is an unbiased estimate of .
^ 𝛉
^ _ `;𝛉 ^ def S(V|`,𝛉)

^ 𝛉 ^ 𝛉

^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉
^ _ `;𝛉
3. Use 𝐠 𝑎g, 𝛉 as an approximation to the policy gradient .
^ 𝛉
Update policy network using policy gradient
Algorithm
1. Observe the state 𝑠> .

2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
Algorithm

3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = ￨𝛉n𝛉m .
^ 𝛉
Algorithm

3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

Option 1: REINFORCE. ^ 𝛉
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• Play the game to the end and generate the trajectory:
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
𝑠5 , 𝑎5 , 𝑟5 , 𝑠6 , 𝑎6 , 𝑟6 , ⋯ , 𝑠q , 𝑎 q , 𝑟q .
• Compute the discounted return 𝑢> = ∑qsn> 𝛾 st> 𝑟s , for all 𝑡.
• Since 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> , we can use 𝑢> to approximate 𝑄S 𝑠> , 𝑎> .
• è Use 𝑞> = 𝑢> .
Algorithm

Option 2: Approximate 𝑄S using a neural^ 𝛉network.
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• This leads to the actor-critic method.
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
Summary
Policy-Based Learning
• If a good policy function 𝜋 is known, the agent can be controlled

by the policy: randomly sample 𝑎> ∼ 𝜋 ⋅ 𝑠> .
• Approximate policy function 𝜋 𝑎 𝑠 by policy network 𝜋 𝑎 𝑠; 𝛉 .
• Learn the policy network by policy gradient algorithm.
• Policy gradient algorithm learn 𝛉 that maximizes 𝔼[ 𝑉 𝑆; 𝛉 .
Thank you!

Policy-Based Reinforcement Learning: Shusen Wang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Policy-Based Reinforcement Learning: Shusen Wang

Uploaded by

Copyright:

Available Formats

Policy-Based Reinforcement Learning

• Policy function 𝜋 𝑎 𝑠 is a probability density function (PDF).

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯

Definition: Action-value function.

The expectation is taken w.r.t.

Definition: Action-value function.

Definition: State-value function.

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).

Definition: Action-value function.

Definition: State-value function.

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).

Definition: State-value function.

Approximate state-value function.

Approximate state-value function.

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .

How to improve 𝛉? Policy gradient ascent!

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Note: This derivation is over-simplified and not rigorous.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.

^ _ `;𝛉 ^ def S(V|`,𝛉)

^ _ `;𝛉 ^ def S(V|`,𝛉)

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).

^ _ `;𝛉 ^ def S(V|`,𝛉)

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).

^ _ `;𝛉 ^ def S(V|`,𝛉)

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).

1. Observe the state 𝑠> .

1. Observe the state 𝑠> .

1. Observe the state 𝑠> .

1. Observe the state 𝑠> .

1. Observe the state 𝑠> .

1. Observe the state 𝑠> .

• If a good policy function 𝜋 is known, the agent can be controlled

You might also like