You are on page 1of 46

Policy-Based Reinforcement Learning

Shusen Wang
Policy Function Approximation
Policy Function 𝜋 𝑎 𝑠

• Policy function 𝜋 𝑎 𝑠 is a probability density function (PDF).


• It takes state 𝑠 as input.
• It output the probabilities for all the actions, e.g.,
𝜋 left 𝑠 = 0.2,
𝜋 right 𝑠 = 0.1,
𝜋 up 𝑠 = 0.7.
• Randomly sample action 𝑎 random drawn from the distribution.
Can we directly learn a policy function 𝜋 𝑎 𝑠 ?

• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯


State 𝑠5
State 𝑠6
State 𝑠7

Can we directly learn a policy function 𝜋 𝑎 𝑠 ?

• If there are only a few states and actions, then yes, we can.
• Draw a table (matrix) and learn the entries.
• What if there are too many (or infinite) states or actions?

Action 𝑎5 Action 𝑎6 Action 𝑎7 Action 𝑎8 ⋯


State 𝑠5
State 𝑠6
State 𝑠7

Policy Network 𝜋 𝑎 𝑠; 𝛉
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.
Policy Network 𝜋 𝑎 𝑠; 𝛉
Policy network: Use a neural net to approximate 𝜋 𝑎|𝑠 .
• Use policy network 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉: trainable parameters of the neural net.

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠>
feature
Policy Network 𝜋 𝑎 𝑠; 𝛉

• ∑C∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1.
• Here, 𝒜 = “left”, “right”, “up” is the set all actions.
• That is why we use softmax activation.

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠>
feature
State-Value Function Approximation
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

• The return depends on actions 𝐴> , 𝐴>L5, 𝐴>L6, ⋯ and states 𝑆> , 𝑆>L5, 𝑆>L6, ⋯
• Actions are random: ℙ 𝐴 = 𝑎 | 𝑆 = 𝑠 = 𝜋 𝑎 𝑠 . (Policy function.)
• States are random: ℙ 𝑆 P = 𝑠 P |𝑆 = 𝑠, 𝐴 = 𝑎 = 𝑝 𝑠 P 𝑠, 𝑎 . (State transition.)
Action-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

The expectation is taken w.r.t.


actions 𝐴>L5, 𝐴>L6, 𝐴>L7, ⋯
and states 𝑆>L5, 𝑆>L6, 𝑆>L7, ⋯
State-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).


State-Value Function
Definition: Discounted return.
• 𝑈> = 𝑅> + 𝛾 ⋅ 𝑅>L5 + 𝛾 6 ⋅ 𝑅>L6 + 𝛾 7 ⋅ 𝑅>L7 + ⋯

Definition: Action-value function.


• 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> |𝑆> = 𝑠> , 𝐴> = 𝑎> .

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Integrate out action 𝐴~𝜋(⋅ |𝑠> ).


Policy-Based Reinforcement Learning

Definition: State-value function.


• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .
Policy-Based Reinforcement Learning
Definition: State-value function.
• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Approximate state-value function.


• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
Policy-Based Reinforcement Learning
Definition: State-value function.
• 𝑉S 𝑠> = 𝔼V 𝑄S 𝑠> , 𝐴 = ∑C 𝜋 𝑎 𝑠> ⋅ 𝑄S 𝑠> , 𝑎 .

Approximate state-value function.


• Approximate policy function 𝜋 𝑎 𝑠> by policy network 𝜋 𝑎|𝑠> ; 𝛉 .
• Approximate value function 𝑉S 𝑠> by:
𝑉 𝑠> ; 𝛉 = ∑C 𝜋 𝑎 𝑠> ; 𝛉 ⋅ 𝑄S 𝑠> , 𝑎 .
Policy-Based Reinforcement Learning
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .


Policy-Based Reinforcement Learning
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy-based learning: Learn 𝛉 that maximizes 𝐽 𝛉 = 𝔼[ 𝑉 𝑆; 𝛉 .

How to improve 𝛉? Policy gradient ascent!

• Observe state 𝑠.
^ _ `;𝛉
• Update policy by: 𝛉 ← 𝛉 + 𝛽 ⋅ .
^ 𝛉

Policy gradient
Policy Gradient

Reference

• Sutton and others: Policy gradient methods for reinforcement learning with function approximation. In NIPS,
2000.
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)⋅ab `,C ^S(C|`;𝛉)
• = ∑C = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C Push derivative inside the summation
^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^ ∑c S C|`;𝛉 ⋅ab `,C
• =
^ 𝛉 ^ 𝛉
^S(C|`;𝛉)⋅ab `,C
= ∑C
^ 𝛉
^S(C|`;𝛉) Pretend 𝑄S is independent of 𝛉.
= ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 (It may not be true.)

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎 Policy Gradient
^ 𝛉 ^ 𝛉

Note: This derivation is over-simplified and not rigorous.


Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉

^ def S 𝛉 5 ^ S 𝛉
• Chain rule: = ⋅ .
^𝛉 S 𝛉 ^ 𝛉
^ def S 𝛉 5 ^ S 𝛉
• è 𝜋 𝛉 ⋅ =𝜋 𝛉 ⋅ ⋅ .
^𝛉 S 𝛉 ^ 𝛉
Policy Gradient
Definition: Approximate state-value function.
• 𝑉 𝑠; 𝛉 = ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑄S 𝑠, 𝑎 .

Policy gradient: Derivative of 𝑉 𝑠; 𝛉 w.r.t. 𝛉.


^ _ `;𝛉 ^S(C|`;𝛉)
• = ∑C ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉 ^ 𝛉
^ def S(C|`;𝛉)
= ∑C 𝜋 𝑎 𝑠; 𝛉 ⋅ ⋅ 𝑄S 𝑠, 𝑎
^ 𝛉
^ def S(V|`;𝛉)
= 𝔼V ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉

The expectation is taken w.r.t. the random variable 𝐴~𝜋(⋅ |𝑠; 𝛉).
Policy Gradient

Policy gradient:
^ _ `;𝛉 ^ def S(V|`,𝛉)
= 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉

^ _ `;𝛉
• By the definition of 𝐠, 𝔼V 𝐠 𝐴, 𝛉 = .
^ 𝛉
^ _ `;𝛉
• 𝐠 𝑎g, 𝛉 is an unbiased estimate of .
^ 𝛉
Calculate Policy Gradient

^ _ `;𝛉 ^ def S(V|`,𝛉)


Policy Gradient: = 𝔼V~S(⋅|`;𝛉) ⋅ 𝑄S 𝑠, 𝐴 .
^ 𝛉 ^ 𝛉

1. Randomly sample an action 𝑎g according to 𝜋(⋅ |𝑠; 𝛉).


^ def S(Cg|`;𝛉)
2. Calculate 𝐠 𝑎g, 𝛉 = ⋅ 𝑄S 𝑠, 𝑎g .
^ 𝛉
^ _ `;𝛉
3. Use 𝐠 𝑎g, 𝛉 as an approximation to the policy gradient .
^ 𝛉
Update policy network using policy gradient
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate).
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
^ 𝛉
5. (Approximate) policy gradient: 𝐠 𝑎> , 𝛉> = 𝑞> ⋅ 𝐝l,> .
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠 𝑎> , 𝛉> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
Option 1: REINFORCE. ^ 𝛉
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• Play the game to the end and generate the trajectory:
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
𝑠5 , 𝑎5 , 𝑟5 , 𝑠6 , 𝑎6 , 𝑟6 , ⋯ , 𝑠q , 𝑎 q , 𝑟q .
• Compute the discounted return 𝑢> = ∑qsn> 𝛾 st> 𝑟s , for all 𝑡.
• Since 𝑄S 𝑠> , 𝑎> = 𝔼 𝑈> , we can use 𝑢> to approximate 𝑄S 𝑠> , 𝑎> .
• è Use 𝑞> = 𝑢> .
Algorithm

1. Observe the state 𝑠> .


2. Randomly sample action 𝑎> according to 𝜋 ⋅ 𝑠> ; 𝛉> .
3. Compute 𝑞> ≈ 𝑄S 𝑠> , 𝑎> (some estimate). How?
^ def S(Cm |`m ,𝛉)
4. Differentiate policy network: 𝐝l,> = │𝛉n𝛉m .
Option 2: Approximate 𝑄S using a neural^ 𝛉network.
5. (Stochastic) policy gradient: 𝐠o 𝛉> ≈ 𝑞> ⋅ 𝐝l,> .
• This leads to the actor-critic method.
6. Update policy network: 𝛉>L5 = 𝛉> + 𝛽 ⋅ 𝐠o 𝛉> .
Summary
Policy-Based Learning

• If a good policy function 𝜋 is known, the agent can be controlled


by the policy: randomly sample 𝑎> ∼ 𝜋 ⋅ 𝑠> .
• Approximate policy function 𝜋 𝑎 𝑠 by policy network 𝜋 𝑎 𝑠; 𝛉 .
• Learn the policy network by policy gradient algorithm.
• Policy gradient algorithm learn 𝛉 that maximizes 𝔼[ 𝑉 𝑆; 𝛉 .
Thank you!

You might also like