Actor-Critic Methods: Shusen Wang

Actor-Critic Methods
Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
Policy network (actor):

• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Value network (critic):

• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 ≈ ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Value network (critic):

• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
Policy Network (Actor): 𝜋 𝑎 𝑠; 𝛉
• Input: state 𝑠, e.g., a screenshot of Super Mario.
• Output: probability distribution over the actions.
• Let 𝒜 be the set all actions, e.g., 𝒜 = “left”, “right”, “up” .
• ∑,∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1. (That is why we use softmax activation.)
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠.
• Output: action-values of all the actions.
𝑞 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑞 𝑠, “right”; 𝐰 = 1000
𝑞 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Actor-Critic Method
policy network (actor) value network (critic)

Train the Neural Networks
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Training: Update the parameters 𝛉 and 𝐰.

Train the networks
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
Train the networks
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
• Update value network 𝑞 𝑠, 𝑎; 𝐰 to better estimate the return.
• Critic’s judgement becomes more accurate.
• Supervision is purely from the rewards.
Train the networks
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

1. Observe the state 𝑠F .
2. Randomly sample action 𝑎F according to 𝜋 ⋅ 𝑠F ; 𝛉F .
3. Perform 𝑎F and observe new state 𝑠FGH and reward 𝑟F .
4. Update 𝐰 (in value network) using temporal difference (TD).
5. Update 𝛉 (in policy network) using policy gradient.
Update value network 𝑞 using TD
• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .

• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .
Update value network 𝑞 using TD
• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .

• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .
H
• Loss: 𝐿 𝐰 = 𝑞 𝑠F , 𝑎F ; 𝐰 − 𝑦F N .
N
Q R 𝐰
• Gradient descent: 𝐰FGH = 𝐰F − 𝛼 ⋅ │𝐰T𝐰U .
Q 𝐰
Update policy network 𝜋 using policy gradient
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.

Q WXY "(,|[,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠F , 𝑎; 𝐰 .
Q 𝛉
Q ] [;𝛉,𝐰U
• = 𝔼_ 𝐠 𝐴, 𝛉 .
Q 𝛉
Update policy network 𝜋 using policy gradient
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.

Q WXY "(,|[,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠F , 𝑎; 𝐰 .
Q 𝛉
Q ] [;𝛉,𝐰U
• = 𝔼_ 𝐠 𝐴, 𝛉 .
Q 𝛉
Algorithm: Update policy network using stochastic policy gradient.

• Random sampling: 𝑎 ∼ 𝜋(⋅ |𝑠F ; 𝛉F ). (Thus 𝐠 𝑎, 𝛉 is unbiased.)
• Stochastic gradient ascent: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝐠 𝑎, 𝛉F .
Actor-Critic Method
policy network (actor) value network (critic)

Actor-Critic Method
Action 𝑎
Policy
Network Environment
(Actor)
State 𝑠
Actor-Critic Method
Action 𝑎
Policy Value 𝑞 Value

Reward 𝑟
Network Network Environment
(Actor) (Critic)
State 𝑠
Actor-Critic Method: Update Actor
Action 𝑎
Action 𝑎
(Actor) (Critic)
State 𝑠
State 𝑠
Actor-Critic Method: Update Critic
Action 𝑎

Reward 𝑟
(Actor) (Critic)
State 𝑠
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = ￨𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = ￨𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Qh([U ,,U ;𝐰)
Q 𝐰
TD Target
Q WXY "(,U |[U ,𝛉)
Q 𝛉
Qh([U ,,U ;𝐰)
Q 𝐰
Q WXY "(,U |[U ,𝛉)
Q 𝛉
Qh([U ,,U ;𝐰)
Q 𝐰
Q WXY "(,U |[U ,𝛉)
Q 𝛉
Summary
Policy Network and Value Network

• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
Definition: function approximation using neural networks.

• Approximate policy function 𝜋 𝑎 𝑠 by 𝜋 𝑎 𝑠; 𝛉 (actor).
• Approximate value function 𝑄" 𝑠, 𝑎 by 𝑞 𝑠, 𝑎; 𝐰 (critic).
Roles of Actor and Critic
During training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic
During training
• Value network 𝑞 (critic) provides the actor with supervision.
After training
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.
Update the value network (critic) by TD learning.

• Predicted action-value: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰 .
• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰
Q hU lmU n /N Q h [U ,,U ;𝐰
• Gradient: = 𝑞F − 𝑦F ⋅ .
Q 𝐰 Q 𝐰
• Perform gradient descent.
Thank you!

Actor-Critic Methods: Shusen Wang

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Actor-Critic Methods: Shusen Wang

Uploaded by

Copyright:

Available Formats

Actor-Critic Methods

Policy network (actor):

Policy network (actor):

Value network (critic):

Policy network (actor):

Value network (critic):

policy network (actor) value network (critic)

Training: Update the parameters 𝛉 and 𝐰.

Training: Update the parameters 𝛉 and 𝐰.

Training: Update the parameters 𝛉 and 𝐰.

Training: Update the parameters 𝛉 and 𝐰.

• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .

• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .

Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.

Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.

Algorithm: Update policy network using stochastic policy gradient.

policy network (actor) value network (critic)

Policy Value 𝑞 Value

Policy Value 𝑞 Value

Definition: State-value function.

Definition: function approximation using neural networks.

Update the value network (critic) by TD learning.

You might also like