You are on page 1of 35

Actor-Critic Methods

Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):


• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 ≈ ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy network (actor):


• Use neural net 𝜋 𝑎|𝑠; 𝛉 to approximate 𝜋 𝑎|𝑠 .
• 𝛉 : trainable parameters of the neural net.

Value network (critic):


• Use neural net 𝑞 𝑠, 𝑎; 𝐰 to approximate 𝑄" 𝑠, 𝑎 .
• 𝐰 : trainable parameters of the neural net.
Policy Network (Actor): 𝜋 𝑎 𝑠; 𝛉
• Input: state 𝑠, e.g., a screenshot of Super Mario.
• Output: probability distribution over the actions.
• Let 𝒜 be the set all actions, e.g., 𝒜 = “left”, “right”, “up” .
• ∑,∈𝒜 𝜋 𝑎 𝑠; 𝛉 = 1. (That is why we use softmax activation.)

“left”, 0.2
Conv Dense Softmax
“right”, 0.1

“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠.
• Output: action-values of all the actions.

𝑞 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑞 𝑠, “right”; 𝐰 = 1000

𝑞 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Actor-Critic Method

policy network (actor) value network (critic)


Train the Neural Networks
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


• Update policy network 𝜋 𝑎 𝑠; 𝛉 to increase the state-value 𝑉 𝑠; 𝛉, 𝐰 .
• Actor gradually performs better.
• Supervision is purely from the value network (critic).
• Update value network 𝑞 𝑠, 𝑎; 𝐰 to better estimate the return.
• Critic’s judgement becomes more accurate.
• Supervision is purely from the rewards.
Train the networks
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Training: Update the parameters 𝛉 and 𝐰.


1. Observe the state 𝑠F .
2. Randomly sample action 𝑎F according to 𝜋 ⋅ 𝑠F ; 𝛉F .
3. Perform 𝑎F and observe new state 𝑠FGH and reward 𝑟F .
4. Update 𝐰 (in value network) using temporal difference (TD).
5. Update 𝛉 (in policy network) using policy gradient.
Update value network 𝑞 using TD

• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .


• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .
Update value network 𝑞 using TD

• Compute 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .


• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰F .

H
• Loss: 𝐿 𝐰 = 𝑞 𝑠F , 𝑎F ; 𝐰 − 𝑦F N .
N
Q R 𝐰
• Gradient descent: 𝐰FGH = 𝐰F − 𝛼 ⋅ │𝐰T𝐰U .
Q 𝐰
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.


Q WXY "(,|[,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠F , 𝑎; 𝐰 .
Q 𝛉
Q ] [;𝛉,𝐰U
• = 𝔼_ 𝐠 𝐴, 𝛉 .
Q 𝛉
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .

Policy gradient: Derivative of 𝑉 𝑠F ; 𝛉, 𝐰 w.r.t. 𝛉.


Q WXY "(,|[,𝛉)
• Let 𝐠 𝑎, 𝛉 = ⋅ 𝑞 𝑠F , 𝑎; 𝐰 .
Q 𝛉
Q ] [;𝛉,𝐰U
• = 𝔼_ 𝐠 𝐴, 𝛉 .
Q 𝛉

Algorithm: Update policy network using stochastic policy gradient.


• Random sampling: 𝑎 ∼ 𝜋(⋅ |𝑠F ; 𝛉F ). (Thus 𝐠 𝑎, 𝛉 is unbiased.)
• Stochastic gradient ascent: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝐠 𝑎, 𝛉F .
Actor-Critic Method

policy network (actor) value network (critic)


Actor-Critic Method

Action 𝑎

Policy
Network Environment
(Actor)

State 𝑠
Actor-Critic Method

Action 𝑎

Policy Value 𝑞 Value


Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Actor-Critic Method: Update Actor

Action 𝑎

Action 𝑎
Policy Value 𝑞 Value
Network Network Environment
(Actor) (Critic)

State 𝑠

State 𝑠
Actor-Critic Method: Update Critic

Action 𝑎

Policy Value 𝑞 Value


Reward 𝑟
Network Network Environment
(Actor) (Critic)

State 𝑠
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
TD Target
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary
Policy Network and Value Network

Definition: State-value function.


• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .

Definition: function approximation using neural networks.


• Approximate policy function 𝜋 𝑎 𝑠 by 𝜋 𝑎 𝑠; 𝛉 (actor).
• Approximate value function 𝑄" 𝑠, 𝑎 by 𝑞 𝑠, 𝑎; 𝐰 (critic).
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic

During training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.

After training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.

Update the value network (critic) by TD learning.


• Predicted action-value: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰 .
• TD target: 𝑦F = 𝑟F + 𝛾 ⋅ 𝑞 𝑠FGH , 𝑎FGH ; 𝐰
Q hU lmU n /N Q h [U ,,U ;𝐰
• Gradient: = 𝑞F − 𝑦F ⋅ .
Q 𝐰 Q 𝐰
• Perform gradient descent.
Thank you!

You might also like