Professional Documents
Culture Documents
Shusen Wang
Value-Based Actor-Critic Policy-Based
Methods Methods Methods
Value Network and Policy Network
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
State-Value Function Approximation
Definition: State-value function.
• 𝑉" 𝑠 = ∑, 𝜋 𝑎 𝑠 ⋅ 𝑄" 𝑠, 𝑎 .
“left”, 0.2
Conv Dense Softmax
“right”, 0.1
“up”, 0.7
state 𝑠
Value Network (Critic): 𝑞 𝑠, 𝑎; 𝐰
• Inputs: state 𝑠.
• Output: action-values of all the actions.
𝑞 𝑠, “left”; 𝐰 = 2000
Conv Dense
𝑞 𝑠, “right”; 𝐰 = 1000
𝑞 𝑠, “up”; 𝐰 = 3000
state 𝑠
feature
Actor-Critic Method
H
• Loss: 𝐿 𝐰 = 𝑞 𝑠F , 𝑎F ; 𝐰 − 𝑦F N .
N
Q R 𝐰
• Gradient descent: 𝐰FGH = 𝐰F − 𝛼 ⋅ │𝐰T𝐰U .
Q 𝐰
Update policy network 𝜋 using policy gradient
Definition: State-value function approximated using neural networks.
• 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Action 𝑎
Policy
Network Environment
(Actor)
State 𝑠
Actor-Critic Method
Action 𝑎
State 𝑠
Actor-Critic Method: Update Actor
Action 𝑎
Action 𝑎
Policy Value 𝑞 Value
Network Network Environment
(Actor) (Critic)
State 𝑠
State 𝑠
Actor-Critic Method: Update Critic
Action 𝑎
State 𝑠
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
TD Target
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary of Algorithm
1. Observe state 𝑠F and randomly sample 𝑎F ~ 𝜋 ⋅ 𝑠F ; 𝛉F .
2. Perform 𝑎F ; then environment gives new state 𝑠FGH and reward 𝑟F .
3. Randomly sample 𝑎dFGH ~ 𝜋 ⋅ 𝑠FGH ; 𝛉F . (Do not perform 𝑎dFGH !)
4. Evaluate value network: 𝑞F = 𝑞 𝑠F , 𝑎F ; 𝐰F and 𝑞FGH = 𝑞 𝑠FGH , 𝑎dFGH ; 𝐰F .
5. Compute TD error: 𝛿F = 𝑞F − 𝑟F + 𝛾 ⋅ 𝑞FGH .
Qh([U ,,U ;𝐰)
6. Differentiate value network: 𝐝g,F = │𝐰T𝐰U .
Q 𝐰
7. Update value network: 𝐰FGH = 𝐰F − 𝛼 ⋅ 𝛿F ⋅ 𝐝g,F .
Q WXY "(,U |[U ,𝛉)
8. Differentiate policy network: 𝐝i,F = │𝛉T𝛉U .
Q 𝛉
9. Update policy network: 𝛉FGH = 𝛉F + 𝛽 ⋅ 𝑞F ⋅ 𝐝i,F .
Summary
Policy Network and Value Network
During training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
Roles of Actor and Critic
During training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) provides the actor with supervision.
After training
• Agent is controlled by policy network (actor): 𝑎F ∼ 𝜋(⋅ |𝑠F ; 𝛉).
• Value network 𝑞 (critic) will not be used.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.
Training
Update the policy network (actor) by policy gradient.
• Seek to increase state-value: 𝑉 𝑠; 𝛉, 𝐰 = ∑, 𝜋 𝑎 𝑠; 𝛉 ⋅ 𝑞 𝑠, 𝑎; 𝐰 .
Q ] [;𝛉 Q WXY "(_|[,𝛉)
• Compute policy gradient: = 𝔼_ ⋅ 𝑞 𝑠, 𝐴; 𝐰 .
Q 𝛉 Q 𝛉
• Perform gradient ascent.