13 RL 5

AlphaGo
Shusen Wang
Disclaimer
• What taught in this lecture is not exactly the same to the original
AlphaGo papers [1,2].
• There are simplifications here.
Reference
1. Silver and others: Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
2. Silver and others: Mastering the game of Go without human knowledge. Nature, 2017.
Go Game
• The standard Go board has a 19×19 grid of

lines, containing 361 points.
• State: arrangement of black, white, and space.
• State 𝑠 can be a 19×19×2 tensor of 0 or 1.
• (AlphaGo actually uses a 19×19×48 tensor to
store other information.)
• Action: place a stone on a vacant point.
• Action space: 𝒜 ⊂ 1, 2, 3, ⋯ , 361 .
• Go is very complex.
• Number of possible sequence of actions is 10012.
AlphaGo
High-Level Ideas
Training and Execution
Training in 3 steps:
1. Initialize policy network using behavior cloning.
(Supervised learning from human experience.)
2. Train the policy network using policy gradient. (Two policy
networks play against each other.)
3. After training the policy network, use it to train a value
network.
1. Initialize policy network using behavior cloning.
(Supervised learning from human experience.)
2. Train the policy network using policy gradient. (Two policy
networks play against each other.)
3. After training the policy network, use it to train a value
network.
Execution (actually play Go games):
• Do Monte Carlo Tree Search (MCTS) using the policy and
value networks.
Policy Network
State (of AlphaGo Zero)
Policy Network (of AlphaGo Zero)
𝜋 1 𝑠, 𝛉
𝜋 2 𝑠, 𝛉
Conv Dense 𝜋 3 𝑠, 𝛉
⋮
𝜋 359 𝑠, 𝛉
𝜋 360 𝑠, 𝛉
𝜋 361 𝑠, 𝛉
state feature Probability distribution

over the 361 actions
19×19×17 tensor
Policy Network (of AlphaGo)
𝜋 1 𝑠, 𝛉
𝜋 2 𝑠, 𝛉
Conv 𝜋 3 𝑠, 𝛉
⋮
𝜋 359 𝑠, 𝛉
𝜋 360 𝑠, 𝛉
𝜋 361 𝑠, 𝛉
state Probability distribution

over the 361 actions
19×19×48 tensor
Initialize Policy Network by Behavior Cloning
Game 2
Game 1
Fan Hui (Black), AlphaGo (White) Learning from human’s record AlphaGo (Black), Fan Hui (White)
Game 3
Fan Hui (Black), A
AlphaGo wins by 2.5 points AlphaGo wins by resignation AlphaGo wins by
203 165 53
93 92 201 151 57 49 51 164 33 • Initially, the network’s parameters are

82 61 52 49 50 51 30 20 14 16 19 60
156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9
155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

random.
71 68 66 83 85
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
229 87 84 186 184 196 236 232 235 5 69 • If two policy networks play against each
62 65 63 57 56 60 137 183 33 177 5 6 31 61 57 55 4
228 39 31 169 271 86 183 58 82 81 224 79 83 7 268 153 199

other, they would do random actions.
70 73 72 59 145 40 176 174 11 10 22 56 53 52 41 39 4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
38 40 96 90 167 189 187 264 70 238 251 67 63 6 192 175 197 • It would take very long before they make
79 76 77 181 146 167 175 26 18 23 58 11 13 27 36 9
242 269 270 145 143 146 255 191 227 237 253 65 64 178 198 reasonable actions.
86 84 80 170 169 168 171 28 27 39 9 10 15 26 6
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78 144 172 163 34 29 22 8 12 24 34 6
42 267 99 98 226 225 213 209 214 266 71 72 77 76 143 142 141 165 161 173 164 36 21 23 64 33 35
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 162 152 147 148 95 91 101 105 37 25 91 30 74 7
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
234 at 179 245 at 122 250 at 59 182 at 169
Game 4 Game 5
Game 2
Game 1
Game 3
Fan Hui (Black), A
203 165 53

82 61 52 49 50 51 30 20 14 16 19 60
156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9
155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

random.
71 68 66 83 85
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
229 87 84 186 184
228 39 31 169 271 86 183 58

196 236 232 235 5
82 81 224 79 83 7 268 153 199

• Human’ sequences of actions have been
69 62 65 63 57 56 60
70 73 72 59
137 183 33 177 5
145 40 176 174 11 10 22

6 31 61 57 55
56 53 52 41 39 4
4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 recorded. (KGS dataset has 160K games’
81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
38 40 96 90 167 189 187 264 70 238 251 67 63
242 269 270 145 143 146 255 191 227 237 253 65 64
6 192 175 197
178 198
records.) 79 76 77
86 84 80
181 146 167 175
170 169 168 171

26 18 23
28 27 39
58 11 13 27 36
9 10 15 26
9
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78 144 172 163 34 29 22 8 12 24 34 6
42 267 99 98 226 225 213 209 214 266 71 72 77 76 143 142 141 165 161 173 164 36 21 23 64 33 35
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 162 152 147 148 95 91 101 105 37 25 91 30 74 7
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
234 at 179 245 at 122 250 at 59 182 at 169
Game 4 Game 5
Game 2
Game 1
Game 3
Fan Hui (Black), A
203 165 53

82 61 52 49 50 51 30 20 14 16 19 60
156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9
155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

random.
71 68 66 83 85
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
229 87 84 186 184
228 39 31 169 271 86 183 58

196 236 232 235 5
82 81 224 79 83 7 268 153 199

• Human’ sequences of actions have been
69 62 65 63 57 56 60
70 73 72 59
137 183 33 177 5
145 40 176 174 11 10 22

6 31 61 57 55
56 53 52 41 39 4
4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 recorded. (KGS dataset has 160K games’
81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
38 40 96 90 167 189 187 264 70 238 251 67 63
242 269 270 145 143 146 255 191 227 237 253 65 64
6 192 175 197
178 198
records.) 79 76 77
86 84 80
181 146 167 175
170 169 168 171

26 18 23
28 27 39
58 11 13 27 36
9 10 15 26
9
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78
• Behavior cloning: Let the policy network
144 172 163 34 29 22 8 12 24 34 6
imitate human players.

42 267 99 98 226 225 213 209 214 266 71 72 77 76 143 142 141 165 161 173 164 36 21 23 64 33 35
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
• After behavior cloning, the policy

208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 162 152 147 148 95 91 101 105 37 25 91 30 74 7
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 network beats advanced amateur.
127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
234 at 179 245 at 122 250 at 59 182 at 169
Game 4 Game 5
Behavior Cloning
Behavior cloning is not reinforcement learning!
• Reinforcement learning: Supervision is from rewards given

by the environment.
• Imitation learning: Supervision is from experts’ actions.
• Agent does not see rewards.
• Agent simply imitates experts’ actions.
Behavior Cloning
Behavior cloning is not reinforcement learning!
• Reinforcement learning: Supervision is from rewards given

by the environment.
• Imitation learning: Supervision is from experts’ actions.
• Behavior cloning is one of the imitation learning methods.

• Behavior cloning is simply classification or regression.
Behavior Cloning
• Observe this state 𝑠8 .

• The neural net makes a prediction:
<=0
𝐚8 = 𝜋 1 𝑠8 , 𝛉 , ⋯ , 𝜋 361 𝑠8 , 𝛉 ∈ 0,1 .
• The expert’s action is 𝑎8⋆ = 281.

• Let 𝐲8 ∈ 0,1 <=0 be the one-hot
encode of 𝑎8⋆ = 281.
• Loss = Dist 𝐲8 , 𝐚8 .
• Use gradient descent to update policy
network and CNN.
Behavior Cloning

• Policy network makes a prediction:
<=0
𝐩8 = 𝜋 1 𝑠8 , 𝛉 , ⋯ , 𝜋 361 𝑠8 , 𝛉 ∈ 0,1 .
w.p. 0.4
w.p. 0.1 • Let 𝐲8 ∈ 0,1 <=0 be the one-hot
w.p. 0.2
w.p. 0.3
network and CNN.
Behavior Cloning

Expert’s action 𝐩8 = 𝜋 1 𝑠8 , 𝛉 , ⋯ , 𝜋 361 𝑠8 , 𝛉 ∈ 0,1 <=0
.

network and CNN.
Behavior Cloning

Expert’s action 𝐩8 = 𝜋 1 𝑠8 , 𝛉 , ⋯ , 𝜋 361 𝑠8 , 𝛉 ∈ 0,1 <=0
.

• Loss = CrossEntropy 𝐲8 , 𝐩8 .
network.
After behavior cloning…
• Suppose the current sate 𝑠8 has appeared in training data.
• The policy network imitates expert’s action 𝑎8 . (Which is a good action!)
Question: Why bother doing RL after behavior cloning?
• What if the current state 𝑠8 has not appeared in training data?

• Then the policy network’ action 𝑎8 can be bad.
Question: Why bother doing RL after behavior cloning?
• What if the current state 𝑠8 has not appeared in training data?

• Then the policy network’ action 𝑎8 can be bad.
• Number of possible states is too big.
• There is a big chance that 𝑠8 has not appeared in training data.
Behavior cloning + RL beats behavior cloning with 80% chance.

Train Policy Network Using Policy Gradient
Reinforcement learning of policy network
• Player’s parameters are the latest parameters of the policy network.

• Opponent’s parameters are randomly selected from previous iterations.
Player Opponent
(Agent) V.S. (Environment)
policy network with policy network with
latest param old param
Reinforcement learning is guided by rewards.
• Suppose a game ends at step 𝑇.

• Rewards:
• 𝑟0 = 𝑟Q = 𝑟< = ⋯ = 𝑟RS0 = 0. (When the game has not ended.)
• 𝑟R = +1 (winner).
• 𝑟R = −1 (loser).
Reinforcement learning is guided by rewards.
• Suppose a game ends at step 𝑇.

• Rewards:
• 𝑟0 = 𝑟Q = 𝑟< = ⋯ = 𝑟RS0 = 0. (When the game has not ended.)
• 𝑟R = +1 (winner).
• 𝑟R = −1 (loser).
• Recall that return is defined by 𝑢8 = ∑RXY8 𝑟X . (No discount here.)
• Winner’s returns: 𝑢0 = 𝑢Q = ⋯ = 𝑢 R = +1.
• Loser’s returns: 𝑢0 = 𝑢Q = ⋯ = 𝑢 R = −1.
Policy Gradient
Policy gradient: Derivative of state-value function 𝑉 𝑠; 𝛉 w.r.t. 𝛉.
Z [\] ^(à |ca ,𝛉)

• Recall that ⋅ 𝑄^ 𝑠8 , 𝑎8 is approximate policy gradient.
Z 𝛉
Policy Gradient
Policy gradient: Derivative of state-value function 𝑉 𝑠; 𝛉 w.r.t. 𝛉.
Z [\] ^(à |ca ,𝛉)

• Recall that ⋅ 𝑄^ 𝑠8 , 𝑎8 is approximate policy gradient.
Z 𝛉
• By definition, the action value is 𝑄^ 𝑠8 , 𝑎8 = 𝔼 𝑈8 |𝑠8 , 𝑎8 .
• Thus, we can replace 𝑄^ 𝑠8 , 𝑎8 by the observed return 𝑢8 .
Z [\] ^(à |ca ,𝛉)
• Approximate policy gradient: ⋅ 𝑢8 .
Z 𝛉
Train policy network using policy gradient
Repeat the followings:
• Two policy networks play a game to the end. (Player v.s. Opponent.)
• Get a trajectory: 𝑠0 , 𝑎0 , 𝑠Q , 𝑎Q , ⋯ , 𝑠R , 𝑎 R .
• After the game ends, update the player’s policy network.
• The player’s returns: 𝑢0 = 𝑢Q = ⋯ = 𝑢 R . (Either +1 or −1.)
Z [\] ^(à |ca ,𝛉)
• Sum of approximate policy gradients: 𝐠 l = ∑R8Y0 ⋅ 𝑢8 .
Z 𝛉
• Update policy network: 𝛉 ← 𝛉 + 𝛽 ⋅ 𝐠 l .
Play Go using the policy network
• The policy network 𝜋 has been learned.

• Observing the current state 𝑠8 , randomly sample action
𝑎8 ∼ 𝜋 ⋅ 𝑠8 , 𝛉 .
Play Go using the policy network
• The policy network 𝜋 has been learned.

• Observing the current state 𝑠8 , randomly sample action
𝑎8 ∼ 𝜋 ⋅ 𝑠8 , 𝛉 .
• The learned policy network 𝜋 is strong, but not strong enough.

• A small mistake may change the game result.
Train the Value Network
State-Value Function
Definition: State-value function.

• 𝑉^ 𝑠 = 𝔼 𝑈8 |𝑆8 = 𝑠 , where 𝑈8 = +1 (if win) and −1 (if lose) .
• The expectation is taken with respect to
• The future actions 𝐴8 , 𝐴8r0, ⋯ , 𝐴 R .
• The future states 𝑆8r0, 𝑆8rQ, ⋯ , 𝑆R
State-Value Function
Definition: State-value function.

• 𝑉^ 𝑠 = 𝔼 𝑈8 |𝑆8 = 𝑠 , where 𝑈8 = +1 (if win) and −1 (if lose) .
• The expectation is taken with respect to
• The future actions 𝐴8 , 𝐴8r0, ⋯ , 𝐴 R .
• The future states 𝑆8r0, 𝑆8rQ, ⋯ , 𝑆R
Approximate state-value function using a value network.

• Use a neural network 𝑣 𝑠; 𝐰 to approximate 𝑉^ 𝑠 .
• It evaluate how good the current situation is.
Policy Value Networks (AlphaGo Zero)
𝜋 1 𝑠, 𝛉
Policy
Head ⋮
𝜋 361 𝑠, 𝛉
Conv
Value
state feature Head State value (scalar)
19×19×17 tensor
Train the value network
After finishing training the policy network, train the value network.
Train the value network
After finishing training the policy network, train the value network.
Repeat the followings:

1. Play a game to the end.
• If win, let 𝑢0 = 𝑢Q = ⋯ = 𝑢 R = +1.
• If lose, let 𝑢0 = 𝑢Q = ⋯ = 𝑢 R = −1.
R 0
2. Loss: 𝐿 = ∑8Y0 𝑣 𝑠8 ; 𝐰 − 𝑢8 Q .
Q
Z w
3. Update: 𝐰 ← 𝐰 − 𝛼 ⋅ .
Z 𝐰
Monte Carlo Tree Search
How do human play Go?
Players must look ahead two or more steps.
• Suppose I choose action 𝑎8 .
• What will be my opponent’s action? (His action leads to state 𝑠8r0 .)
• What will I be my action 𝑎8r0 upon observing 𝑠8r0 ?
• What will be my opponent’s action? (His action leads to state 𝑠8rQ .)
• What will I be my action 𝑎8rQ upon observing 𝑠8r< ?
⋮
• If you can exhaustively foresee all the possibilities, you will win.
• Strange: I went forward in time... to view alternate futures. To see all the possible outcomes of the coming conflict.
• Quill: How many did you see?
• Strange: Fourteen million six hundred and five.
• Stark: How many did we win?

• Strange: … One.
Select actions by look-ahead search
c Evaluation d Backup
player’s action Q Q Main idea
• Randomly select an action 𝑎.

opponent’s action Q Q
• Look ahead and see whether 𝑎 leads to win or
lose.
• Repeat this procedure many times.
p • Choose the action 𝑎 that has the highest score.
r r r r
is evaluated in two ways: using the value network vθ; and by running
Q, a rollout to the end of the game with the fast rollout policy pπ, then
Monte Carlo Tree Search (MCTS)
Every simulation of Monte Carlo Tree Search (MCTS) has 4 steps:
1. Selection: The player makes an action 𝑎. (Imaginary action; not

actual move.)
2. Expansion: The opponent makes an action; the state updates. (Also
imaginary action; made by the policy network.)
3. Evaluation: Evaluate the state-value and get score 𝑣. Play the game
xry
to the end to receive reward 𝑟. Assign score to action 𝑎.
Q
xry
4. Backup: Use the score to update action-values.
Q
𝑠8 Step 1: Selection
Question: Observing 𝑠8 , which action shall we explore?
valid actions
valid actions
First, for all the valid actions 𝑎, calculate the score:
^ ` | ca ;𝛉
score 𝑎 = 𝑄 𝑎 + 𝜂 ⋅ .
0r• `
• 𝑄 𝑎 : Action-value computed by MCTS. (To be defined.)

• 𝜋 𝑎 | 𝑠8 ; 𝛉 : The learned policy network.
• 𝑁 𝑎 : Given 𝑠8 , how many times we have selected 𝑎 so far.
0.6 0.3 0.8

^ ` | ca ;𝛉
0r• `

0.6 0.3 0.8

^ ` | ca ;𝛉
0r• `

Second, the action with the biggest score 𝑎 is selected.

𝑠8 Step 2: Expansion
𝑎8 Question: What will be the opponent's action?
• Given 𝑎8 , the opponent’s action 𝑎8€ will lead to the

new state 𝑠8r0 .
Prob = 0.4 Prob = 0.6
new state 𝑠8r0 .
• The opponent’s action is randomly sampled from
𝑎8€ ∼ 𝜋 ⋅ | 𝑠8€ ; 𝛉 .
Here, 𝑠8€ is the state observed by the opponent.
Prob = 0.4 Prob = 0.6
𝑠8r0 new state 𝑠8r0 .
• The opponent’s action is randomly sampled from
𝑎8€ ∼ 𝜋 ⋅ | 𝑠8€ ; 𝛉 .
Here, 𝑠8€ is the state observed by the opponent.
• The state-transition probability 𝑝 𝑠8r0 𝑠8 , 𝑎8 is unknown.

• Use the policy function 𝜋 as the state-transition function 𝑝.
𝑠8 Step 3: Evaluation
𝑎8 Run a rollout to the end of the game (step 𝑇).
• Player’s action: 𝑎„ ∼ 𝜋 ⋅ | 𝑠„ ; 𝛉 .
𝑠8r0 • Opponent’s action: 𝑎„€ ∼ 𝜋 ⋅ | 𝑠„€ ; 𝛉 .
player’s action
𝑎8r0
opponent’s action
𝑠8rQ Fast
player’s action Rollout
⋮
opponent’s action
𝑠R
player’s action
• Receive reward 𝑟R at the end.
𝑎8r0 • Win: 𝑟R = +1.
opponent’s action • Lose: 𝑟R = −1.
𝑠8rQ Fast
player’s action Rollout
⋮
opponent’s action
𝑠R
• Win: 𝑟R = +1.
• Lose: 𝑟R = −1.
Evaluate the state 𝑠8r0 .

• 𝑣 𝑠8r0 ; 𝐰 : output of the value network.
• Win: 𝑟R = +1.
Record 𝑉 𝑠8r0 . • Lose: 𝑟R = −1.
0 0
𝑉 𝑠8r0 = 𝑣 𝑠8r0; 𝐰 + 𝑟R . Evaluate the state 𝑠8r0 .
Q Q
• 𝑣 𝑠8r0 ; 𝐰 : output of the value network.

𝑠8 Step 4: Backup
𝑎8 • MCTS repeats such a simulation many times.
• Each child of 𝑎8 has multiple recorded 𝑉 𝑠8r0 .
Records: Records:
0 Q
• 𝑉0 , • 𝑉0 ,
0 Q
• 𝑉Q , • 𝑉Q ,
0 Q
• 𝑉< , • 𝑉< ,
0 Q
• 𝑉… , • 𝑉… ,
• ⋮ • ⋮
𝑎8 • MCTS repeats such a simulation many times.
• Update action-value:
𝑄 𝑎8 = mean the recorded 𝑉 € 𝑠 .
• The 𝑄 values will be used in Step 1 (selection).
Records: Records:
0 Q
• 𝑉0 , • 𝑉0 ,
0 Q
• 𝑉Q , • 𝑉Q ,
0 Q
• 𝑉< , • 𝑉< ,
0 Q
• 𝑉… , • 𝑉… ,
• ⋮ • ⋮
0 𝑄 𝑎 < • MCTS repeats such a simulation many times.

𝑄 𝑎 Q
𝑄 𝑎
• Update action-value:
𝑄 𝑎8 = mean the recorded 𝑉 € 𝑠 .
• The 𝑄 values will be used in Step 1 (selection).
𝑠8 Revisit Step 1 Selection
<
𝑄 𝑎 0 𝑄 𝑎
Q
𝑄 𝑎

^ ` | ca ;𝛉
0r• `
Second, the action with the biggest score 𝑎 is selected.

𝑠8 Decision Making after MCTS
<
𝑁 𝑎 0 𝑁 𝑎
𝑁 𝑎 Q
• 𝑁 𝑎 : How many times 𝑎 has been selected so far.
• After MCTS, the player makes actual decision:
𝑎8 = argmax 𝑁 𝑎
`
MCTS: Summary
• MCTS has 4 steps: selection, expansion, evaluation, and backup.

• To perform one action, AlphaGo repeats the 4 steps for many
times to calculate 𝑄 𝑎 and 𝑁 𝑎 (for every action 𝑎.)
• AlphaGo executes the action 𝑎 with the highest 𝑁 𝑎 value.
• To perform the next action, AlphaGo do MCTS all over again.
(Initialize 𝑄 𝑎 and 𝑁 𝑎 to zeros.)
Summary
1. Train a policy network using behavior cloning.

2. Train the policy network using policy gradient algorithm.
3. Train a value network.
1. Train a policy network using behavior cloning.

2. Train the policy network using policy gradient algorithm.
3. Train a value network.
Execution (actually play Go games):
• Select the “best” action by Monte Carlo Tree Search.

• To perform one action, AlphaGo repeats simulations many
times.
AlphaGo Zero
AlphaGo Zero v.s. AlphaGo
• AlphaGo Zero is stronger than AlphaGo. (100-0 against AlphaGo.)

• Differences:
• AlphaGo Zero does not use human experience. (No behavior cloning.)
• MCTS is used to train the policy network.
Is behavior cloning useless?
• AlphaGo Zero does not use human experience. (No behavior cloning.)
• For the Go game, human experience is harmful.
• In general, is behavior cloning useless?

• What if a surgery robot (randomly initialized) is learned purely by
performing surgery? (Human experience is not used.)
Training of policy network
AlphaGo Zero uses MCTS in training. (AlphaGo does not.)
1. Observe state 𝑠8 .
2. Predictions made by policy network:
𝐩 = 𝜋 𝑎 = 1 𝑠8 ; 𝛉 , ⋯ , 𝜋 𝑎 = 361 𝑠8 ; 𝛉 ∈ ℝ<=0 .
Training of policy network
AlphaGo Zero uses MCTS in training. (AlphaGo does not.)
1. Observe state 𝑠8 .
2. Predictions made by policy network:
𝐩 = 𝜋 𝑎 = 1 𝑠8 ; 𝛉 , ⋯ , 𝜋 𝑎 = 361 𝑠8 ; 𝛉 ∈ ℝ<=0 .
3. Predictions made by MCTS:
𝐧 = normalize 𝑁 𝑎 = 1 , 𝑁 𝑎 = 2 , ⋯ , 𝑁 𝑎 = 361 ∈ ℝ<=0 .
4. Loss: 𝐿 = CrossEntropy 𝐧, 𝐩 .
Z w
5. Use to update 𝛉.
Z 𝛉
Reference
• AlphaGo:
• Silver and others: Mastering the game of Go with deep neural networks and
tree search. Nature, 2016.
• AlphaGo Zero:
• Silver and others: Mastering the game of Go without human knowledge.
Nature, 2017.

13 RL 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

13 RL 5

Uploaded by

Copyright:

Available Formats

AlphaGo

• The standard Go board has a 19×19 grid of

state feature Probability distribution

state Probability distribution

93 92 201 151 57 49 51 164 33 • Initially, the network’s parameters are

156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9

155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

228 39 31 169 271 86 183 58 82 81 224 79 83 7 268 153 199

234 at 179 245 at 122 250 at 59 182 at 169

93 92 201 151 57 49 51 164 33 • Initially, the network’s parameters are

156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9

155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

229 87 84 186 184

228 39 31 169 271 86 183 58

82 81 224 79 83 7 268 153 199

145 40 176 174 11 10 22

38 40 96 90 167 189 187 264 70 238 251 67 63

170 169 168 171

234 at 179 245 at 122 250 at 59 182 at 169

93 92 201 151 57 49 51 164 33 • Initially, the network’s parameters are

156 154 205 35 202 34 152 50 10 47 43 45 221 220 4 223 9

155 157 3 204 85 149 150 48 44 46 8 222 231 160 161

229 87 84 186 184

228 39 31 169 271 86 183 58

82 81 224 79 83 7 268 153 199

145 40 176 174 11 10 22

38 40 96 90 167 189 187 264 70 238 251 67 63

170 169 168 171

imitate human players.

• After behavior cloning, the policy

234 at 179 245 at 122 250 at 59 182 at 169

Behavior cloning is not reinforcement learning!

• Reinforcement learning: Supervision is from rewards given

Behavior cloning is not reinforcement learning!

• Reinforcement learning: Supervision is from rewards given

• Behavior cloning is one of the imitation learning methods.

• Observe this state 𝑠8 .

• The expert’s action is 𝑎8⋆ = 281.

• Observe this state 𝑠8 .

• Observe this state 𝑠8 .

• The expert’s action is 𝑎8⋆ = 281.

• Observe this state 𝑠8 .

• The expert’s action is 𝑎8⋆ = 281.

• What if the current state 𝑠8 has not appeared in training data?

• What if the current state 𝑠8 has not appeared in training data?

Behavior cloning + RL beats behavior cloning with 80% chance.

• Player’s parameters are the latest parameters of the policy network.

• Suppose a game ends at step 𝑇.

• Suppose a game ends at step 𝑇.

Z [\] ^(`a |ca ,𝛉)

Z [\] ^(`a |ca ,𝛉)

Repeat the followings:

• The policy network 𝜋 has been learned.

• The policy network 𝜋 has been learned.

• The learned policy network 𝜋 is strong, but not strong enough.

Definition: State-value function.

Definition: State-value function.

Approximate state-value function using a value network.

Repeat the followings:

• Stark: How many did we win?

player’s action Q Q Main idea

• Randomly select an action 𝑎.

1. Selection: The player makes an action 𝑎. (Imaginary action; not