Professional Documents
Culture Documents
Shusen Wang
Disclaimer
• What taught in this lecture is not exactly the same to the original
AlphaGo papers [1,2].
• There are simplifications here.
Reference
1. Silver and others: Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
2. Silver and others: Mastering the game of Go without human knowledge. Nature, 2017.
Go Game
𝜋 1 𝑠, 𝛉
𝜋 2 𝑠, 𝛉
Conv Dense 𝜋 3 𝑠, 𝛉
⋮
𝜋 359 𝑠, 𝛉
𝜋 360 𝑠, 𝛉
𝜋 361 𝑠, 𝛉
𝜋 1 𝑠, 𝛉
𝜋 2 𝑠, 𝛉
Conv 𝜋 3 𝑠, 𝛉
⋮
𝜋 359 𝑠, 𝛉
𝜋 360 𝑠, 𝛉
𝜋 361 𝑠, 𝛉
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
229 87 84 186 184 196 236 232 235 5 69 • If two policy networks play against each
62 65 63 57 56 60 137 183 33 177 5 6 31 61 57 55 4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
38 40 96 90 167 189 187 264 70 238 251 67 63 6 192 175 197 • It would take very long before they make
79 76 77 181 146 167 175 26 18 23 58 11 13 27 36 9
242 269 270 145 143 146 255 191 227 237 253 65 64 178 198 reasonable actions.
86 84 80 170 169 168 171 28 27 39 9 10 15 26 6
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78 144 172 163 34 29 22 8 12 24 34 6
42 267 99 98 226 225 213 209 214 266 71 72 77 76 143 142 141 165 161 173 164 36 21 23 64 33 35
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 162 152 147 148 95 91 101 105 37 25 91 30 74 7
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
Game 4 Game 5
Game 2
Game 1
Fan Hui (Black), AlphaGo (White) Learning from human’s record AlphaGo (Black), Fan Hui (White)
Game 3
Fan Hui (Black), A
AlphaGo wins by 2.5 points AlphaGo wins by resignation AlphaGo wins by
203 165 53
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
70 73 72 59
137 183 33 177 5
56 53 52 41 39 4
4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 recorded. (KGS dataset has 160K games’
81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
242 269 270 145 143 146 255 191 227 237 253 65 64
6 192 175 197
178 198
records.) 79 76 77
86 84 80
181 146 167 175
28 27 39
58 11 13 27 36
9 10 15 26
9
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78 144 172 163 34 29 22 8 12 24 34 6
42 267 99 98 226 225 213 209 214 266 71 72 77 76 143 142 141 165 161 173 164 36 21 23 64 33 35
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
208 52 53 241 91 100 190 171 172 257 141 219 115 116 15 16 112 158 162 152 147 148 95 91 101 105 37 25 91 30 74 7
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
Game 4 Game 5
Game 2
Game 1
Fan Hui (Black), AlphaGo (White) Learning from human’s record AlphaGo (Black), Fan Hui (White)
Game 3
Fan Hui (Black), A
AlphaGo wins by 2.5 points AlphaGo wins by resignation AlphaGo wins by
203 165 53
67 3
47 45 48 43 41
64 58 54 55 46 44 42
25 9 4
7
13 15
8 24 32
62 49 48
63 3
70 73 72 59
137 183 33 177 5
56 53 52 41 39 4
4
230 148 36 37 41 168 185 181 182 188 88 195 89 80 68 32 176 177 recorded. (KGS dataset has 160K games’
81 75 74 78 138 179 180 178 12 17 21 38 54 59 37 38 4
242 269 270 145 143 146 255 191 227 237 253 65 64
6 192 175 197
178 198
records.) 79 76 77
86 84 80
181 146 167 175
28 27 39
58 11 13 27 36
9 10 15 26
9
194 193 207 95 139 142 144 261 265 254 170 252 69 66 248 78
• Behavior cloning: Let the policy network
144 172 163 34 29 22 8 12 24 34 6
60 56 240 97 94 256 174 136 212 140 260 73 114 75 12 139 140 166 158 153 96 92 90 89 35 19 20 29 28 73 6
59 54 23 133 101 102 258 173 262 121 263 119 110 27 14 11 25 113 121 120 157 150 151 159 160 100 93 88 94 103 80 14 31 7
210 55 130 131 103 104 259 217 129 215 218 117 62 120 28 18 26 111 network beats advanced amateur.
127 125 118 116 149 156 98 97 99 90 87 89 98 17 94 76
211 135 1 132 128 216 137 206 233 118 74 13 108 2 249 107 159 133 126 1 124 119 117 155 154 106 110 2 104 102 84 83 1 96 81 82
29 134 22 138 30 179 200 20 105 109 126 17 19 129 123 122 131 115 107 108 87 112 88 85 18 32 95 16
61 24 166 180 106 124 21 127 239 272 136 130 128 132 134 135 109 113 111 114 86
163 147 162 244 243 123 122 125 246 247
Game 4 Game 5
Behavior Cloning
Player Opponent
(Agent) V.S. (Environment)
policy network with policy network with
latest param old param
Reinforcement learning of policy network
Reinforcement learning is guided by rewards.
• Two policy networks play a game to the end. (Player v.s. Opponent.)
• Get a trajectory: 𝑠0 , 𝑎0 , 𝑠Q , 𝑎Q , ⋯ , 𝑠R , 𝑎 R .
• After the game ends, update the player’s policy network.
• The player’s returns: 𝑢0 = 𝑢Q = ⋯ = 𝑢 R . (Either +1 or −1.)
Z [\] ^(`a |ca ,𝛉)
• Sum of approximate policy gradients: 𝐠 l = ∑R8Y0 ⋅ 𝑢8 .
Z 𝛉
• Update policy network: 𝛉 ← 𝛉 + 𝛽 ⋅ 𝐠 l .
Play Go using the policy network
Conv
Value
state feature Head State value (scalar)
19×19×17 tensor
Train the value network
After finishing training the policy network, train the value network.
Train the value network
After finishing training the policy network, train the value network.
⋮
• If you can exhaustively foresee all the possibilities, you will win.
• Strange: I went forward in time... to view alternate futures. To see all the possible outcomes of the coming conflict.
• Quill: How many did you see?
• Strange: Fourteen million six hundred and five.
is evaluated in two ways: using the value network vθ; and by running
Q, a rollout to the end of the game with the fast rollout policy pπ, then
Monte Carlo Tree Search (MCTS)
Every simulation of Monte Carlo Tree Search (MCTS) has 4 steps:
𝑎8r0
opponent’s action
𝑠8rQ Fast
player’s action Rollout
⋮
opponent’s action
𝑠R
𝑠8 Step 3: Evaluation
𝑎8 Run a rollout to the end of the game (step 𝑇).
• Player’s action: 𝑎„ ∼ 𝜋 ⋅ | 𝑠„ ; 𝛉 .
𝑠8r0 • Opponent’s action: 𝑎„€ ∼ 𝜋 ⋅ | 𝑠„€ ; 𝛉 .
player’s action
• Receive reward 𝑟R at the end.
𝑎8r0 • Win: 𝑟R = +1.
opponent’s action • Lose: 𝑟R = −1.
𝑠8rQ Fast
player’s action Rollout
⋮
opponent’s action
𝑠R
𝑠8 Step 3: Evaluation
𝑎8 Run a rollout to the end of the game (step 𝑇).
• Player’s action: 𝑎„ ∼ 𝜋 ⋅ | 𝑠„ ; 𝛉 .
𝑠8r0 • Opponent’s action: 𝑎„€ ∼ 𝜋 ⋅ | 𝑠„€ ; 𝛉 .
• Receive reward 𝑟R at the end.
• Win: 𝑟R = +1.
• Lose: 𝑟R = −1.
Records: Records:
0 Q
• 𝑉0 , • 𝑉0 ,
0 Q
• 𝑉Q , • 𝑉Q ,
0 Q
• 𝑉< , • 𝑉< ,
0 Q
• 𝑉… , • 𝑉… ,
• ⋮ • ⋮
𝑠8 Step 4: Backup
𝑎8 • MCTS repeats such a simulation many times.
• Each child of 𝑎8 has multiple recorded 𝑉 𝑠8r0 .
• Update action-value:
𝑄 𝑎8 = mean the recorded 𝑉 € 𝑠 .
• The 𝑄 values will be used in Step 1 (selection).
Records: Records:
0 Q
• 𝑉0 , • 𝑉0 ,
0 Q
• 𝑉Q , • 𝑉Q ,
0 Q
• 𝑉< , • 𝑉< ,
0 Q
• 𝑉… , • 𝑉… ,
• ⋮ • ⋮
𝑠8 Step 4: Backup
<
𝑄 𝑎 0 𝑄 𝑎
Q
𝑄 𝑎
<
𝑁 𝑎 0 𝑁 𝑎
𝑁 𝑎 Q
• 𝑁 𝑎 : How many times 𝑎 has been selected so far.
• After MCTS, the player makes actual decision:
𝑎8 = argmax 𝑁 𝑎
`
MCTS: Summary
Training in 3 steps:
Training in 3 steps:
• AlphaGo Zero does not use human experience. (No behavior cloning.)
• For the Go game, human experience is harmful.
• AlphaGo:
• Silver and others: Mastering the game of Go with deep neural networks and
tree search. Nature, 2016.
• AlphaGo Zero:
• Silver and others: Mastering the game of Go without human knowledge.
Nature, 2017.