Professional Documents
Culture Documents
Instruction
Instruction
artificial
intelligence due to its enormous search space and the difficulty of evaluating b
oard
positions and moves. We introduce a new approach to computer Go that uses value
networks
to evaluate board positions and policy networks to select moves. These deep neur
al networks
are trained by a novel combination of supervised learning from human expert game
s, and
reinforcement learning from games of self-play. Without any lookahead search, th
e neural
networks play Go at the level of state-of-the-art Monte-Carlo tree search progra
ms that simulate
thousands of random games of self-play. We also introduce a new search algorithm
that combines Monte-Carlo simulation with value and policy networks. Using this
search algorithm,
our program AlphaGo achieved a 99.8% winning rate against other Go programs,
and defeated the European Go champion by 5 games to 0. This is the first time th
at a computer
program has defeated a human professional player in the full-sized game of Go, a
feat
previously thought to be at least a decade away.
AlphaGo combines the policy and value networks in an MCTS algorithm (Figure 3) t
hat selects
actions by lookahead search. Each edge (s; a) of the search tree stores an actio
n value Q(s; a), visit
7
count N(s; a), and prior probability P(s; a). The tree is traversed by simulatio
n (i.e. descending
the tree in complete games without backup), starting from the root state. At eac
h time-step t of
each simulation, an action at is selected from state st,
At the end of simulation n, the action values and visit counts of all traversed
edges are
updated. Each edge accumulates the visit count and mean evaluation of all simula
tions passing
through that edge,
Evaluating policy and value networks requires several orders of magnitude more c
omputation
than traditional search heuristics. To efficiently combine MCTS with deep neural
networks,
AlphaGo uses an asynchronous multi-threaded search that executes simulations on
CPUs, and
computes policy and value networks in parallel on GPUs. The final version of Alp
haGo used 40
search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version o
f AlphaGo that
exploited multiple machines, 40 search threads, 1202 CPUs and 176 GPUs. The Meth
ods section
provides full details of asynchronous and distributed MCTS.