Progressive Strategies For Monte-Carlo Tree Search: May 15, 2007 15:17 World Scientific Book - 9in X 6in Pmcts

May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS
Chapter 1
Progressive Strategies for

Monte-Carlo Tree Search
G.M.J.B. Chaslot, M.H.M. Winands,

J.W.H.M. Uiterwijk and H.J. van den Herik
MICC-IKAT, Universiteit Maastricht, Maastricht, The Netherlands1
B. Bouzy
CRIP5, Université Paris Descartes, Paris, France2
1.1 Introduction
Two-person zero-sum games with perfect information have been addressed

by many AI researchers with great success for fifty years [van den Herik et
al. (2002)]. The classical approach is to use the alpha-beta framework com-
bined with a dedicated static evaluation function. This evaluation function
is applied to the leaf nodes of a tree. If the node represents a terminal
position (or a databased position) it produces an exact value. Otherwise
heuristic knowledge is used to estimate the value of the leaf node. This
technique led to excellent results in many games (e.g., Chess and Check-
ers).
However, building an evaluation function based on heuristic knowledge
for a non-terminal position is a difficult and time-consuming issue in several
games; the most notorious example is the game of Go [Bouzy and Cazenave
1 MICC-IKAT Games and AI Group, Faculty of Humanities and Sciences, Universiteit
Maastricht, P.O. Box 616, 6200 MD Maastricht, The Netherlands. Emails: {g.chaslot,
m.winands, uiterwijk, herik}@micc.unimaas.nl
2 Centre de Recherche en Informatique de Paris 5, Université Paris 5 Descartes, 45, rue
des Saints Pères, 75270 Cedex 06, France. Email: bouzy@math-info.univ-paris5.fr
1
2 Book Title
(2001)]. It is probably one of the reasons why Go programs so far did

not achieve a strong level, despite intensive research and additional use of
knowledge-based methods.
Recently, Bouzy and Helmstetter (2003) proposed to use Monte-Carlo
simulations as an evaluation function. Monte-Carlo simulations are basi-
cally fast self-play games. Nevertheless, this approach remained too slow to
achieve a satisfying search depth. More recently, a different use of Monte-
Carlo simulations within a tree-search context has been proposed [Chaslot
et al. (2006); Kocsis and Szepesvári (2006); Coulom (2007)]. The method,
which we call “Monte-Carlo Tree Search”, is not a classical tree search fol-
lowed by a Monte-Carlo evaluation, but rather a best-first search guided
by the results of previous Monte-Carlo simulations. This method uses two
main strategies, that aim at different purposes. (1) A simulation strategy
decides on the moves played in the Monte-Carlo simulations [Bouzy (2005);
Gelly and Wang (2006)]. (2) A selection strategy, derived from the Multi-
Armed Bandit problem, is able to increase the quality of the move-selection
when the number of simulations grows [Gelly and Wang (2006); Coquelin
and Munos (2007)]. In this chapter, we propose to develop a soft transition
between simulation strategy and selection strategy. The resulting “progres-
sive strategies” for Monte-Carlo Tree Search enables, among others, the use
of time-consuming heuristics to promising positions.
This chapter is organized as follows. In Section 2, we present the Monte-
Carlo Tree Search method. In Section 3 we describe progressive strategies.
Section 4 presents the experiments, performed in our Go program Mango.
Section 5 summarizes the contributions of this chapter and gives an outlook
on future research.
1.2 Monte-Carlo Tree Search
Monte-Carlo Tree Search (MCTS) is a best-first search algorithm which

does not need game-dependent heuristic knowledge. Its aim is to select the
best move by exploring the search space pseudo-randomly. In the begin-
ning of the search, exploration is at random. Using the results of previous
explorations, the algorithm becomes able to predict more accurately the
most promising moves, and thus, exploration becomes narrower. The basic
structure of MCTS is given below.
Progressive Strategies for Monte-Carlo Tree Search 3
1.2.1 Structure of MCTS

In MCTS, a node i contains at least the following two variables: (1) the
value vi (usually the average of the results of the simulated games that
visited this node), and (2) the visit count ni . MCTS usually starts with a
tree containing only the root node.
Monte-Carlo Tree Search consists of four steps, repeated as long as there
is time left. (1) The tree is traversed from the root node to a leaf node (L),
using a selection strategy. (2) An expansion strategy is called to store one
(or more) children of L in the tree. (3) A simulation strategy plays moves
in self-play until the end of the game is reached. (4) The result R of this
“simulated” game is +1 in case of a win for Black (the first player in Go),
and −1 in case of a win for White. R is backpropagated in the tree according
to a backpropagation strategy. This mechanism is outlined in Figure 1.1.
The four steps of MCTS are explained in more details below.
Repeated X times
Selection
Selection Expansion
Expension Simulation
Simulation Backpropagation
Backpropagation
TheThe
selection function
selection function is
is applied
OneOne
or more
or morenodes
recursively until a leaf node is
nodes One simulated
One simulated The
Theresult ofthis
result of this game
game is is
applied recursively until might
might bebecreated
created game isisplayed backpropagated in the tree
reached game played backpropagated in the tree
a leaf node is reached
Fig. 1.1 Scheme of a Monte-Carlo Tree Search.
Selection is the strategic task that selects one of the children of a given
node. It controls the balance between exploitation and exploration. On the
one hand, the task is often to select the move that leads to the best results
(exploitation). On the other hand, the least promising moves still have to be
explored, due to the uncertainty of the evaluation (exploration). This prob-
lem is similar to the Multi-Armed Bandit problem [Coquelin and Munos
(2007)]. As an example, we mention hereby the strategy UCT [Kocsis and
Szepesvári (2006)] (UCT stands for Upper Confidence bound applied to
Trees). This strategy is easy to implement, and used in many programs.
4 Book Title
The essence is choosing the move i which maximizes formula 1.1:

r
ln N
vi + C × (1.1)
ni
where vi is the value of the node i, ni is the visit count of i, and N is
the visit count of the parent node of i. C is a coefficient, which has to be
tuned experimentally.
Expansion is the strategic task that, for a given leaf node L, decides
whether this node will be expanded by storing some of its children in mem-
ory. The simplest rule, proposed by [Coulom (2007)], is to expand one
node per simulation. The node expanded corresponds to the first position
encountered that was not stored yet.
Simulation (also called playout) is the strategic task that selects moves
in self-play until the end of the game. This task might consist of playing
plain random moves or – better – pseudo-random moves. Indeed, the use
of an adequate simulation strategy has been shown to improve the level of
play significantly [Bouzy (2005); Gelly et al. (2006)]. The main idea is to
play better moves by using patterns, capture considerations, and proximity
to the last move.
Backpropagation is the procedure which backpropagates the result of a
simulated game (win/loss) to the nodes it had to traverse to reach the leaf.
The value vi of a node is computed by taking the average of the results of
all simulated games made through this node.
Finally, the move played by the program is the child of the root with
the highest visit count.
1.3 Progressive Strategies
When a node has been visited a few times, a well-tuned simulation strategy
chooses moves more accurately than a selection strategy. However, when a
node has been visited quite often, the selection strategy is more accurate,
because it is able to improve by the number of games played [Chaslot et al.
(2006); Kocsis and Szepesvári (2006); Coquelin and Munos (2007); Coulom
(2007)].
In the current programs, there is no transition phase between the simu-
lation strategy and the selection strategy. For instance, in Rémi Coulom’s
program CrazyStone the simulation strategy is applied in a node if it
has been visited fewer than 81 times. Otherwise, the selection strategy is
applied [Coulom (2007)].
We propose to develop a strategy that performs a soft transition be-

tween simulation strategy and selection strategy. Such a strategy uses (i)
the heuristic knowledge available for the simulation strategy, (ii) the infor-
mation available for the selection strategy, and (iii) time-expensive domain
knowledge. “Progressive strategies” are similar to a simulation strategy
when few games have been played, and converge to a selection strategy
when numerous games have been played.
In the following two subsections we describe the two progressive strate-

gies used in Mango, our Go-playing program: Progressive Bias (Subsection
1.3.1) and Progressive Unpruning (Subsection 1.3.2). Subsection 1.3.3 de-
scribes how to implement them efficiently with the use of time-expensive
domain knowledge.
1.3.1 Progressive Bias

The aim of this strategy is to direct the search according to time-expensive
heuristic knowledge. For that purpose, the selection strategy is modified ac-
cording to knowledge. The influence of this modification may be important
when few games have been played, but should decrease fast to ensure that
the strategy converges to a selection strategy. We modified the UCT selec-
tion in the following way. Instead of selecting the move which maximizes
formula 1.1, we select the move which maximizes formula 1.2:
r
ln N
vi + C × + f (ni ) (1.2)
ni
In Mango, we chose experimentally f (ni ) = HnBi , where HB is a coef-

ficient representing heuristic knowledge, which depends only on the board
configuration B. N , C, and ni are the same as in Section 1.2. We give
more details on the construction of HB in Subsection 1.3.3.
1.3.2 Progressive Unpruning

We have seen in Mango that when the time available is low and simulta-
neously the branching factor is high, MCTS performs poorly. Our solution
consists in (1) reducing artificially the branching factor in the beginning of
the search, and (2) increasing it progressively as more time becomes avail-
6 Book Title
able. When a node i is created, all its children are “pruned”3 except a
certain amount U (0) of them. The children of the node i are progressively
“unpruned”. In Mango, we use a function U (ni ) which gives the number
of unpruned moves depending on the number ni of games played through
the node. The move with the highest values of HB are unpruned first.
1.3.3 Using Time-Expensive Heuristics

The two previous strategies require to compute a heuristic value HB for
a given board configuration B. In Mango, HB is computed by using the
pattern matching, described in [Bouzy and Chaslot (2005)], in combination
with the proximity to the last move. The exact formula uses coefficients
tuned with simulated annealing. The time consumed to compute HB is in
the order of a millisecond, which is around 1000 times slower than playing
a move in a simulated game. To avoid a speed reduction in the program’s
performance, we perform the time-consuming pattern matching only once
per node, when a certain threshold of games has been played through this
node. This threshold was set to 10 simulated games in Mango. With this
setting, the speed of the program was only reduced by 4 percent. This
speed reduction is low because the number of nodes that have been visited
more than 10 times is low compared to the number of moves played in the
simulated games.
1.4 Experiments
The main test bed for experiments was 13 × 13 Go. This size is more
interesting than 9 × 9 because there the level of basic MCTS programs is
lower. We preferred 13 × 13 over 19 × 19 because games are played faster,
and so MCTS programs can be analyzed more extensively. Our program
was tested in online tournaments, against the last version of GNU Go,
and in self-play. Below we briefly describe three types of experiments.
Tournaments. Table 1.1 presents the results of Mango in the tourna-
ments it entered in 2007. The progressive strategy was used in each of these
tournaments, and constituted the main strength of Mango.
Mango against GNU Go. We tested the playing strength of our pro-
gram against the last version of GNU Go (3.7.10), at level 10, with a time
setting of 30 minutes per player. Mango won 58 percent of the 200 games
3A node is pruned if it cannot be accessed in the simulated games.
Table 1.1 Results of Mango in the last tournaments.
Tournament Board Size Participants Mango’s rank
KGS January 2007 13 × 13 10 2nd

KGS March 2007 19 × 19 12 4th
KGS April 2007 13 × 13 10 3rd
KGS May 2007 13 × 13 7 2nd
played when using Progressive Unpruning and Progressive Bias. Without

these strategies, Mango won 25 percent of the 200 games.
Self-play experiment. For this experiment, we decreased the time to 10
seconds per move, in order to be able to play more games. The version of
Mango using Progressive Bias and Progressive Unpruning won 81 percent
of the 500 games played against Mango without progressive strategy.
Based on these experiments, we may conclude that progressive strate-
gies increase the level of play of our program Mango significantly.
1.5 Conclusion
In this chapter we described progressive strategies, which are a method

of using heuristic knowledge in Monte-Carlo Tree Search. We presented
two strategies to achieve this goal: Progressive Unpruning and Progressive
Bias. Furthermore, when these strategies are used with time-expensive
domain heuristics, we have proposed a mechanism to avoid speed reduction.
From the experiments we may conclude that these techniques increase the
playing strength of our 13 × 13 Go program Mango significantly. In future
research, we will study in detail the influence of different parameters, such
as the speed of unpruning and the biasing formula. In particular, we want to
apply the work by Coquelin and Munos (2007) for designing the progressive
bias. We also want to enhance our domain knowledge by taking “groups”
and “life and death” into consideration.
Acknowledgments
This work is financed by the Dutch Organization for Scientific Research
(NWO) for the project Go for Go, grant number 612.066.409.
8 Book Title
Bibliography
Bouzy, B. (2005). Associating Domain-Dependent Knowledge and Monte-Carlo

Approaches within a Go program, Information Sciences, Heuristic Search
and Computer Game Playing IV 175, 4, pp. 247–257.
Bouzy, B. and Cazenave, T. (2001). Computer Go: An AI-oriented Survey, Arti-
ficial Intelligence 132, 1, pp. 39–103.
Bouzy, B. and Chaslot, G. (2005). Bayesian Generation and Integration of K-
nearest-neighbor Patterns for 19x19 Go, in G. Kendall and S. Lucas (eds.),
IEEE 2005 Symposium on Computational Intelligence in Games, Essex,
UK, pp. 176–181.
Bouzy, B. and Helmstetter, B. (2003). Monte-Carlo Go Developments, in H. J.
van den Herik, H. Iida and E. A. Heinz (eds.), Proceedings of the 10th
Advances in Computer Games Conference (ACG-10) (Kluwer Academic),
pp. 159–174.
Chaslot, G., Saito, J.-T., Bouzy, B., Uiterwijk, J. W. H. M. and Herik, H. J. van
den (2006). Monte-Carlo Strategies for Computer Go, in Proceedings of the
18th BeNeLux Conference on Artificial Intelligence, pp. 83–90.
Coquelin, P.-A. and Munos, R. (2007). Bandit Algorithm for Tree Search, Tech-
nical Report 6141, INRIA.
Coulom, R. (2007). Efficient Selectivity and Backup Operators in Monte-Carlo
Tree Search, in H. J. van den Herik, P. Ciancarini and H. H. L. M. Donkers
(eds.), Proceedings of the 5th Computers and Games Conference (CG 2006),
to be published in LNCS, Springer-Verlag, Berlin.
Gelly, S. and Wang, Y. (2006). Exploration Exploitation in Go: UCT for Monte-
Carlo Go, in Twentieth Annual Conference on Neural Information Process-
ing Systems (NIPS 2006).
Gelly, S., Wang, Y., Munos, R. and Teytaud, O. (2006). Modifications of UCT
with Patterns in Monte-Carlo Go, Technical Report 6062, INRIA.
Herik, H. J. van den, Uiterwijk, J. W. H. M. and Rijswijck, J. van (2002). Games
Solved: Now and in the Future, Artificial Intelligence 134, 1–2, pp. 277–
311.
Kocsis, L. and Szepesvári, C. (2006). Bandit Based Monte-Carlo Planning, in
J. Fürnkranz, T. Scheffer and M. Spiliopoulou (eds.), Machine Learning:
9
10 Book Title
ECML 2006, Lecture Notes in Artificial Intelligence 4212, pp. 282–293.

Progressive Strategies For Monte-Carlo Tree Search: May 15, 2007 15:17 World Scientific Book - 9in X 6in Pmcts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Progressive Strategies For Monte-Carlo Tree Search: May 15, 2007 15:17 World Scientific Book - 9in X 6in Pmcts

Uploaded by

Copyright:

Available Formats

May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Progressive Strategies for

G.M.J.B. Chaslot, M.H.M. Winands,

Two-person zero-sum games with perfect information have been addressed

des Saints Pères, 75270 Cedex 06, France. Email: bouzy@math-info.univ-paris5.fr

(2001)]. It is probably one of the reasons why Go programs so far did

1.2 Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) is a best-first search algorithm which

Progressive Strategies for Monte-Carlo Tree Search 3

1.2.1 Structure of MCTS

Fig. 1.1 Scheme of a Monte-Carlo Tree Search.

The essence is choosing the move i which maximizes formula 1.1:

1.3 Progressive Strategies

Progressive Strategies for Monte-Carlo Tree Search 5

We propose to develop a strategy that performs a soft transition be-

In the following two subsections we describe the two progressive strate-

1.3.1 Progressive Bias

In Mango, we chose experimentally f (ni ) = HnBi , where HB is a coef-

1.3.2 Progressive Unpruning

1.3.3 Using Time-Expensive Heuristics

Progressive Strategies for Monte-Carlo Tree Search 7

Table 1.1 Results of Mango in the last tournaments.

Tournament Board Size Participants Mango’s rank

KGS January 2007 13 × 13 10 2nd

played when using Progressive Unpruning and Progressive Bias. Without

In this chapter we described progressive strategies, which are a method

Bouzy, B. (2005). Associating Domain-Dependent Knowledge and Monte-Carlo

ECML 2006, Lecture Notes in Artificial Intelligence 4212, pp. 282–293.

You might also like