You are on page 1of 10

May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Chapter 1

Progressive Strategies for


Monte-Carlo Tree Search

G.M.J.B. Chaslot, M.H.M. Winands,


J.W.H.M. Uiterwijk and H.J. van den Herik
MICC-IKAT, Universiteit Maastricht, Maastricht, The Netherlands1

B. Bouzy
CRIP5, Université Paris Descartes, Paris, France2

1.1 Introduction

Two-person zero-sum games with perfect information have been addressed


by many AI researchers with great success for fifty years [van den Herik et
al. (2002)]. The classical approach is to use the alpha-beta framework com-
bined with a dedicated static evaluation function. This evaluation function
is applied to the leaf nodes of a tree. If the node represents a terminal
position (or a databased position) it produces an exact value. Otherwise
heuristic knowledge is used to estimate the value of the leaf node. This
technique led to excellent results in many games (e.g., Chess and Check-
ers).
However, building an evaluation function based on heuristic knowledge
for a non-terminal position is a difficult and time-consuming issue in several
games; the most notorious example is the game of Go [Bouzy and Cazenave
1 MICC-IKAT Games and AI Group, Faculty of Humanities and Sciences, Universiteit

Maastricht, P.O. Box 616, 6200 MD Maastricht, The Netherlands. Emails: {g.chaslot,
m.winands, uiterwijk, herik}@micc.unimaas.nl
2 Centre de Recherche en Informatique de Paris 5, Université Paris 5 Descartes, 45, rue

des Saints Pères, 75270 Cedex 06, France. Email: bouzy@math-info.univ-paris5.fr

1
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

2 Book Title

(2001)]. It is probably one of the reasons why Go programs so far did


not achieve a strong level, despite intensive research and additional use of
knowledge-based methods.
Recently, Bouzy and Helmstetter (2003) proposed to use Monte-Carlo
simulations as an evaluation function. Monte-Carlo simulations are basi-
cally fast self-play games. Nevertheless, this approach remained too slow to
achieve a satisfying search depth. More recently, a different use of Monte-
Carlo simulations within a tree-search context has been proposed [Chaslot
et al. (2006); Kocsis and Szepesvári (2006); Coulom (2007)]. The method,
which we call “Monte-Carlo Tree Search”, is not a classical tree search fol-
lowed by a Monte-Carlo evaluation, but rather a best-first search guided
by the results of previous Monte-Carlo simulations. This method uses two
main strategies, that aim at different purposes. (1) A simulation strategy
decides on the moves played in the Monte-Carlo simulations [Bouzy (2005);
Gelly and Wang (2006)]. (2) A selection strategy, derived from the Multi-
Armed Bandit problem, is able to increase the quality of the move-selection
when the number of simulations grows [Gelly and Wang (2006); Coquelin
and Munos (2007)]. In this chapter, we propose to develop a soft transition
between simulation strategy and selection strategy. The resulting “progres-
sive strategies” for Monte-Carlo Tree Search enables, among others, the use
of time-consuming heuristics to promising positions.
This chapter is organized as follows. In Section 2, we present the Monte-
Carlo Tree Search method. In Section 3 we describe progressive strategies.
Section 4 presents the experiments, performed in our Go program Mango.
Section 5 summarizes the contributions of this chapter and gives an outlook
on future research.

1.2 Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) is a best-first search algorithm which


does not need game-dependent heuristic knowledge. Its aim is to select the
best move by exploring the search space pseudo-randomly. In the begin-
ning of the search, exploration is at random. Using the results of previous
explorations, the algorithm becomes able to predict more accurately the
most promising moves, and thus, exploration becomes narrower. The basic
structure of MCTS is given below.
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Progressive Strategies for Monte-Carlo Tree Search 3

1.2.1 Structure of MCTS


In MCTS, a node i contains at least the following two variables: (1) the
value vi (usually the average of the results of the simulated games that
visited this node), and (2) the visit count ni . MCTS usually starts with a
tree containing only the root node.
Monte-Carlo Tree Search consists of four steps, repeated as long as there
is time left. (1) The tree is traversed from the root node to a leaf node (L),
using a selection strategy. (2) An expansion strategy is called to store one
(or more) children of L in the tree. (3) A simulation strategy plays moves
in self-play until the end of the game is reached. (4) The result R of this
“simulated” game is +1 in case of a win for Black (the first player in Go),
and −1 in case of a win for White. R is backpropagated in the tree according
to a backpropagation strategy. This mechanism is outlined in Figure 1.1.
The four steps of MCTS are explained in more details below.

Repeated X times

Selection
Selection Expansion
Expension Simulation
Simulation Backpropagation
Backpropagation

TheThe
selection function
selection function is
is applied
OneOne
or more
or morenodes
recursively until a leaf node is
nodes One simulated
One simulated The
Theresult ofthis
result of this game
game is is
applied recursively until might
might bebecreated
created game isisplayed backpropagated in the tree
reached game played backpropagated in the tree
a leaf node is reached

Fig. 1.1 Scheme of a Monte-Carlo Tree Search.

Selection is the strategic task that selects one of the children of a given
node. It controls the balance between exploitation and exploration. On the
one hand, the task is often to select the move that leads to the best results
(exploitation). On the other hand, the least promising moves still have to be
explored, due to the uncertainty of the evaluation (exploration). This prob-
lem is similar to the Multi-Armed Bandit problem [Coquelin and Munos
(2007)]. As an example, we mention hereby the strategy UCT [Kocsis and
Szepesvári (2006)] (UCT stands for Upper Confidence bound applied to
Trees). This strategy is easy to implement, and used in many programs.
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

4 Book Title

The essence is choosing the move i which maximizes formula 1.1:


r
ln N
vi + C × (1.1)
ni
where vi is the value of the node i, ni is the visit count of i, and N is
the visit count of the parent node of i. C is a coefficient, which has to be
tuned experimentally.
Expansion is the strategic task that, for a given leaf node L, decides
whether this node will be expanded by storing some of its children in mem-
ory. The simplest rule, proposed by [Coulom (2007)], is to expand one
node per simulation. The node expanded corresponds to the first position
encountered that was not stored yet.
Simulation (also called playout) is the strategic task that selects moves
in self-play until the end of the game. This task might consist of playing
plain random moves or – better – pseudo-random moves. Indeed, the use
of an adequate simulation strategy has been shown to improve the level of
play significantly [Bouzy (2005); Gelly et al. (2006)]. The main idea is to
play better moves by using patterns, capture considerations, and proximity
to the last move.
Backpropagation is the procedure which backpropagates the result of a
simulated game (win/loss) to the nodes it had to traverse to reach the leaf.
The value vi of a node is computed by taking the average of the results of
all simulated games made through this node.
Finally, the move played by the program is the child of the root with
the highest visit count.

1.3 Progressive Strategies

When a node has been visited a few times, a well-tuned simulation strategy
chooses moves more accurately than a selection strategy. However, when a
node has been visited quite often, the selection strategy is more accurate,
because it is able to improve by the number of games played [Chaslot et al.
(2006); Kocsis and Szepesvári (2006); Coquelin and Munos (2007); Coulom
(2007)].
In the current programs, there is no transition phase between the simu-
lation strategy and the selection strategy. For instance, in Rémi Coulom’s
program CrazyStone the simulation strategy is applied in a node if it
has been visited fewer than 81 times. Otherwise, the selection strategy is
applied [Coulom (2007)].
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Progressive Strategies for Monte-Carlo Tree Search 5

We propose to develop a strategy that performs a soft transition be-


tween simulation strategy and selection strategy. Such a strategy uses (i)
the heuristic knowledge available for the simulation strategy, (ii) the infor-
mation available for the selection strategy, and (iii) time-expensive domain
knowledge. “Progressive strategies” are similar to a simulation strategy
when few games have been played, and converge to a selection strategy
when numerous games have been played.

In the following two subsections we describe the two progressive strate-


gies used in Mango, our Go-playing program: Progressive Bias (Subsection
1.3.1) and Progressive Unpruning (Subsection 1.3.2). Subsection 1.3.3 de-
scribes how to implement them efficiently with the use of time-expensive
domain knowledge.

1.3.1 Progressive Bias


The aim of this strategy is to direct the search according to time-expensive
heuristic knowledge. For that purpose, the selection strategy is modified ac-
cording to knowledge. The influence of this modification may be important
when few games have been played, but should decrease fast to ensure that
the strategy converges to a selection strategy. We modified the UCT selec-
tion in the following way. Instead of selecting the move which maximizes
formula 1.1, we select the move which maximizes formula 1.2:

r
ln N
vi + C × + f (ni ) (1.2)
ni

In Mango, we chose experimentally f (ni ) = HnBi , where HB is a coef-


ficient representing heuristic knowledge, which depends only on the board
configuration B. N , C, and ni are the same as in Section 1.2. We give
more details on the construction of HB in Subsection 1.3.3.

1.3.2 Progressive Unpruning


We have seen in Mango that when the time available is low and simulta-
neously the branching factor is high, MCTS performs poorly. Our solution
consists in (1) reducing artificially the branching factor in the beginning of
the search, and (2) increasing it progressively as more time becomes avail-
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

6 Book Title

able. When a node i is created, all its children are “pruned”3 except a
certain amount U (0) of them. The children of the node i are progressively
“unpruned”. In Mango, we use a function U (ni ) which gives the number
of unpruned moves depending on the number ni of games played through
the node. The move with the highest values of HB are unpruned first.

1.3.3 Using Time-Expensive Heuristics


The two previous strategies require to compute a heuristic value HB for
a given board configuration B. In Mango, HB is computed by using the
pattern matching, described in [Bouzy and Chaslot (2005)], in combination
with the proximity to the last move. The exact formula uses coefficients
tuned with simulated annealing. The time consumed to compute HB is in
the order of a millisecond, which is around 1000 times slower than playing
a move in a simulated game. To avoid a speed reduction in the program’s
performance, we perform the time-consuming pattern matching only once
per node, when a certain threshold of games has been played through this
node. This threshold was set to 10 simulated games in Mango. With this
setting, the speed of the program was only reduced by 4 percent. This
speed reduction is low because the number of nodes that have been visited
more than 10 times is low compared to the number of moves played in the
simulated games.

1.4 Experiments

The main test bed for experiments was 13 × 13 Go. This size is more
interesting than 9 × 9 because there the level of basic MCTS programs is
lower. We preferred 13 × 13 over 19 × 19 because games are played faster,
and so MCTS programs can be analyzed more extensively. Our program
was tested in online tournaments, against the last version of GNU Go,
and in self-play. Below we briefly describe three types of experiments.
Tournaments. Table 1.1 presents the results of Mango in the tourna-
ments it entered in 2007. The progressive strategy was used in each of these
tournaments, and constituted the main strength of Mango.
Mango against GNU Go. We tested the playing strength of our pro-
gram against the last version of GNU Go (3.7.10), at level 10, with a time
setting of 30 minutes per player. Mango won 58 percent of the 200 games
3A node is pruned if it cannot be accessed in the simulated games.
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Progressive Strategies for Monte-Carlo Tree Search 7

Table 1.1 Results of Mango in the last tournaments.

Tournament Board Size Participants Mango’s rank

KGS January 2007 13 × 13 10 2nd


KGS March 2007 19 × 19 12 4th
KGS April 2007 13 × 13 10 3rd
KGS May 2007 13 × 13 7 2nd

played when using Progressive Unpruning and Progressive Bias. Without


these strategies, Mango won 25 percent of the 200 games.
Self-play experiment. For this experiment, we decreased the time to 10
seconds per move, in order to be able to play more games. The version of
Mango using Progressive Bias and Progressive Unpruning won 81 percent
of the 500 games played against Mango without progressive strategy.
Based on these experiments, we may conclude that progressive strate-
gies increase the level of play of our program Mango significantly.

1.5 Conclusion

In this chapter we described progressive strategies, which are a method


of using heuristic knowledge in Monte-Carlo Tree Search. We presented
two strategies to achieve this goal: Progressive Unpruning and Progressive
Bias. Furthermore, when these strategies are used with time-expensive
domain heuristics, we have proposed a mechanism to avoid speed reduction.
From the experiments we may conclude that these techniques increase the
playing strength of our 13 × 13 Go program Mango significantly. In future
research, we will study in detail the influence of different parameters, such
as the speed of unpruning and the biasing formula. In particular, we want to
apply the work by Coquelin and Munos (2007) for designing the progressive
bias. We also want to enhance our domain knowledge by taking “groups”
and “life and death” into consideration.

Acknowledgments
This work is financed by the Dutch Organization for Scientific Research
(NWO) for the project Go for Go, grant number 612.066.409.
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

8 Book Title
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

Bibliography

Bouzy, B. (2005). Associating Domain-Dependent Knowledge and Monte-Carlo


Approaches within a Go program, Information Sciences, Heuristic Search
and Computer Game Playing IV 175, 4, pp. 247–257.
Bouzy, B. and Cazenave, T. (2001). Computer Go: An AI-oriented Survey, Arti-
ficial Intelligence 132, 1, pp. 39–103.
Bouzy, B. and Chaslot, G. (2005). Bayesian Generation and Integration of K-
nearest-neighbor Patterns for 19x19 Go, in G. Kendall and S. Lucas (eds.),
IEEE 2005 Symposium on Computational Intelligence in Games, Essex,
UK, pp. 176–181.
Bouzy, B. and Helmstetter, B. (2003). Monte-Carlo Go Developments, in H. J.
van den Herik, H. Iida and E. A. Heinz (eds.), Proceedings of the 10th
Advances in Computer Games Conference (ACG-10) (Kluwer Academic),
pp. 159–174.
Chaslot, G., Saito, J.-T., Bouzy, B., Uiterwijk, J. W. H. M. and Herik, H. J. van
den (2006). Monte-Carlo Strategies for Computer Go, in Proceedings of the
18th BeNeLux Conference on Artificial Intelligence, pp. 83–90.
Coquelin, P.-A. and Munos, R. (2007). Bandit Algorithm for Tree Search, Tech-
nical Report 6141, INRIA.
Coulom, R. (2007). Efficient Selectivity and Backup Operators in Monte-Carlo
Tree Search, in H. J. van den Herik, P. Ciancarini and H. H. L. M. Donkers
(eds.), Proceedings of the 5th Computers and Games Conference (CG 2006),
to be published in LNCS, Springer-Verlag, Berlin.
Gelly, S. and Wang, Y. (2006). Exploration Exploitation in Go: UCT for Monte-
Carlo Go, in Twentieth Annual Conference on Neural Information Process-
ing Systems (NIPS 2006).
Gelly, S., Wang, Y., Munos, R. and Teytaud, O. (2006). Modifications of UCT
with Patterns in Monte-Carlo Go, Technical Report 6062, INRIA.
Herik, H. J. van den, Uiterwijk, J. W. H. M. and Rijswijck, J. van (2002). Games
Solved: Now and in the Future, Artificial Intelligence 134, 1–2, pp. 277–
311.
Kocsis, L. and Szepesvári, C. (2006). Bandit Based Monte-Carlo Planning, in
J. Fürnkranz, T. Scheffer and M. Spiliopoulou (eds.), Machine Learning:

9
May 15, 2007 15:17 World Scientific Book - 9in x 6in pMCTS

10 Book Title

ECML 2006, Lecture Notes in Artificial Intelligence 4212, pp. 282–293.

You might also like