Monte-Carlo Tree Search: Alan Fern

Monte-Carlo Tree Search
Alan Fern
1
Introduction
Rollout does not guarantee optimality or near optimality
It only guarantees policy improvement (under certain conditions)
Theoretical Question:
Can we develop Monte-Carlo methods that give us near
optimal policies?
With computation that does NOT depend on number of states!
This was an open theoretical question until late 90s.
Practical Question:
Can we develop Monte-Carlo methods that improve
smoothly and quickly with more computation time?
Multi-stage rollout can improve arbitrarily, but each stage adds a
big increase in time per move.
2
Look-Ahead Trees
Rollout can be viewed as performing one level of search
on top of the base policy
a1 a2 ak
Maybe we should search

for multiple levels.

In deterministic games and search problems it is common
to build a look-ahead tree at a state to select best action
Can we generalize this to general stochastic MDPs?
Sparse Sampling is one such algorithm

Strong theoretical guarantees of near optimality
3
Online Planning with Look-Ahead Trees
At each state we encounter in the environment we
build a look-ahead tree of depth h and use it to
estimate optimal Q-values of each action
Select action with highest Q-value estimate
s = current state
Repeat
T = BuildLookAheadTree(s) ;; sparse sampling or UCT
;; tree provides Q-value estimates for root action
a = BestRootAction(T) ;; action with best Q-value
Execute action a in environment
s is the resulting state
4
Planning with Look-Ahead Trees
a2 a1
Real world
state/action s
sequence
Build look-ahead tree Build look-ahead tree
s s
a1 a1
a2 a2
s11 s1w s11 s1w

s21 s2w s21 s2w
a1 a2 a1 a2 a1 a2 a1 a2
a1 a2 a1 a2 a1 a2 a1 a2

R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s ,a )R(s ,a ) R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s ,a )R(s ,a )
2w 1 2w 2 2w 1 2w 2
5
Expectimax Tree (depth 1)
Alternate max & max node (max over actions)
s
expectation a b expectation node
(weighted average
over states)

s1 s2 sn-1 sn s1 s2 sn-1 sn
States -- weighted by
probability of occuring
after taking a in s
After taking each action there is a distribution over

next states (natures moves)
The value of an action depends on the immediate
reward and weighted value of those states
Expectimax Tree (depth 2)
Alternate max & max node (max over actions)
s
expectation a b expectation node
(max over actions)

a b a b
......

Max Nodes: value equal to max of values of child expectation nodes

Expectation Nodes: value is weighted average of values of its children
Compute values from bottom up (leaf values = 0).
Select root action with best value.
Expectimax Tree (depth H)
s
k = # actions a b

a a b
H
b
......

# states n ......
......
In general can grow tree to any......

horizon H

Size depends on size of the state-space. Bad! Size of tree =
Sparse Sampling Tree (depth H)
s
k = # actions a b
a a b
H
b
......

# samples w ......
......
......
Replace expectation with average over w samples

w will typically be much smaller than n. Size of tree =
Sparse Sampling [Kearns et. al. 2002]
The Sparse Sampling algorithm computes root value via depth first expansion
Return value estimate V*(s,h) of state s and estimated optimal action a*
SparseSampleTree(s,h,w)
If h == 0 Then Return [0, null]
For each action a in s
Q*(s,a,h) = 0
For i = 1 to w
Simulate taking a in s resulting in si and reward ri
[V*(si,h),a*] = SparseSample(si,h-1,w)
Q*(s,a,h) = Q*(s,a,h) + ri + V*(si,h)
Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)
V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)

a* = argmaxa Q*(s,a,h)
Return [V*(s,h), a*]
SparseSample(h=2,w=2)
Alternate max & s
averaging a

Average Nodes: value is average of values of w sampled children
Alternate max & s
averaging a
10

Alternate max & s
averaging a
10 0
SparseSample(h=1,w=2) SparseSample(h=1,w=2)

Alternate max & s
averaging a
5
10 0

Alternate max & s
averaging a b
5 3
2
4
Select action a since its value is larger than bs.

Sparse Sampling (Contd)
For a given desired accuracy, how large
should sampling width and depth be?
Answered: Kearns, Mansour, and Ng (1999)
Good news: can gives values for w and H to

achieve policy arbitrarily close to optimal
Values are independent of state-space size!
First near-optimal general MDP planning algorithm
whose runtime didnt depend on size of state-space
Bad news: the theoretical values are typically

still intractably large---also exponential in H
Exponential in H is the best we can do in general
In practice: use small H and use heuristic at leaves
Sparse Sampling (Contd)
In practice: use small H & evaluate leaves with heuristic
For example, if we have a policy, leaves could be evaluated by
estimating the policy value (i.e. average reward across runs of policy)
a1 a2
......

policy
sims
Uniform vs. Adaptive Tree Search
Sparse sampling wastes time
on bad parts of tree
Devotes equal resources to each
state encountered in the tree
Would like to focus on most
promising parts of tree
But how to control exploration

of new parts of tree vs.
exploiting promising parts?
Need adaptive bandit algorithm
that explores more effectively
18
What now?
Adaptive Monte-Carlo Tree Search

UCB Bandit Algorithm
UCT Monte-Carlo Tree Search
19
Bandits: Cumulative Regret Objective
Problem: find arm-pulling strategy such that the
expected total reward at time n is close to the best
possible (one pull per time step)
Optimal (in expectation) is to pull optimal arm n times
UniformBandit is poor choice --- waste time on bad arms
Must balance exploring machines to find good payoffs
and exploiting current knowledge
a1 a2 ak
20
Cumulative Regret Objective
Protocol: At time step n the algorithm picks an

arm based on what it has seen so far and
receives reward ( and are random variables).
Expected Cumulative Regret ([ ]):

difference between optimal expected cumulative
reward and expected cumulative reward of our
strategy at time n
[ ] = [ ]
=1
21
UCB Algorithm for Minimizing Cumulative Regret
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the
multiarmed bandit problem. Machine learning, 47(2), 235-256.
Q(a) : average reward for trying action a (in

our single state s) so far
n(a) : number of pulls of arm a so far
Action choice by UCB after n pulls:
2 ln n
an arg max a Q(a )
n( a )
Assumes rewards in [0,1]. We can always

normalize if we know max value.
22
UCB: Bounded Sub-Optimality
2 ln n
an arg max a Q(a )
n( a )
Value Term:
Exploration Term:
favors actions that looked
actions get an exploration
good historically
bonus that grows with ln(n)
Expected number of pulls of sub-optimal arm a is bounded by:

8
ln n
a
2
where a is the sub-optimality of arm a
Doesnt waste much time on sub-optimal arms, unlike uniform!

23
UCB Performance Guarantee
[Auer, Cesa-Bianchi, & Fischer, 2002]
Theorem: The expected cumulative regret of UCB

[ ] after n arm pulls is bounded by O(log n)
Is this good?
log
Yes. The average per-step regret is O

Theorem: No algorithm can achieve a better

expected regret (up to constant factors)
24
What now?
Adaptive Monte-Carlo Tree Search

UCB Bandit Algorithm
UCT Monte-Carlo Tree Search
25
UCT Algorithm [Kocsis & Szepesvari, 2006]
UCT is an instance of Monte-Carlo Tree Search

Applies principle of UCB
Similar theoretical properties to sparse sampling
Much better anytime behavior than sparse sampling
Famous for yielding a major advance in

computer Go
A growing number of success stories

Monte-Carlo Tree Search
Builds a sparse look-ahead tree rooted at current state by
repeated Monte-Carlo simulation of a rollout policy
During construction each tree node s stores:
state-visitation count n(s)
action counts n(s,a)
action values Q(s,a)
Repeat until time is up

1. Execute rollout policy starting from root until horizon
(generates a state-action-reward trajectory)
2. Add first node not in current tree to the tree
3. Update statistics of each tree node s on trajectory
Increment n(s) and n(s,a) for selected action a
Update Q(s,a) by total reward observed after the node

Example MCTS
For illustration purposes we will assume MDP is:

Deterministic
Only non-zero rewards are at terminal/leaf nodes
Algorithm is well-defined without these

assumpitons
Iteration 1
Current World State
1 Initially tree is single leaf
new tree node

Rollout
Policy
1
Terminal
(reward = 1)
Assume all non-zero reward occurs at terminal nodes.

Iteration 2
Current World State
new tree node Rollout

Policy
0
Terminal
(reward = 0)
Iteration 3
Current World State
1/2
1 0
Iteration 3
Current World State
1/2
Rollout
Policy
1 0
Iteration 3
Current World State
1/2
Rollout
Policy
1 0
Rollout new tree node

Policy
0
Iteration 4
Current World State
1/3
Rollout
1/2 0
Policy
0
Iteration 4
Current World State
1/3
1/2 0
1
Current World State
2/3
Rollout
2/3 0
Policy
0 1
What should the rollout policy be?

Rollout Policies
Monte-Carlo Tree Search algorithms mainly
differ on their choice of rollout policy
Rollout policies have two distinct phases
Tree policy: selects actions at nodes already in tree
(each action must be selected at least once)
Default policy: selects actions after leaving tree
Key Idea: the tree policy can use statistics

collected from previous trajectories to
intelligently expand tree in most promising
direction
Rather than uniformly explore actions at each node
UCT Algorithm [Kocsis & Szepesvari, 2006]
Basic UCT uses random default policy

In practice often use hand-coded or learned policy
Tree policy is based on UCB:

Q(s,a) : average reward received in trajectories so
far after taking action a in state s
n(s,a) : number of times action a taken in s
n(s) : number of times state s encountered
ln n( s )
UCT ( s ) arg max a Q( s, a) K
n( s, a )
Theoretical constant that must
be selected empirically in practice
38
When all node actions tried once, select action according to tree policy
Current World State

ln n( s )
1/3 UCT ( s) arg max a Q( s, a) K
n( s, a )
a1 a2
Tree
1/2 0
Policy
0 1
Current World State
1/3 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/2 0
Policy
0 1
Current World State
1/3 ln n( s )
n( s, a )
Tree
1/2 0
Policy
0 1
0
Current World State
1/4 ln n( s )
n( s, a )
Tree
1/3 0
Policy
0/1 1
0
Current World State
1/4 ln n( s )
n( s, a )
Tree
1/3 0
Policy
0/1 1
1
Current World State
2/5 ln n( s )
n( s, a )
Tree
1/3 1/2
Policy
0/1 1 1
0
UCT Recap
To select an action at a state s

Build a tree using N iterations of monte-carlo tree
search
Default policy is uniform random
Tree policy is based on UCB rule
Select action that maximizes Q(s,a)

(note that this final action selection does not take
the exploration term into account, just the Q-value
estimate)
The more simulations the more accurate
45
Some Successes
Computer Go
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Combinatorial Optimization
Crowd Sourcing
List is growing
Usually extend UCT is some ways

Practical Issues
Selecting K
There is no fixed rule
Experiment with different values in your domain
Rule of thumb try values of K that are of the
same order of magnitude as the reward signal you
expect
The best value of K may depend on the

number of iterations N
Usually a single value of K will work well for values
of N that are large enough
Practical Issues
UCT can have trouble building deep trees when
actions can result in a large # of possible next states
Each time we try action a we get a new state (new leaf) and
very rarely resample an existing leaf
s
a b
Practical Issues
UCT can have trouble building deep trees when
actions can result in a large # of possible next states
Each time we try action a we get a new state (new leaf) and
very rarely resample an existing leaf
s
a b
Practical Issues
Degenerates into a depth 1 tree search
Solution: We have a sparseness parameter
that controls how many new children of an
action will be generated
s
a b
.. ..
Set sparseness to something like 5 or 10 then

see if smaller values hurt performance or not
Some Improvements
Use domain knowledge to handcraft a more
intelligent default policy than random
E.g. dont choose obviously stupid actions
In Go a hand-coded default policy is used
Learn a heuristic function to evaluate

positions
Use the heuristic function to initialize leaf nodes
(otherwise initialized to zero)

Monte-Carlo Tree Search: Alan Fern

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monte-Carlo Tree Search: Alan Fern

Uploaded by

Copyright:

Available Formats

Monte-Carlo Tree Search

Maybe we should search

Sparse Sampling is one such algorithm

s11 s1w s11 s1w

After taking each action there is a distribution over

Max Nodes: value equal to max of values of child expectation nodes

In general can grow tree to any......

Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)

V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)

Max Nodes: value equal to max of values of child expectation nodes

Max Nodes: value equal to max of values of child expectation nodes

Max Nodes: value equal to max of values of child expectation nodes

Max Nodes: value equal to max of values of child expectation nodes

Select action a since its value is larger than bs.

Max Nodes: value equal to max of values of child expectation nodes

Good news: can gives values for w and H to

Bad news: the theoretical values are typically

But how to control exploration

Adaptive Monte-Carlo Tree Search

Protocol: At time step n the algorithm picks an

Expected Cumulative Regret ([ ]):

Q(a) : average reward for trying action a (in

Assumes rewards in [0,1]. We can always

Expected number of pulls of sub-optimal arm a is bounded by:

where a is the sub-optimality of arm a

Doesnt waste much time on sub-optimal arms, unlike uniform!

Theorem: The expected cumulative regret of UCB

Theorem: No algorithm can achieve a better

Adaptive Monte-Carlo Tree Search

UCT is an instance of Monte-Carlo Tree Search

Famous for yielding a major advance in

A growing number of success stories

Repeat until time is up

Update Q(s,a) by total reward observed after the node

For illustration purposes we will assume MDP is:

Algorithm is well-defined without these

1 Initially tree is single leaf

new tree node

Assume all non-zero reward occurs at terminal nodes.

new tree node Rollout

Rollout new tree node

What should the rollout policy be?

Key Idea: the tree policy can use statistics

Basic UCT uses random default policy

Tree policy is based on UCB:

Current World State

Current World State

Current World State

Current World State

Current World State

Current World State

To select an action at a state s

Tree policy is based on UCB rule

Select action that maximizes Q(s,a)

The more simulations the more accurate

Usually extend UCT is some ways

The best value of K may depend on the

Set sparseness to something like 5 or 10 then

Learn a heuristic function to evaluate

You might also like

Q(s,a,h) = Q(s,a,h) / w ;; estimate of Q*(s,a,h)

V(s,h) = maxa Q(s,a,h) ;; estimate of V*(s,h)