Professional Documents
Culture Documents
Alan Fern
1
Introduction
Rollout does not guarantee optimality or near optimality
It only guarantees policy improvement (under certain conditions)
Theoretical Question:
Can we develop Monte-Carlo methods that give us near
optimal policies?
With computation that does NOT depend on number of states!
This was an open theoretical question until late 90s.
Practical Question:
Can we develop Monte-Carlo methods that improve
smoothly and quickly with more computation time?
Multi-stage rollout can improve arbitrarily, but each stage adds a
big increase in time per move.
2
Look-Ahead Trees
Rollout can be viewed as performing one level of search
on top of the base policy
a1 a2 ak
In deterministic games and search problems it is common
to build a look-ahead tree at a state to select best action
Can we generalize this to general stochastic MDPs?
3
Online Planning with Look-Ahead Trees
At each state we encounter in the environment we
build a look-ahead tree of depth h and use it to
estimate optimal Q-values of each action
Select action with highest Q-value estimate
s = current state
Repeat
T = BuildLookAheadTree(s) ;; sparse sampling or UCT
;; tree provides Q-value estimates for root action
a = BestRootAction(T) ;; action with best Q-value
Execute action a in environment
s is the resulting state
4
Planning with Look-Ahead Trees
a2 a1
Real world
state/action s
sequence
Build look-ahead tree Build look-ahead tree
s s
a1 a1
a2 a2
R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s ,a )R(s ,a ) R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s ,a )R(s ,a )
2w 1 2w 2 2w 1 2w 2
5
Expectimax Tree (depth 1)
Alternate max & max node (max over actions)
s
expectation a b expectation node
(weighted average
over states)
s1 s2 sn-1 sn s1 s2 sn-1 sn
States -- weighted by
probability of occuring
after taking a in s
a b a b
......
a a b
H
b
......
# states n ......
......
a a b
H
b
......
# samples w ......
......
......
Replace expectation with average over w samples
w will typically be much smaller than n. Size of tree =
Sparse Sampling [Kearns et. al. 2002]
The Sparse Sampling algorithm computes root value via depth first expansion
Return value estimate V*(s,h) of state s and estimated optimal action a*
SparseSampleTree(s,h,w)
If h == 0 Then Return [0, null]
For each action a in s
Q*(s,a,h) = 0
For i = 1 to w
Simulate taking a in s resulting in si and reward ri
[V*(si,h),a*] = SparseSample(si,h-1,w)
Q*(s,a,h) = Q*(s,a,h) + ri + V*(si,h)
10
SparseSample(h=1,w=2)
10 0
SparseSample(h=1,w=2) SparseSample(h=1,w=2)
SparseSample(h=1,w=2) SparseSample(h=1,w=2)
a1 a2
......
policy
sims
Uniform vs. Adaptive Tree Search
Sparse sampling wastes time
on bad parts of tree
Devotes equal resources to each
state encountered in the tree
Would like to focus on most
promising parts of tree
18
What now?
19
Bandits: Cumulative Regret Objective
Problem: find arm-pulling strategy such that the
expected total reward at time n is close to the best
possible (one pull per time step)
Optimal (in expectation) is to pull optimal arm n times
UniformBandit is poor choice --- waste time on bad arms
Must balance exploring machines to find good payoffs
and exploiting current knowledge
a1 a2 ak
20
Cumulative Regret Objective
[ ] = [ ]
=1
21
UCB Algorithm for Minimizing Cumulative Regret
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the
multiarmed bandit problem. Machine learning, 47(2), 235-256.
2 ln n
an arg max a Q(a )
n( a )
Is this good?
log
Yes. The average per-step regret is O
24
What now?
25
UCT Algorithm [Kocsis & Szepesvari, 2006]
1
Terminal
(reward = 1)
0
Terminal
(reward = 0)
Iteration 3
Current World State
1/2
1 0
Iteration 3
Current World State
1/2
Rollout
Policy
1 0
Iteration 3
Current World State
1/2
Rollout
Policy
1 0
0
Iteration 4
Current World State
1/3
Rollout
1/2 0
Policy
0
Iteration 4
Current World State
1/3
1/2 0
1
Current World State
2/3
Rollout
2/3 0
Policy
0 1
ln n( s )
UCT ( s ) arg max a Q( s, a) K
n( s, a )
Theoretical constant that must
be selected empirically in practice
38
When all node actions tried once, select action according to tree policy
0 1
When all node actions tried once, select action according to tree policy
1/3 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/2 0
Policy
0 1
When all node actions tried once, select action according to tree policy
1/3 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/2 0
Policy
0 1
0
When all node actions tried once, select action according to tree policy
1/4 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/3 0
Policy
0/1 1
0
When all node actions tried once, select action according to tree policy
1/4 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/3 0
Policy
0/1 1
1
When all node actions tried once, select action according to tree policy
2/5 ln n( s )
UCT ( s) arg max a Q( s, a) K
n( s, a )
Tree
1/3 1/2
Policy
0/1 1 1
0
UCT Recap
45
Some Successes
Computer Go
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Combinatorial Optimization
Crowd Sourcing
List is growing
s
a b
Practical Issues
UCT can have trouble building deep trees when
actions can result in a large # of possible next states
Each time we try action a we get a new state (new leaf) and
very rarely resample an existing leaf
s
a b
Practical Issues
Degenerates into a depth 1 tree search
Solution: We have a sparseness parameter
that controls how many new children of an
action will be generated
s
a b
.. ..