Professional Documents
Culture Documents
3 AI - Search III
3 AI - Search III
shoulder
local maximum
"flat" local maximum
state space
current
state
+ =
∇f = , , , , ,
∂x1 ∂y1 ∂x2 ∂y2 ∂x3 ∂y3
deterministic chance
Computer game
Single game playing: a program to play one game
General Game Playing (GGP): program to play more than one
game
X X X
MIN (O) X X X
X X X
X O X O X ...
MAX (X) O
X O X X O X O ...
MIN (O) X X
X O X X O X X O X ...
TERMINAL O X O O X X
O X X O X O O
Utility −1 0 +1
MIN 3 B 2 C 2 D
b1 b3 c1 c3 d1 d3
b2 c2 d2
3 12 8 2 4 6 14 5 2
def Minimax-Search(game,state)
player ← game.To-Move(state) //The player’s turn is to move
value,move ← Max-Value(game,state)
return move //an action
def Max-Value(game,state)
if game.Is-Terminal(state) then return game.Utility(state,palyer),null
// a utility function defines the final numeric value to player
v ← −∞
for each a in game.Actions(state) do
v2,a2 ← Min-Value(game,game.Result(state, a))
if v2 > v then
v,move ← v2,a
return v,move //a (utility,move) pair
def Min-Value(game,state)
if game.Is-Terminal(state) then return game.Utility(state,palyer),null
v ← +∞
for each a in game.Actions(state) do
v2,a2 ← Max-Value(game,game.Result(state, a))
if v2 < v then
v,move ← v2,a
return v,move
MAX
MIN
..
..
..
MAX
MIN V
α is the best value (to max) found so far off the current path
If V is worse than α, max will avoid it ⇒ prune that branch
Define β similarly for min
MAX 3
MIN 3
3 12 8
MAX 3
MIN 3 2
X X
3 12 8 2
The second leaf has a value of 12, MIN would avoid, and so is still
at most 3
MAX 3
MIN 3 2 14
X X
3 12 8 2 14
The third leaf has a value of 8, the value of MIN is exactly 3 with all
the successor states
MAX 3
MIN 3 2 14 5
X X
3 12 8 2 14 5
The first leaf below the second MIN node has a value at most 2, but
the first MIN node is worth 3, so MAX would never choose it and
look at other successor states
— pruning
MAX 3 3
MIN 3 2 14 5 2
X X
3 12 8 2 14 5 2
— pruning
def Min-Value(game,state,α,β)
if game.Is-Terminal(state) then return game.Utility(state,palyer),null
v ← +∞
for each a in game.Actions(state) do
v2,a2 ← Max-Value(game,game.Result(state, a),α,β)
if v2 < v then
v,move ← v2,a
α ← Max(β,v)
if v ≤ α then return v,move
return v,move
0 1 2 3 4 5 6 7 8 9 10 11 12
25 24 23 22 21 20 19 18 17 16 15 14 13
MAX
CHANCE 3 −1
MIN 2 4 0 −2
2 4 7 4 6 0 5 −2
After 100 iterations (a) select moves for 27 wins for blackout of 35
playouts; (b) expand the selected node and do a playout ended in a
win for black; (c) the results of the playout are back-propagated up
the tree (incremented 1)
AI Slides 8e c Lin Zuoquan@PKU 1998-2023 3 47
MCTS
a. Selection: starting at the root, a child is recursively selected to
descend through the tree until the most expandable node is reached
b1. Expansion: one (or more) child nodes are added to expand the
tree, according to the available actions
b2. Simulation: a simulation is run from the new node(s) according
to the default policy to produce an outcome (random playout)
c. Backpropagation: the simulation result is “backed up” through
the selected nodes to update their statistics
• Tree Policy: selects a leaf node from the nodes already contained
within the search tree (selection and expansion)
– focuses on the important parts of the tree, attempting to
balance
– – exploration: looking in states that have not been well
sampled yet (that have had few playouts)
– – exploitation: looking in states which appear to be promising
(that have done well in past playouts)
• Default (Value) Policy: play out the domain from a given non-
terminal state to produce a value estimate (simulation and eval-
uation) — actions chosen after the tree policy steps have been
completed
– in the simplest case, to make uniform random moves
– values of intermediate states don’t have to be evaluated, as
for depth-limited minimax
AI Slides 8e c Lin Zuoquan@PKU 1998-2023 3 49
Exploitation-exploration
Exploitation-exploration dilemma
– one needs to balance the exploitation of the action currently
believed to be optimal with the exploration of other actions that
currently appear suboptimal but may turn out to be superior in the
long run
There are a number of variations of tree policy, say UCT (see the
next page),
Theorem: UCT allows MCTS to converge to the minimax tree and is
thus optimal, given enough time (and memory)
v
U (n) log N (PARENT(n))
u
u
u
U CB1(n) = +C × u
u
N (n) N (n)
t
with highest
√ average utility
C ( 2) — a constant that balances exploitation and exploration
def MCTS(state)
tree ← Note(state)
while Is-Time-Remaining() do // within computational budget
leaf ← Select(tree) // say UCB1
child ← Expand(leaf)
result ← Simulate(child) // self-play
Back-Propagate(result, child)
return the move in Action(state) whose node has highest number of playouts
2010
1
a. Selecting s with maximum value Q = N (s,a)
P
s′|s,a→s′ V (s′) + an
P (s,a)
upper confidence bound U ∝ 1+N (s,a) (stored prior probability P and
visit count N )
(each simulation traverses the tree; don’t need rollout)
def Alpha0(state)
inputs: rules, the game rules
scores, the game scores
board, the board representation
// historic data and color for players
create root node with state s0, initially random play
while within computational budget do
αθ ← Mcts(s(fθ ))
a ← Move(s, αθ )
return a(BestMove(αθ ))
MIN 4 2 9 3 4 2 9 3 4 2 3 4 3 4 3
0
9 2
MAX 6 6 8 7 8 6 6 7 6 6 7 6 6 7 6 6 7
MIN 4 2 9 3 4 2 9 3 4 2 3 4 3 4 3
0
9 2
6 6 7
4 3
−0.5
MAX 6 6 8 7 8 6 6 7 6 6 7 6 6 7
MIN 4 2 9 3 4 2 9 3 9 4 2 3 2 4 3
6 7
6 4 3
−0.5
3 G
1 S
1 2 3
The competitive ratio – the total cost of the path that the agent
actually travels (online cost) / that the agent would follow if it knew
the search space in advance (offline cost) ⇐ as small as possible
Online search expands nodes in local order, say, DepthFirst and
HillClimbing have exactly this property