You are on page 1of 6

Monte Carlo search in graph structures applied to MDPs

Luisa A. de Almeida and Carlos H. C. Ribeiro


Computer Science Division, Technological Institute of Aeronautics (ITA)
Pça. Mal. Eduardo Gomes 50, Vila das Acácias 12228-900
São José dos Campos – SP – Brasil
Emails: luamaralalmeida@gmail.com, carlos@ita.br

Abstract  We exhibit a new adaptation of the algorithm structure of a MDP like a graph, dealing with the
Upper Confidence bounds applied to Trees (UCT), using the transpositions. Also, we discuss why this new version of
transpositions of nodes in the tree to build a graph. We show UCT does not perform well in dense domains.
how this adaptation performs in some domains of
International Probabilistic Planning Competition (IPPC),
II. FINITE HORIZON MDPs AS DECISION TREES
with great results in sparse domains. We also discuss the
factors that probably results in a bad performance in dense
domains. A finite horizon MDP can be defined as a tuple
< 𝑆, 𝐴, 𝑃, 𝑅 > , where S is a finite set of states; A is a
Keywords  Markov Decision Processes; Probabilistic finite set of possible actions; 𝑃: 𝑆 × 𝐴 × 𝑆 → [0,1] is the
Planner; Upper Confidence bounds applied to Trees (UCT); transition function that gives the probability 𝑃(𝑠 ′ |𝑠, 𝑎) of
Transpositions. move to state 𝑠 ′ applying action 𝑎 in the state 𝑠; and
𝑅: 𝑆 × 𝐴 × 𝑆 → ℝ is the reward function, which gives the
I. INTRODUCTION reward associated to the transition from state 𝑠 to 𝑠 ′
applying action 𝑎, denoted by 𝑅(𝑠, 𝑎, 𝑠 ′ ) [8].
Markov decision processes (MDPs) are stochastic A MDP with horizon 𝐻 can be represented as a
process for modeling decision making, in which the decision tree with height 2𝐻, as shown in Figure 1. The
outcome at any stage depends on which action was taken states are represented with decision nodes (rectangles) and
and some uncertainties. MDPs are widely used in many the uncertainties associated to each pair state-action are
areas, since their formulation in the study of stochastic represented with random nodes (circles). Any decision
optimal control by Bellman [1]. node has exactly one successor for each possible action.
Most researches about MDPs are concerned in
problems too complex to be solved completely, making
necessary to choose approximate policies [2]. The best
known algorithms to solve this kind of problem are: Value
Iteration [3][1]; Policy Iteration [4]; Real time Dynamic
Programming - RTDP [5]; and Upper Confidence bounds
applied to Trees (UCT) [6]. The last one has become more
famous and successful in optimal planning, because it
does not use the transition probabilities in the
computation. That is important when each pair state- Figure 1: Example of representation of a MDP with horizon 1 as a
action has some probability to outcome in many other decision tree.
states, which means the transition function is very dense
and it is hard to compute all the probabilities to find a In a decision tree, the expected value of a decision
good policy. node is the maximum of the expected values of its
A probabilistic planner, based on UCT algorithm, won successors, because the best policy is to choose a path that
the last two editions of International Probabilistic maximizes the reward. The expected value of a random
Planning Competition (IPPC). The planner, called Prost node is the weighted mean of the expected values of its
[7], made some modifications of UCT to improve its successors, in which the weights are the probabilities. In a
performance in some domains of the competition, like decision tree used to represent a MDP, the expected value
pruning the superfluous actions, creating an initialization of the leafs are zero.
procedure and detecting reward locks. The evaluation of
Prost shows that the basic UCT did very well in dense III. UCT
domains, unlike the sparse domains. Only combining
UCT with the modifications proposed in Prost, the The upper confidence bounds applied to trees (UCT)
planner performed well in sparse domains. [6] is a Monte-Carlo planning algorithm that balances the
In this paper, we proposed a new adaptation of UCT exploration-exploitation dilemma. Given a state of an
algorithm that performs very well in sparse domains, MDP, the algorithm performs rollouts until a certain
called Upper Confidence bounds applied to Graphs with depth, building a partial decision tree, to find a good
Maximum update (UCG-max). UCG-max considers the policy and return which action should be taken.
A UCT node 𝑛 keeps some information [7]: Action a=randomChooseDiferentAction();
 𝑠: the state of the MDP; Node nextNode= buildRandomNode(n.s, a);
 𝑑: the depth of the node in the tree; value=rollout(nextNode, depth-1)
 𝑁 𝑘 : the number of rollouts where 𝑛 was chosen else
after 𝑘 rollouts; State nextState=sample(n.s, n.a);
 𝑅𝑘 : the expected reward estimated for node 𝑛 reward=getReward(n.s ,n.a,nextState);
after 𝑘 rollouts; Node nextNode=buildDecisionNode(nextState);
 {𝑛1 , 𝑛2 , … 𝑛3 }: the successors of node n. value=reward+rollout(nextNode, depth-1);
The UCT random node also keeps the information updateReward(n, value);
about which action it refers. return value;
Each rollout is a path from the root until one leaf of ______________________________________________
the tree made in a recursive way.
When a random node is visited in the path, its The function updateReward update the reward of the
successor is chosen by sampling the transition function, as node, adding the information about the actual rollout.
Monte Carlos algorithms usually do. Despite the reward of a decision node is updated with the
When a decision node is visited in the path, it is maximum of the previous rollout, a decision node
analyzed to decide which action should be taken. First, all propagates to its parent the reward of the current rollout.
possible actions should be chosen once, so UCT randomly
takes one action that has not chosen yet. After that, UCT IV. UCT ALGORITHM FOR GRAPHS
decide by the action that maximizes the formula:
Figure 1 shows how a simple MDP can be represented
as a tree. In that figure, the node correspondent to state S2
log 𝑛. 𝑁𝑘 is represented twice in the level 2. Then, a possible idea is
𝐵√ + 𝑛𝑖 . 𝑅 𝑘 (1)
𝑛𝑖 . 𝑁 𝑘 to represent that node once, dealing with the transposition,
like in Figure 2.
The left side of the formula favors actions that have
chosen few times and the right side of the formula favors
actions that have higher expected rewards. This is the
fundamental idea of UCT.
The parameter 𝐵 should have the property of grow
linearly with the expected reward of optimal policy [7].
Since the optimal policy is not known, it has to be
estimated. So, for each node n, the parameter B is the
expected reward of the node after the previous rollouts.
After the successor of a node is chosen, the path goes Figure 2: Representation of MDP of Figure 1 as a graph.
until the leaf and comes back in opposite way. In the
backpropagation, the expected rewards of the visited The idea of deal with transpositions has been explored
nodes are updated. If it is a decision node, its expected in other works, like [9] and [10]. The transpositions can
reward will be the maximum of its successors. If it is a be used to make the convergence of UCT with less
random node, the expected reward will be the average of rollouts, since the update of the expected reward of a node
previous rollouts. can be used to update the expected reward of all the nodes
The Algorithm 1 describes UCT with more details. the precede it in some rollout.
Both [9] and [10] show possible adaptations to UCT
algorithm to deal with transpositions. When the
Algorithm 1: Pseudo-code of UCT.__________________ transposition is used to update all the nodes that precede
Action findAction (State s0): some node, it is necessary to identify with rule will be
Node n0=buildDecisionNode(s0); use. The rules proposed in [9] and [10] give a formula that
while (!timeout) do update the expected reward with a weighted mean of all
rollout(n0, maxDepth); successors. Both of them show good results of those
return bestAction(n0); adaptations of UCT in games like Go and indicate that
this approach worth if the simulation is significantly more
double rollout (Node n, int depth): time consuming than the update.
value=0; In this paper, we propose a new adaptation of UCT,
incNumVisits(n); called Upper Confidence bounds applied to Graphs with
if (isDecision(n) && depth!=0) then Maximum update (UCG-max), that updates the expected
if (allActionsVisited(n)) then reward of decision nodes with the maximum of the
Action a=chooseUCTAction(); expected reward of its successors, instead of a weighted
else mean. Besides that modification, others have to be done to
adapt the algorithm, which will be detailed below.
UCG-max performs rollouts in the decision graph that incNumVisitsOfSuccessor(n,nextNode);
is built progressively. In the graph, same states in the setRewardOfSucessor(n,nextNode,reward);
same level are represented as a unique node. It is rollout (nextNode depth-1);
important to associate the node with the level to guarantee
that the graph will be acyclic. Each UCG-max node keeps void BFSinverse (Node n):
the same information of UCT nodes and the information Queue nodesToVisit=new Queue;
about its parents. A random node has exactly one parent, nodesToVisit.enqueue(n);
but a decision node has a list of parents. In addition, while (!nodesToVisit.isEmpty())
random nodes have to keep information about how many n=nodesToVisit.dequeue();
times each successor was selected and what is the reward if (isDecision(n)) then
associated to each successor. updateMaxReward(n);
In UCG-max, the path from the root to a leaf is made for (Node parent : n.getParents())
like in UCT, using the same formula to choose which if (!nodesToVisit.contais(parent)) then
action should be taken in a decision node. The difference nodesToVisit.enqueue(parent);
occurs in the backpropagation that is made from the leaf else
to the root in a breadth-first-search, passing through all updateWeightedMeanReward(n);
the predecessors of each node. In this backpropagation, if (!nodesToVisit.contais(n.getParent())) then
the expected reward of a decision node is update with the nodesToVisit.enqueue(parent);
maximum of the expected reward of its successors and the ______________________________________________
expected reward of a random node is update with the
weighted mean of its successors, using as weigh the V. EXPERIMENTAL EVALUATION
number of times each action was selected. It is important
to note that the value propagate from a decision node to To evaluate the algorithm proposed, we use as a
its parents is the maximum value, different to UCT benchmark four domains of IPPC 2011, two of them
original formulation. dense and two of them sparse. The domains are briefly
In the backpropagation, the variable 𝑛. 𝑁 𝑘 (number of described below:
rollouts where 𝑛 was chosen after 𝑘 rollouts) is not Dense domains:
updated. Then, when the formula 1 is applied in a decision “SysAdmin”: It describes a network of computers,
node, the value 𝑛𝑖 . 𝑁 𝑘 is really the number of rollouts that where each computer can be connected to others. At each
comes from this decision node to the correspondent stage, the computers may be working or not. If the admin
successor 𝑛𝑖 , not matter how many times that successor reboot a computer, it will be working at next stage.
have been traversed in the backpropagation. It is Otherwise, a computer will be working depending on the
fundamental to ensure that the exploration will influence situation of the computers connected to it and some
the choice of actions in a proper way. probabilities. The goal is to keep the largest number of
The pseudo-code of UCG-max is described below: computers working.
“Game of Life”: This is a well-known problem of
cellular automata area. In a grid, a cell can die or
Algorithm 2: Pseudo-code of UCG-max._____________ reproduce depending of the number of neighbors and the
Action findAction (State s0): probability of noise. At each stage, the agent can set a
Node n0=buildDecisionNode(s0); position of the grid to have a living cell. The goal is to get
while (!timeout) the largest number of cells alive.
rollout(n0, maxDepth); Sparse domains:
return bestAction(n0); “Elevators”: The "elevators" domain has a number of
elevators delivering passengers to either the top or the
void rollout (Node n, int depth): bottom floor. Potential passengers arrive at a floor based
incNumVisits(n); on different arrival probabilities. The agent has to decide
if (isDecision(n) && depth!=0) then which action to do with each elevator.
if (allActionsVisited(n)) then “Navigation”: In a grid, a robot must get to a goal.
Action a=chooseUCTAction(); Every cell offers the robot a chance of disappearing. The
else robot needs to choose a path which gets it to the goal most
Action a=randomChooseDiferentAction(); reliably within the finite horizon time.
Node nextNode= buildRandomNode(n.s, a); Each domain has 10 different instances. The first
rollout(nextNode, depth-1) instances are associated with less state variables. Like in
else if (isDecision(n) && depth==0) then the competition, each instance was played 30 times with
BFSinverse(n); horizon 40.
else To run both UCG-max and UCT, we considered two
State nextState=sample(n.s, n.a); parameters: the number of rollout and the maximum
reward=getReward(n.s ,n.a,nextState); depth. The maximum depth is the depth of the graph that
Node nextNode=buildDecisionNode(nextState); will be built, which correspond to double of the horizon
considered to the MDP. For UCG-max, we consider 2000 Table 3: Average of normalized score in domain “Game
rollouts and maximum depth equal to 10 (horizon equal to of Life”.
5). Despite the horizon of the competition be 40, we can
consider a smaller one, since decisions far in the future Instance UCG-max UCT UCT
(2000,10) (2000,10) (10000,30)
usually influence the expected reward of a policy less than
immediate actions [7]. For UCT, we consider two 1 0,121 0,945 0,724
configurations: 2000 rollouts with maximum depth 10; 2 0,177 0,624 0,776
and 10000 rollouts with maximum depth 30. 3 0,018 1,000 1,000
We exhibit the average of normalized score and time 4 0,087 1,000 0,940
spent in each instance, based on 30 times running. The 5 0,307 1,000 0,793
normalized score is obtained with the normalized 6 0,305 0,741 0,979
constants (worst and best rewards obtained by the 7 0,133 1,000 0,960
algorithms of the competition), giving a score between 8 0,000 0,987 1,000
zero and one. Sometimes, the algorithms played here got a 9 0,089 0,943 0,837
worse or better reward than the normalized constants of 10 0,134 1,000 0,905
the competition, but in those cases we consider the score
zero and one, respectively.
Table 4: Average of time spent (sec) in domain “Game of
Table 1: Average of normalized score in domain Life”.
“SysAdmin”.
Instance UCG-max UCT UCT
Instance UCG-max UCT UCT (2000,10) (2000,10) (10000,30)
(2000,10) (2000,10) (10000,30) 1 24 1 18
1 0,998 0,963 0,930 2 24 1 9
2 0,864 0,724 0,875 3 61 1 10
3 0,843 0,839 1,000 4 106 1 15
4 0,558 0,779 0,790 5 57 3 47
5 0,108 0,715 1,000 6 171 3 58
6 0,000 0,378 0,692 7 259 4 84
7 0,183 0,468 0,689 8 274 5 37
8 0,012 0,601 1,000 9 258 5 24
9 0,109 0,612 0,883 10 271 5 47
10 0,211 0,836 1,000

Table 5: Average of normalized score in domain


Table 2: Average of time spent (sec) in domain “Elevators”.
“SysAdmin”.
Instance UCG-max UCT UCT
Instance UCG-max UCT UCT (2000,10) (2000,10) (10000,30)
(2000,10) (2000,10) (10000,30) 1 1,000 0,000 0,000
1 194 2 43 2 0,946 0,046 0,000
2 149 1 53 3 0,598 0,000 0,000
3 12 2 93 4 1,000 0,160 0,042
4 10 2 92 5 0,747 0,046 0,066
5 7 8 108 6 0,908 0,000 0,156
6 7 8 111 7 1,000 0,220 0,083
7 14 10 105 8 0,889 0,000 0,022
8 22 10 110 9 0,885 0,115 0,000
9 16 12 171 10 1,000 0,314 0,255
10 8 5 122
Table 6: Average of time spent (sec) in domain the longer time spent. In the domain “Elevators”, even
“Elevators”. increasing the number of rollouts and depth, UCT
performs very badly.
Instance UCG-max UCT UCT There is no relation between time spent with UCG-
(2000,10) (2000,10) (10000,30)
max in sparse domains and the number of instance. The
1 58 1 8 algorithm got the top score of “Elevators” instance 10 in
2 171 3 16 10 sec, but spent113 sec in “Elevators” instance 9 to get a
3 140 3 17 normal score. The time spent is related with the structure
4 13 2 10 of the graph, which depends of many things, including the
5 120 4 19 random numbers that are chosen in the Monte Carlo
6 146 3 20 process.
7 31 2 13 Another interesting analysis is the comparison of the
8 151 4 21 results with the winner planners of the competition. Table
9 113 3 22 9 shows the normalized scores by domain of UCT-max,
10 10 2 13 Prost [7] and Glutton [11].

Table 7: Average of normalized score in domain Table 9: Normalized of UCT-max, Prost and Glutton
“Navigation”. scores for each domain.
Instance UCG-max UCT UCT UCG-max Prost* Glutton*
(2000,10) (2000,10) (10000,30) (2000,10)
1 1,000 0,000 0,570 SysAdmin 0,388 0,998 0,321
2 0,910 0,260 0,237 Game of Life 0,137 0,999 0,682
3 1,009 0,095 0,138 Elevators 0,897 0,987 0,968
4 0,000 0,000 0,000 Navigation 0,412 0,440 1,000
5 0,202 0,000 0,202 *The results of Prost and Glutton are taken from IPPC 2011.
6 0,000 0,000 0,000
7 0,000 0,000 0,000 Glutton, the second place of the competition, uses a
8 1,000 0,000 0,000 Real Time Dynamic Programming approach [5], with
9 0,000 0,000 0,000 better results in sparse domains. Prost, based on UCT, got
a great performance in all the competition, because UCT
10 0,000 0,000 0,000
performs very nice in dense domains and Prost introduced
technics to improve the performance in sparse domains.
Table 8: Average of time spent (sec) in domain It is important to note that the maximum time allowed
“Navigation”. in the competition for each run of an instance is 36 sec
Instance UCG-max UCT UCT and UCT-max took more than that in many instances. The
(2000,10) (2000,10) (10000,30) purpose here is not to compare the algorithm as a
candidate of the competition; it is actually analyze the
1 33 1 8
influence of transpositions in UCT and provide subsidies
2 64 1 9
for future works know how and when the transpositions
3 58 1 8
can be used to improve UCT performance.
4 10 1 9
5 77 1 8 VI. FUTURE WORKS
6 14 1 8
7 10 1 9 As future works, we think in change the rule to
8 90 1 6 propagate the expected reward of decision nodes, making
9 14 1 2 a rule that balances the maximum and the weighted mean
10 12 1 2 of the successors. We verified empirically that is
preferable to use the maximum update in sparse domains
Looking results in dense domains, we can observe that and weighted mean updated in dense domains. Probably,
UCG-max didn’t get good scores and spent more time. this is related to the confidence of information after the
Actually, UCG-max performance is even worse than UCT previous rollouts. If the information is reliable, it means
performance with the same parameters in most cases. We that the maximum update can be use without prejudice.
can infer that it happened because in the first algorithm We also think in applying the technics used at Prost
the decisions nodes propagate the maximum reward of (pruning the superfluous actions, creating an initialization
their successors, instead of the weighted mean. <falta procedure and detecting reward locks) in UCT-max and
detalhar esta explicação> see how the algorithm performs.
Analyzing the results in sparse domains, we can see
that UCG-max has much higher scores than UCT, despite
VII. CONCLUSIONS

The algorithm proposed, UCG-max, had a good


performance in sparse domains, considering the reward
obtained after applied the policy. Despite that, the time
consumed in the execution of the algorithm was very
large, because every time a node was updated, all his
successors had to be visited.
It was also verified that UCT has an opposite
characteristic of UCG-max, since the first one performs
better in dense domains. Probably this occurs because
different functions were applied to propagate the reward
accumulated in each decision node: maximum or
weighted mean.
The results observed here can be used to investigate
new ways to adapt UCT algorithm to the structure of
graphs. It is interesting to look for an intermediary
solution, that incorporate characteristics of both UCT and
UCG-max, after an automatically analysis of the domain.

REFERENCES

[1] R. E. Bellman, “A Markov decision process”, Journal of


Mathematics and Mechanics 6, 679–684, 1957.
[2] R. S. Sutton, “On the Significance of Markov Decision
Processes”, in Proc. of ICANN, 1997.
[3] R. Bellman, “Dynamic Programming”, Princeton University
Press, 1957.
[4] R. Howard, “Dynamic programming and Markov decision
processes”, Cambridge, MA, 1960.
[5] A. Barto, S. Bradtke and S. Singh, “Learning to act using real-
time dynamic programming”, Artificial Intelligence 72:81–138,
1995.
[6] L. Kocsis and C. Szepesvári, “Bandit Based Monte-Carlo
Planning”, in Proceedings of the 17th European Conference on
Machine Learning (ECML), 282–293, 2006.
[7] T. Keller and P. Eyerich, “Prost: Probabilistic planning based on
uct,” in International Conference on Automated Planning and
Scheduling, 2012.
[8] M. Puterman, “Markov Decision Processes: Discrete Stochastic
Dynamic Programming”, Wiley, 1994.
[9] B. E. Childs, J. H. Brodeur and L. Kocsis, “Transpositions and
move groups in monte carlo tree search,” in CIG-08, 389–395,
2008.
[10] A. Saffidine, T. Cazenave and J. Méhat, “UCD : Upper
Confidence bound for rooted Directed acyclic graphs”,
Knowledge-Based Systems 34, 26-33, 2011.
[11] A. Kolobov, P. Dai, Mausam and D. Weld, “Reverse Iterative
Deepening for Finite-Horizon MDPs with Large Branching
Factors”, in Proceedings of the 22nd International Conference on
Automated Planning and Scheduling (ICAPS), 2012.

You might also like