You are on page 1of 10

Model based Bayesian Exploration

Richard Dearden Nir Friedman David Andre


Department of Computer Science Institute of Computer Science Computer Science Division
University of British Columbia Hebrew University 387 Soda Hall
Vancouver, BC V6T l a , CANADA Jerusalem 91 904, ISRAEL University of California
dearden@cs.ubc.ca nir@cs.huji.ac.il Berkeley, CA 94720-1776, USA
dandre @cs.berkeley.edu

Abstract A common argument for model-based approaches is that


by learning a model the agent can avoid costly repetition
Reinforcement learning systems are often concerned of steps in the environment. Instead, the agent can use the
with balancing exploration of untested actions against
exploitation of actions that are known to be good. The model to learn the effects of its actions at various states.
benefit of exploration can beestimated usingthe classi- This can lead to a significant reduction in the number of
cal notion of Value of Information -the expected im- steps actually executed by the learner, since it can "learn"
provement in future decision quality arising from the from simulated steps in the model (Sutton 1990).
information acquired by exploration. Estimating this Virtually all of the existing model-based approaches in
quantity requires an assessment of the agent's uncer-
tainty about its current value estimates for states. the literature use simple estimation methods to learn the en-
vironment, and keep a point-estimate of the environment
In this paper we investigate ways to represent and rea-
son about this uncertainty in algorithms where the sys- dynamics. Such estimates ignore the agent's uncertainty
tem attempts to learn a model of its environment. We about various aspects of the environment's dynamics.
explicitly represent uncertainty about the parameters of In this paper, we advocate a Bayesian approach to model-
the model and build probability distributions over Q- based reinforcement learning. We show that under fairly
values based on these. These distributions are used to reasonable assumptions we can represent the posterior dis-
compute a myopic approximationto the value of infor-
mation for each action and hence to select the action tribution over possible models given our past experience.
that best balances exploration and exploitation. This is done with essentially the same cost as maintaining
point estimates. Our methods thus allow us to continually
update this distributionover possible models as we perform
1 Introduction actions in the environment.
Reinforcement learning addresses the problem of how an By representing a distribution over possible models, we
agent should learn to act in dynamic environments. This can quantify our uncertainty as to what are the best actions
is an important learning paradigm for domains where the to perform. This gives us a handle on the exploitation vs.
agent must consider sequences of actions to be made exploration problem. Roughly speaking, this problem in-
throughout its lifetime. The framework underlying much volves the dilemma of whether to explore - perform new
of reinforcement learning is that of Markov Decision Pro- actions that can lead us to uncharted territories -or to ex-
cesses (MDPs). These processes describe the effects of ac- ploit -perform actions that have the best performance ac-
tions in a stochastic environment, and the possible rewards cording to our current knowledge. Clearly, the uncertainty
at various states of the environments. If we have an MDP about our model and our expectations as to the range of pos-
we can compute the choice of actions that maximizes the sible results of actions play crucial roles in this problem.
expected future reward. The task in reinforcement learning In a precursor to this work, Dearden et al. (1998) intro-
is to achieve this level of performance when the underlying duce a Bayesian model-free approach in which uncertainty
MDP is not known in advance. about the Q-values of actions is represented using probabil-
A central debate in reinforcement learning is over the use ity distributions. By explicitly reasoning using uncertainty
of models.. Model-free approaches attempt to learn near- about Q-values, they direct exploration specifically toward
optimal policies without explicitly estimating the dynamics poorly known regions of the state space. Their approach
of the surrounding environment. This is usually done by is based on a decision-theoretic approach to action selec-
directly approximating a value function that measures the tion: the agent should choose actions based on the value of
desirability of each environment state. On the other hand, the information it can expect to learn by performing them
model-based approaches attempt to estimate a model of the (Howard 1966). Dearden et al. propose a measure that bal-
environment's dynamics and use it to compute an estimate ances the expected gains in performance from exploration
of the expected value of actions in the environment. - in the form of improved policies - with the expected
Model based Bayesian Exploration 151

cost of doing a potentially suboptimal action. This mea- Q*( s ,a ) . Given a model, we can compute Q* using a va-
sure is computed from probability distributionsover the Q- riety of methods, including value iteration. In this method
values of actions. we repeatedly update an estimate Q of Q* by applying the
In this paper, we show how to use the posterior distri- Bellman equations to get new values of Q ( s ) for some (or
bution over possible models to estimate the distribution all) of the states.
of possible Q-values, and then use these to select actions. Reinforcement learning procedures attempt to achieve an
This use of models allows us to avoid the problem faced optimal policy when the agent does not know p ~ and p ~ .
by model-free exploration methods, such as the one used Since we do not know the dynamics of the underlying MDP,
by Dearden et al., that need to perform repeated actions to we cannot compute the Q-value function directly. However,
propagate values from one state to another. The main ques- we can estimate it. In model-free approaches one usually es-
tion is how to estimate these Q-values from our distribu- timates Q by treating each step in the environment as a sam-
tion of possible models. We present several methods of ple from the underlying dynamics. These samples are then
stochastic sampling to approximate these Q-value distribu- used for performing updates of the Q-values based on the
tions. We then evaluate the performance of the resulting Bellman equations. In model-based reinforcement learning
Bayesian learning agents on test environments that are de- one usually directly estimates M. ( s q t )andpR(s4T).The
signed to fool many exploration methods. standard approach is then to act as though these approxima-
In Section 2 we briefly review the definition of MDPs and tions are correct, compute Q * ,and use it to choose actions.
the definition of reinforcement learning problems. In Sec- A standard problem in learning is balancing between
tion 3 we discuss a Bayesian approach for learning models. planning (i.e., choosing a policy) and execution. Ideally,
In Section 4 we review the notion of Q-value distributions the agent would compute the optimal value function for
and the use of value of information for directing exploration its model of the environment each time it updates it. This
and the notion. In Section 5 we propose several sampling scheme is unrealistic since finding the optimal policy for
methods for estimating Q-value distributions based on the a given model is a non-trivial computational task. Fortu-
uncertainty about the underlying model. In Section 6 we nately, we can approximate this scheme if we notice that the
discuss several approaches of generalizing from the sam- approximate model changes only slightly at each step. We
ples we get from the aforementioned methods, and how this can hope that the value function from the previous model
generalization can improveour algorithms. In Section 7 we can be easily "repaired to reflect these changes. This ap-
compare our methods to Prioritized Sweeping (Moore & proach was pursued in the DYNA (Sutton 1990) frame-
Atkeson 1993), a well known model-based reinforcement work, where after the execution of an action, the agent
learning procedure. updates its model of the environment, and then performs
some bounded number of value propagation steps to up-
2 Background date its approximation of the value function. Each value-
We assume the reader is familiar with the basic concepts propagatjon step locally enforces the Bellman-yuation by
of MDPs (see, e.g., (Kaelbling, Littman & Moore 1996)). setting V ( s ) t m a ~ ~~ (Asa ),, where Q ( s ,a) =
We will use the following notation: An MDP is a 4-tuple, E ~ R ( s ~ T )y+ ] EsjES $T(s%s')v(s'), $T(s$sl) and
( S ,A, p ~pR), where S is a set of states, A is a set of ac- T ) the agent's approximate model, and v is the
I ~ R ( S ~ are
tions, p ~(8%) is a transition model that captures the prob- agent's approximation of the value function.
-
-- This raises the question of which states should be up-
ability of reaching state t after we execute action a at state
s, and P ~ (is ~ q ~model
a reward ) that captures the proba- dated. Prioritized Sweeping (Moore & Atkeson 1993) is a
bility of receiving reward T after executing a at state s. For method that estimates to what extent states would change
the reminder of this paper, we assume that possible rewards their value as a consequence of new knowledge of the MDP
are a finite subset R of the real numbers. dynamics or previous value propagations. States are as-
In this paper, we focus on infinite-horizon MDPs with a signed priorities based on the expected size of changes in
discount factor 7 . The agent's aim is to maximize the ex- their values, and states with the highest priority are the ones
pected discounted total reward it receives. Equivalently, we for which we perform value propagation.
can compute a optimal value function V* and a Q-function
Q * . These functions satisfy the Bellman equations: 3 Bayesian Model Learning
--!
In this section we describe how to maintain a Bayesian pos-
V * ( S )= maxQ*(s,a ) ,
aEA terior distribution over MDPs given our experiences in the
environment. At each step in the environment, we start at
where state s. choose an action a. and then observe a new state
t
p T ( s 4 s 1 ) V * ( ~ ' ) . and a reward r . We summarize our experience by a se-
Q* ( s ,a ) = Ep n ( r S r ) [ T [ s , ~ ] +7
.S ' .-
tD
auence of experience tuples
. (* s .,a ., T., t ,) .
A Bayesian approach to this learning problem is to main-
If the agent has access to V* or Q * ,it can optimize its ex- tain a belief state over the possible MDPs. Thus, a belief
pected reward by choosing the action a at s that maximizes state p defines a probability density P ( M I p). Given an
152 Dearden, Friedman, a n d Andre

experience tuple ( s ,a , r , t ) we can compute the posterior For the case of discrete multinomials, which we have
belief state, which we denote p o ( s ,a , r , t ) ,by Bayes rule: assumed in our transition and reward models, we can use
Dirichlet priors to represent Pr(Bf ,) and Pr(0:,,). These
P ( M I P 0 ( s ,a , r , t ) ) priors are conjugate, and thus the posterior after each ob-
p ( ( s ,a , r , t ) I M ) P ( M I P ) served experience tuple will also be a Dirichlet distribution.
In addition, Dirichlet distributions can be described using a
= ~ ( s f +I M t ) P ( S % I M ) P ( M I p).
small number of hyper-parameters. See Appendix A for a
review of ~irichletiri6rsand their propertids.
Thus, theBayesian approach starts with somepriorprob-
In the case of most MDPs studied in reinforcement learn-
ability distribution over all possible MDPs (we assume that
the sets of possible states, actions and rewards are delim- ing, we expect the transition model to be sparse-there are
only a few states that can result from a particular action at
ited in advance). As we gain experience, the approach fo-
a particular state. Unfortunately, if the state space is large,
cuses the mass of the posterior distribution on those MDPs
in which the observed experience tuples are most probable. learning with a Dirichlet prior can require many examples
to recognize that most possible states are highly unlikely.
An immediate question is whether we can represent these
prior and posterior distributions over an infinite number This problem is addressed by a recent method of learn-
of MDPs. We show that this is possible by adopting re- ing sparse-multinomial priors (Friedman & Singer 1999).
sults from Bayesian learning of probabilistic models, such Without going into details, the sparse-multinomial priors
have the same general properties as Dirichlet priors, but as-
as Bayesian networks (Heckerman 1998). Under carefully
sume that the observed outcomes are from some small sub-
chosen assumptions, we can represent such priors and pos-
teriors in any of several compact manners. We discuss one sets of the set of possible ones. The sparse Dirichlet priors
make predictions as though only the observed outcomes are
such choice below.
possible, except that they also assign to novel outcomes. In
To formally represent our problem, we consider the pa-
the MDP setting, a novel outcome is a transition to state t
rameterization of MDPs. The simplest parameterization is
that was not reached from s previously by executing a. See
table based, where there are parameters Bf ,,,, and B:,,,, for
Appendix A for a brief summary of sparse-multinomial pri-
the transition and reward models. Thus, for each choice
of s and a, the parameters O f , , = {Of ,, : t E S ) de- ors and their properties.
For both the Dirichlet and its sparse-multinomial exten-
fine a distribution over possible states, and the parameters
B:,, = {B:,,,, : r E R)define a distributionover possible sion, we need to maintain the number of times, N ( s q t ) ,
state t is observed after executing- action a at state s, and
rewards.l
We say that our prior satisfies parameter independence if similarly, N ( s q r )for rewards. W ~ t the
h prior distributions
it has the product form: over the Darameters of the MDP, these counts define a poste-
rior distribution over MDPs. This representation allows us
to both predict the probability of the next transition and re-
ward, and also to compute the probability of every possible
MDP and to sample from the distribution of MDPs.
Thus, the prior distribution over the parameters of each lo- To summarize, we assumed parameter independence, and
cal probability term in the MDP is independent of the prior that for each prior in (1) we have either a Dirichlet or sparse-
over the others. It turns out that this form is maintained as multinomial prior. The consequence is that the posterior
we incorporate evidence at each stage in the learning can be represented compactly.
Proposition 3.1: Zfthe belief state P(B I p) satisfiesparam- This enables us to estimate adistributionover MDPs at each
eter independence, then P(O I /L o (s,a , r , t)) also satisfies stage.
parameter independence. It is easy to extend this discussion for more compact pa-
rameterization~of the transition and reward models. For
As a consequence, the posterior after we incorporate an ar- example, if each state is described by several attributes, we
bitrarily long number of experience tuples also has the prod- might use a Bayesian network to capture the MDP dynam-
uct form of (1). ics. Such a structure requires fewer parameters and thus
Parameter independence allows us to reformulate the we can learn it with fewer examples. Nonetheless, much
learning problem as a collection of unrelated local learning of the above discussion and conclusions about parameter
problems. In each of these, we have to estimate a probabil- independence and Dirichlet priors apply to these models
ity distribution over all states or all rewards. The question (Heckerman 1998).
is how to learn these distributions. We can use well-known Standard model-based learning methods maintain a point
Bayesian methods for learning standard distributions such estimate of the model. These point estimates are often close
as multinomials or Gaussian distributions (Degroot 1986). to the mean prediction of the Bayesian method. However,
'The methods we describe are easily extend to other param- these point estimates do not capture the uncertainty about
eterization~. In particular, we can consider continuous distribu- the model. In this paper, we examine how knowledge of this
tions, e.g., Gaussians,over rewards. For clarity of discussion, we uncertainty can be exploited to improve exploration.
focus on multinomial distributions throughout the paper.
Model based Bayesian Exploration 153

4 Value of Information Exploration value of q:,, of q,,, is:

In a recent paper, Dearden et al. (1998) examined model- ifa=al


free Bayesian reinforcement learning. Their approach E[qsla21 - q:,a and q:,a < E[q,,a2]
builds on the notion of Q-value distributions. Recall, that gain,,,(^:,^) = i f a # a1
Q*(s, a ) is the expected reward if we execute a at s and ':la - E[q8za11 and ,g: > E[q,,.,]
then continue with optimal selection of actions. Since dur- otherwise
ing learning we are uncertain about the model, there is adis- where, again, a1 and a2 are the actions with the best and
tribution over the Q-values at each pair (s, a ) . This distri- second best expected values respectively. Since the agent
bution is induced by the belief state over possible MDPs, does not know in advance what value will be revealed for
and the Q-values for each of these MDPs. In the model- q;,,, we need to compute the expected gain given our prior
free case, Dearden et al. propose an approach for estimat- beliefs. Hence the expected value of perfect information
ing Q-value distributions without building a model. This about q,,, is:
approach makes several strong assumptions that are clearly
violated in MDPs. In the next section, we show how we can
use our representation of the posterior over models to give
estimates of Q-value distributions. Before we do that, we
briefly review how Dearden et al. use the Q-value distri-
butions for selecting actions, as we use this method in the The computation of this integral depends on how we repre-
current work. sent our distributions over q,,,. We return to this issue be-
The ~ D D I ' o ~ of
c ~Dearden et a1 is based on the decision-
low.
theoreticideas of value of information(Howard 1966). The The value of perfect information gives an upper bound on
application of these ideas in this context is reminiscent of its the myopic value of information forexploringaction a . The
use in tree search (Russell & Wefald 1991), which can also expected cost incurred for this exploration is given by the
be seen as a form of exploration. The idea is to balance the difference between the value of a and the value of the cur-
expected gains from exploration-in the form of improved rent best action, i.e., max,, E[q,,,t] - E[q,,,]. This sug-
policies-against the expected cost of doing a potentially gests we choose the action that maximizes
suboptimal action.
To formally define the approach, we need to introduce
We by qs50 a possible Clearly, this strategy is equivalent to choosing the action
Q*(s, a ) in some MDP. We treat these quantities as ran- that maximizes:
dom variables that d e ~ e n don our belief state. (For clarifica-
tion of the following - discussion, we do not explicitly refer- EIqs.al+
-- , . . a .) .
VPl(s,
ence the belief state in the mathematical notation.) We now
consider what can be gained by learning the true value q:,, We see that the value of exploration estimate is used as a
of q,,,. How would this knowledge change the agent's fu- way of boosting the desirability of different actions. When
ture rewards? Clearly, if this knowledge does not change the agent is confident of the estimated Q-values, the VPI of
-
- the agent's policy, then future rewards would not change. each is and the agent will
-
Thus, the onlv interesting scenarios are those where the new
knowledge does change the agent's policy. This can happen
the action with the highest expected value.

in two cases: (a) when the new knowledge shows that an


5 Estimating Q-Value Distributions
action previously considered sub-optimal is revealed as the How do we estimate the Q-value distributions? We now ex-
best choice (given the agent's beliefs about other actions), amine several methods of different complexity and bias.
and (b) when the new knowledge indicates that an action
that was previously considered best is actually inferior to 5.1 Naive Sampling
other actions. Perhaps the simplest approach is to simulate the definition
For case (a), suppose that a, is the best action; that is, of a Q-value distribution. Since there are an infinite number
E[q,,,,] 2 E[qS,,,]for all other actions a'. Moreover sup- of possible MDPs, we cannot afford to compute Q-values
pose that the new knowledge indicates that a is a better ac- for each. Instead, we sample k MDPs: M l , . . . , Mk from
tion; that is, q:,, > E [ q , , , , ] . Thus, we expect the agent to the distribution Pr(M ( p). We can solve each MDP us-
gain q:,, - E[q,,,,] by virtueofperforminga insteadof a*. ing standard techniques (e.g., value iteration or linear pro-
For case (b), suppose that a1 is the action with the highest gramming). For each state s and action a, we then have a
expected value and a2 is the second-best action. If the new sample solution q:,,, . . . , q f , , , where q;,, is the optimal Q-
knowledge indicates that q,,,, < E [ q , , , , ] ,then the agent value, Q*(s, a), given the i'th MDP. From this sample we
should perform a2 instead of a1 and we expect it to gain can estimate properties of the Q-distribution. For general-
E[qs,=21- q:,a,. ity, we denote the weight of each sample, given p, as $,.
Combining these arguments, the gain from learning the Initially these weights are all equal to 1.
154 Dearden, Friedman, a n d A n d r e

Given these samples, we can estimate the mean Q-value


as

Similarly, we can estimate the VPI by summing over the k


MDPs:

This approach is straightforward; however, it requires an


efficient sampling procedure. Here again the assumptions
we made about the priors helps us. If our prior has the form
0 100 200 300 400 500 600 700
of (1). then we can sample each distribution ( ~ ~ ~ ( sorq t ) Iterations
p ~ ( s % r ) )independently of the rest. Thus, the sampling
problem reduces to sampling from "simple" posterior dis-
tributions. For Dirichlet priors there are known sampling Figure 1: Mean and variance of the Q-value distribution for
methods. For the sparse-multinomials the problem is a bit a state, plotted as a function of time. Note that the means of
more complex, but solvable. In Appendix A we describe each method converge to the true value of the state at the
both sampling methods. same time that the variances approach zero.

5.2 Importance Sampling


The term Pr((s, a , r, t) I M ) is easily computed from M ,
An immediate problem with the naive sampling approach and Pr((s, a , r, t ) I p ) can be easily computed based on our
is that it requires several global computations (e.g., comput- posteriors. Thus, we can easily re-weight the sampled mod-
ing value functions for MDPs) to evaluate each action made
els after each experience is recorded and use the weighted
by the agent. This is clearly too expensive. One possible sum for choosing actions. Note that re-weighting of models
way of avoiding these repeated computations is to reuse the is fast, and since we have already computed the Q-value for
same sampled MDPs for several steps. To do so, we can use each pair (s, a) in each of the models, no additional compu-
ideas from impoflance sampling. tations are needed.
In importance sampling we want to a sample from Of course, the original set of models we sampled be-
P r ( M ( p') but for some reasons, we actually sample from comes irrelevant as we learn more about the underlying
P r ( M I p). We adjust the weight of each sample appro- MDP. We can use the total weight of the sampled MDPs
priately to correct for the difference between the sampling to track how unlikely they are given the observations. Ini-
distribution (e.g., P r ( M I p)) and the target distribution tially this weight is k . As we learn more it usually be-
(e.g.. P r ( M 1 pi)): comes smaller. When it becomes smaller than some thresh-
. P r ( M i 1 p') , old k,in, we sample k - k,in new MDPs from our current
w;, = belief state, assigning each one weight 1 and thus bringing
Pr(Mi 1 p ) the total weight of the sample to k again. We then need only
We now use the weighted sum of samples to estimate the the MDPs.
mean and the VPI of different actions. It is easy to verify Summarize, we MDPs, them, and use
that the weighted sample leads to correct prediction when the k Q-values to estimate properties of the Q-value distri-
we have a large number of samples. In practice, the success bution. We re-weight the samples at each step to reflect our
of importance sampling depends on the difference between newly gained knowledge. Finally, we have an automatic
the two distributions. If an MDP M has low probability ac- for detecting when new samples are required.
cording to P r ( M I p), then the probability of sampling it is 5.3 Global Sampling with Repair
small, even if P r ( M 1 p') is high.
Fortunately for us, the differences between the beliefs The global sampling approach of the previous section has
before and after observing an experience tuple are usually One serious deficiency. It involves computing global
small. We can easily show that tions tc. MDPs which can be very expensive. Although we
can reuse MDPs from previous steps, this approach still re-
Proposition 5.1: auires us to samole new MDPs and solve them suite often.
An alternative idea is to keep updating each of the sam-
- P r ( M I P 0 (s, a , r , t ) )
w~~(s,a,r,t)- P pled MDPs. Recall that after observing an experience tuple
P r ( M I P)
"J

(s, a , r , t), we only change the posterior over Of,, and %:,.
-
- pr((s,a ,r,t) I M ) Thus, instead of re-weighting the sample M ~ we , can up-
Pr((s, a , r , t ) I P) w P date, or repair, it by re-sampling %f , and O:,,. If the orig-
Model based Bayesian Exploration 155

inal sample M' was sampled from Pr(M I p ) , then it eas- In our current setting, the terms q,l,,t are random vari-
ily follows that the repaired Mi is sampled from Pr(M I ables that depend on our current estimate of Q-value dis-
P O ( s ,a , r , t ) ) . tributions. The probabilities p ~ ( s % s ' ) are also random
Of course, once we modify M' its Q-value function variables that depend on our posterior on O f , , , and finally
changes. However, all of these changes are consequences E [ ~ R ( s % ~is) also
] a random variable that depends on the
of the new values of the dynamics at ( s ,a). Thus, we can posterior on O:,,. Thus, we can sample from q,,,, by jointly
use prioritized sweeping to update the Q-value computed sampling from all of these distributions, i.e., q,t,,t for all
for M i . This sweeping performs several Bellman updates
states, ~ T ( s ~ sand ' ) P, R ( ~ % ~ ) and
. r then computing the
to correct the values of states that are affected by the change
Q-value. If we repeat this sampling step k times, we get k
in the model.=
samples from a single bellman iteration for q,,,.
This suggests the following algorithm. Initially, we sam-
Starting with our beliefs about the model and about the
ple k MDPs from our prior belief state. At each step we:
Q-value distribution of all states, we can sample from the
Observe an experience tuple ( s ,a , r , t ) distribution of q,,,. To make this procedure manageable,
we assume that we can sample from each q,,,,, indepen-
Update Pr(O:,,) by t , and Pr(O:,,) by r .
dently. This assumption does not hold in general MDPs,
For each i = 1,.. . , k, sample O f $ , O$ from the new since the distribution of different Q-values are correlated
Pr(O:,,) and PI($:,,),respectively. (by the Bellman equation). However, we might hope that
For each i = 1, . . . , k run a local instantiation of prior- the exponential decay will weaken these dependencies.
itized sweeping to update the Q-value function of M'. We are now left with the question how to use the k sam-
ples from q,,,. The simplest approach is to use the sam-
Thus, our approach is quite similar to standard model ples as a representation of our approximation of the distri-
based learning with prioritized sweeping, but instead of run- bution of q,,,. We can compute the mean and VPI from
ning one instantiation of prioritized sweeping, we run k in- a set of samples, as we did in the global sampling ap-
stantiations in parallel, one for each sampled MDP. The re- proach. Similarly, we can re-sample from this represen-
pair to the sampled MDPs ensures that they constitute a tation by randomly choosing one of the points. This re-
s a m ~ l from
e the current belief state, and the local instantia- sults in a method that is similar to recent sampling methods
tions of prioritized sweeping ensure that the Q-values com- that have been used successfully in monitoring complex dy-
puted in each of these MDPs is a good approximation to the namic processes (Kanazawa, Koller & Russell 1995).
true value. This gives us a method for performing a Bellman-update
As with the other approaches we have described, after we on our Q-value distributions. To get a good estimate of
invoke the k prioritized sweeping instances we use the k these distributions we need to repeat these updates. Here we
samples from each q , , , to select the next actions using VPI can use a prioritized sweeping like algorithm that performs
computations. updates based on an estimate of which Q-value distribution
Figure 1 shows a single run of learning where the actions can be most affected by the updates to other Q-value distri-
selected were fixed and each of the three methods was used butions.
to estimate the Q-values of a state. Initially the means and
-
-- variances are very high, but as the agent gains more experi- 6 Generalization and Smoothing
ence, the means converge on the true value of the state, and
the variances tend towards zero. These results suggest that In the approaches described above we generated samples
the repair and importance sampling approaches both pro- from the Q-value distributions, and effectively used a col-
vide reasonable approximations to naive global sampling. lection of points to represent the approximation to the Q-
Value distribution. A possible problem with this represen-
5.4 Local Sampling tation approach is that we use a fairly simplistic representa-
Until now we have considered using global samples of tion to describe a complex distribution. This suggests that
MDPs. An alternative approach is to try to maintain for we should generalize from the k samples by using standard
each ( s ,a ) an estimate of the Q-value distribution, and to generalization methods.

- update these distributions using a local, Bellman-update


like, propagation rule. To understand this approach, recall
This is particularly important in the local sampling ap-
proach. Here we also use our representation of the Q-value
the Bellman equation: distribution to propagate samples for other Q-value distri-
butions. Experience from monitoring tasks in stochastic
processes suggest that introducing generalization can dras-
tically improve performance (Koller & Fratkina 1998).
Perhaps the simplest approach to generalize from the k
'Generalized prioritized sweeping (Andre, Friedman & Parr samples is to assume that the Q-value distribution has a par-
1997) allows us to extend prioritized sweeping to these approx-
imate settings. When using approximate models or value func- ticular parametric form, and then to fit the parameters to
tions, one must address the problem of calculating the states on the samples. The first approach that comes to mind is fit-
which toadmate the priority. ting a Gaussian to the k samples. This captures the first two
156 Dearden, Friedman, and Andre

Figure 2: Samples, Gaussian approximation, and Kernel estimates of a Q-value distribution after 100,300, and 700 steps
of Naive global sampling on the same run as Figure 1.

moments of the sample, and allows simple generalization. where Pr(q,,, = I)is computed from the generalized prob-
Unfortunately, because of the rnax() terms in the Bellman ability distribution for state s and action a. This integration
equations, we expect the Q-value distribution to be skewed can be simplified to a term where the main cost is an evalua-
to the positivedirection. If this skew is strong, then fitting a tion of the cdf of a Gaussian distribution (e.g., see (Russell
Gaussian would be a poor generalization from the sample. & Wefald 1991). This function, however, is implemented
At the other end of the spectrum are non-parametric ap- in most language libraries (e.g., using the erfo function in
proaches. One of the simplest ones is Kernel estimation the C-library), and thus can be done quite efficiently.
(see for example (Bishop 1995)). In this approach, we ap- Figure 2 shows the effects of Gaussian approximation
proximate the distribution over Q ( s , a ) by a sum of Gaus- and kernel estimation smoothing (using the computed ker-
sians with a fixed variance, one for each sample. This ap- nel width) on the sample values used to generate the Q-
proach can be effective if we are careful in choosing the distributionsin Figure 1 for three different time steps. Early
variance parameter. A too small variance, will lead to in the run Gaussian approximation produces a very poor ap-
a spiky distribution, a too large variance, will lead to an proximation because the samples are quite widely spread
overly smooth and flat distribution. We use a simple rule and very skewed, while kernel estimation provides a much
for estimating the kernel width as a function of the mean better approximation to the observed distribution. For this
(squared) distance between p ~ i n t s . ~ reawn, we expect kernel estimation to perfonn better than
Of course, there are many other generalization methods Gaussian approximation for computing VPI.
we might consider using here, such as mixturedistributions.
However, these two approaches provide us with initial ideas 7 Ex~erimentalResults
on the effect of in this context.
We must also compute the VPI of a set of generalized dis- Figure 3 shows two domains of the type on which we have
tributionsmade up of Gaussians or kernel estimates. This is tested Our algorithms. Each is a four maze
simply a matter of solving the integral given in Equation 2 in which the agent begins at the point marked S and must
collect the flag F and deliver it to the goal G. The agent re-
3This rule is motivated by a leave-one-ourcross-vuliduriones- ceives areward of 1 for each flag it collects and then moves
timate of the kernel w i d t h s . ~ e tq l , . . . ,q%e the k samples. We to the goal state, and the problem is then reset. If the agent
want to find the kernel width u that maximizes the tenn enters the square marked T (a trap) it receives a reward of
-10. Each action (up, down, left, right) succeeds with prob-
ability 0.9 if that direction is clear, and with probability 0.1,
moves the agent perpendicular to the desired direction. The
where f (q'lqJ, u)is the Gaussian pdf with mean qJ and variance
"trap" domain has 18 states, the "maze" domain 56.
u2.Using Jensen's inequality, we have that We evaluate the algorithms by computing the avenge
(over 10 runs) future discounted reward received by the
agent. We use this measure rather than the value of the
learned policy because exploratory agents rarely actually
follow either the greedy policy they have discovered or their
Proposition 6.1 : The value if u2 current exploration policy for very long. For comparison
rhar marimizes C, CJz,
log f (q' I$, u2)is i d , where d is fhe we use prioritized sweeping (Moore & Atkeson 1993) with
average dislunce umong sumples: the Tbord parameter optimized for each problem.
Figure4 shows the performance of a representative Sam-
ple of our algorithms on the trap domain. Unless they
are based on a very small number of samples, all of
the Bayesi'ul exploration methods outperform prioritized
Model based Bayesian Exploration 157

sweeping. This is due to their more cautious approach to the


trap state. Although they are uncertain about it, they know
that its value is probably bad, and hence do not explore it
further after a small number of visits.
Figure 5 compares prioritized sweeping with our Q-value
estimation techniques on the larger maze domain. As the
graph shows, our techniques perform better than prioritize
a. b.
sweeping early in the learning process. They explore more
widely initially, and do a better job of avoiding the trap state
figure 3: The (a,) "trap" and (b.) larger maze domains. once they find it. Of the three techniques, global sampling
performs best, although its computational requirements are
5r . . , , . . . . considerable - about ten times as much as sampling with
repair. Importance sampling runs about twice as fast as
global sampling but converges relatively late on this prob-
lem, and did not converge on all trials.
Figure 6 shows the relative performance of the three
smoothing methods, again on the larger domain. To exag-
gerate the effects of smoothing, only 20 samples were used
to produce this graph. Kernel estimation performs very
N . OIU
~ K ~ M
POortld-
r t i m l ~~ m f MWW h ~ plm ~
-- well, while no smoothing failed to find the optimal (two
-
I -bq. ~ o u m r m h g~, p r mjlhai.~plm
r ------.
-25
9 9 ~ 1 U ~ G.Y.IYI
b . P P ~ X ~~ ~m~mWhD~I ,, M H ~ ;--~ flag) strategy on two out of ten runs. Gaussian approxima-
50 1W 150 2W 250 po a 0 4W 450
~ ( n n ~ tion was slow to settle on a policy, it continued to make ex-
ploratory actions after 1500 steps while all the other algo--
kthms had converged by then. -
Figure 4: Discounted future reward received for the "trap"
We are currently investigating the of the al-
domain. gorithm on both more complex maze domains and random
MDPS, and also the effectiveness of the local sampling ap-
proach we have described.

8 Discussion
This paper makes two main contributions. First, we show
how to maintain Bayesian belief states about MDPs. We
show that this can be done in a simple manner by using
ideas that appear in Bayesian learning of probabilisticmod-
els. Second, wediscuss how to use theBayesian belief state
.12 -
o sw IW
N u m * d ltw
ISM 2mo to choose actions in a way that balances exploration and ex-
ploitation. We adapt the value of information approach of
Dearden et al. (1998) to this model-based setup and show
Figure 5: Comparison of Q-value estimation techniques on how to approximate the Q-value distributions needed for
the larger maze domain. making these choices.
A recent approach to exploration that is related to our
work is that of Kearns and Singh (1998). Their approach
4 -
divides the set of states in to two groups. The known states
are ones for which the learner is quite confident about the
transition probabilities. That is, the learner believes that
its estimate of the transition probabilities is close enough
to the true distribution. All other states are considered un-
p Nos-hlwg z-. known states. In Kearns and Singh's proposal, the learner
.......... - constructs a policy over the known states. This policy takes
o.Lu*n W.rnln*Hn
into account both exploitation and the possibility of find-
ing better rewards in unknown states (which are considered
-12
o sw
N&
1m0
m d.p.
1501 zmo as highly-rewarding). When it finds itself in an unknown
state, the agent chooses actions randomly. The algorithm
proceeds in phases, after each one it reclassifies the states
Figure 6: The effects of smoothing techniques on perfor- and recomputes the policy on the known states. Kearns and
mance in the large maze domain. Singh's proposal is significant in that it is the first one for
158 Dearden, Friedman, and Andre

which we have polynomial guarantees on number of steps


needed to get to a good policy. However, this algorithm
was not implemented or tested, and it is not clear how fast
where 0 = (el,. . . , BL) is a vector that describes possible
it learns in real domains.
values of the (unknown) probabilities P * ( l ) , . . . , P*(L),
Our exploration strategy also keeps a record of how con-
and ( is the "context" variable that denote all other assump-
fident we are in each state (i.e., Bayesian posterior), 'and
tions about the domain.
also chooses actions based on their expected rewards (both
The posterior probability of 0 can be rewritten as:
known rewards, and possible exploration rewards). The
main difference is that we do not commit to a binary classi-
fication of states, but instead choose actions in a way that
takes into account the possible value of doing the explo-
ration. This leads to exploitation, even before we are ex-
tremely confident in the dynamics at every state in the "in- where Ni is the number of occurrences of the symbol i in
teresting" parts of the domain. the training data.
There are several directions for future research. First, we Dirichlel distributions are a parametric family that is
are currently conducring experiments on larger domains to conjugate to the multinomial dismbution. That is, if the
show how our method scales u ~ We . are also interested in prior distribution is from this family, so is the posterior.
applying it to more compact model representations(e.g., us- A Dirichlet prior for X is specified by hyper-paramelers
ing dynamic Bayesian nehuorks), and to problems with con- a ' , . . . , (YL,and has the form:
tinuous state spaces.
Finally, the most challenging future direction is to deal
with theactual value of information of an action rather than
( 01 ) 0 (x 0i = 1 and Bi 2 0 for all i)
myopic estimates. This problem can smed as an MDP
over belief states. However, this MDP is extremely large, where the proportion depends on a normalizing constant
and requires some approximations to find good policies that ensures that this is a legal density function (i.e., inte-
quickly. Some of the ideas we introduced here, such as the gral of P(O I <) over all parameter values is 1). Given a
re-weighting of sampled MDPs might allow us to address Dirichlet prior, the initial prediction for each value of X is
this computational task.
Acknowledgements
We are grateful for useful comments from Craig Boutilier It is easy to see that, if the prior is a Dirichlet prior with
and Stuart Russell. Richard Dearden was supported by hyper-parameters a ' , . . . , a ~then
, theposteriorisaDirich-
a Killam Predoctoral fellowship and by IRIS Phase-I11
project "Dealing with Actions" (BAC). Some of this work
+
let with hyper-parameters a l +N 1 , . . . , (YL NL.Thus, we
get that the prediction for xN+'is
was done while Nir Friedman was at U.C. Berkeley. Nu
Friedman and David Andre were supported in part by
ARO under the MLTRI program ''Integrated Approach to
Intelligent Systems", grant number DAAH04-96-1-0341,
and by ONR under grant number N00014-97-1-0941. Nu In some situations we would like to sample a vector 0 ac-
Friedman was also supported through the generosity of the cording to the distribution P(O I t). This can be done us-
ing a simple procedure: Sample values y, , . . . , y~ such that
Michael Sacher Trust. David Andre was also supported by
a W D National Defense Science and Engineering Grant. -
each yi Gammo(ai, 1) and then normalize to get a prob-
abilitvdistribution. where Gamma(a,, .13), is the Gamma dis-
\

A Dirichlet and Sparse-Multinomial Priors tribuion. Procedures for sampling from these distributions
can be found in Npley 1987).
Let X be a random variable that can take L possible values Friedm'm 'and Singer (1999) introduce a structured prior
from a set C. Without loss of generality, let C = { 1, . . . L}. that captures our uncertainty about the set of "feasible" val-
We are given a training set D that contains the outcomes of ues of X. Define a random variable V that takes values
N independent draws X I , . . . , zN of X from an unknown from the set 2C of possible subsets of C. The intended se-
multinomial distribution P*. The mullinomial esrimarion mantics for this variable, is that if we know the value of V,
problem is to find a good approximation for P*. then Bi > 0 iff i E V.
This problem can be stated as the problem of predicting Clearly, the hypothesis V = C' (for C' C_ C) is consis-
the outcome xN+' given x l ,. . . , xN. Given a prior dis- tent with training data only if C' contains all the indices i for
tribution over the possible multinomial distributions, the which N , > 0. We denote by C0 the set of observed sym-
Bayesinn estimate is: bols. That is, Co = {i : N, > 0 ) , 'and we let k" = ICol.
Suppose we know the value of V. Given this assumption,
we c'an define a Dirichlet prior over possible multinomial
M o d e l b a s e d Bayesian Exploration 159

distributions 0 if we use the same hyper-parameter a for complication since each sample will depend on some unob-
each symbol in V . Formally, we define the prior: served states. To "smooth this behaviour we sample from

P(0IV) K
iFV
0;-I (c 0i = 1 and Bi = 0 for all i # V)
thedistributionover V ocombined withthenovel event. We
sample a value of k from P ( S = klD). We then, sam-
ple from the Dirichlet distribution of dimension k where the
+
first k0 elements are assigned hyper-parameter a Ni, and
Using Eq. (4), we have that: the rest are assigned hyper-parameter a . The sampled vec-
tor of probabilities describes the probability of outcomes in
ifiEV V 0 and additional k - ko events. We combine these latter
otherwise probabilities to be the probability of the novel event.
(6)
Now consider the case where we are uncertain about the References
actual set of feasible outcomes. We construct a two tiered Andre, D., Friedman, N. & Pam, R. (1997), Generalized priori-
prior over the values of V . We start with a prior over the tized sweeping, in 'Advances in Neural Information Process-
size of V , and assume that all sets of the same cardinality ing Systems', Vol. 10.
have the same prior probability. We let the random variable Bishop, C. M. (1995), Neural Networks for Pattern Recognition,
S denote the cardinality of V . We assume that we are given Oxford University Press, Oxford.
a distribution P ( S = k) for k = 1,.. . , L. We define the
Dearden, R., Friedman, N. & Russell, S. (1998). Bayesian Q-
prior over sets to be P(V I S = k) = (:)-I. This prior is leaming, in 'Proceedings of the Fifteenth National Confer-
a sparse-multinomial with parameters cr and Pr(S = k ) . ence on Artificial Intelligence (AAAI-98)'.
Friedman and Singer show that how we can efficiently Degroot, M. H. (1986), Proability and Statistics, 2nd edn,
predict using this prior. Addison-Wesley, Reading, Mass.
Theorem A.1: (Friedman & Singer 1999) Given a sparse- Friedman, N. & Singer, Y. (1999), Efficient bayesian parameter
multinomial priol; the probability of the next symbol is estimation in large discrete domains, in 'Advances in Neu-
ral lnformation Processing Systems 11'. MIT Press, Cam-
p ( x N + l = i / D) = { F-C ( -DC ,(L)D ,L))
n-kD (1
i f i E C0
ifi @ so
bridge, Mass.
Heckeman, D. (1998), A tutorial on leaming with Bayesian net-
works, in M. 1. Jordan, ed., 'Learning in Graphical Models',
where Kluwer, Dordrecht, Netherlands.
Howard, R. A. (1966), 'Information value theory', IEEE Transac-
tions on Systems Science and Cybernetics SSC-2.22-26.
Kaelbling, L. P., Littman, M. L. & Moore, A. W. (1996), 'Rein-
forcement learning: A survey', Journal of Artificial Intelli-
Moreovet; gence Research 4,237-285.
mk Kanazawa, K., Koller, D. & Russell, S. (1995). Stochastic simula-
P ( S = k ID) =
C k ' > k O mk ' tion algorithms for dynamic probabilistic networks, in 'Pro-
ceedings of the Eleventh Conferenceon Uncertainty in Ani-
where ficial Intelligence (UAI-95)', Morgan Kaufmann, Montreal.
k! Kearns, M. & Singh, S. (1998),Near-optimal performance for re-
mk = P ( S = k) r(kff) inforcement learning in polynomial time, in 'Proceedings
(k - k?)! r ( k a + N ) of the Fifteenth Int. Conf. on Machine Learning', Morgan
and r ( x ) = Jr t z - l e - t d t is the gamma function. Thus,
Kaufmann.
Koller, D. & Fratkina, R. (1998). Using leaming for approxima-
tion in stochastic processes, in 'Proceedings of the Fifteenth
International Conference on Machine Learning', Morgan
Kaufmann, San Francisco, Calif.
Moore, A. W. & Atkeson, C. G. (1993), 'Prioritized sweeping-
We can think of C ( D ,L) as scaling factor that we apply to reinforcement leaming with less data and less time', Ma-
the Dirichlet prediction that assumes that we have seen all chine Learning 13, 103-130.
of the feasible symbols. The quantity 1 - C ( D ,L) is the Ripley, B. D. (1987),Stochastic Simulation, Wiley, NY.
probability mass assigned to novel (i.e., unseen) outcomes.
In some of the methods discussed above we need to sam- Russell, S. J. & Wefald, E. H. (1991), Do the Right Thing: Studies
in Limited Rationality, MIT Press, Cambridge, Mass.
ple a parameter vector from a sparse-multinomial prior.
Probable parameter vectors according to such a prior are Sutton, R. S. (1990). Integrated architectures for learning, plan-
sparse, i.e., contain few non-zero entries. The choice of ning, and reacting based on approximating dynamic pro-
gramming, in 'Proceedings of the Seventh Int. Conf. on Ma-
the non-zero entries among the outcomes that were not ob- chine Learning', Morgan Kaufmann, pp. 216-224.
served is done with uniform probability. This presents a

You might also like