You are on page 1of 6

The 2009 IEEE/RSJ International Conference on

Intelligent Robots and Systems


October 11-15, 2009 St. Louis, USA

Bayesian Reinforcement Learning in Continuous POMDPs with


Gaussian Processes
Patrick Dallaire, Camille Besse, Stephane Ross and Brahim Chaib-draa

Abstract— Partially Observable Markov Decision Processes expected return with respect to its current uncertainty on
(POMDPs) provide a rich mathematical model to handle real- the model. An important limitation of this approach is
world sequential decision processes but require a known model that making decision is significantly more complex since it
to be solved by most approaches. However, mainstream POMDP
research focuses on the discrete case and this complicates involves reasoning about all possible models and courses of
its application to most realistic problems that are naturally action. In addition, most work to date on this framework has
modeled using continuous state spaces. In this paper, we been limited to cases where full knowledge of the agent’s
consider the problem of optimal control in continuous and state is available at every time step [1], [2], [3], [4].
partially observable environments when the parameters of the Mainstream frameworks for model-based BRL for
model are unknown. We advocate the use of Gaussian Process
Dynamical Models (GPDMs) so that we can learn the model POMDPs [5], [6] are only applicable in domains described
through experience with the environment. Our results on the by finite sets of states, actions and observations, where the
blimp problem show that the approach can learn good models Dirichlet distribution is used as a posterior over discrete
of the sensors and actuators in order to maximize long-term distributions. This is an important limitation in practice as
rewards. many practical problems are naturally described by continu-
ous states, actions and observations spaces. For instance, in
I. INTRODUCTION
robot navigation problems, the state is usually described by
In the past few decades, Reinforcement Learning (RL) continuous variables such as the robot’s position, orientation,
has emerged as an elegant and popular technique to handle velocity and angular velocity; the choice of action controls
decision problems when the model is unknown. Reinforce- the forward and angular acceleration, which are both con-
ment learning is a general technique that allows an agent tinuous; and the observations provide a noisy estimate of
to learn the best way to behave, i.e. to maximize expected the robot’s state based on its sensors, which would also be
return, from repeated interactions with the environment. continuous.
One of the most fundamental open questions in RL is that One recent framework for model-based BRL in contin-
of exploration-exploitation: namely, how should the agent uous POMDP is the Bayes-Adaptive Continuous POMDP
choose actions during the learning phase, in order to both [7], which extends the previously proposed Bayes-Adaptive
maximize its knowledge of the model as needed to better POMDP model for discrete domains to continuous domains.
achieve the objective later (to explore), and maximize current However, this framework assumes that the parametric form
achievement of the objective based on what is already known of the transition and observation functions are known and
about the domain (to exploit). Under some (reasonably that only the mean vector and covariance matrix of the
general) conditions on the exploratory behavior, it has been noise random variables are unknown. Hence, this framework
shown that RL eventually learns the optimal action-selection has limited applicability when the transition and observation
behavior. However, these conditions do not specify how functions are completely unknown.
to optimally trade-off between exploration and exploitation This paper aims to investigate a model-based BRL frame-
such as to maximize utilities throughout the life of the agent, work that can handle domains that are both partially ob-
including during the learning phase, as well as beyond. servable and continuous without assuming any parametric
Model-based Bayesian RL (BRL) is a recent extension form for the transition, observation and reward function.
of RL that has gained significant interest from the AI To achieve that, we use Gaussian Process priors to learn
community as it can jointly optimize performance during these functions, and then propose a planning algorithm which
learning and beyond, within the standard Bayesian inference selects the actions that maximize long-term expected rewards
paradigm. In this framework, prior information about the under the current model.
problem (including uncertainty) is represented in parametric
form, and Bayesian inference is used to incorporate any II. RELATED WORK
new information about the model. Thus, the exploration- In the case of continuous POMDPs, the literature is
exploitation problem can be handled as an explicit sequential relatively sparse. In general, the common approach is to
decision problem, where the agent seeks to maximize future assume a discretization of the state space and therefore, be
an incomplete model of the underlying system.
P. Dallaire, Camille Besse and B. Chaib-draa are with Department of A first approach to continuous-state POMDPs is undoubt-
Computer Science, Laval University, Quebec, Canada
S. Ross is with the Robotics Institute, Carnegie Mellon University, edly Thrun’s approach [8], where a belief is viewed as a
Pittsburgh, PA, USA set of weighted samples, which can be considered as a

978-1-4244-3804-4/09/$25.00 ©2009 IEEE 2604


particular case (a degenerative case) of Gaussian mixture III. CONTINUOUS POMDP
representation. The author presents a Monte Carlo algorithm
for learning to act in POMDPs with real-valued state and We consider a continuous POMDP to be defined by the
action spaces, paying thus tribute to the fact that a large tuple (S, A, Z, T, O, R, b1 , ):
number of real-world problems are continuous in nature. A • S ⇢ Rm : The state space, which is continuous and
reinforcement learning algorithm value iteration is used to potentially multidimensional.
learn the value function over belief states. Then, a sample • A ⇢ Rn : The action space, which is continuous and
version of nearest neighbors is used to generalize across potentially multidimensional. It is assumed here that A
states. is a closed subset of Rn , so that the optimal control, as
defined below, exists.
A second approach to continuous-state POMDPs has been • Z ⇢ Rp : The observation space, which is continuous
proposed by Porta et al. [9]. In this paper, the authors and potentially multidimensional.
show that the optimal finite-horizon value function over • T : S ⇥ A ⇥ S ! [0, 1] : The transition func-
the infinite POMDP belief space is piecewise linear and tion which specifies the conditional probability density
convex. This value function is defined by a finite set of T (s, a, s0 ) = p(s0 |s, a) of moving to state s0 , given the
↵-functions, expressed as linear combinations of Gaussians, current state s and the action a performed by the agent.
assuming that the sensors, actions and rewards models are • O : S ⇥ A ⇥ Z ! [0, 1] : The observation func-
also Gaussian-based. Then, they show it for a fairly general tion which specifies the conditional probability density
class of POMDP models in which all functions of interest are O(s0 , a, z0 ) = p(z|s0 , a) of observing observation z0
modeled by Gaussian mixtures. All belief updates and value when moving to state s0 after doing action a.
iteration backups can be carried out analytically and exactly. • R : S ⇥ A ⇥ R ! [0, 1] : The reward func-
Experimental results, using the previously proposed random- tion which specifies the conditional probability density
ized point-based value iteration algorithm PERSEUS [10], R(s0 , a, r0 ) = p(r0 |s0 , a) of getting reward r0 , given that
have been demonstrated with a simple robot planning task in the agent performed action a and reached state s0 .
a continuous domain. The same authors have extended their • b1 2 S : The initial state distribution.
work to deal with continuous action and observation sets • : The discount factor.
by designing effective sampling approaches [11]. The basic
idea is that general continuous POMDPs can be casted in The posterior distribution over the current state s of
the continuous-state POMDP paradigm via sampling strate- the agent, denoted b and refered to as the belief, can be
gies. This work, contrary to the approach presented in this maintained via Bayes’ rule as follows:
paper, assumes the POMDP model is known which is very Z
restrictive for the type of real applications we consider. baz (s0 ) / O(s0 , a, z0 ) T (s, a, s0 )b(s)ds (1)
S
To sum up, and as previously stated, the common approach
to continuous POMDPs assumes a discretization of the state The agent’s behavior is determined by its policy ⇡ which
space which can be a poor model for the underlying system. specifies the probability of performing action a in belief
This remains true if we cast the general continuous POMDP state b. The optimal policy ⇡ ⇤ is the one that maximizes the
in the continuous-state POMDP paradigm as Porta and expected sum of discounted rewards over the infinite horizon
colleagues did [11]. Our approach avoids this, by proposing and is obtained by solving Bellman’s equation:
a “real” continuous POMDPs where (i) the state space;
(ii) the action space and (iii) the observation space; are all  Z
continuous and potentially multidimensional. V ⇤ (b) = max g(b, a) + f (z|b, a)V ⇤ (baz )dz (2)
a2A Z
Another approaches related to our work turn around the
where V ⇤ is the value function of the optimal policy and is
modeling of time series data using dynamical systems [12].
the fixed point of Bellman’s equation. Also,
In the case of nonlinear time series analysis, Wang and
Z Z Z
colleagues introduced Gaussian Process Dynamical Models
(GPDMs) so that they can learn models of human pose g(b, a) = b(s) T (s, a, s0 ) rR(s0 , a, r)drds0 ds
S S R
and motion from high-dimensional motion capture data.
Their GPDM is a latent variable model comprising a low is the expected reward when performing a in belief state b
dimensional latent space with associated dynamics, as well as and
a map from the latent space to an observation space. The au- Z Z
thors marginalized out the model parameters in closed form f (z|b, a) = O(s0 , a, z) T (s, a, s0 )b(s)dsds0
S S
by using Gaussian process priors for both the dynamical and
observation mappings. It results in a nonparametric model is the conditional probability density of observing z after
for dynamical systems. Thus, their approach is sustained by doing action a in belief b. The next belief state obtained
a continuous hidden Markov model and does not include by performing a Bayes’ rule update of b with action a and
actions and rewards. observation z is denoted by baz .

2605
IV. GP-POMDP form [15], [16] by specifying an isotropic Gaussian prior on
In this paper, we consider the problem of optimal control the columns of C to yield a multivariate Gaussian likelihood
in such continuous POMDP where T , O and R are unknown ¯) =
p(Y | S, A, ↵
a priori. To consider making optimal decisions, the model N ✓ ◆
|W| 1
is learned from sequence of action-observation through q exp tr(KY 1 YW2 Y> )
Gaussian Processes (GPs) modeling [13], [14]. GPs are a (p+1) 2
(2⇡)N (p+1) |KY |
class of probabilistic models which focuses on points where (5)
a function is instantiated, through a Gaussian distribution
over the function space. Usually a Gaussian distribution is where Y = [y2 , y3 , . . . , yN +1 ] is a sequence of joint
>

parameterized by a mean vector and a covariance matrix, and observations-rewards, KY is a kernel matrix computed with
in the case of GPs, these two parameters are functions of the hyperparameters ↵ ¯ = {↵1 , ↵2 , . . . , W}. The matrix W is
space on which the process operates. diagonal and contains p + 1 scaling factor to account for dif-
For our problem of optimal control, we propose to use ferent variances in observed data dimensions. Using a unique
a Gaussian Process Dynamical Model (GPDM) [12] to kernel function for both states and actions, the element of the
learn the transition, observation and reward functions and kernel matrix are (KY )i,j = kY ([si , ai 1 ], [sj , aj 1 ]). Note
then propose a planning algorithm which selects action that that for the observation case, the action is the one done at
maximizes long-term expected rewards under the current the previous time step. The kernel used for this mapping is,
model. with xi = [si , ai 1 ], the common radial basis function:
In order to use GPs to learn the transition, observation ⇣ ↵ ⌘
2 2
kY (x, x0 ) = ↵1 exp kx x0 k + ↵3 1 xx0 (6)
and reward model, we first assume that the dynamics can be 2
expressed in the following form: where hyperparameter ↵1 represents the output variance,
↵2 is the inverse width of the RBF which represents the
st = T 0 (st 1 , at 1 ) + ✏T
smoothness of the function and ↵3 1 is the variance of the
zt = O0 (st , at 1 ) + ✏O (3)
additive noise ny .
rt = R0 (st , at 1 ) + ✏R
For the dynamic mapping in the latent space, similar
where ✏T , ✏O and ✏R are zero-mean Gaussian white noise. methods are applied but needs to account for the Markov
The functions T 0 , O0 and R0 are respectively unknown property. By specifying an isotropic Gaussian prior on the
deterministic function that returns the next state, observation columns of B, the marginalization can be done in closed
and reward. We seek to learn this model and maintain a form. The resulting probability density over latent trajectories
maximum likelihood estimate of the state trajectory by using is:
GPDM with optimization methods. p(S | A, ¯) =
Let us denote S = [s1 , s2 , . . . , sN +1 ]> the state sequence ✓ ◆
p(s1 ) 1
matrix, A = [a1 , a2 , . . . , aN ]> the action sequence matrix, q exp tr(KX1 Sout S> )
m+n 2 out
Z = [z2 , z3 , . . . , zN +1 ]> the observation sequence matrix (2⇡)N (m+n) |KX |
and r = [r2 , r3 , . . . , rN +1 ]> the reward sequence vector. (7)
A. Gaussian Process Dynamical Model Here, p(s1 ) is the initial state distribution b1 and assumed
isotropic Gaussian, Sout = [s2 , . . . , sN +1 ]> is the matrix
A GPDM consists of a mapping from a latent space,
of latent coordinate that represent the unobserved states.
which is assumed to follow first-order Markov dynamics,
The N ⇥ N kernel matrix KX is constructed from X =
to an observation space. This probabilistic mapping, for
[x1 , . . . , xN ]> where xi = [si , ai ] with 1  i  t 1. The
the purpose of POMDPs, is defined as S ⇥ A ! Z ⇥ R
kernel function for the dynamic mapping is:
and represents the observation-reward function where the
actions are fully observable and the states are latent variables. kX (x, x0 ) =
The dynamic mapping in latent space is S ⇥ A ! S and ✓ ◆
2 0 2 (8)
corresponds to the transition function. Both mappings are 1 exp kx xk + > 0
3x x + 1
xx0
2 4
defined as linear combinations of nonlinear basis functions:
P where the additional hyperparameter 3 represents the output
st = Pi bi i (st 1 , at 1 ) + ns scale of the linear term. Since all hyperparameters are un-
(4)
yt = j cj j (st , at 1 ) + ny known and following [17], uninformative priors are applied
Q Q
Here, B = [b1 , b2 , . . . ] and C = [c1 , c2 , . . . ] are weights ↵) / i ↵i 1 and p( ¯) / i 1 . This results in
so that p(¯
for basis function i and j , ns and ny are zero-mean time a probabilistic interpretation of action-observation sequence:
invariant white Gaussian noise. The joint observation-space ¯ , ¯|A) =
p(Z, r, S, ↵
is denoted as Y = Z ⇥ R and therefore yt is the obser- (9)
¯ )p(S|A, ¯)p(¯
p(Y|S, A, ↵ ↵)p( ¯)
vation vector zt augmented with the reward rt . According
to Bayesian methodology, the unknown parameters of the To learn the GPDM, it has been proposed to min-
model should be marginalized out. This can be done in closed imize the joint negative log-posterior of the unknowns

2606
¯ , ¯, W|Z, r, A) which is given by:
ln p(S, ↵ for the next levels (evaluation) of the tree and d is the
m+n 1 1 > depth of the tree, i.e. the planning horizon. Then, the agent
L= ln |KX | + tr(KX1 Sout S>out ) + s1 s1 would execute the action at = a⇤ in the environment
2 2 2
p+1 1 and then compute its new belief bt+1 using the function
N ln |W| + ln |KY | + tr(KY YW2 Y> )
1
L EARN GPDM(bt , at , zt+1 , rt+1 ) after observing zt+1 and
X X2 2
rt+1 . As described earlier, learning the GPDM is done by
+ ln ↵i + ln i + C
optimizing equation (10) which returns the maximum a
i i
(10) posteriori state sequence and hyperparameters. Then, the
planning algorithm would be called again with the new
For more details on GPDM learning methods, see [12]. belief bt+1 to choose the next action to perform, and so on.
The resulting MAP estimates of {S, ↵ ¯ , ¯, W} is then used
to estimate the transition and observations-reward function: V. EXPERIMENTS
st+1 = kX ([st , at ]) KX1 Sout
>
In this paper, we investigate the problem of learning to
(11) control the height of an autonomous blimp online, without
yt+1 = kY ([st+1 , at ])> KY 1 Y
knowing its pre-defined physical models. We have opted
Here kX and kY denote the vectors of covariance between for blimps because, in comparison to other flying vehicles,
test point and their respective training set w.r.t. to hyperpa- they have the advantage of operating at relatively low speed,
rameters ↵¯ and ¯ respectively. Since S is part of a MAP and they keep their altitude without necessarily moving.
estimation, the states are assumed to be the most probable Furthermore, they are not sensitive to control errors as
under the current model and therefore the last element is con- in the case of helicopters for instance [18]. In fact, the
sidered as the current state. This model estimate combined problem of controlling a blimp has been studied intensively
with the latent trajectory S correspond to our complete belief in the past, particularly in the control community. Zhang and
conditioned on observations, actions and rewards. colleagues [19] have introduced a PID controller combined
B. Online Planning to a vision system to guide a blimp. For their part, Wyteh and
Barron [20] have used a reactive controller; whereas Rao et
By using the MAP estimate to update the belief of the al. [21] a controller using fuzzy logic. Some other researchers
agent conditioned on observations in the environment, we have used the non-linear dynamics to control several flight
are able to address the question of how should the agent phases [22], [23].
act given its current belief. To achieve this, we adopt an All these approaches assume prior knowledge about the
online planning algorithm which predicts future trajectories dynamics or pre-defined models. Our approach does not
of actions, observations and rewards to estimate the expected assume such prior knowledge1 and aims to learn the control
return for each sampled immediate action. Each prediction policy directly, without passing through a priori information
is made with the corresponding Gaussian Process’ mean. about the payload, the temperature, or the air pressure. In
Therefore, we approximate the value function with a finite this context, the approach that is most closely to the one
horizon for each predicted states. Afterward, the agent simply described here has recently been presented by Rottmann et
performs the sampled immediate action which has highest al. [18]. They applied model-free Monte Carlo reinforcement
estimate. In fact, algorithm 1 has for only purpose the learning and used Gaussian processes for dealing with the
evaluation of the learned GP-POMDP’s model: continuous state-action space. More precisely, they assumed
a completely observable continuous state-action space and
Algorithm 1 P LANNING(b, M , E, d)
used filtered state estimates to approximate the Q-function
1: if d = 0 then Return 0, null with a Gaussian process. Conversely, our model-based ap-
2: r⇤ 1 proach aims to learn the model by using a Gaussian process
3: for i = 1 to M do prior and approximate the value function with the learned
4: Sample a 2 A uniformly posterior. In fact, the blimp application is an illustrative
5: Predict z|b, a and r|b, a example that sustains the applicability of the Bayesian Rein-
6: b0 L EARN GPDM(b, a, z, r) forcement Learning in Continuous POMDPs with Gaussian
7: r 0 , a0 P LANNING(b0 , E, E, d 1) Processes; an idea that certainly can be used in any complex
8: r r + r0 realistic application.
9: if r > r⇤ then (r⇤ , a⇤ ) (r, a) The goal of our experiments is to validate the use of
10: end for GPDMs for online model identification and state estimation
11: Return r⇤ , a⇤ combined with a planning algorithm for online decision-
making. The agent aims to maintain a target height equals
to 0 using as less energy as possible. Each episode starts
At time t, the agent would call the function at zero height and zero velocity and is run for 100 time
P LANNING(bt , M, E, D), where bt is its current belief, steps. The time discretization is 1 second according to the
M is the number of actions to sample at the first level
(selection) of the tree, E is the number of actions to sample 1 The prior is a zero mean Gaussian process.

2607
Fig. 1. Averaged received reward Fig. 2. Averaged distance from the target height

simulator dynamics. Observations are the height(m) and the last state estimate. We defined the error as the sum of
velocity(m/s) with an additive zero-mean Gaussian noise absolute errors on predicting noisy observations and rewards.
with 1cm standard deviation. Rewards are corrupted with Figure V shows that most trajectories have large variance at
the same Gaussian noise. The blimp dynamics are simulated the beginning and then the variance decrease as the agent
using a deterministic transition function where the resulting gets more observations. Each box represent the 25th and 75th
state, comprising height and velocity, is then degraded by percentiles, the central mark is the median and the whiskers
a zero-mean isotropic Gaussian noise with 0.5cm standard extend to non-outlier range. A comparison with the random
deviation. The continuous action set is defined as A = policy, which is not reported, showed that this policy rapidly
[ 1, 1] where the bounds represent maximum downward and diverge to over 3 meters from the goal.
upward actions.
VI. D ISCUSSION
To learn and plan from observations and rewards, we
Making decision in stochastic and partially observable en-
trained a GPDM at each time step providing the planning
vironments with unknown model is a very important problem
algorithm a model and associated believed trajectory. In
because a parametric model is often difficult to specify. To
order to ensure a good tradeoff between exploration and
address this problem, we proposed a continuous formulation
exploitation, we randomly sampled actions according to a
of the POMDP where the model’s functions are assumed
time-decreasing probability. The planning parameters were
to be Gaussian processes. This assumption allowed us to
set to M = 25 samples for action selection level, E = 10
use the work on Gaussian Process Dynamical Model [12]
samples for state evaluation levels, d = 3 as tree depth and
by adding the actions and rewards parts to fulfil POMDP’s
the discount factor = 0.9.
requirement. Although the agent had no prior information,
The reward function was crucial for our experiment since our experimental results on the blimp problem show that the
the agent has only few steps to learn and the search depth learned model is able to provide useful predictions for our
for the planning phase is limited. Consequently, the chosen online planning algorithm. Considering the chosen noises’
reward function is set as the sum of two Gaussians. The variance, most agents achieved reasonable performance by
first Gaussian having standard deviation of 1 meter gives receiving good averaged rewards over time.
half point for being roughly around the goal. The second However, comparing our approach to previous ones on
one with 5cm standard deviation gives another half point if the basis of control performance is hard, since the model is
the blimp is only a few centimeters apart its goal. A cost learned and only very few data are used to do so. To compare
for actions is also incurred to force the blimp to reach its the approaches under control performances criterions, two
goal with minimal energy usage. All observed rewards are points should be considered before. First, learning from
corrupted by a Gaussian noise with 0.1 variance. larger data sets using sparsification techniques would surely
Figure V shows the averaged reward received by the improve prediction performance. Second, once the model is
agent. We notice that the curve stabilize around a return learned, adding filtering methods instead of the maximum a
of 0.8. This value corresponds to the received reward when posteriori estimate will also help in the decision process.
a blimp is 5cm appart the target height while not doing Indeed, the maximum a posteriori estimation of the state
any significant action. In Figure V, we observe that the trajectory limits the learning efficiency. While using Gaus-
blimp averaged distance from height 0 seems to stabilize sian distribution on state trajectory could provide better
around 10cm. Figure V shows the prediction error on the results, it also entails learning with uncertain data sets which
observations-reward sequence using equation (11) with the map Gaussian inputs to Gaussian outputs. Such learning

2608
Fig. 3. Averaged prediction error Fig. 4. Box plot of the heights distributions

procedure is one of the interesting research avenue. More- [7] ——, “Bayesian reinforcement learning in continuous pomdps with
over, as we proposed a reinforcement learning algorithm application to robot navigation,” in Robotics and Automation, 2008.
ICRA 2008. IEEE International Conference on, 2008, pp. 2845–2851.
which use bayesian learning for the model, its uncertainty [8] S. Thrun, “Monte Carlo POMDPs,” Advances in Neural Information
should be used in the planning phase to get an exploration- Processing Systems, vol. 12, pp. 1064–1070, 2000.
exploitation trade-off. Further improvements include learning [9] J. Porta, M. Spaan, and N. Vlassis, “Robot planning in partially
observable continuous domains,” Robotics: Science and Systems, MIT,
from different trajectories as well as finding better and Cambridge, MA, 2005.
separate kernel functions for state and actions, especially [10] M. Spaan and N. Vlassis, “Perseus: Randomized point-based value
when the transition function is stochastic while actions are iteration for POMDPs,” Journal of Artificial Intelligence Research,
vol. 24, pp. 195–220, 2005.
noise-free, as in our example. At last, it is possible to use [11] J. Porta, N. Vlassis, M. Spaan, and P. Poupart, “Point-Based Value
the learning procedure for model evaluation by changing the Iteration for Continuous POMDPs,” The Journal of Machine Learning
Gaussian processes’ prior mean by the functions to evaluate. Research, vol. 7, pp. 2329–2367, 2006.
[12] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian Process Dynamical
To conclude, results on the blimp control task are promis- Models for Human Motion,” IEEE Transactions on Pattern Analysis
ing. Further investigations on bayesian reinforcement learn- and Machine Intelligence, pp. 283–298, 2008.
ing in continuous POMDPs with Gaussian processes are [13] A. O’Hagan, “Some Bayesian numerical analysis,” Bayesian Statistics,
vol. 4, pp. 345–363, 1992.
currently in progress. A harder control problem is currently [14] C. Williams and D. Barber, “Bayesian classification with Gaussian
considered using a Gaussian approximation instead of a processes,” Pattern Analysis and Machine Intelligence, IEEE Trans-
simple point estimates for the state trajectory estimation. actions on, vol. 20, no. 12, pp. 1342–1351, 1998.
[15] D. J. C. Mackay, Information Theory, Inference, and Learning Algo-
The posterior distribution over the models’ functions is thus rithms. Cambridge University Press, 2003.
computed from infered sequences of belief states. [16] R. M. Neal, Bayesian Learning for Neural Networks (Lecture Notes
in Statistics), 1st ed. Springer, August 1996.
[17] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process
R EFERENCES dynamical models.” in Advances in Neural Information Processing
Systems 18. The MIT Press, 2006, pp. 1441–1448, proc. NIPS’05.
[1] M. Duff, “Optimal Learning: Computational procedures for Bayes- [18] A. Rottmann, C. Plagemann, P. Hilgers, and W. Burgard, “Autonomous
adaptive Markov decision processes,” Ph.D. dissertation, University blimp control using model-free reinforcement learning in a continuous
of Massachusetts Amherst, Amherst, MA, 2002. state and action space,” in Intelligent Robots and Systems, 2007. IROS
[2] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans, “Bayesian 2007. IEEE/RSJ International Conference on, 2007, pp. 1895–1900.
sparse sampling for on-line reward optimization,” in Proceedings of [19] H. Zhang and J. Ostrowski, “Visual servoing with dynamics: control of
the 22nd international conference on Machine learning (ICML), 2005, an unmanned blimp,” in Robotics and Automation, 1999. Proceedings.
pp. 956–963. 1999 IEEE International Conference on, vol. 1, 1999.
[3] P. S. Castro and D. Precup, “Using linear programming for bayesian [20] G. Wyeth and I. Barron, “An Autonomous Blimp,” in Proc. IEEE Int
exploration in markov decision processes,” in Proceedings of the 20th Conf. on Field and Service Robotics, 1997, pp. 464–470.
International Joint Conference on Artificial Intelligence (IJCAI), 2007, [21] J. Rao, Z. Gong, J. Luo, and S. Xie, “A flight control and navigation
pp. 2437–2442. system of a small size unmanned airship,” in Mechatronics and
[4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, “An analytic solution Automation, 2005 IEEE International Conference, vol. 3, 2005.
to discrete bayesian reinforcement learning,” in Proceedings of the [22] S. Gomes and J. Ramos Jr, “Airship dynamic modeling for autonomous
23rd international conference on Machine learning (ICML), 2006, pp. operation,” in Robotics and Automation, 1998. Proceedings. 1998
697–704. IEEE International Conference on, vol. 4, 1998.
[5] F. Doshi, J. Pineau, and N. Roy, “Reinforcement learning with limited [23] E. Hygounenc, I. Jung, P. Soueres, and S. Lacroix, “The Autonomous
reinforcement: using Bayes risk for active learning in POMDPs,” in Blimp Project of LAAS-CNRS: Achievements in Flight Control and
Proceedings of the 25th international conference on Machine learning. Terrain Mapping,” International Journal of Robotics Research, vol. 23,
ACM New York, NY, USA, 2008, pp. 256–263. no. 4, pp. 473–511, 2004.
[6] S. Ross, B. Chaib-draa, and J. Pineau, “Bayes-adaptive pomdps,” in
Advances in Neural Information Processing Systems 20, 2008, pp.
1225–1232.

2609

You might also like