Professional Documents
Culture Documents
Add a paragraph about the differences between state estimation and pattern recognition. Include remarks of Tine
that pattern recognition can be seen as Multiple model (see chapter about parameter estimation) . . . . . . 14
Niet duidelijk: inleiding zegt niets over secties 4-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Include information from Herman’s URKS course here, entre autres say something about Choice of the prior . . . 17
Is there a difference between accuracy and precision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
include cross reference to introductory application examples document? . . . . . . . . . . . . . . . . . . . . . . 17
I guess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
KG : sounds weird for continu systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Is this a true constraint? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Do we ever use these kind of models with uncertainty “directly” on the inputs . . . . . . . . . . . . . . . . . . . 18
describe one-to-one relationship between functional representation and PDF notation somewhere . . . . . . . . . 19
Even I don’t understand anymore what I was meaning :) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
introduce General Bayesian approach first: not applied to time-dependent systems [109] . . . . . . . . . . . . . . 19
If so, add an example! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
toevoegen: continuous-time (differential equations) and discrete-time models (difference equations). . . . . . . . 23
TL: er bestaan ook ”Belief networks”, ”graphical models”, ”bayesian networks” etc. horen die hier bij ? synony-
men ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
TL: u, θ f en f ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
zowel graph als eq. modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Nog referenties toevoegen, o.a. Isard and Blake voor condensation algo . . . . . . . . . . . . . . . . . . . . . . 29
KG: Uitgebreider ingaan op het algoritme, in de veronderstelling dat je kan weet wat MC technieken zijn, zie ook
appendix natuurlijk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
uitvissen hoe dit precies werkt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
gebruikt EKF als proposal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
TL: do not understand volgende twee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
TL : naar hoofdstuk MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Needs to be extended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
KG lose correlation between measured features in map due to the inaccurately known pose of robot, or not . . . . 33
KG Is optimizing this pdf, without taking into account the state, the best way to do param. estimation? . . . . . . 33
KG: Look for a solution of this!! IMHO only easy to solve for linear systems and Gaussian distributions . . . . . 35
and Grid-based HMMs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Work this further out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
KG: Relate this to Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
relatie tot model: MDP - Markov Models with reward; POMDP - Hidden Markov Models with reward . . . . . . 37
KG: Look for better formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
KG: Maybe add index to enumerate the constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
4 LIST OF FIXME’S
I Introduction 9
1 Introduction 11
1.1 Application examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Overview of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 System modeling 23
3.1 Continuous state variables, equation modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Continuous state variables, network modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Discrete state variables, Finite State Machine modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Markov Chains/Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
II Algorithms 27
5 Parameter learning 33
5.1 Augmenting the state space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Multiple Model Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5
6 CONTENTS
6 Decision Making 37
6.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Performance criteria for accuracy of the estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Trajectory generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.5 If the sequence of actions is restricted to a parameterized trajectory . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.7 Partially Observable Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.8 Model-free learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Model selection 47
D Particle filters 83
D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2 Joint a posteriori density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2.2 Sequential importance sampling (SIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
D.3 Theory vs. reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
D.3.1 Resampling (SIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.3.2 Choice of the proposal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.4 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Introduction
9
Chapter 1
Introduction
This document wants to compare different Bayesian (also referred to as probabilistic) filters (or estimators) with respect
to their appropriateness for the state/parameter estimation of (dynamical) systems. By Bayesian or probabilistic we mean
simply that we try to model uncertainty explicitly. e.g. when measuring the dimensions of an object with a 3D coordinate
measuring machine, a Bayesian approach does not only provide the estimates for these dimesions, it also gives the accurracy
of these estimates. The approach will be illustrated with examples from multiple domains, but most algorithms will be
applied to the (static) localization problem of objects. This report wants to verify what simplyfying assumptions the different
filters make. The goal of this document is to provide a kind of manual that helps you to decide what filter is appropriate to
solve your estimation problem.
A lot of people only speak of “good and better” filters. This proves that they don’t understand the problem they’re dealing
with: there are no such things as good, better and best filters. Some filters are just more appropriate (faster and more
accurate) for solving specific problems. It is not a good way of solving problems by just testing a certain filter on a certain
problem. One should start from analyzing a problem, checking which model assumptions are justified and then deciding
which filter is most appropriate to solve the problem. One should be able to predict more or less (rather more) whether the
filter will give good results or not.
Figure 1.1: Mobile Robot Platform Lias, equiped with a laser scanner (arrow). Note that the laser scanner should be much lower than on
this foto to be able to recognize transport pallets on the ground!
constituted by a bunch of distance measurements in radial order (every 0.5o ). The vector containing these measurements is
11
12 CHAPTER 1. INTRODUCTION
denoted as z k . Depending on the location (position x, y and orientation θ, see figure 1.2) of the pallet, a number of clusters
(coming from the “pootjes” of the transport pallet”) will be visible on the scan in a certain geometrical order. Because the
pallet
$theta$
robot
$(x,y)$
(a) Foto of a transport pallet (b) Scan of a transport pallet made by (c) Definition of x, y and θ
a radial laser scanner
robot has to move towards the pallet, the position and orientation of the pallet with respect to the robot will change according
to robot motion. We cannot immediately estimate the location from the raw laser scanner measurements: the location of
the transport pallet is a hidden variable or hidden state of our dynamic system. We can denote the location of the transport
pallet with respect to the robot at timestep k as the vector x(k). A concrete location will the be denoted as xk .
xk
xk = yk
θk
If we know the state vector x(k) = xk , we can predict the measurements of the laser scanner (a vector where each compo-
nent will be a distance at a certain angle of the laser scanner) at timestep k through a measurement model z(k) = g(x(k)).
This measurement model incorporates information about the geometry of the transport pallet, the sensor characteristics
and about its (the measurement models’) inaccuracy. Indeed, nor the sensor, nor the measurement model are perfectly
known. Therefore, the sensor measurement prediction is not 100% sure (not infinitely accurate), even if the state is known.
Therefore, in a Bayesian context, the measurement prediction is characterised by a likelihood probability density function
(PDF):
P z(k)x(k) = xk
But, we are interested in the reverse problem, i.e. to calculate the pdf over x(k), once a measurement z(k) = z k is made:
P x(k)z k .
Application of Bayes’ rule (often called inference) allows us to calculate the location of the pallet given this measurement
and the prior pdf P (x(k)). This a priori estimate is the knowledge (pdf) we have about the state x before the measurement
z(k) = z k is made (due to initial knowledge, previous measurements, . . . ). Note that P (z k ) is constant and independent
of x(k) and hence is just a “normalising factor” in the equation.
When moving with the robot towards the transport pallet, the relative location of the pallet with respect to the robot changes.
When the robot motion is known, the changes in x can be calculated. In order to know the robot motion, the robot is
equipped with so called internal sensors: encoders at the driving wheels and a gyroscope. These internal sensors are used
1.1. APPLICATION EXAMPLES 13
to calculate the translational velocity v and the angular velocity ω of the robot. In this example, vk and ωk are supposed
to be perfectly known at each time tk (ideal encoders and gyroscope, no wheel slip, . . . ). We consider the velocities as the
inputs uk to our dynamical system:
v
uk = k
ωk
We can model our system through the system equations (or model/proces equations)
if the time step ∆t is small enough. Note that we immediately made a discrete model of our system! With a vector function,
we denote this as
x(k) = f (x(k − 1), uk−1 ).
The uncertainty over x(k − 1) will be propagated to x(k), even more, because of the inaccuracy of the system model, the
uncertainty over x(k) will augment. In a Bayesian context, we calculate the pdf over x(k), given the pdf over x(k − 1) and
the input uk−1 :
P x(k)P (x(k − 1)), uk−1
and obtain for the system equation
Z
P (x(k)) = P x(k)x(k − 1), uk−1 ) P (x(k − 1) dx(k − 1)
object) are collected in the state vector x. The location of the fixed object is described with respect to a fixed world frame,
the location of the manipulated object is described with respect to a frame on the robot end effector. Therefore, the state is
static, i.e. the real values of these locations do not change during the experiment.
The measurements at a certain time tk are collected in the vector z k (these are 6 contact force and moment measurements,
6 translational and rotational velocities of the manipulated object and/or 6 position and orientation measurements of the
manipulated object). A measurement model describes the relation between these measurements and the state vector:
g k (z(k), x(k)) = 0;
The model g is different for the different measurement types (velocities, forces, . . . ) and for different contacts between the
contacting objects (point-plane, edge-edge, . . . )
paragraph about the Example 1.4 Pattern recognition examples such as OCR and speech recognition.
ween state estimation
recognition. Include
ks of Tine that pattern
n be seen as Multiple
pter about parameter
estimation)
But unfortunately, . . . , this estimation is deterministic (least squares approach). The measurement error on the measured
points are not taken into account... I think the measurement error is considered to be negligeable with respect to the desired
surface accuracy, and in order to suppose this an awfully lot of measurement points are taken and “filtered” beforehand into
a smaller bunch of “measured points”. However, when using a Bayesian approach the number of measurement points will
be lower, i.e., just enough to get the desired surface accurracy. Even more, the measurement machine and touching device
probably do not have the same accuracy in the different touch-directions, which is not at all taken into account with the
current (non-Bayesian) approach.
Reverse engineering problems can be seen as a SLAM (Simultaneous Localization and Mapping) between different points.
• Chapter 5 describes how inaccurately known parameters of your system and measurement models can also be esti-
mated;
Model 1
Model 2
Physical world
Model n
Figure 2.1: A model should contain only those properties of the physical system that are relevant for the application in which it will be
used. Hence the relation world-model is not a one-on-one relation.
3. State: Every model can be fully described at a certain instant in time by all of its states. A different model of the
same system can once result in dynamic states (dynamic model) of static states (static model).
Example 2.2 Localization of a transport pallet with a mobile robot. FIXME: inc
The location of the transport pallet with respect to the mobile robot is dynamic, with respect to the world it is static introductor
(provided that during the experiment this pallet is not moved).
4. Parameter: a value that, although it can be unknown and should thus be estimated, that is constant (in time) in the
physical model.
Example 2.3 When using an ultrasonic sensor with an additive Gaussian sensor characteristic but an unknown (con-
stant) variance σ 2 , this variance is considered as a parameter of the model. However, when a certain sensor has
a behaviour that is dependant of the temperature, we consider the temperature to be a state of the system. So the
distinction parameter/state can depend on the chosen model. When localising a transport pallet with a mobile robot,
the diameter of the wheel+tyre will in most models be a parameter, but for some applications, it will be necessary to
model the diameter as a state: Suppose the robot odometry is to be known very accurately in a highly temperature
varying environment).
17
18 CHAPTER 2. DEFINITIONS AND PROBLEM DESCRIPTION
5. Inputs/measurements:
6. PDF/Information/Accuracy/Precision
Remark 2.2 a “physically moving” system does not necessarily imply that the estimation problem has a dynamic state!
When identifying the masses and lengths of the robot links, the whole robot can be moving around, but the parameters to
estimate (masses, lengths) are constant.
System model A lot of engineering problems require the estimation of the system state in order to be able to control the
system (=process). The state vector is called static when it does not change in time or dynamic when it changes according
KG : sounds weird for to the system model in function of the previous value of the state itself and an input. The input, measured by proprioceptive
continu systems (“internal”) sensors, describes how the state changes; it does not give an absolute measure for the actual state value. The
system model is subject to uncertainty (often denoted as noise), the noise characteristics (the probability density function,
his a true constraint? or some of its characteristics eg. its mean and covariance) are supposed to be known.
Example 2.4 When a mobile robot wants to move around autonomously, it needs to know its location (state). This state is
dynamic, since the robot location changes whenever the robot moves. The inputs to the system can be eg. the currents sent to
ever use these kind of the different motors of the mobile robot, or the velocity of the wheels measured by encoders, . . . The system model describes
ertainty “directly” on how the robot’s location changes with these inputs. However, “unmodeled” effects such as slipping wheels, flexible tires,
the inputs
etc. occur. These effects should be reflected in the system model uncertainty.
Measurement model The uncertainty in the system model makes the state estimate more and more uncertain in time. To
cope with this, the system needs some exteroceptive sensors (“external” sensors) whose measurements yield information
about the absolute value of the state.
When these sensors do not directly and accurately observe the state, i.e. when there is no one-to-one relationship between
states and observations, a filter or estimator is used to calculate the state estimate. This process is called state estimation
(“localization” in mobile robotics). The filter contains information about the system (through the system model), the sensors
(through the measurement model that expresses the relation between state, sensor parameters (see example below) and
measurements. In this case, the measurement model is subject to uncertainty, eg. due to the sensor noise/uncertainty, of
which the characteristics (probability density function, or some of its characteristics) are supposed to be known.
Example 2.5 If a mobile robot is not equipped with an “accurate enough” (“enough” means here enough for a particular
goal we want to achieve) GPS system, the state variables (denoting the robot’s location) are not “directly” observable from
the system. This is for example the case when it has only infrared sensors which measure the distances to the environment’s
objects. When the robot is equipped with a laser scanner and each scan point is considered to be a measurement, the current
angle of the laser scanner is a sensor parameter and the measurement is a scalar (distance to the nearest object in a certain
direction). We can also consider the measurements at all angles of the laser scanner at once. In this case, our measurement
is a vector and our model uses no sensor parameters.
Parameters
Remark 2.3 The above description uses the restriction that the system and measurement models and their noise characteris-
tics are perfectly known. Chapter 5 extends the problem to system and measurement models with uncertainty characteristics
described by parameters that are inaccurately known, but constant.
2.3. BAYESIAN APPROACH 19
Symbol Name
x state vector, hidden state/values
z measurement vector, observations, sensor data, sensor measurement
u input vector
s sensor parameters
f system model, process model, dynamics (functional notation)
g measurement model, observation model, sensing model
θf parameters of the system model and its uncertainty characteristics
θg parameters of the measurement model and its uncertainty characteristics
be one-to-one Notations Table 2.1 list the symbols used in the rest of this text and some synonyms often found in literature. x(k),
een functional z(k), u(k) and s(k), denote these variables at a certain discrete time instant t = k; xk , z k , uk , sk , f k and g k describe
PDF notation
somewhere
specific values for these variables. We also define:
X(k) = x(0) . . . x(k) ; Z(k) = z(1) . . . z(k) ;
U (k) = u(0) . . . u(k) ; S(k) = s(1) . . . s(k) ;
X k = x0 . . . xk ; Z k = z1 . . . zk ;
U k = u0 . . . uk ; S k = s1 . . . sk ;
F k = f0 . . . fk ; Gk = g 1 . . . g k .
Remark 2.4 Note that the variables x(k), z(k), u(k), s(k) for different time steps k still indicate the same variables, e.g.
x(k − 1) and x(k) denote in fact “the same variable”, they correspond to the same state space. The notation x(k) where
the time is indicated at the variable itself is introduced in order to have “readable” equations. Indeed, if we denote the time
step as a subscript to the pdf function P (.), formulas are very ugly because most of the used pdf functions are function of a
lot of variables (x, z, u, s, θf , . . . ), where most of them, though not all, are specified at certain ( and even different) time
steps. FIXME: E
anymore
This conditional PDF is often called a posteriory pdf and denoted by P ost (x(k)).
Calculating P ost (x(k)) is called diagnostic reasoning: given the causes (the data), find the internal (not directly measured)
variables (state) that can explain these. This is much harder than causal reasoning: given the internal variables (state),
predict the causes (the data). Think of a disease (state) and its symptoms (data): finding the disease, given the symptoms
(diagnostic reasoning) is much harder than predicting the symptoms of a certain disease (causal reaoning).
Bayes’ rule relates the diagnostic problem (calculating P ost (x(k))) to two causal problems:
P ost (x(k)) = α P z k xk , Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
P xk Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) (2.2)
20 CHAPTER 2. DEFINITIONS AND PROBLEM DESCRIPTION
where
1
α=
P z k Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
is a normalizer (i.e. independent of the state random variable). The terms in Bayes’ rule are often described as
likelihood ∗ prior
posterior =
evidence
Eq. (2.2) is valid for all possible values of x(k), which we write as:
P ost (x(k)) = α P z k x(k), Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
P x(k)Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) . (2.3)
The last factor of this expression is the pdf over x at time k, just before the measurement is taken, and is further on denoted
as P rior (x(k)):
P rior (x(k)) , P x(k)Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) .
Remark 2.5 Expression 2.1 is also known as the filtering distribution. Another formulation of the problem estimates the
joint distribution P ost (X(k)):
P ost (X(k)) = P X(k)Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (X(0)) (2.4)
Remark 2.6 As previously noted, the model parameters θ f and θ g in formulas (2.1)–(2.4), are supposed to be known.
This limits the problem to pure state estimation problem (namely estimating x(k) or X(k)). In some cases, the model
parameters are not accurately known and need also to be estimated (“parameter learning”). This leads to a concurrent-state-
estimation-and-parameter-learning problem and is discussed in Chapter 5.
Next to the Markov assumptions, Eqs. (2.6) and (2.7), do not make any assumptions, nor on the nature of the hidden variables
to be estimated (discrete, continuous), nor on the nature of the system and measurement models (graphs, equations, . . . ).
Remark 2.7 We talk about Markov Models and not Markov Systems: a system can be modeled in different ways and it is
possible that for the same system Markovian and non-Markovian models can be written. e.g. think of the following one-
dimensional system: a body is moving in one direction with a constant acceleration (apple falling from tree under gravity).
We are interested in the position x(k) of the body at all times k. When the state is chosen to be the object’s position:
x = [x], the model is not Markovian as the state at the last time step is not enough to predict the state evolution. At least
the states from two different time steps are necessary for this prediction. When the state is chosen to be the object’s position
T
x and velocity v: x = x v , the state evolution can be predicted with only one state estimate.
Remark 2.8 Are there systems which cannot be modeled with Markov models? FIXME:
Remark 2.9 Note that some pdfs are conditioned over some value of x(k), while others are conditioned over P ost (x(k)).
In literature both are denoted as ”x(k)” behind the conditional sign ”|”; in this text however we do not use this double
notation in order to stress the difference between conditioning over a value of x(k) or over the pdf of x(k).
e.g. P rior (x(k)) = P x(k)|uk−1 , θ f , f k−1 , P ost (x(k − 1)) indicates the pdf over x(k), given the known values
uk−1 , θ f , f k−1 and the pdf P ost (x(k − 1)). Hence, this formula expresses how the pdf over x(k − 1) propagates to the
pdf over x(k) through the process model.
e.g. the likelihood P z k x(k), sk , θ g , g k indicates the probability of a measurement z k , given the known values sk , θ g ,
g k and the currently considered value of the state x(k). Hence, this formula expresses the sensor characteristic: what is
the pdf over z(k), given a state estimate and the measurement model. This sensor characteristic does not depend on what
values of x(k) are more or less probable (does not depend on the pdf over x(k)).
Remark 2.10 Proof of Eq. (2.6). To keep the derivation somewhat more clear, uk−1 , θ f and f k−1 are replaced by the
single symbol H k−1 . Eq. (2.6) is
Z
P x(k) P ost (x(k − 1)) , H k−1 = P (x(k)|x(k − 1), H k−1 ) P ost (x(k − 1)) dx(k − 1)
(2.8)
1. the pdf over x(k−1) given the posterior pdf over x(k−1) and H k−1 , is the posterior pdf itself, i.e. P (x(k − 1)|P ost (x(k − 1)) , H
P ost (x(k − 1));
2. the new state is independant of the pdf over the previous state if the value of the previous state is given (ie. P (x(k)|x(k − 1), P ost (x
P (x(k)|x(k − 1), H k−1 ).
e.g. given
• the probabilities that today it rains (0.3) or that it doesn’t rain (0.7), (P ost (x(k − 1)));
• the transition probabilities that the weather is the same as the day before (0.9) or not (0.1),
• the knowledge that it does rain today (x(k − 1)),
what are the chances that it will rain tomorrow (P (x(k)|x(k − 1), P ost (x(k − 1)) , H k−1 ))?? The pdf of rain
tomorrow (0.9) only depends on the fact that it rains today x(k − 1) and the transition probability, and not on
P ost (x(k − 1))!
estimate x(k)
system and measurement model
Bayesian approach
Markov Assumptions
System modeling
FIXME: toevo
Modeling the system corresponds to (i) choosing a state; eg. for a map-building problem it can be the status (occupied/free) (diff
discrete-
of grid points, positions of features, . . . ; (ii) choosing the measurements (choosing the sensors) and (iii) writing down the
system and measurement models. This chapter describes how (Markovian) system and measurement models can be written
down: a system with a continuous state space is modeled by equations (Section 3.1) or by a network (Section 3.2); a system
with a discrete state space is modeled by a Finite State Machine (FSM) (Section 3.3).
where
• both f () and g() can be (and most often are!) non-linear functions
• wk−1 and v k are noises (uncertainties) for which the stochastic distribution (or at least some of its characteristics)
are supposed to be known. v and w are mutually uncorrelated and uncorrelated between sampling times (This is a
necessary condition for the model to be a Markovian).
Examples of models with correlated uncertainties:
– correlation between process and measurement uncertainty: when a measurement changes the state, e.g. when
measuring the speed of electrons (or other elementary particles) by fotons, an impuls is exchanged at the col-
lision and the velocity of the electron will be different after this measurement, (met dank aan Wouter voor het
voorbeeld)
– correlation process uncertainty over time: deviations from the model (process noise) which depend on the
current state or on unmodeled effects as humidity,
– correlation measurement uncertainty over time: a not explicitely modeled temperature drift of the sensor.
Note that the uk−1 and sk are assumed to be exact (not stochastic variables). If e.g. the proprioceptive sensors (which
measure uk−1 ) are inaccurate, this uncertainty is modeled by wk−1 .
23
24 CHAPTER 3. SYSTEM MODELING
Markov chains (sometimes called first order markov chains) are models of a category of systems that are most often denoted
as Finite State Machines or automata. These are systems that have a finite number of states. At any time instant, the system
is in a certain state, and can go from one state to another one, depending on a random proces, a discrete PDF, an input to
the system or a combination of these. Figure 3.1 shows a graph representation of a system that changes from state to state
depending on a discrete PDF only, i.e.
P (x(k) = State 3|x(k − 1) = State 2) = a23
The name first order markov chains, that is sometimes used in literature, stems from the fact that the probability of being
in a certain state xk at step k depends only on the the previous time instant. This is wat we called Markov Models in the
previous section. Some authors consider Markov Models in a broader sense, and use the term “first order markov chains”
to denote what we mean in this text by markov chains.
In literature, the transformation matrix (a discrete version of the system equation!) is often represented by A.
Literature
• “First paper”: [94]
• Good introduction: [42], [61]: Here measurements are defined as inherently together with the transition between
two states, whereas the normal approach considers them linked to a certain state. But the two approaches are en-
tirely equivalent (this can be seen by redefining the state space (see eg. section 2.9.2 on p. 35 of [61]. See also
http://www.univ-st-etienne.fr/eurise/pdupont/bib/hmm.html1 .
1 http://www.univ-st-etienne.fr/eurise/pdupont/bib/hmm.html
3.3. DISCRETE STATE VARIABLES, FINITE STATE MACHINE MODELING 25
Meas. B
Meas. A Meas. B
Meas. A Meas. C
Software
Extensions Standard HMMs are not very powerful models and appropriate for very particular cases only, so some exten-
sions have been made to be able to use them for more complex and thus realistic situations:
As this is for most systems very unrealistic, Variable Duration HMMs [70, 71] solve this problem by introducing an FIX
extra, parametric, pdf P (Dj = d) (ie. a pdf predicting how long one typically stays in state j) to model the duration
in a certain state. These are very appropriate for speech recognition.
2 http://www.kulnet.kuleuven.ac.be/LDP/HOWTO/Speech-Recognition-HOWTO/index.html
26 CHAPTER 3. SYSTEM MODELING
Part II
Algorithms
27
Chapter 4
Literature describes different filters that calculate Bel(x(k)) or Bel(X(k)) for specific system and measurement models.
Some of these algorithms calculate the full Belief function, others only some of its characteristics (mean, covariance, . . . ).
This chapter gives an overview of the basic recursive (i.e., Markov) filters, without claiming to give a complete enumeration
of the existing filters.
To be able to determine which filter is applicable to a certain problem, one should verify certain things:
2. Do we represent the pdfs involved as parametric distributions or do we use sampling techniques to be able to sample
non-parametric distributions?
3. Are we solving a position tracking problem or a global localisation problem (unimodal or multimodal distributions)
...
This section uses the previously defined symbols (xk , z k , . . . ). The detailed algorithms in appendix however, are described
with the in the literature most common symbols for each specific filter.
Filter Markov Chains for discrete state variables directly solve Equations (2.6)–(2.7) for all possible values of the state.
For continuous state variables they use numerical techniques, such as Monte Carlo-methods (often abbreviated as MC, see
chapter 8) in order to ”discretize” the state space1 . Another applied discretization technique is the use of a grid over the
entire state space. The corresponding filters are called MC Markov Chains and Grid-based Markov Chains. The Grid-based
filters sample the state space in a uniform way, whereas the MC filters apply a different kind of sampling, most often referred
to as importance sampling (see chapter8 ⇒ from where the name “particle filters”). Monte Carlo (particle) filters are also
often referred to as the Condensation algorithm (mainly in vision applications), Survival of the fittest, or bootstrap filters.
The most general and maybe most clear term appears to be sequential Monte Carlo methods. FIXME: Nog
o.a
• Smoothing the particles posterior distribution by a Markov Chain MC Move step [38]
• Taking better proposal distributions then the system transition pdf [38]: Prior editing (niet goed), Rejection methods,
Auxiliary particle filter [91] , Extended Kalman particle filter , Unscented Kalman particle filter FIXME: u
1 Note that for continuous pdfs which can be parameterized, this discretization is not necessary, filters for these systems are described in section 4.4. FIXME: geb
29
30 CHAPTER 4. STATE ESTIMATION ALGORITHMS
• any-time implementations
Literature
Filter HMM filter algorithms typically calculate all state variables instead of just the last one: they solve (Eq. (2.4)
instead of Eq. (2.1)). However,
they do not estimate the whole probability distribution of Bel(X(k)), they just give the
sequence of states X k = x0 , . . . , xk for which the joint a posteriori distribution Bel(X(k)) is maximal. The filter
algorithm is often called the Viterbi algorithm (based on the Forward-backward algorithm). The version of both these
algorithms for VDHMMs is fully described in appendix A. The algorithms for MCHMMs should be easy to derive from
these algorithms. . . .
• Verify if MCHMM filters sample the whole distribution or do they also just provide a state sequence that maximizes
eq. 2.4.
• Connection with MC Markov Chains ! Is there a difference? I think the only difference is the fact that MCHMM’s
search a solution to the more general problem (eq. 2.4) and MC Markov Chains is just estimating the last hidden state
: naar hoofdstuk MC xk (eq. 2.1)
• Add HMM bookmarks?
Filter KFs estimate 2 characteristics of the pdf Bel(x(k)), namely the minimum-mean-squared-error (MMSE) estimate
and covariance. Hence, their use is mainly restricted to unimodal distributions. A big advantage of KFs over the other filters
is that KFs are computationally less expensive. The KF algorithm is described in appendix B.
Literature
Extensions KFs are often applied to systems with non-linear system and/or measurement functions:
• Unimodal: the (Iterated) Extended KF [8] and Unscented KF [102] linearize the nonlinear system and measurement
equations.
• Multimodal: Gaussian sum filters [5] (often called multi hypothesis tracking in mobile robotics): for every mode
(every Gaussian) an EKF is run.
Remark 4.1 Note that the KF doesn’t assume Gaussian pdfs, but, for Gaussian pdfs the 2 characteristics estimated by the
KF fully describe Bel(x(k)).
Filter The filter calculates the full (exponential) Bel(x(k)), the algorithm is given in appendix C.
Literature [33]
4.6 Concluding
Filter X P (X) Varia
Grid-based Markov Chain C n’importe Computationally expensive
MC Markov Chain C n’importe Subdivide (rejection, metropolis, . . . )
HMM D n’importe x = max P(X), eq. (2.4)
VDHMM D n’importe x = max P(X), eq. (2.4)
MCHMM C n’importe ?????
KF C unimodal f () and g() linear
EKF, UKF C unimodal f () and g() not too unlinear
Gaussian sum C multimodal f () and g() not too unlinear
Daum C exponential rare cases (appendix C)
32 CHAPTER 4. STATE ESTIMATION ALGORITHMS
Chapter 5
Parameter learning
All Bayesian approaches use explicit system and measurement models of their environment. In some cases, the construction
of good enough models to approximate the system state in a satisfying manner is impossible. Speech is an ideal example:
every person has a different way of pronouncing different letters (such as in “Bruhhe”). The system and measurement
models and the characteristics of their uncertainties are written in function of inaccurately known parameters, collected
in the vectors θ f , respectively θ g . In a Bayesian context, estimation of those parameters would typically be done by
maintaining a pdf over the space of all possible parameter values. The inaccurately known parameters θ f and θ g have
to be estimated online, next to the estimation of the state variables. This is often called parameter learning (mapping in
mobile robotics). The initial state estimation problem of Chapters 2–4 is augmented to a concurrent-state-estimation-and-
parameter-learning problem (“simultaneous localization and mapping (SLAM)” or “concurrent mapping and localization
(CML)” in mobile roboticsterminology).
To simplify the notation of the following equations, θ f and θ g are collected into
θf
one parameter vector θ = . Remark that any estimate for this vector is valid for all time steps (parameters are constant
θg
in time . . . ).
If the parameter vector θ comes from a limited discrete distribution, the problem can be solved by multiple model filtering
(Section 5.3). However if the parameter vector θ does not come from a limited discrete distribution, —IMHO— the only
‘right’ way to handle the concurrent-state-estimation-and-parameter-learning problem is to augment the state vector with
the inaccurately known parameters (Section 5.1). However if a lot of parameters are inaccurately known, up till now, the
resulting state estimation problem is only succesfully solved with Kalman Filters (on problems that obey the corresponding
assumptions). In other cases, the computational less expensive Expectation-Maximization algorithm (EM, Section 5.2) is
often used as an alternative. The EM algorithm subdivides the problem in two steps: one state estimation step and one
parameter learning step. The algorithm is a method for searching a local maximum of the pdf P (z k |θ) (consider this pdf
as a function of θ). FIXME: KG lo
measured fea
Parameter learning is also sometimes called model building. IMHO, this can be use to construct models in which some inaccurately k
parameters are not accurately known, or in situations where is it very difficult to construct an off-line, analytical model.
I’ll try to clarify this with the example of the localization of a transport pallet with a mobile robot, equipped with a laser FIXME: KG
scanner. without taking
the best way to
It is very difficult (but not impossible) to create off-line a fully correct measurement distribution (ie. taking sensor uncer-
tainty/characteristics into account), for a state x = [x, y, θ]T :
P z k x(k) = [xk yk θk ]T , sk , θ g , g k
Figure 5.1 illustrates this. Experiments should point out whether off-line construction of this likelihood function is faster
than learning.
Filters Augmenting the state space is possible for all state estimators, as long as the new state, system and measurement
model still obey the estimator’s assumptions. In the specific case of a Kalman Filter, estimating state and parameters
simultaneously by augmenting the state vector is called “Joint Kalman Filtering”, [122].
33
34 CHAPTER 5. PARAMETER LEARNING
$ $ $
%
% % #
"
$
$ $
%
% % #
! ! ! "
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
! ! !
$ $ $
% % %
5.2 EM algorithm
As discribed in the introduction, augmenting the state space with many parameters often leads to computational difficulties,
if a KF is not a good model for the (non-linear) system. The EM algorithm is an often used technique for these cases.
However, is it not a Bayesian technique for parameter estimation and (thus :-) not an ideal solution for parameter estimation!
The EM algorithm consists of two steps:
the pdf over all previous states X(k) is estimated based on the current best parameter estimate θ k−1 :
P X(k)Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))
Remark 5.1 Note that this is a Batch method with a not-constant evaluation time!! For every new map, we recalculate
the whole state sequence! This is a batch method and not very well suited for real-time applications
5.2. EM ALGORITHM 35
With this pdf, the expected value of the logarithm of the complete-data likelihood function P X(k), Z k U k−1 , S k , θ, F k−1 , Gk , P
is evaluated:
Q(θ, θ k−1 ) =
h i (5.1)
E log P X(k), Z k U k−1 , θ, . . . , P (X(0)) | P X k Z k , U k−1 , θ k−1 , . . . , P (X(0))
h i
E f (X k ) P X k |Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) means that the expectation of the function f (X k )
is sought when X k is a random variable distributed according to the a posteriori pdf P X(k)Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X
P X(k)Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) dX(k).
NOTE: θ k−1 is not a parameter of this function, but it’s value does influence the function! The evaluation of this
integral can be done with eg. Monte Carlo methods. If we are using a particle filter (see chapter D), expression 5.1
reduces to
XN
k−1
log P X i (k), Z k U k−1 , S k , θ, F k−1 , Gk , P (X(0))
Q(θ, θ )=
i=1
where X i (k) denotes the i-th sample of the complete data-likelihood pdf (which we don’t know). Application of
Bayes’ rule and the Markov assumption on the previous expression gives
Q(θ, θ k−1 ) =
N
X
log P Z k X i (k), U k−1 , S k , θ, F k−1 , Gk , P (X(0)) P X i (k)U k−1 , S k , θ, F k−1 , Gk , P (X(0))
≈
i=1
N
X
log P Z k X i (k), S k , θ g , Gk P X i (k)U k−1 , θ f , F k−1 , P (X(0))
=
i=1
The left hand term of the log product is the measurement equation, with θ considered as a parameter and specific
values for the state and the measurement. The right hand side of the equation is the result of a dead-reckoning
exercice, with θ considered as a parameter. However we don’t know this PDF as a function of θ :-(. FIXME: KG
this!! IMHO
2. the M-step (or parameter learning step) linear
a new estimate θ k is calculated for which the the (incomplete-data) likelihood function increases:
p Z k U k−1 , S k , θ k , F k−1 , Gk , P (X(0)) > p Z k U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) .
(5.2)
This estimate θ k is calculated as the θ which maximizes the expected value of the logarithm of the complete-data
likelihood function:
θ k = argmax Q(θ, θ k−1 ); (5.3)
or at least increases it (this version of the EM algorithm is called the Generalized EM algorithm (GEM)):
Q(θ k , θ k−1 ) > Q(θ k−1 , θ k−1 ) (5.4)
Appendix E proves that a solution to (5.3) or (5.4) satisfies (5.2).
Remark 5.2 Note that in this section, the superscript k in θ k. refers to the estimate for θ . in the kth iteration. This estimate
is valid for all timesteps because θ . is static.
Remark 5.3 Sometimes the E-step calculates p(X(k), Z k |U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))) instead of p(X(k)|Z k , U k−1 , S k , θ k
Both differ only in a factor
p(Z k |U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))). This factor is independent of the variable θ and hence does not affect the
M-step of the algorithm.
Remark 5.4 Note that the EM algorithm calculates at each time step the full pdf over X, but it only calculates one θ which
maximizes or increases Q(θ, θ k−1 ).
36 CHAPTER 5. PARAMETER LEARNING
Filters
1. All HMM filters allow the use of EM. The algorithm is most often known as the Baum-Welch algorithm (appendix A
gives the concrete formulas for the VDHMM; for a derivation starting from the general EM algorithm, see [61]).
d Grid-based HMMs? In the case of MCHMMs , where pdf’s are non parametric, the danger for overfitting is real and regularization is
absolutely necessary. Typically cross-validation techniques are used to avoid this (shrinkage and annealing).
Work this further out
1. Model detection (model selection, model switching, multiple model, multiple model hypothesis testing, . . . ) filters
try to identify the “correct” model, the other models are neglected.
2. Model fusion (interacting multiple model, . . . ) filters calculate a weighted state estimate between the models.
Filters Multiple Model Filtering is possible with all filtering algorithms, however, in practice, it is almost only applied for
Kalman Filters, because most other filters are computationally too complex to run several of them in parallel.
Chapter 6
Decision Making
FIXME: re
In the previous chapters, we learned how to process measurements in order to obtain estimates for states and parameters. Markov Models
- Hidden Mark
When we have a closer look at the system’s proces and measurement functions () and (), we see that the system’s states and
measurements are influenced by the input to the system. This input can be in the proces function (e.g. an acceleration input),
or in the measurement function (e.g. a parameter of the sensor). The previous chapters assumed that these inputs were given
and known. This chapter is about planning (decision making), about the choice of the inputs (control signals, actions).
Indeed, a different input can lead to more accurate estimates of the states and/or parameters. So, we want to optimize the
input in some way to get “the best possible estimates” (optimal experiment design) and in the mean while perform the task
“as good as possible”, i.e. to perform active sensing.
An example is mobile robot navigation in a known map. The robot is unsure about its exact position in the map and needs
to determine the action that determines best where it is in the map. Some people make the distinction between active
localization and active sensing. The former then refers to robot motion decisions, the latter to sensing decisions (e.g. when
a robot is allowed to fire only one sensor at a time).
Section 6.1 formulates the active sensing problem. The performance criteria Uj which measure the gain in accuracy of
the estimates are explained in section 6.2. Section 6.3 describes possible ways to model the input trajectories. Section 6.4
discusses some optimization procedures. Section 6.8 discusses model-free learning, i.e. when there is no model (or not yet
an exact model) of the system available.
This criterion (or cost function) is a weighted sum of expected costs: The optimal policy π 0 is the one that minimizes this
function. The cost function consists of
1 The index 0 denotes that π contains all actions starting from time 0
37
38 CHAPTER 6. DECISION MAKING
1. j terms αj Uj (...) characterizing the minimization of expected uncertainties Uj (...) (maximization of expected infor-
mation extraction) and
2. l terms βl Cl (...) denoting other expected costs and utilities Cl (...), such as time, energy, distances to obstacles, distance
to the goal.
: KG: Look for better The weighting coefficients αj and βl are chosen by the designer and reflect his personal preferences . A reward/cost can be
formulation associated both with an action a as with the arrival in a certain state x.
If both the goal configuration and the intermediate time evolution of the system are important with respect to the calulation
of the cost function, the terms Uj (...) and Cl (...) are themselves a function of the Uj,k ()... and Cl,k (...) at different time
steps k. If the probability distribution over the state at the goal configuration p(xN |x0 , π 0 ) fully determines the rewards,
these components are reduced into their last terms and V is calculated by using Uj,N and Cl,N only.
: Maybe add index to V is to be minimized with respect to the sequence of actions under certain constraints
merate the constraints
c(x0 , . . . , xN , π 0 ) ≤ cmax . (6.4)
The thresholds cmax express for instance maximal allowed velocities and acceleration, maximal steering angle, minimum
distance to obstacles, etc.
The problem could be a finite-horizon (over a fixed, finite number of time steps) or an infinite-horizon problem (N = ∞).
For infinite horizon problems: [15, 93]
• the problem can be posed as one in which we wish to maximize expected average reward per time step, or expected
total reward;
• in some cases, the problem itself is structured so that reward is bounded (e.g. goal reward, all actions: cost), once in
goal state: stay at no cost;
• sometimes, one uses a a discount factor (”discounting”): rewards in the far future have less weight than rewards in
the near future.
• loss function based on the covariance matrix: The covariance matrix P of the estimated pdf of state x is a measure
of the uncertainty of the estimate. Since no scalar function can capture all aspects of a matrix, no loss function
suits the needs of every experiment. Minimization of a scalar loss function of the posterior covariance matrix is
extensively described in the literature of optimal experiment design [47, 92] where several scalar loss functions have
been proposed:
– D-optimal design: minimizes det(P ) or log(det(P ))). The minimum is invariant to any transformation of the
variables x with a nonsingular Jacobian (e.g. scaling). Unfortunately, this measure does not allow to verify task
completion.
6.2. PERFORMANCE CRITERIA FOR ACCURACY OF THE ESTIMATES 39
– A-optimal design: minimizes the trace tr(P ). Unlike D-optimal design, A-optimal design does not have the
invariance property. The measure does not even make sense physically if the target states have inconsistent
units. On the other hand, this measure allows to verify task completion (pessimistic).
– L-optimal design: minimizes the weighted trace tr(W P ). A proper choice of the matrix W can render the
L-optimal design criterium invariant to transformations of the variables x with a nonsingular Jacobian: W
has units and is also transformed accordingly. A special case of L-optimal design is the tolerance-weighted
L-optimal design [34, 53], which proposes a natural choice of W depending on the desired standard deviations
/ tolerances at task completion. The value of this scalar function has a direct relation to the task completion.
– E-optimal design: minimizes the maximum eigenvalue λmax (P ). Like A-optimal design, this is not invariant
to transformations of x, nor the measure makes sense physically if the target states have inconsistent units; but
the measure allows to verify task completion (pessimistic).
• loss function based on the entropy: Entropy is a measure of uncertainty represented by the probability distribution.
This measure has more information about the pdf than only the covariance matrix, which is important for multi-
modal distributions, consisting of several small peaks. Entropy is defined as: H(x) = E[− log p(x)]. For a discrete
distribution (p(x = x1 ) = p1 , . . . , p(x = xn ) = pn ) this is:
n
X
H(x) = − pi log pi (6.5)
i=1
If we make the change between the entropy of the prior distribution p(x|Zk ) and the conditional distribution
p(x|Zk+1 ); this measure corresponds to the mutual information (see appendix G.5). Note that the entropy of the
conditional distribution p(x|Zk+1 ) is not the equal to the entropy of the posterior distribution p(x|Zk+1 ) (see
appendix G.3)!
– the Kullback-Leibler distance or relative entropy is a measure for the goodness of fit or closeness of two distri-
butions:
p2 (x)
D(p2 (x)||p1 (x)) = E[log ]; (6.8)
p1 (x)
where the expected value E[.] is calculated with respect to p2 (x). For discrete distributions:
n
X n
X
D(p2 (x)||p1 (x)) = p2,i (x) log p2,i (x) − p2,i (x) log p1,i (x) (6.9)
i=1 i=1
Note that the change in entropy and the relative entropy are different measures. The change in entropy only
quantifies how much the form of the pdfs changes; the relative entropy also incorporates a measure of how
much the pdf moves: if p1 (x) and p2 (x) are the same pdf, but translated to another mean value, the change in
entropy is zero, while the relative entropy is not. The question of which measure is best to use for active sensing
is not an issue as the decision making is based on the expectations of the change in entropy or relative entropy,
which are equal.
Remark: Minimizing the covariance matrix is often a more appropriate active sensing criterion than minimizing an entropy
function of the full pdf. This is the case when we want to estimate our state unambiguously, i.e. when we want to use
one value for the state estimate, and reduce the uncertainty of this estimate maximally. The entropy will not always be
a good measure because for multimodal distributions (ambiguity in the estimate) the entropy can be very small while the
uncertainty on any possible state estimate is still large. With the expected value of the distribution as estimate, the covariance
matrix indicates how uncertain this estimate is.
40 CHAPTER 6. DECISION MAKING
• The evolution of ak can be restricted to trajectory, described by a reference trajectory and a parametrized deviation of
this trajectory. In this way, the optimization problem is reduced to a finite-dimensional, parameterized optimization
problem. Examples are the parameterization of the deviation as finite sine/cosine series.
• A more general way to describe the trajectory is a sequence of freely to choose actions, that are not restricted to
a certain form of trajectory. The optimization of such a sequence of decisions over time and under uncertainty is
called dynamic programming. At execution time, the state of the system is known at any time step. If there is no
measurement uncertainty at execution time, the problem is a Markov Decision Proces (MDP) for which the optimal
policy can calculated before the task execution for each possible state at every possible time step in the execution (a
policy that maximizes the total future expected reward).
If the measurements are noisy, the problem is a Partially Observable Markov Decision Proces (POMDP). This means
that at execution time the state of the system is not known, only a probability distribution over the states can be
calculated. For this case, we need an optimal policy for every possible probability distribution at every possible time
step. No need to say that this complicates the solution a lot.
• Linear programming [90]: linear objective function and constraints, which may include both equalities and inequali-
ties. Two basic methods:
– simplex method: each step is to move from one vertex of the feasible set to an adjacent one with a lower value
of the objective function.
– the interior-point methods, e.g. the primal-dual interior point methods: they require all iterates to satisfy the
inequality constraints in the problem strictly.
• Convex programming (e.g. semidefinite programming) [21]: convex (or linear) objective function and constraints,
which may include both equalities and inequalities.
• Unconstrained optimization
– Line search methods: starts by fixing the direction (Steepest descent direction, any-descent direction, Newton
direction, Quasi-Newton direction, conjugate gradient direction), then identifies an approximate step distance
(with lower function value).
– Trust region methods: first chooses a maximum distance, then approximate the objectve function in that region
(linear or quadratic) and then seeks a direction and step length (Steepest descent direction and Cauchy point,
Newton direction, Quasi-Newton direction, conjugate gradient direction).
• Constrained optimization: e.g. reduced-gradient methods, sequential linear and quadratic programming methods and
methods based on Lagrangians, penalty functions, augmented Lagrangians.
2. Global optimization methods: The Global Optimization website by Arnold Neumaier2 gives a nice overview of various
optimization problems and solutions.
2 http://solon.cma.univie.ac.at/∼neum/glopt.html
6.6. MARKOV DECISION PROCESSES 41
• Deterministic
– Branch and Bound methods: Mixed Integer Programming, Constraint Satisfaction Techniques, DC-Methods,
Interval Methods, Stochastic Methods
– Homotopy
– Relaxation
• Stochastic
– Evolutionary computation: genetic algorithms (not good), evolution strategies (good), evolutionary program-
ming, etc
– Adaptive Stochastic Methods: (good)
– Simulated Annealing (not good)
– Clustering
– 2-phase
• smooth approximations: treat the value function V and/or decision rules π as smooth, flexible functions of the state
x and a finite-dimensional parameter vector θ
Discrete MDP problems can be solved exactly, whereas the solutions to continuous MDPs can generally only be approx-
imated. Approximate solution methods may also be attractive for solving discrete MDPs with a large number of possible
states or actions.
Standard methods to solve:
Value iteration: optimal solution for finite and infinite horizon problems ** For every state xk−1 it is rather straight-
forward to know the immediate reward being associated to an action ak−1 (1 step policy): R(xk−1 , ak−1 ). The goal
however is to find the policy π0∗ that maximizes the (expected) reward over the long term (N steps). The future reward is
function of the starting state/pdf xk−1 and the executed policy πk = (ak−1 , . . . , aN −1 ) at time k − 1:
X
V πk−1 (xk−1 ) = R(xk−1 , ak−1 ) + γ {P (xk |xk−1 , ak−1 )V πk (xk )} (6.11)
xk
Z
ak−1 = arg max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a)dxk (6.12)
a xk
bellmans equation: Z
Vk−1 = max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a)dxk (6.13)
a xk
" #
X
ak−1 = arg max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a) (6.14)
a
xk
** We exploit the sequential structure of the problem: the optimization problem minimizes (or maximizes) V , written as a
succession of sequential problems to be solved with only 1 of the N variables ai . This way of optimizing is called dynamic
programming (DP)3 and is introduced by Richard Bellman [10] with his Principle of Optimality, also known as Bellman’s
principle:
∗
An optimal policy πk−1 has the property that whatever the initial state xk−1 and the initial decision ak−1 are, the remaining
∗
decisions πk must constitute an optimal policy with regard to the state xk resulting from the first decision (xk−1 , ak−1 ).
The intuitive justification of this principle is simple: if πk∗ were not optimal as stated, we would be able to maximize the
reward further by switching to an optimal policy for the subproblem once we reach xk . This makes a recursive calculation
of the optimal policy possible: finding an optimal policy for the system when N − i time steps remain, can be optained by
using the optimal policy for the next time step (i.e. when N − i − 1 steps remain); and is expressed in the Bellman equation
(aka functional equation):
for discrete state space:
( )
∗ Xn ∗
o
V πk−1 (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V πk (xk ) (6.16)
ak−1
xk
The solution of the MDP problem with dynamic programming is called value iteration [10]. The algorithm starts with the
∗ ∗ ∗
value function V πN (xN ) = R(xN ) and computes the value function for 1 more time step (V πk−1 ) based on (V πk ) using
∗
Bellman’s equation (6.16) untill V π0 (x0 ) is obtained. This method works for both finite and infinite MDPs. For infinite
horizon problems Bellman’s equation is iterated till convergence.
Note that the algorithm may be quite time consuming, since the minimization in the DP must be carried out ∀xk , ∀ak . curse
of dimensionality.
policy iteration: optimal solution for infinite horizon problems Policy iteration is an iterative technique similar to dy-
namic programming, introduced by Howard [58]. The algorithm starts with any policy (for all states), called π 0 . Following
iterations are performed:
i
1. evaluate the value function V π (x) for the current policy with an (iterative) policy evaluation algorithm
2. improve the policies with a policy improvement algorithm: ∀x, find the action a∗ that maximizes
Xn i
o
Q(a, x) = R(x, a) + γ P (x0 |a, x)V π (x0 ) (6.18)
x0
i
if Q(a, x) > V π (x0 ), let π i+1 (x) = a∗ else keep π i+1 (x) = π i (x).
modified policy algorithm: optimal solution for infinite horizon problems The modified policy algorithm [93] is a
combination of the policy iteration and value iteration methods. Like policy iteration, the algorithm contains a policy
improvement step and a policy evaluation step. However, the evaluation step is not done exactly. The key insight is that one
need not to evaluate a policy exactly in order to improve it. The policy evaluation step is solved approximately by executing
a limited number of value iterations. Like the value iteration , it is an iterative method starting with a value V πN and iterates
till convergence.
linear programming: optimal solution for infinite horizonproblems [93, 36, 105] value function for a discrete infinite
horizon MDP problem:
X
minV V (x) (6.19)
x
X
s.t.V (x) ≥ R(x, a) + γ {V (x0 )p(x0 |x, a)} (6.20)
x0
a and x over all possible actions and states. Linear programs are solved with (1) the Simplex Method or (2) the Interior
Point Method [90]. Linear programming is generally less efficient than the previously mentioned techniques because it does
not exploit the dynamic programming structure of the problem. However, [118] showed that sometimes it is a good solution.
Approximations without enumeration of the state: approx, finite and infinite hor The previously mentioned methods
are optimal algorithms to solve MDPs. Unfortunately, we can only find exact solutions for small MDPs because these
methods produce optimal policies in explicit form (i.e. tabular manner that enumerates the state space). For larger MDPs,
we must resort to approximate solutions [19], [101].
To this point our discussion of MDPs has used an explicit or extensional representation for the set of states (and actions) in
which states are enumerated directly. We identify three ways in which structural regularities can be recognized, represented,
and exploited computationally to solve MDPs effectively without enumeration of the state space
• simplyfying assumptions such as observability, no process uncertainty, goal satisfaction, time-separable value func-
tions, . . . can make the problem computationally easier to solve. In the AI literature, many different models are
presented which can in most cases be viewed as special cases of MDPs and POMDPs.
• in many cases it is advantageous to compact the states, actions and rewards representation (factored representation).
Also the components of a problem’s solution, i.e. the policy and optimal value function, are also candidates for com-
pact structured representation. Following algorithms use these factored representations to avoid iterating explicitly
over the entire set of states and actions:
– aggregation and abstraction techniques: these techniques allow the explicit or implicit grouping of states that
are indistinguishable with respect to certain characteristics (e.g. the value function or the optimal action choice).
– decomposition techniques: (i) techniques relying on reachability and serial decomposition: an MDP is broken
into various pieces, each of which is solved independently; the solutions are then pieced together or used to
guide the search for a global solution. The reachability analysis restricts the attention to “relevant ” regions of
state space. and (ii) parallel decomposition in which an MDP is broken into a set of sub-MDPs that are “run in
parallel”. Specifically, at each stage of the (global) decision process, the state of each subprocess is affected.
while most of these methods provide approximate solutions, some of them offer optimality guarantees in general, and most
can provide optimal solutions under suitable assumptions.
4 One way to formulate the problem as a graph search is to make each node of the graph correspond to a state. The inital and goal states can then be
identified, and the search can proceed either forward or backward through the graph, or in both directions simultaneously.
44 CHAPTER 6. DECISION MAKING
Limited lookahead: approximate solution for finite and infinite horizon problems. The limited lookahead is to trun-
cate the time horizon and use at each stage a decision based on a lookahead of a small number of stages. The simplest
possibility is to use a one-step lookahead policy.
MDP: E OVER PROCESRUIS; POMDP E OVER STATE, PROCESRUIS, MEETRUIS for continuous state space:
Z
∗ ∗
πk−1 πk
V (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V (xk )dxk (6.22)
ak−1 xk
Unfortunately, in many practical cases and analytical solution is not possible, and one has to resort to numerical execution of
the DP algorithm. This may be quite time consuming, since the minimization in the DP must be carried out ∀xk , ∀ak , (∀zk :
P OM DP ). This means that the state space must be discretized in some way (if it is not already a finite set).
curse of dimensionality
** What is POMDP **
Original books/papers on POMDP: [41], [7]
Survey algorithms: Lovejoy [74]
E.g. for mobile robotics: [99, 24, 65, 51, 67, 108, 66] (generally they minimize the expected entropy and look one step
ahead)
This model has been analyzed by transforming it into an equivalent continuous-state MDP in which the system state is a
pdf (a set of probability distributions) on the unobserved states in the POMDP, and the transition probs are derived through
Bayes’rule. Because of the continuity of the state space, the algorithms are complicated and limited.
Exact algorithms for general POMDPs are intractable for all but the smallest problems so that algorithmic solutions will
rely heavily on approximation. Only solution methods that exploit the special structure in a specific problem class or
approximations by heuristics (such as aggregation and disacretisation of MDPs) may be quite efficient.
1. We can convert the POMDP in a belief-state MDP, and compute the exact V(b) for that [83]. This is the optimal approach,
but is often computationally intractable. We can then consider approximating either the value function V (...), the belief
state b, or both.
• exact V,exact b: the value function is piecewise linear and convex. Hence, it can be represented by a limited number
of vectors α. This is used as a basis of exact algorithms for computing V (b) (cfr MDP value iteration algorithms):
enumeration algorithm [111, 78, 44], one-pass algorithm [111], linear support algorithm [27], witness algorithm [72],
incremental pruning algorithm [125]; (an overview of the first three algorithms can be found in [74], and of the first
four algorithms [25]). Current computing power can only solve finite horizon problems POMDPs with a few dozen
discretized states.
• approx V, exact b: use function approximator with ”better” properties than piece-wise linear, e.g. polynomial func-
tions, Fourier expansion, wavelet expansion, output of a neural network, cubic splines, etc [57]. This is generally
more efficient, but may poorly represent the optimal solution.
• exact V, approx. b: [74] the computation of the belief space b (Bayesian inference) can be inefficient. Approximating
b can be done (i) by contracting the belief space by using particle filters on Monte Carlo or grid based basis, etc (see
previous chapters on estimation). The optimal value function or policy for the discrete problem may then be extended
to a suboptimal value fucntion or policy for the original problem through some form of interpolation. or (ii) by finite
memory approximations.
• approx V, approx b: combinations of above. E.g [114] uses a partical filter to approximate the belief state and uses a
nearest neighbor function approximator for V.
6.8. MODEL-FREE LEARNING ALGORITHMS 45
2.Sometimes, the structure of the POMDP can be used to compute exact tree-structured value functions and policies (e.g.
structure in the form of DBN) [20].
3. We can also solve the underlying MDP and use that as the basis of various heuristics Two examples are [26]:
• compute the most likely state x∗ = arg maxx b(x) and use this as the “observed state” in the MDP instead of the
belief b(x).
• active localization (greedy, exploiting): execute the actions that optimize the reward
• active exploration (exploring): execute actions to experience states which we might otherwise never see. We hope to
choose actions that maximize knowledge gain of the map (parameters).
• use the observations to learn the system model, see [46] where a CML algorithm is used to build a map (model) using
an augmented state vector. This model then determines the optimal policy. This is called Indirect RL.
• use the observations to improve the value function and policy, no system model is learned. This is called Direct RL.
46 CHAPTER 6. DECISION MAKING
Chapter 7
Model selection
FIXME: TL:
Model selection: [124] each description was designed to pursue a different goal, so each criterion might be the best for
achieving its goal. n: sample size (number of measurements); k: model dimension (number of parameters in θ).
• Aikake’s Information criterion (AIC) [1, 2, 3, 4, 103, 49]. The Aikake framework definies the success of inference
by how close the selected hypothesis is to the true hypothesis, where closeness is measured by the Kullback-Leibler
distance (largest predictive accuracy). model with highest value of log(L(θ̂))−k. The predictive accuracy of a family
tells you how well the best-fitting member of that family can expect to predict new data.
• Bayesian Information criterion (BIC) [104]: we should choose the theory that has the greatest probability (i.e. prob-
ability that the hypothesis is true) model with highest value of log(L(θ̂)) − k log(n)
2 . Selects simpler model (smaller
k) that AIC. A family’s average likelihood tells you how well, on average, the different members of the family fit the
data at hand.
• Minimum description length criterion (MDL) [97, 98, 121]
• various methods of cross validation (e.g. [119, 123])
when p(H1 ) = p(H2 ) = 0.5. Likelihood tells which model is good for the observed data. This is not necessarily a
good model for the system (a good predictive model), because of overfitting: fits data better than real model. e.g. the
most likely second order model will be better than the model likely linear model. (The linear model is a special
case of the second order model.) Scientist interpret the data as favoring the simpler model, but the likelihood not.
When the models are equally complex, likelihood OK (=AIC for these cases). Why not Likelihood difference ?? not
invariant to scaling...
The Bayes factor is hard to evaluate, especially in high dimensions. Approximating Bayes factors: BIC.
• Kullback-Leibler information: between model and real. We do not have the real... =¿ AIC
• Aikake information criterion (AIC) [1] [Sakamoto, Y., Ishiguro, M. and Kitagawa, G. 1986 Aikake information
criterior statistics. Dordrecht: Kluwer Academic Publishers]
p(Zk |H) is the likelihood of the likeliest case (i.e. the k parameters of the model parameters for the maximum
p(Zk |H))!! k: number of parameters in the distribution. The model giving the minimum value of AIC should be
selected. It does not chooses the model in which the likelihood of the data is the largest, but takes also the order of the
47
48 CHAPTER 7. MODEL SELECTION
system model into account. AIC is a natural sample estimate of expected Kulback-Leibler information (as a result of
assymptotic theory). AIC: H1 is estimates to be more predictively accurate than H2 if and only if
p(Zk |H1 )
≥ exp(k1 − k2 ) (7.5)
p(Zk |H2 )
log p(Zk |Hi ) = log p(Zk |Hi , θ̂) − (k/2) log n + O(1) (7.6)
= loglikelihoodatM LE − penalty (7.7)
k
approx bayes factors:penalty terms: AIC: k BIC: 2 log n RIC: k log k
• posterior Bayes factors [Aitkin, M 1991 Posterior Bayes Factors, journal of the Royal Statistical Society B 1: 110-
128.]
p(x|Zk , H1 )
>κ (7.8)
p(x|Zk , H2 )
Occam factor - likelihood. The likelihood for a model Mi = the average likelihood for its parameters θi
Z
p(Zk |Mi ) = p(θi |Mi )p(Zk |θi , Mi )dθi (7.9)
δθi
This is approximately equal to p(Zk |Mi ) ≈ p(θ̂i ) ∆θ i
= maximum likelihood × occam factor. The occam factor penalizes
models for wasted volume of parameter space.
Part III
Numerical Techniques
49
Chapter 8
8.1 Introduction
Monte-Carlo methods are a group of methods in which physical or mathematical problems are solved by using random
number generators. The name “Monte Carlo” was chosen by Metropolis during the Manhattan Project of World War II,
because of the similarity of statistical simulation to games of chance—and the capital of Monaco was a center for gambling
and similar pursuits—. Monte Carlo methods were first used to perform simulations on the collision behaviour of particles
during their transport within a material (to make predictions about how long it takes to collide).
Monte Carlo techniques provide us with a number of ways to solve one or both of the following problems:
• Sampling from a certain pdf (that is FROM, and not to be confused with sampling a certain signal or a (probability
density) function as often used in signal processing). The first methods (the “real” Monte Carlo methods) are also
called importance sampling, whereas the others are called uniform sampling1 .
Importance sampling methods represent the posterior density by a set of N random samples (often called particles
from where the name particle filters). Both methods are presented in figure 8.1. It can be proved that these represen-
tation methods are dual.
Remark 8.1 Note that equation 2.6 is of the type of equation eq. 8.1!
Note that the latter equation is easily solved once we are able to sample from p(x):
i=N
X
I≈ h(xi ) (8.2)
i=1
P ROOF Suppose we have a random variable x, distributed according to a pdf p(x): x ∼ p(x). Then any function fn (x) is
also a random variable. Let xi be a random sample drawn from p(x) and define
i=N
X
F = λn fn (xi ) (8.3)
i=1
1 To make the confusion complete, importance sampling is also the term used do denote a certain algorithm to perform (importance) sampling.
51
52 CHAPTER 8. MONTE CARLO TECHNIQUES
0.4
0.4
0.3
0.3
dnorm(x, mu, sigma)
0.2
0.1
0.1
0.0
0.0
++ ++++++++++++++++ ++ ++++++++++++++++++++++++++++++
−4 −2 0 2 4 −4 −2 0 2 4
x x
Figure 8.1: Difference between uniform and importance sampling. Note that the uniform samples only fully characterize the pdf if every
sample xi is accompanied by a weight wi = p(xi ).
Starting from the Chebychev inequality or the central limit theorem (asymptotically for N → ∞), one can obtain expressions
that indicate how good the approximation of I is.
Remark 8.2 Note that for uniform sampling (as in grid based methods), we can approximate the integral as
i=N
X
I≈ h(xi )p(xi ) (8.5)
i=1
The following sections describe several methods for (importance) sampling from certain distributions. We start with discrete
distributions in section 8.2. The other sections describe techniques for sampling from continuous distributions.
8.2. SAMPLING FROM A DISCRETE DISTRIBUTION 53
Example 8.1 Suppose we want to sample from a discrete distribution p(x1 = 0.6, x2 = 0.2, x3 = 0.2). Generate xi with
the uniform random number generator: if xi ≤ 0.6, the sample belongs to the first category, if 0.6 < xi ≤ 0.8, xi belongs
to the second, . . .
This results in the following algorithm, taking O(N logN ) time to draw the samples:
However, more efficient methods based on arithmetic coding exist [75]. [96], p. 96, uses ordered uniform samples allowing
to sample N samples in O(N )
and thus
p(x)
p(y) = dy
dx
Suppose we want to generate samples from a certain pdf p(x). If we take the transformation function y = f (x) to be the
cumulative distribution function (cdf) from p(x), p(y) will be a uniform distribution on the interval [0, 1]. So, if we have
an analytic form of p(x), and we can find the inverse cdf f −1 from p(x), sampling is straightforward (algorithm 3). An
example of a (basic) RNG is rand() in the C math-library.
The obtained samples xi are exact samples from p(x).
54 CHAPTER 8. MONTE CARLO TECHNIQUES
Algorithm 3 Inversion sampling (U [0, 1] denotes the uniform distribution on the interval [0, 1])
for i = 1 to N do
Sample ui from a U [0, 1] Rx
xi = f −1 (ui ) where f (x) = −∞ p(x)dx
end for
+
+ ++
+ +
3.0
+
+ ++
+
0.8
+ ++
2.5
+ +
+
+
+ ++
+
+ +
++
pbeta(x, p, q)
dbeta(x, p, q)
+
2.0
0.6
+ +
+ +
1.5
+
+ +
+
+ +
0.4
+
+ +
+
+ +
1.0
+
+ +
+ ++
+ ++
0.2
+
+ +
0.5
+ +
+
+
+ ++
+ + +
++
0.0
0.0
++ ++++++++
++++++++
++++++++++ + + + ++++++++++
++++++++
+++++++++ + + +
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 8.2: Illustration of inversion sampling: 50 uniformly generated samples transformed through the cumulative Beta distribution.
The right hand side shows that these samples are indeed samples of a Beta distribution
this as an example of are independent samples from a standard normal distribution. There exist also variations on this method, such as the
inversion sampling approximative inversion sampling method. This is the same approach, but applied to a discrete approximation of the
distribution we want to sample from.
1.0
+ ++++
+
+
+ +++++++++++
+
+ +++++
100
+
+ ++
+
+ +++
+
+
+ ++++
+
++
0.8
+
+
+
+ ++
+ ++
80
+
+ ++
+
+
+ +
++
+
+ ++
+
pbeta(x, p, q)
+ ++
0.6
Frequency
+
+
+ +
+
+ +
++
60
+
+
+ +
+
+ +
+
+ +
+
+
+ ++
+
+
+ +
0.4
+
+ +
+ ++
40
+
+ +
+
+
+
+ +
+
+ +
+
+
+
+ +++
+
+ +
++
0.2
+
+ +
20
+
+ ++
+
+ ++
+
+
+ ++
+
+ + +
+
+ +++++++
+
+
0.0
+ +++++++++
++
++
++
++++
++
+++
++++
+
++
++
+
++
++++
++
++
+++
++
+++
++
+++
+++
++
++
++
+
+++
++
+++
++
+
++
+
++
+
++
++++
+++
++
++
++
++
++
++
++
+
++
++++
+++
++++++++++
++
+++++++++++++++++++++++++
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x pbeta(samples, p, q)
Figure 8.3: Illustration of inversion sampling: Histogram of transformed samples should approach uniform distribution
Algorithm 5 is sometimes referred to as Sampling Importance Resampling (SIR). It was originally described by Rubin [100]
to do inference in a Bayesian context. Rubin drew samples from the prior distribution, assigned a weight to each of them
according to their likelihood. The samples from the posterior distribution were then obtained by resampling from the latter
discrete set.
Remark 8.3 Note also that the tails of the proposal density should be as heavy or heavier than those of the desired pdf, to
avoid degeneracy of the weight factor.
Beta
Beta and Gaussian density
2.5
Gaussian
Beta Samples
Normal samples
2.0
1.5
1.0
0.5
0.0
++ ++ ++++++++++++++++++++++++ ++++++++ + ++ + +
normalised abscis
600
0
realsamples
600
0
rbeta(D, p, q)
Figure 8.4: Illustration of importance sampling. Generating samples of a Beta distribution via a Gaussian with the same mean and
standarddeviation as the beta distribution. The histogram compares the samples generated via importance sampling with some samples
generated via inversion sampling. 50000 samples were generated from the Gaussian to get 5000 samples from the Beta distribution
8.6. MARKOV CHAIN MONTE CARLO (MCMC) METHODS 57
• Metropolis sampling
• Single component Metropolis–Hastings
• Gibbs sampling
• Slice sampling
These algorithms and more variations are more thoroughly discussed in [88, 75, 55].
• Choose a proposal density q(x, x(t) ), (that can but not need to be) dependant of the current sample x(t) . Contrary
to the previous sample methods, the proposal density doesn’t have to be similar to p(x). It can be any density from
which we can draw samples. We assume we can evaluate p(x) for all x.
Choose also an initial state x0 of the markov chain.
• At every timestep t, a new state x̃ is generated from this proposal density q(x, x(t) ). To decide if this new state will
be accepted, we compute
p(x̃) q(x(t) , x̃)
a= . (8.8)
p(xt ) q(x̃, x(t) )
If a ≥ 1, the new state x̃ is accepted and x(t+1) = x̃, else the new state is accepted with probability a (this means:
sample a random uniform variable ui , if a ≥ ui , then x(t+1) = x̃, else x(t+1) = x(t) ).
Rejection Sampling
0.4
factor * dnorm(x, mu, sigma)
student t
0.3
Scaled Gaussian
0.2
0.1
0.0
−4 −2 0 2 4
* +
0.0 0.2 0.4 0.6 0.8 1.0
x
First sample
First Proposal
Accepted −> Second Sample = First Proposal
0.0 1.0 2.0
desired(x)
+ *
0.0 0.2 0.4 0.6 0.8 1.0
x
0.0 1.0 2.0
desired(x)
+ o
x
Proposal rejected
4th sample = 3rd sample!
0.0 1.0 2.0
desired(x)
* +
Figure 8.6: Demonstration of MCMC for a Beta distribution with a Gaussian proposal density
The Beta target density is in black, the gaussian proposal (centered around the current sample) in red, blue denotes that the proposal is
accepted, green denotes the proposal is rejected
60 CHAPTER 8. MONTE CARLO TECHNIQUES
The resulting histogram of MCMC sampling with 1000 samples is shown in figure 8.7. We will prove later that, asymptoti-
Beta density
1.5
0.0
+++++
++++
+++
++
++++++++
++
+++++++++
++++++
++
++++++++++++
++
++
+++
++
++
+++
++
+++++
++
+++++++++++++
++
++++++
++
+++
++
+
++
+++++++
+++
++
+++++++++++
++++++++++++
++
+
++
++++++++++++++++
++
+++++
+++ +
normalised abscis
150
0
samples
Figure 8.7: 1000 samples drawn from a Beta distribution with MCMC (gaussian proposal). Histogram of those samples.
cally, the samples generated from this Markov Chain are samples from p(x).
Note although that the generated samples are not iid. drawn from p(x).
Efficiency considerations
Run length and Burn-in period As mentioned, the samples generated by the algorithm are only asymptotically samples
from p(x). This means we have to throw away a number of samples in the beginning of the algorithm (called the Burn-in
period. Since the generated samples are also dependant (on each other), we have to make sure that our Markov chain
explores the whole state space by running it long enough.
Typically one uses an approximation of the form
n
1 X
E [f (x) | p(x)] ≈ f (xi ). (8.9)
n − m i=m+1
m denotes the burn-in period and n (the run length) should be big enough in order to assure the required precision and the
fact that the whole state space is explored.
e further research on There exist several convergence diagnostics for determining both m and n [55]. The total number of samples n depends
this strongly on the ratio typical step size of Markov Chain
representative length of SS of the algorithm (sometimes also called convergence ratio, although this term
can be misleading).
This typical step size of the markov chain depends on the choice of the proposal density q(). To explore the whole state
space efficiently (some authors speak about a well mixing Markov Chain, it should be of the same order of magnitude as the
smallest length scale of p(x). One way to determine this stopping time, given a required precision, is using the variance of
the estimate in equation (8.9) (called the Monte Carlo variance, but this is very hard because of the dependance between
the different samples. The most obvious method is starting several chains in parallel, and compare the different estimates.
One way to improve mixing is to use a reparametrisation (use with care, because these can destroy conditional independance
d example explaining properties).
this Convergence diagnostics is still an active area of research, and the ultimate solution still has to appear!
include remark about
sterior correlation to
the speed of mixing
Independence If the typical step size of the markov chain is the representative length of the state space is L, it typically
2
FIXME: Verify why takes ≈ f1 L steps to generate 2 independant samples, with f is the number of rejections . Although the fact that samples
are correlated constitutes in most cases hardly a problem for evaluation of the quantities of interest such as E [f (x) | p(x)].
A way to avoid (some) dependence is obtained by starting different chains in parallel.
8.6. MARKOV CHAIN MONTE CARLO (MCMC) METHODS 61
Why?
Definition 8.2 (Irreducibility) A Markov Chain is called irreducible if we can get from any state x into another state y
within a finite amount of time.
Remark 8.4 For discrete Markov Chains, this means that irreducible Markov Chains cannot be decomposed into parts
which do not interact.
Definition 8.3 (Invariant/Stationary Distribution) A distribution function p(x) is called the stationary or invariant dis-
tribution from a Markov Chain with Transition Kernel T (x̃, x) if and only if
Z
p(x̃) = T (x̃, x)p(x)dx (8.10)
Definition 8.4 (Aperiodicity – Acyclicity) An irreducible Markov Chain is called aperiodic/acyclic if there isn’t any dis-
tribution function which allows something of the form
Z Z
p(x̃) = · · · T (x̃, . . . ) . . . T (. . . , x)p(x)d . . . dx (8.11)
Definition 8.5 (Time reversibility – Detailed balance) An irreducible, aperiodic Markov Chain is said to be time re-
versible if
T (xa , xb )p(xb ) = T (xb , xa )p(xa ), (8.12)
What is more important, the detailed balance property implies the invariance of the distribution p(x) under the Markov
Chain transition kernel T (x̃, x):
P ROOF Combine eq. (8.12) with the fact that
Z
T (xa , xb )dxa = 1.
This yealds Z Z
a b b a
T (x , x )p(x )dx = T (xb , xa )p(xa )dxa
Z
p(xb ) = T (xb , xa )p(xa )dxa ,
q.e.d.
Definition 8.6 (ergodicity) ergodicity = aperiodicity + irreducibility
It can also be proven that any ergodic chain that satisfies the detailed balance equation (8.12), will eventually converge to
the invariant distribution of that chain p(x) from any distribution function f 0 (x).
So, to prove that the Metropolis Algorithm does provide us with samples of p(x), we have to prove that this density is the
invariant distribution for the Markov Chain with transition kernel defined by the MCMC algorithm.
where I()˙ denotes the indicator function (taking the value 1 if its argument is true, and 0 otherwise). The chance of
arriving in a state x 6= xt is just the first term of equation (8.14). The chance of staying in xt , on the other hand,
consists of 2 contributions: Or xt was generated from the proposal density q and accepted, or another state generated
and rejected: the integral “sums” over all possible rejections!
Detailed Balance We can still wonder why the minimum is taken! To satisfy the detailed balance property
One can verify that the definition we took in (8.13) satisfies this need. If we would not take the minimum, this would
not be the case!
Remark 8.5 Note that we should also prove that this chain is ergodic, but that is the case for most proposal densities!
Metropolis sampling [76] is a variant of Metropolis–Hasting sampling that supposes that the proposal density is symmetric
around the current state.
The independence sampler is an implementation of the Metropolis–Hastings algorithm in which the proposal distribution is
independent of the current state. This approach only works well if the proposal distribution is a good approximation of p
(and heavier tailed to avoid getting stuck in the tails).
For complex multivariate densities, it can be very difficult to come up with an appropriate proposal density that explores the
whole state space fast enough. Therefore, it is often easier to divide the state space vector x into a number of components:
where x.i denotes the i-th component of x. We can then update those components one by one. One can prove that this
doesn’t affect the invariant distribution of the Markov Chain. The acceptance function then becomes
(t)
!
(t) (t) p(x.i , xt.−i ) q(x.i | x.−i , x)
a(x.i , x.i , x.−i ) = min 1, , (8.15)
p(xt.i , xt.−i ) q(x, x(t) )
Gibbs sampling is a special case of the previous method. It can be seen as an M (RT )2 algorithm, where the proposal
distributions are the conditional distributions of the joint density p(x). Gibbs sampling can be seen as a Metropolis method
where every proposal is always accepted.
Gibbs sampling is probably the most popular form of MCMC sampling because it can easily be applied to inference prob-
lems. This has to do with the concept of conditional conjugacy explained in the next paragraphs.
8.7. REDUCING RANDOM WALK BEHAVIOUR AND OTHER TRICKS 63
Example 8.2 The family of gamma functions X ∼ Gamma(r, α) (r is called the shape, α is called the rate, sometimes also
the scale s = α1 is used), is the conjugate family if the likelihood is an exponential distribution. X is Gamma distributed if
αr r−1 −αx
P (x) = x e (8.16)
Γ [r]
The mean and variance are E(X) = r/α and V ar(X) = r/α2 . If the likelihood P (Z1 . . . Zk | X) is of the form
n −x k
P
x e i=1 Zi (ie. according to an exponential distribution, and supposing the measurements are independant given the
state), then the posterior will also be Gamma distributed (The interested reader can verify that the posterior will be distributed
Pk
∼ Gamma(α + k, r + i=1 Zi ) as an exercise :-) FIXME: a
This means inference can be executed very fast and easily. Therefor, conjugate densities are often (mis)used by Bayesians,
although they do not always correctly reflect the a priori belief.
For multi-parameter problems, conjugate families are very hard to find, but many multi-parameter problems do exhibit
conditional conjugacy. This means the joint posterior itself has a very complicated form (and is thus hard to sample from)
but it’s conditionals have nice simple forms. FIX
3
See also the BUGS software. BUGS is a free, but not open software package for bayesian inference that uses Gibbs
sampling.
• It also uses the conditional distributions of the joint density p(x) as proposal densities, but these can be hard to
evaluate, so a simplified approach is used.
Slice sampling [87, 86, 85] can be seen as a “combination” of rejection sampling and Gibbs sampling. It is similar to
rejection sampling in the sense that it provides samples that are uniformly distributed in the area/volume/hypervolume
delimited by the density function. In this sense, both these approaches introduce an auxiliary variable u and sample
from the joint distribution p(x, u), which is a uniform distribution. Obtaining samples from p(x) then just consists of
marginalizing over u!
Slice sampling uses, contrary to rejection sampling, a Markov Chain to generate these uniform samples. The proposal
densities are similar to those in Gibbs sampling (but not completely).
The algorithm has several versions: Stepping out, doubling, . . . . We refer to [85] for a elaborate discussion of them.
Algorithm 7 describes the stepping out version for a 1D pdf. We illustrate this with a simple 1D example in figure 8.8 on
page 64. The resulting histogram is shown in figure 8.9. Allthough there is still a parameter that has to be chosen, unlike in
the case of Metropolis sampling, this lenght scale doesn’t influence the complexity of the algorithm as badly.
8.6.7 Conclusions
Drawbacks of Markov Chain Monte Carlo methods are the fact that samples are correlated (although this is generally not
a problem) and that, in some cases, it is hard to set some parameters in order to be able to explore the whole state space
efficiently. To speed up the process of generating independant samples, Hybrid Monte Carlo methods were developed.
2.0
desired(x)
1.0
w
0.0 o o * o + o x
x
2.0
desired(x)
+
1.0
o o o* x
0.0
x
2.0
desired(x)
o * o + x
1.0
0.0
x
2.0
desired(x)
o o + o * x
1.0
0.0
************** **********************************
*******************************************************************************************************************************************************************************
1.5
** * *** * * **
****************************** ****************************************************** **********************************************
* *
*** ****************************************************************************************************************************************************************************************************************************************************************************************
*********************************
****************************************************************************************************************************************************************************************
* * * * * * * * * *
0.0
Histogram of samples
Frequency
300
0
samples
Figure 8.9: Resulting histogram for 5000 samples of a beta density generated with slice sampling
66 CHAPTER 8. MONTE CARLO TECHNIQUES
Monte Carlo
Methods
Iterative
(MCMC methods) Not iterative
M(RT)^2 Gibbs
Sampling Sampling
8.10 Literature
• First paper about Monte Carlo methods: [77];
first paper about MCMC by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller: [76], generalised by Hastings in
1970 [56]
• SIR: [100]
• Good tutorials: [89] (very well explained,but not fully complete), [75, 64]. There is an excellent book about MCMC
by Gilks et al. [55].
• Overview of all methods and combination with Markov techniques [88, 75]
8.11 Software
• Octave Demonstrations of most Monte Carlo methods by David Mackay: MCMC.tgz4
• My own demonstrations of Monte Carlo methods, used to generate the figures in this chapter, and written in R are
here5
• Radford Neal has some C-software for Markov Chain Monte Carlo and other Monte Carlo-methods here7
• BUGS8
4 http://wol.ra.phy.cam.ac.uk/mackay/itprnn/code/mcmc/mcmc.tgz
5 http://www.mech.kuleuven.ac.be/ kgadeyne/downloads/R/
6 http://wol.ra.phy.cam.ac.uk/mackay/itprnn/code/metrop/Welcome.html
7 http://www.cs.toronto.edu/ radford/fbm.software.html
8 http://www.mrc-bsu.cam.ac.uk/bugs/
68 CHAPTER 8. MONTE CARLO TECHNIQUES
Appendix A
πi = P (q1 = Si )
If there are m possible measurements (observations) vi (1 ≤ i ≤ m), a measurement sequence from t = 1 until t is denoted
as O1 O2 . . . Ot where each Ok (1 ≤ k ≤ t) corresponds to one of the possible measurements vj (1 ≤ j ≤ m). bij denotes
the probability of measuring vj , given state Si .
The duration densities pi (d), denoting the probability of staying d time units in Si , are typically exponential densities so
the duration is modeled by 2n + 1 parameters . The parameter D contains the maximal duration in all state i (mainly to FI
simplify the calculations, see also [70, 71])
Remark that the filters for the VDHMM increase both the computation time (×D2 /2) as the memory requirements (×D)
with regard to the standard HMM filters.
1. Given a measurement sequence (OS) O = O1 O2 · · · OT , and a model λ, calculate the probability of seeing this OS
(solved by the forward-backward algorithm in section A.1).
2. Given a measurement sequence (OS) O = O1 O2 · · · OT , and a model λ, calculate the state sequence (SS) that most
likeli generated this OS (solved by the Viterbi algorithm in section A.2).
3. Adapt the model parameters A, B en π (parameter learning or training of the model, Solved by the Baum-Welsch
algorithm, see section A.3)
Note that the actual inference problem (finding the most probable state sequence) is solved by the Viterbi algorithm. Note
also that the Viterbi algorithm does not construct a Belief PDF over all possible state sequences, it only gives you the ML
estimator! FIXME
Suppose
αt (i) = P (O1 O2 . . . Ot , Si ends at t|λ) (A.1)
69
70 APPENDIX A. VARIABLE DURATION HMM FILTERS
αt (i) is the probability that the part of the measurement sequence from t = 1 until t is seen en that the FSM is in state Si at
time t en jumps to another state at time t + 1.
If t = 1, then
α1 (i) = P (O1 , Si ends at t = 1|λ) (A.2)
The probability that Si ends at t = 1 equals the probability that the FSM starts in Si (πi ) stays there 1 timestep (pi (1)).
Furthermore O1 should be measured. Since all these phenomena are supposed to be independent1 , this results in:
For t = 2
α2 (i) = P (O1 O2 , Si ends at t = 2|λ) (A.4)
This probability consists of 2 parts. Either the FSM started in Si and stayed there for 2 time units, either she was in another
state for 1 time step, namely Sj and after that one time unit in Si . That results in
2
Y N
X
α2 (i) = πi pi (2) bi (Os ) + α1 (j)aji pi (1)bi (O2 ) (A.5)
s=1 j=1
Induction leads to the general case (as long as t ≤ D, the maximal duration time possible):
t
Y X t−1
N X t
Y
αt (i) = πi pi (t) bi (Os ) + αt−d (j)aji pi (d) bi (Os ) (A.6)
s=1 j=1 d=1 s=t+1−d
If t > D:
X D
N X t
Y
αt (i) = αt−d (j)aji pi (d) bi (Os ) (A.7)
j=1 d=1 s=t+1−d
Since
αT (i) = P (O1 , O2 , . . . , OT , Si ends at t = T |λ), (A.8)
N
X
P (O|λ) = αT (i) (A.9)
i=1
The recursion starts here at time T. That why we change the index t into T − k:
Note that this definition is complementary with that of αt (j), which leads to
Analog to the calculation of the α0 s, the recursion step can be split into two parts:
For k ≤ D:
N
X T
Y
βT −k (i) = aij pj (k) bj (Os ) +
j=1 s=T −k+1
t−1
N X T −k+1+(d−1)
X Y
βT −k+d (j)aij pj (d) bj (Os ) (A.13)
j=1 d=1 s=T −k+1
For k > D:
N X
D T −k+1+(d−1)
X Y
βT −k (i) = βT −k+d (j)aij pj (d) bj (Os ) (A.14)
j=1 d=1 s=T −k+1
Suppose
δt (i) = max P (q1 q2 . . . qt = Si ends at t, O1 O2 . . . Ot |λ) (A.15)
q1 q2 ...qt−1
δt (i) is the maximum of all probabilities that belong to all possible paths at time t. That means it represents the most
probable sequence of arriving in Si .
Then (cfr. the definition of αt (i))
δ1 (i) = P (q1 = Si en q2 6= Si , O1 |λ) (A.16)
This means the FSM started in Si and stayed there for one time step. Furthermore O1 should have been measured. So
At t = 2
δ2 (i) = max P (q1 q2 = Si and q3 6= Si , O1 O2 |λ) (A.18)
q1
Either the FSM stayed 2 time units in Si and both O1 en O2 have been measured in state Si ; or the FSM was for one time
step in another state Sj , in which O1 was measured, jumped to state Si at time t = 2 in which O2 was measured:
"
δ2 (i) = max max δ2−1 (j)aji pi (1)bi (O2 ) ,
1≤j≤N
#
πi pi (2)bi (O1 )bi (O2 ) (A.19)
∀t > D geldt:
t
Y
δt (i) = max max {δt−d (j)aji pi (d) bi (Os )} (A.21)
1≤j≤N 1≤d<D
s=t−d+1
Note that, except the presence of a second term in (A.20), the only difference between (A.20) and (A.21) is the borders of
the maximum, in order to avoid to reference δ’s that do not exist. Eg. suppose t = 1, d = 3. This would lead to terms like
δ1−3 (i) in eq. (A.20). However, δt (i) does not exist if t < 0.
72 APPENDIX A. VARIABLE DURATION HMM FILTERS
A.2.2 Backtracking
The δt (i)’s alone are not sufficient to determine the most probable state sequence. Indeed, when all δt (i)’s are known the
maximum
δT (i) = max P (q1 q2 . . . qT = Si , O1 O2 . . . OT |λ) ∀i : 1 ≤ i ≤ N (A.22)
q1 q2 ...qT −1
allows us to determine te most probable state at time t = T , qT∗ , but this does not solve the problem of finding the most
probable sequence (ie. how we arrived in that state). This can be solved by determining, together with the calculation of all
δt (i), the arguments, ic. how long the FSM stayed in Si and where it came from before it was in Si , that maximise δt (i).
Therefore we define ψt (i) and τt (i).
If ψt (i) = k and τt (i) = l, then
t
Y t
Y
δt (i) = δt−l (k)aki pi (l) bi (Os ) ≥ δt−d (j)aji pi (d) bi (Os ) ∀j : 1 ≤ j ≤ N
s=t−l+1 s=t−d+1
∀d : 1 ≤ d ≤ D (A.23)
t
Y
τt (i) = Arg max max {δt−d (j)aji pi (d) bi (Os )} (A.25)
1≤d<D 1≤j≤N
s=t−d+1
gives us the missing parameter i necessary to determine the τT (i) and ψT (i) to start the first step of the Backtracking part
of the algorithm. That part constructs, starting from t = T the most probable state sequence q1∗ q2∗ . . . qT∗ . This can be done
as follows: One knows that
qT∗ = SκT (A.28)
But according to the definition of ψt (i) and τt (i), we also know that
and that
voor i = τT (κT ) : qT∗ −i = Sj with j = ψT (κT ) (A.30)
In this way we know both the last τT (κT ) elements of q ∗ and the previous state Sj , so with τt (j) and ψt (j) we can start the
recursion.
An example of such a backtracking procedure can be seen in figure A.1. After calculation of all δ’s, it appears that κT =
maxi δT (i) = 3. By starting from state 3, and verifying the value of ψT (κT ) and τT (κT ) , those appear to be equal to 2
and 3 respectively. It appears thus that the FSM stayed 3 time steps in state 3 en before it was in state 2.
... 3
.
.
.
N
κT = 3 ψT (κT ) = 2 τT (κT ) = 3
ψT −3 (2) = N τT −3 (2) = 2
Note that βt∗ (i) is only defined for t from 0 until T − 1 (instead of from t = 1 until t = T ).
Since the condition on αt∗ (i) is that Si starts at t + 1 and that of αt (i) that Si ends at t, the following relationship is easy
to derive. With eq. (A.1) eq. (A.31) becomes
XN
αt∗ (j) = αt (i)aij (A.33)
i=1
Analog
D
X t+d
Y
βt∗ (i) = βt+d (i)pi (d) bi (Os ) (A.34)
d=1 s=t+1
Note that this formula has to be modified for all t starting form t = T − D2 .
Different consecutive measurements are assumed independant. The first factor of the product equals (see eq. (A.1))
αt (i). Applying Bayes’ rule to the second factor of eq. (A.43) gives
Eq. (A.32) allows us to conclude that the first factor of this expression equals βt∗ (j). Since from eq. (A.43) we can
conclude that Si ends at time t, the second factor of this product is nothing but aij . The sum of all these factors equals
the numerator of eq. (A.39). The denominator of that equation is a normalisation factor.
3. The formulas for bi (k) and pi (d) can be derived in a similar way.
T
" #
X X X
∗ ∗
ατ (i) βτ (i) − ατ (i)βτ (i)
t=1 τ <t τ <t
f or which Ot =k
bi (k) = M T
" # (A.44)
X X X X
ατ∗ (i) βτ∗ (i) − ατ (i)βτ (i)
k=1 t=1 τ <t τ <t
f or which Ot =k
Notes:
• The iteration formula for bi (k) sums over all indices t for which Ot = k, in other words, it first filters the input.
• Numerators are normalisation factors
A.4. CASE STUDY: ESTIMATING FIRST ORDER GEOMETRICAL PARAMETERS BY THE USE OF VDHMM’S75
A.4 Case study: Estimating first order geometrical parameters by the use of
VDHMM’s
This problem has been extensively studied yet (refs toevoegen) with Kalman filters.
• Measurement vectors: Stems from Twist times Wrench = 0 Different CF should give rise to different clusters in
hyperspace and thus allow the construction of a measurement vector.
B.1 Notations
The state estimate at time step k, based on the measurements up to time step i, is denoted as x̂k|i ; its covariance matrix is
P k|i . x̂k|k−1 is called the predicted state estimate and x̂k|k the updated state estimate. The initial state estimate x̂0|0 and
its covariance matrix P 0|0 represent the prior knowledge. wk−1 and v k are the process and measurement uncertainty and
are a random vector sequences with zero mean and known covariance matrices Qk−1 and Rk .
where
K k = P k|k−1 GTk S −1
k ; (B.7)
00 00
S k = Gk Rk Gk T + Gk P k|k−1 GTk . (B.8)
System update:
Before a system update is calculated (time step k − 1), the distribution P ost(x(k − 1)) is gaussian with mean x̂k−1|k−1
and covariance matrix P k−1|k−1 (n is the dimension of the state vector x):
1 − 21 (x(k−1)−x̂k−1|k−1 )T P −1 (x(k−1)−x̂k−1|k−1 )
P ost(x(k − 1)) = |(2π)n P k−1|k−1 |− 2 e k−1|k−1 . (B.9)
77
78 APPENDIX B. KALMAN FILTER (KF)
wk−1 is a zero mean gaussian process uncertainty with covariance matrix Qk−1 .
Out of this:
As Z ∞
1 1
e− 2 g(x(k−1),x(k)) dx(k − 1) = |(2π)n C k−1 | 2 ; (B.21)
−∞
This is a gaussian distribution with mean and covariance as obtained with the Kalman filter equations (B.18)-(B.19).
Measurement update:
Before the measurement is processed, x has a probability distribution P rior(x(k)), (B.22).
00
The measurement equation is z k = g 0k (sk , θ g,k ) + Gk x(k) + Gk v k . The probability of measuring the value z k for a
00 00
certain x(k) given the the measurement covariance Gk Rk Gk T is (m is the dimension of the measurement vector z):
00 00 0 T 00 00
(Gk Rk Gk T )−1 (g 0k (sk ,θ g,k )+Gk x(k)−z k )
p(z k |x(k), sk , θ g , g k ) = |(2π)m Gk Rk Gk T |− 2 e− 2 (gk (sk ,θg,k )+Gk x(k)−zk )
1 1
.
(B.23)
P ost(x(k)) is proportional to the product of (B.22) and (B.23):
T 00 00
− 21 (x(k)−x̂k|k−1 )T P −1 (x(k)−x̂k|k−1 )− 21 (g 0k (sk ,θ g,k )+Gk x(k)−z k ) (Gk Rk Gk T )−1 (g 0k (sk ,θ g,k )+Gk x(k)−z k )
P ost(x(k)) ∼ e k|k−1
(B.24)
1 use matrix inversion lemma for expression of P −1
k|k−1
.
00 00
P −1
k|k−1
= F k−1 P k−1|k−1 F T T
k−1 + F k−1 Qk−1 F k−1
00 00 00 00 00 00 00 00
T −1 T −1 T −1 −1 −1 T T −1
= (F k−1 Qk−1 F k−1 ) − (F k−1 Qk−1 F k−1 ) F k−1 (F T
k−1 (F k−1 Qk−1 F k−1 )k−1 F k−1 + P k−1|k−1 ) F k−1 (F k−1 Qk−1 F k−1 )
B.4. KALMAN SMOOTHER 79
− 21 (x(k)−x̂k|k )T P −1 (x(k)−x̂k|k )
P ost(x(k)) ∼ e k|k ; (B.25)
00 00
P −1 −1 T
k|k = P k|k−1 + Gk (Gk Rk Gk )
T −1
Gk ; (B.26)
00 00
x̂k|k = P k|k GTk (Gk Rk Gk T )−1 (z k − g 0k (sk , θ g,k )) + P −1
k|k−1 x̂ k|k−1 . (B.27)
This shows that the new distribution is again a Gaussian distribution. The mean and covariance are the ones obtained with the
Kalman filter equations: the update P −1 0
k|k is as formula (B.26); the update x̂k|k equals x̂k|k−1 +K k z k − g k (sk , θ g,k ) − Gk x̂k|k−1
00 00
−1
with K k = P k|k−1 GTk Gk Rk Gk T + Gk P k|k−1 GTk .
•
p X(k)Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))
•
k k
!
Y Y
log p X(k), Z k U k−1 , S k , θ, F k−1 , Gk , P (X(0)) = log p(x1 ) p(xi |xi−1 ) p(z i |xi )
i=2 i=1
• write Q
discrete time measurements (Remark, a filter for continuous time measurements is also developed [32]):
z k = g(xk , tk , v k ) (C.3)
Assumptions:
• p(x, t) is nowhere vanishing, is twice continuously differentiable in x and continuously differentiable in t, further-
more, p(x, t) approaches zero sufficiently fast as ||x|| → ∞ such that it satisfies Eq. (C.15);
• p(z k |xk ) is nowhere vanishing, is twice continuously differentiable in xk and z k
• for a given initial condition p(x, tk ), Eq. (C.15) has a unique bounded solution for all x and tk ≤ t ≤ tk+1 .
81
82 APPENDIX C. DAUM’S EXACT NONLINEAR FILTER
∂p ∂p ∂f 1 ∂2p
= − f − p tr( ) + tr(Q 2 ) (C.15)
∂t ∂x ∂x 2 ∂x
θ(x, t) (Eq (C.6)):
∂θ ∂θ 1
= (Qr T − f ) + ξ − Aθ; (C.16)
∂t ∂x 2
C.2.2 On-line
ψ(Z k , t) on-line
system (Eq (C.9)):
dψ(t)
= AT (t) ψ(t) + Γ(t); (C.17)
dt
measurement (Eqs. (C.5), (C.8) and Bayes’ formula):
where ψ̄(tk ) is the value of ψ before a measurement at time tk (solution of Eq. (C.17)) and ψ(tk ) is the value of ψ
immediately after the measurement z k at time tk . The initial condition (right before the first measurement) is ψ̄(t1 ) = 0.
Appendix D
Particle filters
D.1 Introduction
FIXME
chapter. Is i
As mentioned in section 4.1, an assumption these filters make is the Markov assumption. The observations are also assumed
to be conditionally independent given the state.
with P ost (X(k)) as defined in (2.4) on page 20 and E [f |p] denoting the expected value of the function f under the pdf p.
In a sampling based approach, we estimate the a posteriori distribution by drawing N samples of it
N
1 X
P ost (X(k)) ≈ δ i (X k ) (D.2)
N i=1 X k
where X ik denotes the i-th sample drawn from P ost (X(k)) and δ denotes the Dirac function.
where P rop (X(k)) is the proposal distribution. It’s a pdf with the same arguments a P ost (X(k))(as defined in equation
2.4), but it has a different “form”.
Suppose we denote a certain sample (instantiation) of X(k) as X i (k) and the ratio between the value of the a posteriori
i
and proposal pdf at that sample as w X (k) (or wi in a shortened version)
P ost X i (k)
i
w X (k) = wi = (D.4)
P rop X i (k)
83
84 APPENDIX D. PARTICLE FILTERS
We can thus obtain an estimate of our expected value with N samples of our proposal distribution
N
1 X
h X i (k) w X i (k)
E [h (X(k)) | P ost (X(k))] ≈ (D.6)
N i=1
klopt iets niet met die
aarom dit niet mag en
en moet worden door This still doesn’t allow for a recursive solution of our problem. Indeed, at a certain timestep k, this means we should choose
maliseerde gewichten a proposal density and sample N × k samples of dimension Rn of this proposal density. If however, we would be able
to formulate our problem in a recursive way, this would allow us to keep the number of samples we have to generate at a
certain time instant constant (N ).
Remark D.1 Note that this approach leaves us with samples of the joint a posteriori density! It can be proved that, provided
that enough samples are drawn, by taking the last vector of each of these samples, one obtains samples of the marginal pdf!
Remark D.2 Note that there also exist particle filters that use other Monte Carlo sampling methods than importance sam-
pling. Markov chain Monte Carlo methods are often to computationally complex but rejection methods [16] are also used!
To avoid to heavy notation, we’ll combine some symbols as we already did before (see section 2.4, remark 2.10 on page
21).
H k−1 = uk−1 θ f f k−1 (D.7)
I k = sk θ g g k (D.8)
With these symbols, we refrase the 3 most important equations for Markov Systems (2.5, 2.6, 2.7 from section 2.4) on page
20 here for the joint a posteriori density:
FIXME: explain! Remark D.3 Note that the prediction step does not contain an integral here . Note also that we can formulate the iteration
step here as a simple product of distributions, and do not really have to make the two-step approach. prediction-recursion!
P ost (X(k)) = P X(k)P ost (X(k − 1)) , H k−1 , I k , z k
We can obtain the this in a recursive way, using Bayes’ rule and the Markov assumption:
P ( X(k) = X k P ost (X(k − 1)) , H k−1 , I k , z k
P (z k | X k , P ost (X(k − 1)) , H k−1 , I k ) P (X k | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (X k | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (X k−1 , xk | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k ) (D.9)
P (z k | xk , I k ) P (xk | X k−1 , P ost (X(k − 1)) , H k−1 , I k ) P (X k−1 | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (xk | xk−1 , H k−1 ) P ost (X(k − 1))
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P z k xk , I k P (xk |xk−1 , H k−1 )
= P ost (X(k − 1))
P (z k )
e last line of equation And thus we obtain the following recursive formula for P ost (X(k)):
ect! The denominator
the probability of the P z k xk , I k P (xk |xk−1 , H k−1 )
rement “tout court”) P ost (X(k)) = P ost (X(k − 1)) (D.10)
P (z k )
D.3. THEORY VS. REALITY 85
P ROOF Suppose we dispose of S samples of a pdf p(x) xi , i = 1 → S . For each of these samples, we know the pdf
p(y|x = xi ) and we can sample from this distribution (and thus obtaining y i ). Can we combine the xi and y i to obtain
samples of the joint pdf p(x, y) = p(y|x)p(x)?
If the above can be proved , this allows us to solve the problem recursively. Indeed, (see also eq. (2.1) on page 19) FIXME: The pr
5 of the a
P rop (X(k)) = Q X(k) = X k Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
= Q X k−1 , xk Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
(D.11)
= Q xk X k−1 , Z k , . . . , P (x(0)) Q (X k−1 |Z k , . . . , P (x(0)))
= Q xk xk−1 , z k , H k−1 , I k ) P rop (X(k − 1))
We can thus recursively use this formula to start from an a priori proposal distribution
Combining those 2
Starting from the definition of the weights (D.4), and using both formulas for the recursion of the proposal density (D.11)
and the a posteriori density (D.10), we obtain
w X i (k)
P ost X i (k)
=
P rop X i (k)
The unknown normalizing factor α = P (z1 k ) is a serious problem, or not? Indeed, this factor is not dependent of the
estimated state vector and can thus be put before the integral in eq. (D.5).
We can avoid the unknown normalizing factor α by working with normalized weights w̃ X i (k)
w X i (k)
i
w̃ X (k) = PN i
(D.13)
i=1 w X (k)
D.4 Literature
• The SMC homepage1 has lots of useful links to papers, videos, software . . .
• Sebastian Thrun and others have written several papers about the application of Particle filters in applications [81, 79,
35].
D.5 Software
• The Bayesian Filtering Library (BFL)2 of Klaas Gadeyne contains (amongst others) C++ support for particle filters.
FIXME: check this • The Player Stage Project3 has an particle filter implementation in C for mobile robot
• Bayes++4 also contains an implementation of a SIR filter (and several other “schemes” for Bayesian filtering)
1 http://www-sigproc.eng.cam.ac.uk/smc/
2 http://people.mech.kuleuven.ac.be/˜kgadeyne/bfl.html
3 http://playerstage.sourceforge.net/
4 http://www.acfr.usyd.edu.au/technology/bayesianfilter/Bayes++.htm
Appendix E
h i h i
E log p X(k)Z k , H k , θ k Z k , H k , θ k−1 − E log p X(k)Z k , H k , θ k−1 Z k , H k , θ k−1
Z h i
log p X(k)Z k , H k , θ k − log p X(k)Z k , H k , θ k−1
=
p X(k)Z k , H k , θ k−1 dX(k);
p X(k)Z k , H k , θ k i
Z h
p X(k)Z k , H k , θ k−1 dX(k)
= log
p X(k)Z k , H k , θ k−1
= 0.
hence it will increase the (incomplete-data) likelihood function itself (Eq (5.2)).
We know that
p X(k), Z k H k , θ = p X(k)Z k , H k , θ p Z k H k , θ ;
hence:
log p Z k H k , θ = log p X(k), Z k H k , θ − log p X(k)Z k , H k , θ .
87
88 APPENDIX E. THE EM ALGORITHM, M-STEP, PROOFS
When averaging over X, given the in the E-step calculated pdf p X(k)Z k , H k , θ k−1 , this becomes (the term on the
The first two terms on the right hand side equal Q(θ k , θ k−1 ) − Q(θ k−1 , θ k−1 ). When Eq. (5.3) or Eq. (5.4) is satisfied,
this is strict positive. Part one of the proof showed that the last two terms on the right hand side give a positive sum, for
all values of θ k . This makes that —if Eq. (5.3) or Eq. (5.4) is satisfied— the left hand side is positive, i.e. θ k increases
the logarithm of the incomplete-data likelihood function, and hence increases also the incomplete-data likelihood function
itself (Eq. (5.2)).
Appendix F
A Bayesian network provides a model representation for the joint distribution of a set of variables in terms of conditional
and prior probabilities, in which the orientations of the arrows represent influence (usually though not always of a causal
nature), such that these conditional probabilities for these particular orientations are relatively straightforward to specify.
When data are observed, then typically an inference procedure is required. This involves calculating marginal probabilities
conditional on the observed data using Bayes’ theorem, which is diagrammatically equivalent to reversing one or more of
the Bayesian network arrows.
Features:
• Conditional independence properties can be used to simplify the general factorization formula for the joint probability.
In some cases, this can be very important to provide an efficient basis for the implementation of some MCMC variants
such as Gibbs sampling [55].
• That result can be expressed by the use of a DAG
A Bayesian Network is a directed acyclic graph (DAG), whose structure defines a set of conditional independence (often
denoted as ⊥⊥) properties. This follows from the fact that any PDF can be factorised
FIXME:
difference betw
Recursive factorization:
n
Y FIXME: N
P (X1 , . . . , Xn ) = P (Xi |parents(Xi ))
i=1
Marginalising over a childless node is equivalent to simply removing it and any edges to it from its parents.
Directed acyclic graphs can always have their nodes linearly ordered so that for each node X all of its parents P a(X)
precedes it in the ordering. This is called a topological ordering. FI
Directed Markov property
A variable is conditionally independent of its non-descendents given its parents:
X ⊥⊥ nd(X) | parents(X)
89
90 APPENDIX F. BAYESIAN (BELIEF) NETWORKS
Appendix G
The concept of entropy H arises from an equally important concept called (self)-information I. The following sections
define these concepts and the relation between them. A good book on this subject is [29].
• If all the pi are equal, pi = 1/n, then H should be a monotonic increasing function of n. With equally likely events
there is more choice, or uncertainty, when there are more possible events.
• If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual
values of H. e.g. three possible values, p1 = 21 , p2 = 13 , and p3 = 61 , H( 12 , 13 , 61 ) = H( 21 , 21 ) + 12 H( 32 , 13 )
[106, 107] prove that the only H satisfying the three above assumptions is of the form:
n
X
H(x) = −K pi log pi (G.1)
i=1
where any choice of “log” is possible; this changes only the units of the entropy result (e.g. log: [bits], ln: [nats]). He also
extended this to the continuous case (differential entropy):
Z ∞
H(x) = − p(x) log p(x)dx = E[− log p(x)] (G.3)
−∞
91
92 APPENDIX G. ENTROPY AND INFORMATION
Mutual information I(x, y) is the reduction in the uncertainty of x due to the knowledge of y.
For discrete distributions:
XX p(x, y)
I(x, y) = p(x, y) log (G.20)
p(x)p(y)
X Y
= D(p(x, y)||p(x)p(y)) (G.21)
The relation between entropy and mutual information is (see figure G.1):
Principle of maximum entropy [60]: When making inferences based on incomplete information, the pdf with maximum
entropy is the least biased estimate possible on the given information; i.e. it is maximally noncommittal with regard to
missing information.
The intuition is that we should make the least possible additional assumptions about p.
It turns out that there is always a unique maximal entropy measure.
Principle of minimum cross entropy [69, 68]: The Shannon entropy is maximum when the pdf of the random variable
is that one which is as close to the prior distribution as possible. This is equivalent to maximizing the Shannon entropy
(section G.6).
94 APPENDIX G. ENTROPY AND INFORMATION
H(x) H(y)
H(x,y)
Figure G.1: Relation between entropy and mutual information
i.e. the maximum likelihood estimation (or the maximum a posteriori probability estimation) is looking for a point x̂, which
is not necessarily unique, that minimizes the Kullback-Leibler distance between p(Z k |x) and the empirical distribution
p(Z k ) (possibly modified by the prior).
Appendix H
The inverse of the Fisher information matrix determines a lower bound on the covariance matrix of the estimate that can
be obtained with an efficient estimator, given the measurements. Note that the covariance matrix is a good measure of the
uncertainty on the estimate if we are interested in a single value estimate: with the expected value of the distribution as
estimate, the covariance matrix expresses the covariance of the deviations between this estimate and the real value1 . For
a multimodal distribution with small peaks, the covariance matrix will be large, in contrast to the entropy measures which
will be small. If, on the other hand, we are not interested in a single value estimate e.g. because our estimate is intrinsically
multimodal, the covariance matrix measure is not a good measure.
The next section describes the Fisher information matrix and Cramér-Rao lower bound for the estimation of a non random
state vector, Section H.2 for a random state vector. The original derivation of the Fisher information matrix and the Cramér-
Rao lower bound is made for the non random case: given a number of measurements, we want to estimate a static state
(parameter) x. The random case is an extension to Bayesian estimation: given a number of measurements and an a priori
distribution of the state x, we want to estimate the state x. The extension is also valid for dynamic states, changing in time
according to a process function with process uncertainty.
For more info, see [120].
Where Ox = [ dxd 1 . . . dxdn ]T is the gradient operator with respect to x = [x1 . . . xn ], and Ox OTx is the Hessian matrix.
E[.] is the expected value with respect to p(Z k |x). This measure was introduced by Fisher as a measure of the amount of
information about x, present in the measurements. The elements of the matrix I(x) are:
2
∂ ln p(Z k |x)
I ij (x) = E − (H.3)
∂xi ∂xj
95
96 APPENDIX H. FISHER INFORMATION MATRIX AND CRAMÉR-RAO LOWER BOUND
I(x∗ ) is the Fisher information matrix evaluated at the true state vector x∗ . The matrixinequality (H.4) means that var(T )−
I −1 (x∗ ) is positive semi definite. The bound above depends on the actual state value. Hence, it is not possible to compute
the bound in any real estimation cases where the states are unknown. However, the bound can be used to analyse and
evaluate estimators in simulations.
The unbiased estimator T (x) is efficient if var(T ) = I −1 (x∗ ). Note that it is possible that there does not exist an estimator
meeting this lower bound.
The Fisher information matrix for a random state vector xk is defined as the covariance of the gradient of the total log-
probability, that is:
• I k|k can be divided into I k|k,D and I k|k,P (provided that these exists):
I k|k,D is the information obtained from the data, I k|k,P represents the information in the prior distribution p(xk ).
• The information matrix can also be described in function of the posterior distribution p(xk |Z k ):
The Cramér-Rao bound for a random state vector xk is called the Van Trees version of the Cramér-Rao bound, or the
posterior Cramér-Rao bound [117]. As was the case for the estimation of non random states, the Cramér-Rao lower bound
is the inverse of the Fisher information matrix I k|k .
H.3. ENTROPY AND FISHER 97
If we obtain an efficient estimator for xk , the Fisher information will simply be given by the inverse of the error covariance
matrix of the state : I k|k = P −1
k .
P k,β ≥ I ββ − I βα I −1
αα I αβ . (H.19)
Fisher represents the local behaviour of the relative entropy: it indicates the rate of change in information in a given direction
of the probability manifold. For two distributions p(z|x) and p(z|x0 ) [68]:
1
D(p(z|x)||p(z|x0 )) ∼ I(x)(x − x0 )2 ; (H.22)
2
X ∂
I(x) = p(z|x)( ln p(z|x))2 (H.23)
∂x
X
98 APPENDIX H. FISHER INFORMATION MATRIX AND CRAMÉR-RAO LOWER BOUND
Bibliography
[1] H. Aikake. Information theory and an extension of the maximum likelihood principle. In B. Petrov and F. Csaki,
editors, Proceedings of the Second International Symposium in Information Theory, pages 267–81. Akadémiai Kiadó,
Budapest, Hungary, 1973.
[2] H. Aikake. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716–23,
1974.
[3] H. Aikake. On the entropy maximiztion principle. In P. Krishniah, editor, Applications of Statistics, pages 27–41.
North-Holland, Amsterdam, 1977.
[4] H. Aikake. Prediction and entropy. In A. Atkinson and S. Fienberg, editors, A Celebration of Statistics, pages 1–24.
Springer, New York, 1985.
[5] D. Alspach and H. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. IEEE Transactions
on Automatic Control, 17(4):439–448, August 1972.
[6] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A Tutorial on Particle Filters for Online Nonlinear/Non-
gaussian Bayesian Tracking. IEEE Transactions on Signal Processing, 50(2):174–188, february 2002. http:
//www-sigproc.eng.cam.ac.uk/˜sm224/ieeepstut.ps.
[7] K. J. Astrom. Optimal control of markov decision processes with incomplete state estimation. J. Math. Anal. Appl.,
10:174–205, 1965.
[8] Bar-Shalom and X. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1993.
[9] A. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence,
72:81–138, 1995.
[10] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, New Jersey, 1957.
[11] R. Bellman. A markov decision process. Journal of Mathematical Mechanics, 6:679–684, 1957.
[12] V. Beneˇ s. Exact finite-dimensional filters for certain diffusions with nonlinear drift. Stochastics, 5:65–92, 1981.
[13] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley series in probability and statistics. John Wiley & Sons,
repr. edition, 2001.
[14] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume I. Athena Scientific, Belmont Massachusetts,
1995.
[15] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume II. Athena Scientific, Belmont Massachusetts,
1995.
[16] E. Bølviken, P. Acklam, N. Christophersen, and J.-M. Størdal. Monte Carlo filters for non-linear state estimation.
Automatica, 37(2):177–183, 2001. http://www.math.uio.no/˜erikb/automatica.pdf.
[17] B. Bonet and H. Geffner. Planning with incomplete information as heuristic search in belief space. In Proc. of the
5th International Conference on AI PLanning and Scheduling,AAAI Press, pages 52–61, Colorado, 2000.
[18] B. Bonet and H. Geffner. Planning as heuristic search. Artificial Intelligence, Special issue on Heuristic Search,
129(1–2):5–33, 2001.
[19] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage.
Journal of Artificial Intelligence Research, 11:1–94, 1999.
99
100 BIBLIOGRAPHY
[20] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision processes using compact
representations. AAA, 2:1168–1175, 1996.
[21] S. Boyd and L. Vandenberghe. Convex Optimization. http://www.ee.ucla.edu/∼vandenbe/publications.html. Course
reader for EE364 (Stanford) and EE236B (UCLA), and draft of a book that will be published in 2003.
[22] G. Calafiore, M. Indri, and B. Bona. Robot dynamic calibration: Optimal trajectories and experiment al parameter
estimation. IEEE Trans. on AC, 13(5):730–740, 1997.
[23] G. Casella and E. I. George. Explaining the Gibbs Sampler. The American Statistician, 46(3):167–174, 1992.
[24] A. Cassandra, L. Kaelbling, and J. Kurien. Acting Under Uncertainty: Discrete Bayesian Models for Mobile-Robot
Navigation,. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 1996. http:
//www.cs.brown.edu/people/lpk/iros96.ps.
[25] A. R. Cassandra. Optimal policies for partially observable markov decision processes. Tech-
nical Report CS-94-14, Brown University, Department of Computer Science, Providence RI,
http://www.cs.brown.edu/publications/techreports/reports/CS-94-14.html 1994.
[26] A. R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis,
U. Brown, 1998.
[27] H.-T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis, University of British
Columbia, British Columbia, Canada, 1988.
[28] S. Chib and E. Greenberg. Understanding the Metropolis–Hastings Algorithm. The American Statistician, 49(4):327–
335, 1995.
[29] T. M. Cover and J. A. Thomas, editors. Elements of Information Theory. Wiley Series in Telecommunications.
Wiley-Interscience, 1991.
[30] H. Cramér. Mathematical methods of Statistics. Princeton. Princeton University Press, New Jersey, 1946.
[31] F. Daum. The fisher-darmois-koopman-pitman theorem for random processes. In Proc. of the 1986 IEEE Conference
on Decision and Control, pages 1043–1044.
[32] F. Daum. Solution of the zakai equation by separation of variables. IEEE Trans. Autom. Control. AC-32(10), 1987.
[33] F. Daum. New exact nonlinear filters. In e. J. C. Spall, editor, Bayesian Analysis of Time Series and Dynamic Models,
chapter 8, pages 199–226. Marcel Dekker inc., New York, 1988.
[34] J. De Geeter. Constrained system state estimation and task-directed sensing. PhD thesis, K.U.Leuven, Department
of Mechanical engineering, div. PMA, Celestijnenlaan 300B, 3001 Leuven, Belgium, 1998.
[35] F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte carlo localization for mobile robots. In Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA’99), Detroit, Michigan, 1999.
[36] F. d’Epenoux. Sur un problème de production et de stockage dans l’aléatoire. Revue Francaise Recherche Opra-
tionelle, 14:3–16, 1960.
[37] A. Doucet. Monte Carlo Methods for Bayesian Estimation of Hidden Markov Models. PhD thesis, Univ. Paris-Sud,
Orsay, 1997. in french.
[38] A. Doucet. On Sequential Simulation-Based Methods for Bayesian Filtering. Technical Report CUED/F-
INFENG/TR.310, Signal Processing Group, Dept. of Engineering, University of Cambridge, 1998.
[39] A. Doucet, N. de Freytas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Statistics for
engineering and information science. Springer–Verlag, january 2001.
[40] A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics
and Computing, 10(3):197–208, 2000.
[41] A. Drake. Observation of Markov Processes Through a Noisy Chanel. PhD thesis, Massachusetts Institute of Tech-
nology, Cambridge, Massachusetts, 1962.
[42] R. Dugad and U. Desai. A tutorial on Hidden Markov Models. Technical Report SPANN-96.1, Indian institute of
Technology, dept. of electrical engineering, Signal Processing and Artificial Neural Networks Laboratory, Bombay,
Powai, Mumbai 400 076 India, may 1996. http://vision.ai.uiuc.edu/dugad/newhmmtut.ps.gz.
BIBLIOGRAPHY 101
[43] D. Dugué. Applications des propriétés de la limite au sens du calcul des probabilités à l’étude des diverses questions
d’estimation. Ecol. Poly., 3(4):305–372, 1937.
[44] J. N. Eagle. The optimal search for a moving target when the search path is constrained. Operations research,
32(5):1107–1115, 1984.
[45] G. J. Erickson and C. R. Smith, editors. Maximum-Entropy and Bayesian Methods in Science and Engineering.
Vol. 1: Foundations; Vol. 2: Applications, Dordrecht, The Netherlands, 1988. Kluwer Academic Publishers.
[46] H. J. S. Feder, J. J. Leonard, and C. M. Smith. Adaptive mobile robot navigation and mapping. International Journal
of Robotics Research, 18(7):650–668, July 1999.
[47] V. Fedorov. Theory of optimal experiments. Academic press, New York, 1972.
[48] R. Fisher. On the mathematical foundations of theoretical statistics. Pilosophical Transactions of the Royal Society,
A,, 222:309–368, 1922.
[49] M. Forster and E. Sober. How to tell when simpler, more unified, or less ad hoc theories will provide more accurate
predictions. British Joural for the Philosophy of Science, 45:1–35, 1994.
[50] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. Monte carlo localization: Efficient position estimation for mobile
robots. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’99), Orlando, FL, 1999.
[51] D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. volume 25, pages 195–207, 1998.
[52] D. Fox, W. Burgard, and S. Thrun. Markov localization for mobile robots in dynamic environments. Journal of
Artificial Intelligence Research, 11, 1999.
[53] J. D. Geeter, J. D. Schutter, H. Bruyninckx, H. V. Brussel, and M. Decréton. Tolerance-weighted L-optimal experi-
ment design: a new approach to task-directed sensing. Advanced Robotics, 13(4):401–416, 1999.
[54] A. E. Gelfland and A. F. M. Smith. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the
American Statistical Association, 85(410):398–409, june 1990.
[55] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice. Chapman &
Hall, London, first edition, 1996.
[56] W. K. Hastings. Monte Carlo sampling methods using Markov Chains and their applications. Biometrika, 57:97–107,
1970.
[57] M. Hauskrecht. Value-function approximations for partally observable markov decision processes. Journal of Artifi-
cial Intelligence Research, 13:33–94, 2000.
[58] R. A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, Massachusetts, 1960.
[59] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical
Statistics, 5(3):299–314, 1996.
[60] E. T. Jaynes. How does the brain do plausible reasoning? Technical Report 421, Stanford University Microwave
Laboratory, 1957. Reprinted in [45, Vol. 1, p. 1–24].
[61] F. Jelinek. Statistical methods for speech recognition. MIT Press, 1997.
[62] M. I. Jordan, editor. Learning in Graphical Models. Adaptive Computation and Machine Learning. MIT Press,
London, England, 1999. ISBN 0262600323.
[63] R. E. Kalman. A new approach to linear filtering and prediction problems. 82:34–45, 1960.
[64] M. H. Kalos and P. A. Whitlock. Monte Carlo methods, volume I: Basics of Wiley-intersience publications. Wiley,
New York, 1986.
[65] S. Koenig and R. Simmons. Solving robot navigation problems with initial pose uncertainty using real-time heuristic
search. In Proceedings of the International Conference on Artificial Intelligence Planning Systems, pages 154–153,
1998.
[66] S. Kristensen. Sensor planning with bayesian decision theory. Robotics and Autonomous Systems, 19:273–286, 1997.
102 BIBLIOGRAPHY
[67] G. J. A. Kröse and R. Bunschoten. Probabilistic localization by appearance models and active vision. In IEEE
conference on Robotics and Automation, Detroit, May 1999.
[68] S. Kullback. Information theory and statistics. New York, NY, 1959.
[69] S. Kullback and R. Leibler. On information and sufficiency. Annals of mathematical Statistics, 22:79–86, 1951.
[70] S. E. Levinson. Continuously Variable Duration Hidden Markov Models for speech analysis. In Int. Conf. on
Acoustics, Speech, and Signal Processing, volume 2, pages 1241–1244. AT&T Bell Lab., april 1986.
[71] S. E. Levinson. Continuously Variable Duration Hidden Markov Models for speech recognition. Computer, Speech
and Language, 1:29–45, 1986.
[72] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Efficient dynamic-programming updates in partially observ-
able markov decision processes. Technical Report CS-95-19, Brown University, Department of Computer Science,
Providence RI, 1995.
[73] M. L. Littman, T. L. Dean, and L. P. Kaelbling. On the complexity of solving markov decision problems. In
Proceedings of the 11th International Conference on Uncertainty in Artificial Intelligence, 1995.
[74] W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of
Operations Research, 18:47–65, 1991.
[75] D. J. C. MacKay. Information theory, inference and learning algorithms. Textbook in preparation. http://wol.
ra.phy.cam.ac.uk/mackay/itprnn/, 1999.
[76] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by
fast computing machine. Journal of Chemical Physics, 21:1087–1091, 1963.
[77] N. Metropolis and S. Ulam. The Monte Carlo Method. Journal of the American Statistical Association, 1949.
[78] G. E. Monahan. A survey of partially observable decision processes: Theory, models and algorithms. Management
Science, 28(1):1–16, 1982.
[79] M. Montemerlo and S. Thrun. Simultaneous localization and mapping with unknown data association. In Proceedings
of the 2003 ICRA, pages 1985 – 1991, Taipei, Taiwan, September 2003. IEEE.
[80] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. Fastslam: A factored solution to the simultaneous localization
and mapping problem. In Proceedings of the eighteenth National Conference on Artificial Intelligence, pages 593–
598, 2002.
[81] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. Fastslam 2.0: An improved particle filtering algorithm for
simultaneous localization and mapping that provably converges. In Proceedings of the eighteenth International Joint
Conference on Artificial Intelligence, 2003.
[82] K. Murphy and S. Russell. Sequential Monte Carlo Methods in Practice, chapter RaoBlackwellised particle filtering
for dynamic Bayesian networks, pages 499–516. Statistics for engineering and information science. Springer–Verlag,
january 2001.
[84] C. Musso, N. Oudjane, and F. LeGland. Sequential Monte Carlo Methods in Practice, chapter Improving regularised
particle filters, page ?? Statistics for engineering and information science. Springer–Verlag, january 2001.
[85] M. Neal, Radford. Markov Chain Monte Carlo Methods Based on ‘Slicing’ the Density Function. Technical Report
9722, Dept. of Statistics and dept. of Computer Science, University of Toronto, Toronto, Ontario, Canada, november
1997. http://www.cs.utoronto.ca/˜radford/slice.abstract.html.
[86] M. Neal, Radford. Slice Sampling. Technical Report 2005, Dept. of Statistics, University of Toronto, Toronto, On-
tario, Canada, august 2000. http://www.cs.toronto.edu/˜radford/slc-samp.abstract.html.
[88] R. M. Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1,
University of Toronto, Department of Computer Science, 1993.
BIBLIOGRAPHY 103
[90] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.
[91] M. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter. Journal of the American Statistical
Association, 1999. forthcoming.
[93] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons,
Wiley series in probability and mathematical statistics, New York, 1994.
[94] L. R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of
the IEEE, 77(2):257–286, 1989.
[95] C. R. Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta
Mathematical Society, 37:81–91, 1945.
[97] J. Rissanen. Modeling by the shortest data description. Automatica, 14:465–71, 1978.
[98] J. Rissanen. Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49:223–239,
1987.
[99] N. Roy, W. Burgard, D. Fox, and S. Thrun. Coastal navigation - mobile robot navigation with uncertainty in dynamic
environments. In Proceedings of the IEEE International Conference on Robotics and Automation, Detroit, MI,
volume 1, pages 35–40, May 1999.
[100] D. B. Rubin. Bayesian Statistics 3, chapter Using the SIR algorithm to simulate posterior distributions, pages 395–
402. Oxford University Press, 1988. Using the SIR algorithm to simulate posterior distributions.
[101] J. Rust. Numerical dynamic programming in economics. In H. Amman, D. Kendrick, and J. Rust, editors, Handbook
of Computational Economics, pages 619–729. Elsevier, Amsterdam, 1996.
[102] J. U. S. Julier and H. Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in
filters and estimators. IEEE Transactions on Automatic Control, 45(3):477–482, March 2000.
[103] Y. Sakamoto, M. Ishiguro, and G. Kitagawa. Aikake information criterion statistics. Kluwer, Dordrecht, 1986.
[104] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.
[105] P. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal
of Mathematical Analysis and Applications, 110:568–582, 1985.
[106] C. Shannon. A mathematical theory of communication, i. The Bell System Technical Journal, 27:379–423, July
1948.
[107] C. Shannon. A mathematical theory of communication, ii. The Bell System Technical Journal, 27:623–656, October
1948.
[108] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of the
fourteenth International Joint Conference on Artificial Intelligence, Montréal, Québec, Canada, pages 1080–1087.
Springer-Verlag, Berlin, Germany, 1995.
[110] A. F. M. Smith and A. E. Gelfland. Bayesian Statistics Without Tears: A Sampling–Resampling Perspective. The
American Statistician, 46(2):84–88, 1992.
[111] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University,
Stanford, California, 1971.
[112] R. Sutton and A. Barto. Reinforcement Learning, An introduction. The MIT Press, 1998.
[113] J. Swevers, C. Ganseman, D. B. Tükel, J. De Schutter, and H. Van Brussel. Optimal robot excitation and identification.
IEEE Transactions on Robotics and Automation, 13(5):730–739, October 1997.
104 BIBLIOGRAPHY
[114] S. Thrun. Monte Carlo POMDPs. In S. A. Solla, T. K. Leen, and K. R. Muller, editors, Advances in Neural Processing
Systems, volume 12, pages 1064–1070. MIT Press, 2000.
[115] S. Thrun and J. Langford. Monte Carlo Hidden Markov Models. Technical Report CMU-CS-98-179, Carnegie
Mellon University, School of computer science, Pittsburgh, PA 15213, 1998. http://www.cs.cmu.edu/afs/
cs.cmu.edu/user/thrun/public_html/papers/thru%n.hmm.html.
[116] S. Thrun, J. Langford, and D. Fox. Monte Carlo Hidden Markov Models: Learning non-parametric models of
partially observable stochastic processes. In ??, editor, Proceeding of The Sixteenth International Conference on Ma-
chine Learning, page ??, 1999. http://www.cs.cmu.edu/afs/cs.cmu.edu/user/thrun/public_
html/papers/thru%n.mchmm.html.
[117] P. Tichavský, C. H. Muravchik, and A. Nehorai. Posterior Cramér-Rao bounds for discrete-time nonlinear filtering.
IEEE Transactions on Signal Processing, 46(5):1386–1396, May 1998.
[118] M. Trick and S. Zin. A linear programming approach to solving stochastic dynamic programs. Technical report,
Carnegie-Mellon University, manuscript, 1993.
[119] P. Turney. A theory of cross-validation error. The Journal of Theoretical and Experimental Artificial Intelligence,
6:361–92, 1994.
[120] H. L. Van Trees. Detection, Estimation and Modulation Theory, Vol I. Wiley and Sons, New York, 1968.
[121] C. Wallace and P. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society B,
49:240–65, 1987.
[122] E. Wan and A. Nelson. Dual kalman filtering methods for nonlinear prediction, estimation, and smoothing. In
J. Mozer and Petsche, editors, In Advances in Neural Information Processing Systems: Proceedings of the 1996
Conference , NIPS-9, 1997.
[123] D. Xiang and G. Wahba. A generalized approximate cross validation for smoothing splines with non-Gaussian data.
Statistica Sinica, 6:675–92, 1996.
[124] A. Zellner, H. A. Keuzenkamp, and M. McAleer. Simplicity, Inference and Modelling. Keeping it Sophisticatedly
Simple. Cambridge University Press, Cambridge, UK, 2001.
[125] N. Zhang and W. Liu. Planning in stochastic domains: Problem characteristics and approximation. Technical Report
HKUST-CS96031, Hong Kong University of Science and Technology, 1996.