You are on page 1of 104

The application of probabilistic techniques for the state/parameter

estimation of (dynamical) systems and pattern recognition problems

Klaas Gadeyne & Tine Lefebvre


Division Production Engineering, Machine Design and Automation (PMA)
Department of Mechanical Engineering, Katholieke Universiteit Leuven
[Klaas.Gadeyne],[Tine.Lefebvre]@mech.kuleuven.ac.be

14th July 2004


2
List of FIXME’s

Add a paragraph about the differences between state estimation and pattern recognition. Include remarks of Tine
that pattern recognition can be seen as Multiple model (see chapter about parameter estimation) . . . . . . 14
Niet duidelijk: inleiding zegt niets over secties 4-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Include information from Herman’s URKS course here, entre autres say something about Choice of the prior . . . 17
Is there a difference between accuracy and precision? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
include cross reference to introductory application examples document? . . . . . . . . . . . . . . . . . . . . . . 17
I guess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
KG : sounds weird for continu systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Is this a true constraint? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Do we ever use these kind of models with uncertainty “directly” on the inputs . . . . . . . . . . . . . . . . . . . 18
describe one-to-one relationship between functional representation and PDF notation somewhere . . . . . . . . . 19
Even I don’t understand anymore what I was meaning :) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
introduce General Bayesian approach first: not applied to time-dependent systems [109] . . . . . . . . . . . . . . 19
If so, add an example! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
toevoegen: continuous-time (differential equations) and discrete-time models (difference equations). . . . . . . . 23
TL: er bestaan ook ”Belief networks”, ”graphical models”, ”bayesian networks” etc. horen die hier bij ? synony-
men ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
TL: u, θ f en f ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
zowel graph als eq. modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Nog referenties toevoegen, o.a. Isard and Blake voor condensation algo . . . . . . . . . . . . . . . . . . . . . . 29
KG: Uitgebreider ingaan op het algoritme, in de veronderstelling dat je kan weet wat MC technieken zijn, zie ook
appendix natuurlijk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
uitvissen hoe dit precies werkt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
gebruikt EKF als proposal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
TL: do not understand volgende twee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
TL : naar hoofdstuk MC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Needs to be extended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
KG lose correlation between measured features in map due to the inaccurately known pose of robot, or not . . . . 33
KG Is optimizing this pdf, without taking into account the state, the best way to do param. estimation? . . . . . . 33
KG: Look for a solution of this!! IMHO only easy to solve for linear systems and Gaussian distributions . . . . . 35
and Grid-based HMMs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Work this further out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
KG: Relate this to Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
relatie tot model: MDP - Markov Models with reward; POMDP - Hidden Markov Models with reward . . . . . . 37
KG: Look for better formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
KG: Maybe add index to enumerate the constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3
4 LIST OF FIXME’S

TL: dit hoofdtuk is nog een rommeltje . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47


Proof this as an example of inversion sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Sentence is far to qualitative instead of quantitative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
add example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Discuss Adaptive Rejection sampling [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Do some further research on this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Add a 2d example explaining this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
include remark about influence of posterior correlation to the speed of mixing . . . . . . . . . . . . . . . . . . . 60
Verify why . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Check this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Conjugacy should be explained in chapter 2 where Bayes’ rule is explained and the choice of the prior distribution
is a bit motivated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
add plot to illustrate this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Fill this further in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
To be filled in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
add illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
KG: Add other Monte Carlo methods to this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
TL zie ik niet in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
TL: TOT HIER DEZE SECTIE GELEZEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
? state sequence ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Uitwerken! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
TL: moet nog eens nadenken over de < constant in x > dinges . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Rework layout of this chapter. Is it überhaupt possible to derive the second part? . . . . . . . . . . . . . . . . . . 83
Hier klopt iets niet met die 1/N. Uitzoeken waarom dit niet mag en vervangen moet worden door genormaliseerde
gewichten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
explain! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
The last line of equation (D.9) is not correct! The denominator is not equal to the probability of the last measure-
ment “tout court”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
The proof is given in Chapter 5 of the algoritmic data analysis course GM28 . . . . . . . . . . . . . . . . . . . . 85
This is a preliminary version of this text, as you should have noticed :-) . . . . . . . . . . . . . . . . . . . . . . . 85
This and next section should still be written . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
include algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
include a number of important variants and describe them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
update this! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
check this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
TL: bij te voegen: niet noodzakelijk 1 iteratie per meting, liever hopen iteraties . . . . . . . . . . . . . . . . . . 87
KG So far this chapter consists of some notes I took while reading [62] and [55]. . . . . . . . . . . . . . . . . . 89
Add example to explain difference between (non)- and acyclic and directed . . . . . . . . . . . . . . . . . . . . 89
Notation: Parent - Child node: add example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Add an example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
deze sectie niet OK, ik heb de klok horen luiden maar weet niet waar de klepel hangt... . . . . . . . . . . . . . . 97
Contents

I Introduction 9

1 Introduction 11
1.1 Application examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Overview of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Definitions and Problem description 17


2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Bayesian approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Markov assumption and Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 System modeling 23
3.1 Continuous state variables, equation modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Continuous state variables, network modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Discrete state variables, Finite State Machine modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Markov Chains/Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Hidden Markov Models (HMMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

II Algorithms 27

4 State estimation algorithms 29


4.1 Grid based and Monte Carlo Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Hidden Markov Model filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Kalman filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Exact Nonlinear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Rao-Blackwellised filtering algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6 Concluding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Parameter learning 33
5.1 Augmenting the state space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Multiple Model Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5
6 CONTENTS

6 Decision Making 37
6.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Performance criteria for accuracy of the estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Trajectory generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.5 If the sequence of actions is restricted to a parameterized trajectory . . . . . . . . . . . . . . . . . . . . . . 40
6.6 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.7 Partially Observable Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.8 Model-free learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Model selection 47

III Numerical Techniques 49

8 Monte Carlo techniques 51


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.2 Sampling from a discrete distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3 Inversion sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.4 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.5 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.6 Markov Chain Monte Carlo (MCMC) methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.6.1 The Metropolis-Hasting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.6.2 Metropolis sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.3 The independence sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.4 Single component Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.5 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.6.6 Slice sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.7 Reducing random walk behaviour and other tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.8 Overview of Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.9 Applications of Monte Carlo techniques in recursive markovian state and parameter estimation . . . . . . . 66
8.10 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.11 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A Variable Duration HMM filters 69


A.1 Algorithm 1 : The Forward-Backward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.1.1 The forward algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.1.2 The backward procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.1 Inductive calculation of the weights δt (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.2.2 Backtracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.3 Parameter learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.4 Case study: Estimating first order geometrical parameters by the use of VDHMM’s . . . . . . . . . . . . . 75
CONTENTS 7

B Kalman Filter (KF) 77


B.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.3 Kalman Filter, derived from Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
B.4 Kalman Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
B.5 EM with Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

C Daum’s Exact Nonlinear Filter 81


C.1 Systems for which this filter is applicable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C.2 Update equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C.2.1 Off-line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
C.2.2 On-line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

D Particle filters 83
D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2 Joint a posteriori density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2.1 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.2.2 Sequential importance sampling (SIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
D.3 Theory vs. reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
D.3.1 Resampling (SIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.3.2 Choice of the proposal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.4 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
D.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

E The EM algorithm, M-step, proofs 87

F Bayesian (belief) networks 89


F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
F.2 Inference in Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

G Entropy and information 91


G.1 Shannon entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
G.2 Joint entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
G.3 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
G.4 Relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
G.5 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
G.6 Principle of maximum entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
G.7 Principle of minimum cross entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
G.8 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8 CONTENTS

H Fisher information matrix and Cramér-Rao lower bound 95


H.1 Non random state vector estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
H.1.1 Fisher information matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
H.1.2 Cramér-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
H.2 Random state vector estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
H.2.1 Fisher information matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
H.2.2 Alternative expressions for the information matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 96
H.2.3 Cramér-Rao lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
H.2.4 Example: Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
H.2.5 Example: Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
H.2.6 Example: Cramér-Rao lower bound on a part of the state vector . . . . . . . . . . . . . . . . . . . 97
H.3 Entropy and Fisher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Part I

Introduction

9
Chapter 1

Introduction

This document wants to compare different Bayesian (also referred to as probabilistic) filters (or estimators) with respect
to their appropriateness for the state/parameter estimation of (dynamical) systems. By Bayesian or probabilistic we mean
simply that we try to model uncertainty explicitly. e.g. when measuring the dimensions of an object with a 3D coordinate
measuring machine, a Bayesian approach does not only provide the estimates for these dimesions, it also gives the accurracy
of these estimates. The approach will be illustrated with examples from multiple domains, but most algorithms will be
applied to the (static) localization problem of objects. This report wants to verify what simplyfying assumptions the different
filters make. The goal of this document is to provide a kind of manual that helps you to decide what filter is appropriate to
solve your estimation problem.
A lot of people only speak of “good and better” filters. This proves that they don’t understand the problem they’re dealing
with: there are no such things as good, better and best filters. Some filters are just more appropriate (faster and more
accurate) for solving specific problems. It is not a good way of solving problems by just testing a certain filter on a certain
problem. One should start from analyzing a problem, checking which model assumptions are justified and then deciding
which filter is most appropriate to solve the problem. One should be able to predict more or less (rather more) whether the
filter will give good results or not.

1.1 Application examples


We’ll try to clarify all the filtering algorithms we describe by application to certain examples
Example 1.1 Localization of a transport pallet with a mobile robot platform.
A mobile robot platform is equipped with a radial laser scanner (as in figure 1.1) to be able to localize objects (such as a
transport pallet) in it’s environment. Figure 1.2 shows a foto and a scan of such a transport pallet. A laser scan image is

Figure 1.1: Mobile Robot Platform Lias, equiped with a laser scanner (arrow). Note that the laser scanner should be much lower than on
this foto to be able to recognize transport pallets on the ground!

constituted by a bunch of distance measurements in radial order (every 0.5o ). The vector containing these measurements is

11
12 CHAPTER 1. INTRODUCTION

denoted as z k . Depending on the location (position x, y and orientation θ, see figure 1.2) of the pallet, a number of clusters
(coming from the “pootjes” of the transport pallet”) will be visible on the scan in a certain geometrical order. Because the

pallet

$theta$

robot

$(x,y)$

(a) Foto of a transport pallet (b) Scan of a transport pallet made by (c) Definition of x, y and θ
a radial laser scanner

Figure 1.2: Laser scanning of a transport pallet

robot has to move towards the pallet, the position and orientation of the pallet with respect to the robot will change according
to robot motion. We cannot immediately estimate the location from the raw laser scanner measurements: the location of
the transport pallet is a hidden variable or hidden state of our dynamic system. We can denote the location of the transport
pallet with respect to the robot at timestep k as the vector x(k). A concrete location will the be denoted as xk .
 
xk
xk = yk 

θk

If we know the state vector x(k) = xk , we can predict the measurements of the laser scanner (a vector where each compo-
nent will be a distance at a certain angle of the laser scanner) at timestep k through a measurement model z(k) = g(x(k)).
This measurement model incorporates information about the geometry of the transport pallet, the sensor characteristics
and about its (the measurement models’) inaccuracy. Indeed, nor the sensor, nor the measurement model are perfectly
known. Therefore, the sensor measurement prediction is not 100% sure (not infinitely accurate), even if the state is known.
Therefore, in a Bayesian context, the measurement prediction is characterised by a likelihood probability density function
(PDF): 
P z(k) x(k) = xk

But, we are interested in the reverse problem, i.e. to calculate the pdf over x(k), once a measurement z(k) = z k is made:

P x(k) z k .

Fortunately the insights of a guy named Bayes lead to following equality:



 P z k xk P (xk )
P xk z k =
.
P (z k )

This can be written for all values of x(k):



 P z k x(k) P (x(k))
P x(k) z k =
.
P (z k )

Application of Bayes’ rule (often called inference) allows us to calculate the location of the pallet given this measurement
and the prior pdf P (x(k)). This a priori estimate is the knowledge (pdf) we have about the state x before the measurement
z(k) = z k is made (due to initial knowledge, previous measurements, . . . ). Note that P (z k ) is constant and independent
of x(k) and hence is just a “normalising factor” in the equation.
When moving with the robot towards the transport pallet, the relative location of the pallet with respect to the robot changes.
When the robot motion is known, the changes in x can be calculated. In order to know the robot motion, the robot is
equipped with so called internal sensors: encoders at the driving wheels and a gyroscope. These internal sensors are used
1.1. APPLICATION EXAMPLES 13

to calculate the translational velocity v and the angular velocity ω of the robot. In this example, vk and ωk are supposed
to be perfectly known at each time tk (ideal encoders and gyroscope, no wheel slip, . . . ). We consider the velocities as the
inputs uk to our dynamical system:  
v
uk = k
ωk

We can model our system through the system equations (or model/proces equations)

xk = xk−1 − vk−1 cos(θk−1 )∆t;


yk = yk−1 − vk−1 sin(θk−1 )∆t;
θk = θk−1 − ωk−1 ∆t;

if the time step ∆t is small enough. Note that we immediately made a discrete model of our system! With a vector function,
we denote this as
x(k) = f (x(k − 1), uk−1 ).
The uncertainty over x(k − 1) will be propagated to x(k), even more, because of the inaccuracy of the system model, the
uncertainty over x(k) will augment. In a Bayesian context, we calculate the pdf over x(k), given the pdf over x(k − 1) and
the input uk−1 : 
P x(k) P (x(k − 1)), uk−1
and obtain for the system equation
Z

P (x(k)) = P x(k) x(k − 1), uk−1 ) P (x(k − 1) dx(k − 1)

Example 1.2 Estimation of object locations during force-controlled compliant motion.


Compliant motion tasks are robot tasks in which the robot manipulates a (moving) object that at the same time is in contact
with the (typically fixed) environment. Examples are assembly of two pieces (a simple example is given in figure 1.3),
deburring of a casting piece, etc. The aim of autonomous compliant motion is to execute these tasks when the locations
(positions and orientations) of the objects in contact are not accurately known at the beginning of the task. Based on position,
velocity and force measurements, the robot will estimate the locations before or during the task execution. In industrial (i.e.
structured) environments this reduces the time and costs necessary to position the pieces very accurately; in less structured
environments (houses, nature,...) this is the only way to perform tasks which require precise relative positioning of the
contacting objects. The locations of both contacting objects (typically 12 variables: 3 positions and 3 orientations for each

Figure 1.3: Assembly of a cube (manipulated object) in a corner (environment object)


14 CHAPTER 1. INTRODUCTION

object) are collected in the state vector x. The location of the fixed object is described with respect to a fixed world frame,
the location of the manipulated object is described with respect to a frame on the robot end effector. Therefore, the state is
static, i.e. the real values of these locations do not change during the experiment.
The measurements at a certain time tk are collected in the vector z k (these are 6 contact force and moment measurements,
6 translational and rotational velocities of the manipulated object and/or 6 position and orientation measurements of the
manipulated object). A measurement model describes the relation between these measurements and the state vector:

g k (z(k), x(k)) = 0;

The model g is different for the different measurement types (velocities, forces, . . . ) and for different contacts between the
contacting objects (point-plane, edge-edge, . . . )

Example 1.3 Localization of objects with force-controlled robots (local sensors).

Figure 1.4: Localization of a cube in 3 dofs with a touch sensor

paragraph about the Example 1.4 Pattern recognition examples such as OCR and speech recognition.
ween state estimation
recognition. Include
ks of Tine that pattern
n be seen as Multiple
pter about parameter
estimation)

Figure 1.5: Easy OCR problem

Example 1.5 Measuring a known object with a 3D coordinate measuring machine


e.g. to control the accurracy of the positioning of holes, quality control
known geometry, parametrized,
measurement points on known parts of the object, estimate the parameters accurately

Example 1.6 Reverse engineering: Info on the Metris website1


The user selects the points corresponding to the part of the object on which the surface has to fit. This surface can be some
primitive entity as a cylinder, a sphere, a plane, etc. or a free-form surface, e.g. modeled by a NURB curve or surface. In the
latter case the user also defines the surface smoothing, which determines the number of parameters in the free-form surface
(let’s say the “order” of the surface model). The Reverse Engineering program estimates the parameters of the surface
(e.g. the radius of the sphere, the parameters of the NURBS surface, etc).
1 http://www.metris.be/
1.2. OVERVIEW OF THIS REPORT 15

But unfortunately, . . . , this estimation is deterministic (least squares approach). The measurement error on the measured
points are not taken into account... I think the measurement error is considered to be negligeable with respect to the desired
surface accuracy, and in order to suppose this an awfully lot of measurement points are taken and “filtered” beforehand into
a smaller bunch of “measured points”. However, when using a Bayesian approach the number of measurement points will
be lower, i.e., just enough to get the desired surface accurracy. Even more, the measurement machine and touching device
probably do not have the same accuracy in the different touch-directions, which is not at all taken into account with the
current (non-Bayesian) approach.
Reverse engineering problems can be seen as a SLAM (Simultaneous Localization and Mapping) between different points.

Example 1.7 Holonic systems

Example 1.8 Modal analysis?

1.2 Overview of this report


FIXME: Niet d
• Chapter 2 defines the state estimation problem and various symbols and terms;
• Chapter 3 handles possible ways to model your system;

• Chapter 4 gives an overview of different state estimation algorithms;

• Chapter 5 describes how inaccurately known parameters of your system and measurement models can also be esti-
mated;

• Chapter 6, Planning/Active sensing:


• Chapter 8, Monte Carlo techniques:

Detailed filter algorithms are provided in appendix.


16 CHAPTER 1. INTRODUCTION
Chapter 2

Definitions and Problem description


FIXME: In
Herman’s U
2.1 Definitions autres say som

1. System: any (physical) system an engineer would want to control/describe/use/model.


2. Model: a mathematical/graphical description of a system. A model should be an accurate enough image of the
system in order to be “useful” (eg. to control the system). This implies that a physical system can be modeled by
different models (figure 2.1). Note that in the context of state estimation, the accuracy of certain parts of the model

Model 1

Model 2
Physical world

Model n

Figure 2.1: A model should contain only those properties of the physical system that are relevant for the application in which it will be
used. Hence the relation world-model is not a one-on-one relation.

will determine the accuracy of the state estimates.


For a dynamical model, the output at any time instant depends on its history (i.e. the dynamical model has memory), FIXME: Is the
not just on the present input as in an statical model. The “memory” of the dynamical model is described by a a
dynamical state, which is to be known in order to predict the output of the model.
Example 2.1 A car:
input: pushing of gaspedal (corresponds to car acceleration)
output: velocity of car
state: current velocity of car.

3. State: Every model can be fully described at a certain instant in time by all of its states. A different model of the
same system can once result in dynamic states (dynamic model) of static states (static model).
Example 2.2 Localization of a transport pallet with a mobile robot. FIXME: inc
The location of the transport pallet with respect to the mobile robot is dynamic, with respect to the world it is static introductor
(provided that during the experiment this pallet is not moved).

4. Parameter: a value that, although it can be unknown and should thus be estimated, that is constant (in time) in the
physical model.
Example 2.3 When using an ultrasonic sensor with an additive Gaussian sensor characteristic but an unknown (con-
stant) variance σ 2 , this variance is considered as a parameter of the model. However, when a certain sensor has
a behaviour that is dependant of the temperature, we consider the temperature to be a state of the system. So the
distinction parameter/state can depend on the chosen model. When localising a transport pallet with a mobile robot,
the diameter of the wheel+tyre will in most models be a parameter, but for some applications, it will be necessary to
model the diameter as a state: Suppose the robot odometry is to be known very accurately in a highly temperature
varying environment).

17
18 CHAPTER 2. DEFINITIONS AND PROBLEM DESCRIPTION

5. Inputs/measurements:

6. PDF/Information/Accuracy/Precision

Remark 2.1 Difference between a static state and a parameter.


FIXME: I guess For physical systems, the distinction is rather easy to make . Eg. When localising a transport pallet with a fixed position (in
a world frame) with unknown dimensions (length and width), the location parameters are states of the system, the length
and the width would be parameters.
For systems of which the state has no physical meaning, the distinction can be hard to make (this does not (have to) mean
that the state/parameters are hard to estimate). One could say that a static state is constant during the experiment (but can
change), whilst a parameter is always constant (in a given model).
It is not very important to make a strict distinction between a static state and a parameter, as for the estimation problem both
are treated equally.

Remark 2.2 a “physically moving” system does not necessarily imply that the estimation problem has a dynamic state!
When identifying the masses and lengths of the robot links, the whole robot can be moving around, but the parameters to
estimate (masses, lengths) are constant.

2.2 Problem description

System model A lot of engineering problems require the estimation of the system state in order to be able to control the
system (=process). The state vector is called static when it does not change in time or dynamic when it changes according
KG : sounds weird for to the system model in function of the previous value of the state itself and an input. The input, measured by proprioceptive
continu systems (“internal”) sensors, describes how the state changes; it does not give an absolute measure for the actual state value. The
system model is subject to uncertainty (often denoted as noise), the noise characteristics (the probability density function,
his a true constraint? or some of its characteristics eg. its mean and covariance) are supposed to be known.

Example 2.4 When a mobile robot wants to move around autonomously, it needs to know its location (state). This state is
dynamic, since the robot location changes whenever the robot moves. The inputs to the system can be eg. the currents sent to
ever use these kind of the different motors of the mobile robot, or the velocity of the wheels measured by encoders, . . . The system model describes
ertainty “directly” on how the robot’s location changes with these inputs. However, “unmodeled” effects such as slipping wheels, flexible tires,
the inputs
etc. occur. These effects should be reflected in the system model uncertainty.

Measurement model The uncertainty in the system model makes the state estimate more and more uncertain in time. To
cope with this, the system needs some exteroceptive sensors (“external” sensors) whose measurements yield information
about the absolute value of the state.
When these sensors do not directly and accurately observe the state, i.e. when there is no one-to-one relationship between
states and observations, a filter or estimator is used to calculate the state estimate. This process is called state estimation
(“localization” in mobile robotics). The filter contains information about the system (through the system model), the sensors
(through the measurement model that expresses the relation between state, sensor parameters (see example below) and
measurements. In this case, the measurement model is subject to uncertainty, eg. due to the sensor noise/uncertainty, of
which the characteristics (probability density function, or some of its characteristics) are supposed to be known.

Example 2.5 If a mobile robot is not equipped with an “accurate enough” (“enough” means here enough for a particular
goal we want to achieve) GPS system, the state variables (denoting the robot’s location) are not “directly” observable from
the system. This is for example the case when it has only infrared sensors which measure the distances to the environment’s
objects. When the robot is equipped with a laser scanner and each scan point is considered to be a measurement, the current
angle of the laser scanner is a sensor parameter and the measurement is a scalar (distance to the nearest object in a certain
direction). We can also consider the measurements at all angles of the laser scanner at once. In this case, our measurement
is a vector and our model uses no sensor parameters.

Parameters

Remark 2.3 The above description uses the restriction that the system and measurement models and their noise characteris-
tics are perfectly known. Chapter 5 extends the problem to system and measurement models with uncertainty characteristics
described by parameters that are inaccurately known, but constant.
2.3. BAYESIAN APPROACH 19

Symbol Name
x state vector, hidden state/values
z measurement vector, observations, sensor data, sensor measurement
u input vector
s sensor parameters
f system model, process model, dynamics (functional notation)
g measurement model, observation model, sensing model
θf parameters of the system model and its uncertainty characteristics
θg parameters of the measurement model and its uncertainty characteristics

Table 2.1: Symbol names

be one-to-one Notations Table 2.1 list the symbols used in the rest of this text and some synonyms often found in literature. x(k),
een functional z(k), u(k) and s(k), denote these variables at a certain discrete time instant t = k; xk , z k , uk , sk , f k and g k describe
PDF notation
somewhere
specific values for these variables. We also define:
   
X(k) = x(0) . . . x(k) ; Z(k) = z(1) . . . z(k) ;
   
U (k) = u(0) . . . u(k) ; S(k) = s(1) . . . s(k) ;
   
X k = x0 . . . xk ; Z k = z1 . . . zk ;
   
U k = u0 . . . uk ; S k = s1 . . . sk ;
   
F k = f0 . . . fk ; Gk = g 1 . . . g k .

Remark 2.4 Note that the variables x(k), z(k), u(k), s(k) for different time steps k still indicate the same variables, e.g.
x(k − 1) and x(k) denote in fact “the same variable”, they correspond to the same state space. The notation x(k) where
the time is indicated at the variable itself is introduced in order to have “readable” equations. Indeed, if we denote the time
step as a subscript to the pdf function P (.), formulas are very ugly because most of the used pdf functions are function of a
lot of variables (x, z, u, s, θf , . . . ), where most of them, though not all, are specified at certain ( and even different) time
steps. FIXME: E
anymore

2.3 Bayesian approach


FIXME: intro
For a given system and measurement model, inputs, sensor parameters and sensor measurements,our goal is to estimate approa
time-de
the state x(k). Due to the uncertainty in both the system and measurement models, a Bayesian approach (i.e. modeling the
uncertainty explicitly by a probability density function) is appropriate to solve this problem. A Probability Density Function
(PDF) of the variable x(k) is denoted as P (x(k)). x(k) is often called the random variable, although most of the time, is
is not random at all.
The probability that the random value equals a specific value xk is (i) for a discrete state space P (x(k) = xk ); and (ii) for
a continuous state space  
P xk ≤ x(k) ≤ xk + dxk = P x(k) = xk dxk .
Further in this text, both discrete and continuous variables are denoted as P (xk )!
Probabilistic filters (Bayesian Filters) calculate the pdf over the variable x(k) given (denoted in the formulas by “|”) the
previous measurements Z(k) = Z k , inputs U (k − 1) = U k−1 , sensor parameters S(k) = S k , the model parameters θ f
and θ g , the system and measurement models F k−1 and Gk , and the prior pdf P (x(0))

P ost (x(k)) , P x(k) Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))

(2.1)

This conditional PDF is often called a posteriory pdf and denoted by P ost (x(k)).
Calculating P ost (x(k)) is called diagnostic reasoning: given the causes (the data), find the internal (not directly measured)
variables (state) that can explain these. This is much harder than causal reasoning: given the internal variables (state),
predict the causes (the data). Think of a disease (state) and its symptoms (data): finding the disease, given the symptoms
(diagnostic reasoning) is much harder than predicting the symptoms of a certain disease (causal reaoning).
Bayes’ rule relates the diagnostic problem (calculating P ost (x(k))) to two causal problems:

P ost (x(k)) = α P z k xk , Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))

P xk Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) (2.2)
20 CHAPTER 2. DEFINITIONS AND PROBLEM DESCRIPTION

where
1
α= 
P z k Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
is a normalizer (i.e. independent of the state random variable). The terms in Bayes’ rule are often described as
likelihood ∗ prior
posterior =
evidence
Eq. (2.2) is valid for all possible values of x(k), which we write as:


P ost (x(k)) = α P z k x(k), Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))

P x(k) Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) . (2.3)

The last factor of this expression is the pdf over x at time k, just before the measurement is taken, and is further on denoted
as P rior (x(k)):
P rior (x(k)) , P x(k) Z k−1 , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0)) .


Remark 2.5 Expression 2.1 is also known as the filtering distribution. Another formulation of the problem estimates the
joint distribution P ost (X(k)):

P ost (X(k)) = P X(k) Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (X(0)) (2.4)
Remark 2.6 As previously noted, the model parameters θ f and θ g in formulas (2.1)–(2.4), are supposed to be known.
This limits the problem to pure state estimation problem (namely estimating x(k) or X(k)). In some cases, the model
parameters are not accurately known and need also to be estimated (“parameter learning”). This leads to a concurrent-state-
estimation-and-parameter-learning problem and is discussed in Chapter 5.

2.4 Markov assumption and Markov Models


Most filtering algorithms are formulated in a recursive way, in order to assure a known fixed-time computation time. Re-
cursive formulation of problem (2.3) is possible for a specific class of systems models: the Markov Models.
The Markov assumption states that x(k) depends only on x(k − 1) (and of course uk−1 , θ f and f k−1 ) and that z(k)
depends only on x(k) (and of course sk , θ g and g). This means that P ost (x(k)) incorporates all information about the
previous data—being the measurements Z k−1 , inputs U k−2 , sensor parameters S k−1 , models F k−2 and Gk−1 and the
prior P (x(0))—in order to calculate P ost (x(k)). Hence, for Markov Models, (2.1) is reduced to:

P ost (x(k)) = P x(k) z k , uk−1 , sk , θ f , θ g , f k−1 , g k , P ost (x(k − 1)) (2.5)
and (2.3) to:

P ost (x(k)) = α P z k x(k), uk−1 , sk , θ f , θ g , f k−1 , g k , P ost (x(k − 1))

P x(k) uk−1 , sk , θ f , θ g , f k−1 , g k , P ost (x(k − 1))
 
= α P z k x(k), sk , θ g , g k P x(k) uk−1 , θ f , f k−1 , P ost (x(k − 1))

Markov filters typically solve this equation in two steps:

1. the process update (system update, prediction update)



P rior (x(k)) = P x(k)|uk−1 , θ f , f k−1 , P ost (x(k − 1))
Z

= P x(k) uk−1 , θ f , f k−1 , x(k − 1) P ost (x(k − 1)) dx(k − 1) (2.6)

2. the measurement update (correction update)



P ost (x(k)) = α P z k x(k), sk , θ g , g k P rior (x(k)) . (2.7)
2.4. MARKOV ASSUMPTION AND MARKOV MODELS 21

Next to the Markov assumptions, Eqs. (2.6) and (2.7), do not make any assumptions, nor on the nature of the hidden variables
to be estimated (discrete, continuous), nor on the nature of the system and measurement models (graphs, equations, . . . ).
Remark 2.7 We talk about Markov Models and not Markov Systems: a system can be modeled in different ways and it is
possible that for the same system Markovian and non-Markovian models can be written. e.g. think of the following one-
dimensional system: a body is moving in one direction with a constant acceleration (apple falling from tree under gravity).
We are interested in the position x(k) of the body at all times k. When the state is chosen to be the object’s position:
x = [x], the model is not Markovian as the state at the last time step is not enough to predict the state evolution. At least
the states from two different time steps are necessary for this prediction. When the state is chosen to be the object’s position
 T
x and velocity v: x = x v , the state evolution can be predicted with only one state estimate.
Remark 2.8 Are there systems which cannot be modeled with Markov models? FIXME:

Remark 2.9 Note that some pdfs are conditioned over some value of x(k), while others are conditioned over P ost (x(k)).
In literature both are denoted as ”x(k)” behind the conditional sign ”|”; in this text however we do not use this double
notation in order to stress the difference between conditioning over a value of x(k) or over the pdf of x(k).
e.g. P rior (x(k)) = P x(k)|uk−1 , θ f , f k−1 , P ost (x(k − 1)) indicates the pdf over x(k), given the known values
uk−1 , θ f , f k−1 and the pdf P ost (x(k − 1)). Hence, this formula expresses how the pdf over x(k − 1) propagates to the
pdf over x(k) through the process model. 
e.g. the likelihood P z k x(k), sk , θ g , g k indicates the probability of a measurement z k , given the known values sk , θ g ,
g k and the currently considered value of the state x(k). Hence, this formula expresses the sensor characteristic: what is
the pdf over z(k), given a state estimate and the measurement model. This sensor characteristic does not depend on what
values of x(k) are more or less probable (does not depend on the pdf over x(k)).
Remark 2.10 Proof of Eq. (2.6). To keep the derivation somewhat more clear, uk−1 , θ f and f k−1 are replaced by the
single symbol H k−1 . Eq. (2.6) is
Z

P x(k) P ost (x(k − 1)) , H k−1 = P (x(k)|x(k − 1), H k−1 ) P ost (x(k − 1)) dx(k − 1)
(2.8)

We prove this as following:


P (x(k)|P ost (x(k − 1)) , H k−1 )
Z
= P (x(k), x(k − 1)|P ost (x(k − 1)) , H k−1 ) dx(k − 1)
Z
= P (x(k)|x(k − 1), P ost (x(k − 1)) , H k−1 )

P (x(k − 1)|P ost (x(k − 1)) , H k−1 ) dx(k − 1)


Z
= P (x(k)|x(k − 1), H k−1 ) P ost (x(k − 1)) dx(k − 1)

The last simplifications can be made because

1. the pdf over x(k−1) given the posterior pdf over x(k−1) and H k−1 , is the posterior pdf itself, i.e. P (x(k − 1)|P ost (x(k − 1)) , H
P ost (x(k − 1));
2. the new state is independant of the pdf over the previous state if the value of the previous state is given (ie. P (x(k)|x(k − 1), P ost (x
P (x(k)|x(k − 1), H k−1 ).
e.g. given
• the probabilities that today it rains (0.3) or that it doesn’t rain (0.7), (P ost (x(k − 1)));
• the transition probabilities that the weather is the same as the day before (0.9) or not (0.1),
• the knowledge that it does rain today (x(k − 1)),
what are the chances that it will rain tomorrow (P (x(k)|x(k − 1), P ost (x(k − 1)) , H k−1 ))?? The pdf of rain
tomorrow (0.9) only depends on the fact that it rains today x(k − 1) and the transition probability, and not on
P ost (x(k − 1))!

Concluding Figure 2.2.


22 CHAPTER 2. DEFINITIONS AND PROBLEM DESCRIPTION

estimate x(k)
system and measurement model

Bayesian approach

calculate Post(x(k)) with Bayes Rule


Eq. (2.3)

Markov Assumptions

calculate Post(x(k)) recursively


Eqs. (2.6)–(2.7)

Figure 2.2: State estimation problem, different assumptions


Chapter 3

System modeling
FIXME: toevo
Modeling the system corresponds to (i) choosing a state; eg. for a map-building problem it can be the status (occupied/free) (diff
discrete-
of grid points, positions of features, . . . ; (ii) choosing the measurements (choosing the sensors) and (iii) writing down the
system and measurement models. This chapter describes how (Markovian) system and measurement models can be written
down: a system with a continuous state space is modeled by equations (Section 3.1) or by a network (Section 3.2); a system
with a discrete state space is modeled by a Finite State Machine (FSM) (Section 3.3).

3.1 Continuous state variables, equation modeling


The modelling by equations:

xk = f k−1 (xk−1 [, uk−1 , θ f ], wk−1 ) (3.1)


zk = g k (xk [, sk , θ g ], v k ) (3.2)

where

• both f () and g() can be (and most often are!) non-linear functions

• [ ] denotes an optional argument

• wk−1 and v k are noises (uncertainties) for which the stochastic distribution (or at least some of its characteristics)
are supposed to be known. v and w are mutually uncorrelated and uncorrelated between sampling times (This is a
necessary condition for the model to be a Markovian).
Examples of models with correlated uncertainties:

– correlation between process and measurement uncertainty: when a measurement changes the state, e.g. when
measuring the speed of electrons (or other elementary particles) by fotons, an impuls is exchanged at the col-
lision and the velocity of the electron will be different after this measurement, (met dank aan Wouter voor het
voorbeeld)
– correlation process uncertainty over time: deviations from the model (process noise) which depend on the
current state or on unmodeled effects as humidity,
– correlation measurement uncertainty over time: a not explicitely modeled temperature drift of the sensor.

Note that the uk−1 and sk are assumed to be exact (not stochastic variables). If e.g. the proprioceptive sensors (which
measure uk−1 ) are inaccurate, this uncertainty is modeled by wk−1 .

3.2 Continuous state variables, network modeling


nn - bayes nn

23
24 CHAPTER 3. SYSTEM MODELING

3.3 Discrete state variables, Finite State Machine modeling


FIXM
n
3.3.1 Markov Chains/Models ”bay

Figure 3.1: Finite State Machine or Markov Chain: Graph model

Markov chains (sometimes called first order markov chains) are models of a category of systems that are most often denoted
as Finite State Machines or automata. These are systems that have a finite number of states. At any time instant, the system
is in a certain state, and can go from one state to another one, depending on a random proces, a discrete PDF, an input to
the system or a combination of these. Figure 3.1 shows a graph representation of a system that changes from state to state
depending on a discrete PDF only, i.e.
P (x(k) = State 3|x(k − 1) = State 2) = a23
The name first order markov chains, that is sometimes used in literature, stems from the fact that the probability of being
in a certain state xk at step k depends only on the the previous time instant. This is wat we called Markov Models in the
previous section. Some authors consider Markov Models in a broader sense, and use the term “first order markov chains”
to denote what we mean in this text by markov chains.
In literature, the transformation matrix (a discrete version of the system equation!) is often represented by A.

3.3.2 Hidden Markov Models (HMMs)


Model First off all, the name Hidden Markov Model (HMM) is chosen very badly. All dynamical systems being modeled
have hidden state variables, so a Hidden Markov Model should be a model of a dynamical system that doesn’t make
any assumptions except the Markov assumption. However, in literature, HMMs refer to models with the following extra
assumptions:
• The state space is discrete, ie. there’s a finite number of possible hidden states x. (eg. a mobile robot walking in a
topological map: at kitchen door, in bed room, . . . )
• The measurement (observation) space is discrete.
The difference between a Hidden Markov Model and a “normal” Markov Chain is the fact that the states of a normal Markov
Chain are observable (and hence there is no estimation problem!). In other words, whereas for Markov Models, there’s a
unique relationship between the state and the observation or measurement (no uncertainties), whilst for Hidden Markov
Models the uncertainty between a certain measurement and the state it stems from is modeled by a probability density (see
figure 3.2)
Because of the discrete state and measurement spaces, each HMM can be represented as λ = (A, B, π) where eg. B ij =
P (Z(i) = z j |x = xi ). The matrix A represents f (), B represents g() and π is used to determine in which state the HMM
starts. The filter algorithms for HMMs are described in Section 4.2.

Literature
• “First paper”: [94]
• Good introduction: [42], [61]: Here measurements are defined as inherently together with the transition between
two states, whereas the normal approach considers them linked to a certain state. But the two approaches are en-
tirely equivalent (this can be seen by redefining the state space (see eg. section 2.9.2 on p. 35 of [61]. See also
http://www.univ-st-etienne.fr/eurise/pdupont/bib/hmm.html1 .
1 http://www.univ-st-etienne.fr/eurise/pdupont/bib/hmm.html
3.3. DISCRETE STATE VARIABLES, FINITE STATE MACHINE MODELING 25

Meas. B
Meas. A Meas. B
Meas. A Meas. C

state 1 state 2 state 1 state 2

Markov Model Hidden Markov Model


Figure 3.2: Difference between a Markov Model and a Hidden Markov Model

Software

• See the Speech Recognition HOWTO2

Extensions Standard HMMs are not very powerful models and appropriate for very particular cases only, so some exten-
sions have been made to be able to use them for more complex and thus realistic situations:

• Variable Duration HMMs


Standard HMMs consider the chance to stay in a particular state as a exponential function of time

P x(k) = xi x(k − l) = xi ∼ e−l




As this is for most systems very unrealistic, Variable Duration HMMs [70, 71] solve this problem by introducing an FIX
extra, parametric, pdf P (Dj = d) (ie. a pdf predicting how long one typically stays in state j) to model the duration
in a certain state. These are very appropriate for speech recognition.

• Monte Carlo HMMs)


Monte Carlo HMMs [115, 116], also referred to as Generalized HMMs (GHMMs), extend the standard HMMs
toward continuous state and measurement spaces. Whereas eg. in a normal HMM transitions between states are mod-
eled by a matrix A, a MCHMM uses a non-parametric pdf to model state transitions (like a(xk |xk−1 , uk−1 , f k−1 )).
Due to the fact that they don’t make any assumptions about any of the parameters involved, nor on the nature of the
pdfs, in my opinion, GHMM filters can be used to describe strong non-linear problems such as the localization of
transport pallets with a laser scanner (Memory/time requirements??), if defined as a dynamical system.

2 http://www.kulnet.kuleuven.ac.be/LDP/HOWTO/Speech-Recognition-HOWTO/index.html
26 CHAPTER 3. SYSTEM MODELING
Part II

Algorithms

27
Chapter 4

State estimation algorithms

Literature describes different filters that calculate Bel(x(k)) or Bel(X(k)) for specific system and measurement models.
Some of these algorithms calculate the full Belief function, others only some of its characteristics (mean, covariance, . . . ).
This chapter gives an overview of the basic recursive (i.e., Markov) filters, without claiming to give a complete enumeration
of the existing filters.
To be able to determine which filter is applicable to a certain problem, one should verify certain things:

1. Is X a continuous or a discrete variable? (Eqs/graph)

2. Do we represent the pdfs involved as parametric distributions or do we use sampling techniques to be able to sample
non-parametric distributions?

3. Are we solving a position tracking problem or a global localisation problem (unimodal or multimodal distributions)
...

This section uses the previously defined symbols (xk , z k , . . . ). The detailed algorithms in appendix however, are described
with the in the literature most common symbols for each specific filter.

4.1 Grid based and Monte Carlo Markov Chains


Model The only assumption Markov Chains make is the Markov assumption. Thus, they do not make assumptions on the
nature of x, nor on the nature of the pdfs that are used. FIXME: zowel

Filter Markov Chains for discrete state variables directly solve Equations (2.6)–(2.7) for all possible values of the state.
For continuous state variables they use numerical techniques, such as Monte Carlo-methods (often abbreviated as MC, see
chapter 8) in order to ”discretize” the state space1 . Another applied discretization technique is the use of a grid over the
entire state space. The corresponding filters are called MC Markov Chains and Grid-based Markov Chains. The Grid-based
filters sample the state space in a uniform way, whereas the MC filters apply a different kind of sampling, most often referred
to as importance sampling (see chapter8 ⇒ from where the name “particle filters”). Monte Carlo (particle) filters are also
often referred to as the Condensation algorithm (mainly in vision applications), Survival of the fittest, or bootstrap filters.
The most general and maybe most clear term appears to be sequential Monte Carlo methods. FIXME: Nog
o.a

Particle Filters FIXME: KG:


het algoritme
dat je kan w
• The basics: The SIS filter [39, 38] zijn, zie o
• To avoid the degeneracy of the sample weights: The SIR filter [100, 38, 52]

• Smoothing the particles posterior distribution by a Markov Chain MC Move step [38]

• Taking better proposal distributions then the system transition pdf [38]: Prior editing (niet goed), Rejection methods,
Auxiliary particle filter [91] , Extended Kalman particle filter , Unscented Kalman particle filter FIXME: u
1 Note that for continuous pdfs which can be parameterized, this discretization is not necessary, filters for these systems are described in section 4.4. FIXME: geb

29
30 CHAPTER 4. STATE ESTIMATION ALGORITHMS

• any-time implementations

The detailed algorithms are described in appendix D.

Literature

• first general paper?


• Good tutorials: [52] (Markov Localisation), [50] (= Monte Carlo version of [52]), [6]

4.2 Hidden Markov Model filters


L: do not understand
volgende twee
In literature, people do not write about HMM filters: They only speak about the different algorithms for HMMs. We chose
to call them in this way to stress the similarities between the different techniques.

Model Finite state machines, see section 3.3.

Filter HMM filter algorithms typically calculate all state variables instead of just the last one: they solve (Eq. (2.4)
instead of Eq. (2.1)). However,
 they do not estimate the whole probability distribution of Bel(X(k)), they just give the
sequence of states X k = x0 , . . . , xk for which the joint a posteriori distribution Bel(X(k)) is maximal. The filter
algorithm is often called the Viterbi algorithm (based on the Forward-backward algorithm). The version of both these
algorithms for VDHMMs is fully described in appendix A. The algorithms for MCHMMs should be easy to derive from
these algorithms. . . .

Literature and software See 3.3.2.


TODO

• Verify if MCHMM filters sample the whole distribution or do they also just provide a state sequence that maximizes
eq. 2.4.
• Connection with MC Markov Chains ! Is there a difference? I think the only difference is the fact that MCHMM’s
search a solution to the more general problem (eq. 2.4) and MC Markov Chains is just estimating the last hidden state
: naar hoofdstuk MC xk (eq. 2.1)
• Add HMM bookmarks?

4.3 Kalman filters


Model Kalman filters are filters for equation models with continuous state variable X and with functions f () and g() that
are linear in the state and uncertainties; i.e. eqs. (3.1)-(3.2) are:

xk = F k−1 xk−1 + f 0 k−1 (uk−1 , θ f ) + F 00 k−1 wk−1


zk = Gk xk + g 0 k (sk , θ g ) + G00 k v k

F k−1 , F 00 k−1 , Gk and G00 k are matrices.

Filter KFs estimate 2 characteristics of the pdf Bel(x(k)), namely the minimum-mean-squared-error (MMSE) estimate
and covariance. Hence, their use is mainly restricted to unimodal distributions. A big advantage of KFs over the other filters
is that KFs are computationally less expensive. The KF algorithm is described in appendix B.

Literature

• first general paper [63]


• Good tutorial: [8]
4.4. EXACT NONLINEAR FILTERS 31

Extensions KFs are often applied to systems with non-linear system and/or measurement functions:

• Unimodal: the (Iterated) Extended KF [8] and Unscented KF [102] linearize the nonlinear system and measurement
equations.

• Multimodal: Gaussian sum filters [5] (often called multi hypothesis tracking in mobile robotics): for every mode
(every Gaussian) an EKF is run.

Remark 4.1 Note that the KF doesn’t assume Gaussian pdfs, but, for Gaussian pdfs the 2 characteristics estimated by the
KF fully describe Bel(x(k)).

4.4 Exact Nonlinear Filters


Model For some equation models with continuous state variables, pdf (2.1) can be represented by a fixed finite-dimensional
sufficient statistic (the Kalman Filter is special case for Gaussian pdfs). [33] describes the systems for which the exponential
family of probability distributions is a sufficient statistic, see appendix C.

Filter The filter calculates the full (exponential) Bel(x(k)), the algorithm is given in appendix C.

Literature [33]

Extension: approximations to other systems [33].

4.5 Rao-Blackwellised filtering algorithms


FIXME
In certain cases where some variables of the set of variables of the joint a posteriori distribution are independent of other
ones, a mixed analytical/sample based algorithm can be used, combining the advantages of both worlds [82]. The FASTSlam
algorithm [81, 79, 80] is a nice example of these.

4.6 Concluding
Filter X P (X) Varia
Grid-based Markov Chain C n’importe Computationally expensive
MC Markov Chain C n’importe Subdivide (rejection, metropolis, . . . )
HMM D n’importe x = max P(X), eq. (2.4)
VDHMM D n’importe x = max P(X), eq. (2.4)
MCHMM C n’importe ?????
KF C unimodal f () and g() linear
EKF, UKF C unimodal f () and g() not too unlinear
Gaussian sum C multimodal f () and g() not too unlinear
Daum C exponential rare cases (appendix C)
32 CHAPTER 4. STATE ESTIMATION ALGORITHMS
Chapter 5

Parameter learning

All Bayesian approaches use explicit system and measurement models of their environment. In some cases, the construction
of good enough models to approximate the system state in a satisfying manner is impossible. Speech is an ideal example:
every person has a different way of pronouncing different letters (such as in “Bruhhe”). The system and measurement
models and the characteristics of their uncertainties are written in function of inaccurately known parameters, collected
in the vectors θ f , respectively θ g . In a Bayesian context, estimation of those parameters would typically be done by
maintaining a pdf over the space of all possible parameter values. The inaccurately known parameters θ f and θ g have
to be estimated online, next to the estimation of the state variables. This is often called parameter learning (mapping in
mobile robotics). The initial state estimation problem of Chapters 2–4 is augmented to a concurrent-state-estimation-and-
parameter-learning problem (“simultaneous localization and mapping (SLAM)” or “concurrent mapping and localization
(CML)” in mobile roboticsterminology).
 To simplify the notation of the following equations, θ f and θ g are collected into
θf
one parameter vector θ = . Remark that any estimate for this vector is valid for all time steps (parameters are constant
θg
in time . . . ).
If the parameter vector θ comes from a limited discrete distribution, the problem can be solved by multiple model filtering
(Section 5.3). However if the parameter vector θ does not come from a limited discrete distribution, —IMHO— the only
‘right’ way to handle the concurrent-state-estimation-and-parameter-learning problem is to augment the state vector with
the inaccurately known parameters (Section 5.1). However if a lot of parameters are inaccurately known, up till now, the
resulting state estimation problem is only succesfully solved with Kalman Filters (on problems that obey the corresponding
assumptions). In other cases, the computational less expensive Expectation-Maximization algorithm (EM, Section 5.2) is
often used as an alternative. The EM algorithm subdivides the problem in two steps: one state estimation step and one
parameter learning step. The algorithm is a method for searching a local maximum of the pdf P (z k |θ) (consider this pdf
as a function of θ). FIXME: KG lo
measured fea
Parameter learning is also sometimes called model building. IMHO, this can be use to construct models in which some inaccurately k
parameters are not accurately known, or in situations where is it very difficult to construct an off-line, analytical model.
I’ll try to clarify this with the example of the localization of a transport pallet with a mobile robot, equipped with a laser FIXME: KG
scanner. without taking
the best way to
It is very difficult (but not impossible) to create off-line a fully correct measurement distribution (ie. taking sensor uncer-
tainty/characteristics into account), for a state x = [x, y, θ]T :
P z k x(k) = [xk yk θk ]T , sk , θ g , g k


Figure 5.1 illustrates this. Experiments should point out whether off-line construction of this likelihood function is faster
than learning.

5.1 Augmenting the state space


In order to solve the concurrent-state-estimation-and-parameter-learning
  problem, the state vector can be augmented with
x
the model parameters x ←− . These parameters are then estimated within the state estimation problem.
θ

Filters Augmenting the state space is possible for all state estimators, as long as the new state, system and measurement
model still obey the estimator’s assumptions. In the specific case of a Kalman Filter, estimating state and parameters
simultaneously by augmenting the state vector is called “Joint Kalman Filtering”, [122].

33
34 CHAPTER 5. PARAMETER LEARNING

$ $ $
%
 % % #

"

$
 $ $
%
 % % #

! ! ! "

$ $ $
% % %

! ! !

$ $ $
% % %

! ! !

$ $ $
% % %

! ! !

$ $ $
% % %


  

   


! ! !

$ $ $
% % %


  

   

! ! !

$ $ $
% % %

  
  

  
  

! ! !

$ $ $
% % %

  
  

  
  

! ! !

$ $ $
% % %

  
  

  
  

! ! !

$ $ $
% % %

  
  

  
  

! ! !

$ $ $
% % %

 
 

  
  

  
  

! ! !

$ $ $
% % %

 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  



  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  



  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  
  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  
  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

  
  

! ! !

$ $ $
% % %


 
 
 


 
 
 

  
  

Figure 5.1: Illustration of the complexity `of the


˛ measurement model ´of a transport pallet. The figure shows two pallets in a different
position. Imagine how to set up the pdf P z k ˛x(k) = [xk yk θk ]T , sk . The pallet on the above right side doesn’t cause much trouble.
However, the location of the pallet on the left side below causes more trouble. First for every possible location, one has to search the
intersection of the laserbeam (with orientation sk ) and the pallet. This is already quite complicated. But, most likely, there will also
be uncertainty on sk , such that some particular laserbeams (such as the dash-dotted one in the figure) can actually reflect on either one
“poot” of the pallet or the other one (further behind) all location and we would create a kind of multi-modal gaussian with 2 peaks. So
for some cases, the measurement function becomes really complex

5.2 EM algorithm

As discribed in the introduction, augmenting the state space with many parameters often leads to computational difficulties,
if a KF is not a good model for the (non-linear) system. The EM algorithm is an often used technique for these cases.
However, is it not a Bayesian technique for parameter estimation and (thus :-) not an ideal solution for parameter estimation!
The EM algorithm consists of two steps:

1. the E-step (or state estimation step)

the pdf over all previous states X(k) is estimated based on the current best parameter estimate θ k−1 :
 
P X(k) Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))

This problem is a state estimation problem as described in the previous chapter.

Remark 5.1 Note that this is a Batch method with a not-constant evaluation time!! For every new map, we recalculate
the whole state sequence! This is a batch method and not very well suited for real-time applications
5.2. EM ALGORITHM 35

With this pdf, the expected value of the logarithm of the complete-data likelihood function P X(k), Z k U k−1 , S k , θ, F k−1 , Gk , P
is evaluated:
Q(θ, θ k−1 ) =
h  i (5.1)
E log P X(k), Z k U k−1 , θ, . . . , P (X(0)) | P X k Z k , U k−1 , θ k−1 , . . . , P (X(0))


h  i
E f (X k ) P X k |Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) means that the expectation of the function f (X k )

is sought when X k is a random variable distributed according to the a posteriori pdf P X(k) Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X

Eg. for a continuous state variable this means:


Z
Q(θ, θ k−1 ) = log P X(k), Z k U k−1 , S k , θ, F k−1 , Gk , P (X(0)))


 
P X(k) Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) dX(k).

NOTE: θ k−1 is not a parameter of this function, but it’s value does influence the function! The evaluation of this
integral can be done with eg. Monte Carlo methods. If we are using a particle filter (see chapter D), expression 5.1
reduces to
XN
k−1
log P X i (k), Z k U k−1 , S k , θ, F k−1 , Gk , P (X(0))

Q(θ, θ )=
i=1

where X i (k) denotes the i-th sample of the complete data-likelihood pdf (which we don’t know). Application of
Bayes’ rule and the Markov assumption on the previous expression gives
Q(θ, θ k−1 ) =
N 
X 
log P Z k X i (k), U k−1 , S k , θ, F k−1 , Gk , P (X(0)) P X i (k) U k−1 , S k , θ, F k−1 , Gk , P (X(0))


i=1
N 
X 
log P Z k X i (k), S k , θ g , Gk P X i (k) U k−1 , θ f , F k−1 , P (X(0))

=
i=1

The left hand term of the log product is the measurement equation, with θ considered as a parameter and specific
values for the state and the measurement. The right hand side of the equation is the result of a dead-reckoning
exercice, with θ considered as a parameter. However we don’t know this PDF as a function of θ :-(. FIXME: KG
this!! IMHO
2. the M-step (or parameter learning step) linear

a new estimate θ k is calculated for which the the (incomplete-data) likelihood function increases:
 
p Z k U k−1 , S k , θ k , F k−1 , Gk , P (X(0)) > p Z k U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0)) .

(5.2)

This estimate θ k is calculated as the θ which maximizes the expected value of the logarithm of the complete-data
likelihood function:
θ k = argmax Q(θ, θ k−1 ); (5.3)
or at least increases it (this version of the EM algorithm is called the Generalized EM algorithm (GEM)):
Q(θ k , θ k−1 ) > Q(θ k−1 , θ k−1 ) (5.4)
Appendix E proves that a solution to (5.3) or (5.4) satisfies (5.2).
Remark 5.2 Note that in this section, the superscript k in θ k. refers to the estimate for θ . in the kth iteration. This estimate
is valid for all timesteps because θ . is static.
Remark 5.3 Sometimes the E-step calculates p(X(k), Z k |U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))) instead of p(X(k)|Z k , U k−1 , S k , θ k
Both differ only in a factor
p(Z k |U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))). This factor is independent of the variable θ and hence does not affect the
M-step of the algorithm.
Remark 5.4 Note that the EM algorithm calculates at each time step the full pdf over X, but it only calculates one θ which
maximizes or increases Q(θ, θ k−1 ).
36 CHAPTER 5. PARAMETER LEARNING

Filters

1. All HMM filters allow the use of EM. The algorithm is most often known as the Baum-Welch algorithm (appendix A
gives the concrete formulas for the VDHMM; for a derivation starting from the general EM algorithm, see [61]).
d Grid-based HMMs? In the case of MCHMMs , where pdf’s are non parametric, the danger for overfitting is real and regularization is
absolutely necessary. Typically cross-validation techniques are used to avoid this (shrinkage and annealing).
Work this further out

2. Dual Kalman Filtering [122]. The algorithm is described in appendix B.

5.3 Multiple Model Filtering


Relate this to Pattern
Recognition When the parameters are discrete and there is only a limited number of possible parameters, the concurrent-state-estimation-
and-parameter-learning problem can be solved by a Multiple Model Filter. A Multiple Model Filter considers a fixed number
of models, one for each possible value of the parameters. So, in each filter, the parameters are different but known (the
different models can also have different structure, different parameterization). For each of the models a separate filter is run.
Two kinds of Multiple Model Filter exist:

1. Model detection (model selection, model switching, multiple model, multiple model hypothesis testing, . . . ) filters
try to identify the “correct” model, the other models are neglected.
2. Model fusion (interacting multiple model, . . . ) filters calculate a weighted state estimate between the models.

Filters Multiple Model Filtering is possible with all filtering algorithms, however, in practice, it is almost only applied for
Kalman Filters, because most other filters are computationally too complex to run several of them in parallel.
Chapter 6

Decision Making
FIXME: re
In the previous chapters, we learned how to process measurements in order to obtain estimates for states and parameters. Markov Models
- Hidden Mark
When we have a closer look at the system’s proces and measurement functions () and (), we see that the system’s states and
measurements are influenced by the input to the system. This input can be in the proces function (e.g. an acceleration input),
or in the measurement function (e.g. a parameter of the sensor). The previous chapters assumed that these inputs were given
and known. This chapter is about planning (decision making), about the choice of the inputs (control signals, actions).
Indeed, a different input can lead to more accurate estimates of the states and/or parameters. So, we want to optimize the
input in some way to get “the best possible estimates” (optimal experiment design) and in the mean while perform the task
“as good as possible”, i.e. to perform active sensing.
An example is mobile robot navigation in a known map. The robot is unsure about its exact position in the map and needs
to determine the action that determines best where it is in the map. Some people make the distinction between active
localization and active sensing. The former then refers to robot motion decisions, the latter to sensing decisions (e.g. when
a robot is allowed to fire only one sensor at a time).
Section 6.1 formulates the active sensing problem. The performance criteria Uj which measure the gain in accuracy of
the estimates are explained in section 6.2. Section 6.3 describes possible ways to model the input trajectories. Section 6.4
discusses some optimization procedures. Section 6.8 discusses model-free learning, i.e. when there is no model (or not yet
an exact model) of the system available.

6.1 Problem formulation


We consider a dynamic system described by the state space model

xk+1 = f (xk , uk , η k ) (6.1)

z k+1 = h(xk+1 , sk+1 , ξ k+1 ) (6.2)


where x is the system state vector, f and h nonlinear system and measurement functions, z is the measurement vector, η
and ξ are respectively system and measurement noises. u stands for the input vector of the state function, s stands for a
sensor parameter vector as input of the measurement function (an example is the focal length of a camera). The subscripts
k and k + 1 stand for the time step. The system’s states and measurements are influenced
 by the inputs u and s. Further,
we make no distinction and denote both inputs to the system with ak = uk sk+1 (actions). Conventional systems
consisting only of control and estimation components assume that these inputs are given and known. Intelligent systems
should be able to perform active sensing.
A first thing we have to do is choose a multiobjective performance
 criterium, (often
 called value function or return function),
that determines when the result of a sequence of actions π 0 = a0 . . . aN −1 1 (also called policy) is considered to be
“better” than the result of another policy:
X X
V ∗ = min V () = min{ αj Uj (...) + βl Cl (...)} (6.3)
π0 π0
j l

This criterion (or cost function) is a weighted sum of expected costs: The optimal policy π 0 is the one that minimizes this
function. The cost function consists of
1 The index 0 denotes that π contains all actions starting from time 0

37
38 CHAPTER 6. DECISION MAKING

1. j terms αj Uj (...) characterizing the minimization of expected uncertainties Uj (...) (maximization of expected infor-
mation extraction) and
2. l terms βl Cl (...) denoting other expected costs and utilities Cl (...), such as time, energy, distances to obstacles, distance
to the goal.

: KG: Look for better The weighting coefficients αj and βl are chosen by the designer and reflect his personal preferences . A reward/cost can be
formulation associated both with an action a as with the arrival in a certain state x.

If both the goal configuration and the intermediate time evolution of the system are important with respect to the calulation
of the cost function, the terms Uj (...) and Cl (...) are themselves a function of the Uj,k ()... and Cl,k (...) at different time
steps k. If the probability distribution over the state at the goal configuration p(xN |x0 , π 0 ) fully determines the rewards,
these components are reduced into their last terms and V is calculated by using Uj,N and Cl,N only.
: Maybe add index to V is to be minimized with respect to the sequence of actions under certain constraints
merate the constraints
c(x0 , . . . , xN , π 0 ) ≤ cmax . (6.4)

The thresholds cmax express for instance maximal allowed velocities and acceleration, maximal steering angle, minimum
distance to obstacles, etc.
The problem could be a finite-horizon (over a fixed, finite number of time steps) or an infinite-horizon problem (N = ∞).
For infinite horizon problems: [15, 93]

• the problem can be posed as one in which we wish to maximize expected average reward per time step, or expected
total reward;
• in some cases, the problem itself is structured so that reward is bounded (e.g. goal reward, all actions: cost), once in
goal state: stay at no cost;
• sometimes, one uses a a discount factor (”discounting”): rewards in the far future have less weight than rewards in
the near future.

6.2 Performance criteria for accuracy of the estimates


The terms Uj,k (...) represent (i) the expected uncertainty of the system about its state; or (ii) this uncertainty compared to the
accuracy needed for the task completion. In a Bayesian framework, the characterization of the uncertainty of the estimate
is based on a scalar loss function of its probability density function. Since no scalar function can capture all aspects of a
pdf, no function suits the needs of every experiment. Common used functions are based on a loss function of the covariance
matrix of the pdf or on the entropy of the full pdf.
Active sensing is looking for the actions which minimize

• the posterior pdf: p = ... in the following formulas


• the “distance” between the prior and the posterior pdf: p1 = ... and p2 = ... in the following formulas
• the “distance” between the posterior and the goal pdf: p1 = ... and p2 = ... in the following formulas
• the posterior covariance matrix (P = P post in the following functions)
• the inverse of the Fisher information matrix I [48] which describes the posterior covariance matrix of an efficient
estimator (P = I −1 in the following functions). Appendix H gives more details on the Fisher info matrix and the
Cramer Rao.

• loss function based on the covariance matrix: The covariance matrix P of the estimated pdf of state x is a measure
of the uncertainty of the estimate. Since no scalar function can capture all aspects of a matrix, no loss function
suits the needs of every experiment. Minimization of a scalar loss function of the posterior covariance matrix is
extensively described in the literature of optimal experiment design [47, 92] where several scalar loss functions have
been proposed:
– D-optimal design: minimizes det(P ) or log(det(P ))). The minimum is invariant to any transformation of the
variables x with a nonsingular Jacobian (e.g. scaling). Unfortunately, this measure does not allow to verify task
completion.
6.2. PERFORMANCE CRITERIA FOR ACCURACY OF THE ESTIMATES 39

– A-optimal design: minimizes the trace tr(P ). Unlike D-optimal design, A-optimal design does not have the
invariance property. The measure does not even make sense physically if the target states have inconsistent
units. On the other hand, this measure allows to verify task completion (pessimistic).
– L-optimal design: minimizes the weighted trace tr(W P ). A proper choice of the matrix W can render the
L-optimal design criterium invariant to transformations of the variables x with a nonsingular Jacobian: W
has units and is also transformed accordingly. A special case of L-optimal design is the tolerance-weighted
L-optimal design [34, 53], which proposes a natural choice of W depending on the desired standard deviations
/ tolerances at task completion. The value of this scalar function has a direct relation to the task completion.
– E-optimal design: minimizes the maximum eigenvalue λmax (P ). Like A-optimal design, this is not invariant
to transformations of x, nor the measure makes sense physically if the target states have inconsistent units; but
the measure allows to verify task completion (pessimistic).
• loss function based on the entropy: Entropy is a measure of uncertainty represented by the probability distribution.
This measure has more information about the pdf than only the covariance matrix, which is important for multi-
modal distributions, consisting of several small peaks. Entropy is defined as: H(x) = E[− log p(x)]. For a discrete
distribution (p(x = x1 ) = p1 , . . . , p(x = xn ) = pn ) this is:
n
X
H(x) = − pi log pi (6.5)
i=1

for continuous distributions: Z ∞


H(x) = − p(x) log p(x)dx (6.6)
−∞
Appendix G describes the concept of entropy in more detail. Some entropy based performance criteria are:
– the entropy of the distribution: H(x) = E[− log p(x)]. !! not invariant to transformation of x !!??
– the change in entropy between two distributions p1 (x) and p2 (x):

H2 (x) − H1 (x) = E[− log p2 (x)] − E[− log p1 (x)] (6.7)

If we make the change between the entropy of the prior distribution p(x|Zk ) and the conditional distribution
p(x|Zk+1 ); this measure corresponds to the mutual information (see appendix G.5). Note that the entropy of the
conditional distribution p(x|Zk+1 ) is not the equal to the entropy of the posterior distribution p(x|Zk+1 ) (see
appendix G.3)!
– the Kullback-Leibler distance or relative entropy is a measure for the goodness of fit or closeness of two distri-
butions:
p2 (x)
D(p2 (x)||p1 (x)) = E[log ]; (6.8)
p1 (x)
where the expected value E[.] is calculated with respect to p2 (x). For discrete distributions:
n
X n
X
D(p2 (x)||p1 (x)) = p2,i (x) log p2,i (x) − p2,i (x) log p1,i (x) (6.9)
i=1 i=1

For continuous distributions:


Z ∞ Z ∞
D(p2 (x)||p1 (x)) = p2 (x) log p2 (x)dx − p2 (x) log p1 (x)dx (6.10)
−∞ −∞

Note that the change in entropy and the relative entropy are different measures. The change in entropy only
quantifies how much the form of the pdfs changes; the relative entropy also incorporates a measure of how
much the pdf moves: if p1 (x) and p2 (x) are the same pdf, but translated to another mean value, the change in
entropy is zero, while the relative entropy is not. The question of which measure is best to use for active sensing
is not an issue as the decision making is based on the expectations of the change in entropy or relative entropy,
which are equal.

Remark: Minimizing the covariance matrix is often a more appropriate active sensing criterion than minimizing an entropy
function of the full pdf. This is the case when we want to estimate our state unambiguously, i.e. when we want to use
one value for the state estimate, and reduce the uncertainty of this estimate maximally. The entropy will not always be
a good measure because for multimodal distributions (ambiguity in the estimate) the entropy can be very small while the
uncertainty on any possible state estimate is still large. With the expected value of the distribution as estimate, the covariance
matrix indicates how uncertain this estimate is.
40 CHAPTER 6. DECISION MAKING

6.3 Trajectory generation


The description of the possible sequence of actions ak can be done in different ways. This has a mayor impact on the
optimization problem to solve afterwards (section 6.4).

• The evolution of ak can be restricted to trajectory, described by a reference trajectory and a parametrized deviation of
this trajectory. In this way, the optimization problem is reduced to a finite-dimensional, parameterized optimization
problem. Examples are the parameterization of the deviation as finite sine/cosine series.
• A more general way to describe the trajectory is a sequence of freely to choose actions, that are not restricted to
a certain form of trajectory. The optimization of such a sequence of decisions over time and under uncertainty is
called dynamic programming. At execution time, the state of the system is known at any time step. If there is no
measurement uncertainty at execution time, the problem is a Markov Decision Proces (MDP) for which the optimal
policy can calculated before the task execution for each possible state at every possible time step in the execution (a
policy that maximizes the total future expected reward).
If the measurements are noisy, the problem is a Partially Observable Markov Decision Proces (POMDP). This means
that at execution time the state of the system is not known, only a probability distribution over the states can be
calculated. For this case, we need an optimal policy for every possible probability distribution at every possible time
step. No need to say that this complicates the solution a lot.

6.4 Optimization algorithms

6.5 If the sequence of actions is restricted to a parameterized trajectory


E.g. dynamical robot identification [22, 113].
The optimization can have different forms, depending on the function to optimize and the constraints: linear programming,
constrained nonlinear least squares methods, convex optimization, etc. The references in this section are just examples, and
not necessarily to the earliest nor the most famous works.
A. Local optimum = global optimum:

• Linear programming [90]: linear objective function and constraints, which may include both equalities and inequali-
ties. Two basic methods:
– simplex method: each step is to move from one vertex of the feasible set to an adjacent one with a lower value
of the objective function.
– the interior-point methods, e.g. the primal-dual interior point methods: they require all iterates to satisfy the
inequality constraints in the problem strictly.
• Convex programming (e.g. semidefinite programming) [21]: convex (or linear) objective function and constraints,
which may include both equalities and inequalities.

B.Nonlinear-nonconvex problems: 1. Local optimization methods [90]:

• Unconstrained optimization
– Line search methods: starts by fixing the direction (Steepest descent direction, any-descent direction, Newton
direction, Quasi-Newton direction, conjugate gradient direction), then identifies an approximate step distance
(with lower function value).
– Trust region methods: first chooses a maximum distance, then approximate the objectve function in that region
(linear or quadratic) and then seeks a direction and step length (Steepest descent direction and Cauchy point,
Newton direction, Quasi-Newton direction, conjugate gradient direction).
• Constrained optimization: e.g. reduced-gradient methods, sequential linear and quadratic programming methods and
methods based on Lagrangians, penalty functions, augmented Lagrangians.

2. Global optimization methods: The Global Optimization website by Arnold Neumaier2 gives a nice overview of various
optimization problems and solutions.
2 http://solon.cma.univie.ac.at/∼neum/glopt.html
6.6. MARKOV DECISION PROCESSES 41

• Deterministic

– Branch and Bound methods: Mixed Integer Programming, Constraint Satisfaction Techniques, DC-Methods,
Interval Methods, Stochastic Methods
– Homotopy
– Relaxation

• Stochastic

– Evolutionary computation: genetic algorithms (not good), evolution strategies (good), evolutionary program-
ming, etc
– Adaptive Stochastic Methods: (good)
– Simulated Annealing (not good)

• Hybrids: ad-hoc or involved combinations of the above

– Clustering
– 2-phase

6.6 Markov Decision Processes


Original books and papers that describe MDPs: [10, 11, 58]
Modern works on MDPs: [14, 15, 73, 93]
** What is MDP **
If the sequence of actions is not restricted to a parametrized trajectory, then the optimization problem has a different struc-
ture: (PO)MDP. This could be a finite-horizon, i.e. over a fixed finite number of time steps (N is finite), or an infinite-horizon
problem (N = ∞). For every state it is rather straightforward to know the immediate reward being associated to every action
(1 step policy). The goal however is to find the policy that maximizes the reward over a long term (N steps).

The optimal policy is π ∗0 , if V π0 (x0 ) ≥ V π0 (x0 ), ∀π 0 , x0 . For large problems (many states, many possible actions, large
N,...) it is computationally not tractable to calculate all value functions V π0 (x0 ) for all policies π 0 .
Some techniques have been developed that exploit the fact that an infinite-horizon problem will have an optimal stationary
policy, a characteristic not shared by their finite horizon counterparts.
Although MDPs can be both continuous or discrete systems, we will focus on the discrete (discrete actions / states) stochastic
version of the optimal control problem. Extensions to real-valued states and observations can be made. There are two basic
strategies for approximating the solution to a continuous MDP [101]:

• discrete approximations: grid, Monte Carlo [114], . . .

• smooth approximations: treat the value function V and/or decision rules π as smooth, flexible functions of the state
x and a finite-dimensional parameter vector θ

Discrete MDP problems can be solved exactly, whereas the solutions to continuous MDPs can generally only be approx-
imated. Approximate solution methods may also be attractive for solving discrete MDPs with a large number of possible
states or actions.
Standard methods to solve:

Value iteration: optimal solution for finite and infinite horizon problems ** For every state xk−1 it is rather straight-
forward to know the immediate reward being associated to an action ak−1 (1 step policy): R(xk−1 , ak−1 ). The goal
however is to find the policy π0∗ that maximizes the (expected) reward over the long term (N steps). The future reward is
function of the starting state/pdf xk−1 and the executed policy πk = (ak−1 , . . . , aN −1 ) at time k − 1:
X
V πk−1 (xk−1 ) = R(xk−1 , ak−1 ) + γ {P (xk |xk−1 , ak−1 )V πk (xk )} (6.11)
xk

This is a backward recursive calculation. 0 ≤ γ ≤ 1


42 CHAPTER 6. DECISION MAKING

 Z 
ak−1 = arg max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a)dxk (6.12)
a xk
bellmans equation:  Z 
Vk−1 = max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a)dxk (6.13)
a xk
" #
X
ak−1 = arg max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a) (6.14)
a
xk

bellmans equation: " #


X
Vk−1 = max R(xk−1 , a) + γ V (xk )p(xk |xk−1 , a) (6.15)
a
xk

** We exploit the sequential structure of the problem: the optimization problem minimizes (or maximizes) V , written as a
succession of sequential problems to be solved with only 1 of the N variables ai . This way of optimizing is called dynamic
programming (DP)3 and is introduced by Richard Bellman [10] with his Principle of Optimality, also known as Bellman’s
principle:

An optimal policy πk−1 has the property that whatever the initial state xk−1 and the initial decision ak−1 are, the remaining

decisions πk must constitute an optimal policy with regard to the state xk resulting from the first decision (xk−1 , ak−1 ).
The intuitive justification of this principle is simple: if πk∗ were not optimal as stated, we would be able to maximize the
reward further by switching to an optimal policy for the subproblem once we reach xk . This makes a recursive calculation
of the optimal policy possible: finding an optimal policy for the system when N − i time steps remain, can be optained by
using the optimal policy for the next time step (i.e. when N − i − 1 steps remain); and is expressed in the Bellman equation
(aka functional equation):
for discrete state space:
( )
∗ Xn ∗
o
V πk−1 (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V πk (xk ) (6.16)
ak−1
xk

for continuous state space:


 Z 
∗ ∗
πk−1 πk
V (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V (xk )dxk (6.17)
ak−1 xk

MDP: E OVER PROCESRUIS.

The solution of the MDP problem with dynamic programming is called value iteration [10]. The algorithm starts with the
∗ ∗ ∗
value function V πN (xN ) = R(xN ) and computes the value function for 1 more time step (V πk−1 ) based on (V πk ) using

Bellman’s equation (6.16) untill V π0 (x0 ) is obtained. This method works for both finite and infinite MDPs. For infinite
horizon problems Bellman’s equation is iterated till convergence.
Note that the algorithm may be quite time consuming, since the minimization in the DP must be carried out ∀xk , ∀ak . curse
of dimensionality.

policy iteration: optimal solution for infinite horizon problems Policy iteration is an iterative technique similar to dy-
namic programming, introduced by Howard [58]. The algorithm starts with any policy (for all states), called π 0 . Following
iterations are performed:
i
1. evaluate the value function V π (x) for the current policy with an (iterative) policy evaluation algorithm
2. improve the policies with a policy improvement algorithm: ∀x, find the action a∗ that maximizes
Xn i
o
Q(a, x) = R(x, a) + γ P (x0 |a, x)V π (x0 ) (6.18)
x0
i
if Q(a, x) > V π (x0 ), let π i+1 (x) = a∗ else keep π i+1 (x) = π i (x).

π i+1 (x) = π i (x), ∀x.


3 dynamic programming: optimization in a dynamic context;

“dynamic” time plays a significant role


6.6. MARKOV DECISION PROCESSES 43

modified policy algorithm: optimal solution for infinite horizon problems The modified policy algorithm [93] is a
combination of the policy iteration and value iteration methods. Like policy iteration, the algorithm contains a policy
improvement step and a policy evaluation step. However, the evaluation step is not done exactly. The key insight is that one
need not to evaluate a policy exactly in order to improve it. The policy evaluation step is solved approximately by executing
a limited number of value iterations. Like the value iteration , it is an iterative method starting with a value V πN and iterates
till convergence.

linear programming: optimal solution for infinite horizonproblems [93, 36, 105] value function for a discrete infinite
horizon MDP problem:
X
minV V (x) (6.19)
x
X
s.t.V (x) ≥ R(x, a) + γ {V (x0 )p(x0 |x, a)} (6.20)
x0

a and x over all possible actions and states. Linear programs are solved with (1) the Simplex Method or (2) the Interior
Point Method [90]. Linear programming is generally less efficient than the previously mentioned techniques because it does
not exploit the dynamic programming structure of the problem. However, [118] showed that sometimes it is a good solution.

state based search methods (AI planning): optimal solution [19]


The solution here is to build suitable structures (e.g. a graph4 , a set of clauses,...) and then search them. The heuristici
search can be in state space [18] or in belief space [17] These methods explicitly search the state or belief space with a
heuristic that estimates the cost from this state or belief to the goal state or belief. Several planning heuristics have been
proposed. The simplest one is a greedy search where we select the best node for expansion and forget about the rest.
Real time dynamic programming [9] is a combination of value iteration for dynamic programming and a greedy heuristic
search. Real time dynamic programming is guaranteed to yield optimal solutions for a large class of finite-state MDPs.
Dynamic programming algorithms generally require explicit enumeration of the state space at each iteration, while search
techniques enumerate only reachable states. However, at sufficient depth in the search tree, individual states can be enumer-
ated multiple times, whereas they are considered only once per stage in dynamic programming.

Approximations without enumeration of the state: approx, finite and infinite hor The previously mentioned methods
are optimal algorithms to solve MDPs. Unfortunately, we can only find exact solutions for small MDPs because these
methods produce optimal policies in explicit form (i.e. tabular manner that enumerates the state space). For larger MDPs,
we must resort to approximate solutions [19], [101].
To this point our discussion of MDPs has used an explicit or extensional representation for the set of states (and actions) in
which states are enumerated directly. We identify three ways in which structural regularities can be recognized, represented,
and exploited computationally to solve MDPs effectively without enumeration of the state space

• simplyfying assumptions such as observability, no process uncertainty, goal satisfaction, time-separable value func-
tions, . . . can make the problem computationally easier to solve. In the AI literature, many different models are
presented which can in most cases be viewed as special cases of MDPs and POMDPs.
• in many cases it is advantageous to compact the states, actions and rewards representation (factored representation).
Also the components of a problem’s solution, i.e. the policy and optimal value function, are also candidates for com-
pact structured representation. Following algorithms use these factored representations to avoid iterating explicitly
over the entire set of states and actions:
– aggregation and abstraction techniques: these techniques allow the explicit or implicit grouping of states that
are indistinguishable with respect to certain characteristics (e.g. the value function or the optimal action choice).
– decomposition techniques: (i) techniques relying on reachability and serial decomposition: an MDP is broken
into various pieces, each of which is solved independently; the solutions are then pieced together or used to
guide the search for a global solution. The reachability analysis restricts the attention to “relevant ” regions of
state space. and (ii) parallel decomposition in which an MDP is broken into a set of sub-MDPs that are “run in
parallel”. Specifically, at each stage of the (global) decision process, the state of each subprocess is affected.

while most of these methods provide approximate solutions, some of them offer optimality guarantees in general, and most
can provide optimal solutions under suitable assumptions.
4 One way to formulate the problem as a graph search is to make each node of the graph correspond to a state. The inital and goal states can then be

identified, and the search can proceed either forward or backward through the graph, or in both directions simultaneously.
44 CHAPTER 6. DECISION MAKING

Limited lookahead: approximate solution for finite and infinite horizon problems. The limited lookahead is to trun-
cate the time horizon and use at each stage a decision based on a lookahead of a small number of stages. The simplest
possibility is to use a one-step lookahead policy.

6.7 Partially Observable Markov Decision Processes


**
for discrete state space:
( )
∗ Xn ∗
o
πk−1
V (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V πk (xk ) (6.21)
ak−1
xk

MDP: E OVER PROCESRUIS; POMDP E OVER STATE, PROCESRUIS, MEETRUIS for continuous state space:
 Z 
∗ ∗
πk−1 πk
V (xk−1 ) = max E R(xk−1 , ak−1 ) + γ P (xk |xk−1 , ak−1 )V (xk )dxk (6.22)
ak−1 xk

Unfortunately, in many practical cases and analytical solution is not possible, and one has to resort to numerical execution of
the DP algorithm. This may be quite time consuming, since the minimization in the DP must be carried out ∀xk , ∀ak , (∀zk :
P OM DP ). This means that the state space must be discretized in some way (if it is not already a finite set).
curse of dimensionality
** What is POMDP **
Original books/papers on POMDP: [41], [7]
Survey algorithms: Lovejoy [74]
E.g. for mobile robotics: [99, 24, 65, 51, 67, 108, 66] (generally they minimize the expected entropy and look one step
ahead)

This model has been analyzed by transforming it into an equivalent continuous-state MDP in which the system state is a
pdf (a set of probability distributions) on the unobserved states in the POMDP, and the transition probs are derived through
Bayes’rule. Because of the continuity of the state space, the algorithms are complicated and limited.
Exact algorithms for general POMDPs are intractable for all but the smallest problems so that algorithmic solutions will
rely heavily on approximation. Only solution methods that exploit the special structure in a specific problem class or
approximations by heuristics (such as aggregation and disacretisation of MDPs) may be quite efficient.
1. We can convert the POMDP in a belief-state MDP, and compute the exact V(b) for that [83]. This is the optimal approach,
but is often computationally intractable. We can then consider approximating either the value function V (...), the belief
state b, or both.

• exact V,exact b: the value function is piecewise linear and convex. Hence, it can be represented by a limited number
of vectors α. This is used as a basis of exact algorithms for computing V (b) (cfr MDP value iteration algorithms):
enumeration algorithm [111, 78, 44], one-pass algorithm [111], linear support algorithm [27], witness algorithm [72],
incremental pruning algorithm [125]; (an overview of the first three algorithms can be found in [74], and of the first
four algorithms [25]). Current computing power can only solve finite horizon problems POMDPs with a few dozen
discretized states.

• approx V, exact b: use function approximator with ”better” properties than piece-wise linear, e.g. polynomial func-
tions, Fourier expansion, wavelet expansion, output of a neural network, cubic splines, etc [57]. This is generally
more efficient, but may poorly represent the optimal solution.

• exact V, approx. b: [74] the computation of the belief space b (Bayesian inference) can be inefficient. Approximating
b can be done (i) by contracting the belief space by using particle filters on Monte Carlo or grid based basis, etc (see
previous chapters on estimation). The optimal value function or policy for the discrete problem may then be extended
to a suboptimal value fucntion or policy for the original problem through some form of interpolation. or (ii) by finite
memory approximations.

• approx V, approx b: combinations of above. E.g [114] uses a partical filter to approximate the belief state and uses a
nearest neighbor function approximator for V.
6.8. MODEL-FREE LEARNING ALGORITHMS 45

2.Sometimes, the structure of the POMDP can be used to compute exact tree-structured value functions and policies (e.g.
structure in the form of DBN) [20].
3. We can also solve the underlying MDP and use that as the basis of various heuristics Two examples are [26]:

• compute the most likely state x∗ = arg maxx b(x) and use this as the “observed state” in the MDP instead of the
belief b(x).

b(x)QM DP (x, a) : theQ-MDP approximation


P
• define Q(b, a) =

6.8 Model-free learning algorithms


In the previous section, a model of the system was available. With this we mean that, given an initial state and an action, it
was possible to calculate the next state (or the next probability distribution over the states). This makes planning of action
possible.
In this section we look at possible algorithms in the absence of such a model.
Reinforcement learning (RL) [112] can be performed without having such a model, the value functions are then learned
at execution time. Therefore, the system needs to choose a balance between its localization (optimal policy) and the new
information it can gather about the environment (optimal learning):

• active localization (greedy, exploiting): execute the actions that optimize the reward

• active exploration (exploring): execute actions to experience states which we might otherwise never see. We hope to
choose actions that maximize knowledge gain of the map (parameters).

Reinforcement learning can improve its model knowledge in different ways:

• use the observations to learn the system model, see [46] where a CML algorithm is used to build a map (model) using
an augmented state vector. This model then determines the optimal policy. This is called Indirect RL.

• use the observations to improve the value function and policy, no system model is learned. This is called Direct RL.
46 CHAPTER 6. DECISION MAKING
Chapter 7

Model selection
FIXME: TL:
Model selection: [124] each description was designed to pursue a different goal, so each criterion might be the best for
achieving its goal. n: sample size (number of measurements); k: model dimension (number of parameters in θ).

• Aikake’s Information criterion (AIC) [1, 2, 3, 4, 103, 49]. The Aikake framework definies the success of inference
by how close the selected hypothesis is to the true hypothesis, where closeness is measured by the Kullback-Leibler
distance (largest predictive accuracy). model with highest value of log(L(θ̂))−k. The predictive accuracy of a family
tells you how well the best-fitting member of that family can expect to predict new data.
• Bayesian Information criterion (BIC) [104]: we should choose the theory that has the greatest probability (i.e. prob-
ability that the hypothesis is true) model with highest value of log(L(θ̂)) − k log(n)
2 . Selects simpler model (smaller
k) that AIC. A family’s average likelihood tells you how well, on average, the different members of the family fit the
data at hand.
• Minimum description length criterion (MDL) [97, 98, 121]
• various methods of cross validation (e.g. [119, 123])

two models, hypotheses H1 and H2 ,

• Likelihood ratio = Bayes factor:


p(Zk |H1 )
>κ (7.1)
p(Zk |H2 )
Is Posterior odds
p(H1 |Zk ) p(H1 ) p(Zk |H1 )
= >κ (7.2)
p(H2 |Zk ) p(H2 ) p(Zk |H2 )
P osteriorodds = P riorodds × Bayesf actor (7.3)

when p(H1 ) = p(H2 ) = 0.5. Likelihood tells which model is good for the observed data. This is not necessarily a
good model for the system (a good predictive model), because of overfitting: fits data better than real model. e.g. the
most likely second order model will be better than the model likely linear model. (The linear model is a special
case of the second order model.) Scientist interpret the data as favoring the simpler model, but the likelihood not.
When the models are equally complex, likelihood OK (=AIC for these cases). Why not Likelihood difference ?? not
invariant to scaling...
The Bayes factor is hard to evaluate, especially in high dimensions. Approximating Bayes factors: BIC.
• Kullback-Leibler information: between model and real. We do not have the real... =¿ AIC
• Aikake information criterion (AIC) [1] [Sakamoto, Y., Ishiguro, M. and Kitagawa, G. 1986 Aikake information
criterior statistics. Dordrecht: Kluwer Academic Publishers]

AIC = log p(Zk |H) − k (7.4)

p(Zk |H) is the likelihood of the likeliest case (i.e. the k parameters of the model parameters for the maximum
p(Zk |H))!! k: number of parameters in the distribution. The model giving the minimum value of AIC should be
selected. It does not chooses the model in which the likelihood of the data is the largest, but takes also the order of the

47
48 CHAPTER 7. MODEL SELECTION

system model into account. AIC is a natural sample estimate of expected Kulback-Leibler information (as a result of
assymptotic theory). AIC: H1 is estimates to be more predictively accurate than H2 if and only if

p(Zk |H1 )
≥ exp(k1 − k2 ) (7.5)
p(Zk |H2 )

• variations on AIC (e.g. [Hurvish and Tsai 1989])


R
• Bayesian Information criterion BIC [104] Approximate p(Zk |Hi ) = θ
p(Zk |theta, Hi )p(theta|Hi )dθ :

log p(Zk |Hi ) = log p(Zk |Hi , θ̂) − (k/2) log n + O(1) (7.6)
= loglikelihoodatM LE − penalty (7.7)

k
approx bayes factors:penalty terms: AIC: k BIC: 2 log n RIC: k log k

• posterior Bayes factors [Aitkin, M 1991 Posterior Bayes Factors, journal of the Royal Statistical Society B 1: 110-
128.]

• Neyman-Pearson hypothesis tests [Cover and Thomas 1991] FREQUENTIST

• a Bayesian counterpart based on the posterior ratio test:

p(x|Zk , H1 )
>κ (7.8)
p(x|Zk , H2 )

Occam factor - likelihood. The likelihood for a model Mi = the average likelihood for its parameters θi
Z
p(Zk |Mi ) = p(θi |Mi )p(Zk |θi , Mi )dθi (7.9)

δθi
This is approximately equal to p(Zk |Mi ) ≈ p(θ̂i ) ∆θ i
= maximum likelihood × occam factor. The occam factor penalizes
models for wasted volume of parameter space.
Part III

Numerical Techniques

49
Chapter 8

Monte Carlo techniques

8.1 Introduction

Monte-Carlo methods are a group of methods in which physical or mathematical problems are solved by using random
number generators. The name “Monte Carlo” was chosen by Metropolis during the Manhattan Project of World War II,
because of the similarity of statistical simulation to games of chance—and the capital of Monaco was a center for gambling
and similar pursuits—. Monte Carlo methods were first used to perform simulations on the collision behaviour of particles
during their transport within a material (to make predictions about how long it takes to collide).

Monte Carlo techniques provide us with a number of ways to solve one or both of the following problems:

• Sampling from a certain pdf (that is FROM, and not to be confused with sampling a certain signal or a (probability
density) function as often used in signal processing). The first methods (the “real” Monte Carlo methods) are also
called importance sampling, whereas the others are called uniform sampling1 .
Importance sampling methods represent the posterior density by a set of N random samples (often called particles
from where the name particle filters). Both methods are presented in figure 8.1. It can be proved that these represen-
tation methods are dual.

• Estimating the value of


Z
I= h(x)p(x)dx (8.1)

Remark 8.1 Note that equation 2.6 is of the type of equation eq. 8.1!

Note that the latter equation is easily solved once we are able to sample from p(x):

i=N
X
I≈ h(xi ) (8.2)
i=1

where xi is a sample drawn from p(x) (often denoted as xi ∼ p(x) !

P ROOF Suppose we have a random variable x, distributed according to a pdf p(x): x ∼ p(x). Then any function fn (x) is
also a random variable. Let xi be a random sample drawn from p(x) and define

i=N
X
F = λn fn (xi ) (8.3)
i=1

1 To make the confusion complete, importance sampling is also the term used do denote a certain algorithm to perform (importance) sampling.

51
52 CHAPTER 8. MONTE CARLO TECHNIQUES

Importance sampling Uniform sampling

0.4

0.4
0.3

0.3
dnorm(x, mu, sigma)

dnorm(x, mu, sigma)


0.2

0.2
0.1

0.1
0.0

0.0
++ ++++++++++++++++ ++ ++++++++++++++++++++++++++++++

−4 −2 0 2 4 −4 −2 0 2 4

x x
Figure 8.1: Difference between uniform and importance sampling. Note that the uniform samples only fully characterize the pdf if every
sample xi is accompanied by a weight wi = p(xi ).

F is also a random variable. The expectation of the random variable F is then


"i=N #
X
i
Ep(x) [F ] = < F > = Ep(x) λn fn (x )
i=1
i=N
X
λn Ep(x) fn (xi )
 
=
i=1
i=N
X
= λn Ep(x) [fn (x)]
i=1
(8.4)
1
Now suppose λn = N and fn (x) = h(x) ∀n, then
i=N
X 1
Ep(x) [F ] = Ep(x) [h(x)] = Ep(x) [h(x)] = I
i=1
N
This means that, if N is large enough, our estimation will converge to I.

Starting from the Chebychev inequality or the central limit theorem (asymptotically for N → ∞), one can obtain expressions
that indicate how good the approximation of I is.
Remark 8.2 Note that for uniform sampling (as in grid based methods), we can approximate the integral as
i=N
X
I≈ h(xi )p(xi ) (8.5)
i=1

The following sections describe several methods for (importance) sampling from certain distributions. We start with discrete
distributions in section 8.2. The other sections describe techniques for sampling from continuous distributions.
8.2. SAMPLING FROM A DISCRETE DISTRIBUTION 53

8.2 Sampling from a discrete distribution


Sampling from a discrete distribution is fairly simple: Just use a uniform random number generator (RNG) in the interval
[0, 1].

Example 8.1 Suppose we want to sample from a discrete distribution p(x1 = 0.6, x2 = 0.2, x3 = 0.2). Generate xi with
the uniform random number generator: if xi ≤ 0.6, the sample belongs to the first category, if 0.6 < xi ≤ 0.8, xi belongs
to the second, . . .

This results in the following algorithm, taking O(N logN ) time to draw the samples:

Algorithm 1 Basic resampling algorithm


Construct the Cumulative Distribution of the sample distribution P (xi : CDF (xi ).
Sample N samples ui (1 < i <= N ) from a uniform density U [0, 1]
Lookup in Cumulative PDF
for i = 1 to N do
j=0
while ui > CDF (xj ) do
j++
end while
Add xj to sample list
end for

However, more efficient methods based on arithmetic coding exist [75]. [96], p. 96, uses ordered uniform samples allowing
to sample N samples in O(N )

Algorithm 2 Ordered resampling


Construct the Cumulative Distribution of the sample distribution P (xi ): CDF (xi ).
Sample N samples ui (1 < i <= N ) from a uniform density U [0, 1]
1/N
Take nth racine of uN : uN = uN
for i = N − 1 to 1 do
(1.0/i)
Rescale sample: ui = ui ∗ ui+1
end for
Lookup in Cumulative PDF: j = 0
for i = 1 to N do
while ui > CDF (xj ) do
j++
end while
Add xj to sample list
end for

8.3 Inversion sampling


Suppose we can sample from one distribution (in particular, all RNGs allow us to sample from a uniform distribution). If
we transform a variable x into another one y = f (x), the invariance rule says that:

p(x)dx = p(y)dy (8.6)

and thus
p(x)
p(y) = dy
dx

Suppose we want to generate samples from a certain pdf p(x). If we take the transformation function y = f (x) to be the
cumulative distribution function (cdf) from p(x), p(y) will be a uniform distribution on the interval [0, 1]. So, if we have
an analytic form of p(x), and we can find the inverse cdf f −1 from p(x), sampling is straightforward (algorithm 3). An
example of a (basic) RNG is rand() in the C math-library.
The obtained samples xi are exact samples from p(x).
54 CHAPTER 8. MONTE CARLO TECHNIQUES

Algorithm 3 Inversion sampling (U [0, 1] denotes the uniform distribution on the interval [0, 1])
for i = 1 to N do
Sample ui from a U [0, 1] Rx
xi = f −1 (ui ) where f (x) = −∞ p(x)dx
end for

Illustration of inversion sampling


1.0

+
+ ++
+ +

3.0
+
+ ++
+
0.8

+ ++

2.5
+ +
+
+
+ ++
+
+ +
++
pbeta(x, p, q)

dbeta(x, p, q)
+

2.0
0.6

+ +
+ +

1.5
+
+ +
+
+ +
0.4

+
+ +
+
+ +

1.0
+
+ +
+ ++
+ ++
0.2

+
+ +

0.5
+ +
+
+
+ ++
+ + +
++
0.0

0.0
++ ++++++++
++++++++
++++++++++ + + + ++++++++++
++++++++
+++++++++ + + +

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

Figure 8.2: Illustration of inversion sampling: 50 uniformly generated samples transformed through the cumulative Beta distribution.
The right hand side shows that these samples are indeed samples of a Beta distribution

This approach is illustrated in figures 8.2 and 8.3 2


An important example of this method is the Box-Muller method used to draw samples from a normal distribution (see
eg. [64]).
When u1 , u2 are independant and uniformly distributed then,
p
x1 = −2 log u1 cos(2πu2 )
p
x2 = −2 log u1 sin(2πu2 )

this as an example of are independent samples from a standard normal distribution. There exist also variations on this method, such as the
inversion sampling approximative inversion sampling method. This is the same approach, but applied to a discrete approximation of the
distribution we want to sample from.

8.4 Importance sampling


In many cases p(x) is too complex to be able to compute f −1 , so inversion sampling isn’t possible. A possible approach is
then to approximate p(x) by a function q(x), often called the proposal density [75] or the importance function [38, 37] (to
which the inversion technique might be applicable). This technique, as described in algorithm 4, was originally meant to
provide an approximation of eq. (8.1). “Real” samples from p(x) can also be approximated with this technique [13]: See
algorithm 5.
ce is far to qualitative Note that, the further p() and q() are apart, the bigger the ratio M/N should be to converge “fast enough”. Otherwise too
nstead of quantitative many samples M are necessary in order to get a decent approximation.
2 All figures in this chapter were made in R [59]
8.5. REJECTION SAMPLING 55

Illustration of inversion sampling Histogram of pbeta(samples, p, q)

1.0
+ ++++
+
+
+ +++++++++++
+
+ +++++

100
+
+ ++
+
+ +++
+
+
+ ++++
+
++
0.8

+
+
+
+ ++
+ ++

80
+
+ ++
+
+
+ +
++
+
+ ++
+
pbeta(x, p, q)

+ ++
0.6

Frequency
+
+
+ +
+
+ +
++

60
+
+
+ +
+
+ +
+
+ +
+
+
+ ++
+
+
+ +
0.4

+
+ +
+ ++

40
+
+ +
+
+
+
+ +
+
+ +
+
+
+
+ +++
+
+ +
++
0.2

+
+ +

20
+
+ ++
+
+ ++
+
+
+ ++
+
+ + +
+
+ +++++++
+
+
0.0

+ +++++++++
++
++
++
++++
++
+++
++++
+
++
++
+
++
++++
++
++
+++
++
+++
++
+++
+++
++
++
++
+
+++
++
+++
++
+
++
+
++
+
++
++++
+++
++
++
++
++
++
++
++
+
++
++++
+++
++++++++++
++
+++++++++++++++++++++++++

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x pbeta(samples, p, q)

Figure 8.3: Illustration of inversion sampling: Histogram of transformed samples should approach uniform distribution

Algorithm 4 Integral estimation using Importance Sampling


for i = 1 to N do
Sample xi ∼ q(x) {eg. with the inversion technique}
i
wi = p(x )
q(xi )
end for
i=N
1 X
I ≈ Pi=N p(xi )wi
i=1 wi i=1

Algorithm 5 is sometimes referred to as Sampling Importance Resampling (SIR). It was originally described by Rubin [100]
to do inference in a Bayesian context. Rubin drew samples from the prior distribution, assigned a weight to each of them
according to their likelihood. The samples from the posterior distribution were then obtained by resampling from the latter
discrete set.
Remark 8.3 Note also that the tails of the proposal density should be as heavy or heavier than those of the desired pdf, to
avoid degeneracy of the weight factor.

This approach is illustrated in figure 8.4.

8.5 Rejection sampling


Another way to get the sampling job done is rejection sampling (figure 8.5). In this case we use a proposal density q(x) of
p(x) such that
c × q(x) > p(x) ∀x (8.7)
We then generate samples from q. For each sample xi , we generate a value, uniformly drawn from the interval [0, q(xi )].
If the generated value is smaller than p(xi ), the sample is accepted, else the sample is rejected. This approach is illustrated
by algorithm 6 and figure 8.5. This kind of sampling is also only interesting if the number of rejections is small: This
means that the Acceptance Rate (as calculated in algorithm 6) should be as close to 1 as possible (and thus again, if the
proposal density q approximates p(x) fairly well). One can prove that for high dimensional problems, rejection sampling is
not appropriate at all because of eq. (8.7). FIXME: Disc
56 CHAPTER 8. MONTE CARLO TECHNIQUES

Beta distribution and Gaussian proposal density


3.0

Beta
Beta and Gaussian density

2.5

Gaussian
Beta Samples
Normal samples
2.0
1.5
1.0
0.5
0.0

++ ++ ++++++++++++++++++++++++ ++++++++ + ++ + +

0.0 0.2 0.4 0.6 0.8 1.0

normalised abscis

Samples of Beta distribution obtained through


SIR (with Gaussian proposal density)
Frequency

600
0

0.0 0.2 0.4 0.6 0.8

realsamples

Samples of Beta distribution obtained through


the ICDF method
Frequency

600
0

0.0 0.2 0.4 0.6 0.8

rbeta(D, p, q)

Figure 8.4: Illustration of importance sampling. Generating samples of a Beta distribution via a Gaussian with the same mean and
standarddeviation as the beta distribution. The histogram compares the samples generated via importance sampling with some samples
generated via inversion sampling. 50000 samples were generated from the Gaussian to get 5000 samples from the Beta distribution
8.6. MARKOV CHAIN MONTE CARLO (MCMC) METHODS 57

Algorithm 5 Generating samples using Importance Sampling


Require: M >> N
for i = 1 to M do
Sample x̃i ∼ q(x) {eg. with the inversion technique}
i
wi = p(x̃ )
q(x̃i )
end for
for i = 1 to N do
Sample xi ∼ (x̃j , wj ) 1 < j < M {Discrete distibution!}
end for

Algorithm 6 Rejection Sampling algorithm


j = 1, i = 1
repeat
Sample x̃j ∼ q(x)
Sample uj from U [0, q(x̃j )]
if uj < p(x̃j ) then
xi = x̃j {Accepted}
i++
end if
j++
until i = N
Acceptance Rate = Nj

8.6 Markov Chain Monte Carlo (MCMC) methods


The previous methods only work well if the proposal density q(x) approximates p(x) fairly good. In practice, this is often
utopic. Markov Chain MC methods use markov chains to sample from pdfs and don’t suffer from this drawback, but they
provide us with correlated samples and it takes a large number of transition steps to explore the whole state space. This
section first discusses the most general principle (the Metropolis–Hasting algorithm) of MCMC sampling, and then focusses
on some particular implementations:

• Metropolis sampling
• Single component Metropolis–Hastings
• Gibbs sampling
• Slice sampling

These algorithms and more variations are more thoroughly discussed in [88, 75, 55].

8.6.1 The Metropolis-Hasting algorithm


This algorithm is often referred to as the M (RT )2 algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller and Teller [76]),
although its most general formulation is due to Hastings [56]. Therefore, it is called the Metropolis–Hastings algorithm. It
provides us with samples from p(x) by using a Markov chain:

• Choose a proposal density q(x, x(t) ), (that can but not need to be) dependant of the current sample x(t) . Contrary
to the previous sample methods, the proposal density doesn’t have to be similar to p(x). It can be any density from
which we can draw samples. We assume we can evaluate p(x) for all x.
Choose also an initial state x0 of the markov chain.
• At every timestep t, a new state x̃ is generated from this proposal density q(x, x(t) ). To decide if this new state will
be accepted, we compute
p(x̃) q(x(t) , x̃)
a= . (8.8)
p(xt ) q(x̃, x(t) )
If a ≥ 1, the new state x̃ is accepted and x(t+1) = x̃, else the new state is accepted with probability a (this means:
sample a random uniform variable ui , if a ≥ ui , then x(t+1) = x̃, else x(t+1) = x(t) ).

This approach is illustrated in figure 8.6.


58 CHAPTER 8. MONTE CARLO TECHNIQUES

Rejection Sampling
0.4
factor * dnorm(x, mu, sigma)

student t
0.3

Scaled Gaussian
0.2
0.1
0.0

−4 −2 0 2 4

Figure 8.5: Rejection sampling


8.6. MARKOV CHAIN MONTE CARLO (MCMC) METHODS 59

0.0 1.0 2.0


desired(x)

* +
0.0 0.2 0.4 0.6 0.8 1.0

x
First sample
First Proposal
Accepted −> Second Sample = First Proposal
0.0 1.0 2.0
desired(x)

+ *
0.0 0.2 0.4 0.6 0.8 1.0

x
0.0 1.0 2.0
desired(x)

+ o

0.0 0.2 0.4 0.6 0.8 1.0

x
Proposal rejected
4th sample = 3rd sample!
0.0 1.0 2.0
desired(x)

* +

0.0 0.2 0.4 0.6 0.8 1.0

Figure 8.6: Demonstration of MCMC for a Beta distribution with a Gaussian proposal density
The Beta target density is in black, the gaussian proposal (centered around the current sample) in red, blue denotes that the proposal is
accepted, green denotes the proposal is rejected
60 CHAPTER 8. MONTE CARLO TECHNIQUES

The resulting histogram of MCMC sampling with 1000 samples is shown in figure 8.7. We will prove later that, asymptoti-

Beta distribution sampled with MCMC


(Gaussian proposal)

Beta density

1.5
0.0
+++++
++++
+++
++
++++++++
++
+++++++++
++++++
++
++++++++++++
++
++
+++
++
++
+++
++
+++++
++
+++++++++++++
++
++++++
++
+++
++
+
++
+++++++
+++
++
+++++++++++
++++++++++++
++
+
++
++++++++++++++++
++
+++++
+++ +

0.0 0.2 0.4 0.6 0.8 1.0

normalised abscis

Frequency Histogram of samples

150
0

0.0 0.2 0.4 0.6 0.8

samples

Figure 8.7: 1000 samples drawn from a Beta distribution with MCMC (gaussian proposal). Histogram of those samples.

cally, the samples generated from this Markov Chain are samples from p(x).
Note although that the generated samples are not iid. drawn from p(x).

Efficiency considerations

Run length and Burn-in period As mentioned, the samples generated by the algorithm are only asymptotically samples
from p(x). This means we have to throw away a number of samples in the beginning of the algorithm (called the Burn-in
period. Since the generated samples are also dependant (on each other), we have to make sure that our Markov chain
explores the whole state space by running it long enough.
Typically one uses an approximation of the form
n
1 X
E [f (x) | p(x)] ≈ f (xi ). (8.9)
n − m i=m+1

m denotes the burn-in period and n (the run length) should be big enough in order to assure the required precision and the
fact that the whole state space is explored.
e further research on There exist several convergence diagnostics for determining both m and n [55]. The total number of samples n depends
this strongly on the ratio typical step size of Markov Chain
representative length of SS of the algorithm (sometimes also called convergence ratio, although this term
can be misleading).
This typical step size of the markov chain  depends on the choice of the proposal density q(). To explore the whole state
space efficiently (some authors speak about a well mixing Markov Chain, it should be of the same order of magnitude as the
smallest length scale of p(x). One way to determine this stopping time, given a required precision, is using the variance of
the estimate in equation (8.9) (called the Monte Carlo variance, but this is very hard because of the dependance between
the different samples. The most obvious method is starting several chains in parallel, and compare the different estimates.
One way to improve mixing is to use a reparametrisation (use with care, because these can destroy conditional independance
d example explaining properties).
this Convergence diagnostics is still an active area of research, and the ultimate solution still has to appear!
include remark about
sterior correlation to
the speed of mixing
Independence If the typical step size of the markov chain is  the representative length of the state space is L, it typically
2
FIXME: Verify why takes ≈ f1 L steps to generate 2 independant samples, with f is the number of rejections . Although the fact that samples
are correlated constitutes in most cases hardly a problem for evaluation of the quantities of interest such as E [f (x) | p(x)].
A way to avoid (some) dependence is obtained by starting different chains in parallel.
8.6. MARKOV CHAIN MONTE CARLO (MCMC) METHODS 61

Why?

Why on earth does this method generates samples from p(x)?


Let’s start with some definitions of Markov Chains.
Definition 8.1 (Markov Chain) A (continuous) Markov Chain can be specified by an initial pdf f (0) (x) and a transition
pdf or transition kernel T (x̃, x). The pdf describing the state at the (t + 1)th iteration of the Markov Chain, f (t+1) (x), is
given by Z
f (t+1) (x̃) = T (x̃, x)f (t) (x)dx.

Definition 8.2 (Irreducibility) A Markov Chain is called irreducible if we can get from any state x into another state y
within a finite amount of time.
Remark 8.4 For discrete Markov Chains, this means that irreducible Markov Chains cannot be decomposed into parts
which do not interact.

Definition 8.3 (Invariant/Stationary Distribution) A distribution function p(x) is called the stationary or invariant dis-
tribution from a Markov Chain with Transition Kernel T (x̃, x) if and only if
Z
p(x̃) = T (x̃, x)p(x)dx (8.10)

Definition 8.4 (Aperiodicity – Acyclicity) An irreducible Markov Chain is called aperiodic/acyclic if there isn’t any dis-
tribution function which allows something of the form
Z Z
p(x̃) = · · · T (x̃, . . . ) . . . T (. . . , x)p(x)d . . . dx (8.11)

where the dots denote a finite number of transitions!

Definition 8.5 (Time reversibility – Detailed balance) An irreducible, aperiodic Markov Chain is said to be time re-
versible if
T (xa , xb )p(xb ) = T (xb , xa )p(xa ), (8.12)

What is more important, the detailed balance property implies the invariance of the distribution p(x) under the Markov
Chain transition kernel T (x̃, x):
P ROOF Combine eq. (8.12) with the fact that
Z
T (xa , xb )dxa = 1.

This yealds Z Z
a b b a
T (x , x )p(x )dx = T (xb , xa )p(xa )dxa
Z
p(xb ) = T (xb , xa )p(xa )dxa ,

q.e.d.
Definition 8.6 (ergodicity) ergodicity = aperiodicity + irreducibility

It can also be proven that any ergodic chain that satisfies the detailed balance equation (8.12), will eventually converge to
the invariant distribution of that chain p(x) from any distribution function f 0 (x).
So, to prove that the Metropolis Algorithm does provide us with samples of p(x), we have to prove that this density is the
invariant distribution for the Markov Chain with transition kernel defined by the MCMC algorithm.

Transition Kernel Define


p(x) q(x(t) , x)
 
(t)
a(x, x ) = min 1, . (8.13)
p(xt ) q(x, x(t) )
The transition kernel of the MCMC is then
 Z 
(t) (t) (t) t t (t)
T (x, x ) = q(x, x ) × a(x, x ) + I(x = x ) 1 − q(y | x )a(y, x )dy (8.14)
62 CHAPTER 8. MONTE CARLO TECHNIQUES

where I()˙ denotes the indicator function (taking the value 1 if its argument is true, and 0 otherwise). The chance of
arriving in a state x 6= xt is just the first term of equation (8.14). The chance of staying in xt , on the other hand,
consists of 2 contributions: Or xt was generated from the proposal density q and accepted, or another state generated
and rejected: the integral “sums” over all possible rejections!

Detailed Balance We can still wonder why the minimum is taken! To satisfy the detailed balance property

T (x, x(t) )p(x(t) ) = T (x(t) , x)p(x)


q(x, x(t) )a(x, x(t) )p(x(t) ) = q(x(t) , x)a(x(t) , x)p(x)
a(x, x(t) ) q(x(t) , x)p(x)
=
a(x(t) , x) q(x, x(t) )p(x(t) )

One can verify that the definition we took in (8.13) satisfies this need. If we would not take the minimum, this would
not be the case!

Remark 8.5 Note that we should also prove that this chain is ergodic, but that is the case for most proposal densities!

8.6.2 Metropolis sampling

Metropolis sampling [76] is a variant of Metropolis–Hasting sampling that supposes that the proposal density is symmetric
around the current state.

8.6.3 The independence sampler

The independence sampler is an implementation of the Metropolis–Hastings algorithm in which the proposal distribution is
independent of the current state. This approach only works well if the proposal distribution is a good approximation of p
(and heavier tailed to avoid getting stuck in the tails).

8.6.4 Single component Metropolis–Hastings

For complex multivariate densities, it can be very difficult to come up with an appropriate proposal density that explores the
whole state space fast enough. Therefore, it is often easier to divide the state space vector x into a number of components:

x = {x.1 x.2 . . . x.n }

where x.i denotes the i-th component of x. We can then update those components one by one. One can prove that this
doesn’t affect the invariant distribution of the Markov Chain. The acceptance function then becomes

(t)
!
(t) (t) p(x.i , xt.−i ) q(x.i | x.−i , x)
a(x.i , x.i , x.−i ) = min 1, , (8.15)
p(xt.i , xt.−i ) q(x, x(t) )

where xt.−i = {xt+1 t+1 t t


.1 . . . x.i−1 x.i+1 x.n } denotes the value of the state vector of which the first i − 1 components have
FIXME: Check this already been updated without component i.

8.6.5 Gibbs sampling

Gibbs sampling is a special case of the previous method. It can be seen as an M (RT )2 algorithm, where the proposal
distributions are the conditional distributions of the joint density p(x). Gibbs sampling can be seen as a Metropolis method
where every proposal is always accepted.
Gibbs sampling is probably the most popular form of MCMC sampling because it can easily be applied to inference prob-
lems. This has to do with the concept of conditional conjugacy explained in the next paragraphs.
8.7. REDUCING RANDOM WALK BEHAVIOUR AND OTHER TRICKS 63

Conjugacy and Conditional Conjugacy


acy should be
where Bayes’
choice of the
Conjugacy is an extremely interesting property when doing Bayesian inference. For a certain likelihood function, a fam-
bit motivated ily/class of analytical pdf’s is said to be the conjugate family of that likelihood, if the posterior belongs to the same pdf-
family

Example 8.2 The family of gamma functions X ∼ Gamma(r, α) (r is called the shape, α is called the rate, sometimes also
the scale s = α1 is used), is the conjugate family if the likelihood is an exponential distribution. X is Gamma distributed if

αr r−1 −αx
P (x) = x e (8.16)
Γ [r]

The mean and variance are E(X) = r/α and V ar(X) = r/α2 . If the likelihood P (Z1 . . . Zk | X) is of the form
n −x k
P
x e i=1 Zi (ie. according to an exponential distribution, and supposing the measurements are independant given the

state), then the posterior will also be Gamma distributed (The interested reader can verify that the posterior will be distributed
Pk
∼ Gamma(α + k, r + i=1 Zi ) as an exercise :-) FIXME: a

This means inference can be executed very fast and easily. Therefor, conjugate densities are often (mis)used by Bayesians,
although they do not always correctly reflect the a priori belief.
For multi-parameter problems, conjugate families are very hard to find, but many multi-parameter problems do exhibit
conditional conjugacy. This means the joint posterior itself has a very complicated form (and is thus hard to sample from)
but it’s conditionals have nice simple forms. FIX
3
See also the BUGS software. BUGS is a free, but not open software package for bayesian inference that uses Gibbs
sampling.

8.6.6 Slice sampling


This is a Markov Chain MC method that tries to eliminate the drawbacks of the 2 previous methods:

• It is more robust in terms of choices of parameters such as step sizes 

• It also uses the conditional distributions of the joint density p(x) as proposal densities, but these can be hard to
evaluate, so a simplified approach is used.

Slice sampling [87, 86, 85] can be seen as a “combination” of rejection sampling and Gibbs sampling. It is similar to
rejection sampling in the sense that it provides samples that are uniformly distributed in the area/volume/hypervolume
delimited by the density function. In this sense, both these approaches introduce an auxiliary variable u and sample
from the joint distribution p(x, u), which is a uniform distribution. Obtaining samples from p(x) then just consists of
marginalizing over u!
Slice sampling uses, contrary to rejection sampling, a Markov Chain to generate these uniform samples. The proposal
densities are similar to those in Gibbs sampling (but not completely).
The algorithm has several versions: Stepping out, doubling, . . . . We refer to [85] for a elaborate discussion of them.
Algorithm 7 describes the stepping out version for a 1D pdf. We illustrate this with a simple 1D example in figure 8.8 on
page 64. The resulting histogram is shown in figure 8.9. Allthough there is still a parameter that has to be chosen, unlike in
the case of Metropolis sampling, this lenght scale doesn’t influence the complexity of the algorithm as badly.

8.6.7 Conclusions
Drawbacks of Markov Chain Monte Carlo methods are the fact that samples are correlated (although this is generally not
a problem) and that, in some cases, it is hard to set some parameters in order to be able to explore the whole state space
efficiently. To speed up the process of generating independant samples, Hybrid Monte Carlo methods were developed.

8.7 Reducing random walk behaviour and other tricks


• Dynamical Monte Carlo methods
3 http://www.mrc-bsu.cam.ac.uk/bugs/
64 CHAPTER 8. MONTE CARLO TECHNIQUES

2.0
desired(x)

1.0
w
0.0 o o * o + o x

0.0 0.2 0.4 0.6 0.8 1.0

x
2.0
desired(x)

+
1.0

o o o* x
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
2.0
desired(x)

o * o + x
1.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0

x
2.0
desired(x)

o o + o * x
1.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 8.8: Illustration of the slice sampling algorithm


8.7. REDUCING RANDOM WALK BEHAVIOUR AND OTHER TRICKS 65

Algorithm 7 Slice Sampling algorithm (1D stepping out version)


Choose x1 in domain of p(x
Choose interval length w
for i = 1 to N do
Sample ui from a U [0, p(xi )]
Sample r i from a U [0, 1]
L = ui − r × w
R = ui + (1 − r) × w
repeat
L− = w
until p(L) < p(ui )
repeat
R+ = w
until p(R) < p(ui )
Sample x̃i+1 ∼ U [L, R]
while p(x̃i+1 ) < p(ui ) do
if x̃i+1 < xi then
L = x̃i+1
else
R = x̃i+1
end if
Sample x̃i+1 ∼ U [L, R]
end while
xi+1 = x̃i+1
end for

Slice sampling of a Beta Density


desired(x)

************** **********************************
*******************************************************************************************************************************************************************************
1.5

** * *** * * **
****************************** ****************************************************** **********************************************
* *
*** ****************************************************************************************************************************************************************************************************************************************************************************************
*********************************
****************************************************************************************************************************************************************************************
* * * * * * * * * *
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Histogram of samples
Frequency

300
0

0.0 0.2 0.4 0.6 0.8

samples

Figure 8.9: Resulting histogram for 5000 samples of a beta density generated with slice sampling
66 CHAPTER 8. MONTE CARLO TECHNIQUES

• Hybrid Monte Carlo methods


• Overrelaxation
1
• Simulated annealing: Can be seen as importance sampling, where the proposal distribution q(x) is p(x) T . T repre-
sents a temperature and the higher T, the more flattened the proposal distribution becomes. This can be very useful
in cases where p(x) is a multimodal density with well-separated nodes. Heating the target will flatten the modes and
put more probability weight in between them.
ME: add illustration • Stepping stones . To solve the same problem as before, especially in conjunction with Gibbs sampling or single
component MCMC sampling, where movement only happens parallel to the coordinate axes.
• MCMCMC: Meropolis–coupled MCMC (multiple chains in parallel, with different proposals but all differ only
gradually, swapping states between different chains), als o to eliminate problems with (too) well articulated nodes.
• Simulated tempering Sort of a combination of simulated annealing and MCMCMC, but very tricky
• Auxiliary variables: Introduce some extra variables u and choose a convenient conditional density q(u | x) such that
q ∗ (x, u) = q(u | x)p(x) is easier to sample from than the original distribution. Note that choosing q might not be
the simplest of things though.

8.8 Overview of Monte Carlo methods


Figure 8.10 gives an overview of all discussed methods.

Monte Carlo
Methods

Iterative
(MCMC methods) Not iterative

Metropolis Slice Rejection Importance


Methods Sampling Sampling Sampling

M(RT)^2 Gibbs
Sampling Sampling

Figure 8.10: Overview of different MC methods


dd other Monte Carlo
methods to this

8.9 Applications of Monte Carlo techniques in recursive markovian state and


parameter estimation
• SIS: Sequential Importance Sampling: See appendix D about particle filters.
8.10. LITERATURE 67

8.10 Literature
• First paper about Monte Carlo methods: [77];
first paper about MCMC by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller: [76], generalised by Hastings in
1970 [56]

• SIR: [100]

• Good tutorials: [89] (very well explained,but not fully complete), [75, 64]. There is an excellent book about MCMC
by Gilks et al. [55].

• Overview of all methods and combination with Markov techniques [88, 75]

• Other interesting papers about MCMC: [110, 54, 28, 23]

8.11 Software
• Octave Demonstrations of most Monte Carlo methods by David Mackay: MCMC.tgz4

• My own demonstrations of Monte Carlo methods, used to generate the figures in this chapter, and written in R are
here5

• Perl demonstration of metropolis method by Mackay here6

• Radford Neal has some C-software for Markov Chain Monte Carlo and other Monte Carlo-methods here7

• BUGS8

4 http://wol.ra.phy.cam.ac.uk/mackay/itprnn/code/mcmc/mcmc.tgz
5 http://www.mech.kuleuven.ac.be/ kgadeyne/downloads/R/
6 http://wol.ra.phy.cam.ac.uk/mackay/itprnn/code/metrop/Welcome.html
7 http://www.cs.toronto.edu/ radford/fbm.software.html
8 http://www.mrc-bsu.cam.ac.uk/bugs/
68 CHAPTER 8. MONTE CARLO TECHNIQUES
Appendix A

Variable Duration HMM filters

In this section we describe the filters for the VDHMM.


A VDHMM with n possible states and m possible measurements is characterised by λ = (An×n , B n×m , π n , D) where
eg. aij denotes the (discrete!) transition probability for to go from state i (denoted as Si ) to state j (Sj ).
A state sequence from t = 1 to t is denoted as q1 q2 . . . qt where each qk (1 ≤ k ≤ t) corresponds to one of the possible
states Sj (1 ≤ j ≤ n).
The vector π denotes the initial state probabilities so

πi = P (q1 = Si )

If there are m possible measurements (observations) vi (1 ≤ i ≤ m), a measurement sequence from t = 1 until t is denoted
as O1 O2 . . . Ot where each Ok (1 ≤ k ≤ t) corresponds to one of the possible measurements vj (1 ≤ j ≤ m). bij denotes
the probability of measuring vj , given state Si .
The duration densities pi (d), denoting the probability of staying d time units in Si , are typically exponential densities so
the duration is modeled by 2n + 1 parameters . The parameter D contains the maximal duration in all state i (mainly to FI
simplify the calculations, see also [70, 71])
Remark that the filters for the VDHMM increase both the computation time (×D2 /2) as the memory requirements (×D)
with regard to the standard HMM filters.

3 Different algorithms for (VD)HMM’s

1. Given a measurement sequence (OS) O = O1 O2 · · · OT , and a model λ, calculate the probability of seeing this OS
(solved by the forward-backward algorithm in section A.1).

2. Given a measurement sequence (OS) O = O1 O2 · · · OT , and a model λ, calculate the state sequence (SS) that most
likeli generated this OS (solved by the Viterbi algorithm in section A.2).

3. Adapt the model parameters A, B en π (parameter learning or training of the model, Solved by the Baum-Welsch
algorithm, see section A.3)

Note that the actual inference problem (finding the most probable state sequence) is solved by the Viterbi algorithm. Note
also that the Viterbi algorithm does not construct a Belief PDF over all possible state sequences, it only gives you the ML
estimator! FIXME

A.1 Algorithm 1 : The Forward-Backward algorithm

A.1.1 The forward algorithm

Suppose
αt (i) = P (O1 O2 . . . Ot , Si ends at t|λ) (A.1)

69
70 APPENDIX A. VARIABLE DURATION HMM FILTERS

αt (i) is the probability that the part of the measurement sequence from t = 1 until t is seen en that the FSM is in state Si at
time t en jumps to another state at time t + 1.
If t = 1, then
α1 (i) = P (O1 , Si ends at t = 1|λ) (A.2)
The probability that Si ends at t = 1 equals the probability that the FSM starts in Si (πi ) stays there 1 timestep (pi (1)).
Furthermore O1 should be measured. Since all these phenomena are supposed to be independent1 , this results in:

α1 (i) = πi pi (1) bi (O1 ) (A.3)

For t = 2
α2 (i) = P (O1 O2 , Si ends at t = 2|λ) (A.4)
This probability consists of 2 parts. Either the FSM started in Si and stayed there for 2 time units, either she was in another
state for 1 time step, namely Sj and after that one time unit in Si . That results in

2
Y N
X
α2 (i) = πi pi (2) bi (Os ) + α1 (j)aji pi (1)bi (O2 ) (A.5)
s=1 j=1

Induction leads to the general case (as long as t ≤ D, the maximal duration time possible):
t
Y X t−1
N X t
Y
αt (i) = πi pi (t) bi (Os ) + αt−d (j)aji pi (d) bi (Os ) (A.6)
s=1 j=1 d=1 s=t+1−d

If t > D:
X D
N X t
Y
αt (i) = αt−d (j)aji pi (d) bi (Os ) (A.7)
j=1 d=1 s=t+1−d

Since
αT (i) = P (O1 , O2 , . . . , OT , Si ends at t = T |λ), (A.8)
N
X
P (O|λ) = αT (i) (A.9)
i=1

A.1.2 The backward procedure

This is a simple variant on the forward algorithm.

βt (i) = P (Ot+1 Ot+2 . . . OT |Si ends at t, λ) (A.10)

The recursion starts here at time T. That why we change the index t into T − k:

βT −k (i) = P (OT −k+1 OT −k+2 . . . OT |Si ends at t = T − k, λ) (A.11)

Note that this definition is complementary with that of αt (j), which leads to

αt (i)βt (i) P (O1 O2 . . . Ot , Si ends at t|λ) × P (Ot+1 Ot+2 . . . OT |Si ends at t, λ)


=
P (O|λ) P (O|λ)
P (O1 O2 . . . Ot |Si ends at t, λ) × P (Si ends at t|λ)
=
P (O|λ)
× P (Ot+1 Ot+2 . . . OT |Si ends at t, λ)
P (O|Si ends at t, λ) × P (Si ends at t|λ)
=
P (O|λ)
=P (Si ends at t|λ) (A.12)
1 Not always the case in the real world??
A.2. THE VITERBI ALGORITHM 71

Analog to the calculation of the α0 s, the recursion step can be split into two parts:
For k ≤ D:
N
X T
Y
βT −k (i) = aij pj (k) bj (Os ) +
j=1 s=T −k+1
t−1
N X T −k+1+(d−1)
X Y
βT −k+d (j)aij pj (d) bj (Os ) (A.13)
j=1 d=1 s=T −k+1

For k > D:
N X
D T −k+1+(d−1)
X Y
βT −k (i) = βT −k+d (j)aij pj (d) bj (Os ) (A.14)
j=1 d=1 s=T −k+1

A.2 The Viterbi algorithm

A.2.1 Inductive calculation of the weights δt (i)

Suppose
δt (i) = max P (q1 q2 . . . qt = Si ends at t, O1 O2 . . . Ot |λ) (A.15)
q1 q2 ...qt−1

δt (i) is the maximum of all probabilities that belong to all possible paths at time t. That means it represents the most
probable sequence of arriving in Si .
Then (cfr. the definition of αt (i))
δ1 (i) = P (q1 = Si en q2 6= Si , O1 |λ) (A.16)
This means the FSM started in Si and stayed there for one time step. Furthermore O1 should have been measured. So

δ1 (i) = πi pi (1)bi (O1 ) (A.17)

At t = 2
δ2 (i) = max P (q1 q2 = Si and q3 6= Si , O1 O2 |λ) (A.18)
q1

Either the FSM stayed 2 time units in Si and both O1 en O2 have been measured in state Si ; or the FSM was for one time
step in another state Sj , in which O1 was measured, jumped to state Si at time t = 2 in which O2 was measured:
" 

δ2 (i) = max max δ2−1 (j)aji pi (1)bi (O2 ) ,
1≤j≤N
 #
πi pi (2)bi (O1 )bi (O2 ) (A.19)

In the general case ∀t ≤ D, one comes to


t
"  
 Y
δt (i) = max max max δt−d (j)aji pi (d) bi (Os ) ,
1≤j≤N 1≤d<t
s=t−d+1
 t
Y #
πi pi (t) bi (Os ) (A.20)
s=1

∀t > D geldt:
t
Y

δt (i) = max max {δt−d (j)aji pi (d) bi (Os )} (A.21)
1≤j≤N 1≤d<D
s=t−d+1

Note that, except the presence of a second term in (A.20), the only difference between (A.20) and (A.21) is the borders of
the maximum, in order to avoid to reference δ’s that do not exist. Eg. suppose t = 1, d = 3. This would lead to terms like
δ1−3 (i) in eq. (A.20). However, δt (i) does not exist if t < 0.
72 APPENDIX A. VARIABLE DURATION HMM FILTERS

A.2.2 Backtracking

The δt (i)’s alone are not sufficient to determine the most probable state sequence. Indeed, when all δt (i)’s are known the
maximum
δT (i) = max P (q1 q2 . . . qT = Si , O1 O2 . . . OT |λ) ∀i : 1 ≤ i ≤ N (A.22)
q1 q2 ...qT −1

allows us to determine te most probable state at time t = T , qT∗ , but this does not solve the problem of finding the most
probable sequence (ie. how we arrived in that state). This can be solved by determining, together with the calculation of all
δt (i), the arguments, ic. how long the FSM stayed in Si and where it came from before it was in Si , that maximise δt (i).
Therefore we define ψt (i) and τt (i).
If ψt (i) = k and τt (i) = l, then
t
Y t
Y
δt (i) = δt−l (k)aki pi (l) bi (Os ) ≥ δt−d (j)aji pi (d) bi (Os ) ∀j : 1 ≤ j ≤ N
s=t−l+1 s=t−d+1

∀d : 1 ≤ d ≤ D (A.23)

Note that D is to be replaced by t if t ≤ D.


Put in a more mathematically correct way:
t
Y

ψt (i) = Arg max max {δt−d (j)aji pi (d) bi (Os )} (A.24)
1≤j≤N 1≤d<D
s=t−d+1

t
Y

τt (i) = Arg max max {δt−d (j)aji pi (d) bi (Os )} (A.25)
1≤d<D 1≤j≤N
s=t−d+1

All variables can be determined recursively.


Suppose
κt = Arg max {δt (i)} (A.26)
1≤i≤N

Then the equation


κT = Arg max {δT (i)} (A.27)
1≤i≤N

gives us the missing parameter i necessary to determine the τT (i) and ψT (i) to start the first step of the Backtracking part
of the algorithm. That part constructs, starting from t = T the most probable state sequence q1∗ q2∗ . . . qT∗ . This can be done
as follows: One knows that
qT∗ = SκT (A.28)
But according to the definition of ψt (i) and τt (i), we also know that

∀i | 0 ≤ i < τT (κT ) : qT∗ −i = SκT (A.29)

and that
voor i = τT (κT ) : qT∗ −i = Sj with j = ψT (κT ) (A.30)
In this way we know both the last τT (κT ) elements of q ∗ and the previous state Sj , so with τt (j) and ψt (j) we can start the
recursion.
An example of such a backtracking procedure can be seen in figure A.1. After calculation of all δ’s, it appears that κT =
maxi δT (i) = 3. By starting from state 3, and verifying the value of ψT (κT ) and τT (κT ) , those appear to be equal to 2
and 3 respectively. It appears thus that the FSM stayed 3 time steps in state 3 en before it was in state 2.

A.3 Parameter learning


We start by defining two new forward–backward variables

αt∗ (i) = P (O1 O2 . . . Ot , Si starts at t + 1|λ) (A.31)

βt∗ (i) = P (Ot+1 Ot+2 . . . OT |Si starts at t + 1, λ) (A.32)


A.3. PARAMETER LEARNING 73

1 T−7 T−6 T−5 T−4 T−3 T−2 T−1 T

... 3
.
.
.
N
κT = 3 ψT (κT ) = 2 τT (κT ) = 3
ψT −3 (2) = N τT −3 (2) = 2

Figure A.1: Backtracking with the Viterbi algorithm.

Note that βt∗ (i) is only defined for t from 0 until T − 1 (instead of from t = 1 until t = T ).
Since the condition on αt∗ (i) is that Si starts at t + 1 and that of αt (i) that Si ends at t, the following relationship is easy
to derive. With eq. (A.1) eq. (A.31) becomes
XN
αt∗ (j) = αt (i)aij (A.33)
i=1

Analog
D
X t+d
Y
βt∗ (i) = βt+d (i)pi (d) bi (Os ) (A.34)
d=1 s=t+1

Note that this formula has to be modified for all t starting form t = T − D2 .

The re-estimation formulas

1. the re-estimation formula for πi .


πi β0∗ (i)
πi = (A.35)
P (O|λ)
Intuitively this formula can be explained as follows: β0∗ (i) is the probability that the complete measurement sequence
is observed , given that q1 = Si . FIXM

β0∗ (i) = P (O1 O2 . . . OT |Si starts at t = 1, λ) (A.36)

By multiplication of this parameter with πi and applying Bayes’ rule:

πi β0∗ (i) = P (O1 O2 . . . OT |Si starts at t = 1, λ) × P (Si starts at t = 1|λ)


= P (O1 O2 . . . OT and Si starts at t = 1|λ)
= P (O, Si starts at t = 1|λ) (A.37)

Applying Bayes’ rule again

P (O, Si starts at t = 1|λ) = P (Si starts at t = 1|O, λ) × P (O|λ) (A.38)

so that eq. (A.35) follows from eq. (A.37) en (A.38).


2 De relatie t + d ≤ T moet immers steeds blijven gelden. Dat houdt bij the implementatie een extra moeilijkheid in.
74 APPENDIX A. VARIABLE DURATION HMM FILTERS

2. The re-estimation formula for aij


T
X
αt (i)aij βt∗ (j)
t=1
aij = T
N X
(A.39)
X
αt (i)aij βt∗ (j)
j=1 t=1

Intuitively one can say that


# transitions from i to j
aij = (A.40)
# transitions from i
We’re looking for
T
X
P (Si ends at t, Sj starts at t + 1|O, λ) (A.41)
t=1

Each term of this sum can be written as

P (Si ends at t, Sj starts at t + 1|O, λ)


P (Si ends at t, Sj starts at t + 1, O|λ)
= (A.42)
P (O|λ)
Writing the numerator of this expression in full gives

P (Si ends at t, Sj starts at t + 1, O|λ)


=P (Si ends at t, Sj starts at t + 1, O1 , O2 , . . . , OT |λ)
=P (O1 O2 . . . Ot , Si ends at t|λ) × P (Ot+1 Ot+2 . . . OT , Sj starts at t + 1|λ) (A.43)

Different consecutive measurements are assumed independant. The first factor of the product equals (see eq. (A.1))
αt (i). Applying Bayes’ rule to the second factor of eq. (A.43) gives

P (Ot+1 Ot+2 . . . OT |Sj starts at t + 1, λ) × P (Sj starts at t + 1|λ)

Eq. (A.32) allows us to conclude that the first factor of this expression equals βt∗ (j). Since from eq. (A.43) we can
conclude that Si ends at time t, the second factor of this product is nothing but aij . The sum of all these factors equals
the numerator of eq. (A.39). The denominator of that equation is a normalisation factor.
3. The formulas for bi (k) and pi (d) can be derived in a similar way.
T
" #
X X X
∗ ∗
ατ (i) βτ (i) − ατ (i)βτ (i)
t=1 τ <t τ <t
f or which Ot =k
bi (k) = M T
" # (A.44)
X X X X
ατ∗ (i) βτ∗ (i) − ατ (i)βτ (i)
k=1 t=1 τ <t τ <t
f or which Ot =k

# times that vk from i has been measured


bi (k) =
# times that an measurement has been made for state i
T
X t+d
Y
αt∗ (i)pi (d)βt+d (i) bi (Os )
t=1 s=t+1
pi (d) = D T t+d
(A.45)
XX Y
αt∗ (i)pi (d)βt+d (i) bi (Os )
d=1 t=1 s=t+1

# times that d time units in i have been passed


pi (d) =
# times that state i was visited

Notes:

• The iteration formula for bi (k) sums over all indices t for which Ot = k, in other words, it first filters the input.
• Numerators are normalisation factors
A.4. CASE STUDY: ESTIMATING FIRST ORDER GEOMETRICAL PARAMETERS BY THE USE OF VDHMM’S75

A.4 Case study: Estimating first order geometrical parameters by the use of
VDHMM’s
This problem has been extensively studied yet (refs toevoegen) with Kalman filters.

• States: Different CF (Contact formation)

• Measurement vectors: Stems from Twist times Wrench = 0 Different CF should give rise to different clusters in
hyperspace and thus allow the construction of a measurement vector.

• State transition Matrix A comes from planner

• Duration estimation comes from ?? (Planner??)

• Pi comes from planner


76 APPENDIX A. VARIABLE DURATION HMM FILTERS
Appendix B

Kalman Filter (KF)

system and measurement equations:


00
x(k) = F k−1 x(k − 1) + f 0k−1 (uk−1 , θ f,k−1 ) + F k−1 wk−1 (B.1)
00
zk = Gk x(k) + g 0k (sk , θ g,k ) + Gk v k (B.2)

For nonlinear systems: use the linearized equations :)

B.1 Notations
The state estimate at time step k, based on the measurements up to time step i, is denoted as x̂k|i ; its covariance matrix is
P k|i . x̂k|k−1 is called the predicted state estimate and x̂k|k the updated state estimate. The initial state estimate x̂0|0 and
its covariance matrix P 0|0 represent the prior knowledge. wk−1 and v k are the process and measurement uncertainty and
are a random vector sequences with zero mean and known covariance matrices Qk−1 and Rk .

B.2 Kalman Filter


Kalman Filter algorithm [8]:

x̂k|k−1 = F k−1 x̂k−1|k−1 + f 0 k−1 (uk−1 , θ f,k−1 ); (B.3)


00 00
P k|k−1 = F k−1 P k−1|k−1 F Tk−1 + F k−1 Qk−1 F k−1T
; (B.4)
0
x̂k|k = x̂k|k−1 + K k (z k − (Gk x̂k|k−1 + g k (sk , θ g,k ))); (B.5)
P k|k = P k|k−1 − K k S k K Tk ; (B.6)

where

K k = P k|k−1 GTk S −1
k ; (B.7)
00 00
S k = Gk Rk Gk T + Gk P k|k−1 GTk . (B.8)

B.3 Kalman Filter, derived from Bayes’ rule


linear measurement and process equations; Gaussian uncertainty distributions on the state estimate and white additive
Gaussian uncertainties on the measurement and process equation.

System update:
Before a system update is calculated (time step k − 1), the distribution P ost(x(k − 1)) is gaussian with mean x̂k−1|k−1
and covariance matrix P k−1|k−1 (n is the dimension of the state vector x):
1 − 21 (x(k−1)−x̂k−1|k−1 )T P −1 (x(k−1)−x̂k−1|k−1 )
P ost(x(k − 1)) = |(2π)n P k−1|k−1 |− 2 e k−1|k−1 . (B.9)

77
78 APPENDIX B. KALMAN FILTER (KF)

The system dynamics can be written as:


00
x(k) = F k−1 x(k − 1) + f 0 k−1 (uk−1 , θ f,k−1 ) + F k−1 wk−1 . (B.10)

wk−1 is a zero mean gaussian process uncertainty with covariance matrix Qk−1 .
Out of this:

p(x(k)|uk−1 , θ f , f k−1 , x(k − 1)) = (B.11)


00 00 00 00 T −1
T − 21 − 21 (x(k)−F k−1 x(k−1)−f 0 k−1 (uk−1 ,θ f,k−1 ))T (F k−1 Qk−1 F k−1 ) (x(k)−F k−1 x(k−1)−f 0 k−1 (uk−1 ,θ f,k−1 ))
|(2π)n F k−1 Qk−1 F k−1 | e .
(B.12)

The distribution of x(k) is then:


Z ∞
P rior(x(k)) = p(x(k)|uk−1 , θ f , f k−1 , x(k − 1)) P ost(x(k − 1)) dx(k − 1) (B.13)
−∞
Z ∞
1 1
= c1 ∗ e− 2 f (x(k)) e− 2 g(x(k−1),x(k)) dx(k − 1); (B.14)
−∞

where c1 is independent of x(k − 1) and x(k); and1 :

f (x(k)) = c2 + (x(k) − x̂k|k−1 )T P −1


k|k−1 x(k) − x̂k|k−1 ); (B.15)
T
g (x(k − 1), x(k)) = (x(k − 1) − h(x(k))) C −1
k−1 (x(k − 1) − h(x(k))) ; (B.16)
 
 00 00
 −1
T T −1
0

h(x(k)) = C k−1 F k−1 F k−1 Qk−1 F k−1 x(k) − f k−1 (uk−1 , θ f,k−1 ) + P k−1|k−1 x̂k−1|k−1 ;
(B.17)
0
x̂k|k−1 = F k−1 x̂k−1|k−1 + f k−1 (uk−1 , θ f,k−1 ); (B.18)
00 00
P k|k−1 = F k−1 P k−1|k−1 F Tk−1 + F k−1 Qk−1 F k−1
T
; (B.19)
 00 00
−1
C k−1 = F Tk−1 (F k−1 Qk−1 F k−1 T −1
) F k−1 + P −1
k−1|k−1 . (B.20)

As Z ∞
1 1
e− 2 g(x(k−1),x(k)) dx(k − 1) = |(2π)n C k−1 | 2 ; (B.21)
−∞

this is independent of x(k) and hence


1 − 12 (x(k)−x̂k|k−1 )T P −1 (x(k)−x̂k|k−1 )
P rior(x(k)) = |(2π)n P k|k−1 |− 2 e k|k−1 . (B.22)

This is a gaussian distribution with mean and covariance as obtained with the Kalman filter equations (B.18)-(B.19).

Measurement update:
Before the measurement is processed, x has a probability distribution P rior(x(k)), (B.22).
00
The measurement equation is z k = g 0k (sk , θ g,k ) + Gk x(k) + Gk v k . The probability of measuring the value z k for a
00 00
certain x(k) given the the measurement covariance Gk Rk Gk T is (m is the dimension of the measurement vector z):
00 00 0 T 00 00
(Gk Rk Gk T )−1 (g 0k (sk ,θ g,k )+Gk x(k)−z k )
p(z k |x(k), sk , θ g , g k ) = |(2π)m Gk Rk Gk T |− 2 e− 2 (gk (sk ,θg,k )+Gk x(k)−zk )
1 1
.
(B.23)
P ost(x(k)) is proportional to the product of (B.22) and (B.23):
T 00 00
− 21 (x(k)−x̂k|k−1 )T P −1 (x(k)−x̂k|k−1 )− 21 (g 0k (sk ,θ g,k )+Gk x(k)−z k ) (Gk Rk Gk T )−1 (g 0k (sk ,θ g,k )+Gk x(k)−z k )
P ost(x(k)) ∼ e k|k−1

(B.24)
1 use matrix inversion lemma for expression of P −1
k|k−1
.
00 00
P −1
k|k−1
= F k−1 P k−1|k−1 F T T
k−1 + F k−1 Qk−1 F k−1
00 00 00 00 00 00 00 00
T −1 T −1 T −1 −1 −1 T T −1
= (F k−1 Qk−1 F k−1 ) − (F k−1 Qk−1 F k−1 ) F k−1 (F T
k−1 (F k−1 Qk−1 F k−1 )k−1 F k−1 + P k−1|k−1 ) F k−1 (F k−1 Qk−1 F k−1 )
B.4. KALMAN SMOOTHER 79

The part that is dependent of x(k) can be written as:

− 21 (x(k)−x̂k|k )T P −1 (x(k)−x̂k|k )
P ost(x(k)) ∼ e k|k ; (B.25)
00 00
P −1 −1 T
k|k = P k|k−1 + Gk (Gk Rk Gk )
T −1
Gk ; (B.26)
 00 00

x̂k|k = P k|k GTk (Gk Rk Gk T )−1 (z k − g 0k (sk , θ g,k )) + P −1
k|k−1 x̂ k|k−1 . (B.27)

This shows that the new distribution is again a Gaussian distribution. The mean and covariance are the ones obtained with the
Kalman filter equations: the update P −1 0

k|k is as formula (B.26); the update x̂k|k equals x̂k|k−1 +K k z k − g k (sk , θ g,k ) − Gk x̂k|k−1
 00 00
−1
with K k = P k|k−1 GTk Gk Rk Gk T + Gk P k|k−1 GTk .

B.4 Kalman Smoother


Given Z k , not only the pdf over x(k), but over X(k) is necessary. The algorithm to compute an estimate x(j), j < k given
Z k is called Kalman Smoother.

B.5 EM with Kalman Filters


E-step

•  
p X(k) Z k , U k−1 , S k , θ k−1 , F k−1 , Gk , P (X(0))


k k
!
 Y Y
log p X(k), Z k U k−1 , S k , θ, F k−1 , Gk , P (X(0)) = log p(x1 ) p(xi |xi−1 ) p(z i |xi )
i=2 i=1

• write Q

M-step diferentiate Q with respect to θ and maximize it


80 APPENDIX B. KALMAN FILTER (KF)
Appendix C

Daum’s Exact Nonlinear Filter


FIXME: TL: m
over de <
Daum [33]:“Filtering problems with fixed finite-dimensional sufficient statistics are,
in some vague intuitive sense, extremely rare, but if such problems can be identified
and solved, the reward is very great.”
An exact filter that includes both the Kalman Filter [63] and the Beneš filter [12]. The description here: continuous process
equation and discrete measurements (“hybrid setup”). Daum’s filter is based on the exponential family of probability
distributions. The conditional density is said to belong to an exponential family if it is of the form:
p(x, t|Zk ) = a(x, t)b(Zk , t) exp[θT (x, t)φ(Zk , t)], (C.1)
where a(x, t) and b(Zk , t) are non-negative scalar valued functions. For smooth nowhere-vanishing conditional densities,
the exponential family is the most general class that has a sufficient statistic with fixed finite dimension (Fisher-Darmois-
Koopman-Pitman theorem, [31]). The practical significance of a fixed finite-dimensional sufficient statistic is that the
storage requirements and computational complexity do not grow as more and more measurements are accumulated.
continuous system update equation (called the Itô stochastic differential equation):
dx(t) = f (x(t), t)dt + G(t)dw (C.2)
Remark C.1 Further in this text we will denote x at a specific moment tk as xk .

discrete time measurements (Remark, a filter for continuous time measurements is also developed [32]):
z k = g(xk , tk , v k ) (C.3)

• dimension state: n, dimension measurement: m


dw
• process noise w(t), dt zero-mean white noise, independent of x(to ). E(dw dwT ) = Idt;
• vk independent of {w(t)} and x(to ). vk statistically indep. values at discrete points in time.

Assumptions:

• p(x, t) is nowhere vanishing, is twice continuously differentiable in x and continuously differentiable in t, further-
more, p(x, t) approaches zero sufficiently fast as ||x|| → ∞ such that it satisfies Eq. (C.15);
• p(z k |xk ) is nowhere vanishing, is twice continuously differentiable in xk and z k
• for a given initial condition p(x, tk ), Eq. (C.15) has a unique bounded solution for all x and tk ≤ t ≤ tk+1 .

solution to this filtering problem1 :


p(x, t|Z k ) ∼ p(x, t) exp[θ T (x, t) ψ(Z k , t)] (C.5)
where p(x, t) is the unconditional density. p(x, t) and θ T (x, t) are independent of the measurements Z k and can be
calculated off-line (partial differential eqs). ψ(Z k , t) is calculated on-line (ordinary differential eqs).
1 unnormalized,
R
i.e. p(x, t|Zk )dx is not necessarily unity
p(x, t) exp[θT (x, t) ψ(Zk , t)]
p(x, t|Zk ) = R (C.4)
p(x, t) exp[θT (x, t) ψ(Zk , t)]dx

81
82 APPENDIX C. DAUM’S EXACT NONLINEAR FILTER

C.1 Systems for which this filter is applicable


The system (C.2)–(C.3) has the exponential pdf (C.5) as sufficient statistics if, ∀z k , M × M matrices A(t) and B j (t)
(j = 1, . . . , M ) and a M -vector c(z k , tk ) can be found such that the following equations have solutions θ(x, t) and ψ(t)
(θ and ψ are M-vectors):
∂θ ∂θ 1
= (Qr T − f ) + ξ − Aθ; (C.6)
∂t ∂x 2
M
1 ∂θ ∂θ X
Q( )T = θj B j ; (C.7)
2 ∂x ∂x j=1

log[p(z k |x)] = cT (z k , tk ) θ(x, tk )+ < whatever that is constant in x >; (C.8)


dψ(t)
= AT (t)ψ(t) + Γ(t); (C.9)
dt
where

∂p(x, t)
r = r(x, t) = p(x, t); (C.10)
∂x
Q = GGT ; (C.11)
 T
θ = θ1 (x, t), . . . , θM (x, t) ; (C.12)
 T ∂ 2 θj
ξ = ξ1 , . . . , ξM ; ξj = tr(Q ); (C.13)
∂x2
T
Γj = ψ T B j ψ.

Γ = Γ(t) = Γ1 , . . . , ΓM (C.14)

C.2 Update equations


C.2.1 Off-line
p(x, t) (Fokker-Planck eq. corresponding to Eq (C.2))

∂p ∂p ∂f 1 ∂2p
= − f − p tr( ) + tr(Q 2 ) (C.15)
∂t ∂x ∂x 2 ∂x
θ(x, t) (Eq (C.6)):
∂θ ∂θ 1
= (Qr T − f ) + ξ − Aθ; (C.16)
∂t ∂x 2

C.2.2 On-line
ψ(Z k , t) on-line
system (Eq (C.9)):
dψ(t)
= AT (t) ψ(t) + Γ(t); (C.17)
dt
measurement (Eqs. (C.5), (C.8) and Bayes’ formula):

ψ(tk ) = ψ̄(tk ) + c(z k , tk ); (C.18)

where ψ̄(tk ) is the value of ψ before a measurement at time tk (solution of Eq. (C.17)) and ψ(tk ) is the value of ψ
immediately after the measurement z k at time tk . The initial condition (right before the first measurement) is ψ̄(t1 ) = 0.
Appendix D

Particle filters

D.1 Introduction
FIXME
chapter. Is i
As mentioned in section 4.1, an assumption these filters make is the Markov assumption. The observations are also assumed
to be conditionally independent given the state.

D.2 Joint a posteriori density


In the most general formulation, we want to estimate a characteristic of our joint a posteriori distribution P ost (X(k))
(eg. the mean value, what means that h (X) = X):
Z
E [h (X) | P ost (X(k))] = h (X(k)) P ost (X(k)) dX(k) (D.1)

with P ost (X(k)) as defined in (2.4) on page 20 and E [f |p] denoting the expected value of the function f under the pdf p.
In a sampling based approach, we estimate the a posteriori distribution by drawing N samples of it
N
1 X
P ost (X(k)) ≈ δ i (X k ) (D.2)
N i=1 X k

where X ik denotes the i-th sample drawn from P ost (X(k)) and δ denotes the Dirac function.

D.2.1 Importance sampling


The goal of all particle filters is to estimate characteristics of P ost (X(k)) by using samples drawn from it. Because
we don’t know P ost (X(k)) (and even if we would know, it would be very hard to draw samples of it because of it’s
complex shape), we’ll use importance sampling (see chapter 8) to approximate the posterior and we’ll take into account the
differences by multiplying samples with their associated weights. This means that we’ll approximate the expected value of
eq. (D.1) as follows:
Z
E [h (X(k)) | P ost (X(k))] = h (X(k)) P ost (X(k)) dX(k)
(D.3)
P ost (X(k))
Z
= h (X(k)) P rop (X(k)) dX(k)
P rop (X(k))

where P rop (X(k)) is the proposal distribution. It’s a pdf with the same arguments a P ost (X(k))(as defined in equation
2.4), but it has a different “form”.
Suppose we denote a certain sample (instantiation) of X(k) as X i (k) and the ratio between the value of the a posteriori
i

and proposal pdf at that sample as w X (k) (or wi in a shortened version)

P ost X i (k)

i

w X (k) = wi = (D.4)
P rop X i (k)


83
84 APPENDIX D. PARTICLE FILTERS

then w (X(k)) is a function of X(k) and equation D.3 becomes


Z
E [h (X(k)) | P ost (X(k))] = h (X(k)) w (X(k)) P rop (X(k)) dX(k) (D.5)

We can thus obtain an estimate of our expected value with N samples of our proposal distribution
N
1 X
h X i (k) w X i (k)
 
E [h (X(k)) | P ost (X(k))] ≈ (D.6)
N i=1
klopt iets niet met die
aarom dit niet mag en
en moet worden door This still doesn’t allow for a recursive solution of our problem. Indeed, at a certain timestep k, this means we should choose
maliseerde gewichten a proposal density and sample N × k samples of dimension Rn of this proposal density. If however, we would be able
to formulate our problem in a recursive way, this would allow us to keep the number of samples we have to generate at a
certain time instant constant (N ).
Remark D.1 Note that this approach leaves us with samples of the joint a posteriori density! It can be proved that, provided
that enough samples are drawn, by taking the last vector of each of these samples, one obtains samples of the marginal pdf!

Remark D.2 Note that there also exist particle filters that use other Monte Carlo sampling methods than importance sam-
pling. Markov chain Monte Carlo methods are often to computationally complex but rejection methods [16] are also used!

D.2.2 Sequential importance sampling (SIS)


Obtaining the joint a posteriori distribution in a recursive way

To avoid to heavy notation, we’ll combine some symbols as we already did before (see section 2.4, remark 2.10 on page
21).
 
H k−1 = uk−1 θ f f k−1 (D.7)
 
I k = sk θ g g k (D.8)

With these symbols, we refrase the 3 most important equations for Markov Systems (2.5, 2.6, 2.7 from section 2.4) on page
20 here for the joint a posteriori density:
FIXME: explain! Remark D.3 Note that the prediction step does not contain an integral here . Note also that we can formulate the iteration
step here as a simple product of distributions, and do not really have to make the two-step approach. prediction-recursion!

P ost (X(k)) = P X(k) P ost (X(k − 1)) , H k−1 , I k , z k
We can obtain the this in a recursive way, using Bayes’ rule and the Markov assumption:

P ( X(k) = X k P ost (X(k − 1)) , H k−1 , I k , z k
P (z k | X k , P ost (X(k − 1)) , H k−1 , I k ) P (X k | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (X k | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (X k−1 , xk | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k ) (D.9)
P (z k | xk , I k ) P (xk | X k−1 , P ost (X(k − 1)) , H k−1 , I k ) P (X k−1 | P ost (X(k − 1)) , H k−1 , I k )
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )
P (z k | xk , I k ) P (xk | xk−1 , H k−1 ) P ost (X(k − 1))
=
P (z k | P ost (X(k − 1)) , H k−1 , I k )

P z k xk , I k P (xk |xk−1 , H k−1 )
= P ost (X(k − 1))
P (z k )

e last line of equation And thus we obtain the following recursive formula for P ost (X(k)):
ect! The denominator 
the probability of the P z k xk , I k P (xk |xk−1 , H k−1 )
rement “tout court”) P ost (X(k)) = P ost (X(k − 1)) (D.10)
P (z k )
D.3. THEORY VS. REALITY 85

Obtaining the Proposal distribution in a recursive way

P ROOF Suppose we dispose of S samples of a pdf p(x) xi , i = 1 → S . For each of these samples, we know the pdf
p(y|x = xi ) and we can sample from this distribution (and thus obtaining y i ). Can we combine the xi and y i to obtain
samples of the joint pdf p(x, y) = p(y|x)p(x)?

If the above can be proved , this allows us to solve the problem recursively. Indeed, (see also eq. (2.1) on page 19) FIXME: The pr
 5 of the a
P rop (X(k)) = Q X(k) = X k Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))

= Q X k−1 , xk Z k , U k−1 , S k , θ f , θ g , F k−1 , Gk , P (x(0))
 (D.11)
= Q xk X k−1 , Z k , . . . , P (x(0)) Q (X k−1 |Z k , . . . , P (x(0)))

= Q xk xk−1 , z k , H k−1 , I k ) P rop (X(k − 1))

We can thus recursively use this formula to start from an a priori proposal distribution

Combining those 2

Starting from the definition of the weights (D.4), and using both formulas for the recursion of the proposal density (D.11)
and the a posteriori density (D.10), we obtain

w X i (k)


P ost X i (k)

=
P rop X i (k)


P ost X i (k − 1) P (zk |xk ,I k )PP (x k |xk−1 ,H k−1 ) (D.12)



(z k )
= i
 
P rop X (k − 1) Q xk xk−1 , z k , H k−1 , I k
 P (z k | xk , I k ) P (xk | xk−1 , H k−1 )
= αw X i (k − 1) 
Q xk xk−1 , z k , H k−1 , I k

The unknown normalizing factor α = P (z1 k ) is a serious problem, or not? Indeed, this factor is not dependent of the
estimated state vector and can thus be put before the integral in eq. (D.5).
We can avoid the unknown normalizing factor α by working with normalized weights w̃ X i (k)


w X i (k)

i

w̃ X (k) = PN i
 (D.13)
i=1 w X (k)

This results in algorithm 8.

Algorithm 8 Generic Particle filter algorithm


Sample N samples from the a priori density
for i = 1 to N do
Sample xik from Q xk xk−1 , z k , H k−1 , I k


Assign the particle a weight according to Eq. (D.12)


end for

D.3 Theory vs. reality


FIXME: This i
After a few iteration steps, one (of few) of the weights becomes very large (or near to one), whileas the other weights of this text, as y
become negligible. This is called the Degeneracy phenomenon. There are 2 solutions for this: Resampling and a good
choice of the proposal density (You did notice we didn’t tell you anything yet about the choice of the proposal density,
didn’t you?).
This last issue (the choice of the proposal density) is not only important to avoid degeneracy, it also strongly influences the
variance of the sample weights and thus the convergence of the filter!
86 APPENDIX D. PARTICLE FILTERS

D.3.1 Resampling (SIR)


d next section should . Discuss basic resampling and the sample impoverishment problem.
still be written
E: include algorithm Resampling can be done in O(N ).

D.3.2 Choice of the proposal density


: include a number of Discuss the ideal (minimal variance) choice: [38]. In reality almost never possible! . Some other variants are
nts and describe them
• The auxiliary particle filter [91]

• Regularized particle filter [84]

D.4 Literature
• The SMC homepage1 has lots of useful links to papers, videos, software . . .

• Good tutorials: [6]


• Arnaud Doucet and others have written several interesting papers [40, 38] and books [39, 37] about particle filters

• Sebastian Thrun and others have written several papers about the application of Particle filters in applications [81, 79,
35].

FIXME: update this! • ...

D.5 Software
• The Bayesian Filtering Library (BFL)2 of Klaas Gadeyne contains (amongst others) C++ support for particle filters.

FIXME: check this • The Player Stage Project3 has an particle filter implementation in C for mobile robot

• Bayes++4 also contains an implementation of a SIR filter (and several other “schemes” for Bayesian filtering)

1 http://www-sigproc.eng.cam.ac.uk/smc/
2 http://people.mech.kuleuven.ac.be/˜kgadeyne/bfl.html
3 http://playerstage.sourceforge.net/
4 http://www.acfr.usyd.edu.au/technology/bayesianfilter/Bayes++.htm
Appendix E

The EM algorithm, M-step, proofs


FIXME:
noodzakeli
The M-step of the EM-algorithm calculates a θ = θ k that increases Q(θ, θ k−1 ), (Eq. (5.3) or Eq. (5.4)). This estimate
will guarantee an increase in the (incomplete-data) likelihood function Eq. (5.2). Proof is given here.
For the ease of notation “U k−1 , S k , F k−1 , Gk , P (X(0))” is abbreviated as H k .

Proof: part one


In this paragraph, we prove that, ∀θ k ,
h    i h    i
E log p X(k) Z k , H k , θ k Z k , H k , θ k−1 − E log p X(k) Z k , H k , θ k−1 Z k , H k , θ k−1 ≤ 0.

h    i h    i
E log p X(k) Z k , H k , θ k Z k , H k , θ k−1 − E log p X(k) Z k , H k , θ k−1 Z k , H k , θ k−1

Z h       i
log p X(k) Z k , H k , θ k − log p X(k) Z k , H k , θ k−1

=
 
p X(k) Z k , H k , θ k−1 dX(k);

 
p X(k) Z k , H k , θ k i 

Z h 
 p X(k) Z k , H k , θ k−1 dX(k)

= log 
p X(k) Z k , H k , θ k−1

because log(x) ≤ x − 1, ∀x,


 
Z h p X(k) Z k , H k , θ k

i  
k−1

≤   − 1 p X(k) Z k , H k , θ dX(k)
p X(k) Z k , H k , θ k−1

Z   Z  
k
dX(k) − p X(k) Z k , H k , θ k−1 dX(k)

= p X(k) Z k , H k , θ

= 0.

Proof: part two


In this paragraph we show that a θ k that increases Q(θ, θ k−1 ), will increase the logarithm of the (incomplete-data)
likelihood function,      
log p Z k H k , θ k > log p Z k H k , θ k−1 ; (E.1)

hence it will increase the (incomplete-data) likelihood function itself (Eq (5.2)).
We know that   
p X(k), Z k H k , θ = p X(k) Z k , H k , θ p Z k H k , θ ;
hence:   
log p Z k H k , θ = log p X(k), Z k H k , θ − log p X(k) Z k , H k , θ .

87
88 APPENDIX E. THE EM ALGORITHM, M-STEP, PROOFS
 
When averaging over X, given the in the E-step calculated pdf p X(k) Z k , H k , θ k−1 , this becomes (the term on the

left side is independant of X(k)):


 
log p Z k H k , θ =
h  i h  i
E log p X(k), Z k H k , θ Z k , H k , θ k−1 − E log p X(k) Z k , H k , θ Z k , H k , θ k−1 .

Hence the change in Eq. (E.1) between two updates is:


   
log p Z k H k , θ k − log p Z k H k , θ k−1 =

h    i h    i
E log p X(k), Z k H k , θ k Z k , H k , θ k−1 − E log p X(k), Z k H k , θ k−1 Z k , H k , θ k−1

h    i h    i
− E log p X(k) Z k , H k , θ k Z k , H k , θ k−1 + E log p X(k) Z k , H k , θ k−1 Z k , H k , θ k−1 .

The first two terms on the right hand side equal Q(θ k , θ k−1 ) − Q(θ k−1 , θ k−1 ). When Eq. (5.3) or Eq. (5.4) is satisfied,
this is strict positive. Part one of the proof showed that the last two terms on the right hand side give a positive sum, for
all values of θ k . This makes that —if Eq. (5.3) or Eq. (5.4) is satisfied— the left hand side is positive, i.e. θ k increases
the logarithm of the incomplete-data likelihood function, and hence increases also the incomplete-data likelihood function
itself (Eq. (5.2)).
Appendix F

Bayesian (belief) networks


FIXME:
consists of
F.1 Introduction
Definition F.1 (Belief networks) (from [62]) Belief networks are a widely applicable formalism for compactly representing
the joint probability distribution over a set of random variables.

A Bayesian network provides a model representation for the joint distribution of a set of variables in terms of conditional
and prior probabilities, in which the orientations of the arrows represent influence (usually though not always of a causal
nature), such that these conditional probabilities for these particular orientations are relatively straightforward to specify.
When data are observed, then typically an inference procedure is required. This involves calculating marginal probabilities
conditional on the observed data using Bayes’ theorem, which is diagrammatically equivalent to reversing one or more of
the Bayesian network arrows.
Features:

• Conditional independence properties can be used to simplify the general factorization formula for the joint probability.
In some cases, this can be very important to provide an efficient basis for the implementation of some MCMC variants
such as Gibbs sampling [55].
• That result can be expressed by the use of a DAG

A Bayesian Network is a directed acyclic graph (DAG), whose structure defines a set of conditional independence (often
denoted as ⊥⊥) properties. This follows from the fact that any PDF can be factorised

P (X1 , . . . , Xn ) = P (X1 | X2 . . . Xn ) . . . P (Xn−1 | Xn )P (Xn )

FIXME:
difference betw
Recursive factorization:
n
Y FIXME: N
P (X1 , . . . , Xn ) = P (Xi |parents(Xi ))
i=1

Marginalising over a childless node is equivalent to simply removing it and any edges to it from its parents.
Directed acyclic graphs can always have their nodes linearly ordered so that for each node X all of its parents P a(X)
precedes it in the ordering. This is called a topological ordering. FI
Directed Markov property
A variable is conditionally independent of its non-descendents given its parents:

X ⊥⊥ nd(X) | parents(X)

where nd(X) denotes the non-descendents of X.


Undirected graphical models, also called Markov Random Fields (MRFs)

F.2 Inference in Bayesian networks

89
90 APPENDIX F. BAYESIAN (BELIEF) NETWORKS
Appendix G

Entropy and information

The concept of entropy H arises from an equally important concept called (self)-information I. The following sections
define these concepts and the relation between them. A good book on this subject is [29].

G.1 Shannon entropy


Shannon [106, 107] defined a measure of the “amount of uncertainty” or “the amount of chaos” or “the lack of information”
represented by a probability distribution: the Shannon entropy or informational entropy.
Shannon looked for a measure of uncertainty of a discrete probability distribution (p(x = x1 ) = p1 , . . ., p(x = xn ) = pn )
with the following properties [106, 107]:

• H should be continuous in the pi

• If all the pi are equal, pi = 1/n, then H should be a monotonic increasing function of n. With equally likely events
there is more choice, or uncertainty, when there are more possible events.

• If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual
values of H. e.g. three possible values, p1 = 21 , p2 = 13 , and p3 = 61 , H( 12 , 13 , 61 ) = H( 21 , 21 ) + 12 H( 32 , 13 )

[106, 107] prove that the only H satisfying the three above assumptions is of the form:
n
X
H(x) = −K pi log pi (G.1)
i=1

where K is a positive constant. Shannon defined entropy as


n
X
H(x) = − pi log pi = E[− log p(x)] (G.2)
i=1

where any choice of “log” is possible; this changes only the units of the entropy result (e.g. log: [bits], ln: [nats]). He also
extended this to the continuous case (differential entropy):
Z ∞
H(x) = − p(x) log p(x)dx = E[− log p(x)] (G.3)
−∞

e.g. for a Gaussian distribution1 (d dimensional state):


1 1 T
P −1 (x−µ)
p(x) = p e− 2 (x−µ) (G.4)
(2π)d |P |
q 
1
log (2πe)d |P |

H(x) = log (2πe)d |P | = (G.5)
2
1 The Gaussian distribution has an important special entropy characterization: under the assumption of a fixed covariance matrix, the function that

maximizes the entropy is Gaussian [106, 107].

91
92 APPENDIX G. ENTROPY AND INFORMATION

where |P | is the determinant of the covariance matrix.


There is one important difference between the entropy of continuous and discrete distributions. In the discrete case, the
entropy measures the randomness of the chance variable in an absolute way. In the continuous case, the measurement is
relative to the coordinate system: this means that if we change the coordinates, the entropy will change. The entropy in
the continuous case can be considered as a measure of randomness relative to an assumed standard, namely the coordinate
system chosen with each small volume element dx1 . . . dxn given equal weight. As the scale of measurements is set to an
arbitrary zero corresponding to an uniform distribution over this unit volume, the entropy of a continuous distribution can
be negative. Differences between two entropies of the pdf expressed in the same coordinate system, however, do not depend
on the choice of this coordinate frame.

G.2 Joint entropy


The joint entropy is defined as the entropy of the joint distribution.
For discrete distributions: XX
H(x, y) = − p(x, y) log p(x, y) (G.6)
X Y

X and Y define all possible values for x and y.


For continuous distributions: Z Z
H(x, y) = − p(x, y) log p(x, y)dydx (G.7)
X Y

G.3 Conditional entropy


The conditional entropy is not the entropy of the posterior conditional distribution p(y|x = xk ), instead it is defined as:
For discrete distributions:
XX XX p(x, y)
H(y|x) = − p(x, y) log p(y|x) = − p(x, y) log (G.8)
p(x)
X Y X Y

For continuous distributions:


p(x, y)
Z Z Z Z
H(y|x) = − p(x, y) log p(y|x)dxdy = − p(x, y) log dxdy (G.9)
X Y X Y p(x)

Some (in)equalities related to the conditional entropy are:

H(x, y) = H(x) + H(y|x) = H(y) + H(x|y) (G.10)


H(y|x) 6= H(x|y) (G.11)
H(x, y|z) = H(x|z) + H(y|x, z) (G.12)
H(y|x) ≤ H(y) (G.13)

G.4 Relative entropy


The concept of relative entropy is also known under the name Kullback-Leibler information or Kullback-Leibler dis-
tance [69, 68], mutual entropy, informational divergence, information for discrimination or cross entropy. It represents
a measure for the goodness of fit or closeness of two distributions p1 (x) and p2 (x):
p2 (x)
D(p2 (x)||p1 (x)) = E[log ]; (G.14)
p1 (x)
For discrete distributions:
n
X p2 (x)
D(p2 (x)||p1 (x)) = p2 (x) log (G.15)
i=1
p1 (x)
Xn n
X
= p2 (x) log p2 (x) − p2 (x) log p1 (x) (G.16)
i=1 i=1
G.5. MUTUAL INFORMATION 93

For continuous distributions:



p2 (x)
Z
D(p2 (x)||p1 (x)) = p2 (x) log dx (G.17)
−∞ p1 (x)
Z∞ Z ∞
= p2 (x) log p2 (x)dx − p2 (x) log p1 (x)dx (G.18)
−∞ −∞

Note: not symmetric !:


D(p2 (x)||p1 (x)) 6= D(p1 (x)||p2 (x)) (G.19)

G.5 Mutual information

Mutual information I(x, y) is the reduction in the uncertainty of x due to the knowledge of y.
For discrete distributions:
XX p(x, y)
I(x, y) = p(x, y) log (G.20)
p(x)p(y)
X Y
= D(p(x, y)||p(x)p(y)) (G.21)

For continuous distributions:


∞ ∞
p(x, y)
Z Z
I(x, y) = p(x, y) log dxdy (G.22)
−∞ −∞ p(x)p(y)
= D(p(x, y)||p(x)p(y)) (G.23)

I(x, y) is always positive: I(x, y) ≥ 0.


x says as much about y as y says about x:
I(x, y) = I(y, x) (G.24)

The relation between entropy and mutual information is (see figure G.1):

I(x, y) = H(x) − H(x|y) (G.25)


= H(y) − H(y|x) (G.26)
= H(x) + H(y) − H(x, y) (G.27)

G.6 Principle of maximum entropy

Principle of maximum entropy [60]: When making inferences based on incomplete information, the pdf with maximum
entropy is the least biased estimate possible on the given information; i.e. it is maximally noncommittal with regard to
missing information.
The intuition is that we should make the least possible additional assumptions about p.
It turns out that there is always a unique maximal entropy measure.

G.7 Principle of minimum cross entropy

Principle of minimum cross entropy [69, 68]: The Shannon entropy is maximum when the pdf of the random variable
is that one which is as close to the prior distribution as possible. This is equivalent to maximizing the Shannon entropy
(section G.6).
94 APPENDIX G. ENTROPY AND INFORMATION

H(x) H(y)

H(x|y) I(x,y) H(y|x)

H(x,y)
Figure G.1: Relation between entropy and mutual information

G.8 Maximum likelihood estimation


The maximum likelihood estimation is equivalent to the minimum Kullback-Leibler distance estimation:

x̂ = min D(p(Z k )||p(Z k |x)) (G.28)


x

i.e. the maximum likelihood estimation (or the maximum a posteriori probability estimation) is looking for a point x̂, which
is not necessarily unique, that minimizes the Kullback-Leibler distance between p(Z k |x) and the empirical distribution
p(Z k ) (possibly modified by the prior).
Appendix H

Fisher information matrix and Cramér-Rao


lower bound

The inverse of the Fisher information matrix determines a lower bound on the covariance matrix of the estimate that can
be obtained with an efficient estimator, given the measurements. Note that the covariance matrix is a good measure of the
uncertainty on the estimate if we are interested in a single value estimate: with the expected value of the distribution as
estimate, the covariance matrix expresses the covariance of the deviations between this estimate and the real value1 . For
a multimodal distribution with small peaks, the covariance matrix will be large, in contrast to the entropy measures which
will be small. If, on the other hand, we are not interested in a single value estimate e.g. because our estimate is intrinsically
multimodal, the covariance matrix measure is not a good measure.
The next section describes the Fisher information matrix and Cramér-Rao lower bound for the estimation of a non random
state vector, Section H.2 for a random state vector. The original derivation of the Fisher information matrix and the Cramér-
Rao lower bound is made for the non random case: given a number of measurements, we want to estimate a static state
(parameter) x. The random case is an extension to Bayesian estimation: given a number of measurements and an a priori
distribution of the state x, we want to estimate the state x. The extension is also valid for dynamic states, changing in time
according to a process function with process uncertainty.
For more info, see [120].

H.1 Non random state vector estimation


H.1.1 Fisher information matrix
The Fisher information matrix [48] for a non random state (parameter) vector is defined as the covariance of the gradient of
the log-likelihood, that is:
I(x) = E (Ox ln p(Z k |x))(Ox ln p(Z k |x))T
 
(H.1)
T
= −E Ox Ox ln p(Z k |x)
 
(H.2)

Where Ox = [ dxd 1 . . . dxdn ]T is the gradient operator with respect to x = [x1 . . . xn ], and Ox OTx is the Hessian matrix.
E[.] is the expected value with respect to p(Z k |x). This measure was introduced by Fisher as a measure of the amount of
information about x, present in the measurements. The elements of the matrix I(x) are:
 2 
∂ ln p(Z k |x)
I ij (x) = E − (H.3)
∂xi ∂xj

H.1.2 Cramér-Rao lower bound


The inverse of the Fisher Information Matrix, also called the Cramér-Rao lower bound is a lower bound on the covariance
matrix2 for an unbiased estimator T (x) of x [43, 95, 30] :
var(T ) ≥ I −1 (x∗ ). (H.4)
1 Note that this is the estimate which has the smallest covariance of the deviations to the real state.
2 The assumption of the normality of the estimate is not necessary.

95
96 APPENDIX H. FISHER INFORMATION MATRIX AND CRAMÉR-RAO LOWER BOUND

I(x∗ ) is the Fisher information matrix evaluated at the true state vector x∗ . The matrixinequality (H.4) means that var(T )−
I −1 (x∗ ) is positive semi definite. The bound above depends on the actual state value. Hence, it is not possible to compute
the bound in any real estimation cases where the states are unknown. However, the bound can be used to analyse and
evaluate estimators in simulations.
The unbiased estimator T (x) is efficient if var(T ) = I −1 (x∗ ). Note that it is possible that there does not exist an estimator
meeting this lower bound.

H.2 Random state vector estimation


The Fisher information matrix as defined above, is for the estimation of the non random state. In the Bayesian approach
to estimation, the state vector is random (uncertain) with an apriori probability distribution. The definition of the Fisher
information matrix is extended to this case and, as was the case for the estimation of non random states, the inverse of this
Fisher information matrix is also the Cramér-Rao lower bound for the mean square error [120, 117].

H.2.1 Fisher information matrix

The Fisher information matrix for a random state vector xk is defined as the covariance of the gradient of the total log-
probability, that is:

I k|k = E (Oxk ln p(xk , Z k ))(Oxk ln p(xk , Z k ))T


 
(H.5)
T
= −E Oxk Oxk ln p(xk , Z k )
 
(H.6)

i.e. the elements of the matrix I k|k are


∂ 2 ln p(Z k , xk )
I k|k,ij = E[− ] (H.7)
∂xk,i ∂xk,j
The mean E[.] is taken over the distribution p(xk , Z k ).

H.2.2 Alternative expressions for the information matrix

• I k|k can be divided into I k|k,D and I k|k,P (provided that these exists):

I k|k = −E Oxk OTxk ln p(xk , Z k ) ;


 
(H.8)
I k|k,D + I k|k,P = E −Oxk OTxk ln p(Z k |xk ) + E −Oxk OTxk ln p(xk )
   
(H.9)

I k|k,D is the information obtained from the data, I k|k,P represents the information in the prior distribution p(xk ).

• The information matrix can also be described in function of the posterior distribution p(xk |Z k ):

I k|k = −Oxk OTxk ln p(xk , Z k ); (H.10)


= E −Oxk OTxk ln p(xk |Z k ) + E −Oxk OTxk ln p(Z k ) ;
   
(H.11)
= E −Oxk OTxk ln p(xk |Z k )
 
(H.12)

• A recursive formulation is possible for Markovian models:

p(Z k , xk ) = p(Z k−1 , xk )p(z k |xk ) (H.13)


I k|k = I k|k−1 − E Oxk OTxk ln p(z k |xk )
 
(H.14)

H.2.3 Cramér-Rao lower bound

The Cramér-Rao bound for a random state vector xk is called the Van Trees version of the Cramér-Rao bound, or the
posterior Cramér-Rao bound [117]. As was the case for the estimation of non random states, the Cramér-Rao lower bound
is the inverse of the Fisher information matrix I k|k .
H.3. ENTROPY AND FISHER 97

H.2.4 Example: Gaussian distribution


If p(xk |Z k ) is Gaussian:

I k|k = E −Oxk OTxk ln p(xk |Z k )


 
(H.15)
 
T 1 T −1
= E −Oxk Oxk (c0 − (xk − µk )) P k (xk − µk )) (H.16)
2
 −1 
= E Pk (H.17)

If we obtain an efficient estimator for xk , the Fisher information will simply be given by the inverse of the error covariance
matrix of the state : I k|k = P −1
k .

H.2.5 Example: Kalman Filtering


For a linear system model, the Fisher information will be given by the Kalman Filter formulas for the covariance matrix
I k|k = P −1
k .
For a nonlinear system model, the Fisher information will be given by the Extended Kalman Filter formulas for the covari-
ance matrix if all derivatives are evaluated at the true state value.

H.2.6 Example: Cramér-Rao lower bound on a part of the state vector


Assume that the state vector xk is decomposed into two parts xk = [xTk,α xTk,β ]T , and the information matrix I k|k is
correspondingly decomposed into blocks  
I I αβ
I k|k = αα ; (H.18)
I βα I ββ
then, assuming that I −1
αα exists, the covariance matrix of the estimate of xk,β is

P k,β ≥ I ββ − I βα I −1
αα I αβ . (H.19)

H.3 Entropy and Fisher


FIXME: dez
there is a relation between entropy and the Fisher information matrix, namely the de Bruijn’s identity. If x is a random de klok horen
variable, with finite variance and pdf p(x); and y is an independently normally distributed random variable with mean 0 and
variance 1:
∂ √ 1 √
He (x + ty) = I(x + ty) (H.20)
∂t 2
If the limit exists as t → 0:


∂ 1
He (x + ty) = I(x) (H.21)
∂t t=0 2

Fisher represents the local behaviour of the relative entropy: it indicates the rate of change in information in a given direction
of the probability manifold. For two distributions p(z|x) and p(z|x0 ) [68]:
1
D(p(z|x)||p(z|x0 )) ∼ I(x)(x − x0 )2 ; (H.22)
2
X ∂
I(x) = p(z|x)( ln p(z|x))2 (H.23)
∂x
X
98 APPENDIX H. FISHER INFORMATION MATRIX AND CRAMÉR-RAO LOWER BOUND
Bibliography

[1] H. Aikake. Information theory and an extension of the maximum likelihood principle. In B. Petrov and F. Csaki,
editors, Proceedings of the Second International Symposium in Information Theory, pages 267–81. Akadémiai Kiadó,
Budapest, Hungary, 1973.

[2] H. Aikake. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716–23,
1974.

[3] H. Aikake. On the entropy maximiztion principle. In P. Krishniah, editor, Applications of Statistics, pages 27–41.
North-Holland, Amsterdam, 1977.

[4] H. Aikake. Prediction and entropy. In A. Atkinson and S. Fienberg, editors, A Celebration of Statistics, pages 1–24.
Springer, New York, 1985.

[5] D. Alspach and H. Sorenson. Nonlinear bayesian estimation using gaussian sum approximations. IEEE Transactions
on Automatic Control, 17(4):439–448, August 1972.

[6] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A Tutorial on Particle Filters for Online Nonlinear/Non-
gaussian Bayesian Tracking. IEEE Transactions on Signal Processing, 50(2):174–188, february 2002. http:
//www-sigproc.eng.cam.ac.uk/˜sm224/ieeepstut.ps.

[7] K. J. Astrom. Optimal control of markov decision processes with incomplete state estimation. J. Math. Anal. Appl.,
10:174–205, 1965.

[8] Bar-Shalom and X. Li. Estimation and Tracking: Principles, Techniques and Software. Artech House, 1993.

[9] A. Barto, S. Bradtke, and S. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence,
72:81–138, 1995.

[10] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, New Jersey, 1957.

[11] R. Bellman. A markov decision process. Journal of Mathematical Mechanics, 6:679–684, 1957.

[12] V. Beneˇ s. Exact finite-dimensional filters for certain diffusions with nonlinear drift. Stochastics, 5:65–92, 1981.

[13] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley series in probability and statistics. John Wiley & Sons,
repr. edition, 2001.

[14] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume I. Athena Scientific, Belmont Massachusetts,
1995.

[15] D. P. Bertsekas. Dynamic Programming and Optimal Control, Volume II. Athena Scientific, Belmont Massachusetts,
1995.

[16] E. Bølviken, P. Acklam, N. Christophersen, and J.-M. Størdal. Monte Carlo filters for non-linear state estimation.
Automatica, 37(2):177–183, 2001. http://www.math.uio.no/˜erikb/automatica.pdf.

[17] B. Bonet and H. Geffner. Planning with incomplete information as heuristic search in belief space. In Proc. of the
5th International Conference on AI PLanning and Scheduling,AAAI Press, pages 52–61, Colorado, 2000.

[18] B. Bonet and H. Geffner. Planning as heuristic search. Artificial Intelligence, Special issue on Heuristic Search,
129(1–2):5–33, 2001.

[19] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage.
Journal of Artificial Intelligence Research, 11:1–94, 1999.

99
100 BIBLIOGRAPHY

[20] C. Boutilier and D. Poole. Computing optimal policies for partially observable decision processes using compact
representations. AAA, 2:1168–1175, 1996.
[21] S. Boyd and L. Vandenberghe. Convex Optimization. http://www.ee.ucla.edu/∼vandenbe/publications.html. Course
reader for EE364 (Stanford) and EE236B (UCLA), and draft of a book that will be published in 2003.
[22] G. Calafiore, M. Indri, and B. Bona. Robot dynamic calibration: Optimal trajectories and experiment al parameter
estimation. IEEE Trans. on AC, 13(5):730–740, 1997.
[23] G. Casella and E. I. George. Explaining the Gibbs Sampler. The American Statistician, 46(3):167–174, 1992.
[24] A. Cassandra, L. Kaelbling, and J. Kurien. Acting Under Uncertainty: Discrete Bayesian Models for Mobile-Robot
Navigation,. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 1996. http:
//www.cs.brown.edu/people/lpk/iros96.ps.
[25] A. R. Cassandra. Optimal policies for partially observable markov decision processes. Tech-
nical Report CS-94-14, Brown University, Department of Computer Science, Providence RI,
http://www.cs.brown.edu/publications/techreports/reports/CS-94-14.html 1994.
[26] A. R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis,
U. Brown, 1998.
[27] H.-T. Cheng. Algorithms for Partially Observable Markov Decision Processes. PhD thesis, University of British
Columbia, British Columbia, Canada, 1988.
[28] S. Chib and E. Greenberg. Understanding the Metropolis–Hastings Algorithm. The American Statistician, 49(4):327–
335, 1995.
[29] T. M. Cover and J. A. Thomas, editors. Elements of Information Theory. Wiley Series in Telecommunications.
Wiley-Interscience, 1991.
[30] H. Cramér. Mathematical methods of Statistics. Princeton. Princeton University Press, New Jersey, 1946.
[31] F. Daum. The fisher-darmois-koopman-pitman theorem for random processes. In Proc. of the 1986 IEEE Conference
on Decision and Control, pages 1043–1044.
[32] F. Daum. Solution of the zakai equation by separation of variables. IEEE Trans. Autom. Control. AC-32(10), 1987.
[33] F. Daum. New exact nonlinear filters. In e. J. C. Spall, editor, Bayesian Analysis of Time Series and Dynamic Models,
chapter 8, pages 199–226. Marcel Dekker inc., New York, 1988.
[34] J. De Geeter. Constrained system state estimation and task-directed sensing. PhD thesis, K.U.Leuven, Department
of Mechanical engineering, div. PMA, Celestijnenlaan 300B, 3001 Leuven, Belgium, 1998.
[35] F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte carlo localization for mobile robots. In Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA’99), Detroit, Michigan, 1999.
[36] F. d’Epenoux. Sur un problème de production et de stockage dans l’aléatoire. Revue Francaise Recherche Opra-
tionelle, 14:3–16, 1960.
[37] A. Doucet. Monte Carlo Methods for Bayesian Estimation of Hidden Markov Models. PhD thesis, Univ. Paris-Sud,
Orsay, 1997. in french.
[38] A. Doucet. On Sequential Simulation-Based Methods for Bayesian Filtering. Technical Report CUED/F-
INFENG/TR.310, Signal Processing Group, Dept. of Engineering, University of Cambridge, 1998.
[39] A. Doucet, N. de Freytas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Statistics for
engineering and information science. Springer–Verlag, january 2001.
[40] A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics
and Computing, 10(3):197–208, 2000.
[41] A. Drake. Observation of Markov Processes Through a Noisy Chanel. PhD thesis, Massachusetts Institute of Tech-
nology, Cambridge, Massachusetts, 1962.
[42] R. Dugad and U. Desai. A tutorial on Hidden Markov Models. Technical Report SPANN-96.1, Indian institute of
Technology, dept. of electrical engineering, Signal Processing and Artificial Neural Networks Laboratory, Bombay,
Powai, Mumbai 400 076 India, may 1996. http://vision.ai.uiuc.edu/dugad/newhmmtut.ps.gz.
BIBLIOGRAPHY 101

[43] D. Dugué. Applications des propriétés de la limite au sens du calcul des probabilités à l’étude des diverses questions
d’estimation. Ecol. Poly., 3(4):305–372, 1937.

[44] J. N. Eagle. The optimal search for a moving target when the search path is constrained. Operations research,
32(5):1107–1115, 1984.

[45] G. J. Erickson and C. R. Smith, editors. Maximum-Entropy and Bayesian Methods in Science and Engineering.
Vol. 1: Foundations; Vol. 2: Applications, Dordrecht, The Netherlands, 1988. Kluwer Academic Publishers.

[46] H. J. S. Feder, J. J. Leonard, and C. M. Smith. Adaptive mobile robot navigation and mapping. International Journal
of Robotics Research, 18(7):650–668, July 1999.

[47] V. Fedorov. Theory of optimal experiments. Academic press, New York, 1972.

[48] R. Fisher. On the mathematical foundations of theoretical statistics. Pilosophical Transactions of the Royal Society,
A,, 222:309–368, 1922.

[49] M. Forster and E. Sober. How to tell when simpler, more unified, or less ad hoc theories will provide more accurate
predictions. British Joural for the Philosophy of Science, 45:1–35, 1994.

[50] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. Monte carlo localization: Efficient position estimation for mobile
robots. In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’99), Orlando, FL, 1999.

[51] D. Fox, W. Burgard, and S. Thrun. Active markov localization for mobile robots. volume 25, pages 195–207, 1998.

[52] D. Fox, W. Burgard, and S. Thrun. Markov localization for mobile robots in dynamic environments. Journal of
Artificial Intelligence Research, 11, 1999.

[53] J. D. Geeter, J. D. Schutter, H. Bruyninckx, H. V. Brussel, and M. Decréton. Tolerance-weighted L-optimal experi-
ment design: a new approach to task-directed sensing. Advanced Robotics, 13(4):401–416, 1999.

[54] A. E. Gelfland and A. F. M. Smith. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the
American Statistical Association, 85(410):398–409, june 1990.

[55] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in Practice. Chapman &
Hall, London, first edition, 1996.

[56] W. K. Hastings. Monte Carlo sampling methods using Markov Chains and their applications. Biometrika, 57:97–107,
1970.

[57] M. Hauskrecht. Value-function approximations for partally observable markov decision processes. Journal of Artifi-
cial Intelligence Research, 13:33–94, 2000.

[58] R. A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, Massachusetts, 1960.

[59] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical
Statistics, 5(3):299–314, 1996.

[60] E. T. Jaynes. How does the brain do plausible reasoning? Technical Report 421, Stanford University Microwave
Laboratory, 1957. Reprinted in [45, Vol. 1, p. 1–24].

[61] F. Jelinek. Statistical methods for speech recognition. MIT Press, 1997.

[62] M. I. Jordan, editor. Learning in Graphical Models. Adaptive Computation and Machine Learning. MIT Press,
London, England, 1999. ISBN 0262600323.

[63] R. E. Kalman. A new approach to linear filtering and prediction problems. 82:34–45, 1960.

[64] M. H. Kalos and P. A. Whitlock. Monte Carlo methods, volume I: Basics of Wiley-intersience publications. Wiley,
New York, 1986.

[65] S. Koenig and R. Simmons. Solving robot navigation problems with initial pose uncertainty using real-time heuristic
search. In Proceedings of the International Conference on Artificial Intelligence Planning Systems, pages 154–153,
1998.

[66] S. Kristensen. Sensor planning with bayesian decision theory. Robotics and Autonomous Systems, 19:273–286, 1997.
102 BIBLIOGRAPHY

[67] G. J. A. Kröse and R. Bunschoten. Probabilistic localization by appearance models and active vision. In IEEE
conference on Robotics and Automation, Detroit, May 1999.

[68] S. Kullback. Information theory and statistics. New York, NY, 1959.

[69] S. Kullback and R. Leibler. On information and sufficiency. Annals of mathematical Statistics, 22:79–86, 1951.

[70] S. E. Levinson. Continuously Variable Duration Hidden Markov Models for speech analysis. In Int. Conf. on
Acoustics, Speech, and Signal Processing, volume 2, pages 1241–1244. AT&T Bell Lab., april 1986.

[71] S. E. Levinson. Continuously Variable Duration Hidden Markov Models for speech recognition. Computer, Speech
and Language, 1:29–45, 1986.

[72] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Efficient dynamic-programming updates in partially observ-
able markov decision processes. Technical Report CS-95-19, Brown University, Department of Computer Science,
Providence RI, 1995.

[73] M. L. Littman, T. L. Dean, and L. P. Kaelbling. On the complexity of solving markov decision problems. In
Proceedings of the 11th International Conference on Uncertainty in Artificial Intelligence, 1995.

[74] W. S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of
Operations Research, 18:47–65, 1991.

[75] D. J. C. MacKay. Information theory, inference and learning algorithms. Textbook in preparation. http://wol.
ra.phy.cam.ac.uk/mackay/itprnn/, 1999.

[76] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equations of state calculations by
fast computing machine. Journal of Chemical Physics, 21:1087–1091, 1963.

[77] N. Metropolis and S. Ulam. The Monte Carlo Method. Journal of the American Statistical Association, 1949.

[78] G. E. Monahan. A survey of partially observable decision processes: Theory, models and algorithms. Management
Science, 28(1):1–16, 1982.

[79] M. Montemerlo and S. Thrun. Simultaneous localization and mapping with unknown data association. In Proceedings
of the 2003 ICRA, pages 1985 – 1991, Taipei, Taiwan, September 2003. IEEE.

[80] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. Fastslam: A factored solution to the simultaneous localization
and mapping problem. In Proceedings of the eighteenth National Conference on Artificial Intelligence, pages 593–
598, 2002.

[81] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. Fastslam 2.0: An improved particle filtering algorithm for
simultaneous localization and mapping that provably converges. In Proceedings of the eighteenth International Joint
Conference on Artificial Intelligence, 2003.

[82] K. Murphy and S. Russell. Sequential Monte Carlo Methods in Practice, chapter RaoBlackwellised particle filtering
for dynamic Bayesian networks, pages 499–516. Statistics for engineering and information science. Springer–Verlag,
january 2001.

[83] K. P. Murphy. A survey of pomdp solution techniques. Technical report,


http://citeseer.nj.nec.com/murphy00survey.html, September 2000.

[84] C. Musso, N. Oudjane, and F. LeGland. Sequential Monte Carlo Methods in Practice, chapter Improving regularised
particle filters, page ?? Statistics for engineering and information science. Springer–Verlag, january 2001.

[85] M. Neal, Radford. Markov Chain Monte Carlo Methods Based on ‘Slicing’ the Density Function. Technical Report
9722, Dept. of Statistics and dept. of Computer Science, University of Toronto, Toronto, Ontario, Canada, november
1997. http://www.cs.utoronto.ca/˜radford/slice.abstract.html.

[86] M. Neal, Radford. Slice Sampling. Technical Report 2005, Dept. of Statistics, University of Toronto, Toronto, On-
tario, Canada, august 2000. http://www.cs.toronto.edu/˜radford/slc-samp.abstract.html.

[87] M. Neal, Radford. Slice Sampling. Annals of Statistics, 2002. To appear.

[88] R. M. Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1,
University of Toronto, Department of Computer Science, 1993.
BIBLIOGRAPHY 103

[89] NN. Introduction to monte carlo methods. CSEP. http://csep1.phy.ornl.gov/mc/mc.html.

[90] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999.

[91] M. Pitt and N. Shephard. Filtering via simulation: auxiliary particle filter. Journal of the American Statistical
Association, 1999. forthcoming.

[92] F. Pukelsheim. Optimal Design of Experiments. New York, NY, 1993.

[93] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons,
Wiley series in probability and mathematical statistics, New York, 1994.

[94] L. R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of
the IEEE, 77(2):257–286, 1989.

[95] C. R. Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta
Mathematical Society, 37:81–91, 1945.

[96] B. D. Ripley. Stochastic Simulation. John Wiley and Sons, 1987.

[97] J. Rissanen. Modeling by the shortest data description. Automatica, 14:465–71, 1978.

[98] J. Rissanen. Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B, 49:223–239,
1987.

[99] N. Roy, W. Burgard, D. Fox, and S. Thrun. Coastal navigation - mobile robot navigation with uncertainty in dynamic
environments. In Proceedings of the IEEE International Conference on Robotics and Automation, Detroit, MI,
volume 1, pages 35–40, May 1999.

[100] D. B. Rubin. Bayesian Statistics 3, chapter Using the SIR algorithm to simulate posterior distributions, pages 395–
402. Oxford University Press, 1988. Using the SIR algorithm to simulate posterior distributions.

[101] J. Rust. Numerical dynamic programming in economics. In H. Amman, D. Kendrick, and J. Rust, editors, Handbook
of Computational Economics, pages 619–729. Elsevier, Amsterdam, 1996.

[102] J. U. S. Julier and H. Durrant-Whyte. A new method for the nonlinear transformation of means and covariances in
filters and estimators. IEEE Transactions on Automatic Control, 45(3):477–482, March 2000.

[103] Y. Sakamoto, M. Ishiguro, and G. Kitagawa. Aikake information criterion statistics. Kluwer, Dordrecht, 1986.

[104] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978.

[105] P. Schweitzer and A. Seidmann. Generalized polynomial approximations in markovian decision processes. Journal
of Mathematical Analysis and Applications, 110:568–582, 1985.

[106] C. Shannon. A mathematical theory of communication, i. The Bell System Technical Journal, 27:379–423, July
1948.

[107] C. Shannon. A mathematical theory of communication, ii. The Bell System Technical Journal, 27:623–656, October
1948.

[108] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of the
fourteenth International Joint Conference on Artificial Intelligence, Montréal, Québec, Canada, pages 1080–1087.
Springer-Verlag, Berlin, Germany, 1995.

[109] D. Sivia. Data analysis: a Bayesian tutorial. 1996.

[110] A. F. M. Smith and A. E. Gelfland. Bayesian Statistics Without Tears: A Sampling–Resampling Perspective. The
American Statistician, 46(2):84–88, 1992.

[111] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford University,
Stanford, California, 1971.

[112] R. Sutton and A. Barto. Reinforcement Learning, An introduction. The MIT Press, 1998.

[113] J. Swevers, C. Ganseman, D. B. Tükel, J. De Schutter, and H. Van Brussel. Optimal robot excitation and identification.
IEEE Transactions on Robotics and Automation, 13(5):730–739, October 1997.
104 BIBLIOGRAPHY

[114] S. Thrun. Monte Carlo POMDPs. In S. A. Solla, T. K. Leen, and K. R. Muller, editors, Advances in Neural Processing
Systems, volume 12, pages 1064–1070. MIT Press, 2000.
[115] S. Thrun and J. Langford. Monte Carlo Hidden Markov Models. Technical Report CMU-CS-98-179, Carnegie
Mellon University, School of computer science, Pittsburgh, PA 15213, 1998. http://www.cs.cmu.edu/afs/
cs.cmu.edu/user/thrun/public_html/papers/thru%n.hmm.html.

[116] S. Thrun, J. Langford, and D. Fox. Monte Carlo Hidden Markov Models: Learning non-parametric models of
partially observable stochastic processes. In ??, editor, Proceeding of The Sixteenth International Conference on Ma-
chine Learning, page ??, 1999. http://www.cs.cmu.edu/afs/cs.cmu.edu/user/thrun/public_
html/papers/thru%n.mchmm.html.

[117] P. Tichavský, C. H. Muravchik, and A. Nehorai. Posterior Cramér-Rao bounds for discrete-time nonlinear filtering.
IEEE Transactions on Signal Processing, 46(5):1386–1396, May 1998.

[118] M. Trick and S. Zin. A linear programming approach to solving stochastic dynamic programs. Technical report,
Carnegie-Mellon University, manuscript, 1993.

[119] P. Turney. A theory of cross-validation error. The Journal of Theoretical and Experimental Artificial Intelligence,
6:361–92, 1994.

[120] H. L. Van Trees. Detection, Estimation and Modulation Theory, Vol I. Wiley and Sons, New York, 1968.

[121] C. Wallace and P. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society B,
49:240–65, 1987.

[122] E. Wan and A. Nelson. Dual kalman filtering methods for nonlinear prediction, estimation, and smoothing. In
J. Mozer and Petsche, editors, In Advances in Neural Information Processing Systems: Proceedings of the 1996
Conference , NIPS-9, 1997.

[123] D. Xiang and G. Wahba. A generalized approximate cross validation for smoothing splines with non-Gaussian data.
Statistica Sinica, 6:675–92, 1996.

[124] A. Zellner, H. A. Keuzenkamp, and M. McAleer. Simplicity, Inference and Modelling. Keeping it Sophisticatedly
Simple. Cambridge University Press, Cambridge, UK, 2001.

[125] N. Zhang and W. Liu. Planning in stochastic domains: Problem characteristics and approximation. Technical Report
HKUST-CS96031, Hong Kong University of Science and Technology, 1996.

You might also like