P. 1
Kalman Filter Explained

# Kalman Filter Explained

|Views: 83|Likes:

See more
See less

10/30/2012

pdf

text

original

# December 3, 2010

The Kalman Filter Explained
Tristan Fletcher
www.cs.ucl.ac.uk/staﬀ/T.Fletcher/
1 Introduction
The aim of this document is to derive the ﬁltering equations for the simplest
Linear Dynamical System case, the Kalman Filter, outline the ﬁlter’s im-
plementation, do a similar thing for the smoothing equations and conclude
with parameter learning in an LDS (calibrating the Kalman Filter).
The document is based closely on Bishop [1] and Gharamani’s [2] texts,
but is more suitable for those who wish to understand every aspect of the
mathematics required and see how it all comes together in a procedural
sense.
2 Model Speciﬁcation
The simplest form of Linear Dynamical System (LDS) models a discrete time
process where a latent variable h is updated every time step by a constant
linear state transition A with the addition of zero-mean Gaussian noise η
h
:
h
t
= Ah
t−1

h
t
where η
h
t
∼ N(0, Σ
H
)
⇒p(h
t
|h
t−1
) ∼ N(Ah
t−1
, Σ
H
) (2.1)
This latent variable is observed through a constant linear function of the
latent state B also subject to zero-mean Gaussian noise η
v
:
v
t
= Bh
t

v
t
where η
v
t
∼ N(0, Σ
V
)
⇒p(v
t
|h
t
) ∼ N(Bh
t
, Σ
V
) (2.2)
We wish to infer the probability distribution of h
t
given the observations up
to that point in time v
1:t
, i.e. p(h
t
|v
1:t
), which can be expressed recursively.
Starting with an initial distribution for our latent variable given the ﬁrst
observation:
p(h
1
|v
1
) ∝ p(v
1
|h
1
)p(h
1
)
and the assumption that h
1
has a Gaussian distribution:
p(h
1
) ∼ N(µ
0
, σ
2
0
)
1
values for p(h
t
|v
1:t
) for subsequent values of t can be found by iteration:
p(h
t
|v
1:t
) = p(h
t
|v
t
, v
1:t−1
)
=
p(h
t
, v
t
|v
1:t−1
)
p(v
t
|v
1:t−1
)
∝ p(h
t
, v
t
|v
1:t−1
)
=
_
h
t−1
p(h
t
, v
t
|v
1:t−1
, h
t−1
)p(h
t−1
|v
1:t−1
)
=
_
h
t−1
p(v
t
|v
1:t−1
, h
t−1
, h
t
)p(h
t
|h
t−1
, v
1:t−1
)p(h
t−1
|v
1:t−1
)
h
t
⊥v
1:t−1
|h
t−1

_
h
t−1
p(v
t
|v
1:t−1
, h
t−1
, h
t
)p(h
t
|h
t−1
)p(h
t−1
|v
1:t−1
)
v
t
⊥h
t−1
|v
1:t−1
, h
t

_
h
t−1
p(v
t
|v
1:t−1
, h
t
)p(h
t
|h
t−1
)p(h
t−1
|v
1:t−1
)
=
_
h
t−1
p(v
t
|h
t
)p(h
t
|h
t−1
)p(h
t−1
|v
1:t−1
) (2.3)
The fact that the distributions described in (2.1) and (2.2) are both Gaussian
and the operations in (2.3) of multiplication and then integration will yield
Gaussians when performed on Gaussians, means that we know p(h
t
|v
1:t
) will
itself be a Gaussian:
p(h
t
|v
1:t
) ∼ N(µ
t
, σ
2
t
) (2.4)
and the task is to derive the expressions for µ
t
and σ
2
t
.
3 Deriving the state estimate variance
If we deﬁne ∆h ≡ h −E
_
ˆ
h
_
, where h denotes the actual value of the latent
variable,
ˆ
h is its estimated value, E
_
ˆ
h
_
is the expected value of this estimate
and F as the covariance of the error estimate then:
F
t|t−1
= E
_
∆h
t
∆h
T
t
¸
= E
_
(A∆h
t−1
+ ∆η
h
t
)(A∆h
t−1
+ ∆η
h
t
)
T
_
= E
_
(A∆h
t−1
∆h
t−1
T
A
T
+ ∆η
h
t
∆η
hT
t
+ ∆η
h
t
A
T
∆h
T
t−1
+A∆h
t−1
∆η
hT
t
)
_
= AE
_
∆h
t−1
∆h
T
t−1
¸
A
T
+E
_
∆η
h
t
∆η
hT
t
_
= AF
t−1:t−1
A
T
+ Σ
H
(3.1)
The subscript in F
t|t−1
denotes the fact that this is F’s value before an
observation is made at time t (i.e. its a priori value) while F
t|t
would denote
2
a value for F
t|t
after an observation is made (its posterior value). This more
informative notation allows the update equation in (2.1) to be expressed as
follows:
ˆ
h
t|t−1
= A
ˆ
h
t−1|t−1
(3.2)
Once we have an observation (and are therefore dealing with posterior val-
ues), we can deﬁne
t
as the diﬀerence between the observation we’d expect
to see given our estimate of the latent state (its a priori value) and the one
actually observed, i.e.:

t
= v
t
−B
ˆ
h
t|t−1
(3.3)
Now that we have an observation, if we wish to add a correction to our a
priori estimate that is proportional to the error
t
we can use a coeﬃcient
κ:
ˆ
h
t|t
=
ˆ
h
t|t−1

t
(3.4)
This allows us to express F
t|t
recursively:
F
t|t
= Cov(h
t

ˆ
h
t|t
)
= Cov(h
t
−(
ˆ
h
t|t−1

t
)
= Cov(h
t
−(
ˆ
h
t|t−1
+κ(v
t
−B
ˆ
h
t|t−1
)))
= Cov(h
t
−(
ˆ
h
t|t−1
+κ(Bh
t

v
t
−B
ˆ
h
t|t−1
)))
= Cov(h
t

ˆ
h
t|t−1
−κBh
t
−κη
v
t
+κB
ˆ
h
t|t−1
)
= Cov((I −κB)(h
t

ˆ
h
t|t−1
) −κη
v
t
)
= Cov((I −κB)(h
t

ˆ
h
t|t−1
)) +Cov(κη
v
t
)
= (I −κB)Cov(h
t

ˆ
h
t|t−1
)(I −κB)
T
+κCov(η
v
t

T
= (I −κB)F
t|t−1
(I −κB)
T
+κΣ
V
κ
T
= (F
t|t−1
−κBF
t|t−1
)(I −κB)
T
+κΣ
V
κ
T
= F
t|t−1
−κBF
t|t−1
−F
t|t−1
(κB)
T
+κBF
t|t−1
(κB)
T
+κΣ
V
κ
T
= F
t|t−1
−κBF
t|t−1
−F
t|t−1
B
T
κ
T
+κ(BF
t|t−1
B
T
+ Σ
V

T
(3.5)
If we deﬁne the innovation variance as S
t
= BF
t|t−1
B
T
+ Σ
V
then (3.5)
becomes:
F
t|t
= F
t|t−1
−κBF
t|t−1
−F
t|t−1
B
T
κ
T
+κS
t
κ
T
(3.6)
3
4 Minimizing the state estimate variance
If we wish to minimize the variance of F
t|t
, we can use the mean square error
measure (MSE):
E
_
¸
¸
¸h
t

ˆ
h
t|t
¸
¸
¸
2
_
= Tr(Cov(h
t

ˆ
h
t|t
)) = Tr(F
t|t
) (4.1)
The only coeﬃcient we have control over is κ, so we wish to ﬁnd the κ that
gives us the minimum MSE, i.e. we need to ﬁnd κ such that:
δTr(F
t|t
)
δκ
= 0
(2.6) ⇒
δTr(F
t|t−1
−κBF
t|t−1
−F
t|t−1
B
T
κ
T
+κS
t
κ
T
)
δκ
= 0
⇒κ = F
t|t−1
B
T
S
−1
t
(4.2)
This optimum value for κ in terms of minimizing MSE is known as the
Kalman Gain and will be denoted K.
If we multiply by both sides of (4.2) by SK
T
:
KSK
T
= F
t|t−1
B
T
K
T
(4.3)
Substituting this into (3.6):
F
t|t
= F
t|t−1
−KBF
t|t−1
−F
t|t−1
B
T
K
T
+F
t|t−1
B
T
K
T
= (I −KB)F
t|t−1
(4.4)
4
5 Filtered Latent State Estimation Procedure (The
Kalman Filter)
The procedure for estimating the state of h
t
, which when using the MSE
optimal value for κ is called Kalman Filtering, proceeds as follows:
1. Choose initial values for
ˆ
h and F (i.e.
ˆ
h
0|0
and F
0|0
).
ˆ
h
t|t−1
= A
ˆ
h
t−1|t−1
F
t|t−1
= AF
t−1|t−1
A
T
+ Σ
H
4. Make an observation v
t
5. Calculate innovation:

t
= v
t
−B
ˆ
h
t|t−1
6. Calculate S
t
:
S
t
= BF
t|t−1
B
T
+ Σ
V
7. Calculate K:
K = F
t|t−1
B
T
S
−1
t
8. Update latent state estimate:
ˆ
h
t|t
=
ˆ
h
t|t−1
+K
t
9. Update estimate covariance (from (4.2)):
F
t|t
= (I −KB)F
t|t−1
10. Cycle through stages 2 to 9 for each time step.
Note that
ˆ
h
t|t
and F
t|t
correspond to µ
t
and σ
2
t
from (2.4).
5
6 Smoothed Latent State Estimation
The smoothed probability of the latent variable is the probability it had a
value at time t after a sequence of T observations, i.e. p(h
t
|v
1:T
). Unlike the
Kalman Filter which you can update with each observation, one has to wait
until T observations have been made and then retrospectively calculate the
probability the latent variable had a value at time t where t < T.
Commencing at the ﬁnal time step in the sequence (t = T) and working
backwards to the start (t = 1), p(h
t
|v
1:T
) can be evaluated as follows:
p(h
t
|v
1:T
) =
_
h
t+1
p(h
t
|h
t+1
, v
1:T
)p(h
t+1
|v
1:T
)
h
t
⊥v
t+1:T
|h
t+1
⇒p(h
t
|h
t+1
, v
1:T
) = p(h
t
|h
t+1
, v
1:t
)
∴ p(h
t
|v
1:T
) =
_
h
t+1
p(h
t
|h
t+1
, v
1:t
)p(h
t+1
|v
1:T
)
=
_
h
t+1
p(h
t+1
, h
t
|v
1:t
)p(h
t+1
|v
1:T
)
p(h
t+1
|v
1:t
)

_
h
t+1
p(h
t+1
, h
t
|v
1:t
)p(h
t+1
|v
1:T
)
=
_
h
t+1
p(h
t+1
|h
t
, v
1:t
)p(h
t
|v
1:t
)p(h
t+1
|v
1:T
)
h
t+1
⊥v
1:t
|h
t

_
h
t+1
p(h
t+1
|h
t
)p(h
t
|v
1:t
)p(h
t+1
|v
1:T
) (6.1)
6
As before, we know that p(h
t
|v
1:T
) will be a Gaussian and we will need to
establish it’s mean and variance at each t, i.e. in a similar manned to (2.4):
p(h
t
|v
1:T
) ∼ N(h
s
t
, F
s
t
) (6.2)
Using the ﬁltered values calculated in the previous section for
ˆ
h
t|t
and F
t|t
for each time step, the procedure for estimating the smoothed parameters
h
s
t
and F
s
t
works backwards from the last time step in the sequence, i.e. at
t = T as follows:
1. Set h
s
T
and F
s
T
to
ˆ
h
T|T
and F
T|T
from steps 8 and 9 in section 5.
2. Calculate A
s
t
:
A
s
t
= (AF
t|t
)
T
(AF
t|t
A
T
+ Σ
H
)
−1
3. Calculate S
s
t
:
S
s
t
= F
t|t
−A
s
t
AF
t|t
4. Calculate the smoothed latent variable estimate h
s
t
:
h
s
t
= A
s
t
h
s
t+1
+
ˆ
h
t|t
−A
s
t
A
ˆ
h
t|t
5. Calculate the smoothed estimate covariance F
s
t
:
F
s
t
=
1
2
_
(A
s
t
F
s
t+1
A
T
+S
s
t
) + (A
s
t
F
s
t+1
A
T
+S
s
t
)
T
¸
6. Calculate the smoothed cross-variance X
s
t
:
X
s
t
= A
s
t
F
s
t
+h
s
t
ˆ
h
T
t|t
7. Cycle through stages 2 to 6 for each time step backwards through the
sequence from t = T to t = 1.
7
7 Expectation Maximization (Calibrating the Kalman
Filter)
The procedures outlined in the previous sections are ﬁne if we assume that
we know the value in the parameter set θ =
_
µ
0
, σ
2
0
, A, Σ
H
, B, Σ
V
_
but in
order to learn these values, we will need to perform the Expectation Maxi-
mization algorithm.
The joint probability of T time steps of the latent and observable variables
is:
p(h
1:T
, v
1:T
) = p(h
1
)
T

t=2
p(h
t
|h
t−1
)
T

t=1
p(v
t
|h
t
) (7.1)
Making the dependence on the parameters explicit, the likelihood of the
model given the parameter set θ is:
p(h
1:T
, v
1:T
|θ) = p(h
1

0
, σ
2
0
)
T

t=2
p(h
t
|h
t−1
, A, Σ
H
)
T

t=1
p(v
t
|h
t
, B, Σ
V
)
(7.2)
Taking logs gives us the model’s log likelihood:
ln p(h
1:T
, v
1:T
|θ) = ln p(h
1

0
, σ
2
0
) +
T

t=2
ln p(h
t
|h
t−1
, A, Σ
H
) +
T

t=1
ln p(v
t
|h
t
, B, Σ
V
)
(7.3)
We will deal with each of the three components of (7.3) in turn. Using V
to represent the set of observations up to and including time t (i.e. v
1:t
),
H for h
1:T
, θ
old
to represent our parameter values before an iteration of the
EM loop, the superscript n to represent the value of a parameter after an
iteration of the loop, c to represent terms that are not dependent on µ
0
or
σ
2
0
, λ to represent (σ
2
0
)
−1
and Q = E
H|θ
old [ln p(H, V |θ)] we will ﬁrst ﬁnd the
expected value for p(h
1

0
, σ
2
0
):
Q = −
1
2
ln
¸
¸
σ
2
0
¸
¸
−E
H|θ
old
_
1
2
(h
1
−µ
0
)
T
λ(h
1
−µ
0
)
_
+c
= −
1
2
ln
¸
¸
σ
2
0
¸
¸

1
2
E
H|θ
old
_
h
T
1
λh
1
−h
T
1
λµ
0
−µ
T
0
λh
1

T
0
λµ
0
¸
+c
=
1
2
_
ln |λ| −Tr
_
λE
H|θ
old
_
h
1
h
T
1
−h
1
µ
T
0
−µ
0
h
T
1

0
µ
T
0
¸
__
+c
(7.4)
8
In order to ﬁnd the µ
0
which maximizes the expected log likelihood described
in (7.4), we will diﬀerentiate it wrt µ
0
and set the diﬀerential to zero:
δQ
δµ
0
= 2λµ
0
−2λE[h
1
] = 0
⇒µ
n
0
= E[h
1
] (7.5)
Proceeding in a similar manner to establish the maximal λ:
δQ
δλ
=
1
2
_
σ
2
0
−E
_
h
1
h
T
1
¸
−E[h
1
] µ
T
0
−µ
0
E
_
h
T
1
¸

0
µ
T
0
_
= 0
σ
2
0
n
= E
_
h
1
h
T
1
¸
−E[h
1
] E
_
h
T
1
¸
(7.6)
In order to optimize for A and Σ
H
we will substitute for p(h
t
|h
t−1
, A, Σ
H
)
in (7.3) giving:
Q = −
T −1
2
ln |Σ
H
| −E
H|θ
old
_
1
2
T

t=2
(h
t
−Ah
t−1
)
T
Σ
−1
H
(h
t
−Ah
t−1
)
_
+c
(7.7)
Maximizing with respect to these parameters then gives:
A
n
=
_
T

t=2
E
_
h
t
h
T
t−1
¸
__
T

t=2
E
_
h
t
h
T
t−1
¸
_
−1
(7.8)
Σ
n
H
=
1
T −1
T

t=2
_
E
_
h
t
h
T
t−1
¸
−A
n
E
_
h
t−1
h
T
t
¸
−E
_
h
t
h
T
t−1
¸
A
n
+A
n
E
_
h
t−1
h
T
t−1
¸
(A
n
)
T
_
(7.9)
In order to determine values for B and Σ
V
we substitute for p(v
t
|h
t
, B, Σ
V
)
in (7.3) to give:
Q = −
T
2
ln |Σ
V
| −E
H|θ
old
_
1
2
T

t=2
(v
t
−Bh
t
)
T
Σ
−1
V
(v
t
−Bh
t
)
_
+c
(7.10)
Maximizing this with respect to B and Σ
V
gives:
B
n
=
_
T

t=1
v
t
E
_
h
T
t
¸
__
T

t=1
E
_
h
t
h
T
t
¸
_
−1
(7.11)
Σ
n
V
=
1
T
T

t=1
_
v
t
v
T
t
−B
n
E[h
t
] v
T
t
−v
t
E
_
h
T
t
¸
B
n
+B
n
E
_
h
t
h
T
t
¸
B
n
_
(7.12)
9
Using the values calculated from the smoothing procedure in section 5:
E[h
t
] = h
s
t
E
_
h
t
h
T
t
¸
= F
s
t
E
_
h
t
h
T
t−1
¸
= X
s
t
We can now set out the procedure for parameter learning using Expectation
Maximization:
1. Choose starting values for the parameters θ =
_
µ
0
, σ
2
0
, A, Σ
H
, B, Σ
V
_
.
2. Using the parameter set θ, calculate the ﬁltered statistics
ˆ
h
t|t
and F
t|t
for each time step as described in section 4.
3. Using the parameter set θ, calculate the smoothed statistics h
s
t
, F
s
t
and X
s
t
for each time step as described in section 5.
4. Update A:
A
n
=
_
T

t=1
h
s
t
h
s
t−1
T
+
T

t=1
X
s
t
__
T

t=1
h
s
t
h
s
t
T
+
T

t=1
F
s
t
_
−1
5. Update Σ
H
:
Σ
n
H
= [T −1]
−1
_
_
T

t=2
h
s
t
h
s
t
T
+
T

t=2
F
s
t
−A
n
_
T

t=1
h
s
t
h
s
t−1
T
+
T

t=1
X
s
t
_
T
_
_
6. Update B:
B
n
=
_
T

t=1
v
t
h
s
t
T
__
T

t=1
h
s
t
h
s
t
T
+
T

t=1
F
s
t
_
−1
7. Update Σ
V
:
Σ
n
V
=
_
_
T

t=1
v
t
v
T
t
−B
n
_
T

t=1
v
t
h
s
t
T
_
T
_
_
T
−1
8. Update µ
0
:
µ
n
0
= h
s
1
9. Update σ
2
0
:
σ
2
0
n
=
_
F
s
1
+h
s
1
h
s
1
T
¸ _
1 −µ
n
0
µ
n
0
T
¸
−1
10. Iterate steps 2 to 10 a given number of times or until the diﬀerence
between parameter values from succeeding iterations is below a pre-
deﬁned threshold.
10
References
[1] C. M. Bishop, Pattern Recognition and Machine Learning (Information
Science and Statistics). Springer, 2006.
[2] Z. Ghahramani and G. Hinton, “Parameter estimation for linear dynam-
ical systems,” Tech. Rep., 1996.
11

1 Introduction The aim of this document is to derive the ﬁltering equations for the simplest Linear Dynamical System case. do a similar thing for the smoothing equations and conclude with parameter learning in an LDS (calibrating the Kalman Filter). ΣH ) ⇒ p(ht |ht−1 ) ∼ N (Aht−1 . ΣH ) (2.e. The document is based closely on Bishop [1] and Gharamani’s [2] texts. outline the ﬁlter’s implementation. but is more suitable for those who wish to understand every aspect of the mathematics required and see how it all comes together in a procedural sense. the Kalman Filter. σ0 ) 1 . Starting with an initial distribution for our latent variable given the ﬁrst observation: p(h1 |v1 ) ∝ p(v1 |h1 )p(h1 ) and the assumption that h1 has a Gaussian distribution: 2 p(h1 ) ∼ N (µ0 . ΣV ) (2.2) We wish to infer the probability distribution of ht given the observations up to that point in time v1:t . 2 Model Speciﬁcation The simplest form of Linear Dynamical System (LDS) models a discrete time process where a latent variable h is updated every time step by a constant linear state transition A with the addition of zero-mean Gaussian noise η h : h ht = Aht−1 + ηt h where ηt ∼ N (0. p(ht |v1:t ). i. ΣV ) ⇒ p(vt |ht ) ∼ N (Bht .1) This latent variable is observed through a constant linear function of the latent state B also subject to zero-mean Gaussian noise η v : v vt = Bht + ηt v where ηt ∼ N (0. which can be expressed recursively.

2) are both Gaussian and the operations in (2. (2. vt |v1:t−1 ) = ht−1 p(ht . its a priori value) while Ft|t would denote 2 .e. vt |v1:t−1 .1) The subscript in Ft|t−1 denotes the fact that this is F ’s value before an observation is made at time t (i. ht−1 . where h denotes the actual value of the latent ˆ ˆ variable. v1:t−1 )p(ht−1 |v1:t−1 ) ht−1 = ht ⊥v1:t−1 |ht−1 ⇒ ht−1 p(vt |v1:t−1 . ht ⇒ = ht−1 p(vt |ht )p(ht |ht−1 )p(ht−1 |v1:t−1 ) (2.3) The fact that the distributions described in (2.values for p(ht |v1:t ) for subsequent values of t can be found by iteration: p(ht |v1:t ) = p(ht |vt . v1:t−1 ) p(ht . vt |v1:t−1 ) = p(vt |v1:t−1 ) ∝ p(ht . h is its estimated value. ht−1 . ht )p(ht |ht−1 . ht )p(ht |ht−1 )p(ht−1 |v1:t−1 ) p(vt |v1:t−1 .3) of multiplication and then integration will yield Gaussians when performed on Gaussians. ht )p(ht |ht−1 )p(ht−1 |v1:t−1 ) ht−1 vt ⊥ht−1 |v1:t−1 . ht−1 )p(ht−1 |v1:t−1 ) p(vt |v1:t−1 . means that we know p(ht |v1:t ) will itself be a Gaussian: 2 p(ht |v1:t ) ∼ N (µt . σt ) 2 and the task is to derive the expressions for µt and σt .4) 3 Deriving the state estimate variance ˆ If we deﬁne ∆h ≡ h − E h .1) and (2. E h is the expected value of this estimate and F as the covariance of the error estimate then: Ft|t−1 = E ∆ht ∆hT t h h = E (A∆ht−1 + ∆ηt )(A∆ht−1 + ∆ηt )T h hT h hT = E (A∆ht−1 ∆ht−1 T AT + ∆ηt ∆ηt + ∆ηt AT ∆hT + A∆ht−1 ∆ηt ) t−1 h hT = AE ∆ht−1 ∆hT AT + E ∆ηt ∆ηt t−1 = AFt−1:t−1 AT + ΣH (3.

if we wish to add a correction to our a priori estimate that is proportional to the error t we can use a coeﬃcient κ: ˆ ˆ ht|t = ht|t−1 + κ t (3.4) This allows us to express Ft|t recursively: ˆ Ft|t = Cov(ht − ht|t ) ˆ = Cov(ht − (ht|t−1 + κ t ) ˆ ˆ = Cov(ht − (ht|t−1 + κ(vt − B ht|t−1 ))) v ˆ ˆ = Cov(ht − (ht|t−1 + κ(Bht + ηt − B ht|t−1 ))) ˆ ˆ = Cov(ht − ht|t−1 − κBht − κη v + κB ht|t−1 ) t v ˆ = Cov((I − κB)(ht − ht|t−1 ) − κηt ) ˆ = Cov((I − κB)(ht − ht|t−1 )) + Cov(κη v ) t v ˆ = (I − κB)Cov(ht − ht|t−1 )(I − κB)T + κCov(ηt )κT = (I − κB)Ft|t−1 (I − κB)T + κΣV κT = (Ft|t−1 − κBFt|t−1 )(I − κB)T + κΣV κT = Ft|t−1 − κBFt|t−1 − Ft|t−1 (κB)T + κBFt|t−1 (κB)T + κΣV κT = Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κ(BFt|t−1 B T + ΣV )κT (3.3) Now that we have an observation.6) 3 .2) Once we have an observation (and are therefore dealing with posterior values).5) becomes: Ft|t = Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κSt κT (3. we can deﬁne t as the diﬀerence between the observation we’d expect to see given our estimate of the latent state (its a priori value) and the one actually observed.e. i. This more informative notation allows the update equation in (2.1) to be expressed as follows: ˆ ˆ ht|t−1 = Aht−1|t−1 (3.a value for Ft|t after an observation is made (its posterior value).5) If we deﬁne the innovation variance as St = BFt|t−1 B T + ΣV then (3.: t ˆ = vt − B ht|t−1 (3.

so we wish to ﬁnd the κ that gives us the minimum MSE.6): Ft|t = Ft|t−1 − KBFt|t−1 − Ft|t−1 B T K T + Ft|t−1 B T K T = (I − KB)Ft|t−1 (4. i. we need to ﬁnd κ such that: δT r(Ft|t ) =0 δκ δT r(Ft|t−1 − κBFt|t−1 − Ft|t−1 B T κT + κSt κT ) (2. we can use the mean square error measure (MSE): E ˆ ht − ht|t 2 ˆ = T r(Cov(ht − ht|t )) = T r(Ft|t ) (4.2) This optimum value for κ in terms of minimizing MSE is known as the Kalman Gain and will be denoted K.1) The only coeﬃcient we have control over is κ.4 Minimizing the state estimate variance If we wish to minimize the variance of Ft|t .4) (4.2) by SK T : KSK T = Ft|t−1 B T K T Substituting this into (3.6) ⇒ =0 δκ −1 ⇒κ = Ft|t−1 B T St (4.3) 4 .e. If we multiply by both sides of (4.

5 Filtered Latent State Estimation Procedure (The Kalman Filter) The procedure for estimating the state of ht . Make an observation vt 5.4). 5 .e. proceeds as follows: ˆ ˆ 1. Update latent state estimate: ˆ ˆ ht|t = ht|t−1 + K t 9.2)): Ft|t = (I − KB)Ft|t−1 10. Cycle through stages 2 to 9 for each time step. Update estimate covariance (from (4. h0|0 and F0|0 ). Advance estimate covariance: Ft|t−1 = AFt−1|t−1 AT + ΣH 4. Calculate innovation: ˆ t = vt − B ht|t−1 6. which when using the MSE optimal value for κ is called Kalman Filtering. 2 ˆ Note that ht|t and Ft|t correspond to µt and σt from (2. 2. Calculate K: −1 K = Ft|t−1 B T St 8. Advance latent state estimate: ˆ ˆ ht|t−1 = Aht−1|t−1 3. Calculate St : St = BFt|t−1 B T + ΣV 7. Choose initial values for h and F (i.

6 Smoothed Latent State Estimation The smoothed probability of the latent variable is the probability it had a value at time t after a sequence of T observations.e. v1:t )p(ht+1 |v1:T ) p(ht+1 . p(ht |v1:T ). v1:t ) ∴ p(ht |v1:T ) = ht+1 p(ht |ht+1 . v1:T ) = p(ht |ht+1 . Commencing at the ﬁnal time step in the sequence (t = T ) and working backwards to the start (t = 1).1) 6 . i. ht |v1:t )p(ht+1 |v1:T ) ht+1 = ht+1 p(ht+1 |ht . v1:T )p(ht+1 |v1:T ) ht ⊥vt+1:T |ht+1 ⇒ p(ht |ht+1 . v1:t )p(ht |v1:t )p(ht+1 |v1:T ) p(ht+1 |ht )p(ht |v1:t )p(ht+1 |v1:T ) ht+1 ht+1 ⊥v1:t |ht ⇒ (6. Unlike the Kalman Filter which you can update with each observation. p(ht |v1:T ) can be evaluated as follows: p(ht |v1:T ) = ht+1 p(ht |ht+1 . one has to wait until T observations have been made and then retrospectively calculate the probability the latent variable had a value at time t where t < T . ht |v1:t )p(ht+1 |v1:T ) = ∝ ht+1 p(ht+1 |v1:t ) p(ht+1 .

e. Calculate As : t As = (AFt|t )T (AFt|t AT + ΣH )−1 t s 3. Calculate the smoothed latent variable estimate hs : t ˆ ˆ hs = As hs + ht|t − As Aht|t t t t+1 t 5. Calculate St : s St = Ft|t − As AFt|t t 4. Calculate the smoothed estimate covariance Fts : s s s s Fts = 1 (As Ft+1 AT + St ) + (As Ft+1 AT + St )T t t 2 s 6. we know that p(ht |v1:T ) will be a Gaussian and we will need to establish it’s mean and variance at each t.4): p(ht |v1:T ) ∼ N (hs . Set hs and FT to hT |T and FT |T from steps 8 and 9 in section 5. i. the procedure for estimating the smoothed parameters hs and Fts works backwards from the last time step in the sequence. Calculate the smoothed cross-variance Xt : s ˆ Xt = As Fts + hs hT t t t|t 7. Cycle through stages 2 to 6 for each time step backwards through the sequence from t = T to t = 1.2) ˆ Using the ﬁltered values calculated in the previous section for ht|t and Ft|t for each time step.e.As before. in a similar manned to (2. i. Fts ) t (6. at t t = T as follows: s ˆ 1. 7 . T 2.

2) Taking logs gives us the model’s log likelihood: T T ln p(h1:T . B. c to represent terms that are not dependent on µ0 or 2 2 σ0 . σ0 ) t=2 T p(ht |ht−1 . v1:T ) = p(h1 ) t=2 p(ht |ht−1 ) t=1 p(vt |ht ) (7. ΣV but in order to learn these values.e. σ0 ): 1 1 2 (h1 − µ0 )T λ(h1 − µ0 ) + c Q = − ln σ0 − EH|θold 2 2 1 1 2 = − ln σ0 − EH|θold hT λh1 − hT λµ0 − µT λh1 + µT λµ0 + c 1 1 0 0 2 2 1 ln |λ| − T r λEH|θold h1 hT − h1 µT − µ0 hT + µ0 µT +c = 1 0 1 0 2 (7. v1:t ). ΣH . B. v1:T |θ) = p(h1 |µ0 . H for h1:T . B.1) Making the dependence on the parameters explicit. A.7 Expectation Maximization (Calibrating the Kalman Filter) The procedures outlined in the previous sections are ﬁne if we assume that 2 we know the value in the parameter set θ = µ0 . ΣV ) (7. Using V to represent the set of observations up to and including time t (i. σ0 ) + t=2 ln p(ht |ht−1 . v1:T |θ) = 2 ln p(h1 |µ0 .3) We will deal with each of the three components of (7.4) 8 . V |θ)] we will ﬁrst ﬁnd the 2 expected value for p(h1 |µ0 . ΣV ) (7.3) in turn. the likelihood of the model given the parameter set θ is: T 2 p(h1:T . λ to represent (σ0 )−1 and Q = EH|θold [ln p(H. The joint probability of T time steps of the latent and observable variables is: T T p(h1:T . ΣH ) + t=1 ln p(vt |ht . σ0 . θold to represent our parameter values before an iteration of the EM loop. A. A. the superscript n to represent the value of a parameter after an iteration of the loop. ΣH ) t=1 p(vt |ht . we will need to perform the Expectation Maximization algorithm.

9) In order to determine values for B and ΣV we substitute for p(vt |ht . ΣV ) in (7.10) Maximizing this with respect to B and ΣV gives: T T −1 B = Σn = V 1 T t=1 T n vt E hT t t=1 E ht hT t (7.6) In order to optimize for A and ΣH we will substitute for p(ht |ht−1 .7) Maximizing with respect to these parameters then gives: T T −1 A = t=2 n E 1 T −1 ht hT t−1 t=2 T E ht hT t−1 (7.11) T T vt vt − B n E [ht ] vt − vt E hT B n + B n E ht hT B n t t t=1 (7.3) giving: Q=− T −1 1 ln |ΣH | − EH|θold 2 2 T (ht − Aht−1 )T Σ−1 (ht − Aht−1 ) + c H t=2 (7. B.In order to ﬁnd the µ0 which maximizes the expected log likelihood described in (7. A. we will diﬀerentiate it wrt µ0 and set the diﬀerential to zero: δQ = 2λµ0 − 2λE [h1 ] = 0 δµ0 ⇒ µn = E [h1 ] 0 Proceeding in a similar manner to establish the maximal λ: δQ 1 2 = σ − E h1 hT − E [h1 ] µT − µ0 E hT + µ0 µT = 0 1 0 1 0 δλ 2 0 2n T T σ0 = E h1 h1 − E [h1 ] E h1 (7. ΣH ) in (7.3) to give: T 1 Q = − ln |ΣV | − EH|θold 2 2 T (vt − Bht )T Σ−1 (vt − Bht ) + c V t=2 (7.12) 9 .4).5) (7.8) Σn = H n T T n n T n T E ht hT t−1 − A E ht−1 ht − E ht ht−1 A + A E ht−1 ht−1 (A ) t=2 (7.

4. Update σ0 : 2 n = F s + hs hs T σ0 1 1 1 1 − µn µn T 0 0 −1 10. Update ΣH :  T T T T T s Xt   Σn = [T − 1]−1  H t=2 hs hs T + t t t=2 Fts − An t=1 hs hs T + t t−1 t=1 6. σ0 . Update µ0 : µn = hs 0 1 2 9. 10 . Fts t s and Xt for each time step as described in section 5.Using the values calculated from the smoothing procedure in section 5: E [ht ] = hs t E ht hT = Fts t s E ht hT t−1 = Xt We can now set out the procedure for parameter learning using Expectation Maximization: 2 1. ΣH . Choose starting values for the parameters θ = µ0 . Using the parameter set θ. Update B: T T T −1 Bn = t=1 v t hs T t t=1 hs hs T t t + t=1 Fts 7. Update A: T T T s Xt t=1 t=1 T −1 An = t=1 hs hs T t t−1 + hs hs T t t + t=1 Fts 5. ˆ 2. Update ΣV :  T T T   T −1 Σn =  V t=1 T vt vt − B n t=1 vt hs T t 8. Using the parameter set θ. ΣV . 3. A. calculate the ﬁltered statistics ht|t and Ft|t for each time step as described in section 4. B. calculate the smoothed statistics hs . Iterate steps 2 to 10 a given number of times or until the diﬀerence between parameter values from succeeding iterations is below a predeﬁned threshold.

Bishop. Ghahramani and G.. Springer. 1996. Hinton. [2] Z. Pattern Recognition and Machine Learning (Information Science and Statistics). Rep.” Tech. 2006. “Parameter estimation for linear dynamical systems. M.References [1] C. 11 .

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->