You are on page 1of 23

EM Algorithm

Jur van den Berg

Kalman Filtering vs.


Smoothing
Dynamics and Observation model
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Kalman Filter:

Compute X t | Y0 y 0 , , Yt y t
Real-time, given data so far

Kalman Smoother:

X t | Y0 y 0 , , YT y T , 0 t T
Compute
Post-processing, given all data

EM Algorithm
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Kalman smoother:
Compute distributions X0, , Xt
given parameters A, C, Q, R, and data y0, , yt.

EM Algorithm:
Simultaneously optimize X0, , Xt and A, C, Q,
R
given data y0, , yt.

Probability vs. Likelihood


Probability: predict unknown outcomes
based on known parameters:
p(x | )

Likelihood: estimate unknown


parameters based on known outcomes:
L(| x) = p(x | )

Coin-flip example:
is probability of heads (parameter)
x = HHHTTH is outcome

Likelihood for Coin-flip


Example
Probability of outcome given parameter:
p(x = HHHTTH | = 0.5) = 0.56 = 0.016

Likelihood of parameter given outcome:


L(= 0.5 | x = HHHTTH) = p(x | ) = 0.016

Likelihood maximal when = 0.6666


Likelihood function not a probability density

Likelihood for Cont.


Distributions
Six samples {-3, -2, -1, 1, 2, 3}
believed to be drawn from some
Gaussian N(0, 2)

LLikelihood
( | {3,2,1,1,2of
,3}):
p ( x 3 | ) p ( x 2 | ) p ( x 3 | )
Maximum (likelihood:
3) 2 (2) 2 (1) 2 12 2 2 32

2.16

Likelihood for Stochastic


Model
Dynamics model
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Suppose xt and yt are given for 0 t


T, what is likelihood of A,
C, Q and
T
L( A, C , Q, R | x, y ) p (x, y | A, C , Q, R ) p (x t | x t 1 ) p (y t | x t )
R?
t 0

log p (x, y | A, C , Q, R )
Compute log-likelihood:

Log-likelihood
T

log p(x, y | A, C , Q, R ) log p(x t | x t 1 ) p(y t | x t )


t 0

T 1

log p(x
t 0

t 1

| x t ) log p(y t | x t ) ...


t 0

Multivariate normal
distribution N(,
1
/
2
k / 2
1
T 1
1
p
(
x
)

(
2

exp(

(
x

)
(x ))
2
) has pdf:
From model:xt 1 ~ N ( Axt , Q) y t ~ N (Cxt , R)

1
1

1
T
1
log Q (x t 1 Ax t ) Q (x t 1 Ax t )
2
t 0 2

1
T 1

1
T
1
log R (y t Cx t ) R (y t Cx t ) const
2
t 0 2

T 1

Log-likelihood #2

1
1

1
T
1
log Q (x t 1 Ax t ) Q (x t 1 Ax t )
2
t 0 2

1
T 1

1
T
1
log R (y t Cx t ) R (y t Cx t ) const ...
2
t 0 2

T 1

a = Tr(a) if a is scalar
Bring summation inward
T 1
T
1

1
T
1
log Q Tr(( x t 1 Ax t ) Q (x t 1 Ax t ))
2
2 t 0

T 1
1 T

1
T
1
log R Tr(( y t Cx t ) R (y t Cx t )) const
2
2 t 0

Log-likelihood #3
T
1 T 1

1
T
1
log Q Tr(( x t 1 Ax t ) Q (x t 1 Ax t ))
2
2 t 0

T
T 1
1

1
T
1
log R Tr(( y t Cx t ) R (y t Cx t )) const ...
2
2 t 0

Tr(AB) = Tr(BA)
Tr(A) + Tr(B) = Tr(A+B)
T
1 1
1
log Q Tr Q
2
2

T 1
1 1
1
log R Tr R
2
2

T 1

(x
t 0

t 1

(y
t 0

Cx t ) (y t Cx t ) const

T

Ax t )(x t 1 Ax t )

T

Log-likelihood #4
T
1 1
1
log Q Tr Q
2
2

T 1

(x
t 0

T 1
1 1
1
log R Tr R
2
2

t 1

(y
t 0

Ax t )(x t 1 Ax t )

T

Cx t ) (y t Cx t ) const ...

T

Expand
l ( A, C , Q, R | x, y )
T
1 1
1
log Q Tr Q
2
2

T 1

x
t 0

T 1
1 1
1
log R Tr R
2
2

T
t 1 t 1

y y
t 0

T
t

x x A Ax t x
T
t 1 t

y t x C Cx t y
T
t

T
t 1

T
t 1

Ax t x A

T
t

Cx t x C const

T
t

Maximize likelihood
log is monotone function
max log(f(x)) max f(x)

Maximize l(A, C, Q, R | x, y) in turn


for A, C, Q and R.

l ( A, C , Q, R | x, y )
Solve
A
l ( A, C , Q, R | x, y )
Solve
C
Solvel ( A, C , Q, R | x, y )
Q
Solvel ( A, C , Q, R | x, y )
R

0
0

0
0

for
for
for
for

A
C
Q
R

Matrix derivatives
Defined for scalar functions f : Rn*m ->
R

Key identities

xT Ax
xT ( AT A)
x
B T AB
B T ( AT A)
B
Tr ( AB) Tr ( BA) Tr ( B T AT )

BT
A
A
A
log A
A T
A

Optimizing A
Derivative
l ( A, C , Q, R | x, y ) 1 1
Q
A
2

Maximizer

T 1

x
t 0

T
t 1 t

T 1

x x
t 0

T
t

T 1

2x
t 0

x 2 Ax t x

T
t 1 t

T
t

Optimizing C
Derivative
l ( A, C , Q, R | x, y ) 1 1
R
C
2

Maximizer

y x x x
t 0

T
t

t 0

T
t

2y x
t 0

T
t

2Cx t x
T
t

Optimizing Q
Derivative with respect to inverse
l ( A, C , Q, R | x, y ) T
1
Q
1
Q
2
2

t 0

Maximizer
1
Q
T

T 1

x
t 0

T
t 1 t 1

T
T T
T
T T
xt 1xt 1 xt 1xt A Axt xt 1 Axt xt A
T 1

x x A Ax t x
T
t 1 t

T
t 1

Ax t x A
T
t

Optimizing R
Derivative with respect to inverse
l ( A, C , Q, R | x, y ) T 1
1

R
1
R
2
2

T
T T
T
T
y
y

y
x
C

C
x
y

C
x
x
t t t t
t t
t t C

t 0

Maximizer
1
R

T 1

y y
t 0

T
t

y t x C Cx t y Cx t x C
T
t

T
t

T
t

EM-algorithm
x t 1
yt

Ax t w t , w t ~ Wt N (0, Q)
Cx t v t , v t ~ Vt N (0, R )

Initial guesses of A, C, Q, R
Kalman smoother (E-step):
Compute distributions X0, , XT
given data y0, , yT and A, C, Q, R.

Update parameters (M-step):


Update A, C, Q, R such that
expected log-likelihood is maximized

Repeat until convergence (local


optimum)

Kalman Smoother
for (t = 0; t < T; ++t)
filter x t 1|t Ax t|t

// Kalman

Pt 1|t APt|t AT Q

K t 1
x t 1|t 1

Pt 1|t C CPt 1|t C R


x t 1|t K t 1 y t 1 Cx t 1|t

Pt 1|t 1

Pt 1|t K t 1CPt 1|t

for (t = T 1; t 0;T --t)1


Lt
Pt |t A Pt 1|t
pass
x t|T
Pt|T

// Backward

x t |t Lt x t 1|T x t 1|t
Pt |t Lt ( Pt 1|T Pt 1|t ) LTt

Update Parameters
Likelihood in terms of x, but only X
l ( A, C , Q, R | x, y )
available
T
1 1
1
log Q Tr Q
2
2

T 1

x
t 0

T 1
1

log R 1 Tr R 1
2
2

T
t 1 t 1

x x A Ax t x
T
t 1 t

T
t 1

Ax t x A

T
t

T
T T
T
T T

y
y

y
x
C

C
x
y

C
x
x

t t
t t
t t 1
t t C const
t 0

T

x t , x t xTt , x t xTt1

Likelihood-function linear in
Expected likelihood: replace them with:
E ( X t | y ) x t|T

E ( X t X tT | y ) Pt|T x t|T x Tt|T

E ( X t X tT1 | y ) x t|t x Tt 1|T Lt Pt 1|T (x t 1|T x t 1|t )x Tt 1|T

Use maximizers to update A, C, Q and R.

Convergence
Convergence is guaranteed to local
optimum
Similar to coordinate ascent

Conclusion
EM-algorithm to simultaneously
optimize
state estimates and model
parameters
Given ``training data, EM-algorithm
can be used (off-line) to learn the
model for subsequent use in (realtime) Kalman filters

Next time
Learning from demonstrations
Dynamic Time Warping

You might also like