Professional Documents
Culture Documents
Kalman Filter
Nonlinear State Space Models
Particle Filtering
Andrew Lesniewski
Baruch College
New York
Fall 2019
Outline
2 Kalman Filter
4 Particle Filtering
OLS regression
As a motivation for the reminder of this lecture, we consider the standard linear
model
Y = X T β + ε, (1)
where Y ∈ R, X ∈ Rk , and ε ∈ R is noise (this includes the model with an
intercept as a special case in which the first component of X is assumed to be 1).
Given n observations x1 , . . . , xn and y1 , . . . , yn of X and Y , respectively, the
ordinary least square least (OLS) regression leads to the following estimated
value of the coefficient β:
x1T y1
respectively.
Suppose now that X and Y consists of a streaming set of data, and each new
observation leads to an updated value of the estimated β.
It is not optimal to redo the entire calculation above after a new observation
arrives in order to find the updated value.
Instead, we shall derive a recursive algorithm that stores the last calculated
value of the estimated β, and updates it to incorporate the impact of the latest
observation.
To this end, we introduce the notation:
Pn = (XnT Xn )−1 ,
(4)
Bn = XnT Yn ,
−1
Pn+1 = Pn−1 + xn+1 xn+1
T
,
Bn+1 = Bn + xn+1 yn+1 .
valid for a square invertible matrix A, and matrices B and D such that the
operations above are defined.
This yields the following relation:
T
Pn+1 = Pn − Pn xn+1 (1 + xn+1 Pn xn+1 )−1 xn+1
T
Pn
T
= Pn − Kn+1 xn+1 Pn .
T
Kn+1 = Pn xn+1 (1 + xn+1 Pn xn+1 )−1 .
T b
εbn+1 = yn+1 − xn+1 βn .
T b
Bn+1 = Bn + xn+1 xn+1 βn + xn+1 εbn+1 .
−1 b
Pn+1 βn+1 = Pn−1 βbn + xn+1 xn+1
T b
βn + xn+1 εbn+1
= (Pn−1 + xn+1 xn+1
T
)βbn + xn+1 εbn+1
−1 b
= Pn+1 βn + xn+1 εbn+1 .
In other words,
βbn+1 = βbn + Pn+1 xn+1 εbn+1 .
However, from the definition of Kn+1 ,
and so
βbn+1 = βbn + Kn+1 εbn+1 .
The algorithm can be summarized as follows. We initialize βb0 (e.g. at 0), and P0
(e.g. at I), and iterate:
T b
εbn+1 = yn+1 − xn+1 βn ,
T
Kn+1 = Pn xn+1 (1 + xn+1 Pn xn+1 )−1 ,
T
(7)
Pn+1 = Pn − Kn+1 xn+1 Pn ,
βbn+1 = βbn + Kn+1 εbn+1 .
Note that (i) we no longer have to store the (potentially large) matrices Xn and
Yn , and (ii) the computationally expensive operation of inverting the matrix Xn XnT
is replaced with a small number of simpler operations.
We now move on to the main topic of these notes, the Kalman filter and its
generalizations.
A state space model (SSM) is a time series model in which the time series Yt is
interpreted as the result of a noisy observation of a stochastic process Xt .
The values of the variables Xt and Yt can be continuous (scalar or vector) or
discrete.
Graphically, an SSM is represented as follows:
SSMs belong to the realm of Bayesian inference, and they have been
successfully applied in many fields to solve a broad range of problems.
Our discussion of SSMs follows largely [2].
Xt ∼ p(Xt |Xt−1 ),
(9)
Yt ∼ p(Yt |Xt ).
The most well studied SSM is the Kalman filter, for which the processes above
are linear and and the sources of randomness are Gaussian.
Namely, a linear state space model has the form:
where δts denotes Kronecker’s delta, and where Q and R are known positive
definite (covariance) matrices.
We also assume that the components of εt and ηs are independent of each other
for all t and s.
The matrices Q and R may depend deterministically on time.
The first of the equations in (10) is called the state equation, while the second
one is referred to as the observation equation.
Kalman filter
In principle, any inference for this model can be done using the standard
methods of multivariate statistics.
However, these methods require storing large amounts of data and inverting
tn × tn matrices. Notice that, as new data arrive, the storage requirements and
matrix dimensionality increase.
This is frequently computationally intractable and impractical.
Instead, the Kalman filter relies on a recursive approach which does not require
significant storage resources and involves inverting n × n matrices only.
We will go through a detailed derivation of this recursion.
Kalman filter
The purpose of filtering is to update the knowledge of the system each time a
new observation is made.
We define the one period predictor µt+1 , when the observation Yt is made, and
its covariance Pt+1 :
Kalman filter
We let
vt = Yt − E(Yt |Y1:t−1 ) (14)
denote the one period prediction error or innovation.
In Homework Assignment #6 we show that vt ’s are mutually independent.
Note that
vt = Yt − E(HXt + ηt |Y1:t−1 )
= Yt − Hµt .
Kalman filter
Now we notice that
Ft = HPt H T + R. (17)
Lemma
First, we will establish the following Lemma: Let X and Y be Gaussian jointly
distributed random vectors with
X µX
E = ,
Y µY
and
X ΣXX ΣXY
Var =
Y ΣTXY ΣYY
Then
E(X |Y ) = µX + ΣXY Σ−1
YY
Y − µY , (18)
and
Var(X |Y ) = ΣXX − ΣXY Σ−1 ΣT .
YY XY
(19)
Proof : Consider the random variable
Z = X − ΣXY Σ−1
YY
Y − µY .
Lemma
E(Z ) = µX ,
Var(Z ) = E((Z − µX )(Z − µX )T )
= ΣXX − ΣXY Σ−1 ΣT .
YY XY
Finally,
Cov(Y , Z ) = E(Y (Z − µX )T )
= E(Y (X − µX )T − Y (Y − µY )T Σ−1 ΣT )
YY XY
= 0.
Lemma
we have
E(X |Y ) = µX + ΣXY Σ−1
YY
Y − µY ,
which proves (18).
Also, conditioned on Y , X and Z differ by a constant vector, and so
Var(X |Y ) = Var(Z |Y )
= Var(Z )
= ΣXX − ΣXY Σ−1 ΣT ,
YY XY
Kalman filter
and
Kalman filter
Note that
= E Xt (Xt − µt )T H T |Y1:t−1
(21)
= E (Xt − µt )(Xt − µt )T |Y1:t−1 H T
= Pt H T ,
Kalman filter
From the definition (13) of Pt|t , this can be stated as the following relation:
where
Kt = Pt H T Ft−1 . (24)
The matrix Kt is referred to as the Kalman gain.
Kalman filter
Now we are ready to establish recursions for µt and Pt . From the state equation
(the first equation in (10)) we have
Furthermore,
Relations and are called the prediction step of the Kalman filter.
Kalman filter
Using this notation, we can write the full system of recursive relations for
updating from t to t + 1 in the following form:
vt = Yt − Hµt ,
Ft = HPt H T + R,
Kt = Pt H T Ft−1 ,
µt|t = µt + Kt vt , (28)
Pt|t = (I − Kt H)Pt ,
µt+1 = Gµt|t ,
Pt+1 = GPt|t GT + Q,
for t = 1, 2, . . ..
The initial values µ1 and P1 are assumed to be known (consequently,
X1 ∼ N(µ1 , P1 )).
If the matrices G, H, Q, R depend (deterministically) on t, the formulas above
remain valid with G replaced by Gt , etc.
vt = Yt − Hµt − Dt ,
Ft = HPt H T + R,
Kt = Pt H T Ft−1 ,
µt|t = µt + Kt vt , (30)
Pt|t = (I − Kt H)Pt ,
µt+1 = Gµt|t + Ct ,
Pt+1 = GPt|t GT + Q,
for t = 1, 2, . . ..
A. Lesniewski Time Series Analysis
Warm-up: Recursive Least Squares
Kalman Filter
Nonlinear State Space Models
Particle Filtering
State smoothing
State smoothing refers to the process of estimation of the values of the states
X1 , . . . , XT , given the entire observation set.
The objective is thus to (recursively) determine the conditional mean
bt = E(Xt |Y1:T )
X (31)
State smoothing
An analysis, similar to the derivation of the Kalman filter leads to the following
result (see [2]) for the derivation).
The smoothing process consists of two phases:
(i) forward sweep of the Kalman filter (28) for t = 1, . . . , T ,
(ii) backward recursion
∗ ∗ ∗
− Xt+d )T |Y1:t ,
Pt+d = E (Xt+d − Xt+d )(Xt+d (35)
and
∗
Yt+d = E(Yt+d |Y1:t ), (36)
with variance
∗ ∗ ∗
− Yt+d )T |Y1:t ,
Vt+d = E (Yt+d − Yt+d )(Yt+d (37)
respectively.
∗
Xt+1 = Gµt|t , (38)
with
∗
Pt+1 = GPt+1 GT + Q, (39)
and
∗
Yt+1 = Hµt+1 , (40)
with
∗
Vt+1 = HPt+1 H T + R. (41)
∗ ∗
Xt+d = GXt+d−1 , (42)
with
∗ ∗
Pt+d = GPt+d−1 GT + Q, (43)
and
∗ ∗
Yt+d = HXt+d , (44)
with
∗ ∗
Vt+d = HPt+d H T + R. (45)
We have left a number of parameters that may have not been specified, namely:
(i) the initial values µ1 and P1 that enter the probability distribution N(µ1 , P1 )
of the state X1 ,
(ii) the matrices G and H,
(iii) the variances Q and R of the disturbances in the state and observation
equations, respectively.
We denote the unspecified model parameters collectively by θ.
If µ1 and P1 are known, the remaining parameters θ can be estimated by means
of MLE.
T
Y
p(Y1:T |θ) = p(Yt |Y1:t−1 ), (46)
t=1
T
1 X
− log L(θ|Y1:T ) = log det(Ft ) + vtT Ft−1 vt + const, (47)
2 t=1
For each t, the value of the log likelihood function is calculated by running the
Kalman filter.
Searching for the minimum of this log likelihood function using an efficient
algorithm such as BFGS, we find estimates of θ.
In the case of unknown µ1 and P1 , one can use the diffuse log likelihood
method, which is discussed in detail in [2].
Alternatively, one can regard µ1 and P1 hyperparameters of the model.
The recursion relations defining the Kalman filter can be extended to the case of
more general state space models.
Some of these extensions are straightforward (such as adding constant terms to
the right hand sides of the state and observation equations in (10)), others are
fundamentally more complicated.
In the remainder of this lecture we summarize these extensions following [2].
X1 → X2 → ... → Xt → ...
y y y y ··· (48)
Y1 Y2 ... Yt ...
On the other hand, Yt depends only on the value of the state variable Xt :
Yt ∼ p(Yt | Xt ). (50)
Xt = G(Xt−1 , εt ),
(51)
Yt = H(Xt , ηt ),
Xt+1 = a + Xt + εt+1 ,
(52)
Yt = exp(Xt )ηt ,
respectively, where
(i) G(x) and H(x) are differentiable functions on Rr ,
(ii) the disturbances εt and ηt are mutually and serially uncorrelated with
mean zero and covariances Q(Xt ) and R(Xt ), respectively (we do not
require that their distributions are Gaussian),
(iii) X1 has mean µ1 and variance P1 , and is uncorrelated with all noises.
We denote by Gt0 and Ht0 the matrices of first derivatives (Jacobi matrices) of
G(Xt ) and H(Xt ) evaluated at µt and µt|t , respectively:
We now expand the matrix functions G, H, Q and R in Taylor series to the orders
indicated:
Applying the Kalman filter formulas to this SSM we obtain the following EKF
recursion:
vt = Yt − H(µt ),
Ft = Ht0 Pt Ht0T + R(µt ),
Kt = Pt H 0T Ft−1 ,
µt|t = µt + Kt vt , (55)
Pt|t = (I − Kt Ht0 )Pt ,
µt+1 = G(µt|t ),
Pt+1 = Gt0 Pt|t Gt0T + Q(µt|t ),
for t = 1, 2, . . ..
A. Lesniewski Time Series Analysis
Warm-up: Recursive Least Squares
Kalman Filter
Nonlinear State Space Models
Particle Filtering
The EKF works well if the functions G and H are weakly nonlinear, for strongly
nonlinear models its performance may be poor.
Other extensions of the Kalman filter have been developed, including the
unscented Kalman filter (UKF).
It is based on a different principle than the EKF: rather than approximating G and
H by linear expressions, one matches approximately the first and second
moments of a nonlinear function of a Gaussian random variable, see [2] for
details.
Another approach to inference in nonlinear SSMs is via Monte Carlo (MC)
techniques particle filters, a.k.a. sequential Monte Carlo.
Particle filtering
p(X1:t , Y1:t )
p(X1:t | Y1:t ) = . (56)
p(Y1:t )
We derive the following recursion relation for the joint smoothing distribution:
Filtering recursion
Filtering recursion
The filtering distribution is calculated based on the arrival of the new observation
Yt .
Namely, applying Bayes’ rule, and the fact that Yt depends on Xt only,
p(Yt , Xt | Y1:t−1 )
p(Xt | Y1:t ) =
p(Yt | Y1:t−1 )
p(Yt | Xt , Y1:t−1 )p(Xt | Y1:t−1 )
= R (59)
p(Yt | xt )p(xt | Y1:t−1 )dxt
p(Yt | Xt )p(Xt | Y1:t−1 )
= R .
p(Yt | xt )p(xt | Y1:t−1 )dxt
Filtering recursion
The difficulty with this recursion is clear: there is a complicated integral in the
denominator, which cannot in general be calculated in closed form.
In some special cases this can be done: for example, in the case of a linear
Gaussian state space model, this integral is Gaussian and can be calculated.
The recursion above leads then to the Kalman filter.
Instead of trying to evaluate the integral numerically, we will develop a Monte
Carlo based approach for approximately solving recursions (58) and (59).
Importance sampling
Suppose we are faced with Monte Carlo evaluation of the expected value
Z
E(f (X1:t ) | Y1:t ) = f (x1:t )p(x1:t | Y1:t )dx1:t . (60)
Importance sampling
p(x1:t | Y1:t )
Z
E(f (X1:t ) | Y1:t ) = f (x1:t ) g(x1:t | Y1:t )dx1:t . (61)
g(x1:t | Y1:t )
The proposal distribution should be chosen so that it is easy to sample from it.
1 , . . . , x N from the proposal distribution, and assign
2. Draw N samples of paths x1:t 1:t
to each of them a weight proportional to the ratio of the target and proposal
distributions:
j
j p(x1:t | Y1:t )
wt ∝ j
. (62)
g(x1:t | Y1:t )
Importance sampling
N
btj f (x1:t
j
X
b N (f (X1:t ) | Y1:t ) =
E w ), (63)
j=1
j
w
btj = P t
w . (64)
N j
j=1 wt
g(X1:t , Y1:t )
g(X1:t | Y1:t ) =
g(Y1:t )
g(Xt | X1:t−1 , Y1:t )g(X1:t−1 , Y1:t )
=
g(Y1:t )
= g(Xt | X1:t−1 , Y1:t )g(X1:t−1 | Y1:t ).
Once the sample of X1:t−1 has been generated from g(X1:t−1 | Y1:t−1 ), its value
is independent of the observation Yt , and so g(X1:t−1 | Y1:t ) = g(X1:t−1 | Y1:t−1 ).
We can thus write the result of the calculation above as the following recursion:
The second factor on the RHS of this equation, g(X1:t−1 | Y1:t−1 ), is the proposal
distribution built out of the paths that have already been generated in the
previous steps.
A new set of samples xt1 , . . . , xtN is drawn from the first factor g(Xt | X1:t−1 , Y1:t ).
We then append the newly simulated values xt1 , . . . , xtN to the simulated paths
1
x1:t−1 N
, . . . , x1:t−1 of length t − 1.
1 , . . . , x N of length t.
We thus obtain simulated paths x1:t 1:t
j
We initialize this distribution with w1 = 1.
The densities p(Yt | Xt ) and p(Xt | Xt−1 ) are determined by the state and
observation equations (51).
The only quantity that needs to be computed at each iteration is the ratio of
weights w
et .
1 , . . . , xN
As a result of each iteration, SIS produces N Monte Carlo paths x1:t 1:t
1 N
along with the unnormalized importance weights wt , . . . , wt .
These paths are referred to as particles.
We define the normalized weights by
j
w
btj = P t
w . (68)
N
i=1 wti
N
btj δ(X1:t − x1:t
j
X
p(X1:t | Y1:t ) =
b w ), (69)
j=1
N
btj f (x1:t
j
X
b (X1:t ) | Y1:t ) =
E(f w ). (70)
j=1
N
j
etj .
X
p(Yt | Y1:t−1 ) =
b w
bt−1 w (71)
j=1
It has been observed that in practice SIS suffers from the weight degeneracy
problem.
This manifests itself in the rapid increase of the variance of the distribution of the
importance weights as the number of time steps t increases.
As t increases, all the probability density gets eventually allocated to a single
particle.
That particle’s normalized weight converges to one, while the normalized weights
of the other particles converge to zero, and the SIS estimator becomes a
function of a single sample.
j
j p(x1 )
w1 = j
.
g(x1 )
2. For t = 2, . . . , T :
j j
(i) Generate N samples xt ∼ g(Xt | xt−1 , Y1:t ) and compute the importance
weights
j j j
j p(Yt | xt ) p(xt | xt−1 ) j
wt ∝ j j
wt−1 .
g(xt | xt−1 , Y1:t )
j
w
btj = P t
w . (72)
N j
j=1 wt
j
bt1 , . . . , w
(iii) Resample N particles with probabilities w btN , and define wt = 1/N.
After every iteration, once the particles have been generated, quantities of
interest can be estimated.
N
btj δ(X1:t − x1:t
j
X
pN (X1:t | Y1:t ) =
b w ), (73)
j=1
N (75)
1 X j
≈ w
bt .
N
j=1
Bootstrap filter
The efficacy of the algorithms presented above depends on the choice of the
proposal distribution.
The simplest choice of the proposal distribution is
This choice is called the prior kernel, and the corresponding particle filter is
called the bootstrap filter.
The bootsstrap filter resamples by setting the incremental weight ratios equal to
et = p(Yt | Xt ).
w
The prior kernel is an example of a blind proposal: it does not use the current
observation Yt .
Despite this, the bootstrap filter performs well in a number of situations.
Another popular version is the auxiliary particle filter, see [1] and [2].
References
[1] Creal, D.: A survey of sequential Monte Carlo methods for economics and
finance, Economic Reviews, textbf31, 245 - 296 (2012).
[2] Durbin, J., and Koopman, S. J.: Time Series Analysis by State Space
Methods, Oxford University Press (2012).