State Space Expectation Propagation

State Space Expectation Propagation:
Efficient Inference Schemes for Temporal Gaussian Processes
William J. Wilkinson 1 Paul E. Chang 1 Michael Riis Andersen 2 Arno Solin 1
Abstract
We formulate approximate Bayesian inference
arXiv:2007.05994v1 [stat.ML] 12 Jul 2020
in non-conjugate temporal and spatio-temporal

Gaussian process models as a simple parame-
ter update rule applied during Kalman smooth-
ing. This viewpoint encompasses most inference
schemes, including expectation propagation (EP),
the classical (Extended, Unscented, etc.) Kalman
smoothers, and variational inference. We pro- Filtering / Forward pass → ← Smoothing / Backward pass
vide a unifying perspective on these algorithms, (a) EKF forward and RTS backward pass
showing how replacing the power EP moment
matching step with linearisation recovers the clas-
sical smoothers. EP provides some benefits over
the traditional methods via introduction of the so-
called cavity distribution, and we combine these
benefits with the computational efficiency of lin-
earisation, providing extensive empirical analysis
demonstrating the efficacy of various algorithms
under this unifying framework. We provide a fast
implementation of all methods in JAX. Filtering / Forward pass → ← Smoothing / Backward pass
(b) Extended Expectation Propagation (EEP)
Figure 1. Filtering and smoothing in the Banana classification task.
Training data represented by coloured points, the decision bound-
1. Introduction aries by black lines, and the predictive mean for the class label by
Gaussian processes (GPs, Rasmussen & Williams, 2006) colour map . The vertical dimension is treated as the ‘spatial’
are a nonlinear probabilistic modelling tool that combine input and the horizontal as the sequential (‘temporal’) dimension.
Forward sweep on the left, backward sweep on the right. Top
well calibrated uncertainty estimates with the ability to en-
panels (a) show the EKF; bottom (b) is the 2nd iteration of EEP.
code prior information, and as such they are an increasingly
effective method for many difficult machine learning tasks.
The well known limitations of GPs are (i) their cubic scaling sparse-GP approach (Quiñonero-Candela & Rasmussen,
in the number of data, and (ii) their intractability when the 2005) which summarises the GP posterior through a subset
observation model is non-Gaussian. of ‘inducing’ points. However, when the data exhibits a
natural ordering—as in temporal or spatio-temporal tasks—
For (i), a wide variety of methods have been proposed (e.g.
many GP priors can be rewritten in closed-form in terms
Hensman et al., 2013; Salimbeni & Deisenroth, 2017; Wang
of stochastic differential equations (SDEs, Särkkä & Solin,
et al., 2019), with perhaps the most common being the
2019), allowing for linear-time exact inference via Kalman
1
Department of Computer Science, Aalto University, Finland filtering (Hartikainen & Särkkä, 2010; Reece & Roberts,
2
Department of Applied Mathematics and Computer Science, 2010). This link is beneficial in scenarios such as climate
Technical University of Denmark, Denmark. Correspondence modelling, or audio signal analysis, which exhibit both high
to: William J. Wilkinson <william.wilkinson@aalto.fi>. and low-frequency behaviour. A sparse-GP analogy to the
Proceedings of the 37 th International Conference on Machine Shannon–Nyquist theorem (Tobar, 2019) tells us that audio
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by signals, for example, necessarily require tens of thousands
the author(s). of inducing points per second of data, rendering sparse ap-
State Space Expectation Propagation
proximations infeasible for all but the shortest of time series. input–output pairs for a time series (we first consider the 1D
This strongly motivates our reformulation of temporal GPs input case), then GP models typically take the form
as SDEs for efficient inference. n
Y
For limitation (ii), a wide variety of approximative infer- f (t) ∼ GP(µ(t), κ(t, t0 )), y|f ∼ p(yk | f (tk )), (1)
ence methods have been considered, with the current gold- k=1
standard being various sampling schemes (see Gelman et al., which defines the prior for the latent function f : R → R
2013, for an overview), variational methods (Opper & Ar- and the observation model for yk . For Gaussian observation
chambeau, 2009; Titsias, 2009; Wainwright & Jordan, 2008), models, the posterior distribution p(f | y) is also Gaussian
and expectation propagation (EP, Bui et al., 2017; Jylänki and can be obtained analytically, but non-Gaussian like-
et al., 2011; Kuss & Rasmussen, 2006; Minka, 2001). De- lihoods render the posterior intractable and approximate
spite Minka’s original work having its foundations in filter- inference methods must be applied.
ing and smoothing, all the special characteristics of temporal
models have not been thoroughly leveraged in the machine 2.1. State Space Models for Gaussian Processes
learning community. We extend recent work on approxi-
mate inference under the state space paradigm, and provide In signal processing, the canonical (discrete-time) state
a framework that unifies EP and traditional methods such as space model formulation is (e.g., Bar-Shalom et al., 2001):
the Extended (EKF, Bar-Shalom et al., 2001) and Unscented
Kalman filters (UKF, Julier et al., 1995; 2000). Our frame- xk = g(xk−1 , qk ), (2a)
work provides ways to trade off accuracy and computation, yk = h(xk , σk ), (2b)
and we show that an iterated version of the EKF with EP-
style updates can be efficient and easy to implement, whilst for time instances tk , where xk ∈ Rs is the discrete-
providing good performance in cases where the likelihood time state sequence, yk ∈ Rd is a measurement sequence,
model is not locally highly nonlinear. For completeness, we qk ∼ N(0, Qk ) is the process noise, and σk ∼ N(0, Σk ) is
also formulate variational inference in the same setting. Gaussian measurement noise. The model dynamics (prior)
are defined by the nonlinear mapping g(·, ·), while the obser-
We show that such tools are not limited to one-dimensional- vation/measurement model (likelihood) is given in terms of
input models; instead they only require us to treat a single the mapping h(·, ·). We restrict the model dynamics g(·, ·)
dimension of the data sequentially (regardless of whether it to be linear-Gaussian—defining an s-dimensional Gaussian
is actually ordered, or represents time). We apply our meth- process. The dynamical model Eq. (2a) becomes,
ods to multi-dimensional problems such as 2D classification
(see Fig. 1) and 2D log-Gaussian Cox processes. xk = Ak xk−1 + qk , qk ∼ N(0, Qk ), (3)
Our main contributions are: (i) We formulate EP as a which is characterised by the transition matrix Ak and pro-
Kalman smoother, showing how it unifies many classical cess noise covariance Qk . Whilst we restrict our interest
smoothing methods, providing an efficient framework for to latent Gaussian dynamics, the inference methods pre-
inference in temporal GPs. (ii) We show how to rewrite sented later apply to more general nonlinear state estimation
common machine learning tasks (likelihoods) into canoni- settings, where the prior is not necessarily a GP (see Sec. 5).
cal state space form, and provide extensive analysis demon-
The motivation for linking the machine learning GP for-
strating performance in many modelling scenarios. (iii) We
malism with state space models comes from the special
show how the state space framework can be extended be-
structure in temporal or spatio-temporal problems, where
yond the one-dimensional case, applying it to multidimen-
the data points have a natural ordering with respect to the
sional classification and regression tasks, where we still
temporal dimension. If the GP prior in Eq. (1) admits a
enjoy linear-time inference over the sequential dimension.
Markovian structure, the model can be rewritten in the form
(iv) We provide fast JAX code for inference and learning
of Eq. (3). We leverage the link between the kernel and state
with all the methods described in this paper, available at
space forms of GPs (Särkkä & Solin, 2019; Särkkä et al.,
https://github.com/AaltoML/kalman-jax.
2013), which comes through linear time-invariant SDEs:
2. Background ẋ(t) = F x(t) + L w(t), such that f (t) = Hx(t), (4)
Gaussian processes (GPs, Rasmussen & Williams, 2006) where w(t) is a white noise process, and F ∈ Rs×s ,
form a non-parametric family of probability distributions L ∈ Rs×v , H ∈ Rm×s are the feedback, noise effect, and
on function spaces, and are completely characterized by a measurement matrices, respectively. Many widely used co-
mean function µ(t) : R → R and a covariance function variance functions admit this form exactly or approximately
κ(t, t0 ) : R × R → R. Let {(tk , yk )}nk=1 denote a set of n (e.g., the Matérn class, polynomial, noise, constant, squared-
exponential, rational quadratic, periodic, and sums/products
thereof). Särkkä & Solin (2019) discuss methods for con- where qkcav (fk ) is the cavity distribution: qkcav (fk ) ∝
structing the required matrices for many GP models. Key q(fk )/qksite (fk ). The KL-divergence in Eq. (6) is minimized
to this formulation is that linear time-invariant SDEs are using moment matching (Minka, 2001), i.e. qk is chosen
guaranteed to have a closed-form discrete-time solution in such that the approximation qksite qkcav matches the first two
the form of a linear Gaussian state space model as in Eq. (3). moments of the tilted distribution p̂k . This process is iterated
We leverage this link in order to apply sequential inference for all sites until convergence. Power EP is a generaliza-
schemes to temporal and spatio-temporal GP models. tion of EP, where the KL-divergence is generalized to the
α-divergence (Minka, 2005). Minka (2004) showed that
2.2. Extended and Unscented State Estimation PEP can be implemented using the EP algorithm, by raising
the site terms in the tilted distributions to a power of α.
Many nonlinear variants of the Kalman filter have been de-
veloped to deal with the measurement model in Eq. (2b) (see
Särkkä, 2013, for an overview). The most widely known are 3. Methods
the EKF (e.g. Bar-Shalom et al., 2001) and UKF (Julier et al., We consider non-conjugate (i.e. non-Gaussian likelihood)
1995). The EKF linearises h(xk , σk ) via a first-order Taylor Gaussian process models with input t, i.e. time, which have
series expansion, which in turn results in linear Gaussian a dual kernel (left) and discrete state space (right) form for
approximations to all the required Kalman update equations. the prior (Särkkä et al., 2013),
We discuss the approach in detail in Sec. 3.2.
f (t) ∼ GP µ(t), Kθ (t, t0 ) , xk = Aθ,k xk−1 + qk , (7)

The UKF is a member of a wider class of Gaussian filter-
ing methods (Ito & Xiong, 2000), which approximate the >
where f (t) = f (1) (t), . . . , f (m) (t) ∈ Rm are GPs,
Kalman update equations via statistical linearisation rather (1) (m) >
than a Taylor expansion. Statistical linearisation is generally xk = xk , . . . , xk ∈ Rs is the latent state vector
intractable, involving expectations that must be computed containing the GP dynamics, and yk ∈ Rd are observa-
(i)
numerically (shown in App. B). Choosing the Unscented tions. Each xk contains the state dynamics for one GP.
transform as the numerical integration method results in Using notation fk = f (tk ), we define a time-varying linear
the UKF, but other sigma-point methods can also be used map Hk ∈ Rm×s from state space to function space, such
(see, e.g., Ito & Xiong, 2000; Kokkala et al., 2016; Šimandl that fk = Hk xk (the time-varying mapping allows us to
& Dunı́k, 2009; Wu et al., 2005; 2006)—e.g. using Gauss– naturally incorporate spacial inducing points when consid-
Hermite cubature gives the Gauss–Hermite Kalman filter. ering multidimensional input models, see Sec. 3.6). The
likelihood (left) / state observation model (right) are
2.3. Expectation Propagation
yk ∼ p(yk | fk ) vs. yk = h(fk , σk ). (8)
Expectation propagation (EP) is a general framework for
approximating probability distributions proposed by Minka Measurement model h(fk , σk ) is a (nonlinear) function of
(2001). EP and its extension Power-EP (PEP, Minka, 2004) fk and observation noise σk ∼ N(0, Σk ), and can gener-
have been extensively studied for Gaussian process mod- ally be derived for continuous likelihoods or approximated
els and shown to provide state-of-the-art results (Bui et al., for discrete ones by letting h(fk , σk ) ≈ E[yk | fk ] + εk ,
2017; Jylänki et al., 2011; Kuss & Rasmussen, 2006). EP εk ∼ N(0, Cov[yk | fk ]). See Sec. 4 and App. I for deriva-
approximates the target distribution p(f | y) with an approx- tions of some common models. We aim to calculate the
imation q(f ) that factorises in the same way as the target, posterior over the states, p(xk | y1 , . . . , yn ), known as the
smoothing solution, which can be obtained via applica-
n
Y n
Y tion of a Gaussian filter (to obtain the filtering solution
p(f | y)∝ p(yk | fk )p(f ) ≈ q(f ) ∝ qksite (fk )p(f ) (5) p(xk | y1 , . . . , yk )) followed by a Gaussian smoother. If
k=1 k=1 h(·, ·) is linear, i.e. p(yk | fk ) is Gaussian, then the Kalman
filter and Rauch–Tung–Striebel (RTS, Rauch et al., 1965)
The likelihood approximations qksite (fk ) ≈ p(yk | fk ) are smoother return the closed-form solution.
usually referred to as sites. For GP models, the sites are
chosen to be Gaussians and hence the global approximation 3.1. Power EP as a Gaussian Smoother
q(f ) is also Gaussian. The sites are updated in an itera-
tive fashion by minimizing local Kullback–Leibler diver- Our inference methods approximate the filtering distribu-
gences between the so-called tilted distributions, p̂k (fk ) = tions with Gaussians, p(xk | y1:k ) ≈ N(xk | mfilt filt
k , Pk ).
1 cav The prediction step remains the same as in the
Zk p(yk | fk )qk (fk ), and its approximation using the site,
standard Kalman filter: mpred
k = Aθ,k mfilt
k−1 , and
∗ pred filt >
qksite = arg min KL p̂k (fk ) k qksite (fk )qkcav (fk ) ,

(6) Pk = Aθ,k Pk−1 Aθ,k + Qθ,k . The resulting distribu-
site
qk tion provides a means by which to calculate the EP cavity,
True posterior h True posterior h

Prior ĥ, EKS Prior ĥ, EKS
Likelihood h, GHKS Likelihood
ĥ, GHKS
EKS EKS
GHKS GHKS
VI VI
EP EP
−5 0 5 −5 0 5 −5 0 5 −5 0 5
Figure 2. Comparison between EP, VI and iterated linearisation (EKS, GHKS). When measurement function h is approximately linear in
the region of the prior (or the cavity / posterior in the full algorithm), (left), linearisation ĥ provides a similar result to EP / VI. When h is
highly nonlinear (right), the posterior approximations have different properties. 20-point Gauss–Hermite quadrature used for all methods
except EKS. All methods are iterated 10 times except EP which does not require iteration for a single data point.
qkcav (fk ) = N(fk | µcav cav

k , Σk ), on the first forward pass: posterior, p(xk | y1:n ) = N(xk | mpost post
k , Pk ),
> −1
−1 −1
µcav = Hk mpred Σcav = Hk Ppred H> Σcav = Hk Ppost − α Σsite

k k , k k k. (9) k k Hk k , (12)
cav cav
post > −1 post site −1 site

In this sense, we can view the first pass of the Kalman µk = Σk Hk Pk Hk Hk mk − α Σk µk .
filter as an effective way to initialise the EP parameters. Moment matching is again performed via Eq. (10) using this
To account for the non-Gaussian likelihood in the up- new cavity. The site parameters, µsite site
k , Σk , are stored to be
date step we follow Nickisch et al. (2018), introducing used on the next forward (filtering) pass. App. F discusses
an intermediary step in which the parameters of the sites, methods for avoiding numerical issues that could occur
qksite (fk ) = N(fk | µsite site
k , Σk ), are set via moment matching due to the subtraction of covariance matrices in Eq. (12).
and stored before continuing with the Kalman updates. Algorithm 1 summarises the full learning algorithm, and
This PEP formulation, with power α, makes use of the fact App. G describes how the marginal likelihood, p(y | θ), is
that the required moments can be calculated via the deriva- computed to enable hyperparameter learning.
tives of the log-normaliser, Lk , of the tilted distribution (see
Seeger, 2005). Letting ∇Lk ∈ Rm and ∇2 Lk ∈ Rm×m 3.2. Unifying PowerEP and Extended KalmanFiltering
be the Jacobian and Hessian of Lk w.r.t. µcav k respectively, In the above inference scheme, a computational saving can
this gives the following site update rule,
be gained by noticing that when h(·, ·) is linear, Lk can be
calculated in closed form. This fact has been exploited previ-
Power expectation propagation ously to aid inference in GP dynamical systems (Deisenroth
α & Mohamed, 2012). Fig. 2 demonstrates that such an ap-
Lk = log EN(fk | µcav cav
k ,Σk )
p (yk | fk ) ,
−1 proximation can be accurate when h(·, ·) is locally linear, or
Σsite
k = −α Σcav
k + ∇ 2
L k , (10) when the cavity variance is small. Using a first-order Taylor
−1
series expansion about the mean µcav k , we obtain
µsite = µcav 2

k k − ∇ Lk ∇Lk .
h(fk , σk ) ≈ Jfk (fk − µcav cav
k ) + h(µk , 0) + Jσk σk , (13)
After the mean and covariance of our new likelihood approx- a linear function of fk and σk ∼ N(0, Σk ), such that
imation have been calculated, we proceed with a modified p(yk | fk ) ≈ N(yk | ĥ(fk ), Jσk Σk J>σk ), where ĥ(fk ) =
cav cav
set of linear Kalman updates, Jfk (fk −µk )+h(µk , 0). Here Jfk = Jf |µcav k ,0
∈ Rd×m
d×d
and Jσk = Jσ |µcavk ,0
∈R are the Jacobians of h(·, ·)
Sk = Hk Ppred
k H> site
k + Σk , w.r.t. fk and σk evaluated at the mean, respectively.
Kk = Ppred
k H> −1
k Sk , In order to frame approximate inference in the same setting
(11)
mfilt
k = mpred
k + Kk (µsite
k − Hk mpred
k ), as EP, we seek the site update rule implied by this linearisa-
filt pred >
tion. If Jf is invertible, then writing down such a rule would
Pk = Pk − Kk Sk Kk . be trivial, but since this is not generally the case we instead
As in Wilkinson et al. (2019), we augment the RTS smoother use the EP moment matching steps, Eq. (10), which give,
with another moment matching step where the cavity dis- N yk | ĥ(fk ), Jσk Σk J>
α
Lk = log EN(fk | µcav cav
k ,Σk ) σk
tribution is calculated by removing (a fraction α of) the −1
= c + log N yk | h(µcav

local site from the marginal smoothing distribution, i.e. the k , 0), α Σ̂k , (14)
where Σ̂k = Jσk Σk J> cav >

σk +αJfk Σk Jfk . Taking the deriva- Algorithm 1 Sequential inference & learning algorithm
tives of this log-Gaussian w.r.t. the cavity mean, we get Input: {tk , yk }nk=1 , θ0 , α, and learning rate ρ,
update rule ← Eq. (10), Eq. (16), Eq. (18) or Eq. (20)
∂Lk −1
∇Lk = = αJ>
fk Σ̂k vk ,
for i = 1 to num iters do
∂µcav build model, Eq. (7), with θi−1 . mfilt filt
k
(15) 0 , P0 ← 0, P∞
∂ 2 Lk −1 for k = 1 to n do
∇2 Lk = = −αJ>
fk Σ̂k Jfk ,
∂µk ∂(µcav
cav
k )> mpred
k , Pk
pred
← predict (mfilt filt
k−1 , Pk−1 )
if i = 1 then
where vk = yk − h(µcav k , 0). It is important to note that we initialise µsite
k , Σk
site
via update rule using
have assumed the derivative of Σ̂k to be zero, even though it α = 1 and mk , Ppred
pred
k as cavity / posterior
depends on µcav
k . This assumption is crucial in ensuring that end if
pred pred
the updates are consistent, since it reflects the knowledge mfilt filt
k , Pk ← update(mk , Pk , µk , Σk )
site site
that the model is now linear (see Deisenroth & Mohamed ek = − log p(yk | y1:k−1 , θi−1 ) see App. G
(2012) for detailed discussion). Now we update the site in end for
closed form (App. C gives the derivation), for k = n − 1 to 1 do
mpost
k , Ppost
k ← smooth(mpost post filt
k+1 , Pk+1 , mk , Pk )
filt
site site
Extended expectation propagation update µk , Σk via update rule
−1 end for
> −1
P
Σsite = J >
J Σ J

J , θi = θi−1 + ρ∇θ k ek update hyper.
k fk σ k k σk f k
(16) end for
> −1
µsite cav site cav

k = µ k + Σ k + αΣ k J Σ̂
fk k v k . Return: posterior mean and covariance: mpost , Ppost
The result when we use Eq. (16) (with α = 1) to modify

the filter updates, Eq. (11), is exactly the EKF (see App. D Statistically linearised expectation propagation
for the proof). Additionally, since these updates are now −1
> −1
available in closed form, taking the limit α → 0 is now Σsite
k = −αΣ cav
k + Ω k Σ̃k Ω k ,
possible and avoids the matrix subtractions and inversions −1 (18)
> −1 > −1
in Eq. (12), which can be costly and unstable. This is not µsite
k = µ cav
k + Ω k Σ̃ k Ω k Ω k Σ̃k v k .
possible prior to linearisation because the intractable inte-
grals also depend on α. App. H describes our full algorithm.
where vk = yk − µk , Σ̃k = Sk + (α − 1)C> cav −1
k (Σk ) Ck ,
ZZ
∂µk −1
3.3. Power EP and the Unscented/GH Kalman Filters Ωk = = h(fk , σk )(Σcav
k ) (fk − µcav
k )
∂µcav
k
We now consider the relationship between EP and general × N(fk | µcav cav
k , Σk )N(σk | 0, Σk ) dfk dσk . (19)
Gaussian filters, which use the likelihood approximation
3.4. Nonlinear Kalman Smoothers
p(yk | fk ) ≈ N(yk | µk + C> cav −1
k (Σk ) (fk − µcav
k ),
Sk − C> cav −1 Iterated versions of nonlinear filter-smoothers have been

k (Σk ) Ck ), (17)
developed to address the fact that the forward prediction,
where µk , Sk and Ck are the Kalman mean, innovation N(xpred pred pred
k | mk , Pk ), may not be the optimal distribution
and cross-covariance terms respectively, given in App. B. about which to perform linearisation. It is argued that the
Eq. (17) amounts to statistical linear regression (Särkkä, posterior, N(xpost post post
k | mk , Pk ), obtained via smoothing,
2013) of h(fk , σk ). Letting µcav = Hk mpred cav provides a better estimate of the region in which the likeli-
k k , Σk =
pred > hood affects the posterior (Garcı́a-Fernández et al., 2015).
Hk Pk Hk and using the Unscented transform / Gauss–
Hermite to approximate µk , Sk and Ck results in the UKF These iterated smoothers (Bell, 1994) can be seen as spe-
/ GHKF. This approximation has a similar form to the EKF cial cases of the algorithms described in Sec. 3.2 and
(which uses analytical linearisation, see Fig. 2 for compari- Sec. 3.3, where the posterior is used to perform the lin-
son), and as in Sec. 3.2 we can insert the Gaussian likelihood earisation in place of the cavity, i.e. α = 0. The classical
approximation into Eq. (10) to derive an iterated algorithm smoothers seek a linear approximation to the likelihood
that matches the Gaussian filters on the first forward pass, p(yk | fk ) ≈ N(yk | Bk fk + bk , Ek ) via a Taylor expansion,
but then refines the linearisation using EP style updates. Eq. (13), or SLR, Eq. (17), and then store parameters Bk ,
This provides the following site update rule (see App. E): bk , Ek to be used during the next forward pass. Instead, we
use the current posterior approximation to compute the site ing (spatial) input(s) r: κ(r, t; r0 , t0 ) = κr (r, r0 ) κt (t, t0 ).
parameters via Eq. (16) or Eq. (18), which differs slightly
Following Särkkä et al. (2013), we extend the state x(t)
from the standard presentation of these algorithms. We ar-
of the system via m coupled temporal processes. These
gue that framing the Kalman smoothers as site update rules
processes are associated with inducing points {ru,j }m
j=1 in
is beneficial in that it allows for direct comparison with EP,
the spatial domain. The measurement model matrix now
but also that introduction of the cavity is beneficial. The
projects the latent state at time tk from the inducing pro-
cavity may be a better distribution about which to linearise
cesses in the state to function space by,
than the posterior, since it does not already include the effect
of the local data. However, Table 1 shows that setting α = 0 Hk = [Kfk ,u K−1
u,u ] ⊗ Ht , (21)
typically provides the best performance.
with Gram matrices Kfk ,u = κr (rk , ru,j ) and Ku,u =
3.5. Variational Inference with Natural Gradients κr (ru,j , ru,j ) for j = 1, 2, . . . , m, where Ht is the measure-
ment model matrix for the GP prior. If the data lies on a fixed
Variational inference (VI) is an alternative to EP, often
set of spatial points {rj }m j=1 (an irregular grid of m points),
favoured due to its convergence guarantees and ease of
the above expression simplifies to Hk = Im ⊗ Ht and
implementation. If VI is formulated such that the varia-
the models becomes exact (see, Hartikainen, 2013; Solin,
tional parameters of the approximate posterior q(f ) are the
2016, for further details and discussion, also covering non-
likelihood (i.e. site) mean and covariance, as in Eq. (5), then
separable models).
it can also be framed as a site update rule during Kalman
smoothing (Chang et al., 2020). This parametrisation is
in fact the optimal one, as discussed in Opper & Archam- 3.7. Fast Implementation Using JAX
beau (2009), but is often avoided because the resulting op- Sequential inference in GPs is extremely efficient, however
timisation problem is non-convex (instead it is common optimising the model hyperparameters involves differentiat-
to declare a variational distribution over the full posterior, ing functions with large loops. When using automatic dif-
q(f ) = N(m, K), and optimise m, K with respect to the ferentiation this typically results in a massive computational
evidence lower bound. Adam et al. (2020) show how to graph with large compilation overheads, memory usage and
perform natural gradient VI under this parametrisation). runtime. Previous approaches have avoided this issue either
We present here the VI site update rule, based on conjugate- by using finite differences (Nickisch et al., 2018), which are
computation variational inference (CVI, Khan & Lin, 2017), slow when the number of parameters is large, or by reformu-
in order to show their similarity to EP, and to enable di- lating the model to exploit linear algebra methods applicable
rect comparison between the algorithms. CVI sidesteps the to sparse precision matrices (Durrande et al., 2019).
issues with the optimal parametrisation by showing that nat- We utilise the following features of the differential pro-
ural gradient VI can be performed via local site parameter gramming Python framework, JAX (Bradbury et al., 2018):
updates that avoid directly differentiating the evidence lower (i) we avoid ‘unrolling’ of for-loops, i.e. instead of build-
bound. The updates can be written, ing a large graph of repeated operations, a smaller graph
is recursively called, reducing the compilation overhead
Variational inference (with natural gradients) and memory, (ii) we just-in-time (JIT) compile the loops,
to avoid the cost of graph retracing, (iii) we use accel-
L̃k = EN(fk | µpost ,Σpost ) log p(yk | fk ) , erated linear algebra (XLA) to speed up the underlying
k k
−1 filtering/smoothing operations. Combined, these features
Σsite
k = − ∇2 L̃k , (20) result in an extremely fast implementation, and based
−1 on this we provide a fully featured temporal GP frame-
µsite
k = µpost
k − ∇2 L̃k ∇L̃k , work with all models and inference methods, available at
https://github.com/AaltoML/kalman-jax.
where ∇L̃k ∈ Rm and ∇2 L̃k ∈ Rm×m are the Jacobian
and Hessian of L̃k w.r.t. µpost
k respectively. 4. Empirical Analysis
Table 1 examines the performance of 17 methods from our
3.6. Spatio-Temporal Filtering and Smoothing GP framework on 7 benchmark tasks of varying data size
The methodology presented in the previous sections for and model complexity. Blanks (—) in the table represent
temporal models directly lends itself to generalisations in scenarios where the method does not scale practically to
spatio-temporal modelling. We consider a GP prior which is the size of the task. First we demonstrate that state space
separable in the sequential (temporal) input t and the remain- approximate inference schemes are competitive with two
state-of-the-art baseline methods on three small datasets.
Table 1. Normalised negative log predictive density (NLPD) results with 10-fold cross-validation. Smaller is better. Blank entries (—)
represent scenarios where the method does not scale to the size of the task. EEP, UEP, and GHEP are the iterated smoothers with
linearisation. EP( U ) and EP( GH ) are state space EP, where the intractable moment matching is performed via the Unscented transform or
Gauss–Hermite, respectively. Linearisation performs poorly on the heteroscedastic noise task, however EEP performs well on the audio
task since it is the only method capable of maintaining full site cross-covariance terms without compromising stability. The state space
methods are able to match the performance of the non-sequential (batch) EP and VGP baselines.
M OTORCYCLE C OAL BANANA B INARY AUDIO A IRLINE R AINFOREST
# DATA POINTS 133 333 400 10 K 22 K 36 K 125 K

I NPUT DIMENSION 1 1 2 1 1 1 2
L IKELIHOOD H ETEROSCEDASTIC P OISSON B ERNOULLI B ERNOULLI P RODUCT P OISSON P OISSON
EEP (α = 1) 0.855±0.25 0.922±0.11 0.228±0.07 0.536±0.01 −0.433±0.04 0.142±0.01 0.325±0.01

EEP (α = 0.5) 0.855±0.25 0.922±0.11 0.228±0.07 0.536±0.01 −0.499±0.03 0.142±0.01 0.321±0.01
EEP (α = 0) / EKS 0.855±0.25 0.922±0.11 0.229±0.07 0.537±0.01 −0.570±0.04 0.142±0.01 0.309±0.01
L INEARISATION
UEP (α = 1) 0.745±0.28 0.922±0.11 0.217±0.08 0.536±0.01 −0.471±0.02 0.142±0.01 —

UEP (α = 0.5) 0.745±0.28 0.922±0.11 0.217±0.08 0.536±0.01 −0.474±0.02 0.142±0.01 —
UEP (α = 0) / UKS 0.745±0.28 0.922±0.11 0.217±0.08 0.536±0.01 −0.484±0.02 0.142±0.01 —
GHEP (α = 1) 0.750±0.26 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
GHEP (α = 0.5) 0.747±0.27 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
GHEP (α = 0) / GHKS 0.746±0.27 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
EP( U ) (α = 1) 0.696±0.59 0.922±0.11 0.217±0.08 0.536±0.01 −0.321±0.15 0.143±0.01 —
M OMENT M ATCH
EP( U ) (α = 0.5) 0.479±0.30 0.922±0.11 0.217±0.08 0.536±0.01 −0.327±0.20 0.143±0.01 —

EP( U ) (α ≈ 0) 0.465±0.29 0.924±0.11 0.222±0.08 0.536±0.01 0.011±0.30 0.143±0.01 —
EP( GH ) (α = 1) 0.569±0.41 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
EP( GH ) (α = 0.5) 0.531±0.38 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
EP( GH ) (α ≈ 0) 0.444±0.32 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —
VI( U ) 0.444±0.31 0.922±0.11 0.217±0.08 0.536±0.01 −0.204±0.02 0.142±0.01 —
VI
VI( GH ) 0.495±0.34 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —

BASEL .
EP( BATCH , GH ) 0.441±0.30 0.922±0.11 0.216±0.10 — — — —

VGP( BATCH , GH ) 0.444±0.30 0.922±0.11 0.219±0.09 — — — —
We compare against batch EP (see Sec. 2.3) and a varia- applicability as practical machine learning algorithms. The
tional GP (VGP, Opper & Archambeau, 2009, with order baseline methods scale as O(n3 ), while all the sequential
n + n2 parameters to ensure convexity of the objective, as in schemes scale as O(ns3 ).
GPflow, Matthews et al., 2017). We then compare our meth-
Results Table 1 confirms it is possible to achieve state-
ods on four large data tasks to which the baselines are not
of-the-art performance with sequential inference. The log-
applicable. Note that sparse variants of EP and VGP are not
Gaussian Cox process and classification experiments return
suited to these long time series containing high-frequency
consistent results across all methods. However, Audio and
behaviour (e.g., Fig. 3c) that cannot be summarised by a few
Rainforest involve multidimensional sites, making them dif-
thousand inducing points (see Sec. 1 for discussion).
ficult tasks. In such cases, EEP performs well because it
We evaluate all methods via negative log predictive den- is the only method capable of maintaining full site covari-
sity (NLPD) using 10-fold cross-validation, with each ance terms whilst remaining stable. Statistical linearisation
method run for 250 iterations (baselines are run until conver- suffers less from a reduction of cubature points than EP mo-
gence). Gauss–Hermite integration uses 20q cubature points, ment matching or VI updates, as shown by the performance
whereas the Unscented transform uses 2q 2 + 1 (we use the of UEP on Audio. EP generally performed well, but EEP
symmetric 5th order cubature rule, i.e. UT5, Kokkala et al., matches its performance sometimes whilst being the only
2016; McNamee & Stenger, 1967), where q is the dimen- method applicable to the Rainforest task. In other cases,
sionality of the integral (typically the number of GPs that cubature methods outperform linearisation, particularly on
are nonlinearly combined in the likelihood). For standard the Motorcycle task.
EP, where a power of zero is not possible, we set α = 0.01.
Motorcycle (heteroscedastic noise) The motorcycle crash
We optimise the model hyperparameters by maximising dataset (Silverman, 1985) contains 131 non-uniformly
the marginal likelihood p(y | θ) separately for each method spaced measurements from an accelerometer placed on a
(see App. G for details), hence the results in Table 1 are motorcycle helmet during impact, over a period of 60 ms. It
affected by both training and inference, demonstrating their is a challenging benchmark (e.g., Tolvanen et al., 2014), due
100
Table 2. Run times in seconds (mean across 10 runs). We report
Accelerometer reading
(a) Motorcycle
time to evaluate the marginal likelihood on the forward pass and

0
perform the site updates on the smoothing pass.
(Heteroscedastic)
−100
M OTORCYCLE
R AINFOREST
(Bernoulli)
(Bernoulli)
(Poisson)
(Product)
(Poisson)
(Poisson)
BANANA
A IRLINE
B INARY
AUDIO
10 20 30 40 50
C OAL
Time (milliseconds)
EP (posterior mean)
EEP 0.015 0.013 0.135 0.100 1.176 23.941 37.441
Accident intensity, λ(t)
4
EP (95% confidence)
EKF UEP 0.017 0.015 0.144 0.113 1.661 24.583 —
(b) Coal mining
EEP (5 iterations)
GHEP 0.020 0.016 0.146 0.120 — 23.709 —
3
EP(U) 0.018 0.015 0.143 0.108 1.713 23.492 —

EP(GH) 0.019 0.016 0.150 0.127 — 23.777 —
2
VI(U) 0.017 0.017 0.142 0.098 1.619 23.796 —

1
VI(GH) 0.018 0.016 0.145 0.123 — 23.611 —

1860 1880 1900 1920 1940 1960 Time steps 133 333 400 10000 22050 35959 500
Time (years) State dim. 6 3 45 4 15 59 500
Accident intensity, λ(t)
(c) Airline accidents
30
Qn use a Matérn- /2 GP prior with likelihood p(y | f ) ≈

We 5
k=1 Poisson(yk | exp(f (t̂k ))), where t̂k is the bin coor-
20
dinate and yk the number of disasters in the bin. This model

10
reaches posterior consistency in the limit of bin width going

to zero (Tokdar & Ghosh, 2007). Since linearisation re-
0
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Time (years) quires a continuous likelihood, we approximate the discrete
Poisson with a Gaussian by noticing that its first two mo-
r: spatial dimension 2 (metres)
log-intensity, log λ(r, t)

ments are equal to the intensity λ(t) = exp(f (t)), giving
400
approx.
yk | fk ∼ N(yk | λ(t̂k ), λ(t̂k )). See App. I for details.
(d) Rainforest
200
Airline (log-Gaussian Cox process) The airline accidents

data (Nickisch et al., 2018) consists of 1210 dates of com-
mercial airline accidents between 1919–2017. We use a log-
0
0 200 400 600 800 1000 Gaussian Cox process with bin width of one day, leading
t: spatial dimension 1 (metres)
to n = 35,959 observations. The prior has multiple compo-
ν=5/2 ν=3/2
Figure 3. Examples of non-conjugate GP models. (a) In the mo- nents, κ(t, t0 ) = κ(t, t0 )Mat. + κ(t, t0 )1per.year κ(t, t0 )Mat. +
torcycle task (heteroscedastic noise), EP is capable of modelling ν=3/2
κ(t, t0 )1per.week κ(t, t0 )Mat. , capturing a long-term trend, time-
the time-varying noise component. The log-Gaussian Cox proof-year variation (with decay), and day-of-week variation
cesses (b)–(d) are well approximated via linearisation, and iterating
(with decay). The state dimension is s = 59.
improves the match to the EP posterior.
Binary (1D classification) As a 1D classification task,
to the heteroscedastic noise variance. We model both the we create a long binary time series, n = 10,000, using
process itself and the measurement noise scale with indepen- the generating function y(t) = sign{ 12 sin(4πt)
0.25πt+1 + σt }, with
(1) (2)
dent GP priors with Matérn-3/2 kernels: yk | fk , fk ∼ σt ∼ N(0, 0.252 ). Our GP prior has a Matérn-7/2 kernel,
(1) (2) 2
N(yk | f (tk ), [φ(f (tk ))] ), with softplus link function s = 4, and the sigmoid function ψ(f ) = (1 + e−f )−1 maps
φ(f ) = log(1 + ef ) to ensure positive noise scale. The full R 7→ [0, 1] (logit classification). See App. I for deriva-
EP and VGP baselines were implemented and hand-tailored tion of approximate continuous
p state observation model,
for this task. For VGP, we used GPflow 2 (Matthews et al., h(fk , σk ) = ψ(fk ) + ψ(fk )(1 − ψ(fk ))σk .
2017) with a custom model. VI and EP (α ≈ 0) performed Audio (product of GPs) We apply a simplified version
well, however the linearisation-based methods failed to cap- of the Gaussian Time-Frequency model from Wilkinson
ture the time-varying noise (see App. I for discussion). et al. (2019) to half a second of human speech, sampled
Coal (log-Gaussian Cox process) The coal mining dis- at 44.1 kHz, n = 22,050. The prior consists of 3 quasi-
aster dataset (Vanhatalo et al., 2013) contains 191 explo- periodic (κexp (t, t0 )κcos (t, t0 )) ‘subband’ GPs, and 3 smooth
sions that killed ten or more men in Britain between 1851– (κMat-5/2 (t, t0 )) ‘amplitude’ GPs. The likelihood consists of
1962. We use a log-Gaussian Cox process, i.e. an inho- a sum of the product of these processes with additive noise
mogeneous Poisson process (approximated with a Pois- and a softplus mapping φ(·) for the positive amplitudes:
P3 amp.
son likelihood for n = 333 equal time interval bins). yk | fk ∼ N( i=1 fi,k sub.
φ(fi,k ), σk2 ). The nonlinear inter-
(a) VGP (Full) (b) VGP (Sparse) (c) EKF (d) EEP (e) EEP
Figure 4. Inference schemes on the two-dimensional Banana classification task. The coloured points represent training data and the black
lines are decision boundaries. (a) is the baseline variational GP method (VGP in Table 1). (b) shows the sparse variant of the VGP baseline
(Generalized FITC) with 15 inducing points (black dots). In (c)–(e), the vertical dimension is treated as the ‘spatial’ input (m = 15
inducing points shown with lines) and the horizontal as the sequential (‘temporal’) dimension. Our formulation using the EKF (c) works
well, but is further improved by the EP-like iteration in (d). In (e), the method is applied to the full data set (n = 5400).
action of 6 GPs (s = 15) in the likelihood makes this a is faster than the cubature methods for the Audio task which
challenging task. EEP performs best since it is capable of involves 6-dimensional sites. The gridded data of the Rain-
maintaining full site covariance terms without compromis- forest task requires a 250-dimensional site parameter update
ing stability. UEP outperforms EP and VI since statistical which is impractical for most methods. Conversely, in the
linearisation is still accurate when using few cubature points. Banana task data points are handled one by one, such that
only one-dimensional updates are required.
4.1. Spatio-Temporal Models
As presented in Sec. 3.6, the sequential inference schemes 5. Discussion and Conclusions
are also applicable to spatio-temporal problems. We illus- We argue that development of methods capable of naturally
trate this via two spatial problems, treating one spatial input handling sequential data is crucial to extend the applicability
as the sequential dimension (‘time’) and the other as ‘space’. of GPs beyond short time series. EP was originally inspired
Banana (2D classification) The banana data set, n = by, and derived from, Kalman filtering and here we make
400, is a common 2D classification benchmark (e.g., the case that a return to sequential methods is desirable
Hensman et al., 2015). We use the logit likelihood for large spatio-temporal problems. We present a flexible
with a separable space-time kernel: κ(r, t; r0 , t0 ) = and efficient framework for sequential learning that encom-
ν=5/2 ν=5/2
κ(t, t0 )Mat. κ(r, r0 )Mat. . The vertical dimension is treated passes many state-of-the-art approximate inference schemes,
as space r and the horizontal as the sequential (‘temporal’) whilst also illuminating the connections between modern
dimension t. We use m = 15 inducing points in r (see day inference methods and traditional filtering approaches.
Sec. 3.6), visualised by lines in Fig. 4(c)–(e). The state Our theoretical contributions confirm that using linearisation
dimension is s = 3m = 45. Fig. 4 shows that the EKF in place of EP moment matching results in iterated algo-
provides a similar solution to the VGP baseline of Hensman rithms that exactly match the classical nonlinear Kalman
et al. (2015), and an even closer match is obtained by EEP filters on the first pass, and also generalise the classical
(3 iterations). The forward and backward passes are visu- smoothers by refining the linearisations via multiple passes
alised in Fig. 1. The method is also applicable to the larger through the data. These algorithms are fast and scale to high-
(n = 5400) version of the data set (Fig. 4e). dimensional spatio-temporal problems more effectively than
Rainforest (2D log-Gaussian Cox process) We study the EP and VI. The methods based on Taylor series approxi-
density of a single tree species, Trichilia tuberculata, from mations only require one evaluation of the likelihood (and
a 1000 m × 500 m region of a rainforest in Panama (Condit, its Jacobian) for each data point, as opposed to cubature
1998; Hubbell et al., 1999; 2005). We segment the space into methods, and these algorithms make it particularly straight-
4 m2 bins, giving a 500 × 250 grid with 125,000 data points forward to prototype and implement new likelihood models.
(n = 500 time steps), and use a log-Gaussian Cox process We provide a detailed examination of the different properties
(Fig. 3d). The space-time GP has a separable Matérn-3/2 of all these methods on five time series and two spatial
kernel. We do not use a sparse approximation in r, instead tasks, showing that the state space framework for GPs is
we have m = 250 temporal processes, so s = 2m = 500. applicable beyond one-dimensional problems. We have also
Run Times Table 2 compares time taken for all methods to highlighted the scenarios in which such methods might fail:
make a single training step on a MacBook Pro with 2.3 GHz linearisation is a poor approximation when the cavities are
Intel Core i5 and 16 GB RAM using JAX. For tasks with diffuse (high variance) and the likelihood is highly nonlinear,
one-dimensional sites all methods are similar, however EEP but cubature methods do not scale well to high dimensions.
Acknowledgements Garcı́a-Fernández, Á. F., Svensson, L., Morelande, M. R.,

and Särkkä, S. Posterior linearization filter: Principles
We acknowledge funding from the Academy of Finland and implementation using sigma points. IEEE Transac-
(grant numbers 308640 and 324345) and from Innovation tions on Signal Processing, 63(20):5561–5573, 2015.
Fund Denmark (grant number 8057-00036A). Our results
were obtained using computational resources provided by Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B.,
the Aalto Science-IT project. Thanks to S. T. John for Vehtari, A., and Rubin, D. B. Bayesian Data Analysis.
help implementing the VGP baseline for the heteroscedastic Chapman & Hall/CRC Press, third edition, 2013.
noise model.
Hartikainen, J. Sequential Inference for Latent Temporal
Gaussian Process Models. Doctoral dissertation, Aalto
References
University, Helsinki, Finland, 2013.
Adam, V., Eleftheriadis, S., Artemev, A., Durrande, N.,
and Hensman, J. Doubly sparse variational gaussian Hartikainen, J. and Särkkä, S. Kalman filtering and smooth-
processes. In International Conference on Artificial In- ing solutions to temporal Gaussian process regression
telligence and Statistics (AISTATS), volume 108 of Pro- models. In International Workshop on Machine Learning
ceedings of Machine Learning Research, pp. 2874–2884. for Signal Processing (MLSP). IEEE, 2010.
PMLR, 2020.
Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian
Bar-Shalom, Y., Li, X.-R., and Kirubarajan, T. Estimation processes for big data. In Proceedings of the 29th Confer-
with Applications to Tracking and Navigation. Wiley- ence on Uncertainty in Artificial Intelligence (UAI), pp.
Interscience, 2001. 282–290. AUAI Press, 2013.
Bell, B. M. The iterated Kalman smoother as a Gauss– Hensman, J., Matthews, A., and Ghahramani, Z. Scalable
Newton method. SIAM Journal on Optimization, 4(3): variational Gaussian process classification. In Proceed-
626–636, 1994. ings of the Eighteenth International Conference on Artifi-
cial Intelligence and Statistics (AISTATS), volume 38 of
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,
Proceedings of Machine Learning Research, pp. 351–360.
C., Maclaurin, D., and Wanderman-Milne, S. JAX: Com-
PMLR, 2015.
posable transformations of Python+NumPy programs,
2018. http://github.com/google/jax. Hubbell, S., Foster, R., O’Brien, S., Harms, K., Condit, R.,
Bui, T. D., Yan, J., and Turner, R. E. A unifying framework Wechsler, B., Wright, S., and Loo de Lao, S. Light-gap
for Gaussian process pseudo-point approximations us- disturbances, recruitment limitation, and tree diversity in
ing power expectation propagation. Journal of Machine a neotropical forest. Science, 283:554–557, 1999.
Learning Research (JMLR), 18(1):3649–3720, 2017. Hubbell, S., Condit, R., and Foster, R. Barro Colorado
Chang, P. E., Wilkinson, W. J., Khan, M. E., and Solin, A. forest census plot data. URL: http://ctfs.si.edu/webatlas/
Fast variational learning in state-space Gaussian process datasets/bci/, 2005.
models. In International Workshop on Machine Learning
Ito, K. and Xiong, K. Gaussian filters for nonlinear filtering
for Signal Processing (MLSP), 2020.
problems. IEEE Transactions on Automatic Control, 45
Condit, R. Tropical Forest Census Plots. Springer-Verlag (5):910–927, 2000.
and R. G. Landes Company, Berlin, Germany, and
Georgetown, Texas, 1998. Julier, S. J., Uhlmann, J. K., and Durrant-Whyte, H. F. A
new approach for filtering nonlinear systems. In Proceed-
Deisenroth, M. and Mohamed, S. Expectation propagation ings of the American Control Conference, volume 3, pp.
in Gaussian process dynamical systems. In Advances 1628–1632, 1995.
in Neural Information Processing Systems (NIPS), vol-
ume 25, pp. 26092617. Curran Associates, Inc., 2012. Julier, S. J., Uhlmann, J. K., and Durrant-Whyte, H. F. A
new method for the nonlinear transformation of means
Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, S., and covariances in filters and estimators. IEEE Transac-
and Hensman, J. Banded matrix operators for Gaussian tions on Automatic Control, 45(3):477–482, 2000.
Markov models in the automatic differentiation era. In
Proceedings of the 22nd International Conference on Ar- Jylänki, P., Vanhatalo, J., and Vehtari, A. Robust Gaussian
tificial Intelligence and Statistics (AISTATS), volume 89 process regression with a Student-t likelihood. Journal
of Proceedings of Machine Learning Research, pp. 2780– of Machine Learning Research (JMLR), 12:3227–3257,
2789. PMLR, 2019. 2011.
Khan, M. and Lin, W. Conjugate-computation varia- Reece, S. and Roberts, S. An introduction to Gaussian
tional inference: Converting variational inference in non- processes for the Kalman filter expert. In Proceedings of
conjugate models to inferences in conjugate models. In the 13th Conference on Information Fusion (FUSION),
Proceedings of the 20th International Conference on Ar- pp. 1–9. IEEE, 2010.
tificial Intelligence and Statistics, volume 54 of Proceed-
ings of Machine Learning Research, pp. 878–887. PMLR, Salimbeni, H. and Deisenroth, M. Doubly stochastic varia-
2017. tional inference for deep Gaussian processes. In Advances
in Neural Information Processing Systems (NIPS), vol-
Kokkala, J., Solin, A., and Särkkä, S. Sigma-point filtering ume 32, pp. 4588–4599. Curran Associates, Inc., 2017.
and smoothing based parameter estimation in nonlinear
dynamic systems. Journal of Advances in Information Särkkä, S. Bayesian Filtering and Smoothing. Cambridge
Fusion, 11(1):15–30, 2016. University Press, 2013.
Kuss, M. and Rasmussen, C. E. Assessing approximations Särkkä, S. and Solin, A. Applied Stochastic Differential
for Gaussian process classification. In Advances in Neural Equations. Cambridge University Press, 2019.
Information Processing Systems (NIPS), volume 18, pp.
699–706. MIT Press, 2006. Särkkä, S., Solin, A., and Hartikainen, J. Spatiotemporal
learning via infinite-dimensional Bayesian filtering and
Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, smoothing. IEEE Signal Processing Magazine, 30(4):
K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., 51–61, 2013.
and Hensman, J. GPflow: A Gaussian process library us-
ing TensorFlow. Journal of Machine Learning Research Seeger, M. Expectation propagation for exponential families.
(JMLR), 18(40):1–6, 2017. Technical report, 2005.
McNamee, J. and Stenger, F. Construction of fully symmet- Silverman, B. W. Some aspects of the spline smoothing ap-
ric numerical integration formulas. Numerische Mathe- proach to non-parametric regression curve fitting. Journal
matik, 10(4):327–344, 1967. of the Royal Statistical Society: Series B (Methodologi-
Minka, T. Divergence measures and message passing. Tech- cal), 47(1):1–21, 1985.
nical report, 2005. Šimandl, M. and Dunı́k, J. Derivative-free estimation meth-
Minka, T. P. Expectation propagation for approximate ods: New results and performance analysis. Automatica,
Bayesian inference. In Proceedings of the Conference on 45(7):1749–1757, 2009.
Uncertainty in Artificial Intelligence (UAI), volume 17,
Solin, A. Stochastic Differential Equation Methods for
pp. 362–369. AUAI Press, 2001.
Spatio-Temporal Gaussian Process Regression. Doctoral
Minka, T. P. Power EP. Technical report, 2004. dissertation, Aalto University, Helsinki, Finland, 2016.
Nickisch, H., Solin, A., and Grigorievskiy, A. State space Titsias, M. K. Variational learning of inducing variables in
Gaussian processes with non-Gaussian likelihood. In sparse Gaussian processes. In Proceedings of the Twelth
Proceedings of the 35th International Conference on Ma- International Conference on Artificial Intelligence and
chine Learning (ICML), volume 80 of Proceedings of Ma- Statistics (AISTATS), volume 5 of Proceedings of Ma-
chine Learning Research, pp. 3789–3798. PMLR, 2018. chine Learning Research, pp. 567–574. PMLR, 2009.
Opper, M. and Archambeau, C. The variational Gaussian Tobar, F. Band-limited gaussian processes: The sinc kernel.
approximation revisited. Neural Computation, 21(3):786– In Advances in Neural Information Processing Systems
792, 2009. (NeurIPS), pp. 12728–12738. Curran Associates, Inc.,
Quiñonero-Candela, J. and Rasmussen, C. E. A unifying 2019.
view of sparse approximate Gaussian process regression.
Tokdar, S. T. and Ghosh, J. K. Posterior consistency of
Journal of Machine Learning Research (JMLR), 6(Dec):
logistic Gaussian process priors in density estimation.
1939–1959, 2005.
Journal of Statistical Planning and Inference, 137(1):
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes 34–42, 2007.
for Machine Learning. MIT Press, 2006.
Tolvanen, V., Jylänki, P., and Vehtari, A. Expectation propa-
Rauch, H. E., Tung, F., and Striebel, C. T. Maximum likeli- gation for nonstationary heteroscedastic Gaussian process
hood estimates of linear dynamic systems. AIAA Journal, regression. In International Workshop on Machine Learn-
3(8):1445–1450, 1965. ing for Signal Processing (MLSP), pp. 1–6. IEEE, 2014.
Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P.,

Tolvanen, V., and Vehtari, A. GPstuff: Bayesian modeling
with Gaussian processes. Journal of Machine Learning
Research (JMLR), 14(Apr):1175–1179, 2013.
Wainwright, M. J. and Jordan, M. I. Graphical models, ex-
ponential families, and variational inference. Foundations
and Trends R
in Machine. Learning, 1(1-2):1–305, 2008.
Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger,

K. Q., and Wilson, A. G. Exact Gaussian processes on a
million data points. In Advances in Neural Information
Processing Systems (NeurIPS), volume 32, pp. 14622–
14632. Curran Associates, Inc., 2019.
Wilkinson, W. J., Andersen, M. R., Reiss, J. D., Stowell,

D., and Solin, A. End-to-end probabilistic inference for
nonstationary audio analysis. In Proceedings of the 36th
International Conference on Machine Learning (ICML),
volume 97 of Proceedings of Machine Learning Research,
pp. 6776–6785. PMLR, 2019.
Wu, Y., Hu, D., Wu, M., and Hu, X. Unscented Kalman
filtering for additive noise case: Augmented vs. non-
augmented. In Proceedings of the American Control
Conference, volume 6, pp. 4051–4055. IEEE, 2005.
Wu, Y., Hu, D., Wu, M., and Hu, X. A numerical-integration

perspective on Gaussian filters. IEEE Transactions on
Signal Processing, 54(8):2910–2921, 2006.
Supplementary Material:
A. Nomenclature B. Gaussian Filtering

Vectors: bold lowercase. Matrices: bold uppercase. Given observation model p(yk | fk ) = N(yk | fk , Σk )
Symbol Description for fk = Hk xk , along with current filter predictions
p(xk | y1:k−1 ) = N(xk | mpred
k , Ppred
k ), the Kalman filter
n Number of time steps update equations are,
m Number of latent functions / processes
s State dimensionality µk = Hk mpred
k ,
d Output dimensionality Sk = Hk Ppred H>
k k + Σk ,
t∈R Time (input)
r Space (input, of arbitrary dimension) Ck = Ppred
k H> ,
(22)
k Time index, tk , k = 1, . . . , n Kk = Ck S−1
k ,
yk ∈ Rd Observation (output)
mk = mpred + Kk (yk − µk ),
y ∈ Rd×n Collection of outputs, (y1 , y2 , . . . , yn ) k
θ Vector of model (hyper)parameters Pk = Ppred

k − Kk Sk K>
k.
κ(t, t0 ) Covariance function (kernel) For nonlinear measurement models, yk = h(fk , σk ), letting
K(t, t0 ) Multi-output covariance function pred pred >
µcav
k = Hk mk and Σcav
k = Hk Pk Hk , the statistical
µ(t) Mean function
linear regression equations for the general Gaussian filtering
µ(t) Multi-output mean function
methods are,
σk Measurement noise ZZ
Σk Measurement noise covariance µk = h(fk , σk )
f (t) : R → R Latent function (Gaussian process)
f (t) : R → Rm Vector of latent functions, fk = f (tk ) × N(fk | µcav cav
k , Σk )N(σk | 0, Σk ) dfk dσk ,
f ∈ Rm×n Collection of latents, (f (t1 ), . . . , f (tn )) ZZ
Hk ∈ Rm×s State→function mapping Sk = (h(fk , σk ) − µk )(h(fk , σk ) − µk )>
h(fk , σk ) Measurement model (Rm , Rd ) → Rd
s × N(fk | µcav cav
k , Σk )N(σk | 0, Σk ) dfk dσk ,
x(t) : R → R State vector, f (t) = Hk x(t) ZZ
xk ∈ Rs State variable, xk = x(tk ) ∼ N(mk , Pk ) (fk − µcav >
Ck = k )(h(fk , σk ) − µk )
F ∈ Rs×s Feedback matrix (continuous)
L ∈ Rs×v Noise effect matrix (continuous) × N(fk | µcav cav
k , Σk )N(σk | 0, Σk ) dfk dσk .
Qc ∈ Rv×v White noise spectral density (continuous) (23)
Ak ∈ Rs×s Dynamic model (discrete)
qk ∈ Rs State space process noise (discrete) Note that in the additive noise case, h(fk , σk ) = h̃(fk )+σk ,
Qk ∈ Rs×s Process noise covariance (discrete) these can be simplified to,
P∞ ∈ Rs×s Stationary state covariance (prior)
Z
µk = h̃(fk )N(fk | µcav cav
k , Σk ) dfk ,
mk ∈ Rs×1 State mean
Pk ∈ Rs×s State covariance Z h i
Kk ∈ Rs×d Kalman gain Sk = (h̃(fk ) − µk )(h̃(fk ) − µk )> + Cov[yk | fk ]
Gk ∈ Rs×s Smoother gain
× N(fk | µcav cav
k , Σk ) dfk ,
Jf ∈ Rd×m Jacobian of h w.r.t fk Z
Jσ ∈ Rd×d Jacobian of h w.r.t σk (fk − µcav > cav cav
Ck = k )(h̃(fk ) − µk ) N(fk | µk , Σk ) dfk .
α EP power / fraction
Lk log-normaliser of true posterior update (24)
qksite (fk ) EP site (approximate likelihood) for σk ∼ N(0, Σk = Cov[yk | fk ]). Note that we include
qksite (fk ) ∼ N(µsite site
k , Σk ) the case where Σk is a nonlinear function of fk , which oc-
cav
qk (fk ) EP cavity (leave-one-out posterior) curs in our approximations to discrete likelihoods presented
qkcav (fk ) ∼ N(µcav cav
k , Σk ) in App. I. Here we have used h̃(fk ) = E[yk | fk ].
C. Closed-form Site Updates in Sec. 3.2 D. Analytical Linearisation in EP (α = 1)
Here we derive in full the closed form site updates after
Results in an Iterated Version of the EKF
analytical linearisation in Sec. 3.2. Plugging the derivatives Here we prove the result given in Sec. 3.2: a single pass of
from Eq. (15) into the updates in Eq. (10) we get, the proposed EP-style algorithm with analytical linearisation
−1 (i.e. a first order Taylor series approximation) is exactly
> −1 −1
µsite
k = µcav
k + J Σ̂
fk k J fk J>
fk Σ̂k vk , equivalent to the EKF. Plugging the closed form site updates,
−1 (25) Eq. (16), with α = 1 (since the filter predictions can be
> −1
Σsite
k = −αΣ cav
k + J Σ̂
fk k J fk , interpreted as the cavity with the full site removed), into
our modified Kalman filter update equations, Eq. (11), we
where vk = yk − h(µcav k , 0). By the matrix inversion
get a new set of Kalman updates in which the latent noise
lemma, and letting Rk = Jσk Σk J>
σk ,
terms are determined by scaling the observation noise with
the Jacobian of the state. Crucially, on the first forward
Σ̂−1 −1 pass the Kalman prediction is used as the cavity such that
k = Rk −
−1 Σcav
k = Hk Ppredk H>k:
R−1 cav −1
k Jfk (αΣk ) + J> −1
fk Rk Jfk J> −1
fk Rk , (26)
> −1
−1
Sk = Σcav
k + Jfk Rk Jfk ,
so that
−1 Kk = Ppred
k H> −1
k Sk ,
−1
J> cav −1
fk Σ̂k Jfk = Wk − Wk (αΣk ) + Wk Wk , −1
mk = mpred
k + Kk Sk J> cav >
f k R k + J f k Σk J f k vk ,
(27)
where Wk = J> −1
fk Rk Jfk . Applying the matrix inversion
Pk = Ppred
k − Kk Sk K>
k.
lemma for a second time we obtain (31)
−1
J> −1
fk Σ̂k Jfk
where Rk = Jσk Σk J> σk . This can be rewritten to explicitly
show that there are two innovation covariance terms, Sk and
= Wk−1 − Wk−1 Wk Wk Wk−1 Wk Ŝk , which act on the state mean and covariance separately:
−1 Linearised update step:
(αΣcav −1
Wk Wk−1

− k ) + Wk −1
> −1
Ŝk = Σcav
k + Jfk Rk Jfk ,
= Wk−1 + αΣcav Sk = Jfk Σcav >
k Jfk + Rk ,
k
> −1
−1
= Jfk Rk Jfk + αΣcav
k . (28) K̂k = Ppred Hk Ŝ−1
k k ,
pred > > −1
(32)
We can also write Kk = Pk Hk Jfk Sk ,
−1 −1 mk = mpred
k + Kk vk ,
−1 −1 −1
J>
fk Σ̂k Jfk J>
fk Σ̂k = J>
fk Rk Jfk + αΣcav pred
Pk = Pk − K̂k Ŝk K̂> k.
k
cav > −1
× J>

fk Rk + αJfk Σk Jfk . (29)
Now we calculate the inverse of Ŝk :
Together the above calculations give the approximate site −1 −1
mean and covariance as Ŝ−1
k = Σcav
k + J > −1
f k
R k J f k
−1
Σsite = J> −1
−1 = J>
fk Rk Jfk −
k fk Rk Jfk ,
−1
µsite
k = µcav
k + J> −1
fk Rk Jfk Σk
cav −1
+ J> −1
fk Rk Jfk J> −1
fk Rk Jfk
> −1
>
Σsite + αΣcav Jfk Rk + αJfk Σcav

k k k Jfk vk . (33)
(30)
and the inverse of Sk :
−1
S−1 cav >
k = J f k Σk J f k + R k
= R−1
k −
−1
cav −1
R−1
k Jfk Σk + J> −1
fk Rk Jfk J> −1
fk Rk
(34)
which shows that F. Avoiding Numerical Issues When
Computing the Cavity
Ŝ−1 > −1
k = Jfk Sk Jfk , (35)
Computing the cavity distribution in Eq. (12) involves the
and hence, recalling that Rk = Jrk Σk J>rk , Eq. (32) simpli- subtraction of two PSD covariance matrices. The result is
fies to give exactly the extended Kalman filter updates: not guaranteed to be PSD and not guaranteed to be invertible,
EKF update step: which can lead to numerical issues. If fk is one-dimensional,
then no such issue occurs. In the higher-dimensional case
Sk = Jfk Hk Ppred
k H> > >
k J f k + J σk Σ k J σk , issues can be avoided by discarding the cross-covariances
Kk = Ppred H> > −1 such that Eq. (12) involves only element-wise subtraction
k k Jfk Sk ,
(36) of scalars. If using cubature to perform moment matching /
mk = mpred
k + Kk (yk − h(Hk mpred
k , 0)), linearisation, then this results in a loss of accuracy. However,
Pk = Ppred
k − Kk Sk K>
k.
for the Taylor series approximations (EKF / EKS / EEP) the
cross-covariances are discarded anyway.
E. General Gaussian Filter Site Updates in An alternative approach which does not trade off accuracy
Sec. 3.3 is to instead compute the cavity by taking the product of the
forward and backward filtering distributions, an approach
Here we derive in full the site updates after statistical linear known as two-filter smoothing (Särkkä, 2013), and then
regression in Sec. 3.3. The Gaussian likelihood approxima- include a fraction (1 − α) of the local site. This method
tion results in, only involves products of PSD matrices which is more nu-
Lk = log Eqkcav N (yk | µk + C>
α cav −1
(fk − µcav
merically stable. We did not implement this approach here.
k (Σk ) k ), Rk )
= c + log N yk | µk , α−1 Σ̃k ,

(37)
G. Marginal Likelihood Calculation During
where qkcav = N(fk | µcav cav
k , Σk ), Rk = Sk − Filtering
Ck (Σk ) Ck and Σ̃k = Rk + αC>
> cav −1
k (Σ cav −1
k ) Ck for
The marginal likelihood, p(y | θ), is used as an optimisa-
µk , Sk and Ck given in Eq. (23) with µk = Hk mpred
cav
k ,
pred >
tion objective for hyperparameter learning. The marginal
Σcav
k = Hk P k H k . Taking the derivatives of this log- likelihood can be written as a product of conditional terms
Gaussian w.r.t. the cavity mean, we get (dropping the dependence on θ for notational convenience),
∂Lk −1
∇Lk = = αΩ>
k Σ̃k vk ,
n
Y
∂µcav
k p(y) = p(y1 ) p(y2 | y1 ) p(y3 | y1:2 ) p(yk | y1:k−1 ).
2 (38)
∂ Lk > −1
k=4
∇2 Lk = cav )> = −αΩk Σ̃k Ωk , (41)
∂µcav
k ∂(µk Each term can be computed via numerical integration during
where vk = yk − µk and the Kalman filter by noticing that,
∂µk
Z
Ωk = p(y k | y 1:k−1 ) = p(yk | xk , y1:k−1 )p(xk | y1:k−1 ) dxk
∂µcav
Z Zk Z
−1
= h(fk , σk )(Σcav
k ) (fk − µcav cav cav
k )N(fk | µk , Σk ) = p(yk | fk = Hxk )p(xk | y1:k−1 ) dxk .
× N(σk | 0, Σk ) dfk dσk . (42)

(39) The first component in the integral is the likelihood, and the
As in Sec. 3.2, to ensure consistency we have assumed here second term is the forward filter prediction.
that the derivative of Σ̃k is zero, despite the fact that it The Taylor series methods (EKF / EKS / EEP) aim to avoid
depends on µcav
k . numerical integration, and hence use an alternative approx-
Plugging the derivatives from Eq. (38) into the updates in imation to the marginal likelihood based on the linearised
Eq. (10) we get, model, as shown in Algorithm 2.
−1
> −1 −1
µsite
k = µcav
k + Ωk Σ̃k Ωk Ω> k Σ̃k vk ,
−1 (40)
site cav > −1
Σk = −αΣk + Ωk Σ̃k Ωk .
H. The EEP Algorithm
Algorithm 2 EEP: Extended Expectation Propagation, a globally iterated Extended Kalman filter with power EP-style
updates such that linearisation is performed w.r.t. the cavity mean.
Input: {tk , yk }nk=1 , Ak , Qk , P∞ , Σk data, discrete state space model and obs. noise
h, Hk , Jf , Jσ , α measurement model, Jacobian and EP power
m0 ← 0, P0 ← P∞ , e1:n = 0 initial state
while not converged do iterated EP-style loop
for k = 1 to n do forward pass (FILTERING)
mk ← Ak mk−1 predict mean
Pk ← Ak Pk−1 A> k +Q k predict covariance
if has label yk then
Σcav
k ← Hk Pk H> k predict = forward cavity
cav
µk ← Hk mk
vk ← yk − h(µcav k , 0) residual
Jfk ← Jf |µcav k ,0 ; J σ k
← J σ | µcav ,0
k
evaluate Jacobians
if first iteration
then −1
> −1
← J>

Σsite
k f k J σk Σ k J σk Jfk match moments to get site covariance...
cav > −1
site cav site cav
> >

µk ← µk + Σk + Σk Jfk Jσk Σk Jσk + Jfk Σk Jfk vk and site mean (α = 1)
end if
Sk ← Hk Pk H> k + Σk
site
innovation
> −1
Kk ← Pk Hk Sk Kalman gain
mk ← mk + Kk (µsite k − µcav
k ) update mean
>
Pk ← Pk − Kk Sk Kk update covariance
Ek ← Jσk Σk J> cav >
σk + Jfk Σk Jfk
1 1 > −1
ek ← 2 log |2πEk | + 2 vk Ek vk energy
end if
end for
for k = n − 1 to 1 do backward pass (SMOOTHING)
Gk ← Pk A> k+1 (A P
k+1 k A >
k+1 + Q k+1 )−1
smoothing gain
mk ← mk + Gk (mk+1 − Ak+1 mk ) update
Pk ← Pk + Gk (Pk+1 − Ak+1 Pk A> k+1 − Q k+1 ) G>
k
if has label yk then
−1 −1
Σcav
k ← (Hk Pk H> k)
−1
− α Σsite k remove site to get cavity covariance...

−1 site
(Hk Pk H> −1

µcav
k ← Σcavk k) Hk mk − α Σsite k µk and cavity mean
Jfk ← Jf |µcav
k ,0
; Jσk ← Jσ |µcav k ,0
evaluate Jacobians
vk ← yk − h(µcav k , 0) residual
−1
> > −1
site

Σk ← Jfk Jσk Σk Jσk Jfk match moments to get site covariance...
−1
J> > cav >

µsite
k ← µcav
k + Σk
site
+ αΣcavk fk Jσk Σk Jσk + αJfk Σk Jfk vk and site mean
end if
end for
end while
Return: E[f (tk )] = Hk mkP ; V[f (tk )] = Hk Pk H>k posterior marginal mean and variance
n
log p(y | θ) ' − k=1 ek log marginal likelihood
I. Continuous Measurement Model I.2. Log-Gaussian Cox Process
Approximations for Sec. 4 For a log-Gaussian Cox process, binning the data into subre-
The next subsections provide further details of the model for- gions and assuming the process has locally constant intensity
mulations used in the experiments (i.e., how to write down in these subregions
Qn allows us to use a Poisson likelihood,
approximative measurement models for common tasks such p(y | f ) ≈ k=1 Poisson(yk | exp(f (t̂k ))), where t̂k is the
as heteroscedastic noise modelling, Poisson likelihoods, or bin coordinate and yk the number of data points in it. How-
logistic classification). ever, the Poisson is a discrete probability distribution and
the EKF and EEP methods requires the observation model
I.1. Heteroscedastic Noise Model to be differentiable. Therefore we use a Gaussian approxi-
mation, noticing that the first two moments of the Poisson

The heteroscedastic noise model contains one GP for the distribution are equal to the intensity λk = exp fk .
mean, f (1) , and another for the time-varying observation
noise, f (2) , both with Matern-3/2 covariance functions. The We have a GP prior over f :
GP priors are independent, f (t) ∼ GP(0, K(t, t0 )), (48)
f (1) (t) ∼ GP(0, κ(t, t0 )), and the approximate Gaussian likelihood is
(43) n
f (2) (t) ∼ GP(0, κ(t, t0 )), p(y | f ) =
Y
N (yk | exp(fk ), diag[exp(fk )]) , (49)
k=1
and the likelihood model is
where diag[exp(fk )] is a diagonal matrix whose entries are
n the elements of exp(fk ). This implies the following state
(1) (2)
Y
y | f (1) , f (2) ∼ N(yk | fk , [φ(fk )]2 ). (44) space observation model:
k=1
h(fk , σk ) = exp(fk ) + diag[exp(fk /2)] σk , (50)
The corresponding state space observation model is
where σk ∼ N(0, I). The EKF and EEP algorithms require
(1) (2) the Jacobian of h(fk , σk ) with respect to fk and σk , which
h(fk , σk ) = fk + φ(fk )σk , (45) are given by,
where σk ∼ N(0, 1) and φ(f ) = log(1 + exp(f − 21 )). The ∂h

Jf (fk , σk ) = >
Jacobians w.r.t. the (two-dimensional) latent GPs fk and the ∂fk
noise variable σk are,

1
= diag exp(fk ) + diag[exp(fk /2)]σk , (51a)
2
∂ h̄ h
(2)
i
Jf (fk , σk ) = > = 1, φ0 fk σk , (46a) ∂h
∂fk Jσ (fk , σk ) = = diag[exp(fk /2)] . (51b)
∂σk>
∂ h̄ (2)
Jσ (fk , σk ) = >
= φ fk , (46b)
∂σk I.3. Bernoulli (Logistic Classification)
As in standard GP classification we place a GP prior over the
where the derivative of the softplus is the sigmoid function:
latent function f , and use a Bernoulli likelihood by mapping
1
1 f through the logistic function ψ(f ) = 1+exp(−f ),
φ0 (f ) = . (47)
1 + exp(−f + 21 ) f (t) ∼ GP(0, κ(t, t0 )), (52a)
n
In practice a problem arises when using the above linearisa-
Y
y|f ∼ Bern(ψ(f (tk ))). (52b)
tion. Since the mean of σk is zero, the Jacobian w.r.t. f (2) k=1
disappears when evaluated at the mean regardless of the
value of f (2) . This means that the second latent function is As with the Poisson likelihood, we wish to approximate
never updated, which results in poor performance, as shown the Bernoulli with a distribution that has continuous sup-
in Table 1. We found that statistical linearisation suffers port. We form a Gaussian approximation whose mean
from a similar issue, providing little importance to the latent and variance are equal to that of the Bernoulli distribu-
function that models the noise, which highlights a potential tion, which has mean E[y | f ] = ψ(f ), and variance
weakness of linearisation-based methods. Var[y | f ] = ψ(f )(1 − ψ(f )), giving:
n
Fig. 7 plots a breakdown of the different components in the
Y
y|f ∼ N yk | ψ(fk ), ψ(fk )(1 − ψ(fk )) . (53)
posterior for the motorcycle crash data set. k=1
Therefore the approximate state space observation model is J.2. Audio
1 exp(fk /2)
1
h(fk , σk ) = + σk , (54)
1 + exp(−fk ) 1 + exp(fk )
(a) Audio signal
0.5
and the Jacobians are
0
y
∂h
−0.5
Jf (fk , σk ) =
∂fk
−1
exp(fk ) exp(fk /2) − exp(3fk /2) 0 0.1 0.2 0.3 0.4 0.5
= + σk ,
(1 + exp(fk ))2 2(1 + exp(f ))2
(55a)
(b) First component
0.4
periodic subband (posterior mean)
positive amplitude (posterior mean)
∂h exp(fk /2)
Jσ (fk , σk ) = = . (55b)
∂σk 1 + exp(fk )
0
I.4. Bernoulli (Probit Classification)
−0.4
The Probit likelihood can be constructed similarly to the Lo- 0 0.1 0.2 0.3 0.4 0.5
gistic model above, by simply swapping the logistic function
(c) Second component
1
Rf
for the Normal CDF: ψ(f ) = Φ(f ) = −∞ N(x | 0, 1) dx.
0.5
0
J. Supplementary Figures for Sec. 4 −0.5
J.1. Rainforest
−1
0 0.1 0.2 0.3 0.4 0.5

1
400
(a) Tree locations
(d) Third component
0.5
200
0
−0.5
0
0 200 400 600 800 1000

−1
t: spatial dimension 1 (metres) 0 0.1 0.2 0.3 0.4 0.5

t: time (seconds)
log-intensity, log λ(r, t)

400
(b) Posterior mean
Figure 6. Analysis of a recording of female speech (a), duration 0.5

seconds, sampled at 44.1 kHz, n = 22,050. The three-component
GP prior is overly simple given the true harmonic structure of the
200
data, but the model is able to uncover high-, medium-, and low-
frequency behaviour (b)-(d) along with their positive amplitude
envelopes (shown in red). Only the posterior means are shown. The
0
0 200 400 600 800 1000

t: spatial dimension 1 (metres) posterior for the signal (not shown) is produced by multiplying
the periodic components by their amplitudes and summing the
Figure 5. The data (a) are 12,929 tree locations in a rainforest. three resulting signals (see Sec. 4 for more details regarding the
They are binned into a grid of 500 × 250 and we apply a log- model). The sub-components have been rescaled for visualisation
Gaussian Cox process using EEP for inference. The posterior purposes.
log-intensity is shown in (b).
J.3. Motorcycle
(1) (1)
SLEP: f (mean component) EP: f (mean component)
0
0
−100
−100
95% confidence 95% confidence
posterior mean posterior mean
observations observations
10 20 30 40 50 10 20 30 40 50
(2) (2)
SLEP: φ(f ) (noise standard deviation component) EP: φ(f ) (noise standard deviation component)
80
80
60
60
40
40
20
20
0
0
10 20 30 40 50 10 20 30 40 50
SLEP: full posterior EP: full posterior
100
100
0
0
−100
−100
10 20 30 40 50 10 20 30 40 50
Time (milliseconds) Time (milliseconds)
Figure 7. Model components for the motorcycle crash experiment. Left is the SLEP method (with Gauss-Hermite cubature, i.e. GHEP)
and right is the EP equivalent. The linearisation-based methods fail to incorporate the heteroscedastic noise, whereas EP captures rich
time-varying behaviour. The top plots are the posterior for f (1) (t) (the mean process), the middle plots show the posterior for f (2) (t)
(the observation noise process), and the bottom plots show the full model.

State Space Expectation Propagation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

State Space Expectation Propagation

Uploaded by

Copyright:

Available Formats

State Space Expectation Propagation:

Efficient Inference Schemes for Temporal Gaussian Processes

William J. Wilkinson 1 Paul E. Chang 1 Michael Riis Andersen 2 Arno Solin 1

in non-conjugate temporal and spatio-temporal

2. Background ẋ(t) = F x(t) + L w(t), such that f (t) = Hx(t), (4)

True posterior h True posterior h

qkcav (fk ) = N(fk | µcav cav

where Σ̂k = Jσk Σk J> cav >

The result when we use Eq. (16) (with α = 1) to modify

Sk − C> cav −1 Iterated versions of nonlinear filter-smoothers have been

M OTORCYCLE C OAL BANANA B INARY AUDIO A IRLINE R AINFOREST

# DATA POINTS 133 333 400 10 K 22 K 36 K 125 K

EEP (α = 1) 0.855±0.25 0.922±0.11 0.228±0.07 0.536±0.01 −0.433±0.04 0.142±0.01 0.325±0.01

UEP (α = 1) 0.745±0.28 0.922±0.11 0.217±0.08 0.536±0.01 −0.471±0.02 0.142±0.01 —

EP( U ) (α = 0.5) 0.479±0.30 0.922±0.11 0.217±0.08 0.536±0.01 −0.327±0.20 0.143±0.01 —

VI( GH ) 0.495±0.34 0.922±0.11 0.217±0.08 0.536±0.01 — 0.142±0.01 —

EP( BATCH , GH ) 0.441±0.30 0.922±0.11 0.216±0.10 — — — —

time to evaluate the marginal likelihood on the forward pass and

perform the site updates on the smoothing pass.

EP(U) 0.018 0.015 0.143 0.108 1.713 23.492 —

VI(U) 0.017 0.017 0.142 0.098 1.619 23.796 —

VI(GH) 0.018 0.016 0.145 0.123 — 23.611 —

Qn use a Matérn- /2 GP prior with likelihood p(y | f ) ≈

dinate and yk the number of disasters in the bin. This model

reaches posterior consistency in the limit of bin width going

log-intensity, log λ(r, t)

Airline (log-Gaussian Cox process) The airline accidents

Acknowledgements Garcı́a-Fernández, Á. F., Svensson, L., Morelande, M. R.,

Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P.,

Wang, K., Pleiss, G., Gardner, J., Tyree, S., Weinberger,

Wilkinson, W. J., Andersen, M. R., Reiss, J. D., Stowell,

Wu, Y., Hu, D., Wu, M., and Hu, X. A numerical-integration

A. Nomenclature B. Gaussian Filtering

θ Vector of model (hyper)parameters Pk = Ppred

= c + log N yk | µk , α−1 Σ̃k ,

× N(σk | 0, Σk ) dfk dσk . (42)

where σk ∼ N(0, 1) and φ(f ) = log(1 + exp(f − 21 )). The ∂h

(a) Audio signal

(b) First component

gistic model above, by simply swapping the logistic function

(c) Second component

0 0.1 0.2 0.3 0.4 0.5

(d) Third component

0 200 400 600 800 1000

t: spatial dimension 1 (metres) 0 0.1 0.2 0.3 0.4 0.5

log-intensity, log λ(r, t)

Figure 6. Analysis of a recording of female speech (a), duration 0.5

0 200 400 600 800 1000

You might also like