Durrande 2020

Gaussian Process Summer School
Second introduction to GPs

and Kernel Design
Nicolas Durrande – PROWLER.io
@NicolasDurrande – nicolas@prowler.io
September 2020
Introduction
2 / 88
The probability density function of a normal random variable is:
0.4
0.3
density
(x − µ)2

1
0.2
f (x) = √ exp −
σ 2π 2σ 2
0.1
0.0
-4 -2 0 2 4
x
The parameters µ and σ 2 correspond to the mean and variance
µ = E[X ]
σ 2 = E[X 2 ] − E[X ]2
The variance is positive.

3 / 88
Definition
We say that a vector Y = (Y1 , . . . , Yn )T follows a multivariate
normal distribution if any linear combination of Y follows a normal
distribution:
∀α ∈ Rn , αT Y ∼ N
Two examples and one counter-example:

10
3
5
2
5
1
0
Y2
Y2
Y2
0
0
-1
-2
-5
-5
-3
-3 -2 -1 0 1 2 3 -5 0 5 -5 0 5
Y1 Y1 Y1
4 / 88
The pdf of a multivariate Gaussian is:

1 1 T −1
fY (x) = exp − (x − µ) Σ (x − µ) .
(2π)n/2 |Σ|1/2 2
It is parametrised by
mean vector µ = E[Y ]
covariance matrix
densit
Σ = E[YY T ] − E[Y ]E[Y ]T
y
(i.e. Σi,j = cov(Yi , Yj ))
x2
x1
A covariance matrix is symmetric Σi,j = Σj,i and positive

semi-definite
∀α ∈ Rn , αT Σα ≥ 0.
5 / 88
Conditional distribution
2D multivariate Gaussian conditional distribution:
p(y1 , α)
p(y1 |y2 = α) =
p(α)
exp(quadratic in y1 and α)
=
const √
Σc
exp(quadratic in y1 )
fY
=
const µc
x2
= Gaussian distribution! x1
The conditional distribution is still Gaussian!
6 / 88
3D Example
3D multivariate Gaussian conditional distribution:
x3
x2
x1
7 / 88
Conditional distribution
Let (Y1 , Y2 ) be a Gaussian vector (Y1 and Y2 may both be
vectors):
Y1 µ1 Σ11 Σ12
=N , .
Y2 µ2 Σ21 Σ22
The conditional distribution of Y1 given Y2 is:
Y1 |Y2 ∼ N (µcond , Σcond )
with µcond = E[Y1 |Y2 ] = µ1 + Σ12 Σ−1

22 (Y2 − µ2 )
Σcond = cov[Y1 , Y1 |Y2 ] = Σ11 − Σ12 Σ−1
22 Σ21
8 / 88
Gaussian processes
4
Y(x)
4
1 1
x
Definition
A random process Z over D ⊂ Rd is said to be Gaussian if
∀n ∈ N, ∀xi ∈ D, (Z (x1 ), . . . , Z (xn )) is multivariate normal.
⇒ Demo: https://github.com/awav/interactive-gp 9 / 88
We write Z ∼ N (m(.), k(., .)):
m : D → R is the mean function m(x) = E[Z (x)]
k : D × D → R is the covariance function (i.e. kernel):
k(x, y ) = cov(Z (x), Z (y ))
The mean m can be any function, but not the kernel:

Theorem (Loeve)
k is a GP covariance
m
k is symmetric k(x, y ) = k(y , x) and positive semi-definite:
for all n ∈ N, for all xi ∈ D, for all αi ∈ R
n X
X n
αi αj k(xi , xj ) ≥ 0
i=1 j=1
10 / 88
Proving that a function is psd is often difficult. However there are a
lot of functions that have already been proven to be psd:
(x − y )2

squared exp. k(x, y ) = σ 2 exp −
2θ2
√ ! √ !
2 5|x − y| 5|x − y |2 5|x − y |
Matern 5/2 k(x, y ) = σ 1+ + exp −
θ 3θ2 θ
√ ! √ !
3|x − y | 3|x − y |
Matern 3/2 k(x, y ) = σ 2 1 + exp −
θ θ

|x − y |
exponential k(x, y ) = σ 2 exp −
θ
Brownian k(x, y ) = σ 2 min(x, y )
white noise k(x, y ) = σ 2 δx,y
constant k(x, y ) = σ 2
linear k(x, y ) = σ 2 xy
When k is a function of x − y , the kernel is called stationary.

σ 2 is called the variance and θ the lengthscale.
11 / 88
Examples of kernels in gpflow:
Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
2 0 2 2 0 2 2 0 2 2 0 2
12 / 88
Associated samples
Matern12 Matern32 Matern52 RBF

3
2
1
0
1
2
3
RationalQuadratic Constant White Cosine

3
2
1
0
1
2
3
Periodic Linear Polynomial ArcCosine

3
2
1
0
1
2
3
2 0 2 2 0 2 2 0 2 2 0 2
13 / 88
Gaussian process regression
We assume we have observed a function f for a set of points
X = (X1 , . . . , Xn ):
0
f
5
1 1
x
The vector of observations is F = f (X ) (ie Fi = f (Xi ) ).
14 / 88
Since f in unknown, we make the general assumption that it is the
sample path of a Gaussian process Z ∼ N (0, k):
4
Y(x)
4
1 1
x
15 / 88
The posterior distribution Z (·)|Z (X ) = F :
Is still a Gaussian process
Can be computed analytically
It is N (m(·), c(·, ·)) with:
m(x) = E[Z (x)|Z (X )=F ]

= k(x, X )k(X , X )−1 F
c(x, y ) = cov[Z (x), Z (y )|Z (X )=F ]

= k(x, y ) − k(x, X )k(X , X )−1 k(X , y )
16 / 88
Samples from the posterior distribution
4
Y(x)|Y(X) = F
4
1 1
x
17 / 88
It can be summarized by a mean function and 95% confidence
intervals.
4
Y(x)|Y(X) = F
4
1 1
x
18 / 88
A few remarkable properties of GPR models
They (can) interpolate the data-points.
The prediction variance does not depend on the observations.
The mean predictor does not depend on the variance
parameter.
The mean (usually) come back to zero when predicting far
away from the observations.
Can we prove them?
Reminder:
m(x) = k(x, X )k(X , X )−1 F
c(x, y ) = k(x, y ) − k(x, X )k(X , X )−1 k(X , y )
⇒ Demo https://durrande.shinyapps.io/gp_playground
19 / 88
Why do we like GP models that much?
Data efficiency The prior encapsulate lots of information
4 4
Y(x)|Y(X) = F
0 0
f
4
5
1 1 1 1
x x
Uncertainty quantification This is useful for

• Risk assesment
• Active learning (Talk for Javier on Day 3)
(Mathematically tractable!)
20 / 88
A few words on GPR Complexity
Storage footprint is O(n2 ): We have to store the covariance

matrix which is n × n.
Complexity is O(n3 ): We have to invert the covariance matrix
(or compute the Cholesky factor and apply triangular solves).
Storage footprint is often the first limit to be reached.
The maximal number of observation points is between 1000 and

10 000.
What if we have more data? ⇒ Talk from Zhenwen tomorrow
21 / 88
Parameter estimation
22 / 88
The choice of the kernel parameters has a great influence on the
model. ⇒ Demo https://durrande.shinyapps.io/gp_playground
In order to choose a prior that is suited to the data at hand, we can

search for the parameters that maximise the model likelihood.
Definition
The likelihood of a distribution with a density fX given some
observations X1 , . . . , Xp is:
p
Y
L= fX (Xi )
i=1
23 / 88
In the GPR context, we often have only one observation of the
vector F . The likelihood is then:

2 1 1 T −1
L(σ , θ) = fZ (X ) (F ) = exp − F k(X , X ) F .
(2π)n/2 |k(X , X )|1/2 2
It is thus possible to maximise L – or log(L) – with respect to the

kernel’s parameters in order to find a well suited prior.
Why is the likelihood linked to good model predictions? They are

linked by the product rule:
fZ (X ) (F ) = f (F1 ) × f (F2 |F1 ) × f (F3 |F1 , F2 ) × · · · × f (Fn |F1 , . . . , Fn−1 )
24 / 88
Model validation
25 / 88
The idea is to introduce new data and to compare the model
prediction with reality
2
Z(x)|Z(X) = F
0 1
-1
0.0 0.2 0.4 0.6 0.8 1.0

x
Two (ideally three) things should be checked:

Is the mean accurate?
⇒ Mean square error (0.038 for the plot above)
Do the confidence intervals make sense?
⇒ Percentage of points in confidence intervals...
Are the predicted covariances right?
26 / 88
The predicted distribution can be tested by normalising the
residuals.
According to the model, Ft ∼ N (m(Xt ), c(Xt , Xt )).
c(Xt , Xt )−1/2 (Ft − m(Xt )) should thus be independents N (0, 1):
Normal Q-Q Plot

0.6
3
0.5
2
0.4
Sample Quantiles
1
Density
0.3
0
-1
0.2
-2
0.1
-3
0.0
-3 -2 -1 0 1 2 3 -2 -1 0 1 2
standardised residuals Theoretical Quantiles
27 / 88
When no test set is available, another option is to consider cross
validation methods such as leave-one-out.
The steps are:

1. build a model based on all observations except one
2. compute the model error at this point
This procedure can be repeated for all the design points in order to
get a vector of error.
28 / 88
Model to be tested:
1
Z(x)|Z(X) = F
0
-1
-2
0.0 0.2 0.4 0.6 0.8 1.0

x
29 / 88
Step 1:
1
Z(x)|Z(X) = F
0
-1
-2
0.0 0.2 0.4 0.6 0.8 1.0

x
30 / 88
Step 2:
1
Z(x)|Z(X) = F
0
-1
-2
0.0 0.2 0.4 0.6 0.8 1.0

x
31 / 88
Step 3:
1
Z(x)|Z(X) = F
0
-1
-2
0.0 0.2 0.4 0.6 0.8 1.0

x
32 / 88
We finally obtain:
MSE = 0.24 and Q2 = 0.34.
We can also look at the residual distribution, but computing their

joint distribution is not as straightforward as previously.
Normal Q-Q Plot
2
0.4
1
0.3
Sample Quantiles
Density
0.2
0
0.1
-1
0.0
-2
-2 -1 0 1 2 3 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
standardised residuals Theoretical Quantiles
33 / 88
Choosing the kernel
34 / 88
Changing the kernel has a huge impact on the model:
Gaussian kernel: Exponential kernel:
35 / 88
This is because changing the kernel means changing the prior on f
Gaussian kernel: Exponential kernel:
Kernels encode the prior belief about the function to approximate...

they should be chosen accordingly!
36 / 88
In order to choose a kernel, one should gather all possible
informations about the function to approximate...
Is it stationary?
Is it differentiable, what’s its regularity?
Do we expect particular trends?
Do we expect particular patterns (periodicity, cycles,
additivity)?
It is common to try various kernels and to asses the model accuracy

(test set or leave-one-out).
Furthermore, it is often interesting to try some input remapping

such as x → log(x), x → exp(x), ...
37 / 88
We have seen previously:
Theorem (Loeve)
k corresponds to the covariance of a GP

m
k is a symmetric positive semi-definite function
n X
X n
αi αj k(xi , xj ) ≥ 0
i=1 j=1
for all n ∈ N, for all xi ∈ D, for all αi ∈ R.
38 / 88
For a few kernels, it is possible to prove they are psd directly from
the definition.
k(x, y ) = δx,y
k(x, y ) = 1
For most of them a direct proof from the definition is not possible.
The following theorem is helpful for stationary kernels:
Theorem (Bochner)
A continuous stationary function k(x, y ) = k̃(|x − y |) is positive
definite if and only if k̃ is the Fourier transform of a finite positive
measure: Z
k̃(t) = e −iωt dµ(ω)
R
39 / 88
Example
We consider the following measure:
0.0
sin(t)
Its Fourier transform gives k̃(t) = :
t
0.0
sin(x − y )
As a consequence, k(x, y ) = is a valid covariance
x −y
function.
40 / 88
Usual kernels
Bochner theorem can be used to prove the positive definiteness of

many usual stationary kernels
The Gaussian is the Fourier transform of itself

⇒ it is psd.
1
Matérn kernels are the Fourier transforms of (1+ω 2 )p
⇒ they are psd.
41 / 88
Unusual kernels
Inverse Fourier transform of a (symmetrised) sum of Gaussian gives

(A. Wilson, ICML 2013):
µ(ω) k̃(t)
−→
F
0.0 0.0
The obtained kernel is parametrised by its spectrum.
42 / 88
Unusual kernels
The sample paths have the following shape:
60 1 2 3 4 5
43 / 88
Making new from old
44 / 88
Making new from old:
Kernels can be:

Summed together
I On the same space k(x, y ) = k1 (x, y ) + k2 (x, y )
I On the tensor space k(x, y) = k1 (x1 , y1 ) + k2 (x2 , y2 )
Multiplied together
I On the same space k(x, y ) = k1 (x, y ) × k2 (x, y )
I On the tensor space k(x, y) = k1 (x1 , y1 ) × k2 (x2 , y2 )
Composed with a function
I k(x, y ) = k1 (f (x), f (y ))
All these operations will preserve the positive definiteness.
How can this be useful?
45 / 88
Sum of kernels over the same input space
Property
k(x, y ) = k1 (x, y ) + k2 (x, y )

is a valid covariance structure.
This can be proved directly from the p.s.d. definition.
Example
Matern12 k(x, 0.03) Linear k(x, 0.03) Sum k(x, .03)
0.04 0.04 0.04
0.02 0.02 0.02
0.00 + 0.00 = 0.00
0.02 0.02 0.02
0.04 0.04 0.04
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
46 / 88
Sum of kernels over the same input space
Z ∼ N (0, k1 + k2 ) can be seen as Z = Z1 + Z2 where Z1 , Z2 are
indenpendent and Z1 ∼ N (0, k1 ), Z2 ∼ N (0, k2 )
k(x, y ) = k1 (x, y ) + k2 (x, y )
Example
Z1(x) Z2(x) Z(x)
2.0 2.0 2.0
1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0.0 + 0.0 = 0.0
0.5 0.5 0.5
1.0 1.0 1.0
1.5 1.5 1.5
2.0 2.0 2.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
47 / 88
Sum of kernels over the same space
Example: The Mauna Loa observatory dataset [GPML 2006]
This famous dataset compiles the monthly CO2 concentration in
Hawaii since 1958.
440
420
400
380
360
340
320
1960 1970 1980 1990 2000 2010 2020 2030
Let’s try to predict the concentration for the next 20 years.
48 / 88
We first consider a squared-exponential kernel:
(x − y )2

2
k(x, y ) = σ exp −
θ2
600 480
460
400
440
200 420
400
0
380
200 360
340
400
320
600
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
The results are terrible!

49 / 88
What happen if we sum both kernels?
k(x, y ) = krbf 1 (x, y ) + krbf 2 (x, y )
50 / 88
What happen if we sum both kernels?
k(x, y ) = krbf 1 (x, y ) + krbf 2 (x, y )

480
460
440
420
400
380
360
340
320
300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
The model is drastically improved!

50 / 88
We can try the following kernel:
k(x, y ) = σ02 x 2 y 2 + krbf 1 (x, y ) + krbf 2 (x, y ) + kper (x, y )
51 / 88
We can try the following kernel:
k(x, y ) = σ02 x 2 y 2 + krbf 1 (x, y ) + krbf 2 (x, y ) + kper (x, y )

460
440
420
400
380
360
340
320
300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
Once again, the model is significantly improved.

51 / 88
Sum of kernels over tensor space
Property
k(x, y) = k1 (x1 , y1 ) + k2 (x2 , y2 )

is a valid covariance structure.
1.0 1.0
0.8 0.8 1.5
0.6 0.6
1.0
0.4 + 0.4 = 0.5
0.2 0.2
1.0 1.0 1.0

0.8 0.8 0.8
0.0 0.2 0.6 0.0 0.2 0.6 0.0 0.2 0.6
0.4 0.6 0.4 0.4 0.6 0.4 0.4 0.6 0.4
0.8 1.00.00.2 0.8 1.00.00.2 0.8 1.00.00.2
Remark:
From a GP point of view, k is the kernel of
Z (x) = Z1 (x1 ) + Z2 (x2 )
52 / 88
We can have a look at a few sample paths from Z :
4 1.5
3 2 1.0
2 1 0.5
1 0 0.0
0 1 0.5
1 2 1.0
2 3 1.5
1.0 1.0 1.0
0.8 0.8 0.8
0.0 0.2 0.6 0.0 0.2 0.6 0.0 0.2 0.6
0.4 0.6 0.4 0.4 0.6 0.4 0.4 0.6 0.4
0.8 1.00.00.2 0.8 1.00.00.2 0.8 1.00.00.2
⇒ They are additive (up to a modification)
Tensor Additive kernels are very useful for

Approximating additive functions
Building models over high dimensional input space
53 / 88
Remarks
It is straightforward to show that the mean predictor is additive
m(x) = (k1 (x, X ) + k2 (x, X ))k(X , X )−1 F
= k1 (x1 , X1 )k(X , X )−1 F + k2 (x2 , X2 )k(X , X )−1 F
| {z } | {z }
m1 (x1 ) m2 (x2 )
⇒ The model shares the prior behaviour.
The sub-models can be interpreted as GP regression models

with observation noise:
m1 (x1 ) = E ( Z1 (x1 ) | Z1 (X1 ) + Z2 (X2 )=F )
54 / 88
Remark
The prediction variance has interesting features
pred. var. with kernel product pred. var. with kernel sum
1.0 1.0
0.8 0.8
0.6 0.6
x2
x2
0.4 0.4
0.2 0.2
0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0
x1 x1
55 / 88
This property can be used to construct a design of experiment that
covers the space with only cst × d points.
1.0
0.8
0.6
x2
0.4
0.2
0.00.0 0.2 0.4 0.6 0.8 1.0

x1
Prediction variance
56 / 88
Product over the same space
Property
k(x, y ) = k1 (x, y ) × k2 (x, y )

is valid covariance structure.
Example
We consider the product of a squared exponential with a cosine:
× =
57 / 88
Product over the tensor space
Property
k(x, y) = k1 (x1 , y1 ) × k2 (x2 , y2 )

is valid covariance structure.
Example
We multiply two squared exponential kernels
1.0 1.0 1.0

0.8 0.8 0.8
0.6 0.6 0.6
0.4 × 0.4 = 0.4
0.2
0.2 0.2
1.0 1.0 1.0

0.8 0.8 0.8
0.0 0.2 0.6 0.0 0.2 0.6 0.0 0.2 0.6
0.4 0.6 0.4 0.4 0.6 0.4 0.4 0.6 0.4
0.8 1.00.00.2 0.8 1.00.00.2 0.8 1.00.00.2
Calculation shows we obtain the usual 2D squared exponential

kernels.
58 / 88
Composition with a function
Property
Let k1 be a kernel over D1 × D1 and f be an arbitrary function
D → D1 , then
k(x, y ) = k1 (f (x), f (y ))
is a kernel over D × D.
proof
XX XX
ai aj k(xi , xj ) = ai aj k1 (f (xi ), f (xj )) ≥ 0
| {z } | {z }
yi yj
Remarks:
k corresponds to the covariance of Z (x) = Z1 (f (x))
This can be seen as a (nonlinear) rescaling of the input space
59 / 88
Example
We consider f (x) = x1 and a Matérn 3/2 kernel
k1 (x, y ) = (1 + |x − y |)e −|x−y | .
We obtain:
Kernel Sample paths

1.0 3
2
0.8
1
0.6
0
0.4
1
0.2
2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 30.0 0.2 0.4 0.6 0.8 1.0
60 / 88
All these transformations can be combined!
Example
k(x, y ) = f (x)f (y )k1 (x, y ) is a valid kernel.
1
This can be illustrated with f (x) = x and
k1 (x, y ) = (1 + |x − y |)e −|x−y | :
Kernel Sample paths

70
40
60
50 20
40
0
30
20 20
10
40
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
61 / 88
Can we automate the construction of the covariance?
Automatic statistician [Duvenaud 2013, Steinruecken 2019]

It considers a set of possible
kernel functions
kernel combinations (+, ×, change-point)
and uses a greedy approach to find the kernel that minimises
BIC = −2 log(L) + #param log(n)
The automatic statistician also generates human readable reports!
62 / 88
Conclusion: GPR and kernel design in practice
63 / 88
The various steps for building a GPR model are:
1. Get the Data (Design of Experiment)

I What is the overall evaluation budget?
I What is my model for?
2. Choose a kernel. Do we have any specific knowledge we can

include in it?
3. Estimate the parameters
I Maximum likelihood
I Cross-validation
I Multi-start
4. Validate the model

I Test set
I Leave-one-out to check mean and confidence intervals
I Leave-k-out to check predicted covariances
Remarks
It is common to iterate over steps 2, 3 and 4.
64 / 88
In practice, the following errors may appear:
• Error: Cholesky decomposition failed
• Error: the matrix is not positive definite
In practice, invertibility issues arise when observation points are
close-by. This is specially true if
the kernel corresponds to very regular sample paths
(squared-exponential for example)
the range (or length-scale) parameters are large
In order to avoid numerical problems during optimization, one can:

add some (very) small observation noise
impose a maximum bound to length-scales
impose a minimal bound for noise variance
avoid using the Gaussian kernel
65 / 88
Want to know more?
Applying linear operators to GPs
66 / 88
Effect of a linear operator
Property (Ginsbourger 2013)

Let L be a linear operator that commutes with the covariance, then
k(x, y ) = Lx (Ly (k1 (x, y ))) is a kernel.
Example
We want to approximate a function [0, 1] → R that is symmetric
with respect to 0.5. We will consider 2 linear operators:
(
f (x) x < 0.5
L1 : f (x) →
f (1 − x) x ≥ 0.5
f (x) + f (1 − x)
L2 : f (x) → .
2
67 / 88
Example
Associated sample paths are
k1 = L1 (L1 (k)) k2 = L2 (L2 (k))
3
3
2
2
1
1
Y
Y
0
0
−1
−1
−2
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
The differentiability is not always respected!
68 / 88
These linear operator are projections onto a space of symmetric

functions:
f
H
L2 f
L1 f
Hsym
What about the optimal projection?
⇒ This can be difficult... but it raises interesting questions!
69 / 88
Application to sensitivity analysis
70 / 88
The analysis of the influence of the various variables of a
d-dimensional function f is often based on the FANOVA:
d
X X
f (x) = f0 + fi (xi ) + fi,j (xi , xj ) + · · · + f1,...,d (x)
i=1 i<j
Z
where f (xI )dxi = 0 if i ∈ I .
The expressions of the fI are:
Z
f0 = f (x)dx
Z
fi (xi ) = f (x)dx−i − f0
Z
fi,j (xi , xj ) = f (x)dx−ij − fi (xi ) − fj (xj ) + f0
Can we obtain a similar decomposition for a GP?
71 / 88
samples with zero integrals
We are interested in building a GP such that the integral of the
samples are exactly zero.
Z
idea: project a GP onto a
space of functions with zero
integrals:
Z0
It can be proved that the orthogonal projection is

Z Z
k(x, s)ds Z (s)ds
Z0 (x) = Z (x) − ZZ
k(s, t)dsdt
72 / 88
The associated kernel is:
Z Z
k(x, s)ds k(y , s)ds
k0 (x, y ) = k(x, y ) − ZZ
k(s, t)dsdt
Such 1-dimensional kernels are great when combined as ANOVA

kernels:
d
Y
k(x, y) = (1 + k0 (xi , yi ))
i=1
d
X X d
Y
=1+ k0 (xi , yi ) + k0 (xi , yi )k0 (xj , yj ) + · · · + k0 (xi , yi )
i=1 i<j i=1
| {z } | {z } | {z }
additive part 2nd order interactions full interaction
73 / 88
10d example
Let us consider the test function f : [0, 1]10 → R with ε ∼ N (0, 1)
observation noise:
x 7→ 10 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε
The steps for approximating f with GPR are:
1 Learn f on a DoE (here LHS maximin with 180 points)

2 get the optimal values for the kernel
Q parameters using MLE,
3 build a model based on kernel (1 + k0 )
The structure of the kernel allows to split m in sub-models.

 
X X
m(x) = 1 + k0 (xi , Xi ) + k0 (xi , Xi )k0 (xj , Xj ) + . . .  k(X , X )−1 F
i i6=j
X X
= m0 + mi (xi ) + mi,j (xi , xj ) + · · · + m1,...,d (x)
i6=j
74 / 88
The univariate sub-models are:

we had f (x) = 10 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N (0, 1)
75 / 88
Application: Periodicity detection
76 / 88
Periodicity detection
We will now discuss the detection of periodicity
Given a few observations can we extract the periodic part of a

signal ?
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5 0 5 10 15
77 / 88
As previously we will build a decomposition of the process in two
independent GPs:
Z = Zp + Za
where Zp is a GP in the span of the Fourier basis
B(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t .
Property
It can be proved that the kernel of Zp and Za are
kp (x, y ) = B(x)t G −1 B(y )

ka (x, y ) = k(x, y ) − kp (x, y )
where G is the Gram matrix associated to B in the RKHS.
78 / 88
As previously, a decomposition of the model comes with a
decomposition of the kernel
m(x) = (kp (x, X ) + ka (x, X ))k(X , X )−1 F
= kp (x, X )k(X , X )−1 F + ka (x, X )k(X , X )−1 F
| {z } | {z }
periodic sub-model mp aperiodic sub-model ma
and we can associate a prediction variance to the sub-models:

vp (x) = kp (x, x) − kp (x, X )k(X , X )−1 kp (X , x)
va (x) = ka (x, x) − ka (x, X )k(X , X )−1 ka (X , x)
79 / 88
Example
For the observations shown previously we obtain:
2.5 2.5 2.5
2.0 2.0 2.0
1.5 1.5 1.5
1.0 1.0 1.0
+
||
0.5 0.5 0.5
0.0 0.0 0.0
−0.5 −0.5 −0.5
−1.0 −1.0 −1.0
−1.5 −1.5 −1.5

0 5 10 15 0 5 10 15 0 5 10 15
Can we can do any better?
80 / 88
Initially, the kernels are parametrised by 2 variables:
k(x, y , σ 2 , θ)
but writing k as a sum allows to tune independently the parameters

of the sub-kernels.
Let k ∗ be defined as
k ∗ (x, y , σp2 , σa2 , θp , θa ) = kp (x, y , σp2 , θp ) + ka (x, y , σa2 , θa )
Furthermore, we include a 5th parameter in k ∗ accounting for the

period by changing the Fourier basis:
Bω (t) = (sin(ωt), cos(ωt), . . . , sin(nωt), cos(nωt))t
81 / 88
MLE of the 5 parameters of k ∗ gives:
2.5 2.5 2.5
2.0 2.0 2.0
1.5 1.5 1.5
1.0 1.0 1.0
+
||
0.5 0.5 0.5
0.0 0.0 0.0
−0.5 −0.5 −0.5
−1.0 −1.0 −1.0
−1.5 −1.5 −1.5

0 5 10 15 0 5 10 15 0 5 10 15
We will now illustrate the use of these kernels for gene expression
analysis.
82 / 88
The 24 hour cycle of days can be observed in the oscillations of
many physiological processes of living beings.
Examples
Body temperature, jet lag, sleep, ... but also observed for plants,
micro-organisms, etc.
This phenomenon is called the circadian rhythm and the

mechanism driving this cycle is the circadian clock.
To understand how the circadian clock operates at the gene level,

biologist look at the temporal evolution of gene expression.
83 / 88
The aim of gene expression is to measure the activity of genes:
84 / 88
The mRNA concentration is measured with microarray experiments
85 / 88
Experiments to study the circadian clock are typically:
1. Expose the organism to a 12h light / 12h dark cycle
2. at t=0, transfer to constant light
3. perform a microarray experiment every 4 hours to measure
gene expression
Regulators of the circadian clock are often rhythmically regulated.

⇒ identifying periodically expressed genes gives an insight on
the overall mechanism.
86 / 88
We can apply this method to study the circadian rythm in
organisms. We used arabidopsis data from Edward 2006.
The dimension of the data is:

22810 genes
13 time points
Edward 2006 gives a list of the 3504 most periodically expressed

genes. The comparison with our approach gives:
21767 genes with the same label (2461 per. and 19306
non-per.)
1043 genes with different labels
87 / 88
Let’s look at genes with different labels:
3 At1g60810 3 At4g10040 4 At5g41480 4 At3g08000
2 2 3 3
1 2 2
1
0 1 1
0
1 0 0
2 1 1 1
3 30 40 50 60 70 2 30 40 50 60 70 2 30 40 50 60 70 2 30 40 50 60 70
3 At1g06290 2.0 At5g48900 2 At3g03900 3 At2g36400

2 1.5
1 2
1.0
1 0.5 0 1
0 0.0
0.5 1 0
1
1.0
2 2 1
1.5
3 30 40 50 60 70 2.0 30 40 50 60 70 3 30 40 50 60 70 2 30 40 50 60 70
periodic for Edward periodic for our approach
88 / 88

Durrande 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Durrande 2020

Uploaded by

Copyright:

Available Formats

Gaussian Process Summer School

Second introduction to GPs

Nicolas Durrande – PROWLER.io

The variance is positive.

Two examples and one counter-example:

A covariance matrix is symmetric Σi,j = Σj,i and positive

The conditional distribution is still Gaussian!

The conditional distribution of Y1 given Y2 is:

Y1 |Y2 ∼ N (µcond , Σcond )

with µcond = E[Y1 |Y2 ] = µ1 + Σ12 Σ−1

∀n ∈ N, ∀xi ∈ D, (Z (x1 ), . . . , Z (xn )) is multivariate normal.

k(x, y ) = cov(Z (x), Z (y ))

The mean m can be any function, but not the kernel:

white noise k(x, y ) = σ 2 δx,y

When k is a function of x − y , the kernel is called stationary.

Matern12 Matern32 Matern52 RBF

RationalQuadratic Constant White Cosine

Periodic Linear Polynomial ArcCosine

The vector of observations is F = f (X ) (ie Fi = f (Xi ) ).

It is N (m(·), c(·, ·)) with:

m(x) = E[Z (x)|Z (X )=F ]

c(x, y ) = cov[Z (x), Z (y )|Z (X )=F ]

Uncertainty quantification This is useful for

Storage footprint is O(n2 ): We have to store the covariance

The maximal number of observation points is between 1000 and

What if we have more data? ⇒ Talk from Zhenwen tomorrow

In order to choose a prior that is suited to the data at hand, we can

It is thus possible to maximise L – or log(L) – with respect to the

Why is the likelihood linked to good model predictions? They are

fZ (X ) (F ) = f (F1 ) × f (F2 |F1 ) × f (F3 |F1 , F2 ) × · · · × f (Fn |F1 , . . . , Fn−1 )

0.0 0.2 0.4 0.6 0.8 1.0

Two (ideally three) things should be checked:

According to the model, Ft ∼ N (m(Xt ), c(Xt , Xt )).

c(Xt , Xt )−1/2 (Ft − m(Xt )) should thus be independents N (0, 1):

Normal Q-Q Plot

standardised residuals Theoretical Quantiles

The steps are:

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

MSE = 0.24 and Q2 = 0.34.

We can also look at the residual distribution, but computing their

Normal Q-Q Plot

-2 -1 0 1 2 3 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

standardised residuals Theoretical Quantiles

Gaussian kernel: Exponential kernel:

Gaussian kernel: Exponential kernel:

Kernels encode the prior belief about the function to approximate...

It is common to try various kernels and to asses the model accuracy

Furthermore, it is often interesting to try some input remapping

k corresponds to the covariance of a GP

for all n ∈ N, for all xi ∈ D, for all αi ∈ R.

We consider the following measure:

Bochner theorem can be used to prove the positive definiteness of

The Gaussian is the Fourier transform of itself

Inverse Fourier transform of a (symmetrised) sum of Gaussian gives

The obtained kernel is parametrised by its spectrum.

The sample paths have the following shape:

Kernels can be:

All these operations will preserve the positive definiteness.

How can this be useful?

k(x, y ) = k1 (x, y ) + k2 (x, y )