Professional Documents
Culture Documents
@NicolasDurrande – nicolas@prowler.io
September 2020
Introduction
2 / 88
The probability density function of a normal random variable is:
0.4
0.3
density
(x − µ)2
1
0.2
f (x) = √ exp −
σ 2π 2σ 2
0.1
0.0
-4 -2 0 2 4
x
The parameters µ and σ 2 correspond to the mean and variance
µ = E[X ]
σ 2 = E[X 2 ] − E[X ]2
5
2
5
1
0
Y2
Y2
Y2
0
0
-1
-2
-5
-5
-3
-3 -2 -1 0 1 2 3 -5 0 5 -5 0 5
Y1 Y1 Y1
4 / 88
The pdf of a multivariate Gaussian is:
1 1 T −1
fY (x) = exp − (x − µ) Σ (x − µ) .
(2π)n/2 |Σ|1/2 2
It is parametrised by
mean vector µ = E[Y ]
covariance matrix
densit
Σ = E[YY T ] − E[Y ]E[Y ]T
y
(i.e. Σi,j = cov(Yi , Yj ))
x2
x1
5 / 88
Conditional distribution
2D multivariate Gaussian conditional distribution:
p(y1 , α)
p(y1 |y2 = α) =
p(α)
exp(quadratic in y1 and α)
=
const √
Σc
exp(quadratic in y1 )
fY
=
const µc
x2
= Gaussian distribution! x1
6 / 88
3D Example
3D multivariate Gaussian conditional distribution:
x3
x2
x1
7 / 88
Conditional distribution
Let (Y1 , Y2 ) be a Gaussian vector (Y1 and Y2 may both be
vectors):
Y1 µ1 Σ11 Σ12
=N , .
Y2 µ2 Σ21 Σ22
8 / 88
Gaussian processes
4
Y(x)
4
1 1
x
Definition
A random process Z over D ⊂ Rd is said to be Gaussian if
⇒ Demo: https://github.com/awav/interactive-gp 9 / 88
We write Z ∼ N (m(.), k(., .)):
m : D → R is the mean function m(x) = E[Z (x)]
k : D × D → R is the covariance function (i.e. kernel):
10 / 88
Proving that a function is psd is often difficult. However there are a
lot of functions that have already been proven to be psd:
(x − y )2
squared exp. k(x, y ) = σ 2 exp −
2θ2
√ ! √ !
2 5|x − y| 5|x − y |2 5|x − y |
Matern 5/2 k(x, y ) = σ 1+ + exp −
θ 3θ2 θ
√ ! √ !
3|x − y | 3|x − y |
Matern 3/2 k(x, y ) = σ 2 1 + exp −
θ θ
|x − y |
exponential k(x, y ) = σ 2 exp −
θ
Brownian k(x, y ) = σ 2 min(x, y )
constant k(x, y ) = σ 2
linear k(x, y ) = σ 2 xy
Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0)
1.0
0.8
0.6
0.4
0.2
0.0
0.2
2 0 2 2 0 2 2 0 2 2 0 2
12 / 88
Associated samples
0
f
5
1 1
x
14 / 88
Since f in unknown, we make the general assumption that it is the
sample path of a Gaussian process Z ∼ N (0, k):
4
Y(x)
4
1 1
x
15 / 88
The posterior distribution Z (·)|Z (X ) = F :
Is still a Gaussian process
Can be computed analytically
16 / 88
Samples from the posterior distribution
4
Y(x)|Y(X) = F
4
1 1
x
17 / 88
It can be summarized by a mean function and 95% confidence
intervals.
4
Y(x)|Y(X) = F
4
1 1
x
18 / 88
A few remarkable properties of GPR models
They (can) interpolate the data-points.
The prediction variance does not depend on the observations.
The mean predictor does not depend on the variance
parameter.
The mean (usually) come back to zero when predicting far
away from the observations.
Can we prove them?
Reminder:
m(x) = k(x, X )k(X , X )−1 F
c(x, y ) = k(x, y ) − k(x, X )k(X , X )−1 k(X , y )
⇒ Demo https://durrande.shinyapps.io/gp_playground
19 / 88
Why do we like GP models that much?
Data efficiency The prior encapsulate lots of information
4 4
Y(x)|Y(X) = F
0 0
f
4
5
1 1 1 1
x x
(Mathematically tractable!)
20 / 88
A few words on GPR Complexity
21 / 88
Parameter estimation
22 / 88
The choice of the kernel parameters has a great influence on the
model. ⇒ Demo https://durrande.shinyapps.io/gp_playground
Definition
The likelihood of a distribution with a density fX given some
observations X1 , . . . , Xp is:
p
Y
L= fX (Xi )
i=1
23 / 88
In the GPR context, we often have only one observation of the
vector F . The likelihood is then:
2 1 1 T −1
L(σ , θ) = fZ (X ) (F ) = exp − F k(X , X ) F .
(2π)n/2 |k(X , X )|1/2 2
24 / 88
Model validation
25 / 88
The idea is to introduce new data and to compare the model
prediction with reality
2
Z(x)|Z(X) = F
0 1
-1
3
0.5
2
0.4
Sample Quantiles
1
Density
0.3
0
-1
0.2
-2
0.1
-3
0.0
-3 -2 -1 0 1 2 3 -2 -1 0 1 2
27 / 88
When no test set is available, another option is to consider cross
validation methods such as leave-one-out.
28 / 88
Model to be tested:
1
Z(x)|Z(X) = F
0
-1
-2
29 / 88
Step 1:
1
Z(x)|Z(X) = F
0
-1
-2
30 / 88
Step 2:
1
Z(x)|Z(X) = F
0
-1
-2
31 / 88
Step 3:
1
Z(x)|Z(X) = F
0
-1
-2
32 / 88
We finally obtain:
2
0.4
1
0.3
Sample Quantiles
Density
0.2
0
0.1
-1
0.0
-2
33 / 88
Choosing the kernel
34 / 88
Changing the kernel has a huge impact on the model:
35 / 88
This is because changing the kernel means changing the prior on f
36 / 88
In order to choose a kernel, one should gather all possible
informations about the function to approximate...
Is it stationary?
Is it differentiable, what’s its regularity?
Do we expect particular trends?
Do we expect particular patterns (periodicity, cycles,
additivity)?
37 / 88
We have seen previously:
Theorem (Loeve)
38 / 88
For a few kernels, it is possible to prove they are psd directly from
the definition.
k(x, y ) = δx,y
k(x, y ) = 1
For most of them a direct proof from the definition is not possible.
The following theorem is helpful for stationary kernels:
Theorem (Bochner)
A continuous stationary function k(x, y ) = k̃(|x − y |) is positive
definite if and only if k̃ is the Fourier transform of a finite positive
measure: Z
k̃(t) = e −iωt dµ(ω)
R
39 / 88
Example
0.0
sin(t)
Its Fourier transform gives k̃(t) = :
t
0.0
sin(x − y )
As a consequence, k(x, y ) = is a valid covariance
x −y
function.
40 / 88
Usual kernels
41 / 88
Unusual kernels
µ(ω) k̃(t)
−→
F
0.0 0.0
42 / 88
Unusual kernels
60 1 2 3 4 5
43 / 88
Making new from old
44 / 88
Making new from old:
45 / 88
Sum of kernels over the same input space
Property
Example
Matern12 k(x, 0.03) Linear k(x, 0.03) Sum k(x, .03)
0.04 0.04 0.04
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
46 / 88
Sum of kernels over the same input space
Z ∼ N (0, k1 + k2 ) can be seen as Z = Z1 + Z2 where Z1 , Z2 are
indenpendent and Z1 ∼ N (0, k1 ), Z2 ∼ N (0, k2 )
Example
Z1(x) Z2(x) Z(x)
2.0 2.0 2.0
1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0.0 + 0.0 = 0.0
0.5 0.5 0.5
1.0 1.0 1.0
1.5 1.5 1.5
2.0 2.0 2.0
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
47 / 88
Sum of kernels over the same space
Example: The Mauna Loa observatory dataset [GPML 2006]
This famous dataset compiles the monthly CO2 concentration in
Hawaii since 1958.
440
420
400
380
360
340
320
1960 1970 1980 1990 2000 2010 2020 2030
48 / 88
Sum of kernels over the same space
We first consider a squared-exponential kernel:
(x − y )2
2
k(x, y ) = σ exp −
θ2
600 480
460
400
440
200 420
400
0
380
200 360
340
400
320
600
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
50 / 88
Sum of kernels over the same space
What happen if we sum both kernels?
460
440
420
400
380
360
340
320
300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
51 / 88
Sum of kernels over the same space
We can try the following kernel:
440
420
400
380
360
340
320
300
1950 1960 1970 1980 1990 2000 2010 2020 2030 2040
Remark:
From a GP point of view, k is the kernel of
Z (x) = Z1 (x1 ) + Z2 (x2 )
52 / 88
Sum of kernels over tensor space
We can have a look at a few sample paths from Z :
4 1.5
3 2 1.0
2 1 0.5
1 0 0.0
0 1 0.5
1 2 1.0
2 3 1.5
1.0 1.0 1.0
0.8 0.8 0.8
0.0 0.2 0.6 0.0 0.2 0.6 0.0 0.2 0.6
0.4 0.6 0.4 0.4 0.6 0.4 0.4 0.6 0.4
0.8 1.00.00.2 0.8 1.00.00.2 0.8 1.00.00.2
53 / 88
Sum of kernels over tensor space
Remarks
It is straightforward to show that the mean predictor is additive
m(x) = (k1 (x, X ) + k2 (x, X ))k(X , X )−1 F
= k1 (x1 , X1 )k(X , X )−1 F + k2 (x2 , X2 )k(X , X )−1 F
| {z } | {z }
m1 (x1 ) m2 (x2 )
54 / 88
Sum of kernels over tensor space
Remark
The prediction variance has interesting features
pred. var. with kernel product pred. var. with kernel sum
1.0 1.0
0.8 0.8
0.6 0.6
x2
x2
0.4 0.4
0.2 0.2
0.00.0 0.2 0.4 0.6 0.8 1.0 0.00.0 0.2 0.4 0.6 0.8 1.0
x1 x1
55 / 88
Sum of kernels over tensor space
This property can be used to construct a design of experiment that
covers the space with only cst × d points.
1.0
0.8
0.6
x2
0.4
0.2
56 / 88
Product over the same space
Property
Example
We consider the product of a squared exponential with a cosine:
× =
57 / 88
Product over the tensor space
Property
Property
Let k1 be a kernel over D1 × D1 and f be an arbitrary function
D → D1 , then
k(x, y ) = k1 (f (x), f (y ))
is a kernel over D × D.
proof
XX XX
ai aj k(xi , xj ) = ai aj k1 (f (xi ), f (xj )) ≥ 0
| {z } | {z }
yi yj
Remarks:
k corresponds to the covariance of Z (x) = Z1 (f (x))
This can be seen as a (nonlinear) rescaling of the input space
59 / 88
Example
We consider f (x) = x1 and a Matérn 3/2 kernel
k1 (x, y ) = (1 + |x − y |)e −|x−y | .
We obtain:
2
0.8
1
0.6
0
0.4
1
0.2
2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 30.0 0.2 0.4 0.6 0.8 1.0
60 / 88
All these transformations can be combined!
Example
k(x, y ) = f (x)f (y )k1 (x, y ) is a valid kernel.
1
This can be illustrated with f (x) = x and
k1 (x, y ) = (1 + |x − y |)e −|x−y | :
40
0
30
20 20
10
40
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
61 / 88
Can we automate the construction of the covariance?
62 / 88
Conclusion: GPR and kernel design in practice
63 / 88
The various steps for building a GPR model are:
Remarks
It is common to iterate over steps 2, 3 and 4.
64 / 88
In practice, the following errors may appear:
• Error: Cholesky decomposition failed
• Error: the matrix is not positive definite
In practice, invertibility issues arise when observation points are
close-by. This is specially true if
the kernel corresponds to very regular sample paths
(squared-exponential for example)
the range (or length-scale) parameters are large
65 / 88
Want to know more?
Applying linear operators to GPs
66 / 88
Effect of a linear operator
Example
We want to approximate a function [0, 1] → R that is symmetric
with respect to 0.5. We will consider 2 linear operators:
(
f (x) x < 0.5
L1 : f (x) →
f (1 − x) x ≥ 0.5
f (x) + f (1 − x)
L2 : f (x) → .
2
67 / 88
Effect of a linear operator
Example
Associated sample paths are
k1 = L1 (L1 (k)) k2 = L2 (L2 (k))
3
3
2
2
1
1
Y
Y
0
0
−1
−1
−2
−2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
68 / 88
Effect of a linear operator
f
H
L2 f
L1 f
Hsym
69 / 88
Application to sensitivity analysis
70 / 88
The analysis of the influence of the various variables of a
d-dimensional function f is often based on the FANOVA:
d
X X
f (x) = f0 + fi (xi ) + fi,j (xi , xj ) + · · · + f1,...,d (x)
i=1 i<j
Z
where f (xI )dxi = 0 if i ∈ I .
The expressions of the fI are:
Z
f0 = f (x)dx
Z
fi (xi ) = f (x)dx−i − f0
Z
fi,j (xi , xj ) = f (x)dx−ij − fi (xi ) − fj (xj ) + f0
71 / 88
samples with zero integrals
We are interested in building a GP such that the integral of the
samples are exactly zero.
Z
idea: project a GP onto a
space of functions with zero
integrals:
Z0
72 / 88
The associated kernel is:
Z Z
k(x, s)ds k(y , s)ds
k0 (x, y ) = k(x, y ) − ZZ
k(s, t)dsdt
73 / 88
10d example
Let us consider the test function f : [0, 1]10 → R with ε ∼ N (0, 1)
observation noise:
x 7→ 10 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε
The steps for approximating f with GPR are:
we had f (x) = 10 sin(πx1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + N (0, 1)
75 / 88
Application: Periodicity detection
76 / 88
Periodicity detection
We will now discuss the detection of periodicity
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5 0 5 10 15
77 / 88
As previously we will build a decomposition of the process in two
independent GPs:
Z = Zp + Za
where Zp is a GP in the span of the Fourier basis
B(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t .
Property
It can be proved that the kernel of Zp and Za are
78 / 88
As previously, a decomposition of the model comes with a
decomposition of the kernel
m(x) = (kp (x, X ) + ka (x, X ))k(X , X )−1 F
= kp (x, X )k(X , X )−1 F + ka (x, X )k(X , X )−1 F
| {z } | {z }
periodic sub-model mp aperiodic sub-model ma
79 / 88
Example
For the observations shown previously we obtain:
+
||
0.5 0.5 0.5
80 / 88
Initially, the kernels are parametrised by 2 variables:
k(x, y , σ 2 , θ)
81 / 88
MLE of the 5 parameters of k ∗ gives:
+
||
0.5 0.5 0.5
We will now illustrate the use of these kernels for gene expression
analysis.
82 / 88
The 24 hour cycle of days can be observed in the oscillations of
many physiological processes of living beings.
Examples
Body temperature, jet lag, sleep, ... but also observed for plants,
micro-organisms, etc.
83 / 88
The aim of gene expression is to measure the activity of genes:
84 / 88
The mRNA concentration is measured with microarray experiments
85 / 88
Experiments to study the circadian clock are typically:
1. Expose the organism to a 12h light / 12h dark cycle
2. at t=0, transfer to constant light
3. perform a microarray experiment every 4 hours to measure
gene expression
86 / 88
We can apply this method to study the circadian rythm in
organisms. We used arabidopsis data from Edward 2006.
88 / 88