You are on page 1of 40

Gaussian Process Regression

4F10 Pattern Recognition, 2010

Carl Edward Rasmussen

Department of Engineering, University of Cambridge

November 11th - 16th, 2010

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 1 / 40
The Prediction Problem
420
CO2 concentration, ppm
400
?

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 2 / 40
The Gaussian Distribution

The Gaussian distribution is given by

p(x|, ) = N(, ) = (2)D/2 ||1/2 exp 21 (x )> 1 (x )




where is the mean vector and the covariance matrix.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 3 / 40
Conditionals and Marginals of a Gaussian

joint Gaussian joint Gaussian


conditional marginal

Both the conditionals and the marginals of a joint Gaussian are again Gaussian.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 4 / 40
Conditionals and Marginals of a Gaussian
In algebra, if x and y are jointly Gaussian
hai h A B i
p(x, y) = N , ,
b B> C

the marginal distribution of x is


i h A B i
a
h
p(x, y) = N , = p(x) = N(a, A),
b B> C

and the conditional distribution of x given y is


hai h A B i
p(x, y) = N , = p(x|y) = N(a+BC1 (yb), ABC1 B> ),
b B> C

where x and y can be scalars or vectors.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 5 / 40
What is a Gaussian Process?

A Gaussian process is a generalization of a multivariate Gaussian distribution to


infinitely many variables.

Informally: infinitely long vector ' function

Definition: a Gaussian process is a collection of random variables, any


finite number of which have (consistent) Gaussian distributions. 

A Gaussian distribution is fully specified by a mean vector, , and covariance


matrix :
f = (f1 , . . . , fn )> N(, ), indexes i = 1, . . . , n
A Gaussian process is fully specified by a mean function m(x) and covariance
function k(x, x 0 ):

f (x) GP m(x), k(x, x 0 ) , indexes: x




Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 6 / 40
The marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long mean vector


and an infinite by infinite covariance matrix may seem impractical. . .

. . . luckily we are saved by the marginalization property:

Recall: Z
p(x) = p(x, y)dy.

For Gaussians:
hai h A B i
p(x, y) = N , = p(x) = N(a, A)
b B> C

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 7 / 40
Random functions from a Gaussian Process

Example one dimensional Gaussian process:

p(f (x)) GP m(x) = 0, k(x, x 0 ) = exp( 12 (x x 0 )2 ) .




To get an indication of what this distribution over functions looks like, focus on a
finite subset of function values f = (f (x1 ), f (x2 ), . . . , f (xn ))> , for which

f N(0, ),

where ij = k(xi , xj ).

Then plot the coordinates of f as a function of the corresponding x values.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 8 / 40
Some values of the random function
1.5

0.5
output, f(x)

0.5

1.5
5 0 5
input, x
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 9 / 40
Joint Generation

To generate a random sample from a D dimensional joint Gaussian with


covariance matrix K and mean vector m: (in octave or matlab)

z = randn(D,1);
y = chol(K)*z + m;

where chol is the Cholesky factor R such that R> R = K.

Thus, the covariance of y is:

E[(y y)(y y)> ] = E[R> zz> R] = R> E[zz> ]R = R> IR = K.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 10 / 40
Sequential Generation

Factorize the joint distribution

Y
n
p(f1 , . . . , fn |x1 , . . . xn ) = p(fi |fi1 , . . . , f1 , xi , . . . , x1 ),
i=1

and generate function values sequentially.

What do the individual terms look like? For Gaussians:


h i h A B i
a
p(x, y) = N , = p(x|y) = N(a+BC1 (yb), ABC1 B> )
b B> C

Do try this at home!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 11 / 40
Function drawn at random from a Gaussian Process with Gaussian covariance

6
4
6
2
4
0 2
2 0
2
4
4
6 6

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 12 / 40
The Linear Model

In the linear model, the output is assumed to be a linear function of the input:

y = x> w + .

The weights w can be fit by least squares:


X
w = argmin (y(i) x(i)> w)2 .
i

Equivalently, if the noise is Gaussian N(0, 2n ), we maximize the likelihood:


Y Y 1 (y(i) x(i)> w)2 
p(y|X, w) = p(y(i) |x(i) , w) = exp ,
2n 22n
i i

where x(i) is the ith column of the matrix X. The solution is

w = (XX> )1 Xy.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 13 / 40
Basis functions, or Linear-in-the-Parameters Models

The linear model is of little use of the data doesnt follow a linear relationship.
But can be generalized using basis functions:

y = (x)> w + ,

with similar maximum likelihood solution:


1
w = (X)(X)> (X)y.

Common examples includes polynomials and Gaussian basis functions.

Practical problems include overfitting and selection of basis functions.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 14 / 40
Maximum likelihood, parametric model
Supervised parametric learning:

data: x, y
model: y = fw (x) +

Gaussian likelihood:
Y
p(y|x, w, Mi ) exp( 12 (yc fw (xc ))2 /2noise ).
c

Maximize the likelihood:

wML = argmax p(y|x, w, Mi ).


w

Make predictions, by plugging in the ML estimate:

p(y |x , wML , Mi )

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 15 / 40
Bayesian Inference, parametric model
Supervised parametric learning:

data: x, y
model: y = fw (x) +

Gaussian likelihood:
Y
p(y|x, w, Mi ) exp( 12 (yc fw (xc ))2 /2noise ).
c

Parameter prior:
p(w|Mi )

Posterior parameter distribution by Bayes rule p(a|b) = p(b|a)p(a)/p(b):

p(w|Mi )p(y|x, w, Mi )
p(w|x, y, Mi ) =
p(y|x, Mi )

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 16 / 40
Bayesian Inference, parametric model, cont.

Making predictions:
Z
p(y |x , x, y, Mi ) = p(y |w, x , Mi )p(w|x, y, Mi )dw

Marginal likelihood:
Z
p(y|x, Mi ) = p(w|Mi )p(y|x, w, Mi )dw.

Model probability:
p(Mi )p(y|x, Mi )
p(Mi |x, y) =
p(y|x)

Problem: integrals are intractable for most interesting models!

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 17 / 40
Non-parametric Gaussian process models
In our non-parametric model, the parameters are the function itself!

Gaussian likelihood:
y|x, f (x), Mi N(f, 2noise I)

(Zero mean) Gaussian process prior:


f (x)|Mi GP m(x) 0, k(x, x 0 )


Leads to a Gaussian process posterior


f (x)|x, y, Mi GP mpost (x) = k(x, x)[K(x, x) + 2noise I]1 y,
kpost (x, x 0 ) = k(x, x 0 ) k(x, x)[K(x, x) + 2noise I]1 k(x, x 0 ) .


And a Gaussian predictive distribution:


y |x , x, y, Mi N k(x , x)> [K + 2noise I]1 y,
k(x , x ) + 2noise k(x , x)> [K + 2noise I]1 k(x , x)


Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 18 / 40
Prior and Posterior

2 2

1 1

output, f(x)
output, f(x)

0 0

1 1

2 2

5 0 5 5 0 5
input, x input, x

Predictive distribution:

p(y |x , x, y) N k(x , x)> [K + 2noise I]1 y,


k(x , x ) + 2noise k(x , x)> [K + 2noise I]1 k(x , x)


Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 19 / 40
Factor Graph for Gaussian Process

training set test set


A Factor Graph is a graphical representation
f(xi) f(xl) f(xj) of a multivariate distribution.

Nodes are random variables, black boxes are


factors. The factors induce dependencies be-
y(xi) y(xj) tween the variables to which they have edges.
Open nodes are stochastic (free) and shaded
xi x1 , . . . , xn other indices xj x1 , . . . , xm nodes are observed (clamped).Plates indicate
repetitions.

The predictive distribution for test case y(xj ) depends only on the corresponding
latent variable f (xj ).

Adding other variables (without observations) doesnt change the distributions.


This explains why we can make inference using a finite amount of computation!
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 20 / 40
Some interpretation
Recall our main result:

f |X , X, y N K(X , X)[K(X, X) + 2n I]1 y,


K(X , X ) K(X , X)[K(X, X) + 2n I]1 K(X, X ) .


The mean is linear in two ways:

X
n X
n
(x ) = k(x , X)[K(X, X) + 2n I]1 y = c y(c) = c k(x , x(c) ).
c=1 c=1

The last form is most commonly encountered in the kernel literature.

The variance is the difference between two terms:

V(x ) = k(x , x ) k(x , X)[K(X, X) + 2n I]1 k(X, x ),

the first term is the prior variance, from which we subtract a (positive) term,
telling how much the data X has explained. Note, that the variance is
independent of the observed outputs y.
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 21 / 40
The marginal likelihood
Log marginal likelihood:

1 1 n
log p(y|x, Mi ) = y> K1 y log |K| log(2)
2 2 2
is the combination of a data fit term and complexity penalty. Occams Razor is
automatic.

Learning in Gaussian process models involves finding

the form of the covariance function, and


any unknown (hyper-) parameters .

This can be done by optimizing the marginal likelihood:

log p(y|x, , Mi ) 1 K 1 1 K
= y> K1 K y trace(K1 )
j 2 j 2 j

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 22 / 40
Example: Fitting the length scale parameter
(x x 0 )2 
Parameterized covariance function: k(x, x 0 ) = v2 exp + 2n xx 0 .
2`2

1.5
observations
too short
1 good length scale
too long

0.5

0.5
10 8 6 4 2 0 2 4 6 8 10

The mean posterior predictive function is plotted for 3 different length scales (the
green curve corresponds to optimizing the marginal likelihood). Notice, that an
almost exact fit to the data can be achieved by reducing the length scale but the
marginal likelihood does not favour this!
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 23 / 40
Why, in principle, does Bayesian Inference work?
Occams Razor

too simple
P(Y|Mi)

"just right"
too complex

Y
All possible data sets
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 24 / 40
An illustrative analogous example
Imagine the simple task of fitting the variance, 2 , of a zero-mean Gaussian to a
set of n scalar observations.

The log likelihood is log p(y|, 2 ) = 21 y> Iy/2 12 log |I2 | n


2 log(2)
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 25 / 40
From random functions to covariance functions

Consider the class of linear functions:

f (x) = ax + b, where a N(0, ), and b N(0, ).

We can compute the mean function:


ZZ Z Z
(x) = E[f (x)] = f (x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0,

and covariance function:


ZZ
0 0
k(x, x ) = E[(f (x) 0)(f (x ) 0)] = (ax + b)(ax 0 + b)p(a)p(b)dadb
Z Z Z
= a2 xx 0 p(a)da + b2 p(b)db + (x + x 0 ) abp(a)p(b)dadb = xx 0 + .

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 26 / 40
From random functions to covariance functions II
Consider the class of functions (sums of squared exponentials):
1X
f (x) = lim i exp((x i/n)2 ), where i N(0, 1), i
n n
i
Z
= (u) exp((x u)2 )du, where (u) N(0, 1), u.

The mean function is:


Z Z
2
(x) = E[f (x)] = exp((x u) ) p()ddu = 0,

and the covariance function:


Z
E[f (x)f (x )] = exp (x u)2 (x 0 u)2 du
0


Z
x + x 0 2 (x + x 0 )2 (x x 0 )2 
x2 x 02 du exp

= exp 2(u ) + .
2 2 2
Thus, the squared exponential covariance function is equivalent to regression
using infinitely many Gaussian shaped basis functions placed everywhere, not just
at your training points!
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 27 / 40
Using finitely many basis functions may be dangerous!

0.5

0.5
10 8 6 4 2 0 2 4 6 8 10

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 28 / 40
Model Selection in Practise; Hyperparameters
There are two types of task: form and parameters of the covariance function.

Typically, our prior is too weak to quantify aspects of the covariance function.
We use a hierarchical model using hyperparameters. Eg, in ARD:
X
D
(xd xd0 )2 
0
k(x, x ) = v20 exp , hyperparameters = (v0 , v1 , . . . , vd , 2n ).
2v2d
d=1

v1=v2=1 v1=v2=0.32 v1=0.32 and v2=1

2 2 2
1 0 0
0
2 2
2 2 2
2 2 2
0 0 0 0 0 0
x2 2 2 x1 x2 2 2 x1 x2 2 2 x1
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 29 / 40
Rational quadratic covariance function

The rational quadratic (RQ) covariance function:


 r2 
kRQ (r) = 1+
2`2
with , ` > 0 can be seen as a scale mixture (an infinite sum) of squared
exponential (SE) covariance functions with different characteristic length-scales.

Using = `2 and p(|, ) 1 exp(/):


Z
kRQ (r) = p(|, )kSE (r|)d
Z  r2 
   r2 
1 exp exp d 1 + ,
2 2`2

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 30 / 40
Rational quadratic covariance function II

1 =1/2
3
=2
2
0.8
1

output, f(x)
covariance

0.6
0
0.4 1

0.2 2

0 3
0 1 2 3 5 0 5
input distance input, x

The limit of the RQ covariance function is the SE.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 31 / 40
Matrn covariance functions
Stationary covariance functions can be based on the Matrn form:

1 h 2 i  2 
0 0 0
k(x, x ) = |x x | K |x x | ,
()21 ` `
where K is the modified Bessel function of second kind of order , and ` is the
characteristic length scale.

Sample functions from Matrn forms are b 1c times differentiable. Thus, the
hyperparameter can control the degree of smoothness

Special cases:

k=1/2 (r) = exp( `r ): Laplacian covariance function, Browninan motion


(Ornstein-Uhlenbeck)
 
k=3/2 (r) = 1 + `3r exp `3r (once differentiable)

5r2
k=5/2 (r) = 1 + `5r + 3` exp `5r (twice differentiable)

2
2
r
k = exp( 2` 2 ): smooth (infinitely differentiable)

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 32 / 40
Matrn covariance functions II

Univariate Matrn covariance function with unit characteristic length scale and
unit variance:

covariance function sample functions


1
=1/2 2
=1
covariance

output, f(x)
=2 1

0.5 0
1
2
0
0 1 2 3 5 0 5
input distance input, x

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 33 / 40
Periodic, smooth functions
To create a distribution over periodic functions of x, we can first map the inputs
to u = (sin(x), cos(x))> , and then measure distances in the u space. Combined
with the SE covariance function, which characteristic length scale `, we get:
kperiodic (x, x 0 ) = exp(2 sin2 ((x x 0 ))/`2 )

3 3

2 2

1 1

0 0

1 1

2 2

3 3
2 1 0 1 2 2 1 0 1 2

Three functions drawn at random; left ` > 1, and right ` < 1.


Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 34 / 40
The Prediction Problem
420
CO2 concentration, ppm
400
?

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 35 / 40
Covariance Function
The covariance function consists of several terms, parameterized by a total of 11
hyperparameters:

long-term smooth trend (squared exponential)


k1 (x, x 0 ) = 21 exp((x x 0 )2 /22 ),
seasonal trend (quasi-periodic
 smooth)   
k2 (x, x 0 ) = 23 exp 2 sin2 ((x x 0 ))/25 exp 21 (x x 0 )2 /24 ,
short- and medium-term anomaly (rational quadratic)
 8
(xx 0 )2
k3 (x, x 0 ) = 26 1 + 28 27
noise (independent Gaussian,
 and
 dependent)
0 2
(xx )
k4 (x, x 0 ) = 29 exp 2210
+ 211 xx 0 .

k(x, x 0 ) = k1 (x, x 0 ) + k2 (x, x 0 ) + k3 (x, x 0 ) + k4 (x, x 0 )

Lets try this with the gpml software (http://www.gaussianprocess.org/gpml).


Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 36 / 40
Long- and medium-term mean predictions

400 1
CO2 concentration, ppm

CO2 concentration, ppm


380 0.5

360 0

340 0.5

320 1

1960 1970 1980 1990 2000 2010 2020


year

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 37 / 40
Mean Seasonal Component
2020
3.6
2010
2000 2.8
1
Year

1990 0 2
2 1
1 2 3.1
2 1 0
1980 3.3
3 2.8
2.8
1970
1960
J F M A M J J A S O N D
Month

Seasonal component: magnitude 3 = 2.4 ppm, decay-time 4 = 90 years.

Dependent noise, magnitude 9 = 0.18 ppm, decay 10 = 1.6 months.


Independent noise, magnitude 11 = 0.19 ppm.

Optimize or integrate out? See MacKay [? ].

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 38 / 40
Conclusions

Gaussian processes are intuitive, powerful and practical approach to inference,


learning and prediction.

Bayesian inference is tractable, neatly addressing model complexity issues.

Predictions contain sensible error-bars, reflecting their confidence.

Many other models are (crippled versions) of GPs: Relevance Vector Machines
(RVMs), Radial Basis Function (RBF) networks, splines, neural networks.

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 39 / 40
More on Gaussian Processes

Rasmussen and Williams


Gaussian Processes for Machine Learning,
MIT Press, 2006.
http://www.GaussianProcess.org/gpml

Gaussian process web (code, papers, etc): http://www.GaussianProcess.org

Rasmussen (Engineering, Cambridge) Gaussian Process Regression November 11th - 16th, 2010 40 / 40