You are on page 1of 6

Chapter 16 Low-rank Method to Fasten the

Inference of Gaussian Process(Lecture on


02/25/2021)
We are trying to fit y = f (x) + ϵ for ϵ ∼ N (0, τ
2
) and we put a Gaussian process prior on f . We
have showed how to draw post burn-in samples
2
(μ(1), σ (1), τ
2
(1), ϕ(1)), ⋯ , (μ(L), σ (L), τ
2 2
(L), ϕ(L)). Then using these post burn-in
samples, we can get θ(1), ⋯ , θ(L).

With post burn-in samples (θ(1) , μ(1) , (τ 2 )(1) , (σ 2 )(1) ), ⋯ , (θ(L) , μ(L) , (τ 2 )(L) , (σ 2 )(L) ), we can
~ ~
simulate f (x) at new locations x 1 , ⋯ , xm by noticing that

f (x1 )
⎛ ⎞

⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
f (xn ) Σ11 Σ12
⎜ ⎟ 2 2
⎜ ⎟ |μ, τ , σ , ϕ ∼ N (μ1, ( )) (16.1)
~
⎜ f (x ⎟
1) Σ21 Σ22
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⎟

⎝ ~ ⎠
f (xm )

~
~ ~
Denote θ = (f (x1 ), ⋯ , f (xn )) and θ = (f (x1 ), ⋯ , f (xm )) , we have
~ 2 2 −1 −1
θ |θ, μ, τ , σ , ϕ ∼ N (μ1 + Σ21 Σ (θ − μ1), Σ11 − Σ21 Σ Σ12 ) (16.2)
11 11

~l
Therefore, for posterior samples of θ , it can be simulated from the (16.2) using posterior
samples (θ(l) , μ(l) , (τ 2 )(l) , (σ 2 )(l) , ϕ(l) ). Finally, to sample corresponding y, since
~ ~
y = f (xk ) + ϵk , we can use posterior samples f (x k)
(l)
and sample y (l) form
~
N (f (xk )
(l)
, (τ
2
)
(l)
.
)
In the Bayesian inference of Gaussian process, since y 2
∼ N (μ1, σ H (ϕ) + τ
2
, when running
I)

MCMC, at each iteration, we have to evaluate this likelihood:

1
T 2 2 −1
exp{(y − μ1) (σ H (ϕ) + τ I) (y − μ1)} (16.3)
1/2
2 2
|det(σ H (ϕ) + τ I)|

Therefore, at each iteration, we need to evaluate inverse and determinant of σ 2 H (ϕ) + τ 2 I,


which needs to be done using by computing the Cholesky decomposition of σ 2 H (ϕ) + τ 2 I.
Cholesky decomposition requires computation complexity of O(n3 ), when n is large, even for
n = 20000 , fitting GP may take months. This is known as the big-n problem in Gaussian process.

There are various versions of GP proposed which are computationally less expensive. We will
discuss two such methods. The two methods are:

1. Low-rank method;

2. Sparse method.

Low-rank method

Rather than using a GP specification directly on f (x) in y = f (x) + ϵ , low-rank methods express
f (x) as a basis representation. For example, we may write


f (x) = μ + ∑ K(x, x , ξ)λj (16.4)
j

j=1

where K(⋅, ⋅, ξ) is a pre-specified basis function, and x∗1 , ⋯ , x∗s are called the knot points. We
also have S << n .

There are several possible choices for the basis functions, such as:

a. Spline basis

b. Bezier kernel

The different choice of basis functions will give you different types of models. The choice of
basis functions and the choice of knots will totally determine the low-rank method.

There is a trade-off between the computation efficiency and the inferencial accuracy
when choosing S . If we keep S larger, then the compuational efficiency will be much
harmed, but the accuracy is more. If we keep S smaller, the compuational efficiency is
imporved, but the statistical accuracy is less.

Suppose we fix the basis functions and the knots, it will lead to


y = f (x) + ϵ = μ + ∑ K(x, x , ξ)λj + ϵ (16.5)
j

j=1

Note that λ1 , ⋯ , λj , ξ and μ are parameters, we need to estimate them.

Since we have


y1 = μ + ∑ K(x1 , x , ξ)λj + ϵ1
j

j=1

⋯⋯⋯ (16.6)


yn = μ + ∑ K(xn , x , ξ)λj + ϵn
j

j=1

i.i.d.
and ϵi ∼ N (0, τ
2
) .

Written in the matrix form, we have

y = μ1 + Kλ + ϵ (16.7)

with ϵ ∼ N (0, τ
2
,
I) λ = (λ1 , ⋯ , λS )
T
and the n × S matrix K is given by

∗ ∗
K(x1 , x , ξ) ⋯ K(x1 , x , ξ)
⎛ 1 S ⎞

K = ⎜ ⎟ (16.8)
⎜ ⋮ ⋮ ⎟

⎝ ∗ ∗ ⎠
K(xn , x , ξ) ⋯ K(xn , x , ξ)
1 S

We want to estimate μ, λ, τ 2 , ξ|y1 , ⋯ , yn . We assigns prior for the parameters as

2
μ ∼ N (mμ , σμ ), λ ∼ N (0, Σ1 )
(16.9)
2
τ ∼ I G(aτ , bτ ), ξ ∼ p(ξ)

Since y ∼ N (μ1 + Kλ, τ


2
I) and λ ∼ N (0, Σ1 ) , we can integrate out λ, which gives us

T 2
y ∼ N (μ1, KΣ1 K + τ I) (16.10)
Like before you will have to evaluate the likelihood N (y|μ1, KΣ1 K T + τ
2
I) at each MCMC
iteration for drawing samples. The log-likelihood is

T T 2 −1
(y − μ1) (KΣ1 K + τ I) (y − μ1) 1
T 2
l = − − log(det(KΣ1 K + τ I)) (16.1
2 2

Then use the woodbury matrix identity,

−1 −1 −1 −1 −1 −1 −1
(A + U CV ) = A − A U (C + VA U) VA (16.12)

where A, U , C, V are n × n, n × s, s × s, s × n matrix, respectively. The crucial thing to notice


is that C −1 + VA
−1
U is a s × s matrix. Therefore, take K = U C = Σ1 , and A = τ
2
I ,
. The main computational difficulty comes from evaluation (C −1 ,
−1 1 −1 −1
A = I + VA U)
2
τ

which is a S × S matrix. The computational complexity of inverting this matrix is O(S 3 ). Since
S << n , therefore it achieves computational efficiency.

We will also need to evaluate the determinant of KΣ1 K T + τ


2
I , which can be simplified using
the following result.

Lemma 16.1 (Matrix-determinant lemma) For n × n, n × s, s × s and s × n matrix A, U , C and


V , we have

−1 −1
det(A + U CV ) = det(C + VA U )det(A)det(C) (16.13)

Therefore, for our case, compute the determinant of a n × n matrix KΣ1 K T + τ


2
I can be done
by compute the determinant of S × S matrices Σ1 −1
+ K
T 1
2
K and Σ1 . A = τ
2
I so det(A)
τ

and A−1 are both easy to obtain.

The inverse and determinant of a matrix can be get together by doing Cholesky
decomposition of the matrix. Suppose the Cholesky decomposition of matrix A is
A = LL
T
, then A−1 = (L
−1
)(L
−1 T
) and det(A) = (det(L))
2
. Since L is a lower
triangular matrix, the determinant pf L is the product of its diagonals.

For low-rank methods, the choice of kernel functions K(⋅, ⋅, ξ) and the choice of knot points

x ,⋯,x
1

S
seems to affect the results.

People are interested in whether the choice of the kernel functions can be motivated from the
Gaussian process itself. That leads to the idea of predictive process model. In a predictive
process model, you can only choose the set of knots, then the set of kernel functions is
automatically set.

Suppose the original nonlinear regression model is y = f (x) + ϵ and suppose


f (x) ∼ GP (μ, σ
2
exp(−ϕd)) . Let us define a set of knots x∗1 , x∗S , and define

f

= (f (x ), ⋯ , f (x
1

S
))
T
. We have

2 ∗
σ exp(−ϕ|xi − x |)
⎛ 1 ⎞

⎜ ⎟ (16.14)
Cov(f (xi ), f ) = ⋮
⎜ ⎟

⎝ 2 ∗ ⎠
σ exp(−ϕ|xi − x |)
S


and V ar(f ) is an S × S matrix whose (k, l)th entry is given by
σ
2
exp(−ϕ|x

k

− x |) = Cov(f (x ), f (x ))
l

k

l
.

Predictive process model

Let yi = μ + Cov(f (xi ), f



)V ar(f
∗ −1
) f

+ ϵi . Denote Cov(f (xi ), f ∗ ) = c(xi ) and

C

= V ar(f , we then have
)

∗ −1 ∗
y1 = μ + c(x1 )(C ) f + ϵ1

⋯⋯⋯ (16.15)

∗ −1 ∗
yn = μ + c(xn )(C ) f + ϵn

Written in matrix form, we have

c(x1 )
⎛ ⎞
∗ −1 ∗
y = μ1 + ⎜ ⋮
⎟ (C ) f + ϵ (16.16)
⎜ ⎟

⎝ ⎠
c(xn )

c(x1 )
⎛ ⎞

where ⎜ is n × S matrix, (C ∗ )−1 is S matrix and f is S matrix. Think about



⎟ × S × 1
⎜ ⋮ ⎟

⎝ ⎠
c(xn )

the low-rank method, and set

c(x1 )
⎛ ⎞
∗ −1
K = ⎜ ⋮
⎟ (C ) (16.17)
⎜ ⎟

⎝ ⎠
c(xn )

and we have

y = μ1 + Kf + ϵ (16.18)

It looks like a low-rank method, but the choice of the basis functions is implicitly determined by
the GP specification on f .

To do posterior inference on predictive process, since y ∼ N (μ1 + Kf




2
I) and
f

∼ N (0, C

, marginalized out f ∗ we have y
) ∼ N (μ1, τ
2
I + KC

K
T
). You have to run
Gibbs with-in Metropolis-Hastings (Gibbs for μ and M-H for other parameters) to draw samples of
μ, τ
2
, ϕ, σ
2
.

If the full model is yi = μ + f (xi ) + ϵi where f ∼ GP (0, Cν ) , the predictive processes is


∗ ∗ ∗
yi = μ + Cov(f (xi ), f )V ar(f )
−1
f + ϵi . Notice that
∗ ∗ ∗ ∗
Cov(f (xi ), f )V ar(f
−1
) f = E(f (xi )|f ), it implies that the predictive process model can
also be written as

yi = μ + E(f (xi )|f ) + ϵi (16.19)

The variability of an observation from the full model is V ar(f (xi )) + V ar(ϵi ). The variability of
∗ ~
an observation from the predictive process model is V ar(E(f (xi )|f )) + V ar(ϵ i ) . Since the
variability of an observation should be the same for different models, we have
∗ ~
V ar(f (xi )) + V ar(ϵi ) = V ar(E(f (xi )|f )) + V ar(ϵ i ) (16.20)

which implies
∗ ∗ ∗ ~
V ar(E(f (xi )|f )) + E(V ar(f (xi )|f )) + V ar(ϵi ) = V ar(E(f (xi )|f )) + V ar(ϵ i )

and we can finally get

~
V ar(ϵ i ) ≥ V ar(ϵi ) (16.21)

(16.21) tells us that if we simulate data from a full Gaussian process model, and we try to fit
predict process model on that data, then we will always observe an overestimation of the error
variance. It appears that as S increases, predictive process becomes a better approximation of
the full GP model. For the extreme case when S = n and data points are taken as the knot

points, then E(f (xi )|f ) = f (xi ) .

You might also like