Professional Documents
Culture Documents
With post burn-in samples (θ(1) , μ(1) , (τ 2 )(1) , (σ 2 )(1) ), ⋯ , (θ(L) , μ(L) , (τ 2 )(L) , (σ 2 )(L) ), we can
~ ~
simulate f (x) at new locations x 1 , ⋯ , xm by noticing that
f (x1 )
⎛ ⎞
⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟
f (xn ) Σ11 Σ12
⎜ ⎟ 2 2
⎜ ⎟ |μ, τ , σ , ϕ ∼ N (μ1, ( )) (16.1)
~
⎜ f (x ⎟
1) Σ21 Σ22
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⋮ ⎟
⎝ ~ ⎠
f (xm )
~
~ ~
Denote θ = (f (x1 ), ⋯ , f (xn )) and θ = (f (x1 ), ⋯ , f (xm )) , we have
~ 2 2 −1 −1
θ |θ, μ, τ , σ , ϕ ∼ N (μ1 + Σ21 Σ (θ − μ1), Σ11 − Σ21 Σ Σ12 ) (16.2)
11 11
~l
Therefore, for posterior samples of θ , it can be simulated from the (16.2) using posterior
samples (θ(l) , μ(l) , (τ 2 )(l) , (σ 2 )(l) , ϕ(l) ). Finally, to sample corresponding y, since
~ ~
y = f (xk ) + ϵk , we can use posterior samples f (x k)
(l)
and sample y (l) form
~
N (f (xk )
(l)
, (τ
2
)
(l)
.
)
In the Bayesian inference of Gaussian process, since y 2
∼ N (μ1, σ H (ϕ) + τ
2
, when running
I)
1
T 2 2 −1
exp{(y − μ1) (σ H (ϕ) + τ I) (y − μ1)} (16.3)
1/2
2 2
|det(σ H (ϕ) + τ I)|
There are various versions of GP proposed which are computationally less expensive. We will
discuss two such methods. The two methods are:
1. Low-rank method;
2. Sparse method.
Low-rank method
Rather than using a GP specification directly on f (x) in y = f (x) + ϵ , low-rank methods express
f (x) as a basis representation. For example, we may write
∗
f (x) = μ + ∑ K(x, x , ξ)λj (16.4)
j
j=1
where K(⋅, ⋅, ξ) is a pre-specified basis function, and x∗1 , ⋯ , x∗s are called the knot points. We
also have S << n .
There are several possible choices for the basis functions, such as:
a. Spline basis
b. Bezier kernel
The different choice of basis functions will give you different types of models. The choice of
basis functions and the choice of knots will totally determine the low-rank method.
There is a trade-off between the computation efficiency and the inferencial accuracy
when choosing S . If we keep S larger, then the compuational efficiency will be much
harmed, but the accuracy is more. If we keep S smaller, the compuational efficiency is
imporved, but the statistical accuracy is less.
Suppose we fix the basis functions and the knots, it will lead to
∗
y = f (x) + ϵ = μ + ∑ K(x, x , ξ)λj + ϵ (16.5)
j
j=1
Since we have
∗
y1 = μ + ∑ K(x1 , x , ξ)λj + ϵ1
j
j=1
⋯⋯⋯ (16.6)
∗
yn = μ + ∑ K(xn , x , ξ)λj + ϵn
j
j=1
i.i.d.
and ϵi ∼ N (0, τ
2
) .
y = μ1 + Kλ + ϵ (16.7)
with ϵ ∼ N (0, τ
2
,
I) λ = (λ1 , ⋯ , λS )
T
and the n × S matrix K is given by
∗ ∗
K(x1 , x , ξ) ⋯ K(x1 , x , ξ)
⎛ 1 S ⎞
K = ⎜ ⎟ (16.8)
⎜ ⋮ ⋮ ⎟
⎝ ∗ ∗ ⎠
K(xn , x , ξ) ⋯ K(xn , x , ξ)
1 S
2
μ ∼ N (mμ , σμ ), λ ∼ N (0, Σ1 )
(16.9)
2
τ ∼ I G(aτ , bτ ), ξ ∼ p(ξ)
T 2
y ∼ N (μ1, KΣ1 K + τ I) (16.10)
Like before you will have to evaluate the likelihood N (y|μ1, KΣ1 K T + τ
2
I) at each MCMC
iteration for drawing samples. The log-likelihood is
T T 2 −1
(y − μ1) (KΣ1 K + τ I) (y − μ1) 1
T 2
l = − − log(det(KΣ1 K + τ I)) (16.1
2 2
−1 −1 −1 −1 −1 −1 −1
(A + U CV ) = A − A U (C + VA U) VA (16.12)
which is a S × S matrix. The computational complexity of inverting this matrix is O(S 3 ). Since
S << n , therefore it achieves computational efficiency.
−1 −1
det(A + U CV ) = det(C + VA U )det(A)det(C) (16.13)
The inverse and determinant of a matrix can be get together by doing Cholesky
decomposition of the matrix. Suppose the Cholesky decomposition of matrix A is
A = LL
T
, then A−1 = (L
−1
)(L
−1 T
) and det(A) = (det(L))
2
. Since L is a lower
triangular matrix, the determinant pf L is the product of its diagonals.
For low-rank methods, the choice of kernel functions K(⋅, ⋅, ξ) and the choice of knot points
∗
x ,⋯,x
1
∗
S
seems to affect the results.
People are interested in whether the choice of the kernel functions can be motivated from the
Gaussian process itself. That leads to the idea of predictive process model. In a predictive
process model, you can only choose the set of knots, then the set of kernel functions is
automatically set.
2 ∗
σ exp(−ϕ|xi − x |)
⎛ 1 ⎞
∗
⎜ ⎟ (16.14)
Cov(f (xi ), f ) = ⋮
⎜ ⎟
⎝ 2 ∗ ⎠
σ exp(−ϕ|xi − x |)
S
∗
and V ar(f ) is an S × S matrix whose (k, l)th entry is given by
σ
2
exp(−ϕ|x
∗
k
∗
− x |) = Cov(f (x ), f (x ))
l
∗
k
∗
l
.
∗ −1 ∗
y1 = μ + c(x1 )(C ) f + ϵ1
⋯⋯⋯ (16.15)
∗ −1 ∗
yn = μ + c(xn )(C ) f + ϵn
c(x1 )
⎛ ⎞
∗ −1 ∗
y = μ1 + ⎜ ⋮
⎟ (C ) f + ϵ (16.16)
⎜ ⎟
⎝ ⎠
c(xn )
c(x1 )
⎛ ⎞
⎝ ⎠
c(xn )
c(x1 )
⎛ ⎞
∗ −1
K = ⎜ ⋮
⎟ (C ) (16.17)
⎜ ⎟
⎝ ⎠
c(xn )
and we have
∗
y = μ1 + Kf + ϵ (16.18)
It looks like a low-rank method, but the choice of the basis functions is implicitly determined by
the GP specification on f .
The variability of an observation from the full model is V ar(f (xi )) + V ar(ϵi ). The variability of
∗ ~
an observation from the predictive process model is V ar(E(f (xi )|f )) + V ar(ϵ i ) . Since the
variability of an observation should be the same for different models, we have
∗ ~
V ar(f (xi )) + V ar(ϵi ) = V ar(E(f (xi )|f )) + V ar(ϵ i ) (16.20)
which implies
∗ ∗ ∗ ~
V ar(E(f (xi )|f )) + E(V ar(f (xi )|f )) + V ar(ϵi ) = V ar(E(f (xi )|f )) + V ar(ϵ i )
~
V ar(ϵ i ) ≥ V ar(ϵi ) (16.21)
(16.21) tells us that if we simulate data from a full Gaussian process model, and we try to fit
predict process model on that data, then we will always observe an overestimation of the error
variance. It appears that as S increases, predictive process becomes a better approximation of
the full GP model. For the extreme case when S = n and data points are taken as the knot
∗
points, then E(f (xi )|f ) = f (xi ) .