You are on page 1of 6

Chapter 14 MCMC for Continuous Distribution,

Gaussian Process(Lecture on 02/18/2021)

We studied why Metropolis-Hastings chain converges to the full posterior when the parameter
space is discrete. Now we study the case when the parameter space is continuous. When the
Markov chain is defined on a continuous state space, we cannot use transition probability matrix.
We are dealing with Markov chain of the form {Xn : n ≥ 1} but each Xn ∈ S , where S is not
discrete.

Definition 14.1 (Transition kernels) The transition kernels T (x, x′ ) of making a transition from
Xn = x to Xn+1 = x

is a function of x and x′ that has a. T (x, ⋅) is a density function.

b. T (⋅, x′ ) is a measurable function.

You can think of T (x, ⋅) as the conditional distribution of Xn+1 |Xn = x

Let g(x′ |x) be the proposal density for proposing x′ from x. Let a(x, x′ ) be the acceptance
probability of accepting x′ from x. Then the transition kernel for Metropolis-Hasting can be
described as

′ ′ ′
⎧ g(x |x)a(x, x ) x ≠ x

T (x, x ) = ⎨ ′ ′ ′ ′
(14.1)
⎩1 − ∫ g(x |x)a(x, x )dx x = x

The first expression in (14.1) is the probability that you propose x′ , which is given by

g(x |x) , then the proposal get accepted (the probability of that is a(x, x′ )). For the
second expression in (14.1), ∫ ′
g(x |x)a(x, x )dx
′ ′
is the probability that proposing any
x

and get accepted. Thus, 1 − ∫ ′
g(x |x)a(x, x )dx
′ ′
is the probability that any
proposed x′ is not accepted, or just x = x

.
Definition 14.2 (Stationary distribution in continuous case) If a transition kernel satisfies

′ ′ ′
∫ T (x, x )π(x)dx = π(x ), ∀x (14.2)

then π(⋅) is known as the stationary distribution for the transition kernel.

Notice the similarity with the definition of stationary distribution in discrete case
(Definition 9.1), where we have the stationary distribution satisfies πj = ∑ πi pij , ∀j
i
.
Here we just replace the sum with integral.

Proposition 14.1 (Detailed balance condition for continuous case) Similar to the discrete case, if
∃π(⋅) such that

′ ′ ′ ′
T (x, x )π(x) = T (x , x)π(x ), ∀x, x (14.3)

then π(⋅) is the stationary distribution, and we say that the detailed balance condition is satisfied.
Let P (x) be the full posterior distribution. We have

′ ′
p(x )g(x|x )

a(x, x ) = min(1, ) (14.4)

p(x)g(x |x)

If x ≠ x

, we then get

′ ′ ′
T (x, x )p(x) = g(x |x)a(x, x )p(x)
′ ′
p(x )g(x|x ) (14.5)

= g(x |x)p(x) min(1, )

p(x)g(x |x)

We consider two cases. In the first case, assume p(x′ )g(x|x′ ) < p(x)g(x |x)

, then (14.5)
becomes

′ ′
p(x )g(x|x )
′ ′ ′ ′
T (x, x )p(x) = g(x |x)p(x) = p(x )g(x|x ) (14.6)

p(x)g(x |x)

In addition,

′ ′ ′ ′ ′
T (x , x)p(x ) = g(x|x )a(x , x)p(x )

p(x)g(x |x)
′ ′
= g(x|x )p(x ) min(1, ) (14.7)
′ ′
p(x )g(x|x )

′ ′
= g(x|x )p(x )
Therefore, by (14.6) and (14.7), we get

′ ′ ′ ′
T (x, x )p(x) = T (x , x)p(x ), ∀x ≠ x (14.8)

and it is trivially satisfied for x = x



.

For the second case, assume p(x′ )g(x|x′ ) ≥ p(x)g(x |x)



, the proof follows in the similar way.
Thus,

′ ′ ′ ′
T (x, x )p(x) = T (x , x)p(x ), ∀x, x (14.9)

Therefore, p(x) is the stationary distribution. Assume the regularity conditions hold, p(x) is also
the limiting distribution. This is the reason why when doing MCMC using Metropolis-Hastings,
ultimately you will sample from the full posterior distribution.

Now as for Gibbs sampler, let p(θ1 , θ2 ) be the full posterior distribution for (θ1 , θ2 ). Assume
p(θ1 |θ2 ) and p(θ2 |θ1 ) be the conditional distributions of θ1 |θ2 and θ2 |theta1 . The transition
kernel is given by

′ ′ ′ ′ ′
T ((θ1 , θ2 ), (θ , θ )) = p(θ |θ2 )p(θ |θ ) (14.10)
1 2 1 2 1

For Gibbs sampler, the detailed balance condition is not hold, so we need to use the definition
directly.

We want to prove

′ ′ ′ ′ ′ ′
∫ ∫ T ((θ1 , θ2 ), (θ , θ ))p(θ1 , θ2 )dθ1 dθ2 = p(θ , θ ), ∀θ1 , θ2 , θ , θ (14.11)
1 2 1 2 1 2

The L.H.S. can be written as

′ ′ ′
L. H . S. = ∫ ∫ p(θ |θ2 )p(θ |θ )p(θ1 , θ2 )dθ1 dθ2
1 2 1

′ ′ ′
= p(θ |θ ) ∫ p(θ |θ2 )(∫ p(θ1 , θ2 )dθ1 )dθ2
2 1 1
θ2 θ1
(14.12)
′ ′ ′
= p(θ |θ ) ∫ p(θ |θ2 )p(θ2 )dθ2
2 1 1
θ2

′ ′ ′
= p(θ |θ )p(θ )
2 1 1

′ ′
= p(θ , θ ) = R. H . S.
1 2
This implies that p(θ1 , θ2 ) is the stationary distribution for the Gibbs sampling kernel. Under
ergodicity, the stationary distribution is then the limiting distribution. Thus, the Gibbs sampler is
also valid.

This validition is for Gibbs sampler with two parameters. Similar technique can be
generated to Gibbs sampler with more parameters.

Gaussian Process

Definition 14.3 (Gaussian Process) A Gaussian process is a stochastic process {Xt : t ≥ 0}

whose finite dimensional distribution are multivariate normal distribution.

Why Gaussian process is so important? Consider the example where the data (yi , xi ) for
i = 1, ⋯ , n looks like Figure 14.1.
4

linear fit
spline fit
2
0
y

-2
-4

0 1 2 3 4 5

FIGURE 14.1: An example when we may need to use Gaussian process to fit the data.

We may first consider fitting the data using linear regression, that is

yi = μ + xi β + ϵi (14.13)

which is shown as the blue line in Figure 14.1. It is not a good fit to the data. There is obviously an
inherent nonlinear relationship between x and y, which is not captured by the linear regression
model. Therefore, we consider fit a nonlinear regression model

yi = f (xi ) + ϵi (14.14)

where f is not a linear function.

There are many different ways to make f nonlinear. For example, f can be piecewise polynomials
or more generally, f can be constructed using the spline functions. The idea behind this kind of
method is to fit locally linear or polynomial functions to the data, then add some constrains such
that the function at the boundaries are smooth. However, there are some issues with this kind of
technique. First of all, you need to know how many spline basis functions you have to use. Every
spline function is represented by some knots. You have to decide how to put knots. There is not
automatic approach to do this. This drawbacks lead to the usage of Gaussian process to
estimate this nonlinear function f . The Gaussian process is much more automatic than the spline
functions and it is also flexible.

f (⋅) is an unknown function and our job is to put a prior distribution on f (⋅). Here f (⋅) is an
infinite dimensional quantity and stochastic process can act as a prior on an infinite dimensional
function. There comes the idea of using a Gaussian process prior on f (⋅).

Formally, we say f (⋅) ∼ GP (μ, Cν (d, θ)) . Here we inherently assume that the fitted Gaussian
process is stationary, which means Cov(f (x), f (x′ )) is just a function of x − x′ .
f (⋅) ∼ GP (μ, Cν (d, θ)) imples that E(f (xi )) = μ for all xi and
Cov(f (xi ), f (xj )) = Cν (d, θ) where d = |xi − xj | , and θ, ν are parameters. Cν (⋅, ⋅) is called
the covariance kernel of a Gaussian process.

Quesiton: How to choose the covariance kernel?

Definition 14.4 (Matern covariance kernel) The Matern family of covariance kernel is defined as

1−ν
2
2 2 ν
Cν (d, ϕ, σ ) = σ (√2νϕd) Kν (√2νϕd) (14.15)
Γ(ν)

where Kν (⋅) is the modefied Bessel function of the second kind.

For more about Matern family covariance functions, one can referred to Stein (1999).

Here are some special cases of Matern covariance function:


ν =
1

2
, Cν (d, ϕ, σ 2 ) = σ
2
exp(−dϕ) , which is also known as the exponential covariance
kernel;
3
ν =
2
, Cν (d, ϕ, σ 2 ) 2
= σ (1 + √3dϕ) exp(−√3dϕ) ;
5 5
ν =
2
, Cν (d, ϕ, σ 2 ) 2
= σ (1 + √5dϕ +
3
2 2
d ϕ ) exp(−√5dϕ) ;
2
d ϕ
,
ν → ∞ Cν (d, ϕ, σ ) = σ
2 2
exp(−
2
), which is also known as the Gaussian covariance
kernel.
If you draw a function from a Gaussian process with the covariance kernel specified by Matern,
the function you draw will be ⌊ν⌋ times differentiable. By choosing ν appropriately, we can
control the smoothness of the functions we draw from the Gaussian process prior. For example,

1
If ν =
2
, we will draw functions which are nowhere differentiable;

If ν =
3

2
, we will draw functions which are once differentiable;

If ν → ∞ , we will draw functions which are infinitely differentiable (smooth functions);


1
This result implies if we use a GP prior with Matern covariance kernel and μ =
2
, then the
posterior will provide probability 1 on functions which are nowhere differentiable.

Sometimes people also use powered exponential covariance kernels.

Definition 14.5 (Powered exponential covariance kernels) The powered exponential covariance
kernels is defined as

2 2 α
C(d, ϕ, σ , α) = σ exp(−ϕd ) (14.16)

where α is called the power in the powered exponential kernels. If α = 1 , then the covariance
function is called the exponential kernel while if α = 2 , it is called the Gaussian kernel.
Suppose we want to fit y = f (x) + ϵ , and we have data (yi , xi ) for i = 1, ⋯ , n . How to fit the
model? How to find p(f |y1 , x1 , ⋯ , yn , xn )? We will discuss these questions later.

References

Stein, Michael. 1999. Interpolation of Spatial Data: Some Theory of Kriging. New York, NY: Springer.

You might also like