You are on page 1of 6

Bayesian Inference for Normal Mean and Precision

Under Independent Priors


This note describes Bayesian inference for Normal data where the variance
is unknown. The example also illustrates the use of a simple Markov Chain
Monte-Carlo procedure. The development here is adapted from Ho (2009)
Sections 5.2, 6.2 and 6.3.
The likelihood
Suppose that we have i.i.d. data Y
1
, . . . , Y
n
N(, 1/
2
).
Here
2
= 1/V (Y
i
) is the reciprocal of the variance, which is called the
precision.
The density for the likelihood for a single observation is:
f(y
i
| ,
2
) =

2
exp
_

2
(y
i
)
2
/2
_
= dnorm(y
i
, , 1/)
where dnorm(y
i
, , 1/) is the R function for the density, and we use 1/ here
since the function takes the standard deviation, 1/, as the third argument.
Note that this is simply the Normal likelihood
f(y
i
| ,
2
) =
1

2
exp
_
(y
i
)
2
/(2
2
)
_
substituting
2
= 1/
2
. Similarly, the likelihood for the full data is:
f(y
1
, . . . , y
n
| ,
2
) =
n

i=1

2
exp
_

2
(y
i
)
2
/2
_
= (2)
n/2
(
2
)
n/2
exp
_
(
2
/2)
n

i=1
(y
i
)
2
_
= prod(dnorm(y, , 1/))
where y = (y
1
, . . . , y
n
) is a vector, and prod is the R function which takes the
product of the elements in a vector, here (dnorm(y
1
, , 1/), . . . , dnorm(y
n
, , 1/)).
1
The Gamma Distribution
Here we introduce the Gamma distribution, which we will use as a prior
on the precision
2
. This is a distribution with support on (0, ), which
generalizes the
2
distributions that you may have encountered before.
A random variable Gamma(a, b) if:
f() = dgamma(, a, b) =
_
b
a
(a)

a1
e
b
> 0
0 otherwise
where a, b > 0; dgamma is the R function for the density.
Some other useful facts about Gamma distributions:

2
(k) distribution on k degrees of freedom corresponds to Gamma(k/2, 1/2).
Gamma(1, b) is the same as Exp(b).
If X Gamma(a, b) then kX Gamma(a, b/k).
X
1
Gamma(a
1
, b) and X
2
Gamma(a
2
, b) independently, then
X
1
+X
2
Gamma(a
1
+a
2
, b).
If U Unif(0, 1), then log U Gamma(1, 1) = Exp(1).
Prior Distribution on and
2
.
We will use a prior distribution for and
2
under which and are inde-
pendent, and
N(
0
, 1/
2
0
)

2
Gamma(
0
/2,
0
/(2
2
0
))
here
0
and
2
0
,
0
,
2
0
> 0 are constants specifying our prior distributions.
Thus the prior density is given by:
f(,
2
) = f()f(
2
) = dnorm(,
0
, 1/
0
)dgamma(
2
,
0
/2,
0
/(2
2
0
))
Under this prior distribution our beliefs about and
2
are unrelated.
This independence makes sense in contexts where our prior information about
the population mean and variance arise from dierent sources. However,
2
this prior is not conjugate to the normal likelihood; and
2
are no longer
independent in the posterior distribution.
(Note that a prior distribution on
2
automatically implies a prior on

2
= 1/
2
; the resulting distribution on
2
is called the inverse-gamma
distribution.)
Posterior Distribution on and
2
.
Using Bayes rule we obtain:
f(,
2
| y
1
, . . . , y
n
) f()f(
2
)f(y
1
, . . . , y
n
| ,
2
).
We would like to look at plots of this joint posterior f(,
2
| y
1
, . . . , y
n
)
and also to compute marginal posterior distributions f(
2
| y
1
, . . . , y
n
) and
f( | y
1
, . . . , y
n
)
However, it turns out that under this prior the marginal posterior for
2
,
f(
2
| y
1
, . . . , y
n
), is not a standard distribution. Consequently we will need
to use other techniques in order to proceed. We now describe two methods:
1. Evaluating the posterior f(,
2
| y
1
, . . . , y
n
) on a grid of values and
then using this as an approximation.
2. Using a Gibbs sampler, which is a form of Markov Chain Monte-Carlo
(MCMC) to obtain a sequence of dependent samples:
(
(1)
,
2(1)
), . . . , (
(N)
,
2(N)
).
The former method is simpler to understand, but cannot be scaled up to
problems with more than a few parameters. The latter approach can be
applied to models with thousands of parameters.
Using a discrete approximation to the posterior
We know the posterior distribution up to a constant of proportionality, c:
f(,
2
| y
1
, . . . , y
n
) = c f()f(
2
)f(y
1
, . . . , y
n
| ,
2
),
c = 1/f(y
1
, . . . y
n
). Consequently, we can approximate this density by con-
structing a two-dimensional grid of equally spaced values for and
2
, eval-
uating the density at each point and then normalizing by summing up over
all the possible values:
3
Let {
1
, . . . ,
G
} a sequence of equally spaced values for and similarly
let {
2
1
, . . .
2
H
} be a sequence of equally spaced values for
2
. We dene a
discrete approximation f
DA
(
k
,
2
l
| y
1
, . . . , y
n
) to the posterior as follows:
f
DA
(
k
,
2
l
| y
1
, . . . , y
n
) =
f(
k
)f(
2
l
)f(y
1
, . . . , y
n
|
k
,
2
l
)

G
g=1

H
h=1
f(
g
)f(
2
h
)f(y
1
, . . . , y
n
|
g
,
2
h
)
where
k
{
1
, . . . ,
G
} and
2
l
{
2
1
, . . .
2
H
}. In eect, we have replaced
our continuous bivariate prior distribution f(,
2
) with a discrete bivariate
approximation taking values on the grid. Thus although f(
k
,
2
l
| y
1
, . . . , y
n
)
is a joint density, f
DA
(
k
,
2
l
| y
1
, . . . , y
n
) is a joint mass function.
Computation of marginal distributions from this approximation is then
simple:
f
DA
(
2
l
| y
1
, . . . , y
n
) =
G

g=1
f
DA
(
g
,
2
l
| y
1
, . . . , y
n
)
(See R code from the lab.)
Note that if G = H = 100 then we have 10,000 points in the 2-d grid.
Thus this general approach will become infeasible in other problems where
we may have large numbers of parameters, since we would have (100)
p
points
with p variables.
Using a Gibbs sampler
Although it is dicult to compute marginal posterior distributions, it is much
simpler to nd conditional posterior distributions:
f( |
2
, y
1
, . . . , y
n
) f()f(y
1
, . . . , y
n
| ,
2
).
This conditional posterior for given
2
is identical to that which we obtained
in the case where
2
was known (and hence also
2
was known). Thus:
f( |
2
, y
1
, . . . , y
n
) = dnorm(,

, 1/

)
where

=

0

2
0
+ yn
2

2
0
+n
2
and
2

=
2
0
+n
2
(these expressions are identical to those that we saw for the mean and vari-
ance of the posterior distribution of the mean in Lecture 9, simply replacing
variances with precisions).
4
The conditional posterior distribution of
2
given also takes a fairly
simple form:
f(
2
| , y
1
, . . . , y
n
) f(
2
)f(y
1
, . . . , y
n
| ,
2
)
(
2
)
(
0
/2)1
exp
_

0
/(2
2
0
)
_

(
2
)
n/2
exp
_
(
2
/2)
n

i=1
(y
i
)
2
_
= (
2
)
(
0
+n)/2)1
exp
_

0
/
2
0
+
n

i=1
(y
i
)
2
__
2
_
This is the form of a Gamma(a

, b

) density with
a

= (
0
+n)/2 and b

= (
0
/
2
0
+
n

i=1
(y
i
)
2
)/2.
Notice that
n

i=1
(y
i
)
2
=
n

i=1
(y
i
y)
2
+n( y )
2
= (n 1)var(y) +n( y )
2
where var(y) is the sample variance function in R applied to the vector y.
Our general scheme for generating the (dependent) sequence of pairs
(
(i)
,
2(i)
) is then to sample sample from the conditional posteriors f( |

2
, y
1
, . . . , y
n
) and f(
2
| , y
1
, . . . , y
n
):
0. Select starting values (
(0)
,
2(0)
), set i = 0.
1. Sample
(i+1)
f( |
2(i)
, y
1
, . . . , y
n
).
2. Sample
2(i+1)
f( |
2(i+1)
, y
1
, . . . , y
n
).
3. Add (
(i+1)
,
2(i+1)
) to our sequence of values; increase i by 1 and go
to 1.
If we generate a long enough sequence of values then after an initial
burn-in period our samples will no longer depend on the starting values and
will be samples from the posterior distribution. Thus often one throws away
the rst 50 or 100 samples. (How many need to be thrown away depends on
5
how much dependence is present in the chain; whether starting values came
from the middle of the distribution.)
We may then look at scatterplots and histograms of these samples in order
to understand the joint and marginal posterior distributions. We can also
compute means, variances, medians etc., for these posterior distributions.
It is also easy to obtain the posterior distribution over
2
= 1/
2
by
simply taking the reciprocals of the sampled precisions. i.e. letting
2(i)
=
1/
2(i)
.
(See R code from the lab.)
6

You might also like