Professional Documents
Culture Documents
Murphy - Conjugate Bayesian Analysis of Gaussian Distributin PDF
Murphy - Conjugate Bayesian Analysis of Gaussian Distributin PDF
Kevin P. Murphy∗
murphyk@cs.ubc.ca
Last updated October 3, 2007
1 Introduction
The Gaussian or normal distribution is one of the most widely used in statistics. Estimating its parameters using
Bayesian inference and conjugate priors is also widely used. The use of conjugate priors allows all the results to be
derived in closed form. Unfortunately, different books use different conventions on how to parameterize the various
distributions (e.g., put the prior on the precision or the variance, use an inverse gamma or inverse chi-squared, etc),
which can be very confusing for the student. In this report, we summarize all of the most commonly used forms. We
provide detailed derivations for some of these results; the rest can be obtained by simple reparameterization. See the
appendix for the definition the distributions that are used.
2 Normal prior
Let us consider Bayesian estimation of the mean of a univariate Gaussian, whose variance is assumed to be known.
(We discuss the unknown variance case later.)
2.1 Likelihood
Let D = (x1 , . . . , xn ) be the data. The likelihood is
n n
( )
Y 1 X
p(D|µ, σ 2 ) = p(xi |µ, σ 2 ) = (2πσ 2 )−n/2 exp − 2 (xi − µ)2 (1)
i=1
2σ i=1
1
Hence
2 1 1 1 2 2
p(D|µ, σ ) = exp − 2 ns + n(x − µ) (8)
(2π)n/2 σ n 2σ
n/2
ns2
1 n
2
∝ exp − (x − µ) exp − (9)
σ2 2σ 2 2σ 2
(Do not confuse σ02 , which is the variance of the prior, with σ 2 , which is the variance of the observation noise.) (A
natural conjugate prior is one that has the same form as the likelihood.)
2.3 Posterior
Hence the posterior is given by
Since the product of two Gaussians is a Gaussian, we will rewrite this in the form
P 2
µ2 1
P 2
n µ0 i xi µ0 x
p(µ|D) ∝ exp − 2 + 2 +µ 2 + 2
− 2 + i 2i (16)
2 σ0 σ σ σ 2σ0 2σ
0
def 1 1
= exp − 2 (µ2 − 2µµn + µ2n ) = exp − 2 (µ − µn )2 (17)
2σn 2σn
−µ2 −µ2 1
n
= + 2 (18)
2σn2 2 σ02 σ
1 1 n
= + 2 (19)
σn2 σ02 σ
σ 2 σ02 1
σn2 = = n (20)
nσ02 + σ 2 σ2 + 1
σ02
2
5
N = 10
N =2
N =1
N =0
0
−1 0 1
Figure 1: Sequentially updating a Gaussian mean starting with a prior centered on µ0 = 0. The true parameters are µ∗ = 0.8
(unknown), (σ 2 )∗ = 0.1 (known). Notice how the data quickly overwhelms the prior, and how the posterior becomes narrower.
Source: Figure 2.12 [Bis06].
Hence
σ2 nσ02
µ0 nx
µn = µ0 + x = σn2 + 2 (24)
nσ02 + σ 2 nσ02 + σ 2 2
σ0 σ
This operation of matching first and second powers of µ is called completing the square.
Another way to understand these results is if we work with the precision of a Gaussian, which is 1/variance (high
precision means low variance, low precision means high variance). Let
λ = 1/σ 2 (25)
λ0 = 1/σ02 (26)
λn = 1/σn2 (27)
3
prior sigma 10.000 prior sigma 1.000
0.45 0.7
prior
prior
0.4 lik
lik 0.6 post
post
0.35
0.5
0.3
0.25 0.4
0.2 0.3
0.15
0.2
0.1
0.1
0.05
0 0
−5 0 5 −5 0 5
(a) (b)
Figure 2: Bayesian estimation of the mean of a Gaussian from one sample. (a) Weak prior N (0, 10). (b) Strong prior N (0, 1). In
the latter case, we see the posterior mean is “shrunk” towards the prior mean, which is 0. Figure produced by gaussBayesDemo.
Pn nλ
where nx = i=1 xi and w = λn . The precision of the posterior λn is the precision of the prior λ0 plus one
contribution of data precision λ for each observed data point. Also, we see the mean of the posterior is a convex
combination of the prior and the MLE, with weights proportional to the relative precisions.
To gain further insight into these equations, consider the effect of sequentially updating our estimate of µ (see
Figure 1). After observing one data point x (so n = 1), we have the following posterior mean
σ2 σ2
µ1 = 2 µ0 + 2 0 2 x (31)
σ2
+ σ0 σ + σ0
σ2
= µ0 + (x − µ0 ) 2 0 2 (32)
σ + σ0
σ2
= x − (x − µ0 ) 2 (33)
σ + σ02
The first equation is a convex combination of the prior and MLE. The second equation is the prior mean ajusted
towards the data x. The third equation is the data x adjusted towads the prior mean; this is called shrinkage. These
are all equivalent ways of expressing the tradeoff between likelihood and prior. See Figure 2 for an example.
2.4 Posterior predictive
The posterior predictive is given by
Z
p(x|D) = p(x|µ)p(µ|D)dµ (34)
Z
= N (x|µ, σ 2 )N (µ|µn , σn2 )dµ (35)
4
since we assume that the residual error is conditionally independent of the parameter. Thus the predictive variance is
the uncertainty due to the observation noise σ 2 plus the uncertainty due to the parameters, σn2 .
2.5 Marginal likelihood
Writing m = µ0 and τ 2 = σ02 for the hyper-parameters, we can derive the marginal likelihood as follows:
Z Y n
` = p(D|m, σ 2 , τ 2 ) = [ N (xi |µ, σ 2 )]N (µ|m, τ 2 )dµ (41)
i=1
τ 2 n2 x
2 2 2
!
+ σ τm
P 2
m2 + 2nxm
σ i xi σ2 2
= √ √ exp − − 2 exp (42)
( 2πσ)n nτ 2 + σ 2 2σ 2 2τ 2(nτ + σ 2 )
2
where
exp − 21 (S 2 i x2i + T 2 m2 )
P
c = √ √ (47)
( 2π/S)n ( 2π/T )
So
S 2 i xi + T 2 m
Z P
` = c exp − 21 (S 2 n + T 2 ) µ2 − 2µ dµ (48)
S2n + T 2
2 " 2 #
(S nx + T 2 m)2 S 2 nx + T 2 m
Z
1 2 2
= c exp exp − 2 (S n + T ) µ − dµ (49)
2(S 2 n + T 2 ) S2n + T 2
2 √
(S nx + T 2 m)2
2π
= c exp √ (50)
2(S 2 n + T 2 ) S2n + T 2
√
exp − 12 (S 2 i x2i + T 2 m2 )
2
(S nx + T 2 m)2
P
2π
= √ √ exp 2n + T 2)
√ (51)
n
( 2π/S) ( 2π/T ) 2(S S n + T2
2
Now √
1 2π σ
p √ = √ (52)
2
(2π)/T S n + T 2 N τ 2 + σ2
and
( nσx2 + τm2 )2 (nxτ 2 + mσ 2 )2
= (53)
2( σn2 + τ12 ) 2σ 2 τ 2 (nτ 2 + σ 2 )
n2 x2 τ 2 /σ 2 + σ 2 m2 /τ 2 + 2nxm
= (54)
2(nτ 2 + σ 2 )
5
So
τ 2 n2 x
2 2 2
!
+ σ τm
P 2
m2 + 2nxm
σ i xi σ2 2
p(D) = √ √ exp − − 2 exp (55)
( 2πσ)n nτ 2 + σ 2 2σ 2 2τ 2(nτ + σ 2 )
2
3.2 Prior
The conjugate prior is the normal-Gamma:
def
N G(µ, λ|µ0 , κ0 , α0 , β0 ) = N (µ|µ0 , (κ0 λ)−1 )Ga(λ|α0 , rate = β0 ) (63)
1 1 κ0 λ
= λ 2 exp(− (µ − µ0 )2 )λα0 −1 e−λβ0 (64)
ZN G (µ0 , κ0 , α0 , β0 ) 2
1 α0 − 1 λ 2
= λ 2 exp − κ0 (µ − µ0 ) + 2β0 (65)
ZN G 2
1
Γ(α0 ) 2π 2
ZN G (µ0 , κ0 , α0 , β0 ) = (66)
β0α0 κ0
6
NG(κ=2.0, a=1.0, b=1.0) NG(κ=2.0, a=3.0, b=1.0)
0.4 0.4
0.2 0.2
0 0
4 4
2 2
2 0 2 0
λ 0 −2 µ λ 0 −2 µ
0.4 0.4
0.2 0.2
0 0
4 4
2 2
2 0 2 0
λ 0 −2 µ λ 0 −2 µ
Γ(a)
p(µ) ∝ (69)
ba
∝ b−a (70)
κ0 1
= (β0 + (µ − µ0 )2 )−α0 − 2 (71)
2
1 α0 κ0 (µ − µ0 )2 −(2α0 +1)/2
= (1 + ) (72)
2α0 β0
7
3.3 Posterior
The posterior can be derived as follows.
So
1 2
p(µ, λ|D) ∝ λ 2 e−(λ/2)(κ0 +n)(µ−µn ) (81)
x) e
2 x
κ n( −µ )2
−(λ/2) 0 κ +n 0
×λα0 +n/2−1 e−β0 λ e−(λ/2) i (xi −
P
0 (82)
−1
∝ N (µ|µn , ((κ + n)λ) ) × Ga(λ|α0 + n/2, βn ) (83)
where
n
X κ0 n(x − µ0 )2
βn = β0 + 1
2 (xi − x)2 + (84)
i=1
2(κ0 + n)
In summary,
P We see that the posterior sum of squares, βn , combines the prior sum of squares, β0 , the sample sum of squares,
2
i (xi − x) , and a term due to the discrepancy between the prior mean and sample mean. As can be seen from
Figure 3, the range of probable values for µ and σ 2 can be quite large even after for moderate n. Keep this picture in
mind whenever someones claims to have “fit a Gaussian” to their data.
8
3.3.1 Posterior marginals
The posterior marginals are (using Equation 72)
p(Dnew , D)
p(Dnew |D) = (96)
p(D)
Zn+m Z0
= (2π)−(n+m)/2 (2π)n/2 (97)
Z0 Zn
Zn+m
= (2π)−m/2 (98)
Zn
1
Γ(αn+m ) βnαn
κn 2
= αn+m (2π)−m/2 (99)
Γ(αn ) βn+m κn+m
In the special case that m = 1, it can be shown (see below) that this is a T-distribution
βn (κn + 1)
p(x|D) = t2αn (x|µn , ) (100)
αn κn
To derive the m = 1 result, we proceed as follows. (This proof is by Xiang Xuan, and is based on [GH94, p10].)
When m = 1, the posterior parameters are
9
1 P1 2
Use the fact that when m = 1, we have x1 = x̄ (since there is only one observation), hence we have 2 i=1 (xi −x̄) =
0. Let’s use x denote Dnew , then βn+1 is
κn (x − µn )2
βn+1 = βn + (104)
2(κn + 1)
So the posterior is
n
X
p(m, λ|D) = N G(µn = x, κn = n, αn = (n − 1)/2, βn = 1
2 (xi − x)2 ) (112)
i=1
− x)2
P
i (xi
p(m|D) = tn−1 (m|x, ) (113)
n(n − 1)
which corresponds to the frequentist sampling distribution of the MLE µ̂. Thus in this case, the confidence interval
and credible interval coincide.
4 Gamma prior
If µ is known, and only λ is unknown (e.g., when implementing Gibbs sampling), we can use the following results,
which can be derived by simplifying the results for the Normal-NG model.
10
4.1 Likelihood
n
!
λX
p(D|λ) ∝ λn/2 exp − (xi − µ)2 (114)
2 i=1
4.2 Prior
4.3 Posterior
5.2 Prior
The normal-inverse-chi-squared prior is
11
NIX(µ0=0.0, κ0=1.0, ν0=1.0, σ20=1.0) NIX(µ0=0.0, κ0=5.0, ν0=1.0, σ20=1.0)
0.4 0.8
0.3 0.6
0.2 0.4
0.1 0.2
0 0
2 2
1.5 1 1.5 1
1 0.5 1 0.5
0 0
0.5 0.5
−0.5 −0.5
2 0 −1 2 0 −1
sigma µ sigma µ
(a) (b)
NIX(µ0=0.0, κ0=1.0, ν0=5.0, σ20=1.0) NIX(µ0=0.5, κ0=5.0, ν0=5.0, σ20=0.5)
0.4 2.5
2
0.3
1.5
0.2
1
0.1
0.5
0 0
2 2
1.5 1 1.5 1
1 0.5 1 0.5
0 0
0.5 0.5
−0.5 −0.5
2 0 −1 0 −1
sigma µ sigma2 µ
(c) (d)
Figure 4: The N Iχ2 (µ0 , κ0 , ν0 , σ02 ) distribution. µ0 is the prior mean and κ0 is how strongly we believe this; σ02 is the prior
variance and ν0 is how strongly we believe this. (a) µ0 = 0, κ0 = 1, ν0 = 1, σ02 = 1. Notice that the contour plot (underneath the
surface) is shaped like a “squashed egg”. (b) We increase the strenght of our belief in the mean, so it gets narrower: µ0 = 0, κ0 =
5, ν0 = 1, σ02 = 1. (c) We increase the strenght of our belief in the variance, so it gets narrower: µ0 = 0, κ0 = 1, ν0 = 5, σ02 = 1.
(d) We strongly believe the mean and variance are 0.5: µ0 = 0.5, κ0 = 5, ν0 = 5, σ02 = 0.5. These plots were produced with
NIXdemo2.
See Figure 4 for some plots. The hyperparameters µ0 and σ 2 /κ0 can be interpreted as the location and scale of µ, and
the hyperparameters u0 and σ02 as the degrees of freedom and scale of σ 2 .
For future reference, it is useful to note that the quadratic term in the prior can be written as
Q0 (µ) = S0 + κ0 (µ − µ0 )2 (127)
2
= κ0 µ − 2(κ0 µ0 )µ + (κ0 µ20 + S0 ) (128)
12
5.3 Posterior
(The following derivation is based on [Lee04, p67].) The posterior is
p(µ, σ 2 |D) ∝ N (µ|µ0 , σ 2 /κ0 )χ−2 (σ 2 |ν0 , σ02 )p(D|µ, σ 2 ) (129)
−1 2 −(ν0 /2+1) 1 2 2
∝ σ (σ ) exp − 2 [ν0 σ0 + κ0 (µ0 − µ) ] (130)
2σ
2 −n/2 1 2 2
× (σ ) exp − 2 ns + n(x − µ) (131)
2σ
−3 2 −(νn /2) 1
∝ σ (σ ) exp − 2 [νn σn + κn (µn − µ) ] = N Iχ2 (µn , κn , νn , σn2 )
2 2
(132)
2σ
Matching powers of σ 2 , we find
νn = ν0 + n (133)
To derive the other terms, we will complete the square. Let S0 = ν0 σ02 and Sn = νn σn2 for brevity. Grouping the
terms inside the exponential, we have
S0 + κ0 (µ0 − µ)2 + ns2 + n(x − µ)2 = (S0 + κ0 µ20 + ns2 + nx2 ) + µ2 (κ0 + n) − 2(κ0 µ0 + nx)µ(134)
Comparing to Equation 128, we have
κn = κ0 + n (135)
κn µn = κ0 µ0 + nx (136)
Sn + κn µ2n = (S0 + κ0 µ20 + ns2 + nx2 ) (137)
Sn = S0 + ns2 + κ0 µ20 + nx2 − κn µ2n (138)
One can rearrange this to get
Sn = S0 + ns2 + (κ−1
0 +n
−1 −1
) (µ0 − x)2 (139)
nκ0
= S0 + ns2 + (µ0 − x)2 (140)
κ0 + n
We see that the posterior sum of squares, Sn = νn σn2 , combines the prior sum of squares, S0 = ν0 σ02 , the sample sum
of squares, ns2 , and a term due to the uncertainty in the mean.
In summary,
κ0 µ0 + nx
µn = (141)
κn
κn = κ0 + n (142)
νn = ν0 + n (143)
1 X nκ0
σn2 = (ν0 σ02 + (xi − x)2 + (µ0 − x)2 ) (144)
νn i
κ 0 + n
13
The modes of the marginal posterior are
mode[µ|D] = µn (149)
νn σn2
mode[σ 2 |D] = (150)
νn + 2
∝ A−α (164)
= (νn σn2 + κn (µn − µ)2 )−(νn +1)/2 (165)
−(νn +1)/2
κn 2
∝ 1+ (µ − µn ) (166)
νn σn2
∝ tνn (µ|µn , σn2 /κn ) (167)
14
5.4 Marginal likelihood
Repeating the derivation of the posterior, but keeping track of the normalization constants, gives the following.
Z Z
p(D) = P (D|µ, σ 2 )P (µ, σ 2 )dµdσ 2 (168)
Zp (µn , κn , νn , σn2 ) 1
= (169)
Zp (µ0 , κ0 , ν0 , σ02 ) ZlN
√ ν /2 νn /2
κ0 Γ(νn /2) ν0 σ02 0
2 1
= √ 2
(170)
κn Γ(ν0 /2) 2 νn σn (2π)(n/2)
Γ(νn /2) κ0 (ν0 σ02 )ν0 /2 1
r
= (171)
Γ(ν0 /2) κn (νn σn2 )νn /2 π n/2
p(x, D)
= (173)
p(D)
(νn σn2 )νn /2
r
Γ((νn + 1)/2) κn 1
= κ (174)
Γ(νn /2) κn + 1 (νn σn2 + κn +1n
(x − µn )2 ))(νn +1)/2 π 1/2
12 −(νn +1)/2
κn (x − µn )2
Γ((νn + 1)/2) κn
= 1+ (175)
Γ(νn /2) (κn + 1)πνn σn2 (κn + 1)νn σn2
2
(1 + κn )σn
= tνn (µn , ) (176)
κn
µn = x (178)
νn = n−1 (179)
κn = n (180)
2
P
− x)
i (xi
σn2 = (181)
n−1
!
2 −n−2 1 X 2 2
p(µ, σ |D) ∝ σ exp − 2 [ (xi − x) + n(x − µ) ] (182)
2σ i
(xi − x)2
P
p(σ 2 |D) = χ−2 (σ 2 |n − 1, i ) (183)
n−1
(xi − x)2
P
p(µ|D) = tn−1 (µ|x, i ) (184)
n(n − 1)
15
which are very closely related to the sampling distribution of the MLE. The posterior predictive is
(1 + n) i (xi − x)2
P
p(x|D) = tn−1 x, (185)
n(n − 1)
Note that [Min00] argues that Jeffrey’s principle says the uninformative prior should be of the form
1
lim N (µ|µ0 , σ 2 /k)χ−2 (σ 2 |k, σ02 ) ∝ (2πσ 2 )− 2 (σ 2 )−1 ∝ σ −3 (186)
k→ 0
6.1 Likelihood
The likelihood can be written in this form
2 1 2 −n/2 1 2 2
p(D|µ, σ ) = (σ ) exp − 2 ns + n(x − µ) (189)
(2π)n/2 2σ
6.2 Prior
This is equivalent to the N Iχ2 prior, where we make the following substitutions.
m0 = µ0 (192)
1
V0 = (193)
κ0
ν0
a0 = (194)
2
ν0 σ02
b0 = (195)
2
6.3 Posterior
We can show that the posterior is also NIG:
The NIG posterior follows directly from the N Iχ2 results using the specified substitutions. (The bn term requires
some tedious algebra...)
16
6.3.1 Posterior marginals
To be derived.
6.4 Marginal likelihood
For the marginal likelihood, substituting into Equation 171 we have
7.1 Prior
7.2 Likelihood
1
p(D|µ, Σ) ∝ N (x|µ, Σ) (209)
N
7.3 Posterior
17
7.4 Posterior predictive
8 Normal-Wishart prior
The multivariate analog of the normal-gamma prior is the normal-Wishart prior. Here we just state the results without
proof; see [DeG70, p178] for details. We assume X is a d-dimensional.
8.1 Likelihood
n
!
X
−nd/2 n/2 T
p(D|µ, Λ) = (2π) |Λ| exp − 21 (xi − µ) Λ(xi − µ) (216)
i=1
8.2 Prior
Posterior marginals
18
The MAP estimates are given by
µn = x, Tn = S, κn = n, νn = n − 1 (237)
S(n + 1)
p(x|D) = tn−d (x, (240)
n(n − d)
9 Normal-Inverse-Wishart prior
The multivariate analog of the normal inverse chi-squared (NIX) distribution is the normal inverse Wishart (NIW) (see
also [GCSR04, p85]).
19
9.1 Likelihood
The likelihood is
n
!
−n 1X
p(D|µ, Σ) ∝ |Σ| 2 exp − (xi − µ)T Σ−1 (xi − µ) (241)
2 i=1
n 1
= |Σ|− 2 exp − tr(Σ−1 S) (242)
2
(243)
9.2 Prior
The natural conjugate prior is normal-inverse-wishart
9.3 Posterior
The posterior is
Σ|D ∼ IW (Λ−1
n , νn ) (255)
Λn
µ|D = tνn −d+1 (µn , ) (256)
κn (νn − d + 1)
To see the connection with the scalar case, note that Λn plays the role of νn σn2 (posterior sum of squares), so
Λn Λn σ2
= = (257)
κn (νn − d + 1) κn νn κn
20
9.4 Posterior predictive
Λn (κn + 1)
p(x|D) = tνn −d+1 (µn , ) (258)
κn (νn − d + 1)
To see the connection with the scalar case, note that
Λn (κn + 1) Λn (κn + 1) σ 2 (κn + 1)
= = (259)
κn (νn − d + 1) κn νn κn
9.5 Marginal likelihood
The posterior is given by
1 1 1
p(µ, Σ|D) = N IW 0 (µ, Σ|α0 ) N 0 (D|µ, Σ) (260)
p(D) Z0 (2π)nd/2
1
= N IW 0 (µ, Σ|αn ) (261)
Zn
where
0 −((ν0 +d)/2+1) 1 −1 κ0 T −1
N IW (µ, Σ|α0 ) = |Σ| exp − tr(Λ0 Σ ) − (µ − µ0 ) Σ (µ − µ0 ) (262)
2 2
n 1
N 0 (D|µ, Σ) = |Σ|− 2 exp − tr(Σ−1 S) (263)
2
is the unnormalized prior and likelihood. Hence
Zn 1 2νn d/2 Γd (νn /2)(2π/κn )d/2 |Λ0 |ν0 /2 1
p(D) = = (264)
Z0 (2π)nd/2 |Λn |νn /2 2ν0 d/2 Γd (ν0 /2)(2π/κ0 )d/2 (2π)nd/2
1 2νn d/2 (2π/κn )d/2 Γd (νn /2)
= (265)
(2π)nd/2 2ν0 d/2 (2π/κ0 )d/2 Γd (ν0 /2)
d/2
1 Γd (νn /2) |Λ0 |ν0 /2 κ0
= (266)
π nd/2 Γd (ν0 /2) |Λn |νn /2 κn
This reduces to Equation 171 if d = 1.
9.6 Reference analysis
A noninformative (Jeffrey’s) prior is p(µ, Σ) ∝ |Σ|−(d+1)/2 which is the limit of κ0 →0, ν0 → − 1, |Λ0 |→0 [GCSR04,
p88]. Then the posterior becomes
µn = x (267)
κn = n (268)
νn = n−1 (269)
X
Λn = S= (xi − x)(xi − x)T (270)
i
p(Σ|D) = IWn−1 (Σ|S) (271)
S
p(µ|D) = tn−d (µ|x, ) (272)
n(n − d)
S(n + 1)
p(x|D) = tn−d (x|x, ) (273)
n(n − d)
Note that [Min00] argues that Jeffrey’s principle says the uninformative prior should be of the form
1 d
lim N (µ|µ0 , Σ/k)IWk (Σ|kΣ) ∝ |2πΣ|− 2 |Σ|−(d+1)/2 ∝ |Σ|−( 2 +1) (274)
k→0
21
Gamma(shape=a,rate=b) Gamma(shape=a,rate=b)
1.8 3
a=0.5, b=1.0 a=0.5, b=3.0
1.6 a=1.0, b=1.0 a=1.0, b=3.0
a=1.5, b=1.0 a=1.5, b=3.0
2.5
a=2.0, b=1.0 a=2.0, b=3.0
1.4 a=5.0, b=1.0 a=5.0, b=3.0
1.2 2
1
1.5
0.8
0.6 1
0.4
0.5
0.2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Figure 5: Some Ga(a, b) distributions. If a < 1, the peak is at 0. As we increase b, we squeeze everything leftwards and upwards.
Figures generated by gammaDistPlot2.
Note that the shape parameter controls the shape; the scale parameter merely defines the measurement scale (the
horizontal axis). The rate parameter is just the inverse of the scale. See Figure 5 for some examples. This distribution
has the following properties (using the rate parameterization):
a
mean = (277)
b
a−1
mode = for a ≥ 1 (278)
b
a
var = (279)
b2
10.2 Inverse Gamma distribution
Let X ∼ Ga(shape = a, rate = b) and Y = 1/X. Then it is easy to show that Y ∼ IG(shape = a, scale = b), where
the inverse Gamma distribution is given by
ba −(a+1) −b/x
IG(x|shape = a, scale = b) = x e , x, a, b > 0 (280)
Γ(a)
22
IG(a,b)
1.4
a=0.10, b=1.00
a=1.00, b=1.00
1.2 a=2.00, b=1.00
a=0.10, b=2.00
a=1.00, b=2.00
1 a=2.00, b=2.00
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2
Figure 6: Some inverse gamma distributions (a=shape, b=rate). These plots were produced by invchi2plot.
b
mean = , a>1 (281)
a−1
b
mode = (282)
a+1
b2
var = , a>2 (283)
(a − 1)2 (a − 2)
See Figure 6 for some plots. We see that increasing b just stretches the horizontal axis, but increasing a moves the
peak up and closer to the left.
There is also another parameterization, using the rate (inverse scale):
1
IG(x|shape = α, rate = β) = Γ(a)x−(α+1) e−1/(βx) , x, α, β > 0 (284)
βa
23
χ−2(ν,σ2)
1.5
2
ν=1.00, σ =0.50
ν=1.00, σ2=1.00
2
ν=1.00, σ =2.00
ν=5.00, σ2=0.50
2
ν=5.00, σ =1.00
1
ν=5.00, σ2=2.00
0.5
0
0 0.5 1 1.5 2
Figure 7: Some inverse scaled χ2 distributions. These plots were produced by invchi2plot.
(So Γ1 (α) = Γ(α).) The mean and mode are given by (see also [Pre05])
mean = νS (293)
mode = (ν − p − 1)S, ν > p + 1 (294)
24
Wishart(dof=2.0,S=[4 3; 3 4]) Wishart(dof=10.0,S=[4 3; 3 4])
5 5 10
2 10 10
5
0 0 0 0 0 0
−5
−2 −10 −10
−5 −5 −10
4 5
10
5 2 5 10
5
0 0 0 0 0 0
−2 −5 −10
−5 −5
−4 −10
−5
−10 0 10 −5 0 5 −5 0 5 −10 0 10 −10 0 10 −20 0 20
4 10
5 5 10 10
2 5
5 5
0 0 0 0 0 0
−5 −5
−2 −5 −5
−5 −10 −10
−4 −10
−5 0 5 −10 0 10 −10 0 10 −10 0 10 −10 0 10 −10 0 10
Figure 8: Some samples from the Wishart distribution. Left: ν = 2, right: ν = 10. We see that if if ν = 2 (the smallest valid
value in 2 dimensions), we often sample nearly singular matrices. As ν increases, we put more mass on the S matrix. If S = I2 ,
the samples would look (on average) like circles. Generated by wishplot.
|B|a
W i(X|a, B) = |X|a−(p+1)/2 exp[−tr(BX)] (295)
Γp (a)
We require that B is a p × p symmetric positive definite matrix, and 2a > p − 1. If p = 1, so B is a scalar, this reduces
to the Ga(shape = a, rate= b) density.
To get some intuition for this distribution, recall that tr(AB) is a vector which contains the inner product of the
rows of A and the columns of B. In Matlab notation we have
trace(A B) = [a(1,:)*b(:,1), ..., a(n,:)*b(:,n)]
If X ∼ W iν (S), then we are performing a kind of template matching between the columns of X and S −1 (recall that
both X and S are symmetric). This is a natural way to define the distance between two matrices.
10.5 Inverse Wishart
This is the multidimensional generalization of the inverse Gamma. Consider a d × d positive definite (covariance) ma-
trix X and a dof parameter ν > d − 1 and psd matrix S. Some authors (eg [GCSR04, p574]) use this parameterization:
−1 1 −(ν+d+1)/2 1 −1
IWν (X|S ) = |X| exp − T r(SX ) (296)
Z 2
|S|ν/2
Z = (297)
2νd/2 Γ d (ν/2)
where
d
Y ν +1−i
Γd (ν/2) = π d(d−1)/4 Γ( ) (298)
i=1
2
25
The distribution has mean
S
EX = (299)
ν−d−1
In Matlab, use iwishrnd. In the 1d case, we have
χ−2 (Σ|ν0 , σ02 ) = IWν0 (Σ|(ν0 σ02 )−1 ) (300)
Other authors (e.g., [Pre05, p117]) use a slightly different formulation (with 2d < ν)
−1
Yd
IWν (X|Q) = 2(ν−d−1)d/2 π d(d−1)/4 Γ((ν − d − j)/2) (301)
j=1
1
×|Q|(ν−d−1)/2 |X|−ν/2 exp − T r(X−1 Q) (302)
2
which has mean
Q
EX = (303)
ν − 2d − 2
10.6 Student t distribution
The generalized t-distribution is given as
−( ν+1
2 )
2 1 x−µ 2
tν (x|µ, σ ) = c 1+ ( ) (304)
ν σ
Γ(ν/2 + 1/2) 1
c = √ (305)
Γ(ν/2) νπσ
where c is the normalization consant. µ is the mean, ν > 0 is the degrees of freedom, and σ 2 > 0 is the scale. (Note
that the ν parameter is often written as a subscript.) In Matlab, use tpdf.
The distribution has the following properties:
mean = µ, ν > 1 (306)
mode = µ (307)
νσ 2
var = , ν>2 (308)
(ν − 2)
Note: if x ∼ tν (µ, σ 2 ), then
x−µ
∼ tν (309)
σ
which corresponds to a standard t-distribution with µ = 0, σ 2 = 1:
Γ((ν + 1)/2)
tν (x) = √ (1 + x2 /ν)−(ν+1)/2 (310)
νπΓ(ν/2)
In Figure 9, we plot the density for different parameter values. As ν → ∞, the T approaches a Gaussian. T-
distributions are like Gaussian distributions with heavy tails. Hence they are more robust to outliers (see Figure 10).
If ν = 1, this is called a Cauchy distribution. This is an interesting distribution since if X ∼ Cauchy, then E[X]
does not exist, since the corresponding integral diverges. Essentially this is because the tails are so heavy that samples
from the distribution can get very far from the center µ.
It can be shown that the t-distribution is like an infinite sum of Gaussians, where each Gaussian has a different
precision:
Z
p(x|µ, a, b) = N (x|µ, τ −1 )Ga(τ |a, rate = b)dτ (311)
26
Student T distributions
0.4
t(ν=0.1)
0.35 t(ν=1.0)
t(ν=5.0)
0.3 N(0,1)
0.25
0.2
0.15
0.1
0.05
−0.05
−6 −4 −2 0 2 4 6
Figure 9: Student t-distributions T (µ, σ 2 , ν) for µ = 0. The effect of σ is just to scale the horizontal axis. As ν→∞, the
distribution approaches a Gaussian. See studentTplot.
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
−5 0 5 10 −5 0 5 10
(a) (b)
Figure 10: Fitting a Gaussian and a Student distribution to some data (left) and to some data with outliers (right). The Student
distribution (red) is much less affected by outliers than the Gaussian (green). Source: [Bis06] Figure 2.16.
27
T distribution, dof 2.0 Gaussian
0.2 2
0.15 1.5
0.1 1
0.05 0.5
0 0
2 2
1 2 1 2
0 1 1
0
0 0
−1 −1
−1 −1
−2 −2 −2 −2
Figure 11: Left: T distribution in 2d with dof=2 and Σ = 0.1I2 . Right: Gaussian density with Σ = 0.1I2 and µ = (0, 0); we see
it goes to zero faster. Produced by multivarTplot.
where Σ is called the scale matrix (since it is not exactly the covariance matrix). This has fatter tails than a Gaussian:
see Figure 11. In Matlab, use mvtpdf.
The distribution has the following properties
Ex = µ if ν > 1 (315)
mode x = µ (316)
ν
Cov x = Σ for ν > 2 (317)
ν−2
(The following results are from [Koo03, p328].) Suppose Y ∼ T (µ, Σ, ν) and we partition the variables into 2
blocks. Then the marginals are
Yi ∼ T (µi , Σii , ν) (318)
and the conditionals are
We can sample from a y ∼ T (µ, Σ, ν) by sampling x ∼ T (0, 1, ν) and then transforming y = µ + RT x, where
R = chol(Σ), so RT R = Σ.
28
References
[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.
[BL01] P. Baldi and A. Long. A Bayesian framework for the analysis of microarray expression data: regularized
t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509–519, 2001.
[BS94] J. Bernardo and A. Smith. Bayesian Theory. John Wiley, 1994.
[DeG70] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
[DHMS02] D. Denison, C. Holmes, B. Mallick, and A. Smith. Bayesian methods for nonlinear classification and
regression. Wiley, 2002.
[DMP+ 06] F. Demichelis, P. Magni, P. Piergiorgi, M. Rubin, and R. Bellazzi. A hierarchical Naive Bayes model
for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC
Bioinformatics, 7:514, 2006.
[GCSR04] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman and Hall, 2004. 2nd
edition.
[GH94] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, Microsoft
Research, 1994.
[Koo03] Gary Koop. Bayesian econometrics. Wiley, 2003.
[Lee04] Peter Lee. Bayesian statistics: an introduction. Arnold Publishing, 2004. Third edition.
[Min00] T. Minka. Inferring a Gaussian distribution. Technical report, MIT, 2000.
[Pre05] S. J. Press. Applied multivariate analysis, using Bayesian and frequentist methods of inference. Dover,
2005. Second edition.
29