Professional Documents
Culture Documents
Gaussian Distribution
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
Learn how to perform inference for 𝝁 and 𝚲 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits
Learn how to perform inference for 𝝁 and 𝜮 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits
It weights the data and the prior information in making probabilistic inferences
𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃
Posterior Mean
𝜃መ = 𝔼𝑝(𝜃|𝑥) [𝜃] = න𝜃𝜋(𝜃|𝑥)𝑑𝜃
Posterior Quantiles Pr a ( | x)d
a
We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 =
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)
= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃
Compare this with the form of the normalizing factor (computed for 𝒙
ෝ) in
Bayes’ rule:
𝑚(ෝ
𝒙) = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃)𝑑𝜃
| X ~ N ( N , N2 ) with
1 1 N 0
2 2
2
, and
N 0
2 2 2 N
N 0
2 2
N
xn N N 2
2
N N n 1 2 02 N 2ML 02
2 2 0
ML 0
0 0 N 0 N 0
2 2 2 2
prior prior
lik lik
0.6 0.6
post post
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1
0 0
-5 0 5 -5 0 5
The Gamma distribution has a finite integral if 𝑎 > 0, and the distribution itself
is finite if 𝑎 ≥ 1. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Inference of Precision with Known Mean
The posterior takes the form:
N
1 N
p( | X, ) f ( xn | )Gamma ( | a0 , b0 ) N /2 a0 1
exp b0 ( xn ) 2
n 1 2 n 1
1 N N 2
p ( | X, ) Gamma ( | aN , bN ), aN N / 2 a0 , bN b0 ( xn ) 2 b0 ML
2 n 1 2
The results above are identical with inference of the variance 𝜎2 using as prior
InvGamma ( 2 | a0 , b0 ) resulting in a posterior InvGamma ( 2 | aN , bN ) .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Gamma and Inverse Gamma
Here 𝑣0 represents the strength of the prior and 𝜎02 encodes the value of the
prior. With this, the posterior takes the form:
N
v0 xi
2 2
0
2 2 | D , 2 2 | vN , N2 , vN v0 N , N2 i 1
vN
The posterior dof 𝑣𝑁 is the prior dof 𝑣0 plus 𝑁. The posterior sum of squares
𝜎𝑁2 𝑣𝑁 = 𝑣0 𝜎02 + σ𝑁 (𝑥
𝑖=1 𝑖 − 𝜇)2 is the sum of the prior sum of squares plus the
gaussSeqUpdateSigma1D
0.25
from PMTK
0.2
0.15
0.1
0.05
0
0 5 10 15
2
2
2
p ( | ) p ( )
c 1 1 c2
p ( , ) N | 0 , Gamma | a ,b d
b 2 2
Recall the form of the Gamma distribution:
ba a 1 b
Gamma ( | a, b) e
( a )
bN
N
0 xn
1
2
N
0 xn
N N | N n 1
, N
( N ) exp
1/ 2 n 1
2 N N
N
6
a 5, b 6,
5
0 0, 2
4
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
MatLab Code
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides
additional results for posterior marginals, posterior predictive, and reference results for
an uninformative prior)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Posterior for 𝝁 and 𝝈𝟐 for Scalar Data
We can also work directly with 𝜎2. We use the normal inverse chi-squared
distribution (NIX)
NI 2 , 2 | m0 , 0 , v0 , 02 N | m0 , 2 / 0 2 2 | v0 , 02 v0 v0 02
2 2 | v0 , 02 IG 2 | ,
2 2
1
( v0 3)/2
v0 02 0 m0 2
2 exp v 2
2
v0 /2 1
exp 0 20
2 2
2
Similarly to our earlier calculations, the posterior is given as:
0 m0 N x 0
p ( , 2 | D ) NI 2 , 2 | mN , N , vN , N2 , mN
N
m0 x
N 0 N 0 N
N 0
N
N 0 N , vN = v0 N , vN v0 xi x
2 2
2 2
m0 x
N 0
i 1 0 N
NIXdemo2
from PMTK
1
( )
p ( | D ) p( , 2 | D ) d 2 T | mN , N2 / N , vN , | D mN 2
Mean : , 1
Mode :
The posterior predictive is:
2
p ( x | D ) p ( x | , ) p( , | D ) d d T x | m , (1 ) / , v
2 2 2 2 Var :
N N N N N 2
, 2
Let us revisit all these results with an uninformative prior:* 2
p , p p , | 0 0, 0 0, v0 1, 0
2
2 2 2
NI 2 2 2
0
1
( v0 3)/2
v0 02 0 m0 2
* Recall that: NI , | m0 , 0 , v0 , 2
2 2 2
exp
0
2 2
** Recall that:
0 m0 N x N 0
p ( , | D ) NI , | mN , N , vN , , mN
N
, N 0 N , vN = v0 N , vN N v0 0 xi x
2 2
2 2 2 2 2 2
m0 x
N
N i 1 0 N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Marginal Posterior for 𝝁 and Credible Interval
The predictive distribution becomes:
p ( x | D ) T x | mN , N2 (1 N ) / N , vN T x | mN , s 2 (1 N ) / N , N 1
N 1 s2
1 N
N
MLE , var | D
vN s2
2 2
p ( | D ) T | x, s / N , N 1 , s
2 2
xi x N
2
N 1 i 1 N 1 vN 2 N 3 N N
s
var | D
N
2 2 n 1
T 1
0 N 0 0 xn const
N
1 T 1 1 1
2 n 1
N1
N1 N
So the variance and mean of the posterior p ( | X, ) N | N , N are:
N1 01 N 1 ,
1
N N ML
N
1 1 1 1
1
0
0 0
1
n 1
x n
1
0 N 1
0 0 N 1
1
For uninformative prior, 0 I p ( | X, ) N | ML ,
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Posterior Distribution of Precision 𝚲
We now discuss how to compute 𝑝(𝚲|𝒟, 𝝁) for a 𝐷 −dimensional Gaussian.
The likelihood has the form
1
exp Tr S , S xn xn
N
p (D | , ) | | N /2 T
2 n
The corresponding conjugate prior is known as the Wishart distribution
1
Wi | W0 , v0 B | |( v D 1)/2 exp Tr W01 , v0 D 1 (dof ), W0 sym pos. def . ( D D)
0
2
1
v /2 v0 D /2 D ( D 1)/4 v0 1 i
D
B W0 , v0 | W0 | 2
i 1
2
v0 D
v0 1 i
D
2
D ( D 1)/4
i 1
2
(multivariate Gamma Function)
2 2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
The Wishart Vs. the Gamma Distributions
mode (v k 1) S
for v k 1
N
For xi ~ N 0, , S xi xiT (scatter matrix) ~ Wi S | , N S N
i 1
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Posterior Distribution of 𝚺
We similarly discuss computing 𝑝(𝚺|𝒟, 𝝁). The likelihood has the form
1 1
exp Tr S , S xn xn
N
p (D | , ) | | N /2 T
2 n
The corresponding conjugate prior is known as the inverse Wishart
1
InvWi | S0 , v0 | |
( v0 D 1)/2
exp Tr S 0 1 , 0 D 1, S 0 sym pos. def ., D D pos. def .
2
𝜈0 + 𝐷 + 1 controls the strength of the prior, and hence plays a role
analogous to the sample size 𝑁.
The prior scatter matrix is here 𝑺0.
If ~ Wi | S , v then ~ InvWi | S 1 , v
Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.
A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27
Inverse Wishart Distribution
W ~ Inv Wishart ( S ) for v D 1 (dof , real )
p (W ) Inv Wishart (W | S )
for v D 1
mode v k 1 S
1
S 1
For k = 1, InvWi | ,S InvGamma 2 | ,
2 If ~ Gamma (a,b)
~ InvGamma (a,b)
2 2
If 1 ~ Wi , S ~ InvWi , S 1
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Wishart Distribution
𝜦~𝒲𝐷 (𝜦|𝑻, 𝜈) then
1
exp Tr T 1
1
p ( ) Wi | T , ||( D 1)/2
ZW 2
/2
ZW 2 D /2 T D , D multivariate Gamma function
2
where 𝜈 > 𝐷 − 1 degrees of freedom (real), 𝑻 > 0 scale matrix (𝐷 × 𝐷, pos.
def.), 𝚲 (𝐷 × 𝐷) positive definite matrix, Γ𝐷 is the multivariate gamma function
| | 0
N ( v D 1)/2 1
exp Tr S S 0 1
2
p | D , = InvWi | N v0 , S N , S N S S 0
The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the data
scatter matrix 𝑺𝝁.
Alternatively, we can use the formula provided in Ledoit & Wolf and Schaefer & Strimmer
which is the optimal frequentist estimate (for squared loss).
This loss function for covariance matrices ignores the positive definite constraint but results in
a simple estimator (see PMTK function shrinkcov).
For the prior covariance matrix, 𝑺0, it is common to use the following (data-
dependent) prior:
0 diag MLE
Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of Multivariate Analysis
88(2), 365– 411.
Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).
Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and implications for
functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).
Thus we see that the diagonal entries are equal to their MLE estimates, and
the off diagonal elements are “shrunk” somewhat towards 0 (shrinkage
estimation, or regularized estimation).
samples.
Eigenvalues in descending 1 1 1
eigenvalue
eigenvalue
eigenvalue
matrix (solid black), MLE 0.5 0.5 0.5
the legend.
i1
as: 1 N
p D | , 2 | | exp xn xn
ND /2 N /2 T
2 n 1
2
N
1
x exp Tr S x
ND /2 T
|| N /2
exp x
2 2
p ( , ) NWi , | 0 , 0 , v0 , T0 = N | 0 , 0
1
Wi | T ,
0 0
1
| |1/2 exp 0 0 0 | |( 0 D 1)/2 exp Tr T01
1
T
Z0 2 2
0
Z 0 2 0
D /2
2 0 D /2 D 0 T0 2 , D multivariate Gamma function
D /2
2
N
x 2
0
T
Combining gives: p ( , D) exp x 0 T
0
2
1
2
| |( N 0 D )/2 exp Tr T01 S x
M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 8)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36
Inference for Both 𝝁 and 𝚲
N 0
0
0
T
p ( , D) exp x x
T
2 2
1
2
| |( N 0 D )/2 exp Tr T01 S x
Can close the square in 𝝁 as follows:
1 0 0 N x
T
0 0 N x
p ( , D) | |1/2 exp N
2 0 N 0 N
0
1 1
1
T T
|| ( 0 N D 1)/2
exp Tr T0 S x N x x 0 0 0T 0 0 N x 0 0 N x
2
0 N
1 1 N
T
|| ( 0 N D 1)/2
exp Tr T0 S x 0 0 x 0 x
2
0 N
1 N
T
| |( 0 N D 1)/2 exp Tr T01 S x 0 0 x 0 x
2 0 N
This can be written as:
p ( , D) NWi , | N , N , vN , TN = N | N , N
1
Wi | T , N N
0 0 N x
N
0 N
N
N
0 x 0 x , where S x = xi x xi x
T T
TN1 T01 S x 0
0 N i 1
N 0 N , N 0 N
ZN 2 2
N 2
Z N 2 N
D/2
2 N D /2 D N
D/2
TN
2
2
N 0 N , N 0 N
p ( D) Wi | TN , vN
p ( D) T N D 1 | N , N ( N D 1)TN
1
* Refer to this report for these results based on Bernardo and Smith.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Appendix: Marginal of Normal-Wishart
Consider the Normal-Wishart distribution
p ( , ) NWi , | 0 , 0 , v0 , T0 = N | 0 , 0
1
Wi | T ,
0 0
1
exp 0 0 0 | |( 0 D )/2 exp Tr T01
T
p ( , ) | |( 0 D )/2
1
T
1
2
1
2
exp Tr T01 0 0 0 | |( 0 D )/2 exp Tr A1 | |( 0 1 D 1)/2 exp Tr A1
2
A 1
This is a Wishart distribution 𝒲𝒾𝐷 (𝜦|𝑨, 𝜈0 + 1) so when integrating out 𝜦 we
get the normalization factor: ( 0 1)/2
T0 0 0 0
T ( 0 1)/2 1
p ( ) | |( 0 1)/2
= T 0 0 0
1 T
0
p ( ) 1 0 T0 0 0 0 T0 0 ( 0 1 D) 0
( 0 1)/2 1
1
T T
0 1 D
𝑻−1
We can immediately recognize that this is a Student’s 𝒯𝜈0 +1−𝐷 (𝝁0 , 0
).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
𝜅0 (𝜈 0 +1−𝐷)
41
MAP Estimates for 𝝁 and 𝚲
p ( , D) NWi , | N , N , vN , TN = N | N , N
1
Wi | T ,
N N
0 0 N x 1 0 N
N
0 x 0 x , where S x = xi x xi x
T T
N 1
, TN T0 S x
0 N 0 N i 1
N 0 N , N 0 N
Differentiating the Eq. on the top of this slide, we can also derive the MAP
estimates as: , arg max p( , D)
,
N
x i 0 0
= i 1
N 0
x x
N T T
i i 0 0 0 T01
i 1
N 0 D
p ( , ) N | 0 , 0
1
Wi | T ,
0 0
Z 2 2
The posterior marginals and posterior predictive for this prior are given as:
p ( D) Wi N 1 | S x 1 , p( D) TN D | x ,
Sx
N ( N D )
S (N +1)
p ( x D) TN D x , x
N ( N D )
2
N
1
x 1 x exp Tr 1 S x
ND /2 T
| | N /2 exp
2 2
The conjugate prior is given as the product of a Gaussian and the Inverse
Wishart distribution:
InvWis | S0 , 0
1
NIW ( , | 0 , 0 , S0 , v0 ) N | 0 ,
0
1
| |1/2 exp 0 0 1 0 | | ( 0 D 1)/2 exp Tr 1 S0
1
T
Z NIW 2 2
exp 0 0 1 0 Tr 1 S0 | | ( 0 D 2)/2
1 1
T
Z NIW 2 2
D /2
v 2 v0 /2
Z NIW 2v0 D /2 D 0 S0 , D multivariate Gamma function
2 0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Inference for 𝝁 and 𝚺
N 1 0 D 1
1
exp tr 1 S x S0 N x x
0 0 0
p , | D | |
T T
2 2 2
2
0 D 2 N
1
exp tr 1 S x S0 N x x
0 0 0
T
| |
T
2
2
0 0 0
T
N x x
T
Nx Nx
T
0 N
0 N 0 0 x
T
0 0 x 0
0 N 0 N 0 N 0
N
Thus: N
N D2 1 1 N
x
p , | D | | exp tr S x S0 N N N 0
T
x 0
T
,
2
2 0
N
where : N 0
Z NIW 2 2
0 0 N x 0 N
N 0 x
N 0 N 0 N
N 0 N , vN = v0 N
N
x
N
S0 S 0 0 N N , S = xi xiT
T
S N S0 S x 0 x 0 T T
0 N 0 0 N
i1
The posterior mean is a convex combination of the prior mean and the MLE
with strength 𝜅0 + 𝑁.
The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the empirical
scatter matrix S plus an extra term due to the uncertainty in the mean which
x
0 N 0 0 N
i 1
It is interesting to note that this mode is almost the same as the MAP estimate
shown next – it differs by 1 in the denominator as the mode above is the mode
of the joint posterior and not of the marginal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
The Posterior Marginals of 𝝁 and 𝚺
The posterior marginal for and are:
p ( | D ) p( , | D ) d IW | S N , vN
SN SN
MAP ,
vN D 1 vN D 1
1
p ( | D ) p( , | D ) d TvN D 1 | N , SN
N (vN D 1)
It is not surprising that the last marginal is Student’s 𝒯 that we know can be
represented as a mixture of Gaussians.
To see the connection for the scalar case (𝐷 = 1), note that 𝑺𝑁 plays the role
of the posterior sum of squares v (you may want to revisit the results for the
N
2
N
Recall that the Student’s T distribution has heavier tails than the Gaussian but
rapidly becomes Gaussian like.
To see the connection of the above expression with the scalar case, note:
N 1
SN
N 1 vN N2 N 1 N2
N (vN D 1) N vN N
1
exp tr S0 1 0 0 1 0
0 D /2 1
NIW' , | a0
T
2 2
1
N' D | , exp tr 1 S
N /2
2
In the last two expressions ()’ give the unnormalized likelihood and prior.
2
K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal likelihood
for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Non Informative Prior
p ,
( D 1)/2
The uninformative Jeffrey’s prior is . This is obtained in the limit
0 0, v0 1, S0 0
0 D 2 D 1
1
p , | D | | exp tr 1 S0 0 0 0
T
2 2
| |
2
Nx
In this case, we have: N 0 0
N
, N 0 N , vN = v0 N
N
N x, N N , vN N 1, S N S x xi x xi x
T
i 1
p ( | D ) p( , | D ) d IW | S N , vN
The posterior marginals are then given as: 1
p ( | D ) TvN D 1 | N , SN
N (vN D 1)
p ( | D ) IW | S x , N 1 , p( | D ) TN D | x ,
1
Sx
N ( N D)
1
lim N | 0 , InvWis | S0 ,
0 v0 0 instead of v0 1
| | ( D /21) NIW , | 0, 0, 0 I , 0
InvWis | S0 , 0 exp 0 0 1 0 Tr 1 S0 | | ( 0 D 2)/2
1 1 1
NIW ( , | 0 , 0 , S0 , v0 ) N | 0 ,
T
0 Z NIW 2 2
diagS x
Set : S0 , v0 D 2 to ensure S0 , and
N
0 x , 0 0.01
InvWis | S0 , 0 exp 0 0 1 0 Tr 1 S0 | | ( 0 D 2)/2
1 1 1
NIW ( , | 0 , 0 , S0 , v0 ) N | 0 ,
T
0 Z NIW 2 2
Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of Bayesian Model Selection. Model
Selection. IMS Lecture Notes.
Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering.
J. of Classification 24, 155–181
i 1 2 -5 -5 -5
-50 0 50 -10 0 10 -10 0 10
5 5 5
0 0 0
-5 -5 -5
-20 0 20 -5 0 5 -5 0 5
S s1 ij
ii jj
We can then use kernel density estimation to produce for plotting purposes a
smooth approximation to the univariate density 𝔼 [𝑅𝑖𝑗 ].
0.05 0.5
0 0
0 10 20 30 40 -2 -1 0 1 2
22
0.4
wiPlotDemo
from PMTK 0.2
0
0 5 10