You are on page 1of 57

Conjugate Bayesian Analysis of the

Gaussian Distribution
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 10, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource).
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman
and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)

 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

 Chris Bishops’ PRML book, Chapter 2.


 M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-print).
 Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and 4.
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Contents
 Prior, Likelihood and Posterior, Posterior Point Estimates, Predictive
Distribution, Univariate Gaussian with Unknown Mean, Inference of Precision
with Known Mean, Scaled Inverse Chi Squared Prior for the Variance,
Inference on Mean and Precision

 Inference for the Multivariate Gaussian, Unknown Mean, Unknown Precision,


MAP Estimation and Shrinkage, Unknown Mean and Precision, Marginal
Likelihood, Non-informative Prior, Wishart Distribution

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Goals
 The goals for today’s lecture are:

 Familiarize ourselves with the calculations needed to compute the joint


posterior and posterior marginals of 𝜇 and 𝜎 2 for a univariate Gaussian

 Understand the concept of conjugate priors needed for inferring 𝜇 and 𝜎 2

 Understand the concept of uninformed priors

 Learn how to perform inference for 𝝁 and 𝚲 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits

 Learn how to perform inference for 𝝁 and 𝜮 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Posterior 𝝅(𝜽|𝑥): Inference and Prediction
 It combines the prior and likelihood.
f ( x |  ) ( )
 ( | x) 
m( x )

 It weights the data and the prior information in making probabilistic inferences

 The posterior distribution is also useful in estimating the probability of


observing a future outcome (prediction)

 The normalizing factor is given as:

𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Posterior Inference: Point Estimates
f ( x |  ) ( )
 ( | x) 
m( x )

Maximum A Posteriori estimate (MAP)

 *  arg max log  ( | x)   arg max  log f ( x |  )  log  ( ) 


 

Posterior Mean
𝜃መ = 𝔼𝑝(𝜃|𝑥) [𝜃] = න𝜃𝜋(𝜃|𝑥)𝑑𝜃


Posterior Quantiles Pr   a     ( | x)d
a

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Posterior Predictive Distribution
 Suppose we have observed 𝒙 and we want to make a prediction about
ෝ if
(future) unknown observables: What is the probability of observing data 𝒙
we already have observed data 𝒙?

 This means finding


𝒙|𝒙)
𝑔(ෝ

 We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 = ඲
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = ඲ 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)

= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃

 Compare this with the form of the normalizing factor (computed for 𝒙
ෝ) in
Bayes’ rule:
𝑚(ෝ
𝒙) = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃)𝑑𝜃

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Bayesian Inference for the Gaussian
 Consider
X   x1 , x2 ,..., xN  ~ N (  ,  2 ), with prior  ~ N ( 0 ,  02 ).

 We proved in the last lecture that the posterior is given as:

 | X ~ N (  N ,  N2 ) with
1 1 N  0
2 2
    2
 , and
 N 0 
2 2 2 N
N 0  
2 2

 N 
  xn    N    N  2
 2
 N   N  n 1 2  02    N  2ML  02  
2 2 0
 ML  0
  0     0  N 0   N 0  
2 2 2 2

 
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Inferring the Mean of a Gaussian
prior variance = 1.00 prior variance = 5.00

prior prior
lik lik
0.6 0.6
post post

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1

0 0
-5 0 5 -5 0 5

 Inference about 𝑥 given a single noisy observation 𝑦 = 3.


 (a) Strong prior N(0, 1). The posterior mean is “shrunk” towards the prior
mean, which is 0.
 (b) Weak prior N(0, 5). The posterior mean is similar to the MLE.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Inference of Precision with Known Mean
 Consider xn ~ N ( xn |  ,  ), n  1,...., N . We want to infer the precision 𝜆 = 1/𝜎2
1

with the mean 𝜇 taken as known.

 The likelihood takes the form:


N
 1 N 
p( X |  )   f ( xn |  )   N /2
exp     ( xn   ) 2 
n 1  2 n 1 
 The corresponding “conjugate prior” (a prior that results in a posterior of the
same family as the prior) should be proportional to the product of a power of 𝜆
and the exponential of a linear function of 𝜆. This corresponds to the gamma
distribution.
ba a 1 b
Gamma ( | a, b)   e , x  0,    
a a
, var     2
( a ) b b

 The Gamma distribution has a finite integral if 𝑎 > 0, and the distribution itself
is finite if 𝑎 ≥ 1. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Inference of Precision with Known Mean
 The posterior takes the form:

N
 1 N 
p( | X,  )   f ( xn |  )Gamma ( | a0 , b0 )   N /2  a0 1
exp  b0    ( xn   ) 2 
n 1  2 n 1 

 We can immediately see that the posterior is also a Gamma distribution:

1 N N 2
p ( | X,  )  Gamma ( | aN , bN ), aN  N / 2  a0 , bN  b0   ( xn   ) 2  b0   ML
2 n 1 2

 Here  ML is the MLE of the variance.


2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Inference of Precision with Known Mean
 The effect of observing 𝑁 data points is to increase the value of 𝑎 by 𝑁/2 (i.e.
½ for each data point). Thus we interpret the parameter 𝑎0 as 2𝑎0 ‘effective’
prior observations.
1 N N 2
p ( | X,  )  Gamma ( | aN , bN ), aN  N / 2  a0 , bN  b0   ( xn   ) 2  b0   ML
2 n 1 2
 Each measurement contributes to the parameter 𝑏 a variance  ML / 2.
2
Since
we have 2𝑎0 effective prior measurements, each of them contributes to 𝑏 an
effective prior variance
2a0 2 b
b0   2  0
2 a0
The interpretation of a conjugate prior in terms of effective data points is typical
for the exponential family of distributions.

 The results above are identical with inference of the variance 𝜎2 using as prior
InvGamma ( 2 | a0 , b0 ) resulting in a posterior InvGamma ( 2 | aN , bN ) .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Gamma and Inverse Gamma

If  ~ Gamma  | a, b  then  1 ~ InvGamma  1 | a, b 


Here :  ~ Gamma   | a, b  then  2 ~ InvGamma  2 | a, b 
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Scaled Inverse Chi-Squared Prior
 Alternative prior for 𝜎2 is the Scaled Inverse Chi-Squared Distribution*
 2 v0 v0 02   v0 02 
  | v0 ,    InvGamma   | ,
2 2 2
   
2  v0 /2 1
exp   2 
 2 
0
 2 2 

 Here 𝑣0 represents the strength of the prior and 𝜎02 encodes the value of the
prior. With this, the posterior takes the form:
N
v0    xi   
2 2
0
 2  2 | D ,     2  2 | vN ,  N2  , vN  v0  N ,  N2  i 1

vN
 The posterior dof 𝑣𝑁 is the prior dof 𝑣0 plus 𝑁. The posterior sum of squares
𝜎𝑁2 𝑣𝑁 = 𝑣0 𝜎02 + σ𝑁 (𝑥
𝑖=1 𝑖 − 𝜇)2 is the sum of the prior sum of squares plus the

data sum of squares. An uninformative prior corresponds to zero virtual


sample size, 𝒗𝟎 = 𝟎. This is the prior 𝑝 𝜎 2 ∝ 𝜎 −2 .

 This approach is certainly more appealing.


v0 02 2v02 04 v0 02
* Often denoted as Scale - Inv -   | v0 ,   , mean 
2 2 2
for v0  2 var  for v0  4, mode 
 v0  2   v0  4  v0  2
2
v0  2
0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Sequential Update of the Posterior for 𝝈2
  s
prior = IW(=0.001, S=0.001), true  2=10.000
For D = 1 : IW  2 |  ,s 1   IG   2 | ,  , s  v 2
 2 2
0.35
N=2
N=5
N=50
0.3 N=100

gaussSeqUpdateSigma1D
0.25
from PMTK

0.2

0.15

0.1

0.05

0
0 5 10 15
2

 Sequential update of the posterior for 𝜎2 starting from an uninformative prior


IW  | v  0.001, s  v  = 0.001 . The scalar version of the inverse-Wishart is the same as
2
0 0 0
2
0

the inverse Gamma. The data were generated from 𝒩(5,10).


 Gelman 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1(3):515–533
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Bayesian Inference: Unknown Mean and Precision
 Consider xn ~ N ( xn |  ,  ), n  1,...., N . We want to infer both the precision 𝜆 =
1

1/𝜎2 and the mean 𝜇.

 The likelihood takes the form:


N
 1 N 
p ( X |  ,  )   f ( xn |  )   N /2
exp     ( xn   ) 2 
n 1  2 n 1 
N
 1/2   2    N
1 N 2
  exp     exp    xn    xn 
  2   n 1 2 n 1 
 We need a prior that has a similar functional form in terms of 𝜆 and 𝜇.

 1/2     2
p(  ,  )   exp     exp   c   d 
  2 
   c    /2
2
  c2  
 exp        exp    d   
 2       2  

p (  | ) p ( )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Bayesian Inference: Unknown Mean and Precision

 1/2     2
p (  ,  )   exp     exp   c   d 
  2 
   c   (  1)/2
2
  c2  
    exp       exp    d   
1/2

 2  
      2  
p (  | ) p ( )

 We can easily identify that the prior is of the form (Normal-Gamma):

 c 1   1  c2 
p (  ,  )  N   | 0  ,     Gamma   | a  ,b  d  
 b   2 2  
 Recall the form of the Gamma distribution:

ba a 1 b
Gamma ( | a, b)   e
( a )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Bayesian Inference: Unknown Mean and Precision
 Combining the likelihood and prior, we can re-arrange and write:
  1 N 2  2 
p(  ,  | X )   N /2
 a 1
exp    b   xn  0    
  2 n 1 2  
 N     2 2  N
 
(N   ) exp      0  n    
  
1/ 2
 x
 2  N   n 1  
 Completing the square on the 2nd argument gives:
 
  2 

   N
  
    N 
 
1 N 2  2 
0 xn 
    Gamma   | aN   a, bN 
p (  ,  | X )    exp   b   xn  0 
N /2 a 1  n 1
   2 
  2 n 1 2 2 N     
   

   N   2
N
 
 
2

 bN 
 N

0   xn
1
  
2
  
 
N
  0   xn  
  N     N   | N  n 1
, N     
  ( N   )  exp      
1/ 2 n 1

2  N     N    
       N  
     

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


The Normal-Gamma Distribution
 c 1    c2 
p(  ,  )  N   | 0  ,     Gamma   | a  1  , b  d  
 b   2 2  
8

6
a  5, b  6,
5
0  0,   2
4

0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

MatLab Code
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides
additional results for posterior marginals, posterior predictive, and reference results for
an uninformative prior)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Posterior for 𝝁 and 𝝈𝟐 for Scalar Data
 We can also work directly with 𝜎2. We use the normal inverse chi-squared
distribution (NIX)
NI 2   ,  2 | m0 ,  0 , v0 ,  02   N   | m0 ,  2 /  0   2  2 | v0 ,  02   v0 v0 02 
 2  2 | v0 ,  02   IG   2 | , 
 2 2 
 1 
( v0  3)/2
 v0 02   0    m0 2 
 2  exp     v 2 
  2 
 v0 /2 1
exp   0 20 
   2  2
  2 
 
 Similarly to our earlier calculations, the posterior is given as:
 0 m0  N x 0
p (  ,  2 | D )  NI  2   ,  2 | mN ,  N , vN ,  N2  , mN 
N
 m0  x
N 0  N 0  N
N 0
   
N
 N   0  N , vN = v0  N , vN   v0   xi  x 
2 2
2 2
m0  x
N 0
i 1 0  N

 The posterior marginal for 𝜎2 and posterior expectation are:


p ( 2 | D )   p(  ,  2 | D ) d    2  2 | vN ,  N2  ,
v
 2 | D   N  N2
vN  2
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior
marginals, posterior predictive, and reference results for an uninformative prior, also Section 6 provides the analysis for a
normal-inverse-Gamma prior)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Normal Inverse x2 Distribution
NIX(0=0, k0=1, 0=1, 20=1) NIX(0=0, k0=5, 0=1, 20=1) NIX(0=0, k0=1, 0=5, 20=1)

NIXdemo2
from PMTK

0.4 0.4 0.4


0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
2 2 2
1 1 1
1 1 1
0 0 0
2 0 -1  2 0 -1  2 0 -1 
 The 𝑁𝐼𝜒2(𝜇0, 𝜅0, 𝜈0, 𝜎02 ) distribution. 𝜇0 is the prior mean and 𝜅0 is how strongly we believe
this; 𝜎02 is the prior variance and 𝜈0 is how strongly we believe this. (a) The contour plot
(underneath the surface) is shaped like a “squashed egg”. (b) We increase the strength of our
belief in the mean, so it gets narrower (c) We increase the strength of our belief in the
variance, so it gets narrower.Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Posterior for an Uninformative Prior for 𝝁 and 𝝈𝟐
 1
 The posterior marginal for 𝜇 is Students’ 𝒯: (  )
p ( x |  ,  , )  2 2 
  
1/2
   x   2 
 /2 1/2

  1  
( )      
p (  | D )   p(  ,  2 | D ) d 2  T   | mN ,  N2 /  N , vN  ,   | D   mN 2
Mean :  ,  1
Mode : 
 The posterior predictive is:
 2
p ( x | D )   p ( x |  ,  ) p(  ,  | D ) d  d  T  x | m ,  (1   ) /  , v 
2 2 2 2 Var : 
N N N N N  2

,  2
 Let us revisit all these results with an uninformative prior:*    2 

p   ,   p    p       , | 0  0,  0  0, v0  1,   0 
   2
2 2 2
 NI  2 2 2
0

 With this prior, the posterior becomes:**


 1 N
    N
2 2
p (  ,  | D )  NI 
2 2
 ,  | mN  x,  N  N , vN  N  1,   s , s 
2 2 2
xi  x
2
  MLE
N  1 i 1 N 1
N

 1 
( v0  3)/2
 v0 02   0    m0 2 
* Recall that: NI   ,  | m0 ,  0 , v0 ,     2 
2 2 2
exp   
 
0
 2  2

 
** Recall that:
 0 m0  N x N 0
p (  ,  | D )  NI    ,  | mN ,  N , vN ,   , mN     
N
,  N   0  N , vN = v0  N , vN  N  v0 0   xi  x 
2 2
2 2 2 2 2 2
m0  x
N
N i 1  0  N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Marginal Posterior for 𝝁 and Credible Interval
 The predictive distribution becomes:
p ( x | D )  T  x | mN ,  N2 (1   N ) /  N , vN   T  x | mN , s 2 (1  N ) / N , N  1

 𝑠 is the sample std. Thus the marginal posterior for 𝜇 becomes:

N 1 s2
  1 N
   N
 MLE , var   | D  
vN s2
2 2
p (  | D )  T  | x, s / N , N  1 , s 
2 2
xi  x  N 
2

N  1 i 1 N 1 vN  2 N 3 N N

 The standard error of the mean is defined as

s
var   | D  
N

 An approximate 95% posterior credible interval is thus:


s
I 0.95   | D   x  2
N

 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Multivariate Gaussian: Posterior of 𝝁
 Consider a known variance 𝚺 and a Gaussian prior 𝒩(𝝁𝟎 , 𝚺0) with the
posterior for the unknown mean 𝝁 taking the form:
N
p (  | X )  p (  ) p( xn |  , )
n 1

 This posterior is the exponential of a quadratic in 𝝁:


1 1 N
    0   0    0    ( xn  )T  1 ( xn   ) 
1
T

2 2 n 1
T  1 
    0  N        0 0    xn   const
N
1 T 1 1 1

2  n 1 
 N1
 N1 N
 So the variance and mean of the posterior p (  | X,  )  N   |  N ,  N  are:
 N1   01  N  1 ,
 1 
N     N      ML 
N
1 1 1 1
1
0 
 0 0

   1

n 1
x n

  1
0  N   1
0  0  N  1

 1 
 For uninformative prior,  0  I  p (  | X,  )  N   |  ML ,  
 N 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
Posterior Distribution of Precision 𝚲
 We now discuss how to compute 𝑝(𝚲|𝒟, 𝝁) for a 𝐷 −dimensional Gaussian.
The likelihood has the form
 1 
exp   Tr  S     , S     xn    xn   
N
p (D |  ,  ) |  | N /2 T

 2  n
 The corresponding conjugate prior is known as the Wishart distribution
 1 
Wi   | W0 , v0   B |  |( v  D 1)/2 exp   Tr W01   , v0  D  1 (dof ), W0 sym pos. def . ( D  D)
0

 2 
1
 v /2  v0 D /2 D ( D 1)/4  v0  1  i  
D
B W0 , v0  | W0 |  2

 
i 1
 
 2


 v0  D
 v0  1  i 
D    
 
2
D ( D 1)/4

i 1
 
 2
 (multivariate Gamma Function)

 Clearly the posterior is 𝒲𝒾 𝜦 𝑾𝑁 , 𝜈𝑁 , 𝑾−1 −1


𝑁 = 𝑾0 + 𝑺𝜇 , 𝜈𝑁 = 𝜈0 + 𝑁. For 𝐷 =
1, the Wishart becomes the Gamma for 𝜆 that we used earlier for the
univariate Gaussian. For D = 1, Wi   | W ,   Gamma   |  , W 
N N
N
1
N

 2 2 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25
The Wishart Vs. the Gamma Distributions

mode  (v  k  1) S
for v  k  1

N
For xi ~ N  0,  , S   xi xiT (scatter matrix) ~ Wi  S | , N   S  N
i 1
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Posterior Distribution of 𝚺
 We similarly discuss computing 𝑝(𝚺|𝒟, 𝝁). The likelihood has the form
 1 1 
exp   Tr  S     , S     xn    xn   
N
p (D |  , ) |  |  N /2 T

 2  n
 The corresponding conjugate prior is known as the inverse Wishart
 1 
InvWi   | S0 , v0  |  |
 ( v0  D 1)/2
exp   Tr  S 0  1   , 0  D  1, S 0 sym pos. def .,  D  D pos. def .
 2 
 𝜈0 + 𝐷 + 1 controls the strength of the prior, and hence plays a role
analogous to the sample size 𝑁.
 The prior scatter matrix is here 𝑺0.

If  ~ Wi   | S , v  then  ~ InvWi   | S 1 , v 

 Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.
 A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27
Inverse Wishart Distribution
W ~ Inv  Wishart ( S ) for v  D  1 (dof , real )

p (W )  Inv  Wishart (W | S )

for v  D  1
mode   v  k  1 S
1

  S 1
For k = 1, InvWi  | ,S  InvGamma   2 | , 
2 If  ~ Gamma (a,b) 

~ InvGamma (a,b)
 2 2
If  1 ~ Wi  , S    ~ InvWi  , S 1 
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Wishart Distribution
 𝜦~𝒲𝐷 (𝜦|𝑻, 𝜈) then
 1 
exp   Tr T 1  
1
p ( )  Wi   | T ,    ||(  D 1)/2

ZW  2 
 /2 
ZW  2 D /2 T  D   ,  D multivariate Gamma function
2
where 𝜈 > 𝐷 − 1 degrees of freedom (real), 𝑻 > 0 scale matrix (𝐷 × 𝐷, pos.
def.), 𝚲 (𝐷 × 𝐷) positive definite matrix, Γ𝐷 is the multivariate gamma function

 The following properties are useful


𝔼 𝑻 = 𝜈𝑻, 𝑚𝑜𝑑𝑒 = (𝜈 − 𝐷 − 1)𝑻 for 𝜈 ≥ 𝐷 + 1
𝑉𝑎𝑟 Λ𝑖𝑗 = 𝜈(𝑇𝑖𝑗2 + 𝑇𝑖𝑖 𝑇𝑗𝑗 )
 The Wishart distribution is a generalization to multiple dimensions of
the gamma distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29
Posterior Distribution of 
 Multiplying the likelihood and prior, we find that the posterior is also inverse
Wishart:
 1   1 
p   | D ,   |  | N /2 exp   Tr  S   1   |  | ( v0  D 1)/2 exp   Tr  S 0 1   ,
 2   2 

|  |  0
 N  ( v  D 1)/2  1

exp   Tr  S   S 0   1 
 2



p   | D ,   = InvWi   | N  v0 , S N  , S N  S   S 0

 The posterior strength 𝜈𝑁 = 𝜈0 + 𝑁, is the prior strength 𝜈0 plus the number


of observations 𝑁.

 The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the data
scatter matrix 𝑺𝝁.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


MAP Estimation
 From the mode of the inverse Wishart and
p   | D ,   = InvWi   | vN , S N  , S N  S   S 0 , vN  N  v0

we conclude that the MAP estimate is:


SN S   S0 S   S0
 MAP = = 
vN  D  1 N  v0  D  1 N  N 0
N0
S
 For an improper prior, 𝑺0 = 𝟎 and 𝑁0 = 0,  MAP    MLE
N
 Consider now the use of a proper informative prior, which is necessary
whenever 𝐷/𝑁 is large. Let   x  S   S x . Rewrite the MAP estimate as a
convex combination of the prior mode and MLE
S x  S0 N0 S0 N Sx S
 MAP =     0  (1   ) MLE ,  0  0 ( prior mode)
N  N0 N  N0 N0 N  N0 N N0
 1 
where 𝜆 controls the amount of shrinkage towards the prior.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
MAP Estimation
S0
 MAP   0  (1   ) MLE ,  0  ( prior mode)
N0

 Can set 𝜆 by cross validation.

Alternatively, we can use the formula provided in Ledoit & Wolf and Schaefer & Strimmer
which is the optimal frequentist estimate (for squared loss).
This loss function for covariance matrices ignores the positive definite constraint but results in
a simple estimator (see PMTK function shrinkcov).
 For the prior covariance matrix, 𝑺0, it is common to use the following (data-
dependent) prior:

 0  diag  MLE 
 Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of Multivariate Analysis
88(2), 365– 411.
 Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).
 Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and implications for
functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


MAP Shrinkage Estimation

 0  diag  MLE 
 In this case, the MAP estimate is:
  MLE (i, j ) if i  j
 MAP = 
(1   ) MLE (i, j ) otherwise

 Thus we see that the diagonal entries are equal to their MLE estimates, and
the off diagonal elements are “shrunk” somewhat towards 0 (shrinkage
estimation, or regularized estimation).

 The benefits of MAP estimation are illustrated next. We consider fitting a 50 −


dim Gaussian to 𝑁 = 100, 𝑁 = 50 and 𝑁 = 25 data points.
 The MAP estimate is always well-conditioned, unlike the MLE.
 The eigenvalue spectrum of the MAP estimate is much closer to that of the true matrix than the
MLE’s.
 The eigenvectors, however, are unaffected.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33
Posterior Distribution of 𝚺
 Estimating a covariance
matrix in 𝐷 = 50 dimensions 1.5
N=100, D=50
1.5
N=50, D=50
1.5
N=25, D=50

using 𝑁 ∈ {100, 50, 25} true, k=10.00


MLE, k= 71
MAP, k=8.62
true, k=10.00
MLE, k=4.1e+16
MAP, k=8.85
true, k=10.00
MLE, k=3.7e+17
MAP, k=21.09

samples.
 Eigenvalues in descending 1 1 1

order for the true covariance

eigenvalue

eigenvalue

eigenvalue
matrix (solid black), MLE 0.5 0.5 0.5

(dotted blue) and MAP


estimates (dashed red) with
𝜆 = 0.9. 0
0 5 10 15 20 25
0
0 5 10 15 20 25
0
0 5 10 15 20 25

 The condition number 𝑘 of shrinkcovDemo


each matrix is also given in from PMTK

the legend.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Inference for Both 𝝁 and 𝚲
 Suppose 𝒙1, 𝒙2, … 𝒙𝑁 ~(i.i.d) 𝒩(𝝁, 𝜦−1 ). We do not know 𝝁 or 𝚲.
 When both sets of parameters are unknown, a conjugate family of priors is one in
which
𝚲 ~ 𝒲(𝚲 |𝜈0 , 𝑻𝟎 )
and
𝝁 | 𝚲 ~ 𝒩(𝝁𝟎, 𝜅0 𝜦 −1 )

 The Wishart distribution is the multivariate analog of the Gamma distribution


(extension to positive definite matrices). If matrix 𝑼 has the Wishart distribution, then
𝑼−1 has the inverse-Wishart distribution. The resulting 𝑝(𝝁, 𝚲| 𝝁𝟎, 𝜅0 , 𝑻𝟎 , 𝜈0 ) is the
Gaussian-Wishart distribution.
 The quantity 𝜈0 is a positive scalar, while 𝑻𝟎 is a positive definite matrix. They play
roles analogous to 𝑎 and 𝛽, respectively, in the Gamma distribution.
 Other parameters of the prior are the mean vector 𝝁𝟎 and 𝜅0 which represents the `a
priori number of observations’.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Inference for Both 𝝁 and 𝚲
 Denoting S x =   xi  x  xi  x  , the likelihood and prior distributions are given
N T

i1
as:  1 N 
p  D |  ,     2  |  | exp     xn      xn    
 ND /2 N /2 T

 2 n 1 

  2 
 N
   

 1 
   x  exp   Tr  S x  
 ND /2 T
|| N /2
exp   x
 2   2 

p (  ,  )  NWi   ,  | 0 ,  0 , v0 , T0  = N  | 0 ,  0  
1
Wi   | T ,   
0 0

    1 
|  |1/2 exp   0    0      0   |  |( 0  D 1)/2 exp   Tr T01  
1

T

Z0  2   2 
  0
Z 0   2   0 
 D /2
2 0 D /2  D  0  T0 2 ,  D multivariate Gamma function
D /2

2

 N
      x   2      
    0   
T
 Combining gives: p (  ,   D)  exp   x 0 T
0
 2 
 1
 2
 
|  |( N 0  D )/2 exp   Tr T01  S x   


 M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 8)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36
Inference for Both 𝝁 and 𝚲
 N 0
       0  
    0   
T
p (  ,   D)  exp   x  x 
T

 2 2 
 1
 2
 
|  |( N 0  D )/2 exp   Tr T01  S x   


 Can close the square in 𝝁 as follows:
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0

 1   1  
1
  
T T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  N x x   0 0 0T   0 0  N x  0 0  N x    
 2  
  0 N  

 We can simplify as:


 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0

 1   1  N  
  
T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  0 0  x 0  x    
 2  
  0 N  

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Inference for Both  and 
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0

 1   N  
  
T
|  |( 0  N  D 1)/2 exp   Tr   T01  S x  0 0  x 0  x    
 2 0  N
   
 This can be written as:


p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T ,  N N 
 0 0  N x
N 
0  N

     
N
N
0  x 0  x , where S x =  xi  x xi  x
T T
TN1  T01  S x  0
0  N i 1

 N  0  N ,  N   0  N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Normalization Factors
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp        0  N      
 2 0  N   0  N  
 
 1   1  N  
  
T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  0 0  x 0  x    
 2  
  0 N  

 Using the normalization constants of the multivariate Gaussian and the


Wishart:  1   1 
|  |1/2 exp       N   N      N   |  |(  N  D 1)/2 exp   Tr TN1  
1
p (  ,   D) 
T

ZN  2   2 
  N 2
Z N   2   N 
D/2
2 N D /2  D  N
D/2
 TN
 2 

 Similarly for the prior the normalization factor is:


  0
Z 0   2   0 
 D /2
2 0 D /2  D  0  T0 2
D /2

2

 The marginal likelihood is computed as a ratio of normalization constants


using 𝑝 𝝁, 𝜦 𝒟 = 𝑝 𝒟|𝝁, 𝚲 𝑝 𝝁, 𝜦 /𝑝(𝒟):
1  D  N / 2  TN
 /2 D /2
ZN 1
N
 0 
p(D )    
Z 0  2  ND /2  ND /2  D  0 / 2  T0  /2   N 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Posterior Marginals for 𝝁 and 𝚲

p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
0  x 0  x , where S x =  xi  x xi  x
T T
N  1
, TN  T0  S x 
0  N 0  N i 1

 N  0  N ,  N   0  N

 The posterior marginals can be derived as:

p (  D)  Wi   | TN , vN 


p (   D)  T N  D 1  |  N ,  N ( N  D  1)TN 
1

* Refer to this report for these results based on Bernardo and Smith.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40
Appendix: Marginal of Normal-Wishart
 Consider the Normal-Wishart distribution

p (  ,  )  NWi   ,  | 0 ,  0 , v0 , T0  = N  | 0 ,  0  
1
Wi   | T ,   
0 0

    1 
 exp   0    0      0   |  |( 0  D )/2 exp   Tr T01  
T

 Let us write this in the form:  2   2 

  
p (  ,  ) |  |( 0  D )/2
 1 
 T 
  1
 2


 1
 2

exp   Tr  T01   0    0    0     |  |( 0  D )/2 exp   Tr  A1   |  |( 0 1 D 1)/2 exp   Tr  A1  

 2  
  A 1

 This is a Wishart distribution 𝒲𝒾𝐷 (𝜦|𝑨, 𝜈0 + 1) so when integrating out 𝜦 we
get the normalization factor:  (  0 1)/2
 T0  0      0    0 
T  (  0 1)/2 1
p (  ) |  |(  0 1)/2
= T   0    0    0 
1 T
0

 The matrix determinant lemma says: det 𝑨 + 𝒖𝒗𝑇 = (1 + 𝒗𝑇 𝑨−1 𝒖) det(𝑨).


Applying this to the formula above:
 (  0 1 D )  D  /2

p (  )  1     0  T0  0     0     0  T0  0 ( 0  1  D)     0 
 (  0 1)/2 1
 1
T T

0  1  D
𝑻−1
 We can immediately recognize that this is a Student’s 𝒯𝜈0 +1−𝐷 (𝝁0 , 0
).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
𝜅0 (𝜈 0 +1−𝐷)
41
MAP Estimates for 𝝁 and 𝚲

p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
0  x 0  x , where S x =  xi  x xi  x
T T
N  1
, TN  T0  S x 
0  N 0  N i 1

 N  0  N ,  N   0  N

 Differentiating the Eq. on the top of this slide, we can also derive the MAP
estimates as:   ,    arg max p( ,   D)
 ,
N

 x   i 0 0
= i 1

N  0

  x    x         
N T T
i i   0   0 0  T01
 i 1

N  0  D

 These are reduced to the MLE by setting  0  0,  0  D, T01  0


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Inference for both 𝝁 and 𝚲
 The posterior predictive is:

p ( x  D)  T N  D 1   N ,
  N  1 TN1 

  N ( N  D  1) 

 Another useful reference prior considers


0  0,  0  0,  0  1, T01  0

 This results in the following for the prior:*


 ( D 1)/2
p(  ,  )  


p (  ,  )  N  | 0 ,  0  
1
Wi   | T ,   
0 0

* Recall the general forms:     1 


|  |1/2 exp   0    0      0   |  |( 0  D 1)/2 exp   Tr T01  
1

T

Z  2   2 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Reference Prior: Posterior Marginals and Predictive
 The posterior parameters are also simplified as:*
 N  x, TN1  S x ,  N  N ,  N  N  1

 The posterior marginals and posterior predictive for this prior are given as:
 
p (  D)  Wi N 1   | S x 1  , p(   D)  TN  D   | x ,
Sx
 
 N ( N  D ) 
 S (N +1) 
p ( x  D)  TN  D  x , x
 N ( N  D ) 
 

* Recall the general form of the posterior



p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
  x 0  x , where S x =  xi  x xi  x   N   0  N ,  N   0  N
T T
N  1
, TN  T0  S x 
0  N 0  N 0 i 1

p (  D)  Wi   | TN , vN  , p(   D)  T N  D 1   |  N ,
TN1     N  1 TN1 
 , p( x  D)  T N  D 1   N , 
  N  N
  D  1   N ( N  D  1) 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44
Inference for 𝝁 and 𝚺
 For the case of the multivariate Gaussian of a 𝐷 −dim variable 𝒙, 𝒩 𝒙 𝝁, 𝚺
with both the mean and variance unknowns, the likelihood is of the form:
N
 1 N 
      2 n    
 ND /2
N       N /2
    1
 
T
x n | , 2 | | exp x x n 
n 1  n 1 

  2 
 N
  
 1
 
  x  1   x  exp   Tr   1 S x  
 ND /2 T
|  | N /2 exp  
 2   2 
 The conjugate prior is given as the product of a Gaussian and the Inverse
Wishart distribution:
 
  InvWis   | S0 ,  0  
1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
 0 
    1 
|  |1/2 exp   0    0   1    0   |  | ( 0  D 1)/2 exp   Tr   1 S0  
1

T

Z NIW  2   2 
  
exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1

T

Z NIW  2 2 
D /2
 v   2   v0 /2
Z NIW  2v0 D /2  D  0    S0 ,  D multivariate Gamma function
 2   0 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Inference for 𝝁 and 𝚺
 
N 1  0  D 1
 1

exp   tr   1 S x  S0  N   x   x   
  0    0    0   
  
p   ,  | D  |  |
T T
2 2 2

 2  

 
0  D  2 N
 1

exp   tr   1 S x  S0  N   x   x   
  0    0    0   
 T
|  |
T
2

 2  

 One can show by completing the square in 𝝁 that:

     0    0    0  
T
N x x
T

 
   Nx     Nx 
T
0 N
 0  N     0 0   x   
T
   0 0   x  0
 0  N   0  N  0  N 0
N  
 Thus:  N 
N  D2  1  1   N 
  x   

p   ,  | D  |  | exp   tr    S x  S0   N     N     N   0
T
x  0
T
   ,
2
 2   0

  N

where :  N   0  

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


The Posterior of 𝝁 and 𝚺
 The posterior is NIW given as:
    1 
|  |1/2 exp   N     N   1     N   |  | (  N  D 1)/2 exp   Tr   1 S N  
1
p   ,  | D , 0 ,  0 , S0 , v0   NIW   ,  |  N ,  N , S N , vN  
T

Z NIW  2   2 
 0 0  N x 0 N
N   0  x
N 0  N 0  N
 N   0  N , vN = v0  N
 N
  x   
N
 S0  S   0 0    N  N  , S =  xi xiT
T
S N  S0  S x  0 x  0 T T

0  N 0 0 N
i1

 The posterior mean is a convex combination of the prior mean and the MLE
with strength 𝜅0 + 𝑁.
 The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the empirical
scatter matrix S plus an extra term due to the uncertainty in the mean which
x

creates its own scatter matrix.


 Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.
 Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of Bayesian Model Selection. Model Selection. IMS Lecture Notes.
 Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering. J. of Classification 24, 155–181
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47
MAP Estimate of (𝝁, 𝚺)
p   ,  | D , 0 ,  0 , S0 , v0   NIW   ,  |  N ,  N , vN , S N 
 0 0  N x 0 N
N   0  x ,  N   0  N , vN = v0  N
N 0  N 0  N
 N
  x   
N
 S0  S   0 0    N  N  , S =  xi xiT
T
S N  S0  S x  0 x  0 T T

0  N 0 0 N
i 1

 The mode of the joint posterior is:


 
 
 ,   arg max p   ,  | D , 0 ,  0 , S0 , v0     N ,
SN

vN  D  2 
 , 

 For 𝑘0 = 0, this becomes:


 S0  S x 
 
 ,   arg max p   ,  | D , 0 ,  0  0, S0 , v0    x, 
v0  N  D  2 
 , 

 It is interesting to note that this mode is almost the same as the MAP estimate
shown next – it differs by 1 in the denominator as the mode above is the mode
of the joint posterior and not of the marginal.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
The Posterior Marginals of 𝝁 and 𝚺
 The posterior marginal for  and  are:
p ( | D )   p(  ,  | D ) d   IW   | S N , vN 
SN SN
 MAP  ,   
vN  D  1 vN  D  1
 1 
p (  | D )   p(  ,  | D ) d   TvN  D 1   |  N , SN 
  N (vN  D  1) 

 It is not surprising that the last marginal is Student’s 𝒯 that we know can be
represented as a mixture of Gaussians.

 To see the connection for the scalar case (𝐷 = 1), note that 𝑺𝑁 plays the role
of the posterior sum of squares v  (you may want to revisit the results for the
N
2
N

scalar case of simultaneously estimating 𝜇 and 𝜎 2 ):


1 SN  N2
SN  
 N (vN  D  1)  N vN  N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49
The Posterior Predictive of 𝝁 and 𝚺
 The posterior predictive 𝑝(𝒙|𝒟) = 𝑝(𝒙, 𝒟)/𝑝(𝒟) can be evaluated as:
p (x | D )    N ( x |  , )NIW   ,  |  N ,  N , vN , S N  d d 
 N 1 
 TvN  D 1  x |  N , SN 
  N (vN  D  1) 

 Recall that the Student’s T distribution has heavier tails than the Gaussian but
rapidly becomes Gaussian like.

 To see the connection of the above expression with the scalar case, note:

N 1
SN 
  N  1 vN  N2  N  1  N2

 N (vN  D  1)  N vN N

 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50


Marginal Likelihood
 The posterior is given as:
1 1 1 1
p   ,  | D , 0 ,  0 , S0 , v0   NIW'   ,  | a0  N'  D |  ,    NIW'   ,  | aN 
p  D  Z0  
2
ND /2
Z N

 1  
exp   tr  S0  1   0    0   1    0  
  0  D  /2 1
NIW'   ,  | a0   
T

 2 2 
 1 
N'  D |  ,     exp   tr   1 S  
 N /2

 2 

 In the last two expressions ()’ give the unnormalized likelihood and prior.

 The marginal likelihood


ZN 1
p D  
Z 0  2  ND /2 is then:
D /2 D /2
 N D /2    2   2   N   
2 D  N    0 /2    0 /2  D   D  N   0 /2
 N   N
D /2
 2 S0 1 1 2 N D /2  S0  2  1  2  S0  0 
p D     
 2   2  SN N   0     0  S N  N /2   N 
 N /2  /2
SN     2 
D /2 ND /2 ND /2
2 0 D /2  2 
D /2 ND /2
 0 D /2
D  0   D  D  
2     2  2
 2   0   0 
 
D  N 
2   0 0 
2  0 /2 1/2
1   0 
p  D   N /2
 Note that for 𝐷 = 1, this reduces to the familiar eq. 
 D  0   N  N 
  2  N /2 

 N

 2
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal likelihood
for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Non Informative Prior
p  ,    
 ( D 1)/2
 The uninformative Jeffrey’s prior is . This is obtained in the limit
 0  0, v0  1, S0  0

  
0  D  2 D 1
  1  
p   ,  | D  |  | exp   tr  1 S0   0    0    0   
T

2 2
| |
 2 
  Nx
 In this case, we have: N  0 0
N
,  N   0  N , vN = v0  N
  
N
 N  x,  N  N , vN  N  1, S N  S x   xi  x xi  x
T

i 1
p ( | D )   p(  ,  | D ) d   IW   | S N , vN 
 The posterior marginals are then given as:  1 
p (  | D )  TvN  D 1   |  N , SN 
    N (vN  D  1) 
p ( | D )  IW   | S x , N  1 , p(  | D )  TN  D   | x ,
1
Sx 
 N ( N  D) 

 Also the posterior predictive is:  N 1 


p (x | D )  TvN  D 1  x |  N , SN 
 N 1    N (vN  D  1) 
p ( x | D )  TN  D  x | x , Sx 
 N ( N  D) 
 Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004 (pp. 88)
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See Section 9)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Non Informative Jeffrey’s Prior
 Based on the report of Minka below, the uninformative prior should be instead

 1 
lim N   | 0 ,   InvWis   | S0 ,  
 0    v0  0 instead of v0  1
|  | ( D /21)  NIW   ,  | 0, 0, 0 I , 0 

Recall the general form

    
  InvWis   | S0 ,  0   exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1 1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
T

 0  Z NIW  2 2 

 Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53


Data-Dependent Weakly Informative Prior
 Often, a data-dependent weakly informative prior is recommended (see
Chipman et al. and Fraley and Raftery):

diagS x
Set : S0  , v0  D  2 to ensure    S0 , and
N
0  x ,  0  0.01

    
  InvWis   | S0 ,  0   exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1 1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
T

 0  Z NIW  2 2 

 Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of Bayesian Model Selection. Model
Selection. IMS Lecture Notes.
 Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering.
J. of Classification 24, 155–181

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54


Visualization of the Wishart Distribution
 Wis is a distribution over matrices thus difficult to plot the PDF. However, one
can sample from it and in 2d use the eigenvectors of the resulting sample to
define an ellipse as we have done for the 2D Gaussian.
Wi(dof=3.0, S), E=[9.5, -0.1; -0.1, 1.9], =-0.018
5 2 2
 1 
Wi   | W , v   B |  |( v  D 1)/2 exp   Tr W 1   , 0 0 0
 2 
-5 -2 -2
v  D  1 (dof ), W sym pos. def . ( D  D )
wiPlotDemo
-10 0 10 -5 0 5 -5 0 5
5 5 5
from PMTK 1
 D
 v 1 i  
B W , v  | W | v /2  2vD /2  D ( D 1)/4    
0 0 0

 i 1  2  -5 -5 -5
-50 0 50 -10 0 10 -10 0 10
5 5 5

0 0 0

-5 -5 -5
-20 0 20 -5 0 5 -5 0 5

 Samples 𝚺 ∼ 𝒲𝒾(𝑺, 𝜈), 𝑺 = [3.1653, −0.0262; −0.0262, 0.6477] and 𝜈 = 3.


 The sampled matrices are highly variable, and some are nearly singular. As 𝜈
increases, the sampled matrices are more concentrated on the prior 𝑺.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
Visualization of the Wishart Distribution
 For off-diagonal elements, one can sample matrices from the distribution, and
then compute their distribution empirically.

 We can convert each sampled matrix to a correlation matrix, and thus


compute a Monte Carlo approximation
 ij
[ Rij ]   R    ,  ~ Wi (, v), R   ij 
1 S (s) (s)

S s1 ij
 ii  jj
 We can then use kernel density estimation to produce for plotting purposes a
smooth approximation to the univariate density 𝔼 [𝑅𝑖𝑗 ].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56


Visualization of the Wishart Distribution
 Plots of the marginals (which are Gamma), and the sample-based marginal on
the correlation coefficient.

 If 𝜈 = 3 there is a lot of uncertainty about the correlation coefficient ρ (nearly


uniform on [−1, 1]).
 21 (1, 2)
0.1 1

0.05 0.5

0 0
0 10 20 30 40 -2 -1 0 1 2

 22
0.4

wiPlotDemo
from PMTK 0.2

0
0 5 10

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57

You might also like