Lec12 13 BayesianInferenceForTheGaussian

Conjugate Bayesian Analysis of the
Gaussian Distribution
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 10, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

References
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to
Compulational Implementation, Springer-Verlag, NY, 2001 (online resource).
 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman
and Hall CRC Press, 2nd Edition, 2003.
 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
 Chris Bishops’ PRML book, Chapter 2.

 M. Jordan, An introduction to Probabilistic Graphical Models, Chapter 8 (pre-print).
 Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapters 2 and 4.
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Contents
 Prior, Likelihood and Posterior, Posterior Point Estimates, Predictive
Distribution, Univariate Gaussian with Unknown Mean, Inference of Precision
with Known Mean, Scaled Inverse Chi Squared Prior for the Variance,
Inference on Mean and Precision
 Inference for the Multivariate Gaussian, Unknown Mean, Unknown Precision,

MAP Estimation and Shrinkage, Unknown Mean and Precision, Marginal
Likelihood, Non-informative Prior, Wishart Distribution

Goals
 The goals for today’s lecture are:
 Familiarize ourselves with the calculations needed to compute the joint

posterior and posterior marginals of 𝜇 and 𝜎 2 for a univariate Gaussian
 Understand the concept of conjugate priors needed for inferring 𝜇 and 𝜎 2
 Understand the concept of uninformed priors
 Learn how to perform inference for 𝝁 and 𝚲 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits
 Learn how to perform inference for 𝝁 and 𝜮 for the multivariate Gaussian
using the appropriate prior models and their uninformed limits

Posterior 𝝅(𝜽|𝑥): Inference and Prediction
 It combines the prior and likelihood.
f ( x |  ) ( )
 ( | x) 
m( x )
 It weights the data and the prior information in making probabilistic inferences
 The posterior distribution is also useful in estimating the probability of

observing a future outcome (prediction)
 The normalizing factor is given as:
𝑚(𝑥) = න𝑓(𝑥|𝜃)𝜋(𝜃)𝑑𝜃

Posterior Inference: Point Estimates
f ( x |  ) ( )
 ( | x) 
m( x )
Maximum A Posteriori estimate (MAP)
 *  arg max log  ( | x)   arg max  log f ( x |  )  log  ( ) 

 
Posterior Mean
𝜃መ = 𝔼𝑝(𝜃|𝑥) [𝜃] = න𝜃𝜋(𝜃|𝑥)𝑑𝜃

Posterior Quantiles Pr   a     ( | x)d
a

Posterior Predictive Distribution
 Suppose we have observed 𝒙 and we want to make a prediction about
ෝ if
(future) unknown observables: What is the probability of observing data 𝒙
we already have observed data 𝒙?
 This means finding

𝒙|𝒙)
𝑔(ෝ
 We have:
𝒙, 𝜃, 𝒙)
𝜋(ෝ 𝒙, 𝜃, 𝒙) 𝜙(𝜃, 𝒙)
𝜋(ෝ
𝑔(ෝ 𝒙, 𝜃|𝒙)𝑑𝜃 = ඲
𝒙|𝒙) = න𝑔(ෝ 𝑑𝜃 = ඲ 𝑑𝜃 =
𝑚(𝒙) 𝜙(𝜃, 𝒙) 𝑚(𝒙)
= න𝑓(ෝ
𝒙|𝜃, 𝒙)𝜋(𝜃|𝒙)𝑑𝜃 = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃|𝒙)𝑑𝜃
 Compare this with the form of the normalizing factor (computed for 𝒙
ෝ) in
Bayes’ rule:
𝑚(ෝ
𝒙) = න𝑓(ෝ
𝒙|𝜃)𝜋(𝜃)𝑑𝜃

Bayesian Inference for the Gaussian
 Consider
X   x1 , x2 ,..., xN  ~ N (  ,  2 ), with prior  ~ N ( 0 ,  02 ).
 We proved in the last lecture that the posterior is given as:
 | X ~ N (  N ,  N2 ) with
1 1 N  0
2 2
    2
 , and
 N 0 
2 2 2 N
N 0  
2 2
 N 
  xn    N    N  2
 2
 N   N  n 1 2  02    N  2ML  02  
2 2 0
 ML  0
  0     0  N 0   N 0  
2 2 2 2
 
 

Inferring the Mean of a Gaussian
prior variance = 1.00 prior variance = 5.00
prior prior
lik lik
0.6 0.6
post post
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
gaussInferParamsMean1d
from PMTK
0.1 0.1
0 0
-5 0 5 -5 0 5
 Inference about 𝑥 given a single noisy observation 𝑦 = 3.

 (a) Strong prior N(0, 1). The posterior mean is “shrunk” towards the prior
mean, which is 0.
 (b) Weak prior N(0, 5). The posterior mean is similar to the MLE.
Inference of Precision with Known Mean
 Consider xn ~ N ( xn |  ,  ), n  1,...., N . We want to infer the precision 𝜆 = 1/𝜎2
1
with the mean 𝜇 taken as known.
 The likelihood takes the form:

N
 1 N 
p( X |  )   f ( xn |  )   N /2
exp     ( xn   ) 2 
n 1  2 n 1 
 The corresponding “conjugate prior” (a prior that results in a posterior of the
same family as the prior) should be proportional to the product of a power of 𝜆
and the exponential of a linear function of 𝜆. This corresponds to the gamma
distribution.
ba a 1 b
Gamma ( | a, b)   e , x  0,    
a a
, var     2
( a ) b b
 The Gamma distribution has a finite integral if 𝑎 > 0, and the distribution itself
is finite if 𝑎 ≥ 1. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
 The posterior takes the form:
N
 1 N 
p( | X,  )   f ( xn |  )Gamma ( | a0 , b0 )   N /2  a0 1
exp  b0    ( xn   ) 2 
n 1  2 n 1 
 We can immediately see that the posterior is also a Gamma distribution:
1 N N 2
p ( | X,  )  Gamma ( | aN , bN ), aN  N / 2  a0 , bN  b0   ( xn   ) 2  b0   ML
2 n 1 2
 Here  ML is the MLE of the variance.

2

 The effect of observing 𝑁 data points is to increase the value of 𝑎 by 𝑁/2 (i.e.
½ for each data point). Thus we interpret the parameter 𝑎0 as 2𝑎0 ‘effective’
prior observations.
1 N N 2
p ( | X,  )  Gamma ( | aN , bN ), aN  N / 2  a0 , bN  b0   ( xn   ) 2  b0   ML
2 n 1 2
 Each measurement contributes to the parameter 𝑏 a variance  ML / 2.
2
Since
we have 2𝑎0 effective prior measurements, each of them contributes to 𝑏 an
effective prior variance
2a0 2 b
b0   2  0
2 a0
The interpretation of a conjugate prior in terms of effective data points is typical
for the exponential family of distributions.
 The results above are identical with inference of the variance 𝜎2 using as prior
InvGamma ( 2 | a0 , b0 ) resulting in a posterior InvGamma ( 2 | aN , bN ) .
Gamma and Inverse Gamma
If  ~ Gamma  | a, b  then  1 ~ InvGamma  1 | a, b 

Here :  ~ Gamma   | a, b  then  2 ~ InvGamma  2 | a, b 
Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004

Scaled Inverse Chi-Squared Prior
 Alternative prior for 𝜎2 is the Scaled Inverse Chi-Squared Distribution*
 2 v0 v0 02   v0 02 
  | v0 ,    InvGamma   | ,
2 2 2
   
2  v0 /2 1
exp   2 
 2 
0
 2 2 
 Here 𝑣0 represents the strength of the prior and 𝜎02 encodes the value of the
prior. With this, the posterior takes the form:
N
v0    xi   
2 2
0
 2  2 | D ,     2  2 | vN ,  N2  , vN  v0  N ,  N2  i 1
vN
 The posterior dof 𝑣𝑁 is the prior dof 𝑣0 plus 𝑁. The posterior sum of squares
𝜎𝑁2 𝑣𝑁 = 𝑣0 𝜎02 + σ𝑁 (𝑥
𝑖=1 𝑖 − 𝜇)2 is the sum of the prior sum of squares plus the
data sum of squares. An uninformative prior corresponds to zero virtual

sample size, 𝒗𝟎 = 𝟎. This is the prior 𝑝 𝜎 2 ∝ 𝜎 −2 .
 This approach is certainly more appealing.

v0 02 2v02 04 v0 02
* Often denoted as Scale - Inv -   | v0 ,   , mean 
2 2 2
for v0  2 var  for v0  4, mode 
 v0  2   v0  4  v0  2
2
v0  2
0

Sequential Update of the Posterior for 𝝈2
  s
prior = IW(=0.001, S=0.001), true  2=10.000
For D = 1 : IW  2 |  ,s 1   IG   2 | ,  , s  v 2
 2 2
0.35
N=2
N=5
N=50
0.3 N=100
gaussSeqUpdateSigma1D
0.25
from PMTK
0.2
0.15
0.1
0.05
0
0 5 10 15
2
 Sequential update of the posterior for 𝜎2 starting from an uninformative prior

IW  | v  0.001, s  v  = 0.001 . The scalar version of the inverse-Wishart is the same as
2
0 0 0
2
0
the inverse Gamma. The data were generated from 𝒩(5,10).

 Gelman 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1(3):515–533
Bayesian Inference: Unknown Mean and Precision
 Consider xn ~ N ( xn |  ,  ), n  1,...., N . We want to infer both the precision 𝜆 =
1
1/𝜎2 and the mean 𝜇.
 The likelihood takes the form:

N
 1 N 
p ( X |  ,  )   f ( xn |  )   N /2
exp     ( xn   ) 2 
n 1  2 n 1 
N
 1/2   2    N
1 N 2
  exp     exp    xn    xn 
  2   n 1 2 n 1 
 We need a prior that has a similar functional form in terms of 𝜆 and 𝜇.

 1/2     2
p(  ,  )   exp     exp   c   d 
  2 
   c    /2
2
  c2  
 exp        exp    d   
 2       2  

p (  | ) p ( )

 1/2     2
p (  ,  )   exp     exp   c   d 
  2 
   c   (  1)/2
2
  c2  
    exp       exp    d   
1/2
 2  
      2  
p (  | ) p ( )
 We can easily identify that the prior is of the form (Normal-Gamma):
 c 1   1  c2 
p (  ,  )  N   | 0  ,     Gamma   | a  ,b  d  
 b   2 2  
 Recall the form of the Gamma distribution:
ba a 1 b
Gamma ( | a, b)   e
( a )

 Combining the likelihood and prior, we can re-arrange and write:
  1 N 2  2 
p(  ,  | X )   N /2
 a 1
exp    b   xn  0    
  2 n 1 2  
 N     2 2  N
 
(N   ) exp      0  n    
  
1/ 2
 x
 2  N   n 1  
 Completing the square on the 2nd argument gives:
 
  2 

   N
  
    N 
 
1 N 2  2 
0 xn 
    Gamma   | aN   a, bN 
p (  ,  | X )    exp   b   xn  0 
N /2 a 1  n 1
   2 
  2 n 1 2 2 N     
   

   N   2
N
 
 
2
 bN 
 N

0   xn
1
  
2
  
 
N
  0   xn  
  N     N   | N  n 1
, N     
  ( N   )  exp      
1/ 2 n 1
2  N     N    
       N  
     

The Normal-Gamma Distribution
 c 1    c2 
p(  ,  )  N   | 0  ,     Gamma   | a  1  , b  d  
 b   2 2  
8
6
a  5, b  6,
5
0  0,   2
4
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
MatLab Code
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides
additional results for posterior marginals, posterior predictive, and reference results for
an uninformative prior)
Posterior for 𝝁 and 𝝈𝟐 for Scalar Data
 We can also work directly with 𝜎2. We use the normal inverse chi-squared
distribution (NIX)
NI 2   ,  2 | m0 ,  0 , v0 ,  02   N   | m0 ,  2 /  0   2  2 | v0 ,  02   v0 v0 02 
 2  2 | v0 ,  02   IG   2 | , 
 2 2 
 1 
( v0  3)/2
 v0 02   0    m0 2 
 2  exp     v 2 
  2 
 v0 /2 1
exp   0 20 
   2  2
  2 
 
 Similarly to our earlier calculations, the posterior is given as:
 0 m0  N x 0
p (  ,  2 | D )  NI  2   ,  2 | mN ,  N , vN ,  N2  , mN 
N
 m0  x
N 0  N 0  N
N 0
   
N
 N   0  N , vN = v0  N , vN   v0   xi  x 
2 2
2 2
m0  x
N 0
i 1 0  N
 The posterior marginal for 𝜎2 and posterior expectation are:

p ( 2 | D )   p(  ,  2 | D ) d    2  2 | vN ,  N2  ,
v
 2 | D   N  N2
vN  2
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (provides additional results for posterior
marginals, posterior predictive, and reference results for an uninformative prior, also Section 6 provides the analysis for a
normal-inverse-Gamma prior)
Normal Inverse x2 Distribution
NIX(0=0, k0=1, 0=1, 20=1) NIX(0=0, k0=5, 0=1, 20=1) NIX(0=0, k0=1, 0=5, 20=1)
NIXdemo2
from PMTK
0.4 0.4 0.4

0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
2 2 2
1 1 1
1 1 1
0 0 0
2 0 -1  2 0 -1  2 0 -1 
 The 𝑁𝐼𝜒2(𝜇0, 𝜅0, 𝜈0, 𝜎02 ) distribution. 𝜇0 is the prior mean and 𝜅0 is how strongly we believe
this; 𝜎02 is the prior variance and 𝜈0 is how strongly we believe this. (a) The contour plot
(underneath the surface) is shaped like a “squashed egg”. (b) We increase the strength of our
belief in the mean, so it gets narrower (c) We increase the strength of our belief in the
variance, so it gets narrower.Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Posterior for an Uninformative Prior for 𝝁 and 𝝈𝟐
 1
 The posterior marginal for 𝜇 is Students’ 𝒯: (  )
p ( x |  ,  , )  2 2 
  
1/2
   x   2 
 /2 1/2
  1  
( )      
p (  | D )   p(  ,  2 | D ) d 2  T   | mN ,  N2 /  N , vN  ,   | D   mN 2
Mean :  ,  1
Mode : 
 The posterior predictive is:
 2
p ( x | D )   p ( x |  ,  ) p(  ,  | D ) d  d  T  x | m ,  (1   ) /  , v 
2 2 2 2 Var : 
N N N N N  2

,  2
 Let us revisit all these results with an uninformative prior:*    2 
p   ,   p    p       , | 0  0,  0  0, v0  1,   0 
   2
2 2 2
 NI  2 2 2
0
 With this prior, the posterior becomes:**

 1 N
    N
2 2
p (  ,  | D )  NI 
2 2
 ,  | mN  x,  N  N , vN  N  1,   s , s 
2 2 2
xi  x
2
  MLE
N  1 i 1 N 1
N
 1 
( v0  3)/2
 v0 02   0    m0 2 
* Recall that: NI   ,  | m0 ,  0 , v0 ,     2 
2 2 2
exp   
 
0
 2  2

 
** Recall that:
 0 m0  N x N 0
p (  ,  | D )  NI    ,  | mN ,  N , vN ,   , mN     
N
,  N   0  N , vN = v0  N , vN  N  v0 0   xi  x 
2 2
2 2 2 2 2 2
m0  x
N
N i 1  0  N
Marginal Posterior for 𝝁 and Credible Interval
 The predictive distribution becomes:
p ( x | D )  T  x | mN ,  N2 (1   N ) /  N , vN   T  x | mN , s 2 (1  N ) / N , N  1
 𝑠 is the sample std. Thus the marginal posterior for 𝜇 becomes:
N 1 s2
  1 N
   N
 MLE , var   | D  
vN s2
2 2
p (  | D )  T  | x, s / N , N  1 , s 
2 2
xi  x  N 
2

N  1 i 1 N 1 vN  2 N 3 N N
 The standard error of the mean is defined as
s
var   | D  
N
 An approximate 95% posterior credible interval is thus:

s
I 0.95   | D   x  2
N
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)

Multivariate Gaussian: Posterior of 𝝁
 Consider a known variance 𝚺 and a Gaussian prior 𝒩(𝝁𝟎 , 𝚺0) with the
posterior for the unknown mean 𝝁 taking the form:
N
p (  | X )  p (  ) p( xn |  , )
n 1
 This posterior is the exponential of a quadratic in 𝝁:

1 1 N
    0   0    0    ( xn  )T  1 ( xn   ) 
1
T
2 2 n 1
T  1 
    0  N        0 0    xn   const
N
1 T 1 1 1
2  n 1 
 N1
 N1 N
 So the variance and mean of the posterior p (  | X,  )  N   |  N ,  N  are:
 N1   01  N  1 ,
 1 
N     N      ML 
N
1 1 1 1
1
0 
 0 0

   1

n 1
x n

  1
0  N   1
0  0  N  1
 1 
 For uninformative prior,  0  I  p (  | X,  )  N   |  ML ,  
 N 
Posterior Distribution of Precision 𝚲
 We now discuss how to compute 𝑝(𝚲|𝒟, 𝝁) for a 𝐷 −dimensional Gaussian.
The likelihood has the form
 1 
exp   Tr  S     , S     xn    xn   
N
p (D |  ,  ) |  | N /2 T
 2  n
 The corresponding conjugate prior is known as the Wishart distribution
 1 
Wi   | W0 , v0   B |  |( v  D 1)/2 exp   Tr W01   , v0  D  1 (dof ), W0 sym pos. def . ( D  D)
0
 2 
1
 v /2  v0 D /2 D ( D 1)/4  v0  1  i  
D
B W0 , v0  | W0 |  2

 
i 1
 
 2


 v0  D
 v0  1  i 
D    
 
2
D ( D 1)/4

i 1
 
 2
 (multivariate Gamma Function)

 Clearly the posterior is 𝒲𝒾 𝜦 𝑾𝑁 , 𝜈𝑁 , 𝑾−1 −1

𝑁 = 𝑾0 + 𝑺𝜇 , 𝜈𝑁 = 𝜈0 + 𝑁. For 𝐷 =
1, the Wishart becomes the Gamma for 𝜆 that we used earlier for the
univariate Gaussian. For D = 1, Wi   | W ,   Gamma   |  , W 
N N
N
1
N
 2 2 
The Wishart Vs. the Gamma Distributions
mode  (v  k  1) S
for v  k  1
N
For xi ~ N  0,  , S   xi xiT (scatter matrix) ~ Wi  S | , N   S  N
i 1
Posterior Distribution of 𝚺
 We similarly discuss computing 𝑝(𝚺|𝒟, 𝝁). The likelihood has the form
 1 1 
exp   Tr  S     , S     xn    xn   
N
p (D |  , ) |  |  N /2 T
 2  n
 The corresponding conjugate prior is known as the inverse Wishart
 1 
InvWi   | S0 , v0  |  |
 ( v0  D 1)/2
exp   Tr  S 0  1   , 0  D  1, S 0 sym pos. def .,  D  D pos. def .
 2 
 𝜈0 + 𝐷 + 1 controls the strength of the prior, and hence plays a role
analogous to the sample size 𝑁.
 The prior scatter matrix is here 𝑺0.
If  ~ Wi   | S , v  then  ~ InvWi   | S 1 , v 
 Steven W. Nydick, The Wishart and Inverse Wishart Distributions, Report, 2012.
 A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, 2004
Inverse Wishart Distribution
W ~ Inv  Wishart ( S ) for v  D  1 (dof , real )
p (W )  Inv  Wishart (W | S )
for v  D  1
mode   v  k  1 S
1
  S 1
For k = 1, InvWi  | ,S  InvGamma   2 | , 
2 If  ~ Gamma (a,b) 

~ InvGamma (a,b)
 2 2
If  1 ~ Wi  , S    ~ InvWi  , S 1 
Wishart Distribution
 𝜦~𝒲𝐷 (𝜦|𝑻, 𝜈) then
 1 
exp   Tr T 1  
1
p ( )  Wi   | T ,    ||(  D 1)/2
ZW  2 
 /2 
ZW  2 D /2 T  D   ,  D multivariate Gamma function
2
where 𝜈 > 𝐷 − 1 degrees of freedom (real), 𝑻 > 0 scale matrix (𝐷 × 𝐷, pos.
def.), 𝚲 (𝐷 × 𝐷) positive definite matrix, Γ𝐷 is the multivariate gamma function
 The following properties are useful

𝔼 𝑻 = 𝜈𝑻, 𝑚𝑜𝑑𝑒 = (𝜈 − 𝐷 − 1)𝑻 for 𝜈 ≥ 𝐷 + 1
𝑉𝑎𝑟 Λ𝑖𝑗 = 𝜈(𝑇𝑖𝑗2 + 𝑇𝑖𝑖 𝑇𝑗𝑗 )
 The Wishart distribution is a generalization to multiple dimensions of
the gamma distribution.
Posterior Distribution of 
 Multiplying the likelihood and prior, we find that the posterior is also inverse
Wishart:
 1   1 
p   | D ,   |  | N /2 exp   Tr  S   1   |  | ( v0  D 1)/2 exp   Tr  S 0 1   ,
 2   2 
|  |  0
 N  ( v  D 1)/2  1

exp   Tr  S   S 0   1 
 2



p   | D ,   = InvWi   | N  v0 , S N  , S N  S   S 0
 The posterior strength 𝜈𝑁 = 𝜈0 + 𝑁, is the prior strength 𝜈0 plus the number

of observations 𝑁.
 The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the data
scatter matrix 𝑺𝝁.

MAP Estimation
 From the mode of the inverse Wishart and
p   | D ,   = InvWi   | vN , S N  , S N  S   S 0 , vN  N  v0
we conclude that the MAP estimate is:

SN S   S0 S   S0
 MAP = = 
vN  D  1 N  v0  D  1 N  N 0
N0
S
 For an improper prior, 𝑺0 = 𝟎 and 𝑁0 = 0,  MAP    MLE
N
 Consider now the use of a proper informative prior, which is necessary
whenever 𝐷/𝑁 is large. Let   x  S   S x . Rewrite the MAP estimate as a
convex combination of the prior mode and MLE
S x  S0 N0 S0 N Sx S
 MAP =     0  (1   ) MLE ,  0  0 ( prior mode)
N  N0 N  N0 N0 N  N0 N N0
 1 
where 𝜆 controls the amount of shrinkage towards the prior.
MAP Estimation
S0
 MAP   0  (1   ) MLE ,  0  ( prior mode)
N0
 Can set 𝜆 by cross validation.
Alternatively, we can use the formula provided in Ledoit & Wolf and Schaefer & Strimmer
which is the optimal frequentist estimate (for squared loss).
This loss function for covariance matrices ignores the positive definite constraint but results in
a simple estimator (see PMTK function shrinkcov).
 For the prior covariance matrix, 𝑺0, it is common to use the following (data-
dependent) prior:

 0  diag  MLE 
 Ledoit, O. and M. Wolf (2004b). A well conditioned estimator for large dimensional covariance matrices. J. of Multivariate Analysis
88(2), 365– 411.
 Ledoit, O. and M. Wolf (2004a). Honey, I Shrunk the Sample Covariance Matrix. J. of Portfolio Management 31(1).
 Schaefer, J. and K. Strimmer (2005). A shrinkage approach to largescale covariance matrix estimation and implications for
functional genomics. Statist. Appl. Genet. Mol. Biol 4(32).

MAP Shrinkage Estimation

 0  diag  MLE 
 In this case, the MAP estimate is:
  MLE (i, j ) if i  j
 MAP = 
(1   ) MLE (i, j ) otherwise
 Thus we see that the diagonal entries are equal to their MLE estimates, and
the off diagonal elements are “shrunk” somewhat towards 0 (shrinkage
estimation, or regularized estimation).
 The benefits of MAP estimation are illustrated next. We consider fitting a 50 −

dim Gaussian to 𝑁 = 100, 𝑁 = 50 and 𝑁 = 25 data points.
 The MAP estimate is always well-conditioned, unlike the MLE.
 The eigenvalue spectrum of the MAP estimate is much closer to that of the true matrix than the
MLE’s.
 The eigenvectors, however, are unaffected.
Posterior Distribution of 𝚺
 Estimating a covariance
matrix in 𝐷 = 50 dimensions 1.5
N=100, D=50
1.5
N=50, D=50
1.5
N=25, D=50
using 𝑁 ∈ {100, 50, 25} true, k=10.00

MLE, k= 71
MAP, k=8.62
true, k=10.00
MLE, k=4.1e+16
MAP, k=8.85
true, k=10.00
MLE, k=3.7e+17
MAP, k=21.09
samples.
 Eigenvalues in descending 1 1 1
order for the true covariance
eigenvalue
eigenvalue
eigenvalue
matrix (solid black), MLE 0.5 0.5 0.5
(dotted blue) and MAP

estimates (dashed red) with
𝜆 = 0.9. 0
0 5 10 15 20 25
0
0 5 10 15 20 25
0
0 5 10 15 20 25
 The condition number 𝑘 of shrinkcovDemo

each matrix is also given in from PMTK
the legend.

Inference for Both 𝝁 and 𝚲
 Suppose 𝒙1, 𝒙2, … 𝒙𝑁 ~(i.i.d) 𝒩(𝝁, 𝜦−1 ). We do not know 𝝁 or 𝚲.
 When both sets of parameters are unknown, a conjugate family of priors is one in
which
𝚲 ~ 𝒲(𝚲 |𝜈0 , 𝑻𝟎 )
and
𝝁 | 𝚲 ~ 𝒩(𝝁𝟎, 𝜅0 𝜦 −1 )
 The Wishart distribution is the multivariate analog of the Gamma distribution

(extension to positive definite matrices). If matrix 𝑼 has the Wishart distribution, then
𝑼−1 has the inverse-Wishart distribution. The resulting 𝑝(𝝁, 𝚲| 𝝁𝟎, 𝜅0 , 𝑻𝟎 , 𝜈0 ) is the
Gaussian-Wishart distribution.
 The quantity 𝜈0 is a positive scalar, while 𝑻𝟎 is a positive definite matrix. They play
roles analogous to 𝑎 and 𝛽, respectively, in the Gamma distribution.
 Other parameters of the prior are the mean vector 𝝁𝟎 and 𝜅0 which represents the `a
priori number of observations’.
 Denoting S x =   xi  x  xi  x  , the likelihood and prior distributions are given
N T
i1
as:  1 N 
p  D |  ,     2  |  | exp     xn      xn    
 ND /2 N /2 T
 2 n 1 
  2 
 N
   

 1 
   x  exp   Tr  S x  
 ND /2 T
|| N /2
exp   x
 2   2 

p (  ,  )  NWi   ,  | 0 ,  0 , v0 , T0  = N  | 0 ,  0  
1
Wi   | T ,   
0 0
    1 
|  |1/2 exp   0    0      0   |  |( 0  D 1)/2 exp   Tr T01  
1

T
Z0  2   2 
  0
Z 0   2   0 
 D /2
2 0 D /2  D  0  T0 2 ,  D multivariate Gamma function
D /2
2
 N
      x   2      
    0   
T
 Combining gives: p (  ,   D)  exp   x 0 T
0
 2 
 1
 2
 
|  |( N 0  D )/2 exp   Tr T01  S x   


 M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
 N 0
       0  
    0   
T
p (  ,   D)  exp   x  x 
T
 2 2 
 1
 2
 
|  |( N 0  D )/2 exp   Tr T01  S x   


 Can close the square in 𝝁 as follows:
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0
 1   1  
1
  
T T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  N x x   0 0 0T   0 0  N x  0 0  N x    
 2  
  0 N  
 We can simplify as:

 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0
 1   1  N  
  
T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  0 0  x 0  x    
 2  
  0 N  

Inference for Both  and 
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp         N      
 2 0  N  0    N  
 0
 1   N  
  
T
|  |( 0  N  D 1)/2 exp   Tr   T01  S x  0 0  x 0  x    
 2 0  N
   
 This can be written as:

p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T ,  N N 
 0 0  N x
N 
0  N

     
N
N
0  x 0  x , where S x =  xi  x xi  x
T T
TN1  T01  S x  0
0  N i 1
 N  0  N ,  N   0  N

Normalization Factors
 1  0 0  N x 
T
  0 0  N x  
p (  ,   D) |  |1/2 exp        0  N      
 2 0  N   0  N  
 
 1   1  N  
  
T
|| (  0  N  D 1)/2
exp   Tr   T0  S x  0 0  x 0  x    
 2  
  0 N  
 Using the normalization constants of the multivariate Gaussian and the

Wishart:  1   1 
|  |1/2 exp       N   N      N   |  |(  N  D 1)/2 exp   Tr TN1  
1
p (  ,   D) 
T
ZN  2   2 
  N 2
Z N   2   N 
D/2
2 N D /2  D  N
D/2
 TN
 2 
 Similarly for the prior the normalization factor is:

  0
Z 0   2   0 
 D /2
2 0 D /2  D  0  T0 2
D /2
2
 The marginal likelihood is computed as a ratio of normalization constants

using 𝑝 𝝁, 𝜦 𝒟 = 𝑝 𝒟|𝝁, 𝚲 𝑝 𝝁, 𝜦 /𝑝(𝒟):
1  D  N / 2  TN
 /2 D /2
ZN 1
N
 0 
p(D )    
Z 0  2  ND /2  ND /2  D  0 / 2  T0  /2   N 
Posterior Marginals for 𝝁 and 𝚲

p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
T T
N  1
, TN  T0  S x 
0  N 0  N i 1
 N  0  N ,  N   0  N
 The posterior marginals can be derived as:
p (  D)  Wi   | TN , vN 

p (   D)  T N  D 1  |  N ,  N ( N  D  1)TN 
1

* Refer to this report for these results based on Bernardo and Smith.
Appendix: Marginal of Normal-Wishart
 Consider the Normal-Wishart distribution

p (  ,  )  NWi   ,  | 0 ,  0 , v0 , T0  = N  | 0 ,  0  
1
Wi   | T ,   
0 0
    1 
 exp   0    0      0   |  |( 0  D )/2 exp   Tr T01  
T
 Let us write this in the form:  2   2 
  
p (  ,  ) |  |( 0  D )/2
 1 
 T 
  1
 2


 1
 2

exp   Tr  T01   0    0    0     |  |( 0  D )/2 exp   Tr  A1   |  |( 0 1 D 1)/2 exp   Tr  A1  

 2  
  A 1

 This is a Wishart distribution 𝒲𝒾𝐷 (𝜦|𝑨, 𝜈0 + 1) so when integrating out 𝜦 we
get the normalization factor:  (  0 1)/2
 T0  0      0    0 
T  (  0 1)/2 1
p (  ) |  |(  0 1)/2
= T   0    0    0 
1 T
0
 The matrix determinant lemma says: det 𝑨 + 𝒖𝒗𝑇 = (1 + 𝒗𝑇 𝑨−1 𝒖) det(𝑨).

Applying this to the formula above:
 (  0 1 D )  D  /2
p (  )  1     0  T0  0     0     0  T0  0 ( 0  1  D)     0 
 (  0 1)/2 1
 1
T T
0  1  D
𝑻−1
 We can immediately recognize that this is a Student’s 𝒯𝜈0 +1−𝐷 (𝝁0 , 0
).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras
𝜅0 (𝜈 0 +1−𝐷)
41
MAP Estimates for 𝝁 and 𝚲

p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
T T
N  1
, TN  T0  S x 
0  N 0  N i 1
 N  0  N ,  N   0  N
 Differentiating the Eq. on the top of this slide, we can also derive the MAP
estimates as:   ,    arg max p( ,   D)
 ,
N
 x   i 0 0
= i 1
N  0
  x    x         
N T T
i i   0   0 0  T01
 i 1
N  0  D
 These are reduced to the MLE by setting  0  0,  0  D, T01  0

Inference for both 𝝁 and 𝚲
 The posterior predictive is:

p ( x  D)  T N  D 1   N ,
  N  1 TN1 

  N ( N  D  1) 
 Another useful reference prior considers

0  0,  0  0,  0  1, T01  0
 This results in the following for the prior:*

 ( D 1)/2
p(  ,  )  

p (  ,  )  N  | 0 ,  0  
1
Wi   | T ,   
0 0
* Recall the general forms:     1 

|  |1/2 exp   0    0      0   |  |( 0  D 1)/2 exp   Tr T01  
1

T
Z  2   2 

Reference Prior: Posterior Marginals and Predictive
 The posterior parameters are also simplified as:*
 N  x, TN1  S x ,  N  N ,  N  N  1
 The posterior marginals and posterior predictive for this prior are given as:
 
p (  D)  Wi N 1   | S x 1  , p(   D)  TN  D   | x ,
Sx
 
 N ( N  D ) 
 S (N +1) 
p ( x  D)  TN  D  x , x
 N ( N  D ) 
 
* Recall the general form of the posterior


p (  ,   D)  NWi   ,  |  N ,  N , vN , TN  = N  |  N ,  N  
1
Wi   | T , 
N N 
 0 0  N x 1 0 N
     
N
  x 0  x , where S x =  xi  x xi  x   N   0  N ,  N   0  N
T T
N  1
, TN  T0  S x 
0  N 0  N 0 i 1

p (  D)  Wi   | TN , vN  , p(   D)  T N  D 1   |  N ,
TN1     N  1 TN1 
 , p( x  D)  T N  D 1   N , 
  N  N
  D  1   N ( N  D  1) 
Inference for 𝝁 and 𝚺
 For the case of the multivariate Gaussian of a 𝐷 −dim variable 𝒙, 𝒩 𝒙 𝝁, 𝚺
with both the mean and variance unknowns, the likelihood is of the form:
N
 1 N 
      2 n    
 ND /2
N       N /2
    1
 
T
x n | , 2 | | exp x x n 
n 1  n 1 
  2 
 N
  
 1
 
  x  1   x  exp   Tr   1 S x  
 ND /2 T
|  | N /2 exp  
 2   2 
 The conjugate prior is given as the product of a Gaussian and the Inverse
Wishart distribution:
 
  InvWis   | S0 ,  0  
1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
 0 
    1 
|  |1/2 exp   0    0   1    0   |  | ( 0  D 1)/2 exp   Tr   1 S0  
1

T
Z NIW  2   2 
  
exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1

T
Z NIW  2 2 
D /2
 v   2   v0 /2
Z NIW  2v0 D /2  D  0    S0 ,  D multivariate Gamma function
 2   0 
Inference for 𝝁 and 𝚺
 
N 1  0  D 1
 1

exp   tr   1 S x  S0  N   x   x   
  0    0    0   
  
p   ,  | D  |  |
T T
2 2 2
 2  
 
0  D  2 N
 1

exp   tr   1 S x  S0  N   x   x   
  0    0    0   
 T
|  |
T
2
 2  
 One can show by completing the square in 𝝁 that:
     0    0    0  
T
N x x
T
 
   Nx     Nx 
T
0 N
 0  N     0 0   x   
T
   0 0   x  0
 0  N   0  N  0  N 0
N  
 Thus:  N 
N  D2  1  1   N 
  x   

p   ,  | D  |  | exp   tr    S x  S0   N     N     N   0
T
x  0
T
   ,
2
 2   0

  N
where :  N   0  

The Posterior of 𝝁 and 𝚺
 The posterior is NIW given as:
    1 
|  |1/2 exp   N     N   1     N   |  | (  N  D 1)/2 exp   Tr   1 S N  
1
p   ,  | D , 0 ,  0 , S0 , v0   NIW   ,  |  N ,  N , S N , vN  
T
Z NIW  2   2 
 0 0  N x 0 N
N   0  x
N 0  N 0  N
 N   0  N , vN = v0  N
 N
  x   
N
 S0  S   0 0    N  N  , S =  xi xiT
T
S N  S0  S x  0 x  0 T T
0  N 0 0 N
i1
 The posterior mean is a convex combination of the prior mean and the MLE
with strength 𝜅0 + 𝑁.
 The posterior scatter matrix 𝑺𝑁 is the prior scatter matrix 𝑺0 plus the empirical
scatter matrix S plus an extra term due to the uncertainty in the mean which
x
creates its own scatter matrix.

 Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.
 Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of Bayesian Model Selection. Model Selection. IMS Lecture Notes.
 Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering. J. of Classification 24, 155–181
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007
MAP Estimate of (𝝁, 𝚺)
p   ,  | D , 0 ,  0 , S0 , v0   NIW   ,  |  N ,  N , vN , S N 
 0 0  N x 0 N
N   0  x ,  N   0  N , vN = v0  N
N 0  N 0  N
 N
  x   
N
 S0  S   0 0    N  N  , S =  xi xiT
T
S N  S0  S x  0 x  0 T T
0  N 0 0 N
i 1
 The mode of the joint posterior is:

 
 
 ,   arg max p   ,  | D , 0 ,  0 , S0 , v0     N ,
SN

vN  D  2 
 , 
 For 𝑘0 = 0, this becomes:

 S0  S x 
 
 ,   arg max p   ,  | D , 0 ,  0  0, S0 , v0    x, 
v0  N  D  2 
 , 
 It is interesting to note that this mode is almost the same as the MAP estimate
shown next – it differs by 1 in the denominator as the mode above is the mode
of the joint posterior and not of the marginal.
The Posterior Marginals of 𝝁 and 𝚺
 The posterior marginal for  and  are:
p ( | D )   p(  ,  | D ) d   IW   | S N , vN 
SN SN
 MAP  ,   
vN  D  1 vN  D  1
 1 
p (  | D )   p(  ,  | D ) d   TvN  D 1   |  N , SN 
  N (vN  D  1) 
 It is not surprising that the last marginal is Student’s 𝒯 that we know can be
represented as a mixture of Gaussians.
 To see the connection for the scalar case (𝐷 = 1), note that 𝑺𝑁 plays the role
of the posterior sum of squares v  (you may want to revisit the results for the
N
2
N
scalar case of simultaneously estimating 𝜇 and 𝜎 2 ):

1 SN  N2
SN  
 N (vN  D  1)  N vN  N
The Posterior Predictive of 𝝁 and 𝚺
 The posterior predictive 𝑝(𝒙|𝒟) = 𝑝(𝒙, 𝒟)/𝑝(𝒟) can be evaluated as:
p (x | D )    N ( x |  , )NIW   ,  |  N ,  N , vN , S N  d d 
 N 1 
 TvN  D 1  x |  N , SN 
  N (vN  D  1) 
 Recall that the Student’s T distribution has heavier tails than the Gaussian but
rapidly becomes Gaussian like.
 To see the connection of the above expression with the scalar case, note:
N 1
SN 
  N  1 vN  N2  N  1  N2

 N (vN  D  1)  N vN N

Marginal Likelihood
 The posterior is given as:
1 1 1 1
p   ,  | D , 0 ,  0 , S0 , v0   NIW'   ,  | a0  N'  D |  ,    NIW'   ,  | aN 
p  D  Z0  
2
ND /2
Z N
 1  
exp   tr  S0  1   0    0   1    0  
  0  D  /2 1
NIW'   ,  | a0   
T
 2 2 
 1 
N'  D |  ,     exp   tr   1 S  
 N /2
 2 
 In the last two expressions ()’ give the unnormalized likelihood and prior.
 The marginal likelihood

ZN 1
p D  
Z 0  2  ND /2 is then:
D /2 D /2
 N D /2    2   2   N   
2 D  N    0 /2    0 /2  D   D  N   0 /2
 N   N
D /2
 2 S0 1 1 2 N D /2  S0  2  1  2  S0  0 
p D     
 2   2  SN N   0     0  S N  N /2   N 
 N /2  /2
SN     2 
D /2 ND /2 ND /2
2 0 D /2  2 
D /2 ND /2
 0 D /2
D  0   D  D  
2     2  2
 2   0   0 
 
D  N 
2   0 0 
2  0 /2 1/2
1   0 
p  D   N /2
 Note that for 𝐷 = 1, this reduces to the familiar eq. 
 D  0   N  N 
  2  N /2 

 N

 2
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See calculation of the marginal likelihood
for 1D analysis of the Normal-Inverse-Chi-Squared prior on Section 5)
Non Informative Prior
p  ,    
 ( D 1)/2
 The uninformative Jeffrey’s prior is . This is obtained in the limit
 0  0, v0  1, S0  0
  
0  D  2 D 1
  1  
p   ,  | D  |  | exp   tr  1 S0   0    0    0   
T

2 2
| |
 2 
  Nx
 In this case, we have: N  0 0
N
,  N   0  N , vN = v0  N
  
N
 N  x,  N  N , vN  N  1, S N  S x   xi  x xi  x
T
i 1
p ( | D )   p(  ,  | D ) d   IW   | S N , vN 
 The posterior marginals are then given as:  1 
p (  | D )  TvN  D 1   |  N , SN 
    N (vN  D  1) 
p ( | D )  IW   | S x , N  1 , p(  | D )  TN  D   | x ,
1
Sx 
 N ( N  D) 
 Also the posterior predictive is:  N 1 

p (x | D )  TvN  D 1  x |  N , SN 
 N 1    N (vN  D  1) 
p ( x | D )  TN  D  x | x , Sx 
 N ( N  D) 
 Bayesian Data Analysis, A. Gelman, J. Carlin, H. Stern and D. Rubin, 2004 (pp. 88)
 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (See Section 9)
Non Informative Jeffrey’s Prior
 Based on the report of Minka below, the uninformative prior should be instead
 1 
lim N   | 0 ,   InvWis   | S0 ,  
 0    v0  0 instead of v0  1
|  | ( D /21)  NIW   ,  | 0, 0, 0 I , 0 
Recall the general form
    
  InvWis   | S0 ,  0   exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1 1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
T
 0  Z NIW  2 2 
 Minka, T. (2000). Inferring a Gaussian distribution. Technical report, MIT.

Data-Dependent Weakly Informative Prior
 Often, a data-dependent weakly informative prior is recommended (see
Chipman et al. and Fraley and Raftery):
diagS x
Set : S0  , v0  D  2 to ensure    S0 , and
N
0  x ,  0  0.01
    
  InvWis   | S0 ,  0   exp   0    0   1    0   Tr   1 S0   |  | ( 0  D  2)/2
1 1 1
NIW (  ,  | 0 ,  0 , S0 , v0 )  N   | 0 ,
T
 0  Z NIW  2 2 
 Chipman, H., E. George, and R. Mc-Culloch (2001). The practical implementation of Bayesian Model Selection. Model
Selection. IMS Lecture Notes.
 Fraley, C. and A. Raftery (2007). Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering.
J. of Classification 24, 155–181

Visualization of the Wishart Distribution
 Wis is a distribution over matrices thus difficult to plot the PDF. However, one
can sample from it and in 2d use the eigenvectors of the resulting sample to
define an ellipse as we have done for the 2D Gaussian.
Wi(dof=3.0, S), E=[9.5, -0.1; -0.1, 1.9], =-0.018
5 2 2
 1 
Wi   | W , v   B |  |( v  D 1)/2 exp   Tr W 1   , 0 0 0
 2 
-5 -2 -2
v  D  1 (dof ), W sym pos. def . ( D  D )
wiPlotDemo
-10 0 10 -5 0 5 -5 0 5
5 5 5
from PMTK 1
 D
 v 1 i  
B W , v  | W | v /2  2vD /2  D ( D 1)/4    
0 0 0
 i 1  2  -5 -5 -5
-50 0 50 -10 0 10 -10 0 10
5 5 5
0 0 0
-5 -5 -5
-20 0 20 -5 0 5 -5 0 5
 Samples 𝚺 ∼ 𝒲𝒾(𝑺, 𝜈), 𝑺 = [3.1653, −0.0262; −0.0262, 0.6477] and 𝜈 = 3.

 The sampled matrices are highly variable, and some are nearly singular. As 𝜈
increases, the sampled matrices are more concentrated on the prior 𝑺.
 For off-diagonal elements, one can sample matrices from the distribution, and
then compute their distribution empirically.
 We can convert each sampled matrix to a correlation matrix, and thus

compute a Monte Carlo approximation
 ij
[ Rij ]   R    ,  ~ Wi (, v), R   ij 
1 S (s) (s)
S s1 ij
 ii  jj
 We can then use kernel density estimation to produce for plotting purposes a
smooth approximation to the univariate density 𝔼 [𝑅𝑖𝑗 ].

 Plots of the marginals (which are Gamma), and the sample-based marginal on
the correlation coefficient.
 If 𝜈 = 3 there is a lot of uncertainty about the correlation coefficient ρ (nearly

uniform on [−1, 1]).
 21 (1, 2)
0.1 1
0.05 0.5
0 0
0 10 20 30 40 -2 -1 0 1 2
 22
0.4
wiPlotDemo
from PMTK 0.2
0
0 5 10

Lec12 13 BayesianInferenceForTheGaussian

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec12 13 BayesianInferenceForTheGaussian

Uploaded by

Copyright:

Available Formats

Conjugate Bayesian Analysis of the

September 10, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Chris Bishops’ PRML book, Chapter 2.

 Inference for the Multivariate Gaussian, Unknown Mean, Unknown Precision,

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

 Familiarize ourselves with the calculations needed to compute the joint

 Understand the concept of conjugate priors needed for inferring 𝜇 and 𝜎 2

 Understand the concept of uninformed priors

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

 The posterior distribution is also useful in estimating the probability of

 The normalizing factor is given as:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Maximum A Posteriori estimate (MAP)

 *  arg max log  ( | x)   arg max  log f ( x |  )  log  ( ) 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

 This means finding

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

 We proved in the last lecture that the posterior is given as:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

 Inference about 𝑥 given a single noisy observation 𝑦 = 3.

with the mean 𝜇 taken as known.

 The likelihood takes the form:

 We can immediately see that the posterior is also a Gamma distribution:

 Here  ML is the MLE of the variance.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

If  ~ Gamma  | a, b  then  1 ~ InvGamma  1 | a, b 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

data sum of squares. An uninformative prior corresponds to zero virtual

 This approach is certainly more appealing.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

 Sequential update of the posterior for 𝜎2 starting from an uninformative prior

the inverse Gamma. The data were generated from 𝒩(5,10).

1/𝜎2 and the mean 𝜇.

 The likelihood takes the form:

 We can easily identify that the prior is of the form (Normal-Gamma):

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

 The posterior marginal for 𝜎2 and posterior expectation are:

0.4 0.4 0.4

 With this prior, the posterior becomes:**

 𝑠 is the sample std. Thus the marginal posterior for 𝜇 becomes:

 The standard error of the mean is defined as

 An approximate 95% posterior credible interval is thus:

 K. Murphy, Conjugate Baeysian Analysis of the Gaussian Distribution, 2007 (Section 9)

 This posterior is the exponential of a quadratic in 𝝁:

 Clearly the posterior is 𝒲𝒾 𝜦 𝑾𝑁 , 𝜈𝑁 , 𝑾−1 −1

 The following properties are useful

 The posterior strength 𝜈𝑁 = 𝜈0 + 𝑁, is the prior strength 𝜈0 plus the number

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

we conclude that the MAP estimate is:

 Can set 𝜆 by cross validation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

 The benefits of MAP estimation are illustrated next. We consider fitting a 50 −

using 𝑁 ∈ {100, 50, 25} true, k=10.00

order for the true covariance

(dotted blue) and MAP

 The condition number 𝑘 of shrinkcovDemo

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

 The Wishart distribution is the multivariate analog of the Gamma distribution

 We can simplify as:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37