Lec23 Evidence4Regression

The Evidence
Approximation for
Regression Models
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 30, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Evidence approximation, Bayesian model comparison, Model averaging, Model evidence,
Occam’s Razor, Model complexity, Approximating the model evidence, Optimal model
complexity, Laplace approximation, BIC criterion
 The evidence approximation for our regression example, Empirical Bayes for linear
regression, Effective number of regression parameters
Following closely:
 Chris Bishops’ PRML book, Chapter 3

 Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7
 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Goals
 The goals for today’s lecture include the following:
 Understand how to implement the evidence methodology to regression

problems
 Learn how Empirical Bayes controls model complexity and effective

number of parameters

The Evidence Approximation
 The evidence procedure can be used to perform feature selection.
 The evidence procedure is also useful when comparing different kinds of

models:
p (D | m)   p (D | w , m) p ( w | m, ) p ( | m)dwd
 max  p (D | w , m) p ( w | m, ) p( | m)dw


 It is important to (at least approximately) integrate over 𝜼 rather than setting it

arbitrarily.
 Variational Bayes models our uncertainty on 𝜼 rather than computing point

estimates.

Bayesian Model Comparison
 The Bayesian view of model comparison involves the use of probabilities to represent
uncertainty in the choice of model.
 Suppose we wish to compare 𝐿 models {ℳ𝑖 } where 𝑖 = 1, . . . , 𝐿. Here a model refers to a

probability distribution over the observed data 𝒟. Our uncertainty is expressed through a prior
𝑝(ℳ𝑖 ). Given a training set D, then the posterior distribution is:
p(Mi | D )  p(Mi ) p(D | Mi )

Posterior Prior Model evidence or
marginal likelihood
 We have already defined the Bayes Factor as the ratio of two model evidences
p(D | Mi )
p(D | M j )

Model Averaging and Model Selection
 Once we know the posterior distribution over models, the predictive distribution is
given, by
L
p (t | x , D )   p (t | x, Mi , D ) p(Mi | D )
i 1
Test data Training data
 This has the form of a mixture distribution in which the overall predictive distribution
is obtained by averaging the predictive distributions p(t | x, Mi , D ) of individual models,
weighted by the posterior probabilities p(M | D ) of those models.
i
 A simple approximation to model averaging is to use the single most probable model
alone to make predictions. This is known as model selection.

Model Evidence
 For a model governed by a set of parameters 𝒘, the model evidence is given, from
the sum and product rules of probability, by
p (D | Mi )   p  D | wi , Mi  p  wi | Mi dwi
 From a sampling perspective, the marginal likelihood can be viewed as the

probability of generating the data set D from a model whose parameters are
sampled at random from the prior.
 Also we can see the model evidence as normalizing factor:
p  D | wi , Mi  p  wi | Mi 
p  wi | D, Mi  
p (D | Mi )

Occam’s Razor and Model Selection
 Compare model classes ℳ𝑖 using their posterior probability given the data 𝒟 :
p (Mi | D )  p (Mi ) p( D | Mi ) p(D | Mi )   p  D | wi , Mi  p  wi | Mi  dwi
i
 The marginal likelihood (Bayesian evidence) 𝑝(𝒟 |ℳ𝑖) is viewed as the probability
that randomly selected parameter values from the prior model class would
generate the data set 𝒟.
 Simple model classes are unlikely to generate D.
 Too complex model classes
can generate many data sets
so it is unlikely to generate 𝑝(𝓓 |𝑴𝑖)
the particular data set D.
 Bayesian inference automatically

implements Occam’s Razor Principle:
Prefer simple than complex explanations.
Data ordered
in complexity

Approximating the Model Evidence
 For a given model (omit | ℳ𝑖) with a single parameter, 𝑤, consider the
approximation (use Bayes rule)
𝑝 𝒟|𝑤 𝑝 𝑤
𝑝 𝒟 = න𝑝 𝒟|𝑤 𝑝 𝑤 𝑑𝑤 = |𝑤𝑀𝐴𝑃
𝑝 𝑤|𝒟
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
≃ 𝑝 𝒟|𝑤𝑀𝐴𝑃
𝛥𝑤𝑝𝑟𝑖𝑜𝑟
where the posterior is assumed to be sharply peaked around 𝑤𝑀𝐴𝑃.*

 Taking logs we obtain Note: the
evidence
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 is not defined if
𝑙𝑛𝑝 𝒟 ≃ 𝑙𝑛𝑝 𝒟|𝑤𝑀𝐴𝑃 + ln the prior is
improper.
Negative
* This is certainly not as accurate as the Laplace approximation.
Optimal Model Complexity
 For a model with 𝑀 parameters, we can make a similar approximation for each
parameter in turn. Assuming that all parameters have the same ratio of Δ𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 /
Δ𝑤𝑝𝑟𝑖𝑜𝑟 ,
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
𝑙𝑛𝑝 𝒟 ≃ 𝑙𝑛𝑝 𝒟|𝑤𝑀𝐴𝑃 + M ln
 The size of the complexity penalty increases linearly with 𝑀. As we increase the
complexity of the model
 the 1st term increases, because a more complex model is better able to fit the
data,
 whereas the 2nd term decreases due to the dependence on 𝑀.
 The optimal model complexity determined by maximum evidence will be given by a

trade-off between these two terms.

 The marginal likelihood favors models of intermediate complexity.
 Let us think of the regression model and consider the models ℳ1, ℳ2 and ℳ3
representing linear, quadratic and cubic fitting.
 The data 𝒟 are ordered in complexity – for a given model, we choose 𝒘 from the
prior 𝑝(𝒘), then sample the data from 𝑝(𝒟|𝒘).
Data ordered
in complexity
 A 1st order polynomial has little variability, generates data that are similar, 𝑝(𝒟) is
confined to a small region in the D axis.
 A 9th order polynomial generates a variety of different data, and so its 𝑝(𝒟) is spread
over a large region in the 𝒟 axis.
 Because 𝑝(𝒟|ℳ𝑖 ) are normalized, a particular 𝒟0 can have the highest evidence for
the model of intermediate complexity.
 The simpler model cannot fit

the data well, whereas the
more complex model spreads
its predictive probability over too
broad a range of data sets.
Data ordered
in complexity
Optimal Model Comparison
 A Bayesian model comparison in an average (over the data 𝒟) sense will favor the
correct model.
 Let ℳ1 be the correct model and ℳ2 another model. We can show that the evidence
for model ℳ1 is higher. Using the definition and properties of the Kullback-Leibler
distance:
p (D | M1 )
KL  p (D | M1 ) || p( D | M2 )    p( D | M1 ) ln dD  0
p (D | M2 )
Averaged with the
exact probability Bayes factor
 This analysis assumes that the true distribution from which the data are generated is
contained in our class of models.

Laplace Approximation
 As we have seen earlier, the Laplace approximation allows a Gaussian
approximation of the parameter posterior about the maximum a posteriori (MAP)
parameter estimate. Denore here our parameters with 𝜽 of dimension 𝑑.
 Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters

𝜽𝑖 , 𝑖 = 1, … 𝑀. We compare models using the posteriors:
p (M | D )  p ( M ) p ( D | M )
 For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around  m (can also use 2𝑛𝑑 order Taylor expansion of the
MAP
log-posterior):
1
−𝑑 Τ2 |𝑨|1Τ2 exp
𝑝(𝜽𝑚 |𝒟, 𝑀𝑚 ) ≃ 2𝜋 − 𝜽𝑚 − 𝜽𝑀𝐴𝑃𝑚
𝑇 𝑨 𝜽 − 𝜽𝑀𝐴𝑃
𝑚 𝑚 ,
2
𝜕 2 log𝑃 𝜽𝑚 |𝒟, 𝑀𝑚
𝐴𝑖𝑗 = − |𝜽𝑀𝐴𝑃
𝜕𝜽𝑚𝑖 𝜕𝜽𝑚𝑗 𝑚

Laplace Approximation
 We can write the model evidence as
p ( m , D | Mm )
p (D | Mm ) 
p ( m | D , Mm )
 Using the Laplace approximation for the posterior of the parameters and evaluating
the equation above at  m :
MAP
𝑀𝐴𝑃 𝑀𝐴𝑃
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝜃𝑚 , 𝒟|𝑀𝑚 ) − log𝑝(𝜃𝑚 |𝒟, 𝑀𝑚 )
𝑀𝐴𝑃 𝑀𝐴𝑃
𝑑 1
≃ log𝑝(𝒟|𝜃𝑚 , 𝑀𝑚 ) + log𝑝(𝜃𝑚 |𝑀𝑚 ) + log 2𝜋 − log|𝐴|
2 2
 This Laplace approximation is used often for model comparison.
 Other approximations are also very useful:
• Bayesian Information Criterion (BIC) (on the limit of 𝑁)

• MCMC (Sampling approach)
• Variational Methods
Bayesian Information Criterion
 We start with the Laplace approximation on the limit of large data sets 𝑁,
𝑑 1
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝒟|𝜽𝑀𝐴𝑃
𝑚 , 𝑀𝑚 ) + log𝑝(𝜽𝑀𝐴𝑃
𝑚 |𝑀𝑚 ) + log 2𝜋 − log|𝑨|
2 2
 A 𝑁 grows, 𝑨 grows as 𝑁𝑨0 for some fixed matrix 𝑨0, thus
log A  log NA0  log  N d A0   d log N  log  A0  

N 
 d log N
 Then the Laplace approximation is simplified as:

𝑑
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝒟|𝜽𝑀𝐴𝑃
𝑚 , 𝑀𝑚 ) − log𝑁 (𝑙𝑖𝑚𝑖𝑡 𝑁 → ∞) 2
 Note interesting properties of (the easy to compute) BIC:

• No dependence on the prior
• One can use the MLE rather than the MAP estimate of  m
• If not all parameters are well determined from the data, 𝑑 =number of effective
parameters.
 Let us return to our regression problem. The fully Bayesian predictive distribution is
given by The hyperparameters 
and  are now random
variables
p  t | t      p  t | w ,   p  w | t , ,   p  ,  | t  dwd d  Dependence on
𝑥 and 𝒙 not shown

N t | y ( x , w ),  1  N  w |m N , S N  to simplify the notation
m N   S N T t
S N1  I  T 
but this integral is intractable. Approximate with
ො 𝛽መ = න𝑝 𝑡|𝒘, 𝛽መ 𝑝 𝒘|𝒕, 𝛼,
𝑝 𝑡|𝒕 ≃ 𝑝 𝑡|𝒕, 𝛼, ො 𝛽መ 𝑑𝒘
ො 𝛽መ is the mode of
where 𝛼, p  ,  | t  , which is assumed to be sharply peaked.
a.k.a. empirical Bayes, type II or generalized maximum likelihood, or evidence

approximation.

 From Bayes’ theorem, the posterior distribution for 𝛼 and 𝛽 is given by
p  ,  | t   p  t |  ,   p  ,  
ො 𝛽መ are
 If the prior is relatively flat, then in the evidence framework the values of 𝛼,
obtained by maximizing the marginal likelihood function 𝑝(𝒕|𝛼, 𝛽).
 The marginal likelihood function 𝑝(𝒕|𝛼, 𝛽) is obtained by integrating over the

parameters 𝒘, so that
p  ,  | t   p  t |  ,    p  t | w ,   p  w |   dw
1
 1
 N /2 e  ED N w|0 , I M M  Evidence/
 2  N /2
Marginal Likelihood
 One can evaluate this integral using the completion of the square procedure.

 We can write the evidence function in the form
    
N /2 M /2
p  t |  ,        exp  E (w ) dw
 2   2 
where 𝑀 is the dimensionality of 𝒘, and we have defined
 
E ( w )   ED ( w )   EW ( w )  t  w 
2
wT w
2 2
1
E ( w )  E (m N )  ( w  m N )T A( w  m N )
2
 We have introduced here:
 
A   I    T , E ( m N )  t - m N  mTN m N , m N   A1T t
2
2 2
 Note that the Hessian matrix 𝑨 corresponds to the matrix of 2nd derivatives of the
error function:
A  E ( w )
 The integral over 𝒘 can now be evaluated simply by appealing to the standard result
for the normalization coefficient of a multivariate Gaussian, giving
 1 
 exp   E ( w ) d w  exp   E ( m N )   2
exp  ( w  m N ) T
A( w  m N dw
)

 exp  E (m N ) 2  | A |1/2
M /2
 We can then write the log of the marginal likelihood in the form
         
N /2 M /2 N /2 M /2
p  t |  ,        e  E ( mN )  2  | A |1/2 
M /2
   exp E ( w ) dw    
 2   2   2   2 
M N 1 N
ln   ln   E ( m N )  ln | A |  ln  2 
ln p (t |  ,  ) 
2 2 2 2
M : number of parameters in the model
N : size of training dataset

  5 103 ,
 Example: sinusoidal data, polynomial regression,
-12
Log of Model Evidence vs # Order of polynomial   11.1
-14
-16 MatLab Code

-18
and data
p  t |  , 
-20
-22
-24
0 1 2 3 4 5 6 7 8 9
M
 From the plot of the model evidence for

Root-mean-square error evaluated on the training and test data for various M
1
Training
given  and , we see that the

0.9 Testing
0.8
evidence favors the model with 0.7
0.6
𝑀 = 5 (4th degree polynomial) 0.5
0.4
 Looking at the non-Bayesian 0.3
0.2
approach, one cannot distinguish the 0.1
performance of polynomials of order 3 … .8 0

0 1 2 3 4 5 6 7 8 9

Maximizing the Evidence Function
 Let us first consider the maximization of 𝑝(𝒕|𝛼, 𝛽) with respect to 𝛼. This can be done
by first defining the following eigenvector equation
 T   ui  i ui
 Thus, A   I   T  has eigenvalues 𝛼 + 𝜆𝑖.
 Now consider the derivative of the term involving ln |𝑨| with respect to 𝑎. We have
M N 1 N
ln p(t |  ,  )  ln   ln   E (m N )  ln | A |  ln  2 
2 2 2 2
d d d 1
d
ln | A |
d
ln  i    
i d
 ln  i     
i i i  

d d d 1
d
ln | A |
d
ln  i    
i d
 ln  i     
i i i  
 Thus the stationary points of
M N 1 N
2 2 2 2
 
E (m N )  t  m N  mTN m N , m N   A1T t , A   I  T 
2
2 2
M 1 T 1 1
with respect to 𝛼 satisfy 0  mN mN  
2 2 2 i i  
 Multiplying through by 2𝑎 and rearranging, we obtain
1    i
 m mN  M   
T
  1   
 
Implicit solution
i i   i    i i  
N
i  for 𝑎
1. Choose 𝑎

 2. Calculate mN , 
m TN m N 3. Re-estimate 𝑎

Implicit Solution for Computing 𝛂
1. Choose α
2. Calculate
mN ,  :
m N   A1T t , A  aI  T      u
T
i  i ui
i
 
i i  
3. Re-estimate 𝑎


m TN m N

 We can similarly maximize the log marginal likelihood with respect to 𝛽.
M N 1 N
2 2 2 2
 To do this, we note that the eigenvalues 𝜆𝑖 defined by  T   ui  i ui
are proportional to 𝛽, and hence d i / d   i /  giving
d d d 1 i 
d
ln | A |
d
ln  i    
i d
 ln  i    
i 
  
i


i

M N 1 N
2 2 2 2
  T d 1 i 
E (m N )  t  m N  m N m N
2
d 
ln | A |

i    

2 2 i
i
m N   A1T t , A  aI   T      u
T
i  i ui ,   
i  
i
 Setting the derivative wrt 𝛽 equal to zero, the stationary point of the marginal
likelihood therefore satisfies

  tn  m TN  ( xn ) 
N 1 N 2
0
2  2 n 1 2
 Rearranging we obtain Implicit solution

for 𝛽
1. Choose 𝛽
 t  m  ( xn )
N
1 1 2
 T
2. Calculate mN , 
 N  n 1
n N
3. Re-estimate 𝛽

 It is interesting to note that in the evidence framework (using the optimal computed
values of 𝛼 and 𝛽), the following is true:
N
E (m N ) 
2
 This can be easily shown using the earliest derived results:

 
E (m N )  t  m N 
2
m TN m N
2 2
with


m TN m N
and
 t n  m  ( xn ) 
N
1 1 2 1
 t  m N
T 2
 N  n 1
N
N 

Empirical Bayes for Linear Regression
 We show the results next from the PMTK implementation of the empirical
Bayes procedure for picking the hyper-parameters in the prior.
 We choose 𝜼 = (𝛼, 𝛽) to maximize the marginal likelihood, where 𝛽 =

1/𝜎2 be the precision of the observation noise and 𝛼 is the precision of the
prior, 𝑝(𝒘) = 𝑁(𝒘|𝟎, 𝛼 −1 𝑰).
 This is known as the evidence procedure.
 MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for
supervised neural networks. Network.
 Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.
 MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation
11(5), 1035–1068.

 The evidence procedure provides an alternative to using cross validation.
 In the Figure, the log marginal likelihood is plotted for different values of 𝛼, as
well as the maximum value found by the optimizer.
log evidence
-50
-60
-70
-80
-90
-100
-110
Run linregPolyVsRegDemo -120

from PMTK3
-130
-140
-150
-25 -20 -15 -10 -5 0 5
log alpha
 We obtain the same result as 5-CV (𝛽 = 1/𝜎2 is fixed in both methods).

log evidence
0.9
-50
negative log marg. likelihood
0.8 CV estimate of MSE -60
-70
0.7
-80
0.6
-90
0.5 -100
0.4 -110
-120
0.3 Run linregPolyVsRegDemo
-130
from PMTK3
0.2
-140
0.1 -150
-20 -15 -10 -5 0 5 -25 -20 -15 -10 -5 0 5
log lambda log alpha
𝑙𝑜𝑔𝛽 𝑙𝑜𝑔𝛼
 The key advantage of the evidence procedure over CV is that it allows
different 𝛼𝑗 to be used for every feature.

Effective Parameters and Ridge Regression
 Consider the SVD of 𝜱 = 𝑼𝑺𝑽𝑇
𝛼
 We can write 𝒎𝑁 = 𝛽 𝛼𝑰 + 𝛽𝜱𝑇 𝜱 −1
𝜱𝑇 𝒕 = 𝐕 𝜆𝑰 + 𝑺2 −1
𝑺𝑼𝑇 𝒕, 𝜆 = .
𝛽
ෝ
 The Least Square (MLE) prediction on the training set is 𝒚 = σ𝑀 𝑇
𝑗=1 𝒖𝑗 𝒖𝑗 𝒕 while the Ridge
predictions are:
𝜎𝑗2
ෝ = 𝜱𝒎𝑵 =
𝒚 𝑼𝑺𝑽𝑇 𝐕 𝜆𝑰 + 𝑺2 −1 𝑺𝑼𝑇 𝒕 = σ𝑗=1 𝒖𝑗 2 𝒖𝑗𝑇 𝒕,
𝑀
where 𝜎𝑗 are the singular values of 𝜱. Note
𝜎𝑗 +𝜆
𝜆
We follow the notation from Murphy. Use 𝜎𝑗2 ← 𝑗ൗ𝛽 to recover
that 𝜎𝑗2 are also the eigenvalues of 𝜱𝑇 𝜱.
the standard notation from Bishop.
 Note that directions of small 𝜎𝑗 don’t contribute to the Ridge estimate. Directions of small 𝜎𝑗
correspond to directions of high posterior variance. For uniform prior we have seen that:
𝑐𝑜𝑣 𝒘 𝒟 = 𝛽 −1 𝜱𝑇 𝜱 −1 . It is in these directions that the ridge estimate shrinks the most.
2
𝜎𝑗
 We can now define the effective number of degrees of freedom as: dof 𝜆 = σ𝑀
𝑗=1 𝜎2 +𝜆 . For 𝜆 →
𝑗
0, 𝑑𝑜𝑓 𝜆 → 𝑀, 𝑎𝑛𝑑 𝜆 → ∞, 𝑑𝑜𝑓 𝜆 → 0.
Effective Number of Regression Parameters
m N    aI   T   T t =   aI  T   T w MLE
1 1
 Consider the contours of the
likelihood & prior in which the axes in
i  i 
parameter space have been rotated mN   ui  uiT w MLE     wi , MLE  ui
i   i  i   
to align with the eigenvectors 𝒖𝑖. i
t  w  const
2
 For 𝛼 = 0, the mode of the posterior Likelihood
is given by the MLE solution 𝒘𝑀𝐿, 1  2
whereas for nonzero 𝛼 the mode is at 1 2
𝒘𝑀𝐴𝑃 = 𝒎𝑁.
1 1
 In the direction 𝑤1 , 𝜆1 is small
compared with 𝛼 and 𝜆1/(𝜆1 + 𝛼) is
close to zero, and the corresponding     u
T
i  i ui
𝜆1
MAP value of 𝑤1, 𝑤1𝑀𝐴𝑃 = 𝜆 +𝛼 𝑤1 𝑀𝐿𝐸
1
is also close to zero. wT w = const
 In the direction 𝑤2, 𝜆2 is large Prior
i
compared with 𝛼 and so 𝜆2/(𝜆2 + 𝛼) is 0  1,
i  
close to unity, and the MAP value of
i
𝑤2 is close to its MLE value. 0  M
i i  
 In directions 𝑤𝑖 , 𝜆𝑖 << 𝛼, 𝜆𝑖 /(𝜆𝑖 + 𝛼) is
m N    aI   T   T t =   aI  T   T w MLE
1 1
close to zero, and the corresponding

MAP value of 𝑤𝑖 is also close to zero. i  i 
mN   ui  uiT w MLE     wi , MLE  ui
i i   i  i   
These are directions in which the
likelihood function is relatively
insensitive to the parameter value and
so the parameter has been set to a
Likelihood
small value by the prior.
1  2
 The quantity 𝛾 1 2
i
0  1, 1 1
i  
i
0  M     u
T
i  i ui
i i  
therefore measures the effective total
number of well determined
parameters. Prior

 We can obtain some insight into the equation for 𝛽
 tn  mTN ( xn )
N
1 1 2

 N  n 1
by comparing it with the MLE result derived in an earlier lecture:
 tn  mTN ( xn )
N
1 1 2

 ML N n 1
 These formulas express the variance as an average of the squared differences between the
targets and model predictions.
 They differ in that the # of data points

 𝑁 in the MLE result is replaced by
 𝑁 − 𝛾 in the Bayesian result.

 t  m  ( xn )  t  m  ( xn )
N N
1 1 2 1 1 2
 T
 T
 N  n 1
n N
 ML N n 1
n N
 The effective number of parameters determined by the data is 𝛾.

 The remaining 𝑀 − 𝛾 parameters are set to small values by the prior.
 This is reflected in the Bayesian result for the variance that has a factor 𝑁 − 𝛾 in the
denominator correcting for the bias of the MLE.
 These results are analogous to the estimation of the variance of a Gaussian:
N
1 1 N
  x   ML  vs.    n ML
   
2 2 2 2
x
N  1 n 1
ML n MAP
N n 1
1 degree of freedom has been used to fit the mean and the MAP estimate for the variance
accounts for that.

 We illustrate the evidence framework for setting hyperparameters using the
sinusoidal synthetic data, together with 9 Gaussian basis functions. The total # of
parameters is thus 𝑀 = 10 including the bias.
 For simplicity, we set 𝛽 = 11.1 (true value) and use the evidence framework to only
determine 𝛼
15 15
 ln p(t |  )
10
 MT M
n n
test set error

5
 0
10
m TN m N MatLab Code and data
-5
-10
-15
5
-20
-25
-30 Min generalization

error
0 -35
-5 0 5 -5 0 5
ln
 We can also see how 𝛼 controls the magnitude of the parameters {𝑤𝑖}, by plotting the
individual parameters (posterior means) versus the effective number 𝛾 of parameters.
We use the Gaussian basis model.
Parameters wi, i=1,..,10 versus gamma
1.5
4
1
1
0.5 9
wi 6
3
0
7 MatLab Code
2
-0.5
and data
-1 5
10
-1.5
8
-2
0 1 2 3 4 5 6 7 8 9
γ
 For the simulation, 𝛼 is varied 0 ≤ 𝛼 ≤ ∞ causing 𝛾 to vary in the range 0 ≤ 𝛾 ≤ 𝑀.

Case of N>>M
 For 𝑁 >> 𝑀, all of the parameters are well determined by the data because 𝚽𝑇𝚽
involves an implicit sum over data points, and so the eigenvalues 𝜆𝑖 increase with
the size of the data set.
 In this case, 𝛾 = 𝑀, and the re-estimation equations for 𝛼 and 𝛽 become

M 
 T  T (  M )
mN mN mN mN
 t  m  ( xn )
N
1 1
 N  M 
2
 T
 N n 1
n N
 These results do not require computing the eigenspectrum of the Hessian.

Lec23 Evidence4Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec23 Evidence4Regression

Uploaded by

Copyright:

Available Formats

The Evidence

September 30, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Chris Bishops’ PRML book, Chapter 3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

 Understand how to implement the evidence methodology to regression

 Learn how Empirical Bayes controls model complexity and effective

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

 The evidence procedure is also useful when comparing different kinds of

 max  p (D | w , m) p ( w | m, ) p( | m)dw

 It is important to (at least approximately) integrate over 𝜼 rather than setting it

 Variational Bayes models our uncertainty on 𝜼 rather than computing point

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

 Suppose we wish to compare 𝐿 models {ℳ𝑖 } where 𝑖 = 1, . . . , 𝐿. Here a model refers to a

p(Mi | D )  p(Mi ) p(D | Mi )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

 From a sampling perspective, the marginal likelihood can be viewed as the

 Also we can see the model evidence as normalizing factor:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

 Bayesian inference automatically

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

where the posterior is assumed to be sharply peaked around 𝑤𝑀𝐴𝑃.*

 The optimal model complexity determined by maximum evidence will be given by a

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

 The simpler model cannot fit

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

 Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

 This Laplace approximation is used often for model comparison.

 Other approximations are also very useful:

• Bayesian Information Criterion (BIC) (on the limit of 𝑁)

log A  log NA0  log  N d A0   d log N  log  A0  

 Then the Laplace approximation is simplified as:

 Note interesting properties of (the easy to compute) BIC:

but this integral is intractable. Approximate with

a.k.a. empirical Bayes, type II or generalized maximum likelihood, or evidence

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

 The marginal likelihood function 𝑝(𝒕|𝛼, 𝛽) is obtained by integrating over the

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

-16 MatLab Code

 From the plot of the model evidence for

given  and , we see that the

evidence favors the model with 0.7

𝑀 = 5 (4th degree polynomial) 0.5

 Looking at the non-Bayesian 0.3

approach, one cannot distinguish the 0.1

performance of polynomials of order 3 … .8 0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

 To do this, we note that the eigenvalues 𝜆𝑖 defined by  T   ui  i ui

are proportional to 𝛽, and hence d i / d   i /  giving

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

 Rearranging we obtain Implicit solution

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

 This can be easily shown using the earliest derived results:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

 We choose 𝜼 = (𝛼, 𝛽) to maximize the marginal likelihood, where 𝛽 =

 This is known as the evidence procedure.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28