You are on page 1of 38

The Evidence

Approximation for
Regression Models
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 30, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Evidence approximation, Bayesian model comparison, Model averaging, Model evidence,
Occam’s Razor, Model complexity, Approximating the model evidence, Optimal model
complexity, Laplace approximation, BIC criterion

 The evidence approximation for our regression example, Empirical Bayes for linear
regression, Effective number of regression parameters

Following closely:

 Chris Bishops’ PRML book, Chapter 3


 Kevin Murphy, Machine Learning: A probabilistic Perspective, Chapter 7
 Regression using parametric discriminative models in pmtk3 (run TutRegr.m in Pmtk3)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Goals
 The goals for today’s lecture include the following:

 Understand how to implement the evidence methodology to regression


problems

 Learn how Empirical Bayes controls model complexity and effective


number of parameters

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


The Evidence Approximation
 The evidence procedure can be used to perform feature selection.

 The evidence procedure is also useful when comparing different kinds of


models:
p (D | m)   p (D | w , m) p ( w | m, ) p ( | m)dwd

 max  p (D | w , m) p ( w | m, ) p( | m)dw


 It is important to (at least approximately) integrate over 𝜼 rather than setting it


arbitrarily.

 Variational Bayes models our uncertainty on 𝜼 rather than computing point


estimates.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Bayesian Model Comparison
 The Bayesian view of model comparison involves the use of probabilities to represent
uncertainty in the choice of model.

 Suppose we wish to compare 𝐿 models {ℳ𝑖 } where 𝑖 = 1, . . . , 𝐿. Here a model refers to a


probability distribution over the observed data 𝒟. Our uncertainty is expressed through a prior
𝑝(ℳ𝑖 ). Given a training set D, then the posterior distribution is:

p(Mi | D )  p(Mi ) p(D | Mi )


Posterior Prior Model evidence or
marginal likelihood
 We have already defined the Bayes Factor as the ratio of two model evidences

p(D | Mi )
p(D | M j )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Model Averaging and Model Selection
 Once we know the posterior distribution over models, the predictive distribution is
given, by
L
p (t | x , D )   p (t | x, Mi , D ) p(Mi | D )
i 1
Test data Training data

 This has the form of a mixture distribution in which the overall predictive distribution
is obtained by averaging the predictive distributions p(t | x, Mi , D ) of individual models,
weighted by the posterior probabilities p(M | D ) of those models.
i

 A simple approximation to model averaging is to use the single most probable model
alone to make predictions. This is known as model selection.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Model Evidence
 For a model governed by a set of parameters 𝒘, the model evidence is given, from
the sum and product rules of probability, by

p (D | Mi )   p  D | wi , Mi  p  wi | Mi dwi

 From a sampling perspective, the marginal likelihood can be viewed as the


probability of generating the data set D from a model whose parameters are
sampled at random from the prior.

 Also we can see the model evidence as normalizing factor:

p  D | wi , Mi  p  wi | Mi 
p  wi | D, Mi  
p (D | Mi )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Occam’s Razor and Model Selection
 Compare model classes ℳ𝑖 using their posterior probability given the data 𝒟 :
p (Mi | D )  p (Mi ) p( D | Mi ) p(D | Mi )   p  D | wi , Mi  p  wi | Mi  dwi
i

 The marginal likelihood (Bayesian evidence) 𝑝(𝒟 |ℳ𝑖) is viewed as the probability
that randomly selected parameter values from the prior model class would
generate the data set 𝒟.
 Simple model classes are unlikely to generate D.
 Too complex model classes
can generate many data sets
so it is unlikely to generate 𝑝(𝓓 |𝑴𝑖)
the particular data set D.

 Bayesian inference automatically


implements Occam’s Razor Principle:
Prefer simple than complex explanations.
Data ordered
in complexity

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Approximating the Model Evidence
 For a given model (omit | ℳ𝑖) with a single parameter, 𝑤, consider the
approximation (use Bayes rule)

𝑝 𝒟|𝑤 𝑝 𝑤
𝑝 𝒟 = න𝑝 𝒟|𝑤 𝑝 𝑤 𝑑𝑤 = |𝑤𝑀𝐴𝑃
𝑝 𝑤|𝒟
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
≃ 𝑝 𝒟|𝑤𝑀𝐴𝑃
𝛥𝑤𝑝𝑟𝑖𝑜𝑟

where the posterior is assumed to be sharply peaked around 𝑤𝑀𝐴𝑃.*


 Taking logs we obtain Note: the
evidence
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 is not defined if
𝑙𝑛𝑝 𝒟 ≃ 𝑙𝑛𝑝 𝒟|𝑤𝑀𝐴𝑃 + ln the prior is
𝛥𝑤𝑝𝑟𝑖𝑜𝑟
improper.
Negative
* This is certainly not as accurate as the Laplace approximation.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Optimal Model Complexity
 For a model with 𝑀 parameters, we can make a similar approximation for each
parameter in turn. Assuming that all parameters have the same ratio of Δ𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 /
Δ𝑤𝑝𝑟𝑖𝑜𝑟 ,
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
𝑙𝑛𝑝 𝒟 ≃ 𝑙𝑛𝑝 𝒟|𝑤𝑀𝐴𝑃 + M ln
𝛥𝑤𝑝𝑟𝑖𝑜𝑟

 The size of the complexity penalty increases linearly with 𝑀. As we increase the
complexity of the model
 the 1st term increases, because a more complex model is better able to fit the
data,
 whereas the 2nd term decreases due to the dependence on 𝑀.

 The optimal model complexity determined by maximum evidence will be given by a


trade-off between these two terms.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Optimal Model Complexity
 The marginal likelihood favors models of intermediate complexity.

 Let us think of the regression model and consider the models ℳ1, ℳ2 and ℳ3
representing linear, quadratic and cubic fitting.

 The data 𝒟 are ordered in complexity – for a given model, we choose 𝒘 from the
prior 𝑝(𝒘), then sample the data from 𝑝(𝒟|𝒘).

Data ordered
in complexity
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Optimal Model Complexity
 A 1st order polynomial has little variability, generates data that are similar, 𝑝(𝒟) is
confined to a small region in the D axis.

 A 9th order polynomial generates a variety of different data, and so its 𝑝(𝒟) is spread
over a large region in the 𝒟 axis.

 Because 𝑝(𝒟|ℳ𝑖 ) are normalized, a particular 𝒟0 can have the highest evidence for
the model of intermediate complexity.

 The simpler model cannot fit


the data well, whereas the
more complex model spreads
its predictive probability over too
broad a range of data sets.

Data ordered
in complexity
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Optimal Model Comparison
 A Bayesian model comparison in an average (over the data 𝒟) sense will favor the
correct model.

 Let ℳ1 be the correct model and ℳ2 another model. We can show that the evidence
for model ℳ1 is higher. Using the definition and properties of the Kullback-Leibler
distance:

p (D | M1 )
KL  p (D | M1 ) || p( D | M2 )    p( D | M1 ) ln dD  0
p (D | M2 )
Averaged with the
exact probability Bayes factor

 This analysis assumes that the true distribution from which the data are generated is
contained in our class of models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Laplace Approximation
 As we have seen earlier, the Laplace approximation allows a Gaussian
approximation of the parameter posterior about the maximum a posteriori (MAP)
parameter estimate. Denore here our parameters with 𝜽 of dimension 𝑑.

 Consider a data set 𝒟 and 𝑀 models ℳ𝑖, 𝑖 = 1, . . , 𝑀 with corresponding parameters


𝜽𝑖 , 𝑖 = 1, … 𝑀. We compare models using the posteriors:
p (M | D )  p ( M ) p ( D | M )

 For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around  m (can also use 2𝑛𝑑 order Taylor expansion of the
MAP

log-posterior):
1
−𝑑 Τ2 |𝑨|1Τ2 exp
𝑝(𝜽𝑚 |𝒟, 𝑀𝑚 ) ≃ 2𝜋 − 𝜽𝑚 − 𝜽𝑀𝐴𝑃𝑚
𝑇 𝑨 𝜽 − 𝜽𝑀𝐴𝑃
𝑚 𝑚 ,
2
𝜕 2 log𝑃 𝜽𝑚 |𝒟, 𝑀𝑚
𝐴𝑖𝑗 = − |𝜽𝑀𝐴𝑃
𝜕𝜽𝑚𝑖 𝜕𝜽𝑚𝑗 𝑚

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Laplace Approximation
 We can write the model evidence as
p ( m , D | Mm )
p (D | Mm ) 
p ( m | D , Mm )

 Using the Laplace approximation for the posterior of the parameters and evaluating
the equation above at  m :
MAP

𝑀𝐴𝑃 𝑀𝐴𝑃
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝜃𝑚 , 𝒟|𝑀𝑚 ) − log𝑝(𝜃𝑚 |𝒟, 𝑀𝑚 )
𝑀𝐴𝑃 𝑀𝐴𝑃
𝑑 1
≃ log𝑝(𝒟|𝜃𝑚 , 𝑀𝑚 ) + log𝑝(𝜃𝑚 |𝑀𝑚 ) + log 2𝜋 − log|𝐴|
2 2

 This Laplace approximation is used often for model comparison.

 Other approximations are also very useful:

• Bayesian Information Criterion (BIC) (on the limit of 𝑁)


• MCMC (Sampling approach)
• Variational Methods
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Bayesian Information Criterion
 We start with the Laplace approximation on the limit of large data sets 𝑁,
𝑑 1
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝒟|𝜽𝑀𝐴𝑃
𝑚 , 𝑀𝑚 ) + log𝑝(𝜽𝑀𝐴𝑃
𝑚 |𝑀𝑚 ) + log 2𝜋 − log|𝑨|
2 2
 A 𝑁 grows, 𝑨 grows as 𝑁𝑨0 for some fixed matrix 𝑨0, thus

log A  log NA0  log  N d A0   d log N  log  A0  


N 
 d log N

 Then the Laplace approximation is simplified as:


𝑑
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝒟|𝜽𝑀𝐴𝑃
𝑚 , 𝑀𝑚 ) − log𝑁 (𝑙𝑖𝑚𝑖𝑡 𝑁 → ∞) 2

 Note interesting properties of (the easy to compute) BIC:


• No dependence on the prior
• One can use the MLE rather than the MAP estimate of  m
• If not all parameters are well determined from the data, 𝑑 =number of effective
parameters.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
The Evidence Approximation
 Let us return to our regression problem. The fully Bayesian predictive distribution is
given by The hyperparameters 
and  are now random
variables

p  t | t      p  t | w ,   p  w | t , ,   p  ,  | t  dwd d  Dependence on
𝑥 and 𝒙 not shown

N t | y ( x , w ),  1  N  w |m N , S N  to simplify the notation
m N   S N T t
S N1  I  T 

but this integral is intractable. Approximate with

ො 𝛽መ = න𝑝 𝑡|𝒘, 𝛽መ 𝑝 𝒘|𝒕, 𝛼,
𝑝 𝑡|𝒕 ≃ 𝑝 𝑡|𝒕, 𝛼, ො 𝛽መ 𝑑𝒘

ො 𝛽መ is the mode of
where 𝛼, p  ,  | t  , which is assumed to be sharply peaked.

a.k.a. empirical Bayes, type II or generalized maximum likelihood, or evidence


approximation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


The Evidence Approximation
 From Bayes’ theorem, the posterior distribution for 𝛼 and 𝛽 is given by
p  ,  | t   p  t |  ,   p  ,  

ො 𝛽መ are
 If the prior is relatively flat, then in the evidence framework the values of 𝛼,
obtained by maximizing the marginal likelihood function 𝑝(𝒕|𝛼, 𝛽).

 The marginal likelihood function 𝑝(𝒕|𝛼, 𝛽) is obtained by integrating over the


parameters 𝒘, so that
p  ,  | t   p  t |  ,    p  t | w ,   p  w |   dw
1
 1
 N /2 e  ED N w|0 , I M M  Evidence/
 2  N /2
Marginal Likelihood
 One can evaluate this integral using the completion of the square procedure.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


The Evidence Approximation
 We can write the evidence function in the form

    
N /2 M /2

p  t |  ,        exp  E (w ) dw
 2   2 
where 𝑀 is the dimensionality of 𝒘, and we have defined
 
E ( w )   ED ( w )   EW ( w )  t  w 
2
wT w
2 2
1
E ( w )  E (m N )  ( w  m N )T A( w  m N )
2
 We have introduced here:
 
A   I    T , E ( m N )  t - m N  mTN m N , m N   A1T t
2

2 2
 Note that the Hessian matrix 𝑨 corresponds to the matrix of 2nd derivatives of the
error function:
A  E ( w )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
The Evidence Approximation
 The integral over 𝒘 can now be evaluated simply by appealing to the standard result
for the normalization coefficient of a multivariate Gaussian, giving

 1 
 exp   E ( w ) d w  exp   E ( m N )   2
exp  ( w  m N ) T
A( w  m N dw
)

 exp  E (m N ) 2  | A |1/2
M /2

 We can then write the log of the marginal likelihood in the form

         
N /2 M /2 N /2 M /2

p  t |  ,        e  E ( mN )  2  | A |1/2 
M /2
   exp E ( w ) dw    
 2   2   2   2 
M N 1 N
ln   ln   E ( m N )  ln | A |  ln  2 
ln p (t |  ,  ) 
2 2 2 2
M : number of parameters in the model
N : size of training dataset

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


The Evidence Approximation
  5 103 ,
 Example: sinusoidal data, polynomial regression,
-12
Log of Model Evidence vs # Order of polynomial   11.1
-14

-16 MatLab Code


-18
and data
p  t |  , 
-20

-22

-24
0 1 2 3 4 5 6 7 8 9
M

 From the plot of the model evidence for


Root-mean-square error evaluated on the training and test data for various M
1
Training

given  and , we see that the


0.9 Testing

0.8

evidence favors the model with 0.7

0.6

𝑀 = 5 (4th degree polynomial) 0.5

0.4

 Looking at the non-Bayesian 0.3

0.2

approach, one cannot distinguish the 0.1

performance of polynomials of order 3 … .8 0


0 1 2 3 4 5 6 7 8 9

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Maximizing the Evidence Function
 Let us first consider the maximization of 𝑝(𝒕|𝛼, 𝛽) with respect to 𝛼. This can be done
by first defining the following eigenvector equation
 T   ui  i ui
 Thus, A   I   T  has eigenvalues 𝛼 + 𝜆𝑖.

 Now consider the derivative of the term involving ln |𝑨| with respect to 𝑎. We have

M N 1 N
ln p(t |  ,  )  ln   ln   E (m N )  ln | A |  ln  2 
2 2 2 2
d d d 1
d
ln | A |
d
ln  i    
i d
 ln  i     
i i i  

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22


Maximizing the Evidence Function
d d d 1
d
ln | A |
d
ln  i    
i d
 ln  i     
i i i  
 Thus the stationary points of
M N 1 N
ln p(t |  ,  )  ln   ln   E (m N )  ln | A |  ln  2 
2 2 2 2
 
E (m N )  t  m N  mTN m N , m N   A1T t , A   I  T 
2

2 2
M 1 T 1 1
with respect to 𝛼 satisfy 0  mN mN  
2 2 2 i i  
 Multiplying through by 2𝑎 and rearranging, we obtain
1    i
 m mN  M   
T
  1   
 
Implicit solution
i i   i    i i  
N
i  for 𝑎
1. Choose 𝑎

 2. Calculate mN , 
m TN m N 3. Re-estimate 𝑎

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Maximizing the Evidence Function
Implicit Solution for Computing 𝛂
1. Choose α
2. Calculate

mN ,  :

m N   A1T t , A  aI  T      u
T
i  i ui
i
 
i i  

3. Re-estimate 𝑎


m TN m N

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Maximizing the Evidence Function
 We can similarly maximize the log marginal likelihood with respect to 𝛽.

M N 1 N
ln p(t |  ,  )  ln   ln   E (m N )  ln | A |  ln  2 
2 2 2 2

 To do this, we note that the eigenvalues 𝜆𝑖 defined by  T   ui  i ui

are proportional to 𝛽, and hence d i / d   i /  giving

d d d 1 i 
d
ln | A |
d
ln  i    
i d
 ln  i    
i 
  
i


i

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Maximizing the Evidence Function
M N 1 N
ln p(t |  ,  )  ln   ln   E (m N )  ln | A |  ln  2 
2 2 2 2
  T d 1 i 
E (m N )  t  m N  m N m N
2

d 
ln | A |

i    

2 2 i
i
m N   A1T t , A  aI   T      u
T
i  i ui ,   
i  
i

 Setting the derivative wrt 𝛽 equal to zero, the stationary point of the marginal
likelihood therefore satisfies

  tn  m TN  ( xn ) 
N 1 N 2
0
2  2 n 1 2

 Rearranging we obtain Implicit solution


for 𝛽
1. Choose 𝛽
 t  m  ( xn )
N
1 1 2
 T
2. Calculate mN , 
 N  n 1
n N

3. Re-estimate 𝛽

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Maximizing the Evidence Function
 It is interesting to note that in the evidence framework (using the optimal computed
values of 𝛼 and 𝛽), the following is true:
N
E (m N ) 
2

 This can be easily shown using the earliest derived results:


 
E (m N )  t  m N 
2
m TN m N
2 2

with


m TN m N

and
 t n  m  ( xn ) 
N
1 1 2 1
 t  m N
T 2

 N  n 1
N
N 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Empirical Bayes for Linear Regression
 We show the results next from the PMTK implementation of the empirical
Bayes procedure for picking the hyper-parameters in the prior.

 We choose 𝜼 = (𝛼, 𝛽) to maximize the marginal likelihood, where 𝛽 =


1/𝜎2 be the precision of the observation noise and 𝛼 is the precision of the
prior, 𝑝(𝒘) = 𝑁(𝒘|𝟎, 𝛼 −1 𝑰).

 This is known as the evidence procedure.

 MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for
supervised neural networks. Network.
 Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.
 MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation
11(5), 1035–1068.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Empirical Bayes for Linear Regression
 The evidence procedure provides an alternative to using cross validation.
 In the Figure, the log marginal likelihood is plotted for different values of 𝛼, as
well as the maximum value found by the optimizer.
log evidence
-50

-60

-70

-80

-90

-100

-110

Run linregPolyVsRegDemo -120


from PMTK3
-130

-140

-150
-25 -20 -15 -10 -5 0 5
log alpha

 We obtain the same result as 5-CV (𝛽 = 1/𝜎2 is fixed in both methods).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Empirical Bayes for Linear Regression
log evidence
0.9
-50
negative log marg. likelihood
0.8 CV estimate of MSE -60

-70
0.7
-80
0.6
-90

0.5 -100

0.4 -110

-120
0.3 Run linregPolyVsRegDemo
-130
from PMTK3
0.2
-140

0.1 -150
-20 -15 -10 -5 0 5 -25 -20 -15 -10 -5 0 5
log lambda log alpha
𝑙𝑜𝑔𝛽 𝑙𝑜𝑔𝛼
 The key advantage of the evidence procedure over CV is that it allows
different 𝛼𝑗 to be used for every feature.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Effective Parameters and Ridge Regression
 Consider the SVD of 𝜱 = 𝑼𝑺𝑽𝑇
𝛼
 We can write 𝒎𝑁 = 𝛽 𝛼𝑰 + 𝛽𝜱𝑇 𝜱 −1
𝜱𝑇 𝒕 = 𝐕 𝜆𝑰 + 𝑺2 −1
𝑺𝑼𝑇 𝒕, 𝜆 = .
𝛽


 The Least Square (MLE) prediction on the training set is 𝒚 = σ𝑀 𝑇
𝑗=1 𝒖𝑗 𝒖𝑗 𝒕 while the Ridge
predictions are:
𝜎𝑗2
ෝ = 𝜱𝒎𝑵 =
𝒚 𝑼𝑺𝑽𝑇 𝐕 𝜆𝑰 + 𝑺2 −1 𝑺𝑼𝑇 𝒕 = σ𝑗=1 𝒖𝑗 2 𝒖𝑗𝑇 𝒕,
𝑀
where 𝜎𝑗 are the singular values of 𝜱. Note
𝜎𝑗 +𝜆
𝜆
We follow the notation from Murphy. Use 𝜎𝑗2 ← 𝑗ൗ𝛽 to recover
that 𝜎𝑗2 are also the eigenvalues of 𝜱𝑇 𝜱.
the standard notation from Bishop.

 Note that directions of small 𝜎𝑗 don’t contribute to the Ridge estimate. Directions of small 𝜎𝑗
correspond to directions of high posterior variance. For uniform prior we have seen that:
𝑐𝑜𝑣 𝒘 𝒟 = 𝛽 −1 𝜱𝑇 𝜱 −1 . It is in these directions that the ridge estimate shrinks the most.
2
𝜎𝑗
 We can now define the effective number of degrees of freedom as: dof 𝜆 = σ𝑀
𝑗=1 𝜎2 +𝜆 . For 𝜆 →
𝑗
0, 𝑑𝑜𝑓 𝜆 → 𝑀, 𝑎𝑛𝑑 𝜆 → ∞, 𝑑𝑜𝑓 𝜆 → 0.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Effective Number of Regression Parameters
m N    aI   T   T t =   aI  T   T w MLE
1 1
 Consider the contours of the
likelihood & prior in which the axes in
i  i 
parameter space have been rotated mN   ui  uiT w MLE     wi , MLE  ui
i   i  i   
to align with the eigenvectors 𝒖𝑖. i

t  w  const
2
 For 𝛼 = 0, the mode of the posterior Likelihood
is given by the MLE solution 𝒘𝑀𝐿, 1  2
whereas for nonzero 𝛼 the mode is at 1 2
𝒘𝑀𝐴𝑃 = 𝒎𝑁.
1 1
 In the direction 𝑤1 , 𝜆1 is small
compared with 𝛼 and 𝜆1/(𝜆1 + 𝛼) is
close to zero, and the corresponding     u
T
i  i ui
𝜆1
MAP value of 𝑤1, 𝑤1𝑀𝐴𝑃 = 𝜆 +𝛼 𝑤1 𝑀𝐿𝐸
1
is also close to zero. wT w = const
 In the direction 𝑤2, 𝜆2 is large Prior
i
compared with 𝛼 and so 𝜆2/(𝜆2 + 𝛼) is 0  1,
i  
close to unity, and the MAP value of
i
𝑤2 is close to its MLE value. 0  M
i i  
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Effective Number of Regression Parameters
 In directions 𝑤𝑖 , 𝜆𝑖 << 𝛼, 𝜆𝑖 /(𝜆𝑖 + 𝛼) is
m N    aI   T   T t =   aI  T   T w MLE
1 1

close to zero, and the corresponding


MAP value of 𝑤𝑖 is also close to zero. i  i 
mN   ui  uiT w MLE     wi , MLE  ui
i i   i  i   
These are directions in which the
likelihood function is relatively
insensitive to the parameter value and
so the parameter has been set to a
Likelihood
small value by the prior.
1  2
 The quantity 𝛾 1 2
i
0  1, 1 1
i  
i
0  M     u
T
i  i ui
i i  
therefore measures the effective total
number of well determined
parameters. Prior

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Effective Number of Regression Parameters
 We can obtain some insight into the equation for 𝛽

 tn  mTN ( xn )
N
1 1 2

 N  n 1

by comparing it with the MLE result derived in an earlier lecture:

 tn  mTN ( xn )
N
1 1 2

 ML N n 1

 These formulas express the variance as an average of the squared differences between the
targets and model predictions.

 They differ in that the # of data points


 𝑁 in the MLE result is replaced by

 𝑁 − 𝛾 in the Bayesian result.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Effective Number of Regression Parameters
 t  m  ( xn )  t  m  ( xn )
N N
1 1 2 1 1 2
 T
 T

 N  n 1
n N
 ML N n 1
n N

 The effective number of parameters determined by the data is 𝛾.


 The remaining 𝑀 − 𝛾 parameters are set to small values by the prior.
 This is reflected in the Bayesian result for the variance that has a factor 𝑁 − 𝛾 in the
denominator correcting for the bias of the MLE.
 These results are analogous to the estimation of the variance of a Gaussian:
N
1 1 N
  x   ML  vs.    n ML
   
2 2 2 2
x
N  1 n 1
ML n MAP
N n 1

1 degree of freedom has been used to fit the mean and the MAP estimate for the variance
accounts for that.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Effective Number of Regression Parameters
 We illustrate the evidence framework for setting hyperparameters using the
sinusoidal synthetic data, together with 9 Gaussian basis functions. The total # of
parameters is thus 𝑀 = 10 including the bias.
 For simplicity, we set 𝛽 = 11.1 (true value) and use the evidence framework to only
determine 𝛼
15 15
 ln p(t |  )
10
 MT M
n n
test set error


5

 0
10
m TN m N MatLab Code and data
-5

-10

-15
5
-20

-25

-30 Min generalization


error
0 -35
-5 0 5 -5 0 5
ln
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36
Effective Number of Regression Parameters
 We can also see how 𝛼 controls the magnitude of the parameters {𝑤𝑖}, by plotting the
individual parameters (posterior means) versus the effective number 𝛾 of parameters.
We use the Gaussian basis model.
Parameters wi, i=1,..,10 versus gamma
1.5

4
1

1
0.5 9
wi 6
3
0
7 MatLab Code
2
-0.5
and data
-1 5
10
-1.5
8

-2
0 1 2 3 4 5 6 7 8 9
γ
 For the simulation, 𝛼 is varied 0 ≤ 𝛼 ≤ ∞ causing 𝛾 to vary in the range 0 ≤ 𝛾 ≤ 𝑀.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Case of N>>M
 For 𝑁 >> 𝑀, all of the parameters are well determined by the data because 𝚽𝑇𝚽
involves an implicit sum over data points, and so the eigenvalues 𝜆𝑖 increase with
the size of the data set.

 In this case, 𝛾 = 𝑀, and the re-estimation equations for 𝛼 and 𝛽 become


M 
 T  T (  M )
mN mN mN mN

 t  m  ( xn )
N
1 1
 N  M 
2
 T

 N n 1
n N

 These results do not require computing the eigenspectrum of the Hessian.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

You might also like