Professional Documents
Culture Documents
Approximation for
Regression Models
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
The evidence approximation for our regression example, Empirical Bayes for linear
regression, Effective number of regression parameters
Following closely:
p(D | Mi )
p(D | M j )
This has the form of a mixture distribution in which the overall predictive distribution
is obtained by averaging the predictive distributions p(t | x, Mi , D ) of individual models,
weighted by the posterior probabilities p(M | D ) of those models.
i
A simple approximation to model averaging is to use the single most probable model
alone to make predictions. This is known as model selection.
p (D | Mi ) p D | wi , Mi p wi | Mi dwi
p D | wi , Mi p wi | Mi
p wi | D, Mi
p (D | Mi )
The marginal likelihood (Bayesian evidence) 𝑝(𝒟 |ℳ𝑖) is viewed as the probability
that randomly selected parameter values from the prior model class would
generate the data set 𝒟.
Simple model classes are unlikely to generate D.
Too complex model classes
can generate many data sets
so it is unlikely to generate 𝑝(𝓓 |𝑴𝑖)
the particular data set D.
𝑝 𝒟|𝑤 𝑝 𝑤
𝑝 𝒟 = න𝑝 𝒟|𝑤 𝑝 𝑤 𝑑𝑤 = |𝑤𝑀𝐴𝑃
𝑝 𝑤|𝒟
𝛥𝑤𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟
≃ 𝑝 𝒟|𝑤𝑀𝐴𝑃
𝛥𝑤𝑝𝑟𝑖𝑜𝑟
The size of the complexity penalty increases linearly with 𝑀. As we increase the
complexity of the model
the 1st term increases, because a more complex model is better able to fit the
data,
whereas the 2nd term decreases due to the dependence on 𝑀.
Let us think of the regression model and consider the models ℳ1, ℳ2 and ℳ3
representing linear, quadratic and cubic fitting.
The data 𝒟 are ordered in complexity – for a given model, we choose 𝒘 from the
prior 𝑝(𝒘), then sample the data from 𝑝(𝒟|𝒘).
Data ordered
in complexity
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
Optimal Model Complexity
A 1st order polynomial has little variability, generates data that are similar, 𝑝(𝒟) is
confined to a small region in the D axis.
A 9th order polynomial generates a variety of different data, and so its 𝑝(𝒟) is spread
over a large region in the 𝒟 axis.
Because 𝑝(𝒟|ℳ𝑖 ) are normalized, a particular 𝒟0 can have the highest evidence for
the model of intermediate complexity.
Data ordered
in complexity
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Optimal Model Comparison
A Bayesian model comparison in an average (over the data 𝒟) sense will favor the
correct model.
Let ℳ1 be the correct model and ℳ2 another model. We can show that the evidence
for model ℳ1 is higher. Using the definition and properties of the Kullback-Leibler
distance:
p (D | M1 )
KL p (D | M1 ) || p( D | M2 ) p( D | M1 ) ln dD 0
p (D | M2 )
Averaged with the
exact probability Bayes factor
This analysis assumes that the true distribution from which the data are generated is
contained in our class of models.
For large sets of data 𝒟 (relative to the model parameters), the parameter posterior is
approximately Gaussian around m (can also use 2𝑛𝑑 order Taylor expansion of the
MAP
log-posterior):
1
−𝑑 Τ2 |𝑨|1Τ2 exp
𝑝(𝜽𝑚 |𝒟, 𝑀𝑚 ) ≃ 2𝜋 − 𝜽𝑚 − 𝜽𝑀𝐴𝑃𝑚
𝑇 𝑨 𝜽 − 𝜽𝑀𝐴𝑃
𝑚 𝑚 ,
2
𝜕 2 log𝑃 𝜽𝑚 |𝒟, 𝑀𝑚
𝐴𝑖𝑗 = − |𝜽𝑀𝐴𝑃
𝜕𝜽𝑚𝑖 𝜕𝜽𝑚𝑗 𝑚
Using the Laplace approximation for the posterior of the parameters and evaluating
the equation above at m :
MAP
𝑀𝐴𝑃 𝑀𝐴𝑃
log𝑝(𝒟|𝑀𝑚 ) ≃ log𝑝(𝜃𝑚 , 𝒟|𝑀𝑚 ) − log𝑝(𝜃𝑚 |𝒟, 𝑀𝑚 )
𝑀𝐴𝑃 𝑀𝐴𝑃
𝑑 1
≃ log𝑝(𝒟|𝜃𝑚 , 𝑀𝑚 ) + log𝑝(𝜃𝑚 |𝑀𝑚 ) + log 2𝜋 − log|𝐴|
2 2
p t | t p t | w , p w | t , , p , | t dwd d Dependence on
𝑥 and 𝒙 not shown
N t | y ( x , w ), 1 N w |m N , S N to simplify the notation
m N S N T t
S N1 I T
ො 𝛽መ = න𝑝 𝑡|𝒘, 𝛽መ 𝑝 𝒘|𝒕, 𝛼,
𝑝 𝑡|𝒕 ≃ 𝑝 𝑡|𝒕, 𝛼, ො 𝛽መ 𝑑𝒘
ො 𝛽መ is the mode of
where 𝛼, p , | t , which is assumed to be sharply peaked.
ො 𝛽መ are
If the prior is relatively flat, then in the evidence framework the values of 𝛼,
obtained by maximizing the marginal likelihood function 𝑝(𝒕|𝛼, 𝛽).
N /2 M /2
p t | , exp E (w ) dw
2 2
where 𝑀 is the dimensionality of 𝒘, and we have defined
E ( w ) ED ( w ) EW ( w ) t w
2
wT w
2 2
1
E ( w ) E (m N ) ( w m N )T A( w m N )
2
We have introduced here:
A I T , E ( m N ) t - m N mTN m N , m N A1T t
2
2 2
Note that the Hessian matrix 𝑨 corresponds to the matrix of 2nd derivatives of the
error function:
A E ( w )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
The Evidence Approximation
The integral over 𝒘 can now be evaluated simply by appealing to the standard result
for the normalization coefficient of a multivariate Gaussian, giving
1
exp E ( w ) d w exp E ( m N ) 2
exp ( w m N ) T
A( w m N dw
)
exp E (m N ) 2 | A |1/2
M /2
We can then write the log of the marginal likelihood in the form
N /2 M /2 N /2 M /2
p t | , e E ( mN ) 2 | A |1/2
M /2
exp E ( w ) dw
2 2 2 2
M N 1 N
ln ln E ( m N ) ln | A | ln 2
ln p (t | , )
2 2 2 2
M : number of parameters in the model
N : size of training dataset
-22
-24
0 1 2 3 4 5 6 7 8 9
M
0.8
0.6
0.4
0.2
Now consider the derivative of the term involving ln |𝑨| with respect to 𝑎. We have
M N 1 N
ln p(t | , ) ln ln E (m N ) ln | A | ln 2
2 2 2 2
d d d 1
d
ln | A |
d
ln i
i d
ln i
i i i
2 2
M 1 T 1 1
with respect to 𝛼 satisfy 0 mN mN
2 2 2 i i
Multiplying through by 2𝑎 and rearranging, we obtain
1 i
m mN M
T
1
Implicit solution
i i i i i
N
i for 𝑎
1. Choose 𝑎
2. Calculate mN ,
m TN m N 3. Re-estimate 𝑎
mN , :
m N A1T t , A aI T u
T
i i ui
i
i i
3. Re-estimate 𝑎
m TN m N
M N 1 N
ln p(t | , ) ln ln E (m N ) ln | A | ln 2
2 2 2 2
d d d 1 i
d
ln | A |
d
ln i
i d
ln i
i
i
i
d
ln | A |
i
2 2 i
i
m N A1T t , A aI T u
T
i i ui ,
i
i
Setting the derivative wrt 𝛽 equal to zero, the stationary point of the marginal
likelihood therefore satisfies
tn m TN ( xn )
N 1 N 2
0
2 2 n 1 2
3. Re-estimate 𝛽
with
m TN m N
and
t n m ( xn )
N
1 1 2 1
t m N
T 2
N n 1
N
N
MacKay, D. (1995b). Probable networks and plausible predictions — a review of practical Bayesian methods for
supervised neural networks. Network.
Buntine, W. and A. Weigend (1991). Bayesian backpropagation. Complex Systems 5, 603–643.
MacKay, D. (1999). Comparision of approximate methods for handling hyperparameters. Neural Computation
11(5), 1035–1068.
-60
-70
-80
-90
-100
-110
-140
-150
-25 -20 -15 -10 -5 0 5
log alpha
-70
0.7
-80
0.6
-90
0.5 -100
0.4 -110
-120
0.3 Run linregPolyVsRegDemo
-130
from PMTK3
0.2
-140
0.1 -150
-20 -15 -10 -5 0 5 -25 -20 -15 -10 -5 0 5
log lambda log alpha
𝑙𝑜𝑔𝛽 𝑙𝑜𝑔𝛼
The key advantage of the evidence procedure over CV is that it allows
different 𝛼𝑗 to be used for every feature.
ෝ
The Least Square (MLE) prediction on the training set is 𝒚 = σ𝑀 𝑇
𝑗=1 𝒖𝑗 𝒖𝑗 𝒕 while the Ridge
predictions are:
𝜎𝑗2
ෝ = 𝜱𝒎𝑵 =
𝒚 𝑼𝑺𝑽𝑇 𝐕 𝜆𝑰 + 𝑺2 −1 𝑺𝑼𝑇 𝒕 = σ𝑗=1 𝒖𝑗 2 𝒖𝑗𝑇 𝒕,
𝑀
where 𝜎𝑗 are the singular values of 𝜱. Note
𝜎𝑗 +𝜆
𝜆
We follow the notation from Murphy. Use 𝜎𝑗2 ← 𝑗ൗ𝛽 to recover
that 𝜎𝑗2 are also the eigenvalues of 𝜱𝑇 𝜱.
the standard notation from Bishop.
Note that directions of small 𝜎𝑗 don’t contribute to the Ridge estimate. Directions of small 𝜎𝑗
correspond to directions of high posterior variance. For uniform prior we have seen that:
𝑐𝑜𝑣 𝒘 𝒟 = 𝛽 −1 𝜱𝑇 𝜱 −1 . It is in these directions that the ridge estimate shrinks the most.
2
𝜎𝑗
We can now define the effective number of degrees of freedom as: dof 𝜆 = σ𝑀
𝑗=1 𝜎2 +𝜆 . For 𝜆 →
𝑗
0, 𝑑𝑜𝑓 𝜆 → 𝑀, 𝑎𝑛𝑑 𝜆 → ∞, 𝑑𝑜𝑓 𝜆 → 0.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Effective Number of Regression Parameters
m N aI T T t = aI T T w MLE
1 1
Consider the contours of the
likelihood & prior in which the axes in
i i
parameter space have been rotated mN ui uiT w MLE wi , MLE ui
i i i
to align with the eigenvectors 𝒖𝑖. i
t w const
2
For 𝛼 = 0, the mode of the posterior Likelihood
is given by the MLE solution 𝒘𝑀𝐿, 1 2
whereas for nonzero 𝛼 the mode is at 1 2
𝒘𝑀𝐴𝑃 = 𝒎𝑁.
1 1
In the direction 𝑤1 , 𝜆1 is small
compared with 𝛼 and 𝜆1/(𝜆1 + 𝛼) is
close to zero, and the corresponding u
T
i i ui
𝜆1
MAP value of 𝑤1, 𝑤1𝑀𝐴𝑃 = 𝜆 +𝛼 𝑤1 𝑀𝐿𝐸
1
is also close to zero. wT w = const
In the direction 𝑤2, 𝜆2 is large Prior
i
compared with 𝛼 and so 𝜆2/(𝜆2 + 𝛼) is 0 1,
i
close to unity, and the MAP value of
i
𝑤2 is close to its MLE value. 0 M
i i
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Effective Number of Regression Parameters
In directions 𝑤𝑖 , 𝜆𝑖 << 𝛼, 𝜆𝑖 /(𝜆𝑖 + 𝛼) is
m N aI T T t = aI T T w MLE
1 1
tn mTN ( xn )
N
1 1 2
N n 1
tn mTN ( xn )
N
1 1 2
ML N n 1
These formulas express the variance as an average of the squared differences between the
targets and model predictions.
N n 1
n N
ML N n 1
n N
1 degree of freedom has been used to fit the mean and the MAP estimate for the variance
accounts for that.
5
0
10
m TN m N MatLab Code and data
-5
-10
-15
5
-20
-25
4
1
1
0.5 9
wi 6
3
0
7 MatLab Code
2
-0.5
and data
-1 5
10
-1.5
8
-2
0 1 2 3 4 5 6 7 8 9
γ
For the simulation, 𝛼 is varied 0 ≤ 𝛼 ≤ ∞ causing 𝛾 to vary in the range 0 ≤ 𝛾 ≤ 𝑀.
t m ( xn )
N
1 1
N M
2
T
N n 1
n N