Professional Documents
Culture Documents
Regularization Methods
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
Multi-output Regression
Centered Data
Learn how to compute the bias parameter and performing data centering
Setting the gradient with respect to 𝒘 to zero, and solving for 𝒘 as before, we obtain
w I T T t
1
w I T t
T 1
Regularization limits the effective model complexity (the appropriate number of basis
functions).
This is replaced with the problem of finding a suitable value of the regularization
coefficient 𝜆.
𝜆 controls how many non-zero 𝒘’s (i.e. basis functions) you have.
2 n 1 2 j 1
Regularizer term plot with q = 0.5 Regularizer term plot with q = 1 Regularizer term plot with q = 2 Regularizer term plot with q = 4
10 10 10 10
5 5 5 5
0 0 0 0
-5 -5 -5 -5
MatLab code
𝑞 = 1 is known as the Lasso regularizer. These plots show only the regularizer term
with 𝜆 = 0.7334.
2 n 1 2 j 1
MatLab code
𝑞 = 2 corresponds to the quadratic regularizer.
𝑞 = 2 q=1
Constraint
q
M
w
j 1
j
2 n 1 2 j 1
This is identical to our regularized least squares (RLS) in the dependence on 𝒘.
q
1 N
tn w ( xn ) w j (*)
1 M
T 2
2 n 1 2 j 1
For a particular 𝜆 > 0, let 𝒘∗ (𝜆) be the solution of the RLS in (*).
w*j
j 1
min f ( x ), subject to g ( x ) 0
x
0, g( x) 0, g( x) 0
L( x, ) f ( x ) g( x )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Multiple Outputs-Isotropic Covariance
If we want to predict 𝐾 > 1 target variables, we use the same basis for all
components of the target vector:
p t | x,W , N t | y ( x,W ), 1 I N t | W T ( x ), 1 I
Given observed inputs, X x1 ,..., xN , and targets, T t1 ,..., t N T, we obtain the log
likelihood function
N
2
ln p T | X ,W , ln N tn | W ( xn ), I
N
NK
T 1
ln t n W T ( xn )
n 1 2 2 2 n 1
WML T TT 1
M K M N N K
M M
If we examine this result for each target variable 𝑡𝑘 , we have (take the 𝑘th column of
𝑾 and 𝑻):
wk ML T T tk †tk
1
which is identical to the single output case (so there is decoupling between the target
variables).
p t | x,W , N t | y ( x, W ), N t | W T ( x ),
Given observed inputs, X x1 ,..., xN , and targets, T t1 ,..., t N , we obtain the log
T
likelihood function ln p T | X ,W , :
n 2 n t n W ( xn )
N
N 1 N
T 1
ln N t | W T
( x n ), ln t n W T
( x ) T
n 1 2 n 1
2 2 n 1
t W ( xn ) ( xn ) WML T T
N
1
0 1
n
T T T
n 1 M K
For the ML estimate for 𝚺, use the result for the MLE of the covariance of a
multivariate Gaussian:
tn WMLT ( xn ) tn WMLT ( xn )
N
1 T
N n 1
wML T t
T 1
t y t w
y w
is orthogonal to the basis j , yn y ( x n , w )
i.e. such that:
T t w 0
These are the normal equations we
derived earlier.
0T S: 𝑀 −dimensional subspace
T spanned by j ( x )
1
T
: 0 ( x ) 1 ( x ) .. M 1 ( x )
T
M 1
M 1 N N
1 1
w0 t w j j , t= t n , j ( x ) j n
j 1 N n 1 N n 1
The bias parameter 𝑤0 compensates for the difference between the averages of the
target values and the weighted sum of the averages of the basis function values.
2
Let us assume that the input data are centered in each dimension such that:
x 0 i 1,..., M 1
N
i j
j 1
T ( x1 ) ( x2 ) .. ( x N )
( x1 ) 1 ( x1 ) 2 ( x1 ) .. M 1 ( x1 )
T
T x x x . . x T
( x2 ) 1 ( x2 ) 2 ( x2 ) .. M 1 ( x2 ) i 1 i 2 i M 1 i
,
: : : : : 1 2 .. M 1
T
( x N
)
1 N
( x ) 2 ( x N ) .. M 1 ( x N
)
i i x1 i x2 . . i x N
T
The mean of the output is equally likely to be positive or negative. Let us put an
improper prior 𝑝(𝜇) ∝ 1 and integrate 𝜇 out.
2
tN tN wT T 1N
0 0
t t1
T
exp t t 1 N w N w
2
Our model is now simplified if instead of 𝒕 we use (centered output) t t t1N and
the likelihood is simply written as:
p (t | x , w , ) exp t w t w
T
2 𝑀
ഥ 𝑇𝑗 𝒘𝑗 , 𝝓
𝜇Ƹ = 𝑡ҧ − 𝝓 ഥ1 , . . . , 𝝓
ഥ 𝑀−1 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑
Recall that the MLE estimate for 𝜇 is: 𝑗=1
𝑏𝑦 𝑡𝑎𝑘𝑖𝑛𝑔 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎ge of 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝜱
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
A Note on Data Centering: MLE of 𝑤0
As an example, consider a linear regression model of the form
y | x w0 wT x
In the context e.g. of MLE, we need to minimize
min ti w0 w xi
N
T 2
w0 ,w
i 1
Minimization wrt 𝑤0 gives:
t w
N
i 0 w T
x i 0 w0 N tN N w T
x 𝑤
ෝ 0 = 𝑡 ҧ − 𝒘𝑇 𝒙
ഥ
i 1
N
x i1 N
x1 i 1
where: N
x2 i2 x N
N
x i 1 , t ti N
: : i 1
xM N
x
iM N
i 1
Thus: ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
A Note on Data Centering: MLE of 𝑤
Substituting the bias term in our objective function gives:
x x
N N
min ti t w x w xi min ti t w
2 2
T T T
i
w w
i 1 i 1
ഥ 𝒙𝑖 − 𝒙
𝒙𝑖 − 𝒙 ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
ෝ = 𝒙𝑖 − 𝒙
𝒘
𝑖=1 𝑖=1
We thus first compute the MLE of 𝒘 using the centered input and output as follows:
𝑁 𝑁
−1
ෝ = 𝑿𝑇𝑐 𝑿𝑐
𝒘 𝑿𝑇𝑐 𝒕𝑐 = 𝒙𝑖 − 𝒙
ഥ 𝒙𝑖 − 𝒙
ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
𝒙𝑖 − 𝒙
𝑖=1 𝑖=1 N
xi1 N
xT x x11 x1
T
x12 x 2 .. x1M x M x1 i 1
1 N
xT x x21 x1 xi 2 N t c t t t 1N t ,
T
T x22 x 2 .. x2 M x M x2
X c X X X 1N x 2 , x i 1 ,
N
: : : :
: : : t ti N
T T xM N
x xN 1 x1 xN 2 x 2 .. xNN x M i 1
xN x
iM N
We can then estimate the MLE estimate of 𝑤0 as follows: i 1
ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21