Lec20 RidgeRegression

Ridge Regression and
Regularization Methods
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
September 23, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Ridge Regression, Lasso Regularizer and Sparse Solutions
 Multi-output Regression
 Geometry of least squares
 Computing the Bias Parameter
 Centered Data
 Chris Bishops’ PRML book, Chapters 1 and 2

 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goal’s for todays’ lecture are the following:
 Learn about various regularized least squares solutions to regression

problems and in particular about Ridge regression
 Understand the geometry of least squares
 Understand how to solve multi-output regression problems
 Learn how to compute the bias parameter and performing data centering

Regularized Least Squares – Ridge Regression
 Consider the error function:
ED ( w )   EW ( w )
data term + regularization term
 With the sum-of-squares error function and a quadratic regularizer, we get
1 N
 n   1 T
2
t  w T
 ( x n )   w w
2 n 1 2
 Setting the gradient with respect to 𝒘 to zero, and solving for 𝒘 as before, we obtain
w    I  T   T t
1
 This is a trivial extension of the least-squares solution we encountered earlier

(Regularized Least Squares – Ridge Regression)

Regularized Least Squares
 Regularized solution:
w    I     T t
T 1
 Regularization limits the effective model complexity (the appropriate number of basis
functions).
 This is replaced with the problem of finding a suitable value of the regularization
coefficient 𝜆.
 𝜆 controls how many non-zero 𝒘’s (i.e. basis functions) you have.

 With a more general regularizer, we have
q
1 N
  tn  w  ( xn )     w j
1 M
T 2
2 n 1 2 j 1
Regularizer term plot with q = 0.5 Regularizer term plot with q = 1 Regularizer term plot with q = 2 Regularizer term plot with q = 4
10 10 10 10
5 5 5 5
0 0 0 0
-5 -5 -5 -5
-10 -10 -10 -10

-10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10
MatLab code
 𝑞 = 1 is known as the Lasso regularizer. These plots show only the regularizer term
with 𝜆 = 0.7334.

 With a more general regularizer, we have
q
1 N
  tn  w  ( xn )     w j
1 M
T 2
2 n 1 2 j 1
MatLab code
 𝑞 = 2 corresponds to the quadratic regularizer.

 Lasso tends to generate sparser solutions than a quadratic regularizer – if 𝜆 is large,
some of the 𝑤𝑗 → 0 (here 𝑤1 = 0).
Unregularized
error function
𝑞 = 2 q=1
Constraint
q
M
w
j 1
j 
 Here, we consider that the regularized least squares solution is equivalent to

minimizing the unregularized sum of squares with the constraint shown for some 𝜂
(see proof next).
1 M 
q
 Let us write the constraint in the equivalent form:   w j     0

2  j 1 
 This leads to the following Lagrangian function:
  q

L( w ,  )    tn  w  ( xn )     w j   
1 N T 2
M
2 n 1 2  j 1 

 This is identical to our regularized least squares (RLS) in the dependence on 𝒘.
q
1 N
  tn  w  ( xn )     w j (*)
1 M
T 2
2 n 1 2 j 1
 For a particular 𝜆 > 0, let 𝒘∗ (𝜆) be the solution of the RLS in (*).
 From the Kuhn-Tucker optimality conditions for L( w ,  ), we then see:

M q
   w*j
j 1

Kuhn-Tucker Optimality Conditions
 Consider the following constraint minimization problem:
min f ( x ), subject to g ( x )  0
x
 This is equivalent as the minimization with respect to 𝒙 and 𝜆 of the following

Lagrangian:
min L( x,  )  f ( x )   g( x )
x,
subject to the following Kuhn-Tucker conditions:
  0, g( x)  0,  g( x)  0
 Note for maximization problems, the Lagrangian should be modified as:
L( x,  )  f ( x )   g( x )
Multiple Outputs-Isotropic Covariance
 If we want to predict 𝐾 > 1 target variables, we use the same basis for all
components of the target vector:
p  t | x,W ,    N  t | y ( x,W ),  1 I   N  t | W T  ( x ),  1 I 
𝑾 is an 𝑀 × 𝐾 matrix and 𝒕 is 𝐾 dimensional.
 Given observed inputs, X   x1 ,..., xN  , and targets, T  t1 ,..., t N T, we obtain the log
likelihood function
  N
2
ln p T | X ,W ,     ln N  tn | W  ( xn ),  I  
N
NK
T 1
ln   t n  W T  ( xn )
n 1 2 2 2 n 1

K-Independent Regression Problems
  N
2
NK
ln p T | X ,W ,    ln   t n  W T  ( xn )
2 2 2 n 1
 As before, we can maximize this function with respect to 𝑾, giving
WML      T TT 1
M K M  N N K
M M
 If we examine this result for each target variable 𝑡𝑘 , we have (take the 𝑘th column of
𝑾 and 𝑻):
wk ML   T   T tk  †tk
1
which is identical to the single output case (so there is decoupling between the target
variables).
 We obtain 𝐾 −independent regression problems.

Multiple Outputs – Full Covariance
 Let us repeat the earlier formulation but with covariance matrix 𝚺. If we want to
predict 𝐾 > 1 target variables, we use the same basis for all components of the
target vector):
p  t | x,W ,    N  t | y ( x, W ),    N  t | W T  ( x ),  
where 𝑾 is an 𝑀 × 𝐾 matrix of parameters
 Given observed inputs, X   x1 ,..., xN , and targets, T  t1 ,..., t N  , we obtain the log
T
likelihood function ln p T | X ,W ,   :
n  2  n    t n  W  ( xn ) 
N
N 1 N
 
T 1
ln N t | W T
( x n ),    ln   t n  W T
 ( x ) T
n 1 2 n 1

Multiple Outputs – Full Covariance
ln p T | X , W ,     ln     t n  W T  ( xn )   1  t n  W T  ( xn ) 
N 1 N T
2 2 n 1
 We maximize this function with respect to 𝑾,
t  W  ( xn )   ( xn )  WML      T T
N
1
0    1
n
T T T
n 1 M K
 For the ML estimate for 𝚺, use the result for the MLE of the covariance of a
multivariate Gaussian:
  tn  WMLT  ( xn )  tn  WMLT  ( xn ) 
N
1 T

N n 1
 Note that each column of 𝑾𝑀𝐿 is of the form
wML      T t
T 1
seen for isotropic noise distribution and is independent of 𝚺!

Geometry of Least Squares
 We look for a geometrical
interpretation of the least-squares  j   j ( x1 ),...,  j ( xN ) 
T
solution in an 𝑁-dimensional space. 𝒕

is a vector in that space with
components 𝑡1, . . . , 𝑡𝑁 (𝑁 > 𝑀).
 The least-squares regression function y  w
is obtained by finding the orthogonal
projection of the data vector 𝒕 onto the
subspace spanned by the basis
functions 𝝋𝑗(𝒙).
 Note 𝝋𝑗 is here the 𝑗 −th
 0 ( x1 ) 1 ( x1 ) .. M 1 ( x1 ) 
column of 𝚽.  
 ( x )  ( x ) ..  ( x )
   0 1 ...  M 1    0 2 1 2 M 1 2 
 : : : : 
 
 0 ( x N ) 1 ( x N ) .. M 1 ( x N ) 

Geometry of Least Squares
 We are looking for 𝒘 such that the
projection error  j ( x)   j ( x1 ),...,  j ( xN ) 
T
t  y  t  w
y  w
is orthogonal to the basis  j , yn  y ( x n , w )
i.e. such that:
 T  t  w   0
 These are the normal equations we
derived earlier.
  0T  S: 𝑀 −dimensional subspace
 T  spanned by  j ( x )
 1 
 
T
 :     0 ( x ) 1 ( x ) ..  M 1 ( x ) 
 T 
  M 1 

Computing the Bias Parameter
 If we make the bias parameter 𝑤0 explicit, then the error function becomes
2
1  N M 1 
ED ( w )    tn  w0   w j j ( xn ) 
2 n 1  j 1 
 Setting the derivative with respect to 𝑤0 equal to zero, and solving for 𝑤0, we obtain
N N M 1
1 1
w0 
N

n 1
tn 
N
  w  (x ) 
n 1 j 1
j j n
M 1 N N
1 1
w0  t   w j  j , t=  t n ,  j   ( x ) j n
j 1 N n 1 N n 1
 The bias parameter 𝑤0 compensates for the difference between the averages of the
target values and the weighted sum of the averages of the basis function values.

A Note on Data Centering: Likelihood
 In linear regression, it helps to center the data in a way that does not require us to
compute the offset term 𝑤0 ≡ 𝜇. Write the likelihood as:
  
p (t | x, w ,  ,  )  exp    t   1N  w   t   1N  w  
T
 2 
 Let us assume that the input data are centered in each dimension such that:
   x   0  i  1,..., M  1
N
i j
j 1
T   ( x1 )  ( x2 ) ..  ( x N ) 
  ( x1 )   1 ( x1 ) 2 ( x1 ) .. M 1 ( x1 ) 
T
 T      x     x    x  . .   x  T
  ( x2 )   1 ( x2 ) 2 ( x2 ) .. M 1 ( x2 )  i 1 i 2 i M 1 i
  ,
 :   : : : :    1  2 ..  M 1 
 T   
  ( x N 
) 
 1 N
( x ) 2 ( x N ) ..  M 1 ( x N 
)
 i  i  x1  i  x2  . . i  x N  
T
 The mean of the output is equally likely to be positive or negative. Let us put an
improper prior 𝑝(𝜇) ∝ 1 and integrate 𝜇 out.

A Note on Data Centering: Likelihood
N
1
 Introducing, t 
N
t
i 1
i , the marginal likelihood becomes :
  
T


 
p (t | x , w ,  )   exp    t  t1N  w    t 1N  t  t1N  w    t 1N
 2      d

  A  
 Completing the square in 𝜇 gives (use the centering of 𝜱):
  
  
   
2
p (t | x , w ,  )   exp      t N 2  t A 1N  A A   d 
T T
 2 
  tN tN  wT T 1N 
  0 0 
 
   t  t1 
T
 exp   t  t 1 N  w N  w 
 2 
 Our model is now simplified if instead of 𝒕 we use (centered output) t  t  t1N and
the likelihood is simply written as:
  
p (t | x , w ,  )  exp    t  w   t  w  
T
 2 𝑀

ഥ 𝑇𝑗 𝒘𝑗 , 𝝓
𝜇Ƹ = 𝑡ҧ − ෍ 𝝓 ഥ1 , . . . , 𝝓
ഥ 𝑀−1 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑
 Recall that the MLE estimate for 𝜇 is: 𝑗=1
𝑏𝑦 𝑡𝑎𝑘𝑖𝑛𝑔 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎ge of 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝜱
A Note on Data Centering: MLE of 𝑤0
 As an example, consider a linear regression model of the form
 y | x   w0  wT x
 In the context e.g. of MLE, we need to minimize
min    ti  w0  w xi 
N
T 2
w0 ,w
i 1
 Minimization wrt 𝑤0 gives:
 t  w 
N
i 0  w T
x i  0  w0 N  tN  N w T
x 𝑤
ෝ 0 = 𝑡 ҧ − 𝒘𝑇 𝒙
ഥ
i 1
 N 
  x i1 N 
 x1   i 1 
where:    N

 x2    i2 x N 
N
x     i 1 , t   ti N
 :   :  i 1
 xM   N 
 
 x 
  iM N 
 i 1 
 Thus: ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤
A Note on Data Centering: MLE of 𝑤
 Substituting the bias term in our objective function gives:
    x  x 
N N
min  ti  t  w x  w xi  min  ti  t  w
2 2
T T T
i
w w
i 1 i 1
 Minimization wrt 𝒘 gives:

𝑁 𝑁
ഥ 𝒙𝑖 − 𝒙
෍ 𝒙𝑖 − 𝒙 ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
ෝ = ෍ 𝒙𝑖 − 𝒙
𝒘
𝑖=1 𝑖=1
 We thus first compute the MLE of 𝒘 using the centered input and output as follows:
𝑁 𝑁
−1
ෝ = 𝑿𝑇𝑐 𝑿𝑐
𝒘 𝑿𝑇𝑐 𝒕𝑐 = ෍ 𝒙𝑖 − 𝒙
ഥ 𝒙𝑖 − 𝒙
ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
෍ 𝒙𝑖 − 𝒙
𝑖=1 𝑖=1  N 
  xi1 N 
 xT  x   x11  x1
T
x12  x 2 .. x1M  x M   x1   i 1 
 1       N 
 xT  x   x21  x1    xi 2 N  t c  t  t  t  1N t ,
T
T x22  x 2 .. x2 M  x M   x2
X c  X  X  X  1N x   2   , x    i 1 ,
  
N
: : : :
:   :   :  t   ti N
 T T    xM   N
 x   xN 1  x1 xN 2  x 2 .. xNN  x M     i 1
 xN  x 
  iM N 
 We can then estimate the MLE estimate of 𝑤0 as follows:  i 1 
ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤

Lec20 RidgeRegression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec20 RidgeRegression

Uploaded by

Copyright:

Available Formats

Ridge Regression and

September 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 Geometry of least squares

 Computing the Bias Parameter

 Chris Bishops’ PRML book, Chapters 1 and 2

 Learn about various regularized least squares solutions to regression

 Understand the geometry of least squares

 Understand how to solve multi-output regression problems

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

 This is a trivial extension of the least-squares solution we encountered earlier

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

-10 -10 -10 -10

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

 Here, we consider that the regularized least squares solution is equivalent to

 Let us write the constraint in the equivalent form:   w j     0

 From the Kuhn-Tucker optimality conditions for L( w ,  ), we then see:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

 This is equivalent as the minimization with respect to 𝒙 and 𝜆 of the following

subject to the following Kuhn-Tucker conditions:

 Note for maximization problems, the Lagrangian should be modified as:

𝑾 is an 𝑀 × 𝐾 matrix and 𝒕 is 𝐾 dimensional.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

 As before, we can maximize this function with respect to 𝑾, giving

 We obtain 𝐾 −independent regression problems.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

where 𝑾 is an 𝑀 × 𝐾 matrix of parameters

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

 We maximize this function with respect to 𝑾,

 Note that each column of 𝑾𝑀𝐿 is of the form

seen for isotropic noise distribution and is independent of 𝚺!

solution in an 𝑁-dimensional space. 𝒕

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

 Minimization wrt 𝒘 gives:

You might also like