You are on page 1of 21

Ridge Regression and

Regularization Methods
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

September 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Ridge Regression, Lasso Regularizer and Sparse Solutions

 Multi-output Regression

 Geometry of least squares

 Computing the Bias Parameter

 Centered Data

 Chris Bishops’ PRML book, Chapters 1 and 2


 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goal’s for todays’ lecture are the following:

 Learn about various regularized least squares solutions to regression


problems and in particular about Ridge regression

 Understand the geometry of least squares

 Understand how to solve multi-output regression problems

 Learn how to compute the bias parameter and performing data centering

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Regularized Least Squares – Ridge Regression
 Consider the error function:
ED ( w )   EW ( w )
data term + regularization term
 With the sum-of-squares error function and a quadratic regularizer, we get
1 N
 n   1 T
2
t  w T
 ( x n )   w w
2 n 1 2

 Setting the gradient with respect to 𝒘 to zero, and solving for 𝒘 as before, we obtain
w    I  T   T t
1

 This is a trivial extension of the least-squares solution we encountered earlier


(Regularized Least Squares – Ridge Regression)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Regularized Least Squares
 Regularized solution:

w    I     T t
T 1

 Regularization limits the effective model complexity (the appropriate number of basis
functions).

 This is replaced with the problem of finding a suitable value of the regularization
coefficient 𝜆.

 𝜆 controls how many non-zero 𝒘’s (i.e. basis functions) you have.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Regularized Least Squares
 With a more general regularizer, we have
q
1 N
  tn  w  ( xn )     w j
1 M
T 2

2 n 1 2 j 1
Regularizer term plot with q = 0.5 Regularizer term plot with q = 1 Regularizer term plot with q = 2 Regularizer term plot with q = 4
10 10 10 10

5 5 5 5

0 0 0 0

-5 -5 -5 -5

-10 -10 -10 -10


-10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10

MatLab code

 𝑞 = 1 is known as the Lasso regularizer. These plots show only the regularizer term
with 𝜆 = 0.7334.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Regularized Least Squares
 With a more general regularizer, we have
q
1 N
  tn  w  ( xn )     w j
1 M
T 2

2 n 1 2 j 1

MatLab code
 𝑞 = 2 corresponds to the quadratic regularizer.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Regularized Least Squares
 Lasso tends to generate sparser solutions than a quadratic regularizer – if 𝜆 is large,
some of the 𝑤𝑗 → 0 (here 𝑤1 = 0).
Unregularized
error function

𝑞 = 2 q=1
Constraint
q
M

w
j 1
j 

 Here, we consider that the regularized least squares solution is equivalent to


minimizing the unregularized sum of squares with the constraint shown for some 𝜂
(see proof next).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Regularized Least Squares
1 M 
q

 Let us write the constraint in the equivalent form:   w j     0


2  j 1 
 This leads to the following Lagrangian function:
  q

L( w ,  )    tn  w  ( xn )     w j   
1 N T 2
M

2 n 1 2  j 1 

 This is identical to our regularized least squares (RLS) in the dependence on 𝒘.
q
1 N
  tn  w  ( xn )     w j (*)
1 M
T 2

2 n 1 2 j 1
 For a particular 𝜆 > 0, let 𝒘∗ (𝜆) be the solution of the RLS in (*).

 From the Kuhn-Tucker optimality conditions for L( w ,  ), we then see:


M q

   w*j
j 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Kuhn-Tucker Optimality Conditions
 Consider the following constraint minimization problem:

min f ( x ), subject to g ( x )  0
x

 This is equivalent as the minimization with respect to 𝒙 and 𝜆 of the following


Lagrangian:
min L( x,  )  f ( x )   g( x )
x,

subject to the following Kuhn-Tucker conditions:

  0, g( x)  0,  g( x)  0

 Note for maximization problems, the Lagrangian should be modified as:

L( x,  )  f ( x )   g( x )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Multiple Outputs-Isotropic Covariance
 If we want to predict 𝐾 > 1 target variables, we use the same basis for all
components of the target vector:

p  t | x,W ,    N  t | y ( x,W ),  1 I   N  t | W T  ( x ),  1 I 

𝑾 is an 𝑀 × 𝐾 matrix and 𝒕 is 𝐾 dimensional.

 Given observed inputs, X   x1 ,..., xN  , and targets, T  t1 ,..., t N T, we obtain the log
likelihood function

  N
2

ln p T | X ,W ,     ln N  tn | W  ( xn ),  I  
N
NK
T 1
ln   t n  W T  ( xn )
n 1 2 2 2 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


K-Independent Regression Problems
  N
2
NK
ln p T | X ,W ,    ln   t n  W T  ( xn )
2 2 2 n 1

 As before, we can maximize this function with respect to 𝑾, giving

WML      T TT 1

M K M  N N K
M M
 If we examine this result for each target variable 𝑡𝑘 , we have (take the 𝑘th column of
𝑾 and 𝑻):
wk ML   T   T tk  †tk
1

which is identical to the single output case (so there is decoupling between the target
variables).

 We obtain 𝐾 −independent regression problems.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Multiple Outputs – Full Covariance
 Let us repeat the earlier formulation but with covariance matrix 𝚺. If we want to
predict 𝐾 > 1 target variables, we use the same basis for all components of the
target vector):

p  t | x,W ,    N  t | y ( x, W ),    N  t | W T  ( x ),  

where 𝑾 is an 𝑀 × 𝐾 matrix of parameters

 Given observed inputs, X   x1 ,..., xN , and targets, T  t1 ,..., t N  , we obtain the log
T

likelihood function ln p T | X ,W ,   :

n  2  n    t n  W  ( xn ) 
N
N 1 N
 
T 1
ln N t | W T
( x n ),    ln   t n  W T
 ( x ) T

n 1 2 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Multiple Outputs – Full Covariance
ln p T | X , W ,     ln     t n  W T  ( xn )   1  t n  W T  ( xn ) 
N 1 N T

2 2 n 1

 We maximize this function with respect to 𝑾,

t  W  ( xn )   ( xn )  WML      T T
N
1
0    1
n
T T T

n 1 M K

 For the ML estimate for 𝚺, use the result for the MLE of the covariance of a
multivariate Gaussian:

  tn  WMLT  ( xn )  tn  WMLT  ( xn ) 
N
1 T

N n 1

 Note that each column of 𝑾𝑀𝐿 is of the form

wML      T t
T 1

seen for isotropic noise distribution and is independent of 𝚺!


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Geometry of Least Squares
 We look for a geometrical
interpretation of the least-squares  j   j ( x1 ),...,  j ( xN ) 
T

solution in an 𝑁-dimensional space. 𝒕


is a vector in that space with
components 𝑡1, . . . , 𝑡𝑁 (𝑁 > 𝑀).
 The least-squares regression function y  w
is obtained by finding the orthogonal
projection of the data vector 𝒕 onto the
subspace spanned by the basis
functions 𝝋𝑗(𝒙).
 Note 𝝋𝑗 is here the 𝑗 −th
 0 ( x1 ) 1 ( x1 ) .. M 1 ( x1 ) 
column of 𝚽.  
 ( x )  ( x ) ..  ( x )
   0 1 ...  M 1    0 2 1 2 M 1 2 
 : : : : 
 
 0 ( x N ) 1 ( x N ) .. M 1 ( x N ) 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


Geometry of Least Squares
 We are looking for 𝒘 such that the
projection error  j ( x)   j ( x1 ),...,  j ( xN ) 
T

t  y  t  w
y  w
is orthogonal to the basis  j , yn  y ( x n , w )
i.e. such that:
 T  t  w   0
 These are the normal equations we
derived earlier.
  0T  S: 𝑀 −dimensional subspace
 T  spanned by  j ( x )
 1 
 
T
 :     0 ( x ) 1 ( x ) ..  M 1 ( x ) 
 T 
  M 1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Computing the Bias Parameter
 If we make the bias parameter 𝑤0 explicit, then the error function becomes
2
1  N M 1 
ED ( w )    tn  w0   w j j ( xn ) 
2 n 1  j 1 
 Setting the derivative with respect to 𝑤0 equal to zero, and solving for 𝑤0, we obtain
N N M 1
1 1
w0 
N

n 1
tn 
N
  w  (x ) 
n 1 j 1
j j n

M 1 N N
1 1
w0  t   w j  j , t=  t n ,  j   ( x ) j n
j 1 N n 1 N n 1

 The bias parameter 𝑤0 compensates for the difference between the averages of the
target values and the weighted sum of the averages of the basis function values.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


A Note on Data Centering: Likelihood
 In linear regression, it helps to center the data in a way that does not require us to
compute the offset term 𝑤0 ≡ 𝜇. Write the likelihood as:
  
p (t | x, w ,  ,  )  exp    t   1N  w   t   1N  w  
T

 2 
 Let us assume that the input data are centered in each dimension such that:

   x   0  i  1,..., M  1
N

i j
j 1
T   ( x1 )  ( x2 ) ..  ( x N ) 
  ( x1 )   1 ( x1 ) 2 ( x1 ) .. M 1 ( x1 ) 
T

 T      x     x    x  . .   x  T
  ( x2 )   1 ( x2 ) 2 ( x2 ) .. M 1 ( x2 )  i 1 i 2 i M 1 i
  ,
 :   : : : :    1  2 ..  M 1 
 T   
  ( x N 
) 
 1 N
( x ) 2 ( x N ) ..  M 1 ( x N 
)
 i  i  x1  i  x2  . . i  x N  
T

 The mean of the output is equally likely to be positive or negative. Let us put an
improper prior 𝑝(𝜇) ∝ 1 and integrate 𝜇 out.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


A Note on Data Centering: Likelihood
N
1
 Introducing, t 
N
t
i 1
i , the marginal likelihood becomes :
  
T


 
p (t | x , w ,  )   exp    t  t1N  w    t 1N  t  t1N  w    t 1N
 2      d

  A  
 Completing the square in 𝜇 gives (use the centering of 𝜱):
  
  
   
2
p (t | x , w ,  )   exp      t N 2  t A 1N  A A   d 
T T

 2 
  tN tN  wT T 1N 
  0 0 
 
   t  t1 
T
 exp   t  t 1 N  w N  w 
 2 
 Our model is now simplified if instead of 𝒕 we use (centered output) t  t  t1N and
the likelihood is simply written as:
  
p (t | x , w ,  )  exp    t  w   t  w  
T

 2 𝑀

ഥ 𝑇𝑗 𝒘𝑗 , 𝝓
𝜇Ƹ = 𝑡ҧ − ෍ 𝝓 ഥ1 , . . . , 𝝓
ഥ 𝑀−1 𝑖𝑠 𝑓𝑜𝑟𝑚𝑒𝑑
 Recall that the MLE estimate for 𝜇 is: 𝑗=1
𝑏𝑦 𝑡𝑎𝑘𝑖𝑛𝑔 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎ge of 𝑒𝑎𝑐ℎ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑜𝑓 𝜱
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
A Note on Data Centering: MLE of 𝑤0
 As an example, consider a linear regression model of the form
 y | x   w0  wT x
 In the context e.g. of MLE, we need to minimize
min    ti  w0  w xi 
N
T 2

w0 ,w
i 1
 Minimization wrt 𝑤0 gives:
 t  w 
N

i 0  w T
x i  0  w0 N  tN  N w T
x 𝑤
ෝ 0 = 𝑡 ҧ − 𝒘𝑇 𝒙

i 1
 N 
  x i1 N 
 x1   i 1 
where:    N

 x2    i2 x N 
N
x     i 1 , t   ti N
 :   :  i 1

 xM   N 
 
 x 
  iM N 
 i 1 

 Thus: ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
A Note on Data Centering: MLE of 𝑤
 Substituting the bias term in our objective function gives:
    x  x 
N N
min  ti  t  w x  w xi  min  ti  t  w
2 2
T T T
i
w w
i 1 i 1

 Minimization wrt 𝒘 gives:


𝑁 𝑁

ഥ 𝒙𝑖 − 𝒙
෍ 𝒙𝑖 − 𝒙 ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
ෝ = ෍ 𝒙𝑖 − 𝒙
𝒘
𝑖=1 𝑖=1
 We thus first compute the MLE of 𝒘 using the centered input and output as follows:
𝑁 𝑁
−1
ෝ = 𝑿𝑇𝑐 𝑿𝑐
𝒘 𝑿𝑇𝑐 𝒕𝑐 = ෍ 𝒙𝑖 − 𝒙
ഥ 𝒙𝑖 − 𝒙
ഥ 𝑇 ഥ 𝒕𝑖 − 𝒕ҧ
෍ 𝒙𝑖 − 𝒙
𝑖=1 𝑖=1  N 
  xi1 N 
 xT  x   x11  x1
T
x12  x 2 .. x1M  x M   x1   i 1 
 1       N 
 xT  x   x21  x1    xi 2 N  t c  t  t  t  1N t ,
T
T x22  x 2 .. x2 M  x M   x2
X c  X  X  X  1N x   2   , x    i 1 ,
  
N
: : : :
:   :   :  t   ti N
 T T    xM   N
 x   xN 1  x1 xN 2  x 2 .. xNN  x M     i 1
 xN  x 
  iM N 
 We can then estimate the MLE estimate of 𝑤0 as follows:  i 1 

ഥ𝑇 𝒘
ෝ0 = 𝑡ҧ − 𝒙
𝑤
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

You might also like