Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 1 / 57
Regression
Y = f (X1 , X2 , · · · , Xd ) + ε = f (X ) + ε
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 2 / 57
Linear Regression
d
X
f (X ) = β + ω1 X1 + ω2 X2 + · · · + ωd Xd = β + ωi X i = β + ω T X
i =1
β is the true (unknown) bias term, ωi is the true (unknown) regression coefficient
or weight for attribute Xi , and ω = (ω1 , ω2 , · · · , ωd )T is the true d-dimensional
weight vector.
f specifies a hyperplane in Rd +1 , where ω is the the weight vector that is normal
or orthogonal to the hyperplane, and β is the intercept or offset term.
f is completely specified by the d + 1 parameters comprising β and ωi , for
i = 1, · · · , d.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 3 / 57
Linear regression
A common approach to predicting the bias and regression coefficients is to use the
method of least squares.
Given the training data D with points x i and response values yi (for i = 1, · · · , n),
we seek values b and w , so as to minimize the sum of squared residual errors
(SSE)
n
X n
X n
X
2 2
SSE = ǫ2i = (yi − ŷi ) = yi − b − w T x i
i =1 i =1 i =1
ŷi = f (xi ) = b + w · xi
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 4 / 57
Bivariate Regression
The residual error is ǫi = yi − ŷi and the best line that minimizes the SSE:
n
X n
X n
X
min SSE = ǫ2i = (yi − ŷi )2 = (yi − b − w · xi )2
b ,w
i =1 i =1 i =1
Therefore, we have
b = µY − w · µX
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 5 / 57
Bivariate Regression
Differentiating with respect to w , we obtain
X n
∂
SSE = −2 xi (yi − b − w · xi ) = 0
∂w i =1
n
X n
X n
X
=⇒ xi · yi − b xi − w xi2 = 0
i =1 i =1 i =1
n
X Xn n
X n
X
=⇒ xi · yi − µY xi + w · µX xi − w xi2 = 0
i =1 i =1 i =1 i =1
Pn
i =1 xi · yi − n · µX · µY
=⇒ w = Pn 2 2
i =1 xi − n · µX
Pn
(x − µX )(yi − µY ) σXY cov(X , Y )
w= Pn i
i =1
2
= 2
=
i =1 (xi − µX ) σX var(X )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 6 / 57
Bivariate Regression
Given two attributes petal length (X ; the predictor variable) and petal width
(Y ; the response variable) in the Iris dataset (n = 150).
bC bC bC
2.5 bC bC
bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC Cb Cb bC bC
2.0 bC bC Cb bC
bC bC bC bC bC bC bC bC
Y : petal width
bC bC
bC bC bC bC
bC bC Cb bC bC bC bC
1.5 bC bC Cb bC bC bC
bC bC bC bC Cb Cb Cb Cb
Cb bC bC Cb Cb
bC bC bC
bC bC bC bC bC
1.0
bC
bC
0.5 bC bC bC Cb bC
bC bC Cb Cb
Cb Cb bC bC bC Cb Cb bC
Cb bC bC
0
−0.5
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
X : petal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 7 / 57
Bivariate Regression
Example
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 8 / 57
Bivariate Regression
Example
ŷ = −0.3665 + 0.4164 · x
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 9 / 57
Bivariate Regression
Example
petal length (X ) versus petal width (Y ). Solid circle (black) shows the
mean point; residual error is shown for two sample points: x9 and x35 .
bC bC bC
2.5 bC bC
bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC bC Cb Cb bC bC
2.0 bC bC Cb bC
Cb bC bC bC bC bC bC bC
Y : petal width
Cb bC
bC
Cb Cb
Cb Cb bC bC bC bC
bC ǫ9 bC
1.5 Cb Cb bC bC bC bC
x35 bC
b
bC Cb bC Cb bC bC bC
bC bC bC
Cb Cb Cb bC bC x9
1.0 ǫ35 bC bC bC bC bC
bC
bC
0.5 bC bC bC bC bC
bC bC Cb Cb
Cb Cb bC bC bC Cb Cb bC
Cb bC bC
0
−0.5
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
X : petal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 10 / 57
Geometry of Bivariate Regression
Yb = b · 1 + w · X
ǫ = Y − Yb
1
θ
x2
xn Yb
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 11 / 57
Geometry of Bivariate Regression
Orthogonal decomposition of X into X and µX · 1.
Even though 1 and X are linearly independent and form a basis for the column
space, they need not be orthogonal.
We can create an orthogonal basis by decomposing X into a component along 1
and a component orthogonal to 1, X .
X = µX · 1 + (X − µX · 1) = µX · 1 +X
where X = X − µX · 1 is the centered attribute vector.
X X
{
X −µX ·1
}|
z
θ
1
| {z }
µX ·1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 12 / 57
Geometry of Regression
The optimal Yb that minimizes the error is the orthogonal projection of Y onto
the subspace spanned by 1 and X .
The residual error vector ǫ is thus orthogonal to the subspace spanned by 1 and
X , and its squared length (or magnitude) equals the SSE value.
Summarizing:
µY = proj1 (Y ) w = projX (Y ) b = µY − w · µX
x1
Y
ǫ = Y − Yb
µ
z }|Y {
1
{
}w|
x2
Yb
z
xn
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 13 / 57
Geometry of Regression
Example
Let us consider the regression of petal length (X ) on petal width (Y ) for the
Iris dataset, with n = 150. First, we center X by subtracting the mean µX = 3.759.
Next, we compute the scalar projections of Y onto 1 and X , to obtain
T
Y 1 179.8
µY = proj1 (Y ) = T
= = 1.1987
1 1 150
T
Y X 193.16
w = projX (Y ) = = = 0.4164
X TX 463.86
We can compute the SSE value as the squared length of the residual error vector
2
SSE = kǫk =
Y − Yb
= (Y − Yb )T (Y − Yb ) = 6.343
2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 14 / 57
Multiple Regression
ŷi = w0 xi 0 + w1 xi 1 + w2 xi 2 + · · · + wd xid = w̃ T x̃ i
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 15 / 57
Multiple Regression
The multiple regression task is to find the best fitting hyperplane defined by w̃
that minimizes the SSE:
n
X
2
ǫ2i = kǫk =
Y − Yb
2
min SSE =
w̃
i =1
= (Y − Yb )T (Y − Yb ) = Y T Y − 2Y T Yb + Yb T Yb
= Y T Y − 2Y T (D̃ w̃ ) + (D̃ w̃ )T (D̃ w̃ )
T T
= Y T Y − 2w̃ T (D̃ Y ) + w̃ T (D̃ D̃)w̃
T T
w̃ = (D̃ D̃)−1 D̃ Y
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 16 / 57
Multiple Regression
Example
Given sepal length (X1 ) and petal length (X2 ) on the response attribute
petal width (Y ) for the Iris dataset with n = 150 points, we want to learn the
multiple regression.
bC bC bC
bC Cb bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
Cb Cb Cb bC bC
bC bC bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC
C b CCb b bC C b bC bC bC
bC Cb bC bC bC C b
Cb bC Cb bC Cb bC bC bC bC bC bC bC bC
bC Cb Cb bC bC
bC Cb Cb Cb bC bC bC bC Cb bC bC bC bC bC
bC Cb
bC bC
bC X2
bC bC bC bC
bC Cb bC bC bC CbCb bC bC bC bC bCbC bC bC bC bC
bC bC bC Cb bC bC bC bC
bC bC bC
bC bC bC Cb bC
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 17 / 57
Multiple Regression
Example
T
We and X0 = 1150 and D̃ ∈ R150×3 . We then compute D̃ D̃ and its inverse
150.0 876.50 563.80 0.793 −0.176 0.064
T T
D̃ D̃ = 876.5 5223.85 3484.25 (D̃ D̃)−1 = −0.176 0.041 −0.017
563.8 3484.25 2583.00 0.064 −0.017 0.009
T
We also compute D̃ Y , given as
179.80
T
D̃ Y = 1127.65
868.97
w2 0.45
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 18 / 57
Multiple Regression
Example
Figure shows the fitted hyperplane and the residual error for each point. Positive
residuals (i.e., ǫi > 0 or ŷi > yi ) are white, while negative residuals (i.e., ǫi < 0 or
ŷi < y ) are gray. The SSE value for the model is 6.18.
bC bC bC
bC Cb bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC Cb Cb bC bC
bC Cb bC bC Cb Cb bC Cb bC
bC bC bC bC bC bC bC
Cb Cb bC bC Cb b C bC bC bC bC
Cb bC bC bC bC
Cb bC Cb bC CCb b bC bC bC bC
bC
bC bC bC Cb bC bC
bC Cb Cb bC bC Cb bC bC Cb bC bC bC bC bC
bC Cb
bC bC
bC X2
bC bC bC bC
bC Cb bC Cb bC bCbC bC bC bC bC bCbC bC bC bC bC
bC Cb bC bC bC bC bC bC
C b C b bC
bC bC bC Cb bC
X1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 19 / 57
Multiple-Regression Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 20 / 57
QR-Factorization and Geometric Approach
Example
Consider the multiple regression of sepal length (X1 ) and petal length (X2 )
on petal width (Y ) for the Iris dataset with n = 150 points.
The Gram–Schmidt orthogonalization results in the following QR-factorization:
| | | | | | 1 5.843 3.759
X 0 X 1 X 2 = U0 U1 U2 · 0 1 1.858
| | | | | | 0 0 1
| {z } | {z } | {z }
D̃ Q R
Q ∈ R150×3 and ∆, the squared norms of the basis vectors, and its inverse are
150 0 0 0.00667 0 0
∆= 0 102.17 0 ∆−1 = 0 0.00979 0
0 0 111.35 0 0 0.00898
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 21 / 57
QR-Factorization and Geometric Approach
Example
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 23 / 57
QR-Factorization and Geometric Approach
Example
U0 = X 0
U1 = −5.843 · X0 + X1
U2 = 7.095 · X0 − 1.858 · X1 + X2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 24 / 57
Multiple Regression: Stochastic Gradient Descent
where w̃ t is the estimate of the weight vector at step t. We update the weight
vector by considering only one (random) point at each iteration.
w̃ t +1 = w̃ t − η · ∇w̃ (x̃ k )
= w̃ t + η · (yk − x̃ k · w̃ t ) · x̃ k
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 25 / 57
Multiple Regression: SGD Algorithm
Multiple Regression:
SGD (D, Y , η, ǫ):
1 D̃ ← 1 D // augment data
2 t ← 0 // step/iteration counter
3 w̃ t ← random vector in Rd +1 // initial weight vector
4 repeat
5 foreach k = 1, 2, · · · , n (in random order) do
6 ∇w̃ (x̃ k ) ← −(yk − x̃ Tk w̃ t ) · x̃ k // compute gradient at x̃ k
7 w̃ t +1 ← w̃ t − η · ∇w̃ (x̃ k ) // update estimate for wk
8 t←
t +1
9 until
w t − w t −1
≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 26 / 57
Multiple Regression: SGD
Example
Multiple regression of sepal length (X1 ) and petal length (X2 ) on the
response attribute petal width (Y ) for the Iris dataset with n = 150 points.
Using the exact approach the multiple regression model was given as
Using SGD we obtain the following model with η = 0.001 and ǫ = 0.0001:
The results from the SGD approach are essentially the same as the exact method,
with a slight difference in the bias term.
The SSE value for the exact method is 6.179, whereas for SGD it is 6.181.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 27 / 57
Ridge Regression
For linear regression, Yb lies in the span of the column vectors comprising the
augmented data matrix D̃.
Often the data is noisy and uncertain, and, therefore, instead of fitting the model
to the data exactly, it may be better to fit a more robust model.
Regularization constrains the solution vector w̃ to have a small norm.
2
Besides minimizing
Y − Yb
, we add a regularization term (kw̃ k ):
2
2
2
min J(w̃ ) =
Y − Yb
+ α · kw̃ k =
Y − D̃ w̃
+ α · kw̃ k
2 2
w̃
α ≥ 0 controls the tradeoff between minimizing the squared norm of the weight
vector and the squared error.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 28 / 57
Ridge Regression
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 29 / 57
Ridge Regression
Example
Given sepal length (X1 ) and petal length (X2 ) on the response attribute
petal width (Y ) for the Iris dataset with n = 150 points, we want to learn the
ridge regression.
bC bC bC
bC Cb bC
bC bC bC bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC Cb Cb bC bC
bC Cb bC bC Cb Cb bC Cb bC
bC bC bC bC bC bC bC
bC Cb bC bC Cb b C bC bC bC bC
Cb bC bC bC bC
Cb bC Cb bC C b Cb bC bC bC bC
bC
bC bC bC bC bC bC
bC Cb bC Cb bC Cb bC bC Cb Cb bC bC bC bC
bC bC
bC bC
bC X2
bC bC bC bC
bC Cb bC Cb bC bCbC bC bC bC bC bCbC bC bC bC bC
bC bC bC bC bC bC bC bC
C b b C bC Cb bC bC bC
bC
X1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 30 / 57
Ridge Regression
Example
We obtain different lines of best fit for different values of the regularization
constant α:
2
α = 0 :Yb = −0.367 + 0.416 · X , kw̃ k2 =
(−0.367, 0.416)T
= 0.308, SSE = 6.34
2
α = 10 :Yb = −0.244 + 0.388 · X , kw̃ k2 =
(−0.244, 0.388)T
= 0.210, SSE = 6.75
2
α = 100 :Yb = −0.021 + 0.328 · X , kw̃ k2 =
(−0.021, 0.328)T
= 0.108, SSE = 9.97
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 31 / 57
Ridge Regression
Example
Cb bC
Cb Cb bC bC
bC Cb Cb bC bC bC bC
1.5 Cb Cb bC bC bC bC
bC bC Cb bC Cb bC bC bC
b Cb Cb Cb bC bC
bC bC bC
bC bC bC bC bC
1.0
bC
bC
0.5 bC bC bC Cb bC
bC bC Cb Cb
Cb Cb bC bC bC Cb Cb bC
Cb bC bC
0
−0.5
0 1.0 2.0 3.0 4.0 5.0 6.0 7.0
X : petal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 32 / 57
Ridge Regression: Unpenalized Bias Term
Often in L2 regularized regression we do not want to penalize the bias term w0 ,
since it simply provides the intercept information.
Consider the new regularized objective where w = (w1 , w2 , · · · , wd )T without w0 :
2 2
min J(w ) = kY − w0 · 1 − Dw k + α · kw k
w
d
2 d
!
X
X
=
Y − w0 · 1 − wi · Xi
+α· wi2
i =1 i =1
Therefore, we have
2 2
min J(w ) =
Y − D w
+ α · kw k
w
When we do not penalize w0 , we obtain the following lines of best fit for different
values of the regularization constant α:
We observe that for α = 10, when we penalize w0 , we obtain the following model:
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 34 / 57
Ridge Regression: Stochastic Gradient Descent
T
Instead of inverting the matrix (D̃ D̃ + α · I ) as called for in the exact ridge
regression solution, we can employ the stochastic gradient descent algorithm.
The gradient of w̃ multiplied by 1/2 for convenience is:
∂ T T
∇w̃ = J(w̃ ) = −D̃ Y + (D̃ D̃)w̃ + α · w̃
∂ w̃
Using (batch) gradient descent, we can iteratively compute w̃ as follows
T
w̃ t +1 = w̃ t − η · ∇w̃ = (1 − η · α)w̃ t + η · D̃ (Y − D̃ · w̃ t )
In SGD, we update the weight vector by considering only one (random) point at
each time:
η ·α t
w̃ t +1 = w̃ t − η · ∇w̃ (x̃ k ) = 1 − w̃ + η · (yk − x̃ k · w̃ t ) · x̃ k
n
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 35 / 57
Ridge Regression: SGD Algorithm
Ridge Regression:
SGD (D, Y , η, ǫ):
1 D̃ ← 1 D // augment data
2 t ← 0 // step/iteration counter
3 w̃ t ← random vector in Rd +1 // initial weight vector
4 repeat
5 foreach k = 1, 2, · · · , n (in random order) do
6 ∇w̃ (x̃ k ) ← −(yk − x̃ Tk w̃ t ) · x̃ k + αn · w̃ // gradient at x̃ k
7 w̃ t +1 ← w̃ t − η · ∇w̃ (x̃ k ) // update estimate for wk
8 t←
t +1
9 until
w t − w t −1
≤ ǫ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 36 / 57
Ridge Regression: SGD
Example
We apply ridge regression on the Iris dataset (n = 150), using petal length (X )
as the independent attribute, and petal width (Y ) as the response variable.
Using SGD (with η = 0.001 and ǫ = 0.0001) we obtain different lines of best fit for
different values of the regularization constant α:
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 37 / 57
Kernel Regression
Kernel generalizes linear regression to the non-linear case, i.e., finding a non-linear
fit to the data to minimize the squared error, along with regularization.
φ(x i ) maps the input point x i to the feature space.
For regularized regression, we have to solve the following objective in feature
space:
2
2
min J(w̃ ) =
Y − Yb
+ α · kw̃ k =
Y − D̃ φ w̃
+ α · kw̃ k
2 2
w̃
c = (K̃ + α · I )−1 Y
T
where I ∈ Rn×n is the n × n identity matrix, and D̃ φ D̃ φ is the augmented kernel
matrix K̃ .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 38 / 57
Kernel Regression
Yb = D̃ φ w̃
T
= D̃ φ D̃ φ c
T
−1
= D̃ φ D̃ φ K̃ + α · I Y
−1
= K̃ K̃ + α · I Y
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 39 / 57
Kernel Regression Algorithm
Kernel-Regression
(D, Y , K , α):
1 K ← K (x i , xj ) i ,j =1,...,n // standard kernel matrix
2 K̃ ← K + 1// augmented kernel matrix
−1
3 c ← K̃ + α · I Y // compute mixture coefficients
4 Yb ← K̃ c
Testing
(z, D, K , c):
5 K̃ z ← 1 + K (z, x i ) ∀ x
i ∈D
6 ŷ ← c T K̃ z
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 40 / 57
Kernel Regression on Iris
Example
bC
bC
bC
1.0 bC
bC
bC
bC
0.5 bC
Y
bC
bC bC bC
bC bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC bCbC
bC bC
bCbC bC
bC bC
0 bCbC bC
bC
bCbC
bC
bC
bC
bC
bCbC
bCbCbC
bC bC bC bC bC
bC
bCbCbC bCbC bC bC bCbC bC bCbC bCbCbCbC
bCbC
bCbC bC bCbC bC
bC
bCbCbC bCbC bC
bC
−0.5
−1.5 −1.0 −0.5 0 0.5 1.0 1.5
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 41 / 57
Kernel Regression on Iris
Example
Yb = 0.168 · X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 42 / 57
Kernel Regression on Iris
Example
The linear (in gray) and quadratic (in black) fit are shown.
The SSE error is 13.82 for the linear and 4.33 for the quadratic kernel.
The quadratic kernel (as expected) gives a much better fit to the data.
1.5 bC
bC
bC
bC
1.0 bC
bC
bC
bC
0.5 bC
Y
bC
bC bC bC
bC bC
bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bCbC
bC bC bC
bC
bCbC bC bC
0 bCbC
bC
bC
bC bCbCbC
bC bC bC bC
bC
bCbC bC bC bCbCbC bC
bC
bCbC bCbC bC bC
bCbC bC bCbC bCbCbC
bCbC
bCbC bCbC bCbC bC
bC
bCbC bC bC
bC
−0.5
−1.5 −1.0 −0.5 0 0.5 1.0 1.5
X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 43 / 57
Kernel ridge regression
Example
Consider the Iris principal components dataset, where X1 and X2 denote the first
two principal components.
The response variable Y is binary, with value 1 corresponding to Iris-virginica
(points on the top right, with Y value 1) and 0 corresponding to Iris-setosa
and Iris-versicolor (other two groups of points, with Y value 0).
Y
bC
Cb bC
bC CbbC Cb bC Cb bC Cb Cb bC bC Cb bC bC bC bC bC bC bC bC bC bC
Cb bC bC bC bC bC bC bC bC bC bC Cb Cb bC Cb bC
bC bC
bC bC bC bC
bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC Cb bC
bC bC bC bC bC bC bC Cb bC bC bCbC Cb Cb bC Cb bC Cb bC bC Cb Cb bC
X1 bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC Cb
bC bC bC Cb bC bC Cb bC Cb bC bC bC Cb bC Cb bC bC
bC Cb bC bC Cb bC Cb Cb
X2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 44 / 57
Kernel ridge regression
Example
Figure shows the fitted regression plane using a linear kernel with ridge value
α = 0.01:
bC bC bC bC
bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC Cb bC
bC bC bC bC bC bC bC bC bC bC bCbC bC bC bC Cb bC bC bC Cb bC Cb bC
X1 bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC Cb
bC bC Cb bC bC Cb bC bC Cb bC bC bC bC Cb bC Cb Cb
bC Cb bC bC Cb bC bC bC
X2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 45 / 57
Kernel ridge regression
Example
Figure shows the fitted model when we use an inhomogeneous quadratic kernel
with α = 0.01:
Y
bC
Cb bC
bC CbCb bC Cb bC bC bC bC bC Cb bC bC bC bC bC Cb Cb bC bC bC Cb
Cb Cb Cb Cb Cb bC Cb bC bC bC bC bC bC bC Cb bC
bC bC
bC bC bC bC
bC bC bC bCbC bC bC bC bC bC bC bC bC bC bC Cb bC
bC bC bC Cb bC bC Cb Cb Cb bC bCbC bC bC Cb Cb bC bC bC bC bC Cb bC
X1 bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC Cb bC
bC bC bC bC Cb Cb bC bC bC Cb bC bC Cb bC Cb bC bC
bC bC Cb bC Cb Cb bC Cb
X2
The SSE error for the linear model is 15.47, whereas for the quadratic kernel it is
8.44, indicating a better fit for the training data.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 46 / 57
L1 Regression: Lasso
1
min J(w ) = · ||Y − D w ||2 + α · kw k1
w 2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 47 / 57
L1 Regression: Lasso
The usage of the L1 norm leads to sparsity in the solution vector.
Ridge regression reduces the value of the regression coefficients wi , they may
remain small but still non-zero.
L1 regression can drive the coefficients to zero, resulting in a more interpretable
model, especially when there are many predictor attributes.
2
The Lasso objective comprises two parts, the squared error term
Y − D w
which is convex and differentiable, and the L1 penalty term
d
X
α · kw k1 = α |wi |
i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 48 / 57
L1 Regression: Subgradients
1
|w |
−1
−2
−3 −2 −1 0 1 2 3
w
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 49 / 57
L1 Regression: Subgradients
1
m = 0.25
|w |
−1
m = −0.5
−2
−3 −2 −1 0 1 2 3
w
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 50 / 57
Subgradients and Subdifferential
The set of all the subgradients at w is called the subdifferential, denoted as ∂|w |.
The subdifferential of f (w ) = |w | at w = 0 is given as ∂|w | = [−1, 1].
Considering all the cases, the subdifferential for f (w ) = |w | is:
1 iff w > 0
∂|w | = −1 iff w < 0
[−1, 1] iff w = 0
When the derivative exists, the subdifferential is unique and corresponds to the
derivative (or gradient).
When the derivative does not exist the subdifferential corresponds to a set of
subgradients.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 51 / 57
Bivariate L1 Regression
Consider the bivariate L1 regression, where we have a single independent attribute
X and a response attribute Y (both centered). The bivariate regression model is
given as
ŷi = w · x i
The Lasso objective can then be written as
n
1X
min J(w ) = (y i − w · x i )2 + α · |w |
w 2 i =1
Corresponding to the three cases for the subdifferential of the absolute value
function we have three cases to consider:
Case I (w > 0 and ∂|w | = 1): w = η ·X TY − η · α
Since w > 0, η ·X TY > η · α or |η ·X TY | > η · α.
Case II (w < 0 and ∂|w | = −1): w = η ·X TY + η · α
Since w < 0, η ·X TY < −η · α or |η ·X TY | > η · α.
Case III (w = 0 and ∂|w | ∈ [−1, 1]): w ∈ η ·X TY − η · α, η ·X TY + η · α
However, since w = 0, |η ·X TY | ≤ η · α.
Then the above three cases can be written compactly as:
w = Sη·α (η ·X TY )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 53 / 57
L1 -Regression Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 54 / 57
L1 Regression
Example
We apply L1 regression to the full Iris dataset with n = 150 points, and four
independent attributes, namely sepal-width (X1 ), sepal-length (X2 ),
petal-width (X3 ), and petal-length (X4 ).
The Iris type attribute comprises the response variable Y . There are three Iris
types, namely Iris-setosa, Iris-versicolor, and Iris-virginica, which
are coded as 0, 1 and 2, respectively.
The L1 regression for α (η = 0.0001) are shown below:
b = +0.19 − 0.11 · X1 − 0.05 · X2 + 0.23 · X3 + 0.61 · X4
α=0: Y SSE = 6.96 kw k1 = 0.44
b = −0.08 − 0.08 · X1 − 0.02 · X2 + 0.25 · X3 + 0.52 · X4
α=1: Y SSE = 7.09 kw k1 = 0.34
b = −0.55 + 0.00 · X1 + 0.00 · X2 + 0.36 · X3 + 0.17 · X4
α=5: Y SSE = 8.82 kw k1 = 0.16
b = −0.58 + 0.00 · X1 + 0.00 · X2 + 0.42 · X3 + 0.00 · X4
α = 10 : Y SSE = 10.15 kw k1 = 0.18
Note the sparsity inducing effect, for α = 5 and α = 10, which drives some wi to 0.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 55 / 57
L1 Regression
Example
L2 : the coefficients for X1 and X2 are small, and therefore less important, but they
are not zero.
L1 : the coefficients for X1 and X2 are exactly zero, leaving only X3 and X4 ;
Lasso can thus act as an automatic feature selection approach.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 56 / 57
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 23: Linear Regression 57 / 57