You are on page 1of 12

Linear Regression

CS771: Introduction to Machine Learning


Piyush Rai
2
Linear Regression
 Given: Training data with input-output pairs , ,

 Goal: Learn a model to predict the output for new test inputs

 Assume the function that approximates the I/O relationship to be a linear


model ⊤ Can also write all of them
𝑦 𝑛 ≈ 𝑓 ( 𝒙 𝑛 ) =𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) compactly using matrix-
vector notation as

 Let’s write the total error or “loss” of this model over the training
measures the data as
Goal of learning is to find
the that minimizes this loss
) prediction error or “loss” or
“deviation” of the model on
+ does well on test data Unlike models like KNN and DT, here we
a single training input
have an explicit problem-specific
objective (loss function) that we wish to
CS771: Intro to ML
3
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set of points [ ] 𝑧 1 = 𝜙 (𝑥 )
𝑧2

What if a line/plane doesn’t


model the input-output
relationship very well, e.g., Original (single) feature Two features
Nonlinear curve needed Can fit a plane (linear)
if their relationship is better
modeled by a nonlinear

(Output )
curve or curved surface?
(Output )

Do linear No. We can even fit a


models become curve using a linear
useless in such model after suitably
cases? transforming the inputs
Input (single feature) (Feature 2) (Feature 1)

The transformation can be predefined or learned (e.g.,


using kernel methods or a deep neural network based
feature extractor). More on this later
 The line/plane must also predict outputs the unseen (test) inputs well
CS771: Intro to ML
4
Loss Functions for Regression Choice of loss function usually
depends on the nature of the
data. Also, some loss functions
result in easier optimization
 Many possible loss functions for regression problems problem than others

Squared loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2


Absolute
Loss Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
loss
Very commonly used
Grows more slowly than
for regression. Leads
squared loss. Thus better
to an easy-to-solve
optimization problem suited when data has some
outliers (inputs on which
model makes large errors)
) )
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨−𝜖
Huber loss Loss
-insensitive loss
Squared loss for
small errors (say up
(a.k.a. Vapnik
to ); absolute loss for loss) Note: Can also use
larger errors. Good Zero loss for small errors squared loss instead
for data with outliers (say up to ); absolute loss of absolute loss
−𝛿 𝛿 ) for larger errors
−𝜖 𝜖 )
CS771: Intro to ML
5
Linear Regression with Squared Loss
In matrix-vector notation, can write it
 In this case, the loss func will be compactly as )

 Let us find the that optimizes (minimizes) the above squared loss
The “least squares” (LS)
 We need calculus and optimization to do this! problem Gauss-Legendre, 18th
century)

 The LS problem can be solved easily and has a closed form solutionClosed form
solutions to ML
= problems are
rare.

= ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚
matrix inversion – can be expensive.
Ways to handle this. Will see later CS771: Intro to ML
6
Proof: A bit of calculus/optim. (more on this later)
 We wanted to find the minima of

 Let us apply basic rule of calculus: Take first derivative of and set to zero
Chain rule of calculus

Partial derivative of dot product w.r.t each element of Result of this derivative is - same size as
 Using the fact , we get
 To separate to get a solution, we write the above as
𝑁 𝑁

∑ 2 𝒙 𝑛 ( 𝑦 𝑛 − 𝒙 𝒘 )=0

𝑛 ∑ 𝑦 𝑛 𝒙𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘 =0
𝑛=1 𝑛=1

= ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚
CS771: Intro to ML
7
Problem(s) with the Solution!
 We minimized the objective w.r.t. and got

= ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚

 Problem: The matrix may not be invertible


 This may lead to non-unique solutions for

 Problem: Overfitting since we only minimized loss defined on training data


 Weights may become arbitrarily large to fit training data perfectly
 Such weights may perform poorly on the test data however is called the Regularizer and
measures the “magnitude” of
 One Solution: Minimize a regularized objective
is the reg. hyperparam.
 The reg. will prevent the elements of from becoming too large Controls how much we wish
 Reason: Now we are minimizing training error + magnitude of vector to regularize (needs to be
tuned via cross-validation)
CS771: Intro to ML
8
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean ( squared) norm of


2 ⊤
𝑅 ( 𝒘 ) =‖𝒘‖2 =𝒘 𝒘
 With this regularizer, we have the regularized least squares problem as
Look at the form of the solution. We
+ are adding a small value to the
diagonals of the DxD matrix (like
Why is the method 𝑁 adding a ridge/mountain to some land)
=  arg   min 𝒘 ∑ ( 𝑦 𝑛 −𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
called “ridge” ⊤ 2 ⊤
regression
𝑛=1
 Proceeding just like the LS case, we can find the optimal which is given by

= ¿( 𝑿 ⊤ 𝑿 + 𝜆 𝐼 𝐷 )− 1 𝑿 ⊤ 𝒚
CS771: Intro to ML
9
A closer look at regularization Remember – in general,
weights with large magnitude
are bad since they can cause
 The regularized objective we minimized is overfitting on training data and
𝑁 may not work well on test data
𝐿𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
⊤ 2 ⊤

𝑛=1

 Minimizing w.r.t. gives a solution for that


 Keeps the training error small Good because, consequently, the Not a “smooth” model
individual entries of the weight since its test data
 Has a small squared norm = vector are also prevented from predictions may
becoming too large change drastically
even with small
 Small entries in are good since they lead to “smooth” models changes in some
feature’s value

A typical learned without reg.


𝒙 𝑛 =¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1 𝑦 𝑛 =0.8
3.2 1.8 1.3 2.1 10000 2.5 3.1 0.1

𝒙 𝑚 =¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1


𝑦 𝑚 =100
Just to fit the training data where one of the
inputs was possibly an outlier, this weight
Exact same feature vectors only Very different outputs though (maybe
became too big. Such a weight vector will
differing in just one feature by a small one of these two training ex. is an
possibly do poorly on normal test inputs
amount outlier)
CS771: Intro to ML
Note that optimizing loss
10
Other Ways to Control Overfitting functions with such regularizers is
usually harder than ridge reg. but
several advanced techniques exist
(we will see some of those later)
 Use a regularizer defined by other norms, e.g.,
Use them if you have a very
norm regularizer large number of features but
𝐷
many irrelevant features.
‖𝒘 ‖1= ∑ ¿ 𝑤 𝑑∨¿ ¿ These regularizers can help in
When should I used these 𝑑=1 automatic feature selection
regularizers instead of the sparse means many entries
regularizer? ‖𝒘 ‖0 =¿ nnz (𝒘 ) Using such regularizers in will be zero or near
gives a sparse weight zero. Thus those features
Automatic feature
vector as solution will be considered
selection? Wow, cool!!! norm regularizer (counts
But how exactly? number of nonzeros in irrelevant by the model
and will not influence
prediction

 Use non-regularization based approaches All of these are very popular ways
to control overfitting in deep
 Early-stopping (stopping training just when we have a decent val.learning
set accuracy)
models. More on these
 Dropout (in each iteration, don’t update some of the weights) later when we talk about deep
learning
 Injecting noise in the inputs CS771: Intro to ML
11
Linear Regression as Solving System of Linear Eqs
 The form of the lin. reg. model is akin to a system of linear equation
 Assuming training examples with features each, we have
First training example: 𝑦 1 =𝑥11 𝑤 1+ 𝑥 12 𝑤2 +…+ 𝑥 1 𝐷 𝑤 𝐷 Note: Here denotes the
feature of the training
Second training example: 𝑦 2 =𝑥21 𝑤1+ 𝑥 22 𝑤 2+ …+ 𝑥2 𝐷 𝑤 𝐷 example
equations and unknowns
here ()
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 +𝑥 𝑁 2 𝑤2 +…+ 𝑥 𝑁𝐷 𝑤 𝐷

 However, in regression, we rarely have but rather or


 Thus we have an underdetermined () or overdetermined () system
 Methods to solve over/underdetermined systems can be used for lin-reg as well
 Many of these methods don’t require expensive matrix inversion Now solve
this!
Solving lin-reg ⊤ −1 ⊤
𝒘 ¿( 𝑿 𝑿 ) 𝑿 𝒚 where , and
as system of lin eq.
System of lin. Eqns with equations and unknowns
CS771: Intro to ML
12
Next Lecture
 Solving linear regression using iterative optimization methods
 Faster and don’t require matrix inversion
 Brief intro to optimization techniques

CS771: Intro to ML

You might also like