You are on page 1of 14

Linear Regression

2
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set of points [ ] 𝑧 1 = 𝜙 (𝑥 )
𝑧2

What if a line/plane doesn’t


model the input-output
relationship very well, e.g.,
Original (single) feature Two features
if their relationship is better Nonlinear curve needed Can fit a plane (linear)
modeled by a nonlinear
curve or curved surface?

(Output )
(Output )

No. We can even fit a


curve using a linear
model after suitably
Do linear transforming the inputs
models become Input (single feature) (Feature 2) (Feature 1)
useless in such
cases? The transformation can be predefined or learned (e.g., using
kernel methods or a deep neural network based feature
extractor). More on this later
 The line/plane must also predict outputs the unseen (test) inputs well
CS771: Intro to ML
Simplest Possible Linear Regression Model
 This is the base model for all
statistical machine learning
 x is a one feature data variable
 y is the value we are trying to predict
 The regression model is
y  w0  w1 x  
Two parameters to estimate – the slope
of the line w1 and the y-intercept w0
 ε is the unexplained, random, or
error component.

CS771: Intro to ML
Solving the regression problem
 We basically want to find {w0, w1}
that minimize deviations from the
predictor line

 How do we do it?
 Iterate over all possible w values
along the two dimensions?
 Same, but smarter? [next class]
 No, we can do this in closed
form with just plain calculus
 Very few optimization problems in
ML have closed form solutions
 The ones that do are interesting
for that reason CS771: Intro to ML
Parameter estimation via calculus
 We just need to set the partial
derivatives to zero (full derivation)

 Simplifying

CS771: Intro to ML
6
More generally
 Given: Training data with input-output pairs , ,

 Goal: Learn a model to predict the output for new test inputs

 Assume the function that approximates the I/O relationship to be a linear


model

Can also write all of them
𝑦 𝑛 ≈ 𝑓 ( 𝒙 𝑛 ) =𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) compactly using matrix-
vector notation as
 Let’s
Goal write
of learning the
is to findtotal error or “loss” of) this model over the training
measures the data as
the that minimizes this loss prediction error or “loss” or
+ does well on test data “deviation” of the model on
Unlike models like KNN and DT, here we a single training input
have an explicit problem-specific objective
(loss function) that we wish to optimize for CS771: Intro to ML
7
Linear Regression with Squared Loss
In matrix-vector notation, can write it compactly as )
 In this case, the loss func will be

 Let us find the that optimizes (minimizes) the above squared loss
The “least squares” (LS) problem Gauss-
 We need calculus and optimization to do this! Legendre, 18th century)

 The LS problem can be solved easily and has a closed form solution
= Link to a nice der
ivation

= ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚
matrix inversion – can be expensive. Ways to
handle this. CS771: Intro to ML
Choice of loss function usually depends on
8
Alternative loss functions the nature of the data. Also, some loss
functions result in easier optimization
problem than others

 Many possible loss functions for regression problems


Squared loss Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
Absolute loss Loss
Very commonly used for
regression. Leads to an Grows more slowly than
easy-to-solve squared loss. Thus better
optimization problem suited when data has some
outliers (inputs on which
) model makes large errors)
)
Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨−𝜖
Huber loss Loss
Squared loss for small
-insensitive loss
errors (say up to ); (a.k.a. Vapnik loss)
Note: Can also use
absolute loss for larger Zero loss for small errors (say squared loss instead of
errors. Good for data up to ); absolute loss for absolute loss
with outliers −𝛿 𝛿 ) larger errors
−𝜖 𝜖 )
CS771: Intro to ML
9
How do we ensure generalization?
 We minimized the objective w.r.t. and got

= ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚

 Problem: The matrix may not be invertible


 This may lead to non-unique solutions for
 Problem: Overfitting since we only minimized loss defined on training data
 Weights may become arbitrarily large to fit training data perfectly
is called the Regularizer and
 Such weights may perform poorly on the test data however measures the “magnitude” of
 One Solution: Minimize a regularized objective
 The reg. will prevent the elements of from becoming too large
is the reg. hyperparam. Controls how much
we wish to regularize (needs to be tuned via
 Reason: Now we are minimizing training error + magnitude of vector
cross-validation)

CS771: Intro to ML
10
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean ( squared) norm of

2 ⊤
𝑅 (𝒘
 With this regularizer, we have ‖𝒘‖2 =𝒘 least
the) =regularized 𝒘 squares problem as
Look at the form of the solution. We are
adding a small value to the diagonals of the
DxD matrix (like adding a ridge/mountain to
𝑁 some land)
=  arg   min 𝒘 ∑ ( 𝑦 𝑛 −𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
⊤ 2 ⊤
+
 Proceeding just like the LS case, we can find𝑛=1
the optimal which is given by
Why is the method called
“ridge” regression

= ¿( 𝑿 ⊤ 𝑿 + 𝜆 𝐼 𝐷 )−CS771:
1 ⊤
𝑿Intro𝒚to ML
11
A closer look at regularization Remember – in general, weights
with large magnitude are bad since
they can cause overfitting on
 The regularized objective we minimized is training data and may not work well
𝑁 on test data
𝐿𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
⊤ 2 ⊤

𝑛=1

 Minimizing w.r.t. gives a solution for that


Good because, consequently,
 Keeps the training error small the individual entries of the Not a “smooth” model
since its test data
 Has a small squared norm = weight vector are also
predictions may change
prevented from becoming too drastically even with
large small changes in some
 Small entries in are good since they lead to “smooth” models feature’s value

A typical learned without reg.


𝒙 𝑛 =¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1 𝑦 𝑛 =0.8
3.2 1.8 1.3 2.1 10000 2.5 3.1 0.1

𝒙 𝑚 =¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1


𝑦 𝑚 =100
Just to fit the training data where one of the inputs
was possibly an outlier, this weight became too big.
Exact same feature vectors only Very different outputs though (maybe
Such a weight vector will possibly do poorly on
differing in just one feature by a small one of these two training ex. is an
normal test inputs
amount outlier)
CS771: Intro to ML
12
Note that optimizing loss functions

Other Ways to Control Overfitting with such regularizers is usually


harder than ridge reg. but several
advanced techniques exist (we will
see some of those later)
 Use a regularizer defined by other norms, e.g.,
Use them if you have a very
norm regularizer large number of features but
𝐷
many irrelevant features. These
‖𝒘 ‖1= ∑ ¿ 𝑤 𝑑∨¿ ¿ regularizers can help in
When should I used these 𝑑=1 automatic feature selection
regularizers instead of the
regularizer? ‖𝒘 ‖0 =¿ nnz (𝒘 ) Using such regularizers sparse means many entries in
gives a sparse weight will be zero or near zero. Thus
vector as solution those features will be
norm regularizer (counts
number of nonzeros in ) considered irrelevant by the
model and will not influence
prediction

 Use non-regularization based approaches


 Early-stopping (stopping training just when we have a decent All
val.of these
set accuracy)
are very popular ways to
 Dropout (in each iteration, don’t update some of the weights) control overfitting in deep learning
models. More on these later when we talk
 Injecting noise in the inputs about deep learning CS771: Intro to ML
13
Linear Regression as Solving System of Linear Eqs
 The form of the lin. reg. model is akin to a system of linear equation
 Assuming training examples with features each, we have
First training example: 𝑦 1 =𝑥11 𝑤 1+ 𝑥 12 𝑤2 +…+ 𝑥 1 𝐷 𝑤 𝐷 Note: Here denotes the
Second training example: 𝑦 2 =𝑥21 𝑤1+ 𝑥 22 𝑤 2+ …+ 𝑥2 𝐷 𝑤 𝐷 feature of the training
example
equations and unknowns
here ()
 However, in regression,
N-th training example: 𝑦we
𝑁 =𝑥 𝑁 1 𝑤 1have
rarely +𝑥 𝑁 2 but
𝑤2 +…+ 𝑥 𝑁𝐷or𝑤 𝐷
rather
 Thus we have an underdetermined () or overdetermined () system
 Methods to solve over/underdetermined systems can be used for lin-reg as well
 Many of these methods don’t require expensive matrix inversion

Now solve this!


Solving lin-reg 𝒘 ¿( 𝑿 ⊤ 𝑿 )− 1 𝑿 ⊤ 𝒚 where , and
as system of lin eq.
System of lin. Eqns with equations and unknowns
CS771: Intro to ML
14
Next Lecture
 Solving linear regression using iterative optimization methods
 Faster and don’t require matrix inversion
 Brief intro to optimization techniques

CS771: Intro to ML

You might also like