You are on page 1of 42

Regression

aka how much money my car worth?

Prof Manuel Roveri – manuel.roveri@polimi.it

ICT4Dev Summer School


Regression Problem

Can we learn to predict a target variable


(e.g., MEDV) by observing input features
(e.g., CRIM, ZN, INDUS, …)?

2
Regression Problem

Let’s start with a simple example … can we predict Y from X?

3
Regression Problem

Let’s start with a simple example … can we predict Y from X?

𝑦 = 𝑤! + 𝑤"𝑥

Slope
Intercept

In the example
• 𝑤! = 15
• 𝑤" = 0.8

4
Regression Problem

Let’s add noise … can we still predict Y from X?


Noisy observations

5
Regression Problem

Let’s add noise … can we still predict Y from X?


Noisy observations

• Many alternative
solutions

6
Regression Problem

Let’s add noise … can we still predict Y from X?


Noisy observations

• Many alternative
solutions
• Linear modeling is
an assumption

7
Regression Problem

Let’s add noise … can we still predict Y from X?


Noisy observations

• Many alternative
solutions
• Linear modeling is
an assumption

Which one is the best?

8
Linear Regression

Given observed pairs < 𝑥! , 𝑦! >, ∀𝑖 ∈ 𝑁, find the best linear model
Noisy observations
𝑦 = 𝑤! + 𝑤" 𝑥
• Infinite solutions exist
• Need to define an
optimality criterion

Least Squares
Estimation!

9
Least Squares Regression

The Residual Sum of Squares (RSS) is used to evaluate the model. It is


defined as the sum of the squared residues 𝑒! = 𝑦! − 𝑦-! , i.e.,

𝑅𝑆𝑆 = 𝑒#$ + 𝑒$$ + ⋯ + 𝑒%$ 𝑒! = 𝑦! − 𝑦+!

Rewriting as a function of parameters


𝑤" and 𝑤# , we obtain
%
&
𝑅𝑆𝑆 𝑤!, 𝑤" = . 𝑦# − 𝑤! + 𝑤"𝑥#
#$"

10
Least Squares Regression

Then, the goal is to find the value of the weights / parameters 𝑤" , 𝑤#
(sometimes named as 𝛽" , 𝛽# ) which minimize the RSS

w" , 𝑤# = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑅𝑆𝑆 𝑤" , 𝑤# =


&! ,&"
$
= 𝑎𝑟𝑔𝑚𝑖𝑛 ∑%
!(# 𝑦! − 𝑤" + 𝑤# 𝑥!
&! ,&"

This is a function minimization problem we can solve by


• Computing gradients ∇𝑅𝑆𝑆(𝑤) w.r.t. weights 𝑤 = [𝑤!, 𝑤"]
• solving ∇𝑅𝑆𝑆(𝑤) = 0 .

11
Multivariate Linear Regression

Suppose to have M features, then the multivariate regression problem is

𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦

Example:
A regression with 2 features [𝑥", 𝑥&] and 1 target
variable 𝑦. The least squares solution is a plane
𝑥"
chosen by minimizing the distances between the
observations and the plane.

𝑥#

12
Multivariate Linear Regression

Suppose to have M features, then the multivariate regression problem is

𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦
%
&
𝑅𝑆𝑆 = . 𝑦# − 𝑤! + 𝑤"𝑥#" + 𝑤&𝑥#& + ⋯ + 𝑤- 𝑥#-
#$"
(
$ '
𝑥" = * 𝑦! − 𝑤% + * 𝑤& 𝑥!&
!"# &"#

𝑥#
𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)

13
Multivariate Linear Regression

Suppose to have M features, then the multivariate regression problem is

𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦
𝑦! 1 𝑥!! 𝑥!" ⋯ 𝑥!$ 𝑤%
𝑦" 1 𝑥"! 𝑥"" ⋯ 𝑥"$ 𝑤!
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 𝑤"
𝑦# 1 𝑦#! 𝑦#" ⋯ 𝑥#$ ⋯
𝑤$
𝑥" • 𝑦 = N x 1 vector of target values
• X is a N x (M+1) data matrix
• 𝑤 is a (M+1) x 1 vector of weights

𝑥#
𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)

14
Multivariate Linear Regression

𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)

Quadratic function,
Let’s compute the derivatives thus convex, thus
unique minimum …
𝜕𝑅𝑆𝑆
= −2X . 𝑦 − X𝑤 .
𝜕𝑤
Putting the derivatives equal to zero and solving for 𝑤 we obtain
𝑤 = X.X /" X . 𝑦

which are known as the normal equations for the least squares problem.
Quite expensive with Moore-Penrose
many samples … pseudo-inverse of X
15
Non linearity of data

𝑦 = 𝑤! + 𝑤"𝑥" + 𝑤&𝑥& + ⋯ + 𝑤- 𝑥- + 𝜖

Multivariate linear regression assumes


• Linear relationship between features and target variables
• Additive relationship between features and target variables

Linear models could not be sufficient to fit observed data as a linear


relationship between features and target may not hold.

16
Generalized Linear Regression

Given a set of input variables 𝑥, a set of 𝑁 examples < 𝑥! , 𝑦! > and a set
of D features ℎ1 computed from the input variables 𝑥! , we get the model

𝑦 = 𝑤! + 𝑤" 𝑓(𝑥)" + 𝑤# 𝑓(𝑥)# + ⋯ + 𝑤% 𝑓(𝑥)% + 𝜖

• 𝑓1 (. ) identify variables derived from the original inputs


• 𝑓1 (. ) could be derived from an existing variable, e.g., the squared
value, a trigonometric function, the age given the date of birth, etc.
Notes: the solution can still be computed via matrix pseudo inverse (but you get a non linear model)
.𝑓 /"
𝑊= 𝑓 𝑋 𝑋 𝑓(𝑋)𝑌
17
Polinomial Regression

Input: LSTAT - % lower status of the population


Output: MEDV - Median value of owner-occupied homes in $1000's

18
Polinomial Regression

Given a set of examples associating 𝐿𝑆𝑇𝐴𝑇!


values to 𝑀𝐸𝐷𝑉! values, nonlinear regression
finds a function 𝑓 . such that

where 𝛆i is the error to be minimized

A polynomial model would fit the data points with a function,

19
Polinomial Regression

20
Polinomial Regression

21
Polinomial Regression

Which one do you


prefer?

22
Coefficient of Determination 𝑹𝟐

Total sum of squares

Coefficient of determination

What about
new data?

R2 measures of how well the regression line approximates the real data
points. When R2 is 1, the regression line perfectly fits the data.
23
Model Evaluation

Would be feasible to evaluate students using exactly the same problems


solved in class?

Overfitting: perfect output on the training data,


terrible outcome on data which has never seen before L

Models should be evaluated using data that have not been used to build
the model itself:
• Training data will be used to build the model
• Test data will be used to evaluate the model performance
24
Hold Out Evaluation

Reserves a certain amount for testing and uses the remainder for training
• Too small training sets might result in poor weight estimation
• Too small test sets might result in a poor estimation of future performance

Typically,
• Reserve 2/3 for training and 1/3 for testing (but percentage may vary)

25
Hold Out Evaluation on Housing Data

Given the original dataset, split the data into 2/3 train and 1/3 test and
then apply linear regression using polynomials of increasing degree.

26
Cross-Validation

For small or “unbalanced” datasets, Hold Out samples might not be


representative, thus k-fold Cross-Validation can be used:

• Data is split into k subsets of equal size


• Each subset in turn is used for testing and the remainder for training
• The error estimates are averaged to yield an overall error estimate

27
Cross-Validation

For small or “unbalanced” datasets, Hold Out samples mightK=10not


gets be
accurate
estimantes
representative, thus k-fold Cross-Validation can be used:

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 p1
test train

1 2 3 4 5 6 7 8 9 10 p2
train test train
… … …

1 2 3 4 5 6 7 8 9 10 p10 Sometimes
train test repeated multiple
times (e.g., 10)
28
K-fold Cross-Validation on Housing Data

Given the original dataset, perform k-fold Cross-Validation on linear


regression using polynomials of increasing degree.

29
Regression: the machine learning perspective
Example: Increasing Sales by Advertising

31
What can we ask to the data?

• Is there a relationship between advertising budget and sales?


• How strong is the relationship between advertising budget and sales?
• Which media contribute to sales?
• How accurately can we estimate the effect of each medium on sales?
• Is the relationship linear?
• Is there synergy among the advertising media?
32
Simple linear regression

Let assume that a linear relationship exists between Y and X

We say sales regress on TV through some parameters


• Model coefficients and
• After training, a new data point can be predicted as

Given a datasets the coefficient above can be estimated by using least


squares to minimize the Residual Sum of Squares

33
Example: TV Advertising vs Sales

34
Least squares fitting

Least square fitting minimizes RSS (Residual Sum of Squares)

Obtaining the following estimates

Where and

35
Least square solution …

36
Least squares fitting

Least square fitting minimizes RSS (Residual Sum of Squares)

Obtaining the following estimates

Where and
But are they
any good?
37
Population regression line

Recall from Statistical Learning theory the underlying hypothesis

• the Y value when X = 0


• the average increase in Y due to unitary increase in X
• the error term captures all the rest …

This model is known as “population regression line”


• the best linear approximation of the true model
• it might differ from the least squares regression line
The population regression line stays to the
mean of a distribution as the least squares
regression line stays to the sample mean …
38
Example: population regression line

Recall here model


variance in the
bias-variance trade-off

39
Parameters confidence intervals

In general, errors variance is not known, but it can be estimated from


residuals (if the model fits properly)

From standard errors we can compute confidence intervals for the linear
regression parameters, e.g., the 95% confidence intervals for the
parameters are

the true slope is, with 95% probability, in the range This should be the 97.5
quantile of a t-distribution

40
Standard error & linear regression

The (squared) standard error for the mean estimator represents the
average distance of the sample mean from the real mean

We have formulae for standard errors of linear regression coefficients

These formulae assume


• uncorrelated errors …
• … having the same (unknown) variance

41
Example: TV Advertising data
Sales without any
If we consider the 95% confidence intervals advertising

• for the intercept we have Average impact of


• for the slope we have TV advertising

42

You might also like