03 - ICT4Dev Regression

Regression
aka how much money my car worth?
Prof Manuel Roveri – manuel.roveri@polimi.it
ICT4Dev Summer School

Regression Problem
Can we learn to predict a target variable

(e.g., MEDV) by observing input features
(e.g., CRIM, ZN, INDUS, …)?
2
Regression Problem
Let’s start with a simple example … can we predict Y from X?
3
Regression Problem
Let’s start with a simple example … can we predict Y from X?
𝑦 = 𝑤! + 𝑤"𝑥
Slope
Intercept
In the example
• 𝑤! = 15
• 𝑤" = 0.8
4
Regression Problem
Let’s add noise … can we still predict Y from X?

Noisy observations
5
Regression Problem

Noisy observations
• Many alternative
solutions
6
Regression Problem

Noisy observations
solutions
• Linear modeling is
an assumption
7
Regression Problem

Noisy observations
solutions
• Linear modeling is
an assumption
Which one is the best?
8
Linear Regression
Given observed pairs < 𝑥! , 𝑦! >, ∀𝑖 ∈ 𝑁, find the best linear model
Noisy observations
𝑦 = 𝑤! + 𝑤" 𝑥
• Infinite solutions exist
• Need to define an
optimality criterion
Least Squares
Estimation!
9
Least Squares Regression
The Residual Sum of Squares (RSS) is used to evaluate the model. It is

defined as the sum of the squared residues 𝑒! = 𝑦! − 𝑦-! , i.e.,
𝑅𝑆𝑆 = 𝑒#$ + 𝑒$$ + ⋯ + 𝑒%$ 𝑒! = 𝑦! − 𝑦+!
Rewriting as a function of parameters

𝑤" and 𝑤# , we obtain
%
&
𝑅𝑆𝑆 𝑤!, 𝑤" = . 𝑦# − 𝑤! + 𝑤"𝑥#
#$"
10
Least Squares Regression
Then, the goal is to find the value of the weights / parameters 𝑤" , 𝑤#
(sometimes named as 𝛽" , 𝛽# ) which minimize the RSS
w" , 𝑤# = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑅𝑆𝑆 𝑤" , 𝑤# =

&! ,&"
$
= 𝑎𝑟𝑔𝑚𝑖𝑛 ∑%
!(# 𝑦! − 𝑤" + 𝑤# 𝑥!
&! ,&"
This is a function minimization problem we can solve by

• Computing gradients ∇𝑅𝑆𝑆(𝑤) w.r.t. weights 𝑤 = [𝑤!, 𝑤"]
• solving ∇𝑅𝑆𝑆(𝑤) = 0 .
11
Multivariate Linear Regression
Suppose to have M features, then the multivariate regression problem is
𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦
Example:
A regression with 2 features [𝑥", 𝑥&] and 1 target
variable 𝑦. The least squares solution is a plane
𝑥"
chosen by minimizing the distances between the
observations and the plane.
𝑥#
12
𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦
%
&
𝑅𝑆𝑆 = . 𝑦# − 𝑤! + 𝑤"𝑥#" + 𝑤&𝑥#& + ⋯ + 𝑤- 𝑥#-
#$"
(
$ '
𝑥" = * 𝑦! − 𝑤% + * 𝑤& 𝑥!&
!"# &"#
𝑥#
𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)
13
𝑦 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥# + ⋯ + 𝑤$ 𝑥$ + 𝜖
𝑦
𝑦! 1 𝑥!! 𝑥!" ⋯ 𝑥!$ 𝑤%
𝑦" 1 𝑥"! 𝑥"" ⋯ 𝑥"$ 𝑤!
⋯ ⋯ ⋯ ⋯ ⋯ ⋯ 𝑤"
𝑦# 1 𝑦#! 𝑦#" ⋯ 𝑥#$ ⋯
𝑤$
𝑥" • 𝑦 = N x 1 vector of target values
• X is a N x (M+1) data matrix
• 𝑤 is a (M+1) x 1 vector of weights
𝑥#
𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)
14
𝑅𝑆𝑆(𝑤) = 𝑦 − X𝑤 . (𝑦 − X𝑤)
Quadratic function,
Let’s compute the derivatives thus convex, thus
unique minimum …
𝜕𝑅𝑆𝑆
= −2X . 𝑦 − X𝑤 .
𝜕𝑤
Putting the derivatives equal to zero and solving for 𝑤 we obtain
𝑤 = X.X /" X . 𝑦
which are known as the normal equations for the least squares problem.
Quite expensive with Moore-Penrose
many samples … pseudo-inverse of X
15
Non linearity of data
𝑦 = 𝑤! + 𝑤"𝑥" + 𝑤&𝑥& + ⋯ + 𝑤- 𝑥- + 𝜖
Multivariate linear regression assumes

• Linear relationship between features and target variables
• Additive relationship between features and target variables
Linear models could not be sufficient to fit observed data as a linear

relationship between features and target may not hold.
16
Generalized Linear Regression
Given a set of input variables 𝑥, a set of 𝑁 examples < 𝑥! , 𝑦! > and a set
of D features ℎ1 computed from the input variables 𝑥! , we get the model
𝑦 = 𝑤! + 𝑤" 𝑓(𝑥)" + 𝑤# 𝑓(𝑥)# + ⋯ + 𝑤% 𝑓(𝑥)% + 𝜖
• 𝑓1 (. ) identify variables derived from the original inputs

• 𝑓1 (. ) could be derived from an existing variable, e.g., the squared
value, a trigonometric function, the age given the date of birth, etc.
Notes: the solution can still be computed via matrix pseudo inverse (but you get a non linear model)
.𝑓 /"
𝑊= 𝑓 𝑋 𝑋 𝑓(𝑋)𝑌
17
Polinomial Regression
Input: LSTAT - % lower status of the population

Output: MEDV - Median value of owner-occupied homes in $1000's
18
Given a set of examples associating 𝐿𝑆𝑇𝐴𝑇!

values to 𝑀𝐸𝐷𝑉! values, nonlinear regression
finds a function 𝑓 . such that
where 𝛆i is the error to be minimized
A polynomial model would fit the data points with a function,
19
20
21
Which one do you

prefer?
22
Coefficient of Determination 𝑹𝟐
Total sum of squares
Coefficient of determination
What about
new data?
R2 measures of how well the regression line approximates the real data
points. When R2 is 1, the regression line perfectly fits the data.
23
Model Evaluation
Would be feasible to evaluate students using exactly the same problems

solved in class?
Overfitting: perfect output on the training data,

terrible outcome on data which has never seen before L
Models should be evaluated using data that have not been used to build
the model itself:
• Training data will be used to build the model
• Test data will be used to evaluate the model performance
24
Hold Out Evaluation
Reserves a certain amount for testing and uses the remainder for training
• Too small training sets might result in poor weight estimation
• Too small test sets might result in a poor estimation of future performance
Typically,
• Reserve 2/3 for training and 1/3 for testing (but percentage may vary)
25
Hold Out Evaluation on Housing Data
Given the original dataset, split the data into 2/3 train and 1/3 test and
then apply linear regression using polynomials of increasing degree.
26
Cross-Validation
For small or “unbalanced” datasets, Hold Out samples might not be

representative, thus k-fold Cross-Validation can be used:
• Data is split into k subsets of equal size

• Each subset in turn is used for testing and the remainder for training
• The error estimates are averaged to yield an overall error estimate
27
Cross-Validation
For small or “unbalanced” datasets, Hold Out samples mightK=10not

gets be
accurate
estimantes
representative, thus k-fold Cross-Validation can be used:
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 p1
test train
1 2 3 4 5 6 7 8 9 10 p2
train test train
… … …
1 2 3 4 5 6 7 8 9 10 p10 Sometimes
train test repeated multiple
times (e.g., 10)
28
K-fold Cross-Validation on Housing Data
Given the original dataset, perform k-fold Cross-Validation on linear

regression using polynomials of increasing degree.
29
Regression: the machine learning perspective
Example: Increasing Sales by Advertising
31
What can we ask to the data?
• Is there a relationship between advertising budget and sales?

• How strong is the relationship between advertising budget and sales?
• Which media contribute to sales?
• How accurately can we estimate the effect of each medium on sales?
• Is the relationship linear?
• Is there synergy among the advertising media?
32
Simple linear regression
Let assume that a linear relationship exists between Y and X
We say sales regress on TV through some parameters

• Model coefficients and
• After training, a new data point can be predicted as
Given a datasets the coefficient above can be estimated by using least

squares to minimize the Residual Sum of Squares
33
Example: TV Advertising vs Sales
34
Least squares fitting
Least square fitting minimizes RSS (Residual Sum of Squares)
Obtaining the following estimates
Where and
35
Least square solution …
36
Least squares fitting
Least square fitting minimizes RSS (Residual Sum of Squares)
Obtaining the following estimates
Where and
But are they
any good?
37
Population regression line
Recall from Statistical Learning theory the underlying hypothesis
• the Y value when X = 0

• the average increase in Y due to unitary increase in X
• the error term captures all the rest …
This model is known as “population regression line”

• the best linear approximation of the true model
• it might differ from the least squares regression line
The population regression line stays to the
mean of a distribution as the least squares
regression line stays to the sample mean …
38
Example: population regression line
Recall here model

variance in the
bias-variance trade-off
39
Parameters confidence intervals
In general, errors variance is not known, but it can be estimated from

residuals (if the model fits properly)
From standard errors we can compute confidence intervals for the linear
regression parameters, e.g., the 95% confidence intervals for the
parameters are
the true slope is, with 95% probability, in the range This should be the 97.5
quantile of a t-distribution
40
Standard error & linear regression
The (squared) standard error for the mean estimator represents the
average distance of the sample mean from the real mean
We have formulae for standard errors of linear regression coefficients
These formulae assume

• uncorrelated errors …
• … having the same (unknown) variance
41
Example: TV Advertising data
Sales without any
If we consider the 95% confidence intervals advertising
• for the intercept we have Average impact of

• for the slope we have TV advertising
42

03 - ICT4Dev Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

03 - ICT4Dev Regression

Uploaded by

Copyright:

Available Formats

Regression

aka how much money my car worth?

Prof Manuel Roveri – manuel.roveri@polimi.it

ICT4Dev Summer School

Can we learn to predict a target variable

Let’s start with a simple example … can we predict Y from X?

Let’s start with a simple example … can we predict Y from X?

Let’s add noise … can we still predict Y from X?

Let’s add noise … can we still predict Y from X?

Let’s add noise … can we still predict Y from X?

Let’s add noise … can we still predict Y from X?

Which one is the best?

The Residual Sum of Squares (RSS) is used to evaluate the model. It is

𝑅𝑆𝑆 = 𝑒#$ + 𝑒$$ + ⋯ + 𝑒%$ 𝑒! = 𝑦! − 𝑦+!

Rewriting as a function of parameters

w" , 𝑤# = 𝑎𝑟𝑔𝑚𝑖𝑛 𝑅𝑆𝑆 𝑤" , 𝑤# =

This is a function minimization problem we can solve by

Suppose to have M features, then the multivariate regression problem is

Suppose to have M features, then the multivariate regression problem is

Suppose to have M features, then the multivariate regression problem is

Multivariate linear regression assumes

Linear models could not be sufficient to fit observed data as a linear

𝑦 = 𝑤! + 𝑤" 𝑓(𝑥)" + 𝑤# 𝑓(𝑥)# + ⋯ + 𝑤% 𝑓(𝑥)% + 𝜖

• 𝑓1 (. ) identify variables derived from the original inputs

Input: LSTAT - % lower status of the population

Given a set of examples associating 𝐿𝑆𝑇𝐴𝑇!

where 𝛆i is the error to be minimized

A polynomial model would fit the data points with a function,

Which one do you

Total sum of squares

Would be feasible to evaluate students using exactly the same problems

Overfitting: perfect output on the training data,

For small or “unbalanced” datasets, Hold Out samples might not be

• Data is split into k subsets of equal size

For small or “unbalanced” datasets, Hold Out samples mightK=10not

Given the original dataset, perform k-fold Cross-Validation on linear

• Is there a relationship between advertising budget and sales?

Let assume that a linear relationship exists between Y and X

We say sales regress on TV through some parameters

Given a datasets the coefficient above can be estimated by using least

Least square fitting minimizes RSS (Residual Sum of Squares)

Obtaining the following estimates

Least square fitting minimizes RSS (Residual Sum of Squares)

Obtaining the following estimates

Recall from Statistical Learning theory the underlying hypothesis

• the Y value when X = 0

This model is known as “population regression line”

Recall here model

In general, errors variance is not known, but it can be estimated from

We have formulae for standard errors of linear regression coefficients

These formulae assume

• for the intercept we have Average impact of

You might also like