You are on page 1of 40

Machine Learning and Data Analytics

Linear regression – Part 1

Dr. Rossana Cavagnini

Deutsche Post Chair – Optimization of Distribution Networks (DPO)


RWTH Aachen University

mlda@dpo.rwth-aachen.de
Simple linear regression

Agenda

1 Simple linear regression


Overview
Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

DPO MLDA 2
Simple linear regression

DPO MLDA 3
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Overview

Regression: find the line which best interpolates the data

DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Overview

Regression: find the line which best interpolates the data


Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X

DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Overview

Regression: find the line which best interpolates the data


Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X

Y : target (output)
β1 : slope
X : input
β0 : intercept (steepness)

DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Estimating the coefficients

Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Sales

15
10
5

0 50 100 150 200 250 300

TV

Regression line in blue and errors in gray

DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Estimating the coefficients

Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20

Error=actual sales - predicted sales


Sales

15
10
5

0 50 100 150 200 250 300

TV

Regression line in blue and errors in gray

DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Estimating the coefficients

Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20

Error=actual sales - predicted sales


Sales

15

ŷi = β̂0 + β̂1 xi : prediction for Y


10

based on the ith value of X


5

Residual: ei = yi − ŷi
0 50 100 150 200 250 300

TV

Regression line in blue and errors in gray

DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Estimating the coefficients

Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20

Error=actual sales - predicted sales


Sales

15

ŷi = β̂0 + β̂1 xi : prediction for Y


10

based on the ith value of X


5

Residual: ei = yi − ŷi
0 50 100 150 200 250 300

TV A good fit minimizes the residuals


Regression line in blue and errors in gray

DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors?

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors?

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors?

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors? → Yes. There is only one line minimizing this sum

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What will a good fit minimize?


1 The sum of the errors? → No. Positive and negative errors will cancel each other.
2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
3 The sum of squared errors? → Yes. There is only one line minimizing this sum

DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

OLS
The ordinary least squares criterion (OLS) minimizes the Residual Sum of Squares
(RSS):
RSS = e12 + e22 + · · · + en2
by defining the least squares coefficient estimates:
n
P
(xi − x)(yi − y )
i=1
β̂1 = n and β̂0 = y − β̂1 x,
x)2
P
(xi −
i=1

n n
1 P 1 P
where y ≡ n yi and x ≡ n xi
i=1 i=1

DPO MLDA 7
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Example: Advertising
How many units will be sold by spending 1000 USD in TV advertising?
Regress sales onto TV advertising: sales ≈ β0 + β1 TV .
25
20
Sales

15
10

RS
S
5

0 50 100 150 200 250 300 β1


TV
β0

OLS: β̂0 = 7.03 and β̂1 = 0.0475


The red dot corresponds to the pair of
By spending 1000 on TV advertising, least squares estimates
47.5 additional units are sold
DPO MLDA 8
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the coefficient estimates

Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X + 

DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the coefficient estimates

Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X + 

Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error

DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the coefficient estimates

Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X + 

Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error

Least squares line is characterized by the least square regression coefficient estimates

DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the coefficient estimates

Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X + 

Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error

Least squares line is characterized by the least square regression coefficient estimates
The true relationship is generally not known, while the least squares line can always
be computed
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Population regression line vs least squares line

10

10
5

5
Y

Y
0

0
−5

−5
−10

−10
−2 −1 0 1 2 −2 −1 0 1 2

X X

Left: the red line is the population regression line, the blue line is the least squares line.
Right: light blue are 10 least squares lines on 10 different observation data
The least squares line uses information from a sample to estimate characteristics of a
large population
For a particular set of observations, the estimator may underestimate or overestimate
The estimate for a huge number of observations is expected to be exactly equal to
the true value
DPO MLDA 10
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

How far will a single estimate be from the actual value?


Standard error of µ̂ (how the deviation shrinks with n)

σ2
Var (µ̂) = SE (µ̂)2 =
n
Standard error (SE) of the estimators:
" #
1 x2 σ2
SE (β̂0 )2 = σ 2 n + n , SE (β̂1 )2 = n
−x)2 (xi −x)2
P P
(xi
i=1 i=1
where σ 2 = Var () (generally not known)

SE (β̂0 )2 small when xi more spread out


Residual standard error: p
RSE = RSS/(n − 2)
DPO MLDA 11
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?

1. Confidence intervals
A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter
There is approx. a 95% chance that the interval: [β̂1 − 2SE (β̂1 ), β̂1 + 2SE (β̂1 )] will
contain the true value of β1 .
Example: Advertising
β0 has a confidence interval of [6.130, 7.935]
β1 has a confidence interval of [0.042, 0.053]
With no advertising, sales will fall somewhere between 6130 and 7935 units
For each 1000 USD increase in advertising, there is an average increase in sales between 42 and
53 units.

DPO MLDA 12
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )

DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y

DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero?

DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”?

DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))

DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

What are standard errors useful for?


2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
HA : β1 6= 0: Some relationship between X and Y
Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))
t-statistic measures the number of standard deviations that β̂1 is away from 0
t = β̂1 −0
SE (β̂1 )

If SE (β̂1 ) small, also a small β̂1 suggests that β1 6= 0 (a relationship between X and Y )
If SE (β̂1 ) large, β̂1 must be large to reject the null hypothesis
p-value is the probability of observing any value equal to | t | or larger, assuming that
β1 = 0
Small p-value: there is an association between the predictor and the response, i.e. reject
the null hypothesis. Typical cutoff values: 5% or 1%
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Example: Advertising

Coefficient Std. Error t-statistic p-value


Intercept 7.0325 0.4578 15.36 <0.0001
TV 0.0475 0.0027 17.67 < 0.0001

β̂0 and β̂1 very large with respect to Std. Error → t-statistics are large
The p-values are close to zero → β0 6= 0 and β1 6= 0 (Reject the null hypothesis)
β0 6= 0: in absence of TV advertising, sales are non-zero
β1 6= 0: there is a relationship between TV and sales

DPO MLDA 14
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the model

1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)

DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Assessing the accuracy of the model

1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
2. Residual Standard Error (RSE) estimates of the standard deviation of , i.e., the
average amount that the response will deviate from the true regression line.
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2
n−2 n−2
i=1

Absolute measure (expressed in units of Y )

DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

3. R 2 statistic measures of the proportion of variability in Y that can be explained using


X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)

DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

3. R 2 statistic measures of the proportion of variability in Y that can be explained using


X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)
Close to 1: a large proportion of the variability in the response can be explained by the
regression
Close to 0: the regression does not explain much of the variability in the response.
For the simple linear regression, R 2 = [Corr (X , Y )2 ]

DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Remark:
n
X n
X
2
TSS = (yi − y ) vs RSS = (yi − ŷ )2
i=1 i=1

Both require to use the actual data points of the data training set (yi )
RSS measures the variability after performing the regression (difference with respect
to the predictions)
TSS measures the variance in Y before the regression is performed (difference with
respect to the mean of data points)
The ratio RSS/TSS measures how good the model is compared to the mean value
without variance
A low ratio: a low residual error with actual values, a high residual error with respect to
the mean → The model is more robust

DPO MLDA 17
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model

Example: Advertising
Quantity Value
Residual Standard Error 3.26
R2 0.612

RSE: Actual sales in each market deviate from the true regression line by approx.
3,26 units, on average
R 2 : Around 2/3 of the variability in sales is explained by a variation in the TV budget

DPO MLDA 18

You might also like