Linear Regression Part 1

Machine Learning and Data Analytics
Linear regression – Part 1
Dr. Rossana Cavagnini
Deutsche Post Chair – Optimization of Distribution Networks (DPO)

RWTH Aachen University
mlda@dpo.rwth-aachen.de
Simple linear regression
Agenda
1 Simple linear regression

Overview
Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 2
Simple linear regression
DPO MLDA 3
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Overview
Regression: find the line which best interpolates the data
DPO MLDA 4
Overview
Overview

Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X
DPO MLDA 4
Overview
Overview

Context:
n observation pairs (measurement of X and Y )
(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
Predict a quantitative response Y on the basis of a single predictor variable X
Y ≈ β0 + β1 X
Y : target (output)
β1 : slope
X : input
β0 : intercept (steepness)
DPO MLDA 4
Overview
Estimating the coefficients
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Sales
15
10
5
0 50 100 150 200 250 300
TV
Regression line in blue and errors in gray
DPO MLDA 5
Overview
25
20
Error=actual sales - predicted sales

Sales
15
10
5
0 50 100 150 200 250 300
TV
DPO MLDA 5
Overview
25
20

Sales
15
ŷi = β̂0 + β̂1 xi : prediction for Y

10
based on the ith value of X

5
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
TV
DPO MLDA 5
Overview
25
20

Sales
15
ŷi = β̂0 + β̂1 xi : prediction for Y

10
based on the ith value of X

5
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
TV A good fit minimizes the residuals

DPO MLDA 5
Overview
What will a good fit minimize?
DPO MLDA 6
Overview

1 The sum of the errors?
DPO MLDA 6
Overview

1 The sum of the errors? → No. Positive and negative errors will cancel each other.
DPO MLDA 6
Overview

2 The sum of the absolute value of the errors?
DPO MLDA 6
Overview

2 The sum of the absolute value of the errors? → No. There can be multiple regression
lines minimizing this sum
DPO MLDA 6
Overview

3 The sum of squared errors?
DPO MLDA 6
Overview

3 The sum of squared errors? → Yes. There is only one line minimizing this sum
DPO MLDA 6
Overview

3 The sum of squared errors? → Yes. There is only one line minimizing this sum
DPO MLDA 6
Overview
OLS
The ordinary least squares criterion (OLS) minimizes the Residual Sum of Squares
(RSS):
RSS = e12 + e22 + · · · + en2
by defining the least squares coefficient estimates:
n
P
(xi − x)(yi − y )
i=1
β̂1 = n and β̂0 = y − β̂1 x,
x)2
P
(xi −
i=1
n n
1 P 1 P
where y ≡ n yi and x ≡ n xi
i=1 i=1
DPO MLDA 7
Overview
How many units will be sold by spending 1000 USD in TV advertising?
Regress sales onto TV advertising: sales ≈ β0 + β1 TV .
25
20
Sales
15
10
RS
S
5
0 50 100 150 200 250 300 β1

TV
β0
OLS: β̂0 = 7.03 and β̂1 = 0.0475

The red dot corresponds to the pair of
By spending 1000 on TV advertising, least squares estimates
47.5 additional units are sold
DPO MLDA 8
Overview
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
DPO MLDA 9
Overview
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
DPO MLDA 9
Overview
X and Y :
Y = β0 + β1 X +
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
DPO MLDA 9
Overview
X and Y :
Y = β0 + β1 X +
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
The true relationship is generally not known, while the least squares line can always
be computed
DPO MLDA 9
Overview
Population regression line vs least squares line
10
10
5
5
Y
Y
0
0
−5
−5
−10
−10
−2 −1 0 1 2 −2 −1 0 1 2
X X
Left: the red line is the population regression line, the blue line is the least squares line.
Right: light blue are 10 least squares lines on 10 different observation data
The least squares line uses information from a sample to estimate characteristics of a
large population
For a particular set of observations, the estimator may underestimate or overestimate
The estimate for a huge number of observations is expected to be exactly equal to
the true value
DPO MLDA 10
Overview
How far will a single estimate be from the actual value?

Standard error of µ̂ (how the deviation shrinks with n)
σ2
Var (µ̂) = SE (µ̂)2 =
n
Standard error (SE) of the estimators:
" #
1 x2 σ2
SE (β̂0 )2 = σ 2 n + n , SE (β̂1 )2 = n
−x)2 (xi −x)2
P P
(xi
i=1 i=1
where σ 2 = Var () (generally not known)
SE (β̂0 )2 small when xi more spread out

Residual standard error: p
RSE = RSS/(n − 2)
DPO MLDA 11
Overview
What are standard errors useful for?
1. Confidence intervals
A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter
There is approx. a 95% chance that the interval: [β̂1 − 2SE (β̂1 ), β̂1 + 2SE (β̂1 )] will
contain the true value of β1 .
β0 has a confidence interval of [6.130, 7.935]
β1 has a confidence interval of [0.042, 0.053]
With no advertising, sales will fall somewhere between 6130 and 7935 units
For each 1000 USD increase in advertising, there is an average increase in sales between 42 and
53 units.
DPO MLDA 12
Overview

2. Hypothesis tests on the coefficients
H0 : β1 = 0: No relationship between X and Y (Y = β0 + )
DPO MLDA 13
Overview

HA : β1 6= 0: Some relationship between X and Y
DPO MLDA 13
Overview

Is β1 far from zero?
DPO MLDA 13
Overview

Is β1 far from zero? How far is “far enough”?
DPO MLDA 13
Overview

Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))
DPO MLDA 13
Overview

Is β1 far from zero? How far is “far enough”? → Accuracy of β̂1 (SE (β̂1 ))
t-statistic measures the number of standard deviations that β̂1 is away from 0
t = β̂1 −0
SE (β̂1 )
If SE (β̂1 ) small, also a small β̂1 suggests that β1 6= 0 (a relationship between X and Y )
If SE (β̂1 ) large, β̂1 must be large to reject the null hypothesis
p-value is the probability of observing any value equal to | t | or larger, assuming that
β1 = 0
Small p-value: there is an association between the predictor and the response, i.e. reject
the null hypothesis. Typical cutoff values: 5% or 1%
DPO MLDA 13
Overview
Coefficient Std. Error t-statistic p-value

Intercept 7.0325 0.4578 15.36 <0.0001
TV 0.0475 0.0027 17.67 < 0.0001
β̂0 and β̂1 very large with respect to Std. Error → t-statistics are large
The p-values are close to zero → β0 6= 0 and β1 6= 0 (Reject the null hypothesis)
β0 6= 0: in absence of TV advertising, sales are non-zero
β1 6= 0: there is a relationship between TV and sales
DPO MLDA 14
Overview
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
DPO MLDA 15
Overview
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
2. Residual Standard Error (RSE) estimates of the standard deviation of , i.e., the
average amount that the response will deviate from the true regression line.
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2
n−2 n−2
i=1
Absolute measure (expressed in units of Y )
DPO MLDA 15
Overview
3. R 2 statistic measures of the proportion of variability in Y that can be explained using

X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)
DPO MLDA 16
Overview
3. R 2 statistic measures of the proportion of variability in Y that can be explained using

X
TSS − RSS RSS
R2 = =1− ,
TSS TSS
n
(yi − y )2
P
where total sum of squares TSS =
i=1
It is a relative measure (0 ≤ R 2 ≤ 1)
Close to 1: a large proportion of the variability in the response can be explained by the
regression
Close to 0: the regression does not explain much of the variability in the response.
For the simple linear regression, R 2 = [Corr (X , Y )2 ]
DPO MLDA 16
Overview
Remark:
n
X n
X
2
TSS = (yi − y ) vs RSS = (yi − ŷ )2
i=1 i=1
Both require to use the actual data points of the data training set (yi )
RSS measures the variability after performing the regression (difference with respect
to the predictions)
TSS measures the variance in Y before the regression is performed (difference with
respect to the mean of data points)
The ratio RSS/TSS measures how good the model is compared to the mean value
without variance
A low ratio: a low residual error with actual values, a high residual error with respect to
the mean → The model is more robust
DPO MLDA 17
Overview
Quantity Value
Residual Standard Error 3.26
R2 0.612
RSE: Actual sales in each market deviate from the true regression line by approx.
3,26 units, on average
R 2 : Around 2/3 of the variability in sales is explained by a variation in the TV budget
DPO MLDA 18

Linear Regression Part 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression Part 1

Uploaded by

Copyright:

Available Formats

Machine Learning and Data Analytics

Linear regression – Part 1

Dr. Rossana Cavagnini

Deutsche Post Chair – Optimization of Distribution Networks (DPO)

1 Simple linear regression

Regression: find the line which best interpolates the data

Regression: find the line which best interpolates the data

Regression: find the line which best interpolates the data

Estimating the coefficients

0 50 100 150 200 250 300

Regression line in blue and errors in gray

Estimating the coefficients

Error=actual sales - predicted sales

0 50 100 150 200 250 300

Regression line in blue and errors in gray

Estimating the coefficients

Error=actual sales - predicted sales

ŷi = β̂0 + β̂1 xi : prediction for Y

based on the ith value of X

Regression line in blue and errors in gray

Estimating the coefficients

Error=actual sales - predicted sales

ŷi = β̂0 + β̂1 xi : prediction for Y

based on the ith value of X

TV A good fit minimizes the residuals

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

What will a good fit minimize?

0 50 100 150 200 250 300 β1

OLS: β̂0 = 7.03 and β̂1 = 0.0475

Assessing the accuracy of the coefficient estimates

Assessing the accuracy of the coefficient estimates

Assessing the accuracy of the coefficient estimates

Assessing the accuracy of the coefficient estimates

Population regression line vs least squares line

How far will a single estimate be from the actual value?

SE (β̂0 )2 small when xi more spread out

What are standard errors useful for?

What are standard errors useful for?

What are standard errors useful for?

What are standard errors useful for?

What are standard errors useful for?

What are standard errors useful for?

What are standard errors useful for?

Coefficient Std. Error t-statistic p-value

Assessing the accuracy of the model

Assessing the accuracy of the model

Absolute measure (expressed in units of Y )

3. R 2 statistic measures of the proportion of variability in Y that can be explained using

3. R 2 statistic measures of the proportion of variability in Y that can be explained using

You might also like