Professional Documents
Culture Documents
mlda@dpo.rwth-aachen.de
Simple linear regression
Agenda
DPO MLDA 2
Simple linear regression
DPO MLDA 3
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Overview
Y : target (output)
β1 : slope
X : input
β0 : intercept (steepness)
DPO MLDA 4
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
Sales
15
10
5
TV
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
15
10
5
TV
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
15
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
TV
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Regress sales onto TV advertising: sales ≈ β0 + β1 TV
25
20
15
Residual: ei = yi − ŷi
0 50 100 150 200 250 300
DPO MLDA 5
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 6
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
OLS
The ordinary least squares criterion (OLS) minimizes the Residual Sum of Squares
(RSS):
RSS = e12 + e22 + · · · + en2
by defining the least squares coefficient estimates:
n
P
(xi − x)(yi − y )
i=1
β̂1 = n and β̂0 = y − β̂1 x,
x)2
P
(xi −
i=1
n n
1 P 1 P
where y ≡ n yi and x ≡ n xi
i=1 i=1
DPO MLDA 7
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
How many units will be sold by spending 1000 USD in TV advertising?
Regress sales onto TV advertising: sales ≈ β0 + β1 TV .
25
20
Sales
15
10
RS
S
5
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Population regression line is the best approximation to the true relationship between
X and Y :
Y = β0 + β1 X +
Estimators β0 , β1
The error term () captures what we miss with the simple model:
The true relationship is probably not linear
Other variables causing variation in Y
Measurement error
Least squares line is characterized by the least square regression coefficient estimates
The true relationship is generally not known, while the least squares line can always
be computed
DPO MLDA 9
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
10
10
5
5
Y
Y
0
0
−5
−5
−10
−10
−2 −1 0 1 2 −2 −1 0 1 2
X X
Left: the red line is the population regression line, the blue line is the least squares line.
Right: light blue are 10 least squares lines on 10 different observation data
The least squares line uses information from a sample to estimate characteristics of a
large population
For a particular set of observations, the estimator may underestimate or overestimate
The estimate for a huge number of observations is expected to be exactly equal to
the true value
DPO MLDA 10
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
σ2
Var (µ̂) = SE (µ̂)2 =
n
Standard error (SE) of the estimators:
" #
1 x2 σ2
SE (β̂0 )2 = σ 2 n + n , SE (β̂1 )2 = n
−x)2 (xi −x)2
P P
(xi
i=1 i=1
where σ 2 = Var () (generally not known)
1. Confidence intervals
A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter
There is approx. a 95% chance that the interval: [β̂1 − 2SE (β̂1 ), β̂1 + 2SE (β̂1 )] will
contain the true value of β1 .
Example: Advertising
β0 has a confidence interval of [6.130, 7.935]
β1 has a confidence interval of [0.042, 0.053]
With no advertising, sales will fall somewhere between 6130 and 7935 units
For each 1000 USD increase in advertising, there is an average increase in sales between 42 and
53 units.
DPO MLDA 12
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
If SE (β̂1 ) small, also a small β̂1 suggests that β1 6= 0 (a relationship between X and Y )
If SE (β̂1 ) large, β̂1 must be large to reject the null hypothesis
p-value is the probability of observing any value equal to | t | or larger, assuming that
β1 = 0
Small p-value: there is an association between the predictor and the response, i.e. reject
the null hypothesis. Typical cutoff values: 5% or 1%
DPO MLDA 13
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
β̂0 and β̂1 very large with respect to Std. Error → t-statistics are large
The p-values are close to zero → β0 6= 0 and β1 6= 0 (Reject the null hypothesis)
β0 6= 0: in absence of TV advertising, sales are non-zero
β1 6= 0: there is a relationship between TV and sales
DPO MLDA 14
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
1. The Sum of squared errors is an absolute measure (it increases by increasing the
number of points)
2. Residual Standard Error (RSE) estimates of the standard deviation of , i.e., the
average amount that the response will deviate from the true regression line.
v
r u n
1 u 1 X
RSE = RSS = t (yi − ŷi )2
n−2 n−2
i=1
DPO MLDA 15
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
DPO MLDA 16
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Remark:
n
X n
X
2
TSS = (yi − y ) vs RSS = (yi − ŷ )2
i=1 i=1
Both require to use the actual data points of the data training set (yi )
RSS measures the variability after performing the regression (difference with respect
to the predictions)
TSS measures the variance in Y before the regression is performed (difference with
respect to the mean of data points)
The ratio RSS/TSS measures how good the model is compared to the mean value
without variance
A low ratio: a low residual error with actual values, a high residual error with respect to
the mean → The model is more robust
DPO MLDA 17
Overview
Simple linear regression Assessing the accuracy of the coefficient estimates
Assessing the accuracy of the model
Example: Advertising
Quantity Value
Residual Standard Error 3.26
R2 0.612
RSE: Actual sales in each market deviate from the true regression line by approx.
3,26 units, on average
R 2 : Around 2/3 of the variability in sales is explained by a variation in the TV budget
DPO MLDA 18