Course in Advanced Statistics: Recap On Linear Regression: Irene Cozzolino

Course in Advanced Statistics:
Recap on linear regression
Irene Cozzolino
Winter semester 2021
Irene Cozzolino Advanced Statistics Winter semester 2021 1 / 20

The linear regression model
Restricting attention to linear relationships, we specify a statistical

model as:
yi = β0 + β1 xi1 + · · · + βK xiK + i (1)

0
yi = xi β + i (2)
y = Xβ + −→ matrix form (3)
where xi are observable variables; i is unobserved and referred to

as an error term and β0 , β1 , . . . , βK is a vector of unknown
parameters characterizing the population.
The above equations are supposed to hold for any possible
observation, while we only observe a sample of N observations.

Model assumptions
It is important to realize that without additional restrictions the

statistical model in equations (1)-(2)-(3) is a tautology: for any value
of β one can always define a set of such that the statistical model
holds exactly for each observation.
We need to impose some assumptions to give the model a meaning.
Ass.1 : E (i ) =0 ∀i (4)

2
Ass.2 : V (i ) =σ ∀i (5)
Ass.3 : Cov (i , j ) =0 ∀i 6= j (6)

Model assumptions
Ass.1 −→ It says that the expected value of the error term is zero,
which means that, on average, the regression line should be correct.
Under Ass.1 it holds that
0
E (yi ) = xi β
The coefficient βK measures how the expected value of yi is changed

after a unit-variation in the value of xiK , keeping the other elements
in x i constant (ceteris paribus condition).
Ass.2 −→ Homoscedasticity.
Ass.3 −→ This excludes any form of autocorrelation between error
terms.

Model assumptions
Other more general assumptions...(not on the error terms)
Ass.4 : The matrix of predictors X is deterministic. (7)

Ass.5 : X has full rank. (8)
Ass.6 : N > K . (9)
Ass.7 : Uncorrelation between regressors and error terms. (10)

Collinearity VS Multicollinearity (or approximate
collinearity)
Perfect collinearity emerges when there are some predictors whose

information is redundant since it is perfectly contained in the others.
Predictors: [X1 , X2 , X3 ] −→ X2 = 2X1
Multicollinearity emerges when there is at least one predictor which is

highly correlated with the others. In other words, the information contained
in the two predictors is approximately similar.
Predictors: [X1 , X2 , X3 ] −→ X2 ≈ 2X1 + 4X3
It is still possible to perform the linear regression but there can be some
problems −→ instability of estimates & huge standard errors.

The ordinary least squares (OLS) estimator
The residuals are defined as the difference between the observed

values and the fitted ones.
0
î = yi − xi β̂ (11)
We would like to identify the estimator of β which minimizes the

residuals −→ minimize the residual sum of squares (RSS).
N N
0
X X
min î 2 = min (yi − xi β̂)2 (12)
β̂ i=1 β̂ i=1
0 −1 P
P 0 0
N N
β̂ OLS = i=1 xi xi i=1 xi yi −→ β̂ OLS = (X X)−1 X y
0
Fitted values: ŷi = xi β̂ OLS .

Properties of OLS estimator
β̂ OLS is unbiased. This means that, in repeated sampling, we can

expect that our estimator is on average equal to the true value β.
E (β̂ OLS ) = β
We would also like to make statements about how (un)likely it is to

be far off in a given sample. This means we would like to know the
distribution of β̂ OLS (which is still not possible). But we can calculate
the variance of β̂ OLS .
N
!−1
X 0
0 −1
V (β̂ OLS ) = σ 2 xi xi = σ2 X X (13)
i=1

Properties of OLS estimator
Equation (13) cannot be directly calculated since we do not know σ 2 .

To estimate the variance of β̂ OLS we need to replace the unknown
error variance by an estimate. An obvious candidate is the sample
variance of the residuals.
N
1 X 2
σ̂ 2 = î (14)
N −K
i=1
To conclude, β̂ OLS is B.L.U.E. −→ Best Linear Unbiased Estimator.

Distribution of the error terms
So far, we made no assumption about the shape of the distribution of

the error terms.
The most common assumption is that the errors are jointly normally
distributed.
i ∼ N(0, σ 2 )
If error terms are assumed to follow a normal distribution this means
that yi (for a given value of xi ) also follows a normal distribution.
In this setting we can make inference on β (CI & hypothesis tests).

Distribution of the error terms & Max.Likelihood approach
In this new framework, we can estimate β and σ 2 using the maximum

likelihood approach.

2 1 1 0
max L(β, σ ) = 2 exp − 2 (y − Xβ) (y − Xβ) (15)
β,σ 2 (σ 2π)N/2 2σ
β̂ MLE = β̂ OLS .
2 = N1 N ˆ2i .
P
σ̂MLE i=1
Most important property:
0
β̂ MLE ∼ N(β, σ 2 (X X)−1 )

Inference on β: tests
T-test: applied on the single predictors.
(
H0 : βj = 0 −→ The j-th regressor is not statistically significant in explaining the response variable
H1 : βj 6= 0 −→ The j-th regressor is statistically significant
F-test: used to evaluate the whole model.
(
H0 : β1 = β2 = · · · = βK = 0 −→ None of the K regressors is statistically significant
H1 : ∃j > 0 : βj 6= 0 −→ Exist at least one regressor, different from the intercept, stat. significant
p-value
p > α −→ Accept H0
p < α −→ Reject H0

Inference on β: confidence interval
CI1−α : It returns the interval containing the true value of βj with a

probability of 1 − α%. The probability is calculated before observing
the data.
Loosely speaking, a confidence interval gives a range of values for βj
that are not unlikely given the data.
Usually α = 0.5, hence 1 − α = 0.95.

Deciding on important variables
As discussed in the previous section, in a multiple regression analysis

is important to compute the F-statistic and to examine the associated
p-value. If we conclude on the basis of that p-value that at least one
of the predictors is related to the response, then it is natural to
wonder which are the guilty ones!
We could look at the individual p-values but if K is large we are likely
to make some false discoveries.
Ideally, we would like to perform variable selection by trying out a lot
of different models, each containing a different subset of the
predictors. For instance, if K = 2, then we can consider four models:
(1) a model containing no variables, (2) a model containing X1 only,
(3) a model containing X2 only, and (4) a model containing both X1
and X2 .

Deciding on Important Variables
We can then select the best model out of all of the models that we
have considered.
Various statistics can be used to judge the quality of a model, these
include Akaike information criterion (AIC).
Unfortunately, there are a total of 2K models that contain subsets of
K variables. This means that even for moderate K, trying out every
possible subset of the predictors is infeasible. For instance, we saw
that if K = 2, then there are 22 = 4 models to consider. But if
K = 30, then we must consider 230 = 10 0730 7410 824 models! This is
not practical.

Deciding on Important Variables
We need an automated and efficient approach (stepwise selection) to

choose a smaller set of models to consider.
There are three classical selection-approaches for this task:
1 Forward selection: starts with no predictors in the model; iteratively
adds the most contributive predictors and stops when the improvement
is no longer statistically significant.
2 Backward selection: starts with all predictors in the model (full model);
iteratively removes the least contributive predictors and stops when
there is a model where all predictors are statistically significant.
3 Mixed selection: is a combination of forward and backward selections.

Assessing the Accuracy of the Model
It is natural to quantify the extent to which the model fits the data.
R 2 Statistic: It provides a measure of goodness of fit of the model.
It takes the form of a proportion - the proportion of variance
explained - and so it always takes on a value between 0 and 1.
R 2 measures the proportion of variability in y that can be explained
using X. An R 2 statistic that is close to 1 indicates that a large
proportion of the variability in the response has been explained by the
regression. A number near 0 indicates that the regression did not
explain much of the variability in the response; this might occur
because the linear model is wrong.
However, it can still be challenging to determine what is a good R 2
value.

Problem of R 2 : It turns out that the R 2 will always increase when

more variables are added to the model, even if those variables are only
weakly associated with the response.
The adjusted R 2 corrects this drawback. We can use it to compare
different models.

Variance inflation factor: It is calculated on each of the K

predictors and is used to identify if there is a problem of
perfect/approximate collinearity between the predictors.
Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + · · · + αK XK + ∀j
Estimate the parameters of each of the models and calculate the Rj2 .
Rj2 = 1 −→ Xj is a perfect linear combination of the remaining K − 1
variables. We have a problem of collinearity. We should delete it since
its information is perfectly contained in the other K − 1 variables.
Rj2 ≥ 0.8 −→ approximate collinearity.
Rj2 = 0 −→ No problems of collinearity.
1
VIFj = 1−Rj2
. Usually, those variables with VIFj > 10 are deleted
from the model.

Diagnostic checking on the residuals
A good model will yield residuals with the following properties:

1 The residuals are uncorrelated.
2 The residuals have zero mean.
3 The residuals have constant variance.
4 The residual have a normal distribution.
Residual plots are also a useful graphical tool for identifying

non-linearity.

Course in Advanced Statistics: Recap On Linear Regression: Irene Cozzolino

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course in Advanced Statistics: Recap On Linear Regression: Irene Cozzolino

Uploaded by

Copyright:

Available Formats

Course in Advanced Statistics:

Recap on linear regression

Winter semester 2021

Irene Cozzolino Advanced Statistics Winter semester 2021 1 / 20

Restricting attention to linear relationships, we specify a statistical

yi = β0 + β1 xi1 + · · · + βK xiK + i (1)

where xi are observable variables; i is unobserved and referred to

Irene Cozzolino Advanced Statistics Winter semester 2021 2 / 20

It is important to realize that without additional restrictions the

Ass.1 : E (i ) =0 ∀i (4)

Irene Cozzolino Advanced Statistics Winter semester 2021 3 / 20

The coefficient βK measures how the expected value of yi is changed

Irene Cozzolino Advanced Statistics Winter semester 2021 4 / 20

Other more general assumptions...(not on the error terms)

Ass.4 : The matrix of predictors X is deterministic. (7)

Irene Cozzolino Advanced Statistics Winter semester 2021 5 / 20

Perfect collinearity emerges when there are some predictors whose

Predictors: [X1 , X2 , X3 ] −→ X2 = 2X1

Multicollinearity emerges when there is at least one predictor which is

Predictors: [X1 , X2 , X3 ] −→ X2 ≈ 2X1 + 4X3

Irene Cozzolino Advanced Statistics Winter semester 2021 6 / 20

The residuals are defined as the difference between the observed

We would like to identify the estimator of β which minimizes the

Irene Cozzolino Advanced Statistics Winter semester 2021 7 / 20

β̂ OLS is unbiased. This means that, in repeated sampling, we can

We would also like to make statements about how (un)likely it is to

Irene Cozzolino Advanced Statistics Winter semester 2021 8 / 20

Equation (13) cannot be directly calculated since we do not know σ 2 .

To conclude, β̂ OLS is B.L.U.E. −→ Best Linear Unbiased Estimator.

Irene Cozzolino Advanced Statistics Winter semester 2021 9 / 20

So far, we made no assumption about the shape of the distribution of

Irene Cozzolino Advanced Statistics Winter semester 2021 10 / 20

In this new framework, we can estimate β and σ 2 using the maximum

Irene Cozzolino Advanced Statistics Winter semester 2021 11 / 20

F-test: used to evaluate the whole model.

Irene Cozzolino Advanced Statistics Winter semester 2021 12 / 20

CI1−α : It returns the interval containing the true value of βj with a

Irene Cozzolino Advanced Statistics Winter semester 2021 13 / 20

As discussed in the previous section, in a multiple regression analysis

Irene Cozzolino Advanced Statistics Winter semester 2021 14 / 20

Irene Cozzolino Advanced Statistics Winter semester 2021 15 / 20

We need an automated and efficient approach (stepwise selection) to

Irene Cozzolino Advanced Statistics Winter semester 2021 16 / 20

Irene Cozzolino Advanced Statistics Winter semester 2021 17 / 20

Problem of R 2 : It turns out that the R 2 will always increase when

Irene Cozzolino Advanced Statistics Winter semester 2021 18 / 20

Variance inflation factor: It is calculated on each of the K

Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + · · · + αK XK +  ∀j

Irene Cozzolino Advanced Statistics Winter semester 2021 19 / 20

A good model will yield residuals with the following properties:

Residual plots are also a useful graphical tool for identifying

Irene Cozzolino Advanced Statistics Winter semester 2021 20 / 20

You might also like

yi = β0 + β1 xi1 + · · · + βK xiK + i (1)

where xi are observable variables; i is unobserved and referred to

Ass.1 : E (i ) =0 ∀i (4)

Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + · · · + αK XK + ∀j