You are on page 1of 20

Course in Advanced Statistics:

Recap on linear regression

Irene Cozzolino

Winter semester 2021

Irene Cozzolino Advanced Statistics Winter semester 2021 1 / 20


The linear regression model

Restricting attention to linear relationships, we specify a statistical


model as:

yi = β0 + β1 xi1 + · · · + βK xiK + i (1)


0
yi = xi β + i (2)
y = Xβ +  −→ matrix form (3)

where xi are observable variables; i is unobserved and referred to


as an error term and β0 , β1 , . . . , βK is a vector of unknown
parameters characterizing the population.
The above equations are supposed to hold for any possible
observation, while we only observe a sample of N observations.

Irene Cozzolino Advanced Statistics Winter semester 2021 2 / 20


Model assumptions

It is important to realize that without additional restrictions the


statistical model in equations (1)-(2)-(3) is a tautology: for any value
of β one can always define a set of  such that the statistical model
holds exactly for each observation.
We need to impose some assumptions to give the model a meaning.

Ass.1 : E (i ) =0 ∀i (4)


2
Ass.2 : V (i ) =σ ∀i (5)
Ass.3 : Cov (i , j ) =0 ∀i 6= j (6)

Irene Cozzolino Advanced Statistics Winter semester 2021 3 / 20


Model assumptions

Ass.1 −→ It says that the expected value of the error term is zero,
which means that, on average, the regression line should be correct.
Under Ass.1 it holds that
0
E (yi ) = xi β

The coefficient βK measures how the expected value of yi is changed


after a unit-variation in the value of xiK , keeping the other elements
in x i constant (ceteris paribus condition).
Ass.2 −→ Homoscedasticity.
Ass.3 −→ This excludes any form of autocorrelation between error
terms.

Irene Cozzolino Advanced Statistics Winter semester 2021 4 / 20


Model assumptions

Other more general assumptions...(not on the error terms)

Ass.4 : The matrix of predictors X is deterministic. (7)


Ass.5 : X has full rank. (8)
Ass.6 : N > K . (9)
Ass.7 : Uncorrelation between regressors and error terms. (10)

Irene Cozzolino Advanced Statistics Winter semester 2021 5 / 20


Collinearity VS Multicollinearity (or approximate
collinearity)

Perfect collinearity emerges when there are some predictors whose


information is redundant since it is perfectly contained in the others.

Predictors: [X1 , X2 , X3 ] −→ X2 = 2X1

Multicollinearity emerges when there is at least one predictor which is


highly correlated with the others. In other words, the information contained
in the two predictors is approximately similar.

Predictors: [X1 , X2 , X3 ] −→ X2 ≈ 2X1 + 4X3

It is still possible to perform the linear regression but there can be some
problems −→ instability of estimates & huge standard errors.

Irene Cozzolino Advanced Statistics Winter semester 2021 6 / 20


The ordinary least squares (OLS) estimator

The residuals are defined as the difference between the observed


values and the fitted ones.
0
ˆi = yi − xi β̂ (11)

We would like to identify the estimator of β which minimizes the


residuals −→ minimize the residual sum of squares (RSS).
N N
0
X X
min ˆi 2 = min (yi − xi β̂)2 (12)
β̂ i=1 β̂ i=1

0 −1 P
P  0 0
N N
β̂ OLS = i=1 xi xi i=1 xi yi −→ β̂ OLS = (X X)−1 X y
0
Fitted values: ŷi = xi β̂ OLS .

Irene Cozzolino Advanced Statistics Winter semester 2021 7 / 20


Properties of OLS estimator

β̂ OLS is unbiased. This means that, in repeated sampling, we can


expect that our estimator is on average equal to the true value β.

E (β̂ OLS ) = β

We would also like to make statements about how (un)likely it is to


be far off in a given sample. This means we would like to know the
distribution of β̂ OLS (which is still not possible). But we can calculate
the variance of β̂ OLS .

N
!−1
X 0
 0 −1
V (β̂ OLS ) = σ 2 xi xi = σ2 X X (13)
i=1

Irene Cozzolino Advanced Statistics Winter semester 2021 8 / 20


Properties of OLS estimator

Equation (13) cannot be directly calculated since we do not know σ 2 .


To estimate the variance of β̂ OLS we need to replace the unknown
error variance by an estimate. An obvious candidate is the sample
variance of the residuals.
N
1 X 2
σ̂ 2 = ˆi (14)
N −K
i=1

To conclude, β̂ OLS is B.L.U.E. −→ Best Linear Unbiased Estimator.

Irene Cozzolino Advanced Statistics Winter semester 2021 9 / 20


Distribution of the error terms

So far, we made no assumption about the shape of the distribution of


the error terms.
The most common assumption is that the errors are jointly normally
distributed.
i ∼ N(0, σ 2 )
If error terms are assumed to follow a normal distribution this means
that yi (for a given value of xi ) also follows a normal distribution.
In this setting we can make inference on β (CI & hypothesis tests).

Irene Cozzolino Advanced Statistics Winter semester 2021 10 / 20


Distribution of the error terms & Max.Likelihood approach

In this new framework, we can estimate β and σ 2 using the maximum


likelihood approach.
 
2 1 1 0
max L(β, σ ) = 2 exp − 2 (y − Xβ) (y − Xβ) (15)
β,σ 2 (σ 2π)N/2 2σ

β̂ MLE = β̂ OLS .
2 = N1 N ˆ2i .
P
σ̂MLE i=1 
Most important property:
0
β̂ MLE ∼ N(β, σ 2 (X X)−1 )

Irene Cozzolino Advanced Statistics Winter semester 2021 11 / 20


Inference on β: tests
T-test: applied on the single predictors.

(
H0 : βj = 0 −→ The j-th regressor is not statistically significant in explaining the response variable
H1 : βj 6= 0 −→ The j-th regressor is statistically significant

F-test: used to evaluate the whole model.

(
H0 : β1 = β2 = · · · = βK = 0 −→ None of the K regressors is statistically significant
H1 : ∃j > 0 : βj 6= 0 −→ Exist at least one regressor, different from the intercept, stat. significant

p-value

p > α −→ Accept H0
p < α −→ Reject H0

Irene Cozzolino Advanced Statistics Winter semester 2021 12 / 20


Inference on β: confidence interval

CI1−α : It returns the interval containing the true value of βj with a


probability of 1 − α%. The probability is calculated before observing
the data.
Loosely speaking, a confidence interval gives a range of values for βj
that are not unlikely given the data.
Usually α = 0.5, hence 1 − α = 0.95.

Irene Cozzolino Advanced Statistics Winter semester 2021 13 / 20


Deciding on important variables

As discussed in the previous section, in a multiple regression analysis


is important to compute the F-statistic and to examine the associated
p-value. If we conclude on the basis of that p-value that at least one
of the predictors is related to the response, then it is natural to
wonder which are the guilty ones!
We could look at the individual p-values but if K is large we are likely
to make some false discoveries.
Ideally, we would like to perform variable selection by trying out a lot
of different models, each containing a different subset of the
predictors. For instance, if K = 2, then we can consider four models:
(1) a model containing no variables, (2) a model containing X1 only,
(3) a model containing X2 only, and (4) a model containing both X1
and X2 .

Irene Cozzolino Advanced Statistics Winter semester 2021 14 / 20


Deciding on Important Variables

We can then select the best model out of all of the models that we
have considered.
Various statistics can be used to judge the quality of a model, these
include Akaike information criterion (AIC).
Unfortunately, there are a total of 2K models that contain subsets of
K variables. This means that even for moderate K, trying out every
possible subset of the predictors is infeasible. For instance, we saw
that if K = 2, then there are 22 = 4 models to consider. But if
K = 30, then we must consider 230 = 10 0730 7410 824 models! This is
not practical.

Irene Cozzolino Advanced Statistics Winter semester 2021 15 / 20


Deciding on Important Variables

We need an automated and efficient approach (stepwise selection) to


choose a smaller set of models to consider.
There are three classical selection-approaches for this task:
1 Forward selection: starts with no predictors in the model; iteratively
adds the most contributive predictors and stops when the improvement
is no longer statistically significant.
2 Backward selection: starts with all predictors in the model (full model);
iteratively removes the least contributive predictors and stops when
there is a model where all predictors are statistically significant.
3 Mixed selection: is a combination of forward and backward selections.

Irene Cozzolino Advanced Statistics Winter semester 2021 16 / 20


Assessing the Accuracy of the Model

It is natural to quantify the extent to which the model fits the data.
R 2 Statistic: It provides a measure of goodness of fit of the model.
It takes the form of a proportion - the proportion of variance
explained - and so it always takes on a value between 0 and 1.
R 2 measures the proportion of variability in y that can be explained
using X. An R 2 statistic that is close to 1 indicates that a large
proportion of the variability in the response has been explained by the
regression. A number near 0 indicates that the regression did not
explain much of the variability in the response; this might occur
because the linear model is wrong.
However, it can still be challenging to determine what is a good R 2
value.

Irene Cozzolino Advanced Statistics Winter semester 2021 17 / 20


Assessing the Accuracy of the Model

Problem of R 2 : It turns out that the R 2 will always increase when


more variables are added to the model, even if those variables are only
weakly associated with the response.
The adjusted R 2 corrects this drawback. We can use it to compare
different models.

Irene Cozzolino Advanced Statistics Winter semester 2021 18 / 20


Assessing the Accuracy of the Model

Variance inflation factor: It is calculated on each of the K


predictors and is used to identify if there is a problem of
perfect/approximate collinearity between the predictors.

Xj = α0 + α1 X1 + · · · + αj−1 Xj−1 + αj+1 Xj+1 + · · · + αK XK +  ∀j

Estimate the parameters of each of the models and calculate the Rj2 .
Rj2 = 1 −→ Xj is a perfect linear combination of the remaining K − 1
variables. We have a problem of collinearity. We should delete it since
its information is perfectly contained in the other K − 1 variables.
Rj2 ≥ 0.8 −→ approximate collinearity.
Rj2 = 0 −→ No problems of collinearity.
1
VIFj = 1−Rj2
. Usually, those variables with VIFj > 10 are deleted
from the model.

Irene Cozzolino Advanced Statistics Winter semester 2021 19 / 20


Diagnostic checking on the residuals

A good model will yield residuals with the following properties:


1 The residuals are uncorrelated.
2 The residuals have zero mean.
3 The residuals have constant variance.
4 The residual have a normal distribution.

Residual plots are also a useful graphical tool for identifying


non-linearity.

Irene Cozzolino Advanced Statistics Winter semester 2021 20 / 20

You might also like