You are on page 1of 33

How to do a linear regression?

Tom Broekel
Diagnostics: Residuals

1 © TBroekel
Diagnostics of linear regression: Residuals

Regression (residual) diagnostics show if requirements for test of


statistical significance are me

If requirements not met - inference of statistical significance of


regression parameters (~p-values) not vali

Requirement

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

2 © TBroekel
s

Diagnostics of linear regression: Residuals

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

3 © TBroekel
s

Diagnostics of linear regression: Residuals


First impression of residuals by looking at their scatterplo

Regression on NUT2 regions: GDP ~ Pop_Den + Patents

40000

20000
Residual

−20000
0 100 200
Index

4 © TBroekel
t

Diagnostics of linear regression: Residuals


Test if residuals are normally distribute

Normal distribution: No systematic biases & just random noise

Explicit test: Shapiro-Wilk-test compares distribution of residuals


with normal distributio

Significant result (p-value below chosen level of significance) indicates


rejection of normal distribution hypothesi

Insignificant result (p-value above chosen level of significance) indicates


not to reject normal distribution hypothesis

5 © TBroekel
n

Diagnostics of linear regression: Residuals

Rejection of normal distribution hypothesi

Coefficients correc

But: Test of coefficients’ significances not reliabl

Results cannot be interpreted

6 © TBroekel
t

Diagnostics of linear regression: Residuals

Rejection of normal distribution hypothesi

What to do

Wrong function relation? (Non-linearity?

Missing variables

Inherent characteristics of data ➡ different empirical approach (e.g.,


bootstrapping)

7 © TBroekel
?

Diagnostics of linear regression: Residuals


Testing for normal distribution of regression residuals in

Function ols_test_normality() in package olsrr directly applicable to


regression results objec

Function reports additional tests with usually little differences and similar
interpretation

Normal distribution
hypothesis to be rejected
because p-value below 0.01

8 © TBroekel
t

Diagnostics of linear regression: Residuals

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

9 © TBroekel
s

Diagnostics of linear regression: Residuals

Test for homoscedastic residual

Test of significance based on the assumption of residuals’ variance


being constant across their distributio

“No heteroscedasticity

No relationship between residuals and explanatory variable

Variance of residuals should be constant across the distribution of fitted


values of the dependent variable

10 © TBroekel

Diagnostics of linear regression: Residuals

Comparison of fitted values of dependent variable with residuals

Source: https://clevertap.com/blog/a-brief-primer-on-linear-regression-part-ii/

11 © TBroekel
Diagnostics of linear regression: Residuals

Breusch-Pagan test of heteroscedasticit

Regression of squared residuals on same explanatory variable

If “too” much variance explained by regression - residuals not


independent of explanatory variable

Significant result of BP-Test suggests rejection of homoscedasticity


assumptio

More tests available

12 © TBroekel
n

Diagnostics of linear regression: Residuals

Rejection of homoscedasticity

Implication

Coefficients correc

Test of significance not reliabl

Results cannot be interpreted

13 © TBroekel
s

Diagnostics of linear regression: Residuals

Rejection of homoscedasticity - Causes and consequence

Wrong function relation? (Test for non-linearities?

Missing variables

Inherent characteristics of data ➡ different empirical approach (e.g.,


bootstrapping)

14 © TBroekel
?

Diagnostics of linear regression: Residuals

Testing for heteroscedasticity in

Function ols_test_breusch_pagan() in package olsrr directly applicable


to regression results object

Prob>Chi2 = p-value
p-value above 0.01 implying
homoscedasticity cannot be
rejected

15 © TBroekel
R

Diagnostics of linear regression: Residuals

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

16 © TBroekel
s

Diagnostics of linear regression: Residuals


Autocorrelation: Observations correlate with themselve

Observation in some kind of orde

Temporal: observations’ values in t correlate with their values in t-1 (temporal-


autocorrelation) -> Panel & time series analysis

Spatial: observations correlate with others in geographical proximity (spatial


autocorrelation

Autocorrelation implies correlated residuals, e.g., residuals not independent of


each other and hence include structural biases and not just random nois

Tests of significance not reliable

17 © TBroekel
)

Diagnostics of linear regression: Residuals

High & low GDP


values geographically
clustered

18 © TBroekel
Diagnostics of linear regression: Residuals

Spatial Autocorrelation: Residuals of observation i (region i)


correlate with those from its neighbouring region

Almost always a problem is case of spatial dat

Spatial autocorrelation hints at similarities across regions or strong


relations between the

Unobserved regional characteristics or relation

Regions part of a “larger” regions - regional borders not optimal

19 © TBroekel
m

Diagnostics of linear regression: Residuals


Spatially autocorrelated residuals: mapping regression residual

Residuals clearly geographically structured / clustere

North Europe: under-estimating GDP from population density & patent

East Europe: over-estimating GDP from population density & patents

Regression residual

^
Residual = Y − Y
40000

20000

20 © TBroekel
d

Diagnostics of linear regression: Residuals

Exact test of spatial autocorrelation: Moran‘s

Extension of Pearson‘s correlation coefficient to spatial structur

Comparison of value for region i with those of (direct) neighbouring


region

Moran-correlation coefficient I
Pn Pn
n i=1 j=1 wij (xi x̄)(xj x̄)
I = Pn Pn Pn
i=1 j=1 wij i=1 (xi x̄)2

21 © TBroekel
s

Diagnostics of linear regression: Residuals


Example: spatial relations reflected by direct
Region Neighbours
neighbourhoo
ResidualA ResidualB
Region A’s neighbours: B, C, D ResidualA ResidualC
ResidualA ResidualD
Region D’s neighbours: A, E ResidualD ResidualA
ResidualD ResidualE
Region E’s neighbours: D, F, G, H
ResidualE ResidualD
Arranging residuals according to spatial ResidualE ResidualF
neighbourhood ResidualE ResidualG
ResidualE ResidualH

Estimation of
correlation coef cient=Moran’s I

22 © TBroekel

fi
Diagnostics of linear regression: Residuals

Moran‘s I test of spatial autocorrelatio

Values between -1 (negative autocorrelation) and 1 (positive


autocorrelation

Significant result indicated presence of autocorrelation

23 © TBroekel
)

Diagnostics of linear regression: Residuals

Problem with Moran’s

Different ways to define “neighbourhood

Direct neighbourhood (weight of neighbouring values =1, all others = 0

Weighting based on distance (growing distance implies less weight in region i’s
estimation

Neighbourhood definition impacts estimation results

Motivate choice from theory: What type of dependencies are relevant?

24 © TBroekel

Diagnostics of linear regression: Residuals

When spatial autocorrelation presen

Use different spatial units (definition of regions

Consideration of spatial characteristics, e.g., urban vs. rura

Model spatial dependencies with dummy variables (e.g., Country


dummies

Use of spatial regression models (not this class

Multi-level regression (not this class)

25 © TBroekel
)

Diagnostics of linear regression: Residuals


How to test for spatial autocorrelation in R

Load spatial information concerning geographical locations of observation

Usually, a “map

Maps = so called “shapefiles” that link geographical information (latitude and


longitude data) to empirical observations, e.g., region

R with excellent capabilities of handling spatial information using the


package sf

Use of sf (simple feature) library makes working with such data eas

Full compatibility with tidyverse and all its feature

Easy integration with ggplot

26 © TBroekel

Diagnostics of linear regression: Residuals


How to test for spatial autocorrelation in R

Load shapefile with read_sf() of sf librar

Add regression residuals to original data set using add_residuals() from


modelr librar Regression object Name of new column

Merge extended data set with shapefile to “geolocated” observations

ID columns in shape le & region.data

27 © TBroekel
y

fi
y

Diagnostics of linear regression: Residuals

Shapefile (sf) object includes data.frame with merged data

Merged columns of regional data

28 © TBroekel
Diagnostics of linear regression: Residuals

Before calculating Moran’s I, create information on spatial relations


(who is neighbour of whom?

Extract neighbourhood information from shapefile with poly2neigh()


and transform into spatial dependency object (spatial weights) with
neigh2listw()

Some regions with no neighbours (islands)

29 © TBroekel
)

Diagnostics of linear regression: Residuals

Moran’s I test implemented in spdep library with function


moran.test()

Highly signi cant


spatial autocorrelation!

Neighbours’ residuals correlat


with 0.65 (correlation coef cient)
30 © TBroekel
fi
:

fi
e

Diagnostics of linear regression: Residuals

Estimating regression model considering the full set of country


dummies (almost) solves the issue

Slightly signi cant


Weak spatial autocorrelation!

31 © TBroekel
fi
:

How to do linear regressions!


Ex-ante checks

Number of observation

Type of dependent variabl

Linearit

Ex-post checks with potentially model refinemen

Multicollinearit

Outlie

Normal distribution of residual

Heteroscedasticit

Autocorrelation (spatial/temporal)

32 © TBroekel
r

How to do a linear regression? Tom Broekel


Diagnostics: Residuals

33 © TBroekel

You might also like