You are on page 1of 15

28/09/2021

Lecture 11

GIS220
Introduction to regression I

Prof Gregory Breetzke


greg.breetzke@up.ac.za
Room 1-19, Geography

Lecture outline

o Up until now…

o Regression analysis

o Regression components

o GeoDa example

1
28/09/2021

…we’ve seen the “where”

o Suicides, crime, health

o 10111 emergency calls, or fires

o Traffic accidents, or mine dumps

…what about the “why”?

• Why are there places in South Africa with high crime rates? What might be
causing this?

• Can we model the characteristics of places that experience lots of


emergency calls, or fire events in order to help reduce these incidents?

• What are the factors contributing to higher than expected traffic accidents?

• Where is TB highest in Tshwane., and why?

• Why is erosion more prevalent in this catchment?

2
28/09/2021

Regression analysis

o What is regression?
- Dependent variable (variable of interest)
- Independent variables (explanatory/predictor variables)

o Often referred to as Ordinary Least Squares (OLS) regression


– What is simple regression?
– Two-dimensional space
– What is multiple regression?
– Three (and more) dimensional space

Regression analysis: example

3
28/09/2021

Regression analysis: example

Residuals
• A residual is the difference between the observed y-value
(from scatter plot) and the predicted y-value (from regression
equation line).

• It is the vertical distance from the


actual plotted point to the point on
the regression line.

4
28/09/2021

Minimising Residuals
• Regression finds a line
such that the sum of
the squared (vertical)
distances between the
points and the line is
minimized
– Hence the term ordinary
least squares (OLS)

Regression analysis

o Relationships are either positive or negative. Like correlation

5
28/09/2021

Correlation vs. Regression


Correlation Regression
• Correlation (co-relation) • Regression studies the causal
measures the strength of a relationship between a
linear association between dependent and a set of
two variables independent, explanatory
– Causality not implied variables
– Constrained to -1 to 1. – Causality implied
– Coefficients not constrained to
-1 to 1
– Simple regression looks like a
scatterplot showing the
correlation coefficient but
regression accounts for
causality!

11

Why Regression?
1. Provides a simplified view of the relationship
between variables.
2. Provides a way of fitting the model with our
data.
3. Provides a means for evaluating the
importance of the variables and the
correctness of the model.

12

6
28/09/2021

Regression analysis

o More than one predictor...?

o Example

"Broken Window Theory" indicates that defacement of public


property (graffiti, damaged structures, etc.) invite other crimes.
Will there be a positive relationship between vandalism incidents,
the wealth of the neighbourhood and residential burglary?

o Is a person at greater risk for burglary if they live


in a rich or a poor neighborhood?

Stepping through the analysis


1) Come up with a list of possible x (independent) variables that may be
helpful in estimating y (dependent variable)

y – dependent variable

Residential burglary

x – independent variables

Median income (in thousands of Rands)


No of vandalism incidents
No of household units

7
28/09/2021

Stepping through the analysis


2) Collect data on the y variable and your x variables from step 1

y – dependent variable

Residential burglary SAPS

x – independent variables

Median income Statistics South Africa


No of vandalism incidents SAPS
No of household units Statistics South Africa

Stepping through the analysis


3) Check the relationships between each x (independent) variable and y
(using scatterplots and correlations), and use the results to eliminate
those variables that aren’t strongly related to y

y – dependent variable

Residential burglary

x – independent variables

Median income +0.45


No of vandalism incidents +0.23
No of household units +0.34

8
28/09/2021

Stepping through the analysis

y – dependent variable

Residential burglary

x – independent variables

Median income +0.45


No of vandalism incidents +0.23
No of household units +0.34

Stepping through the analysis


4) Look at the possible relationships between the x (independent)
variables to make sure you aren’t being redundant (avoid
multicollinearity)

x – independent variables

Median No of vandalism No of household


income incidents units
Median income 1
No of vandalism incidents +0.32 1
No of household units +0.12 +0.67 1

9
28/09/2021

Stepping through the analysis


5) Use those x variables (from step 4) in a multiple OLS regression
analysis to find the best-fitting model for your data

Stepping through the analysis


6) Use the best-fitting model (from step 5) to predict y for given x-
values by plugging those x-values into the model

RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e

Collectively minimise the sum of squared residuals

10
28/09/2021

Stepping through the analysis

RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e

Collectively minimise the sum of squared residuals

GeoDa: example

Columbus neighbourhood crime base map (49 census tracts)

11
28/09/2021

GeoDa: example

GeoDa: example

12
28/09/2021

GeoDa: example

CRIME = 68.61 – 1.59 (INC) – 0.27 (HOVAL)

Interpreting the Results


R² Coefficients

• Indicator of how well the • Sign: positive or negative


regression line fits the data; relationship.
ranges from 0 (no fit) to 1 • Indicator of strength of the
(best fit). relationship.
• The proportion of variability in • P-values (probability) indicate
the dataset that is accounted significance.
for by the regression equation. • Remove variables with high P-
The higher the number, the values to see if R² increases.
better the fit.
• Adjusted R² is preferred to R².
• Note: Outliers or non-linear
data could decrease R².

© UP 2018: GIS 220 26

13
28/09/2021

Interpreting the Results


AICCs Multicollinearity

• Allows for comparison of • For ArcMap: use the


models. Variance Inflation Factor
• AICc is a measure of the (VIF)
relative goodness of fit of a – Larger than 7.5 could indicate
statistical model. redundancy among variables.

• A lower AICc value means


the model is a better fit for
the data.
• Does not test the null
hypothesis.

© UP 2018: GIS 220 27

Residual Errors
Tests Mapping Residuals

• Jarque-Bera Test • If there is significant


– Tests the normality of errors. clustering, there could be
– If significant = a missing misspecification (a variable is
explanatory variable missing from the model).
• Breusch-Pagan, Koenker-
Bassett, White
– Test for heteroskedasticity
(non-constant variance).
– If significant = non-stationarity.
• Spatial Autocorrelation
– Indicate missing variables.
– Indicate the need for
alternative regression models.

© UP 2018: GIS 220 28

14
28/09/2021

References
• Esri (2018). Regression analysis basics.
http://pro.arcgis.com/en/pro-app/tool-reference/spatial-
statistics/regression-analysis-basics.htm#GUID-6D27B3A1-
FFC6-4BF5-893F-F6D60AB2E783.

• Wheeler, D., Shaw, G. Barr, S. (2010). Statistical Techniques in


Geographical Analysis (3rd ed.). Routledge.

© UP 2018: GIS 220 29

15

You might also like