2021 - Lecture 11 - Introduction To Regression I - Slides

28/09/2021
Lecture 11
GIS220
Introduction to regression I
Prof Gregory Breetzke

greg.breetzke@up.ac.za
Room 1-19, Geography
Lecture outline
o Up until now…
o Regression analysis
o Regression components
o GeoDa example
1
28/09/2021
…we’ve seen the “where”
o Suicides, crime, health
o 10111 emergency calls, or fires
o Traffic accidents, or mine dumps
…what about the “why”?
• Why are there places in South Africa with high crime rates? What might be
causing this?
• Can we model the characteristics of places that experience lots of

emergency calls, or fire events in order to help reduce these incidents?
• What are the factors contributing to higher than expected traffic accidents?
• Where is TB highest in Tshwane., and why?
• Why is erosion more prevalent in this catchment?
2
28/09/2021
Regression analysis
o What is regression?
- Dependent variable (variable of interest)
- Independent variables (explanatory/predictor variables)
o Often referred to as Ordinary Least Squares (OLS) regression

– What is simple regression?
– Two-dimensional space
– What is multiple regression?
– Three (and more) dimensional space
Regression analysis: example
3
28/09/2021
Regression analysis: example
Residuals
• A residual is the difference between the observed y-value
(from scatter plot) and the predicted y-value (from regression
equation line).
• It is the vertical distance from the

actual plotted point to the point on
the regression line.
4
28/09/2021
Minimising Residuals
• Regression finds a line
such that the sum of
the squared (vertical)
distances between the
points and the line is
minimized
– Hence the term ordinary
least squares (OLS)
Regression analysis
o Relationships are either positive or negative. Like correlation
5
28/09/2021
Correlation vs. Regression

Correlation Regression
• Correlation (co-relation) • Regression studies the causal
measures the strength of a relationship between a
linear association between dependent and a set of
two variables independent, explanatory
– Causality not implied variables
– Constrained to -1 to 1. – Causality implied
– Coefficients not constrained to
-1 to 1
– Simple regression looks like a
scatterplot showing the
correlation coefficient but
regression accounts for
causality!
11
Why Regression?
1. Provides a simplified view of the relationship
between variables.
2. Provides a way of fitting the model with our
data.
3. Provides a means for evaluating the
importance of the variables and the
correctness of the model.
12
6
28/09/2021
Regression analysis
o More than one predictor...?
o Example
"Broken Window Theory" indicates that defacement of public

property (graffiti, damaged structures, etc.) invite other crimes.
Will there be a positive relationship between vandalism incidents,
the wealth of the neighbourhood and residential burglary?
o Is a person at greater risk for burglary if they live

in a rich or a poor neighborhood?
Stepping through the analysis

1) Come up with a list of possible x (independent) variables that may be
helpful in estimating y (dependent variable)
y – dependent variable
Residential burglary
x – independent variables
Median income (in thousands of Rands)

No of vandalism incidents
No of household units
7
28/09/2021

2) Collect data on the y variable and your x variables from step 1
Residential burglary SAPS
Median income Statistics South Africa

No of vandalism incidents SAPS
No of household units Statistics South Africa

3) Check the relationships between each x (independent) variable and y
(using scatterplots and correlations), and use the results to eliminate
those variables that aren’t strongly related to y
Median income +0.45

No of vandalism incidents +0.23
No of household units +0.34
8
28/09/2021
Median income +0.45

No of vandalism incidents +0.23
No of household units +0.34

4) Look at the possible relationships between the x (independent)
variables to make sure you aren’t being redundant (avoid
multicollinearity)
Median No of vandalism No of household

income incidents units
Median income 1
No of vandalism incidents +0.32 1
No of household units +0.12 +0.67 1
9
28/09/2021

5) Use those x variables (from step 4) in a multiple OLS regression
analysis to find the best-fitting model for your data

6) Use the best-fitting model (from step 5) to predict y for given x-
values by plugging those x-values into the model
RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e
Collectively minimise the sum of squared residuals
10
28/09/2021
RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e
Collectively minimise the sum of squared residuals
GeoDa: example
Columbus neighbourhood crime base map (49 census tracts)
11
28/09/2021
GeoDa: example
GeoDa: example
12
28/09/2021
GeoDa: example
CRIME = 68.61 – 1.59 (INC) – 0.27 (HOVAL)
Interpreting the Results

R² Coefficients
• Indicator of how well the • Sign: positive or negative

regression line fits the data; relationship.
ranges from 0 (no fit) to 1 • Indicator of strength of the
(best fit). relationship.
• The proportion of variability in • P-values (probability) indicate
the dataset that is accounted significance.
for by the regression equation. • Remove variables with high P-
The higher the number, the values to see if R² increases.
better the fit.
• Adjusted R² is preferred to R².
• Note: Outliers or non-linear
data could decrease R².
© UP 2018: GIS 220 26
13
28/09/2021
Interpreting the Results

AICCs Multicollinearity
• Allows for comparison of • For ArcMap: use the

models. Variance Inflation Factor
• AICc is a measure of the (VIF)
relative goodness of fit of a – Larger than 7.5 could indicate
statistical model. redundancy among variables.
• A lower AICc value means

the model is a better fit for
the data.
• Does not test the null
hypothesis.
© UP 2018: GIS 220 27
Residual Errors
Tests Mapping Residuals
• Jarque-Bera Test • If there is significant

– Tests the normality of errors. clustering, there could be
– If significant = a missing misspecification (a variable is
explanatory variable missing from the model).
• Breusch-Pagan, Koenker-
Bassett, White
– Test for heteroskedasticity
(non-constant variance).
– If significant = non-stationarity.
• Spatial Autocorrelation
– Indicate missing variables.
– Indicate the need for
alternative regression models.
© UP 2018: GIS 220 28
14
28/09/2021
References
• Esri (2018). Regression analysis basics.
http://pro.arcgis.com/en/pro-app/tool-reference/spatial-
statistics/regression-analysis-basics.htm#GUID-6D27B3A1-
FFC6-4BF5-893F-F6D60AB2E783.
• Wheeler, D., Shaw, G. Barr, S. (2010). Statistical Techniques in

Geographical Analysis (3rd ed.). Routledge.
© UP 2018: GIS 220 29
15

2021 - Lecture 11 - Introduction To Regression I - Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2021 - Lecture 11 - Introduction To Regression I - Slides

Uploaded by

Copyright:

Available Formats

28/09/2021

Prof Gregory Breetzke

…we’ve seen the “where”

o Suicides, crime, health

o 10111 emergency calls, or fires

o Traffic accidents, or mine dumps

…what about the “why”?

• Can we model the characteristics of places that experience lots of

• Where is TB highest in Tshwane., and why?

• Why is erosion more prevalent in this catchment?

o Often referred to as Ordinary Least Squares (OLS) regression

Regression analysis: example

Regression analysis: example

• It is the vertical distance from the

o Relationships are either positive or negative. Like correlation

Correlation vs. Regression

o More than one predictor...?

"Broken Window Theory" indicates that defacement of public

o Is a person at greater risk for burglary if they live

Stepping through the analysis

Median income (in thousands of Rands)

Stepping through the analysis

Residential burglary SAPS

Median income Statistics South Africa

Stepping through the analysis

Median income +0.45

Stepping through the analysis

Median income +0.45

Stepping through the analysis

Median No of vandalism No of household

Stepping through the analysis

Stepping through the analysis

RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e

Collectively minimise the sum of squared residuals

Stepping through the analysis

RES_BURG = 5.79 + 2.162 (MED_INC) + 1.221 (VAND) + 0.766 (HH_UNITS) + e

Collectively minimise the sum of squared residuals

Columbus neighbourhood crime base map (49 census tracts)

CRIME = 68.61 – 1.59 (INC) – 0.27 (HOVAL)

Interpreting the Results

• Indicator of how well the • Sign: positive or negative

© UP 2018: GIS 220 26

Interpreting the Results

• Allows for comparison of • For ArcMap: use the

• A lower AICc value means

© UP 2018: GIS 220 27

• Jarque-Bera Test • If there is significant

© UP 2018: GIS 220 28

• Wheeler, D., Shaw, G. Barr, S. (2010). Statistical Techniques in

© UP 2018: GIS 220 29

You might also like