You are on page 1of 12

Linear Regression and

Geographically Weighted
Regression

By Richard Yang

Report
The tutorial I picked was the linear and geographically weighted regression. In this
tutorial, I learned about ordinary least squares regression (OLS) and also geographically
weighted regression (GWR). The difference between the two is OLS is a global regression
method, while the other is a more local, spatial, regression method. This allows the relationship
that I am modeling to vary across the study area. The study area for this tutorial was the
Portland Metropolitan Area, and it was focusing on 911 calls. First, I did an OLS regression on
volume of 911 calls to see what variables contributes to the high volume. Is it caused by
population? Education? Income? I was going to find out. Using that information, I did a GWR
regression based upon those variables that were considered important to analyze where future
calls will come from.
The point of the tutorial was to see where the areas of 911 calls were coming from, and
where they will come from in the future. Based upon response stations located now, will they
be effective in the future when the volume of calls grows? First I looked at the hot spot
locations of all the calls now, and where the response stations are located. A file was also
included in the tutorial that had the hot spots already mapped out. Based upon the locations of
911 calls, a map was shown showing the cold spots (blue) to hot spots (red) of volumes of
calls (Figure 1). To continue with the regression analysis, I was asked the question, What
factors are the causes to have a high volume of calls in the hot spot areas? To find out I had
to run an OLS regression to find the factors causes such a high volume of calls. Instead of
using the individual calls as points, I used the file that was associated with calls that have been
aggregated to census tracts. This file is better to use, because this shape file has access to
more information (variables) that could help determine the causes of such a high volume of
calls in the hot spots. The first time, I only used population as the variable to try and explain

the high volume of calls in the hot spots. When running the OLS tool, a results table was spit
out to show many different figures. The most important figure to focus on though, is the RSquared figure. The R-Squared figure was at .393460 (Figure 2). To put it another way,
population was accounting for only 39% of the story of why there was high volume of calls in
the hot spots. If the figure was higher, say 90%, further analysis wouldnt need to be conducted
because population would be causing 90% of the high volume of calls. Since it was only at
39%, it means that other factors are also contributing to the high volume of calls.
To find what other factors attributed to the high volume of calls, I needed to create a
scatterplot matrix. Using the scatterplot matrix, the variables of population, jobs, low education,
and distance to urban centers. Using these 4 variables in the OLS tool. This time the Rsquared value was at .831080 or about 83% (Figure 3). Now the figure of 83% is saying that
these 4 variables is telling 83% of the story, a spatial autocorrelation tool needs to run to see if
the if the data shows a random spatial pattern or not. This step is important, because if there is
a pattern, which means that there is a bias in the data caused by one of the variables. By it
being random, less or no biases will be found caused by the variables. Now I have to check to
see if I have a properly specified model. To see, I have to go back to the OLS results table to
look at figures. First I have to see if coefficient is positive or negative. This is important
because a positive coefficient of population means that as population grows, 911 calls will also
grow. A negative coefficient means that as the populations grows, 911 calls will go down. Since
population has a positive coefficient, that is a good sign (Figure 5). Next I will have to look at
the VIF (variance inflation factor) of my data. This VIF is showing if the variables are showing
the same data. If the number is high (over 7.5, smaller is better), that means that there will be
bias. Since all the figures are around 1.1 1.7, I am good there (Figure 6). Next I had to check
if all the explanatory variables was statistically significant coefficients. By checking if there is

asterisk near certain values, it is showing that it is statistically significant (Figure 7). One figure
that cannot be statiscally significant is the Jarque- Bera test. This one cannot have an asterisk
on this one (Figure 8). Now I have to check the model performance by looking at the Rsquared value (between 0 and 1, closer to one is better) and the AIC (Akaikes Information
Criterion) value (lower the better). In the R- squared value was at .83 while the AIC value when
down to 680 compared to 788 (Figure 9 and Figure 2). Finally running the spatial
autocorrelation tool is the final step to see if the model is free from spatial clustering of over
and under predictions. With the data passing all of these checks, it is now known to have very
little biases and the variables that I picked accounts for a great portion of the data. Now I have
figured out what variables are important, I can apply those same variables to see what will
happen in the future.
To see where the most calls will come from in the future, a geographically weighted
regression (GWR) tool is used. The GWR tool is used to yield optimal results by minimizing
bias and maximizing model fit. By running this tool, I got an output file that shows AIC and Rsquared values. By comparing the OLS output file, I noticed that the R-squared value has gone
up by 3 percent (83% to 86%) and the AIC has gone down 6 points, (from 680 to 674) (Figure
10). Both of these are good sign. Finally using the output of this tool, I inserted into the GWR
Prediction model to see changes for the future. By using the results of the GWR, which was
good, this model can now show me the GWR of the number of calls in the future (Figure 11).

Figures

Figure 1. This is a map of the hot spot analysis of 911 calls in the Portland Metropolitan area. The
green plus signs represent a response station. The blue area represents a low volume of calls while the
red spots represents a high volume of calls.

Figure 2. This is the output file of the OLS results of just using the population as a variable. As the
circle points out, the R- squared number is only 39%. That means that population is accounting for only
39% of telling the story of the data.

Figure 3. This is the output file of the OLS results using the variables: population, jobs, low education,
and distance to urban centers. As the circle now shows, the R- squared number is now at 83%. This is
a much acceptable figure because those four variables are not accounting for 83% of the story of the
data.

Figure 4. This image is mapping the difference between using the variable of population compared to
population, jobs, education, and distance to urban center. The R-squared values are shown underneath
to show how well the model fits using the variables. The colors represent over prediction (red) and
under prediction (blue). These colors should be in a random pattern so that there isnt biases in the
data.

Figure 5. This is the output file of the OLS results using the variables: population, jobs, low education,
and distance to urban centers. The figures in the box is showing the coefficient of those four variables.
It is showing a positive value for population, meaning that as population goes up, 911 calls will also go
up, which is good. That means the model passed one check to see if it is valid.

Figure 6. This is the output file of the OLS results using the variables: population, jobs, low education,
and distance to urban centers. The figures in the box is showing the VIF (variance inflation factor). A
number below 7.5 is a good thing and all four variables fall well below that mark. This model passes the
second test.

Figure 7. This is the output file of the OLS results using the variables: population, jobs, low education,
and distance to urban centers. This images is checking for statically significant figures. By having an
asterisk near certain figures, it is a good thing. This passes the third test

Figure 8. One exception to the asterisk near figures is shown in this figure. The Jarque-Bera Statistics
should not have an asterisk near it. This passes the fourth test.

Figure 9. This is the output file of the OLS results using the variables: population, jobs, low education,
and distance to urban centers. This is the showing the R-squared value and the AIC value. Compared
to Figure 2, which only had population as a variable, the R-squared value has gone up to 83% from
39%. The AIC value has gone from 788 to 680. Higher R-squared value is good, and lower AIC is good.
This passes the fifth test.

Figure 10. This is the output file from the GWR tool. It shows that the AIC has gone down from 680 in
the OLS results to 674. It also shows the R-squared value from the OLS results went from 83% to 89%.
It is showing that using a geographically weighted regression is showing better results and the four
variables that I chose is matching the model well.

Figure 11. This image shows the output file from the GWR tool. The current prediction image is using
the model using current data. The second image is showing the prediction of future 911 calls will come
from based upon future census data.

APPLICATIONS
I cannot see right now how linear and geographically weighted regression can be
applied to my project for next quarter. I can see the applications for it for other studies though.
Using demographics as the overlying factor, many aspects of society can be analyzed. For
instance, one study can see what the incarceration rates are for a certain race, and try to find
the variables that causes those rates. It might be related to location, education, age, crime
rates or many other factors. I could do a study of the incarceration rates of African- Americans
in Louisiana. I can use this same tutorial, but based upon different variables to see if the OLS
shows how well the model works. Using a GWR analysis, I can show what census tracts has
the most area of where incarnated African- Americans live, and what factors might have
caused them to get incarcerated. I could also plot which census tracts will have an increase or
decrease of incarnated people, based upon future predictions of income, education, crime
rates, or other factors. Another application for regression analysis can be applied to
homelessness numbers, and see if race, education, age, or another factor shows why, where,
and where in the future homeless people can be found. Using gender, age, education, and
location, I could see how well the model works, and tweak it to get a better R-squared value
and a low AIC value. Then I could do a use a GWR analysis to show where the homeless
people can be found and where they can be found in the future based upon future predictions
of median income data, job growth, and population increase.