You are on page 1of 75

Population Research Institute

GWR WORKSHOP

University Park, PA – 1st - 6th June 2008

GEOGRAPHICALLY
WEIGHTED
REGRESSION
WORKBOOK

Martin Charlton

A Stewart Fotheringham

Chris Brunsdon

National Centre for Geocomputation


National University of Ireland Maynooth
Maynooth
Co. Kildare
IRELAND

The first two authors acknowledge generous funding from Science Foundation Ireland which helped
create the National Centre for Geocomputation
© The contents of this book are
the copyright of the authors and
may not be reproduced or used
without their permission. This
extends to both the software and
the data files distributed with the
workbook.

i
Contents

Section Title Page


1 Introduction 1
1.1 Introduction 1
1.2 The Basic Operation of GWR and its Linkage to GIS 1
1.3 Program and Data Locations 2
1.3.1 The Program 2
1.3.2 Data for the Labs 2
1.3.3 Other Programs 2

2 A Primer on Running GWR 3 4


2.1 Introduction 4
2.2 Model Specification 4
2.3 Data Organisation 5
2.4 Parameter Estimate Files 6
2.5 The Model Editor 7
2.6 Printed Outputs 11

3 Lab 0: Visualising the Output from GWR with ArcMap 17


3.1 Importing a Coverage from an Interchange File 17
3.2 Visualizing your Coverage 18
3.3 Visualizing Variation with Shaded Polygons 19
3.4 Spatial Join 19
3.5 Checklist 20
Endnote: Assigning Projections 20

4 Lab 1: GWR with Educational Attainment Data in Georgia 22


4.1 Introduction 22
4.2 The Modelling Process 22
4.3 Some Initial Choices 22
4.4 Choices in the Model Editor 23
4.5 Saving and Running the Model 23
4.6 Examining the Outputs 24
4.7 Mapping the Results 25
4.7.1 Getting started 25
4.7.2 Adding Boundaries 26
4.7.3 Choropleth Mapping 26
4.8 Finishing 27

ii
5 Lab 2: GWR and House Price Determinants 28
5.1 Introduction 28
5.2 The Data 28
5.3 The Exercise 29
5.4 Modelling House Price Variation 29
5.5 Exploration, then Modelling 29
5.6 Model 1 – Price~Floorspace 31
5.7 Model 2 – Price~Floorspace+Type 33
5.8 Model 3 – Price~Floorspace+Age 35
5.9 Model 4 _ Price ~Floorspace+Type+Age 36
5.10 Points or Surfaces 37

6 Lab 3: GW Poisson Regression with Tokyo Mortality Data 39


6.1 Introduction 39
6.2 Mortality in Tokyo 40
6.3 Setting Up the Model 41
6.4 Examine the Output 41
6.5 Mapping the Results 44
6.6 Tasks 45

7 Lab 4: Logistic GWR with Landslides in Clearwater, Idaho 46


7.1 Introduction 46
7.2 Landslide Hazard: Clearwater National Forest 47
7.3 Examining the Data 48
7.4 Setting up the Model 49
7.5 Examine the Output 49
7.6 Mapping the Results 51
7.7 Further Tasks 52

8 Finally 53

Appendix Geographically Weighted Descriptive Statistics 54


8.1 Introduction 54
8.2 Local Statistics 54
8.3 Educational Attainment 55
8.4 GIS Manipulation 56
8.5 Tasks 57

iii
1
Introduction

1.1 Introduction
This series of linked labs is designed to introduce you to applying
Geographically Weighted Regression to problems with your spatial data. They
assume a little familiarity with ArcGIS and none with the GWR software.

There are 5 labs as follows:

Lab 0: The Use of ArcGIS


Lab 1: GWR and Educational Attainment Levels in Georgia
Lab 2: GWR and House Price Determinants in Tyne & Wear
Lab 3: Poisson GWR and Mortality Data in Tokyo
Lab 4: Logistic GWR and Educational Attainment Levels in Georgia

The workshops will be lead by Stewart Fotheringham and Martin Charlton

Before explaining each lab in detail (see subsequent chapters of this workbook),
we first describe some basic operational issues with the software package used
in each lab – GWR 3.0. Some of this material is covered in the lectures.

1.2 The Basic Operation of GWR 3.0 and its Linkage to GIS

The following diagram summarises the basic operation of GWR 3.0 and
how its outputs are linked to a GIS.

The user supplies a data file plus ideas on what form of model to calibrate into
the user-friendly GWR Model Editor which is completed in a series of ‘Windows-

1
style’ menus and tick boxes. Unseen to the user, this creates a control file for a
large FORTRAN program which produces two types of output. A Listing File is
written to the screen and an Output File is saved in the user’s workspace. This
latter file contains location-specific parameter estimates and other diagnostics
which can be read into a GIS (along with other spatially referenced data) for
mapping.

1.3 Program and Data Locations

1.3.1 The Program


To run the GWR software you will need to click on the GWR icon which
you will usually find on the Start/Programs menu or on the desktop. Gwr30.exe

1.3.2 Data for the Labs


There are three sets of data that will be used in the labs. These are contained in
three subfolders named:

Georgia
Housing
Tokyo

within the SampleData folder. Each subfolder has a number of files containing
data on variables used in the GWR software and data used for mapping the
results. The various files to be used are listed in each of the lab descriptions in
subsequent sections.

1.3.3 Other Programs


You may need some other software as well; you will need to make a note of how
to start these here:

1. ArcGIS: this will probably be Start/Programs/ArcGis/ArcMap;


however, check this and write down the correct path below:

________________________________________________________________________

2. ArcToolBox: this will probably be Start/Programs/ArcGis/ArcToolBox;


however check this and write down the correct path below:

2
________________________________________________________________________

3. My Computer/
/Explorer we will need this in order to change some
filenames. There’s usually an icon at the top left of the desktop or it might
be on the Start menu. Write down the correct path below:

________________________________________________________________________

4. The software and SampleData folder you will need will usually be in the
C:\GWR3 folder, but may be somewhere else. Note here where this folder is:

________________________________________________________________________

5. The Work folder, which you will need, is usually c:\GWR3\Work but it may
be somewhere else – if it is, note its location here.

________________________________________________________________________

6. Note how to access Excel in case you want to manipulate some data files

_______________________________________________________________________

7. Note how to access SPSS,


SPSS, R or some other statistical software with which
you are familiar

_____________________________________________________________________

8. We may make some announcements prior to the workshop so note these


below:

3
4
2
A Primer on Running GWR3
What you will learn:
learn
1 How to set up, run and interpret a GWR model
2 How to specify a GWR model and understand the workflow
3 How your data file should be organised
4 The content of the parameter estimate (output) file
5 Using the Model Editor to specify the model and associated options
6 Interpreting the Listing File

2.1 Introduction

This section shows how to set up and run a GWR model using the Visual Basic
GWR Model Editor. Much of this has already been covered in the lecture material
so feel free to skip it if you want. There are several different varieties of
regression model that can be run – here we will assume that you wish to run a
geographically weighted regression with a Gaussian error term. This is the
geographically weighted equivalent of an ordinary least squares regression, such
as you might find in SPSS and is probably the most frequently encountered
application of GWR.

The GWR software is located in a folder on your machine or on a network. The


folder is usually named GWR3. There are two binaries in this folder. The name of
the GWR Model Editor is GWR30.exe. A link can be created to the file GWR30.exe
from the desktop or in a toolbar which will create a GWR icon. There is also a
subfolder named SampleData which contains some test data for the software.
Then the appropriate icon is selected to run the program. We assume that you
will place your own data and the results from any analyses on those data in a
different folder.

5
The main GWR program
window is shown on the
right; it has four items in
the menu bar, ‘File’,
‘Analysis’, Tools’ and
‘Help’. The program
assumes that the user
will wish to proceed with
one of five initial options,
and provides a ‘Wizard’
for guidance through the
processes.

2.2 Model Specification


The general outline of specifying a GWR model is shown below. The actual
program that computes the GWR is a FORTRAN program, and the software you
are using is a front end to help you through the following steps:

1. Select a task
2. Select a data file
3. Decide where to estimate the parameters
4. Specify the name of the parameter estimate file
5. Use the Model Editor to:
5.1 Title the run
5.2 Specify the dependent variable
5.3 Specify the independent variable(s)
5.4 Specify the data point location variables
5.5 Specify the weighting scheme
5.6 Specify the calibration method
5.7 Specify the type of parameter estimate file
5.8 Save the model control file
5.9 Run the model
6. Examine the diagnostics

Following this you import the parameter estimate file into a mapping package so
that you can examine any spatial variation in parameter estimates.

6
2.3 Data Organisation
The data file for GWR is an ASCII file which will normally have the filetype of .dat
or .csv. The assumptions in the software are as follows:

1. The first line of the data file is a comma separated list of the names of the
variables in the remainder of the file
2. The variable names should not contain any spaces
3. The variable names should be no more than 8 characters in length
4. The variable names should be formed from upper and lower case
alphabetic characters and the numbers 0 … 9 inclusive
5. The only other character which is allowed is the underscore (_)
6. The remaining lines in the file contain the data
7. There are as many lines as there are observations (“data points”)
8. Each line contains the same number of attributes as there are variables
9. Attributes are separated by commas
10. All attributes are numeric
11. At least one of the attributes will be a dependent variable
12. There are two variables which specify the location of each data point

As an example, here are the first 11 lines of the data file for the georgia
educational attainment data to be used later in the labs:

ID,Latitude,Longitud,TotPop90,PctRural,PctBach,PctEld,PctFB,PctPov,PctBlack
13001,31.753389,-82.285580,15744,75.6,8.2,11.43,0.635,19.9,20.76
13003,31.294857,-82.874736,6213,100.0,6.4,11.77,1.577,26.0,26.86
13005,31.556775,-82.451152,9566,61.7,6.6,11.11,0.272,24.1,15.42
13007,31.330837,-84.454013,3615,100.0,9.4,13.17,0.111,24.8,51.67
13009,33.071932,-83.250851,39530,42.7,13.3,8.64,1.432,17.5,42.39
13011,34.352696,-83.500539,10308,100.0,6.4,11.37,0.340,15.1,3.49
13013,33.993471,-83.711811,29721,64.6,9.2,10.63,0.922,14.7,11.44
13015,34.238402,-84.839182,55911,75.2,9.0,9.66,0.816,10.7,9.21
13017,31.759395,-83.219755,16245,47.0,7.6,12.81,0.332,22.0,31.33

If you have been using ArcMap to integrate your data for an analysis, you can
export a .dbf file as a .txt file. This can be renamed in the Explorer. When
ArcGIS does this it places quotes around the variable names. These are not
however stripped off by the FORTRAN program so the files will need further
editing. You can also create .csv files in Excel (save your data in comma-
separated variable form), Notepad, and other applications capable of writing
ASCII files.

2.4 Parameter Estimate Files


The output from GWR can be voluminous. At every regression point there will
be a set of parameter estimates, a set of associated standard errors, and some

7
diagnostic statistics. For this reason we have decided to make these outputs
available as a file which can then be post-processed.

The outputs are

PARM_1 … PARM_n Values of the estimates of the


parameters at each regression point. n
is one more than the number of
independent variables with PARM_1
containing the values of the intercept
term.
SVAL_1 … SVAL_n Values of the estimates of the standard
errors of the parameters at each
regression point. The numbering of
these variables is as for the parameter
estimate variables.
TVAL_1 … TVAL_n Pseudo-t values
OBS Observed y variable value
PRED Predicted y variable value
RESID Unstandardised residual
HAT Leverage value
STDRES Standardised residual
COOKSD Cook’s Distance
LOCRSQ Pseudo-R2 values

Three types of output format are available

1. ArcINFO uncompressed export format. This may be imported into


Arc/INFO to create a point coverage (where the coordinates of each point
are those of the regression points). On a PC the coverage can be created
using ArcToolBox. The filetype is .e00.
2. Comma-separated-variable format. This may be imported into Excel or
SPSS for further processing. The names of the variables are included at
the head of the file. Small numbers are not dealt with very elegantly and
may be converted to scientific notation – ArcToolBox has trouble with
these conversions. You should note that some numbers may be printed
using scientific notation – the abscissa may be written as D+04 to
represent 104. You will need to change these to E+04 otherwise Excel will
treat them as text.

8
3. MapInfo Interchange Format. A .mif/.mid pair of files is created. These
can be imported into MapINFO. The files are ASCII files and can be hand
edited to remove any anomalies. This is somewhat experimental at the
moment.

2.5 The Model Editor


The first step is to create a new
model to use with your data. If
there is an existing model control
file, then this can be run or the
model editor can be invoked to
change the variables or some other
control parameters. Geographically
weighted descriptive statistics may
also be requested. At this point the
user has the option of clicking on
‘Go’ to proceed with the new model, ‘Cancel’ to close the Wizard, or ‘Help’ to
obtain some assistance on what to do next.

In this example, we have checked ’Create a


new model’ and clicked on ‘Go’. We next
need to determine what type of GWR
model we wish to fit. In many cases this
will be a Gaussian model. Select Gaussian
and then ‘Go’

Before a new model can be created, a data file must be selected from the data
folder (see section 2.2 for details of the data file structure). The model editor
will extract the names of the variables from the first line of the data file that is
selected. We will base this description of the use of the Model Editor around the
data concerning educational attainment in the counties of the state of Georgia,
USA. These data have been described briefly in the previous presentation and
further information is given
in section 4 (lab 2).

The form shown on the right


will now appear – it is a

9
standard Windows type ‘File Open’ form. There are only two data files in the data
folder, one for the example we are using and one which is supplied with the
software. Click on the relevant data file name to highlight it and click on ‘Open’
to proceed.

GWR estimates may be


produced at locations other
than those at which data are
sampled. Locations where
observations are recorded are
referred to as data points (or as sample points) and the locations at which the
estimates are produced as regression points. In most instances, the regression
points and the data points will be the same. However, there is an option in
GWR3.x to produce estimates of local parameters at locations other than those
at which data are recorded, for example at the mesh points of a regular grid.
The prompt shown above allows the user to make this decision. In this instance,
we click on ‘Yes’. Clicking ‘No’ brings up another form to allow the user to select
a separate file of regression point locations. Note that using this second option
means the automatic bandwidth selection and a range of diagnostic statistics
will not be available.

Next, the name of the file into


which the results will be written
must be specified. This file can
be in one of several formats
(comma-separated variable,
ArcInfo uncompressed export,
and MapInfo Interchange). The
user also needs to specify the
appropriate filetype - .e00 for an
ArcInfo export file, .csv for a comma-separated variable file, and .mif for a
MapInfo Interchange File. You will need to navigate to the appropriate folder for
the output file. Note that you cannot proceed without specifying a filename here.

The Data Preview window allows you to check


that you have loaded the correct file – it lists
the variable names which it has found in the
first line of the file and gives you the location of
the file.

10
As well as a check on the names of the
variables, GWR also prints the names of
the files which you selected thus far. If
you have made a mistake, you have the
option of correcting this before you
continue. (Note: the various folder names
we use here may be different from the
ones you will use!). As we have decided
to fit the model at the data points, the
calibration location filename is blank.

The Model Editor Window appears


next and is shown on the left. It
allows a GWR model to be created,
saved and run. The Title box
allows the user to input a title which
will then appear in the output
listing. The list of Variables is read
automatically from the comma-
separated list on the first line of the
data file that has been specified.
From this, a Dependent Variable
and one or more Independent
Variable(s) are selected by
highlighting the variable name and
moving it with the appropriate
arrow key. Next, two variables
representing the coordinates of the data points, the Location Variables need to
be assigned, and an optional Weight Variable can be selected. Note that this
weight variable is not a geographical weight but simply allows data points to be
weighted by some attribute reflecting different levels of uncertainty about the
measurements taken across the data points. In most cases, this will be left
empty. In the special case of Poisson regression, this variable will be used as an
offset variable.

Once the variables have been selected, which essentially defines the model, the
Kernel Type is chosen for the GWR. The choices are either ‘Fixed’ (Gaussian) or

11
‘Adaptive’ (bi-square). The kernel bandwidth is determined by either
crossvalidation (CV) or AIC (AICc) minimisation. Alternatively, an a priori value for
the bandwidth can be entered by clicking on the Bandwidth option and entering
the bandwidth in the window. If you are using a Fixed kernel, the bandwidth
needs to be specified in terms of the distance units used in your model. If you
are using an Adaptive kernel, the bandwidth is specified as the number of data
points in the local sample used to estimate the parameters. If you specify too
small a bandwidth, you may get unpredictable results, or the program may be
unable to estimate the model. With a very large data set, bandwidth selection
can be made using a sample of data points in order to save time. This is
achieved by clicking on Sample (%) and entering the desired percentage of the
data used for the bandwidth selection procedure. The default is that the
procedure will use All data.

If your coordinates are in some projected coordinate system (UTM , for example)
then the Coordinate Type should be Cartesian. If your measurements are in
Lat/Lon, then select Spherical. If you have Lat/Lon coordinates, but your study
area is in a relatively low latitude, then you can use Cartesian as the type. With
Spherical coordinates, the distance computations in the geographical weighting
use Great Circle distances.

The Model Options include specifying the type of output required and the type
of significance test to be employed on the local parameter estimates. Apart
from the default output listings (described later), the user has the option of
outputting List Bandwidth Selection, List Predictions and List Pointwise
Diagnostics. Examples of these are shown below. The significance testing
options are: Monte Carlo, Leung, or None (see above). Finally, the format of the
output file needs to be specified: this should be compatible with the previous
selection of an output filetype (see above). (Note: Although the Leung test
appears in the model editor, it is not really supported as it is very cumbersome
and we do not recommend its use)

A completed example of the GWR Editor is


shown on the left. The dependent variable is
the proportion of the county population with
education to degree level. Suppose we are
interested to see how this is related to total
population within each county, the percentage
rural, the percentage elderly, the percentage
foreign born, the percentage below the

12
poverty line and the percentage black. We would also like to see if there are any
geographical variations in the relationships between educational attainment and
these variables.

The sample point location variables are Longitud (x) and Latitude (y). There is
no aspatial weight variable. We have chosen an adaptive kernel and the
bandwidth will be chosen by AICc minimisation using all the data. A Monte Carlo
significance testing procedure has also been selected for the local parameter
estimates. Printing of a range of diagnostics has been requested and the output
will be written to an ArcInfo export file. Some of the output will, by default, also
be written to the screen in a listing file.

Before the model can be run, it must


be saved. Clicking on Save Model will
open the standard window shown on
the left which depicts the contents of
the model folder where the model
control files are stored. Type the name
of the file in the Filename box or click
on an existing filename and then click
on Save.

Once the model has been saved, it can


be run. Simply click on the Run button
in the Model Editor window and this
brings up the form shown on the left.
A name must be specified for the
Model Listing File (.txt).
(.txt) This file will
be placed in the listing folder. To
specify a filename click on the …
button to the right of the filename box. Once this is done, click on the Run
button. The model control file is now passed to the GWR program and the
program is invoked and run in a DOS window as shown below. 1

1 You may need to make a small alteration in your Windows setup so that the DOS box closes on program
termination.

13
With small data sets and simple models, the program runs very quickly. For
instance, calibrating a bivariate GWR model using the 159 counties of Georgia
on a Pentium III PC took less time than it has taken to type this sentence.
However, the time requirements increase rapidly as both model complexity and
the number of data points increases. One solution to very slow run times is to
use the option in the Model Editor which allows the user to supply a percentage
of the data points on which to base the bandwidth selection procedure.

When the run has completed, the DOS window


closes, and you are asked whether you wish to
examine the listing file:

2.6 Printed Outputs


Once the program has run, the user
is asked if the output listing is to be
viewed. This listing appears in a
separate window; an example of this
for the Georgia educational
attainment model is shown on the
left. The user can scroll down the
file to view other sections. The
listing file is a text file with the
filetype of .txt so that it can also be opened in MS Word or Notepad for viewing
or printing.

Following a description of the model that has been calibrated, the first section of
the output from GWR3 contains the parameter estimates and their standard
errors from a global model fitted to the data. This is shown below.

**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 1816.210715
Effective number of parameters.. 7.000000
Sigma........................... 3.456697
Akaike Information Criterion.... 855.443391
Coefficient of Determination.... 0.645830

14
Parameter Estimate Std Err T
--------- ------------ ------------ ------------
Intercept 14.779297592328 1.705507562188 8.665630340576
TotPop90 0.000023567534 0.000004746089 4.965675354004
PctRural -0.043878182061 0.013715372112 -3.199197292328
PctEld -0.061925096691 0.121460075458 -0.509839117527
PctFB 1.255536084016 0.309690422174 4.054164886475
PctPov -0.155421764065 0.070388091758 -2.208069086075
PctBlack 0.021917908085 0.025251694359 0.867977738380

There are two parts to the output from the global model. In the first panel,
some useful diagnostic information is printed which includes the residual sum of
squares (e
eTe), the number of parameters in the global model, the standard error
of the estimate (σ), the Akaike Information Criterion (corrected version) and the
coefficient of determination. In the second panel the matrix contains one line of
information for each variable in the model. The columns are:

(a) the name of the variable whose parameter is being estimated


(b) the estimate of the parameter
(c) the standard error of the parameter estimate and
(d) the t statistic for the hypothesis α=0.

These global results suggest that educational attainment is positively related to


total population and percentage foreign born and is negatively related to
percentage rural and percentage below the poverty line. Educational attainment
does not appear to be related to the remaining two variables, percentage elderly
and percentage black. The model replicates the data reasonably well (65% of the
variance in educational attainment is explained by the model) but there are
clearly some factors that are not captured adequately by the global model.

From this point, the output listing contains the results of the GWR. The first
section is an optional calibration report which lists the calculated value of the
criterion statistic at various bandwidths, as shown below. The utility of printing
this section is to observe the speed of convergence and also to plot the results
to see the shape of the convergence function. If the calibration report is not
requested, the program will print only the optimal value of the bandwidth.

Dependent mean= 10.9471693


Number of observations, nobs= 159
Number of predictors, nvar= 6
Observation Easting extent: 4.41947222
Observation Northing extent: 4.20193577
*Finding bandwidth...
... using all regression points
This can take some time...
*Calibration will be based on 159 cases
*Adaptive kernel sample size limits: 10 159
*AICc minimisation begins...

15
Bandwidth AICc
56.043532255000 952.763365832809
84.500000000000 894.827422579517
112.956467745000 872.102336481384
130.543532046749 862.364688964195
141.412935569545 859.863227740004
148.130596397659 857.532739228028
152.282339122725 856.699997311380
154.848257244551 855.820209809022
** Convergence after 8 function calls
** Convergence: Local Sample Size= 155

The next section of the output presents diagnostics for the GWR estimation.
There are two panels in this section. The first panel provides some general
information on the model: it includes (a) a count of the number of data points or
observations (b) the number of predictor variables (this is the number of
columns in the design matrix) (c) the bandwidth for the type of kernel specified
(here it is the number of nearest neighbours to be included in the bisquare
kernel) and (d) the number of regression points. The second panel contains
similar information to the corresponding panel for the global model. This
includes (a) the residual sum of squares (b) the effective number of parameters,
(c) the standard error of the estimate, (d) the Akaike Information Criterion
(corrected) and (e) the coefficient of determination. The latter is constructed
from a comparison of the predicted values from different models at each
regression point and the observed values. The coefficient has increased from
0.646 to 0.706 although an increase is to be expected given the difference in
degrees of freedom. However, the reduction in the AIC from the global model
suggests that the local model is better even accounting for differences in
degrees of freedom.

**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 159
Number of independent variables... 7
(Intercept is variable 1)
Number of nearest neighbours...... 155
Number of locations to fit model.. 159

Diagnostic information...
Residual sum of squares......... 1506.219121
Effective number of parameters.. 12.814342
Sigma........................... 3.209901
Akaike Information Criterion.... 839.193981
Coefficient of Determination.... 0.706280

Casewise diagnostics can be also requested (as shown below for the first 10
observations in the Georgia data set). These include:

16
1. the observation sequence number
2. the observed data
3. the predicted data
4. the residual
5. the standardised residual
6. the local pseudo r-square
7. the influence and
8. Cook’s D.

Whilst in general it might be helpful to look at a printout of these statistics, it is


probably a little more useful to be able to map them: with a large data set you
run the risk of being swamped in output. All of these statistics are saved
automatically in the output results file so that requesting them in the listing file
should be done judiciously. This panel is not available when the regression
points are different from the data points.

**********************************************************
* CASEWISE DIAGNOSTICS *
**********************************************************

Obs Observed Predicted Residual Std Resid R-Square Influence Cook's D


----- -------------- -------------- -------------- ----------- ----------- ----------- -----------
1 8.20000 9.26692 -1.06692 -0.258875 0.819218 0.021879 0.000117
2 6.40000 7.33714 -0.93714 -0.232802 0.820589 0.066868 0.000303
3 6.60000 8.70596 -2.10596 -0.525272 0.819776 0.074367 0.001730
4 9.40000 8.11559 1.28441 0.319607 0.840207 0.069997 0.000600
5 13.30000 13.58140 -0.28140 -0.070091 0.839357 0.071855 0.000030
6 6.40000 8.79625 -2.39625 -0.591102 0.844322 0.053656 0.001546
7 9.20000 11.61571 -2.41571 -0.587443 0.846859 0.026203 0.000725
8 9.00000 11.61646 -2.61646 -0.636924 0.852840 0.028236 0.000920
9 7.60000 10.26846 -2.66846 -0.654270 0.826147 0.042107 0.001468
10 7.50000 9.48755 -1.98755 -0.489605 0.822446 0.051028 0.001006

Another optional set of information that can be printed to the screen concerns
the predicted values (as shown below for the first 10 observations in the Georgia
data set). If this option is selected, the following data are printed to the screen:

1. Obs the sequence number of the observation


2. Y(i) the observed value
3. Yhat(i) the predicted value
4. Res(i) the residual
5. X(i) the x-coordinate of the regression point
6. Y(i) the y-coordinate of the regression point and
7. an indicator of whether the matrix inverse was computed using either
the Gauss-Jordan method (F) or a generalised inverse (T). The latter is
only used if there is severe multicollinearity in the design matrix

17
This set of output is not available when the regression points are different from
the sample points.

Predictions from this model...


Obs Y(i) Yhat(i) Res(i) X(i) Y(i)
1 8.200 9.267 -1.067 -82.286 31.753 F
2 6.400 7.337 -0.937 -82.875 31.295 F
3 6.600 8.706 -2.106 -82.451 31.557 F
4 9.400 8.116 1.284 -84.454 31.331 F
5 13.300 13.581 -0.281 -83.251 33.072 F
6 6.400 8.796 -2.396 -83.501 34.353 F
7 9.200 11.616 -2.416 -83.712 33.993 F
8 9.000 11.616 -2.616 -84.839 34.238 F
9 7.600 10.268 -2.668 -83.220 31.759 F
10 7.500 9.488 -1.988 -83.232 31.274 F

Next in the output listing is a panel of results of an ANOVA in which the global
model is compared with the GWR model. The ANOVA tests the null hypothesis
that the GWR model represents no improvement over a global model. The results
are shown below where it can be seen that the F test suggests that the GWR
model is a significant improvement on the global model for the Georgia data.

**********************************************************
* ANOVA *
**********************************************************
Source SS DF MS F
OLS Residuals 1816.2 7.00
GWR Improvement 310.0 5.81 53.3150
GWR Residuals 1506.2 146.19 10.3035 5.1745

The main output from GWR is a set of local parameter estimates for each
relationship. Because of the volume of output these local parameter estimates
and their local standard errors generate, they are not printed in the listing file
but are automatically saved to the output file. However, as a convenient
indication of the extent of the variability in the local parameter estimates, a 5-
number summary of the local parameter estimates is printed. For the Georgia
data, this is shown in below. The 5-number summary of a distribution presents
the median, upper and lower quartiles, and the minimum and maximum values
of the data. This is helpful to get a ‘feel’ for the degree of spatial non-
stationarity in a relationship by comparing the range of the local parameter
estimates with a confidence interval around the global estimate of the equivalent
parameter.

Recall that 50% of the local parameter values will be between the upper and
lower quartiles and that approximately 68% of values in a normal distribution
will be within ± 1 standard deviations of the mean. This gives us a reasonable,
although very informal, means of comparison. We can compare the range of

18
values of the local estimates between the lower and upper and quartiles with the
range of values at ±1 standard deviations of the respective global estimate
(which is simply 2 x S.E. of each global estimate). Given that 68% of the values
would be expected to lie within this latter interval, compared to 50% in the inter-
quartile range, if the range of local estimates between the inter-quartile range is
greater than that of 2 standard errors of the global mean, this suggests the
relationship might be non-stationary.

**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept 12.620986 13.754251 15.823232 16.312238 16.489399
TotPop90 0.000014 0.000018 0.000022 0.000025 0.000028
PctRural -0.060218 -0.051780 -0.039342 -0.031651 -0.025801
PctEld -0.255508 -0.203092 -0.164197 -0.129393 -0.058400
PctFB 0.504876 0.825190 1.432738 2.003490 2.417666
PctPov -0.204510 -0.164793 -0.110038 -0.056264 -0.004242
PctBlack -0.036187 -0.013582 0.006294 0.031046 0.076566

As an example, consider the parameter estimates for the two variables PctEld
(percentage elderly) and PctFB (percentage foreign born) in the Georgia study.

The global results provide the following information:

Variable Global Standard Error. 2 x Global Standard Eror.


PctEld 0.121 0.242
PctFB 0.310 0.620

while the 5-number summary yields:

Variable Lower quartile Upper quartile Range


PctEld -0.203 -0.129 0.074
PctFB 0.825 2.003 1.178

For PctEld the interquartile range of the local estimates is much less than 2 x
S.E. of the global estimate suggesting a stationary relationship.

For PctFB the interquartile range of the local estimates is much greater than 2 x
S.E. of the global estimate suggesting a non-stationary relationship.

19
Finally, we can examine the significance of the spatial variability in the local
parameter estimates more formally by conducting a Monte Carlo test. The
results of a Monte Carlo test on the local estimates indicates that there is
significant spatial variation in the local parameter estimates for the variables
PctFB and PctBlack. The spatial variation in the remaining variables is not
significant and in each case there is a reasonably high probability that the
variation occurred by chance. This is useful information because now in terms
of mapping the local estimates, we can concentrate on the two variables, PctFB
and PctBlack, for which the local estimates exhibit significant spatial non-
stationarity. It is interesting to note that these results reinforce the conclusions
reached above with the informal examination of local parameter variation for the
variables PctEld and PctFB.

*************************************************
* *
* Test for spatial variability of parameters *
* *
*************************************************

Tests based on the Monte Carlo significance test


procedure due to Hope [1968,JRSB,30(3),582-598]

Parameter P-value
---------- ------------------
Intercept 0.22000
TotPop90 0.09000
PctRural 0.17000
PctEld 0.68000
PctFB 0.00000
PctPov 0.50000
PctBlack 0.00000

20
3
Lab 0: Visualising the Output from GWR with ArcMap
What you will learn
1 Importing an interchange file into ArcGIS
2 Options for visualising a point coverage
3 Adding a shapefile into an ArcGIS project
4 How to carry out a spatial join
5 Symbolising a polygon shapefile
6 [Optional: Assigning map projection information]

The main output from GWR is a set of localised parameter estimates and
associated diagnostics. Unlike the single global values traditionally obtained in
modelling, these local values lend themselves to being mapped. Indeed, in large
data sets, mapping, or some other form of visualisation, is the only way to make
sense of the large volume of output that will be generated. We now describe
ways of visualising the output from GWR. Although we concentrate only on
displays of the local parameter estimates, in many instances it might be
instructive to plot other local statistics such as the influence and Cook’s D
statistics. Similarly, it might be useful to plot the local r-square statistic or the
local standard deviation. No matter which local statistic is mapped, however,
there is a choice of map types that can be employed. We now describe some of
these briefly after first discussing mapping the results in a commonly used, PC-
based, Geographic Information System (GIS), ArcMap.

3.1 Importing a Coverage from an Interchange File


We assume that the user has available some software for
visualising the results. Most commonly, this will be some
mapping package, or preferably, a GIS in which both the
results and the data can be manipulated. Saving the
output file as either an uncompressed ESRI shapefile or a
MapInfo interchange file means the output can be viewed
relatively easily within a GIS. For instance, if we convert
the .e00 file to a coverage, it can be viewed in ArcMap.
The conversion is carried out using the ArcToolBox

21
program.

First, start ArcToolBox and in the Conversion Tools kit, select Import to
Coverage. Then select ArcView Import from Interchange file. This brings up
another dialog box which must
be completed. The Input file is
the ArcInfo Export File (also
known as an Interchange file).
Normally the file and path you
specify are those which have already been specified in the GWR program. The
output dataset (or coverage) is probably best located in the same folder.
However, in this lab the input file should be Georgia.e00 which you will find in
the SampleData\Georgia folder; the output file should be located in your Work
folder and named Georgia. When you have specified these, you click on [OK].
Wait a few moments until the software has finished the conversion – an
hourglass will appear while conversion is taking place. Close the ArcToolBox
application.

3.2 Visualizing your Coverage


Now you can start the ArcMap application and
examine your data. The ArcMap window is as
on the right (start with A new empty map). The
converted coverage is now known as a data
layer. We shall add another data layer shortly.
Click on the Add Data icon (it’s the black cross
on a yellow diamond icon in the menu bar).

Navigate to your Work folder and click on the name of the coverage you have
just created. In the case of this
example, the coverage name is Georgia.
You will get an error message which
you can ignore for the time being.

The points which you can see represent the locations of the regression points –
in this case, they are the centroids of the counties of Georgia. To visualise the
spatial variation in the Intercept term (this is called PARM_1 in the coverage).

22
1. Right click on the
georgia point entry in
the Table of Contents
2. Select Properties from the
list (it’s at the bottom)
3. Click on the Symbology
tab
4. Select
Quantities/Graduated
Symbols from the Show:
box
5. Select PARM_1 from the Fields/Value dropdown list

The completed dialog should be as


above. If you click on [OK] you will be
rewarded with a display of graduated
circles. The circle size is related to the
value of the intercept term. The
resulting pattern suggests that in this
case, there is a broad regional pattern
with higher values of the intercept in
southern Georgia, and lower values
elsewhere.

You can use the Identify tool (6th down in the left hand column of tools in the
Toolbar) to click on one of the circles to bring up the values of all the attributes
that you see.

3.3 Visualizing Variation with Shaded Polygons


This is all very well, but it might be desirable to attach the attributes to some
polygons. How can this be achieved?

Click on the Add Data icon once more,


and navigate to the SampleData\Georgia
folder. You will find a data layer called
G_utm.shp. Select this and click OK.
ArcMap places this on the end of the
Table of Contents, and assigns some
default shading. If you use the the
Identify tool, you can find what attributes

23
the data has. Here’s a typical entry on the right. Unfortunately, the AREAKEY
item is not present in the Georgia point coverage. We need to use a “spatial join”
to match the attributes of the points with the attributes of the polygons.

3.4 Spatial Join


Click on G_utm in the Table of Contents, and then right click. Select Relates
and Joins and then Joins… from the list of options. Complete the dialog that
appears as below.

1. You are joining data from another layer


based on spatial location
2. The layer from who which you wish to copy
attributes to the polygon layer is the one
you name in box 1
3. In 2, you are joining Points to Polygons;
make sure the second option is checked
4. In box 3, navigate to your Work folder and
name the output file georgiaj.shp

Click on [OK]

The join then takes place, and the shapefile (files of type .shp are called
shapefiles) is added to the table of contents.

Use the following steps to create shading


for the polygons. Here is the display for
PARM_1 which illustrates the same spatial
variation as in the previous example but in
a slightly different form.

1 Right click on georgiaj

24
2 Click on the Symbology tab
3 Select Quantities/Graduated Colors
4 Select PARM_1 from Field/Value List
5 Click on OK

Checklist
It might be worthwhile looking back over the decisions we have had to make in
carrying this out.

1. Check whether any of the data layers you need has projection
information. If so, make a note of it and see the notes below on
assigning a projection.
2. Convert the parameter estimate interchange file to a coverage using
ArcToolBox
3. In ArcMap add the parameter estimate coverage layer
4. Add in any other layers you might need
5. Carry out any spatial joins you need to do
6. Visualize the parameter estimate variation, either as point data with
graduated symbols, or as polygon data graduated colors.

Endnote: Assigning Projections

When mapping the results from your own data sets, depending on the source of
your boundary files, you may find ArcMap automatically assumes the data to be
projected. Checking its properties, for example, you may find something like…

Data Type: Shapefile Feature Class


Shapefile: C:\GWR3\SampleData\georgia
Geometry Type: Polygon

Coordinate System:
GCS_Assumed_Geographic_1
Datum: D_North_American_1927
Prime Meridian: 0

25
This may cause a problem in that the output from the GWR program will not
have such a projection assigned and the two data sets will therefore not be
compatible. If such a problem occurs, you will need to assign the same
projection to your GWR output coverage as ArcMap has assigned to your
boundary data. This can be done as follows:

You will need to close the ArcMap application and assign a projection to the
parameter estimate coverage. Suppose the projection you wish to assign is that
of geographic NAD 1927. The projection conversion is carried out in
ArcToolBox. You may need to do this every time you Import an interchange file.

1. Close the ArcMap application


2. Start the ArcToolBox application
3. Select Data Management Tools/Projections
4. Select Define Projection Wizard (coverages, grids, TINs)
5. Check define the coordinate system interactively: click [Next]
6. Navigate to your work folder and select Georgia: click [Next]
7. Select Geographic as the dataset projection: click [Next]
8. Select DD as the Units parameter: click [Next]
9. Select NAD 1927 (US-NADCON) as the Datum: click [Next]
10. Check the settings and click [Finish]
11. Exit the ArcToolBox application

You will have noticed that the Wizard also allows you to copy the information
about the projection from another coverage, so if you create a ‘master’ coverage
and assign the projection information, then you can use this as the source for a
copy.

26
4
Lab 1: GWR with Educational Attainment Data in Georgia

4.1 Introduction
The context of this particular modelling exercise has been outlined earlier. You
now have a chance to use GWR and some associated software to explore the
data and to model the relationships yourself.

We’ll use the Georgia data on educational attainment by county for our first
foray into GWR. The idea is to predict the level of education attainment from
some social attributes of the counties in the State of Georgia and then to map
the variation in the local parameter estimates and some diagnostics.

4.2 The Modelling Process


We envisage that GWR will be used in some larger data exploration and
modelling exercise. Typically the steps either side of GWR may include the
following

1. Prepare the data – may involve Excel, SPSS, SAS, or a GIS program.
2. Model relationships in GWR: examine printed diagnostics
3. Save the parameter estimates in a suitable format
4. Import the parameter estimates into a GIS program
5. Display the parameter variation – further analysis
6. Display the diagnostic variation – further analysis

It should be stressed that these are not the only routes. However these
workshops are based loosely around this approach.

4.3 Some Initial Choices


You need to make some initial choices

1. Run GWR by clicking on the GWR icon


2. Check the Create a new model option and click on Go
3. Check Gaussian as the model type and click Go

27
4. Change the file type to .csv then in the SampleData\Georgia folder find
gdata_utm.csv, select it and click on Open
5. In the Analysis Point Selection form click on Yes
6. Change the file type to ArcInfo Export File then navigate to your Work
folder and enter GeorgiaOut.e00 in the Output File form.
7. After the Data Preview and file confirmation, the model editor will appear.

4.4 Choices in the Model Editor


The model editor is central to your analysis. It will have read the first line of the
data file and extracted the names of the variables. These appear in the
Variables box. Completing the rest of the options appears somewhat complex
but will become familiar with practice.

1. Enter something like Georgia Educational Attainment in the title area.


You can put anything you like here: it’s printed on the output.
2. From the list of variables click on PctBach followed by the [->] symbol
beside the Dependent Variable box
3. From the list of variables click on TotPop90 then the [->] symbol to
the left of the Independent Variables box.
4. Do this for the variables PctRural, PctEld, PctFB, PctPov, PctBlack. If
you enter a variable in error, highlight it in the Independent
Variables list and click on the [<-} symbol.
5. Next we specify the Location Variables. Enter X as the x variable
and Y as the y variable.
6. Change the Kernel Type to Adaptive
7. In Model Options Check the List Bandwidth Selection option…
8. …and the List Predictions option
9. …and the List Pointwise Diagnostics option
10. Check Monte Carlo in the Significance test option: click Continue
11. Check AIC in the Bandwidth Selection Method option (there should
be nothing in the bandwidth box
12. Check All Data in the method box
13. Check the Arc/INFO Exp (*.e00) option in the Output Format list

You have now completely specified a GWR model.

28
4.5 Saving and Running the Model
The control file which is created by the Model Editor has to be saved to the Work
folder before you can run it.

1. Click on the Save Model option at the bottom of the Model Editor
2. Enter Georgia.gwr as the name of the control file in your Work folder
and click on Save
3. Click on Run Model
4. In the Run the Model form enter Georgia.txt as the Model Listing
File name (again in your Work folder)
5. Click on the Run button

A DOS window will appear – it should take less than a minute for the program to
run with this dataset. The window title bar will tell you when the program has
finished and the command Exit appears in this window. The DOS window will
then disappear. WAIT until this has happened and then:

1. Click on Yes on the Run Completed form to view the Listing file
2. Click on End on the Run the Model form

4.6 Examining the Outputs


Although you might think it would be illuminating to start plotting the
parameter estimates, it’s worth taking some time to examine the output.

1. Are you using the correct files in the correct folders?


2. Is the list of variables correct?
3. Are the dependent, location, and independent variables correctly
specified?
4. Have you checked the correct options?

Some initial values are reported: (you will need to scroll down to see these)

Dependent mean= 10.9471693


Number of observations, nobs= 159
Number of predictors, nvar= 6
Observation Easting extent: 423741.688
Observation Northing extent: 471492

5. Next check the bandwidth selection. The current values of the bandwidth
and the associated AIC are printed on the output (you can cut these out

29
and put them in Excel if you wish to create a plot of the minimisation
function). These functions can be quite messy at times.

The program has converged at 155 nearest neighbours. As there are only 159
counties, the GWR results may be fairly close to the global results – but
remember that in the GWR model the data are weighted by geographic location.

6. Have a look at the global model parameters and diagnostic statistics. The
AIC for the global model is 855.44 and the global coefficient of
determination is 0.65. It looks as if the global model is a reasonable one,
although 35% of the variation in our dependent variable is from sources
other than the ones in our model.

7. Check the global parameter values: there is a positive association with


Population, Foreign Born and Black and a negative association with Rural,
Elderly and Poverty. However the coefficients for Elderly and PctBlack are
small enough for us to regard them as having no effect on the model (t <
~1.96).

8. Check the GWR parameter estimates and diagnostics. The AIC for this
model is 840.07. This is less than that for the global model and this
suggests that the GWR model is “better” at modelling the data. The
coefficient of determination is a little higher at 0.72.

9. You have requested the complete listing of the pointwise diagnostics and
predictions. We’ll skip these and move down to the ANOVA. The
computed value of 5.01 is in excess of the critical value for F with 7.0 and
146.2 degrees of freedom so we reject the null hypothesis that the GWR
represents no improvement over the global model. This conclusion is in
line with the AIC results above.

10. Finally check the five number summaries.


(a) Are the medians different from the global values?
(b) Are the signs similar?
(c) Do the signs of the coefficients change?
(d) How large is the interquartile range?
(e) Compare the interquartile ranges with the 2 x S.E. of
the global estimates (you’ll have to do this by hand)

11 If you have requested a significance test, which parameters exhibit


significant spatial variation?

30
12 If you want to close GWR at this point you can, but you need not do
so.

4.7 Mapping the Results


At this point we turn to ArcMap.

4.7.1 Getting started


The Arc/INFO .e00 file which you saved earlier has to be converted into an
Arc/INFO coverage2. Find the ArcToolBox utility3 which is part of the ArcMap
system and run it following the procedures we described in Lab 1. The input file
is georgiaout.e00 in your Work folder and the output file is gparms also in
your Work folder.
We’ll store the imported coverages in your
work folder for convenience. Alternatively,
however, you may place them in any
convenient folder

When you’re ready click on OK.

The above will create an Arc/INFO point coverage (called gparms in this case)

Start the ArcMap program and click on the Add Data icon.

Navigate to whatever folder you have saved your parameter coverage and
highlight the name of the file and click on Add.

The name of the theme (gparms point in this case) appears in the Table of
Contents. Right-click its name and select Open Attribute Table from the list of
options. The theme table contains the values of:

the pointwise parameter estimates (PARM_1…)


the pointwise standard errors (SVAL_1…)
the pointwise pseudo-t values (TVAL_1…)
the observed y value OBS
the predicted y value PRED

2
We will eventually support shapefiles but this will be later rather than sooner.
3
We’ll put the location on an overhead

31
the residual RESID
the standardised residual STDRES
the trace of the hat matrix HAT
Cook’s D COOKSD

There are 7 sets of data for the PARM, SVAL and TVAL items numbered thus

1. Intercept
2. Totpop90
3. PctRural
4. PctEld
5. PctFB
6. PctPov
7. PctBlack

PARM_1 contains the values of the Intercept term and SVAL_1 contains the
values of the corresponding standard errors.

To show the variation in the intercept term with proportional symbols located at
the centroids of the regression points:

1. Right click on gparms point again and select Properties, then


2. Select Symbology/Quantities/Graduated symbols
3. Select Value/Field: Parm_1
4. Click Apply to apply this symbolism to your data.

Experiment with different types of symbolism:

Graduated Color symbols may be easier to interpret


You can change the number of classes and the classification method in
Classify
You can also change symbol colours.

4.7.2 Adding Boundaries


Whilst it is clear from the map that the value of the intercept term increases
gently from North West to South East Georgia, it will enhance your map to have
some county boundaries.

1. Click on the Add Data icon

32
2. Highlight g_utm.shp in the SampleData\Georgia folder and click on Add:
3. Right click on G_utm to bring up Properties/Symbology, click in the
middle of the Symbol (which will be colored as the polygons on the map)
4. In the Symbol selector change Options/Fill Color to No Color
5. Click on the various OKs to return you to your map

4.7.3 Choropleth Mapping


The data refer to the counties of Georgia rather than point locations within
them. It would be desirable to attach the parameter values to the attribute table
for the county boundaries. One way of doing this is to use a spatial join
between the gparms point coverage and the g_utm shapefile.

1. Right click on g_utm in the Table of Contents and select Joins and
Relates/Joins…
2. In the first box under the question “What do you want to do to this
layer?” Select “Join data from another layer based on spatial
location”
3. Choice 1: the layer you wish to join will be gparms point
4. Choice 2: you are joining Points to Polygon. Check the second
option here “Each polygon will be given all the attributes
of the points…”
5. Choice 3: name the output layer gparmsj.shp in your Work folder
6. Click OK
7. gparmsj is then added as a layer to the Table of Contents. Use
Properties/Symbology to assign suitable shading to display the
parameter estimates.

An interesting diagnostic is the STDRES – this is the standardised residual. Try


setting Manual class breaks, with 5 classes at –1.96, 0, 1.96, 2.58 and 3.53. The
counties of Clarke and Oconee have unusually high positive residuals; Hall and
Clayton have rather large negative residuals – why might this be? Why have we
chosen 1.96 and 2.58 as class breaks?

4.8 Finishing
We’ve now completed our lighting tour of GWR. The next labs explore in some
greater detail the operation of GWR and its relationship with GIS as well as
reinforcing what has already been learned.

33
5
Lab 2: GWR and House Price Determinants

5.1 Introduction
This workshop introduces you to geographically weighted model selection by
exploring price variation in the UK housing market using a set of explanatory
variables.

5.2 The Data


The data have been extracted from the anonymised records of a UK Building
Society4. They form a sample of all mortgages approved on properties sold in
the UK in 1991. There are 519 properties in the sample from a population of
about 78000. The variables of interest are as follows:

Variable Description
Easting
x-coordinate of the property
Northing
y-coordinate of the property
Purprice
Purchase price in £ sterling
BldIntWr
Built between 1914 and 1939
BldPostW
Built between 1939 and 1959
Bld60s
Built between 1960 and 1970
Bld70s
Built between 1970 and 1979
Bld80s
Built between 1980 and 1990
TypDetch
Detached building
TypSemiD
SemiDetached Building
TypFlat
Flat/Apartment
FlrArea
Floor area in m2

There are 5 binary variables indicating the age of the property. If a property is
recorded with 0 in all 5 fields, it is by default one that was built pre-1914.

4
We are grateful to the Nationwide Anglia Building Society for making their data available for academic
use.

34
TypDetch is a binary variable indicating a detached property (i.e. a stand-alone
property with no shared walls with neighboring properties)
TypSemiD is a binary variable indicating a semi-detached property (i.e. a
property that shares one common wall with an adjoining property – a ‘duplex’ in
US terminology)
TypFlat is a binary variable indicating a flat (apartment in US terminology). Flats
include both purpose built flats as well as those converted from older single
occupancy buildings.
If a property has 0 values recorded for all three of the above property types, it is
by default a terraced property (‘row house’ in US parlance) – i.e. a property
joined to its neighbours on both sides.

5.3 The Exercise


In this exercise we will first explore the housing data using ArcMap, and then
attempt to predict the price using GWR. The results will be loaded back into
ArcMap so that we can examine the spatial variation in the parameter estimates
and some diagnostic variables. We shall look at the use of the AIC statistic to
determine whether to include different groups of variables in the model.

5.4 Modelling House Price Variation


There are many ways of exploring the linkage between the price of property and
its attributes; although the most commonly employed technique is known as
hedonic modelling. It might not be unreasonable to assume that the price of a
house is related in some fashion to the size of the property – bigger houses tend
to be more expensive than smaller ones. The coefficient for floorspace in a
hedonic model represents the price per unit area for property.

However the type of property may also influence its price – in a small island such
as Britain the ubiquity of semi-detached and terraced housing means that
homeowners may be prepared to pay extra for the isolation from the neighbours
that a detached property offers. The relationship between price and floor area
may well be different therefore across different types of property. We can
examine this through the use of dummy variables to represent property type.

We can also explore the relationship between house price and age. People may
well be interested in paying more for a new property – there are fewer
immediate maintenance and decoration costs. However older properties may
also be seen to be more desirable by some who might for example rather live in
a medieval cottage than a more recent building. The coefficients on the age

35
dummies represent the added value of owning a property built during the age
range to which the dummy refers.

People often refer to some areas as being “more expensive” than others. Some
suburbs are seen as more desirable and properties in the more sought after
suburbs may have higher prices than ones with the same attributes in less
desirable areas. To explore this with a global model all we have are the
residuals. With GWR, rather than confine the geography of the phenomenon to
the error term, we can model it directly. Also rather than impose some pre-
determined geography on the model (sometimes dummies for regions are
included) we let the model tell us where the desirable areas are.

5.5 Exploration, then modelling


We’ll begin by loading the data into ArcMap and looking at some of the variation
within England and Wales. You might find it helpful to have the Regional
Development Agency map as an additional layer to help you get your bearings.
The following steps explain how to do this.

1. Start ArcMap
2. Click the Add Data icon, and add regions.shp, HousData.txt and
Places.txt from the SampleData/Housing folder
3. For a brief overview of Britain’s regions, right click on the layer name
(regions) and select Label features. You might wish to use to Zoom
tool from the Toolbar (it’s a magnifying glass with a + sign on it) to zoom
a little to exclude Scotland since we will not be concerned with the
housing market in Scotland.
4. The region names probably won’t mean too much, so we’ll add the names
of some well known towns and cities. Right click on Places.txt in the
Table of Contents and select Display X-Y Data…; the X Field and Y Field
choices should have been chosen as X and Y. Click OK.
5. Remove the region names (right click on regions in the Table of
Contents and uncheck Label Features
6. Right click on Places.txt Events in the Table of Contents, and select
Label Features; right click on this layer again and select Zoom to
Layer. Spend a few seconds examining the names. You can use the
Identify tool from the ToolBar to check the grid references – they are in
metres (which city is further north, Carlisle or Newcastle?).
7. Uncheck the Places.txt Events layer in the Table of Contents
8. Right click on HousData.txt in the Table of Contents and select Display
X-Y Data. The X Field should be Easting and the Y Field should be

36
Northing, so change these from those that ArcMap has selected. Click on
OK.
9. You can examine the variation in housing cost by right clicking on
HousData.txt Events in the Table of Contents, selecting Properties
and then clicking on the Symbology tab.
10. Click on Quantities/Graduated Symbols
11. The Value Field should be PurPrice
12. Click on Classify, and change the classification method from Natural
Breaks (Jenks) to Quantile. Click on OK, and OK.

There doesn’t seem to be a great deal of variation apparent in the displayed


map. However, here is a regional summary of prices, sorted by descending price:

Region Count Ave_Purprice StdDev_Purprice


London 80 83943.9375 39376.9694
South East 147 77435.7143 44835.3742
East 50 65257.8000 37198.2244
South West 52 64928.8462 41602.1426
Yorkshire 24 56852.0833 33520.0540
North West 33 56209.0909 20818.8633
North East 14 55617.8571 44601.8480
West Midlands 46 53582.6087 22381.5976
East Midlands 51 52384.1176 25647.3060
Wales 15 49430.0000 25525.7279

Housing in London and the South East is notably more expensive than elsewhere
in Britain, indeed there is a notable divide between northern and southern
England in terms of average price. A regional dummy variable would be a fairly
crude method of dealing with the spatial variation.

Let’s examine the determinants of house prices more formally

5.6 Model 1 – Price ~ Floorspace


We’ll start with a very simple model: predicting price from floorspace.

1. Start GWR3
2. From GWR Wizard select Create a New Model, then [Go]
3. Select Gaussian as the Model Type the [Go]
4. Navigate to SampleData/Housing and select HousData.dat as the data file
(with a type of *.dat) and Yes when asked whether you wish to fit the
model at the data points.
5. Navigate to your Work folder and choose housing1.e00 as the parameter
output filename; click on Save

37
6. Check the Data Preview and File Confirmation choices to make sure that
you have the correct input and output data.
7. Choose the following options in the model editor:
(1) Title Price ~ Area model
(2) dependent variable: PurPrice
(3) independent variable: FlrArea
(4) X-variable: Easting
(5) Y-variable: Northing
(6) Kernel type: Adaptive
(7) All three listing options from Model Options
(8) Bandwidth selection: AICc
(9) Output format: Arc/INFO Export
(10) Save your model as housing1.gwr (in Work)
(11) Click Run Model
(12) Save the listing file as housing1.txt
(13) Click Run
11980

Examine the listing – the calibration


11975

has converged with 29 properties as


the local sample size. You can 11970

examine the relationship by plotting 11965

the AICc value against the bandwidth


from the listing file, as shown on the 11960

right. 11955
25 35 45 55 65 75 85 95

Below are the results for the global model

**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 454113189550.444030
Effective number of parameters.. 2.000000
Sigma........................... 29637.173765
Akaike Information Criterion.... 12164.963478
Coefficient of Determination.... 0.416307

Parameter Estimate Std Err T


--------- ------------ ------------ ------------
Intercept 1069.526702543610 3688.514945216667 0.289961338043
FlrArea 656.494375576436 34.187763561637 19.202611923218

The AIC is 12,164 and the model accounts for about 42% of the variation in the
Price variable. Nationally the price per square metre of floorspace is just over

38
£650. The parameter estimate for the floorspace variable is significantly
different from zero but the intercept term is not. It would seem that non-
existent houses are free for the taking!

The equivalent results for the local model are:

**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 2
(Intercept is variable 1)
Number of nearest neighbours...... 29
Number of locations to fit model.. 519

Diagnostic information...
Residual sum of squares......... 164532408987.248870
Effective number of parameters.. 92.663317
Sigma........................... 19644.879856
Akaike Information Criterion.... 11861.124348
Coefficient of Determination.... 0.788519

Notice that the local AIC is much lower and the local coefficient of determination
is much higher. We can be more confident that we have a ‘better’ model with
GWR. Note that because the effective number of parameters is different for the
global and local models that we can’t compare the r2 values directly.

The parameter 5 number summary is also interesting:

**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept -64327.508064 -6189.706002 4492.787561 15304.148108 99446.460090
FlrArea -19.135883 447.434134 583.102579 730.316064 1540.075262

For 50% of the properties in the analysis the floorspace parameter varies from
£447.43 to £730.31. The average size of property is 101m2. This implies a
price variation from £45,190 to £73,761 depending on where the property was
located.

Notice that for the FlrArea parameter, the inter-quartile range of the local
estimates is 283.9 whereas 2 x SE of the global estimate is only 68.4 so there
appears to be a great deal of spatial variation in the local estimates. We can
explore this, and other features of the results, by mapping the output data as
follows:

39
1. Use ArcToolBox to convert the housing1.e00 file in your Work folder into
an Arc/INFO coverage. Call the resulting coverage housing1 and leave it
in the Work folder.
2. Add the coverage as a layer in the Table of Contents.
3. Change the fill on the regions layer to ‘No Fill’ and uncheck the other
layers except housing1.
4. Examine the standardised residuals (STDRES) – there appears to be very
little geographical pattern within them. The model has captured much of
the geographical variation through its weighting scheme although there
are a handful of outliers all west of London.
5. Now plot the PARM_2 coefficients. These are for the floorspace variable.
Under Classify change the Classification method to Standard Deviation.
6. You can choose a Color Ramp that looks pleasing to the eye – green to
red, or red to blue are good choices. If you right click on one of the
symbols in the Symbology dialog, you can Flip the color ramp round if
necessary (try it and see).

Some patterns immediately stand out. Some of the strongest relationships are
in the areas to the west of London as far as the Bristol Channel and,
interestingly, in Northern England. Some of this, of course, may be an artefact
of the particular sample chosen.

5.7 Model 2 – Price ~ Floorspace + Property Type


We will now examine the effect of adding a group of dummy variables
representing the type of property. The dummy variable for detached properties
takes the value 1 for a detached property and 0 otherwise. In a global
regression its effect is to create a separate intercept for detached houses
compared to other properties.

If you look back to the above table describing the data for this lab you will
observe that we have included dummy variables in the dataset for detached
properties, semi-detached properties and flats/apartments. Most of the other
types of property in the dataset are terraced (row houses). One of the more
famous terraces in Britain is the Royal Crescent in Bath, an elegant Georgian
terrace of very desirable property. At the other end of the market are the large
areas of low-cost terraced housing built by factory owners adjacent to their
factories in the mid to late 19th century. Such housing is usually cramped and
poorly constructed. We might therefore expect to see some spatial variation in
the GWR parameter for this type of property. This would appear in the intercept
term.

40
Royal Crescent, Bath 19thC Workers' Housing, Newcastle
(Mary Ann Sullivan) (BBC)

The steps for this analysis are as follows:

1. Close all the windows in the GWR main window and select Tools/Analysis
Wizard/Open an existing model in the GWR Model Editor.
2. Navigate to your Work folder and select housing1.gwr and click OK
3. Add the variables TypDetch, TypSemiD and TypFlat into the list of
independent variables
4. Select Save the Model and overwrite the existing control file. The
existing interchange file will be reused.
5. Select Run Model, name the Listing File housing2.txt and proceed as
before

Here are some of the diagnostics for the global model

**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 391372117830.287660
Effective number of parameters.. 5.000000
Sigma........................... 27593.918782
Akaike Information Criterion.... 12093.912039
Coefficient of Determination.... 0.496951

Parameter Estimate Std Err T


--------- ------------ ------------ ------------
Intercept 14739.342677934595 4038.893840041817 3.649351358414
TypDetch 15135.424430767123 4692.434978262239 3.225494861603
TypSemiD -10496.662116166888 4030.874273729064 -2.604065895081
TypFlat -14828.893957866609 3894.269647828749 -3.807875394821
FlrArea 569.641342931521 36.800583930017 15.479138374329

41
The AIC is lower than the AIC for the global ‘floorspace-only’ model so there
appears to have been some benefit in adding the extra variables. All the
parameter estimates would appear to be significant. Whilst the coefficient on
floorspace has declined, there are adjustments for property type to be made. We
add 14,739 to the 569.64*floorarea to obtain the regression line for terraced
housing, add 14,739+15,135 to obtain the regression line for detached
housing, add 14,739-10,496 for semi-detached housing, and 14,739-14,828
for flats/apartments. Thus compared with terraced housing, there is a premium
on average of just over £15,000 on a detached house.

The local diagnostics are:

**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 5
(Intercept is variable 1)
Number of nearest neighbours...... 97
Number of locations to fit model.. 519

Diagnostic information...
Residual sum of squares......... 158230094116.817840
Effective number of parameters.. 71.094406
Sigma........................... 18795.388194
Akaike Information Criterion.... 11779.561886
Coefficient of Determination.... 0.796620

With a lower AIC we appear to be justified in using the GWR model. Perhaps
with the image of the terraced properties above in mind we might map the
intercept parameter.

6. In ArcToolBox, import the housing1.e00 file as housing2 and in ArcMap


display the intercept parameter. There appears to be a strong regional
pattern dominated by high values in London and areas to the immediate
west and low values in the south-west. Terraced housing is less well
represented in rural areas which perhaps accounts for the negative values
of these parameters in such areas. With 5 variables you’ll need to check
the numbering of the PARM_ variables – the numbering follows the order
of the variables as they are entered into the model. I’ve entered the type
variables before the floorspace variable; thus in this example the
parameters will be named as follows in the .dbf file:
(a) Intercept Parm_1
(b) TypDetch Parm_2

42
(c) TypSemiD Parm_3
(d) TypFlat Parm_4
(e) FlrArea Parm_5
7. You can alter the classification in the Legend Editor from the default
(“natural breaks”) to some other, such as quantile. Choosing 4 quantile
classes then gives us class breaks based around the 5-number summary
with the breaks being at lowest value:lower quartile:median:upper
quartile:largest value. These values are reported in the legend.

5.8 Model 3 – Price ~ Floorspace + Property Age


We have two more models to create. The first will exclude property type but we
will add the property variables. Follow steps 1-5 for the previous model, but
with the following substitutions

3a Remove the property type variables so that you only have floorspace
included
3b Add the property age variables – the excluded category is pre-1914.
5 Name your listing file output housing3.txt

Having run the model, examine the global and local diagnostics.

The global values are below:

**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 441439313504.563230
Effective number of parameters.. 7.000000
Sigma........................... 29363.006644
Akaike Information Criterion.... 12160.508453
Coefficient of Determination.... 0.432598

Parameter Estimate Std Err T


--------- ------------ ------------ ------------
Intercept -5903.483170905958 4264.795021581204 -1.384236097336
BldIntWr 8376.051578354800 3828.795639408614 2.187646627426
BldPostW 6598.760667498580 4604.236709139186 1.433193206787
Bld60s 5394.226336688858 4354.567153213901 1.238751411438
Bld70s 8251.992808252189 4634.968743638682 1.780377268791
Bld80s 14458.818474638565 3879.960816152878 3.726537227631
FlrArea 660.031066096202 33.998587801254 19.413484573364

The AIC is 12,160. It might be reasonable to suggest that adding this group of
variables to the global model results in almost no improvement to the model.

43
Indeed most of the parameters are not significant. For these variables these are
interpreted relative to the excluded category of pre-1914; apart from InterWr
and recently built properties there is no premium on age. It also suggests that
there is a non-linear price response to age: putting the age as a continuous
variable into the model might not be altogether wise.

The local diagnostics are below:

**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 7
(Intercept is variable 1)
Number of nearest neighbours...... 111
Number of locations to fit model.. 519

Diagnostic information...
Residual sum of squares......... 201523445943.890960
Effective number of parameters.. 88.941047
Sigma........................... 21647.053653
Akaike Information Criterion.... 11955.358137
Coefficient of Determination.... 0.740973

The AIC is larger than the AICs for either the floorspace only or floorspace+type
models. As such, incorporating all the age groups does not yield a better
model.

5.9 Model 4 – Price ~ Floorspace + Property Type + Property Age


The final stage in this modelling exercise is to consider the inclusion of both
sets of attribute variables: age and type. The excluded category is then pre-
1914 terraced properties. These are not always low quality poorly constructed
houses – recall the contrasting examples from Bath and Benwell above.

Following steps 1-5 from Model 2, add in the rest of the variables and re-run
the model, calling your listing file housing4.txt.

The global diagnostic values are below:

**********************************************************
* GLOBAL REGRESSION PARAMETERS *
**********************************************************
Diagnostic information...
Residual sum of squares......... 387658088547.870180
Effective number of parameters.. 10.000000
Sigma........................... 27597.232591
Akaike Information Criterion.... 12099.319981
Coefficient of Determination.... 0.501725

44
Parameter Estimate Std Err T
--------- ------------ ------------ ------------
Intercept 10841.440705901732 4658.873780335065 2.327051877975
BldIntWr 7377.915831973120 3888.083657451756 1.897571325302
BldPostW 4448.988850693599 4615.216412426111 0.963982701302
Bld60s 1948.867733338220 4366.949551122227 0.446276664734
Bld70s 2503.678352971071 4602.211643175193 0.544016361237
Bld80s 6239.912888906404 3944.833890925199 1.581793546677
TypDetch 12702.104677614816 4962.522032414549 2.559606790543
TypSemiD -12716.365572053077 4310.968223855657 -2.949770212173
TypFlat -15038.310025890560 3911.276249199043 -3.844860076904
FlrArea 585.128599565044 37.757060098477 15.497197151184

The AIC is lower than that for the global floorspace-only model, but is higher
than the floorspace+type model’s. Here it would suggest that we are gaining
little by adding the extra variables, and we appear to be being penalised for
creating an unnecessarily complex model.

The local diagnostics are:

**********************************************************
* GWR ESTIMATION *
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 519
Number of independent variables... 10
(Intercept is variable 1)
Number of nearest neighbours...... 100
Number of locations to fit model.. 519

Diagnostic information...
Residual sum of squares......... 129983243217.320330
Effective number of parameters.. 131.099269
Sigma........................... 18305.575423
Akaike Information Criterion.... 11865.000583
Coefficient of Determination.... 0.832927

The AIC is higher than both the floorspace only and the floorspace+type model’s
AIC. This suggests that there is little to be gained by adding the age variables
into the model either with or without the property type variables.

5.10 Points or Surfaces


The point displays are not always easy to interpret. Another way of visualising
the output is as a surface which appears to vary smoothly across the map. We
will create surface for the Floorspace parameter from the housing1 point
coverage.

1. Uncheck all the layers in the Table of Contents except housing1


2. Add the coast.shp layer from the SampleData/Housing folder
3. From the menu bar select View/ToolBars/3D Analyst

45
4. From 3D Analyst, select Options
5. In General change the Working Directory to C:\Temp; set the Analysis
Mask to coast.shp
6. In Extent, set Analysis Extent to “Same as layer coast”
7. In Cell Size, choose “As specified below” and enter 5000 in the Cell
Size box; click OK
8. From 3D Analyst, select Interpolate to Raster/Inverse Distance
Weighted
9. Input points should be housing1 point
10. Z value field should be PARM_2
11. Search type radius should be Variable
12. Number of points should be 12
13. Output raster should be <temporary>
14. Click on OK – the surface appears on the map with suitable shading to
show variation in the intensity of the relationship
15. Check the Places.txt Events layer to bring up the place names

The surface does reveal the patterns in a slightly different way. West of London,
there is a corridor of highly priced floorspace with peaks in Reading and Bristol –
this follows the route of the M4 motorway and the high speed rail link to Bristol
and Cardiff. Reading is both a commuter dormitory in its own right as well as
home to number of companies in the computing sector. There is a small peak
between London and Brighton, again reflecting the importance of the rail link
which connects Brighton, Crawley, Gatwick Airport and London. The apparent
rise in floorspace prices in northern England is probably a reflection of basing
the analysis on a .7% sample.

46
6
Lab 3: GW Poisson Regression with Tokyo Mortality Data

6.1 Introduction
Poisson Regression is used when the response variable refers to counts of some
phenomenon and the covariates are either continuous or binary measurements.
Typical examples of count data might include numbers of people with a
particular disease, numbers of crimes in an area or numbers of derelict houses
in a neighbourhood. In GWR such data are modelled using a Poisson regression
model. Each observation has an integer-valued response variable, a number of
explanatory variables, and two locational variables (X and Y coordinates). If you
wish to model counts which relate to some underlying areal population, for
example, the number of children aged 0-14 with leukaemia, you should use
what is referred to in the Poisson regression literature as an ‘offset variable’. In
this case, the offset variable would be the number of 0-14 children in each
spatial unit. There will be more detail about this below. You should not attempt
to model data which are continuous, such as the educational attainment data we
have used in a previous workshop, in a Poisson regression framework – use
Gaussian regression for this. The GW Poisson regression model is set up using
the GWR Model Editor as in previous examples.

A Poisson model takes the form

yˆ i = Oi exp( β 0 + β1 x1i ...β k xki )

which, on taking logs of both sides, becomes

log yˆ i = log Oi + β 0 + β1 x1i ...β k xki

The Oi is known as the offset – in the second equation log Oi + β0 forms the
constant term.

In Poisson GWR, as with Gaussian GWR, the resulting parameter estimates βι∗ are
specific to each location i. Unlike Gaussian GWR, however, the model is fitted
using a technique known as iteratively reweighted least squares. This has the

47
following implications. First, the fitting technique is iterative and so a typical
GW Poisson regression will take approximately 5 times as long to run as an
equivalent Gaussian regression. Second, to compute the standard errors, the
observed counts are required, so parameter estimates may only be obtained at
the data points.

The output file contains the parameter estimates from the Poisson model and
their standard errors as well as the exponentials of the parameter estimates and
the standard errors of these exponentiated values. Positive parameter values
when exponentiated are greater than unity, a parameter of zero when
exponentiated yields a value of unity, and negative parameter values when
exponentiated are less than unity. In all cases, the exponentiated values are
positive.

6.2 Mortality in Tokyo


For this workshop, we will be using a dataset concerned with premature
mortality levels in 262 municipalities in Tokyo5. These data refer to areas of
different underlying populations. In order to account for this we shall use the
expected number of deaths (based on population age composition and national
death rates) in each area as an offset. There is, as in the previous lab, a set of
continuous explanatory variables – these are the proportion of the population
aged over 65, the proportion of the population who are unemployed, the
proportion of the population who own their dwelling, and the proportion of the
population in professional occupations. One might expect higher mortality
counts in areas associated with economic problems and lower mortality counts
in more prosperous areas.

Mapping
Before beginning the analysis, you should examine the spatial distribution of the
variable.

1. Start ArcMap
2. From the SampleData/Tokyo folder add the layer TMABSU.shp – these are
the boundaries of the polygons which form municipalities in the Tokyo
Metropolitan Area.

5
Premature mortality counts are defined here as the number of deaths occurring to people aged 25-65 in
each area. We are grateful to Dr Tomoki Nakaya of the Department of Geography, Ritsumeikan University,
Kita-Ku, Kyoto for this dataset.

48
3. Check the Attribute table – there is not much attribute data here – one of
the attributes is an ID field named ID_. Each ID begins with the string
SUGIURA
4. Add the table SOC90DAT.dbf and examine its contents – the area IDs are
in a field called ID
5. Right click on the TMABSU entry in the Table of Contents and select Joins
and Relates/Join.
6. Change the activity to Join attributes from a table
7. The field name in box 1 should be ID_
8. The table name in box 2 should be SOC90DAT.dbf
9. The field name in box 3 should be should be ID
10. Click on OK.
11. In Properties/Symbology select Categories/Unique Values, then
select SOC90DAT.PREFNAME; click on Add All Values then OK, and you’ll
then get a map showing the Prefectures in Tokyo.
12. Examine the distributions of the following variables SOC90DAT.POP65,
SOC90DAT.OWNH, SOC90DAT.UNEMP and SOC90DAT.OCC_TEC (which
are population over 65, homeowners, unemployed, and professional
occupations, respectively). The homeowners variable shows quite clearly
the transition between the urban centre (where very few people own their
own home) and the rural periphery (where home ownership is much
higher).
13. Add the SMR90.dbf table, and join it to TMABSU (remember to examine
the SMR90 table first to identify the ID field.
14. Examine the distribution of the SMR90.SMR variable – this will be the
distribution that we will model as a function of the explanatory variables
above.

6.3 Setting up the Model


The data are in SampleData/Tokyo/Mortality.csv. We will use Mort2564 as
the response variable, Exp_2564 as the offset, and Professl, Elderly, OwnHome
and Unemply as the predictor variables. The geography of Tokyo suggests that
we might be sensible in using an adaptive kernel rather than a fixed kernel.

1. Start GWR3
2. In the Wizard check Create a new model and click Go
3. … select Poisson for Model Type and click Go

49
4. In Open Data File, change the filter to Comma Separated Variable
(*.csv), and select the file Mortality.csv from the SampleData/Tokyo
folder
5. In Analysis Point Selection, click on Yes
6. The Output File should be ArcInfo Export File (*.e00); navigate to
your Work folder, and name the file tokyo.e00
7. Check that the correct variables are present in the Data Preview, and
check that the correct files are being used in Confirm – in particular,
check that the filenames for the Observed Data File and Parameter
Diagnostic File are different. The former should be Mortality.csv and
the latter should be tokyo.e00.
8. In the Model Editor enter a title, select Mort2564 as the Dependent
Variable and Professl, Elderly, OwnHome and Unemply as the predictor
variables; select X and Y as the Location Variables; select Exp_2564 as the
weight variable6; select an Adaptive kernel, with Cartesian coordinates,
and Crossvalidation as the calibration method. Check all three Model
Options (but don’t’ check the Monte Carlo test as it is not available yet for
this model type). Change the output format to ArcInfo Exp, and save the
model on your Work folder as tokyo.gwr.
9. Run the model and name the listing file tokyo. Click on Run and wait a
minute or two for the model to run.

6.4 Examine the Output


The output is broadly similar to that for a Gaussian GWR.

The opening panel is on the next page:

***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED POISSON REGRESSION *
* *
***************************************************************
Number of data cases read: 262
Sample data file read...
*Number of observations, nobs= 262
*Number of predictors, nvar= 4
Observation Easting extent: 131840.781
Observation Northing extent: 120125.898

Data has been read for 262 municipalities. The study region lies within an area
some 132km from West to East, and some 121km from South to North. There
will 4 predictor variables in the model.

6
In Poisson GWR if you specify a weight variable, the variable will be used as an offset.

50
The next panel is from the calibration – we have chosen to use minimising the
crossvalidation score as the criterion. Again, you can graph this function if you
wish to see how the software has found the minimum.

*Finding bandwidth...
... using all regression points
This can take some time...
*Calibration will be based on 262 cases
*Adaptive kernel sample size limits: 65 262
*Crossvalidation begins...
Bandwidth CV Score
125.876348015000 77093.504704727718
163.500000000000 78852.510762618243
102.623652260339 76090.041761883127
88.252695924830 75486.337205487056
79.370956440679 75159.569289465624
73.881739549149 75717.804317078335
82.763479033300 75179.304206706714
77.274262166597 75473.565361751971
80.666784759217 75218.352828294883
** Convergence after 9 function calls
** Convergence: Local Sample Size= 79

79000
The optimal bandwidth is about
78500
79 objects in the local sample.
78000
This is about 25% of the number
77500

77000
of observations, so we will obtain
76500 a moderate degree of smoothing
76000
with this. If you have Excel
available, you can examine the
75500

75000
60 80 100 120 140 160 180
shape of the crossvalidation
function. The number of function
calls refers to the number of times the different models have been fitted to
obtain the optimal bandwidth. Note the slight wobble near the minimum – this
sometimes happens with adaptive kernels – what the curve does suggest is that
between 79 and 82 objects is a reasonable local sample size.

The next panel shows the results from the Global Model.

********** Global Poisson Model Diagnostics **********


Convergence after 3 iterations
Log-likelihood: -194.640790
Deviance (-2LogLikelihood): 389.281580
Trace of the Hat Matrix: 5.000000
Number of parameters in model: 5.000000
Akaike Information Criterion: 399.281580
Corrected AIC (AICc) 399.515955
Bayesian Information Criterion: 417.123303

Parameter Estimate Std Err T Exp(B) Sd(Exp(B))


--------- ------------ ------------ ------------ ------------ ------------

51
Intercept 0.007 0.065 0.115 1.007 0.066
Professl -2.288 0.162 -14.123 0.101 0.016
Elderly 2.199 0.198 11.093 9.019 1.788
OwnHome -0.260 0.047 -5.519 0.771 0.036
Unemply 0.064 0.011 5.822 1.066 0.012

The AIC is about 399.3 (which is correct, since it’s the deviance less twice the
number of parameters in the model). Areas of lower premature mortality would
appear those with higher levels of professional employees and home owners,
whereas an ageing population and economic problems would appear to drive up
mortality.

The next panel shows the output from the Local Model.

********** Local Poisson Model Diagnostics **********


Log Likelihood: -145.678378
Deviance: 291.356756
Trace of the hat matrix......... 32.639607
Residual sum of squares......... 51466.082227
Effective number of parameters.. 32.639607
Akaike Information Criterion.... 356.635970
Corrected AIC................... 366.252205
Bayesian Information Criterion.. 473.105336

First of all, note that there has been a substantial decrease in the AIC – this
suggests that the local model provides a better fit to the data than the global
model despite the increase in the effective number of parameters (32.6 in the
local model compared to 5 in the global model).

It is worth checking also the 5 number summaries for the parameter estimates.
The Elderly parameter estimates are consistently positive At least 75% of the
values of the Professional and Home owner parameters are negative, and 75% of
the Unemployed parameter estimates are positive. In general, this is
encouraging.

**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
Intrcept -1.179779 -0.033398 0.105704 0.254133 0.591215
Professl -4.348735 -2.712938 -2.450336 -1.633525 2.501697
Elderly 0.915348 1.529775 2.022919 2.544694 5.226847
OwnHome -0.678879 -0.397862 -0.289317 -0.193223 0.242000
Unemply -0.064240 0.015048 0.033627 0.080714 0.196946

52
Again, we can informally check on the spatial variation of the local estimates by
comparing the inter-quartile range with 2 x S.E. of the respective global
estimates. In doing this, we get…

Parameter 2 x S.E. (Global) Inter-


Inter-Quartile Range (Local)

Intercept 0.130 0.287


Profess1 0.314 1.079
Elderly 0.396 1.015
OwnHouse 0.094 0.205
Unemply 0.022 0.066

which suggests there might be significant spatial non-stationarity in the local


estimates of the all the parameters.

Further diagnostic analysis should involve mapping the parameter estimates and
other diagnostic data.

6.5 Mapping the Results


The municipality boundaries and the data we have used for the GWR are in two
different projections. The boundaries are stored in ‘geographic’ coordinates,
that is, measurements of latitude and longitude. The data we have used for the
Poisson modelling have coordinates which have been transformed to a local
projection based on UTM7. For this reason we shall use a slightly different
approach to map the results, based on a lookup-table. The lookup table
contains the ID field used in the municipality map, together with their
corresponding names and the name of the prefectures in which they lie, as well
as the sequence number that is allocated by the GWR software. We carry out two
joins. The first joins the lookup table to the municipality attribute table, and the
second joins the point coverage attribute table to this. After that we can then
map the GWR parameter estimates and associated diagnostics.

1. Use ArcToolBox to Import the Interchange File to a coverage. Both should


be in your Work folder – call the coverage tokyo
2. Add the new tokyo coverage into the Table of Contents
3. Add the table GeogIndex.csv into the Table of Contents from
SampleData/Tokyo

7
UTM stands for Universal Transverse Mercator. There appears to be a local shift in the Y coordinates
southwards.

53
4. Click on TMABSU, and then right-click and select Joins and
Relates/Join.
5. Select Join attributes from a table
6. Box 1: select TMABSU.ID_
7. Box 2: select GeogIndex.csv
8. Box 3: select ID and click on OK – this has linked the GeogIndex table and
the TMABSU table
9. Select Joins and Relates/Join again
10. Select Join attributes from a table
11. Box 1: select GWR_ID
12. Box 2: select Tokyo point
13. Box 3: select TOKYO-ID and click on OK – this links the GWR results as an
attribute of the TMABSU shapefile

Use ArcMap to examine the spatial variation in the parameters. With a smaller
bandwidth, some interesting local spatial variations are revealed. The negative
influence of the Professional parameter is most marked in Chiba-ken prefecture
and affluent suburbs of Tokyo-to. The influence of the Elderly is most noticeable
in southern Chiba-ken and the north western part of Saitama-ken. Owner
occupation is most influential in Kanagawa-ke and the central part of Tokyo-to,
and Unemployment is also most influential in Kanagawa-ke and and parts of
Saitama-ken. Examine each parameter estimate in turn. The residuals appear
to be fairly patchy.

The parameter names refer to the variables in the data thus:

point.PARM_1 Intercept
point.PARM_2 Professional
point.PARM_3 Elderly
point.PARM_4 Own Home
point.PARM_5 Unemployed

If you are intending to map the results from another analysis, you can remove
the join to the Tokyo point coverage, and then join another coverage in its place.

6.6 Tasks
Which of the set of independent variables produces the greatest reduction in the
corrected AIC? Experiment with each singly, and in pairs.

54
For a given model (perhaps take a simple one with only Unemployed as the
explanatory variable) what is the effect of choosing different calibration
methods?

Experiment with different bandwidths. What is the effect on the AICc and the
effective number of parameters of reducing the bandwidth? Try 17000 as the
bandwidth for a Fixed kernel, and 40 as the Local Sample Size for an Adaptive
kernel.

55
7
Lab 4: Logistic GWR with Landslides in Clearwater, Idaho

7.1 Introduction
Logistic Regression (also known as Binary Logit Regression) is used when your
response variable is binary and the covariates are either continuous or binary
measurements. Typical examples of binary data might include yes/no,
alive/dead or above/below. In GWR such data are modelled using a logistic
regression model. You will find the term binary or dichotomous used as names
for the 0/1 data – note, however, that these data are not binomial, and the
model is not binomial. In GWR each observation has a 1/0 valued response
variable, a number of explanatory variables, and some locational variables. The
problem is set up using the GWR Model Editor as before.

Mathematically, the situation is that we need a model form that predicts yi ,


where i = 1…n, as a value in the interval 0-1 based on a set of explanatory
variables, 1…k. Such a model is,

exp( β 0 + β1 x1i + ... + β k xki )


yˆ i =
1 + exp( β 0 + β1 x1i + ... + β k xki )

where yi* is the predicted value of yi,

because, as exp(β0 + β1x1i …+ βkxki) → 0, yi* → 0 and as exp(β0 + β1x1i …+ βkxki)


→ ∞, yi* → 1

Note that

1
1 − yˆ i =
1 + exp( β 0 + β1 x1i + ... + β k xki )
so that

yˆ i
= exp( β 0 + β1 x1i + ... + β k xki )
1 − yˆ i

56
and

 yˆ 
log i  = β 0 + β1 x1i + ... + β k xki
 1 − yˆ i 

The term on the left hand side of the equation is known as the logit
transformation and this produces a linear function in terms of the right hand
side of the equation.

In logistic GWR, as with Gaussian GWR, the local parameter estimates βι∗ are
specific to each location i. The model calibration, however, is more complicated
than in OLS regression and the model is fitted using a technique known as
iteratively reweighted least squares. This has two implications for our analysis.
First, the fitting technique is iterative, and a typical problem will take
approximately 5 times as long to run as a Gaussian GWR. Second, to compute
the standard errors, the observed 0/1 values are required, so parameter
estimates may only be obtained at the data points – that is, we no longer have
the option of producing local parameter estimates at points other than the data
points.

The output file from logistic GWR contains the parameter estimates and their
standard errors from (7.4). Positive parameter values when exponentiated are
greater than unity, a parameter of zero yields an exponentiated value of unity,
and negative parameter values when exponentiated are less than unity. In all
cases, the exponentiated values are positive. Exponentiating the parameter
values is an add-on task (see section 7.7)

7.2 Landslide Hazard: Clearwater National Forest


For this workshop, we will revisit some research into the probabilities of
landslide occurrence. In November 1995 and in February 1996 there were some
900 landslides in Clearwater National Forest in northern Idaho. Following the
landslides there were a number of attempts at predicting landslide hazard. Some
interesting work was carried out by Peter Gorsevski and Paul Gessler of the
Department of Forest Resources at the University of Idaho in conjunction with
Randy Folz of the USDA Forest Service Rocky Mountain Research Station. Using
logistic regression they attempted to create a hazard map for landslides in the
Forest. They found they could predict the probability of landslide occurrence
reliably using information on the landscape morphology and proximity to water.
We’ll use GWR to look at a subset of the landslide data with a similar set of data.

57
The data for this workshop is in the SampleData/Clearwater folder in the file
landslides.csv. They are the landslides in a small part of the Forest, about
33 x 29km in size, extracted from the Clearwater National Forest GIS Library. A
digital elevation model was downloaded from the Shuttle Radar Topography
Mission data archive at the Jet Propulsion Laboratory and re-projected into UTM
Zone 11 with a 25m grid interval – this gives us an estimate of the elevation of
every 25 grid square in the forest. The DEM was then used to create slope and
aspect grids the slope is given in percent, and the aspect in degrees. Two more
grids were created with the sin and cosine of the aspect. Digitised streams data
was downloaded from the CWNF website. A basemap extracted from TerraServer
is included to help in interpreting the results.

For the 138 landslides sites in the study area we extracted the following data
(a) elevation in metres
(b) slope (%)
(c) sin of the aspect
(d) cosine of the aspect
(e) absolute deviation of the aspect from due south
(f) distance in metres to the nearest watercourse

We also need observations from sites where there were no landslides. 101
random locations were sampled in the area for the same 5 variables. We’ll refer
to the first sample as the landslide sites and the second sample as the control
sites.

The two samples were merged. A 7th variable was added with the value of 1 for
the landslide sites and 0 for the control sites. This will be used as the y variable
in the logistic regression. The topographic variables will be used as predictors
and the output will include the probability of a landslide occurring given the
characteristics of the site, whether landslide or not. A satisfactory outcome will
be high probabilities at the landslide sites and low probabilities at the control
sites.

We have taken a small area of the Forest to keep the GWR run times short. The
models take about 2 minutes to fit on a 1.70GHz laptop. On the same system
fitting the dataset of 875 sample sites and the same number of controls takes
an hour.

58
7.3 Examining the data
Before beginning the analysis, you should map the spatial distribution of the
landslides on the basemap.

1. Start ArcMap
2. From the SampleData/Clearwater folder add basemap.jpg and
landslides.csv.
3. Right-click on landslides.csv in the Table of Contents, and select
Display XY Data from the options that are presented. A form
will be presented – there’s not much to do except note that
ArcMap has selected X as the X variable and Y as Y variable. Click
on OK at the bottom of the screen
4. Right click on landslides.csv Events in the Table of Contents and
select Properties/Symbology.
5. Select Categories/Unique Values in the Show: list
6. Select Landslid as the Value Field
7. Click on the Add Values button and add the values 0 and 1.
8. Uncheck the symbol for ‘All Other Values’
9. Change the fill colours to red for 1 and blue for 0 and click OK.
Remember that landslide sites are coded 1 and control sites are
coded 0. You might want to change the colours to red for 1 and
blue or green for 0.
10. You can also examine the spatial variation in some of the explanatory
variables. As these are point data, you should use Graduated
Colors as the symbolism.

7.4 Setting up the Model


We will use Landslid as the response variable and initially Elev and Slope as
the predictor variables.

1. Start GWR3
2. In the Wizard check Create a new model and click Go
3. … select Logistic for Model Type and click Go
4. In Open Data File, change the filter to Comma Separated Variable
(*.csv), and select the file landslides.csv from the
SampleData/Clearwater folder
5. In Analysis Point Selection, click on Yes
6. The Output File should be an ArcInfo Export File (*.e00); navigate to
your Work folder, and name the file logmodela.e00

59
7. Check that the correct variables are present in the Data Preview, and
check that the correct files are being used in Confirm – in particular,
check that the filenames for the Observed Data File and Parameter
Diagnostic File are different. The former should be landslides.csv
and the latter should be logmodela.e00.
8. In the Model Editor enter a title, select Landslid as the Dependent
Variable and Elev and Slope as the predictor variables; select X and Y
as the Location Variables; select a Variable kernel, with Cartesian
coordinates, and AICc as the Bandwidth selection method. Check all
three Model Options (but don’t’ check the Monte Carlo test as it is not
available yet for this model type). Change the output format to
ArcInfo Exp, and save the model on your Work folder as
logmodela.gwr.
9. Run the model and name the listing file logmodela.txt. Click on Run
and wait a couple of minutes for the model to run.

7.5 Examine the Output


The output is broadly similar to that for a Gaussian GWR.

The opening panel is on the next page:

***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED LOGISTIC REGRESSION *
* *
***************************************************************
Number of data cases read: 239
Sample data file read...
*Number of observations, nobs= 239
*Number of predictors, nvar= 4

P(Landslid=1 | X) = 0.577406
Observation Easting extent: 33543.0625
Observation Northing extent: 28802.6289

Data has been read for 239 locations. 57.7% of the locations were landslide
sites. The study area is rectangular, 33.5km from east to west and 28.8km from
north to south.

*Finding bandwidth...
... using all regression points
This can take some time...
*Calibration will be based on 239 cases
*Adaptive kernel sample size limits: 59 239
*AICc minimisation begins...
Bandwidth AICc

60
114.623059100000 261.564370933634
149.000000000000 263.977389197345
93.376941151579 259.338341719932
80.246118103905 259.206909758644
72.130823143768 259.220048698357
85.261646191441 259.090515014747
88.361413027338 259.178113791002
83.345884939802 259.175696235611
** Convergence after 8 function calls
** Convergence: Local Sample Size= 85

The optimal bandwidth is about 115 observations. We will obtain a relatively


high degree of smoothing with a bandwidth of this size as it’s about half the
dataset. If you have Excel available, you can examine the shape of the
minimisation function. The number of function calls refers to the number of
times the different models have been fitted to obtain the optimal bandwidth.

The next panel shows the results from the Global Model.

********** Global Logistic Model Diagnostics **********


Convergence after 5 iterations
Log-likelihood: -136.074826
Deviance (-2LogLikelihood): 272.149653
Number of parameters in model: 3.000000
Akaike Information Criterion: 278.149653
Corrected AIC (AICc) 278.251781
Bayesian Information Criterion: 288.579043

Parameter Estimate Std Err T Exp(B) Sd(Exp(B))


--------- ------------ ------------ ------------ ------------ ------------
Intercept 0.884 0.780 1.134 2.420 1.887
Elev -0.002 0.001 -4.711 0.998 0.001
Slope 0.089 0.019 4.693 1.094 0.021

The AIC is about 278.2, and the model has 3 parameters. Elevation has a
negative effect on landslide probability. Slope has a positive influence on
landslide probability. Certainly one would expect steeper sloped hillsides to be
more prone to landslides than flatter ones – landslides may have a greater
chance of occurring on lower slopes than elevated ones. However, remember
that these coefficients represent an average over the study area, and other
factors may be influencing hazard variation. (See the note in section 7.7)

The next panel shows the output from the Local Model.

********** Local Logistic Model Diagnostics **********


Log Likelihood: -109.945257
Deviance: 219.890514
Residual sum of squares......... 35.881650
Effective number of parameters.. 18.038670
Akaike Information Criterion.... 255.967853
Corrected AIC................... 259.090513

61
Bayesian Information Criterion.. 318.678626

First of all, note that there has been a substantial decrease in the AICc – this
suggests that the local model provides a better fit to the data than the global
model. The deviance is lower, but the price of using GWR is a greater number of
effective parameters (35.88 now instead of 5.0 in the global model).

It is worth checking also the 5 number summaries for the parameter estimates.
The Elevation parameter estimates are consistently positive and negative
respectively (these are the values before exponentiation). At least 75% of the
values of the Slope parameter are positive and a similar proportion of the
Elevation parameters are negative.

**********************************************************
* PARAMETER 5-NUMBER SUMMARIES *
**********************************************************
Label Minimum Lwr Quartile Median Upr Quartile Maximum
-------- ------------- ------------- ------------- ------------- -------------
Intrcept -5.856532 -0.240704 1.719492 3.787028 16.416200
Elev -0.013785 -0.004208 -0.002330 -0.000893 0.000381
Slope -0.027931 0.035047 0.078778 0.133318 0.243211

Again, we can informally check on the spatial variation of the local estimates by
comparing the inter-quartile range with 2xS.E. of the respective global
estimates. In doing this, we get…

Parameter 2 x S.E. (Global) Inter-


Inter-Quartile Range (Local)

Intercept 1.560 4.028


Elev 0.002 0.0032
Slope 0.038 0.098

…. which suggests there might be non-stationarity in the local estimates of all


the parameters.

Further diagnostic analysis should involve mapping the parameter estimates and
other diagnostic data.

7.6 Mapping the Results


This proceeds much as previous Labs.

1. Use ArcToolBox to Import the Interchange File to a coverage. Both


should be in your Work folder – call the coverage logmodela

62
2. Add the new logmodela coverage into the Table of Contents

The new layer is added to the Table of Contents. Use ArcMap to examine the
spatial variation in the parameters. As you might expect given the relatively
large bandwidth, there are some broad regional variations present in the
parameter estimates. Remember that we have taken a small sample of landslides
in the Forest, so it might be unwise to draw general conclusions about
landslides in Clearwater. The influence of elevation on landslide hazard is most
negative towards the east and western parts of the study area – higher
elevations imply lower hazard. There is a north south pattern to the variation in
the slope parameter, with slopes in the south part of the study area having the
greatest influence on hazard.

Examine the spatial variation in the predicted probabilities:

1. In Symbology choose Quantities/Graduated Colors


2. Select PRED as the Field/Value
3. Under Classify, select Equal Interval as the Method, with 5
classes. This should give class breaks about 0.2, 0.4, 0.6, and 0.8.
4. Click OK

An interesting task is to examine the relationship between the observed and


predicted values. The easiest way of doing this is by examining the residuals.
Remember that these are the difference between the observed y and the
predicted y. The observed ys have the values 0 and 0, whereas the predicted ys
are probabilities in the range 0…1. A residual of -1 would occur when the
model has predicted a landslide at a location where one did not occur (false
positive). A residual of -1 would occur when the model predicts no landslide at a
landslide site. This second case is the more worrying, and we might wish to plot
their locations.

1. In Symbology choose Quantities/Graduated Color


2. Select RESID as the Value Field
3. Click on OK

There is a small cluster of high negative residuals about 3km west of Sheep
Mountain Work Center in the south west of the map. These are landslides which
the model has failed to predict (use the Properties/Symbology/Classify option to
change the class breaks to 3, and set the lower two breaks at -0.5 and 0.5). You
can use a red/blue color ramp but you map need to flip the symbol order (right
click in the symbol/key area).

63
7.7 Further Tasks
We have included both the sin and cosine of the site aspect as well as the
distance to the nearest watercourse in the data file. Create two further models,
including (a) the distance to the nearest watercourse and (b) the aspect. Is there
evidence for the hypothesis that either of these variables contribute to
improvements in either the local or global model.

Note: The coefficient on a variable in logistic regression represents the change


in the log odds of the response given a unit change in the predictor variable. If
we take the anti-log then we obtain the odds ratio corresponding to a unit
change in the variable. In the global model the coefficient for slope was 0.089
whose anti-log is 1.094. The odds that a steeper sloped site will be a landslide
site increase 9.5% over those of a gentler slope with each percentage point
increase in slope. For a 5% point difference the increase is 1.0945 = 1.567 or
about 57%.

You can experiment with taking the anti-logs of the coefficients. You will need
to add another field to the attribute table (use a type of double, a precision of
12, and a scale of 7). The Exp() function will compute the anti-logs for you.

64
8
Finally

We hope that these labs have given you some of the flavour of GWR in action
and the relationship it has with GIS.

Further information about GWR can be found at the GWR website:

http://ncg.nuim.ie/GWR

Not only is there a brief description of the methodology but there is information
on how to obtain your own copy of the software. The GWR manual, including
information on installing the software is available from this website.

The contact address for GWR is:

Stewart.Fotheringham@nuim.ie

65
Appendix

Geographically Weighted Descriptive Statistics

8.1 Introduction
This appendix introduces you to geographically weighted descriptive statistics.
We shall explore a simple set of these which are incorporated within the GWR
software. Some of this exploration will involve the GWR program, and some will
involve the GIS program.

8.2 Local Statistics


The geographically weighted mean is the starting point for thinking about
geographically weighted statistics. Let us consider the arithmetic mean – its
formula is:

∑x
i =1
i
x=
n

This is simply the sum of the values making up a batch of numbers divided by
the size of the batch. More generally, we can consider a weighted mean:

∑w x
i =1
i i
x= n

∑w
i =1
i

where the wis are the weights. Here we multiply each value by its weight, and
divide by the sum of the weights. In the case that each observation has a weight
of unity, then this formula and the one above are equivalent.

In many cases the weights are integers, but they may also be non-integer
numbers. In this case, we can use weights generated from the same
geographical weighting scheme that we have used for geographically weighted
regression. Rather than being a whole-map statistic, a geographically weighted
mean is available at a particular location, say, u. Thus the formula for a
geographically weighted mean at location u is:

66
n

∑ w(u ) x
i =1
i i
x (u ) = n

∑ w(u)
i =1
i

W(u)i is the geographical weight of the ith observation relative to the location u.
The weights may be generated using a fixed radius or an adaptive kernel.

By analogy the local geographically weighted variance is

∑ w(u ) ( x i i − x (u )) 2
σ 2 (u ) = i =1
n

∑ w(u)
i =1
i

and the locally weighted standard deviation is the square root of this. Notice
that the mean here is the geographical mean around point u and NOT the global
mean of the data.

The GWR software currently supplied (GWR3.0) allows the user to compute
geographically weighted means, variances, and standard deviations for a set if
input data, and for either a fixed or an adaptive kernel. However, as there is no
concept here of an optimal bandwidth, the bandwidth must be supplied by the
user. If a fixed kernel is used, the bandwidth must be in the same units as the
coordinates on the input data; if an adaptive kernel is used, the bandwidth is the
number of objects to include in the local sample. If the bandwidth is very small,
the degree of smoothing from the weighting scheme will be very small: the local
means will approach the original data values and the variances will be very
small. A zero bandwidth may cause premature and possibly inelegant
termination of the program. The larger the bandwidth, the greater will be the
degree of smoothing in the resulting geographically weighted statistic. With a
fixed kernel the bandwidth can be as large as you wish, although anything
greater than the study area width will result in an almost similar set of means.
With the adaptive kernel, the bandwidth should not be greater than the number
of observations in your dataset.

8.3 Educational Attainment


We will start by considering the educational attainment data set we have already
used. The objective will be to examine the variation in the proportion of
residents holding a bachelor’s degree across the counties of Georgia.

67
1. Start GWR
2. Select “Create new descriptive statistics” and click Go
3. Select the data file gdata_utm.csv from SampleData\Georgia and click Open
4. Click Yes to undertake analysis at the data point locations
5. For the output file, navigate to your Work folder, change the file type to
ArcInfo Export File, and enter gstats.e00 as the file name, and click Save
6. Check that the variable names and file locations are correct in the data
preview and file confirmation windows
7. In the Model Editor, enter a title…
8. Enter PctBach, PctEld, PctFB and PctPov into the Variable(s) list
9. The location variables should be X and Y
10. The kernel shape should be adaptive with Cartesian coordinates
11. Enter 30 as the bandwidth
12. Select ArcInfo Exp as the output file type
13. Save the control file as gstats.gwr in your Work folder
14. Run the model with gstats.txt as the listing file

The output from the program is fairly brief:

***************************************************************
* *
* GEOGRAPHICALLY WEIGHTED SUMMARY STATISTICS *
* *
***************************************************************
Number of data cases read: 159
** Data file read...
Number of observations, nobs= 159
Number of predictors, nvar= 4
** Adaptive kernel: local sample size is 30
** Results written to .e00 file

To examine these data, we need first to convert the Export file to a point
coverage, and then assign the values from the attribute table to the county
boundary polygons.

1. Using ArcToolbox, Convert the gstats.e00 file to a point coverage call


gstats in your Work folder
2. Add gstats point and g_utm.shp as layer into a data frame in ArcMap
3. Using a spatial join, create a new layer gstats30.shp in your Work folder.
4. Plot a choropleth map of MEAN_1 (the locally weighted mean of the
educational attainment variable).

68
The map suggests that there is a broad regional pattern with the greatest
proportions being in those counties around the University of Georgia.
Educational attainment is much lower in the rural south.

Plot the variance (VAR_1). Again there is a pattern here – the high local weighted
means are also associated with high locally weighted variances. This suggests
that there is a distinct local variation among the counties in the local samples. It
may be that 30 is too great a bandwidth, and that while broad regional patterns
in the means are being picked up, the confidence intervals on them may be
rather wide. A similar tale is told if you plot the local standard deviation
(STD_1) – try this for yourself.

Another diagnostic might the local coefficient of variation – this would be the
ratio of the local standard deviation to the local mean. This is not computed by
GWR, and you will have to compute this yourself. However, you can use the GIS
to help you.

8.4 GIS Manipulation


While GIS software is useful for manipulating spatial data, some simple
operations can also be carried out on the attributes. We will add an extra item
to the attribute table for gstats30, and compute the coefficient of variation, and
then examine the distribution of this statistic. This will provide some useful
practice at using ArcMap in a little more detail than we have done thus far.

1. Open the Attribute Table for


gstats30.shp (right-click, then Open
Attribute Table)
2. Click on the Options button below
the attribute table data listing and
select Add Field
3. Complete the Add Field dialog as
shown on the right. The Name
should be CV_1, the Type should be
Float, the Precision should be 12 and
the Scale should be 11. The click OK.
4. Scroll along the Attribute Table
listing, and then right click on the Field heading CV_1

69
5. Select Calculate Values and then click yes to the warning which follows.
6. Complete the Field Calculator as
shown on the right. The calculation
should be [STD_1]/[MEAN_1] – select
these from the Fields list with a
single mouseclick. Click on OK

What you now see is the spatial variation in


the locally weighed coefficient of variation.
Whilst this ratio is higher in the northern
than in the southern parts of the state, in
the area around the University of Georgia
the coefficient is lower implying that given the local mean values, the local
standard deviations are smaller than elsewhere in the north.

8.5 Tasks

1. Examine the spatial distributions of the other variables in the coverage,


compute local coefficients of variation, and examine the regional
patterns.

2. ArcMap uses a local sample size of 12 in its computation of the surface


we used in the housing workshop. Rerun the GWR stats with an adaptive
kernel as before, but with a bandwidth of 12. You can use the same
model control file (gstats.gwr), the same output file (gstats.e00) and the
same listing file (gstats.txt). Remove gstats point from the Table of
Contents, and then use ArcToolBox to convert the output file to a
coverage called gstats12. In the spatial join to the boundary data, the
joined shapefile should be named gstats12. Examine the spatial patterns
for at least PctBach (MEAN_1) and notice what effects changing the
bandwidth has.

3. In the first GWR exercise, we used a fixed kernel and allow the calibration
routine to find the optimal bandwidth. This was 94km. Rerun the local
statistics with a fixed kernel, with a bandwidth of 94000 (the map units
are in metres!). Convert the output file to a coverage called gstats94, and
the joined shapefile should be called gstats94. Again, examine the effect
of changing the bandwidth. (The kernel smoothing literature suggests
that the bandwidth changes have a much greater impact on the
smoothing than kernel shape changes).

70
By this time, you should be adept in using both GWR and ArcGIS in tandem. You
should also have an appreciation of local statistics and the ways in which you
can use the GIS software to assist you with your tasks.

71

You might also like