Professional Documents
Culture Documents
Citation:
Gisela Bichler; Stefanie Balchak, Address Matching
Bias: Ignorance is Not Bliss, 30 Policing: Int'l J.
Police Strat. & Mgmt. 32 (2007)
Provided by:
Available Through: Pontificia Universidad Catolica de Chile
Copyright Information
PIJPSM
30,1 Address matching bias: ignorance
is not bliss
Gisela Bichler
32 Centerfor CriminalJustice Research, California State University,
San Bernardino, California, USA, and
Stefanie Balchak
Chula Vista Fire Department,Chula Vista, California, USA
Abstract
Purpose - The purpose of this paper is to show that despite the critical importance of using accurate
data when identifying geographic patterns and studying hotspots, few have explored the data quality
issues introduced by Geographic Information Systems (GIS) software applications. While software
manufacturers provide some information about the address matching process, critical details are left
out or are buried in technical, and sometimes proprietary, jargon. The purpose of this paper is to
address these issues.
Design/methodology/approach - The paper demonstrates, with three datasets of 100 cases each,
how the assumptions built into popular GIS software produce systematically missing data during the
data importing process commonly referred to as address matching.
Findings - Inclusion of directional indicators and zip codes are more important than previously
thought. The results highlight the critical need to provide complete descriptions of research
methodology. All geographic analyses must be accompanied with: information about the hit rate
(percent of cases plotted), details about the software and process used to import tabular crime data,
information about the software parameters set for the importation process (geocoding preferences),
reference information about the street file used; and, an examination of the missing cases to identify
some of the sampling error. When forecasting crime issues or identifying hot spots, analysts must be
cognizant of the differential impact this bias will have on the generalizability of the results.
Originality/value - The paper explores previously neglected issues in data quality introduced by
GIS software applications.
Keywords Bias, Geographic Information Systems
Paper type Research paper
Introduction
One of the fundamental tenets of the scientific method is that a detailed discussion of
the research methodology, including procedures used, summary data, and research
H
Policing: An International Journal of
Police Strategies & Management
Vol 30 No. 1, 2007
limitations, must accompany all results. This is necessary to evaluate findings, enable
replication and ultimately, to advance the development of knowledge. For example, all
survey research must report the response rate associated with the sample generation
process. As simple as this tenet may appear, scholars examining the geographic
properties of crime have faltered on this basic principle. While the proprietary concerns
of Geographic Information System software vendors have contributed to this situation,
it is not an excuse for bad science. Currently, scholars and crime analysts routinely
pp. 32-60 neglect to report the "response rate" or other procedural and analytic parameters
© Emerald Group Publishing Limited
1363-951X
rendering it near impossible to accurately interpret study results or assess the
DOI 10.1108/1363951071072,613 generalizability of conclusions reached.
The research initiative presented here began as a question, raised by an Address
undergraduate student learning how to plot crime events using GIS software. The matching bias
student's question went something like this: "How exactly is the computer selecting
street segments in order to map my event locations? Because, the addresses are written
correctly and I checked, they match the street file, but the computer put the dot on the
wrong street." After careful examination, the student was indeed correct, for some
reason the dots were not on the correct street segments. Little did we know that it 33
would take a great deal of effort to figure out exactly what the computer was doing.
And, in answering this question, a surprising revelation emerged with regard to the
rigor of scholarly reports about the geographic analysis of crime.
It is fair to say that most geographic analysis of crime relies on incomplete data. In
fact, an informal survey of practicing crime analysts showed that many analysts used
crime data that do not achieve geocoding[1] hit rates above the conventionally accepted
90 percent[2] threshold (Ratcliffe, 2003)[3]. Given that self-report and victimization
studies (particularly in reference to commercial burglary) show that only a small
fraction of the total crime actually is recorded in official data sources (e.g. Pease and
Laycock, 1996; Ratcliffe and McCullagh, 1998; Townsley et al., 2000), the starting point
of using "all known crime events" extracted from a records management system, is
already a non-random sample of crime incidents. Furthermore, the degree of sampling
error is unknown. Having less than 90 percent of a non-random sample, particularly
when that missing 10 percent is systematically missing, becomes an important
methodological limitation that may not be remedied by imputation methods (Myrtveit
et al., 2001). The fact that law enforcement policies are increasingly grounded on
rudimentary analysis of datasets that are systematically missing a great deal of data
heightens the importance of this topic; systematically missing 10 percent of 350,000
incidents is a serious analytic limitation. Until more is known about what makes these
cases systematically missing, statistical adjustments to improve analytic reliability are
not possible.
What do we currently know about factors that render crime data "systematically"
unmappable? Prior discussions highlighted: the limitations posed by the quality of
police data in relation to data entry (e.g. Harries, 1999; Block and Block, 1995)[4]; the
imprecision introduced by GIS based plotting of events compared with actual property
boundaries (Ratcliffe, 2001; Martin and Higgs, 1997)[5]; and more recently, that
analyses to identify crime concentration is affected by the number of repeat
victimizations captured in the police data (Bichler, 2004). Together this body of work
suggests that fluctuation in crime incidents available for analysis is generated by
factors that may be unrelated to the volume of crime occurrence; and, this can radically
alter the crime patterns identified (Bichler, 2004; Ratcliffe, 2001; Harries, 1999; and
Block and Block, 1995). Moreover, these examples support the assertion that a closer
inspection of the process by which crime data (in tabular form) is translated into
spatially referenced data in GIS software is warranted to explore the effects of
interactions between GIS software and data quality.
Little is known about the data quality issues generated by the algorithms and
default settings of software. Published scholarship offers few details as to how GIS
software actually selects addresses from a reference street file in order to map data.
And while software manufactures provide some information about the process in user
manuals, critical details are left out or are buried in technical, and sometimes
PJJPSM
PIJPSM proprietary, jargon. This is important because the address matching hit rate is vital to
30,1 the quality of data made available for analysis. While the effects of missing data and
measurement error are widely understood and acknowledged by criminologists using
traditional statistics, when similar issues impact on geographic data, these issues are
often neglected[6].
This report details the results of an inquiry into answering how exactly the software
34 selects candidates from the street file upon which to place events during the address
matching process[7]. In doing so, it provides many neglected details and demonstrates
how data quality and software algorithms interact to affect the address matching
process. Secondary research questions included: how much of a problem (bias) is likely
to be introduced by this systematically missing data; what features of the address lead
to non-matching cases; what information must scholars report to inform research
consumers about the bias that may have been introduced; and what implications does
this bias raise for hot spot identification and forecasting? This detailed treatment of
address matching is necessary as many important details are lost during the
interdisciplinary exchange of analytic techniques. Geographers adept at using various
geographic statistics are well aware of these issues, criminologists and crime analysts
on the other hand, are only beginning to fully appreciate the implications of this
problem.
Address matching
Many issues associated with the geographic analysis of crime (i.e. how to obtain
funding, improve data quality at entry, and train staff on software) have been
addressed[8]; yet, few have examined fundamental issues associated with the
geocoding process. While great strides have been made in the accuracy and availability
of GPS systems that enable the use of more precise latitude and longitude coordinates
to plot locations, address matching[9] is still the most frequently used geocoding
process among criminal justice users of GIS.
Address matching requires the assembly of crime data into a spreadsheet with the
street address written in a single column in this order: prefix (address number), space,
directional indicator (i.e. N for North), space, street name, space, and suffix (i.e. avenue
or street). When multiple areas are being mapped at the same time, another variable
(i.e. zip code, city, county, etc.) is necessary. Crime event addresses are then compared
with a source database including the entire street network of a jurisdiction. Drummond
suggests that there are four possible outcomes (1995) in address matching:
(1) Correct matches include occasions where events are plotted on street segments
that contain the address in question.
(2) Unmatched records are those situations wherein no street segment exists in the
street file that matches the event address.
(3) Incorrectmatches (false positives) occur when the crime is located on the wrong
street segment.
(4) Non-match outcomes involve events that should have matched (or false
negatives), but did not.
The address matching completion rate records the number of events plotted; it is
referred to as the hit rate. Hit rates include the number, or percent, of correct and
incorrect matches - all events that plotted.
Essential elements of address matching success Address
As with all data generation procedures there are many factors that can affect the matching bias
address matching process. Match scores are the byproduct of the interaction
between the target database accuracy (crime data), the reference database accuracy
(street file), and software issues including user-defined parameters and software
algorithms. 35
Positionalaccuracy
TIGER® files are produced from the compilation of information developed through a
range of encoding procedures (i.e. automated map scanning and manual "digitizing")
by various local and state level entities (US Census Bureau, 2004). Given local variation
in the quality of GIS data, some areas are better represented than others; positional
accuracy varies with the source materials and procedures. However, overall, the data
meets the established National Map Accuracy standards of approximately +/ - 167ft
(US Census Bureau, 2003)[10]. Figure 1 shows the overlap of the street center line
compared against an aerial photo. The inaccuracy of the street centerline placement on
a large-scale map is clearly evident.
While this level of positional accuracy is not sufficient for high-precision
engineering applications, it is permissible with general social science research,
particularly when data points are aggregated to administrative units following
address-based geocoding. While GIS scientists commonly understand this,
criminologists tend to be less informed about these concerns. With one exception
(Ratcliffe, 2002), published discussion tends to focus on general issues of data quality
rather than location precision[ll].
Practicalimplication 3. If crime intensity maps are generated (e.g. density maps or
statistical ellipses) and then placed on top of aerial photos, the analysis may
misidentify the locus of the high crime area. To enhance the interpretability of such
analysis, additional information should be included. This may involve a small table
that contains the actual address and crimes occurring within the high crime area. Or,
the street file and crime points could be included to illustrate the degree of warping that
may be affecting the intensity map.
Address
matching bias
37
Figure 1.
Inaccuracy of TIGER files
compared with aerial
imagery
ource: Eagle Aeria Imaging, Inc., Coachela Valley (2001)
38
LW.
I ti'
Figure 2.
Haphazard nature of
missing information from
the Census 2000 street file
Additionally, to enhance comparability of geographic analyses it is importat that
results include a note that assesses the quality of the underlying street file used to
geocode. This could be done by reporting the percent of street segments that are
missing address ranges and/or names (along with the percent of cases that matched
well and matched partially). Thus, when comparing analyses for the same jurisdiction
that were completed a different times it is possible to assess the impact that using
different street files or using a cleaned street file may have made to the results.
A related concern is that it is also possible that the same address, geocoded twice
may be assigned slightly different positions on the map. Software specific interpolation
procedures - estimation of the location based on the beginning and end address range
of the block - and the presence of zero (0) values in the address to be mapped, create
inconsistencies in plotted location (Block and Block, 1995; Mills, 1999). This may lead
to systematically missing data if there is a reason why some streets or section of the
jurisdiction is consistently missing the numeric portion of the street address. One
possible example is when crime is recorded as occurring at a park or other large open
area.
A temporary fix would be to geocode problem addresses to intersections. This will
reduce the number of missing cases considerably; however, care must be taken when
interpreting geographic analyses using this data as the addresses plotted to
intersections will introduce a degree of geographical inaccuracy to visual
representations of the crime intensity.
Deviations in the target dataset (crime events) - an unusual format or may have to
few or too many components - will create partial or unmatchable events; this may
contribute the level of systematic bias. For example, Spanish-named streets cause
problems because the address components are in reverse order to typical default
standardization rules (Bichler-Robertson and Conley, 2003). For example, street names
such as "Avenida 52" will fail when address matched because the first part of the street
name (Avenida) will be treated as name of the street while the second part (52) would Address
be considered the type of street. Adjustments must be made in order to map this data matching bias
(see, for example, Bichler-Robertson and Conley, 2003)[12]. Failing to make these
adjustments in jurisdictions with sections of the city containing Spanish named streets
will result in the systematic loss of cases.
Identifying important (meaning there is a lot of crime occurring there) Spanish
named streets and spot checking the number of events that are shown to have been 39
recorded there and comparing this figure to the number of events that are plotted on
the map of crime will enable the analyst to conduct a quality check. The percent of
cases missing (not matching) can be used as a weighting factor when interpreting the
geographic analysis that uses only matched cases. For example, if a segment of
Avenida 52 were missing 12 percent of crime events, then you could multiply the
number of events plotted by 1.12 to determine the actual number of events that should
plot. This may be a useful technique when assessing the crime volume of
neighborhoods. Generally, spot-checking important streets will provide a rudimentary
assessment of the degree of error in the analysis.
Acknowledging the weakness of address matching (e.g. out-of-date street
directories) Ratcliffe (2001) sought to determine the level of geocoding imprecision.
Using MapInfo (version 5.5) the study found that there was an error level of 72 percent
when performing a standard point-in-polygon operation when matching geocoded
points with their corresponding area. Two explanations were offered for this error
level: the accuracy of the property boundaries and the degree of uncertainty for cases
(points) that fell on or close to the boundaries (Ratcliffe, 2001). The implication raised
here is that events falling on boundary street segments may be located in the wrong
area if crime events are aggregated to political units, i.e. police beats.
When redrawing police reporting districts or policing zones, agencies may want to
consider using minor streets rather than arterial roads - that are likely to generate
high volumes of crime calls - to demarcate boundaries. Additionally, agencies should
consider land use when crafting reporting districts. Keeping reporting districts
homogenous in terms of land use will help to address this problem. Care should be
taken not to include splinters of commercial property or high-density residential
property within single family housing areas.
Software algorithms
This is the critical element to answering the student's questions about why the incident
mapped to the wrong location. To understand the limitations posed by the software it
is important to take a step back in the process. Before possible candidates are pulled
from the street file, the software makes some "on-the-fly" adjustments to all addresses
(crime events and the street file).
First, target addresses (crime events) are compared to a transformation table to
identify allowable substitutions. Then, the street name is converted to a phonetic
equivalent for comparison with the reference street file. Points are assigned to the
address based on the similarity of each field of the address to each field of the street file.
Finally, a weighted cumulative score for each possible match is calculated and
normalized. The reference street segment with the highest score is selected for the
match. At this point, the location of the event address is interpolated and located along
the reference street segment. Two elements of this process require detailed treatment;
transformation tables and phonetic equivalents.
Transformation tables
Inconsistency in the recording of addresses is unavoidable; consequently, the software
has a built in mechanism to "clean addresses" by allowing for substitutions for key
address components. Transformation tables help to standardize the address of the events Address
you wish to plot. It is similar to "data scrubbing" except the original data file is left intact matching bias
(no permanent changes are made). These substitutions are stored in transformation tables
located in program files. Table I supplies a series of example substitutions for prefixes,
directional indicators, and street type. The actual transformation table is considerably
larger and can be expanded or adjusted by the user. Use of transformation tables enable
the software user to save time during the data scrubbing phase of data collection if there 41
are standard, known differences between the crime event addresses and the reference
street file. Once the addresses are temporarily "clean" the software looks for suitable street
segments by street name, this is where the soundex becomes an issue.
Soundex properties
Identifying possible street matches, using street names (English language), involves
creating a phonetic equivalent (referred to as a soundex value) for the written spelling
of each street name (listed in the table of crime events). Then, ESRI software compares
event street name to the reference street names based upon their soundex values
instead of their literal spellings (ESRI Support Center, 2004; Drummond, 1995).
To create a soundex value, the first letter of the street name is retained, and all
occurrences of a, e, h, i, o, u, w, and y in the remainder of the street name are dropped (e.g.
"Pennsylvania" becomes "Pnnslvn"). The next three remaining letters are converted to
numbers based on the values presented in Table II.These three numbers are added to the
first letter of the street name. To continue with the example, Pnnslvn becomes P552.
Prefix/direction
W W, WST, WEST
N N, NO, NORTH
S S, SO, SOUTH
NW NW
Street types
AVE AV, AVD, AVE, AVEFLR, AVEN, AVENIDA, AVENUE, AVNUE
BLVD BD, BL, BLV, BLVD, BOUL, BVD, BOULEVARD
CIR CI, CIR, CIRC, CIRCLE, CORCLE, CR, CRCL, CRCLE, CRL
CTR CEN, CENTE, CENTER, CENTERS, CENTR, CENTRE, CENTRES, CNTR,
CTER, CTR
DR DR, DRIVE
RD RD, ROAD
RT ROUTE, ROUTES, RT, RTE, RTES, RTS
ST STAPT, STBOX, STBX, STEET, STLOT, STPOBOX, STR, STRD, STREET,
STREETS, STRFD, STRR, STRT
Other terms
CANYON CN, CYN, CNYN, CANYN
HARBOR HBR, HARB, HARBR, HRBOR Table I.
MOUNTAIN MTN, MNTN, MNTAIN Sample of acceptable
SPRINGS SPGS, SPNGS, SPRNGS substitutions for street
TUNNEL TUNEL, TUNL, TUNNEL, TUNNL type, directional
VALLEY VALLY, VLEY, VLLY, VLY, VLYS indicators, and suffixes
PIJPSM According to ESRI software manuals, spaces in street names are ignored if the street
30,1 name contains multiple words (e.g. Mt Victoria, Mtvctr, M312). If the street name has
less than four letters following the removal of vowels, h and w, then zeros are added to
the end in order to produce a four-digit soundex value (ESRI Support Center, 2004).
Streets that have different names but the same soundex values will be returned as
candidates for that address. For example, when looking for a match for the street name
42 New York Street (with a soundex value of N620), the address matching process returns
all street segments in the reference database (i.e. streetfile) with the name New York
Street and Newark Street (NrkO or N620) since they have the same soundex value.
Thus, spelling errors could change the soundex value and influence which candidates
are identified for a possible match. For example, misspelling York (New Yotk Street)
will produce a soundex value of N320 rather than N620. Given that vowels are not used
in the soundex transformation process, mistakes in consonants are of critical
importance. Also, soundex does not take directional prefixes into consideration (i.e.
North). The street segments with names that have the same soundex value as the event
to be plotted will be "flagged" in the order that the street segments appear in the
underlying data table that supports the street file[15]. Thus, geocoding with a street file
sorted in alphabetical order will potentially place different candidates in the first
position of the list of possible matches then would a street file sorted geographically.
Points assigned
Candidates are then ranked according to the total weighted score developed that
represents the statistical probability of the street segment (based on all components of
the address) identified being the correct match of the event address (this is common to
several commercial GIS packages)[16]. Each component of an address record in the
reference theme table (aka street file) is scored based on how closely it matches the
corresponding component of the event address. Each address field has two
probabilities: the probability that the fields agree for a matched pair (m), and the
probability that the fields agree for unmatched pairs - agree at random (u). The m and
u probabilities are specific for each address style[17] (e.g. "US Streets with Zone"
address style probabilities are different than "US streets" address style). Field
probabilities are weighted[18] and this can be adjusted to reflect the importance of
matching on that component of the address[19]. The scores of all the address
components (all fields used in an address style) are combined to create the total score
(Table II) which is then "normalized" to improve interpretability; probabilities are
converted to a value out of 100[20]. Candidates with the highest score are considered
the best match. When scores are tied, the candidates will appear in the order that they
appear in the street file.
Letters Number
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
Table II. m, n 5
Letter values R 6
Address
Street address matching bias
Address components Points
House number 10
Pre-direction 2
Pre-type 4
Street name 26 43
Suffix type 6
Suffix direction 2
Zip code 10
Zip code +4 10
Intersections
Prefix direction 2
Prefix street type 4
Street name 26
Suffix name 6 Table EE.
Suffix direction 2 Points for address
(for streets 1 and 2) computation
Recall that software users can set the minimum match score when setting the
geocoding preferences; this raises or lowers the score that potential candidates must
achieve to be considered a partial match. This does not impact on what is considered a
"good match." To be a good match, the candidate must achieve a combined score of 75
for ArcView and 80 for ArcGIS. This means that adjusting the geocoding preferences
higher than 75 will result in more partial matches and nonmatches - all partials are
moved into the nonmatch category. Lowering the user-defined preference will generate
more partial matches that will be automatically accepted, unless otherwise directed,
when the process is complete.
At the conclusion of the address matching process, four columns are appended to
the end of the crime event file: "AVADD", "AV STATUS", "AVSCORE", and
"AVSIDE" The AVADD variable contains the address that was used (i.e. if you
manually adjusted the crime event address during the process this would be the
adjusted address). The AV STATUS column includes coding for whether the event
was matched ("M"), unmatched ("U"), or if the address had two or more candidates
with the best score ("T" for tied) signaling that the first in the list was used. The
AVSCORE is the total score for the candidate selected (where the crime event plotted).
The final variable, AV SIDE, informs the user which side of the street the event was
plotted to: "R" for right and "L" for the left side of the street.
Practical implication 5. Exploring these columns will provide useful information
about the potential error introduced into the analysis by the matching process.
Developing a ratio between the number of T cases in relation to M cases will convey
the degree of potential geographic error when comparing various maps.
(a)Target Data ]
HIT RATE
Figure 3.
Address matching process Note: Width of the arrow represents the amount of sampling bias
introduced to the outcome data file available for geographic analyses
determination lead to secondary research questions: how much of a problem (bias) is Address
likely to be introduced by this systematically missing data; and, what information matching bias
must scholars report to inform research consumers about the bias that may have been
introduced? Answering these queries required testing the software.
Methodology 45
To figure out which address components most impact this process and the nature of
the sampling error typically introduced, this inquiry set out to adjust user-defined
settings while geocoding three different datasets in ArcView 3.3 and ArcGIS. Each
dataset was geocoded at different preference settings to determine the accuracy of the
plotted points. Plotted points were then compared to actual locations to determine the
accuracy of the position. In total, 76 tests were conducted.
Data
Three distinct datasets with 100 cases each were generated. The first was a fixed
comparison of one street in Corona, California. The second was a random sampling of
addresses from a data set for the city of Corona. The third analysis was a regional test
of four cities that contain a main street with the same name and similar street ranges.
All datasets were address matched with the US Census Bureau Tiger/line® street file
from 2000.
Fixed In total, 100 addresses were chosen for the street, Ontario Ave in the SW
section of Corona, CA. To identify verified (city registered) addresses, a section of
parcels (one block with 13 street segments) located in the downtown section of Corona
was selected from the Riverside County web site[22]. This dataset approximates a
clean data file with no anticipated errors in the "crime data" as the addresses were
drawn from an electronic, georeferenced file of parcels. To ensure the data were clean,
street names listed in the street file for these segments were checked to ensure that the
event address street names matched the street file. Using addresses with a common
street name permitted the analysis to control for various sources of error (e.g. spelling
issues) in order to assess the impact of software parameters on the matching process.
This provides a baseline estimate of possible error assuming the street file and the
address file are clean[23].
Random. The second dataset included addresses for the city of Corona and two
unincorporated areas (Home Gardens and El Cerrito). Using an existing database for
the city of Corona, a random sample was chosen from probation case files using
SPSS (George and Mallery, 2001). This dataset approximates the quality of data
extracted from the typical law enforcement RMS. It was expected that this dataset
would contain some inaccuracy in the address fields. Reasonably one would expect
that probation records would contain accurate information about the home address
of clients; however, close inspection of juvenile home addresses recorded for 4,000
juveniles residing in a county in Southern California revealed that probation address
recording practices suffer the same quality issues as typical crime data; there are
many cases of: incomplete addresses, inconsistent street name abbreviations,
inverted address ranges, etc. Additionally, juvenile delinquents are mobile, they
move frequently between communities and guardians; address fields require
constant updating.
PIJPSM Regional. To generate a regional dataset, 25 addresses were selected from four
30,1 different cities - Claremont, Rialto, and San Bernardino (in San Bernardino County,
CA), and Upland (a bordering city in Los Angeles County) - for an East/West
thoroughfare with substantial crime and public safety issues. To begin the address
creation, 25 addresses on Foothill Boulevard in the City of San Bernardino were
generated from a map commonly used in this region (the Thomas Bros. Guide 2003
46 edition). Then, addresses within similar ranges that exist in the three other cities (on
the same street) were produced. There was approximately an even number of East and
West addresses. The actual locations were not visited to verify the presence of a
structure. Of note, the street names recorded were compared against the street file to
ensure that inconsistencies in street names did not generate error. The intent of using
this data file was primarily to examine the impact of using zip codes.
Geocodingprocess
By examining the addresses geocoded at different user defined parameters, it is
possible to assess the impact of altering the geocoding preferences for the spelling
sensitivity, match rate, and candidate controls respectively[24]. The ESRI default
setting of the geocoding preferences was 80/60/30. In addition, tests were also done at
levels below the default (10/10/10 through 60/60/60) and above (70/70/70 through to
exact match of 100 all)[25].
Results
Each batch of geocoded and non-matched points was examined to verify that all
addresses were mapped to the correct place by confirming that the point was placed in
the correct range and segment of the street[26]. Addresses that were not correctly
matched or did not match were then examined to identify what went wrong. As
anticipated, important variations were found with each dataset (see Tables IV and V).
0
CD0?00000000000C>C: = C C
47
000000000CDCD= 000C:
= = C
0000000C0C0C)M
U)
br - g
U
0 O M- M0
0 ,-0 q C) C) C)
U
:' - (0Cc
g CD (=-g?000C"
00D000000CD00C00 ND M C)
I-c
(L
Table IV.
0000C Cl Cl ClC
0000OL- - OO
Comparison of matching
results for fixed, random
and regional datasets
(ARCVIEW)
PIJPSM =10D000000D 0000
D C C
30,1 CD00
bz C)
48
C,0D000D 0000000C ( C C
. 000C) C>C 00 ) )0)a O
CDC)C
I:) : DC C -: (
z C
-C~ U
Cz
,IJ
0-j(0I CO M0 M' M
- z C,5
C- LN C-C C- ~~t
-0CO
C- Cq
C- CC- C'I
C- ) 000C
- C'
1.0( ( (0
0 (C ( ( (0 (C C0 (0
q 000000000,: (:,C>CDC)- - -
Table V.
Comparison of the
matching results for M~ 0M M) co C Czc C,3 t C,3 cC
c' c) CC C C ::, (> CC c> (=> C5
fixed, random and
regional datasets
C'3
4MZ. t- 0DM
(ARCGIS)
Random dataset Address
Analysis of the random addresses revealed a greater degree of missing cases and error. matching bias
For ArcView, addresses that had missing or incorrect information about the street type
would map only if the geocoding preferences were set at 60/60/60 or lower. However,
examination of match accuracy indicated that many were plotted in the wrong location.
Partially matched addresses were plotted on the wrong street segment due to missing
or incorrect street type and name similarities (i.e. Winter Rd mapped to Wonderview 49
Rd, and Blue Mesa Court mapped to Blue Diamond).
Moving from this setting to parameters of 70/70/70 led to the "no match" of 20-21
addresses due to incorrect suffix information. However, the plotted events were
mapped correctly. This means that the addresses that plotted incorrectly due to
incorrect or missing street types, name similarity issues (similar soundex values), and
problematic directional prefixes at the minimum settings became missing cases.
Raising to 70 all eliminated error.
Raising preferences to 80/80/80 and higher lead to the loss of matched addresses
that were plotted to the correct place; this resulted from incorrectly spelled street
names, street type problems, or prefix issues. All plotted events were correctly located.
Increasing settings to a near perfect match (90 all and 100 all) lead to the loss of
additional events that would have mapped correctly at the 70 all setting due to minor
spelling issues. Spelling errors that changed match scores include:
A wrong letter in a word scored at 81, 82, or 87 depending on the letter.
* An extra letter in a name led to a match score of 92.
* A missing letter at the end of a street name resulted in a match score of 94.
Tests of ArcGIS software revealed a similar pattern though the overall numbers of
non-matching cases differs. A higher number of addresses were classed as partial
matches, meaning that six addresses changed from matches to partial matches as
settings from ten all to 80/70/70. In addition, cases that did not map in ArcView were
mapped in ArcGIS. ArcGIS was able to map more events (nonmatches from ArcView
were classed as partial matches in ArcGIS). This shows that there are slight differences
in the algorithms built into these software products. ArcGIS is more sensitive to the
absence of prefixes and slightly less sensitive to street name issues.
Regional dataset.The initial test of the regional dataset showed that 100 percent of
the address matched at all preference settings tested. However, because the initial tests
used US street address only, several problems were found. Most cases mapped in one
or two cities (rather than 25 percent in each city) and only 21 percent of the addresses
actually mapped to the correct location.
Adding the correct zip code did not reduce the percent matching (100 percent of
addresses matched at ten all to 80 all). However, 10 percent mapped incorrectly due to
missing directional prefixes (i.e. E or W). At 90 all, the 10 percent that had mapped
incorrectly at the lower settings were rendered incapable of mapping. These cases
could not map during a batch match due to incorrect and missing directional prefixes
in the street file. ArcGIS software produced similar results. When cities contain streets
that change direction (e.g. N/S running street turns a bend and begins to run E/W) or
crime data includes cases from other jurisdictions (e.g. general assists to other agencies
or multi-jurisdictional crime crackdowns), zip codes and directional prefixes must be
correct if their use is going to improve match rates.
PIJPSM Discussion
30,1 Drummond (1995) presents three strategies for dealing with address matching issues:
" Strategy A. Ignore all records that are not perfect matches.
" Strategy B. Include perfect and the most reliable partial matches.
" Strategy C Use perfect matches and every possible partial match (regardless of
50 reliability).
To illustrate the costs associated with each strategy Drummond (1995) estimated the
degree of bias and error that could reasonably be expected from each. Table VI presents
the theoretical estimates generated by Drummond compared to what was found in this
study for the random dataset (as this is the type most crime analysts work with).
Selecting a strategy
Using only perfect matches is not a feasible strategy in the world of crime analysis
given the degree of clerical error in the address name that typically exists. It would
generate far too many missing cases that may otherwise map somewhat correctly.
Further, going through a manual interactive matching process on a case-by-case basis
with all of the non-matching items is prohibitive; in many situations, this would
involve thousands, if not tens of thousands, of cases. The third strategy, of using every
possible match, is also not advisable in a criminal justice setting given the level of error
introduced; limited resources may be misallocated thereby impacting on public safety.
Thus, the remaining strategy is to take exact matches and the most reliable partial
matches. The question remaining is what is a reliable partial match?
One of the first responses often suggested to those learning how to geocode events
against addresses in a street file when faced with a high number of "no match" and
"partial match" cases is to adjust the geocoding preferences as suggested by ESRI
(Ormsby et al., 2004). It is a reasonable and logical step given the likelihood of small
clerical errors associated with addresses (i.e. Block and Block, 1995) that may be
overcome with slightly relaxed matching criteria. This can be a viable tradeoff given
that the systematic loss of data is critical to the strength of crime pattern modeling; the
extensive "listwise" deletion of cases due to matching criteria may introduce undue
bias into the analysis. However, the results of this exploration suggest that geocoding
preferences should not be raised or lowered beyond a clearly identifiable "optimal
threshold". Lowering geographic preferences (or relaxing the sensitivity of matching
criteria) will radically reduce the number of accurately matched events, leading to
Concluding remarks
A shockingly large void exists within research looking into the spatial properties of
crime distribution. To date, while data quality issues were raised previously, there has
been virtually no public discussion of the importance of systematically missing data
and how analysts must account for this limitation to the validity and reliability of
geographic crime analysis.
Missing data is a problem that all research must deal with. Convention suggests that
when 15 percent or more of the data for a variable are missing, estimation methods -
that work fairly well when a few cases have missing information - may introduce undue
bias into the analysis. While public debate on the issue of systematically missing data
Address
matching bias
4 go
7 9U I
U6 53
a) Cz 0
_2 4C
o ~ *~-:t
~c
u~
4 ;:
:3. ' -C. 0 0
E~ C)r
~ b
4,50 +
w :, >
,- U-0
M czC) ''-
a
,-
04
C5
C-
C.-
-5"
_
o
0
.
- 0
.- .. -
Notes
1. Geocoding is the term given to describe various procedures used to place tabular
information, in this case crime data, onto an electronic map (Harries, 1999).
2. While some suggest that achieving an 80 percent matched is satisfactory (Blosser, 1995),
matching at least 90 percent is more widely accepted standard due to bias introduced by
missing data.
3. A recent impromptu survey of active crime analysts subscribing to the list serve of NIJ's
Mapping and Analysis for Public Safety program (formerly the Crime Mapping Research
Center), showed that the average geocoding hit rate among 43 different agencies was 87.5
percent with a standard deviation of 14 percent (Ratcliffe, 2002b). This survey found that
approximately 33 percent of respondents did not achieve the convention of 90 percent
geocoding success on a regular basis. Ratcliffe also noted that the hit rate was dependent on
the crime type with domestic crimes plotting better than other crimes (e.g. street crime)
(Ratcliffe, 2002b).
4. Clearly, the quality of tabular crime data and the reference street file is the most important
source of address matching problems. This issue has already been discussed at length; as
such, this topic will only be touched on briefly at this juncture.
5. Geographic precision is impacted greatly by interpolation algorithms; this is a complex and
important issue that is beyond the scope of the present paper.
6. While geographers adept at using GIS software for geostatistical analyses are familiar with
issues raised by this interaction effect, few criminologists have thought carefully about this
problem as evidenced by the lack of methodological details left out of published work.
Further, discussions with crime analysts working within police agencies together with
examination of analyses produced by these applied researchers suggests that this problem is
foreign to many, if not most, crime analysts. What geographers specializing in geographic
information science know about data quality and software peculiarities must be imported
into the criminal justice arena if geostatistical tools are to be used properly.
7. Parcel matching - which involves linking events to actual polygons or parcels - and event Address
plotting with latitude and longitude coordinates are not discussed here. These techniques are matching bias
much more accurate and are much less likely to be used by the typical crime analyst as the
data captured by most records management systems (RMS) do not have the capacity to
permit the use of these geocoding procedures.
8. For example, the Office of Community Oriented Policing Services (COPS) was awarded
grants in the excess of $7 billion over the past nine years for start up and operational costs. 57
In addition, various organizations have provided software training including the Police
Foundation, JUSTNET's Crime Mapping and Analysis Program (CMAP), COPS, NIJ's Crime
Mapping Research Center (CMRC), and Carolinas Institute for Community Policing (CICP).
9. Address matching is the most prevalent form of geocoding in the US because law
enforcement agencies fairly consistently have a street addresses or other locational attribute
(i.e. the name of a store which can be linked to a street address), for crime incidents - this is a
byproduct of the introduction of CAD systems.
10. The US Census Bureau is aware that the file is not 100 percent accurate. Consequently, the
Census Bureau is currently in the initial stages of a multi-year project in updating the
features in the file (US Census Bureau, 2004).
11. This locational bias will offset or distort micro-level or large-scale geographic distributions
of crime patterns.
12. Alterations to the street file can be made so that the first part of the street name was in the
pretype column, with significant increases in the geocoding, hit rate without cleaning the
address fields.
13. A number of surveys have queried law enforcement crime analysts across the country about
their crime analysis activity and use of GIS software. A Police Foundation (2000) survey of
51 law enforcement agencies receiving COPS MORE grant funding, found that 71 percent
use or will be using ESRI software (ArcView or ArcInfo) to analyze crime data. This finding
is similar to the results of an earlier study by the National Institute of Justice (Mamalian and
La Vigne, 1999). A broader study using two national samples found that 61 percent of large
agencies and 15 percent of small agencies mapped crime (O'Shea and Nicholls, 2002). And,
only about 6.7 percent of large agencies and 5 percent of small agencies use special hotspot
software like STAC (O'Shea and Nicholls, 2002) suggesting that most use the built in
functions of their GIS software.
14. It is also possible to set the software to permit manual examination of all situations where
more than one possible match was found. This is an optional setting in ESRI products and a
default setting in MapInfo.
15. For this reason the underlying organization of the reference street file will impact on the
address matching process. While no ESRI materials revealed this, careful exploration of
geocoding results demonstrated this feature.
16. This information was gathered from the help topics files of the ArcView program.
17. Address styles differ in relation to the placement and number of address components
required. For example, the US streets address style simply includes the conventional street
address minus the unit number, city, and zip codes; whereas, the US streets with zone
includes the street address and the zip code but excludes the unit number and city.
18. When pairs agree, the agreement weight (log2(m/u) is applied to the probability for the field.
If pairs disagree, the disagreement weight involves a negative value [log2(1 - m/i - u)].
Simply, agreements add to the composite score and disagreements subtract from it. For
example, the m and u probabilities of a directional indicator (for ArcView using the US
streets with Zone address style) are 0.8 and 0.01 respectively, thus the probability for the
PIJPSM directional indicator when there is agreement between the fields is log2(0.8/0.01) = 3. If the
30,1 fields disagree, the weighted probability used is log2(1 - 0.8/1 - 0.1) = -2.17.
19. However, the value must remain between 0.0 and 1.0.
References
Bichler, G. (2004), "Acknowledging the influence of geographic outliers: an analytic caution in
examining chronic repeat victimization", SecurityJournal,Vol. 17 No. 1, pp. 31-49.
Bichler-Robertson, G. and Conley, J. (2003), " Se Habla Espanol? Reconciling geocoding conflict
between census street files and ESRI software", Crime Mapping News, Vol. 5 No. 2,
pp. 10-13.
Block, R. (1995), "Geocoding of crime incidents using the 1990 TIGER file: the Chicago example",
in Block, C.R., Dabdoub, M. and Fregly, S. (Eds), Crime Analysis Through Computer
Mapping (189-193), Police Executive Research Forum, Washington, DC.
Block, R.L. and Block, C.R. (1995), "Space, place, and crime: hot spot areas and hot places of
liquor-related crime", in Eck, J.E. and Weisburd, D. (Eds), Crime Prevention Studies
(145-197), Willow Tree Press, Inc, Monsey, New York, NY.
Blosser, M. (1995), "Modeling address ranges", Proceedingsof the FifteenthAnnual ESRI User
Conference, available at: http://gis.esri.com/library/userconf/proc95/to300/p258.html
Boba, R. (2001), Introductory Guide to Crime Analysis and Mapping, Office of Community
Oriented Policing Services, US Department of Justice, Washington, DC.
Drummond, W.J. (1995), "Address matching", Journal of the American PlanningAssociation,
Vol. 61 No. 2, pp. 240-51.
ESRI Support Center (2004), available at: http://support.esri.com (accessed June 2002 to June
2004).
George, D. and Mallery, P. (2001), SPSSfor Windows Step by Step: A Simple Guide and Reference,
Allyn and Bacon, Boston, MA.
Groff, E.R. and La Vigne, N.G. (2001), "Mapping an opportunity surface of residential burglary",
Journalof Research in Crime and Delinquency, Vol. 38 No. 3, pp. 257-78.
Harries, K. (1999), Mapping Crime: Principleand Practice,National Institute of Justice, Officer of Address
Justice Programs, Washington, DC. matching bias
La Vigne, N. and Wartell, J. (2001), "Tracking gang activity in Orange County, California",
Mapping Across Boundaries, Police Executive Research Forum, Washington, DC,
pp. 25-40.
McCleary, R., Nienstedt, B.C. and Erven, J.M. (1982), "Uniform crime reports as organizational
outcomes: three time series experiments", Social Problems, Vol. 29 No. 4, pp. 361-72. 59
Mamalian, C. and La Vigne, N. (1999), "National Institute of Justice Research Preview", The Use
of Computerized Crime Mapping by Law Enforcement: Survey Results, National Institute of
Justice, Washington, DC.
Martin, D. and Higgs, G. (1997), "Population georeferencing in England and Wales: basic spatial
units reconsidered", Environment and PlanningA, Vol. 29, pp. 333-47.
Mills, J. (1999), "Tain't necessarily so: address geocoding in the real world", Proceedings of the
Nineteenth Annual ESRI User Conference, available at: http://gis.esri.com/library/
userconf/proc99 (accessed in June 2003).
Myrtveit, I., Stensrud, E. and Olsson, U. (2001), "Analyzing data sets with missing data:
an empirical evaluation of imputation methods and likelihood-based methods", IEEE
Transactionson Software Engineering,Vol. 27 No. 11, pp. 999-1014.
O'Shea, T. and Nicholls, K. (2002), Crime Analysis in America, Center for Public Policy,
University of South Alabama, Mobile, AL.
Ormsby, T., Napoleon, E., Groess, C. and Feaster, L. (2004), Getting to Know ArcGis Desktop,
ESRI Press, Redlands, CA (updated for ArcGIS 9).
Pease, K. and Laycock, G. (1996), Reducing the Heat on Hot Victims, Bureau of Justice Statistics,
Washington, DC.
Police Foundation (2000), IntegratingCommunity Policing and Computer Mapping: Assessing
Issues and Needs Among COPS Office Grantees,Crime Mapping Laboratory, Washington,
DC.
Ratcliffe, J. (2001), "On the accuracy of TIGER-type geocoded address data in relation to
cadastral and census areal units", InternationalJournalof GeographicInformation Science,
Vol. 15 No. 5, pp. 473-85.
Ratcliffe, J. (2002a), "Damned if you don't, damned if you do: crime mapping and its implications
in the real world", Policingand Society, Vol. 12 No. 3, pp. 211-25.
Ratcliffe, J. (2002b), personal communications listed on the Crime Mapping Listserve, results of
an informal survey of geocoding hit rates by active crime analysts.
Ratcliffe, J. (2003), The Hotspot Matrix as a Framework for the Spatio-Temporal Targeting of
Crime Reduction, paper presented at the 11th International Symposium on Environmental
Criminology and Crime Analysis, Cincinnati, OH, June 19-21.
Ratcliffe, J. (2004), "The Hotspot Matrix: a framework for the spatio-temporal targeting of crime
reduction", Police Practice and Research, Vol. 5 No. 1, pp. 5-23.
Ratcliffe, J. and McCullagh, M. (1998), "Identifying repeat victimization with GIS", BritishJournal
of Criminology, Vol. 38 No. 4, pp. 651-62.
Townsley, M., Homel, R. and Chaseling, J. (2000), "Repeat burglary victimisation: spatial and
temporal patterns", The Australian and New ZealandJournalof Criminology, Vol. 33 No. 1,
pp. 37-63.
US Census Bureau (2004), TIGER®, TIGER/Line® and TIGER-Related Products,www.census.
gov/www/tiger/index.html (accessed repeatedly June 2002 to June 2004).
PIJPSM Verma, A. and Lodha, S.K. (2002), "A topological representation of the criminal event", Western
30,1 Criminology Review, Vol. 3 No. 2, available at: http://wcr.sonoma.edu/v3n2/verma.html
Further reading
La Vigne, N., Fleury, J.K. and Szakas, J. (2000), "Auto theft and dectecting chop shop locations",
in Turnbull, L.S., Hendrix, E.H. and Dent, B.D. (Eds), Atlas of Crime:Mapping the Criminal
60 Landscape, Oryx, Phoenix, AZ, pp. 60-7.
Sherman, L.W., Gartin, P.R. and Buerger, M.E. (1989), "Hot spots of predatory crime: routine
activities and the criminology of place", Criminology, Vol. 27 No. 1, pp. 27-55.
Corresponding author
Gisela Bichler can be contacted at: gbichler@csusb.edu