You are on page 1of 30

+(,121/,1(

Citation:
Gisela Bichler; Stefanie Balchak, Address Matching
Bias: Ignorance is Not Bliss, 30 Policing: Int'l J.
Police Strat. & Mgmt. 32 (2007)
Provided by:
Available Through: Pontificia Universidad Catolica de Chile

Content downloaded/printed from HeinOnline

Thu Nov 29 09:29:58 2018

-- Your use of this HeinOnline PDF indicates your


acceptance of HeinOnline's Terms and Conditions
of the license agreement available at
https://heinonline.org/HOL/License

-- The search text of this PDF is generated from


uncorrected OCR text.

-- To obtain permission to use this article beyond the scope


of your HeinOnline license, please use:

Copyright Information

Use QR Code reader to send PDF


to your smartphone or tablet device
The current issue and full text archive of this journal is available at
www.emeraldinsight.com1363-951X.htm

PIJPSM
30,1 Address matching bias: ignorance
is not bliss
Gisela Bichler
32 Centerfor CriminalJustice Research, California State University,
San Bernardino, California, USA, and
Stefanie Balchak
Chula Vista Fire Department,Chula Vista, California, USA

Abstract
Purpose - The purpose of this paper is to show that despite the critical importance of using accurate
data when identifying geographic patterns and studying hotspots, few have explored the data quality
issues introduced by Geographic Information Systems (GIS) software applications. While software
manufacturers provide some information about the address matching process, critical details are left
out or are buried in technical, and sometimes proprietary, jargon. The purpose of this paper is to
address these issues.
Design/methodology/approach - The paper demonstrates, with three datasets of 100 cases each,
how the assumptions built into popular GIS software produce systematically missing data during the
data importing process commonly referred to as address matching.
Findings - Inclusion of directional indicators and zip codes are more important than previously
thought. The results highlight the critical need to provide complete descriptions of research
methodology. All geographic analyses must be accompanied with: information about the hit rate
(percent of cases plotted), details about the software and process used to import tabular crime data,
information about the software parameters set for the importation process (geocoding preferences),
reference information about the street file used; and, an examination of the missing cases to identify
some of the sampling error. When forecasting crime issues or identifying hot spots, analysts must be
cognizant of the differential impact this bias will have on the generalizability of the results.
Originality/value - The paper explores previously neglected issues in data quality introduced by
GIS software applications.
Keywords Bias, Geographic Information Systems
Paper type Research paper

Introduction
One of the fundamental tenets of the scientific method is that a detailed discussion of
the research methodology, including procedures used, summary data, and research

H
Policing: An International Journal of
Police Strategies & Management
Vol 30 No. 1, 2007
limitations, must accompany all results. This is necessary to evaluate findings, enable
replication and ultimately, to advance the development of knowledge. For example, all
survey research must report the response rate associated with the sample generation
process. As simple as this tenet may appear, scholars examining the geographic
properties of crime have faltered on this basic principle. While the proprietary concerns
of Geographic Information System software vendors have contributed to this situation,
it is not an excuse for bad science. Currently, scholars and crime analysts routinely
pp. 32-60 neglect to report the "response rate" or other procedural and analytic parameters
© Emerald Group Publishing Limited
1363-951X
rendering it near impossible to accurately interpret study results or assess the
DOI 10.1108/1363951071072,613 generalizability of conclusions reached.
The research initiative presented here began as a question, raised by an Address
undergraduate student learning how to plot crime events using GIS software. The matching bias
student's question went something like this: "How exactly is the computer selecting
street segments in order to map my event locations? Because, the addresses are written
correctly and I checked, they match the street file, but the computer put the dot on the
wrong street." After careful examination, the student was indeed correct, for some
reason the dots were not on the correct street segments. Little did we know that it 33
would take a great deal of effort to figure out exactly what the computer was doing.
And, in answering this question, a surprising revelation emerged with regard to the
rigor of scholarly reports about the geographic analysis of crime.
It is fair to say that most geographic analysis of crime relies on incomplete data. In
fact, an informal survey of practicing crime analysts showed that many analysts used
crime data that do not achieve geocoding[1] hit rates above the conventionally accepted
90 percent[2] threshold (Ratcliffe, 2003)[3]. Given that self-report and victimization
studies (particularly in reference to commercial burglary) show that only a small
fraction of the total crime actually is recorded in official data sources (e.g. Pease and
Laycock, 1996; Ratcliffe and McCullagh, 1998; Townsley et al., 2000), the starting point
of using "all known crime events" extracted from a records management system, is
already a non-random sample of crime incidents. Furthermore, the degree of sampling
error is unknown. Having less than 90 percent of a non-random sample, particularly
when that missing 10 percent is systematically missing, becomes an important
methodological limitation that may not be remedied by imputation methods (Myrtveit
et al., 2001). The fact that law enforcement policies are increasingly grounded on
rudimentary analysis of datasets that are systematically missing a great deal of data
heightens the importance of this topic; systematically missing 10 percent of 350,000
incidents is a serious analytic limitation. Until more is known about what makes these
cases systematically missing, statistical adjustments to improve analytic reliability are
not possible.
What do we currently know about factors that render crime data "systematically"
unmappable? Prior discussions highlighted: the limitations posed by the quality of
police data in relation to data entry (e.g. Harries, 1999; Block and Block, 1995)[4]; the
imprecision introduced by GIS based plotting of events compared with actual property
boundaries (Ratcliffe, 2001; Martin and Higgs, 1997)[5]; and more recently, that
analyses to identify crime concentration is affected by the number of repeat
victimizations captured in the police data (Bichler, 2004). Together this body of work
suggests that fluctuation in crime incidents available for analysis is generated by
factors that may be unrelated to the volume of crime occurrence; and, this can radically
alter the crime patterns identified (Bichler, 2004; Ratcliffe, 2001; Harries, 1999; and
Block and Block, 1995). Moreover, these examples support the assertion that a closer
inspection of the process by which crime data (in tabular form) is translated into
spatially referenced data in GIS software is warranted to explore the effects of
interactions between GIS software and data quality.
Little is known about the data quality issues generated by the algorithms and
default settings of software. Published scholarship offers few details as to how GIS
software actually selects addresses from a reference street file in order to map data.
And while software manufactures provide some information about the process in user
manuals, critical details are left out or are buried in technical, and sometimes
PJJPSM
PIJPSM proprietary, jargon. This is important because the address matching hit rate is vital to
30,1 the quality of data made available for analysis. While the effects of missing data and
measurement error are widely understood and acknowledged by criminologists using
traditional statistics, when similar issues impact on geographic data, these issues are
often neglected[6].
This report details the results of an inquiry into answering how exactly the software
34 selects candidates from the street file upon which to place events during the address
matching process[7]. In doing so, it provides many neglected details and demonstrates
how data quality and software algorithms interact to affect the address matching
process. Secondary research questions included: how much of a problem (bias) is likely
to be introduced by this systematically missing data; what features of the address lead
to non-matching cases; what information must scholars report to inform research
consumers about the bias that may have been introduced; and what implications does
this bias raise for hot spot identification and forecasting? This detailed treatment of
address matching is necessary as many important details are lost during the
interdisciplinary exchange of analytic techniques. Geographers adept at using various
geographic statistics are well aware of these issues, criminologists and crime analysts
on the other hand, are only beginning to fully appreciate the implications of this
problem.

Address matching
Many issues associated with the geographic analysis of crime (i.e. how to obtain
funding, improve data quality at entry, and train staff on software) have been
addressed[8]; yet, few have examined fundamental issues associated with the
geocoding process. While great strides have been made in the accuracy and availability
of GPS systems that enable the use of more precise latitude and longitude coordinates
to plot locations, address matching[9] is still the most frequently used geocoding
process among criminal justice users of GIS.
Address matching requires the assembly of crime data into a spreadsheet with the
street address written in a single column in this order: prefix (address number), space,
directional indicator (i.e. N for North), space, street name, space, and suffix (i.e. avenue
or street). When multiple areas are being mapped at the same time, another variable
(i.e. zip code, city, county, etc.) is necessary. Crime event addresses are then compared
with a source database including the entire street network of a jurisdiction. Drummond
suggests that there are four possible outcomes (1995) in address matching:
(1) Correct matches include occasions where events are plotted on street segments
that contain the address in question.
(2) Unmatched records are those situations wherein no street segment exists in the
street file that matches the event address.
(3) Incorrectmatches (false positives) occur when the crime is located on the wrong
street segment.
(4) Non-match outcomes involve events that should have matched (or false
negatives), but did not.
The address matching completion rate records the number of events plotted; it is
referred to as the hit rate. Hit rates include the number, or percent, of correct and
incorrect matches - all events that plotted.
Essential elements of address matching success Address
As with all data generation procedures there are many factors that can affect the matching bias
address matching process. Match scores are the byproduct of the interaction
between the target database accuracy (crime data), the reference database accuracy
(street file), and software issues including user-defined parameters and software
algorithms. 35

Target database accuracy (aka crime events)


Assembling a complete list of crime incidents is no easy task. It is well established
that crime data are fraught with potential errors and biases introduced by
non-random reporting patterns (Myrtveit et al., 2001), investigation as a mediating
factor (McCleary et al, 1982), and clerical error (Block, 1995). Of primary concern to
the geographic analysis of crime is the data quality associated with the recording of
addresses (La Vigne and Wartell, 2001). Accurate and consistent entry of the
address information is necessary; however, most police record management systems
were not designed with this in mind. As such, there are often many common errors
associated with the address information for crime events: spelling errors, variation
in the recording of street names (i.e. Martin Luther King Blvd, MLK, and Martin
Luther Boulevard), the wrong directional identifier, wrong or missing suffix or
prefix, inconsistent abbreviations, and out-of-range or impossible address numbers
(Block, 1995). In addition, some events may be coded to the nearest corner and/or an
approximate address.
These factors together introduce potential bias. Investigation as a mediating factor,
non-random reporting, and clerical error, impacts on the number of events identified
and the nature of those incidents. Additionally, errors associated with the recording of
addresses impact on the number of cases that are available for study, this is an issue of
systematically missing data that leads to geographic bias. Among conventional
statistics, the rule of thumb suggests that when 15 percent or more of the data are
missing, and when many of these events are systematically missing (this means a
particular address is repeatedly recorded incorrectly) a serious bias is introduced into
the analysis (Myrtveit et al., 2001) that can dramatically impact on the identification
and location of crime problems. This situation exists when events are missing from the
analysis due to problems encountered when we attempt to match events to street
segments contained in the base map.
Practical implication 1. As records management systems are redesigned and
updated, efforts are being made to improve the entry of street addresses in order to fix
some accuracy issues at the source (Boba, 2001). In the meantime, most analysts must
spend a considerable amount of time scrubbing address fields or engaged in manual
interactive cleaning and matching procedures. Two implications follow, when
conducting geographic analysis of crime data one must incorporate sufficient time to
clean data. Second, to aid the process when repeated data dumps must be scrubbed, it
is imperative that analysts develop specific, written cleaning instructions. Following
these procedures each time data are cleaned will enhance efficiency and instill a degree
of consistency in the quality of data made available for the analyses. Analyses from
different data sets (usually different time periods such as a pre- and post-test to assess
the impact of a policing initiative) will then be comparable.
PIJPSM Reference database accuracy (aka base map)
30,1 Currently, the most commonly used street reference database in the USA is the US
Census Bureau's Topologically Integrated Geographic Encoding and Referencing line
files (TIGER/Line® files). The street network produced by the Census Bureau is a
graphical representation of street centerlines for the entire USA and is downloadable
free of charge (which partially explains its widespread use). In this file, streets are
36 broken into segments. Behind the graphical display of a street network is a supporting
tabular database wherein each street segment is linked to a series of fields (variables),
that contain information about the; directional prefix (Fedirp), street type (Pretype),
street name (Name), street type (FeName), suffix (FeType), and directional suffix
(Fedirs). Additional fields - usually four - store the high and low address ranges for
both sides of the street (Drummond, 1995). Noteworthy is the absence of unit numbers
from the address fields. This means that events cannot be plotted precisely to units
within a property.
Practicalimplication2. Tiger/Line® files cannot be used to study crime distribution
within parcels or areas such as an apartment complex. If a detailed city street file or
detailed parcel map are not available, then analysts must create their own street/alley
file using aerial photography or some other information source for the property in
question. This new file can be used to plot the crime for a single development. Given
the time requirements to do this, such detail treatment of parcels should be reserved for
important program evaluations or when diagnosing a crime problem (e.g. during a
Problem-Oriented Policing project).

Positionalaccuracy
TIGER® files are produced from the compilation of information developed through a
range of encoding procedures (i.e. automated map scanning and manual "digitizing")
by various local and state level entities (US Census Bureau, 2004). Given local variation
in the quality of GIS data, some areas are better represented than others; positional
accuracy varies with the source materials and procedures. However, overall, the data
meets the established National Map Accuracy standards of approximately +/ - 167ft
(US Census Bureau, 2003)[10]. Figure 1 shows the overlap of the street center line
compared against an aerial photo. The inaccuracy of the street centerline placement on
a large-scale map is clearly evident.
While this level of positional accuracy is not sufficient for high-precision
engineering applications, it is permissible with general social science research,
particularly when data points are aggregated to administrative units following
address-based geocoding. While GIS scientists commonly understand this,
criminologists tend to be less informed about these concerns. With one exception
(Ratcliffe, 2002), published discussion tends to focus on general issues of data quality
rather than location precision[ll].
Practicalimplication 3. If crime intensity maps are generated (e.g. density maps or
statistical ellipses) and then placed on top of aerial photos, the analysis may
misidentify the locus of the high crime area. To enhance the interpretability of such
analysis, additional information should be included. This may involve a small table
that contains the actual address and crimes occurring within the high crime area. Or,
the street file and crime points could be included to illustrate the degree of warping that
may be affecting the intensity map.
Address
matching bias

37

Figure 1.
Inaccuracy of TIGER files
compared with aerial
imagery
ource: Eagle Aeria Imaging, Inc., Coachela Valley (2001)

Gaps in the referencefile


Many crime scientists were formally introduced to the inherent inaccuracy of TIGER®
files by Block (1995) following a Chicago based initiative to implement a GIS system
referred to as GEOArchive. Researchers found that the streets underlying the city
boundaries in the TIGER® files were very inaccurate. About half of the steet
segments required adjustments (additions and changes) in order to plot the crime
incidents (Block, 1995). Common errors in this street file included: problems with street
segments (e.g. missing and unnamed street segments), inaccurate prefix ranges (e.g.
ranges on the wrong side of the street, and incorrectly ranged streets) and new streets.
Some of these issues may occur within a section of a street, one whole street, a block or
even entire neighborhoods. Figure 2 graphically displays the haphazard manner in
which street segments may be missing critical information.
Practialimplication 4.There are various strategies for dealing with inaccuracy of
Tiger® files when working in an applied setting.
Analysts should identify through a simple frequency distribution the street
segments associated with addresses that do not match. Periodically, the analyst could
clean/update the reference street file using other maps to ensure that the critical street
segments are correct. Starting with the addresses or street segments with the most
missing data (highest frequency), an analyst can clean a street file. Dealing with the
entire list in one setting is prohibitive; however, routinely setting aside a small amount
of time on a regular basis to tackle the addresses that routinely do not map will in a
short time, lead to greatly improved hit rates.
PIJPSM
30,1 [§1 Scattered Segments

38
LW.

I ti'

Figure 2.
Haphazard nature of
missing information from
the Census 2000 street file
Additionally, to enhance comparability of geographic analyses it is importat that

results include a note that assesses the quality of the underlying street file used to
geocode. This could be done by reporting the percent of street segments that are
missing address ranges and/or names (along with the percent of cases that matched
well and matched partially). Thus, when comparing analyses for the same jurisdiction
that were completed a different times it is possible to assess the impact that using
different street files or using a cleaned street file may have made to the results.
A related concern is that it is also possible that the same address, geocoded twice
may be assigned slightly different positions on the map. Software specific interpolation
procedures - estimation of the location based on the beginning and end address range
of the block - and the presence of zero (0) values in the address to be mapped, create
inconsistencies in plotted location (Block and Block, 1995; Mills, 1999). This may lead
to systematically missing data if there is a reason why some streets or section of the
jurisdiction is consistently missing the numeric portion of the street address. One
possible example is when crime is recorded as occurring at a park or other large open
area.
A temporary fix would be to geocode problem addresses to intersections. This will
reduce the number of missing cases considerably; however, care must be taken when
interpreting geographic analyses using this data as the addresses plotted to
intersections will introduce a degree of geographical inaccuracy to visual
representations of the crime intensity.
Deviations in the target dataset (crime events) - an unusual format or may have to
few or too many components - will create partial or unmatchable events; this may
contribute the level of systematic bias. For example, Spanish-named streets cause
problems because the address components are in reverse order to typical default
standardization rules (Bichler-Robertson and Conley, 2003). For example, street names
such as "Avenida 52" will fail when address matched because the first part of the street
name (Avenida) will be treated as name of the street while the second part (52) would Address
be considered the type of street. Adjustments must be made in order to map this data matching bias
(see, for example, Bichler-Robertson and Conley, 2003)[12]. Failing to make these
adjustments in jurisdictions with sections of the city containing Spanish named streets
will result in the systematic loss of cases.
Identifying important (meaning there is a lot of crime occurring there) Spanish
named streets and spot checking the number of events that are shown to have been 39
recorded there and comparing this figure to the number of events that are plotted on
the map of crime will enable the analyst to conduct a quality check. The percent of
cases missing (not matching) can be used as a weighting factor when interpreting the
geographic analysis that uses only matched cases. For example, if a segment of
Avenida 52 were missing 12 percent of crime events, then you could multiply the
number of events plotted by 1.12 to determine the actual number of events that should
plot. This may be a useful technique when assessing the crime volume of
neighborhoods. Generally, spot-checking important streets will provide a rudimentary
assessment of the degree of error in the analysis.
Acknowledging the weakness of address matching (e.g. out-of-date street
directories) Ratcliffe (2001) sought to determine the level of geocoding imprecision.
Using MapInfo (version 5.5) the study found that there was an error level of 72 percent
when performing a standard point-in-polygon operation when matching geocoded
points with their corresponding area. Two explanations were offered for this error
level: the accuracy of the property boundaries and the degree of uncertainty for cases
(points) that fell on or close to the boundaries (Ratcliffe, 2001). The implication raised
here is that events falling on boundary street segments may be located in the wrong
area if crime events are aggregated to political units, i.e. police beats.
When redrawing police reporting districts or policing zones, agencies may want to
consider using minor streets rather than arterial roads - that are likely to generate
high volumes of crime calls - to demarcate boundaries. Additionally, agencies should
consider land use when crafting reporting districts. Keeping reporting districts
homogenous in terms of land use will help to address this problem. Care should be
taken not to include splinters of commercial property or high-density residential
property within single family housing areas.

Software based accuracy


Each GIS software package handles the geocoding process in a similar fashion, though
notable differences exist that can dramatically impact on the quality of spatial data
produced. Two vendors dominate the software market for crime analysis,
Environmental Systems Resources Institute (ESRI) and MapInfo Corporation. The
following discussion will detail the ESRI's address matching process because it is the
more complicated process (understanding this method will aid in understanding
MapInfo's street-level geocoding process). Furthermore, ESRI products are currently
more widely used by active crime analysts[13].

User defined parameters


Address matching, as used by ESRI, involves finding similarity between addresses in a
table (crime events) and the address attributes of a theme (street reference file).
Address matching tools allow for both automatic "batch matches" and interactive or
PJPSM manual matching. During a batch match, the software selects possible candidates for
30,1 each event without user intervention. If only a small number of unmatched records
remain, the residual records can be interactively matched. Interactive matching
displays each unmatched crime address for the user to make adjustments to in order to
correct misspellings, expand abbreviations, or delete extraneous information that has
prevented a correct match[14].
40 To initiate the process, the software user defines matching criteria; these settings
are referred to as "geocoding preferences." First, the spelling sensitivity preference
relaxes or tightens the tolerance on spelling errors in the street names, and therefore,
controls the number of candidates that may be found. The candidate score becomes
important when there are partial matches as it sets a "similarity standard" for the
entire address of a potential match; thus, this score controls which candidates will be
displayed as possible matches during the interactive matching of partial matches. The
minimum match score sets the lowest possible limit that a street segment must achieve
before it can be considered a candidate for the match (for the entire address).
If the target address (i.e. crime event location) has been matched to a single segment
(only one possible perfect match (100 percent matching) candidate exists)in the
reference database, the address matching software checks the parity (odd/even street
number) of the crime event and determines which side of the street to plot the case. The
software then interpolates the address's approximate distance along the street segment
by using the low and high range.
Given inconsistencies in spelling, missing ranges (street file), and the presence or
absence of directional indicators, perfect matches are not always possible; although,
there may be street segments that are close. These street segments may be considered
partial matches. ESRI software algorithms involve two steps in the process for
developing match candidates: transformation tools (soundex) and a probabilistic
scoring process.

Software algorithms
This is the critical element to answering the student's questions about why the incident
mapped to the wrong location. To understand the limitations posed by the software it
is important to take a step back in the process. Before possible candidates are pulled
from the street file, the software makes some "on-the-fly" adjustments to all addresses
(crime events and the street file).
First, target addresses (crime events) are compared to a transformation table to
identify allowable substitutions. Then, the street name is converted to a phonetic
equivalent for comparison with the reference street file. Points are assigned to the
address based on the similarity of each field of the address to each field of the street file.
Finally, a weighted cumulative score for each possible match is calculated and
normalized. The reference street segment with the highest score is selected for the
match. At this point, the location of the event address is interpolated and located along
the reference street segment. Two elements of this process require detailed treatment;
transformation tables and phonetic equivalents.

Transformation tables
Inconsistency in the recording of addresses is unavoidable; consequently, the software
has a built in mechanism to "clean addresses" by allowing for substitutions for key
address components. Transformation tables help to standardize the address of the events Address
you wish to plot. It is similar to "data scrubbing" except the original data file is left intact matching bias
(no permanent changes are made). These substitutions are stored in transformation tables
located in program files. Table I supplies a series of example substitutions for prefixes,
directional indicators, and street type. The actual transformation table is considerably
larger and can be expanded or adjusted by the user. Use of transformation tables enable
the software user to save time during the data scrubbing phase of data collection if there 41
are standard, known differences between the crime event addresses and the reference
street file. Once the addresses are temporarily "clean" the software looks for suitable street
segments by street name, this is where the soundex becomes an issue.

Soundex properties
Identifying possible street matches, using street names (English language), involves
creating a phonetic equivalent (referred to as a soundex value) for the written spelling
of each street name (listed in the table of crime events). Then, ESRI software compares
event street name to the reference street names based upon their soundex values
instead of their literal spellings (ESRI Support Center, 2004; Drummond, 1995).
To create a soundex value, the first letter of the street name is retained, and all
occurrences of a, e, h, i, o, u, w, and y in the remainder of the street name are dropped (e.g.
"Pennsylvania" becomes "Pnnslvn"). The next three remaining letters are converted to
numbers based on the values presented in Table II.These three numbers are added to the
first letter of the street name. To continue with the example, Pnnslvn becomes P552.

ESRI abbreviation Text read by software

Prefix/direction
W W, WST, WEST
N N, NO, NORTH
S S, SO, SOUTH
NW NW
Street types
AVE AV, AVD, AVE, AVEFLR, AVEN, AVENIDA, AVENUE, AVNUE
BLVD BD, BL, BLV, BLVD, BOUL, BVD, BOULEVARD
CIR CI, CIR, CIRC, CIRCLE, CORCLE, CR, CRCL, CRCLE, CRL
CTR CEN, CENTE, CENTER, CENTERS, CENTR, CENTRE, CENTRES, CNTR,
CTER, CTR
DR DR, DRIVE
RD RD, ROAD
RT ROUTE, ROUTES, RT, RTE, RTES, RTS
ST STAPT, STBOX, STBX, STEET, STLOT, STPOBOX, STR, STRD, STREET,
STREETS, STRFD, STRR, STRT
Other terms
CANYON CN, CYN, CNYN, CANYN
HARBOR HBR, HARB, HARBR, HRBOR Table I.
MOUNTAIN MTN, MNTN, MNTAIN Sample of acceptable
SPRINGS SPGS, SPNGS, SPRNGS substitutions for street
TUNNEL TUNEL, TUNL, TUNNEL, TUNNL type, directional
VALLEY VALLY, VLEY, VLLY, VLY, VLYS indicators, and suffixes
PIJPSM According to ESRI software manuals, spaces in street names are ignored if the street
30,1 name contains multiple words (e.g. Mt Victoria, Mtvctr, M312). If the street name has
less than four letters following the removal of vowels, h and w, then zeros are added to
the end in order to produce a four-digit soundex value (ESRI Support Center, 2004).
Streets that have different names but the same soundex values will be returned as
candidates for that address. For example, when looking for a match for the street name
42 New York Street (with a soundex value of N620), the address matching process returns
all street segments in the reference database (i.e. streetfile) with the name New York
Street and Newark Street (NrkO or N620) since they have the same soundex value.
Thus, spelling errors could change the soundex value and influence which candidates
are identified for a possible match. For example, misspelling York (New Yotk Street)
will produce a soundex value of N320 rather than N620. Given that vowels are not used
in the soundex transformation process, mistakes in consonants are of critical
importance. Also, soundex does not take directional prefixes into consideration (i.e.
North). The street segments with names that have the same soundex value as the event
to be plotted will be "flagged" in the order that the street segments appear in the
underlying data table that supports the street file[15]. Thus, geocoding with a street file
sorted in alphabetical order will potentially place different candidates in the first
position of the list of possible matches then would a street file sorted geographically.

Points assigned
Candidates are then ranked according to the total weighted score developed that
represents the statistical probability of the street segment (based on all components of
the address) identified being the correct match of the event address (this is common to
several commercial GIS packages)[16]. Each component of an address record in the
reference theme table (aka street file) is scored based on how closely it matches the
corresponding component of the event address. Each address field has two
probabilities: the probability that the fields agree for a matched pair (m), and the
probability that the fields agree for unmatched pairs - agree at random (u). The m and
u probabilities are specific for each address style[17] (e.g. "US Streets with Zone"
address style probabilities are different than "US streets" address style). Field
probabilities are weighted[18] and this can be adjusted to reflect the importance of
matching on that component of the address[19]. The scores of all the address
components (all fields used in an address style) are combined to create the total score
(Table II) which is then "normalized" to improve interpretability; probabilities are
converted to a value out of 100[20]. Candidates with the highest score are considered
the best match. When scores are tied, the candidates will appear in the order that they
appear in the street file.

Letters Number
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3

Table II. m, n 5
Letter values R 6
Address
Street address matching bias
Address components Points
House number 10
Pre-direction 2
Pre-type 4
Street name 26 43
Suffix type 6
Suffix direction 2
Zip code 10
Zip code +4 10
Intersections
Prefix direction 2
Prefix street type 4
Street name 26
Suffix name 6 Table EE.
Suffix direction 2 Points for address
(for streets 1 and 2) computation

Recall that software users can set the minimum match score when setting the
geocoding preferences; this raises or lowers the score that potential candidates must
achieve to be considered a partial match. This does not impact on what is considered a
"good match." To be a good match, the candidate must achieve a combined score of 75
for ArcView and 80 for ArcGIS. This means that adjusting the geocoding preferences
higher than 75 will result in more partial matches and nonmatches - all partials are
moved into the nonmatch category. Lowering the user-defined preference will generate
more partial matches that will be automatically accepted, unless otherwise directed,
when the process is complete.
At the conclusion of the address matching process, four columns are appended to
the end of the crime event file: "AVADD", "AV STATUS", "AVSCORE", and
"AVSIDE" The AVADD variable contains the address that was used (i.e. if you
manually adjusted the crime event address during the process this would be the
adjusted address). The AV STATUS column includes coding for whether the event
was matched ("M"), unmatched ("U"), or if the address had two or more candidates
with the best score ("T" for tied) signaling that the first in the list was used. The
AVSCORE is the total score for the candidate selected (where the crime event plotted).
The final variable, AV SIDE, informs the user which side of the street the event was
plotted to: "R" for right and "L" for the left side of the street.
Practical implication 5. Exploring these columns will provide useful information
about the potential error introduced into the analysis by the matching process.
Developing a ratio between the number of T cases in relation to M cases will convey
the degree of potential geographic error when comparing various maps.

A note about MAPINFO


The default process built into the current version of MapInfo Pro requires an exact
match (literal spelling used) on the text portion of an address (directional indicator,
street name and type). Any error, space, extra punctuation, etc. will result in a failure to
match. This occurs because the software matches the text exactly and does not use the
PIJPSM
PIJPSM soundex function. An extensive abbreviation table can be adjusted so that the batch
30,1 matching process substitutes known abbreviations or ignores characters in an address
field. This permits the user to "clean" addresses on the fly. Amending the abbreviation
table will greatly improve hit rates in this exact match process[21]. Similarly, MapInfo
also provides "result codes" for matched cases so that the user can assess why an exact
match was not found. Consulting these codes will aid the analyst in adjusting
44 parameters, or address records to improve geocoding results.
Numeric components of the address are the only exception. The user may opt to
take the closest number to the one being sought. Following an initial attempt to match
cases, the user may invoke an interactive process to manually match the non-matching
cases (similar to the process described above). Generally, the data plotted using
MapInfo are likely to be more accurately placed than when using ESRI products;
however, the trade off is that there could be a far greater number of cases that will
require manual remediation or interactive matching or risk being left out of the
analyses (systematically missing).

Puttingit all together


Figure 3 presents a graphical representation of how all of these elements work together
to introduce bias into geographical analysis of events geocoded with the address
matching process. The thickness of Arrow 1 indicates the degree of potential error
introduced by the interaction of the inaccuracy in: (a) the target data (crime events), and
(b) the reference file (street file) used. This error feeds the interaction between (c)
parameters set (geocoding preferences) and (d) the software algorithms (Arrow 2).
Clearly, the impact of the address matching process is cumulative. Successful address
matching involves knowing the limitations of the data you are working with and
understanding the limitations introduced by the interaction between the factors
discussed because without fail, there will be systematically missing data from any
address matching process.
In the case of the inquisitive undergraduate, the soundex function and the quirk of
the tabular order of the street file data, lead to the selection of the street segment with a
score that was considered a good match. Unfortunately, it was the wrong choice. This

(a)Target Data ]

HIT RATE

[(b) Reference File]

Figure 3.
Address matching process Note: Width of the arrow represents the amount of sampling bias
introduced to the outcome data file available for geographic analyses
determination lead to secondary research questions: how much of a problem (bias) is Address
likely to be introduced by this systematically missing data; and, what information matching bias
must scholars report to inform research consumers about the bias that may have been
introduced? Answering these queries required testing the software.

Methodology 45
To figure out which address components most impact this process and the nature of
the sampling error typically introduced, this inquiry set out to adjust user-defined
settings while geocoding three different datasets in ArcView 3.3 and ArcGIS. Each
dataset was geocoded at different preference settings to determine the accuracy of the
plotted points. Plotted points were then compared to actual locations to determine the
accuracy of the position. In total, 76 tests were conducted.

Data
Three distinct datasets with 100 cases each were generated. The first was a fixed
comparison of one street in Corona, California. The second was a random sampling of
addresses from a data set for the city of Corona. The third analysis was a regional test
of four cities that contain a main street with the same name and similar street ranges.
All datasets were address matched with the US Census Bureau Tiger/line® street file
from 2000.
Fixed In total, 100 addresses were chosen for the street, Ontario Ave in the SW
section of Corona, CA. To identify verified (city registered) addresses, a section of
parcels (one block with 13 street segments) located in the downtown section of Corona
was selected from the Riverside County web site[22]. This dataset approximates a
clean data file with no anticipated errors in the "crime data" as the addresses were
drawn from an electronic, georeferenced file of parcels. To ensure the data were clean,
street names listed in the street file for these segments were checked to ensure that the
event address street names matched the street file. Using addresses with a common
street name permitted the analysis to control for various sources of error (e.g. spelling
issues) in order to assess the impact of software parameters on the matching process.
This provides a baseline estimate of possible error assuming the street file and the
address file are clean[23].
Random. The second dataset included addresses for the city of Corona and two
unincorporated areas (Home Gardens and El Cerrito). Using an existing database for
the city of Corona, a random sample was chosen from probation case files using
SPSS (George and Mallery, 2001). This dataset approximates the quality of data
extracted from the typical law enforcement RMS. It was expected that this dataset
would contain some inaccuracy in the address fields. Reasonably one would expect
that probation records would contain accurate information about the home address
of clients; however, close inspection of juvenile home addresses recorded for 4,000
juveniles residing in a county in Southern California revealed that probation address
recording practices suffer the same quality issues as typical crime data; there are
many cases of: incomplete addresses, inconsistent street name abbreviations,
inverted address ranges, etc. Additionally, juvenile delinquents are mobile, they
move frequently between communities and guardians; address fields require
constant updating.
PIJPSM Regional. To generate a regional dataset, 25 addresses were selected from four
30,1 different cities - Claremont, Rialto, and San Bernardino (in San Bernardino County,
CA), and Upland (a bordering city in Los Angeles County) - for an East/West
thoroughfare with substantial crime and public safety issues. To begin the address
creation, 25 addresses on Foothill Boulevard in the City of San Bernardino were
generated from a map commonly used in this region (the Thomas Bros. Guide 2003
46 edition). Then, addresses within similar ranges that exist in the three other cities (on
the same street) were produced. There was approximately an even number of East and
West addresses. The actual locations were not visited to verify the presence of a
structure. Of note, the street names recorded were compared against the street file to
ensure that inconsistencies in street names did not generate error. The intent of using
this data file was primarily to examine the impact of using zip codes.

Geocodingprocess
By examining the addresses geocoded at different user defined parameters, it is
possible to assess the impact of altering the geocoding preferences for the spelling
sensitivity, match rate, and candidate controls respectively[24]. The ESRI default
setting of the geocoding preferences was 80/60/30. In addition, tests were also done at
levels below the default (10/10/10 through 60/60/60) and above (70/70/70 through to
exact match of 100 all)[25].

Results
Each batch of geocoded and non-matched points was examined to verify that all
addresses were mapped to the correct place by confirming that the point was placed in
the correct range and segment of the street[26]. Addresses that were not correctly
matched or did not match were then examined to identify what went wrong. As
anticipated, important variations were found with each dataset (see Tables IV and V).

Fixed address dataset


Analysis of the geocoding accuracy of the fixed addresses revealed that all cases
mapped until the setting of 80/70/70). From a setting of 80/80/80 and higher, ArcView
lost 7-8 percent of cases and ArcGIS lost one case. ArcView lost cases due to suffix
inconsistencies (street type) and ArcGIS lost one case due to a missing directional
prefix. As expected, clean data for verified addresses in a single jurisdiction map well
(Table IV); however, raising the preference settings above 80 for all settings will result
in the loss of reasonably reliable cases. Using the Tiger® files for Corona led to the loss
of 7-8 percent of cases. It can be anticipated that matching event addresses with clean
street names and correct address ranges will still lead to missing data when geocoding
preferences are set at 80 all and above. Relaxing geocoding preferences to settings
above 70 all (but lower than 80 all) may lead to near perfect match rates. A cautionary
note, it is possible that missing or incorrect suffixes could lead to the placement of
events on the wrong street segment. For example, if there are many like name streets
such as Palm Drive, Palm Street, and Palm Boulevard, then it is possible that 7-8
percent of the events occurring on these streets might be misplaced.
Address
000000000000)
000>C:
= :IC: C> :>C
matching bias
U

0
CD0?00000000000C>C: = C C

47
000000000CDCD= 000C:
= = C
0000000C0C0C)M

U)

br - g

U
0 O M- M0
0 ,-0 q C) C) C)

C)C-CC)C : ~C-. C\C C>1C)~ (=) (D

U
:' - (0Cc
g CD (=-g?000C"

00D000000CD00C00 ND M C)

I-c

(L

Table IV.
0000C Cl Cl ClC
0000OL- - OO
Comparison of matching
results for fixed, random
and regional datasets
(ARCVIEW)
PIJPSM =10D000000D 0000
D C C
30,1 CD00

bz C)

CD00000C 000C) C 00)0:C C

48
C,0D000D 0000000C ( C C
. 000C) C>C 00 ) )0)a O

CDC)C
I:) : DC C -: (

z C

-C~ U

000 0 00 0000C> M(00M

Cz

,IJ
0-j(0I CO M0 M' M

- z C,5

C- LN C-C C- ~~t
-0CO
C- Cq
C- CC- C'I
C- ) 000C
- C'

1.0( ( (0
0 (C ( ( (0 (C C0 (0

q 000000000,: (:,C>CDC)- - -

Table V.
Comparison of the
matching results for M~ 0M M) co C Czc C,3 t C,3 cC
c' c) CC C C ::, (> CC c> (=> C5
fixed, random and
regional datasets
C'3
4MZ. t- 0DM
(ARCGIS)
Random dataset Address
Analysis of the random addresses revealed a greater degree of missing cases and error. matching bias
For ArcView, addresses that had missing or incorrect information about the street type
would map only if the geocoding preferences were set at 60/60/60 or lower. However,
examination of match accuracy indicated that many were plotted in the wrong location.
Partially matched addresses were plotted on the wrong street segment due to missing
or incorrect street type and name similarities (i.e. Winter Rd mapped to Wonderview 49
Rd, and Blue Mesa Court mapped to Blue Diamond).
Moving from this setting to parameters of 70/70/70 led to the "no match" of 20-21
addresses due to incorrect suffix information. However, the plotted events were
mapped correctly. This means that the addresses that plotted incorrectly due to
incorrect or missing street types, name similarity issues (similar soundex values), and
problematic directional prefixes at the minimum settings became missing cases.
Raising to 70 all eliminated error.
Raising preferences to 80/80/80 and higher lead to the loss of matched addresses
that were plotted to the correct place; this resulted from incorrectly spelled street
names, street type problems, or prefix issues. All plotted events were correctly located.
Increasing settings to a near perfect match (90 all and 100 all) lead to the loss of
additional events that would have mapped correctly at the 70 all setting due to minor
spelling issues. Spelling errors that changed match scores include:
A wrong letter in a word scored at 81, 82, or 87 depending on the letter.
* An extra letter in a name led to a match score of 92.
* A missing letter at the end of a street name resulted in a match score of 94.
Tests of ArcGIS software revealed a similar pattern though the overall numbers of
non-matching cases differs. A higher number of addresses were classed as partial
matches, meaning that six addresses changed from matches to partial matches as
settings from ten all to 80/70/70. In addition, cases that did not map in ArcView were
mapped in ArcGIS. ArcGIS was able to map more events (nonmatches from ArcView
were classed as partial matches in ArcGIS). This shows that there are slight differences
in the algorithms built into these software products. ArcGIS is more sensitive to the
absence of prefixes and slightly less sensitive to street name issues.
Regional dataset.The initial test of the regional dataset showed that 100 percent of
the address matched at all preference settings tested. However, because the initial tests
used US street address only, several problems were found. Most cases mapped in one
or two cities (rather than 25 percent in each city) and only 21 percent of the addresses
actually mapped to the correct location.
Adding the correct zip code did not reduce the percent matching (100 percent of
addresses matched at ten all to 80 all). However, 10 percent mapped incorrectly due to
missing directional prefixes (i.e. E or W). At 90 all, the 10 percent that had mapped
incorrectly at the lower settings were rendered incapable of mapping. These cases
could not map during a batch match due to incorrect and missing directional prefixes
in the street file. ArcGIS software produced similar results. When cities contain streets
that change direction (e.g. N/S running street turns a bend and begins to run E/W) or
crime data includes cases from other jurisdictions (e.g. general assists to other agencies
or multi-jurisdictional crime crackdowns), zip codes and directional prefixes must be
correct if their use is going to improve match rates.
PIJPSM Discussion
30,1 Drummond (1995) presents three strategies for dealing with address matching issues:
" Strategy A. Ignore all records that are not perfect matches.
" Strategy B. Include perfect and the most reliable partial matches.
" Strategy C Use perfect matches and every possible partial match (regardless of
50 reliability).
To illustrate the costs associated with each strategy Drummond (1995) estimated the
degree of bias and error that could reasonably be expected from each. Table VI presents
the theoretical estimates generated by Drummond compared to what was found in this
study for the random dataset (as this is the type most crime analysts work with).

Selecting a strategy
Using only perfect matches is not a feasible strategy in the world of crime analysis
given the degree of clerical error in the address name that typically exists. It would
generate far too many missing cases that may otherwise map somewhat correctly.
Further, going through a manual interactive matching process on a case-by-case basis
with all of the non-matching items is prohibitive; in many situations, this would
involve thousands, if not tens of thousands, of cases. The third strategy, of using every
possible match, is also not advisable in a criminal justice setting given the level of error
introduced; limited resources may be misallocated thereby impacting on public safety.
Thus, the remaining strategy is to take exact matches and the most reliable partial
matches. The question remaining is what is a reliable partial match?
One of the first responses often suggested to those learning how to geocode events
against addresses in a street file when faced with a high number of "no match" and
"partial match" cases is to adjust the geocoding preferences as suggested by ESRI
(Ormsby et al., 2004). It is a reasonable and logical step given the likelihood of small
clerical errors associated with addresses (i.e. Block and Block, 1995) that may be
overcome with slightly relaxed matching criteria. This can be a viable tradeoff given
that the systematic loss of data is critical to the strength of crime pattern modeling; the
extensive "listwise" deletion of cases due to matching criteria may introduce undue
bias into the analysis. However, the results of this exploration suggest that geocoding
preferences should not be raised or lowered beyond a clearly identifiable "optimal
threshold". Lowering geographic preferences (or relaxing the sensitivity of matching
criteria) will radically reduce the number of accurately matched events, leading to

Strategy A Strategy B Strategy C


Perfect (set 100 all)a Reliable (70 all) All partials (50 all)
Theory Results Theory Results Theory Results
Table VI. Matched correctly (%) 50 61 70 78 75 78
Theoretical estimates of Incorrect match (%) 5 0 10 0 20 9
bias and error for Could have matched (%) 40 29 15 12 0 3
drummond's mediation Never match (%) 5 10 5 10 5 10
strategies compared with
study results (random Notes: aGeocoding preferences set at 100 for spelling sensitivity; 100 for minimum candidate score;
crime data) and 100 for minimum match score
higher error. Further, the current default settings are too low to lead to high quality Address
matches of the greatest volume of cases. However, increasing the settings spelling matching bias
sensitivity, candidate score, and minimum match score above 80/80/80 will result in the
loss of events that would have mapped correctly. Future research should identify the
optimal thresholds for other software (such as MapInfo) for data that approximates the
quality typically used in crime analysis. Until then, the optimal threshold should be
between 70 all and 80 all. 51

Additional implicationsfor the geographic analysis of crime


Four additional geocoding implications were raised by these geocoding tests that
directly impact on crime analysis that is conducted to support police operations.
Practical implication 6. The "random dataset" developed to represent calls for
service data were associated with a considerable degree of mismatched events. The
immediate implication is that more attention must be paid to continuing efforts to
improve the quality of data entered into record management systems (La Vigne and
Wartell, 2001; Boba, 2001). Prior attention has focused on correctly naming streets and
has somewhat neglected the importance of directional indicators. Directional indicators
were the locus of many errors in this study and given the widespread use of directional
indicators to designate different "sections" of a major street, fixing this one small
element of an address holds considerable promise for improving match rates. This is
interesting given that the weight assigned to this element of a street address is
considerably less than a street name. Crime analysts should take measures to ensure
that the directional indicators are included where necessary and that they match the
street file being used to plot events.
Practicalimplication 7. Regional mapping must include zip codes. However, care
must be taken to ensure that the zip codes used are accurate in both the crime data file
and the street file. Like other address components, the reference street file is likely to
contain a substantial degree of missing or inaccurate zip codes. Adding this component
to the address will diminish the importance of the other address components and could
lead to additional "false negatives"; cases that should have mapped. Future efforts
should examine/test the improved accuracy of address matching in regional situations
comparing the use of zip codes that include the four-digit extensions with the five digit
zip codes. Multi-jurisdictional analyses are most vulnerable to zip code error; this is a
more pressing issue to mitigate for state level law enforcement and Sheriff's
departments.
Practicalimplication 8. In addition to reporting the hit rate on all visual displays of
crime data or other analyses, it is vital that the settings and other details of the
geocoding process be recorded as well. For example, analysis produced with data that
achieved a hit rate of 95 percent has an entirely different interpretation if the geocoding
preferences were set to 30/30/30 than if an exact match process was used. Address
matching is sufficiently important to the analysis that is must become routinely
documented. Thus, when presenting results that involve geographic analysis, each
image (or map) and the narrative describing this research must report:
* The geocoding hit rate displayed as a percent mapped - both the percent of
partial and percent of complete matches - as well as the total sample size (N) that
the analysis started with.
PIJPSM Reasons for missing cases (some analysis of which cases were excluded and why
30,1 the cases did not map) must be noted.
* The preferences set and software used to geocode the crime events.
* Reference information for the street file used to geocode the crime events.
These information requirements should be added to the notes section of each map or
52 image created, as well as, to text of the report if one accompanies the analysis. This
information should be recorded along with the datum, projection, and event data
reference at the bottom right hand comer of the image or in the methods section of the
report. Failure to include this information is inexcusable as the audience must know
this information in order to interpret the findings. These technical details will be
appreciated because law enforcement personnel have strong geographical skills and
they readily understand that statistics can lie. Knowing what cases are not included
and which areas of the jurisdiction are underrepresented will be easily understood and
will aid in the interpretation of the geographic analyses presented.
Practicalimplication 9. The final implication for crime analysis is that the prediction
of crime hotspots can be demonstrably impacted by address matching accuracy.
Table VII reviews the various types of prediction methodologies, essentially there are
three classes: address level based on prior crime data, administrative units with prior
crime data, and multivariate models (Groff and La Vigne, 2001; Verma and Lodha,
2002). Each of these strategies encounters different bias and error related concerns.
Essentially, address level hotspot analysis and prediction will have the least amount
of tolerance for incorrect plotting as analytic techniques typically used are highly
sensitive to the number of events used (Bichler, 2004). A systematic loss of a "hot"
location could radically alter the landscape of crime. The use of administrative units
will be more robust depending on the boundaries used. Here it is essential to have the
correct side of the street; thus, there is no tolerance for errors in the numerical section of
the address. Given that directional prefixes, street names, and suffixes present more
substantive issues in this study, numeric errors are likely to introduce an insignificant
degree of error into research using administrative units. Further, Ratcliffe (2004) noted
that a 25ft offset improves placement (positional accuracy) of crime events. Thus,
offsets should be considered for this kind of analysis (Drummond, 1995; Ratcliffe,
2004). Finally, multivariate analyses have fluctuating degrees of tolerance. The size of
the grids used interact with the number of dirty data layers thereby creating a situation
wherein the degree of bias is compounded by the level of one's reliance on data sources
that generate poor georeferencing.

Concluding remarks
A shockingly large void exists within research looking into the spatial properties of
crime distribution. To date, while data quality issues were raised previously, there has
been virtually no public discussion of the importance of systematically missing data
and how analysts must account for this limitation to the validity and reliability of
geographic crime analysis.
Missing data is a problem that all research must deal with. Convention suggests that
when 15 percent or more of the data for a variable are missing, estimation methods -
that work fairly well when a few cases have missing information - may introduce undue
bias into the analysis. While public debate on the issue of systematically missing data
Address
matching bias

4 go
7 9U I
U6 53

a) Cz 0

_2 4C
o ~ *~-:t
~c
u~
4 ;:
:3. ' -C. 0 0
E~ C)r

~ b
4,50 +
w :, >

,- U-0
M czC) ''-
a
,-
04

C5
C-

C.-

-5"
_
o
0
.
- 0
.- .. -

. .C-a Table VII.


Forecasting models and
-C
, 0 A the likely impact of
geocoding results
PIJPSM
PIJPSM and various remedial strategies exists, these notions have yet to be translated into the
30,1 geographic and spatial analysis of crime. Considering that the accuracy and quality of
crime data used in spatial analysis is critical to assessing the accuracy of predictive
models, evaluating the effectiveness of intervention strategies, and to understanding the
very nature of crime problems, this dearth of attention is inexcusable.
Missing cases impact on the number of cases and can substantially sway the
54 analysis in favor of some areas of the jurisdiction. Future research must explore the
degree of bias introduced so as to develop diagnostics and remedial techniques to deal
with bias produced by non-mapping addresses.
Nine practical implications were raised in this study that if followed, would improve
the interpretability, quality, and generalizability of geographic crime analyses.
(1) Time must be set aside to clean addresses. To make the process more efficient
and consistent (improving the reliability of data produced) when repeated data
dumps must be scrubbed, it is imperative that analysts develop specific, written
cleaning instructions (a set of data scrubbing scripts could be used). The
cleaning process must be conducted the same way each time.
(2) Tiger/Line® files cannot be used to study crime distribution within parcels or
areas such as an apartment complex. If a detailed city street file is not available,
then analysts must create their own street/alley file using aerial photography or
some other information source for the property in question.
(3) To enhance the interpretability of crime intensity maps that are used with aerial
imagery, additional information must be included. This may involve a small
table that contains the actual addresses and crime counts for various crime
types occurring within the high crime area. Or, the street file and crime points
could be included to illustrate the degree of warping that may be affecting the
intensity map.
(4) There are various strategies for dealing with inaccuracy of Tiger® files when
working in an applied setting.
Analysts should identify through a simple frequency distribution the street
segments associated with addresses that do not match. Periodically, the
analyst could clean/update the reference street file using other maps to
ensure that the critical street segments are correct. Starting with the
addresses or street segments with the most missing data (highest frequency),
an analyst can clean a street file. Dealing with the entire list in one setting is
prohibitive; however, routinely setting aside a small amount of time on a
regular basis to tackle the addresses that routinely do not map will in a short
time, lead to greatly improved hit rates.
To enhance comparability of geographic analyses it is important that results
include a note that assesses the quality of the underlying street file used to
geocode. This could be done by reporting the percent of street segments that
are missing address ranges and/or names (along with the percent of cases
that matched well and matched partially). Thus, when comparing analyses
for the same jurisdiction that were completed a different times it is possible
to assess the impact that using different street files or using a cleaned street
file may have made to the results.
* Problem addresses could be geocoding to intersections as a temporary fix. Address
This will reduce the number of missing cases considerably; however, care matching bias
must be taken when interpreting geographic analyses using this data as the
addresses plotted to intersections will introduce a degree of geographical
inaccuracy to visual representations of the crime intensity.
Conduct quality checks on Spanish names streets by comparing the number
of events that are shown to have been recorded on a street segment to the 55
number of events that are plotted on the map of crime. The percent of cases
missing (not matching) can be used as a weighting factor when interpreting
the geographic analysis that uses only matched cases.
* When redrawing police reporting districts or policing zones, agencies may
want to consider using minor streets rather than arterial road - that are
likely to generate high volumes of crime calls - to demarcate boundaries.
Additionally, agencies should consider land use when crafting reporting
districts. Keeping reporting districts homogenous in terms of land use will
help to address this problem. Care should be taken not to include splinters of
commercial property or high-density residential property within single
family housing areas.
(5) Exploring fields appended to data files that records information about the
geocoding process provides useful information about the potential error
introduced into the analysis by the matching process. The ratio between the
number of tied cases in relation to matched (single 100 percent matched
candidate) cases should be reported with analyses that compares multiple maps
(using data from different geocoding runs).
(6) Directional indicators were the locus of many errors in this study and given the
widespread use of directional indicators to designate different "sections" of a
major street, fixing this one small element of an address holds considerable
promise for improving match rates.
(7) Regional mapping must include zip codes; however, the information contained
in this field must be correct or the events will not match.
(8) Gecoding preferences should be set between 70 all and 80 all. When presenting
results that involve geographic analysis, each image (or map) and the narrative
describing this research must report:
The geocoding hit rate displayed as a percent mapped - both the percent of
partial and percent of complete matches - as well as the total sample size (N)
that the analysis started with;
* Reasons for missing cases (some analysis of which cases were excluded and
why the cases did not map) must be noted;
* The preferences set and software used to geocode the crime events; and,
* Reference information for the street file used to geocode the crime events.
(9) Address level hotspot analysis and prediction will have the least amount of
tolerance for incorrect plotting as analytic techniques typically used are highly
sensitive to the number of events used (Bichler, 2004). A systematic loss of a
"hot" location could radically alter the landscape of crime. Aggregating to small
PIJPSM administrative units is advisable; however, spot-checking the accuracy of this
30,1 process is critical to determine whether events have been plotted to the wrong
unit.
The current study faced a number of limitations that should be avoided in future
research. First, none of the samples included streets that had Spanish names (i.e.
avenida) and so it is not possible to assess what problems would have arisen. In the
56 regional analysis, the addresses were geocoded based on the zip code but not with the
four-digit extension; and therefore, we cannot speculate what would have occurred if
the four digit extensions had been available for use. The samples used for each of the
test included only 100 cases; thus, it is possible that using a larger sample size might
lead to different conclusions. Additionally, the addresses and the street file had been
cleaned slightly prior to geocoding, which may have increased the correct matches and
accuracy of the plotted points (Block, 1995). However, the cleaning primarily involved
cities that were not included in this analysis - with the exception of Corona, but this
was discussed - and thus, the random and regional tests should approximate other
results using Tiger@ files that were generated by cities with similar GIS capacity as
San Bernardino, Upland, Corona, Claremont, and Rialto. Lastly, the analysis and
literature review did not include information or comparison of other software
programs, particularly MapInfo.

Notes
1. Geocoding is the term given to describe various procedures used to place tabular
information, in this case crime data, onto an electronic map (Harries, 1999).
2. While some suggest that achieving an 80 percent matched is satisfactory (Blosser, 1995),
matching at least 90 percent is more widely accepted standard due to bias introduced by
missing data.
3. A recent impromptu survey of active crime analysts subscribing to the list serve of NIJ's
Mapping and Analysis for Public Safety program (formerly the Crime Mapping Research
Center), showed that the average geocoding hit rate among 43 different agencies was 87.5
percent with a standard deviation of 14 percent (Ratcliffe, 2002b). This survey found that
approximately 33 percent of respondents did not achieve the convention of 90 percent
geocoding success on a regular basis. Ratcliffe also noted that the hit rate was dependent on
the crime type with domestic crimes plotting better than other crimes (e.g. street crime)
(Ratcliffe, 2002b).
4. Clearly, the quality of tabular crime data and the reference street file is the most important
source of address matching problems. This issue has already been discussed at length; as
such, this topic will only be touched on briefly at this juncture.
5. Geographic precision is impacted greatly by interpolation algorithms; this is a complex and
important issue that is beyond the scope of the present paper.
6. While geographers adept at using GIS software for geostatistical analyses are familiar with
issues raised by this interaction effect, few criminologists have thought carefully about this
problem as evidenced by the lack of methodological details left out of published work.
Further, discussions with crime analysts working within police agencies together with
examination of analyses produced by these applied researchers suggests that this problem is
foreign to many, if not most, crime analysts. What geographers specializing in geographic
information science know about data quality and software peculiarities must be imported
into the criminal justice arena if geostatistical tools are to be used properly.
7. Parcel matching - which involves linking events to actual polygons or parcels - and event Address
plotting with latitude and longitude coordinates are not discussed here. These techniques are matching bias
much more accurate and are much less likely to be used by the typical crime analyst as the
data captured by most records management systems (RMS) do not have the capacity to
permit the use of these geocoding procedures.
8. For example, the Office of Community Oriented Policing Services (COPS) was awarded
grants in the excess of $7 billion over the past nine years for start up and operational costs. 57
In addition, various organizations have provided software training including the Police
Foundation, JUSTNET's Crime Mapping and Analysis Program (CMAP), COPS, NIJ's Crime
Mapping Research Center (CMRC), and Carolinas Institute for Community Policing (CICP).
9. Address matching is the most prevalent form of geocoding in the US because law
enforcement agencies fairly consistently have a street addresses or other locational attribute
(i.e. the name of a store which can be linked to a street address), for crime incidents - this is a
byproduct of the introduction of CAD systems.
10. The US Census Bureau is aware that the file is not 100 percent accurate. Consequently, the
Census Bureau is currently in the initial stages of a multi-year project in updating the
features in the file (US Census Bureau, 2004).
11. This locational bias will offset or distort micro-level or large-scale geographic distributions
of crime patterns.
12. Alterations to the street file can be made so that the first part of the street name was in the
pretype column, with significant increases in the geocoding, hit rate without cleaning the
address fields.
13. A number of surveys have queried law enforcement crime analysts across the country about
their crime analysis activity and use of GIS software. A Police Foundation (2000) survey of
51 law enforcement agencies receiving COPS MORE grant funding, found that 71 percent
use or will be using ESRI software (ArcView or ArcInfo) to analyze crime data. This finding
is similar to the results of an earlier study by the National Institute of Justice (Mamalian and
La Vigne, 1999). A broader study using two national samples found that 61 percent of large
agencies and 15 percent of small agencies mapped crime (O'Shea and Nicholls, 2002). And,
only about 6.7 percent of large agencies and 5 percent of small agencies use special hotspot
software like STAC (O'Shea and Nicholls, 2002) suggesting that most use the built in
functions of their GIS software.
14. It is also possible to set the software to permit manual examination of all situations where
more than one possible match was found. This is an optional setting in ESRI products and a
default setting in MapInfo.
15. For this reason the underlying organization of the reference street file will impact on the
address matching process. While no ESRI materials revealed this, careful exploration of
geocoding results demonstrated this feature.
16. This information was gathered from the help topics files of the ArcView program.
17. Address styles differ in relation to the placement and number of address components
required. For example, the US streets address style simply includes the conventional street
address minus the unit number, city, and zip codes; whereas, the US streets with zone
includes the street address and the zip code but excludes the unit number and city.
18. When pairs agree, the agreement weight (log2(m/u) is applied to the probability for the field.
If pairs disagree, the disagreement weight involves a negative value [log2(1 - m/i - u)].
Simply, agreements add to the composite score and disagreements subtract from it. For
example, the m and u probabilities of a directional indicator (for ArcView using the US
streets with Zone address style) are 0.8 and 0.01 respectively, thus the probability for the
PIJPSM directional indicator when there is agreement between the fields is log2(0.8/0.01) = 3. If the
30,1 fields disagree, the weighted probability used is log2(1 - 0.8/1 - 0.1) = -2.17.
19. However, the value must remain between 0.0 and 1.0.

20. This involves multiplying the weighted composite score by 100.


21. However, this is not necessarily more efficient or more accurate than cleaning addresses
58 when they are first entered (Boba, 2001; La Vigne and Wartell, 2001).
22. Riverside County maintains a web-based GIS application at: www.tlma.co.riversideca.us/
gis/gisdevelop.html
23. Admittedly, as noted by one of the reviewers, using a random sample of clean addresses
would have been stronger; however, it was beyond the scope of this project for the
researchers to adjust the street file for the entire city of Corona to ensure that the base street
file was clean. Future research should do such an exploration.
24. As noted by an anonymous reviewer, it is obvious that fixing street files and tabular crime
data address fields is the best way to improve hit rates; however, these options are not
typically available to practicing crime analysts doing routine assessments.
25. "All" means that the three controls (spelling sensitivity, match rate, and candidate selection)
were all set to the same number.
26. To assist with this process, plotted locations were also compared to locations shown on the
MapQuest On-Line service and to recorded street segment address ranges printed in the
Thomas Bros. Street Maps (2003) for San Bernardino and Los Angeles counties.

References
Bichler, G. (2004), "Acknowledging the influence of geographic outliers: an analytic caution in
examining chronic repeat victimization", SecurityJournal,Vol. 17 No. 1, pp. 31-49.
Bichler-Robertson, G. and Conley, J. (2003), " Se Habla Espanol? Reconciling geocoding conflict
between census street files and ESRI software", Crime Mapping News, Vol. 5 No. 2,
pp. 10-13.
Block, R. (1995), "Geocoding of crime incidents using the 1990 TIGER file: the Chicago example",
in Block, C.R., Dabdoub, M. and Fregly, S. (Eds), Crime Analysis Through Computer
Mapping (189-193), Police Executive Research Forum, Washington, DC.
Block, R.L. and Block, C.R. (1995), "Space, place, and crime: hot spot areas and hot places of
liquor-related crime", in Eck, J.E. and Weisburd, D. (Eds), Crime Prevention Studies
(145-197), Willow Tree Press, Inc, Monsey, New York, NY.
Blosser, M. (1995), "Modeling address ranges", Proceedingsof the FifteenthAnnual ESRI User
Conference, available at: http://gis.esri.com/library/userconf/proc95/to300/p258.html
Boba, R. (2001), Introductory Guide to Crime Analysis and Mapping, Office of Community
Oriented Policing Services, US Department of Justice, Washington, DC.
Drummond, W.J. (1995), "Address matching", Journal of the American PlanningAssociation,
Vol. 61 No. 2, pp. 240-51.
ESRI Support Center (2004), available at: http://support.esri.com (accessed June 2002 to June
2004).
George, D. and Mallery, P. (2001), SPSSfor Windows Step by Step: A Simple Guide and Reference,
Allyn and Bacon, Boston, MA.
Groff, E.R. and La Vigne, N.G. (2001), "Mapping an opportunity surface of residential burglary",
Journalof Research in Crime and Delinquency, Vol. 38 No. 3, pp. 257-78.
Harries, K. (1999), Mapping Crime: Principleand Practice,National Institute of Justice, Officer of Address
Justice Programs, Washington, DC. matching bias
La Vigne, N. and Wartell, J. (2001), "Tracking gang activity in Orange County, California",
Mapping Across Boundaries, Police Executive Research Forum, Washington, DC,
pp. 25-40.
McCleary, R., Nienstedt, B.C. and Erven, J.M. (1982), "Uniform crime reports as organizational
outcomes: three time series experiments", Social Problems, Vol. 29 No. 4, pp. 361-72. 59
Mamalian, C. and La Vigne, N. (1999), "National Institute of Justice Research Preview", The Use
of Computerized Crime Mapping by Law Enforcement: Survey Results, National Institute of
Justice, Washington, DC.
Martin, D. and Higgs, G. (1997), "Population georeferencing in England and Wales: basic spatial
units reconsidered", Environment and PlanningA, Vol. 29, pp. 333-47.
Mills, J. (1999), "Tain't necessarily so: address geocoding in the real world", Proceedings of the
Nineteenth Annual ESRI User Conference, available at: http://gis.esri.com/library/
userconf/proc99 (accessed in June 2003).
Myrtveit, I., Stensrud, E. and Olsson, U. (2001), "Analyzing data sets with missing data:
an empirical evaluation of imputation methods and likelihood-based methods", IEEE
Transactionson Software Engineering,Vol. 27 No. 11, pp. 999-1014.
O'Shea, T. and Nicholls, K. (2002), Crime Analysis in America, Center for Public Policy,
University of South Alabama, Mobile, AL.
Ormsby, T., Napoleon, E., Groess, C. and Feaster, L. (2004), Getting to Know ArcGis Desktop,
ESRI Press, Redlands, CA (updated for ArcGIS 9).
Pease, K. and Laycock, G. (1996), Reducing the Heat on Hot Victims, Bureau of Justice Statistics,
Washington, DC.
Police Foundation (2000), IntegratingCommunity Policing and Computer Mapping: Assessing
Issues and Needs Among COPS Office Grantees,Crime Mapping Laboratory, Washington,
DC.
Ratcliffe, J. (2001), "On the accuracy of TIGER-type geocoded address data in relation to
cadastral and census areal units", InternationalJournalof GeographicInformation Science,
Vol. 15 No. 5, pp. 473-85.
Ratcliffe, J. (2002a), "Damned if you don't, damned if you do: crime mapping and its implications
in the real world", Policingand Society, Vol. 12 No. 3, pp. 211-25.
Ratcliffe, J. (2002b), personal communications listed on the Crime Mapping Listserve, results of
an informal survey of geocoding hit rates by active crime analysts.
Ratcliffe, J. (2003), The Hotspot Matrix as a Framework for the Spatio-Temporal Targeting of
Crime Reduction, paper presented at the 11th International Symposium on Environmental
Criminology and Crime Analysis, Cincinnati, OH, June 19-21.
Ratcliffe, J. (2004), "The Hotspot Matrix: a framework for the spatio-temporal targeting of crime
reduction", Police Practice and Research, Vol. 5 No. 1, pp. 5-23.
Ratcliffe, J. and McCullagh, M. (1998), "Identifying repeat victimization with GIS", BritishJournal
of Criminology, Vol. 38 No. 4, pp. 651-62.
Townsley, M., Homel, R. and Chaseling, J. (2000), "Repeat burglary victimisation: spatial and
temporal patterns", The Australian and New ZealandJournalof Criminology, Vol. 33 No. 1,
pp. 37-63.
US Census Bureau (2004), TIGER®, TIGER/Line® and TIGER-Related Products,www.census.
gov/www/tiger/index.html (accessed repeatedly June 2002 to June 2004).
PIJPSM Verma, A. and Lodha, S.K. (2002), "A topological representation of the criminal event", Western
30,1 Criminology Review, Vol. 3 No. 2, available at: http://wcr.sonoma.edu/v3n2/verma.html

Further reading
La Vigne, N., Fleury, J.K. and Szakas, J. (2000), "Auto theft and dectecting chop shop locations",
in Turnbull, L.S., Hendrix, E.H. and Dent, B.D. (Eds), Atlas of Crime:Mapping the Criminal
60 Landscape, Oryx, Phoenix, AZ, pp. 60-7.
Sherman, L.W., Gartin, P.R. and Buerger, M.E. (1989), "Hot spots of predatory crime: routine
activities and the criminology of place", Criminology, Vol. 27 No. 1, pp. 27-55.

Corresponding author
Gisela Bichler can be contacted at: gbichler@csusb.edu

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints

You might also like