You are on page 1of 20

Crime Hotspot Prediction using Machine-

Learning Algorithms
Abstract
Crime rates have been increasing in many cities over the past few years; therefore,

analyzing hotspots and proactively preventing them is critical. Crime and incident data are being

collected by the police data initiative and are available for the public to encourage joint problem-

solving.

This project focuses on analyzing and predicting hotspots for three major cities (Orlando,

Fort Lauderdale and Gainesville) in Florida using classification. The data were cleansed and

mapped based on the National Incident-Based Reporting System(NIBRS) crime types. Tableau

Public software was used to create a visual report with ability to analyze data using various

dimensions for all the cities. Classification was done using four different algorithms for Fort

Lauderdale and Gainesville in the open source data mining tool Weka 3.8.

The ability to visualize crimes by NIBRS standard crime type at various times allows the

user to analyze multiple scenarios and identify hotspots quickly. Crime spot prediction was done

successfully by all four algorithms and the results were compared to each other to identify the

best one. Classification using k-nearest neighbor was the most precise (60.39%) for Fort

Lauderdale whereas ada boosting was the most precise (84.19%) for Gainesville.

Introduction

Recent statistics show that crime rate is increasing in some cities and is affecting the

quality of life in those communities [3]. In the past few years, some cities have been

implementing Data-Driven Approaches to Crime and Traffic Safety(DDACTS) to improve


public safety by decreasing crime and traffic crashes [1]. DDACTS is a model developed through

a partnership between US Department of Transportation and US Department of Justice. It

emphasizes on locating crime “hotspots” so that law enforcement can be deployed effectively in

those areas. In 2015 White House press announced the “Smart Cities” initiative to help

communities tackle local challenges and improve city services [5]. As part of this, Police data

initiative is identified as a key initiative that can help local authorities with data to improve

community policing [6]. An immense amount of crime data is being collected a part of this

initiative [9]. Using this historical data, predicting crime can be useful for the police department

to proactively monitor high crime areas. This research focuses on predicting crime spots using

visualization and machine learning algorithms for three major cities in Florida for which the data

is currently available in the Police Data Initiative repository.

Incident datasets from Orlando, Fort Lauderdale and Gainesville are used for prediction.

Data is visualized using Tableau Public software and analyzed to identify crime “Hot Spots” by

different crime types for all the three cities. Four different classification algorithms are used to

classify dataset based on binary class, Crime Status. Machine learning algorithms considered are

Decision Tree, Naïve Bayes, K-Nearest neighbor and Boosting. By Experiment, prediction

results of all the four algorithms for Fort Lauderdale and Gainesville incident data were recorded

and analyzed to identify the best algorithm to use. Orlando data is not usable by these algorithms

for prediction as the incidents data only include crimes and thus doesn’t provide sufficient data

for training the models.


Tools

• Weka: Weka is a is a suite of machine learning software written in Java, developed at the

University of Waikato, New Zealand for data mining tasks

• Tableau Public: Tableau Public is a free service tool which allows the user to load and

analyze the data visually. It also publishes the visualizations on the web

• Data Sets

o For visualization and prediction, city incident data with the Case ID, longitude

and latitude of the incident, street name and NIBRS Case type are downloaded

from Police Data Initiative site

Approach

Data Collection

The data were collected from the Police Data Initiative (PDI) open data site. PDI is a law

enforcement community that promotes the use of open data to improve public safety by

collaboration between law enforcement agencies, technologists, and researchers. Currently, there

are around 130 law enforcement agencies have released more than 200 data sets and the

inventory is growing. There are many data sets that are available on the site like accidents crash

data, incident data, complaints, and officer-involved shootings. Incident data includes
information about incidents where the police department responds to an offense and a report of a

crime is generated. In Florida, Gainesville, Orlando, and Fort Lauderdale are the cities that have

provided incident data so far.

Data Cleansing

In data mining, data cleansing is an essential process to prepare the data to remove

incomplete or inconsistent data. In our datasets, crime type is not consistent across the cities. For

example, Fort Lauderdale has 938 different crime type categories whereas Orlando has only 24

crime type categories. This is because each law enforcement agencies classify incidents

according to their own offense definitions. To analyze and visualize crimes across multiple cities

it’s important that crime types are standardized across them. NIBRS provides an offense lookup

table with various types of crime and NIBRS crime category covering the offense. Crime types

in all the data sets are mapped to NIBRS type by defining a mapping table for them.

Geographical longitude and latitude are critical for this study and they were not provided as a

single attribute in all the data sets. For example, in Gainesville set, street names and Geo

coordinates are listed in one field. Steps were taken to parse and substring appropriate data for

analyzing.

Data Exploration

Data exploration is an initial data analysis step to better understand the data and its

specific characteristics. Visual exploration is the best way for this as it allows the analyst to

quickly absorb large amounts of visual information. All the three data sets are explored using

Tableau public software.


Model Building

Prediction models were built using machine learning classification algorithms.

Classification is a supervised learning method used to predict a certain outcome based on a given

input. In general, there are two steps for data classification. The first step is to build a classifier

model based on a data with a predetermined set of classes. This is a “learning” process where the

model is trained from known data. In this step, it is required that each instance of the data is

labeled with an appropriate class label. This dataset is called training set. Because the class label

of each instance in the dataset is provided to the model, it is known as “Supervised” learning. In

the second step, the model is used for predicting the class label for the test set. Test set will have

all the attributes except for the class label. The classification model will classify each instance to

predict the class label for each instance of the test set. There are many data mining algorithms

that can be used. Various algorithms were analyzed and prioritized based on its limitations, and

ability to classify using supervised learning [13].

Algorithm Limitations Classification Supervised Learning Final Score


C49 Decision Tree 4 5 5 17
K-means Clustering 3 5 1 12
Support Vector
3 5 5 15
Machines
Apriori 4 1 1 11
EM 5 1 1 10
PageRank 2 5 1 12
Ada Boosting 5 5 5 18
K-nearest neighbor 4 5 5 18
Naïve Bayes 4 5 5 17
CART 3 5 5 16
Based on the final score, following four classification algorithms will be used to predict the

crime hotspots
1. Decision Tree

2. Naïve Bayes

3. K-Nearest Neighbor

4. Ada Boosting

A general representation of how algorithms work.

Decision Trees

Decision tree is a widely used algorithm in data mining. This algorithm uses a flow chart like

structure to predict the class labels based on several input attributes [6]. Decision trees have 3

kinds of nodes:

1. Root Node – Top most node in the tree with no incoming edges but with zero or more

outgoing edges

2. Internal Node – The node that has exactly one

incoming edge and two or more outgoing edges

3. Leaf Node – The node that has exactly one

incoming edge and no outgoing edge

In this tree, each internal node represents a test on

an input attribute to separate records that have different characteristics and each branch

represents an outcome of those tests. Each leaf node of the tree represents the actual class label
of the instance. The classification problem is resolved based on series of questions (Root or

Internal node) until a conclusion about the class label is reached (Leaf node). The diagram above

shows a simple example of a decision tree to classify if a species is a mammal or non-mammal


[12]
. In this example, there are two attributes body temperature and gives birth considered for

classification. When training the data, the tree is built in a recursive way to optimize the model.

In the first root split, all the attributes will be considered and training data is divided into groups

based on the split. In the example, training data is split based on body temperature. The cost of

the split will be calculated and the step is repeated with the other attribute gives birth. The

attribute with the lowest cost will be considered the root node. There are many measures that can

be used to calculate the cost of the split. Classification error (number of incorrectly classified

instances/ total number of instances) is one of them. The process continues recursively with other

features until a condition is met to stop the split. One way to stop is based on a minimum number

of training inputs to use in each leaf. Another way is using the maximum depth of the model tree.

Maximum depth is the longest path from a root to a leaf. Once the training model is built, the

testing of each instance in the test set starts with the root node. Test conditions are applied to the

records and appropriate branch path is followed until a leaf node with a class label is reached.

Naïve Bayes Classifier

Naïve Bayes is a machine learning classification algorithm based on Bayes' probability

theorem. Let X denote the attribute set of the input instances and Y denote the class variable. By

𝑃(𝑋|𝑌)∗𝑃(𝑌)
Bayes’ Theorem: P(Y|X) = 𝑃(𝑋)

• P(Y|X) is the posterior probability which provides the probability of hypothesis Y given

the data X
• P(X|Y) is class conditional probability which is of data X given that the hypothesis Y was

true.

• P(Y) is the prior probability which provides probability of hypothesis Y being true

regardless of the data X

• P(X) is the probability of the data regardless of the hypothesis

P(X) is always constant when comparing posterior probabilities for different values of Y

and thus it can be ignored. Prior probability can be calculated from the training set by dividing

the number of training records that belong to a class by the total number of instances. For

example, if the dataset has the same number of instances in each class the probability for each of

them are same. Class conditional probability is the probability of each input value given each

class value. This is calculated by dividing the frequency of each attribute value for a given class

value divided by the frequency of instances with that class value. For estimating the class

conditional probability P(X|Y) Naïve Bayes classifier is used. This algorithm estimates the class

conditional probability with an assumption that the attributes are conditionally independent,

given the class label y [12].

It is called Naïve because it estimates conditional probability of each Xi given Y rather

than attempting to calculate the probability for each attribute value P (x1, x2, x3|Y) [7]. It takes

an assumption that they are conditionally independent and calculates P(x1|Y), P(x2|Y) etc. It can

be represented as,

𝑑
𝑃(𝑋|𝑌 = 𝑦) 𝐸 ∏ 𝑃 (𝑋𝑖 |𝑌 = 𝑦)
𝑖=1

Where each attribute set X = {X1, X2, …Xd} consists of d attributes. To classify a test record, the

naïve Bayes classifier computes the posterior probability for each class Y:
𝑃(𝑌) ∏𝑑𝑖=1 𝑃(𝑋𝑖|𝑌)
𝑃(𝑌|𝑋) =
𝑃(𝑋)

Ada Boosting

Boosting is an ensemble classifier that can improve classification accuracy by

aggregating the predictions of multiple classifiers. Multiple training sets are created by

resampling the original data and classifier is built from each training set using any standard

classification algorithms (For example, decision trees) [12]. Ada boosting algorithm attempts to

boost the accuracy of any underlying classification algorithm by assigning weight to each

training sample and adaptively change the weight at the end of each round. Following are the

steps for boosting algorithm

1. Apply same weight for all the instances in the training set

2. Create a training set by sampling with replacement

3. Train using any classifier and calculate the accuracy

4. Reset the weights for all samples. Assign higher weight to incorrectly classified instances so

that classifiers can focus on those in the next round

5. Repeat from step 3 until set number of rounds are completed or higher accuracy is reached

The following diagram provides the visual

representation of how boosting works [8]

In the diagram above, original data set

D1 starts with equal weighting for all data

points. First trained classifier labeled one ‘+’ and two ‘-’ classes incorrectly. In the next round,

those data points are assigned higher weights (highlighted bigger than rest of the data points).

The second classifier will focus on predicting them correctly due to higher weights. This
continues to the next round until the final classifier is obtained by combining the learnings from

the multiple classifiers to obtain better accuracy.

K-Nearest Neighbor Classifier

The nearest neighbor classifier is an instance based learner which makes predictions

using specific training instances “closer” to test instance. These are also called as “Lazy

Learners” and doesn’t require model building. However, the classification process is quite

expensive as the test instance needs to be classified by computing its proximity to training

examples. Most common class label among its K nearest neighbor will be chosen.

In the above diagram (a), classification is based on

one nearest neighbor and the test instance will be assigned

with the label ‘- ‘. In (c), the test instance will be assigned

as ‘+’ as two of the 3 neighbors have “+” as a label. In (b),

where there is a tie, algorithm randomly chooses one of the

labels for the test instance [2].

Model Evaluation

Once the model is built by a classifier, the efficiency of it can be evaluated using several

metrics which are based on counts of correctly and incorrectly labeled instances by the model.

The confusion matrix is a table that can provide a visual representation of the performance of a

model. In the notation below the rare class is denoted as positive class and majority class is

denoted as negative class.


Predicted Class

+ -

Actual + TP FN
Case
- FP TN

• True positive (TP): Number of positive examples that are correctly predicted by the

model

• False negative (FN): Number of positive examples that are wrongly predicted as negative

by the model

• False positive (FP): Number of negative examples that are wrongly predicted as positive

by the model

• True negative (TN): Number of negative examples that are correctly predicted by the

model

In this paper, the following metrics are used to evaluate and compare the classification models
[7]
.

Accuracy

Accuracy provides a summarized number that represents a proportionate number of times

the model is correct when test set is applied.


TP+TN
Accuracy, a = TP+TN+FP+FN

Error rate

Error rate provides a summarized number that represents a proportionate number of times

the model is incorrect when test set is applied.


FP+FN
Error Rate, e = TP+TN+FP+FN
Recall

Recall provides a summarized number that represents a fraction of positive instances

correctly predicted by the classifier. This is also called as True positive rate. Large recall value

means that model has few positive instances misclassified as negative.


𝑇𝑃
Recall, r = 𝑇𝑃+𝐹𝑁

Precision

Precision provides a summarized number that represents a fraction of records labeled as

positive in the group classifier predicted as positive. Many false positive errors predicted by the

classifier will be lower with higher precision.


𝑇𝑃
Precision, p = 𝑇𝑃+𝐹𝑃

Using these measures, the performance of a model needs to be evaluated using a test set

for which the labels are already known. In the cases where there is no separate test set is

available, other alternative methods can be used [10]. The first method is called holdout method.

In this, the original data set if split into two disjoint sets. One is used as training set and the other

is used as testing set. The proportion of split between train and test set is decided by the analyst

based on the problem that is in hand. It can be 1:1 or 2:1 or 3:1 or any other ratios. This method

has few limitations. First, many instances used for training the model will be fewer as some

portion of the data is kept for testing. To overcome this, if the training set split is too large, test

accuracy will be less reliable. Moreover, training and testing are subsets of original data which

means that a class overrepresented in one will be underrepresented in the other. An alternative

method that is widely used is “K- fold Cross Validation”. In this approach, all the instances will

be used for training as well as for testing using a random subsampling method. In k-fold cross

validation, data is divided into k-subsets. K-1 subsets are used as training set and one subset is
used as testing. In this, each partition is used exactly once for testing as this procedure is

repeated k-times. The total error is calculated by taking the mean of the accuracy from the k-

runs. 10-fold cross validation is the most commonly used to evaluate or compare the classifiers.

Results
For all the cities data is mapped using the longitude and latitude of the incident location.

These points are imposed on US geographical map to visualize the hot spots by the NIBRS crime

type. Crime spots are weighted by the number of incidents that occurred in a specific

geographical location. This gives a visual view of “hotspots” in the cities. Selection can be made

by year of the crime or month of the crime or NIBRS crime type for further analysis.

Visualization provides parameters to allow users to select a subset of data for further analysis

based on a scenario. For example, if the police department would like to focus on drug-related

crime, the chart allows selection of that specific crime and helps to identify the hotspots for

taking targeted action.

Visualization

Fort Lauderdale Top 10 Crimes


NIBRS Type Number of Crimes % of Crimes
Robbery 26,768 40.61%
Disorderly Conduct 13,374 20.29%
Assault 6,081 9.23%
Fraud 5,361 8.13%
Trespassing 5,000 7.59%
Drug/ Narcotic 3,885 5.89%
Motor Vehicle Theft 2,427 3.68%
Family Offenses 2,300 3.49%
Liquor Violations 431 0.65%
Driving under the Influence 285 0.43%

In the city of Fort Lauderdale, Robbery is the top crime and following three areas are

“hotspots” for this crime: West Broward Blvd/ NW 25th Ave, E Oakland Park Blvd/ N Federal

Hwy and E Sunrise Blvd and NE 4th Avenue.

Orlando Top 10 Crimes


NIBRS Type Number of Crimes % of Crimes
Larceny 43,018 27.19%
Motor Vehicle Theft 25,334 16.01%
Burglary 24,239 15.32%
Drug/ Narcotics 16,880 10.67%
Assault 16,700 10.56%
Stolen Property 15,407 9.74%
Extortion 7,264 4.59%
Robbery 6,114 3.86%
Fraud 2,997 1.89%
Arson 244 0.15%
In Orlando, Larceny is the top crime and visualization shows that it’s occurring highest in

Orlando international airport followed by Orlando International Premium Outlets and Conroy

Rd/ Eastgate Dr (next to The Mall at Millenia).

Gainesville Top 10 Crimes


NIBRS Type Number of Crimes % of Crimes
Robbery 11,122 32.70%
Assault 5,801 17.05%
Curfew/ loitering 4,029 11.84%
Drug/ narcotics 3,238 9.52%
Disorderly conduct 2,784 8.18%
Stolen property 2,541 7.47%
Embezzlement 1,947 5.72%
Fraud 1,419 4.17%
Driving under the
influence 617 1.81%
Runaway 518 1.52%

In Gainesville, the top crime is robbery and visualization show that the “hotspots” for

robbery is at NE12th Ave/ NE 19th Ter followed by NW 23rd/ NW 34th and SW Archer Rd/ SW

34th St.

Prediction

The table below provides basic statistics of the datasets that were used in this study.
Total # of % of # of Non- % of Non-
Dataset Incidents # of Crimes Crimes Crimes Crimes
Fort Lauderdale 163,765 73,953 45.16 89,812 54.84
Gainesville 49,516 8,888 17.95 40,628 82.05
Orlando 158,459 158,459 100 0 0

In this study, 10-fold cross validation is used to obtain the results. The tool randomly

divides the data into 10 separate subsets. 9 of them were used for training and the rest was used

for testing. The process was repeated 10 times so that each subset is used as a test set at least

once and the average of the result is calculated. Table below provides the results of the

classification using various algorithms.

Accuracy Precision Recall


Dataset Algorithm (%) Error Rate (%) (%) (%)
Decision Tree 57.53 42.47 59.45 70.98
Boosting 57.53 42.47 58.97 74.15
Fort
Lauderdale Nearest
58.10 41.90 60.39 68.58
Neighbor
Naïve Bayes 55.69 44.31 57.97 69.84

Decision Tree 82.05 17.95 82.05 100.00


Boosting 80.01 19.99 84.19 93.13
Gainesville Nearest
79.13 20.87 84.01 92.10
Neighbor
Naive Bayes 80.63 19.37 82.80 96.42
For Fort Lauderdale, K-Nearest neighbor (KNN) provided the best accuracy rate of

58.1% and lower Error rate (41.0%). Better Precision is also achieved by KNN with 60.39%.

Recall value of Naïve Bayes (69.84%) is the best among the four classifiers that were compared.

For Gainesville, Decision tree provided the best accuracy rate of 82.05% and lower Error rate

(17.95%). Better Precision is also achieved by Boosting with 84.19%. Recall value of Decision

tree (100%) is the best among the four classifiers that were compared.
Prediction results are mapped for both Fort Lauderdale and Gainesville, based on the

algorithm results with best precision. The pictures below highlight the predicted hot spots.

Conclusion
The aim of this project is to create an application to predict crime hotspots using

visualization and classification for three Florida cities. To achieve that, incident data for the

cities are obtained, mapped using visualization software and classified using data mining

algorithms. Visualization was done successfully and the application allows hot spot analysis

using multiple dimensions like time and crime type. In addition, four different algorithms are

used to build classification models that can predict hotspots and the models are compared using

evaluation techniques. Based on Accuracy, k-nearest neighbor was the most effective algorithm

for Fort Lauderdale and decision tree was the most effective algorithm for Gainesville. When

Precision is considered, boosting is a better algorithm for prediction in Gainesville and k-nearest

neighbor is best for Fort Lauderdale.


Future Research

The current research provides a baseline for how machine learning algorithms can be

used for proactive policing. This study can be further developed by adding additional Florida

cities as the incident data becomes available for them in Police Data Initiative repository. It can

potentially be a single interface for the users to quickly access crime spots for various cities. The

predictions can be enhanced by adding more features to the dataset. Adding the actual weather

during the incident occurrence time may provide more insights. For example, there were more

crimes in certain areas when there was a hurricane. Adding the weather information can help the

model to identify these patterns. Similarly adding census data for the incident location can also

be beneficial. Particularly, augmenting the dataset with median income level, ethnicity

population spread, and education level will certainly help with improved prediction of hotspots.

Limitations

The tool used for prediction (Weka 3.8) didn’t have a single class classifier. Currently,

Orlando’s data did not include all incidents, rather there has only crime data. Having a one class

classifier could have been used to predict the hotspots for Orlando. Some of the data points in the

dataset are “outliers” from the most of the other common data points for the city. For example,

some incident data recorded for Gainesville had geo-coordinates in Utah. These outliers tend to

limit the model learning process.

Applications
This application is a very beneficial tool, not only for the Police Officers but also for the

public. It provides visualization of crime spots which can be further analyzed using multiple

parameters like time, season, crime type, etc. For example, police officers can filter the data for

certain months when “snow bird” population is higher in retirement community areas and see if

there is a different crime pattern. They can also use the prediction models to proactively predict

the hotspots and patrol those areas more actively. Common people can utilize this to understand

crime spots and use that knowledge when they are considering an area to buy a property or to be

more alert in problem areas while visiting.

References

1. [Special issue]. (2010). Geography Public Safety, 2(3). Retrieved from

https://www.nij.gov/topics/technology/maps/documents/gps-bulletin-v2i3.pdf

2. Brownlee, J. (2016, April 11). Naive Bayes for machine-learning [Online forum post].

Retrieved from https://machinelearningmastery.com/naive-bayes-for-machine-learning/

3. Castillo, M. (2015, June 4). Is a new crime wave on the horizon? CNN. Retrieved from

http://www.cnn.com/2015/06/02/us/crime-in-america/

4. Criminal Justice Information Services Division Uniform Crime Reporting Program.

(n.d.). Retrieved from https://ucr.fbi.gov/nibrs/nibrs-user-manual

5. FACT SHEET: Administration Announces New “Smart Cities” Initiative to Help

Communities Tackle Local Challenges and Improve City Services [Press release]. (2015,

September 14). Retrieved from https://obamawhitehouse.archives.gov/the-press-

office/2015/09/14/fact-sheet-administration-announces-new-smart-cities-initiative-help

6. FACT SHEET: Announcing Over $80 million in New Federal Investment and a Doubling

of Participating Communities in the White House Smart Cities Initiative [Press release].
(n.d.). Retrieved from https://obamawhitehouse.archives.gov/the-press-

office/2016/09/26/fact-sheet-announcing-over-80-million-new-federal-investment-and

7. Gunawardena, T. (2016, September 4). K Nearest Neighbors. Retrieved January 7, 2018,

from https://www.slideshare.net/tilanigunawardena/k-nearest-neighbors

8. Marsh, B. (2016, September). Multivariate Analysis of the Vector Boson Fusion Higgs

Boson. Retrieved from

https://www.researchgate.net/profile/Brendan_Marsh3/publication/306054843_Multivari

ate_Analysis_of_the_Vector_Boson_Fusion_Higgs_Boson/links/57ac9d6508ae7a6420c2

ffa8/Multivariate-Analysis-of-the-Vector-Boson-Fusion-Higgs-Boson.pdf

9. Police Data Initiative. (2017). Retrieved from https://www.policedatainitiative.org/

10. Schneider, J. (1997, February 7). Cross Validation. Retrieved January 7, 2018, from

https://www.cs.cmu.edu/~schneide/tut5/node42.html

11. Shojaee, S., Mustapha, A., Sidi, F., & Jabar, M. Z. (2013, May). A Study on

Classification Learning Algorithms to Predict Crime Status. Retrieved from

https://www.researchgate.net/profile/Somayeh_Shojaee/publication/266971832_A_Study

_on_Classification_Learning_Algorithms_to_Predict_Crime_Status/links/54436e830cf2e

6f0c0f94761/A-Study-on-Classification-Learning-Algorithms-to-Predict-Crime-

Status.pdf

12. Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to datamining. Boston, Ma:

Addison-Wesley Longman Publishing Co.

13. Wu, X., Kumar, V., Quinlan, R. J., Ghosh, J., Yang, Q., Motoda, H., . . . Steinberg, D.

(2007, September). Top 10 algorithms in data mining. Retrieved from

https://atasehir.bel.tr/Content/Yuklemeler/Dokuman/Dokuman3_4.pdf

You might also like